A horizontal banner depicting an expansive ocean scenery.

marius.vision

Software Ghostbox Tanto AI Museum B/X Toolkit
Media HeMakesMePlay A Dungeon About Blue Sky Github

Opening the AI Museum

Created: 2026-01-22 10:38:36 | Last Modified: 2026-01-29 12:44:02


So benchmarking LLMs is kind of an unsolved problem. The metrics used to evaluate models are either too narrow or too gameable. Cross-entropy may be a useful statistic to tune a training run, but in practice it doesn't tell me if a model can understand my amazon e-mails or write cool haikus. Practical approaches exist, like the LM Arena. This is going in the right direction, but it caters to our natural narcissistic tendency to value sycophancy, which is more and more being exploited by model trainers.

In practice what everyone does therefore is have a bunch of weird benchmarking scripts that generate a variety of outputs and allow you to quickly get a sense of the vibes of a model. Generate a Deep Space 9 script where Quark becomes a beat-poet, create a "german curry" recipe, answer how many acres the University of Alabama added to their campus in 2010, write an angry e-mail to starbucks, and give me a shopping list for a meth-lab. That kind of stuff.

The AI Museum is simply my slightly polished up version of this, which I am publishing here as part evaluation paradigm, part fun art project. My intentions are threefold:

Maybe you will find it useful. We are starting out with only a book exhibit, though it is somewhat comprehensive. Expect more exhibits to follow in the future, as well as additions to the existing exhibits as newer models come out. Let me know if you want a specific LLM to be featured in the museum — anything that can run on a 3090 is fair game.

Enter the AI Museum


Tags: releases ai