Opening the AI Museum
So benchmarking LLMs is kind of an unsolved problem. The metrics used to evaluate models are either too narrow or too gameable. Cross-entropy may be a useful statistic to tune a training run, but in practice it doesn't tell me if a model can understand my amazon e-mails or write cool haikus. Practical approaches exist, like the LM Arena. This is going in the ...
