Discussion about this post

User's avatar
Priank Ravichandar's avatar

Love the breakdown of each benchmark! The fact that models can be optimized to score higher makes a lot of the claims of "improved performance" questionable. Newer models may not necessarily be more capable. They may just be better at gaming the benchmarks.

Jacob Ballon's avatar

Great article! To imply that training on benchmark data corrupts a model’s output depends on answering an assumption: is this process of learning the test answers and then taking the test a bad thing? More simply, are frontier models cheating? Our tendency to anthropomorphize here is evident: most answer with an emphatic yes (I’m inclined to do this too) and thus think the same for AI. When humans receive test answers in advance, they gain a clear edge, but not in the same way that an LLM would. For a human, rote memorization takes us from an input (the question) to the output (the answer). With an answer bank, it’s totally fair to assume that humans will skip the requisite reasoning stages before arriving at a correct answer. Can we extrapolate this to LLMs? Maybe, but it’s not as clear-cut. Unless training an LLM with AIME data creates a model that is de facto deterministic (ie given an input from this narrow dataset, it will provide an output. In this case, AI’s are cheating like humans are cheating which is no good), I don’t see an obvious issue with training on benchmark data. Am I right to view this as training with a study guide and then taking a test? I have no technical background, so go easy on me if I’m talking out of my ass lol

12 more comments...

No posts

Ready for more?