Love the breakdown of each benchmark! The fact that models can be optimized to score higher makes a lot of the claims of "improved performance" questionable. Newer models may not necessarily be more capable. They may just be better at gaming the benchmarks.
Great article! To imply that training on benchmark data corrupts a model’s output depends on answering an assumption: is this process of learning the test answers and then taking the test a bad thing? More simply, are frontier models cheating? Our tendency to anthropomorphize here is evident: most answer with an emphatic yes (I’m inclined to do this too) and thus think the same for AI. When humans receive test answers in advance, they gain a clear edge, but not in the same way that an LLM would. For a human, rote memorization takes us from an input (the question) to the output (the answer). With an answer bank, it’s totally fair to assume that humans will skip the requisite reasoning stages before arriving at a correct answer. Can we extrapolate this to LLMs? Maybe, but it’s not as clear-cut. Unless training an LLM with AIME data creates a model that is de facto deterministic (ie given an input from this narrow dataset, it will provide an output. In this case, AI’s are cheating like humans are cheating which is no good), I don’t see an obvious issue with training on benchmark data. Am I right to view this as training with a study guide and then taking a test? I have no technical background, so go easy on me if I’m talking out of my ass lol
Where are the laws and regulations when it comes to disclosing training data? I don’t understand how this is not a prerequisite for these tests at least. Validity and truth go hand in hand.
I’d recommend taking a closer look at GDPval. I dug into the paper a little and thought it was quite impressive.
They developed the test set cooperatively with professionals across domains (e.g. financial, legal, educational) that contributed most significantly to GDP, asking for the most important knowledge work tasks. Another set of experts from each domain made ground truth examples to compare the models against, and a third expert group was the evaluator ranking individual preferences. The study was double-blind so experts had no idea which was the AI’s work, and they published strong results from competitors which is a good signal of honesty.
The live benchmark is a little funkier with llm judging, but the initial release was all blinded human comparison. If they do RL or fine-tuning in *this* domain, the domain of the most important relevant tasks in the most economically relevant fields, isn’t that the whole point?
And since you’re asking about other benchmarks to review, I’d take a look at Epoch AI’s general capabilities index and their paper about extracting general capability vector from aggregate benchmark metrics. It seemed like the steel man case for Benchmarks being useful.
Excellent article! This is exactly the kind of analysis that I would like to see and am glad to be a paid subscriber. For a while now, I felt sure that the big LLM vendors have been gaming the system by training on benchmark data and your article further supports my belief.
Thanks for this excellent article, in particular the overfitting example around whether or not it's safe to drive with a broken light made me laugh out loud.
The more serious point, though, that you raise is the issue around reliability. OpenAI have published a few papers recently where they're making serious claims but they've not been independently peer-reviewed. Without this independent peer-review, how do we know what they're saying is reliable or indeed valid? We don't. And I worry that it's potentially a new direction for many of these companies to take to just publish whitepapers or pre-prints and say "here that's enough", whereas we know it's not.
Very true. We need to measure the *cost of performance* as well. But, on the other hand, you only cherrypicked benchmarks that support your point. This year, LLMs were tested by the very organizers of various competitions on *new* or *private* math problems that require cleverness and originality: Frontier Math, Olympiad of Mathematics, International Collegiate Programming, Miklós Schweitzer competition. The models are obtaining elite results; they are not copying their training data. Many mathematicians now are using LLMs to solve math problems from the Erdos database. Just search internet discussions.
So, yes, as almost all humans, LLMs cannot do arithmetic "in their heads", as they are trained on language. And they do logic mistakes. But an equal number of times they come up now with clever ideas and solve very difficult problems. There is clearly a huge advancement going on, although we agree that many benchmarks are worthless.
oh, frontier math, that is the one that OpenAI funded and was making records and then suddenly it came out that they paid for the benchmark. That is a good one! I should have included that one too
I chose the benchmarks of GPT-5.2 that make the most waves now in the news and in social media. Which other benchmarks that GPT-5.2 is beating I should look at?
Gemini obtains great results as well on Frontier Math. Did Google also fund Epoch AI? Of course, no, and the dataset is private. All these models are really good. You should look at all those competitions that I have mentioned. Look also matharena.ai, they regularly test models and signal possible data contamination. For GPT-5.2 they have Miklós Schweitzer competition results (90%). And mark my words: AIME 2026 will be perfect score, or nearly, for all main models. That's too easy now for them.
Love the breakdown of each benchmark! The fact that models can be optimized to score higher makes a lot of the claims of "improved performance" questionable. Newer models may not necessarily be more capable. They may just be better at gaming the benchmarks.
Great article! To imply that training on benchmark data corrupts a model’s output depends on answering an assumption: is this process of learning the test answers and then taking the test a bad thing? More simply, are frontier models cheating? Our tendency to anthropomorphize here is evident: most answer with an emphatic yes (I’m inclined to do this too) and thus think the same for AI. When humans receive test answers in advance, they gain a clear edge, but not in the same way that an LLM would. For a human, rote memorization takes us from an input (the question) to the output (the answer). With an answer bank, it’s totally fair to assume that humans will skip the requisite reasoning stages before arriving at a correct answer. Can we extrapolate this to LLMs? Maybe, but it’s not as clear-cut. Unless training an LLM with AIME data creates a model that is de facto deterministic (ie given an input from this narrow dataset, it will provide an output. In this case, AI’s are cheating like humans are cheating which is no good), I don’t see an obvious issue with training on benchmark data. Am I right to view this as training with a study guide and then taking a test? I have no technical background, so go easy on me if I’m talking out of my ass lol
Great article!
Where are the laws and regulations when it comes to disclosing training data? I don’t understand how this is not a prerequisite for these tests at least. Validity and truth go hand in hand.
Imagining that real regulation will ever touch ai within 100 years? haha! no no we have important things to do. also no that regulation doesn't exist
I’d recommend taking a closer look at GDPval. I dug into the paper a little and thought it was quite impressive.
They developed the test set cooperatively with professionals across domains (e.g. financial, legal, educational) that contributed most significantly to GDP, asking for the most important knowledge work tasks. Another set of experts from each domain made ground truth examples to compare the models against, and a third expert group was the evaluator ranking individual preferences. The study was double-blind so experts had no idea which was the AI’s work, and they published strong results from competitors which is a good signal of honesty.
The live benchmark is a little funkier with llm judging, but the initial release was all blinded human comparison. If they do RL or fine-tuning in *this* domain, the domain of the most important relevant tasks in the most economically relevant fields, isn’t that the whole point?
And since you’re asking about other benchmarks to review, I’d take a look at Epoch AI’s general capabilities index and their paper about extracting general capability vector from aggregate benchmark metrics. It seemed like the steel man case for Benchmarks being useful.
Excellent article! This is exactly the kind of analysis that I would like to see and am glad to be a paid subscriber. For a while now, I felt sure that the big LLM vendors have been gaming the system by training on benchmark data and your article further supports my belief.
Thanks for this excellent article, in particular the overfitting example around whether or not it's safe to drive with a broken light made me laugh out loud.
The more serious point, though, that you raise is the issue around reliability. OpenAI have published a few papers recently where they're making serious claims but they've not been independently peer-reviewed. Without this independent peer-review, how do we know what they're saying is reliable or indeed valid? We don't. And I worry that it's potentially a new direction for many of these companies to take to just publish whitepapers or pre-prints and say "here that's enough", whereas we know it's not.
Model benchmarking has become an object lesson in Goodhart's law. We've basically optimized 'better' into semantic collapse.
Very true. We need to measure the *cost of performance* as well. But, on the other hand, you only cherrypicked benchmarks that support your point. This year, LLMs were tested by the very organizers of various competitions on *new* or *private* math problems that require cleverness and originality: Frontier Math, Olympiad of Mathematics, International Collegiate Programming, Miklós Schweitzer competition. The models are obtaining elite results; they are not copying their training data. Many mathematicians now are using LLMs to solve math problems from the Erdos database. Just search internet discussions.
So, yes, as almost all humans, LLMs cannot do arithmetic "in their heads", as they are trained on language. And they do logic mistakes. But an equal number of times they come up now with clever ideas and solve very difficult problems. There is clearly a huge advancement going on, although we agree that many benchmarks are worthless.
oh, frontier math, that is the one that OpenAI funded and was making records and then suddenly it came out that they paid for the benchmark. That is a good one! I should have included that one too
https://the-decoder.com/openai-quietly-funded-independent-math-benchmark-before-setting-record-with-o3/
I chose the benchmarks of GPT-5.2 that make the most waves now in the news and in social media. Which other benchmarks that GPT-5.2 is beating I should look at?
Gemini obtains great results as well on Frontier Math. Did Google also fund Epoch AI? Of course, no, and the dataset is private. All these models are really good. You should look at all those competitions that I have mentioned. Look also matharena.ai, they regularly test models and signal possible data contamination. For GPT-5.2 they have Miklós Schweitzer competition results (90%). And mark my words: AIME 2026 will be perfect score, or nearly, for all main models. That's too easy now for them.