GPT-5.2 and Meaningless Benchmarks
Why ARC-AGI-2, AIME, and GDPval don’t measure real capability
Yesterday (December 11, 2025), OpenAI dropped another model: GPT-5.2. It’s presented as their most advanced model yet, with state-of-the-art results across several import benchmarks. It totally outperforms Gemini 3.0 and Opus 4.5. Wow! What an achievement!
Now imagine we’re back in 2015, when deep learning was just starting to gain momentum. A group of NLP researchers sees these benchmark numbers, gets impressed, and thinks the models have ultimately solved the most complex challenges out there. They’re excited to see the architecture of this breakthrough, so important for humanity. Nope: not allowed. It’s proprietary, and nothing is disclosed about what’s inside.
Ok, well, can we at least see the training data, to make sure the model hasn’t seen the benchmarks before? Absolutely not. The researchers wonder: is this a joke? Did someone mistake a Reddit troll post for actual research? Those numbers are meaningless without reproducibility and transparency. Nope, in 2025, that’s the way we do “science”.
Want to support this newsletter?
Vote in our LLM-built website competition and join the course as a paid subscriber!
No hype. Limitations explained. Deliverables defined. All code published.
A small anecdote from the past
Let me tell you a story from my days working for a large automotive company, building chatbots. It was around 2019. One project involved an external partner tasked with building a task-oriented bot. Back then, these bots were trained on hypothetical questions to detect intent, and the answer was predefined and hard-coded based on the detected intent. My job was mostly making sure the contractor did their work properly.
The external developers generated a bunch of potential utterances and, at my request, split them into train, validation, and test. Proudly, they presented the results: 99.9% intent classification accuracy. The business owner was thrilled: “I tried the bot last week and it wasn’t able to answer a single question, and now it’s almost perfect.”
“We made sure test and train data are separated,” said the external project manager.
I smell BS. I open their training and test data, and what do I see?
Train: “I want to exchange tires.”
Test: “I want to exchange tires”
The only difference was the full stop at the end. Their results weren’t accepted, and they were sent back to do a proper evaluation.
Machine learning approaches based on distributional semantics were always good at picking up patterns. Those patterns aren’t always semantic. They can be punctuation, word order, sentence length, or some frequent preposition that shows up in a context but is meaningless for the actual intent.
That project kept going, and it wasn’t easy for them. At some point, we had a pretty funny overfitting case. They added to the training data a bunch of sentences starting with “Is it ok?”, for example: “Is it ok to close the window while the process is still running?” The problem was that they didn’t add any negative examples, and the answer to all those questions was always “it is ok to do it.”
I remember raining on their parade again when I asked if it is ok to drive with broken headlights at night. The model learned: if the question starts with “Is it ok?”, then agree.
Luckily, back then you could always demand to see all the receipts, refuse to extend funding, or demand improvements when the work wasn’t done properly.
Nowadays, companies get billions for showing unvalidated numbers. In my example, the external developers weren’t trying to cheat us, those were honest mistakes. If they had been trying to cheat, that would be fraud, and there should be no business with them ever again.
Can we say the same about the current situation? I doubt it.
But aren’t modern models completely different from what you used back then? No, not really. Back then we fine-tuned transformer classifiers like BERT, and they’re built on the same transformer architecture family. What changes is that modern, massive transformers can pick up much more complex patterns than “Is it ok?” at the beginning of a sentence.
Now let’s take a look at the spectacular GPT-5.2 benchmarks. There are a whole lot of them, but I’ll choose three that made the most waves: 100% on AIME 2025 (no tools), 52.9% on ARC-AGI-2 (Verified) for GPT-5.2 Thinking (54.2% for GPT-5.2 Pro), and GDPval (wins or ties).
If you like this newsletter, consider upgrading your subscription or visiting the shop
Check out our website:
https://www.airealist.org/
Paid subscribers get:
Priority answers to your messages within 48-hours
Free participation in online training “How to build a website with AI“ on 16th of December
Founding members receive even more:
A 45-minute one-on-one call with me
High-priority personal chat where I quickly reply to your questions within 24-hour
Support independent research and AI opinions that don’t follow the hype.
— or check out the anti-hype shop
GPT-5.2 AND ARC-AGI-2
The benchmark presents a series of puzzles. On the right, you see examples of input and output, and on the left you see the test you need to solve. For example, here you need to colour the shapes based on the number of enclosed empty cells within them.
This is a good task for pattern learning, because the patterns can indeed be very complex here. ARC-AGI-2 keeps the actual test sets private, but there are plenty of puzzles online and a public ARC-AGI-2 training and evaluation set.
So can one fine-tune a model for high performance on this specific task? Absolutely.
A group of Nvidia researchers just fine-tuned a 4B Qwen model to deliver a 27.64% score on the benchmark. The fine-tuned model is called NVARC. It is right there in the leaderboard, right above gpt-5.2-medium:
Is a fine-tuned Qwen-4B truly near AGI? I highly doubt it.
If ARC-AGI-2 taught me one thing about GPT-5.2, it’s that the top results come from heavy test-time reasoning. ARC reports efficiency as dollars per task, with costs estimated from retail token pricing. The leaderboard also shows “reasoning system” trend lines, where higher reasoning settings generally trade higher cost for higher accuracy. The more tokens spent - the better results.
Example: to beat Gemini 3 Pro at 31.1% for $0.811, GPT-5.2 Pro Medium spends $8.99 for 38.5%, and GPT-5.2 High spends $1.39 for 43.3%. Those dollars reflect only visible tokens. ARC does not count internal traces that never become tokens, so true compute could be higher.
Most likely no architectural breakthroughs from this leaderboard alone. What is clear is more compute spent at inference.
To stay competitive pricing remains near or slightly under Gemini 3 Pro depending on tier.
That likely makes GPT-5.2 running at a big loss at the high-reasoning settings.
GPT-5.2 AND AIME
AIME is a benchmark that tests the models ability to solve highly complex mathematical tasks. It features 30 challenging problems from 2025 with integer answers (000-999), requiring precise calculation and mathematical insight across algebra, geometry, number theory, and combinatorics.
GPT-5.2 Thinking is reported at 100% on AIME 2025 (no tools e.g. python, calculator etc.). Here is an example of a problem:
And yet, in simple sanity checks, the model can still struggle with basic decimal subtraction.
But could OpenAI fine-tune their model on AIME 2025 to get 100%?
They don’t even need to. The questions and answers are all over the internet. These thirty questions are public, they could have just trained on them or fine-tune an already-trained model on it if the dataset was published after knowledge cut-off.
Why are we even talking about this benchmark for a model that does not publish training data? The only reason we shouldn’t assume AIME 2025 was shown to the model before is that OpenAI pinkie-promises they didn’t do it.
GPT-5.2 AND GDPVal
A lot of people claim this is the most meaningful benchmark because it targets real-world tasks developed by industry professionals . It “covers 1,320 tasks across 44 occupations, sourced to cover the majority of Work Activities tracked by O*NET for each occupation.”
The grading can be done either by human evaluators:
”To grade the 220 open-sourced gold subset, we conducted blinded expert pairwise comparisons, where experts in the relevant occupation were presented with a request and reference files and asked to rank two or more unlabeled work deliverables.”
Or by a model trained to replicate expert judgments. In the GDPval paper, OpenAI describes an experimental automated grader for pairwise comparisons.
It works in a similar pairwise-comparison manner to LMArena, but with a different crowd: GDPval uses occupational experts for its gold set, while LMArena uses community votes.
Can this one be fine-tuned for?
Absolutely. The setup is practically inviting RL: once you have a scalable evaluator that approximates expert preferences, it can be used as a reward model.
We already know this is effective because of the Llama Drama in April. Meta got heat for putting a customized “experimental chat version” of Llama 4 Maverick on LMArena that was “optimized for conversationality” and “human preference.” That is exactly the kind of optimization that can move leaderboard scores without making the underlying model broadly better.
Eventually, if the automated grader is good enough, fine-tuning on this task becomes straightforward, because the neural evaluator effectively becomes the reward model.
The result is that the model gets optimized for those 44 occupations, and might not generalize as well outside that distribution.
And guess who created this benchmark? Yep, OpenAI.
All in all, currently OpenAI seems to be fighting for its life, ramping up compute, releasing suspicious benchmarks and pausing other projects, even postponing its adult content. Time will show how good this model truly is. To give it the benefit of the doubt, it might surprise us, but I’m not holding my breath.













Love the breakdown of each benchmark! The fact that models can be optimized to score higher makes a lot of the claims of "improved performance" questionable. Newer models may not necessarily be more capable. They may just be better at gaming the benchmarks.
Great article! To imply that training on benchmark data corrupts a model’s output depends on answering an assumption: is this process of learning the test answers and then taking the test a bad thing? More simply, are frontier models cheating? Our tendency to anthropomorphize here is evident: most answer with an emphatic yes (I’m inclined to do this too) and thus think the same for AI. When humans receive test answers in advance, they gain a clear edge, but not in the same way that an LLM would. For a human, rote memorization takes us from an input (the question) to the output (the answer). With an answer bank, it’s totally fair to assume that humans will skip the requisite reasoning stages before arriving at a correct answer. Can we extrapolate this to LLMs? Maybe, but it’s not as clear-cut. Unless training an LLM with AIME data creates a model that is de facto deterministic (ie given an input from this narrow dataset, it will provide an output. In this case, AI’s are cheating like humans are cheating which is no good), I don’t see an obvious issue with training on benchmark data. Am I right to view this as training with a study guide and then taking a test? I have no technical background, so go easy on me if I’m talking out of my ass lol