AI Realist

AI Realist

Share this post

AI Realist
AI Realist
OpenAI’s Open-Weight Models: Overhyped & Shockingly Underperforming

OpenAI’s Open-Weight Models: Overhyped & Shockingly Underperforming

The first evaluations of the OpenAI's Open-Weight Model

Maria Sukhareva's avatar
Maria Sukhareva
Aug 07, 2025
∙ Paid
6

Share this post

AI Realist
AI Realist
OpenAI’s Open-Weight Models: Overhyped & Shockingly Underperforming
2
Share

OpenAI made waves yesterday by releasing their first open-weight model in five years since GPT-2.

“You see now!” proclaimed the hypers. “They’re delivering! The models are here!”
But before we join the applause, let’s take a moment to actually test the model, evaluate its performance, and see how usable it truly is.

The landscape of small open-weight models has changed drastically in the past few years, especially thanks to Chinese researchers. We now have powerful contenders like Qwen-3, Xiaomi’s Mimo, Kimi-K2, DeepSeek-R1 (with R2 just around the corner). So the real question is: Is OpenAI’s open-weight model a game changer? And more importantly: what is it actually good for in this already crowded ecosystem?

But before any evaluation can begin, let’s make sure everyone can run it easily. The simplest way to test the model is through Ollama. If you do not code, just download it from here:

https://ollama.com/download

After you have done it, you should be able to install it and see a simple UI with the two models called gpt-oss-120b and gpt-oss-20b:

You can follow along with this article and the benchmarks by asking questions directly to the model, for example:

It hallucinated beautifully here.
Unfortunately, I am not Elliot M. Glick, a venture capitalist, just a humble NLP enthusiast with limited resources, doing my best to bring you quality content.


Consider becoming a paid subscriber, so I can keep hallucinating less and testing more.

Or if you're unsure, check out my anti-hype shop - all income goes straight back into supporting this blog: paying for token consumption, model subscriptions, and all the other fun costs.
Every item comes with a free one-month subscription to the newsletter. Two items - two months. You get the idea.


We’ll take a look at the top benchmarks and run some tests in our own language to get a feel for how realistic those benchmarks actually are.
Sometimes, benchmark results conflict with users’ empirical observations and that’s not surprising. Many benchmarks end up in the training data themselves, as discussed in this paper:

Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs

Or the providers simply train benchmarks in to make sure that the performance is high: llama-drama


The model description

The openAI blog describes the models as follows:

These aren’t the smallest models but they’re not behemoths either. The 20B variant can run locally on newer laptops like the M3 Mac, though the 120B will likely require remote deployment (e.g., Azure or similar). Compared to the highly portable 7B/8B Qwen or LLaMA models, they’re less practical for local use for now.

That said, it's only a matter of time before personal hardware catches up.

If you want me to talk more about expert routing, architecture details, or sparsity tricks, let me know in the comments
For now, here’s the TL;DR:

20B: Local-friendly on modern high-end laptops
120B: Deployable, but not local, yet not absurdly large either

So let’s dive in—benchmark by benchmark, test by test.


1. General Knowledge and Language Understanding

1. The MMLU (Massive Multitask Language Understanding) benchmark tests

Benchmark Description: 57 multiple-choice tasks across STEM, humanities, and professional fields. It measures a model’s broad knowledge and reasoning from high school to expert levels.

Importance: The MMLU standardizes comparison of AI model performance across diverse domains. It guides improvements by revealing model strengths and weaknesses for real-world applications.

Probability of Contamination: Contamination risk is moderate to high due to MMLU’s public sources overlapping with web-scraped training data. Models may memorize similar questions, potentially inflating scores by 5-10%.

Task Example:

Which of the following is a primary source of energy for Earth's climate system?

A) Geothermal heat B) Solar radiation C) Tidal forces D) Cosmic background radiation" (Correct answer: B).

Performance:

Reported by OpenAI [source]:

The reported performance is state-of-the-art and comparable to OpenAI’s proprietary models. Even better - it outperforms o3-mini.

Discrepancies in Community Evaluations:

The performance on English-centric tasks is good enough, comparable to proprietary models.
However, as Bender’s Rule reminds us: you always need to specify which languages you're claiming state-of-the-art for.

When the community tested the model on multilingual benchmarks, the results were disappointing - performance dropped significantly. That said, some contradictory data points have emerged, so the full picture is still unfolding.

The German variant of MMLU benchmark shows that performance is very good:

OpenAI’s own report on multilingual MMLU also shows a performance drop in non-English languages but it’s nowhere near as dramatic as what the community is reporting.

Yet we should remember that dataset contamination is likely high for this benchmark, and a separate evaluation on German showed that:

Let’s put it to the test using our Ollama UI and our own languages.
As a native Russian speaker, I’ll evaluate the model with Russian prompts, following the same analysis style as above.

For this, I used a set of creative prompts from: the em-dash study I used for this study:

The mystery of em‑dashes: part two with quantitative evidence

The mystery of em‑dashes: part two with quantitative evidence

Maria Sukhareva
·
Jul 5
Read full story

I ran five creative prompts translated into Russian using Grok-3, and then evaluated the output’s grammar with o3. The results are as expected — the grammar quality was lacking, similar to what we saw in the German study.

This is by no means a representative study, just a small example of how you can manually test a model to get a feel for its behavior and draw your own observations. In many cases, this kind of hands-on evaluation can be more insightful than simply staring at benchmark scores.

For instance, one interesting detail I noticed: the model really loves em dashes.

Here is the chat with the evaluation.

Of course, programmers can do this much more easily using API calls—but here, we’re keeping it simple with exercises anyone can try.

There are consistent reports from the community pointing to poor performance Russian multi-task processing.

Also there are reports that it is unusable for Finnish.

Conclusion for general tasks:

It does appear that OpenAI’s open-weight models were heavily optimized for English, particularly for English-language benchmarks. This likely means its performance in other languages will be underwhelming. While we’ll have to wait for more thorough evaluations in the coming days, early signs suggest there are stronger multilingual models available, especially ones that can be run locally. Till then you can also easily try it yourself and build your own opinion - it is always better than trusting benchmarks.


2. Coding Performance

Coding performance follows the same pattern: impressive benchmarks, but underwhelming results according to community feedback.


The rest of this content is paywalled and dives deeper into the usability of these models for coding, as well as discusses the abundance of hallucinations in the outputs.

It will include a look at their abysmal performance on coding and hallucination nightmares with some fun (and painful) examples.

If you're a student and can’t afford access, feel free to drop me a message - I'll make sure you get in.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Maria Sukhareva
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share