🦋 On LLM Popcorn, Collecting Butterflies, and the Sad State of LLM Research

Why We Must Answer Why, Not Just What

Jul 29, 2025

The panel discussion with Eduard Hovy, Mirella Lapata, Yue Zhang, and Dan Roth started with a very harsh but incredibly necessary speech by Eduard Hovy.

His speech was interrupted by applause multiple times. The state of current LLM research is indeed disturbing.

We see a flood of papers published on mere observations:

LLMs do this, LLMs do that.
They benefit from synthetic datasets that are diverse but factually incorrect.
They exhibit strong sycophantic tendencies—agreeing with the user regardless of correctness.
They still randomly produce toxic content and hate speech.
They show signs of alignment collapse, where they simply refuse to answer because that’s the safest route.

Much of this work focuses on the what, which is important and useful for applications, but ignores the why.

As Eduard Hovy put it: this is LLM popcorn.

Just like popcorn fills you up briefly and leaves you hungry again an hour later, this research gives you something to chew on, but no lasting insight. You keep reading paper after paper hoping for an answer, but you’re left intellectually starving.

He also called this trend butterfly collecting — a collection of pretty, dead observations, like a collector amassing lifeless butterflies without any understanding of their behavior, ecosystem, or function. That’s what we are seeing in today’s LLM research: a growing museum of outputs, disconnected from any meaningful explanation.

We can keep sitting here for the next 30 years, pinning butterflies and logging observations. Or we can change the paradigm. We can start asking why.

Instead of only testing whether LLMs generalize, we need to understand why they do or don’t. We should stop obsessing over the outputs and start investigating the mechanisms behind them.

If you enjoyed this article and want to support honest, hype‑free discussion about AI, independent of corporate sponsorship, please consider becoming a paid subscriber.

Your support helps cover API tokens, conference fees, research literature, and other resources needed to maintain the level of quality AI Realist strives for.

Disagreement on the Panel

Mirella Lapata disagreed. In her view, generalization isn’t the goal at all. If a model “knows” everything, what matters is whether it can adapt and learn on the fly.

She argued that learning from instructions won’t get us there. A child doesn’t learn to play the violin from reading a manual. Similarly, LLMs won’t become adaptive or creative by reading more annotated datasets. We need to observe how humans, and especially children, learn and adapt, and bring those insights into AI.

Yue Zhang made a crucial point in response: to be adaptive, you must first generalize. And he was optimistic that LLMs can do this to some extent.

Dan Roth: LLMs Generalize Semantically, Not Structurally

Dan Roth brought a more pragmatic perspective.

He told a story about finding his own biography online. It was full of awards… which he had never received. Those awards belonged to Eduard Hovy. The model generalised beautifully here!

What the model did was classic behavior: semantic generalization. Two well-known NLP researchers with overlapping research areas. The model blurred the distinction and attributed Hovy’s achievements to Roth.

This is what current LLMs excel at: pattern-matching based on surface-level similarity.

But what they cannot do is understand the structure of problems.

He gave the example of multiplication: it’s extremely hard for models to multiply five-digit numbers. If they eventually learn, bump it up to seven digits and they fail again. There is no real abstraction, no understanding of the underlying operation.

That’s why all the excitement around agentic AI.

Models aren’t doing the task. They’re just being trained to pick which external tool should do the task. And that’s easier than teaching the model to actually understand or solve the task.

Agentic AI has potential. But let’s not pretend it’s a step toward AGI. It’s a bandaid to compensate for model limitations.

But Humans Make Mistakes Too…

And here came one of the best moments.

A lot of LLM evangelists say: “But humans make mistakes too!”

Dan Roth responded instantly:

“Sure. I make mistakes. But would you use a calculator that was correct only 70% of the time?”

Exactly.

The Limits of Transformers

Eduard Hovy said: let’s build LLMs that can adapt, do causal reasoning, and handle complex tasks autonomously.

The other panelists immediately pointed to evidence that transformers are fundamentally incapable of that.

ACL 2024’s best paper was titled Mission: Impossible Language Models.
The paper dives deep into the limitations of transformers and makes a strong case that certain kinds of reasoning are just not within reach.

We Need to Stop Collecting Butterflies

All in all, we need to shift the research paradigm. We need to stop churning out popcorn papers and pretending they feed our understanding. We need to stop running around like tech bros who are scared of their own inventions, mumbling about how LLMs are too complex, too mysterious, too dangerous to understand.

Would you drive a car if no one could explain why it takes off?
Would you use your phone if nobody could tell you why it can connect to someone across the globe?

Then why are we okay with “we just don’t know how LLMs work”?

We should not settle for that. We can do better than butterfly collecting.

We need real science.

🦋 On LLM Popcorn, Collecting Butterflies, and the Sad State of LLM Research

Why We Must Answer Why, Not Just What

Discussion about this post