AI after GPT-5 flop: World models, SSMs, and the long road to AGI
What is there beyond transformers.
In 2013, when I was a research assistant at Goethe University Frankfurt, my professor told me to look into word embeddings. The first thing I did was google what “embedding” means, then google “word embedding.” It took time to understand what was going on. Then we moved from word embeddings to encoder–decoder LSTMs with attention:
And just when we figured that part out, attention became “all you need,” recurrence was declared unnecessary, and we all learned what BERT is - a bidirectional Transformer trained with masked language modeling. A few years later I remember arguing with a sales person from Microsoft that they should stop pretending GPT-2 was a super-secret, dangerous technology, since similar autoregressive transformers would be released anyway as open-source within the next half a year.
OpenAI in fact staged GPT-2’s release and fully released it in late 2019. Then we got GPT-3 (2020), EleutherAI’s open GPT-Neo/J line (2021), and finally InstructGPT (2022), which showed exceptional fluency and pushed the state of the art.
Now we have GPT-5, and I personally think we’re near the limits of what Transformers alone can deliver in natural language. Which means we’re back where we were with word embeddings: searching for the next thing. If you’d asked me eight years ago about the next breakthrough, I would have told you about LSTMs, vanishing gradients, and the sequential nature of language. Attention originally arrived as an add-on to RNNs, so watching LSTMs back then wasn’t even wrong. Today I’ll briefly outline a few trends beyond Transformers. I’m going to skip incremental stuff like “agents wrapped around the same LLM” or “just smaller LLMs.”
Please don’t expect a slogan like “Transformers are dead, SSMs are the new SOTA.” No. Transformers are the current state of the art. What follows are rising trends that may or may not become the next SOTA.
The “World Models” Hype
“World model” is not a single architecture. It is a goal: learn a representation of the environment’s state and its dynamics so you can predict what happens next and often plan actions. In practice, a “world model” usually means a learned state transition model (and sometimes reward/termination heads), not a particular network family.
For example, Sora is a world model because it aims to capture the physics of the world and generate what the “world“ looks like. The OpenAI website writes about Sora:
We’re teaching AI to understand and simulate the physical world in motion, with the goal of training models that help people solve problems that require real-world interaction.
However, the current video generators are still lacking in reproducing physics. This is one of my early tests with Sora where I tried to animate my cats. You can see that it is not necessarily the best generation of the “world” in terms of feline physics.
World models are frequently built as video models and often use transformers, but not only: