The hypothesis:
I’d love to see this tested rigorously but in the meantime, here’s why I think it’s true.
What is “grounding with RAG”?
RAG pipelines add an external search layer to an LLM, and that search engine operates independently of the model itself:
Document retrieval – An independent search system (vector, BM25, etc.) fetches snippets that appear relevant.
Context injection – The retrieved snippets are concatenated into the prompt as “context” and sent to the LLM.
Answer generation – The LLM reads the combined prompt and produces an answer.
Vendors often promote this as a way to “teach” the model new information on the fly, but no learning occurs. The model simply receives a longer prompt, paraphrases the supplied text, and returns the result.
RAG as a way to bring new knowledge to LLM
RAG, and especially the snippets it retrieves, is often advertised as a method to teach a language model new information. In reality, nothing is being taught: the LLM simply treats the injected snippets as part of the prompt, paraphrases them, and returns the result.
Expectation: RAG injects brand-new knowledge.
Reality: the model paraphrases the snippets that search engine supplied.
The core sales pitch rests on the assumption that an LLM is so smart it can inspect your data, think, analyze, and produce high-quality output. The problem starts when mere paraphrasing is not enough. The model has to come up with logical conclusions based on what it saw. Since it is brand-new data, it cannot rely on what was memorized, so it starts searching for similar knowledge in its internal representations and produces hallucinations.
I hypothesise that RAG does not add new knowledge to the model; it simply extracts the knowledge already embedded within it.
RAG bots, especially industrial ones, are known to be extremely prone to hallucinations. The paper CRAG — Comprehensive RAG Benchmark showed that still suffers from very poor factual accuracy and is riddled with hallucinations.
Throughout my career, I have developed many in-house bots that relied on information the model did not already know, and those bots were a nightmare of hallucinations. More recently, I’ve noticed that the bots I built for my Substack e.g. the series on the best papers for AI realists, perform excellently compared with the domain-specific bots:
This prompted me to search the literature for studies that might support my hypothesis.
It has already been shown that LLMs rely heavily on their internal knowledge, and simple prompting à la “only use the provided context” does not stop them from tapping into it. If we treat prompts as queries that extract that internal knowledge, the behaviour makes sense. When prompted, we do not make the model think, reason, or invent new ideas; we merely cause it to search its memory for a suitable answer and paraphrase it to match the prompt.
Thus, a model cannot be prohibited to use its internal knowledge as it is the only thing it uses which aligns with the results of the study:
What Is Seen Cannot Be Unseen: The Disruptive Effect of Knowledge Conflict on Large Language Models
If a model already knows a topic well, adding extra context can introduce noise and lead to incorrect answers; by contrast, for low-popularity questions, RAG often improves responses. This is exactly how a knowledge extractor behaves: it steers the model toward the most relevant memories:
When Not to Trust Language Models: Investigating Effect
The other paper “Tug-of-War Between Knowledge: Exploring and Resolving Knowledge Conflicts in Retrieval-Augmented Language Models” argues that:
“We find that stronger RALMs (retrieval augmented language models) emerge with the Dunning-Kruger effect, persistently favoring their faulty internal memory even when correct evidence is provided. Besides, RALMs exhibit an availability bias towards common knowledge”
The behaviour described further in the paper is consistent with how a knowledge extractor would behave:
Confirmation bias:
We find RALMs exhibit confirmation bias, being more inclined to choose evidence that is consistent with their own internal memory, regardless of whether it is correct or incorrect
In this context, “confirmation bias” could be an anthropomorphic term for a system that returns a false positive to a search query.
Availability bias:
We show that RALMs exhibit an availability bias towards common knowledge, preferring knowledge that is easily accessible in memory.
Availability bias is another anthropomorphic term, essentially saying that answers improve when the query actually retrieves them.
Unlike other information extraction methods, combining retrieval with a language model produces an interesting effect. An LLM must always produce an answer; it cannot simply return a “not found” error, so it generates a response based on whatever information it has retrieved.
In conclusion, if LLMs truly lack reasoning, common sense, and the ability to invent new ideas or think logically, then context engineering and prompt-based methods will never substantially improve their performance; meaningful gains will come only from supplying more data during pre-training.
Practical implications if the hypothesis is true:
Unreliable on novel context: RAG bots will remain unreliable whenever they are augmented with context entirely new to the LLM.
Best on familiar data: These bots will work best on material already well represented in the model’s training set e.g., public scientific papers.
Limits of context engineering: The current buzz around “context engineering” will not solve the problem; it will simply magnify RAG’s inherent limitations.
Proposed method to test the hypothesis:
We take a cut-off model—e.g. GPT-3.5-Turbo (April 2023)—and evaluate three system variants on two matched question sets (≈ 500 items each):
Canonical set: answers drawn from well-cited papers published before the cut-off.
Novel set: answers drawn from papers published after the cut-off (e.g. NeurIPS 2024, ACL 2025).
For each set we run Zero-shot, RAG-Canonical, and RAG-Novel (all sharing the same prompt template) and measure exact-match accuracy plus hallucination rate, using paired significance tests. The hypothesis predicts RAG-Canonical ≫ Zero-shot, while RAG-Novel ≈ Zero-shot (or worse) with more hallucinations; any statistically significant lift for RAG-Novel would falsify the “extraction-only” view.
Previous studies:
The mystery of em‑dashes: part two with quantitative evidence
A couple of weeks ago I made an assumption: the rise of em‑dashes in AI‑generated text happened because model providers started scanning older, pre‑Kindle books.
Grok 4: A Good Expensive Model
I had a chance to test Grok 4 through the XAI API. My initial observations are:
My company is using RAG-like mechanisms to help find bugs in software. Given the right context, the LLM can do a pretty good job (most of the time). The problem is determining which context to give the LLM. So, we’re using standard static analysis techniques (data flow and finding references mostly) to find code related to the pieces being changed. Our experiments are showing a big improvement.
From what I’ve seen RAG approaches do well when they work on structured data. In these cases it’s easy to know what extra context to provide.
When the data is unstructured, there is no easy way to discover relationships without using another LLM to make the determination.
It feels like we’re building libraries for models that never read — they skim and guess