ChatGPT has not always been overloaded with em dashes. To prove it, I ran 10 queries asking GPT‑3.5, GPT‑4o and GPT‑4.1 to generate 10 stories that have at least 10 sentences. GPT‑3.5 produced 0 em dashes, GPT‑4.1 used 14, and GPT‑4o 16.
There are two obvious sources for those extra dashes: (a) pre‑training, where the base model is trained on a massive text corpus, and (b) RLHF fine‑tuning. Google’s N‑gram viewer shows that em dashes peaked in popularity during the 20th‑century print era; they are common in polished fiction, legal writing, and older scientific articles. It is therefore plausible that GPT‑4 models simply saw more em-dash‑heavy book data during pretraining than GPT‑3.5 did.
But there is also a purely mechanical incentive:
In the tokenizer used by GPT‑4 the sequence “ —” (leading space + em dash) is one token, whereas a comma plus “and” or a semicolon usually costs two or three tokens.
Fewer tokens means cheaper inference and lower training loss per token, therefore, higher reward during RLHF.
Because the RLHF reward model implicitly favors concise, “smooth” prose, the em dash becomes the cheapest, safest way to glue clauses together. Over billions of training steps that micro‑advantage snowballs into a stylistic tic.
The curse spread quickly: Gemini, Claude, Mistral, they all dash more than their predecessors. That could mean they train on similar book‑heavy corpora, or that they are ingesting AI‑generated web text already contaminated with dashes, or both.
Given the volume of new AI text being published every day, any future crawl will be even more dash‑saturated, so unlearning the habit will be difficult.
Sidebar: A crash‑course in tokens
What even is a token?
Think of a token as the smallest chunk of text the model “sees” at once. It can be a full word ("elephant"
), a sub‑word ("phan"
), a piece of punctuation, or even a leading space. GPT‑style tokenizers are built with byte‑pair‑encoding (BPE) or the newer SentencePiece/Unigram approach: they greedily merge the most common byte sequences in a giant text corpus until they hit a target vocabulary size (e.g., ~100k entries for GPT‑4’s cl100k_base).Why does the em dash get its own solo entry?
Because books and legal documents use it a lot. When a character shows up thousands of times, BPE learns that merging it into a single token saves space. That’s why the Unicode character U+2014 ends up as a dedicated ID—often two dedicated IDs, one for"—"
and one for" —"
(with the leading space).Leading‑space trickery.
Most common words (" the"
," and"
," to"
) and punctuation (","
," —"
) are stored with their leading space. This lets the model distinguish between word boundaries without spending an extra token on the space itself.Cheapness in practice.
Compare these two clause joiners, measured in tokens under cl100k_base:One glyph vs. three! If you’re generating 1 000‑token answers millions of times per day, that 1‑token delta per sentence is real money.
Training loss is averaged per token.
When the model can express the same idea in fewer tokens, its loss per token usually falls, which subtly rewards the shorter construction during gradient descent. RLHF then compounds the effect: human raters like prose that feels fluent and avoids comma splices; the dash both shortens the text and sounds polished, so the reward model says “yes, more of that.”
It is possible that we are stuck with em dashes
AI text with em dashes floods the web, so future crawls ingest that text and the dash gets an even bigger “prior.”
Token‑budget pressure intensifies as context windows grow. With massive token windows coming soon, every saved token is amplified as the em dash is a micro‑optimization that scales.
We risk having stylistic monoculture. If every model is trained on the same dash-heavy texts, the distinction between AI‑ and human‑authored prose blurs. Future training pipelines will need stronger de‑duplication and perhaps explicit style balancing objectives to stop the em-dash snowball.
Maybe morse code is the one true language of the future.
Maria I think you are missing the point entirely. With reinforcement learning, our precious LLMs area stating to figure out that concise, precise answers are key. So em dashes go up.(As they are a stylistic way to provide clarity). Be careful with endogeinty bias.