Post

Are LLMs Just Souped-Up Markov Chains?

Are LLMs Just Souped-Up Markov Chains?
A B C Markov Chains VS. Large Language Models (LLMs)

If you’ve ever dabbled in text generation, you know the thrill: feed a system some text, and watch it spit out eerily coherent new sentences. Both Markov chains and Large Language Models (LLMs) can do this—which has led many people to wonder if modern AI is just an overcomplicated version of a technique we’ve had since the 1900s.

Spoiler alert: they’re related, but calling an LLM a “fancy Markov chain” is like calling a jet engine a “fancy windmill.” Sure, both involve spinning things, but the engineering? Completely different league.

Let’s dig into what makes these two text generators tick, where they overlap, and why one can write poetry while the other… well, let’s just say it tries its best.

The Markov Chain: Simple, Fast, and Surprisingly Fun

A Markov chain is built on one elegant idea: the future depends only on the present, not the entire past. In text generation terms, this means looking at the last few words to guess what comes next.

Here’s how it works:

  • Bigram model (n=2): “I just saw…” → most likely next word: “a” (35%), “the” (25%), “something” (15%)
  • Trigram model (n=3): “I just saw a…” → most likely next word: “dog” (40%), “movie” (30%), “ghost” (10%)

Train it on enough text, and a 4-gram or 5-gram model can produce shockingly readable paragraphs. The magic trick? It’s just counting which words tend to follow others and building a probability table.

The catch: Memory is limited. Once you’ve generated a few sentences, the model has completely forgotten how it started. That character you introduced in paragraph one? Gone by paragraph three. Markov chains drift like a conversation with someone who keeps forgetting what you’re talking about.

The LLM: A Different Beast Entirely

Modern LLMs—think GPT, Claude, or Gemini—are also predicting the next word (or “token”). But the machinery under the hood is radically different.

1. Embeddings: Words Become Math

LLMs don’t see text as words. They convert everything into high-dimensional vectors—essentially, clouds of numbers that capture meaning. The word “king” lives near “queen,” and “Paris” lives near “France.” This lets the model understand relationships between concepts, not just memorize sequences.

2. Attention: The Game-Changer

Here’s the killer feature: attention mechanisms. Instead of only looking at the last few words, an LLM can scan back across thousands of tokens and decide which ones matter right now.

Writing about a character named Sarah who was introduced 2,000 words ago? The attention mechanism lets the model “remember” her name and relevant details. This is why LLMs can maintain coherent narratives, follow complex instructions, and reference earlier parts of a conversation.

3. Continuous, Compressed State

A Markov chain’s “state” is explicit: literally the last N words. An LLM’s state is a massive set of internal calculations representing the meaning and context of everything it has seen. It’s not storing “the quick brown fox”—it’s storing an abstract representation of fox-ness, quickness, and brownness that can flexibly inform future predictions.

The Similarities (Yes, There Are Some)

At the highest level of abstraction, both are sequential predictors. Both look at what came before and probabilistically choose what comes next. Technically, since an LLM’s next prediction depends only on its current internal state (not the raw history), you could squint and say it satisfies the Markov property.

But that’s like saying both a calculator and a human brain “process information.” True, but wildly reductive.

The Differences That Actually Matter

Context Length

  • Markov: Fixed window of 2-5 words. Anything earlier? Forgotten.
  • LLM: Context windows of 100,000+ tokens. Can reference information from pages ago.

Understanding

  • Markov: Pure pattern matching. Never seen “red fox” before? Can’t generate it, even if you’ve seen “brown fox” a thousand times.
  • LLM: Captures semantic relationships. Knows “red” and “brown” are both colors and can generalize accordingly.

State Representation

  • Markov: Explicit, discrete, explodes exponentially with longer contexts (a 10-gram model is computationally impossible).
  • LLM: Implicit, continuous, compressed into billions of learned parameters. Scales efficiently.

Long-Range Coherence

  • Markov: Generates sentences that look good locally but drift into nonsense globally.
  • LLM: Maintains themes, follows plot lines, and stays on topic across entire documents.

The Verdict

Are LLMs just incredibly fancy Markov chains?

Kind of, in the way a spaceship is a fancy bicycle. Both get you somewhere, but the underlying physics—and the distances you can travel—are incomparable.

Markov chains are elegant, lightweight, and perfect for certain creative tasks. They’re the statistical equivalent of a party trick that still impresses. But LLMs, with their attention mechanisms and semantic understanding, represent a fundamental paradigm shift in how machines process and generate language.

So next time someone dismisses modern AI as “just statistics,” you can smile knowingly. Yes, it’s statistics—but it’s statistics with a very weird trick that changes everything.

This post is licensed under CC BY 4.0 by the author.