← Danish Khan
P(next)

How LLMs Work, Explained

What actually happens between your prompt and the response.

You use large language models every day. You type a question, and a few seconds later, coherent text appears. But what happens in those few seconds? The answer involves splitting your words into pieces, mapping those pieces into a high-dimensional space where meaning has geometry, routing information through layers of attention, and sampling from probability distributions. All within a fixed memory budget. None of this requires magic. Every step is math you can see. This explainer walks through each step, from the moment you press Enter to the moment the response appears, with interactive demos you can play with.

I.

Tokenization

Before a language model can do anything with your text, it has to convert it into numbers. Not whole words, but pieces of words. The process is called tokenization, and it determines everything from how much your API call costs to why ChatGPT struggles with counting letters.

The dominant algorithm is Byte Pair Encoding (Sennrich et al., 2016). It starts with individual bytes and iteratively merges the most frequent pair into a new token. After enough merges, common words like "the" become single tokens while rare words get split into subwords: "unhappiness" becomes ["un", "happiness"] and "indistinguishable" might become ["ind", "ist", "ingu", "ish", "able"].

Vocabulary size is a tradeoff. A small vocabulary (32K tokens, like Llama) means more tokens per sentence, which means slower inference and higher cost. A large vocabulary (100K tokens, like GPT-4) means fewer tokens per sentence but a bigger embedding table that uses more memory.[1]

Language inequality is real in tokenization. BPE merges are trained on English-heavy corpora, so English gets roughly 1 token per word. Hindi, Japanese, and other non-Latin scripts often require 3-4 tokens per word. The same question costs 3x more in Hindi than in English.

The strawberry problem. When you ask "How many r's in strawberry?", the model sees something like ["str", "aw", "berry"]. The letters are split across token boundaries. The model never sees individual characters. This is why character-level tasks are unreliable.

BPE Tokenizer Explorer

Toggle between examples to see how BPE tokenizes different text. Notice how Hindi and Japanese produce far more tokens per character than English.
II.

Embeddings

A token ID is just an index into a lookup table. Token 4344 means nothing on its own. The model needs to convert each token into a vector, a list of numbers that encodes meaning. These vectors are called embeddings, and they are where the magic starts.

Each token maps to a vector of 1,536 to 12,288 dimensions, depending on model size. These vectors are learned during training, not hand-designed. Similar words end up close together: "king" near "queen", "dog" near "puppy". The geometry is remarkably structured.

Mikolov et al. showed in 2013 that vector arithmetic captures analogies: vector("king") - vector("man") + vector("woman") lands close to vector("queen"). This works because the model learned that "king" and "queen" differ in the same directional way as "man" and "woman".[2]

Embeddings are why semantic search works. You can search for "affordable sedan for commuting" and find a document that says "budget-friendly four-door car for daily drives" even though they share zero keywords. Both phrases land in the same region of vector space.

Embedding Space Explorer

Royalty
Animals
Emotions
Food
Programming
Countries
Click any word to see its nearest neighbors. Use the buttons above to see vector arithmetic in action: the result lands near the expected word.
III.

Attention

Embeddings give each token a meaning. But meaning depends on context. The word "bank" means something different in "river bank" versus "bank account." The mechanism that lets each token look at every other token and decide what is relevant is called self-attention. It is the core innovation of the Transformer architecture (Vaswani et al., 2017).

Each token computes three vectors: a query ("what am I looking for?"), a key ("what do I contain?"), and a value ("what information do I carry"). The attention score between two tokens is the dot product of the query and key, scaled and normalized through softmax.

Multi-head attention runs multiple attention computations in parallel (32 to 96 heads in modern models). Each head learns different relationships. One might learn syntax (subject-verb agreement), another coreference (what "it" refers to), another positional patterns (attending to nearby tokens).[3]

The computational cost is O(n²) where n is sequence length. Every token computes an attention score with every other token. Doubling the sequence length quadruples the computation. This is the fundamental bottleneck that limits context window sizes.

The key insight. The transformer replaced recurrence (processing tokens one at a time, like RNNs) with attention (processing all tokens simultaneously). This made training massively parallelizable on GPUs, which is why transformers scaled where RNNs could not.

Attention Heatmap Explorer

Toggle sentences and attention heads. In "The cat sat on the mat because it was tired", the coreference head assigns high weight from "it" to "cat".
IV.

Generation

A language model does not generate text the way you write it. It predicts one token at a time. At each step, the model outputs a probability distribution over its entire vocabulary: a list of 100,000 numbers that sum to one. The next token is sampled from this distribution. How you sample determines whether the output is creative or repetitive, coherent or chaotic.

Temperature divides the raw logits by T before softmax. T < 1 sharpens the distribution (more confident, less creative). T > 1 flattens it (more random, more creative). T approaching 0 is effectively greedy decoding: always pick the highest-probability token.

Top-k keeps only the k highest-probability tokens and zeros out the rest. Top-p (nucleus sampling, Holtzman et al., 2020) keeps tokens whose cumulative probability reaches p. Top-p adapts automatically: when the model is confident, few tokens pass the filter. When it is uncertain, many do.[4]

Temperature 0 does not mean "perfectly accurate." It means "always pick the most likely token." If the model's probability distribution is wrong, temperature 0 will confidently give you the wrong answer every time. This is why temperature 0 is not a substitute for model quality.

Token Sampling Playground

1.0
20
1.0
Adjust temperature, top-k, and top-p. Watch how the probability bars change and tokens get filtered out. Low temperature concentrates probability; high temperature spreads it.
V.

Training

A freshly initialized LLM is just random numbers. Its embeddings are random. Its attention weights are random. Ask it anything and you get noise. Training is the process of turning this noise into something useful, and it happens in three distinct phases.

Phase 1: Pretraining. The model reads trillions of tokens from the internet: Common Crawl, Wikipedia, books, code. The objective is simple: predict the next token. The model gradually learns grammar, facts, reasoning patterns, and coding ability, all from prediction. GPT-4 reportedly cost over $100M in compute and took months on thousands of GPUs.

Phase 2: Supervised Fine-tuning (SFT). Humans write high-quality instruction-response pairs, typically 10K to 100K examples. The model learns to follow instructions, not just complete random web text. This phase takes days, not months, and turns a "text completer" into a "question answerer."

Phase 3: RLHF. Humans rank multiple model outputs for the same prompt. A reward model is trained on these rankings. The LLM is then optimized to produce outputs the reward model scores highly (Ouyang et al., 2022). This is what makes the model helpful and safe rather than just capable.[5]

The data volume shrinks dramatically across phases: pretraining uses ~15 trillion tokens, SFT uses ~100K examples, RLHF uses ~50K comparisons. But each phase matters critically. RLHF is why GPT-4-base and GPT-4-chat behave completely differently despite being the same architecture trained on the same pretraining data.

Training Pipeline Visualizer

Toggle between training phases. Watch how the loss curve shape, data volume, and compute cost change dramatically. Pretraining takes months; RLHF takes weeks.
VI.

Context Windows

Every LLM has a context window: the maximum number of tokens it can process at once. GPT-4 Turbo supports 128K tokens. Claude supports 200K. Llama 3 starts at 8K. But these numbers do not just limit how much text you can input. They determine how much memory the model uses, how fast it responds, and how much your API call costs.

During generation, the model caches the key and value tensors (the KV cache) for all previously generated tokens, so it does not recompute them at each step. For a model like Llama 70B, the KV cache requires roughly 2.6 MB per token. At 8K tokens, that is 20 GB. At 128K tokens, that is 330 GB.

Attention computation is O(n²): doubling context length quadruples the attention FLOPs. This is why long-context queries cost more and respond slower. Techniques like Grouped Query Attention (Ainslie et al., 2023) reduce the KV cache by sharing keys across heads, but the fundamental quadratic attention cost remains.[6]

When you paste a 100-page PDF into Claude and ask a question, the model is not "remembering" the document. It is holding the entire document in its KV cache: hundreds of gigabytes of GPU memory, just for your conversation. This is why long-context queries are expensive to serve.

Context Window Memory Calculator

8K
Drag the slider to increase context length. Watch the KV cache memory grow. Larger models need more memory per token; longer contexts multiply that cost.

The machine that reads by predicting. Every step we walked through, tokenization, embeddings, attention, sampling, serves a single objective: predict the next token. There is no explicit module for "understanding." No separate system for "reasoning." The model learns all of these as side effects of getting better at prediction. Whether that is intelligence is a question this explainer cannot answer. But now you know the mechanism.

Based on the work of Vaswani et al. (Attention Is All You Need, 2017), Mikolov et al. (Word2Vec, 2013), Holtzman et al. (Nucleus Sampling, 2020), Ouyang et al. (InstructGPT, 2022), Sennrich et al. (BPE, 2016), and Shazeer (Multi-Query Attention, 2019).