How LLMs Work, Explained Plainly

You use large language models every day. You type a question, and a few seconds later, coherent text appears. But what happens in those few seconds? The answer involves splitting your words into pieces, mapping those pieces into a high-dimensional space where meaning has geometry, routing information through layers of attention, and sampling from probability distributions. All within a fixed memory budget. None of this requires magic. Every step is math you can see. This explainer walks through each step, from the moment you press Enter to the moment the response appears, with interactive demos you can play with.

Tokenization

Before a language model can do anything with your text, it has to convert it into numbers. Not whole words, but pieces of words. The process is called tokenization, and it determines everything from how much your API call costs to why ChatGPT struggles with counting letters.

The dominant algorithm is Byte Pair Encoding (Sennrich et al., 2016). It starts with individual bytes and iteratively merges the most frequent pair into a new token. After enough merges, common words like "the" become single tokens while rare words get split into subwords: "unhappiness" becomes ["un", "happiness"] and "indistinguishable" might become ["ind", "ist", "ingu", "ish", "able"].

Vocabulary size is a tradeoff. A small vocabulary (32K tokens, like Llama) means more tokens per sentence, which means slower inference and higher cost. A large vocabulary (100K tokens, like GPT-4) means fewer tokens per sentence but a bigger embedding table that uses more memory.[1]

Language inequality is real in tokenization. BPE merges are trained on English-heavy corpora, so English gets roughly 1 token per word. Hindi, Japanese, and other non-Latin scripts often require 3-4 tokens per word. The same question costs 3x more in Hindi than in English.

The strawberry problem. When you ask "How many r's in strawberry?", the model sees something like ["str", "aw", "berry"]. The letters are split across token boundaries. The model never sees individual characters. This is why character-level tasks are unreliable.

BPE Tokenizer Explorer

Toggle between examples to see how BPE tokenizes different text. Notice how Hindi and Japanese produce far more tokens per character than English.

Large language models split text into tokens before processing it. GPT-4 uses Byte Pair Encoding with a vocabulary of approximately 100,000 tokens. Simple English tokenizes efficiently: "The cat sat on the mat." becomes 7 tokens, roughly one per word. But non-Latin scripts are far less efficient because the BPE merges were trained on English-heavy corpora. The same sentence in Hindi might require 3-4 times more tokens, making API calls more expensive for non-English users. Tokenization also explains why LLMs struggle with character-level tasks: the model never sees individual characters, only token chunks. The word "strawberry" is split into subword pieces, which is why counting specific letters is unreliable.

II.

Embeddings

A token ID is just an index into a lookup table. Token 4344 means nothing on its own. The model needs to convert each token into a vector, a list of numbers that encodes meaning. These vectors are called embeddings, and they are where the magic starts.

Each token maps to a vector of 1,536 to 12,288 dimensions, depending on model size. These vectors are learned during training, not hand-designed. Similar words end up close together: "king" near "queen", "dog" near "puppy". The geometry is remarkably structured.

Mikolov et al. showed in 2013 that vector arithmetic captures analogies: vector("king") - vector("man") + vector("woman") lands close to vector("queen"). This works because the model learned that "king" and "queen" differ in the same directional way as "man" and "woman".[2]

Embeddings are why semantic search works. You can search for "affordable sedan for commuting" and find a document that says "budget-friendly four-door car for daily drives" even though they share zero keywords. Both phrases land in the same region of vector space.

Embedding Space Explorer

Royalty

Animals

Emotions

Food

Programming

Countries

Click any word to see its nearest neighbors. Use the buttons above to see vector arithmetic in action: the result lands near the expected word.

Word embeddings convert tokens into high-dimensional vectors where geometric distance encodes semantic similarity. In a trained embedding space, "king" and "queen" are close together, as are "dog" and "puppy". The structure goes deeper: vector arithmetic captures analogies. The vector for "king" minus "man" plus "woman" lands near "queen", because the gender relationship is encoded as a consistent directional offset. Similarly, "Paris" minus "France" plus "Japan" lands near "Tokyo". This geometry emerges from training on billions of words of text (Mikolov et al., 2013). Modern LLMs use contextual embeddings where the vector for "bank" changes depending on context, but the core principle of meaning-as-geometry remains.

III.

Attention

Embeddings give each token a meaning. But meaning depends on context. The word "bank" means something different in "river bank" versus "bank account." The mechanism that lets each token look at every other token and decide what is relevant is called self-attention. It is the core innovation of the Transformer architecture (Vaswani et al., 2017).

Each token computes three vectors: a query ("what am I looking for?"), a key ("what do I contain?"), and a value ("what information do I carry"). The attention score between two tokens is the dot product of the query and key, scaled and normalized through softmax.

Multi-head attention runs multiple attention computations in parallel (32 to 96 heads in modern models). Each head learns different relationships. One might learn syntax (subject-verb agreement), another coreference (what "it" refers to), another positional patterns (attending to nearby tokens).[3]

The computational cost is O(n²) where n is sequence length. Every token computes an attention score with every other token. Doubling the sequence length quadruples the computation. This is the fundamental bottleneck that limits context window sizes.

The key insight. The transformer replaced recurrence (processing tokens one at a time, like RNNs) with attention (processing all tokens simultaneously). This made training massively parallelizable on GPUs, which is why transformers scaled where RNNs could not.

Attention Heatmap Explorer

Toggle sentences and attention heads. In "The cat sat on the mat because it was tired", the coreference head assigns high weight from "it" to "cat".

Self-attention is the core mechanism of the Transformer architecture (Vaswani et al., 2017). Each token computes how much to attend to every other token by comparing query and key vectors. In the sentence "The cat sat on the mat because it was tired", attention head 2 (coreference) assigns high weight from "it" to "cat", resolving the pronoun without explicit rules. Different attention heads learn different patterns: syntax, coreference, positional proximity, and semantic similarity. The computational cost is O(n squared) where n is sequence length: a 1,000-token input requires 1,000,000 attention score computations per head, per layer. This quadratic scaling is the fundamental bottleneck limiting context windows.

IV.

Generation

A language model does not generate text the way you write it. It predicts one token at a time. At each step, the model outputs a probability distribution over its entire vocabulary: a list of 100,000 numbers that sum to one. The next token is sampled from this distribution. How you sample determines whether the output is creative or repetitive, coherent or chaotic.

Temperature divides the raw logits by T before softmax. T < 1 sharpens the distribution (more confident, less creative). T > 1 flattens it (more random, more creative). T approaching 0 is effectively greedy decoding: always pick the highest-probability token.

Top-k keeps only the k highest-probability tokens and zeros out the rest. Top-p (nucleus sampling, Holtzman et al., 2020) keeps tokens whose cumulative probability reaches p. Top-p adapts automatically: when the model is confident, few tokens pass the filter. When it is uncertain, many do.[4]

Temperature 0 does not mean "perfectly accurate." It means "always pick the most likely token." If the model's probability distribution is wrong, temperature 0 will confidently give you the wrong answer every time. This is why temperature 0 is not a substitute for model quality.

Token Sampling Playground

Temperature: 1.0

Top-k: 20

Top-p: 1.0

Adjust temperature, top-k, and top-p. Watch how the probability bars change and tokens get filtered out. Low temperature concentrates probability; high temperature spreads it.

LLMs generate text one token at a time by sampling from a probability distribution over the vocabulary. Three parameters control sampling. Temperature scales the logits before softmax: temperature 0.1 makes the model almost deterministic, while temperature 2.0 makes it nearly random. Top-k limits the candidate pool to the k most probable tokens. Top-p (nucleus sampling, Holtzman et al., 2020) keeps tokens until their cumulative probability reaches a threshold. For a confident prediction like "The capital of France is ___", Paris might have 92% probability and only 1-2 tokens pass any filter. For an open-ended prompt like "The meaning of life is ___", dozens of tokens share probability mass and sampling parameters significantly change the output character.

Training

A freshly initialized LLM is just random numbers. Its embeddings are random. Its attention weights are random. Ask it anything and you get noise. Training is the process of turning this noise into something useful, and it happens in three distinct phases.

Phase 1: Pretraining. The model reads trillions of tokens from the internet: Common Crawl, Wikipedia, books, code. The objective is simple: predict the next token. The model gradually learns grammar, facts, reasoning patterns, and coding ability, all from prediction. GPT-4 reportedly cost over $100M in compute and took months on thousands of GPUs.

Phase 2: Supervised Fine-tuning (SFT). Humans write high-quality instruction-response pairs, typically 10K to 100K examples. The model learns to follow instructions, not just complete random web text. This phase takes days, not months, and turns a "text completer" into a "question answerer."

Phase 3: RLHF. Humans rank multiple model outputs for the same prompt. A reward model is trained on these rankings. The LLM is then optimized to produce outputs the reward model scores highly (Ouyang et al., 2022). This is what makes the model helpful and safe rather than just capable.[5]

The data volume shrinks dramatically across phases: pretraining uses ~15 trillion tokens, SFT uses ~100K examples, RLHF uses ~50K comparisons. But each phase matters critically. RLHF is why GPT-4-base and GPT-4-chat behave completely differently despite being the same architecture trained on the same pretraining data.

Training Pipeline Visualizer

Toggle between training phases. Watch how the loss curve shape, data volume, and compute cost change dramatically. Pretraining takes months; RLHF takes weeks.

LLM training proceeds in three phases. Pretraining exposes the model to trillions of tokens of internet text, optimizing next-token prediction. This phase costs the most compute (GPT-4 reportedly required approximately 2 times 10 to the 25 FLOPs and cost over $100 million) and takes 3-6 months on thousands of GPUs. Supervised fine-tuning then trains the model on approximately 100,000 human-written instruction-response pairs over 1-3 days, converting a text completer into an instruction follower. Finally, RLHF uses approximately 50,000 human comparisons to train a reward model, then optimizes the LLM to produce outputs humans prefer. Each successive phase uses dramatically less data but shapes behavior in qualitatively different ways.

VI.

Context Windows

Every LLM has a context window: the maximum number of tokens it can process at once. GPT-4 Turbo supports 128K tokens. Claude supports 200K. Llama 3 starts at 8K. But these numbers do not just limit how much text you can input. They determine how much memory the model uses, how fast it responds, and how much your API call costs.

During generation, the model caches the key and value tensors (the KV cache) for all previously generated tokens, so it does not recompute them at each step. For a model like Llama 70B, the KV cache requires roughly 2.6 MB per token. At 8K tokens, that is 20 GB. At 128K tokens, that is 330 GB.

Attention computation is O(n²): doubling context length quadruples the attention FLOPs. This is why long-context queries cost more and respond slower. Techniques like Grouped Query Attention (Ainslie et al., 2023) reduce the KV cache by sharing keys across heads, but the fundamental quadratic attention cost remains.[6]

When you paste a 100-page PDF into Claude and ask a question, the model is not "remembering" the document. It is holding the entire document in its KV cache: hundreds of gigabytes of GPU memory, just for your conversation. This is why long-context queries are expensive to serve.

Context Window Memory Calculator

Context length: 8K

Drag the slider to increase context length. Watch the KV cache memory grow. Larger models need more memory per token; longer contexts multiply that cost.

Every LLM maintains a KV (key-value) cache during generation, storing intermediate attention results for all previous tokens. Memory grows linearly with sequence length: for Llama 70B, the KV cache requires approximately 2.6 MB per token, meaning an 8K context uses 20 GB but a 128K context uses 330 GB. Attention computation grows quadratically: doubling context length quadruples FLOPs. For GPT-4 with 128K context, processing the full window requires over 16 billion attention score computations per layer per head. This is why long-context queries cost more and respond slower. Techniques like Grouped Query Attention reduce the KV cache by sharing keys, but the fundamental quadratic attention cost remains the binding constraint.

The machine that reads by predicting. Every step we walked through, tokenization, embeddings, attention, sampling, serves a single objective: predict the next token. There is no explicit module for "understanding." No separate system for "reasoning." The model learns all of these as side effects of getting better at prediction. Whether that is intelligence is a question this explainer cannot answer. But now you know the mechanism.

Based on the work of Vaswani et al. (Attention Is All You Need, 2017), Mikolov et al. (Word2Vec, 2013), Holtzman et al. (Nucleus Sampling, 2020), Ouyang et al. (InstructGPT, 2022), Sennrich et al. (BPE, 2016), and Shazeer (Multi-Query Attention, 2019).