You use large language models every day. You type a question, and a few seconds later, coherent text appears. But what happens in those few seconds? The answer involves splitting your words into pieces, mapping those pieces into a high-dimensional space where meaning has geometry, routing information through layers of attention, and sampling from probability distributions. All within a fixed memory budget. None of this requires magic. Every step is math you can see. This explainer walks through each step, from the moment you press Enter to the moment the response appears, with interactive demos you can play with.
Tokenization
Before a language model can do anything with your text, it has to convert it into numbers. Not whole words, but pieces of words. The process is called tokenization, and it determines everything from how much your API call costs to why ChatGPT struggles with counting letters.
The dominant algorithm is Byte Pair Encoding (Sennrich et al., 2016). It starts with individual bytes and iteratively merges the most frequent pair into a new token. After enough merges, common words like "the" become single tokens while rare words get split into subwords: "unhappiness" becomes ["un", "happiness"] and "indistinguishable" might become ["ind", "ist", "ingu", "ish", "able"].
Vocabulary size is a tradeoff. A small vocabulary (32K tokens, like Llama) means more tokens per sentence, which means slower inference and higher cost. A large vocabulary (100K tokens, like GPT-4) means fewer tokens per sentence but a bigger embedding table that uses more memory.[1]
Language inequality is real in tokenization. BPE merges are trained on English-heavy corpora, so English gets roughly 1 token per word. Hindi, Japanese, and other non-Latin scripts often require 3-4 tokens per word. The same question costs 3x more in Hindi than in English.
The strawberry problem. When you ask "How many r's in strawberry?", the model sees something like ["str", "aw", "berry"]. The letters are split across token boundaries. The model never sees individual characters. This is why character-level tasks are unreliable.
BPE Tokenizer Explorer
Embeddings
A token ID is just an index into a lookup table. Token 4344 means nothing on its own. The model needs to convert each token into a vector, a list of numbers that encodes meaning. These vectors are called embeddings, and they are where the magic starts.
Each token maps to a vector of 1,536 to 12,288 dimensions, depending on model size. These vectors are learned during training, not hand-designed. Similar words end up close together: "king" near "queen", "dog" near "puppy". The geometry is remarkably structured.
Mikolov et al. showed in 2013 that vector arithmetic captures analogies: vector("king") - vector("man") + vector("woman") lands close to vector("queen"). This works because the model learned that "king" and "queen" differ in the same directional way as "man" and "woman".[2]
Embeddings are why semantic search works. You can search for "affordable sedan for commuting" and find a document that says "budget-friendly four-door car for daily drives" even though they share zero keywords. Both phrases land in the same region of vector space.
Embedding Space Explorer
Attention
Embeddings give each token a meaning. But meaning depends on context. The word "bank" means something different in "river bank" versus "bank account." The mechanism that lets each token look at every other token and decide what is relevant is called self-attention. It is the core innovation of the Transformer architecture (Vaswani et al., 2017).
Each token computes three vectors: a query ("what am I looking for?"), a key ("what do I contain?"), and a value ("what information do I carry"). The attention score between two tokens is the dot product of the query and key, scaled and normalized through softmax.
Multi-head attention runs multiple attention computations in parallel (32 to 96 heads in modern models). Each head learns different relationships. One might learn syntax (subject-verb agreement), another coreference (what "it" refers to), another positional patterns (attending to nearby tokens).[3]
The computational cost is O(n²) where n is sequence length. Every token computes an attention score with every other token. Doubling the sequence length quadruples the computation. This is the fundamental bottleneck that limits context window sizes.
The key insight. The transformer replaced recurrence (processing tokens one at a time, like RNNs) with attention (processing all tokens simultaneously). This made training massively parallelizable on GPUs, which is why transformers scaled where RNNs could not.
Attention Heatmap Explorer
Generation
A language model does not generate text the way you write it. It predicts one token at a time. At each step, the model outputs a probability distribution over its entire vocabulary: a list of 100,000 numbers that sum to one. The next token is sampled from this distribution. How you sample determines whether the output is creative or repetitive, coherent or chaotic.
Temperature divides the raw logits by T before softmax. T < 1 sharpens the distribution (more confident, less creative). T > 1 flattens it (more random, more creative). T approaching 0 is effectively greedy decoding: always pick the highest-probability token.
Top-k keeps only the k highest-probability tokens and zeros out the rest. Top-p (nucleus sampling, Holtzman et al., 2020) keeps tokens whose cumulative probability reaches p. Top-p adapts automatically: when the model is confident, few tokens pass the filter. When it is uncertain, many do.[4]
Temperature 0 does not mean "perfectly accurate." It means "always pick the most likely token." If the model's probability distribution is wrong, temperature 0 will confidently give you the wrong answer every time. This is why temperature 0 is not a substitute for model quality.
Token Sampling Playground
Training
A freshly initialized LLM is just random numbers. Its embeddings are random. Its attention weights are random. Ask it anything and you get noise. Training is the process of turning this noise into something useful, and it happens in three distinct phases.
Phase 1: Pretraining. The model reads trillions of tokens from the internet: Common Crawl, Wikipedia, books, code. The objective is simple: predict the next token. The model gradually learns grammar, facts, reasoning patterns, and coding ability, all from prediction. GPT-4 reportedly cost over $100M in compute and took months on thousands of GPUs.
Phase 2: Supervised Fine-tuning (SFT). Humans write high-quality instruction-response pairs, typically 10K to 100K examples. The model learns to follow instructions, not just complete random web text. This phase takes days, not months, and turns a "text completer" into a "question answerer."
Phase 3: RLHF. Humans rank multiple model outputs for the same prompt. A reward model is trained on these rankings. The LLM is then optimized to produce outputs the reward model scores highly (Ouyang et al., 2022). This is what makes the model helpful and safe rather than just capable.[5]
The data volume shrinks dramatically across phases: pretraining uses ~15 trillion tokens, SFT uses ~100K examples, RLHF uses ~50K comparisons. But each phase matters critically. RLHF is why GPT-4-base and GPT-4-chat behave completely differently despite being the same architecture trained on the same pretraining data.
Training Pipeline Visualizer
Context Windows
Every LLM has a context window: the maximum number of tokens it can process at once. GPT-4 Turbo supports 128K tokens. Claude supports 200K. Llama 3 starts at 8K. But these numbers do not just limit how much text you can input. They determine how much memory the model uses, how fast it responds, and how much your API call costs.
During generation, the model caches the key and value tensors (the KV cache) for all previously generated tokens, so it does not recompute them at each step. For a model like Llama 70B, the KV cache requires roughly 2.6 MB per token. At 8K tokens, that is 20 GB. At 128K tokens, that is 330 GB.
Attention computation is O(n²): doubling context length quadruples the attention FLOPs. This is why long-context queries cost more and respond slower. Techniques like Grouped Query Attention (Ainslie et al., 2023) reduce the KV cache by sharing keys across heads, but the fundamental quadratic attention cost remains.[6]
When you paste a 100-page PDF into Claude and ask a question, the model is not "remembering" the document. It is holding the entire document in its KV cache: hundreds of gigabytes of GPU memory, just for your conversation. This is why long-context queries are expensive to serve.
Context Window Memory Calculator
The machine that reads by predicting. Every step we walked through, tokenization, embeddings, attention, sampling, serves a single objective: predict the next token. There is no explicit module for "understanding." No separate system for "reasoning." The model learns all of these as side effects of getting better at prediction. Whether that is intelligence is a question this explainer cannot answer. But now you know the mechanism.
Based on the work of Vaswani et al. (Attention Is All You Need, 2017), Mikolov et al. (Word2Vec, 2013), Holtzman et al. (Nucleus Sampling, 2020), Ouyang et al. (InstructGPT, 2022), Sennrich et al. (BPE, 2016), and Shazeer (Multi-Query Attention, 2019).