You write a careful system prompt. You tell the model to always respond in JSON. For the first few turns it works perfectly. By turn fifteen, it's back to prose. You didn't change anything. The model just... forgot.

This isn't a bug in any particular model. It's a structural property of how transformers process long sequences. And understanding why it happens is the first step to fixing it.

The attention curve is a U-shape

In 2023, Liu et al. published a paper called "Lost in the Middle" that made one of the cleanest observations in LLM research. They gave models a long context with a key fact buried at various positions, then asked the model to retrieve it. The result was a U-shaped curve: models attended well to information at the beginning of the context, attended well to the end, and largely ignored the middle.

Think about what this means for a conversation. Your system prompt is at the beginning. The user's latest message is at the end. Everything in between, all those earlier turns, the clarifications, the examples you provided three messages ago, sits in the dead zone. The model can see it, technically. It just doesn't pay much attention to it.

This is why instruction drift happens gradually. The system prompt doesn't disappear. It gets diluted. Each new turn pushes the conversation further from the system prompt and adds more tokens to the middle. The model's effective attention to your original instructions decays with every exchange.

Context rot is worse than you think

Chroma's context rot study quantified this across 18 leading models, including GPT-4, Claude, and Gemini. The numbers are stark.

Even with 100% perfect retrieval of relevant information, performance degrades 13.9% to 85% as input length grows. Models that scored over 95% accuracy on short prompts fell to 60-70% on longer contexts. Adding full conversation history (around 113K tokens) dropped accuracy by 30% compared to a focused 300-token version of the same information.

The degradation patterns vary by model. Claude models decay the slowest but tend to refuse on very long tasks. GPT-4.1 showed erratic outputs, at one point inserting lowercase fragments where proper nouns should be. Gemini 1.5 Pro nosedived starting at just 500-750 words of context.

The takeaway: a bigger context window doesn't mean a better context window. Having 200K tokens of capacity means nothing if the model stops paying attention after 8K.

What's actually happening: the KV cache problem

To understand why this happens, you need to understand the KV cache. During inference, the model stores key-value pairs for every token it has processed. This is the mechanism that lets it "remember" earlier parts of the conversation without reprocessing them from scratch.

But the attention mechanism isn't uniform. Each layer has a fixed number of attention heads, each with a fixed capacity. As the sequence grows, every token competes for attention with every other token. Your carefully worded instruction to "always respond in JSON" is competing with hundreds of user messages, tool outputs, and previous responses. It's not that the model forgets. It's that your instruction gets outcompeted.

Xiao et al. discovered something interesting about this in their attention sink paper (2023). The first few tokens in a sequence receive disproportionately high attention regardless of their content. This is why system prompts work at all: they benefit from this positional advantage. But the advantage is finite. Pile on enough tokens and even the attention sink gets overwhelmed.

Writing in the Margins: a structural fix

The most elegant approach to this problem came from Russak et al. in 2024 with a technique called Writing in the Margins (WiM).

The idea is simple once you see it. Instead of processing the entire context as one sequence and hoping the model attends to the right parts, WiM processes the context in chunks and generates intermediate notes ("margins") for each chunk. These margins are mini-summaries that extract only the task-relevant information. Then, instead of feeding the full context to the model for the final answer, it feeds just the margins plus the question.

Imagine you're reading a 200-page contract to answer one specific question. You could read the whole thing and try to hold it all in your head. Or you could read it section by section, jotting a one-line note in the margin whenever you hit something relevant, then answer the question using only your notes. The second approach is WiM.

The results: 7.5% accuracy improvement on multi-hop reasoning tasks (HotpotQA, MultiHop-RAG) and over 30% F1 improvement on aggregation tasks. No fine-tuning. Works on off-the-shelf models. The overhead is marginal because the margin generation happens during the chunked prefill that the model does anyway.

What works in practice (2025-2026)

Research techniques like WiM are important, but most people building with LLMs need practical solutions today. Here's what actually works, ranked by impact.

1. Put instructions last, not just first

The U-shaped attention curve has two peaks: beginning and end. Most people only use the beginning (system prompt). But you can exploit recency bias by repeating your critical constraints at the end of the context, right before the model generates.

Concretely: if your system prompt says "respond in JSON," add a final user message or assistant prefill that says "Remember: JSON only." This alone eliminates most instruction drift in production systems. It's ugly but it works because you're putting the instruction where the model is actually looking.

2. Structure creates salience

A wall of text is easy to lose attention on. Structured formatting, XML tags, markdown headers, numbered lists, creates visual landmarks that the attention mechanism can latch onto. This isn't metaphorical. The tokenizer produces distinct tokens for structural elements like <instruction> tags, and the model learns during training that these tokens signal important boundaries.

Anthropic's long context prompting guide explicitly recommends XML delimiters for this reason. The difference between "respond only in JSON format" buried in a paragraph and <output_format>JSON only</output_format> is measurable.

3. Compaction over accumulation

The single biggest cause of instruction drift in agent systems is context accumulation. Every tool call, every intermediate result, every previous turn gets appended to the context. By turn 30, the model is processing 50K tokens, of which maybe 2K actually matter.

The fix is compaction: periodically summarize the conversation history, discard the raw messages, and continue with just the summary plus the system prompt. Anthropic calls this context engineering and considers it more important than prompt engineering. Claude Code does this automatically, compressing conversation history while preserving architectural decisions and unresolved issues.

The key insight: you're not losing information by compacting. You're losing noise. And noise is what causes the model to forget your instructions.

4. Multi-agent delegation

If compaction is a band-aid, multi-agent architectures are the proper fix. Instead of one model processing an ever-growing context, a manager agent delegates subtasks to worker agents, each running in a fresh context window. The worker gets a clean context with just the system prompt and the specific subtask. It does its work, returns a summary, and the manager continues.

This is how Claude Code, Cursor, and most production coding agents work in 2025. Each sub-agent sees the minimum context required. No context rot because there's no context to rot.

5. Prompt caching changes the economics

OpenAI and Anthropic now cache the KV states of system prompts across requests. This means your system prompt, the part you want the model to attend to most strongly, gets processed once and reused. Anthropic's cache reduces costs by up to 90% on cached tokens. OpenAI offers 50% discount on cached tokens for prompts over 1024 tokens.

This isn't just a cost optimization. It's architecturally significant. Cached prompts occupy the same position in the attention pattern every time, reinforcing the attention sink effect. The model doesn't just "remember" the system prompt; it consistently allocates attention to it because the KV states are identical across requests.

The deeper lesson

Instruction drift isn't really about forgetting. Models don't forget the way humans do. They attend. Attention is a finite resource, and every token you add to the context competes for it.

Once you internalize this, the solutions become obvious. Keep contexts short. Put critical instructions where attention is highest (beginning and end). Use structure to create attention anchors. Compress aggressively. And when one context window isn't enough, use multiple clean ones.

The models aren't getting worse at following instructions. We're just getting better at understanding where their attention actually goes.


References