Why LLMs can't write long (and what breaks the ceiling)

GPT-5.2 can process over a million tokens of input. Claude can hold 200K tokens in its context window. Gemini 2.5 Pro handles a million. But ask any of them to write a 10,000-word essay, and you'll get a strange result: the model will either stop early, start repeating itself, or quietly degrade in quality after the first couple thousand words.

The asymmetry is striking. These models can read a novel but can't write one. And the reason turns out to be surprisingly simple.

The training data ceiling

LLMs learn to generate text by imitating their training data. During supervised fine-tuning (SFT), models see thousands of instruction-response pairs: "summarize this article," "write a function that sorts a list," "explain quantum entanglement." The responses in these pairs are almost always short. Across standard SFT datasets, the longest outputs are around 2,000 words.

The model learns an implicit length distribution. It learns that responses are a few hundred words, maybe a thousand, rarely more. When you ask for 10,000 words, you're asking it to extrapolate beyond anything it has seen. It's like training a runner exclusively on 5K races and then entering them in a marathon. The cardiovascular system might be capable, but the pacing strategy, the mental model of "how long this is supposed to take," is all wrong.

This isn't a theoretical observation. Bai et al. demonstrated it empirically in their LongWriter paper (ICLR 2025). They showed that models with 128K-token context windows could only produce outputs of 2,000 words or fewer, with quality dropping sharply at the boundary. The architecture supported longer output. The training data didn't.

AgentWrite: plan, then write in pieces

The first solution Bai et al. proposed was AgentWrite, a divide-and-conquer agent pipeline. The idea is simple. Instead of asking the model to write 10,000 words in one pass, you break the task into two stages.

First, the model creates a detailed writing plan: section titles, key points for each section, and target word counts per section. This is a short-output task, well within the model's comfort zone.

Second, the model writes each section independently, following the plan. Each section is a few hundred words. The model is good at this. You concatenate the sections and get a coherent long document.

AgentWrite produced outputs of up to 20,000 words using models that normally capped out at 2,000. No fine-tuning, no architecture changes. Just a smarter way of using the model's existing capabilities.

The analogy is to how humans write long documents. Nobody writes a book by starting at page one and typing continuously until the end. You outline first. You write chapters. You revise. AgentWrite does the same thing, mechanically.

The real fix: training data

AgentWrite solved the immediate problem, but it also revealed something deeper: if the ceiling comes from training data, why not just fix the training data?

Using AgentWrite's pipeline, the authors generated LongWriter-6k: 6,000 SFT examples with output lengths ranging from 2,000 to 32,000 words. They incorporated this dataset into standard SFT training. The result: models could now generate 10,000+ words natively, without the agent pipeline, without quality degradation.

The best model, LongWriter-9B-DPO, was further refined with Direct Preference Optimization. On their LongBench-Write benchmark, this 9-billion-parameter model outperformed much larger proprietary models. The DPO step alone added a 5% overall quality improvement and an 18% improvement in depth and breadth of content.

The takeaway is almost embarrassingly simple: models can write long if they've seen long writing during training. The output ceiling was never architectural. It was a data gap.

Where things stand now (early 2026)

The output token limits across frontier models have expanded dramatically since mid-2024. GPT-5.2 supports 128K output tokens. Claude supports 128K in extended thinking mode. Gemini 2.5 Pro and Flash both support 65K. These numbers would have seemed absurd two years ago.

But raw token limits don't tell the whole story. The quality of generation at those lengths still varies. HelloBench, a benchmark specifically designed to evaluate long-form generation, found that even well-performing models like GPT-4o struggle to maintain quality past 4,000 words. More troublingly, models enhanced for long-context understanding sometimes perform worse at long-form generation. The skills are inversely correlated in some architectures.

Several new approaches have emerged beyond the original AgentWrite paradigm.

Chain of Agents (Google Research) uses worker agents that process chunks sequentially, passing summaries to the next agent, with a manager agent synthesizing the final output. It requires no fine-tuning and works with off-the-shelf models.

Recursive Language Models treat long prompts as an external environment that the LLM can programmatically examine, decompose, and recursively call itself over. They demonstrate strong performance even at the 10-million-token scale, dramatically outperforming other approaches, often by double-digit percentage gains.

NEXUSSUM (ACL 2025) uses a hierarchical multi-agent framework that achieved a 30% improvement in long-form narrative summarization, particularly on BookSum, where hierarchical processing mitigates context truncation.

The multimodal extension is happening too. LongWriter-V (ACM MM 2025) adapted the same paradigm for vision-language models, creating a 22K-example dataset with multiple input images and outputs up to 10,000 words. A 7B parameter model outperformed GPT-4o on their benchmarks.

The practical lesson

If you need an LLM to produce long-form output today, there are three tiers of solution.

The simplest is plan-then-generate: have the model outline the document, then write each section independently. This works with any model, requires no fine-tuning, and is easy to implement. The quality ceiling is determined by the model's per-section capability, which is usually good.

If you can fine-tune, include long-output examples in your SFT data. Even a few thousand examples with outputs in the 2K-32K word range will dramatically extend the model's native output capability. This is the LongWriter insight, and it remains the most impactful finding in this space.

If you need truly unbounded generation, multi-agent architectures (Chain of Agents, hierarchical expansion) handle the orchestration, ensuring each agent works within its comfortable range while the system as a whole produces arbitrarily long, coherent output.

The underlying lesson applies beyond just output length. Whenever an LLM seems unable to do something, the first question to ask is: did it see enough examples of this during training? The answer is usually no, and the fix is usually data.

References

Bai et al., LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs, ICLR 2025
Huang et al., HelloBench: Evaluating Long Text Generation Capabilities, 2024
Zhang et al., Recursive Language Models, 2025
Huang et al., Chain of Agents: Large Language Models Collaborating on Long-Context Tasks, Google Research, 2024
Tu et al., LongWriter-V: Enabling Ultra-Long Generation in Vision-Language Models, ACM MM 2025
Wei et al., LongGenBench: Long-Form Generation Benchmark, ICLR 2025