Context window

The maximum number of tokens a model can process in a single inference pass — both the input prompt and the generated output combined. Larger context windows allow longer documents, conversations, and code to be processed. Expressed in tokens (e.g. 32K, 128K).

The active context is not stored directly as raw text — it’s represented as the KV cache, the computed attention state for every token processed so far. This lives in VRAM alongside the model weights during GPU inference. Because the KV cache grows with every token in the context, a large context window has a direct memory cost: a 128K context can consume several GB of VRAM on its own, independent of model size.

The “Lost in the Middle” problem

Having a large context window doesn’t mean a model uses all of it equally well. A 2023 Stanford/Meta paper found that performance follows a U-shaped curve: models are best at recalling information placed at the very beginning or end of the context, and accuracy drops substantially for information in the middle — by as much as 30% on multi-document question answering tasks. The effect is rooted in how attention mechanisms (particularly RoPE positional encoding) decay over distance, combined with training data patterns that mix tasks requiring uniform recall with tasks that prioritize recent context.

However, this is an active area of research and the picture has improved:

  • Larger models show reduced severity. Studies have found that larger, more capable models exhibit a flatter curve or eliminate the U-shape entirely, suggesting the problem is partly a function of model scale.

  • “Found in the Middle” (NeurIPS 2024) proposed Multi-scale Positional Encoding (Ms-PoE), a plug-and-play technique that adjusts position index scaling across attention heads without fine-tuning. It improved middle-context retrieval accuracy by up to 15 percentage points.

  • “Never Lost in the Middle” (2024) proposed position-agnostic decompositional training, teaching models to break queries into sub-questions that retrieve information without relying on positional cues.

  • A 2025 paper reframes the effect as an emergent adaptation rather than a bug — arguing it arises from training data that mixes long-term and short-term memory demands, and that the U-shape is a learned behaviour rather than a pure architectural flaw.

  • Context Window - Google ML Glossary

  • Lost in the Middle — Liu et al. 2023

Referenced in