agentsQ7to verify
Liu et al. 2024 — language models exhibit U-shaped position bias on long inputs ('Lost in the Middle')
Language models — including those marketed as long-context — perform worst when relevant information is in the middle of a long input, with U-shaped position bias toward beginning and end. Long-context capacity in token count does not entail long-context capability in usage.
Accuracy on multi-document QA and key-value retrieval as a function of position of relevant information within the input contextU-shaped position effect: highest accuracy when relevant information is at beginning or end, substantially lower when in the middle of the context (specific point estimates not extracted to verification)
- Sample
- Multiple open- and closed-source LLMs across multi-document QA and synthetic key-value retrieval tasks (specific N not extracted to verification)
- Methodology
- Controlled-position manipulation: relevant document/key placed at varying positions within a long input; accuracy measured at each position.
Figures
Accuracy by position of relevant document in input context — characteristic U-shape across models
Figure in the paper (TACL 2024) showing position-vs-accuracy curves; not extracted as image
What this means
- Establishes the canonical 'capacity ≠ capability' distinction for long-context LLMs: the marketing claim ('we have a 1M-token context window') does not entail the usage claim ('the model uses 1M tokens well').
- Counter-evidence for any encyclopedia framing that treats context-window size as the load-bearing variable in extended-session work. The real variable is position-conditional accuracy across the window.
- Pairs with the Laban et al. multi-turn-degradation finding: capacity does not solve usage; sequential coherence does not improve with more tokens.
Source
Lost in the Middle: How Language Models Use Long Contexts
Transactions of the Association for Computational Linguistics · Nelson F. Liu et al. · 2024 · peer-reviewed
Context
- What came before
- Vendor messaging through 2023-2024 treated context-window expansion as the load-bearing capability for long-document and long-conversation tasks. The Liu et al. finding (preprint 2023; TACL 2024) is the canonical demonstration that this framing is wrong.
- What comes next
- Verify exact accuracy-by-position numbers and the model list. Connect to the multi-turn-degradation literature (Laban et al. 2025) as the two halves of the long-context-capability story: position-bias within input, plus turn-degradation across dialogue.
- Where this lands
- Encyclopedia Part I (foundations — what AI does differently than prior software; capacity vs capability), Part II (workforce — practical implications for extended knowledge work), Part V (research frontier — what long-context benchmarks should measure).