peopleanalyst

Insight Cards · agents

agentsQ7to verify

Liu et al. 2024 — language models exhibit U-shaped position bias on long inputs ('Lost in the Middle')

Language models — including those marketed as long-context — perform worst when relevant information is in the middle of a long input, with U-shaped position bias toward beginning and end. Long-context capacity in token count does not entail long-context capability in usage.

Accuracy on multi-document QA and key-value retrieval as a function of position of relevant information within the input contextU-shaped position effect: highest accuracy when relevant information is at beginning or end, substantially lower when in the middle of the context (specific point estimates not extracted to verification)
Sample
Multiple open- and closed-source LLMs across multi-document QA and synthetic key-value retrieval tasks (specific N not extracted to verification)
Methodology
Controlled-position manipulation: relevant document/key placed at varying positions within a long input; accuracy measured at each position.

Figures

  • Accuracy by position of relevant document in input context — characteristic U-shape across models

    Figure in the paper (TACL 2024) showing position-vs-accuracy curves; not extracted as image

What this means

  • Establishes the canonical 'capacity ≠ capability' distinction for long-context LLMs: the marketing claim ('we have a 1M-token context window') does not entail the usage claim ('the model uses 1M tokens well').
  • Counter-evidence for any encyclopedia framing that treats context-window size as the load-bearing variable in extended-session work. The real variable is position-conditional accuracy across the window.
  • Pairs with the Laban et al. multi-turn-degradation finding: capacity does not solve usage; sequential coherence does not improve with more tokens.

Source

Lost in the Middle: How Language Models Use Long Contexts

Transactions of the Association for Computational Linguistics · Nelson F. Liu et al. · 2024 · peer-reviewed

Context

What came before
Vendor messaging through 2023-2024 treated context-window expansion as the load-bearing capability for long-document and long-conversation tasks. The Liu et al. finding (preprint 2023; TACL 2024) is the canonical demonstration that this framing is wrong.
What comes next
Verify exact accuracy-by-position numbers and the model list. Connect to the multi-turn-degradation literature (Laban et al. 2025) as the two halves of the long-context-capability story: position-bias within input, plus turn-degradation across dialogue.
Where this lands
Encyclopedia Part I (foundations — what AI does differently than prior software; capacity vs capability), Part II (workforce — practical implications for extended knowledge work), Part V (research frontier — what long-context benchmarks should measure).
← All insight cards