peopleanalyst

Insight Cards · agents

agentsQ6to verify

Laban et al. 2025 — top LLMs degrade an average 39% from single-turn to multi-turn

Across six generation tasks, top open- and closed-weight LLMs degrade an average 39% in performance from single-turn to multi-turn conversation; underspecification compounds across turns, models lock in to early incorrect framings, and they have difficulty course-correcting when later turns provide updated information.

Mean performance drop, single-turn vs multi-turn, across six generation tasks for top open-weight + closed-weight LLMs~39% average degradation from single-turn to multi-turn
Sample
Six generation tasks × top open- and closed-weight LLMs (specific model count + per-task N not extracted to verification)
Methodology
Benchmark comparison of single-turn vs underspecified multi-turn delivery of the same generation tasks; measured task-completion quality at end of dialogue against single-turn baselines.

What this means

  • Load-bearing finding for the AHI program's longitudinal claims: multi-turn degradation is a coherence-across-turns problem, not a context-window-capacity problem — extending the window does not address it.
  • Mechanism named in the paper — early-turn lock-in plus poor course correction — is structurally close to the persona-drift literature's anchoring failures and the sycophancy literature's preference-capture dynamics; the three may be one phenomenon viewed from three angles.
  • Encyclopedia consequence: any framework that treats 'long context' purely as token capacity (the consulting-vendor framing) misses the load-bearing failure mode.

Source

LLMs Get Lost in Multi-Turn Conversation

arXiv (preprint) · Philippe Laban et al. · 2025 · peer-reviewed

Context

What came before
The long-context-window arms race (100K → 200K → 1M+ tokens) framed extended context as a solved problem in capacity terms. The Laban et al. finding inverts the framing: capability across many turns does not scale with capacity.
What comes next
Verify exact per-task degradation breakdown, model list, and per-model N. Connect to Liu et al. 2024 ('Lost in the Middle' — capacity vs capability) and the Sharma et al. sycophancy work as the multi-turn-coherence failure cluster.
Where this lands
Encyclopedia Part I §1.x (why LLMs are different from prior software — coherence-across-turns is not a stable property), Part II (workforce — extended-session knowledge work is the AI's structural weak spot), Part V (research frontier — multi-turn benchmark proliferation).
← All insight cards