peopleanalyst

Insight Cards · gen-ai

gen-aiQ7to verify

Shumailov et al. 2024 — model collapse from recursively generated data (Nature)

Generative AI models trained on data that includes their own previous outputs progressively forget the true data distribution over generations — in particular, low-probability ('tail') events disappear first, and after enough iterations the model converges on a degenerate distribution with little resemblance to the original.

Distribution distance from original training corpus across model generations under recursive self-training (perplexity drift; loss of distributional tails)Tails of the data distribution are lost within a handful of generations; convergence to a degenerate distribution is theoretically inevitable in the recursive-self-training regime. Specific numerical values for perplexity drift were not extracted to verification; see provenance.
Sample
Simulation across multiple model families (Gaussian mixture models, variational autoencoders, large language models) with iterative self-training cycles. Specific N of iterations / models not extracted to verification.
Methodology
Theoretical analysis plus empirical demonstration of recursive-training degeneration across multiple model families; trained successive generations of models on data sampled from prior model generations and measured distributional drift.

What this means

  • The 'model collapse' phenomenon is the digital-ecological analog of niche-construction-induced variance collapse: the AI's outputs become its own training environment, and the loop systematically erodes diversity.
  • Implies that uncontrolled use of LLM-generated web content as future training data creates a feedback loop that caps the intelligence of future models at the level of the current model.
  • Provides a load-bearing mechanism for the encyclopedia's Part I §1.3 'methodology gap' — software engineering and knowledge work that uses AI outputs without provenance discipline is a model-collapse-like substrate for the human-AI system.

Source

AI models collapse when trained on recursively generated data

Nature · Ilia Shumailov et al. · 2024-07-24 · peer-reviewed

Context

What came before
Pre-2024 LLM training discourse treated web-scale text as an essentially infinite, externally-sourced training substrate. The implicit assumption was that successive model generations could continue scaling on more of the same kind of data.
What comes next
Verification of the specific numerical drift rates (iterations to tail loss, perplexity-curve shapes). Comparison with Cito & Bork 2025 'code collapse' analogue for software ecosystems. Empirical work on whether commercial provider data-filtering pipelines (e.g., anti-AI-detection in training data curation) actually prevent the collapse trajectory.
Where this lands
Encyclopedia Part I §1.3 (methodology gap / why this isn't software-as-usual) and Part V (research frontier — feedback-loop measurement).
← All insight cards