gen-aiQ7to verify
Shumailov et al. 2024 — model collapse from recursively generated data (Nature)
Generative AI models trained on data that includes their own previous outputs progressively forget the true data distribution over generations — in particular, low-probability ('tail') events disappear first, and after enough iterations the model converges on a degenerate distribution with little resemblance to the original.
Distribution distance from original training corpus across model generations under recursive self-training (perplexity drift; loss of distributional tails)Tails of the data distribution are lost within a handful of generations; convergence to a degenerate distribution is theoretically inevitable in the recursive-self-training regime. Specific numerical values for perplexity drift were not extracted to verification; see provenance.
- Sample
- Simulation across multiple model families (Gaussian mixture models, variational autoencoders, large language models) with iterative self-training cycles. Specific N of iterations / models not extracted to verification.
- Methodology
- Theoretical analysis plus empirical demonstration of recursive-training degeneration across multiple model families; trained successive generations of models on data sampled from prior model generations and measured distributional drift.
What this means
- The 'model collapse' phenomenon is the digital-ecological analog of niche-construction-induced variance collapse: the AI's outputs become its own training environment, and the loop systematically erodes diversity.
- Implies that uncontrolled use of LLM-generated web content as future training data creates a feedback loop that caps the intelligence of future models at the level of the current model.
- Provides a load-bearing mechanism for the encyclopedia's Part I §1.3 'methodology gap' — software engineering and knowledge work that uses AI outputs without provenance discipline is a model-collapse-like substrate for the human-AI system.
Source
AI models collapse when trained on recursively generated data
Nature · Ilia Shumailov et al. · 2024-07-24 · peer-reviewed
Context
- What came before
- Pre-2024 LLM training discourse treated web-scale text as an essentially infinite, externally-sourced training substrate. The implicit assumption was that successive model generations could continue scaling on more of the same kind of data.
- What comes next
- Verification of the specific numerical drift rates (iterations to tail loss, perplexity-curve shapes). Comparison with Cito & Bork 2025 'code collapse' analogue for software ecosystems. Empirical work on whether commercial provider data-filtering pipelines (e.g., anti-AI-detection in training data curation) actually prevent the collapse trajectory.
- Where this lands
- Encyclopedia Part I §1.3 (methodology gap / why this isn't software-as-usual) and Part V (research frontier — feedback-loop measurement).