peopleanalyst

parts / capability / article-deep-extraction

Article deep-extraction

Turn an open-access research article into structured, table-cited evidence — automatically. Resolves a real full-text PDF (OpenAlex / Unpaywall / Europe PMC / publisher patterns, %PDF-validated), reads the design, sample, and the effect sizes printed in the results tables, and emits curator-gated effect-size proposals mapped to canonical constructs. Reproduces hand-sourcing with table-level provenance; never auto-promotes.

Algorithm·origin: principia·source: people-analyst/devplane/docs/CAPABILITIES/article-deep-extraction.md

Article Deep-Extraction

Type: algorithm + pipeline Origin repo(s): principia (scripts/extract-article.ts + scripts/extraction-to-proposals.ts) Extraction readiness: live (CLI pipeline; reuses Principia's own OA-discovery, PDF, and LLM primitives — does not fork meta-factory's book pipeline) Depends on: OpenAlex / Unpaywall / Europe PMC (OA-PDF discovery), pdfplumber (text), an LLM (structured extraction), the construct registry (endpoint mapping), the curator-gated promote path Last reviewed: 2026-06-08

What it is

Turn an open-access research article into structured, table-cited evidence — automatically. Where most ingestion keeps only an article's metadata and abstract, this reads the paper itself: the design, the sample, and the effect sizes printed in the results tables. It was built to close the corpus's depth frontier — a journal article had been the shallowest-mined medium in the portfolio.

Given a DOI, the pipeline:

  • Resolves a real full-text PDF — OpenAlex best_oa_location, then the locations[] array, Unpaywall, Europe PMC search-by-DOI, and publisher direct-PDF patterns (Frontiers, PLOS, PMC) — validating the %PDF magic bytes so a landing-page HTML never masquerades as a paper.
  • Extracts text (pdfplumber) and runs an article-tuned structured extraction (methods → results → discussion ontology, not a book's chapter framing).
  • Emits a structured record — study-design facets (the 8-facet taxonomy), sample (N, k), the effects read from the results tables (statistic, value, k, N, and the table they came from), key claims, and license-gated salient passages.

A second stage maps each effect's free-text endpoints to canonical construct IDs and emits curator-gated effect-size proposals — it never auto-promotes (the write path stays human-in-the-loop).

Data shape

extract-article --doi <doi>  →  ArticleExtraction {
  design: { design_family, causal_identification, time_structure, data_mode, analysis_math },
  sample: { n, k_studies, population, setting, country },
  effects: [{ from, to, statistic_type, value, k_studies, n_total, location, corrected }],
  salient_passages: [{ section, quote }], key_claims, limitations
}
extraction-to-proposals  →  curator EffectSize proposals (mapped to construct.<id>)
  • Stateless per article; the extraction JSON is the cache (idempotent / resumable).
  • Curator-gated: proposals are reviewed before promotion; a verify-job backstop independently checks each promoted claim against its source.

Why it matters

  • Reproduces hand-sourcing, automatically. On a real test it pulled the same pooled correlations a human researcher had hand-read from the same paper — with table-level provenance — and a single bulk pass grew the registry by ~150 effects.
  • The honest ceiling is named. ~42% of "open access" articles resolve to a fetchable full-text PDF; the rest are landing-page / bronze OA or bot-blocked and route to the acquisition queue. The pipeline reports what it could and couldn't reach rather than silently truncating.
  • It compounds. Sourcing waves add evidence one agent at a time; this adds it one article at a time, automatically, and keeps paying off as new open-access primaries are harvested.