parts / capability / article-deep-extraction
Article deep-extraction
Turn an open-access research article into structured, table-cited evidence — automatically. Resolves a real full-text PDF (OpenAlex / Unpaywall / Europe PMC / publisher patterns, %PDF-validated), reads the design, sample, and the effect sizes printed in the results tables, and emits curator-gated effect-size proposals mapped to canonical constructs. Reproduces hand-sourcing with table-level provenance; never auto-promotes.
Article Deep-Extraction
Type: algorithm + pipeline
Origin repo(s): principia (scripts/extract-article.ts + scripts/extraction-to-proposals.ts)
Extraction readiness: live (CLI pipeline; reuses Principia's own OA-discovery, PDF, and LLM primitives — does not fork meta-factory's book pipeline)
Depends on: OpenAlex / Unpaywall / Europe PMC (OA-PDF discovery), pdfplumber (text), an LLM (structured extraction), the construct registry (endpoint mapping), the curator-gated promote path
Last reviewed: 2026-06-08
What it is
Turn an open-access research article into structured, table-cited evidence — automatically. Where most ingestion keeps only an article's metadata and abstract, this reads the paper itself: the design, the sample, and the effect sizes printed in the results tables. It was built to close the corpus's depth frontier — a journal article had been the shallowest-mined medium in the portfolio.
Given a DOI, the pipeline:
- Resolves a real full-text PDF — OpenAlex
best_oa_location, then thelocations[]array, Unpaywall, Europe PMC search-by-DOI, and publisher direct-PDF patterns (Frontiers, PLOS, PMC) — validating the%PDFmagic bytes so a landing-page HTML never masquerades as a paper. - Extracts text (pdfplumber) and runs an article-tuned structured extraction (methods → results → discussion ontology, not a book's chapter framing).
- Emits a structured record — study-design facets (the 8-facet taxonomy), sample (N, k), the effects read from the results tables (statistic, value, k, N, and the table they came from), key claims, and license-gated salient passages.
A second stage maps each effect's free-text endpoints to canonical construct IDs and emits curator-gated effect-size proposals — it never auto-promotes (the write path stays human-in-the-loop).
Data shape
extract-article --doi <doi> → ArticleExtraction {
design: { design_family, causal_identification, time_structure, data_mode, analysis_math },
sample: { n, k_studies, population, setting, country },
effects: [{ from, to, statistic_type, value, k_studies, n_total, location, corrected }],
salient_passages: [{ section, quote }], key_claims, limitations
}
extraction-to-proposals → curator EffectSize proposals (mapped to construct.<id>)
- Stateless per article; the extraction JSON is the cache (idempotent / resumable).
- Curator-gated: proposals are reviewed before promotion; a verify-job backstop independently checks each promoted claim against its source.
Why it matters
- Reproduces hand-sourcing, automatically. On a real test it pulled the same pooled correlations a human researcher had hand-read from the same paper — with table-level provenance — and a single bulk pass grew the registry by ~150 effects.
- The honest ceiling is named. ~42% of "open access" articles resolve to a fetchable full-text PDF; the rest are landing-page / bronze OA or bot-blocked and route to the acquisition queue. The pipeline reports what it could and couldn't reach rather than silently truncating.
- It compounds. Sourcing waves add evidence one agent at a time; this adds it one article at a time, automatically, and keeps paying off as new open-access primaries are harvested.