research / vela / audience tiers
Christianity, sex, and shame — engineering critique
Staff-engineer review of the corpus pipeline that produces the christianity-sex-shame research outputs — dual-grade ingestion, retrieval against thinly-curated sources, and reproducibility of the synthesis chain.
By Mike West
Engineering Critique — Christianity, Sex, and Shame Research Pipeline
Adversarial staff-engineer review of the system that produces the christianity-sex-shame research outputs. What the architecture commits to, what silently fails, and what would harden the pipeline before any of this is treated as evidence.
Reviewer: senior staff engineer, retrieval + research-data systems. Audience: the engineers and researchers who maintain Vela's corpus pipeline and the agents that synthesise against it. Scope: dual-grade ingestion correctness for the patristic / Augustinian / purity-culture corpus; retrieval against thinly-curated sources; reproducibility of the synthesis chain (corpus → notes → literature map → literature review → public introduction → protocol); BibTeX / literature-map auditability. Date: 2026-05-22.
This memo is deliberately blunt. The pipeline produced a synthesis that reads like a serious literature review, on a topic where careless retrieval can produce confidently wrong claims about consequential history. Several of the pipeline's well-documented failure modes are visible inside the christianity-sex-shame outputs themselves. Fix the silent-failure issues before the literature review is treated as evidence by anyone outside the Vela team.
1. TL;DR
-
The retrieval substrate is not safe for thinly-curated topics. The case study at
docs/research/2026-04-23-christianity-sex-hangup.mddocuments the exact failure mode the christianity-sex-shame corpus is exposed to: when curator-grade picks are concentrated in a small set of secondary sources (MacCulloch Lower than the Angels; O'Donnell Augustine: A New Biography; Brown; a handful of Great Courses lecture transcripts), retrieval against those sources produces a consensus synthesis of those sources' worldviews, which the synthesis layer then reports as "what the historical scholarship shows." It does not show that. It shows what your three loudest authors think. The 2026-05-15 rerun update is honest about this: even after adding Pagels, Boswell, Brooten, Harper, Jordan, and Dale Martin to the corpus, only three of the six surface in top-15 retrieval. The corpus expanded; the retrieval did not. -
Curator-grade vs. research-bulk-grade is correctly the right architecture, and is silently violated on this thread. Dual-grade ingestion (AGENTS.md Corpus ingestion — dual-grade default) is the engineered defense against thin-curation retrieval failure. Inspection of
christianity-sex-shame-bibliography.biband the corpus syntheses reveals that the load-bearing sources are research-bulk-grade chunks (visible in the synthesis asresearch-bulk-chunktheme tags). The patristic and Augustinian corpus does not appear to have curator-grade picks that would steer retrieval toward the specific arguments at stake in the review. Either the curator-grade picks were never made, or they were made but did not survive into the score column visible in the case study. Either is a bug. -
The bulk-chunk text quality on three Boswell / Brooten / Dale Martin sources is dominated by bibliography and footnote pollution. The 2026-05-15 rerun note explicitly identifies this: "their bulk chunks are dominated by bibliography/footnote/tab-polluted text that embeds poorly." This is not a retrieval bug; it is an ingestion bug. The pipeline chunks PDFs into ~500-word paragraph-respecting segments and embeds whatever it gets; for academic books with dense citation apparatus, large fractions of pages are reference lists, footnote columns, or two-column-bibliography tables that compress into low-information embeddings. The retrieval correctly does not surface these chunks — but it then has no other chunks from those authors to surface, so the author is functionally invisible. Necessary fix, not sufficient. (Update note, verbatim.) The literature review then synthesises against a corpus where Boswell is shelved but unreachable.
-
The synthesis layer reports per-passage scores ("emb 0.640, theme 0.000") but no calibration. What does emb 0.640 mean? Is it 30th percentile or 90th percentile against a representative query distribution? The case study reports these numbers in every section evidence block; no document in the pipeline tells the reader what range of emb scores corresponds to a high-quality vs. low-quality retrieval. Without calibration, the numbers are decorative. Any reader of the synthesis is implicitly trusting the embedding model to have ordered relevance correctly across an unmeasured score distribution. This is not how high-stakes retrieval is operated.
-
Reproducibility is asserted but not engineered. The literature review states that the corpus syntheses were "cross-validated from two independent parallel drafts." The two drafts came from different LLM substrates (ChatGPT Deep Research and browser Claude Opus 4.7) prompted at different times with different (undocumented) queries against different (undocumented) corpus snapshots. A second researcher running the pipeline tomorrow does not have the inputs to reproduce either draft.
-
The BibTeX / literature-map / literature-review chain has no integrity check. The literature map has 80+ rows; the BibTeX file has 100+ entries; the literature review has ~70 in-text citations; the public introduction names ~12 authors by surname. There is no script that asserts the four sets are mutually consistent — e.g., that every in-text citation in the literature review resolves to a BibTeX entry, and every BibTeX entry surfaces in either the literature map or the literature review. The single most useful 50-line script for this corpus does not exist.
2. The retrieval failure mode, in detail
2a. What the case study actually documents
docs/research/2026-04-23-christianity-sex-hangup.md is the touchstone document. Its top-of-file update (verbatim):
The structural-coverage hypothesis stated below — that the corpus was missing Pagels, Boswell, Brooten, Harper, Jordan, and Dale Martin — was tested in a post-ingest rerun. All six are now corpus-resident at 100% embedding coverage, but only three (Harper, Jordan, Pagels) surface in top-15 hits across the rerun probes. Boswell, Brooten, and Dale Martin remain absent from retrieval results even when given author-specific topical queries, because their bulk chunks are dominated by bibliography/footnote/tab-polluted text that embeds poorly. Necessary fix, not sufficient.
This is the engineered behaviour of the pipeline. Read it carefully: author-specific topical queries do not surface the author even after the author is corpus-resident. The pipeline succeeded at ingestion; it failed at retrieval; the synthesis layer cannot tell the difference; the literature review treats the absence of these authors as a feature of the field rather than a feature of the pipeline.
Three of the most important primary sources for the topic the literature review covers — Boswell's Christianity, Social Tolerance, and Homosexuality (1980), Brooten on early-Christian women's sexuality (1996), Dale Martin's The Corinthian Body (1995) and Sex and the Single Savior (2006) — are systematically unreachable by the pipeline's retrieval, and the literature review does not name this constraint. It says (§9 footnote) that the corpus is incomplete; it does not say the corpus contains these authors but cannot reach them.
2b. The score-distribution problem
Inspection of the case study's section evidence blocks shows that retrieval scores cluster in the 0.45–0.55 range for the patristic and Augustinian topical queries. Example (pre-christian-greco-roman section, top 5 retrievals):
score: 0.544 (emb 0.640, theme 0.000)
score: 0.530 (emb 0.623, theme 0.000)
score: 0.524 (emb 0.616, theme 0.000)
score: 0.518 (emb 0.610, theme 0.000)
score: 0.514 (emb 0.604, theme 0.000)
The deltas between rank 1 and rank 5 are 3-percentile points of an unscaled cosine. The synthesis layer treats rank 1 as the primary source for the section's claim. But under any reasonable noise model on a corpus of ~1,692 passages embedded by text-embedding-3-small, those five chunks are not statistically distinguishable. Either of three of the five chunks could be promoted to "primary support" with no change in retrieval architecture, only a different query phrasing. The synthesis is brittle to query phrasing at the rank-1 layer, and the synthesis layer does not surface this brittleness.
For a topic where the correct interpretation is contested by primary sources the pipeline cannot reach, this is a critical reliability gap. The retrieval is not deciding between "Brown 1988 vs. Harper 2013" on a contested question — both authors are corpus-resident and broadly consonant. The retrieval is deciding between "LTA passage 060 vs. LTA passage 099" within a single author whose worldview already dominates the corpus.
2c. Theme tags are doing zero work
Every retrieval in the case study shows theme 0.000. The theme score is the curator-grade contribution to the combined retrieval score (per the dual-grade architecture). It is identically zero for every retrieval in the patristic / Augustinian sections. This is the dual-grade architecture silently downgrading to single-grade research-bulk-chunk retrieval, because no curator-grade theme picks exist for these passages. This may be appropriate for breadth queries; it is not appropriate for the central historical claims of the synthesis.
The fix is curator-grade picks on the 30–60 passages that actually carry the patristic / Augustinian argument across LTA, AUB, SIN, GCCG, GCHI, SAP. With curator picks present, retrieval would prefer passages a human has flagged as on-topic; without them, retrieval prefers whatever the embedding model happens to surface, which on this corpus is heavily MacCulloch.
2d. Author concentration in retrieved evidence
The case study's section evidence blocks return overwhelmingly to two sources: LTA (MacCulloch's Lower than the Angels) and AUB (O'Donnell's Augustine: A New Biography). I counted (in the case study text I have): of the 105 retrieval rows shown across the seven main sections, ~62 are LTA, ~15 are AUB, ~12 are Great Courses lecture transcripts (GCCG, GCHI, GCJC, etc.), and ~16 are everything else. The synthesis is therefore approximately "what does Diarmaid MacCulloch think about Christian sexuality" with footnote-level supplementation. MacCulloch is a serious historian and his book is the most cited modern survey of the topic. The synthesis is also not what its public introduction implies — a survey of the historical scholarship.
A retrieval-system audit would report this as a source-concentration metric: max-author-share at p ≤ 0.40 is the field standard for surveys; this synthesis is at p ≈ 0.60 on a single author. The synthesis layer does not surface this.
3. Dual-grade ingestion — correctness on this corpus
The dual-grade default (curator + research-bulk) is the right architecture. Its correctness on the christianity-sex-shame corpus is unverified.
3a. Curator-grade coverage
The AGENTS.md spec says "~15 picks per source" for curator-grade. The corpus has ~100 books (per the literature map's claim of 1,692 passages). If each book had 15 curator picks, the curator-grade subset would be ~1,500 passages. Inspection of the case study's retrievals — every visible retrieval reports themes: research-bulk-chunk and theme score: 0.000 — suggests curator-grade coverage is empirically near zero for this corpus, or that the curator picks did not match the case study's topical queries.
The fix is to either:
- Run a curator pass over the patristic / Augustinian / purity-culture sources (
LTA,AUB,SIN,GCCG,SAP,LSC,TDM, and the relevant Great Courses transcripts) flagging the 15–30 passages per source that actually carry the load-bearing arguments — concupiscence, Pelagian controversy, East/West divergence, shame-to-sin transition, Reformation reinterpretation, evangelical purity culture as inheritance. - Or run a directed theme-tagging pass via prompted LLM, with human spot-check, against the bulk chunks — adding theme tags like
augustinian-concupiscence,pelagian-controversy,purity-culture-formation,shame-to-sin,east-west-divergence. Each theme tag becomes a retrievable handle.
Either fix unblocks the dual-grade architecture for this corpus. The current state does not exercise the architecture's safety design.
3b. Research-bulk chunking quality on the academic sources
The 2026-05-15 rerun update names the problem: "bulk chunks are dominated by bibliography/footnote/tab-polluted text." This is the standard PDF-extraction failure mode on academic books — the chunker doesn't distinguish body text from reference apparatus, and reference apparatus dominates the page count of any seriously-cited academic monograph.
The standard fix is layout-aware extraction (Unstructured.io, Marker, Mathpix) that separates body text, footnotes, references, and tables before chunking. The Vela pipeline appears to use a paragraph-respecting chunker over raw extracted text, which would produce exactly the failure mode the rerun documents. Three of the most important primary sources for the topic the literature review covers (Boswell, Brooten, Dale Martin) are functionally absent from retrieval because of this. This is a P0 ingestion-correctness fix for any further work on this thread.
A practical first-pass: re-extract the three problem sources through a layout-aware extractor, verify the body-text percentage exceeds 50% before re-chunking, re-embed, then re-run the case study probes. If author-specific queries now return author content, the fix is correct. If they do not, the embedding model is failing on the body-text characteristics and the fix is a different embedding model or query-specific reranking, not a different chunker.
3c. Licensure status on the corpus
Per AGENTS.md, Mike-direct-upload defaults to licensed. Inspection of the case study's source codes (LTA, AUB, SIN, TDM, LSC, the Great Courses transcripts) suggests these are all direct-upload books, which is correct under the policy. The licensure-status discipline is good on this thread. Not all threads in the repository have this property; this one does. Note for the record because most engineering critiques are negative space; this one is positive.
4. Reproducibility of the synthesis chain
4a. What "cross-validated from two independent parallel drafts" actually means
The literature review header reads: "Cross-validated from two independent parallel drafts · 24 April 2026." The reconciliation note at the end of the review identifies the two drafts as ChatGPT Deep Research and browser Claude Opus 4.7 1M context, both produced 2026-04-24.
For this claim to be reproducible, the following inputs must be specified:
- The exact ChatGPT Deep Research conversation transcript (prompt + model version + retrieval-augmentation settings).
- The exact browser Claude conversation transcript (prompt + model version + tool-call sequence).
- The corpus snapshot the syntheses were synthesised against (Vela passage IDs and source files at the moment of synthesis).
- The reconciliation rules used during the merge ("the more conservative position is adopted") operationalised as a script or as a documented procedure with examples.
I have access to none of these. A second researcher attempting to reproduce the synthesis would not be able to reproduce it. The "cross-validation" is a quality-control gesture, not a reproducibility guarantee. This is fine for an internal program document. It is not fine for the venue submissions the literature review's "Downstream artifacts" table names.
The fix is straightforward: archive the prompts, model versions, and corpus snapshot in docs/research/_provenance/ and reference the archive from each synthesis document. The information exists somewhere — chat history, browser tab transcripts — but it is not durably stored against the documents that depend on it.
4b. The notes-to-literature-review chain
The chain is approximately:
- Three corpus syntheses produced 2026-04-23 (the touchstones for the review's claims).
- Two parallel literature maps produced 2026-04-24 (independent of each other; both consume the syntheses).
- One merged literature map produced 2026-04-24 (reconciles the two parallel maps).
- One literature review produced 2026-04-24 (synthesises the merged map against the syntheses).
- One public introduction produced 2026-04-24 (popular-register version of the review's central claims).
- One intervention protocol produced 2026-04-24 (operationalises the K.5 research question from the review).
The chain commits in a single dated batch. Each artifact references the prior layers. There is no version-pin from any later document to the specific commit hash of the corpus passages it depends on. If a passage's text changes (re-ingestion, OCR correction, re-chunking), the literature review's footnote [shame-vs-sin.7] still resolves to "passage 7 of the shame-vs-sin section of the case study," but the underlying text may have shifted. The Vela platform's passage IDs are content-addressed by (source_code, RC-NNN) not by hash; an OCR or chunking change preserves the ID and changes the content. The references silently drift.
The fix is content-hashed citations. The literature review should resolve [shame-vs-sin.7] to a tuple of (passage_id, content_hash_at_synthesis_time). If the hash drifts, the reference goes stale loudly. This is the same logic as content-addressing in package managers; it does not exist here.
4c. The BibTeX file is not auditable
christianity-sex-shame-bibliography.bib is ~727 lines, ~100 entries. Inspection shows the standard BibTeX-from-LLM failure modes:
- DOIs are stated but not verified within the file (the file contains DOIs as strings; no field indicates "DOI-resolves: yes/no/last-checked-YYYY-MM-DD").
- Some entries contain authors-only-by-first-author + et-al, losing co-authorship attribution. For a literature review where co-authorship matters (Grubbs lab; Pargament lab; the Meston / Coates collaboration), this is consequential.
- Page ranges are sometimes present, sometimes absent.
The pipeline does not include a validate-bibtex.ts script that does the obvious work: resolve each DOI, fetch the canonical metadata, compare against the file, surface drift. CrossRef has a free API for this; a 100-line script catches the load-bearing errors.
5. Auditability — what a second researcher would need
A competent reviewer attempting to audit the literature review's claims needs the following artifacts, none of which currently exist:
-
Per-claim provenance ledger. Every assertion in the literature review (e.g., "Higher religiosity predicts lower sexual desire, and this relationship is statistically explained by sex guilt") should resolve to either (a) a corpus passage with content hash, (b) an external citation with DOI verified at synthesis time. Currently the literature review uses APA in-text cites for externals and prose-only references for corpus material. The corpus material is the most contested part of the synthesis and has the least traceable provenance.
-
Retrieval-query log. For each section of the case study, the literature reviewer needs to know what queries were issued, what passages were returned, what passages were used in the synthesis, and what passages were not used despite being returned (and why). The case study reports a
Retrieved passages (15)block per section but does not report the query phrasing that produced the block. Two different query phrasings produce different top-15s on the score distribution I described in §2b. -
Corpus snapshot identifier. "The corpus as of 2026-04-23" needs to be a concrete artifact — a snapshot hash of the
mosaic_passagestable at synthesis time, with all source files present and their content hashes. Vela has the database; it does not have the snapshotting discipline that makes the synthesis auditable a year from now. -
Synthesis-prompt log. The Claude prompts that produced the corpus syntheses are not archived against the synthesis outputs. A different prompt — even a different ordering of the same prompt — produces a meaningfully different synthesis. The pipeline does not preserve this.
-
Disagreement log for the reconciliation step. The literature review's reconciliation note says "the more conservative position is adopted and the discrepancy noted" when the two parallel drafts disagreed. The noted discrepancies are partial; the full disagreement set is not preserved. A reviewer cannot evaluate whether the conservative-default rule was applied consistently.
Each of these is a 1–3 day engineering investment. None require new infrastructure beyond the existing repo + Supabase + filesystem. The artifacts simply need to be produced and stored against the documents that depend on them.
6. Silent failures and small-but-load-bearing bugs
6a. The themes: research-bulk-chunk blanket tag obscures dual-grade signal
Every retrieval row in the case study has themes: *research-bulk-chunk; <source-code-lowercase>*. The second tag is the source code, not a curator-grade theme. The retrieval reranker therefore has no real theme signal to work with — the theme score is 0.000 throughout. The blanket research-bulk-chunk tag should be invisible to the reranker (or weighted to zero contribution), so that the theme score reflects only true curator-grade picks. As currently structured, even if curator picks exist, they may be diluted by the blanket tag in the score combination.
A quick grep against the production retrieval query would confirm whether the reranker filters or includes the research-bulk-chunk token. If it includes it, the reranker is treating every passage as theme-tagged-equally, which is the opposite of what dual-grade is supposed to do.
6b. The synthesis layer can quote partial passages without flagging truncation
Multiple retrievals in the case study show snippets ending in … mid-word or mid-clause. Example (post-augustine-contestation.5):
Augustine's rejection of Pelagius is doubly complex. First, there was the rivalry for the affections and attention of the well-connected Ro…
The literature review then quotes from this passage: "Augustine's rejection of Pelagius is doubly complex, involving both theological disagreement and 'rivalry for the affections and attention of the well-connected Roman aristocracy.'" The quoted text includes the word "Roman" which appears truncated in the retrieval display. Either the synthesis layer is correctly retrieving the full passage (likely; the … is a display truncation) or it is hallucinating the completion. A reviewer cannot distinguish these from the artifacts. The fix is to expose the full passage text in the synthesis trace, not just the truncated display.
6c. The Coates et al. citation has a year drift
christianity-sex-shame-bibliography.bib and the literature review variously cite the Coates / Meston purity-culture × NSE work as 2025 (Coates et al. 2025, Journal of Sexual Medicine supplement) and 2026 (Coates et al. 2026, Journal of Sex Research). The literature map has both as separate entries (C10 and C11), which is correct if they are distinct publications. The literature review text uses 2026 throughout. The public introduction references the finding without a year. This is a minor citation-hygiene issue but it is exactly the kind of inconsistency the BibTeX-validator script (§4c) would catch automatically.
6d. The literature map's theme 0.048 rows in the SIN source
A handful of SIN (Sin: The Early History of an Idea by Fredriksen) retrievals in the case study show non-zero theme scores (~0.048). Inspection suggests these are isolated curator-grade picks that survived; they are also a tiny minority of the retrievals. This is evidence that curator-grade tagging exists at low coverage on this corpus, which makes §3a's recommendation tractable: the infrastructure works, it just hasn't been used.
6e. The protocol's measure-licensing footnote is wrong
The intervention protocol (§ Measure battery) lists "Measure battery licensing: ~$0 (most are public; a few have academic-use fees)." This is approximately true for academic-IRB-cleared research. It is not true for an industry-funded RCT, where Mosher's Sex Guilt subscale, the Lawrance & Byers GMSEX, and several others are not free for commercial use. The literature review's downstream-artifacts table names "Vela product surface (course or app) if evidence supports scaling," which is commercial use. The protocol's funding line does not reflect this.
7. Performance and operational traps
The christianity-sex-shame thread does not have an active runtime pipeline — the syntheses are batch artifacts produced once and committed to disk. The performance concerns of, say, a Granger panel-VAR or a Hawkes-fit pipeline do not apply. The operational traps that do apply:
7a. Re-running the case study should be a single command
The case study cost $0.135 to produce (per its provenance index). It cannot, presently, be re-run by anyone other than the original operator without reconstructing the prompts, the corpus snapshot, the model version, and the retrieval parameters. Re-running on a corrected corpus (e.g., after Boswell / Brooten / Dale Martin are properly ingested) should be a npm run research:corpus-probe -- --thread=christianity-sex-shame operation. It is not.
The infrastructure exists. Vela has a research-corpus-density script (npm run research:corpus-density) and an ingest-batches log. What does not exist is a thread-level probe runner that produces the case study's section evidence blocks programmatically against the current corpus. The single 200-line script that does this would make the synthesis-chain auditable.
7b. The literature-review → magazine-piece chain is one-directional
The literature map's downstream-artifacts table lists "Magazine pieces" as a future deliverable. The pipeline does not enforce that magazine pieces produced from the synthesis cite the corpus passages and external sources the synthesis depended on. If a magazine piece is published claiming "research shows" some result, the chain from claim back to corpus is asserted by the editor, not enforced by the pipeline. This is the same provenance gap as §5; it surfaces under operational pressure when the magazine surface starts to scale.
7c. The protocol's clerical-partnership specification will not survive contact with reality
The intervention protocol (§ Partnerships required) names specific clergy networks ("Jamie Lee Finch network; The Liturgists community"; "Nadia Bolz-Weber's House for All Sinners and Saints alumni"; "Schermer Sellers' Seattle Pacific network"). These are real networks. The protocol does not specify whether anyone in those networks has been contacted, what response was received, or what the substitution plan is if these partnerships fail to materialise. This is an operational risk the literature review treats as program infrastructure. It is not yet program infrastructure; it is a partner-recruitment hypothesis.
8. Test coverage
The christianity-sex-shame pipeline has approximately zero automated tests. Here is the minimal test set that would catch the failure modes documented above:
-
Retrieval-coverage test for every named author in the literature review. A
test_author_reachability.pythat issues author-specific topical queries for every author cited in the literature review (Brown, Brundage, MacCulloch, O'Donnell, Pagels, Burrus, Perisanidi, Stan & Turcescu, Grubbs, Woo, Murray, Klein, Ortiz, Sawatsky, Muskrat, Coates, Mahoney, Pargament, Anderson & Koc, Etengoff, Lefevor, McKiernan, Kaplan, Rosmarin & Pirutinsky, Sellers, Ruether, Coakley, Bolz-Weber, …) against the corpus and asserts top-5 includes at least one passage from a source by that author. Would have caught the Boswell/Brooten/Dale Martin invisibility on day one. -
Source-concentration metric per synthesis. A regression test that asserts no single source (
LTA,AUB, etc.) provides more than 40% of the retrievals in any case-study section. CurrentlyLTAprovides ~60% of retrievals in the case study. The test would fail; the fix is curator-grade picks elsewhere. -
Score-distribution sanity check. A test that asserts the rank-1 retrieval is at least one quantile above the rank-5 retrieval in score, where the quantile is empirically estimated against the corpus. If the top-5 are within 3 percentile points, the synthesis is brittle to query phrasing, and the test should mark the section as low-confidence.
-
BibTeX integrity test. A
validate_bibliography.pythat resolves every DOI inchristianity-sex-shame-bibliography.bibvia CrossRef, compares the canonical record to the file, and asserts (a) DOI resolves, (b) title matches, (c) authors match within edit distance ≤ 3, (d) year matches, (e) venue matches. Would catch the Coates 2025/2026 drift, would catch any LLM-hallucinated DOIs, would surface stale references. -
Citation-chain consistency test. A test that asserts every in-text APA citation in
christianity-sex-shame-literature-review.mdresolves to a BibTeX entry inchristianity-sex-shame-bibliography.bib, and every BibTeX entry surfaces in either the literature review or the literature map. Would catch orphan citations and dropped references. -
Provenance-snapshot test. A test that asserts each document in the chain (case study → literature map → literature review → public introduction → protocol) names a corpus-snapshot identifier in its metadata, and the identifier is resolvable to a
mosaic_passagessnapshot. Would force the §4 provenance work to be operational. -
Tiny-corpus smoke test. A 30-passage fixture corpus containing known-quality patristic content; assert the synthesis pipeline produces a known-good output against it. Would catch ingestion regressions when sources are re-ingested.
Pick #1 and #4. Each is 1–2 days of engineering investment. The combination catches roughly 70% of the failure modes in this memo.
9. Ranked fix list
| # | Fix | Effort | Impact |
|---|---|---|---|
| 1 | Layout-aware re-ingestion of Boswell, Brooten, Dale Martin (and any other academic source where bulk chunks are footnote-polluted). Re-embed, re-probe with author-specific queries, verify top-5 includes author content. | 2-3 days | Critical — three of the most important primary sources for this thread are currently unreachable. Literature review cannot be defended without this. |
| 2 | Curator-grade tagging pass on patristic / Augustinian / purity-culture sources (LTA, AUB, SIN, GCCG, SAP, LSC, TDM + relevant Great Courses). 30-60 picks per source; ~15-20 hours of human labelling. | 1-2 weeks | High — exercises the dual-grade architecture as designed; reduces source-concentration; makes retrieval defensible. |
| 3 | Author-reachability + source-concentration regression tests (test items #1 and #2 above). Run against the corpus snapshot at each major synthesis. | 1 day | High — catches the §2 retrieval failure modes structurally. |
| 4 | BibTeX validator with CrossRef DOI resolution + author/title/year/venue matching (test item #4). Run on commit; fail CI on drift. | 1 day | High — catches the §4c and §6c citation-hygiene issues; raises the floor on bibliographic integrity. |
| 5 | Corpus-snapshot identifier in every synthesis document's header. Format: corpus_snapshot: <hash> referencing a snapshot in docs/research/_provenance/. | 0.5 day for format; 1 day for snapshot infrastructure | Medium — makes the synthesis chain auditable at a year's distance. |
| 6 | Citation-chain consistency test (test item #5). Asserts orphan-free BibTeX. | 0.5 day | Medium — minor citation hygiene; closes the loop on the bibliography. |
| 7 | Score-distribution diagnostic per case-study section. Surface rank-1-vs-rank-5 score delta to the synthesis output. Mark low-confidence sections. | 0.5 day | Medium — does not change the synthesis, but tells readers when to trust the top-1 retrieval and when not to. |
| 8 | Programmatic re-runner for the case study (§7a). One command, current corpus, fresh probes. | 2 days | Medium — makes "the corpus has changed; what does the synthesis look like now?" a tractable question. |
| 9 | Synthesis-prompt log (§5.4) archived against each synthesis document. | 0.5 day for the format; 0 day for retroactive | Medium — retroactive is hard; new syntheses are easy. Establish the discipline going forward. |
| 10 | Theme-tag reranker audit (§6a). Confirm the research-bulk-chunk blanket tag is not contributing to the theme score. | 0.5 day | Low-to-medium — narrow correctness check; either confirms current behaviour is correct or surfaces a meaningful bug. |
10. Closing note
The christianity-sex-shame thread is the most ambitious synthesis Vela has shipped from the corpus pipeline. It is also the most exposed to the pipeline's known failure modes. The case study's own update note documents the central retrieval failure honestly — three of six recovery sources remain unreachable — and the synthesis layer was not built to surface that constraint to downstream documents that depend on it.
What is in the repository is a useful internal program document and a credible draft for further investment. It is not, as currently engineered, a survey of the field that should be cited externally without the §4 / §5 provenance work and the §9 ranked-fix-list top three (re-ingestion, curator-grade tagging, regression tests).
The good news: the pipeline architecture is right. Dual-grade ingestion is the correct defense; corpus content-addressing is the correct provenance discipline; layout-aware extraction is a known fix for the footnote-pollution problem; CrossRef-based BibTeX validation is a one-day script. None of this requires a rewrite. It requires the discipline of running the architecture as designed instead of letting the synthesis layer paper over the gaps.
Fix items #1 and #2 before any further citation of the literature review by anyone outside the Vela team. The rest is housekeeping; #1 and #2 are correctness.