research / ai-human-interaction / audience tiers
Engineering critique
Engineering reviewer's lens on the Penwright Measurement Framework, the Adaptive Authorship Control Kernel (F-19), and the instrumentation discipline.
Engineering critique — the AI–Human Interaction program
A reviewer's lens on the systems underneath the empirical record. What the architecture commits to, where the commitments are well-founded, and what an engineering reviewer should interrogate before treating downstream findings as evidence.
— 2026-05-05
1. System under review
The AI–Human Interaction program is presented as a research surface — twelve papers across three tiers, a literature-mapping pass, a measurement framework, a longitudinal claim. The empirical record sits on instruments built underneath: the Penwright authorship system shipped inside Vela (app/labs/penwright/), the Adaptive Authorship Control Kernel (F-19) as the central registry of measurement and intervention, the Penwright Measurement Framework as the instrumentation layer, and the production-data pipeline that records every authorial event. The empirical claims downstream of these instruments — about dependency, about genre-specific effects, about longitudinal capability development — are only as good as the instruments. This document is the engineering side of that ledger.
The program is unusual in two respects. First, the lead apparatus is a production system whose users are real authors and whose data is collected continuously, not in a discrete study window. Second, the system was built before most of its empirical claims had been tested — it is research-instrument and product simultaneously. An engineering reviewer needs to reckon with both at once.
The stack is shared with Vela: Next.js 16 App Router on Vercel (Fluid Compute), Supabase for Postgres + auth + storage, Modal for GPU-bound work, Anthropic for inference. There is no separate AHI repo; Penwright lives inside Vela's monorepo at /labs/penwright, and graduates when the design stabilizes. The integration is the schema, and the schema is shared. That choice is load-bearing for this critique.
2. What's well-built
Kernel-as-central-registry pattern. F-19 — the Adaptive Authorship Control Kernel — registers measurement, intervention, and genre-aware behavior in one place rather than spreading them across feature flags, prompts, and ad-hoc analysis scripts. The decision is correct for a research instrument: every construct an empirical paper invokes traces back to a kernel registration, which is auditable and version-pinned. The alternative — distributed measurement, where each surface owns its own scoring — is the path that produces "the same construct measured five different ways across five different studies," which is the field's default failure mode and Principia's reason to exist.
Genre as a fork, not a flag. Penwright's commitment that memoir / nonfiction / fiction never collapse into a single skill model is enforced at the architecture level: copy, schema enums, prompts, and metrics each fork by genre rather than sharing a code path with a runtime parameter. This is an engineering decision in service of a research claim — the field's existing AI-writing instruments tend to collapse genres because the engineering is easier — and the cost of the decision is real (more code paths, more test surface) but appropriate. The four failure modes (output-only optimization · over-automation · weak measurement · ignoring genre differences) are encoded as veto conditions, not aspirations.
Three measurement layers. The framework distinguishes output (the writing itself), process (the trajectory through Penwright), and transfer (independent-writing baseline measured under Practice / Constraint Mode without AI scaffolding). The transfer layer is where the load-bearing longitudinal claim — "better writer with Penwright, than without it, in 6 months" — gets its evidence. Without it, the program would only produce within-system findings; with it, the program can speak to capability transfer outside the AI environment. Practice / Constraint Mode is the engineering substrate for this. It exists in the spec; the production discipline of requiring periodic transfer-layer samples from longitudinal participants is in flight, not yet codified.
Authorship Packets as structured input. Replacing freeform prompts with five-field packets (intent · structure · key ideas · relevant passages · counterpositions) makes the input itself measurable. Packet completeness, packet shape, and the relationship between packet structure and downstream output quality are all extractable without after-the-fact NLP guessing. The structure is data. Most existing AI-writing tools cannot say what their users typed because the typing is freeform and the analysis surface is whatever the tool happened to log. Penwright pays the cost (writers learn a new input pattern) for the analyzability gain.
Reflection Layer. Structured post-session prompts inviting the writer to articulate what worked, what didn't, and what to try next surface metacognitive engagement as measurable signal. This is the engineering substrate for the Metacognitive index. The risk that the prompts produce performed reflection rather than genuine engagement is real (see §3), but the alternative — leaving metacognition unmeasured — concedes the entire frontier between output quality and capability development. The framework's commitment is that this distinction is the program's reason to exist.
Pre-registration discipline. Hypotheses for the empirical papers (5, 7, 8 in the twelve-paper plan) are required to specify yes-world and no-world consequences before data collection. The Paper 7 (Genre Effects) preregistration is load-bearing for the rest of the program: if AI effects collapse across memoir / nonfiction / fiction, every other paper's genre-aware analysis is over-engineering; if they don't, the genre fork is non-negotiable. The discipline forces the program to commit before it has the data, which is the only honest version of preregistration.
3. Failure modes and stress conditions
Auto-ethnography is feature-then-bug. The principal investigator is also Penwright's designer and most active user. For descriptive work — what does this system do, what patterns appear when a writer uses it for months — that is a strength: the PI has unmatched ethnographic depth. For causal claims — Penwright caused this writer to develop capability X — it is fatal without mitigation. The mitigation named is the external-operator pilot (PA-009 in the assignment queue): 5–10 outside Penwright users recruited and onboarded under the same instrumentation. Until the pilot's data accumulates, every causal claim in the program is auto-ethnography in form, regardless of its content.
The kernel-vs-framework reconciliation is pending. The Penwright Measurement Framework names six dimensions, six indices, three layers, and a five-step learning loop in VISION-PENWRIGHT-MEASUREMENT.md. The F-19 kernel registry has its own naming and shape (vela ASN-1112 — the reconciliation pass). Until the reconciliation lands, a researcher who reads the measurement vision and writes queries against the kernel will hit drift: the framework speaks one vocabulary, the running registry speaks an adjacent-but-different one. This is the canonical "I thought I was working with X but the runtime has Y" problem. The fix is queued; the fix is not yet shipped.
Production system as research instrument is a tension. Frozen analytical datasets are a baseline expectation for serious empirical work; a reviewer running the analysis against tip-of-master will hit different numbers each week as new sessions accumulate, schema migrations land, and corpus filters shift. Vela has a partial answer (the Erotic-Writing program's frozen dataset rule, ASN-1040) but no platform-level export pipeline that pins a Penwright snapshot for a specific paper's analysis. Without it, papers cannot be re-run by an independent reviewer against the same evidence the original analysis used. The fix is a periodic Parquet export under a research-program S3 bucket with the snapshot SHA recorded — known shape, not yet built.
Reflection-prompt response bias. The Reflection Layer asks the writer to articulate what worked. The literature on self-report measures — including the cognitive-apprenticeship and working-alliance traditions the program draws from — is unambiguous that self-report is contaminated by demand characteristics, social-desirability bias, and the writer's own theory of what the researcher wants to hear. The Metacognitive index that depends on the reflection prompts inherits all of this. A reviewer should ask: what evidence supports the reflection-derived index against an external metacognitive criterion? Today the answer is "the index is in design; calibration against external criteria is queued."
Small-N until the pilot recruits. The current cohort that produces the program's data is one writer (the PI) plus a small number of incidental Penwright users. Statistical claims about distributional patterns, genre differences, or longitudinal trajectories require N that the cohort doesn't yet have. The framework can be specified at any N; the empirical record needs the external-operator pilot to deliver before causal claims become defensible. This is honest in the program's own framing — Paper 8 is gated on pilot data — but it bears restating: the empirical record is sparse, and it should not be over-read in the meantime.
LLM-as-evidence discipline is partially specified. The framework relies on LLM-derived scoring for several indices (Writing Quality components, Genre Awareness, Authorial Voice). The discipline requires verification protocols — when an LLM scores a piece of writing as having low coherence, what's the ground truth, and how is the LLM's scoring calibrated against it? The current answer is "human spot-check + multi-LLM agreement" (see methodology §1.2) for the literature-review pass; the analogue for the production-instrument pass is in design. Without it, an LLM-scored index is downstream of a model that has its own evolving training distribution, and "the index dropped 0.2 points" could mean the writer changed or the model did.
4. What I would interrogate
If I were a peer engineering reviewer reading the empirical findings derived from this stack, my pre-acceptance questions would be:
-
Cohort and confound disclosure. For any empirical claim, what cohort produced the data? Was the cohort the PI alone, the PI plus incidental users, or post-pilot external operators? Are findings stratified by cohort phase? The auto-ethnography threat is mitigated by naming the threat for every paper, not just the longitudinal one.
-
Genre stratification as test. No claim that collapses memoir / nonfiction / fiction passes review. The four-failure-mode veto means ignoring genre differences is a non-negotiable disqualifier — but a reviewer needs to see the stratification in every analysis surface, not just be told that the framework forbids collapse.
-
Index-against-criterion validation. Each of the six derived indices (Writing Quality · Independence · Integration · Metacognitive · Genre-Awareness · Authorial-Voice) needs a validation pass against external criteria — expert ratings, established instruments, or other independent measures. An index without external validation is an opinion of the framework about itself.
-
Measurement reliability over time. Indices derived from LLMs are subject to model-drift effects. A reviewer should expect periodic test-retest checks against frozen evaluation samples (not the production stream) to separate writer-level change from model-level change.
-
Reproducibility — frozen analytical datasets. Per-paper Parquet snapshots, schema-locked, with the snapshot SHA recorded in the paper's preregistration. Without these, the empirical record is a moving target.
-
The kernel-spec divergence ledger. The F-19 kernel and the measurement framework should be reconciled before any paper that invokes a framework construct is filed. A standing "kernel-vs-spec divergence" doc — analogous to the schema-vs-spec ledger Vela needs — would tighten this. It does not currently exist.
5. Trade-offs the future inherits
The program chose production-system-as-instrument over a discrete-study apparatus. The benefit is realism — Penwright users are writers doing real work, not subjects in a 90-minute lab session. The cost is reproducibility friction (frozen datasets owed) and confound complexity (writers' work outside Penwright is invisible to the instrument). The trade-off is correct for the program's longitudinal-capability question, which a discrete study cannot answer; it commits the program to the engineering work of pinning datasets and making transfer-layer measurement disciplined.
The program chose Penwright-inside-Vela during the early build, with graduation deferred until the design stabilizes. The benefit is shared infrastructure (one Postgres, one auth surface, one deploy pipeline). The cost is that AHI claims rest on Vela's substrate — Vela's schema migrations, Vela's licensure-tier filter, Vela's PostgREST FK ambiguity history. A future graduation will require either a clean-room reimplementation of the kernel + framework or a careful repo split that preserves measurement continuity. The cost is paid in advance every month the lock-in persists.
The program chose the longitudinal capability test — "better writer with Penwright, than without it, in 6 months" — as the load-bearing claim. The benefit is that the framing forecloses the easy-fluency win that other AI-writing products optimize toward. The cost is patience: the test cannot be run until enough writers have used Penwright for long enough, which is at minimum half a year out from pilot recruitment, in a field where most papers ship inside three months. The program commits to the slower test on the bet that it produces evidence the field cannot otherwise produce. A reviewer should treat the empirical record as pre-data until that timeline elapses.
Companion artifacts: peer-review.md (audience-tier 1, forthcoming); product.md (audience-tier 3, forthcoming). Vision specs at vela/docs/VISION-PENWRIGHT-AUTHORSHIP.md and vela/docs/VISION-PENWRIGHT-MEASUREMENT.md are the load-bearing internal sources.