peopleanalyst

research / ai-human-interaction / audience tiers

Peer-review framing

Senior referee's lens on the AHI program if submitted today as a sequenced series. Positioning against named bodies of theory (HCI, CSCW, educational psychology, working-alliance theory, transactive memory, niche construction, distributed cognition, phenomenology of skill) and an explicit threats-to-validity register.

AI–Human Interaction·Audience tiers·Peer-review framing

Peer-review framing — the AI–Human Interaction research program

A senior referee's lens on the program if it were submitted today as a sequenced series. Candid about what would pass, what would be sent back, and what threats to validity the program acknowledges.

— 2026-05-05


Program under review: the AI–Human Interaction program at peopleanalyst.com/research/ai-human-interaction — a twelve-paper Penwright Research Program across three tiers (foundational theory · measurement and mechanism · longitudinal empirical studies), built on the Penwright authorship system as the lead empirical apparatus, with a broader frame extending to AI as a long-term cognitive partner across professions, domains, and life stages.

Reviewer: senior referee, AI-and-society methods track.

Hypothetical outlet: Computers in Human Behavior, Human Factors, ACM CSCW, or a measurement-methods venue depending on which of the twelve papers is under review.

Recommendation in advance: Major revisions / publishable as a sequenced series with discipline. The program's measurement framework and pre-registration discipline are unusually rigorous for the field. The longitudinal-causal portion that gives the program most of its theoretical value is empirically inaccessible until the external-operator pilot delivers. Until then, the publishable surface is descriptive. The publication queue needs to reflect that.


1. What the program claims

The program advances four contributions:

  1. A measurement framework for AI-augmented capability development — six skill dimensions, six derived indices, three measurement layers, a five-step learning loop, and four non-negotiable failure modes acting as veto conditions on every measurement decision.
  2. A pre-registered twelve-paper empirical program built on Penwright as a production instrument that records continuous authorship telemetry rather than discrete-study data.
  3. A theoretical bridge between mainstream Human-AI Interaction research and bodies of theory the field has under-engaged: phenomenology of skill, cognitive apprenticeship, working-alliance theory, transactive memory, niche construction, distributed cognition, companion-species studies, indigenous relational ontologies.
  4. A load-bearing longitudinal claim"better writer with Penwright, than without it, in six months," operationalized as capability transfer to independent writing under no-AI scaffolding (Practice / Constraint Mode), measured against an external baseline.

A peer reviewer should take each contribution on its own terms.


2. Strengths

  • The four-failure-mode veto is rare. Output-only optimization · over-automation · weak measurement · ignoring genre differences are operationalized as veto conditions on the measurement framework itself, not as soft aspirations. The HAI literature is full of programs that drift toward output-quality optimization because output is what's easy to measure; this program structurally forecloses that drift. A reviewer should reward the discipline.
  • Pre-registration with yes-world / no-world consequences. Paper 5 (Dependency) and Paper 7 (Genre Effects) are queued as OSF preregistration candidates with falsifiable predictions specified before data collection. This is the standard preregistration aspires to and rarely achieves; the program is set up to deliver it.
  • Genre-aware analysis required as a structural commitment. Memoir / nonfiction / fiction never collapse into a single skill model. The architectural commitment (F-19's genre-fork pattern) makes the discipline enforceable rather than performative. Most HAI work on writing collapses genre because the engineering is easier; the program's refusal to do so is a non-trivial methodological move.
  • Auto-ethnography → external pilot ladder. The principal investigator is also the system designer and most active user; the program acknowledges this as descriptive depth and explicit causal threat, and stages an external-operator pilot (PA-009) as the load-bearing mitigation. The framing is correct; the program does not pretend the threat is not there.
  • Capability transfer as the outcome variable. The longitudinal claim is operationalized against no-AI writing samples under Constraint Mode, not against in-system performance. This is the discipline that distinguishes the program from output-fluency-optimization products. A reviewer should treat this as the program's core methodological contribution.
  • Kernel-as-central-registry (F-19). The architectural commitment that every measurement event runs through one registry — versioned, auditable, genre-forked — is the right substrate for a program that intends to produce a coherent twelve-paper trajectory. The alternative (distributed measurement, per-paper scoring) is the path that produces "the same construct measured five different ways."

3. Major concerns

3.1 Field-positioning breadth is correct; depth of engagement varies

The program names eight under-engaged bodies of theory it bridges. That is the right list. A reviewer reading the literature-map and the per-topic reviews will find some bridges deeply scaffolded and others name-checked. Specifically:

  • Working-alliance theory (Bordin 1979; the contemporary literature on alliance ruptures, repair, and goal-task-bond formation): the analogy from psychotherapeutic process to long-term human-AI interaction is novel and potentially load-bearing. The program needs to specify what is analogous (rupture-and-repair sequences, drift in shared goals, the asymmetric maintenance burden) and what is not (the AI is not a therapist, the writer is not a client in the clinical sense). The current literature-review treatment is competent but not yet program-level.
  • Transactive memory (Wegner 1987; Lewis 2003 and the team-cognition literature): the analogy is intuitive — the writer offloads certain memory functions to the AI, the AI offloads certain context functions to the writer. The program's claim about how AI participates in transactive systems, vs. how transactive memory has been studied in human dyads and teams, is still under-specified.
  • Niche construction (Odling-Smee, Laland & Feldman 2003): the deepest of the under-engaged bridges in the literature reviews. The program-level claim — that Penwright is a niche-construction site, that writers and the system co-constitute the writer's authorial environment — is theoretically rich but needs sharpening into a testable formulation.
  • Phenomenology of skill (Dreyfus, Merleau-Ponty, Heersmink): connects to the Reflection Layer and the Metacognitive index, but the operationalization gap is wide. Dreyfus's five-stage skill-acquisition model and the contemporary phenomenology of attention (Stiegler, Citton, Hayles) appear in the literature reviews; their linkage to specific Penwright measurement constructs is not yet tight.
  • HCI and CSCW: the program's strongest field-positioning — the ironies-of-automation literature, working-coupling, transparency vs. incorporation, distributed-cognition treatments are well-engaged. A reviewer in this space would find the framing familiar and competent.

A peer reviewer would press for tighter scaffolding on the first four. Each is a paper-sized contribution waiting to be written; the program currently has them as background scaffolding rather than as foreground.

3.2 The longitudinal capability claim is methodologically the right shape but empirically inaccessible

The claim — "better writer with Penwright, than without it, in six months" — is methodologically correct. Capability transfer to no-AI work is the right outcome variable; critics of AI-writing tools have been asking for this measurement for years and it is rare. The program is set up to deliver it.

It cannot deliver yet. The external-operator pilot (PA-009) needs 5–10 writers recruited, onboarded under IRB-clean consent + paid compensation, and tracked under measurement discipline for at least six months. As of the date of this review, the protocol is in design and recruitment has not begun. Every causal claim the program makes pre-pilot is auto-ethnography of one. A peer reviewer would not accept causal claims at this stage.

This is a sequencing problem, not a fatal one. The program's descriptive papers (Paper 1 in two registers, Paper 3 on the Authorship Packet Model, Paper 4 on the Measurement Framework) are publishable now with appropriate scope. The longitudinal papers (Paper 5, Paper 7, Paper 8) need to wait for the pilot.

3.3 LLM-as-evidence discipline is partial

The program's literature-review pass uses multi-LLM synthesis with explicit confidence-flagging (A/B/C/D claim-status) and primary-source verification — that is the right discipline (methodology §1.2-1.3). The production-instrument pass relies on LLM-derived scoring for several of the six indices (Writing Quality components, Genre-Awareness, Authorial-Voice). The analogous discipline for production scoring — calibration against external criteria, inter-LLM agreement, human-validation rate, model-version pinning — is in design, not in production.

A reviewer would press on this for any paper that invokes LLM-derived scores. The minimum acceptable bar is a per-index validation pass against expert human ratings on a held-out sample, with reliability statistics (Cohen's κ, ICC, or analogous) reported per genre. The program acknowledges this as queued work; the publication queue should reflect that no LLM-scored paper files before the validation pass lands.

3.4 Single-system generalization is a real concern

Penwright is one authorship system with one set of design decisions. Even after the longitudinal pilot delivers, findings will be findings about this particular instrument. Generalization claims — to AI writing tools more broadly, to AI-augmented capability development across coding / design / research / education / clinical practice — require either replication on adjacent systems or theoretical scaffolding that explains why findings should travel.

The program acknowledges the concern; the mitigation is currently weaker than the auto-ethnography one. The Adaptive Authorship Control Kernel (F-19) is positioned as a portable substrate that future siblings could reuse — but until a sibling exists and runs the analogous measurement, the substrate-portability claim is a forward-looking conjecture. The published-paper trajectory needs at least one paper that engages the generalization-bounds question explicitly, distinguishing what demonstrated generalization the program will produce from what merely projected generalization the discussion sections will gesture at.

3.5 The "twelve-paper program" framing risks publication-rate over-promising

Twelve papers from one production instrument over a multi-year window is ambitious but not unreasonable, if each paper claims appropriate scope. A peer reviewer reading the sub-paper plan (penwright-sub-paper-plan.md) should look for evidence that papers won't double-count the same data and that early-tier descriptive findings won't be re-claimed as late-tier causal findings.

The program documentation names a shared-dataset discipline — that's the right answer in spirit. The discipline needs to be operationalized as: (a) per-paper frozen Parquet snapshots pinned to preregistration SHAs, (b) pre-registered exclusion criteria for which sessions count toward which paper, (c) explicit data-overlap declarations between papers in each manuscript. None of (a)–(c) is shipped yet.

3.6 Methodological generalization claims need bounds

The AHI program blurb states that "the methods generalize beyond writing — to coding, design, research, education, and clinical practice." A peer reviewer would treat this as forward-looking conjecture, not as a contribution claim. The program should distinguish demonstrated generalization (none yet, by design) from projected generalization. The latter belongs in a discussion section, not in the contribution claim of any submitted paper.


4. Threats-to-validity register

The program acknowledges three load-bearing threats. Each is correctly named; mitigations vary.

4.1 Auto-ethnography of the principal investigator

Threat: the PI is Penwright's designer and most active user. For descriptive work this is depth; for causal claims it is unmitigable confounding within the existing cohort.

Mitigation: external-operator pilot (PA-009) — 5–10 outside Penwright users recruited under IRB-clean consent + paid compensation, onboarded under the same instrumentation, tracked for 6+ months.

Reviewer's note: the mitigation is the right shape and is queued. Until pilot data accumulates, no paper that requires causal inference about Penwright's effect on the writer should clear review. Descriptive papers about what Penwright is and how it works are unaffected by the threat.

4.2 Single-system generalization

Threat: findings hold for Penwright as instantiated and may not transfer to other AI-writing systems or to AI-augmented capability development more broadly.

Mitigation: theoretical scaffolding that explains why findings should travel; eventual replication on adjacent surfaces using the Adaptive Authorship Control Kernel as portable substrate; an explicit generalization-bounds paper in the twelve-paper queue.

Reviewer's note: this mitigation is currently weaker than 4.1. Theoretical scaffolding is in the literature reviews; the substrate-portability replication is forward-looking; the generalization-bounds paper is not yet specifically named in the sub-paper plan. A reviewer would expect the program to either name which of the twelve papers carries this load, or add a thirteenth that does.

4.3 Longitudinal-data run-in window

Threat: the load-bearing capability claim cannot be measured until 6+ months of pilot data accumulates. During the run-in, the program produces descriptive evidence but cannot defend causal claims.

Mitigation: publication-queue discipline — papers requiring longitudinal data are gated on data accumulating; descriptive papers are sequenced first; pre-data papers are explicitly labeled.

Reviewer's note: this is honest and unusual. Most programs would be tempted to claim early. A reviewer should reward the discipline AND require the publication queue to enforce it visibly — no pre-data Paper 8 submission, even as a working paper, until the run-in window has elapsed.


5. Questions for the principal investigator

  1. LLM-scoring validation. Provide per-index precision/recall/Cohen's κ against expert human raters on a held-out sample, stratified by genre. Without this, no paper that invokes an LLM-scored index can be evaluated.
  2. Field-positioning depth. Working-alliance theory, transactive memory, niche construction, phenomenology of skill — name which of the twelve papers carries each bridge as foreground rather than background. If none, justify why the program-level positioning rests on bridges no individual paper develops.
  3. Generalization bounds. Name the paper (existing or to-be-added) that engages the single-system generalization question explicitly.
  4. Shared-dataset operationalization. Provide per-paper frozen Parquet snapshots, pre-registered exclusion criteria, and explicit data-overlap declarations between papers.
  5. Pilot pre-registration. When the external-operator pilot recruits, file a separate pre-registration covering recruitment channel, cohort composition, retention criteria, and what counts as "successful onboarding" before the cohort accumulates causal evidence.
  6. Publication queue. Make the pre-data vs. post-data status of each paper explicit in the sub-paper plan, with the run-in-window condition for each post-data paper named.
  7. Genre-fork power analysis. Paper 7 is load-bearing for the rest of the program. Provide a pre-data power analysis specifying the cohort size needed to detect plausible genre differences with adequate power.

6. Verdict

Major revisions / publishable as a sequenced series with discipline.

The program's measurement framework, pre-registration discipline, and four-failure-mode veto are unusually rigorous for the field. The auto-ethnography → external pilot ladder is correct. The longitudinal capability claim is the right outcome variable.

What the program cannot do yet is defend the causal portion that gives it most of its theoretical value. The pilot has not delivered; the LLM-scoring validation is not in production; the generalization-bounds paper is not specifically slotted; the shared-dataset discipline is not operationalized.

Three concrete things would move the verdict from "major revisions" to "accept the descriptive papers now, hold the causal papers for the pilot":

  1. Pilot recruitment begins, with the cohort pre-registered separately.
  2. LLM-scoring validation passes ship before any LLM-scored index appears in a submitted paper.
  3. The publication queue makes pre-data vs. post-data status visible per paper, with run-in windows specified.

With those three, the descriptive papers (Paper 1 in two registers, Paper 3, Paper 4) are publishable now, and the causal papers (Paper 5, 7, 8) sit in the queue against pilot data accumulating. Without them, the program's empirical contribution outpaces what the data can support.

The dataset will be a real contribution. The framework's discipline is exemplary. The patience the longitudinal claim demands is unusual in a field that ships in three months. The program should be evaluated on whether it holds the discipline.


Reviewer signed off, 2026-05-05. Conflicts of interest: this review is self-authored as the audience-tier-1 (peer-review) framing of the program — a pre-acceptance internal critique rather than a third-party review. Companion artifacts: engineering.md (audience-tier 2, shipped 2026-05-05), product.md (audience-tier 3, shipped 2026-05-05), penwright-paper-01-public.md (general-audience, shipped). Vision specs at vela/docs/VISION-PENWRIGHT-AUTHORSHIP.md and vela/docs/VISION-PENWRIGHT-MEASUREMENT.md are the load-bearing internal sources.