research / devplane / audience tiers
Peer-review framing
By Mike West
Peer-Review Framing
Program: DevPlane research program — coordination cost in heterogeneous AI tool ecosystems.
Lead study: C1 — Risk Compensation in Human-AI Coordination (docs/research/PROPOSAL.md).
Apparatus: Coordination-event log (docs/research/instrumentation-spec.md, DP-63), continuous production telemetry on a single-operator multi-agent codebase.
Audience for this memo: an experienced referee in HCI, CSCW, empirical software engineering, or human-factors-of-automation. The work is being positioned for review at a venue with one foot in each tradition (CSCW, CHI, Human Factors, IEEE TSE, Empirical Software Engineering); this document anticipates the questions such a referee will ask and answers them honestly.
The intent here is not to defend the program; it is to articulate it in the language a referee uses. Where the program has weaknesses, this memo names them before the referee does.
1. What the program is
A field study, single-operator, of Ironies of Automation (Bainbridge 1983) in a setting the original cockpit literature did not anticipate: a developer supervising a heterogeneous fleet of AI coding agents (Cursor, Claude Code, others) through a coordination layer (DevPlane) on a real, multi-month, multi-project codebase. Dependent variables are standard human-factors quantities — vigilance proxy (dwell time per assignment), agent- and operator-attributable failure rates, time-to-detect for failures that escape into shipped state — measured continuously via a coordination-event log embedded in the operator's day-to-day tooling.
The program registers three studies: C1 (lead), risk compensation against a specific intervention (auto-resolve heuristic for assignment-status reconciliation) via Bayesian structural time-series with a ≥30-day run-in and ≥90-day post-intervention window; A1, descriptive auto-ethnography of stigmergic coordination patterns; B1, dispatch-quality calibration over operator-session duration. The program does not claim novel theory; it claims novel measurement of an existing prediction in a setting where the prediction has not been formally tested.
2. Position against the named literatures
The contribution is the intersection, not any single axis. The program reads as a small advance on each axis individually; it is more interesting where they meet.
2.1 Cockpit and process-control HCI
Bainbridge (1983, Automatica) named the central irony: automating the easy parts makes the residual harder while eroding the practice and situation awareness needed for it. The literature elaborates Bainbridge along several axes: Endsley (1995, Human Factors) on situation awareness and the SAGAT instrument; Lee & See (2004, Human Factors) on trust calibration as a two-sided problem; Parasuraman & Manzey (2010, Human Factors) on complacency and automation bias as attentional phenomena. The risk-compensation strand — Peltzman (1975, JPE); Wilde's risk homeostasis theory (1982, Risk Analysis) — sits adjacent. The empirical record across automotive and non-automotive settings supports partial offset more often than full homeostasis. C1 predicts and tests partial offset specifically.
Contribution. The cockpit literature was built around a narrow operator-supervises-one-or-two-systems pattern (one cockpit, one anesthesia rig). The AI-agents-on-a-codebase topology is different — one operator, N concurrent agents with overlapping authority over the same artifact, mediated through a coordination surface rather than a single instrument panel. The Ironies prediction in this regime has not, to my knowledge, been formally tested.
2.2 Computer-supported cooperative work
Schmidt & Bannon (1992, CSCW) name articulation work — the meta-work of dividing, scheduling, and integrating cooperative work — which is exactly what the coordination layer is absorbing. Suchman's Plans and Situated Actions (1987) reframed workplace activity from plan-execution to situated improvisation, with the implication that coordination tools treating plans as authoritative will systematically misrepresent how work happens. Grudin (1988, CACM) named the coordination-cost asymmetry — the people who do the coordination work are not the people who get the benefit. The stigmergy strand draws from Hayes-Roth's blackboard architectures (1985, Artificial Intelligence); Bolici, Howison, and Crowston's stigmergic-coordination work in FLOSS is one of the few empirical treatments at sustained operational scale.
Contribution. Telemetry on stigmergic coordination among heterogeneous AI agents at sustained operational scale; a contemporary instance of Grudin's coordination-cost asymmetry; a setting where Suchman's situated-action framing can be tested against formal-plan framing using continuous data on dispatch-text fidelity to shipped output.
2.3 Empirical software engineering
Brooks (1975) and Conway (1968) supply the original super-linear coordination-cost arguments; the field has been empirically testing them for decades. Herbsleb & Mockus (IEEE TSE) on coordination costs in distributed development is the methodological template — operationalize coordination cost as time-to-completion deltas under specified conditions — applied at smaller scale with richer per-event data here. Christensen and Bird on socio-technical coordination, and the CMU empirical-SE tradition (Vasilescu et al.) on telemetry-based research, supply the discipline inherited: acknowledged confounds, skeptical treatment of activity metrics, attention to selection effects in repository data.
Contribution. Rich methodology and large-N observational corpora exist; very small operational telemetry on the multi-agent operator role does not. DevPlane provides small N, very high resolution. The tradeoff is explicit; the program does not pretend to speak about software engineering in general.
2.4 Behavioral decision-making
The trust-calibration mechanism mediating C1 depends on this literature. Kahneman & Tversky's heuristics-and-biases program supplies the broad framing. Tetlock (Expert Political Judgment 2005; Superforecasting 2015) provides the canonical instruments — Brier scores, calibration diagrams, the calibration-versus-resolution distinction. Klein's naturalistic decision-making offers a complementary framing for time-pressured operational settings. Decision-fatigue findings (Vohs, Baumeister, with subsequent replication concerns) and deliberate-practice findings (Ericsson) make opposite predictions on within-session quality trajectories.
Contribution. Continuous data on a single operator making thousands of micro-decisions over months in an operational setting. The behavioral literature has rich theory and laboratory evidence; the dispatch corpus tests calibration, fatigue, and continuation-bias predictions the laboratory can only approximate.
2.5 Multi-agent systems and stigmergic coordination
Hayes-Roth's blackboard architectures are the direct theoretical ancestor of the kanban-as-shared-artifact pattern DevPlane uses. Recent multi-agent LLM work (MetaGPT, ChatDev, AutoGen, LangGraph, CrewAI) is rapidly growing but largely engineering-focused rather than empirical; how multi-agent LLM coordination actually fails in production with a human in the loop is under-studied.
Contribution. A1's stigmergic-drift question — does coordination through shared artifacts produce systematically different failure signatures than direct-communication coordination — is a question this literature implies but rarely tests with real production data on heterogeneous agents.
3. Threats to validity — explicit register
A peer-review-shaped framing must lead with the threats. The program's design treats them as constraints to mitigate, not as objections to deflect.
3.1 Auto-ethnography of the principal investigator
The PI is also the operator of the system being studied. For the descriptive arms (A1 stigmergic-drift; false-complete base-rate reporting) this is auto-ethnography — a recognized tradition in CSCW and human-factors field work, but requiring explicit acknowledgment in every reported finding. For the causal arms (C1, future C-arm interventions), it is a substantive threat to internal and external validity; PI awareness of the experimental design will affect dispatch behavior in ways the design cannot fully mask.
Mitigation. The auto-ethnography flag travels with every finding. The C1 intervention (auto-resolve heuristic, ASN-953/979 lineage) was planned as a product improvement, not introduced for the study, which limits but does not eliminate awareness contamination. The pre-registered coding rule for failure attribution (docs/research/methodology.md §3) constrains post-hoc reinterpretation. External-operator replication is a committed successor study, not a hypothetical safeguard.
3.2 Single-operator generalization risk
Findings from N=1 do not transfer mechanically to multi-operator teams, operators with different prior experience, or different agent fleets. Multi-operator settings have additional coordination dynamics (cross-operator communication, role differentiation, social dynamics) absent here.
Mitigation. The theoretical framing names the architectural features doing the work — heterogeneous-agent topology, stigmergic-coordination surface, single-supervisor pattern — so transfer can be tested rather than assumed. C1 results should be read as a demonstration that the Ironies prediction can be detected in this regime, not as a measurement of magnitude in any other regime. Effect sizes are reported as point estimates with credible intervals; classical power analysis does not apply at N=1, and the program does not pretend otherwise.
3.3 Instrumentation observation effects
The act of logging coordination events changes operator behavior — Hawthorne effect, in its standard form, ineliminable in a self-instrumented study.
Mitigation. Logging is passive and append-only. No real-time UI surfacing of the corpus to the operator; the operator does not see weekly aggregates during the run-in or post-intervention windows. The structural time-series model absorbs the run-in trend, which captures adaptation-to-being-instrumented as part of baseline. The paper names this confound in §1, not in limitations.
3.4 Selection bias in agent choice
The agent fleet (Cursor, Claude Code, occasional others) is selected by the operator, not assigned at random. Vendors update models without notice; capabilities shift across the study window.
Mitigation. Vendor-side changes are logged when known and cross-checked against the event log for unexplained shifts in agent-attributable failure rate. The C1 intervention is coordination-layer, not agent-layer; the prediction concerns the operator's response to a perceived reliability change in the coordination surface, which is partly decoupled from underlying model capability. Vendor comparison is not a primary finding — if Cursor outperforms Claude Code on some metric in this corpus, that is reported as a property of the corpus.
3.5 Construct validity of the vigilance proxy
Dwell time per assignment is the H2 dependent variable and is a proxy for latent operator vigilance. The proxy may not track the construct uniformly (the operator may learn to dispatch faster without becoming less vigilant; dwell time may inflate from external interruption).
Mitigation. Dwell-time effects are reported as proxy results, not direct measurements of vigilance. Time-to-detect for failures that escape to shipped state is a complementary signal — Bainbridge specifically predicts that detection time worsens even if rate is unchanged, testable independently of dwell time.
3.6 Failure-attribution coding and multiple comparisons
The agent / operator / shared-attribution boundary is judgment-laden; inter-rater reliability is impossible at N=1. The coding rule is pre-registered (PROPOSAL §6); all raw evidence — dispatch text, agent output, merge outcome — is preserved so external re-coding is tractable, with a successor assignment committed for that purpose. The primary analysis is pre-registered; secondary analyses are flagged exploratory. OSF deposit precedes unblinding to post-intervention rate data; revisions to the analysis plan increment a version stamp and preserve the original in git history.
3.7 Scope limits
The program does not test whether AI agents are good or bad at coding, does not compare vendors as a primary finding, does not make general productivity claims, does not generalize to multi-operator teams, and does not make policy or buying recommendations. These are scope limits, not evasions; a referee should hold the program to claims within its scope.
4. What the program contributes that existing literatures do not
Four contributions, evaluable independently:
-
Continuous production telemetry on a real operator running real heterogeneous AI agents on a real multi-month codebase. The closest analogues — Copilot RCTs, GitHub-based observational studies — capture either single-tool single-task settings or large-N low-resolution repository telemetry. Neither covers the multi-agent operator role at high temporal resolution.
-
A coordination-event log as research instrument, designed before the study rather than reverse-engineered from existing telemetry. The log records dispatch text (hashed, separately scrubbable), agent self-reports stored separately from ground-truth merge outcomes, operator interventions with kind and dwell, conflict and auto-resolve signals, session start and end. The deliberate separation of agent self-report from outcome operationalizes the false-complete construct, which several adjacent literatures need but few have measured.
-
A failure-mode taxonomy with prevalence rates as reference point for subsequent comparative studies. The base-rate report on agent false-complete events — how often agent self-reports of completion are accurate when checked against shipped state — is a parameter the multi-agent LLM, trust-calibration, and empirical-SE literatures each need as input.
-
A formal test of Ironies of Automation in an AI-coordination setting. The prediction is forty years old; its application to AI coding tools is largely informal. C1 commits a pre-registered design with explicit yes-world and no-world consequences before data collection. Negative results publish with equal prominence.
5. Anticipated referee questions
Why field rather than laboratory? Risk compensation is longitudinal and system-level; one-to-two-hour laboratory tasks produce the wrong distribution of operator behavior. The construct cannot be measured at meaningful magnitude in the laboratory.
The maturation confound. Bayesian structural time-series with a ≥30-day run-in absorbs the trend; the trend coefficient is reported. The H1 prediction is that the post-intervention trajectory sits above the agent-improvement-only counterfactual but below the pre-intervention trend. Maturation alone would predict the trajectory tracks or exceeds the counterfactual.
Auto-ethnography in causal clothing. Conceded for descriptive arms; flagged on every causal finding; external-operator replication is a committed successor study, not a hedge. C1 stands on detectability in this case; generalization is a separate empirical question.
Insufficient agent-side signal. The proposal pre-registers backup interventions in priority order (structured handoff protocol, capacity-regulation feature, conflict-detection system). Each is a candidate for a sequential H1 test.
Publication path. C1 targets a venue that takes field studies of human-AI coordination seriously — Human Factors, CSCW proceedings, possibly Empirical Software Engineering. OSF pre-registration is timed before unblinding to post-intervention rate data. The replication footprint — the export and analysis scripts that produce the published numbers from the released corpus — is a closure requirement; if a result cannot be re-derived from the repository as of the published commit, it is not yet a report (docs/research/methodology.md §5).
6. Where this memo ends
A peer-review-framing memo does not argue the work is correct. It argues the work is positioned, that its threats are named, and that the contribution it claims is the contribution it can defend. C1 is small in scope and large in risk to its own conclusions. A referee should hold the program to its pre-registered predictions, to the auto-ethnography flag traveling with every causal claim, and to the replication-footprint standard the methodology document commits to.
Null results (H0) publish with equal prominence and explore why this regime may differ from cockpit and process-control cases — that is a finding too. Positive results are single-operator and generalization is a separate study. Either way, the apparatus and corpus persist; the next study is cheaper than the first.
References (engaged in this memo)
- Bainbridge, L. (1983). Ironies of Automation. Automatica, 19(6), 775–779.
- Bolici, F., Howison, J., & Crowston, K. Stigmergic coordination in FLOSS development.
- Brooks, F. P. (1975). The Mythical Man-Month. Addison-Wesley.
- Christensen, H. B., & Bird, C. (socio-technical coordination in software development).
- Conway, M. E. (1968). How do committees invent? Datamation, 14(4), 28–31.
- Endsley, M. R. (1995). Toward a theory of situation awareness. Human Factors, 37(1), 32–64.
- Grudin, J. (1988). Why CSCW applications fail. Communications of the ACM.
- Hayes-Roth, B. (1985). A blackboard architecture for control. Artificial Intelligence, 26(3), 251–321.
- Herbsleb, J. D., & Mockus, A. Speed and communication in globally distributed software development. IEEE TSE.
- Klein, G. (recognition-primed decision model).
- Lee, J. D., & See, K. A. (2004). Trust in automation. Human Factors, 46(1), 50–80.
- MetaGPT; ChatDev (multi-agent LLM framework empirical reports).
- Mosier, K. L., & Skitka, L. J. (1996). Human decision makers and automated decision aids.
- Parasuraman, R., & Manzey, D. H. (2010). Complacency and bias in human use of automation. Human Factors, 52(3), 381–410.
- Peltzman, S. (1975). The effects of automobile safety regulation. JPE, 83(4), 677–725.
- Schmidt, K., & Bannon, L. (1992). Taking CSCW seriously: Articulation work. CSCW, 1(1–2), 7–40.
- Suchman, L. (1987). Plans and Situated Actions. Cambridge University Press.
- Tetlock, P. E. (2005). Expert Political Judgment; Tetlock & Gardner (2015), Superforecasting.
- Vasilescu, B., et al. (CMU empirical-SE telemetry methodology).
- Wilde, G. J. S. (1982). The theory of risk homeostasis. Risk Analysis, 2(4), 209–225.
Full review with confidence flags lives at docs/research/LITERATURE-REVIEW.md. Uncertain references will be reconciled against scite / Google Scholar / deep-research outputs before submission.