research / devplane / audience tiers
Product implications
What the research program tells a product builder thinking about coordination tooling for human-AI software development. What to build into DevPlane next, what to build into adjacent tools, what the broader product space gets wrong — traceable to architectural decisions and pre-registered predictions, with honest pre-data caveats where applicable.
By Mike West
Product implications — what the DevPlane research program tells us to build next
A product strategist's lens on the research program to date. What the architectural commitments and pre-registered predictions imply for the next two quarters of build, where the roadmap is following the empirical record, and where it is making explicit bets ahead of evidence.
— 2026-05-08
1. The state of the record, plainly
The DevPlane research program is in pre-instrumentation setup. The three-arm structure is named, the lead study (C1 — risk compensation in human-AI coordination) is specified at proposal grade with pre-registered hypotheses and a Bayesian structural time-series analysis plan, and the five-literature review with confidence flags is published. The apparatus is real production telemetry on a real operator running real agents on a real, multi-month codebase — but the coordination-event log itself (DP-63) has not yet shipped. Per the pipeline-status snapshot of 2026-04-30: nothing is producing research data yet; partial pre-instrumentation corpus exists in git history and devplane_log_completion records, recoverable for dispatch, merge_outcome, and rework event types but not for dwell_ms or operator_inspect, which require the live event log.
That state matters for everything in this memo. Most of the implications below are not "the data told us to build X" — there is not yet enough live data to dictate features. Most are "the architecture and the pre-registered predictions told us what the data, when it arrives, will be capable of adjudicating, and the product needs to be built so the adjudication is honest." That is product strategy in service of a not-yet-running empirical engine, and it is its own discipline. Where I make concrete recommendations, I tag whether they are anchored in an architectural decision, an empirical claim that already has partial corpus support, or a pre-registered prediction whose evidence is still ahead.
2. What to build into DevPlane next, anchored to the program
2.1 The coordination-event log enables what single-tool dashboards cannot — instrument the gaps that prove it
Architectural decision. The coordination-event log (instrumentation-spec.md, DP-63) is specified to capture dispatch, claim, agent self-report, intervention, merge outcome, and dwell against a stable assignment identity — across heterogeneous tools. The instrument is not "what Cursor saw" or "what Claude Code saw"; it is what the joint system did, which is precisely the construct laboratory studies and single-tool dashboards cannot reproduce.
Implication. Three things become measurable that no single AI coding tool can measure on its own, and the product should expose all three as first-class operator surfaces once the log is live:
- Self-report-vs-outcome divergence per agent-tuple. The methodology stores agent self-reports separately from ground-truth merge outcomes specifically because their divergence is the research signal — and the product operationalization of false-complete. A "false-complete rate per (agent × model × prompt-template)" panel is one query off the schema once events are flowing. No single-tool dashboard can compute this because no single tool sees both the agent's "done" and the reviewer's verdict on a different tool's commit.
- Total time-to-merge decomposed into operator and agent components. The C-arm's published goal includes this decomposition explicitly. It is the only honest answer to "is the agent fast?" — agent wall-clock is not the same as net cycle time, and net cycle time is the only metric that survives contact with the Ironies of Automation prediction.
- Operator dwell time per assignment as a vigilance proxy (H2 of the C1 pre-registration). Single-tool dashboards measure tool-specific time-on-tool. The coordination-event log measures attention allocation across a queue, which is a different and more useful construct.
Concrete action. Ship DP-63 unblocked from anything else, then surface these three views as the first analytic layer on top — not as a separate research dashboard, but as part of the operator's day-to-day product surface. The research benefit is that the operator's normal use of the surface produces the corpus; the product benefit is that nothing else on the market shows these views. Pre-data caveat: the false-complete rate and the dwell-time distribution are real numbers that can be computed and reported; their causal relationship to a coordination intervention is what C1 is built to test, and that test cannot run before the run-in period accumulates.
2.2 The two-phase actor handoff is a thesis about review-trust — make it portable
Architectural decision. The two-phase actor handoff (builder → reviewer, where the reviewer must produce an artifact only the reviewer can produce — committing with explicit paths, pushing, firing the second devplane_log_completion) is documented in the methodology as both operationally useful and methodologically useful. Operationally: the reviewer catches what the builder cannot self-see. Methodologically: it makes the boundary between agent-attributable and operator-attributable failure auditable after the fact, which is exactly the construct C1 needs to pre-register.
Implication. Two-phase handoff is a thesis about review-trust: that the integrity of a review depends on the reviewer doing something the builder cannot, not on the reviewer being asked to "look at" something the builder also looks at. This is the inverse of how most LLM-driven review tools work today — they layer a second agent on the same artifact at the same checkpoint, and the second agent has the same incentive structure as the first. The DevPlane pattern enforces the asymmetry by requiring a specific commit posture only the reviewer is positioned to produce. It is a small structural commitment with a large epistemic payoff.
Concrete action. Three product moves, ordered by dependency:
- Make the boundary visible in the UI. The kanban already encodes builder/reviewer phases on dual-stage cards. The next step is showing the operator the per-card attribution — "the builder said done; the reviewer found N corrections; here are the diffs the reviewer authored that the builder did not" — as a first-class card detail rather than a research export.
- Expose the protocol as a primitive other tools can adopt. The convergence brainstorm (2026-05-02) already filed the multi-axis reviewer (DP-68) and origin marker (DP-71) as portable substrate. Two-phase actor handoff belongs in the same family. Cross-application mapping (DP-93–DP-96) is the workstream; the product implication is that the pattern should be packaged as a reusable substrate before it is generalized to a SaaS product.
- Resist the temptation to collapse it. The two-phase pattern adds latency. The product pressure to "let one agent commit and self-review" will be constant. The pre-registered C1 design depends on the boundary remaining auditable, and the broader Arm-A descriptive work depends on the corpus carrying separate self-report and outcome signals. If the protocol gets collapsed for UX reasons, the research instrument degrades silently.
Pre-data caveat: the causal claim that two-phase handoff reduces operator-attributable failure has not been tested. The pattern is currently defended on operational grounds (it catches things) and methodological grounds (it preserves attribution). A formal test of its causal impact would be a candidate C-arm intervention after C1 clears.
2.3 The C1 study is positioned to falsify a load-bearing product assumption — build the surfaces that let it
Pre-registered prediction. C1 (PROPOSAL.md) tests three hypotheses, with both yes-world and no-world consequences explicitly written down. H1 (primary): a coordination-protocol improvement that reduces agent-attributable failure rate produces a non-trivial offsetting increase in operator-attributable failure rate, such that net system error reduction is meaningfully smaller than the agent-side improvement alone. H2 (secondary): operator dwell time per assignment decreases following the improvement. H0 (the null): agent-side improvement passes through with no detectable change in operator behavior.
Implication. C1 is positioned to falsify the dominant productivity argument for AI coding tools — the argument that agent-side improvements compound without offset, justified by agent-side measurement. The yes world says vendors and buyers are systematically overstating net effect because they are counting only one side of the ledger. The no world says operators of heterogeneous AI tool ecosystems are not behaving like cockpit pilots and the productivity claims survive. Either result has product consequences. The product I would build differs depending on which world we are in.
Concrete action. Build the operator-side measurement layer DevPlane needs to be honest about its own findings, regardless of which world C1 lands in:
- A net-effect view, not an agent-effect view, as the headline operator metric. Today most coordination tools (DevPlane included) lead with cards-shipped, time-to-PR, and similar agent-side counters. The C1 design implies the headline should be net cycle time per merged unit with a visible decomposition into agent and operator components — and a small chart showing how the decomposition has shifted over time. If H1 lands, this view is the moat. If H0 lands, this view is still the most honest summary the field has.
- A "rework cause" coding affordance on every reverted card. The pre-committed coding rule for failure attribution (agent-attributable / operator-attributable / shared-attribution) is a weekly research task today. Surfacing it as a one-click affordance on the kanban — on rework cards specifically, gated to operator role — turns the research task into a product feature. The rupture-repair instrumentation already filed as DP-79 is the substrate.
- A pre-registration link on the DevPlane research surface. The proposal commits to OSF deposit before unblinding to post-intervention rate data. Once filed, the product should surface the OSF link inside DevPlane itself — at minimum on the research/about page — because that link is the credibility instrument that distinguishes this work from vendor-funded productivity studies.
Pre-data caveat: H1 requires ≥30 days of run-in plus ≥90 days of post-intervention observation. The intervention itself (auto-resolve heuristic, lineage ASN-953/979) is queued but cannot deploy until run-in baseline is captured. Anything I write here about the C1 result is a prediction, not a finding.
3. What to build into adjacent tools, anchored to portable patterns
The Cross-Application Mapping work (DP-93 through DP-96) already identified that the multi-axis reviewer, the origin marker, and the sycophancy circuit-breaker port directly across the portfolio. Three further implications follow from the research program rather than from the feature brainstorm.
3.1 Heterogeneous-tool ecosystems share the same shape of problem. The OVERVIEW commits explicitly: methods generalize beyond AI agents — multi-tool ops dashboards, hospital handoff systems, distributed scientific instruments share the coordination-cost shape. That is a product line, not just a research disclaimer. The first portable instrument is the assignment-registry-plus-event-log substrate; vela and meta-factory should adopt it not because they have AI agents but because they have heterogeneous tools whose joint behavior is currently unmeasured. Anchored in: architectural decision (methodology §1).
3.2 Self-report-vs-outcome separation is a discipline, not a UI. Tools across the portfolio routinely conflate "the agent said it shipped" with "it shipped." The methodology mandate to store self-report and merge outcome separately is not AI-agent-specific — any system where a worker reports completion and another check verifies has the same shape. Penwright's measurement framework, vela's adaptive authorship kernel, and the principia construct registry would all benefit from the same discipline. Anchored in: methodology §2, §3.
3.3 Pre-registered yes/no worlds work on product decisions, not just studies. Writing down the yes-world and the no-world for a load-bearing claim before you collect evidence is a research move; it is also a product-strategy move. Most roadmaps name the bet but not the disconfirming evidence. The methodology has exported a usable template — pre-registered predictions with falsifiable, operationalized constructs — that the rest of the portfolio could inherit. Anchored in: methodology §3.
4. What the broader product space gets wrong
Three places where the dominant framing in the AI-coding-tool market is at odds with the architectural commitments and pre-registered predictions of this program.
4.1 Productivity claims grounded in agent-side measurement alone are systematically suspect. This is the program's headline argument and it is not yet empirically settled. But the architectural premise — that net effect is the meaningful construct, that operator-attributable failure exists as a measurable category, and that vigilance allocates dynamically across attentional capacity — already implies that any productivity claim grounded only in lines-produced, tasks-completed, or time-to-PR is missing a term. Buyers should treat such claims as upper bounds, not point estimates, until C1-style evidence accumulates.
4.2 Single-tool dashboards cannot diagnose multi-tool coordination problems by construction. Cursor's dashboard cannot see what Claude Code did to Cursor's commit. Claude Code's dashboard cannot see what Cursor's reviewer found. The coordination-event log is the answer; vendors who add operator-vigilance metrics to single-tool dashboards are giving operators a more refined view of one part of an instrument that is still missing its other half. The product correction is a coordination layer, not an instrumented agent.
4.3 Review-trust collapse is the failure mode no one is naming. Most LLM review tools today give operators a second agent on the same artifact at the same checkpoint, with a UI that nudges toward acceptance. The Nature 2026 finding (warmth-tuning raises errors 10–30 pp at vulnerability moments — cited in the convergence brainstorm killed list) is the lab evidence; the Ironies of Automation prediction is the field theory; DevPlane's two-phase pattern is the structural answer. The broader space is moving toward warmer review, not toward asymmetric review. If H1 lands in C1, the broader space is moving the wrong direction at scale.
5. Pre-data caveats, by category
Where I am citing architecture, not findings. Sections 2.1, 2.2, 2.3 (action items), and all of section 3 are anchored in architectural decisions documented in PROGRAM, PROPOSAL, instrumentation-spec, and methodology. They do not depend on C1 results.
Where I am citing pre-registered predictions, not findings. Section 2.3 (the framing of net-effect vs agent-effect) and section 4.1 are anchored in C1's H1, which has not been tested. The yes-world consequences are conditional. The no-world is equally consequential and equally consistent with the research discipline of taking it seriously.
Where I am citing partial corpus. Self-report-vs-outcome divergence (section 2.1) is partially recoverable from devplane_log_completion history pre-DP-63, but rate estimates from that source are an upper bound on what the live log will produce, because dwell and operator-inspect events are missing entirely.
Where I am making explicit bets ahead of evidence. The cross-portfolio export claims in section 3 — that the same substrate ports usefully into vela, meta-factory, and beyond — are extrapolations from the architectural framing. The cross-application mapping work has begun; the empirical claim that the substrate produces useful measurement in those settings has not been tested.
6. What this memo refuses
This memo does not recommend features popular in the AI-coding-tool market but inconsistent with the program's architectural commitments. No warmth-tuned review. No agent-side-only headline metrics. No collapsing of two-phase handoff for UX velocity. No vendor-comparison surfaces as primary product. Each would produce short-term product motion at the cost of degrading the research instrument the product is built around. The discipline that makes the research credible is the same discipline that makes the product distinctive; trading one against the other is a category error.
Net recommendation: ship DP-63 unblocked, surface the three views the coordination-event log enables (false-complete rate, decomposed cycle time, dwell-time distribution) as first-class operator surfaces, keep the two-phase actor handoff structurally protected, and build the net-effect headline view that will be honest in either C1 world.