peopleanalyst

research / devplane / audience tiers

General-audience explainer

By Mike West

DevPlane·Audience tiers·General audience·source: people-analyst/devplane/docs/research/reviews/general-audience-explainer.md

What Happens When You Run a Fleet of AI Coding Agents

The everyday job of supervising several AI coders at once, a forty-year-old prediction from cockpit research that may be quietly recreating itself in software development, and what an honest study of that question looks like before the data is in.


There is a particular kind of moment that happens in a small office at about three in the afternoon. You have three AI coding agents running. One has reported that it finished the database migration. Another has gone quiet on a refactor that has been open for two hours. A third is about to finish something that, on closer look, overlaps in a way you didn't anticipate with what the first one already shipped. The agents are, in some sense, doing the work. You are doing — what, exactly?

The honest answer is that you are doing a very specific kind of cognitive labor that has no good name yet. You are coordinating. You are deciding what to dispatch and to whom. You are reading agent output for the kind of confident-sounding mistake an agent will produce while telling you the work is done. You are catching the moment when two agents have, between them, half-broken the same file. You are the only thing in the room that knows what is actually going on across the whole system.

The economic argument for AI coding tools — the one in the press releases, in the analyst reports, increasingly assumed by enterprise buyers — is that the agents do most of the work and the human does some lighter version of what the human used to do. Lines of code shipped goes up. Time-to-pull-request goes down. Productivity rises. The argument has an implicit assumption inside it: the coordination cost is small. The agent is the load-bearing piece; the human is a residual.

This research program is, at its core, a careful empirical test of that assumption. Not a hot take. Not a vendor benchmark. A pre-registered, falsifiable, production-telemetry study of where coordination cost actually lives when one human runs multiple AI coding agents on a real codebase, and whether the productivity story is missing a term that forty years of human-factors research suggests it might be missing.

The honest position right now: the apparatus is built, the predictions are written down, and the data is being accumulated. The answer is not in. The point of saying any of this in public, this early, is that the discipline of writing the prediction down before the data arrives is most of what separates research from product validation. The thing you are reading is the prediction.

The instrument behind the question

Most people studying multi-agent AI coordination today do it in a laboratory — a manufactured task, an hour or two of an engineer working with a chatbot, a survey afterward. That setting has a famous problem: the phenomenon being measured is largely a long-running, system-level adaptation, and you cannot reproduce a long-running adaptation in a one-hour task. You get the wrong distribution of the thing.

The alternative is field telemetry, which is harder to do but produces data that means something. DevPlane — the dashboard holding the agents, the queue, the assignments, and the merge outcomes — was built for a different reason (one operator needed a coordination layer to keep N agents from crashing into each other) and ends up, almost as a side effect, producing the right data. Every dispatch the operator sends, every agent self-report, every merge outcome, every operator intervention is logged. The agents are the heterogeneous fleet a working software professional actually uses today: Cursor, Claude Code, the rest.

That telemetry, after enough months, gives you something the laboratory cannot: a continuous record of one human running real agents on real software, with ground truth on what shipped, and attribution clean enough to ask whether a particular failure was the agent's fault or the operator's. The instrument is not the contribution. The instrument exists so the contribution — the test — can happen.

The forty-year-old prediction

In 1983, a researcher named Lisanne Bainbridge published a six-page paper in a journal called Automatica with the title Ironies of Automation. The paper is one of the most-cited things in human-factors research, and the argument is so simple that you can hold it in one sentence: automating the easy parts of a task makes the remaining manual parts harder, not easier.

The reasoning is not mysterious. The operator of an automated system loses practice on the routine cases, because the automation now handles them. The operator loses the situational awareness that came from doing those routine cases — the small knowledge of what's normal that used to live in the operator's hands. And the operator is left with the residual exceptional cases, which resisted automation precisely because they are difficult, novel, or ill-specified. The harder problem stays. The base of practice underneath it erodes.

Forty years of subsequent work in cockpits, in nuclear control rooms, in cars, in operating rooms, has elaborated this observation into a literature with measurable phenomena. Complacency — the operator stops checking what the automation is doing because the automation has been right so often. Automation bias — the operator believes the automated output even when their own judgment would have been better. The trust calibration gap — the operator's confidence in the automation drifts away from the automation's actual reliability. And the most empirically concrete of the phenomena, risk compensation — when an automated system gets more reliable, the operator's vigilance drops, and the net change in total error is smaller than the gain in the automation alone.

Risk compensation has a substantial empirical record outside automation. When seatbelts became mandatory, drivers drove a little less carefully. When antilock brakes arrived, drivers followed a little closer. When protective gear in contact sports got better, players took bigger hits. The pattern isn't moralistic and it isn't universal, but it is real, and the working hypothesis from the literature is that it shows up wherever a reliable automated subsystem sits inside a system that includes an attentive human.

The question this research program puts in front of itself is whether the same dynamic is operative in human-AI software development. When the agents get more reliable, when the coordination protocol cuts the agent-to-agent collisions, when the queue stops drifting — does the operator's vigilance fall, in measurable ways, by enough to partially offset the gain?

If yes: the productivity numbers being reported about AI coding tools are systematically too high, because they are agent-side numbers and the offsetting decrement on the operator side isn't being counted. If no: the operator running heterogeneous AI agents is doing something different from the pilot in the cockpit, and figuring out why would itself be a contribution.

Both worlds are written down. Both worlds get reported with equal prominence. That is the discipline.

What the lead study is actually doing

The program has three arms — agent-to-agent coordination, the operator's cognitive load, and the joint human-machine system. The lead study is in the third arm, and it has a specific shape worth describing in plain terms.

The setting. The dashboard is in production. One operator runs it on a portfolio of working software repositories, day after day, doing real work. The instrumentation runs continuously. There is no manufactured task, no contrived scenario. The data is just what happens.

The intervention. A specific, well-bounded improvement to how the agents coordinate through the queue gets deployed. This particular improvement is the auto-resolve heuristic for assignment-status reconciliation — a rule that catches when an assignment card says "open" but the work has actually shipped, and reconciles automatically. The improvement is small, well-understood, has a clear before-and-after signature, and reduces a specific kind of agent-coordination failure. The point is not the heuristic itself; it's that the intervention is the kind of clean, single-step change that lets you ask whether the operator's behavior changes in response.

What gets measured. Four things. How often agents produce output that doesn't pass review. How often the operator is the proximate cause of a defect or rework. Total error per unit of work shipped — the net. And how long the operator dwells on each assignment, as a measurable proxy for vigilance. The first three are the substance. The fourth is the mechanism check — if vigilance is the load-bearing variable, dwell time is one of the few things you can measure that tracks it.

The prediction, written down before the data arrives. Following the intervention: agent-side failures drop (the intervention worked on what it was designed to fix). Operator-side failures rise — not catastrophically, but measurably, by enough that the net improvement is meaningfully smaller than the agent-side improvement alone would suggest. Operator dwell time per assignment shortens. The shape of failures that escape into shipped work shifts toward the harder end — the failures that did get caught before now get caught later, or not at all.

The null world, also written down. Following the intervention: agent-side failures drop. Operator-side failures don't move. Dwell time doesn't move. The improvement carries through cleanly. Risk compensation, in this specific regime, with this specific kind of system, with this specific operator-tooling pair, does not appear. That would be a real finding. It would tell us that something about the heterogeneous-AI-agent setting makes operators behave differently from cockpit pilots and process-control engineers, and the question of why would be the next study.

What this program is being careful not to do

A research program is partly defined by what it refuses to claim. This program does not claim that AI coding agents are good or bad at coding; it does not compare specific vendors or models; it does not make claims about software productivity in general; it does not claim anything about the internals of the language models the agents are built on. The setting is one operator running heterogeneous agents through one coordination layer. Findings here are findings about that. Generalization to multi-operator teams, to other coordination architectures, to other AI tools — every one of those is its own study.

And — this is the methodologically uncomfortable part, but the integrity of the program depends on naming it — the principal investigator is the operator. The person being studied is the person doing the studying. In the descriptive parts of the work this is fine; auto-ethnography is a recognized methodology and the data is logged passively, without the operator getting real-time feedback about their own behavior. In the causal parts, where the claim becomes "the intervention caused the operator behavior change," the auto-ethnography is an explicit threat to validity that future replications with external operators are designed to address. The single-operator finding is a case finding. It will be reported as a case finding. Generalization is what the second-phase study, with at least one external operator, is for.

The honest pre-data state

Here is the honest pre-data state, as of the writing of this piece.

The instrumentation is live. The coordination-event log is recording. The pre-intervention baseline is being accumulated. The auto-resolve heuristic has been deployed at a specific timestamp that is recorded. The pre-registered prediction is in the proposal document, dated and signed. The post-intervention observation window is running. The analysis pipeline (Bayesian structural time-series for the counterfactual; bootstrap two-sample tests for the dwell-time secondary; survival analysis for the time-to-detect tertiary) is being built in parallel.

What is not yet known: whether the prediction is right. Whether the effect is detectable above the noise floor of single-operator-N production data. Whether the operator's behavior change, if it happens, is large enough to matter operationally or only large enough to matter statistically. Whether some confound — a model update from a vendor, a new tool entering the operator's workflow, a maturation effect from the operator simply getting better at the operator role over time — will turn out to be the actual driver of any change observed.

It would be very easy, given the four decades of supportive literature on risk compensation in adjacent domains, to write this up as if the conclusion were already in. The conclusion is not in. The conclusion will be in, in this specific regime, when there are at least ninety days of post-intervention data and the analysis runs against the pre-registered prediction.

If the prediction holds, the productivity story being told about AI coding tools today is missing a term, and the size of the term will be a number anyone deploying these tools at scale should know. If the prediction fails, the case for AI coding tools' net productivity gains compounds more cleanly than the cockpit-automation literature would have predicted, and that itself is a useful finding — the operator of multiple AI agents is not the pilot in the cockpit, and the next thing to figure out is what the role actually is and where its specific failure modes live.

Either way, the methodology — continuous production telemetry on a real operator, pre-registered predictions, falsifiable constructs, decomposed measurement of total time-to-merge into operator and agent components — is the portable contribution. The methods generalize beyond AI agents. Any team running heterogeneous tools through a coordination layer — the operations dashboard at a hospital with three handoff systems, the lab running multiple instruments through a single LIMS, the NOC supervising automated response across a fleet — has the same shape of problem.

What this means if you're not a researcher

If you are reading this as someone whose company is buying AI coding tools, or whose engineering organization is running them at scale, the practical thing to take from this piece is not a verdict — there isn't one yet — but a posture.

The productivity numbers your vendor reports are agent-side numbers. They are measurements of what the agent did. They are not, and structurally cannot be, measurements of what the coupled human-agent system did, because the operator-side decrement (if there is one) lives in a different place from the agent-side measurement. If risk compensation is operative at the magnitude this program's prediction suggests, the agent-side numbers may overstate net effect by a non-trivial fraction. They may not. The point is that you cannot tell from the agent-side numbers alone.

The thing to ask, when somebody quotes you a percentage gain on AI-assisted development, is: how was the operator-side decrement measured, if at all? If the answer is "it wasn't" — and right now, in most reports, it wasn't — the percentage gain is an upper bound on the net effect, not the net effect itself. That is a critique of a measurement gap the entire field shares, not of any specific tool. Closing that gap is what the program is for.

This is a slow program; it is designed to be slow. Coordination cost punishes rushed measurement, and the academic literature on adjacent phenomena is a graveyard of fast studies whose effects the slower studies couldn't replicate. Public updates will be honest about where the program is — including, plausibly, an update that says the prediction was wrong, and here is what we think the right model is instead.

The honest answer to what happens when you run a fleet of AI coding agents is: nobody, including the people running them, fully knows yet. The point of the program is to find out.