peopleanalyst

research / devplane / audience tiers

Engineering critique

Adversarial review of the DevPlane research instrument: the two-phase actor handoff, the coordination-event log schema, multi-tool sync drift across claude-code/cursor/codex, and the completion-block protocol. Names specific failure modes (fire-and-forget write drops, build-agent-skips-review shortcut, agent_started step-function pre/post DP-102, polling-bound merge detection) and a ranked fix list before the corpus carries weight.

By Mike West

DevPlane·Audience tiers·Engineering critique·source: people-analyst/devplane/docs/research/reviews/engineering-critique.md

Engineering critique — DevPlane research instrument (post-DP-102 reconciliation)

Reviewer: an engineer who has built coordination tooling before and is being asked to trust the C1 corpus Scope: coordination-events.ts + capture sites in dispatch.ts, agent-workflow.ts, completion.ts, agents.ts, git-watcher.ts; the kanban two-phase actor handoff; multi-tool sync (cursor / codex / claude-code); the completion-block protocol

The instrument is real and producing data. The corpus is the spine of C1, and the design has at least four classes of issue an external reviewer will spot in the first hour. Two are subtle enough that a casual read of the spec won't surface them.


1. TL;DR

  1. The append-only log is fire-and-forget at the call site (scheduleInsert, coordination-events.ts:211) and silently drops events on Node < 22 or any sqlite open failure. No per-process write-success counter exists. We cannot distinguish "no event happened" from "event happened, write failed."
  2. The two-phase actor handoff is event-shaped but state-machine-soft. Nothing structurally prevents a build agent from supplying its own SHA on the first call and skipping review; the column guard at completion.ts:381 is bypassable by stamping any seven-hex string. Intent, not enforcement.
  3. agent_started had three call sites until DP-102. The spec now declares processQueue canonical; reconciliation columns and reconcile-coordination-agent-started.ts let us mark duplicates after the fact. Pre-DP-102 rows are not retroactively reconciled by default — a reviewer comparing the run-in window to post-DP-102 sees a step-function unrelated to operator behavior.
  4. Multi-tool sync drifts because each tool has its own ground truth. The "agent finished" moment lives in three clocks (claude-code subprocess, cursor cloud webhook, codex API roundtrip); we treat all three as equivalent agent_self_report. Cross-channel dwell/latency analyses pool incomparable distributions.
  5. The completion-block protocol beats free-text on outcomes and silently misses process. Status, SHA, blockers, follow-ups, migrations, env-vars are schema-validated (validateCompletionBlock, completion.ts:78). What's not in the block — half-completions, partial commits, agent retries, mid-run model swaps, operator edits to agent output before commit — is invisible. Self-report is separated from ground truth on the outcome axis, not on the process axis.

The first three are fixable in days. The fourth is a known limitation for the threats-to-validity register. The fifth is a scope question — process telemetry would contaminate the corpus.


2. The two-phase actor handoff

2a. Where it works

Build-agent and review-agent are different actors; the corpus needs to attribute work to each. The handoff is two log_completion calls against the same ASN: the first moves the card to review with a stashed COMPLETION_BLOCK activity entry; the second moves it to done and routes follow-ups (completion.ts:378-562). The separation lets merge_outcome be a system-actor event keyed on a verifiable git SHA, rather than an agent self-claim — exactly the right distinction for the false-complete signal.

The kanban column model (kanban.ts:12-20) gives the protocol a state to live in: agent-queue → in-progress → review → done, with rework as the failed-review escape and blocked as the halt. Each transition is logged with before_state / after_state snapshots, so a reviewer reconstructing the timeline has column transitions in-band.

2b. Where it fails

The phase boundary is enforced by intent, not by structure. completion.ts:381:

if (card && reviewCol && card.column !== reviewCol.column && !hasRealSha) { … first-call branch … }
else if (card && hasRealSha) { … second-call ship branch … }

A build agent that supplies a 7-40 hex string on its first call gets routed to the second-call ship branch, which moves the card to done and emits merge_outcome. The remote-SHA check (verifyShaOnRemote) produces warnings, not hard gates (completion.ts:222-223). The "build agents don't stamp SHAs on the first call" rule is prompt policy, not code policy. Fix: gate the second-call branch on card.column === reviewCol.column or on a distinct actor registry ID from the first call. Without it, a misbehaving build agent skips review and merge_outcome misattributes the actor to completion-protocol (system) when ground truth is build-only-agent.

Race on rapid double-call. Both branches issue moveCard and captureCoordinationEvent without a per-card lock. Two concurrent /api/agent/completion/log calls — a build agent retrying after a network blip — can both pass the column predicate and emit two agent_self_report rows. The card-level processing lock (cardProcessingLockKey) covers dispatch, not completion. Hasn't fired in practice; still a real race, and the corpus has no signal to distinguish retry from two real calls.

Rework-loop telemetry is partial. review → rework emits a rework event (completion.ts:327). The return move rework → in-progress comes from /api/agent/claim, which emits agent_claim — there is no protocol-level "rework retry began" event. Reconstructing "how many times did this assignment cycle" requires joining rework events with subsequent agent_claim events on assignment_id, ordered by event_ts. Doable but undocumented.


3. Coordination-event log schema

3a. What the schema captures well

Actor split (actor_kind × actor_id) cleanly separates operator/agent/system. before_state / after_state snapshots reconstruct column timelines from the log alone. dispatch_text_hash always present, raw gated on DEVPLANE_RESEARCH_STORE_DISPATCH_TEXT=1. agent_self_report and ground_truth_outcome are separate columns — the move that lets the false-complete signal exist. Append-only with reconciliation pointers (superseded_by_event_id); schema_version on every row.

3b. What the schema misses

No write-success counter. scheduleInsert queues a microtask that opens (or memoizes) the SQLite handle and inserts. If node:sqlite is unavailable, the call logs once and silently no-ops (coordination-events.ts:159-167). Missing migration files: log and skip. INSERT throws (constraint, disk full, schema mismatch): log and continue. The corpus has no row that says "I tried to capture an event and could not." This is the single most reviewer-vulnerable property of the instrument. Fix: a process-local counter incremented at every scheduleInsert entry and decremented on successful insert, exposed via /api/research/health.

Dwell is computed at write-time, not from session timeline. dwellForOperator (coordination-events.ts:311) computes Date.now() - prev.lastOperatorWallMs. The session file is rewritten on every operator action. Two operator events firing within the same microtask both read the same prev.lastOperatorWallMs; the second event's dwell is wrong by the duration of the first event's processing. Tens-of-ms noise — tolerable, but flag it in the analysis plan.

session_id rotation on idle is heuristic. SESSION_IDLE_MS = 30 * 60 * 1000 rotates on the next operator event, not at the actual idle moment, so the session_end event timestamp is the wake-up timestamp, not the sleep timestamp. Analyses that bucket by session should flag the first event of any session and re-derive its dwell.

The notes column is JSON-as-string with no schema. Half the capture sites stuff structured JSON into notes ({ source: "/api/agent/claim" }, { trigger: "status_partial_review" }, { commit_sha, stage }), and the export default-scrubs the column. Anyone analyzing the corpus needs --no-scrub, then JSON-parses with try/catch, mixed with rows where notes is genuinely free-form. Promote routing-source to a typed column (source_path, via).

correlation_id exists but is sparsely populated. Only the DP-70 briefing-shell capture sites set it. The join from dispatch → agent_started → agent_self_report → merge_outcome for a single attempt is assignment_id + event_ts ordering — which has the rework-loop ambiguity from §2b. Populate correlation_id per dispatch cycle and propagate.


4. Multi-tool sync — where the abstractions leak

DevPlane treats claude-code, cursor, and codex as parallel assignTo channels through one dispatch path (dispatch.ts:301-303). Each has a dispatchToX adapter; from the kanban's perspective they are interchangeable. From the corpus's perspective they are not.

4a. Where the abstraction holds

Every channel emits a dispatch event with the same shape (dispatch.ts:332). Column transitions are the same. The completion-block protocol is the same once the agent reaches it. agent_artifact_committed is symmetric.

4b. Where it leaks

The "agent started" wall-clock is incomparable across channels. For Claude Code, agent_started fires after processQueue confirms the subprocess spawn — local, single-process. For Cursor and Codex, agent_started fires after the dispatch API call returns; the actual cloud agent might begin execution seconds or minutes later. agent_self_report - agent_started is meaningful latency for Claude Code, noisy proxy for the cloud channels. Cross-channel pooled analysis is implicitly mixing local-subprocess wall-clock against cloud-API roundtrip + cloud-side queue + cloud-side execution + webhook delay.

Cursor and Codex completions are detected by polling, not pushed. A cloud agent that completes between two polls has its agent_self_report timestamped at the poll moment. Polling interval is the lower bound on event-time accuracy.

Each tool has its own retry/restart semantics that the corpus doesn't see. Cursor and Codex retry internally on infra errors; we see one start, one finish. Claude Code retries are observable in principle (a second subprocess spawn), but the dispatch path doesn't differentiate retry from first attempt. execution.attempts exists on the card; there's no event for "attempt 2 began."

External merge events are inferred from git-watcher polling. git-watcher.ts parses ASN tokens out of commit messages with \b(ASN-\d+(?:-DB)?|DP-\d+)\b. A merge without an ASN token is invisible. A Cursor cloud agent that switches commit-message conventions silently breaks merge_outcome capture for its assignments.


5. The completion-block protocol — structured close-out

5a. Where structured close-out beats free-text

validateCompletionBlock (completion.ts:78) imposes a typed shape: machine-routable follow-ups (manual_action / new_assignment / verify), staged migrations and env vars as separate kanban actions, a parseable status enum (complete | partial | blocked) the false-complete reconciler can compare against ground truth. agent_self_report events store the verbatim block as JSON, so the corpus has both structured field and prose. That's the right design.

5b. Where it doesn't

Schema validation is permissive. Unknown follow_up.type values are silently filtered (completion.ts:89-95); migrations and env-vars take any string array; summary and blockers are unconstrained free text. Audit "what did agents claim went wrong" is back to NLP on free text for half the signal.

The block is a self-report. The content — the agent's claim about what it did — is not independently corroborated except for the SHA. A complete status with summary "Refactored X and updated tests" and a verified SHA reaches merge_outcome even if the SHA changed an unrelated file. The corpus says "agent claimed complete, code was merged"; it cannot say "agent's claim matches the diff."

No timestamp on the block. The event-row event_ts is server-write time, not agent-finish time. For Claude Code that's tight; for cloud agents the gap is measurable and unmeasured. Add submitted_at to the block schema in v1.1.

No agent-version field. The block does not include model version, agent harness version, or dispatch prompt hash. Joining back to the dispatch event via assignment_id is ambiguous when the agent was redispatched mid-rework. Add dispatch_correlation_id to the block and propagate.


6. Failure modes and stress conditions

6a. Race conditions

  • Concurrent completion-log calls for the same ASN can both pass the column predicate and double-write events (§2b). The card-level processing lock (cardProcessingLockKey) covers dispatch, not completion.
  • Concurrent claim calls — the agent-claims overlap check (peerInProgressClaimOverlap) gates business logic, but the kanban file write itself is loadBoard → mutate → save and not transactionally locked. Two racing claims can lose-update.
  • Coordination-event session-file rewrites are non-atomic (fs.writeFileSync). A crash mid-write leaves truncated JSON; readSessionFile swallows the parse error and treats it as "no session," forcing a phantom session_start.

6b. Partial-write recovery

  • SQLite boundaries are per-INSERT. A multi-event capture (e.g. dispatch + agent_started from processQueue, dispatch.ts:332 and :352) runs as two separate microtasks. A crash between them leaves dispatch without agent_started, and nothing distinguishes "process crashed" from "the second event genuinely never happened." Fix: wrap related captures in a transaction or share a correlation_id for post-hoc gap detection.
  • The kanban JSON file is written before the event row. A crash in the window produces card-state-changed-but-no-event. The reverse is impossible by construction.

6c. Agent-side instrumentation drift

  • Cursor / Codex update their APIs. Pollers ship today's response shape. A breaking change in Cursor's status enum drops events into "unknown status" without a corpus signal. Emit a system event when the poller doesn't recognize a shape.
  • Commit-message convention drift. git-watcher.ts depends on \b(ASN-\d+(?:-DB)?|DP-\d+)\b. An agent that omits the prefix silently breaks merge_outcome. Add an alert for commits that look work-shaped but lack an ASN token.

6d. Sync-loop edge cases

  • Process restarts during dispatch. Pg advisory locks (lock.ts) are session-scoped; if the process dies, the lock releases when pg detects the disconnect — seconds to minutes via tcp-keepalive. A second devplane instance starting in that window can re-dispatch and double-fire dispatch.
  • Two devplane instances on the same project. The DEVPLANE_AUTH_DISABLED=1 ... --port=4001 headless-capture pattern is exactly this. Both processes write to the same coordination-events.sqlite. Concurrent writers serialize on the file lock, but a stale dbMemo handle (coordination-events.ts:97) will not refresh. Long-lived parallel processes are unsupported; document it.

7. Ranked fix list

#FixEffortImpact
1Per-process write-success counter for scheduleInsert, exposed via /api/research/health. The corpus must answer "how many capture calls did you drop?" with a number. (§3b)2-4 hoursHigh — single biggest reviewer attack surface.
2Hard-gate the second-call ship branch in processCompletionBlock on card.column === reviewCol.column, not just on hasRealSha. (§2b)1 hourHigh — closes the build-agent skips-review shortcut.
3Generate correlation_id on every dispatch event and propagate it through agent_started, agent_self_report, agent_artifact_committed, merge_outcome, rework. (§3b, §5b, §6b)1 dayHigh — rework loop is ambiguous and per-attempt latency uncomputable without it.
4Run reconcile-coordination-agent-started.ts over the seven-day pre-DP-102 run-in window for every active project and write a reconciled_through_event_id watermark into the export manifest. (§1 bullet 3)2 hours per projectHigh — otherwise the run-in window has a step-function in agent_started that contaminates rate analyses.
5Promote notes-smuggled fields to typed columns (source_path, via, outbox_kind, trigger). Default-export typed; default-scrub free-form. (§3b)1 dayMedium — moves a class of blocker out of --no-scrub territory.
6Emit a system unknown_status_observed event from cursor and codex pollers when API shape drifts. (§6c)4 hoursMedium — defensive against silent poll-loop drift.
7Document in methodology.md and the C1 threats-to-validity register: (a) latency comparisons across actor_id channels are not pooled, (b) cross-channel agent_started → agent_self_report is incomparable, (c) merge detection depends on commit-message convention. (§4b, §6c)4 hoursMedium — the limitation exists either way; preempt the reviewer.
8Add submitted_at and dispatch_correlation_id to the completion-block schema as optional v1.1 fields; old blocks remain valid. (§5b)1 dayMedium — closes the time-of-finish unknown for cloud agents.
9Document and enforce "single devplane instance per project," or add an instance-id column to the events table with a startup heartbeat. (§6d)4 hours doc-onlyLow — operator-only anomaly today, cheap to make explicit.

Honourable mentions:

  • dwell_ms write-time computation has a microtask race; flag first-of-session events in the analysis plan and re-derive dwell from the prior session's last event.
  • session_end timestamp is wake-up time, not sleep time; flag explicitly.
  • validateCompletionBlock silently drops unknown follow_up types; emit a system event when it does.
  • The git-watcher's ASN-token regex is the load-bearing pattern for merge_outcome accuracy; pin it in a test that round-trips every commit-message format shipped to date.

8. Closing note

The instrument is correctly designed in its bones. Two-phase handoff is the right model for C1; append-only with separate self-report and ground-truth columns is the right schema for false-complete; the completion block is the right structured close-out. The spec already documents most limitations — unusual in research instrumentation.

Issues are overwhelmingly plumbing, not measurement-theoretic. Most reviewer-vulnerable: silent capture failure (fix #1). Most internally consequential: rework-loop ambiguity (fix #3). Most underdocumented: multi-tool latency incomparability (fix #7). Address those three before treating the corpus as analysis-ready.

The write-success counter is the cheapest correctness win on this list; do it tomorrow morning.