peopleanalyst

Tools · Software engineering

Incident Postmortem

Describe an incident — get a blameless postmortem with typed action items.

How it works

Corpus-grounded (Google SRE via the software-engineering cluster). Blameless: systems/process not people — impact, timeline, systemic root cause, contributing factors, what went well, and action items typed prevent/detect/mitigate. Uses only the facts given (gaps → placeholders).

You bring

{ incident, cluster? }

You get

{ incident_summary, impact, timeline[]{when, what}, root_cause, contributing_factors[], what_went_well[], action_items[]{action, type, owner_hint}, riskiest_assumptions[], grounded_in, provenance }

Use it for

See it work

example output

Incident: a payments API that returned 503s for ~47 minutes after a config change drained the database connection pool.

Incident Postmortem — Payments API 503s (blameless)

Incident: The payments API returned 503 errors for ~47 minutes after a configuration change exhausted the database connection pool.

Impact: Checkout failed for an unknown share of customers for ~47 minutes; [revenue impact not quantified in the input]. Severity: a customer-facing outage on a critical path.

Timeline

  • [T0] — A config change lowering the connection-pool ceiling is rolled out.
  • [T0 + ~5 min] — Error rate climbs; the API begins returning 503s.
  • [T0 + ~15 min] — On-call paged on elevated 5xx.
  • [T0 + ~40 min] — The change is identified as the trigger; rollback initiated.
  • [T0 + ~47 min] — Pool recovers; error rate returns to baseline.

(Timestamps are bracketed where the input didn't specify them.)

Root cause

The systemic cause was a configuration path that let a pool-sizing change reach production without a load-validated guardrail — not the engineer who shipped it. The lowered ceiling couldn't serve peak concurrency, so connections queued and requests timed out into 503s.

Contributing factors

  • No staging load test that exercises peak connection concurrency.
  • Connection-pool limits weren't covered by the config-review checklist.
  • Alerting fired on 5xx rate but not on pool saturation, delaying diagnosis.

What went well

  • On-call responded and escalated within the paging window.
  • The rollback was clean and fully restored service.

Action items

  • Add a pool-saturation alertdetect — owner: platform / observability team.
  • Gate pool-config changes behind a peak-concurrency load testprevent — owner: payments platform team.
  • Add a circuit-breaker / graceful-degrade path on pool exhaustionmitigate — owner: payments service team.
  • Add connection-pool limits to the config-review checklistprevent — owner: on-call / SRE.

Riskiest assumptions / gaps

  • Customer and revenue impact weren't quantified in the input.
  • Exact timestamps were inferred and should be confirmed against logs.

Grounded in: Google SRE / "Site Reliability Engineering"; "The DevOps Handbook" — blameless postmortem, trigger vs. systemic cause.

Run it now

Write a blameless postmortem

Turn an incident into a blameless postmortem — impact, timeline, systemic root cause, contributing factors, what went well, and typed action items. (Uses only what you give it.)

Prefer code? Call it over the API or hand it to your AI agent via MCP — POST /api/bicycle/incident-postmortem · write_postmortem. API & agent access →

← All tools