Tools · Software engineering
Incident Postmortem
Describe an incident — get a blameless postmortem with typed action items.
How it works
Corpus-grounded (Google SRE via the software-engineering cluster). Blameless: systems/process not people — impact, timeline, systemic root cause, contributing factors, what went well, and action items typed prevent/detect/mitigate. Uses only the facts given (gaps → placeholders).
You bring
{ incident, cluster? }
You get
{ incident_summary, impact, timeline[]{when, what}, root_cause, contributing_factors[], what_went_well[], action_items[]{action, type, owner_hint}, riskiest_assumptions[], grounded_in, provenance }
Use it for
- →SWE-guide reader: turn an outage into a blameless postmortem draft
- →Separate the trigger from the systemic root cause
- →Get prevent/detect/mitigate action items, not blame
See it work
example outputIncident: a payments API that returned 503s for ~47 minutes after a config change drained the database connection pool.
Incident Postmortem — Payments API 503s (blameless)
Incident: The payments API returned 503 errors for ~47 minutes after a configuration change exhausted the database connection pool.
Impact: Checkout failed for an unknown share of customers for ~47 minutes; [revenue impact not quantified in the input]. Severity: a customer-facing outage on a critical path.
Timeline
- [T0] — A config change lowering the connection-pool ceiling is rolled out.
- [T0 + ~5 min] — Error rate climbs; the API begins returning 503s.
- [T0 + ~15 min] — On-call paged on elevated 5xx.
- [T0 + ~40 min] — The change is identified as the trigger; rollback initiated.
- [T0 + ~47 min] — Pool recovers; error rate returns to baseline.
(Timestamps are bracketed where the input didn't specify them.)
Root cause
The systemic cause was a configuration path that let a pool-sizing change reach production without a load-validated guardrail — not the engineer who shipped it. The lowered ceiling couldn't serve peak concurrency, so connections queued and requests timed out into 503s.
Contributing factors
- No staging load test that exercises peak connection concurrency.
- Connection-pool limits weren't covered by the config-review checklist.
- Alerting fired on 5xx rate but not on pool saturation, delaying diagnosis.
What went well
- On-call responded and escalated within the paging window.
- The rollback was clean and fully restored service.
Action items
- Add a pool-saturation alert — detect — owner: platform / observability team.
- Gate pool-config changes behind a peak-concurrency load test — prevent — owner: payments platform team.
- Add a circuit-breaker / graceful-degrade path on pool exhaustion — mitigate — owner: payments service team.
- Add connection-pool limits to the config-review checklist — prevent — owner: on-call / SRE.
Riskiest assumptions / gaps
- Customer and revenue impact weren't quantified in the input.
- Exact timestamps were inferred and should be confirmed against logs.
Grounded in: Google SRE / "Site Reliability Engineering"; "The DevOps Handbook" — blameless postmortem, trigger vs. systemic cause.
Run it now
Write a blameless postmortem
Turn an incident into a blameless postmortem — impact, timeline, systemic root cause, contributing factors, what went well, and typed action items. (Uses only what you give it.)
Prefer code? Call it over the API or hand it to your AI agent via MCP — POST /api/bicycle/incident-postmortem · write_postmortem. API & agent access →