peopleanalyst

Tools · Software engineering

SLO Definer

Describe a service — get user-centric SLIs, SLOs, and an error-budget policy.

How it works

Corpus-grounded (Google SRE via the software-engineering cluster). Chooses SLIs that reflect real user experience, sets SLOs (target + window), an error-budget policy (what happens when it's spent), and burn-rate alerting guidance — targets tied to user need, not 100%.

You bring

{ service, cluster? }

You get

{ service_summary, slis[]{sli, definition, measurement}, slos[]{objective, target, window}, error_budget_policy, alerting_notes[], riskiest_assumptions[], grounded_in, provenance }

Use it for

See it work

example output

Service: a payments checkout API that authorizes card transactions for online merchants.

SLOs — payments checkout API

Service: authorizes card transactions for online merchants; user-facing impact is a shopper trying to pay. Grounded in Google SRE (software-engineering cluster). Targets tied to user experience, not 100%.

SLIs (what the user actually feels)

SLIDefinitionMeasurement
Availability% of authorize requests that return a valid response (not 5xx / timeout)Ratio of good : total at the load balancer
Latency% of authorize requests served < 800msServer-side request-duration histogram, p-based
Correctness% of authorizations with no double-charge / wrong-amountReconciliation against settlement ledger

SLOs (target + window)

  • Availability: 99.95% over a rolling 28-day window.
  • Latency: 99% of requests < 800ms over 28 days.
  • Correctness: 99.999% over 28 days (money is unforgiving).

Error-budget policy

The 99.95% availability target allows ~21 minutes/month of unavailability.

  • Budget remaining: ship features normally.
  • Budget < 25%: freeze risky launches; redirect to reliability work.
  • Budget exhausted: change freeze except reliability/security fixes until the window recovers; blameless review required.

Alerting notes

  • Alert on burn rate, not every blip: page on a fast burn (2% of budget in 1 hour) and a slow burn (10% in 6 hours).
  • Multi-window alerts to cut false pages; correctness violations page immediately regardless of budget.

Riskiest assumptions

  • That 800ms reflects the shopper's real checkout patience.
  • That settlement reconciliation is timely enough to catch correctness misses fast.

Run it now

Define SLOs & SLIs

Turn a service into reliability targets — the SLIs that reflect user experience, SLOs (target + window), an error-budget policy, and alerting guidance.

Prefer code? Call it over the API or hand it to your AI agent via MCP — POST /api/bicycle/slo-definer · define_slos. API & agent access →

← All tools