SLOs and On-Call
- An SLI measures what users experience, not what your service does internally.
- Three SLOs per service is usually right. Ten is theatre.
- Burn-rate alerts (fast + slow) beat "error rate > X for Y minutes" on every axis.
- Every alert has a runbook link. No runbook = delete the alert.
- On-call handover is a checklist, not a chat message.
- Post-incidents are blameless but concrete — action items have owners and dates, or they are not action items.
SLI, SLO, SLA, error budget
- SLI — a ratio of good events to total events. "Requests that returned 2xx/3xx in under 300 ms / all requests."
- SLO — a target for the SLI over a window. "99.5% over 30 rolling days."
- SLA — an SLO with consequences (money). If no money moves, it's an SLO.
- Error budget —
1 - SLO, expressed in bad events you are allowed. For 99.5% / 30d on a service doing 10 M req/day, that's 1.5 M bad requests before you blow it.
SLOs do two jobs:
- Tell you when the user experience has degraded below an agreed bar.
- Govern the rate at which you ship change. Budget spent → slow the release train.
Picking SLIs
Common starting menu:
| Category | Typical SLI | Measured at |
|---|---|---|
| Availability | non-5xx / total (success ratio) | Load balancer, or client, not the app |
| Latency | requests under T ms / total | LB for user-facing, app for internal |
| Quality | requests without fallback / total (e.g. no cache miss, no degraded mode) | App-internal metric |
| Freshness | data younger than T / queries | Data system, e.g. pipeline lag |
| Coverage | enqueued jobs processed within T / enqueued | Queue metrics |
| Correctness | (sampled) correct results / sampled total | Offline job running known-good inputs |
Measuring an availability SLI with Prometheus
# Good events (non-5xx)
sum(rate(http_requests_total{service="api",status!~"5.."}[5m]))
# Total events
sum(rate(http_requests_total{service="api"}[5m]))
# SLI = good / total
sum(rate(http_requests_total{service="api",status!~"5.."}[5m]))
/
sum(rate(http_requests_total{service="api"}[5m]))
Latency SLI using histograms
sum(rate(http_request_duration_seconds_bucket{service="api",le="0.3"}[5m]))
/
sum(rate(http_request_duration_seconds_count{service="api"}[5m]))
Burn-rate alerting
The old way: "page me if error rate > 1% for 10 minutes." Problems:
- A 10-minute 5% incident doesn't fire at all if you're watching "> 1% for 10m" and it resolves in 9 minutes — though you've burned a noticeable chunk of the 30-day budget.
- A slow 0.3% bleed over 72 hours never fires, but it consumes all your budget.
Burn-rate alerting expresses alerts in terms of how fast you are consuming the error budget. A burn rate of 1 = you'll exhaust the budget in exactly the SLO window (30 days). A burn rate of 14.4 = you'll exhaust it in 2 days.
Multi-window, multi-burn-rate (the Google SRE formula)
groups:
- name: api-slo
rules:
# Fast burn: 2% of the 30-day budget in 1h, confirm over 5m
- alert: APIErrorBudgetFastBurn
expr: |
(
sum(rate(http_requests_total{service="api",status=~"5.."}[1h]))
/ sum(rate(http_requests_total{service="api"}[1h]))
) > (14.4 * 0.005)
and
(
sum(rate(http_requests_total{service="api",status=~"5.."}[5m]))
/ sum(rate(http_requests_total{service="api"}[5m]))
) > (14.4 * 0.005)
for: 2m
labels: { severity: page }
annotations:
summary: "API burning error budget fast (14.4x)"
runbook_url: "https://wiki.internal/runbooks/api-5xx"
# Slow burn: 10% of budget in 6h, confirm over 30m
- alert: APIErrorBudgetSlowBurn
expr: |
(
sum(rate(http_requests_total{service="api",status=~"5.."}[6h]))
/ sum(rate(http_requests_total{service="api"}[6h]))
) > (6 * 0.005)
and
(
sum(rate(http_requests_total{service="api",status=~"5.."}[30m]))
/ sum(rate(http_requests_total{service="api"}[30m]))
) > (6 * 0.005)
for: 15m
labels: { severity: ticket }
The two-window trick (long + short) keeps you from firing on a 1-hour blip that's already cleared, and from waiting 6 hours when the burn is still ongoing.
Burn-rate targets translate directly to "how much of the budget you've just spent":
- Burn 14.4x → spent 2% of 30-day budget in 1 hour. Page now.
- Burn 6x → spent 5% in 6 hours. Ticket; investigate today.
- Burn 1x → on track. Do nothing special.
Alert hygiene
Every alert should satisfy these, or be deleted. This is the only alert checklist that matters.
- Actionable. The recipient has a concrete next step. "Disk free is low" is actionable; "something might be weird" is not.
- Specific. The symptom is named ("api 5xx burn fast"), not the cause (which you don't know yet). One alert, one symptom.
- Runbooked. The
runbook_urlannotation points to a page that begins with a decision tree, not a wall of text. - Severity-tagged.
pageorticket. If you have more than those two, you have none. - De-duplicated. Grouping in Alertmanager so the pager gets one message per incident, not 30.
- Owned. A
teamlabel routes to the right pager. No "default" team. Unowned alerts become ignored alerts.
Quarterly cull
Once a quarter, list every alert that fired and ask:
- Did the recipient take action beyond "ack"? If not: delete or lower severity.
- Was there a real user impact? If not: this is a cause-alert, convert to ticket.
- Did it fire more than twice with the same root cause? Fix the cause or raise the threshold.
On-call handover template
Paste this into the team wiki, fill in before you hand over. Ten minutes of writing saves an hour of "wait, what?" from the next on-call.
## On-call handover — <from> → <to> — <date, timezone>
### Active incidents
- INC-2024-0142 — partial outage of ingest tier.
Current state: mitigated, rollback in flight. ETA: 18:00 UTC.
Next action needed from YOU: verify ingest p99 under 300 ms after
rollback completes; if not, escalate to @ingest-team.
Ticket: https://... Slack: #inc-2024-0142
### Tickets you may get paged on
- A known pg_repack window is running on DB04 22:00–02:00 UTC.
Expected: writes may queue briefly, don't fail over. Runbook: ...
### Silences active
- `alertname=CertExpiresSoon,instance=legacy01` until 2024-11-10.
Reason: retiring host. Owner: @infra-team.
### Deploys in flight / recently
- api v2.7.3 rolled out 09:00 UTC — healthy.
- auth v1.4.0 scheduled 14:00 UTC today — @you to watch.
### Known unknowns
- We saw one unexplained 30s latency spike on billing at 03:12 UTC;
no alert. Follow-up ticket OPS-9123.
### Unresolved runbook gaps
- Loki compactor alert runbook is stubby. If it pages, ping @logs-team.
Post-incident review
A post-incident review (PIR / postmortem / retro — pick one name, use it) exists to change behavior. If the document ends with a list of platitudes, the incident will repeat.
Structure
- Summary (3 sentences). What happened, user impact, duration.
- Impact. Number of affected requests, customers, revenue. Concrete numbers from the telemetry.
- Timeline. UTC times, in sequence, with what was known at that moment. Not "X was the cause"; "at T+12m we suspected X".
- What went well. Mechanisms that worked — mention them so the team knows to keep them.
- What went badly. Tools, alerts, runbooks, handoffs. Be concrete.
- 5 whys. Or an equivalent causal chain. Stop when further "why" is physics.
- Action items. Each has an owner, a due date, and a ticket ID. No owner, no date → it's not real.
Tone
- Blameless — the person who typed the command was operating within an environment that allowed the outage.
- Concrete — "operator fatigued" is not a cause; "no pre-deploy canary" is.
- Public within the org — hiding PIRs by team is a cultural smell.
Action-item format
| # | Action | Owner | Due | Ticket |
|---|--------|-------|-----|--------|
| 1 | Add pre-deploy canary on ingest | @alice | 2024-12-01 | OPS-9200 |
| 2 | Fix alert: currently fires 5m after real impact | @bob | 2024-11-20 | OPS-9201 |
| 3 | Runbook: rollback procedure is wrong on step 4 | @carol | 2024-11-15 | OPS-9202 |
Related: Incident: The First 15 Minutes for the acute response pattern, and Change Window Runbook for planned changes.
Anti-patterns
| Anti-pattern | Why it's bad | Fix |
|---|---|---|
| Twenty SLOs per service | Noise, nobody can tell which one matters | 2–3 per service; add more only with a clear owner |
| Alert on CPU > 80% | Cause, not symptom; most of the time harmless | Alert on user-facing SLI; investigate CPU during triage |
| SLO measured from the service itself | Service reports healthy while failing for users | Measure from the LB, the client, or a blackbox prober |
| "Non-actionable" severity | Nobody acks it; becomes background noise; the actionable alerts get lost | Delete, or lower to dashboard-only |
| PIR with no action items | Same outage in 3 months | If the review generated no concrete change, do it again |
| Handover in Slack DM | No record; receiver misses it; context is gone | A wiki page per week, linked from the on-call rota |
| Runbook is "call @alex" | Alex leaves. Runbook is now empty. | Write the procedure down; Alex reviews; any two on-calls should be able to execute it |