SLOs and On-Call

Pick SLIs that tell the truth, write SLOs you can actually measure, alert on error-budget burn, ruthlessly cull low-value alerts, hand over cleanly between shifts, and run post-incident reviews that change behavior.

If you only remember six things
  • An SLI measures what users experience, not what your service does internally.
  • Three SLOs per service is usually right. Ten is theatre.
  • Burn-rate alerts (fast + slow) beat "error rate > X for Y minutes" on every axis.
  • Every alert has a runbook link. No runbook = delete the alert.
  • On-call handover is a checklist, not a chat message.
  • Post-incidents are blameless but concrete — action items have owners and dates, or they are not action items.

SLI, SLO, SLA, error budget

SLOs do two jobs:

  1. Tell you when the user experience has degraded below an agreed bar.
  2. Govern the rate at which you ship change. Budget spent → slow the release train.
Target picking. Don't aim for 99.99% because it sounds good. The extra nines cost an order of magnitude each, and most users can't tell 99.9% from 99.99%. Start at 99% or 99.5%, measure a quarter, revisit.

Picking SLIs

Common starting menu:

CategoryTypical SLIMeasured at
Availabilitynon-5xx / total (success ratio)Load balancer, or client, not the app
Latencyrequests under T ms / totalLB for user-facing, app for internal
Qualityrequests without fallback / total (e.g. no cache miss, no degraded mode)App-internal metric
Freshnessdata younger than T / queriesData system, e.g. pipeline lag
Coverageenqueued jobs processed within T / enqueuedQueue metrics
Correctness(sampled) correct results / sampled totalOffline job running known-good inputs

Measuring an availability SLI with Prometheus

# Good events (non-5xx)
sum(rate(http_requests_total{service="api",status!~"5.."}[5m]))

# Total events
sum(rate(http_requests_total{service="api"}[5m]))

# SLI = good / total
sum(rate(http_requests_total{service="api",status!~"5.."}[5m]))
/
sum(rate(http_requests_total{service="api"}[5m]))

Latency SLI using histograms

sum(rate(http_request_duration_seconds_bucket{service="api",le="0.3"}[5m]))
/
sum(rate(http_request_duration_seconds_count{service="api"}[5m]))
Measure availability and latency as two SLIs. If you combine them into "good = 2xx AND fast", you can't tell at a glance whether you have an error spike or a slowdown.

Burn-rate alerting

The old way: "page me if error rate > 1% for 10 minutes." Problems:

Burn-rate alerting expresses alerts in terms of how fast you are consuming the error budget. A burn rate of 1 = you'll exhaust the budget in exactly the SLO window (30 days). A burn rate of 14.4 = you'll exhaust it in 2 days.

Multi-window, multi-burn-rate (the Google SRE formula)

groups:
  - name: api-slo
    rules:
      # Fast burn: 2% of the 30-day budget in 1h, confirm over 5m
      - alert: APIErrorBudgetFastBurn
        expr: |
          (
            sum(rate(http_requests_total{service="api",status=~"5.."}[1h]))
              / sum(rate(http_requests_total{service="api"}[1h]))
          ) > (14.4 * 0.005)
          and
          (
            sum(rate(http_requests_total{service="api",status=~"5.."}[5m]))
              / sum(rate(http_requests_total{service="api"}[5m]))
          ) > (14.4 * 0.005)
        for: 2m
        labels: { severity: page }
        annotations:
          summary: "API burning error budget fast (14.4x)"
          runbook_url: "https://wiki.internal/runbooks/api-5xx"

      # Slow burn: 10% of budget in 6h, confirm over 30m
      - alert: APIErrorBudgetSlowBurn
        expr: |
          (
            sum(rate(http_requests_total{service="api",status=~"5.."}[6h]))
              / sum(rate(http_requests_total{service="api"}[6h]))
          ) > (6 * 0.005)
          and
          (
            sum(rate(http_requests_total{service="api",status=~"5.."}[30m]))
              / sum(rate(http_requests_total{service="api"}[30m]))
          ) > (6 * 0.005)
        for: 15m
        labels: { severity: ticket }

The two-window trick (long + short) keeps you from firing on a 1-hour blip that's already cleared, and from waiting 6 hours when the burn is still ongoing.

Burn-rate targets translate directly to "how much of the budget you've just spent":

Alert hygiene

Every alert should satisfy these, or be deleted. This is the only alert checklist that matters.

Quarterly cull

Once a quarter, list every alert that fired and ask:

  1. Did the recipient take action beyond "ack"? If not: delete or lower severity.
  2. Was there a real user impact? If not: this is a cause-alert, convert to ticket.
  3. Did it fire more than twice with the same root cause? Fix the cause or raise the threshold.

On-call handover template

Paste this into the team wiki, fill in before you hand over. Ten minutes of writing saves an hour of "wait, what?" from the next on-call.

## On-call handover — <from> → <to> — <date, timezone>

### Active incidents
- INC-2024-0142 — partial outage of ingest tier.
  Current state: mitigated, rollback in flight. ETA: 18:00 UTC.
  Next action needed from YOU: verify ingest p99 under 300 ms after
  rollback completes; if not, escalate to @ingest-team.
  Ticket: https://...  Slack: #inc-2024-0142

### Tickets you may get paged on
- A known pg_repack window is running on DB04 22:00–02:00 UTC.
  Expected: writes may queue briefly, don't fail over. Runbook: ...

### Silences active
- `alertname=CertExpiresSoon,instance=legacy01` until 2024-11-10.
  Reason: retiring host. Owner: @infra-team.

### Deploys in flight / recently
- api  v2.7.3 rolled out 09:00 UTC — healthy.
- auth v1.4.0 scheduled 14:00 UTC today — @you to watch.

### Known unknowns
- We saw one unexplained 30s latency spike on billing at 03:12 UTC;
  no alert. Follow-up ticket OPS-9123.

### Unresolved runbook gaps
- Loki compactor alert runbook is stubby. If it pages, ping @logs-team.
Handover is a hand off, not a broadcast. The receiving on-call acknowledges in writing that they have read and understood. That's the signal the outgoing on-call is released.

Post-incident review

A post-incident review (PIR / postmortem / retro — pick one name, use it) exists to change behavior. If the document ends with a list of platitudes, the incident will repeat.

Structure

  1. Summary (3 sentences). What happened, user impact, duration.
  2. Impact. Number of affected requests, customers, revenue. Concrete numbers from the telemetry.
  3. Timeline. UTC times, in sequence, with what was known at that moment. Not "X was the cause"; "at T+12m we suspected X".
  4. What went well. Mechanisms that worked — mention them so the team knows to keep them.
  5. What went badly. Tools, alerts, runbooks, handoffs. Be concrete.
  6. 5 whys. Or an equivalent causal chain. Stop when further "why" is physics.
  7. Action items. Each has an owner, a due date, and a ticket ID. No owner, no date → it's not real.

Tone

Action-item format

| # | Action | Owner | Due | Ticket |
|---|--------|-------|-----|--------|
| 1 | Add pre-deploy canary on ingest | @alice | 2024-12-01 | OPS-9200 |
| 2 | Fix alert: currently fires 5m after real impact | @bob   | 2024-11-20 | OPS-9201 |
| 3 | Runbook: rollback procedure is wrong on step 4 | @carol | 2024-11-15 | OPS-9202 |

Related: Incident: The First 15 Minutes for the acute response pattern, and Change Window Runbook for planned changes.

Anti-patterns

Anti-patternWhy it's badFix
Twenty SLOs per serviceNoise, nobody can tell which one matters2–3 per service; add more only with a clear owner
Alert on CPU > 80%Cause, not symptom; most of the time harmlessAlert on user-facing SLI; investigate CPU during triage
SLO measured from the service itselfService reports healthy while failing for usersMeasure from the LB, the client, or a blackbox prober
"Non-actionable" severityNobody acks it; becomes background noise; the actionable alerts get lostDelete, or lower to dashboard-only
PIR with no action itemsSame outage in 3 monthsIf the review generated no concrete change, do it again
Handover in Slack DMNo record; receiver misses it; context is goneA wiki page per week, linked from the on-call rota
Runbook is "call @alex"Alex leaves. Runbook is now empty.Write the procedure down; Alex reviews; any two on-calls should be able to execute it