SLOs and On-Call

Pick SLIs that tell the truth, write SLOs you can actually measure, alert on error-budget burn, ruthlessly cull low-value alerts, hand over cleanly between shifts, and run post-incident reviews that change behavior.

If you only remember six things

An SLI measures what users experience, not what your service does internally.
Three SLOs per service is usually right. Ten is theatre.
Burn-rate alerts (fast + slow) beat "error rate > X for Y minutes" on every axis.
Every alert has a runbook link. No runbook = delete the alert.
On-call handover is a checklist, not a chat message.
Post-incidents are blameless but concrete — action items have owners and dates, or they are not action items.

On this page

SLI, SLO, SLA, error budget
Picking SLIs
Burn-rate alerting
Alert hygiene
On-call handover template
Post-incident review
Anti-patterns

SLI, SLO, SLA, error budget

SLI — a ratio of good events to total events. "Requests that returned 2xx/3xx in under 300 ms / all requests."
SLO — a target for the SLI over a window. "99.5% over 30 rolling days."
SLA — an SLO with consequences (money). If no money moves, it's an SLO.
Error budget — 1 - SLO, expressed in bad events you are allowed. For 99.5% / 30d on a service doing 10 M req/day, that's 1.5 M bad requests before you blow it.

SLOs do two jobs:

Tell you when the user experience has degraded below an agreed bar.
Govern the rate at which you ship change. Budget spent → slow the release train.

Target picking. Don't aim for 99.99% because it sounds good. The extra nines cost an order of magnitude each, and most users can't tell 99.9% from 99.99%. Start at 99% or 99.5%, measure a quarter, revisit.

Picking SLIs

Common starting menu:

Category	Typical SLI	Measured at
Availability	non-5xx / total (success ratio)	Load balancer, or client, not the app
Latency	requests under `T` ms / total	LB for user-facing, app for internal
Quality	requests without fallback / total (e.g. no cache miss, no degraded mode)	App-internal metric
Freshness	data younger than `T` / queries	Data system, e.g. pipeline lag
Coverage	enqueued jobs processed within `T` / enqueued	Queue metrics
Correctness	(sampled) correct results / sampled total	Offline job running known-good inputs

Measuring an availability SLI with Prometheus

# Good events (non-5xx)
sum(rate(http_requests_total{service="api",status!~"5.."}[5m]))

# Total events
sum(rate(http_requests_total{service="api"}[5m]))

# SLI = good / total
sum(rate(http_requests_total{service="api",status!~"5.."}[5m]))
/
sum(rate(http_requests_total{service="api"}[5m]))

Latency SLI using histograms

sum(rate(http_request_duration_seconds_bucket{service="api",le="0.3"}[5m]))
/
sum(rate(http_request_duration_seconds_count{service="api"}[5m]))

Measure availability and latency as two SLIs. If you combine them into "good = 2xx AND fast", you can't tell at a glance whether you have an error spike or a slowdown.

Burn-rate alerting

The old way: "page me if error rate > 1% for 10 minutes." Problems:

A 10-minute 5% incident doesn't fire at all if you're watching "> 1% for 10m" and it resolves in 9 minutes — though you've burned a noticeable chunk of the 30-day budget.
A slow 0.3% bleed over 72 hours never fires, but it consumes all your budget.

Burn-rate alerting expresses alerts in terms of how fast you are consuming the error budget. A burn rate of 1 = you'll exhaust the budget in exactly the SLO window (30 days). A burn rate of 14.4 = you'll exhaust it in 2 days.

Multi-window, multi-burn-rate (the Google SRE formula)

groups:
  - name: api-slo
    rules:
      # Fast burn: 2% of the 30-day budget in 1h, confirm over 5m
      - alert: APIErrorBudgetFastBurn
        expr: |
          (
            sum(rate(http_requests_total{service="api",status=~"5.."}[1h]))
              / sum(rate(http_requests_total{service="api"}[1h]))
          ) > (14.4 * 0.005)
          and
          (
            sum(rate(http_requests_total{service="api",status=~"5.."}[5m]))
              / sum(rate(http_requests_total{service="api"}[5m]))
          ) > (14.4 * 0.005)
        for: 2m
        labels: { severity: page }
        annotations:
          summary: "API burning error budget fast (14.4x)"
          runbook_url: "https://wiki.internal/runbooks/api-5xx"

      # Slow burn: 10% of budget in 6h, confirm over 30m
      - alert: APIErrorBudgetSlowBurn
        expr: |
          (
            sum(rate(http_requests_total{service="api",status=~"5.."}[6h]))
              / sum(rate(http_requests_total{service="api"}[6h]))
          ) > (6 * 0.005)
          and
          (
            sum(rate(http_requests_total{service="api",status=~"5.."}[30m]))
              / sum(rate(http_requests_total{service="api"}[30m]))
          ) > (6 * 0.005)
        for: 15m
        labels: { severity: ticket }

The two-window trick (long + short) keeps you from firing on a 1-hour blip that's already cleared, and from waiting 6 hours when the burn is still ongoing.

Burn-rate targets translate directly to "how much of the budget you've just spent":

Burn 14.4x → spent 2% of 30-day budget in 1 hour. Page now.
Burn 6x → spent 5% in 6 hours. Ticket; investigate today.
Burn 1x → on track. Do nothing special.

Alert hygiene

Every alert should satisfy these, or be deleted. This is the only alert checklist that matters.

Actionable. The recipient has a concrete next step. "Disk free is low" is actionable; "something might be weird" is not.
Specific. The symptom is named ("api 5xx burn fast"), not the cause (which you don't know yet). One alert, one symptom.
Runbooked. The runbook_url annotation points to a page that begins with a decision tree, not a wall of text.
Severity-tagged. page or ticket. If you have more than those two, you have none.
De-duplicated. Grouping in Alertmanager so the pager gets one message per incident, not 30.
Owned. A team label routes to the right pager. No "default" team. Unowned alerts become ignored alerts.

Quarterly cull

Once a quarter, list every alert that fired and ask:

Did the recipient take action beyond "ack"? If not: delete or lower severity.
Was there a real user impact? If not: this is a cause-alert, convert to ticket.
Did it fire more than twice with the same root cause? Fix the cause or raise the threshold.

On-call handover template

Paste this into the team wiki, fill in before you hand over. Ten minutes of writing saves an hour of "wait, what?" from the next on-call.

## On-call handover — <from> → <to> — <date, timezone>

### Active incidents
- INC-2024-0142 — partial outage of ingest tier.
  Current state: mitigated, rollback in flight. ETA: 18:00 UTC.
  Next action needed from YOU: verify ingest p99 under 300 ms after
  rollback completes; if not, escalate to @ingest-team.
  Ticket: https://...  Slack: #inc-2024-0142

### Tickets you may get paged on
- A known pg_repack window is running on DB04 22:00–02:00 UTC.
  Expected: writes may queue briefly, don't fail over. Runbook: ...

### Silences active
- `alertname=CertExpiresSoon,instance=legacy01` until 2024-11-10.
  Reason: retiring host. Owner: @infra-team.

### Deploys in flight / recently
- api  v2.7.3 rolled out 09:00 UTC — healthy.
- auth v1.4.0 scheduled 14:00 UTC today — @you to watch.

### Known unknowns
- We saw one unexplained 30s latency spike on billing at 03:12 UTC;
  no alert. Follow-up ticket OPS-9123.

### Unresolved runbook gaps
- Loki compactor alert runbook is stubby. If it pages, ping @logs-team.

Handover is a hand off, not a broadcast. The receiving on-call acknowledges in writing that they have read and understood. That's the signal the outgoing on-call is released.

Post-incident review

A post-incident review (PIR / postmortem / retro — pick one name, use it) exists to change behavior. If the document ends with a list of platitudes, the incident will repeat.

Structure

Summary (3 sentences). What happened, user impact, duration.
Impact. Number of affected requests, customers, revenue. Concrete numbers from the telemetry.
Timeline. UTC times, in sequence, with what was known at that moment. Not "X was the cause"; "at T+12m we suspected X".
What went well. Mechanisms that worked — mention them so the team knows to keep them.
What went badly. Tools, alerts, runbooks, handoffs. Be concrete.
5 whys. Or an equivalent causal chain. Stop when further "why" is physics.
Action items. Each has an owner, a due date, and a ticket ID. No owner, no date → it's not real.

Tone

Blameless — the person who typed the command was operating within an environment that allowed the outage.
Concrete — "operator fatigued" is not a cause; "no pre-deploy canary" is.
Public within the org — hiding PIRs by team is a cultural smell.

Action-item format

| # | Action | Owner | Due | Ticket |
|---|--------|-------|-----|--------|
| 1 | Add pre-deploy canary on ingest | @alice | 2024-12-01 | OPS-9200 |
| 2 | Fix alert: currently fires 5m after real impact | @bob   | 2024-11-20 | OPS-9201 |
| 3 | Runbook: rollback procedure is wrong on step 4 | @carol | 2024-11-15 | OPS-9202 |

Related: Incident: The First 15 Minutes for the acute response pattern, and Change Window Runbook for planned changes.

Anti-patterns

Anti-pattern	Why it's bad	Fix
Twenty SLOs per service	Noise, nobody can tell which one matters	2–3 per service; add more only with a clear owner
Alert on CPU > 80%	Cause, not symptom; most of the time harmless	Alert on user-facing SLI; investigate CPU during triage
SLO measured from the service itself	Service reports healthy while failing for users	Measure from the LB, the client, or a blackbox prober
"Non-actionable" severity	Nobody acks it; becomes background noise; the actionable alerts get lost	Delete, or lower to dashboard-only
PIR with no action items	Same outage in 3 months	If the review generated no concrete change, do it again
Handover in Slack DM	No record; receiver misses it; context is gone	A wiki page per week, linked from the on-call rota
Runbook is "call @alex"	Alex leaves. Runbook is now empty.	Write the procedure down; Alex reviews; any two on-calls should be able to execute it