Incident: First 15 Minutes
- Stabilize before you optimize. Stop the damage first, root-cause later.
- Freeze ongoing changes until they are proven unrelated.
- Facts beat guesses. Capture timestamps, symptoms, and scope early.
- Keep a comms cadence even if the update is "investigating, no new ETA yet".
- Preserve evidence before rebooting services, clearing queues, or deleting artifacts.
Declare the incident and set severity
The first question is not "what is the root cause?" It is "how bad is the impact right now?" Severity determines who gets pulled in and how often you update.
| Question | Why it matters |
|---|---|
| Who is affected? | One team, one region, one tenant, or all users? |
| What is broken? | Login, data writes, one API, the whole site, background processing? |
| Is there active data loss or security risk? | This can change the incident from degraded service to emergency containment. |
| Is a recent change the likely trigger? | If yes, rollback may be the fastest path to stability. |
Incident header
- Incident ID: INC-2026-041
- Start time: 14:07 UTC
- Severity: SEV2
- Impact: all user logins failing in production
- Suspected trigger: auth stack rollout at 13:58 UTC
- Incident lead: alice
- Next update: 14:15 UTC
If you already use service-level objectives, match the severity language to the guide in SLOs & On-Call. Consistency matters more than perfection in the first few minutes.
Freeze risky change
When prod is unstable, stop adding new variables. Freeze deploy pipelines, manual changes, schema migrations, and unrelated rollouts until you know what you are dealing with.
- pause or disable automated deploy jobs if they are still running
- tell responders not to make "quick fixes" on random hosts without coordination
- identify the last known-good version, image, or config revision immediately
[FREEZE][INC-2026-041] 14:09 UTC
All unrelated production deploys paused.
Current focus: restore authentication.
Any manual host changes must be coordinated in the incident channel.
If the incident follows a planned rollout, jump straight to the prewritten rollback or stop conditions from Change Window Runbook instead of improvising a new plan under stress.
Gather facts fast
Gather enough evidence to choose the next action. Do not try to read the whole platform in minute five. Aim to answer four questions: what changed, what is failing, how broad is it, and what still works.
date -Is
hostname -f
uptime
systemctl --failed
journalctl -p err --since "-15 min" --no-pager | tail -50
ss -lntup
df -h
df -i
curl -fsS https://app.example.com/health || true
Then add the service-specific checks that match the blast radius:
- web outage: health endpoints, load balancer status, TLS, recent deploy
- auth outage: IdP status, token issuance, SSO path, Kerberos or directory health
- database incident: connection count, replication lag, error rate, write path
- mail or queue issue: queue depth, relay health, consumers stalled or not
Use Service Troubleshooting and Generic as the next branch once you know which domain is actually failing.
Stop the bleeding
The question here is not "what is elegant?" It is "what safely reduces user harm fastest?" Common options:
- roll back the last deploy
- drain or remove a bad node from service
- disable a broken feature flag or WAF rule
- fail over to a healthy replica or secondary path
- rate-limit, shed traffic, or put up a maintenance page to protect data integrity
Mitigation order
1. Revert or disable the suspected bad change
2. Remove obviously bad instances from rotation
3. Confirm core user path recovers
4. Keep the freeze in place until stability is proven
If the mitigation depends on backups, replica health, or DR tooling, use the shortest trusted path described in Backup & Restore rather than inventing a one-off recovery sequence.
Comms cadence and escalation
Good incident comms are short, factual, and time-bound. Do not wait for certainty before sending the first update.
[SEV2][INC-2026-041] 14:15 UTC
Impact: production login failures for most users.
Current action: rollback of auth rollout in progress.
Observed: health endpoint still up; login endpoint returning 500.
Next update: 14:25 UTC.
Escalate early when any of these are true:
- customer-facing impact is broad or sustained
- data integrity or security may be involved
- the primary responder lacks the access or expertise to execute the likely fix
- the service owner or platform dependency team is not already engaged
Keep updates on a fixed cadence, often every 10 to 15 minutes, until the service is clearly stable. Use dashboards from Observability Overview to keep the status factual instead of narrative.
Evidence handling
Evidence is fragile. Reboots, log rotation, autoscaling, and queue cleanup can erase the clues you need later. Save what matters before heavy remediation if doing so does not delay an urgent mitigation.
STAMP=$(date +%Y%m%d-%H%M%S)
DIR="/var/tmp/incident-$STAMP"
mkdir -p "$DIR"
journalctl --since "-30 min" --no-pager > "$DIR/journal.txt"
systemctl status myapp -l --no-pager > "$DIR/systemctl.txt"
ss -lntup > "$DIR/sockets.txt"
cp /etc/myapp/config.yml "$DIR/" 2>/dev/null || true
Also record:
- exact timestamps of symptoms and mitigations
- commit SHA, image ID, package version, or config revision involved
- screenshots or exported graphs showing impact and recovery
- which actions were taken, by whom, and in what order
This material becomes the backbone of the post-incident review and keeps the next response from repeating the same guesswork.
Common first-15 mistakes
| Mistake | Why it hurts | Better move |
|---|---|---|
| Chasing root cause before mitigating impact | Users stay broken while responders debate theory. | Stabilize first, then investigate deeply. |
| Letting deploys continue during the incident | New changes hide the original trigger and expand the blast radius. | Freeze change until the service is stable. |
| Multiple people making untracked host changes | No one knows which action actually helped or hurt. | One lead, explicit assignments, and written updates. |
| No early severity call | The wrong people get paged too late, or nobody owns comms. | Set a provisional severity and revise later if needed. |
| Skipping evidence capture entirely | Post-incident review becomes guesswork and folklore. | Save the obvious logs, config, and version markers early. |
| Giving vague updates like "still investigating" | Stakeholders lose confidence and responders get interrupted for status. | Always include impact, current action, and next update time. |
Related pages: Observability Overview, SLOs & On-Call, Generic, Service Troubleshooting, Backup & Restore.