Incident: First 15 Minutes

The first quarter-hour determines whether an incident stays small or becomes chaos. Triage the impact, freeze risky change, reduce harm, and keep a clear record.

First-15 priorities
  • Stabilize before you optimize. Stop the damage first, root-cause later.
  • Freeze ongoing changes until they are proven unrelated.
  • Facts beat guesses. Capture timestamps, symptoms, and scope early.
  • Keep a comms cadence even if the update is "investigating, no new ETA yet".
  • Preserve evidence before rebooting services, clearing queues, or deleting artifacts.

Declare the incident and set severity

The first question is not "what is the root cause?" It is "how bad is the impact right now?" Severity determines who gets pulled in and how often you update.

QuestionWhy it matters
Who is affected? One team, one region, one tenant, or all users?
What is broken? Login, data writes, one API, the whole site, background processing?
Is there active data loss or security risk? This can change the incident from degraded service to emergency containment.
Is a recent change the likely trigger? If yes, rollback may be the fastest path to stability.
Incident header
- Incident ID: INC-2026-041
- Start time: 14:07 UTC
- Severity: SEV2
- Impact: all user logins failing in production
- Suspected trigger: auth stack rollout at 13:58 UTC
- Incident lead: alice
- Next update: 14:15 UTC

If you already use service-level objectives, match the severity language to the guide in SLOs & On-Call. Consistency matters more than perfection in the first few minutes.

Freeze risky change

When prod is unstable, stop adding new variables. Freeze deploy pipelines, manual changes, schema migrations, and unrelated rollouts until you know what you are dealing with.

[FREEZE][INC-2026-041] 14:09 UTC
All unrelated production deploys paused.
Current focus: restore authentication.
Any manual host changes must be coordinated in the incident channel.

If the incident follows a planned rollout, jump straight to the prewritten rollback or stop conditions from Change Window Runbook instead of improvising a new plan under stress.

Gather facts fast

Gather enough evidence to choose the next action. Do not try to read the whole platform in minute five. Aim to answer four questions: what changed, what is failing, how broad is it, and what still works.

date -Is
hostname -f
uptime
systemctl --failed
journalctl -p err --since "-15 min" --no-pager | tail -50
ss -lntup
df -h
df -i
curl -fsS https://app.example.com/health || true

Then add the service-specific checks that match the blast radius:

Use Service Troubleshooting and Generic as the next branch once you know which domain is actually failing.

Stop the bleeding

The question here is not "what is elegant?" It is "what safely reduces user harm fastest?" Common options:

Prefer reversible mitigations first. A quick rollback, traffic drain, or config toggle is usually safer than editing three hosts by hand while the outage clock is running.
Mitigation order
1. Revert or disable the suspected bad change
2. Remove obviously bad instances from rotation
3. Confirm core user path recovers
4. Keep the freeze in place until stability is proven

If the mitigation depends on backups, replica health, or DR tooling, use the shortest trusted path described in Backup & Restore rather than inventing a one-off recovery sequence.

Comms cadence and escalation

Good incident comms are short, factual, and time-bound. Do not wait for certainty before sending the first update.

[SEV2][INC-2026-041] 14:15 UTC
Impact: production login failures for most users.
Current action: rollback of auth rollout in progress.
Observed: health endpoint still up; login endpoint returning 500.
Next update: 14:25 UTC.

Escalate early when any of these are true:

Keep updates on a fixed cadence, often every 10 to 15 minutes, until the service is clearly stable. Use dashboards from Observability Overview to keep the status factual instead of narrative.

Evidence handling

Evidence is fragile. Reboots, log rotation, autoscaling, and queue cleanup can erase the clues you need later. Save what matters before heavy remediation if doing so does not delay an urgent mitigation.

STAMP=$(date +%Y%m%d-%H%M%S)
DIR="/var/tmp/incident-$STAMP"
mkdir -p "$DIR"

journalctl --since "-30 min" --no-pager > "$DIR/journal.txt"
systemctl status myapp -l --no-pager > "$DIR/systemctl.txt"
ss -lntup > "$DIR/sockets.txt"
cp /etc/myapp/config.yml "$DIR/" 2>/dev/null || true

Also record:

This material becomes the backbone of the post-incident review and keeps the next response from repeating the same guesswork.

Common first-15 mistakes

MistakeWhy it hurtsBetter move
Chasing root cause before mitigating impact Users stay broken while responders debate theory. Stabilize first, then investigate deeply.
Letting deploys continue during the incident New changes hide the original trigger and expand the blast radius. Freeze change until the service is stable.
Multiple people making untracked host changes No one knows which action actually helped or hurt. One lead, explicit assignments, and written updates.
No early severity call The wrong people get paged too late, or nobody owns comms. Set a provisional severity and revise later if needed.
Skipping evidence capture entirely Post-incident review becomes guesswork and folklore. Save the obvious logs, config, and version markers early.
Giving vague updates like "still investigating" Stakeholders lose confidence and responders get interrupted for status. Always include impact, current action, and next update time.

Related pages: Observability Overview, SLOs & On-Call, Generic, Service Troubleshooting, Backup & Restore.