Incident: First 15 Minutes

The first quarter-hour determines whether an incident stays small or becomes chaos. Triage the impact, freeze risky change, reduce harm, and keep a clear record.

First-15 priorities

Stabilize before you optimize. Stop the damage first, root-cause later.
Freeze ongoing changes until they are proven unrelated.
Facts beat guesses. Capture timestamps, symptoms, and scope early.
Keep a comms cadence even if the update is "investigating, no new ETA yet".
Preserve evidence before rebooting services, clearing queues, or deleting artifacts.

On this page

Declare the incident and set severity
Freeze risky change
Gather facts fast
Stop the bleeding
Comms cadence and escalation
Evidence handling
Common first-15 mistakes

Declare the incident and set severity

The first question is not "what is the root cause?" It is "how bad is the impact right now?" Severity determines who gets pulled in and how often you update.

Question	Why it matters
Who is affected?	One team, one region, one tenant, or all users?
What is broken?	Login, data writes, one API, the whole site, background processing?
Is there active data loss or security risk?	This can change the incident from degraded service to emergency containment.
Is a recent change the likely trigger?	If yes, rollback may be the fastest path to stability.

Incident header
- Incident ID: INC-2026-041
- Start time: 14:07 UTC
- Severity: SEV2
- Impact: all user logins failing in production
- Suspected trigger: auth stack rollout at 13:58 UTC
- Incident lead: alice
- Next update: 14:15 UTC

If you already use service-level objectives, match the severity language to the guide in SLOs & On-Call. Consistency matters more than perfection in the first few minutes.

Freeze risky change

When prod is unstable, stop adding new variables. Freeze deploy pipelines, manual changes, schema migrations, and unrelated rollouts until you know what you are dealing with.

pause or disable automated deploy jobs if they are still running
tell responders not to make "quick fixes" on random hosts without coordination
identify the last known-good version, image, or config revision immediately

[FREEZE][INC-2026-041] 14:09 UTC
All unrelated production deploys paused.
Current focus: restore authentication.
Any manual host changes must be coordinated in the incident channel.

If the incident follows a planned rollout, jump straight to the prewritten rollback or stop conditions from Change Window Runbook instead of improvising a new plan under stress.

Gather facts fast

Gather enough evidence to choose the next action. Do not try to read the whole platform in minute five. Aim to answer four questions: what changed, what is failing, how broad is it, and what still works.

date -Is
hostname -f
uptime
systemctl --failed
journalctl -p err --since "-15 min" --no-pager | tail -50
ss -lntup
df -h
df -i
curl -fsS https://app.example.com/health || true

Then add the service-specific checks that match the blast radius:

web outage: health endpoints, load balancer status, TLS, recent deploy
auth outage: IdP status, token issuance, SSO path, Kerberos or directory health
database incident: connection count, replication lag, error rate, write path
mail or queue issue: queue depth, relay health, consumers stalled or not

Use Service Troubleshooting and Generic as the next branch once you know which domain is actually failing.

Stop the bleeding

The question here is not "what is elegant?" It is "what safely reduces user harm fastest?" Common options:

roll back the last deploy
drain or remove a bad node from service
disable a broken feature flag or WAF rule
fail over to a healthy replica or secondary path
rate-limit, shed traffic, or put up a maintenance page to protect data integrity

Prefer reversible mitigations first. A quick rollback, traffic drain, or config toggle is usually safer than editing three hosts by hand while the outage clock is running.

Mitigation order
1. Revert or disable the suspected bad change
2. Remove obviously bad instances from rotation
3. Confirm core user path recovers
4. Keep the freeze in place until stability is proven

If the mitigation depends on backups, replica health, or DR tooling, use the shortest trusted path described in Backup & Restore rather than inventing a one-off recovery sequence.

Comms cadence and escalation

Good incident comms are short, factual, and time-bound. Do not wait for certainty before sending the first update.

[SEV2][INC-2026-041] 14:15 UTC
Impact: production login failures for most users.
Current action: rollback of auth rollout in progress.
Observed: health endpoint still up; login endpoint returning 500.
Next update: 14:25 UTC.

Escalate early when any of these are true:

customer-facing impact is broad or sustained
data integrity or security may be involved
the primary responder lacks the access or expertise to execute the likely fix
the service owner or platform dependency team is not already engaged

Keep updates on a fixed cadence, often every 10 to 15 minutes, until the service is clearly stable. Use dashboards from Observability Overview to keep the status factual instead of narrative.

Evidence handling

Evidence is fragile. Reboots, log rotation, autoscaling, and queue cleanup can erase the clues you need later. Save what matters before heavy remediation if doing so does not delay an urgent mitigation.

STAMP=$(date +%Y%m%d-%H%M%S)
DIR="/var/tmp/incident-$STAMP"
mkdir -p "$DIR"

journalctl --since "-30 min" --no-pager > "$DIR/journal.txt"
systemctl status myapp -l --no-pager > "$DIR/systemctl.txt"
ss -lntup > "$DIR/sockets.txt"
cp /etc/myapp/config.yml "$DIR/" 2>/dev/null || true

Also record:

exact timestamps of symptoms and mitigations
commit SHA, image ID, package version, or config revision involved
screenshots or exported graphs showing impact and recovery
which actions were taken, by whom, and in what order

This material becomes the backbone of the post-incident review and keeps the next response from repeating the same guesswork.

Common first-15 mistakes

Mistake	Why it hurts	Better move
Chasing root cause before mitigating impact	Users stay broken while responders debate theory.	Stabilize first, then investigate deeply.
Letting deploys continue during the incident	New changes hide the original trigger and expand the blast radius.	Freeze change until the service is stable.
Multiple people making untracked host changes	No one knows which action actually helped or hurt.	One lead, explicit assignments, and written updates.
No early severity call	The wrong people get paged too late, or nobody owns comms.	Set a provisional severity and revise later if needed.
Skipping evidence capture entirely	Post-incident review becomes guesswork and folklore.	Save the obvious logs, config, and version markers early.
Giving vague updates like "still investigating"	Stakeholders lose confidence and responders get interrupted for status.	Always include impact, current action, and next update time.