3AM Incident Survival Runbook

Full field note

It's 3AM.

You get paged.

Something is broken.

You don't yet know what.

It feels random in the moment. It usually isn't.

The problem is rarely lack of tools.
It's slow clarity in the first 15 minutes.

This runbook is for that window.

Most incidents don't escalate because they are complex.

They escalate because the first 10 minutes are wasted.

Nobody knows what is actually broken.
Everyone opens different dashboards.
Recent changes are checked too late.
People debug before stabilizing.

This runbook is for that window.

The first 15 minutes

Acknowledge

Ack the alert
Stop alert storms if possible
Take ownership

Answer 3 questions fast

What is broken?
Who is impacted?
Since when?

If you can't answer this yet, don't deep dive.

Check recent change

Deploys (last 30-60 mins or latest)
Config changes
Feature flags
Infra events
Third-party changes (CMS, payments, etc.)

Stabilise before diagnosing

Rollback
Scale up
Disable feature
Route traffic away
Revert the change

At 3AM, speed matters more than perfection.

Narrow scope

One service or many?
One region or global?
One endpoint or all traffic?

Triage patterns

Error spike after deploy

Likely: bad release
Action: rollback
Verify: error drops

Latency increase (no errors)

Likely: slow dependency or database
Check: p95/p99, downstream calls
Action: isolate slow path

CPU / memory saturation

Likely: traffic spike or inefficient code
Check: recent deploy, load pattern
Action: scale or rollback

More triage patterns

Dependency timeout

Likely: downstream service degraded
Action: fail fast, reduce load, isolate

Database connection exhaustion

Likely: connection leak or spike
Check: active vs limit
Action: restart / increase pool (temporary)

Queue backlog

Likely: consumers stuck or slow
Check: worker health, processing rate
Action: scale consumers / fix blocker

CDN / cache issue

Likely: cache miss or config change
Check: hit ratio, origin load
Action: restore caching / reduce origin pressure

Incident communication

Current impact: <who/what is affected>
Suspected area: <service / dependency / unknown>
Action in progress: <what you're doing>
Next update: <time>

Example:

Current impact:
Checkout failures for EU users

Suspected area:
Recent payment service deploy

Action in progress:
Rolling back

Next update:
10 minutes

What not to do at 3AM

Most incidents get worse here.

Don't debug everything at once
Don't chase multiple theories
Don't ignore recent changes
Don't restart blindly
Don't add people without roles
Don't optimize before stabilizing
Don't assume the alert is correct

Check whether alert fatigue is already slowing response ->

Before your next incident

Check whether your alerts are still trustworthy.

Calculate alert fatigue

Closing

At 3AM, you don't need perfect answers.

You need:

fast clarity
simple decisions
reduced impact

Root cause can wait. Impact can't.

If your alerts feel noisy, measure it.

The runbook helps during the incident. The calculator helps you see whether your alerting system is training people to ignore the next one.

Open the Alert Fatigue Calculator ->

If your alerts feel noisy, this is what you follow at 3AM.