Full field note
It's 3AM.
You get paged.
Something is broken.
You don't yet know what.
It feels random in the moment. It usually isn't.
The problem is rarely lack of tools.
It's slow clarity in the first 15 minutes.
This runbook is for that window.
Most incidents don't escalate because they are complex.
They escalate because the first 10 minutes are wasted.
- Nobody knows what is actually broken.
- Everyone opens different dashboards.
- Recent changes are checked too late.
- People debug before stabilizing.
This runbook is for that window.
The first 15 minutes
Acknowledge
- Ack the alert
- Stop alert storms if possible
- Take ownership
Answer 3 questions fast
- What is broken?
- Who is impacted?
- Since when?
If you can't answer this yet, don't deep dive.
Check recent change
- Deploys (last 30-60 mins or latest)
- Config changes
- Feature flags
- Infra events
- Third-party changes (CMS, payments, etc.)
Stabilise before diagnosing
- Rollback
- Scale up
- Disable feature
- Route traffic away
- Revert the change
At 3AM, speed matters more than perfection.
Narrow scope
- One service or many?
- One region or global?
- One endpoint or all traffic?
Triage patterns
Error spike after deploy
- Likely
- bad release
- Action
- rollback
- Verify
- error drops
Latency increase (no errors)
- Likely
- slow dependency or database
- Check
- p95/p99, downstream calls
- Action
- isolate slow path
CPU / memory saturation
- Likely
- traffic spike or inefficient code
- Check
- recent deploy, load pattern
- Action
- scale or rollback
More triage patterns
Dependency timeout
- Likely
- downstream service degraded
- Action
- fail fast, reduce load, isolate
Database connection exhaustion
- Likely
- connection leak or spike
- Check
- active vs limit
- Action
- restart / increase pool (temporary)
Queue backlog
- Likely
- consumers stuck or slow
- Check
- worker health, processing rate
- Action
- scale consumers / fix blocker
CDN / cache issue
- Likely
- cache miss or config change
- Check
- hit ratio, origin load
- Action
- restore caching / reduce origin pressure
Incident communication
Current impact: <who/what is affected>
Suspected area: <service / dependency / unknown>
Action in progress: <what you're doing>
Next update: <time>
Example:
Current impact:
Checkout failures for EU users
Suspected area:
Recent payment service deploy
Action in progress:
Rolling back
Next update:
10 minutesWhat not to do at 3AM
Most incidents get worse here.
- Don't debug everything at once
- Don't chase multiple theories
- Don't ignore recent changes
- Don't restart blindly
- Don't add people without roles
- Don't optimize before stabilizing
- Don't assume the alert is correct
Before your next incident
Check whether your alerts are still trustworthy.
Calculate alert fatigueClosing
At 3AM, you don't need perfect answers.
You need:
- fast clarity
- simple decisions
- reduced impact
Root cause can wait. Impact can't.
If your alerts feel noisy, measure it.
The runbook helps during the incident. The calculator helps you see whether your alerting system is training people to ignore the next one.
Open the Alert Fatigue Calculator ->