During an outage, nobody wants to design a process — they want to follow one. Predefine the first 30 minutes as fixed phases: detect and confirm (0-5 min), activate fallback (5-15 min), verify against gates (15-25 min), and communicate and stabilize (25-30 min). Write it down before you need it.
A Runbook Is Not the Same as a Drill
Running a fallback drill answers a different question than this runbook does. A drill proves the fallback path works when you rehearse it on your own schedule. A runbook tells a stressed, half-awake engineer exactly what to do when the primary model is failing right now and users are already affected.
Teams that only have a drill often freeze during the real thing anyway, because the drill lives in a doc nobody re-reads during an incident. A runbook is written to be followed under pressure: short, sequential, and free of judgment calls that could have been made in advance.
The First 30 Minutes, Phase by Phase
0-5 minutes: Detect and confirm
- Confirm the alert against a real signal (error rate, timeout rate, or provider status page) — do not act on a single failed request.
- Name an incident owner. One person drives the next 25 minutes; everyone else feeds them information.
- Open an incident channel and post the trigger condition that was crossed, with a timestamp.
5-15 minutes: Activate fallback
- Flip the predefined feature flag or config switch — do not hand-edit code during an incident.
- Confirm traffic is actually routing to the fallback model, not just that the flag is set.
- Post the model pair in the incident channel: primary down, fallback active, timestamp.
15-25 minutes: Verify against your gates
- Run the smallest possible smoke check against your highest-risk user flow.
- Check the same four gates you'd use in a drill: quality, latency, cost, safety. You are not re-running the full eval suite — you are confirming nothing is badly broken.
- If any gate fails clearly, escalate to a wider incident instead of continuing to patch quietly.
25-30 minutes: Communicate and stabilize
- Post a plain-language status update: what broke, what's active now, what users might notice.
- Decide explicitly: stay on fallback, or attempt to restore primary. Don't leave this ambiguous.
- Set a recheck timer (commonly 30-60 minutes) rather than waiting for someone to remember to look again.
Spending the first 15 minutes debating whether the outage is "real" instead of following the confirm step. A runbook only saves time if the team agrees in advance to trust the trigger condition instead of re-litigating it live.
The Copy-Paste Runbook
Keep this in your incident tooling (Slack canvas, PagerDuty runbook, or repo docs/runbooks/) where the on-call engineer will actually see it during an alert, not buried in a wiki.
## Model Fallback Runbook - First 30 Minutes
### 0-5 min: Detect and confirm
- [ ] Confirm trigger condition against dashboard (not a single request)
- [ ] Name incident owner
- [ ] Open incident channel, post trigger + timestamp
### 5-15 min: Activate fallback
- [ ] Flip predefined fallback flag/config
- [ ] Confirm traffic is routing to fallback model
- [ ] Post model pair + timestamp to incident channel
### 15-25 min: Verify against gates
- [ ] Run smoke check on highest-risk user flow
- [ ] Quality gate: pass / fail
- [ ] Latency gate: pass / fail
- [ ] Cost gate: pass / fail
- [ ] Safety gate: pass / fail
- [ ] Any gate failed -> escalate, do not patch quietly
### 25-30 min: Communicate and stabilize
- [ ] Post plain-language status update
- [ ] Decide: stay on fallback / attempt primary restore
- [ ] Set recheck timer (30-60 min)
Who Holds Which Role
Even a 30-minute incident benefits from named roles, decided before the incident — not assigned live.
- Incident owner: makes the activate/escalate/stabilize decisions. One person, no committee.
- Verifier: runs the gate checks and reports pass/fail, not opinions.
- Communicator: owns the status update, freeing the owner to focus on the technical decision.
On small teams, one person may hold two roles. What matters is that the split is decided in advance, so nobody is negotiating responsibility while the clock is running.
After Minute 30
This runbook ends at stabilization on purpose — it is not a full postmortem process. Once the incident is stable, move to your normal incident review, and use it to update the trigger conditions, gates, or model pair before the next monthly drill. A runbook that never changes after an incident is a sign nobody is reading the postmortem.
Conclusion
Drills prove the path works in calm conditions. Runbooks make sure the same path gets followed when conditions are not calm. Keep both, but don't confuse them — the runbook is what your on-call engineer opens at 3 a.m., and it needs to work without anyone having to think clearly first.
Pair this with Run a Model Fallback Drill Before You Need One, When Your AI Model Gets Pulled, and AI Evals in Production for the eval slices this runbook's gates depend on.