Article Reliability

Model Fallback Runbook: The First 30 Minutes

A fallback drill tells you the switch works. It doesn't tell your on-call engineer what to actually do at 3 a.m. when the primary model starts failing. This is the runbook for that moment — a minute-by-minute script for the first 30 minutes of a real AI provider outage.

Last reviewed: Jul 3 2026


TL;DR

During an outage, nobody wants to design a process — they want to follow one. Predefine the first 30 minutes as fixed phases: detect and confirm (0-5 min), activate fallback (5-15 min), verify against gates (15-25 min), and communicate and stabilize (25-30 min). Write it down before you need it.

A Runbook Is Not the Same as a Drill

Running a fallback drill answers a different question than this runbook does. A drill proves the fallback path works when you rehearse it on your own schedule. A runbook tells a stressed, half-awake engineer exactly what to do when the primary model is failing right now and users are already affected.

Teams that only have a drill often freeze during the real thing anyway, because the drill lives in a doc nobody re-reads during an incident. A runbook is written to be followed under pressure: short, sequential, and free of judgment calls that could have been made in advance.


The First 30 Minutes, Phase by Phase

0-5 minutes: Detect and confirm

5-15 minutes: Activate fallback

15-25 minutes: Verify against your gates

25-30 minutes: Communicate and stabilize

Common mistake

Spending the first 15 minutes debating whether the outage is "real" instead of following the confirm step. A runbook only saves time if the team agrees in advance to trust the trigger condition instead of re-litigating it live.


The Copy-Paste Runbook

Keep this in your incident tooling (Slack canvas, PagerDuty runbook, or repo docs/runbooks/) where the on-call engineer will actually see it during an alert, not buried in a wiki.

## Model Fallback Runbook - First 30 Minutes

### 0-5 min: Detect and confirm
- [ ] Confirm trigger condition against dashboard (not a single request)
- [ ] Name incident owner
- [ ] Open incident channel, post trigger + timestamp

### 5-15 min: Activate fallback
- [ ] Flip predefined fallback flag/config
- [ ] Confirm traffic is routing to fallback model
- [ ] Post model pair + timestamp to incident channel

### 15-25 min: Verify against gates
- [ ] Run smoke check on highest-risk user flow
- [ ] Quality gate: pass / fail
- [ ] Latency gate: pass / fail
- [ ] Cost gate: pass / fail
- [ ] Safety gate: pass / fail
- [ ] Any gate failed -> escalate, do not patch quietly

### 25-30 min: Communicate and stabilize
- [ ] Post plain-language status update
- [ ] Decide: stay on fallback / attempt primary restore
- [ ] Set recheck timer (30-60 min)

Who Holds Which Role

Even a 30-minute incident benefits from named roles, decided before the incident — not assigned live.

On small teams, one person may hold two roles. What matters is that the split is decided in advance, so nobody is negotiating responsibility while the clock is running.


After Minute 30

This runbook ends at stabilization on purpose — it is not a full postmortem process. Once the incident is stable, move to your normal incident review, and use it to update the trigger conditions, gates, or model pair before the next monthly drill. A runbook that never changes after an incident is a sign nobody is reading the postmortem.


Conclusion

Drills prove the path works in calm conditions. Runbooks make sure the same path gets followed when conditions are not calm. Keep both, but don't confuse them — the runbook is what your on-call engineer opens at 3 a.m., and it needs to work without anyone having to think clearly first.

Related reading

Pair this with Run a Model Fallback Drill Before You Need One, When Your AI Model Gets Pulled, and AI Evals in Production for the eval slices this runbook's gates depend on.


Back to Home