Operations team running an AI incident drill with dashboards and a rollback plan
Delivery & ops

The AI Incident Drill: A 7‑Step Production Runbook for UK SMEs

AI features don’t fail like normal software. A vendor blip, a prompt change, a data update, or a model upgrade can turn yesterday’s “green” pilot into today’s customer‑visible wobble. The fix is not heroics—it’s a runbook you can rehearse. Below is a plain‑English production playbook any UK SME or charity can adopt in a week, with the guardrails, checklists and KPIs that keep incidents small and recoveries fast.

We draw on battle‑tested operations ideas—canary releases, circuit breakers, error budgets—and combine them with simple business controls such as spend caps and go/no‑go rules. This aligns with recent government and industry guidance emphasising “secure by design” AI operations, not just model selection. cisa.gov

The 7‑step AI production runbook

1) Set the guardrails: SLOs and an error budget

Decide your service levels before you ship. Pick three measurable promises your users care about:

  • Reliability: e.g., “95% of answers under 2.0s; 99.9% uptime during business hours.”
  • Quality: e.g., “At least 90% of onboarding emails pass our acceptance checklist.” Tie this to your own AI Quality Scoreboard.
  • Cost: e.g., “£0.05 max per ticket reply; daily cap £120 with auto cut‑off.”

Protect these promises with an error budget (the tiny amount of failure you tolerate each month). Use it to decide if a risky change can ship. This is standard reliability practice and pairs naturally with canary rollouts. sre.google

2) Pre‑flight checks that prevent 80% of issues

  • Rate limits and bursts: confirm your vendor’s quotas and set retries with backoff; expect per‑second enforcement even if limits are expressed per minute.
  • Timeouts: default to short timeouts for the model call and longer timeouts for any retrieval step.
  • Spend guardrails: set per‑day caps and a kill‑switch that degrades the feature instead of taking the whole service down. For a deeper cost pattern, see Beating AI Bill Shock.
  • Prompt and data versioning: store prompts and RAG index versions so you can roll back.

Vendors often recommend backoff, token budgeting and usage tiers to avoid throttling surprises—use these patterns from day one. help.openai.com

3) Ship safely with a canary

A canary release exposes a small slice of real traffic (say 1–5%) to the new version. You watch leading indicators—errors, latency, cost per request, and any quality checks—before expanding. If the canary misbehaves, you roll back quickly and learn cheaply. This is industry‑standard practice for reducing blast radius during change. sre.google

  • Start small: 1% of traffic for 30–60 minutes, then 5%, then 25%.
  • Define pass/fail: e.g., “error rate not worse than control by 0.5%, p95 latency under 3s, cost within +10%.”
  • Avoid overlapping canaries; they make signals noisy and reversions harder to reason about. sre.google

For the organisational side of go‑lives, pair your canary with the approach in The Quiet Cutover.

4) Add circuit breakers and graceful degradation

When a dependency is slow or failing, a circuit breaker trips and stops more failing calls, letting the system recover. For AI, that means switching to cached answers, a simpler model, or a human‑review queue rather than hammering the API. The circuit breaker pattern is a mainstream cloud reliability design you can ask your developers or platform to enable. learn.microsoft.com

  • Trip rules: “Open” after X failures in Y seconds; “half‑open” to test recovery.
  • Degrade deliberately: show a safe template, hide the AI bit of the UI, or add a “Try again later” with a timestamp.
  • Alert on breaker state changes so the team sees issues early.

5) Observe the right things (without logging secrets)

  • Metrics to graph: request count, error rate, latency p95, cost per 100 calls, token usage, and content safety flags.
  • Tracing: tag each request with model version, prompt version, and RAG index version.
  • Privacy: never log raw personal or sensitive data; mask or hash fields. Government guidance on generative AI use reinforces avoiding sensitive inputs in third‑party tools. gov.uk

6) Drill the incident: 30 minutes, every fortnight

Table‑top “what if?” exercises beat real‑world panic. Use a free, structured exercise format and adapt a scenario to AI (vendor outage, cost surge, prompt regression). Practising roles, comms, and the rollback makes everyone faster when it’s real. nicybersecuritycentre.gov.uk

  • Pick a scenario: “API throttled”, “RAG index corrupted”, “prompt hotfix gone wrong”.
  • Assign roles: Incident Lead, Comms, Fixer, Customer Liaison, Scribe.
  • Stop at minute 30: either recovered, degraded safely, or rolled back. Capture actions.

7) Close the loop with suppliers

Share your SLOs and drill outcomes with vendors. Ask for better error codes, burst capacity, and clarifications on failover. External guidance emphasises secure, well‑operated AI systems across the lifecycle—deployment and operations included. cisa.gov

Copy‑paste: the 30‑minute rollback test

Risk vs cost: cheap safeguards that pay for themselves

Failure modeImpactLow‑cost safeguard
Vendor throttling or outage Errors, timeouts, unhappy users Retry with backoff; small canary; circuit breaker to cached response; pre‑agreed burst with supplier. sre.google
Prompt or model change reduces quality Off‑brand responses, rework Version prompts; 24‑hour shadow test; quality spot‑checks before rollout; UAT sign‑off—see 5‑Day UAT.
Cost spike from longer contexts Budget blown, CFO alarmed Daily cap and feature degrade; shorten contexts; index the right data; see Bill‑Shock Guardrails.

KPIs and alerts you actually need

  • Reliability: error rate by route; p95 latency; % requests served by fallback.
  • Quality: pass rate on your acceptance checklist; top 5 failure reasons this week.
  • Cost: cost per 100 requests; daily spend vs cap; top 5 costly prompts.
  • Safety: % flagged by content policy; human‑review queue size and ageing.
  • Drill readiness: time‑to‑rollback; time‑to‑degrade; last exercise date.

Publish these in a one‑page “AI Service Health” board and review weekly; it keeps deployments honest and makes go/no‑go decisions calmer. For a structured way to set targets, reuse the scorecard approach in The AI Quality Scoreboard.

Procurement questions for your vendor or platform

  • Rate limits and bursts: what are the per‑second and per‑minute limits; what error codes and headers are sent; can we request temporary bursts during incidents? help.openai.com
  • Operational transparency: do you offer per‑region status and incident post‑mortems? What’s the ETA on fix types (capacity, networking, model rollback)?
  • Controls: can we set per‑key spend caps, hard maximum tokens per request, and default timeouts?
  • Change management: how much notice before model or policy changes; is there a “compatibility” mode?
  • Security by design: how do you align to recognised guidance for secure AI development and operation? cisa.gov

Your first fortnight: a simple adoption plan

Week 1: Make failure cheap

  1. Define SLOs and error budget; pick your go/no‑go thresholds.
  2. Add the kill‑switch and per‑day cost cap.
  3. Introduce canary stages to your release process and tag traffic “control vs canary”.
  4. Turn on circuit breakers for the model call and retrieval step. learn.microsoft.com
  5. Mask sensitive fields in logs; add a “privacy checklist” to release notes. gov.uk

Week 2: Build muscle memory

  1. Run a 30‑minute drill using a vendor‑throttling scenario.
  2. Create a one‑page “AI Service Health” view with the KPIs above.
  3. Shadow‑test your next prompt or model for 24 hours; hold a go/no‑go.
  4. Document a rollback that a duty manager can execute without a developer.
  5. Schedule a quarterly “bigger” exercise using free materials. nicybersecuritycentre.gov.uk

When you’re ready for the wider go‑live, borrow the techniques in The Quiet Cutover and lock in sign‑off with the 5‑Day UAT plan.

Why this works

You’re not betting on perfect models. You’re investing in a repeatable way to ship small changes, watch the right signals, and recover quickly. That’s the essence of healthy operations, and it’s exactly how reliable teams ship software and services at scale. sre.google

Book a 30‑min call Or email: team@youraiconsultant.london