Team conducting a calm, structured post‑incident review with a shared timeline on screen
Delivery & Ops

The 60‑Minute AI Incident Review: A Blameless Post‑mortem Template for UK SMEs

Incidents happen — whether it’s an AI copilot giving wrong answers, a model outage during trading hours, or a content filter that blocks legitimate messages. What separates resilient organisations from the rest is not whether incidents occur, but how quickly and constructively they learn from them. A short, well‑run, blameless review strengthens your service, your team culture, and your board’s confidence. That’s especially relevant as UK organisations face more digital disruptions and increasingly AI‑enabled threats. reuters.com

This guide gives UK SME and charity leaders a practical, one‑hour post‑incident template tailored to AI features and copilots. It borrows proven lessons from Site Reliability Engineering (SRE), UK public sector incident practices, and AI risk management — adapted for non‑technical teams. sre.google

When should you run a post‑incident review?

Agree triggers in advance so there’s no debate on the day. Typical triggers include any of the following:

  • User‑visible downtime or degradation over your threshold (for example, >15 minutes of failed responses in working hours).
  • Incorrect or unsafe AI outputs that reached customers or staff (for example, harmful language, sensitive data leakage, or materially wrong advice).
  • On‑call intervention required (rollback, traffic reroute, turning off an AI capability).
  • Surprise cost spikes (for example, token usage 2× above forecast for >30 minutes).
  • Monitoring or alerting failure — you discovered the issue from users before your dashboards did.

These align with widely used SRE postmortem criteria and UK government guidance to document, review and improve after incidents. sre.google

Your 60‑minute agenda (with timeboxes)

  1. Welcome and ground rules (5 minutes). State the purpose: learn and improve. No blame, no speculation. Focus on facts, impact, and prevention. Blameless reviews are proven to increase reporting and reduce repeat failures. sre.google
  2. What happened and when? (10 minutes). The scribe shows a shared timeline: first symptom, detection, key decisions, fixes, and full recovery. Include screenshots or alerts if available.
  3. User and business impact (10 minutes). Quantify affected users, tasks blocked, and external visibility. Note any legal, brand, or staff welfare concerns. If comms went out (status page, email), record the wording and timing.
  4. What helped, what hurt (10 minutes). List 3–5 factors that accelerated recovery (good runbook, canary release) and 3–5 that slowed it down (missing alert, unclear ownership, vendor delay). Tie these to your runbook and SLOs.
  5. Contributing causes (10 minutes). Use a simple “5 Whys” or “How did this slip past our tests?” Focus on systems, process, and information design — not people. sre.google
  6. Actions, owners, deadlines (10 minutes). Capture specific, testable actions. Assign an owner and a due date. Tag each action by category: detection, defence/guardrails, rollback/restore, vendor, training, product/UX, and governance.
  7. Close (5 minutes). Confirm who will publish the write‑up, where it will live, and when you’ll check action progress (for example, in 14 and 30 days). If needed, schedule a deeper review.

The minimum roles you need

  • Incident Lead — runs the meeting, keeps time, enforces ground rules.
  • Scribe — maintains the live timeline and captures actions.
  • Service Owner — accountable for follow‑ups landing.
  • Representative from Operations (or your MSP) and Product/Customer Success — to connect technical reality with user impact.
  • Vendor representative (if a supplier was materially involved) — to confirm their timeline and commitments.

Public sector playbooks show similar roles and steps; adapt the scale to your team. gds-way.digital.cabinet-office.gov.uk

The one‑page template (copy this into your doc)

  • Summary: 2–3 lines on what happened, who was impacted, current status.
  • Timeline: first symptom → detection → key decisions → fix → recovery.
  • Impact: users affected, tasks blocked, service level breached, cost impact (£ and tokens), external comms.
  • Contributing causes: monitoring, change control, model or data issues, vendor dependencies, UX/guardrails, runbooks.
  • What worked: 3 bullets. What didn’t: 3 bullets.
  • Actions: table of owner, due date, category, and success test.
  • Attachments: alerts, dashboards, logs, vendor ticket, incident room transcript.

Keep the write‑up short and publish within 5 working days. This cadence is consistent with good practice across SRE and government incident management. sre.google

AI‑specific checks most teams miss

Before the incident

  • Model change control: Were model, prompt and tool versions pinned in production? Did a silent vendor update land? See our 14‑day change safety guide. Version pinning and rollbacks.
  • Observability: Do you track answer quality, refusal rates, safety trigger hits and cost per task, not just uptime? Observability sprint.
  • Load behaviour: Did cost or latency spike under traffic? Load test & capacity plan.

During/after the incident

  • Evaluation gaps: Which evals should have caught this? Add tests aligned to your use cases and risk appetite. NIST’s AI RMF frames this as “Measure” and “Manage.” nist.gov
  • Guardrails and UX: Did UI copy over‑promise? Were disclaimers or “double‑check” prompts present for high‑risk tasks?
  • Vendor clarity: Do you have model ID, region, and incident reference? Can the supplier export a timeline on request?

KPIs to track from January

  • Time to detect (TTD): median minutes from first error to first alert.
  • Time to mitigate (TTM): minutes to rollback, route to fallback, or disable risky feature.
  • Post‑incident publication within 5 working days: target ≥90%.
  • Repeat incident rate: percentage with the same primary cause inside 30 days (target: falling trend).
  • Eval coverage: % of high‑risk tasks with passing pre‑release evals.
  • Cost variance: incidents with >25% cost overrun vs forecast (aim to reduce month‑on‑month).
  • Vendor response time: hours to first meaningful supplier update.

Risk vs cost: choosing your follow‑ups

Action Risk it reduces Typical cost/effort Good evidence it worked
Pin model, prompt, and tool versions Silent behaviour changes Low (policy + config) All incidents show model/tool IDs; no “surprise” vendor changes
Add “safe fallback” path (FAQ or human handover) User harm during outages Medium (design + routing) Fallback used during next outage with acceptable CSAT
Introduce pre‑release evals for top 5 tasks Bad answers in production Medium (test authoring) Fewer production regressions; eval gate catches issues
Strengthen monitoring and on‑call rota Slow detection and response Medium (alert tuning + process) TTD and TTM trend down; fewer user‑reported first detections
Supplier post‑incident clause in MSAs Vendor opacity and slow updates Low (contract wording) Supplier shares timeline within 48 hours, with root cause and fixes

For broader reliability improvements, pair this with a focussed sprint. See our 7‑day reliability sprint for a fast start.

Procurement questions to ask suppliers after an incident

  • Can you provide a timestamped incident timeline (alerts, mitigations, capacity changes, model updates) within 48 hours?
  • What model/version identifiers and regions handled our traffic during the window?
  • Which rate limits, quotas or cost controls were hit? What early‑warning signals do you expose?
  • What SLO/SLA was breached (if any) and how are service credits calculated?
  • What preventative changes will you deploy and by when? How will we know they worked?
  • Do you support post‑incident joint reviews and share learnings we can publish internally?

These are consistent with mainstream incident and lessons‑learned practice in UK government and SRE circles. security.gov.uk

Running the hour: practical tips

  • Invite list: keep it small (the roles above) and share the write‑up more widely later.
  • One doc, one screen: the scribe types in real time. If new facts emerge, update the timeline; don’t argue.
  • Timebox discussions: capture tangents as actions or parking‑lot items.
  • Measure what matters: agree which KPI moved the wrong way and set the action to fix it.
  • Look for systematic fixes: tests, automation, guardrails, clearer ownership — not heroics. sre.google

Common AI incident patterns (and what the review should surface)

  • Silent vendor change: A foundation model update altered behaviour. Action: version pinning; canary test before rollout; vendor notification channel. sre.google
  • RAG freshness gap: Retrieval didn’t include last week’s policy. Action: add freshness checks and content ingestion alerts.
  • Cost surge under load: Long prompts and tool loops drove token usage. Action: token guardrails and budget alerts.
  • Unsafe content leakage: Safety filters mis‑configured. Action: strengthen guardrails; add higher‑risk evals and human‑in‑the‑loop paths.
  • Observability blind spot: No alert on “no answer” rate. Action: add task‑level KPIs and dashboards.

All of these benefit from a short, structured lessons‑learned loop. Public sector examples emphasise documenting roles, steps and the post‑incident review itself — practices easily adapted by SMEs. gds-way.digital.cabinet-office.gov.uk

Where this fits in your wider operating model

Think of the 60‑minute review as the “heartbeat” that turns incidents into improvements. It should feed into your change control, observability and resilience plans for Q1:

  • Feed actions into your change‑safety routines (pinning, canaries, rollbacks).
  • Update your observability dashboards and alerts based on what failed to detect.
  • Schedule small reliability sprints to clear clusters of actions.

These activities — and the blameless culture that underpins them — are standard across high‑reliability teams and reflected in UK guidance for digital operations. sre.google

Template: meeting invite text you can reuse

Copy block “AI incident review (60 minutes). Goal: agree facts, impact, contributing causes and targeted actions. Blameless by default. Please bring: timeline notes, relevant screenshots/alerts, and any user‑facing comms. We will leave with named actions, due dates and review checkpoints in 14 and 30 days.”

Why act now

The UK’s cyber and service‑reliability picture has grown tougher, and AI features add new failure modes. A lightweight, repeatable review process is one of the cheapest ways to reduce risk, avoid repeat pain and demonstrate control to your board and trustees. reuters.com

Book a 30‑min call Or email: team@youraiconsultant.london