Operations team reviewing an AI service runbook and dashboards
Delivery & ops

From pilot to always‑on: the AI runbook, SLOs and on‑call your UK SME needs by January

Why this piece, and why now

Many UK SMEs and charities launched AI pilots in 2025. A few lucky teams turned those into features customers now rely on: triaging inboxes, routing cases, summarising files, drafting sensitive letters. The risk is no longer “will the AI work?” but “what happens at 09:02 on Monday when it doesn’t.” Production‑grade services need clear reliability targets, playbooks, and fallbacks—especially when you depend on third‑party AI APIs that can change behaviour, latency or availability without notice. The Site Reliability Engineering (SRE) approach gives us plain‑English tools—SLIs, SLOs and error budgets—to manage that risk in small organisations too. sre.google

In this guide you’ll get a minimum viable AI runbook, three practical SLOs any SME can measure, a one‑hour tabletop drill, and a 30‑day rollout plan. Where we reference frameworks, we choose those you can adopt without extra headcount, like Google’s SLO model and NIST’s AI Risk Management Framework guidance on measurement and monitoring. sre.google

What “production‑ready” means for AI features

Production‑ready does not mean “perfect answers”. It means you’ve agreed what “good enough” looks like for users and how you’ll recover when it isn’t. In practice:

  • Clear service boundaries: what the AI does, what it never does, who owns it out of hours.
  • SLIs/SLOs that reflect user journeys—such as response time, successful resolution rate, and cost per request—plus an error budget with pre‑agreed actions if you burn through it. docs.cloud.google.com
  • Observability: basic logging of prompts, versions, latencies, costs and outcomes, without collecting unnecessary personal data; dashboards for weekly review. The NIST AI RMF playbook calls this “Measure” and “Manage” across the lifecycle. nist.gov
  • Graceful degradation: sensible timeouts, cached answers for repeat queries, a secondary model, and a human path. Cloud reliability playbooks and ML lenses put these in the reliability pillar. docs.aws.amazon.com
  • Practised incident response: roles, comms templates, and a 60‑minute drill. You can adapt mainstream incident‑management guidance for small teams. atlassian.com

Your minimum viable AI runbook (7 pages, one owner)

Keep it short so people actually use it. Store it where your team lives (SharePoint, Confluence, Notion). Review monthly.

  1. Service overview: plain‑English promise, in‑scope/out‑of‑scope tasks, user groups, hours, data sources, and the product owner.
  2. Dependencies map: model provider(s), embeddings/vector store, prompt libraries, key APIs, analytics, cost/account limits, and status pages you’ll watch.
  3. SLOs and error budget: the three SLOs below, measurement windows, what happens when the budget is half used and when it’s exhausted. docs.cloud.google.com
  4. Run procedures: daily checks, cost checks, version bumps; what to log each week (see “KPIs” section).
  5. Incident procedures: severity levels, who’s on‑call, first 15 minutes, comms template, escalation tree (including vendor). atlassian.com
  6. Fallback playbook: timeouts and retries, switch to backup model or cached answer, disable AI button, route to human. docs.aws.amazon.com
  7. Post‑incident learning: blameless review, follow‑ups allocated with dates, and how to adjust SLOs or prompts based on evidence. sre.google

Set three SLOs that fit AI work

SLOs are targets for user‑centred indicators. Don’t aim for 100%; you need an error budget to allow for safe change and vendor variability. Start loose, then tighten. sre.google

1) Responsiveness SLO

  • Indicator: time to first meaningful response presented to a user.
  • Target: 95% under 2.5s during 09:00–18:00, Mon–Fri; 95% under 5s outside hours. Measure client‑side if possible so you see what users see. sre.google
  • Why: users tolerate slightly slower AI if it’s useful, but they won’t wait indefinitely; clear peak‑time targets align teams.

2) Resolution Quality SLO

  • Indicator: proportion of AI outputs that meet acceptance criteria on a weekly human sample (for example, “fit to send without edits” for 50 sampled emails).
  • Target: 90% pass rate over a rolling 30 days; fallbacks and human escalations do not count against you if they happen before a user sees an error.
  • Why: quality, not just availability, drives trust; the NIST AI RMF emphasises evaluation methods and human oversight. nist.gov

3) Cost per Request SLO

  • Indicator: total model/API spend divided by successful requests.
  • Target: within an agreed band (for example, £0.002–£0.02 per request depending on use case) over a rolling 30 days.
  • Why: reliability work without cost control just moves the problem; link your error budget to a cost budget too so teams know when to switch models or cache. cloud.google.com

Document your error budget policy—what you’ll pause when you overspend (for example, release freeze, or mandatory smaller model). Rolling windows work better than calendar months for small teams. cloud.google.com

Instrument the bare minimum (safely)

Log just enough to run the service and learn from incidents. For each request capture: timestamp, user journey, prompt/prompt‑template ID, model/version, latency, cost, outcome tag (success, fallback, human), and a hashed session/device ID; store minimal examples for weekly QA. The NIST AI RMF playbook provides practical measurement prompts and documentation patterns you can adapt, even in small organisations. nist.gov

Keep personal data out by default; if you must process it, keep visibility narrow and retention short in your runbook. Align team behaviour with clear acceptance criteria for outputs—see our piece on AI that behaves to define these simply for non‑technical reviewers.

Graceful degradation beats heroics

Most user pain comes from slow or flaky responses. Decide your “grace ladder” now and put it in the runbook, in this order:

  1. Retry once with a short timeout; show a friendly spinner with a guarantee (for example, “we’ll answer in under 3 seconds or show you options”).
  2. Serve from cache if the same question appears frequently (for example, HR policies or key grant criteria).
  3. Fail over to a smaller/cheaper model with a stricter prompt.
  4. Hide the AI button for affected journeys; route to a human form or queue.

These patterns are standard reliability practice in ML workloads and keep users moving while you fix the incident. docs.aws.amazon.com

For a deeper reliability sprint that hardens these behaviours, see our 7‑day AI Reliability Sprint.

Run a one‑hour tabletop drill this week

Pick a realistic but low‑drama scenario: your provider increases latency for 30 minutes during Monday peak, and your cost per request doubles for two hours. In the drill:

  1. Assign roles (incident lead, comms lead, resolver, business owner) and start a timer.
  2. Walk the first 15 minutes: identify the SLO at risk, check dashboards, confirm severity, publish a quick internal update.
  3. Decide the fallback: switch model, raise timeout, or hide the feature—based on the runbook’s grace ladder.
  4. Draft the customer note using your template; decide if you’ll publish a status update.
  5. Close with learning: what data was missing, which steps were fuzzy, what to automate.

You don’t need fancy tools—the National Cyber Security Centre encourages simple, facilitated exercises to build muscle memory; their “exercise in a box” approach is designed for non‑specialists and scales to SMEs. ncsc.gov.uk

KPIs for your weekly AI reliability review

  • Responsiveness: 95th‑percentile time to first token/answer by journey; slowest 10 journeys called out.
  • Resolution quality: sample pass rate on human review; top three failure reasons with examples.
  • Error budget: spent/remaining by SLO; actions if ≥50% spent.
  • Cost per request: by journey and by model; requests routed to cache or backup.
  • Fallback utilisation: rate of backup model, cached answers, and “button hidden” events.
  • Ops health: time to detect (TTD), time to mitigate (TTM), incidents per 1,000 requests.

Agree decisions in the meeting: release go/no‑go, model changes, or prompt updates. If error budgets are consistently blown, freeze non‑urgent changes until you stabilise—this is standard SRE hygiene for small teams too. cloud.google.com

A 30‑day rollout plan (fits around normal work)

Week 1 — Baseline and boundaries

  • Write your two‑page runbook (overview + incident procedures). Appoint an owner.
  • Draft SLOs and a simple error budget policy; get sign‑off from product and operations. docs.cloud.google.com
  • Switch on basic logging for latency, cost, outcomes; keep personal data out by default. nist.gov

Week 2 — Fallbacks and drills

  • Implement one cache and a backup model path for your highest‑volume journey.
  • Run the one‑hour tabletop drill; fix the gaps you uncover. ncsc.gov.uk
  • Prepare your comms template and internal status update process. atlassian.com

Week 3 — Dashboards and reviews

  • Build a lightweight dashboard for the KPIs above; share with leaders weekly.
  • Hold your first reliability review; decide one improvement and one cost action.
  • Test your error budget policy with a simulated overspend. cloud.google.com

Week 4 — Harden and handover

  • Extend the runbook to the full seven pages; add a vendor status page list.
  • Tighten SLO targets if you’re comfortably inside them; else stabilise first. cloud.google.com
  • Confirm on‑call rota, escalation contacts and holiday cover.

Go‑live checklist for directors and trustees

  • Owner named for the AI service with out‑of‑hours cover.
  • Three SLOs approved with a written error budget policy. docs.cloud.google.com
  • Fallbacks tested: cache, backup model, and “hide button” path. docs.aws.amazon.com
  • Dashboards working: latency, cost, quality sample, error budget.
  • Tabletop drill completed in the last 30 days; follow‑ups closed. ncsc.gov.uk
  • Post‑incident routine agreed: blameless review within 5 working days. sre.google

If you want a tighter financial lens, pair this runbook with our 90‑day AI Cost Guardrail and, for safer launches, run new features in shadow‑mode first.

Common objections (and quick answers)

“We’re small—SLOs feel overkill.”

One page of targets prevents hours of unplanned firefighting later. SLOs and error budgets exist to balance reliability with feature work; even micro‑teams benefit. docs.cloud.google.com

“We’ll just call the vendor if it breaks.”

Good—add vendor status pages and escalation to your runbook—but your users will still expect continuity during a vendor wobble. Fallbacks are your safety net. docs.aws.amazon.com

“We did a pilot, users loved it—why change?”

Pilots lack the stress of Monday peaks, cost spikes and model changes. Production readiness means you’ve rehearsed incidents and set thresholds everyone understands. sre.google

When you’re ready to raise the bar, follow our reliability sprint to move from “working” to “dependable”.

Book a 30‑min call Or email: team@youraiconsultant.london