If your AI pilot works “most of the time”, it is not ready for customers, case workers or fee earners. This one‑week sprint turns a promising demo into a dependable service using three foundations: measurable service levels, safe fallbacks, and a plain‑English runbook. It’s written for non‑technical leaders who need reliability without stalling delivery.
We lean on proven operations practices (service level objectives and error budgets), UK cyber guidance for AI systems, and pragmatic cost governance so you can scale without surprises. sre.google
What “production‑ready” looks like for AI in an SME/charity
- Explicit SLOs for quality, speed and availability (e.g., 95% “acceptable” answers on a scored test set; 95th percentile response under 3 seconds; weekly error budget reviewed). sre.google
- Golden dataset that represents your work: 50–200 typical queries with expected answers and a scoring rubric, maintained by the business.
- Safe fallbacks: if the model is slow, uncertain or blocked, you degrade gracefully—route to a template, search result, or a human queue rather than failing loudly.
- Observability and logs that show inputs, outputs, guardrail triggers and costs (with redaction for sensitive data). Use standard logging advice adapted for AI events. ncsc.gov.uk
- Cost guardrails you can explain to finance: budget per request and per month, plus a unit cost KPI (such as “cost per resolved enquiry”). finops.org
- Runbook anyone can follow: common failure modes, who to call, how to roll back, and how to switch models if needed—aligned to UK guidance on operating AI securely. gov.uk
The 7‑day AI Reliability Sprint
Day 1 — Define the service and set SLOs
- Service boundary: write one sentence on what the AI does and what it never does (e.g., “summarises internal policies; never gives employment law advice”).
- Pick 3–5 SLIs to track: quality score on golden set, P95 latency, automation rate (or human handoff rate), deflection rate, and cost per successful task. Set pragmatic SLO targets and an error budget to allow learning without chaos. sre.google
- Decision fences you’ll stick to: which models are allowed, which data sources are in scope, and who signs off changes.
Day 2 — Build a golden dataset and a simple scoring rubric
- Collect 50–200 real, permission‑cleared cases or queries from your inboxes and ticketing tools.
- For each, capture the “good answer”, one unacceptable answer, and a 1–5 scoring rubric explained in plain English.
- Run your pilot three times to set a baseline. Your SLOs will use this benchmark going forward.
- Use external guidance to sanity‑check risks and measurement. The NIST AI RMF and the Generative AI Profile list typical risks and controls you can translate into acceptance criteria. nist.gov
Day 3 — Observability, logging and safety controls
- Log the essentials: timestamp, use case, model/version, input category (not raw personal data), guardrail triggers, score, and token cost. Follow UK logging advice and keep sensitive data out of logs unless needed for incident response. ncsc.gov.uk
- Dashboards: show SLOs at a glance—quality score trend, P95 latency, handoff rate, and cost per successful task.
- Safety controls: set limits for prompt length, response length, and rate; enable content filters and PII redaction before data leaves your tenant. The UK AI Cyber Security Code of Practice emphasises asset inventory, version control and the ability to restore to a known good state—apply that to model, data and prompt versions. gov.uk
Day 4 — Plan fallbacks and a kill‑switch
- Fallback triggers: if P95 latency breaches 3 seconds, return a trusted template; if low confidence or guardrail trips, hand off to a human queue; if costs spike, switch to a smaller model for low‑risk intents.
- Kill‑switch: one toggle to disable the AI path and route everything to the existing process. Keep this in your runbook with named owners.
- These controls align with international secure AI deployment guidance backed by the NCSC and partners. cisa.gov
Day 5 — Data and model hygiene
- Asset inventory: list models, prompts, datasets, evaluations and logs; assign an owner and retention period to each. The UK code of practice calls for tracking and protecting these assets. gov.uk
- Versioning: label model and prompt versions; record when and why they changed; re‑run the golden set after every change.
- Supplier alignment: ask your vendor how they expose model version changes and deprecations, and how you can roll back.
Day 6 — Cost guardrails and unit economics
- Set a unit cost KPI that suits your context: “cost per resolved enquiry”, “cost per case triaged”, or “cost per thousand words summarised”. The FinOps community recommends unit metrics to tie spend to value. finops.org
- Budget rules: monthly ceiling; alerts at 50/75/90%; auto‑degrade to cheaper patterns for low‑risk intents when burn is high.
- Showback to teams: weekly email with usage, unit cost and SLO scores. Expect behaviour change once teams can see their own numbers.
Day 7 — Rehearse the runbook (tabletop exercise)
- Run a 45‑minute tabletop using three scenarios: supplier outage, unexpected cost spike, and harmful output detected.
- Use free “exercise in a box” style prompts to structure the session (widely promoted by the NCSC for non‑technical teams). ncsc.gov.uk
- Record gaps, fix owners and deadlines, then repeat monthly until scores stabilise.
Decision checklist you can complete this afternoon
- People: Who approves prompt/model changes? Who owns the runbook on call?
- Quality: What’s your acceptable score on the golden dataset? Who can change the rubric?
- Latency: What’s the P95 target? What happens if it’s breached for 15 minutes?
- Safety: Which categories are blocked? What are the escalation paths?
- Cost: What’s your unit cost KPI and monthly ceiling? What’s your degrade plan when burn increases?
- Transparency: If you’re in the public sector or delivering for it, how will you produce and publish an Algorithmic Transparency Record? The UK ATRS guidance and template explain scope, exemptions and how to publish. gov.uk
Vendor due‑diligence questions (shortlist in 10 minutes)
- Reliability: Do you publish model/version identifiers and deprecation notices? Can we pin and roll back?
- Observability: Which events and safety signals can we log? Do you expose per‑request costs?
- Controls: What rate limits, output filters, and redaction tools exist? Can we enforce geography/tenancy?
- Testing: How do you support our golden set and regression checks? Can we run canaries before a full cutover?
- Security posture: How do you align with the UK AI Cyber Security Code of Practice and the joint secure AI development guidelines? gov.uk
Risks and the controls that keep you out of trouble
| Risk | Practical control | Owner | KPI / SLO |
|---|---|---|---|
| Unstable quality after updates | Pin model version; re‑run golden set before rollout; error budget policy | Product owner | ≥95% acceptable answers on golden set; change failure rate |
| Unexpected cost spikes | Budget alerts; per‑request cap; auto‑degrade model for low‑risk intents | Ops/Finance | Unit cost within ±10% month‑on‑month; spend vs cap |
| Slow or unavailable model | Fallback templates; cached answers; human queue; vendor SLA | Ops | P95 latency under 3s; availability ≥99.5% |
| Unsafe outputs | Input/output filters; prompt rules; red team questions in golden set | Safeguarding lead | 0 critical safety incidents per month |
| Weak audit trail | Structured logs for model, prompt and decision; secure retention | DPO/IT | 100% requests have trace ID; log retention policy applied |
These controls reflect mainstream SRE practice for service levels and incident response, adapted to AI systems. sre.google
KPIs to show the board next month
- Reliability: P95 latency; availability; number of incidents; error budget burn. sre.google
- Quality: Golden‑set score; rework rate (how often humans edit AI output); escalation rate.
- Safety: Blocked requests; policy overrides; training refresh cadence.
- Cost: Unit cost KPI; % of spend on baseline vs premium models; forecast vs actual. finops.org
- Adoption: % of teams hitting their SLOs; satisfaction rating from users; time saved.
Runbook outline (copy and fill in names)
- Detect: Alerts for SLO breaches, cost caps, and safety triggers. Duty manager paged.
- Stabilise: Switch to fallback template or human queue; enable smaller model where safe.
- Diagnose: Check dashboard for model/version changes, latency trends, and guardrail hits.
- Decide: Roll back model/prompt; hold changes until golden set passes again.
- Document: Incident note with root cause, customer impact, and follow‑ups.
- Exercise: Run a 30‑minute tabletop next week to rehearse the scenario. ncsc.gov.uk
A note for public sector teams and suppliers
Central government teams are expected to publish Algorithmic Transparency Records for tools with public impact. If you sell into—or partner with—the public sector, align your playbook to the ATRS template now so publishing a record is a 30‑minute task, not a 3‑week scramble. gov.uk
Related playbooks to go further
Sources
We drew on the Google SRE approach to SLOs and error budgets, UK government guidance on securing AI systems and publishing algorithmic transparency records, the NIST AI Risk Management Framework, and FinOps unit economics. See citations throughout for details. sre.google