Engineers reviewing an AI go‑live dashboard with rollout and reliability checks
Delivery & Ops Playbook

The Go‑Live Gate for AI: A One‑Page Production Readiness Review for UK SMEs

Moving an AI pilot into production is where most of the value is won—or lost. What works in a demo often fails in the real world: requests spike, inputs get messy, a provider changes a model version, or a prompt injection slips through. This article gives UK SMEs and charities a practical “Go‑Live Gate”: a one‑page review you can run in under two hours to decide if an AI feature is ready to serve real users this week.

We draw on widely used operational practices—service level objectives and error budgets, progressive rollouts (shadow and canary), and observability—plus new UK guidance on the cyber security of AI systems. The aim is simple: launch faster, reduce surprises, and keep costs and risks within your appetite. sre.google

Why a “Go‑Live Gate” now?

  • Reliability expectations are unchanged. Users still expect fast, correct, and stable responses; SLOs and error budgets are how modern teams balance reliability and speed of change. sre.google
  • Rollouts must be progressive. Shadow mode lets you test with real traffic without impacting users; canary releases ramp safely from a small percentage to everyone. docs.aws.amazon.com
  • UK guidance is getting specific. The government’s AI Cyber Security Code of Practice sets baseline expectations for asset inventories, incident plans, and securing APIs and infrastructure—practical points for your go‑live checklist. gov.uk
  • Risk management is becoming operational. NIST’s AI RMF frames day‑to‑day actions—govern, map, measure, manage—that translate cleanly into a production gate. nist.gov

The one‑page production readiness review

Print these 12 checks. If any are a “no”, fix before launch—or commit to a small canary and time‑boxed shadow test.

1) Reliability & performance

  • SLO agreed (availability and latency). Example: 99.5% success rate, p95 latency ≤ 2.5s for assisted chat; define an error budget and a pause rule if it’s consumed. sre.google
  • Fallbacks documented for timeouts, provider limits, and classification failures (e.g., “try second model”, “return safe default”, “route to human”).
  • Load expectations clear: peak requests per minute and concurrency tested against a synthetic spike of at least 2× normal.

2) Safety & security

  • Threats mapped including prompt injection, data leakage, and model misuse; controls in place (input/output filtering, rate limits, isolation of secrets). Align with the UK AI Cyber Security Code of Practice principles on design, threat evaluation and secure infrastructure. gov.uk
  • Incident runbook includes owner, comms template, and a known good state you can restore to in under 30 minutes. gov.uk
  • Observability wired: prompts, model/version, latency, token counts, filter hits, and user outcomes are logged with privacy in mind; alerts on error spikes and cost anomalies. security.gov.uk

3) Cost & scalability

  • Unit economics defined: cost per successful outcome (not per request) with a target and alert threshold.
  • Guardrails live: per‑minute and daily spend caps, token ceilings for long prompts, and safe degradation if limits are hit.
  • Capacity plan for seasonal peaks and provider throttling; pre‑approved switch to an alternate model or cached answers for the top queries.

4) People & process

  • RACI agreed for on‑call, data owner, and product sign‑off; change window set for rollout.
  • Shadow → canary plan documented: duration, traffic %, success criteria, and rollback trigger. docs.aws.amazon.com
  • Post‑launch review scheduled (day 7) to decide expand, iterate, or roll back.

Your 7‑day go‑live plan

  1. Day 1 — Baseline: confirm SLOs, error budget, and the metrics you’ll track (success rate, p95 latency, cost per outcome).
  2. Day 2 — Controls: implement input/output filters, rate limits, and secrets isolation; rehearse the incident runbook. Align with the UK Code’s principles on secure design and infrastructure. gov.uk
  3. Day 3 — Observability: ensure logs include prompts, versions, outcomes; create alerts for error spikes and spend anomalies. security.gov.uk
  4. Day 4 — Shadow test: mirror a slice of live traffic to the new version; compare outputs and latencies without impacting users. docs.aws.amazon.com
  5. Day 5 — Canary 10%: roll to a small cohort or traffic slice with a clear rollback rule (e.g., error budget burn rate > 5%/hr). cloud.google.com
  6. Day 6 — Decision: review metrics; if green, progress to 25–50%; if amber, iterate; if red, roll back and capture learnings.
  7. Day 7 — Review: share the post‑launch summary, confirm next increments, and log improvements to prompts, retrieval, or guardrails.

Want more detail on sign‑off? Pair this with our 5‑day UAT for AI features and AI quality scoreboard.

Risk and cost at a glance

Risk Early warning Impact if ignored Low‑effort control
Provider change (model behaviour or rate limits) Quality dip, error spikes, higher latency Poor answers, timeouts, user churn Pin versions where possible; add shadow tests and a “champion/challenger” model switch.
Prompt injection or unsafe output Filter hits rise; unexpected citations or requests Data leak, reputational harm Input/output filtering; least‑privilege design; clear incident plan per the UK Code. gov.uk
Cost runaway Token spikes; long prompts; retrieval misses Budget blown; forced rollback Spend caps; token ceilings; cache or short‑prompt patterns; weekly cost per outcome KPI.
Data drift More hand‑offs to human; lower success rate Quality decay; support backlog grows Shadow new versions monthly; refresh retrieval sources; track success rate by topic.
Observability gaps Can’t trace failures end‑to‑end Slow incidents; guesswork decisions Log prompts, versions, outcomes; alert on errors and spend; follow UK observability guidance. security.gov.uk

The rollout mechanics explained (in plain English)

Shadow testing

Shadow mode quietly sends a copy of real requests to your new AI version and logs the results; users still see the old version. You compare accuracy, latency and safety filter hits, without risking user impact. It’s the cleanest way to catch surprises before a live rollout. docs.aws.amazon.com

Canary releases

Canarying moves a small percentage of traffic (or a small user cohort) to the new version—say 10%, then 25%, 50%, and 100%—so you can stop or roll back if reliability, costs or safety drift outside thresholds. cloud.google.com

SLOs and error budgets

An SLO sets the target reliability; the error budget is the small amount of failure you’re willing to tolerate. If the canary burns the budget too quickly, you pause feature work and fix reliability first. It’s a fair, pre‑agreed rule that keeps teams aligned. sre.google

Vendor and contract questions to close before go‑live

  • Which model version(s) will serve our traffic, and how are changes communicated? Can we pin versions during a canary?
  • What are your rate limits and how do you throttle under load? Do you offer burst pools for peak periods?
  • What availability and latency SLAs apply? How do you measure them? What credits are offered for breach?
  • How do we export logs of prompts, responses, and latency for our observability stack?
  • What built‑in safety filters exist and how can we tune them? Are block‑lists and allow‑lists supported?
  • Can we run a shadow test for a week without extra fees? What guardrails protect our data during shadowing? docs.aws.amazon.com
  • How do you handle incidents? Do you operate a 24/7 on‑call and publish a post‑incident report?
  • Do you provide regional failover or alternative endpoints if your primary region degrades?
  • What’s your deprecation policy for models, APIs and safety features?
  • Can we caps spend at the project and daily level, and set per‑request ceilings?

Use these alongside the UK government’s AI cyber security principles for secure design, threat evaluation, asset tracking and incident readiness. gov.uk

KPIs to review weekly for the first 90 days

  • Quality: task success rate; first‑pass resolution; human hand‑off rate; top 10 failure reasons.
  • Reliability: success rate vs SLO; p95 latency; error‑budget burn; safety filter hit rate.
  • Cost: cost per successful outcome; average tokens per request; cache hit rate; top 5 expensive prompts.
  • Safety: blocked requests by category; incidents opened/closed; mean time to recover.

If you prefer a structured scoreboard, adapt our AI quality scoreboard.

Who needs to be in the room

  • Product owner (acceptance and go/no‑go), Ops/on‑call (runbook and alerts), Data lead (inputs, logs, privacy), Customer team (hand‑offs), and a Director/Trustee sponsor for escalation.
  • Agree a 30‑minute change window for the canary and who can authorise rollback on the call.

For incident rehearsal, borrow the structure from our AI incident drill.

Make it UK‑ready (without the red tape)

You do not need a heavy compliance project to go live responsibly. Map your risks, measure what matters, and manage changes through small, controlled rollouts—the same governance rhythm recommended by NIST’s AI RMF. If your service touches sensitive processes, the UK Code of Practice offers practical, operational requirements to fold into your gate as you scale. nist.gov

Need help pressure‑testing your gate? Our earlier pieces on UAT and feature flags slot straight in.

Appendix: the one‑pager to print

  • Reliability SLOs set • error budget defined • fallbacks listed • load assumptions tested.
  • Security threats mapped • filters/rate limits on • incident runbook rehearsed • restore plan in 30 mins. gov.uk
  • Observability prompts/models/outcomes logged • alerts for errors and spend • dashboards ready. security.gov.uk
  • Rollout shadow plan • canary % and criteria • rollback trigger • comms template. docs.aws.amazon.com
  • Cost per‑outcome KPI • caps and ceilings • cache strategy • weekly review cadence.
  • People owners named • change window • post‑launch review scheduled.