Calm, early‑morning operations centre representing a quiet cutover to production
Delivery & Ops Playbook

The Quiet Cutover: a 30‑day plan to take your AI pilot live without surprises

Many UK SMEs and charities now have at least one working AI pilot — a summariser for project reports, a customer Helpdesk assistant, or a retrieval tool for policies and procedures. The sticking point is turning that promising pilot into a reliable, affordable production service without disrupting day‑to‑day operations.

This article gives you a practical, non‑technical cutover plan you can run in 30 days. It borrows proven ideas from cloud “well‑architected” guidance and site reliability engineering, adapted for smaller teams and budgets. If you follow the steps, your launch should feel… quiet: no late‑night dramas, predictable costs, clear rollback, and measurable outcomes.

What “production‑ready” means for AI in an SME

Big tech teams use a Production Readiness Review (PRR) before launch: a short, structured check that non‑functional basics are in place — reliability targets, monitoring, on‑call, rollback, and runbooks. Google’s SRE handbook treats PRR as a prerequisite for live support, which is a sensible standard to emulate at SME scale. See Google’s SRE PRR overview.

Cloud providers publish simple reliability checklists you can adapt. Microsoft’s Azure Well‑Architected Reliability and the design review checklist are especially accessible, and AWS’s Reliability Pillar covers recovery, change management and test drills in plain language. These resources translate well to AI services built on vendor APIs or cloud‑hosted models.

Why the fuss? UK data shows cyber disruptions and operational incidents are rising, and boards are expected to treat resilience as a leadership duty. The 2025 Cyber Security Breaches Survey highlights low awareness of practical NCSC guidance among SMEs and charities, while recent headlines have underscored the cost of outages. NCSC’s Response & Recovery toolkit is a helpful companion to this playbook.

The 30‑day quiet cutover plan

Run this as a lightweight project with a named owner, a weekly decision gate, and clear exit criteria. You can deliver it alongside normal work if you keep the scope tight.

Week 1 — PRR basics and decision gates

  • Define success and boundaries: one primary outcome (for example, “reduce average email triage time by 40%”). List out‑of‑scope features to avoid last‑minute creep.
  • Service Level Objective (SLO): pick a simple target for the first month (for example, 99.0% uptime for the AI API plus a maximum of 2% escalations due to poor answers). Azure and AWS reliability guides show how to set pragmatic targets without over‑engineering.
  • Operational readiness: name an on‑call contact rota (it can be light‑touch), define office‑hours alerting, and write a one‑page runbook with “how to restart, how to roll back, who to call”. See Google’s production best practices for a sensible minimum.
  • Change and rollback: agree a single‑step rollback that you’ve tested (for example, “switch Helpdesk traffic back to human‑only queue”). No rollback, no go‑live.
  • Data and access: confirm what the AI can and cannot see. Create a test user with read‑only access to sample data. If you use RAG, verify your index excludes sensitive documents by mistake.
  • Cost guardrails: set a monthly budget and a per‑day cap with alerting. Decide what happens at 80% spend (throttle usage, narrow scope, or pause). The FinOps Framework 2025 suggests managing spend across “Scopes” (cloud, SaaS, AI APIs) so finance sees one picture.

Week 2 — Shadow mode and dry runs

  • Shadow traffic: run the AI alongside humans. Don’t expose answers to customers yet; just compare. Aim for a handful of representative scenarios, not perfection.
  • Failure drills: simulate the three most likely failures: vendor API timeout, bad answer, and cost spike. Prove your fallback works. Azure’s reliability checklist explicitly recommends failure mode analysis and drills.
  • Observability lite: dashboard three things only: answer accuracy sample, daily cost vs budget, and error rate. Keep raw logs accessible for diagnosis. NCSC’s security monitoring principles back this minimalist, evolving approach.
  • Team enablement: 45‑minute briefing for front‑line staff: what the AI does, when to override, how to flag “unsafe/incorrect”. Provide a single feedback button or form.

Week 3 — Soft launch to a small audience

  • Blue/green or percentage rollout: start with 10–20% of eligible queries or one internal team. Increase daily if KPIs hold.
  • Daily stand‑up (15 minutes): yesterday’s incidents, cost vs budget, top feedback, decision to widen or hold. Keep a simple log.
  • Content review loop: sample 25–50 answers per day. Tag issues as “accuracy”, “tone”, or “missing data” to spot patterns.

Week 4 — Full launch and stabilisation

  • Announce quietly: brief staff and key customers on the benefit and the rollback option. Avoid marketing fanfare until you have two weeks of stable KPIs.
  • Freeze non‑essential changes: hold back new prompts, data sources or model swaps for two weeks unless they fix a severity‑1 issue.
  • Shift from daily to twice‑weekly reviews: keep incident and cost guardrails; agree the next small improvement (for example, add one new data source to RAG).

One‑page Production Readiness Review (PRR) — SME edition

Use this as your go/no‑go checklist. If you cannot tick an item, fix it or change the launch scope.

  • Purpose and scope are written in one paragraph.
  • SLOs defined for uptime, average response time, and quality (simple pass rate on a small test set).
  • Runbook exists and includes restart, fallback, and contacts.
  • Rollback tested once under supervision.
  • Monitoring shows accuracy sample, error rate, and daily cost.
  • Access and data verified for least privilege and correct document set.
  • Budget and caps configured with alerts and a named owner.
  • On‑call cover named for office hours; escalation path agreed.
  • Known gaps documented and accepted by sponsor.

If you need a broader launch plan with vendor comparison and cost baselining, see our 12‑week pilot‑to‑production guide and the 5‑day evaluation sprint.

Cost governance that actually works in month one

Cost spikes usually come from traffic bursts, prompt inflation, or silent feature creep (more sources, more tokens). Recent surveys show most IT leaders still struggle with cloud costs, and AI adds pressure. Treat cost as a launch risk, not a finance afterthought.

  1. Set a “Levelised” lens: look beyond per‑request price to the total monthly cost to deliver a unit of business value (for example, “per answered ticket”). The FinOps community calls this unifying view a Cloud+ practice; their Framework 2025 adds “Scopes” to cover cloud, SaaS and AI APIs together.
  2. Guardrails on day one: daily spend cap, alert at 60/80/100%, block extra‑long prompts, and set maximum concurrency. If you use Azure or AWS, pair caps with budget alerts in their native cost tools.
  3. Three KPIs for finance: cost per successful outcome, % of budget used to date, and variance vs forecast.
  4. Weekly “cost change control”: any new data source, prompt template, or model must state expected impact in £ and tokens. If unknown, run a 48‑hour A/B test first.

Tip: if your helpdesk is the first use case, compare our 30‑day AI helpdesk case study for realistic cost baselines and deflection rates.

Risks and pre‑emptive controls

Risk What it looks like Pre‑emptive control Trigger to act
Unreliable vendor/API Time‑outs or degraded answers during peak Timeout handling and fallback to human queue; dual‑region or secondary provider if feasible; publish an SLO 3× 5‑minute outages in a week or < 99% weekly uptime
Cost spike Spend doubles vs forecast Daily caps, throttle concurrency, shorten prompts; switch to cheaper model for non‑critical flows 80% of monthly budget reached before day 20
Bad answers Customer complaints or staff rework Shadow QA sample; maintain a “do not answer” list and fallback templates; improve retrieval set Quality sample pass rate drops below 90% for two days
Data exposure Model sees documents it shouldn’t Use least‑privilege access; test with a low‑privilege user; review index build scope Any failed access test or unexpected document in results
Team fatigue Support queue builds; slow incident response Office‑hours rota and a simple escalation path; keep “on‑call” to a shared team diary More than two incidents per on‑call shift

These thresholds map well to cloud reliability guidance from AWS and Azure, and to SRE practices on manageable on‑call load.

KPIs to track in your first 30 days

  • Operational: uptime (%), median response time, error rate (% of requests with fallback invoked).
  • Quality: daily sample pass rate (% acceptable answers on a fixed test set), top three failure themes.
  • Value: deflection rate (% issues resolved without human), average handling time saved (minutes), cost per successful outcome (£).
  • Team: incidents per week, time to acknowledge, time to resolve, staff confidence score (quick pulse via a 1–5 emoji scale).

Keep the dashboard boring and stable. If a KPI becomes noisy or ignored, remove it.

Minimal procurement checks before you cut over

You don’t need a 400‑page tender to launch sensibly. Ask vendors these five questions and capture the answers in your runbook:

  1. Reliability: “What is your historical uptime and how do we check service status?” Link their public status page in your runbook.
  2. Cost controls: “Which controls can we configure ourselves (rate limits, budget caps, alerting)?” If weak, plan your own throttling.
  3. Support path: “When something breaks at 10am UK time, who do we call? What’s the typical response time?”
  4. Roadmap stability: “Any near‑term changes that could affect prompts, APIs or token pricing in the next quarter?”
  5. Export/portability: “If we leave, how do we export data and prompts?” Capture the steps.

Government buyers lean on the Technology Code of Practice to make technology choices repeatable and transparent — the spirit of it translates well to SMEs: define user needs, consider open standards, and have a purchasing strategy you can explain.

Playbook extras: retrieval, prompts, and model choice

Three pragmatic rules prevent most operational surprises:

  • Retrieval: start with a small, curated document set. Expand only after a week of stable quality. Our 6‑week RAG blueprint shows how to structure this without over‑engineering.
  • Prompts: keep prompts short and explicit. Maintain one approved template per task and a “do not answer” rule with human hand‑off text.
  • Model choice: begin with the smallest model that passes your test set. Step up only if a documented gap remains. Azure and AWS both provide reliability patterns and resiliency guidance you can reuse.

Who does what (RACI‑lite)

  • Product owner (sponsor): sets the outcome and approves go/no‑go at each weekly gate.
  • Ops lead: owns runbook, monitoring, on‑call rota, and rollback test.
  • Front‑line lead: trains users, reviews answer samples, collects feedback.
  • Finance partner: owns budget, caps and variance review.
  • Vendor contact: confirms support path and communicates incidents.

Seven decisions to make in the kick‑off meeting

  1. Primary outcome and KPI for month one.
  2. Launch scope (audience, hours, channels).
  3. Fallback and rollback method.
  4. Budget and caps with named owner.
  5. Monitoring metrics and where the dashboard lives.
  6. On‑call rota and escalation path.
  7. Weekly gate schedule and decision forum.

FAQs we hear from UK SME leaders

Can we launch without 24/7 on‑call?

Yes. Start with office‑hours coverage and an escalation path. If usage grows beyond your risk appetite, extend cover. SRE practice suggests keeping incidents per shift low to avoid burnout; build towards that rather than starting there.

What if we cannot afford a second AI vendor?

Don’t force multi‑vendor. Prioritise robust fallback to human handling, test it, and carry a small backlog of “plan B” changes (for example, a switch to a smaller, cheaper model for non‑critical flows) you can apply in a day.

How polished should our dashboard be?

Not very. A simple sheet or shared page is enough if it has the three essentials: quality sample, errors, spend. Improve it only if people actually use it.

Ready to launch quietly?

If you want a partner for the first month, we can run a 3‑day PRR, set guardrails, and co‑pilot the soft launch. We’ve done this for SMEs moving from pilots to real value with predictable costs.

Further reading: AWS Well‑Architected, Azure Reliability, SRE PRR, and NCSC’s Response & Recovery.