Delivery & Ops

Ship AI Changes Safely: version pinning, canaries and rollbacks for UK SMEs in 14 days

Published 01 Dec 2025 • 11–13 min read

If your AI assistant, search copilot or triage bot behaves differently on Monday than it did on Friday, you don’t have a product—you have a pilot. December is the perfect time to harden your release process so January’s updates don’t break customer journeys or budgets. This practical guide shows directors, ops leads and trustees how to ship AI changes safely in two weeks without grinding delivery to a halt.

Why this matters now: model providers retire and replace versions on a regular cadence. OpenAI and Anthropic both publish deprecation schedules and retirement dates—miss one and your service can fail overnight unless you’ve pinned versions and tested a migration path. Build change control around these calendars, not after the fact. platform.openai.com

What “good” looks like for AI releases

Stable behaviour: You pin model versions and prompts, run a short canary, then graduate traffic in stages—protecting your service level objectives (SLOs) and error budget. sre.google
Predictable cost: You set per-route token and latency thresholds; any spike during canary auto-pauses the rollout.
Safe updates: Your process follows recognised secure-by-design guidance for AI systems (secure design, development, deployment and operation). cisa.gov

If you’re still defining SLOs and on-call, start there first, then return to this playbook. See our AI runbook & SLOs guide and the 7‑day reliability sprint for foundations.

The 14‑day safe‑release playbook

Days 1–2 — Inventory and pin

List every changeable AI asset: model IDs/versions, prompts, tools/functions, RAG index versions, safety policies, feature flags.
Pin versions in config. Note “provider retirement” dates in your change calendar (OpenAI/Anthropic deprecations). platform.openai.com
Create a “kill switch” flag per AI feature. Keep flags centralised and auditable (avoid toggle sprawl). martinfowler.com

Days 3–4 — Golden tests and baselines

Assemble a golden set of 50–200 real, representative queries for each workflow: customer service, triage, internal search, document drafting.
Record baseline metrics on your current production version: success rate, harmful output rate, latency, cost per request, average tokens.
Include adversarial prompts from OWASP LLM Top 10 (prompt injection, sensitive info leakage) to spot regressions early. owasp.org

Days 5–6 — Pre‑release evaluation

Run the candidate model/prompt against the golden set. Only proceed if it meets or beats baseline on your blocking metrics (below).
Have a written rollback trigger: e.g., harmful output rate > 0.5% or £/100 requests +15% vs baseline.

Days 7–8 — Shadow and “dark launch”

Shadow traffic: run the new version behind the scenes, compare outputs and costs but show users the old version.
Dark launch parts of the workflow to test system load before exposing to users. martinfowler.com

Days 9–11 — Canary release

Release to 1% of traffic for 60–120 minutes, then 5%, then 20%, pausing if any blocking metric is breached. Track SLO impact and “spend” only a sliver of your error budget during canary. sre.google
Only run one canary at a time; overlapping canaries cause signal noise and slow incident response. sre.google

Days 12–14 — Graduate and tidy

Gradually increase to 50% then 100% once stable for a full business day.
Remove temporary flags, update the runbook, and archive results for your audit trail.

Blocking metrics that auto‑pause a rollout

Pick three to five you’ll actually watch; wire them to your alerting. The moment any threshold is crossed, your release tooling should pause automatically and flip the kill switch.

Metric	Why it matters	Example threshold
Task success rate	Prevents “nice but wrong” answers going live	Must be ≥ baseline (e.g., 92%)
Harmful output rate	Red flags for safety and brand risk	≤ 0.5% on golden/adversarial set
P95 latency	User experience; timeouts drive abandonment	≤ 2.5s for assist, ≤ 800ms for search
Cost per 100 requests	Stops a silent bill shock	≤ +10% vs baseline during canary
SLO burn rate	Protects your error budget; triggers change freeze if overspending	Burn rate ≤ 2x over 1h window

Using SLOs and an error‑budget policy gives teams permission to pause features when reliability dips—without endless debate. sre.google

Versioning, deprecations and vendor change control

Track provider lifecycles: Put OpenAI and Anthropic deprecation pages in your release calendar; rehearse migrations 30–60 days before retirement dates. platform.openai.com
Prefer explicit model IDs/snapshots: Avoid “latest” aliases in production; they can change without warning.
Contractual asks: Request 60–90 days’ notice for model retirement, a like‑for‑like fallback, and a migration guide with eval tips. UK government’s AI Cyber Security Code of Practice backs lifecycle documentation and secure updates across design, deployment and maintenance. gov.uk

Security note: incorporate AI‑specific risks (e.g., prompt injection, data poisoning) into change reviews. OWASP’s GenAI Top 10 is a useful quick reference for test design and mitigations. owasp.org

Release patterns that reduce risk (and when to use them)

Canary release

Expose a small, time‑boxed slice of traffic (1% → 5% → 20%) to the change and compare key metrics to control. Best for model upgrades, new tools and major prompt rewrites. sre.google

Feature flags / kill switches

Turn features on/off instantly without redeploying. Keep flags short‑lived and centralised to avoid “toggle spaghetti”. martinfowler.com

Dark launching

Run the new path invisibly to users to measure load and cost before a public switch‑on; ideal before peak trading or fundraising periods. martinfowler.com

Shadow mode

Run the new agent alongside the old and compare outputs; promote once discrepancies are understood. Complements our shadow‑mode walkthrough.

Security by design, not bolt‑on

Build security into your release gates. UK and international authorities emphasise secure design, deployment and operation for AI (including change control, monitoring and secure updates). If your change introduces new connectors, tools or data sources, treat it like any other supply‑chain change: threat model it and test with guardrails before going live. cisa.gov

For UK context, DSIT’s AI Cyber Security Code of Practice sets baseline lifecycle measures and documentation expectations your board can adopt as policy without slowing delivery. gov.uk

Procurement questions for safer upgrades

What are your model version identifiers and planned retirement dates in the next 12 months? Do you provide 60–90 days’ notice and migration guides? platform.openai.com
Can we pin a specific version in production and move on our timetable?
What safeguards exist against OWASP LLM Top 10 risks (prompt injection, sensitive info disclosure) during updates? owasp.org
Do you support staged rollouts, shadow traffic or test sandboxes that mimic production costs and rate limits?
What telemetry do you expose so we can enforce SLOs and error‑budget policies during canaries? sre.google

If answers are vague, run a two‑week vendor bake‑off with canaries and golden tests. See our 2‑week vendor bake‑off framework.

Runbook: rollback in five minutes

Detect: Alert fires on a blocking metric (harmful output, latency, cost, SLO burn rate).
Pause: Release tool auto‑pauses and pages on‑call.
Kill switch: Flip the flag to disable the feature or route traffic back to the previous version.
Communicate: Post a status update in your agreed channel; note user impact and ETA.
Stabilise: Revert configuration, re‑pin the prior model ID, and clear cache if applicable.
Review: Log incident, attach canary metrics and golden test diffs; decide whether to retry at a smaller canary or hold a change freeze under your error‑budget policy. sre.google

Costs and risks: quick view

Risk	Common trigger	Mitigation	Owner
Behaviour drift	Unpinned “latest” model	Pin versions; golden tests; canary gates	Engineering/Operations
Cost spike	Larger context or different routing	Token budgets; pause if £/100 req +10%	Ops/Finance
Security regression	Prompt/tool change	OWASP LLM tests; secure‑by‑design checklist	Security lead
Forced migration	Model retirement	Calendarise deprecations; rehearse switch	Product/Engineering

For deeper cost control, see our AI cost guardrails.

KPIs to review weekly

Release frequency and mean time to rollback (target: rollback ≤ 5 minutes).
Canary halt rate (aim to halt early, not late; fewer false negatives).
Cost per 100 requests by route vs baseline (target: within ±10%).
Golden set pass rate, including adversarial items (target: ≥ baseline).
SLO attainment and error‑budget use (target: stay within budget). sre.google

Swipe file: your pre‑release checklist

Change ticket created, owner named, rollback trigger defined.
Model ID and prompt pinned; retirement dates noted. platform.openai.com
Golden tests passed (functional + adversarial). owasp.org
Shadow/dark launch completed with no capacity alarms. martinfowler.com
Canary plan set (1% → 5% → 20%) with blocking metrics and alerting. sre.google
Runbook open; kill switch confirmed working.

Book a 30-min call Or email: team@youraiconsultant.london