A release dashboard showing a staged rollout of an AI feature with canary and rollback controls
Delivery & Ops

Ship AI Changes Safely: version pinning, canaries and rollbacks for UK SMEs in 14 days

If your AI assistant, search copilot or triage bot behaves differently on Monday than it did on Friday, you don’t have a product—you have a pilot. December is the perfect time to harden your release process so January’s updates don’t break customer journeys or budgets. This practical guide shows directors, ops leads and trustees how to ship AI changes safely in two weeks without grinding delivery to a halt.

Why this matters now: model providers retire and replace versions on a regular cadence. OpenAI and Anthropic both publish deprecation schedules and retirement dates—miss one and your service can fail overnight unless you’ve pinned versions and tested a migration path. Build change control around these calendars, not after the fact. platform.openai.com

What “good” looks like for AI releases

  • Stable behaviour: You pin model versions and prompts, run a short canary, then graduate traffic in stages—protecting your service level objectives (SLOs) and error budget. sre.google
  • Predictable cost: You set per-route token and latency thresholds; any spike during canary auto-pauses the rollout.
  • Safe updates: Your process follows recognised secure-by-design guidance for AI systems (secure design, development, deployment and operation). cisa.gov

If you’re still defining SLOs and on-call, start there first, then return to this playbook. See our AI runbook & SLOs guide and the 7‑day reliability sprint for foundations.

The 14‑day safe‑release playbook

Days 1–2 — Inventory and pin

  • List every changeable AI asset: model IDs/versions, prompts, tools/functions, RAG index versions, safety policies, feature flags.
  • Pin versions in config. Note “provider retirement” dates in your change calendar (OpenAI/Anthropic deprecations). platform.openai.com
  • Create a “kill switch” flag per AI feature. Keep flags centralised and auditable (avoid toggle sprawl). martinfowler.com

Days 3–4 — Golden tests and baselines

  • Assemble a golden set of 50–200 real, representative queries for each workflow: customer service, triage, internal search, document drafting.
  • Record baseline metrics on your current production version: success rate, harmful output rate, latency, cost per request, average tokens.
  • Include adversarial prompts from OWASP LLM Top 10 (prompt injection, sensitive info leakage) to spot regressions early. owasp.org

Days 5–6 — Pre‑release evaluation

  • Run the candidate model/prompt against the golden set. Only proceed if it meets or beats baseline on your blocking metrics (below).
  • Have a written rollback trigger: e.g., harmful output rate > 0.5% or £/100 requests +15% vs baseline.

Days 7–8 — Shadow and “dark launch”

  • Shadow traffic: run the new version behind the scenes, compare outputs and costs but show users the old version.
  • Dark launch parts of the workflow to test system load before exposing to users. martinfowler.com

Days 9–11 — Canary release

  • Release to 1% of traffic for 60–120 minutes, then 5%, then 20%, pausing if any blocking metric is breached. Track SLO impact and “spend” only a sliver of your error budget during canary. sre.google
  • Only run one canary at a time; overlapping canaries cause signal noise and slow incident response. sre.google

Days 12–14 — Graduate and tidy

  • Gradually increase to 50% then 100% once stable for a full business day.
  • Remove temporary flags, update the runbook, and archive results for your audit trail.

Blocking metrics that auto‑pause a rollout

Pick three to five you’ll actually watch; wire them to your alerting. The moment any threshold is crossed, your release tooling should pause automatically and flip the kill switch.

MetricWhy it mattersExample threshold
Task success ratePrevents “nice but wrong” answers going liveMust be ≥ baseline (e.g., 92%)
Harmful output rateRed flags for safety and brand risk≤ 0.5% on golden/adversarial set
P95 latencyUser experience; timeouts drive abandonment≤ 2.5s for assist, ≤ 800ms for search
Cost per 100 requestsStops a silent bill shock≤ +10% vs baseline during canary
SLO burn rateProtects your error budget; triggers change freeze if overspendingBurn rate ≤ 2x over 1h window

Using SLOs and an error‑budget policy gives teams permission to pause features when reliability dips—without endless debate. sre.google

Versioning, deprecations and vendor change control

  • Track provider lifecycles: Put OpenAI and Anthropic deprecation pages in your release calendar; rehearse migrations 30–60 days before retirement dates. platform.openai.com
  • Prefer explicit model IDs/snapshots: Avoid “latest” aliases in production; they can change without warning.
  • Contractual asks: Request 60–90 days’ notice for model retirement, a like‑for‑like fallback, and a migration guide with eval tips. UK government’s AI Cyber Security Code of Practice backs lifecycle documentation and secure updates across design, deployment and maintenance. gov.uk

Security note: incorporate AI‑specific risks (e.g., prompt injection, data poisoning) into change reviews. OWASP’s GenAI Top 10 is a useful quick reference for test design and mitigations. owasp.org

Release patterns that reduce risk (and when to use them)

Canary release

Expose a small, time‑boxed slice of traffic (1% → 5% → 20%) to the change and compare key metrics to control. Best for model upgrades, new tools and major prompt rewrites. sre.google

Feature flags / kill switches

Turn features on/off instantly without redeploying. Keep flags short‑lived and centralised to avoid “toggle spaghetti”. martinfowler.com

Dark launching

Run the new path invisibly to users to measure load and cost before a public switch‑on; ideal before peak trading or fundraising periods. martinfowler.com

Shadow mode

Run the new agent alongside the old and compare outputs; promote once discrepancies are understood. Complements our shadow‑mode walkthrough.

Security by design, not bolt‑on

Build security into your release gates. UK and international authorities emphasise secure design, deployment and operation for AI (including change control, monitoring and secure updates). If your change introduces new connectors, tools or data sources, treat it like any other supply‑chain change: threat model it and test with guardrails before going live. cisa.gov

For UK context, DSIT’s AI Cyber Security Code of Practice sets baseline lifecycle measures and documentation expectations your board can adopt as policy without slowing delivery. gov.uk

Procurement questions for safer upgrades

  1. What are your model version identifiers and planned retirement dates in the next 12 months? Do you provide 60–90 days’ notice and migration guides? platform.openai.com
  2. Can we pin a specific version in production and move on our timetable?
  3. What safeguards exist against OWASP LLM Top 10 risks (prompt injection, sensitive info disclosure) during updates? owasp.org
  4. Do you support staged rollouts, shadow traffic or test sandboxes that mimic production costs and rate limits?
  5. What telemetry do you expose so we can enforce SLOs and error‑budget policies during canaries? sre.google

If answers are vague, run a two‑week vendor bake‑off with canaries and golden tests. See our 2‑week vendor bake‑off framework.

Runbook: rollback in five minutes

  1. Detect: Alert fires on a blocking metric (harmful output, latency, cost, SLO burn rate).
  2. Pause: Release tool auto‑pauses and pages on‑call.
  3. Kill switch: Flip the flag to disable the feature or route traffic back to the previous version.
  4. Communicate: Post a status update in your agreed channel; note user impact and ETA.
  5. Stabilise: Revert configuration, re‑pin the prior model ID, and clear cache if applicable.
  6. Review: Log incident, attach canary metrics and golden test diffs; decide whether to retry at a smaller canary or hold a change freeze under your error‑budget policy. sre.google

Costs and risks: quick view

RiskCommon triggerMitigationOwner
Behaviour drift Unpinned “latest” model Pin versions; golden tests; canary gates Engineering/Operations
Cost spike Larger context or different routing Token budgets; pause if £/100 req +10% Ops/Finance
Security regression Prompt/tool change OWASP LLM tests; secure‑by‑design checklist Security lead
Forced migration Model retirement Calendarise deprecations; rehearse switch Product/Engineering

For deeper cost control, see our AI cost guardrails.

KPIs to review weekly

  • Release frequency and mean time to rollback (target: rollback ≤ 5 minutes).
  • Canary halt rate (aim to halt early, not late; fewer false negatives).
  • Cost per 100 requests by route vs baseline (target: within ±10%).
  • Golden set pass rate, including adversarial items (target: ≥ baseline).
  • SLO attainment and error‑budget use (target: stay within budget). sre.google

Swipe file: your pre‑release checklist

  • Change ticket created, owner named, rollback trigger defined.
  • Model ID and prompt pinned; retirement dates noted. platform.openai.com
  • Golden tests passed (functional + adversarial). owasp.org
  • Shadow/dark launch completed with no capacity alarms. martinfowler.com
  • Canary plan set (1% → 5% → 20%) with blocking metrics and alerting. sre.google
  • Runbook open; kill switch confirmed working.
Book a 30-min call Or email: team@youraiconsultant.london