If your AI assistant, search copilot or triage bot behaves differently on Monday than it did on Friday, you don’t have a product—you have a pilot. December is the perfect time to harden your release process so January’s updates don’t break customer journeys or budgets. This practical guide shows directors, ops leads and trustees how to ship AI changes safely in two weeks without grinding delivery to a halt.
Why this matters now: model providers retire and replace versions on a regular cadence. OpenAI and Anthropic both publish deprecation schedules and retirement dates—miss one and your service can fail overnight unless you’ve pinned versions and tested a migration path. Build change control around these calendars, not after the fact. platform.openai.com
What “good” looks like for AI releases
- Stable behaviour: You pin model versions and prompts, run a short canary, then graduate traffic in stages—protecting your service level objectives (SLOs) and error budget. sre.google
- Predictable cost: You set per-route token and latency thresholds; any spike during canary auto-pauses the rollout.
- Safe updates: Your process follows recognised secure-by-design guidance for AI systems (secure design, development, deployment and operation). cisa.gov
If you’re still defining SLOs and on-call, start there first, then return to this playbook. See our AI runbook & SLOs guide and the 7‑day reliability sprint for foundations.
The 14‑day safe‑release playbook
Days 1–2 — Inventory and pin
- List every changeable AI asset: model IDs/versions, prompts, tools/functions, RAG index versions, safety policies, feature flags.
- Pin versions in config. Note “provider retirement” dates in your change calendar (OpenAI/Anthropic deprecations). platform.openai.com
- Create a “kill switch” flag per AI feature. Keep flags centralised and auditable (avoid toggle sprawl). martinfowler.com
Days 3–4 — Golden tests and baselines
- Assemble a golden set of 50–200 real, representative queries for each workflow: customer service, triage, internal search, document drafting.
- Record baseline metrics on your current production version: success rate, harmful output rate, latency, cost per request, average tokens.
- Include adversarial prompts from OWASP LLM Top 10 (prompt injection, sensitive info leakage) to spot regressions early. owasp.org
Days 5–6 — Pre‑release evaluation
- Run the candidate model/prompt against the golden set. Only proceed if it meets or beats baseline on your blocking metrics (below).
- Have a written rollback trigger: e.g., harmful output rate > 0.5% or £/100 requests +15% vs baseline.
Days 7–8 — Shadow and “dark launch”
- Shadow traffic: run the new version behind the scenes, compare outputs and costs but show users the old version.
- Dark launch parts of the workflow to test system load before exposing to users. martinfowler.com
Days 9–11 — Canary release
- Release to 1% of traffic for 60–120 minutes, then 5%, then 20%, pausing if any blocking metric is breached. Track SLO impact and “spend” only a sliver of your error budget during canary. sre.google
- Only run one canary at a time; overlapping canaries cause signal noise and slow incident response. sre.google
Days 12–14 — Graduate and tidy
- Gradually increase to 50% then 100% once stable for a full business day.
- Remove temporary flags, update the runbook, and archive results for your audit trail.
Blocking metrics that auto‑pause a rollout
Pick three to five you’ll actually watch; wire them to your alerting. The moment any threshold is crossed, your release tooling should pause automatically and flip the kill switch.
| Metric | Why it matters | Example threshold |
|---|---|---|
| Task success rate | Prevents “nice but wrong” answers going live | Must be ≥ baseline (e.g., 92%) |
| Harmful output rate | Red flags for safety and brand risk | ≤ 0.5% on golden/adversarial set |
| P95 latency | User experience; timeouts drive abandonment | ≤ 2.5s for assist, ≤ 800ms for search |
| Cost per 100 requests | Stops a silent bill shock | ≤ +10% vs baseline during canary |
| SLO burn rate | Protects your error budget; triggers change freeze if overspending | Burn rate ≤ 2x over 1h window |
Using SLOs and an error‑budget policy gives teams permission to pause features when reliability dips—without endless debate. sre.google
Versioning, deprecations and vendor change control
- Track provider lifecycles: Put OpenAI and Anthropic deprecation pages in your release calendar; rehearse migrations 30–60 days before retirement dates. platform.openai.com
- Prefer explicit model IDs/snapshots: Avoid “latest” aliases in production; they can change without warning.
- Contractual asks: Request 60–90 days’ notice for model retirement, a like‑for‑like fallback, and a migration guide with eval tips. UK government’s AI Cyber Security Code of Practice backs lifecycle documentation and secure updates across design, deployment and maintenance. gov.uk
Security note: incorporate AI‑specific risks (e.g., prompt injection, data poisoning) into change reviews. OWASP’s GenAI Top 10 is a useful quick reference for test design and mitigations. owasp.org
Release patterns that reduce risk (and when to use them)
Canary release
Expose a small, time‑boxed slice of traffic (1% → 5% → 20%) to the change and compare key metrics to control. Best for model upgrades, new tools and major prompt rewrites. sre.google
Feature flags / kill switches
Turn features on/off instantly without redeploying. Keep flags short‑lived and centralised to avoid “toggle spaghetti”. martinfowler.com
Dark launching
Run the new path invisibly to users to measure load and cost before a public switch‑on; ideal before peak trading or fundraising periods. martinfowler.com
Shadow mode
Run the new agent alongside the old and compare outputs; promote once discrepancies are understood. Complements our shadow‑mode walkthrough.
Security by design, not bolt‑on
Build security into your release gates. UK and international authorities emphasise secure design, deployment and operation for AI (including change control, monitoring and secure updates). If your change introduces new connectors, tools or data sources, treat it like any other supply‑chain change: threat model it and test with guardrails before going live. cisa.gov
For UK context, DSIT’s AI Cyber Security Code of Practice sets baseline lifecycle measures and documentation expectations your board can adopt as policy without slowing delivery. gov.uk
Procurement questions for safer upgrades
- What are your model version identifiers and planned retirement dates in the next 12 months? Do you provide 60–90 days’ notice and migration guides? platform.openai.com
- Can we pin a specific version in production and move on our timetable?
- What safeguards exist against OWASP LLM Top 10 risks (prompt injection, sensitive info disclosure) during updates? owasp.org
- Do you support staged rollouts, shadow traffic or test sandboxes that mimic production costs and rate limits?
- What telemetry do you expose so we can enforce SLOs and error‑budget policies during canaries? sre.google
If answers are vague, run a two‑week vendor bake‑off with canaries and golden tests. See our 2‑week vendor bake‑off framework.
Runbook: rollback in five minutes
- Detect: Alert fires on a blocking metric (harmful output, latency, cost, SLO burn rate).
- Pause: Release tool auto‑pauses and pages on‑call.
- Kill switch: Flip the flag to disable the feature or route traffic back to the previous version.
- Communicate: Post a status update in your agreed channel; note user impact and ETA.
- Stabilise: Revert configuration, re‑pin the prior model ID, and clear cache if applicable.
- Review: Log incident, attach canary metrics and golden test diffs; decide whether to retry at a smaller canary or hold a change freeze under your error‑budget policy. sre.google
Costs and risks: quick view
| Risk | Common trigger | Mitigation | Owner |
|---|---|---|---|
| Behaviour drift | Unpinned “latest” model | Pin versions; golden tests; canary gates | Engineering/Operations |
| Cost spike | Larger context or different routing | Token budgets; pause if £/100 req +10% | Ops/Finance |
| Security regression | Prompt/tool change | OWASP LLM tests; secure‑by‑design checklist | Security lead |
| Forced migration | Model retirement | Calendarise deprecations; rehearse switch | Product/Engineering |
For deeper cost control, see our AI cost guardrails.
KPIs to review weekly
- Release frequency and mean time to rollback (target: rollback ≤ 5 minutes).
- Canary halt rate (aim to halt early, not late; fewer false negatives).
- Cost per 100 requests by route vs baseline (target: within ±10%).
- Golden set pass rate, including adversarial items (target: ≥ baseline).
- SLO attainment and error‑budget use (target: stay within budget). sre.google
Swipe file: your pre‑release checklist
- Change ticket created, owner named, rollback trigger defined.
- Model ID and prompt pinned; retirement dates noted. platform.openai.com
- Golden tests passed (functional + adversarial). owasp.org
- Shadow/dark launch completed with no capacity alarms. martinfowler.com
- Canary plan set (1% → 5% → 20%) with blocking metrics and alerting. sre.google
- Runbook open; kill switch confirmed working.