Delivery & Ops

Feature flags for AI: a UK SME rollout-and-rollback playbook

Published 05 Nov 2025 • 11–13 min read

You don’t need a big platform team to ship AI features safely. With a few disciplined habits—feature flags, a small canary, and clear rollback rules—UK SMEs and charities can release AI updates during busy periods without sleepless nights or surprise costs.

Why AI needs special care: model behaviour can vary with small input changes, vendors may upgrade models underneath you, and the surrounding data drifts over time. These factors create operational risk beyond “normal” software and are well documented in research on technical debt in ML systems. papers.nips.cc

This article gives you a practical rollout-and-rollback playbook you can apply this month. It includes checklists, success metrics, a decision tree, and procurement questions you can send to your existing flagging tool or cloud provider. For broader go-live tactics, see also our pieces on quiet cutovers, 5‑day UAT, and AI incident drills.

The one-page idea

Wrap every risky AI change in a flag you can switch off instantly.
Release to a small, representative slice of real users (the canary) and compare against the control version before proceeding. sre.google
Define clear “stop” thresholds up front (quality, latency, cost) and automate alerts.
When in doubt, roll back first, investigate second. Keep a clean path to a known good state. gov.uk

Step‑by‑step rollout and rollback for AI features

1) Choose the right flag type

Use three simple categories:

Release flags to decouple deployment from release (keep the code deployed but dark until ready).
Ops/kill‑switch flags to turn off a capability instantly if quality or costs go sideways.
Experiment flags for evaluation or A/B tests on prompts or UX variants.

These patterns are well established in modern delivery practice, but they carry complexity if left unmanaged—so plan for retirement from day one. martinfowler.com

2) Define “go/stop” thresholds before you start

Write down the exact conditions to proceed, pause, or roll back. Tie them to your service objectives and error budget so a small canary won’t risk your overall reliability.

Quality guardrails: hallucination rate under X%, escalation rate to human under Y%, sensitive‑term block rate, and manual spot checks.
Performance: p95 latency under Z ms; timeouts below A%.
Cost: tokens or API spend per task under £B; unit economics compared to the control.

Canarying is specifically designed to protect your error budget by limiting impact to a small slice of traffic. sre.google

3) Build a safe fallback

Decide exactly what happens when the flag is turned off. Options include: revert to the previous model/prompt, show the old rule‑based path, or route to a human queue. Add a circuit breaker around external model calls so the system fails fast and avoids cascading timeouts if the model or network misbehaves. docs.aws.amazon.com

4) Plan the canary

Keep it simple:

Stage 0: internal staff only (1–2 hours in business hours).
Stage 1: 1% of production traffic for 2–4 hours; review metrics against thresholds.
Stage 2: 10% for the rest of the day.
Stage 3: 50% next business day; final 100% after sign‑off.

Only run one canary at a time and always compare canary vs control. sre.google

5) Make changes observable

Record flag name, owner, intended retirement date, and change history.
Annotate monitoring dashboards when flags flip, and tag logs with model version and prompt version.
Capture a small, anonymised sample of canary inputs/outputs for manual review by operations or the service owner.
Keep an audit trail for prompts, data sources, and model changes—this is encouraged in UK government guidance for AI operations. gov.uk

6) Roll forward or roll back—use this decision tree

If any threshold is breached for more than 5 minutes: flip the kill switch and revert. If metrics recover and the root cause is understood (e.g., a single prompt rule), you may retry Stage 1 once; otherwise stop the release and open an incident. Don’t chase a live incident while leaving the risky flag on.

For issues caused by upstream model/API instability, a circuit breaker plus an ops flag lets you degrade gracefully while you contact the provider. docs.aws.amazon.com

7) Retire the flag

Long‑lived flags become hidden complexity that increases risk—especially in AI systems where behaviour depends on data, configuration and external services. Clear them out within 14–30 days, or earlier if they were only needed as a kill switch. papers.nips.cc

When to use flags vs blue‑green vs version bumps

Use feature flags when the change is narrow (a specific prompt, safety filter, or model parameter) and you need instant rollback without redeploying.
Use blue‑green/canary infrastructure when the change is broad (new service, major dependency change) or affects your whole request path. A small, measured canary gives early warning without risking everyone. martinfowler.com
Use a semantic version bump when you can run old and new side‑by‑side and route traffic explicitly, e.g., “v1” vs “v1.1” prompts or model IDs.

For the go‑live choreography around comms, freeze windows and back‑out plans, see our quiet cutover plan.

Procurement questions for your flagging tool or platform team

Copy/paste these into your next vendor email or Jira ticket:

Speed: What’s the worst‑case time from flipping a flag to effect in production? Sub‑second is ideal for kill switches.
Targeting: Can we safely target by cohort (staff, region, account, device) and keep cohorts sticky during canary testing? sre.google
Audit: Do we get a full audit log of flag changes, who changed them, and when?
Approvals: Can non‑engineers (e.g., product owners, operations) flip predefined ops flags with guardrails and approvals?
Observability: Can flag flips be exported as events to our monitoring/alerting tool to help correlate incidents?
Safety: Can we enforce default‑off for risky flags and require an expiry date and owner? Are there built‑in TTLs to prevent zombie flags?
Resilience: What’s the failure mode if the flag service is unavailable? Do clients cache values safely?
Security & access: SSO support, role‑based permissions, and least‑privilege controls for who can toggle what.
Data location: Where is configuration and audit data stored? UK/EU hosting options?
Cost transparency: Predictable pricing per MAU/request; limits and overage rules to avoid surprises.

Risks and what to do about them

Risk	Why it matters	Controls that work	Indicative cost
Long‑lived flags accumulate and collide	Multiple code paths, hard to test; surprises on flip day	Force expiry dates; monthly flag reviews; delete code when retired	2–4 hours per month team time; saves incident hours later
Upstream model/API outage or slowdown	Time‑outs cascade and burn error budget	Circuit breaker + ops flag; cached responses for non‑critical paths	Low engineering effort; large reduction in incident blast radius docs.aws.amazon.com
Silent behaviour change after vendor update	Quality or safety degrades without code changes	Pin versions where possible; canary for config/prompt updates; keep evaluation samples	Lightweight; relies on team discipline and monitoring
No clear rollback threshold	Debate during incidents, slower response	Pre‑agreed SLOs, stop rules, and on‑call authority to flip the switch	Free; set once and reuse sre.google
Weak audit trail	Hard to explain incidents or meet governance expectations	Log flag flips; record prompt/model versions; keep change notes	Minimal; aligns with UK guidance on secure AI operations gov.uk

KPIs to track for 72 hours and 30 days

Quality: manual spot‑check pass rate on canary sample; escalation to human; blocked sensitive outputs.
Reliability: error rate, timeouts, p95/p99 latency, and retried requests.
Cost: average cost per resolved task vs control; daily spend vs budget.
User impact: task completion rate, deflection from human support, NPS/CSAT trend.
Ops speed: time from breach to flag flip (MTTR); time to full rollout.

If any KPI drifts after go‑live, run a small “shadow canary” with the previous configuration to isolate whether the model, prompt, or inputs changed. Research on ML technical debt highlights how configuration and data dependencies can surprise you later—treat monitoring as an ongoing practice, not a launch task. papers.nips.cc

The 10‑minute checklist

Flag created with owner, purpose, default‑off, and expiry date.
Known fallback defined and tested end‑to‑end.
Canary stages and stop thresholds written down and shared.
Dashboards annotated; alerts wired to the on‑call rota.
Prompts, model IDs, and safety filters versioned; changes logged. gov.uk
“Go/No‑Go” meeting in business hours; only one canary at a time. sre.google
Post‑release cleanup booked: delete dead code and retire flags within 30 days. martinfowler.com

Why this approach aligns with UK best practice

The UK’s AI Cyber Security Code of Practice encourages secure‑by‑design operations, incident readiness, and the ability to restore to a known good state. Using flags with canarying, clear SLOs, and a kill‑switch meets these expectations while keeping change risk small and reversible. gov.uk

For teams formalising quality gates, our AI quality scoreboard post offers a quick path to measurable SLAs.

Book a 30‑min call Or email: team@youraiconsultant.london