Operations team reviewing a dashboard showing AI copilot suggestions vs human decisions
Case study & walkthrough

Shadow Mode First: A 10‑Day Walkthrough to Launch an AI Copilot in Your Back Office

“Shadow mode” means your AI runs alongside your existing process, generating suggestions that humans can accept, edit or ignore — but the AI does not make the final call yet. It is the fastest, lowest‑risk way for a UK SME or charity to prove value, measure quality, and build staff confidence before switching anything on for real.

This article gives you a concrete, day‑by‑day plan to launch a back‑office copilot in 10 days. We focus on a single workflow (for example: drafting responses to common enquiries, checking invoices, preparing case notes, or summarising long documents). You’ll get KPIs, a light governance pack, cost guardrails, procurement questions, and a simple go/no‑go.

What success looks like in 10 days

  • A defined workflow with a sample of real cases and a small group of trained staff.
  • An AI copilot producing side‑by‑side suggestions, captured in a simple log for review.
  • A quality pack: accuracy and usefulness scores, examples, and triage rules for when the AI should abstain.
  • Guardrails on cost per successful outcome, plus rollback switches ready. See our feature‑flags playbook.
  • A one‑page go/no‑go with KPIs and risks signed by the sponsor. If you need a template, use the AI Quality Scoreboard and the Go‑Live Gate.

Day 0: Pick one workflow and one success metric

Choose a frequent, painful task where a strong draft saves time. Avoid edge cases in week one. Good examples: first draft of a standard email, short summary of a long PDF, or first pass on expense categorisation. Agree a single primary KPI and a cost ceiling for week one:

WorkflowPrimary KPI (week 1)Good resultCost guardrail
Email/case response draftsAcceptance rate≥ 45% of AI drafts accepted with minor edits≤ £0.30 per accepted draft
Document summariesTime saved≥ 30% reduction vs baseline minutes≤ £0.15 per usable summary
Invoice or claim checksPrecision on flagged issues≥ 80% of AI flags are correct≤ £0.20 per correct flag

Define the abstain rule: when the AI should say “I’m not confident — please handle manually”. This protects users and builds trust early.

Day 1: Assemble your “two‑pizza” team and set ground rules

Keep it small: a sponsor, a product owner, 2–4 front‑line staff, and someone who can export data and wire up a log. Agree:

  • Decision rights: who can sign off the go/no‑go and budget changes.
  • Data boundaries: what documents or fields the copilot may read, and what must be excluded or masked.
  • Operational guardrails: a weekly cost cap and the ability to switch the copilot off by workflow or user group. If you need a pattern, see our feature‑flags guidance.

Give staff a 45‑minute briefing on what the copilot will and won’t do, how to rate suggestions, and how to escalate to a person. GOV.UK’s service manual has clear, non‑technical language on testing with users and running simple experiments that you can borrow for your briefing materials: A/B testing guidance.

Day 2: Prepare a small, representative dataset

Export 50–100 recent, typical cases for the workflow. Remove anything sensitive that staff would not normally see. Label 10 “golden examples” that represent great outcomes so you can compare the copilot’s suggestions against them.

Capture a 2–3 sentence context note for each case: what matters, any constraints, and how success is judged. This makes review sessions faster and more consistent.

Day 3: Instrument your pilot (lightweight and auditable)

You do not need a data platform to start. Create a simple log (for example, a protected sheet or a shared tracker) that records per case:

  • Timestamp, user, case ID (or pseudonym), and the source document length.
  • What the copilot suggested (stored safely), whether it was accepted/edited/rejected, and why.
  • Time taken with and without the copilot (rough minutes), plus a 1–5 usefulness score.
  • Model/config used, so you can attribute costs and quality.

Make sure access to logs is role‑based and that you can delete entries on request. The UK’s National Cyber Security Centre has practical, plain‑English guidance on building and operating AI systems safely, including logging and access control: Secure AI system development.

Day 4: Define quality criteria and a fast feedback loop

For your workflow, agree three or four quality criteria that reviewers will rate from 1–5, such as factual accuracy, tone, structure, and policy alignment. Add a free‑text box: “What would have made this suggestion useful?”

Set weekly thresholds using our AI Quality Scoreboard. For example: “Go” if acceptance ≥ 45% and accuracy ≥ 4/5; “Hold” if either falls below target; “Stop” if there is a critical issue (for example, unsafe advice).

Day 5: Switch on shadow mode for 5–10 staff

Enable the copilot for a small group. The AI appears where staff already work (email client, case system, document viewer) and proposes drafts or flags — staff still decide. Make it clear in the UI what the AI can do, and show examples so people know what “good” looks like.

Keep the experience predictable. If your workflow is a single action (“Summarise this document”), expose it as a button with a short confirmation step rather than a free‑text chat. If it needs a few structured inputs (tone, length, audience), use a small form. We explain this trade‑off in our latest post on channel fit, but the short version is: choose the interface that speeds people to the outcome, not the one that looks most futuristic.

Day 6: Sample review — 10 cases, 30 minutes

Run a short review with the sponsor and two front‑line staff. Look at 10 recent cases side‑by‑side: the copilot’s suggestion, the final human outcome, and the rating. Ask:

  • Where did we save the most time?
  • Where did people reject the AI, and why?
  • What would make it 20% faster next week: better prompt presets, clearer inputs, or additional reference documents?

Agree one small change to ship tomorrow. Keep changes reversible and behind switches so you can compare like with like. When you’re close to a decision, use our 5‑day UAT plan for a quick sign‑off.

Day 7: Costs, capacity, and “no‑surprises” guardrails

Shadow mode rarely breaks the bank if you cap sessions and route only eligible cases to the AI. Set and communicate these rules:

  • Eligible cases only: minimum document quality, allowed file types, and an abstain path when confidence is low.
  • Session limits: number of suggestions per case and maximum length for long documents to prevent runaway costs.
  • Per‑user caps: weekly token or spend ceiling, with a friendly warning at 80% usage.
  • Rollback switches: ability to turn the copilot off by workflow, user group, or time of day.

Document these in a one‑pager the sponsor can share with finance and the DPO. It should list the cost ceiling per successful outcome, the overall weekly budget, and the escalation path if something goes wrong.

Day 8: Expand the dataset and tighten prompts — lightly

Add 20–30 more representative cases and, if needed, one or two policy or style documents. Prefer simple, visible controls (tone, reading level, relevant policy link) over long hidden prompts. If you’re tempted to rebuild the UI as a chatbot, pause and check whether a button or a small form would be faster for staff.

Update your tracker to show time to outcome per case and per person. This helps spot whether the copilot is genuinely saving time or simply moving work around.

Day 9: Decide on the go/no‑go

Bring the sponsor, product owner and two reviewers together for 45 minutes. Present the KPI pack, three representative examples, the cost summary, and the risks with mitigations. Here’s a simple decision frame:

CriteriaGoHoldStop
Primary KPIMeets or exceeds threshold for 3 straight daysWithin 10% of target or highly variableBelow target with no improvement trend
Cost per successful outcomeAt or below guardrailUp to 20% over, with a clear fixOver guardrail with unclear driver
Operational riskNo critical issues; escalation worksMinor issues with a planCritical issue observed
PeopleTeam acceptance ≥ 4/53–3.9/5 with specific concerns< 3/5 or material pushback

If “Go”, choose a small, real slice for production (for example, one team, one region, or out‑of‑hours only) and set a review date in two weeks. If “Hold” or “Stop”, keep the pilot in shadow mode and fix the one or two biggest issues first.

Day 10: Prepare production like adults

Before you flip the switch for a real slice, run a one‑page production review: logging on, support path, cost caps, and a rollback. We’ve bundled this thinking in our Go‑Live Gate. Keep it boring and predictable.

Production checklist (cut‑down)

  • Access and audit: Who can use the copilot on day one? Is access logged and revocable? Is the audit trail retained appropriately?
  • Abstain and escalate: Do users know when and how the AI will abstain, and how to reach a person or an alternative channel?
  • Quality monitoring: Is there a weekly sample review and a named owner to act on findings?
  • Cost monitoring: Are dashboards or alerts in place for cost per successful outcome and total weekly spend?
  • Communications: Do staff have a two‑minute crib sheet and know how to rate suggestions?

If you sell to or work with public sector partners, the Technology Code of Practice is a helpful lens for procurement and integration questions — even for SMEs.

How to measure value without a data team

Use a small, stable KPI set and review weekly. Don’t chase vanity metrics like “containment”. Focus on outcomes your team and board care about.

MetricWhy it mattersHow to capture itGood first‑month signal
Acceptance rateShows whether suggestions are genuinely usefulBinary tick from the user; store reason if rejected≥ 45% accepted with minor edits
Time to outcomeQuantifies time saved vs baselineRough minutes; sample weekly≥ 20–30% faster
Escalation successEnsures recovery paths workTrack when users switch to human or alternative and resolve≥ 80% successful escalations
Cost per successful outcomeKeeps spend aligned to valueDivide model spend by accepted outputsAt or under guardrail

Procurement questions that keep you out of trouble

When reviewing vendors or integrators for a shadow‑mode pilot, ask:

  1. Channel fit: Show us your product delivering this workflow as a button or small form, not just chat. How do you support both modes?
  2. Quality and review: Can your product capture acceptance rates, reviewer comments, and per‑case costs out‑of‑the‑box?
  3. Guardrails: How do you enforce abstain rules, cost caps and per‑user limits? Can we switch features off quickly?
  4. Data handling: Where is data processed and stored? How long are logs kept? Can we delete a case on request? For sensible security prompts, see the NCSC guidance above.
  5. UAT support: Do you provide a short test plan and sign‑off checklist? We expect something akin to our 5‑day UAT.

Risks and how to mitigate them this week

  • Over‑reliance on chat: If staff are typing long prompts, pause and add visible controls (tone, length, audience) or convert to a form.
  • Hallucinated facts: Use the abstain rule and require a link or quote for factual claims in high‑risk workflows. Review a 10‑case sample weekly.
  • Quiet cost creep: Cap sessions, limit long documents, and track cost per successful outcome daily in week one.
  • Process drift: Put the copilot where the work happens. Avoid bouncing between tools. Keep the suggestion short and editable.
  • People anxiety: Make the AI optional, collect feedback, and showcase wins in the week‑two all‑hands. Small, real examples beat long decks.

A simple rollout plan after “Go”

If you approve a small production slice, widen access gradually and keep costs and quality visible. A practical 4‑week path:

  1. Week 1: One team, limited hours (for example, out‑of‑hours) and strict cost caps.
  2. Week 2: Two teams or one additional workflow. Keep the weekly sample review and fix one UX issue.
  3. Week 3: Broaden documents the AI can read; add a preset or control instead of a longer prompt.
  4. Week 4: Decide whether to automate a tiny sub‑step end‑to‑end. Use feature flags and a clear rollback. If in doubt, hold — stability beats speed.

Keep your go/no‑go discipline. If metrics slip for a week, pause, fix, and resume. It’s better to be boring than to rebuild trust later.

Book a 30‑min call Or email: team@youraiconsultant.london

Further reading (short, practical)