Product manager and service lead reviewing AI copilot UX on laptops with sticky notes highlighting moments of confusion
Product & UX Playbook

“Why did it do that?” Fix the moment of confusion in your AI copilot: a 12‑point UX checklist for UK SMEs

If you’ve shipped an AI assistant or “copilot” this year, you’ve probably heard users say: “Why did it do that?” These micro‑interruptions — brief pauses where trust dips and the user hesitates — quietly drive up cost‑to‑serve and erode confidence. The good news: most are fixable with UX patterns you already know, adapted to AI’s quirks.

This article gives non‑technical leaders a practical checklist, decision aids, KPIs and a 30‑day plan to remove confusion, reduce retries and increase satisfaction. We reference established interaction research on response times and human‑AI guidelines so you can align your team fast. For latency and perceived performance thresholds, see Google’s RAIL model (under 100 ms feels instant; around 1 s keeps focus; over 10 s risks abandonment). web.dev

What’s a “micro‑interruption” — and why it matters

  • Definition: A short moment where the assistant behaves unexpectedly, answers too slowly, or hides its reasoning, prompting the user to pause, reread or retry.
  • Business impact: More after‑call work, extra emails and tickets, duplicated effort, lower completion rates and avoidable escalations.
  • Tell‑tale signs: Users ask “What is it doing?”, “Where did this come from?”, or “Can I trust this?”; agents re‑phrase and resend the same prompt; managers see response times creep up as users add screenshots and long context.

A 12‑point UX checklist to prevent “Why did it do that?”

Organised by the four phases in Microsoft’s Human‑AI Interaction guidelines. Use it to review your copilot this week. microsoft.com

1) At first use — set expectations

  1. State scope and limits plainly. On the first screen, list 3–5 tasks your copilot is good at and 2–3 things it can’t do yet. Link to “What sources does it use?” and “What happens with my data?”
  2. Make source boundaries visible. Label replies with the dataset used (e.g., “Policies folder, updated 18 Dec”) and a toggle to show sources.
  3. Show the path to human help. Provide a prominent “Escalate to a person” option when the assistant is unsure.

2) During interaction — maintain control and clarity

  1. Confirm intent fast. Provide immediate visual feedback on input (button depress, typing indicator) and a progress state within ~1 s. Avoid blank screens. web.dev
  2. Chunk long answers. Deliver a short, scannable summary first, then expandable detail and links to sources.
  3. Offer pivots, not prompts. Give 2–3 follow‑up buttons like “Tighten to 100 words” or “Show policy excerpt” so users don’t have to engineer prompts.

3) When it’s wrong — recover gracefully

  1. Admit uncertainty. Use clear language (“I’m not confident about this answer”) and propose a next step (“Would you like me to search the ‘HR Policies’ folder or ask HR Support?”). microsoft.com
  2. One‑tap feedback that matters. Replace generic thumbs‑up/down with reasons (“Wrong source”, “Out of date”, “Too slow”, “Not UK‑relevant”). Route these signals to your backlog weekly.
  3. Easy undo. Provide a visible “Restore previous version” and log what changed and why.

4) Over time — earn trust

  1. Change with consent. When you update behaviour or sources, show a short “What’s new” with a link to release notes and a way to opt out for a period.
  2. Explain improvements in plain English. “We now pull from SharePoint ‘Benefits 2025’ and prioritise official PDFs over slides.”
  3. Close the loop. If you escalate to a person, let the user know the outcome and, where appropriate, teach the assistant from that resolution.

Set latency expectations that feel fast to humans, not just machines

Users judge responsiveness in bands. Aim for immediate acknowledgement (under 100 ms), keep people in flow with a first visible response by ~1 s, and assume attention will wander after ~10 s unless you provide clear progress and alternatives. These targets align with the RAIL performance model and widely used UX thresholds. web.dev

Practical ways to hit these targets:

  • Pre‑fetch context for signed‑in users so you can render a skeleton answer immediately.
  • Stream the first sentence and pin a “still drafting” badge until complete; avoid showing raw, unreviewed text for regulated answers.
  • Offer a switch between “Quick summary” and “Full evidence” for longer tasks so the user stays in control of time vs detail.

Decision helper: which interaction pattern fits your task?

If the task is high‑risk or legally sensitive

  • Use a guided form + generated draft (structured inputs, controlled outputs).
  • Require human approval before sending or publishing.
  • Show sources and confidence by default.

If the task is exploratory or creative

  • Use a chat or outline pattern with quick pivots (“shorter/longer”, “change tone to supportive”).
  • Let users bookmark good paths and copy the “recipe” (prompt + settings) for reuse.

For concrete examples of answer patterns and evaluation, see our case study on launching an AI answers widget in 21 days and our tests for content quality. AI answers widget in 21 days and 9 AI content quality tests.

Measure what leaders care about

Directors and trustees need simple, comparable measures. The GOV.UK Service Manual recommends publishing four core KPIs for digital services: completion rate, user satisfaction, digital take‑up and cost per transaction. These map neatly to AI assistants as well. gov.uk

KPIDefinition for your copilotTarget for Q1How to capture
Completion rate % of sessions that reach the intended outcome without human intervention +10–15% vs December baseline Analytics funnels; tagged intents
User satisfaction Simple, in‑context survey (Very satisfied → Very dissatisfied) post‑interaction >= 70% satisfied+ End‑of‑session survey; monthly publish
Digital take‑up % of cases handled by the copilot vs email/phone/manual +20% shift to digital Channel analytics and call‑centre stats
Cost per transaction Fully‑loaded cost per successfully completed AI‑assisted case −15% by end of Q1 Finance model plus volume and token/compute logs

See also GOV.UK guidance on setting performance metrics and measuring user satisfaction for ongoing services. gov.uk

Risk and cost table: what typically breaks, and the low‑effort fix

RiskWhat users experienceLow‑effort fix (this week)Owner
Slow or invisible loading Blank screen, repeated clicks Acknowledge input instantly; show progress by 1 s; offer “I’ll email this” if >10 s Product + Eng
Unclear source of truth “Where did this come from?” Always show sources; stamp freshness date; prioritise official docs Product
Over‑confident tone Feels assertive when uncertain Standardise uncertainty phrases and offer suggested next steps Content design
No recovery path Users stuck after a wrong answer One‑tap “Escalate to a person” and “Try different source” options Ops
Hidden changes Behaviour shifts without notice Release notes in‑product with opt‑out window Product

These align with established human‑AI design guidance on setting expectations, supporting effective interaction, managing failure and improving over time. microsoft.com

A 30‑day plan to clean up confusion

Week 1 — Observe and baseline

  • Pick three high‑volume tasks. Shadow 10 real sessions for each. Note each “Why did it do that?” moment and latency at three points: intent acknowledged, first visible response, completion.
  • Set your December baseline for the four KPIs above and agree Q1 targets.

Week 2 — Patch the obvious friction

  • Add immediate input feedback and a visible progress state. If typical answers exceed 10 s, provide an email‑me option.
  • Write standard wording for uncertainty and add 2–3 follow‑up pivots per task.

Week 3 — Evidence and trust

  • Turn on sources by default for policy and finance answers. Prioritise official PDFs over slides.
  • Introduce a simple reason‑coded feedback form feeding a weekly triage.

Week 4 — Decide and scale

  • Run a before/after review of completion rate, satisfaction and cost per transaction. Publish a short note to your users and stakeholders.
  • Agree a standing cadence: monthly release notes, weekly feedback review, quarterly UX audit.

If you need a structured way to monitor behaviour and latency, our 30‑day AI observability sprint pairs neatly with this checklist.

Procurement questions to ask your AI vendor this week

  • Latency controls: What’s the 50th/95th percentile for “first token” and “complete answer” for our typical prompts? What’s your plan if we cap at 10 s?
  • Transparency: Can we display sources and freshness automatically? How do you prioritise official documents?
  • Safety and recovery: Do you support uncertainty signalling and easy escalation to people?
  • Change management: How are model or prompt updates communicated to us? Can we pin behaviour during peak periods?
  • Analytics: Can we tag intents and measure completion rate, satisfaction, digital take‑up and cost per transaction easily?

For a broader vendor comparison framework, see our 2026 AI vendor scorecard.

Common pitfalls to avoid

  • Hiding loading states. People assume nothing is happening and retry, doubling cost. Acknowledge within ~100 ms, show progress by ~1 s, offer alternatives if near 10 s. web.dev
  • Over‑promising at onboarding. State the jobs‑to‑be‑done and limits clearly; follow Microsoft’s “set expectations and adapt” principles. microsoft.com
  • Collecting useless feedback. Thumbs‑down with no reason is noise; use reason codes you’ll act on weekly.

What “good” looks like by the end of January

  • Users rarely pause or backtrack; top 5 tasks show 10–15% higher completion and 10–20% fewer retries.
  • Average “first visible response” under 1 s; 95th percentile under 3 s for standard queries. web.dev
  • Clear sources and freshness labels on policy answers; uncertainty and escalation patterns in place.
  • Monthly KPI pack published internally, aligned to GOV.UK’s four core measures for services. gov.uk

Where to go next

If you’re at the stage of shaping product behaviours and trust signals, our deep‑dive on designing copilots people trust expands on tone, escalation and release notes.