Most AI pilots stall not because the model is “bad”, but because the experience around it is vague. Users don’t know what it can do, how well it will do it, or what happens when it gets things wrong. As sponsors, you’re asked to sign off “go‑live” without a crisp definition of “behaves as intended”. This article gives non‑technical leaders a practical way to define, test and monitor behaviour so your copilot earns trust from day one.
We pull from well‑established human‑AI interaction guidance, the HEART UX metrics framework, and UK public‑sector writing standards for helpful error messages. These are stable, research‑backed foundations you can adopt with any vendor or model. microsoft.com
What “trust scaffolding” means (and why chat isn’t the point)
Trust scaffolding is the set of small, visible behaviours that help users predict, correct and recover. It makes an AI feature feel safe and useful, whether it’s a smart form, document assistant or triage tool. The scaffolding includes:
- Set expectations upfront: say what the AI can and cannot do, and your intended accuracy. microsoft.com
- Show confidence and context: highlight why a suggestion appears (e.g. “based on paragraph 4 of your policy”). microsoft.com
- Offer control: easy ways to accept, edit, undo, or dismiss suggestions. microsoft.com
- Graceful failure: clear, actionable messages when the AI can’t help—plus a path to a human. design-system.service.gov.uk
- Improve over time: gather lightweight feedback and adapt to patterns. microsoft.com
If you’re deciding between a chat window and a button, prefer the pattern that fits the task, not the hype. We covered that in Stop forcing chat when AI should be a button, not a bot.
Turn “behaves as intended” into acceptance criteria your board can sign
Translate fuzzy aspirations into a shortlist of measurable acceptance criteria. Start with these six dimensions; pick what matches the user’s job to be done and your organisation’s risk appetite.
| Dimension | Definition (plain English) | Example acceptance criteria | How to verify |
|---|---|---|---|
| Task success | For a defined task, the AI helps users complete it without rework. | ≥85% of triaged emails land in the correct queue; ≤5% require manual re‑routing. | Sample 200 cases; compare AI decision vs. final destination. |
| Time‑to‑complete | It’s quicker than the current process. | Median handling time drops from 6m → ≤3m for standard enquiries. | Measure median duration before/after; slice by enquiry type. research.google |
| Clarity & transparency | People understand why the AI suggested something. | ≥80% of users agree “I can see why this suggestion was made”. | 1‑question pulse survey in‑product; qualitative comments. microsoft.com |
| Error handling | When it fails, it says what happened and how to fix it. | All error messages specify the problem and next step; no “Oops” messages. | Heuristic review against GOV.UK error guidance; spot checks in UAT. design-system.service.gov.uk |
| User control | Easy to edit, undo, or escalate to a human. | Undo visible on the same screen; handoff under 2 clicks. | Heuristic review using Human‑AI guidelines; usability test. microsoft.com |
| Satisfaction | Users would choose to use it again. | Happiness score ≥4/5; task success ≥80% (HEART). | HEART dashboard: Happiness, Engagement, Adoption, Retention, Task success. research.google |
Design failure states first
With AI, failure isn’t a corner case—it’s part of the experience. Poor failure handling is the quickest way to lose trust. Borrow a page from UK public‑sector content design and write errors that help people recover:
- Say what went wrong and how to fix it, in plain English (“We couldn’t find a company by that name. Check spelling or try the registered number”). design-system.service.gov.uk
- Keep a consistent message next to the field and in the error summary. design-system.service.gov.uk
- Offer a next best action: retry, edit, or contact support—do not dead‑end. service-manual.nhs.uk
If your team needs a simple rubric, start with the 10 usability heuristics: visibility of status, match to the real world, control, consistency and error prevention are particularly relevant to AI suggestions and autofill. principles.design
From pilot to production: a three‑stage acceptance path
1) Shadow mode (no user impact)
Run the AI in the background and compare its suggested actions to what your team actually did. Set acceptance thresholds before exposing anything to users. For a step‑by‑step approach, see Shadow mode first.
2) Limited release with feature flags
Turn the AI on for a small cohort (say, one region or 10% of users). Add an in‑product “Was this helpful?” and a one‑click way to report bad outputs. Roll forward or back quickly using feature flags. See our rollout guide: Feature flags for AI.
3) Go‑live gate
Before a wider launch, run a production readiness check: clear ownership, monitoring, dashboards, and a way to pause the feature. Our one‑pager helps sponsors sign off confidently: The Go‑Live Gate.
UX patterns that earn trust
Use proven patterns like preview‑before‑apply, inline sources, and safe defaults. We collected 12 patterns you can ship this quarter: Designing copilot UX that earns trust.
Instrument what matters: a HEART‑based KPI set for AI features
Many teams jump to “accuracy” without agreeing how they will judge success day‑to‑day. Use the HEART framework to make a small, durable KPI set that aligns product, ops and execs. research.google
- Happiness: 2‑question in‑product survey monthly (“Helpfulness”, “Clarity of suggestions”). Target ≥4/5.
- Engagement: % of tasks where AI was invoked when available; % edits vs. accepts.
- Adoption: % of eligible users trying the feature in a given period.
- Retention: Repeat use within 30 days; weekly active users of the AI feature.
- Task success: The core operational metric for your use case (e.g. correct routing, right template chosen).
Pair HEART with a handful of “hygiene” alerts that protect the experience: spike in dismissals; increase in edits per suggestion; rise in “couldn’t help” messages. Tie each alert to a human owner and an action playbook. gov.uk
Write your acceptance checklist (copy/paste to your next steering pack)
- Scope: The AI supports these specific tasks and won’t attempt these others.
- Quality bar: Target metrics for success, time‑to‑complete, and satisfaction are defined and baselined.
- Transparency: Every suggestion shows a reason or source when possible.
- User control: Undo, edit, and human handoff are visible and work.
- Failure states: Errors and “no answer” are actionable and written in plain English per GOV.UK style. design-system.service.gov.uk
- Observability: HEART dashboard live; alerts routed to an owner.
- Rollout safety: Feature flags and rollback plan in place.
- Learning loop: Lightweight feedback and periodic review.
Procurement questions that surface UX quality (not just model names)
When shortlisting vendors, ask questions that reveal how they manage behaviour, not just benchmarks or brand names:
- How do you apply the Human‑AI interaction guidelines in your product (e.g., expectation‑setting, control, recovery)? Show screenshots. microsoft.com
- What are your default error messages and fallbacks when a model is unsure or sources are missing? Are these aligned with GOV.UK‑style guidance for clarity? design-system.service.gov.uk
- Which HEART‑style metrics do you track out of the box, and can we export them to our BI tool? research.google
- How do you support shadow mode and staged rollouts with feature flags? (If they can’t, expect a bumpier go‑live.)
- Show us one example where you reduced time‑to‑complete by ≥30% without a drop in task success. What was the sample size?
Design patterns: describe the behaviour, not the tech
Pattern 1: Preview before apply
Before any AI change is applied (email draft, document rewrite, CRM update), let users preview and edit. Keep the “Apply” action explicit and reversible. This supports user control and reduces error costs. microsoft.com
Pattern 2: Why this?
Show a small “why” link that reveals the source, rule or matching snippet. Avoid dense explanations; users just need enough to judge. microsoft.com
Pattern 3: Safe defaults
Start with conservative behaviours: suggest rather than auto‑commit; summarise rather than rewrite critical facts; flag low confidence. Earn the right to automate once metrics look healthy over time. microsoft.com
Pattern 4: Clear “can’t help” states
When the AI cannot assist, say so plainly and point to a next step. Replace “Oops” with guidance and a route to a person. design-system.service.gov.uk
Cost and risk view for sponsors
| Risk | What users experience | Cost impact | Mitigation you can require |
|---|---|---|---|
| Over‑automation | Unwanted changes are applied | Rework time; reputational risk | Start in suggest‑only mode; preview/undo; feature flags for rapid rollback. |
| Opaque suggestions | “Why did it do that?” | Low adoption; support tickets | Inline “why this” and sources; periodic quality reviews. microsoft.com |
| Poor failure copy | Confusing dead‑ends | Abandonment; manual escalations | GOV.UK‑style error messages; actionable next steps. design-system.service.gov.uk |
| No metrics | Can’t prove value | Budget scrutiny | HEART dashboard; task success baseline + targets. research.google |
Your 30‑60‑90 day plan to ship “AI that behaves”
Days 0–30 (Discovery and guardrails)
- Pick one or two concrete tasks to support; write “won’t do” list.
- Draft acceptance criteria (success, time, clarity, control, failure states, satisfaction).
- Prototype the happy path and failure states; review with a small user group against Human‑AI guidelines. microsoft.com
- Baseline current performance (manual process) and define your HEART metrics. research.google
Days 31–60 (Shadow and limited release)
- Run shadow mode; record deltas vs. human outcomes for a representative sample.
- Enable for a small cohort via feature flags; add “Was this helpful?” and an obvious undo.
- Write GOV.UK‑style error messages for top 10 failure cases. design-system.service.gov.uk
Days 61–90 (Go‑live and learning loop)
- Run the go‑live gate; confirm owners for metrics and alerts.
- Publish a simple “What this copilot can do” help page and a practice sandbox.
- Review data fortnightly; if task success ≥ target and edits are trending down, expand coverage. If not, pause, adjust prompts/policies and try again. gov.uk
Bottom line for leaders
Don’t buy “magic”. Buy behaviours you can describe, measure and improve. If you set acceptance criteria around task success, time‑to‑complete, clarity, control, failure handling and satisfaction—and you ship with trust scaffolding—your AI will feel competent from day one and get better with real use. The research base is there; the playbook above translates it into delivery moves any team can execute. microsoft.com