Agile teams are now shipping chatbots, retrieval‑augmented search (RAG), drafting assistants and “AI summaries” into real services. But user acceptance testing (UAT) for AI is not the same as for classic software. Output varies run‑to‑run, quality depends on data and prompts, and costs scale with usage. This guide gives UK SME and charity leaders a five‑day UAT you can run without code to decide Go/No‑Go with confidence.
The plan draws on established practice such as the NIST AI Risk Management Framework’s “Measure” function, the Site Reliability Engineering “golden signals” for live services, and widely used RAG evaluation metrics (faithfulness, answer relevancy, context precision/recall). Where we reference external frameworks, we’ve linked them so your team can go deeper as needed. nist.gov
What’s different about UAT for AI?
- Non‑deterministic outputs. Two identical questions can produce subtly different answers. You’ll validate distributions, not single results.
- Quality depends on retrieval and prompting. Poor document coverage or ranking will tank answer quality even with top models. Standard metrics such as recall@k, nDCG, faithfulness and answer relevancy help you measure this objectively. docs.nvidia.com
- Cost and latency are first‑class risks. AI features fail if they’re slow or financially unpredictable. Borrow the SRE “golden signals” (latency, traffic, errors, saturation) for your dashboards and acceptance gates. sre.google
- Security and misuse behave differently. Prompt injection and insecure output handling are now common failure modes. Use the OWASP Top 10 for LLM applications as your threat checklist during UAT. owasp.org
Bottom line: treat AI UAT as a mix of product acceptance, content quality review and live‑service rehearsal, not a tick‑box test script.
The 5‑Day UAT plan (no code required)
Block one work‑week. Keep the team small: product owner, service owner, one subject‑matter expert, one analyst, and your supplier/engineer on call.
Day 1 — Define acceptance and build a tiny but sharp test set
- Write 10–20 critical user journeys (e.g., “Find returns policy from UK site”, “Summarise 8‑page policy into a trustee brief”). Capture expected outputs in plain language and the supporting source documents.
- Set acceptance thresholds per journey: minimum faithfulness score, maximum latency, maximum cost per task, and unacceptable behaviours (e.g., revealing internal links, inventing contacts). Use your service goals to pick the KPIs that matter. gov.uk
- Agree decision gates: Go, Go‑with‑guardrails, Fix‑and‑retest, or No‑Go.
Day 2 — Retrieval quality (RAG) and coverage
- Measure retrieval on your test set: recall@k and nDCG for “can we find the right passages?”, plus context precision/recall for “is the context relevant and sufficient?”. docs.nvidia.com
- Investigate misses: Are documents missing, chunked poorly, or ranked behind irrelevant content?
- Watch for leaderboard traps: top embedding models on public leaderboards (MTEB) do not always win on your domain. Always verify on your data. huggingface.co
Day 3 — Answer quality and safety
- Score answer faithfulness and relevancy against your ground truth and retrieved context. Keep a simple 0–1 scale and record comments for low‑scoring answers. docs.ragas.io
- Run misuse tests aligned to OWASP LLM Top 10: prompt injection attempts, attempts to elicit secrets, excessive autonomy, insecure output handling (links/HTML/scripts). Log any failure and require a mitigation before release. owasp.org
Day 4 — Latency, cost and capacity rehearsal
- Latency: set p50/p95 targets per journey (for example, p95 under 5 seconds for customer chat; adjust to your context). Track the SRE golden signals and rehearse “brown‑out” fallbacks. sre.google
- Cost: simulate a week of expected traffic to estimate per‑task cost. Decide your “cost breaker” threshold for pausing roll‑out if exceeded.
- Capacity: test short bursts at your expected peak. Validate graceful degradation (e.g., smaller context windows or cheaper models during spikes).
Day 5 — Business KPIs, sign‑off and rollback rehearsal
- Confirm how you will track service KPIs such as completion rate, user satisfaction and cost per transaction once live, drawing on GOV.UK Service Manual practice. gov.uk
- Hold a 60‑minute sign‑off: present results and decide Go/No‑Go. Pre‑approve a rollback plan and thresholds that trigger it.
Your UAT scoreboard (copy this format)
| Area | Metric & target | How to test | Owner |
|---|---|---|---|
| Retrieval | Recall@5 ≥ 0.8; nDCG@5 ≥ 0.85 | Run test questions; compare retrieved passages to expected sources | Analyst + SME |
| Answer quality | Faithfulness ≥ 0.9; Answer relevancy ≥ 0.9 | Score each answer vs. source and question on 0–1 scale | SME |
| Safety | No critical OWASP failures; 0 P1/P2 issues | Attempt prompt injection; test insecure output handling | Supplier + PO |
| Latency | p95 under agreed seconds per journey | Time 30 back‑to‑back runs; simulate peak burst | Supplier |
| Cost | Cost per task ≤ budget; weekly forecast within ±15% | Simulate expected volume using test set | Service owner |
| Business KPIs | Completion rate and satisfaction targets | Live‑data plan in place; dashboards ready | Service owner |
These targets are illustrative; set yours based on user needs and service constraints, then hold them steady during UAT to avoid moving the goalposts. For retrieval and answer metrics, definitions and examples are widely documented in RAG evaluation guides. docs.nvidia.com
Decision gates: Go, Go‑with‑guardrails, Fix‑and‑retest, No‑Go
If most metrics hit target
- Go — Ship to a small production cohort with monitoring and weekly review.
If quality is good but risk remains
- Go‑with‑guardrails — Add rate limits, remove high‑risk prompts, present citations by default, and enable “Report a problem” in the UI. Align mitigations to the OWASP LLM risks you observed. owasp.org
If retrieval or faithfulness is weak
- Fix‑and‑retest — Improve document coverage, chunking, or re‑ranking; re‑evaluate embeddings on your dataset rather than relying on leaderboards alone. huggingface.co
If multiple areas fail or safety is critical
- No‑Go — Document gaps, agree a remediation plan and date to rerun UAT.
Procurement questions for suppliers before UAT
- What test set of real questions and expected sources will you help us build? Who owns it after the project?
- How do you measure retrieval and answer quality (which metrics, thresholds, and tools)? Show a previous UAT report.
- Which security tests aligned to the OWASP LLM Top 10 will you run, and how do we reproduce them? owasp.org
- Provide latency, error and cost dashboards mapped to the golden signals. What’s your p95 target and cost per task at our forecast volumes? sre.google
- Which embedding/model choices have you validated on data like ours? Do you rely on public leaderboards (e.g., MTEB), and what did you learn when results differed? huggingface.co
- What’s your rollback and degradation plan if costs spike or latency slips?
Playbook extras your team will appreciate
Test set design tips
- Mix “easy wins” and hard edge cases (dates, numbers, synonyms, long documents).
- Include at least 3 “misuse” prompts to probe injection and data leakage paths. owasp.org
- Keep examples concise; reviewers should score each in under 60 seconds.
Choosing embeddings wisely
- Start with a proven generalist from the MTEB leaderboard, but verify locally on your corpus and task. Some models that rank highly on general benchmarks underperform on niche UK domains. huggingface.co
- If you must change model late in delivery, rerun Day‑2/3 tests end‑to‑end.
For broader quality governance and SLA ideas, see our practical templates in The AI Quality Scoreboard and our rapid 5‑Day AI Evaluation Sprint. If you’re nearing launch, pair this UAT with The Quiet Cutover, and use our 10 Tests That Predict AI Quality.
Common pitfalls we see in week‑five of a pilot
- Chasing public benchmarks instead of business KPIs. Leaderboards like MTEB are useful guides, not sign‑off criteria; evaluate on your data and users. huggingface.co
- “We’ll add monitoring later.” Without the golden signals live on day one, you cannot manage latency/cost regressions. sre.google
- No explicit misuse tests. If you don’t try prompt injection and insecure output handling during UAT, your users will in production. owasp.org
- RAG treated as a black box. Retrieval metrics (recall, nDCG, context precision/recall) catch issues earlier than eyeballing answers. docs.nvidia.com
Your quick‑start checklist
- 10–20 real questions with expected sources and model answers
- Targets for faithfulness, answer relevancy, retrieval, latency and cost
- Misuse test prompts mapped to OWASP LLM risks
- Golden‑signals dashboard and a cost forecast
- Rollback plan and Go/No‑Go thresholds written down
If you prefer a formal framework, align your UAT to “Measure → Manage” within the NIST AI RMF, adapted to your service context. nist.gov