Evaluations & quality

The 5‑Day UAT for AI Features: A Practical Sign‑off Plan for UK SMEs

Published 28 Oct 2025 • 12–15 min read

Agile teams are now shipping chatbots, retrieval‑augmented search (RAG), drafting assistants and “AI summaries” into real services. But user acceptance testing (UAT) for AI is not the same as for classic software. Output varies run‑to‑run, quality depends on data and prompts, and costs scale with usage. This guide gives UK SME and charity leaders a five‑day UAT you can run without code to decide Go/No‑Go with confidence.

The plan draws on established practice such as the NIST AI Risk Management Framework’s “Measure” function, the Site Reliability Engineering “golden signals” for live services, and widely used RAG evaluation metrics (faithfulness, answer relevancy, context precision/recall). Where we reference external frameworks, we’ve linked them so your team can go deeper as needed. nist.gov

What’s different about UAT for AI?

Non‑deterministic outputs. Two identical questions can produce subtly different answers. You’ll validate distributions, not single results.
Quality depends on retrieval and prompting. Poor document coverage or ranking will tank answer quality even with top models. Standard metrics such as recall@k, nDCG, faithfulness and answer relevancy help you measure this objectively. docs.nvidia.com
Cost and latency are first‑class risks. AI features fail if they’re slow or financially unpredictable. Borrow the SRE “golden signals” (latency, traffic, errors, saturation) for your dashboards and acceptance gates. sre.google
Security and misuse behave differently. Prompt injection and insecure output handling are now common failure modes. Use the OWASP Top 10 for LLM applications as your threat checklist during UAT. owasp.org

Bottom line: treat AI UAT as a mix of product acceptance, content quality review and live‑service rehearsal, not a tick‑box test script.

The 5‑Day UAT plan (no code required)

Block one work‑week. Keep the team small: product owner, service owner, one subject‑matter expert, one analyst, and your supplier/engineer on call.

Day 1 — Define acceptance and build a tiny but sharp test set

Write 10–20 critical user journeys (e.g., “Find returns policy from UK site”, “Summarise 8‑page policy into a trustee brief”). Capture expected outputs in plain language and the supporting source documents.
Set acceptance thresholds per journey: minimum faithfulness score, maximum latency, maximum cost per task, and unacceptable behaviours (e.g., revealing internal links, inventing contacts). Use your service goals to pick the KPIs that matter. gov.uk
Agree decision gates: Go, Go‑with‑guardrails, Fix‑and‑retest, or No‑Go.

Day 2 — Retrieval quality (RAG) and coverage

Measure retrieval on your test set: recall@k and nDCG for “can we find the right passages?”, plus context precision/recall for “is the context relevant and sufficient?”. docs.nvidia.com
Investigate misses: Are documents missing, chunked poorly, or ranked behind irrelevant content?
Watch for leaderboard traps: top embedding models on public leaderboards (MTEB) do not always win on your domain. Always verify on your data. huggingface.co

Day 3 — Answer quality and safety

Score answer faithfulness and relevancy against your ground truth and retrieved context. Keep a simple 0–1 scale and record comments for low‑scoring answers. docs.ragas.io
Run misuse tests aligned to OWASP LLM Top 10: prompt injection attempts, attempts to elicit secrets, excessive autonomy, insecure output handling (links/HTML/scripts). Log any failure and require a mitigation before release. owasp.org

Day 4 — Latency, cost and capacity rehearsal

Latency: set p50/p95 targets per journey (for example, p95 under 5 seconds for customer chat; adjust to your context). Track the SRE golden signals and rehearse “brown‑out” fallbacks. sre.google
Cost: simulate a week of expected traffic to estimate per‑task cost. Decide your “cost breaker” threshold for pausing roll‑out if exceeded.
Capacity: test short bursts at your expected peak. Validate graceful degradation (e.g., smaller context windows or cheaper models during spikes).

Day 5 — Business KPIs, sign‑off and rollback rehearsal

Confirm how you will track service KPIs such as completion rate, user satisfaction and cost per transaction once live, drawing on GOV.UK Service Manual practice. gov.uk
Hold a 60‑minute sign‑off: present results and decide Go/No‑Go. Pre‑approve a rollback plan and thresholds that trigger it.

Your UAT scoreboard (copy this format)

Area	Metric & target	How to test	Owner
Retrieval	Recall@5 ≥ 0.8; nDCG@5 ≥ 0.85	Run test questions; compare retrieved passages to expected sources	Analyst + SME
Answer quality	Faithfulness ≥ 0.9; Answer relevancy ≥ 0.9	Score each answer vs. source and question on 0–1 scale	SME
Safety	No critical OWASP failures; 0 P1/P2 issues	Attempt prompt injection; test insecure output handling	Supplier + PO
Latency	p95 under agreed seconds per journey	Time 30 back‑to‑back runs; simulate peak burst	Supplier
Cost	Cost per task ≤ budget; weekly forecast within ±15%	Simulate expected volume using test set	Service owner
Business KPIs	Completion rate and satisfaction targets	Live‑data plan in place; dashboards ready	Service owner

These targets are illustrative; set yours based on user needs and service constraints, then hold them steady during UAT to avoid moving the goalposts. For retrieval and answer metrics, definitions and examples are widely documented in RAG evaluation guides. docs.nvidia.com

Decision gates: Go, Go‑with‑guardrails, Fix‑and‑retest, No‑Go

If most metrics hit target

Go — Ship to a small production cohort with monitoring and weekly review.

If quality is good but risk remains

Go‑with‑guardrails — Add rate limits, remove high‑risk prompts, present citations by default, and enable “Report a problem” in the UI. Align mitigations to the OWASP LLM risks you observed. owasp.org

If retrieval or faithfulness is weak

Fix‑and‑retest — Improve document coverage, chunking, or re‑ranking; re‑evaluate embeddings on your dataset rather than relying on leaderboards alone. huggingface.co

If multiple areas fail or safety is critical

No‑Go — Document gaps, agree a remediation plan and date to rerun UAT.

Procurement questions for suppliers before UAT

What test set of real questions and expected sources will you help us build? Who owns it after the project?
How do you measure retrieval and answer quality (which metrics, thresholds, and tools)? Show a previous UAT report.
Which security tests aligned to the OWASP LLM Top 10 will you run, and how do we reproduce them? owasp.org
Provide latency, error and cost dashboards mapped to the golden signals. What’s your p95 target and cost per task at our forecast volumes? sre.google
Which embedding/model choices have you validated on data like ours? Do you rely on public leaderboards (e.g., MTEB), and what did you learn when results differed? huggingface.co
What’s your rollback and degradation plan if costs spike or latency slips?

Playbook extras your team will appreciate

Test set design tips

Mix “easy wins” and hard edge cases (dates, numbers, synonyms, long documents).
Include at least 3 “misuse” prompts to probe injection and data leakage paths. owasp.org
Keep examples concise; reviewers should score each in under 60 seconds.

Choosing embeddings wisely

Start with a proven generalist from the MTEB leaderboard, but verify locally on your corpus and task. Some models that rank highly on general benchmarks underperform on niche UK domains. huggingface.co
If you must change model late in delivery, rerun Day‑2/3 tests end‑to‑end.

For broader quality governance and SLA ideas, see our practical templates in The AI Quality Scoreboard and our rapid 5‑Day AI Evaluation Sprint. If you’re nearing launch, pair this UAT with The Quiet Cutover, and use our 10 Tests That Predict AI Quality.

Common pitfalls we see in week‑five of a pilot

Chasing public benchmarks instead of business KPIs. Leaderboards like MTEB are useful guides, not sign‑off criteria; evaluate on your data and users. huggingface.co
“We’ll add monitoring later.” Without the golden signals live on day one, you cannot manage latency/cost regressions. sre.google
No explicit misuse tests. If you don’t try prompt injection and insecure output handling during UAT, your users will in production. owasp.org
RAG treated as a black box. Retrieval metrics (recall, nDCG, context precision/recall) catch issues earlier than eyeballing answers. docs.nvidia.com

Your quick‑start checklist

10–20 real questions with expected sources and model answers
Targets for faithfulness, answer relevancy, retrieval, latency and cost
Misuse test prompts mapped to OWASP LLM risks
Golden‑signals dashboard and a cost forecast
Rollback plan and Go/No‑Go thresholds written down

If you prefer a formal framework, align your UAT to “Measure → Manage” within the NIST AI RMF, adapted to your service context. nist.gov

Book a 30‑min call Or email: team@youraiconsultant.london