Most AI pilots fail in the awkward middle: the demo looks impressive, but nobody can say whether the output is accurate enough, safe enough, fast enough, or cheap enough to use with real customers and staff. That is the point where UK SMEs need a quality gate, not another round of vague experimentation.
This guide turns AI quality into ten practical tests you can run in two weeks. It is written for owner-managed businesses, charities and growing teams that are evaluating a support assistant, document search tool, summariser, internal copilot or content workflow and need a clear ship/no-ship decision.
The core discipline is simple: test the AI against your real work, not against generic benchmarks. Use recent tickets, policy questions, sales emails, tenders, helpdesk replies and documents. Then score the tool on groundedness, correctness, retrieval, safety, usability, speed, cost and regression risk before anyone calls it “production ready”.
Two recent developments support this approach. First, NIST’s Generative AI Profile emphasises scenario-based, pre-deployment testing as a core activity for managing generative AI risks. nist.gov Second, in the UK public sector the Evaluation Task Force added guidance on impact evaluation methods for AI tools — useful for measuring value in the wild, but separate from the technical and operational tests you run before launch. gov.uk
A practical example: why “sounds right” is not enough
Imagine a customer-support copilot for a small services firm. In a demo, it writes polished replies and finds policy snippets quickly. But when tested against 100 real historical queries, the picture is different: it answers confidently from an outdated PDF, misses an eligibility exception buried in the terms, and refuses several harmless questions because the prompt is too cautious.
That pilot is not a failure — it is exactly why evaluation exists. The right fix may be better document metadata, a clearer retrieval pattern, a narrower answer policy, or a revised refusal rule. Without a test set and pass/fail thresholds, the team only has opinions. With the tests below, they have evidence.
The 10 tests that predict “production‑ready” AI
Use these as your default gates for chat assistants, RAG search, content generation and internal copilots. For each, we suggest a practical target many SMEs can hit without heavy engineering.
| Test | What it measures | Target to pass | Notes |
|---|---|---|---|
| 1) Groundedness | How closely answers stick to your provided sources (RAG). | ≥ 0.85 average groundedness on a 100‑question set. | Defined in Microsoft’s evaluation guidance; use a rubric or LLM‑as‑judge. learn.microsoft.com |
| 2) Answer correctness | Accuracy versus a ground‑truth answer set. | ≥ 80% exact/near‑exact on curated tasks. | Common in RAG tools (e.g., Ragas “answer_correctness”). docs.ragas.io |
| 3) Context recall | Does retrieval fetch the information needed? | ≥ 0.80 context recall@k on your corpus. | Track recall@k plus RAG‑specific recall metrics. docs.nvidia.com |
| 4) Relevance & coherence | Are outputs on topic and readable? | ≥ 0.85 average relevance and coherence. | Available as built‑in evaluators in major platforms. learn.microsoft.com |
| 5) Safety screens | Rate of harmful or policy‑violating content. | ≤ 1% of answers flagged at “high severity”. | Evaluate for self‑harm, hate, violence, sexual content. learn.microsoft.com |
| 6) Refusal vs usefulness | Does the model refuse too often when it could help? | Refusal rate 2–10% on your benign test set. | OpenAI’s safety evaluations show trade‑offs between hallucination and refusal; calibrate for your use case. openai.com |
| 7) User task success | % of users completing a job with AI’s help. | ≥ 80% task success across 10 typical tasks. | Pair with System Usability Scale (SUS) for sentiment. Average SUS ≈ 68; aim ≥ 75–80. uxpajournal.org |
| 8) Time to completion | How long the user needs to get a correct outcome. | Median time at least 30% faster than non‑AI baseline. | Capture screen‑recording timings or timestamps. |
| 9) Cost per task | All‑in variable cost to produce one acceptable answer. | Target a ceiling (e.g., ≤ £0.20 for tier‑1 queries). | Include model calls, retrieval, safety checks, storage I/O. |
| 10) Stability & regressions | Does quality hold weekly as prompts/models change? | No more than 2% swing on key metrics across a week. | Nightly smoke tests on a frozen set of 100 queries. |
Tip: Put “acceptance gates” in your contract or internal go‑live checklist. This is normal practice in software quality; with AI, it’s even more important because models and prompts change underneath you. ISO/IEC 42001, the new AI management system standard, encourages Plan‑Do‑Check‑Act cycles—use these tests as your “Check”. iso.org
What to test with: building a lightweight evaluation set
Curate 100–300 real questions
- Pull from recent tickets, sales emails, policy FAQs and docs. Spread by category (e.g., pricing, warranty, charity eligibility, HR policy).
- Write a short, unambiguous ground truth for each where possible. If there is no single answer, record the specific source passages to support groundedness testing.
Map each test to a metric
- Groundedness, relevance, coherence and safety are available as built‑in evaluation metrics in mainstream platforms; they are defined in plain language and scored consistently. learn.microsoft.com
- For RAG systems, add answer correctness, context precision and context recall. These are standardised in open‑source RAG evaluation libraries. docs.ragas.io
- For usability, run a short task‑based test and send users the 10‑item System Usability Scale (SUS). Average products score around 68; top‑decile usability sits above ~80. uxpajournal.org
Why we don’t rely on generic leaderboards
Leaderboards are useful for research, but your risks are local. NIST’s 2025 work on repeatable, scenario‑based evaluations underlines the need for sector‑specific test sets that reflect real work, not trivia. nist.gov
A 2‑week plan you can run in‑house
Week 1 — Define, baseline, and fix the obvious
- Day 1–2: Agree the top 3 user jobs to improve (e.g., “answer product eligibility”, “summarise 30‑page tender”). Define your 10 acceptance gates and targets from the table above. Add cost per task ceilings per tier.
- Day 3: Assemble 100–300 real questions with ground truths or source passages. Tag each by category and difficulty.
- Day 4: Run a baseline evaluation against your current pilot. Capture groundedness, correctness, recall@k, relevance, coherence, and safety flags.
- Day 5: Triage failures. Most quick wins come from better retrieval (document chunking/metadata), prompt instructions, or stronger source coverage.
Week 2 — Prove it with users, cost and stability
- Day 6–7: Re‑run the evals after fixes. Compare deltas. If groundedness and correctness rose but refusals spiked, rebalance instructions to avoid over‑refusal. OpenAI’s public safety evals illustrate this trade‑off clearly. openai.com
- Day 8: 10–12 users complete 8–10 tasks each. Record completion, time‑to‑complete, and collect SUS scores.
- Day 9: Cost pass: measure the actual marginal cost per acceptable answer for tier‑1 and tier‑2 queries. Budget heads‑up goes to Finance.
- Day 10: Stability pass: nightly run on 100 frozen queries. If metrics swing more than 2%, investigate dependency changes.
- Day 11–12: Document results and risks. If buying, put the acceptance gates into the contract and SLAs. If building, add them to your release checklist.
- Day 13–14: Exec review. Either “Go” with a 4‑week hypercare plan, or “No‑go” with a remediation path and a re‑test date.
Making the metrics concrete
Groundedness
For any answer that should be based on your documents, require a high groundedness score and provide citations. Microsoft defines groundedness as alignment of the answer with provided sources; most tools now offer a built‑in groundedness evaluator using a consistent rubric. learn.microsoft.com
Answer correctness vs ground truth
Where you can write a short, “single correct” answer (e.g., price, eligibility, warranty), test for correctness. Open‑source RAG evaluation libraries include “answer_correctness” and related metrics that combine factuality and semantic similarity, and vendor‑neutral frameworks (e.g., NVIDIA NeMo docs) list the same set. docs.ragas.io
Recall@k for retrieval
When teams complain about “hallucinations”, the culprit is often poor retrieval. Track recall@k and context recall. If recall@k is low, fix ingestions and metadata before tinkering with prompts. docs.nvidia.com
Usability and sentiment
Task success and SUS tell you if people can and will use the new tool. SUS averages about 68 across products; aim ≥ 75–80 before rollout to frontline teams. uxpajournal.org
Decision tree: ship, pause or pivot
- Ship If tests 1–5 meet targets and SUS ≥ 75, with cost per task within ceiling. Prepare a four‑week hypercare plan with weekly re‑evals.
- Pause If groundedness or correctness miss by 5–10 points. Focus on retrieval, prompts and document coverage; re‑test in a week.
- Pivot If safety flags exceed 1% or refusals exceed 15% on benign tasks; evaluate a different model class or pattern and re‑scope the user job.
Procurement questions that force quality
- Show your last three evaluation runs (CSV/JSON) with groundedness, correctness, context recall and safety metrics. Which deltas improved week‑to‑week? How do you detect regressions?
- What are your recommended acceptance gates for our use case, and what sample size do you need to demonstrate them with 95% confidence?
- How will you keep refusal rates useful but not excessive for our benign tasks? Provide evidence from your safety/hallucination evaluations. openai.com
- Which parts of your evaluation are automated, and how often will you re‑run them after a model, prompt or retrieval change?
- How will you hand over the evaluation datasets and dashboards so our team can keep testing independently after go‑live? NIST’s recent work on scenario libraries is a good pattern—can you align to it? nist.gov
For contract language and SLAs, see our buyer’s playbook for AI contracts. It includes model change controls and break‑glass clauses linked to evaluation gates. The UK SME buyer’s playbook for AI contracts (2025)
Risks, costs and the “gotchas” to watch
- Benchmark chasing: A model that tops a public leaderboard may still underperform on your own tasks. Build and keep your own test set aligned to the jobs your staff actually do. nist.gov
- Refusal spikes: Tightening safety or groundedness can push refusal‑to‑answer rates up. Monitor refusals alongside hallucinations and calibrate to your risk appetite. openai.com
- Hidden operating cost: Track full marginal cost per acceptable answer, including retrieval, storage and safety passes—not just the generation call.
- Process drift: Lock a nightly regression run. Even minor prompt edits or a vendor updating a model can subtly erode quality if you don’t watch the numbers.
Where this fits in your broader roadmap
If you’re still evaluating models, pair this with our 5‑day AI evaluation sprint. For RAG‑heavy use cases, use the 6‑week RAG blueprint to improve retrieval and sources before you run these tests. And if you’re productising features, sanity‑check your UX plan with “Chat isn’t always the answer”. For legal/commercial terms, refer back to the buyer’s playbook.
Ready to put your pilot on trial?
We can help you assemble an evaluation set, run the tests, and turn the results into a simple ship/no‑ship decision your board will trust.
Related implementation guides
If you are planning the next step after this playbook, these adjacent guides cover evaluation, rollout, and procurement decisions in the same UK SME operating context: