Product manager and operations lead reviewing AI evaluation dashboards with green and amber quality gates
Evaluations & quality

The 10 Tests That Predict AI Quality: A 2‑Week Evaluation Plan for UK SMEs

If your AI pilot is nearly “good enough”, this article gives you the acceptance tests, pass/fail thresholds and a 2‑week plan to prove it. The goal is simple: don’t ship an assistant, summariser or RAG tool unless it can demonstrate accuracy, safety and usability against a small number of clear, business‑relevant metrics.

Two recent developments support this approach. First, NIST’s Generative AI Profile emphasises scenario‑based, pre‑deployment testing as a core activity for managing generative AI risks. nist.gov Second, in the UK public sector the Evaluation Task Force added guidance on impact evaluation methods (like RCTs) for AI tools—useful for measuring value in the wild, but separate from technical model evaluation you run before launch. gov.uk

The 10 tests that predict “production‑ready” AI

Use these as your default gates for chat assistants, RAG search, content generation and internal copilots. For each, we suggest a practical target many SMEs can hit without heavy engineering.

Test What it measures Target to pass Notes
1) Groundedness How closely answers stick to your provided sources (RAG). ≥ 0.85 average groundedness on a 100‑question set. Defined in Microsoft’s evaluation guidance; use a rubric or LLM‑as‑judge. learn.microsoft.com
2) Answer correctness Accuracy versus a ground‑truth answer set. ≥ 80% exact/near‑exact on curated tasks. Common in RAG tools (e.g., Ragas “answer_correctness”). docs.ragas.io
3) Context recall Does retrieval fetch the information needed? ≥ 0.80 context recall@k on your corpus. Track recall@k plus RAG‑specific recall metrics. docs.nvidia.com
4) Relevance & coherence Are outputs on topic and readable? ≥ 0.85 average relevance and coherence. Available as built‑in evaluators in major platforms. learn.microsoft.com
5) Safety screens Rate of harmful or policy‑violating content. ≤ 1% of answers flagged at “high severity”. Evaluate for self‑harm, hate, violence, sexual content. learn.microsoft.com
6) Refusal vs usefulness Does the model refuse too often when it could help? Refusal rate 2–10% on your benign test set. OpenAI’s safety evaluations show trade‑offs between hallucination and refusal; calibrate for your use case. openai.com
7) User task success % of users completing a job with AI’s help. ≥ 80% task success across 10 typical tasks. Pair with System Usability Scale (SUS) for sentiment. Average SUS ≈ 68; aim ≥ 75–80. uxpajournal.org
8) Time to completion How long the user needs to get a correct outcome. Median time at least 30% faster than non‑AI baseline. Capture screen‑recording timings or timestamps.
9) Cost per task All‑in variable cost to produce one acceptable answer. Target a ceiling (e.g., ≤ £0.20 for tier‑1 queries). Include model calls, retrieval, safety checks, storage I/O.
10) Stability & regressions Does quality hold weekly as prompts/models change? No more than 2% swing on key metrics across a week. Nightly smoke tests on a frozen set of 100 queries.

Tip: Put “acceptance gates” in your contract or internal go‑live checklist. This is normal practice in software quality; with AI, it’s even more important because models and prompts change underneath you. ISO/IEC 42001, the new AI management system standard, encourages Plan‑Do‑Check‑Act cycles—use these tests as your “Check”. iso.org

What to test with: building a lightweight evaluation set

Curate 100–300 real questions

  • Pull from recent tickets, sales emails, policy FAQs and docs. Spread by category (e.g., pricing, warranty, charity eligibility, HR policy).
  • Write a short, unambiguous ground truth for each where possible. If there is no single answer, record the specific source passages to support groundedness testing.

Map each test to a metric

  • Groundedness, relevance, coherence and safety are available as built‑in evaluation metrics in mainstream platforms; they are defined in plain language and scored consistently. learn.microsoft.com
  • For RAG systems, add answer correctness, context precision and context recall. These are standardised in open‑source RAG evaluation libraries. docs.ragas.io
  • For usability, run a short task‑based test and send users the 10‑item System Usability Scale (SUS). Average products score around 68; top‑decile usability sits above ~80. uxpajournal.org

Why we don’t rely on generic leaderboards

Leaderboards are useful for research, but your risks are local. NIST’s 2025 work on repeatable, scenario‑based evaluations underlines the need for sector‑specific test sets that reflect real work, not trivia. nist.gov

A 2‑week plan you can run in‑house

Week 1 — Define, baseline, and fix the obvious

  1. Day 1–2: Agree the top 3 user jobs to improve (e.g., “answer product eligibility”, “summarise 30‑page tender”). Define your 10 acceptance gates and targets from the table above. Add cost per task ceilings per tier.
  2. Day 3: Assemble 100–300 real questions with ground truths or source passages. Tag each by category and difficulty.
  3. Day 4: Run a baseline evaluation against your current pilot. Capture groundedness, correctness, recall@k, relevance, coherence, and safety flags.
  4. Day 5: Triage failures. Most quick wins come from better retrieval (document chunking/metadata), prompt instructions, or stronger source coverage.

Week 2 — Prove it with users, cost and stability

  1. Day 6–7: Re‑run the evals after fixes. Compare deltas. If groundedness and correctness rose but refusals spiked, rebalance instructions to avoid over‑refusal. OpenAI’s public safety evals illustrate this trade‑off clearly. openai.com
  2. Day 8: 10–12 users complete 8–10 tasks each. Record completion, time‑to‑complete, and collect SUS scores.
  3. Day 9: Cost pass: measure the actual marginal cost per acceptable answer for tier‑1 and tier‑2 queries. Budget heads‑up goes to Finance.
  4. Day 10: Stability pass: nightly run on 100 frozen queries. If metrics swing more than 2%, investigate dependency changes.
  5. Day 11–12: Document results and risks. If buying, put the acceptance gates into the contract and SLAs. If building, add them to your release checklist.
  6. Day 13–14: Exec review. Either “Go” with a 4‑week hypercare plan, or “No‑go” with a remediation path and a re‑test date.

Making the metrics concrete

Groundedness

For any answer that should be based on your documents, require a high groundedness score and provide citations. Microsoft defines groundedness as alignment of the answer with provided sources; most tools now offer a built‑in groundedness evaluator using a consistent rubric. learn.microsoft.com

Answer correctness vs ground truth

Where you can write a short, “single correct” answer (e.g., price, eligibility, warranty), test for correctness. Open‑source RAG evaluation libraries include “answer_correctness” and related metrics that combine factuality and semantic similarity, and vendor‑neutral frameworks (e.g., NVIDIA NeMo docs) list the same set. docs.ragas.io

Recall@k for retrieval

When teams complain about “hallucinations”, the culprit is often poor retrieval. Track recall@k and context recall. If recall@k is low, fix ingestions and metadata before tinkering with prompts. docs.nvidia.com

Usability and sentiment

Task success and SUS tell you if people can and will use the new tool. SUS averages about 68 across products; aim ≥ 75–80 before rollout to frontline teams. uxpajournal.org

Decision tree: ship, pause or pivot

  • Ship If tests 1–5 meet targets and SUS ≥ 75, with cost per task within ceiling. Prepare a four‑week hypercare plan with weekly re‑evals.
  • Pause If groundedness or correctness miss by 5–10 points. Focus on retrieval, prompts and document coverage; re‑test in a week.
  • Pivot If safety flags exceed 1% or refusals exceed 15% on benign tasks; evaluate a different model class or pattern and re‑scope the user job.

Procurement questions that force quality

  1. Show your last three evaluation runs (CSV/JSON) with groundedness, correctness, context recall and safety metrics. Which deltas improved week‑to‑week? How do you detect regressions?
  2. What are your recommended acceptance gates for our use case, and what sample size do you need to demonstrate them with 95% confidence?
  3. How will you keep refusal rates useful but not excessive for our benign tasks? Provide evidence from your safety/hallucination evaluations. openai.com
  4. Which parts of your evaluation are automated, and how often will you re‑run them after a model, prompt or retrieval change?
  5. How will you hand over the evaluation datasets and dashboards so our team can keep testing independently after go‑live? NIST’s recent work on scenario libraries is a good pattern—can you align to it? nist.gov

For contract language and SLAs, see our buyer’s playbook for AI contracts. It includes model change controls and break‑glass clauses linked to evaluation gates. The UK SME buyer’s playbook for AI contracts (2025)

Risks, costs and the “gotchas” to watch

  • Benchmark chasing: A model that tops a public leaderboard may still underperform on your own tasks. Build and keep your own test set aligned to the jobs your staff actually do. nist.gov
  • Refusal spikes: Tightening safety or groundedness can push refusal‑to‑answer rates up. Monitor refusals alongside hallucinations and calibrate to your risk appetite. openai.com
  • Hidden operating cost: Track full marginal cost per acceptable answer, including retrieval, storage and safety passes—not just the generation call.
  • Process drift: Lock a nightly regression run. Even minor prompt edits or a vendor updating a model can subtly erode quality if you don’t watch the numbers.

Where this fits in your broader roadmap

If you’re still evaluating models, pair this with our 5‑day AI evaluation sprint. For RAG‑heavy use cases, use the 6‑week RAG blueprint to improve retrieval and sources before you run these tests. And if you’re productising features, sanity‑check your UX plan with “Chat isn’t always the answer”. For legal/commercial terms, refer back to the buyer’s playbook.

Ready to put your pilot on trial?

We can help you assemble an evaluation set, run the tests, and turn the results into a simple ship/no‑ship decision your board will trust.