If you’re a director, operations lead, lawyer/DPO or trustee trying to choose an AI tool, you don’t want demos or vendor claims — you want proof. This article sets out a fast, repeatable 5‑day evaluation sprint you can run with your team to compare models and vendors on your real work, with objective measures for quality, safety and cost.
We’ll use a composite UK SME example: a 120‑person distributor wants AI to draft first‑pass replies to customer emails and speed up knowledge search from manuals. The sprint below is designed to produce an evidence pack that your board, procurement and DPO can sign off.
What “good” looks like: the five KPIs
- Task success rate: percentage of outputs that meet your acceptance criteria on real examples (e.g., “ready to send with minor tweaks”).
- Factuality: zero critical errors; low minor edits per output. Use a small gold set with “right answers”.
- Safety and data handling: low unsafe responses; no leaks of sensitive or customer data; vendor follows recognised secure‑by‑design guidance for AI. ([gov.uk](https://www.gov.uk/government/publications/ai-cyber-security-code-of-practice))
- Latency: 95th percentile response under your SLA (e.g., 4 seconds for chat; 20 seconds for long answers).
- Cost per task: all‑in unit cost you can cap (prompt + retrieval + output + moderation). Track the 10 highest‑cost cases.
These KPIs map cleanly to widely used assurance concepts: measure capability and safety, document the context, and make results understandable to non‑technical stakeholders. ([gov.uk](https://www.gov.uk/data-ethics-guidance/introduction-to-ai-assurance))
The 5‑Day Evaluation Sprint
Day 0 (prep, 2–3 hours)
- Pick one priority use case (e.g., email replies) and gather 50–100 recent, representative examples. Remove personal data you don’t need for testing.
- Write acceptance criteria for “good enough”. Keep it plain English, one page max.
- Assemble a small “gold set” of 15–20 items with agreed ideal answers. These will drive side‑by‑side scoring.
Day 1 — Baseline and test harness
- Score your current method (human only, search only, or existing bot). This is your baseline.
- Create a simple scoring sheet: pass/fail on task success, 0–2 for factuality edits, time to first draft, and estimated cost per task.
- Define red team prompts you don’t want the AI to answer (e.g., requests for customer data, “work around policy”).
Day 2 — Model and vendor shortlist
- Test at least three models: a cost‑efficient “small”, a balanced “medium”, and a “flagship” for tough cases. Smaller models can often deliver acceptable quality at a fraction of the cost, so include one in your mix. ([reuters.com](https://www.reuters.com/business/media-telecom/us-tech-startup-anthropic-unveils-cheaper-model-widen-ais-appeal-2025-10-15/))
- Run the exact same tasks and prompts across models; no tweaking per model. This is crucial for fair comparison and mirrors holistic evaluation practice. ([crfm.stanford.edu](https://crfm.stanford.edu/2022/11/17/helm.html))
- Record latency and token usage (or message quotas) to inform cost per task. Note any vendor limits that might hit your volumes.
Day 3 — Safety, security and documentation
- Execute your red team prompts and log the rate of blocked or safe responses. Require vendors to explain how they mitigate prompt injection, data poisoning and other AI‑specific risks, aligning with the UK AI Cyber Security Code of Practice and the joint NCSC/CISA secure AI guidance. ([gov.uk](https://www.gov.uk/government/publications/ai-cyber-security-code-of-practice))
- Capture a one‑page “model card for your use”: inputs, typical outputs, failure modes, and known limitations. If you’re in the public sector, draft an ATRS entry now; it’s mandatory in central government once you move into pilot or production. ([gov.uk](https://www.gov.uk/guidance/publish-a-record-of-algorithmic-tools-using-the-atrs))
- Note how each vendor supports logs, admin controls, data residency and retention.
Day 4 — Cost and performance sweep
- Stress test the top two models with your longest, messiest inputs. Capture p95 latency and cost per task.
- Ask vendors about cost‑saving levers: batching, caching, and smaller models for easy cases. For example, some providers offer up to 50% discounts for batch processing — this can halve unit costs for backlogs and nightly jobs. ([docs.anthropic.com](https://docs.anthropic.com/en/docs/about-claude/pricing))
- Decide where you can trade speed for price (e.g., overnight summarisation vs live chat).
Day 5 — Decision, guardrails and handover
- Create a 1‑page decision: the chosen model/vendor, expected quality, cost cap per task, and the conditions under which you switch model or throttle usage.
- Package the evidence: KPI charts, scoring sheet, sample outputs, and your red‑team results. This becomes your internal assurance record and procurement pack.
- Write a light‑touch runbook for go‑live: who reviews escalations, how to pause the system, and how to rotate prompts safely.
Minimal tooling for the sprint
You don’t need a data science stack to do this well. A shared spreadsheet for scoring, a folder of anonymised test items, and a consistent prompt template are enough for week one. If you later industrialise, you can align your scoring and documentation with established assurance and risk resources (e.g., NIST GenAI Profile and ARIA’s system‑in‑context approach). ([nist.gov](https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence))
The cost control kit: levers that actually move the needle
| Lever | What to try | Typical impact | Notes |
|---|---|---|---|
| Right‑sizing the model | Use a smaller model for routine tasks; reserve flagship models for edge cases. | 20–70% unit cost reduction | Industry trend is toward efficient “small” models for scale workloads. ([reuters.com](https://www.reuters.com/business/media-telecom/us-tech-startup-anthropic-unveils-cheaper-model-widen-ais-appeal-2025-10-15/)) |
| Batching/off‑peak | Process queues in batches overnight. | Up to 50% discount | Some APIs offer explicit batch discounts. Check your vendor’s documentation. ([docs.anthropic.com](https://docs.anthropic.com/en/docs/about-claude/pricing)) |
| Prompt hygiene | Shorten system prompts; avoid oversized context windows. | Cuts tokens and latency | Track prompt length per task; long prompts silently inflate bill. |
| Caching/partials | Reuse stable instructions or reference sections where supported. | 10–40% saving on repeat tasks | Ask vendors how they price caching and whether it applies to your workflow. ([docs.anthropic.com](https://docs.anthropic.com/en/docs/about-claude/pricing)) |
| Triage policy | Handle simple emails with “small”, escalate tricky ones. | Quality up, cost flat | Define clear escalation triggers to avoid false confidence. |
For reference pricing when you finalise contracts, always use the provider’s current pricing page and capture a screenshot in your file. That prevents surprises later.
Procurement and legal: the 12 questions that surface risk and cost
- What is our unit of billing and how is it measured? Ask for a worked example of your use case.
- Are batching or caching discounts available, and how do we access them? ([docs.anthropic.com](https://docs.anthropic.com/en/docs/about-claude/pricing))
- What rate‑limit and throughput guarantees apply at our expected volumes?
- Which data residency regions are available and how is data retained or deleted?
- How are unsafe prompts handled? Show default policy and admin controls. ([cisa.gov](https://www.cisa.gov/news-events/alerts/2023/11/26/cisa-and-uk-ncsc-unveil-joint-guidelines-secure-ai-system-development))
- What logs can we access for auditing (timestamps, inputs/outputs, moderator flags)?
- Which fallback models or versions are supported, and how are changes notified?
- What is the security posture for AI‑specific threats (prompt injection, data poisoning, model extraction)? Reference the UK AI Cyber Security Code of Practice. ([gov.uk](https://www.gov.uk/government/publications/ai-cyber-security-code-of-practice))
- What SLA applies for latency and uptime? What credits trigger at breach?
- How do we cap spend per day/week and per user? Can we block high‑cost calls?
- If in the public sector, how will the supplier support our ATRS record and ongoing transparency needs? ([gov.uk](https://www.gov.uk/guidance/publish-a-record-of-algorithmic-tools-using-the-atrs))
- Is there a termination for performance clause tied to the KPIs we’ve defined?
If you’re comparing full vendors rather than models alone, the Crown Commercial Service AI DPS can be a practical route to market for public bodies; private sector buyers can still borrow its evaluation discipline. ([crowncommercial.gov.uk](https://www.crowncommercial.gov.uk/agreements/rm6200))
Decision tree: buy, build, or blend?
- Buy if an off‑the‑shelf tool hits ≥80% task success at your target cost per task and exposes admin controls and logs.
- Blend if you need your data for retrieval/search but the vendor provides a stable interface and strong controls.
- Build only if your acceptance criteria are unique and the economics still work after accounting for maintenance and assurance.
Whichever path you choose, document the decision, the metrics, and the risks. This aligns with good practice in AI assurance and helps future audits. ([gov.uk](https://www.gov.uk/data-ethics-guidance/introduction-to-ai-assurance))
Your first 30 days after the sprint: KPIs and rituals
- Weekly quality review: 20 random samples; log edit counts and any critical errors.
- Safety spot‑checks: replay your red‑team prompts after model updates; verify controls still hold. ([cisa.gov](https://www.cisa.gov/news-events/alerts/2023/11/26/cisa-and-uk-ncsc-unveil-joint-guidelines-secure-ai-system-development))
- Spend dashboard: show top users, top tasks, and 10 costliest cases; enforce caps.
- Prompt hygiene: keep a single prompt library with versioning and expiry dates.
- Change control: record any model or configuration change and its effect on KPIs.
In parallel, consider a small retrieval upgrade to boost factual accuracy on your documents; our 6‑week RAG blueprint shows how to do this without spiralling costs.
Frequently asked questions from boards and DPOs
How do we know we’re testing “the system”, not just the model?
Include the end‑to‑end workflow in your tests: inputs, any retrieval, the model, and your review stage. Testing in context (people + process + tech) is increasingly emphasised in independent evaluation work. ([nist.gov](https://www.nist.gov/news-events/news/2024/05/nist-launches-aria-new-program-advance-sociotechnical-testing-and))
Do we need an external audit?
For most SMEs, start with this internal sprint and a clear paper trail. Consider external assurance when the system affects high‑stakes decisions or sensitive data, drawing on the UK’s developing AI assurance market and techniques. ([gov.uk](https://www.gov.uk/guidance/portfolio-of-ai-assurance-techniques))
Where do we find credible benchmarks for model selection?
Use your own tasks first, then consult transparent, multi‑metric resources such as HELM to understand general strengths and trade‑offs across models. ([crfm.stanford.edu](https://crfm.stanford.edu/2022/11/17/helm.html))
Putting it all together (and avoiding common traps)
- Don’t overfit to your gold set. Keep back a “blind” set for final checks.
- Beware test leakage. Don’t let the same person both write the gold answer and judge the model’s output.
- Expect model drift. Re‑test monthly; treat model upgrades as changes that need sign‑off.
- Write it down. One page that states the decision, KPIs, cost cap and safety guardrails is worth more than a thousand demo slides.
When you’re ready to go beyond the sprint, our guides on moving from pilot to production and a pragmatic AI policy pack will help you scale responsibly.
Need a hand running your sprint?
We can facilitate the week, bring a neutral shortlist of models and vendors, and leave you with a board‑ready evidence pack and a costed production plan. If retrieval is in scope, we’ll align it with the practical AI stack we recommend for UK organisations.