A neutral meeting room table with three laptops and scorecards laid out for an AI vendor bake‑off
Vendor selection & procurement

The Two‑Week AI Vendor Bake‑Off: a UK SME Procurement Sprint that Levels the Field and Controls Risk

When buyers say “we saw four demos and still can’t compare them”, it’s rarely about tech. It’s about process. This article gives UK SMEs and charities a practical, repeatable two‑week bake‑off that forces apples‑to‑apples comparison, keeps costs visible, and bakes in basic safety and exit rights. It borrows from public sector guidance, international good practice, and hard‑won lessons from pilots that dragged on for months without a decision.

We’ll cover a 10‑day plan, a simple scorecard with measurable KPIs, the artefacts you need (scripts, data room, price sheet), and the minimum commercial clauses to avoid lock‑in later. If you want deeper dives on adjacent topics, see our guides on avoiding AI lock‑in, building a vendor due‑diligence pack, using feature flags for safe rollouts, and our go‑live gate.

Why a bake‑off now?

  • Market is noisy: Vendors can all “demo” well. A standard script exposes differences in accuracy, cost, and fit.
  • Budgets are tight: Token, hosting, and support fees add up. Comparing unit costs on the same tasks prevents surprises.
  • Risk is real: Basic security hygiene and safe deployment matter. Cross‑check supplier claims against independent sources such as the NCSC secure AI development guidance, OWASP LLM Top 10, and the NIST AI Risk Management Framework.
  • Clarity beats breadth: One well‑designed scenario tells you more than 20 vague features. The UK’s AI procurement guidance encourages challenge‑focused requirements, not shopping lists.

The two‑week plan (10 working days)

Week 1 — Set the rules and the data

  1. Define the job‑to‑be‑done (JTBD): Write a one‑paragraph problem statement and a success metric. Example: “Triage inbound emails into 8 categories and draft replies within 60 seconds; target ≥85% correct category.”
  2. Script the demo: A single, shared walkthrough of 6–8 tasks representative of daily work. Include at least two “edge” cases that commonly fail.
  3. Create a small data room: 30–50 de‑identified examples for evaluation, plus 10 red‑team prompts to test safety and controls. Document what “good” looks like.
  4. Price sheet: Ask all vendors to complete the same one‑pager: licence fee, model usage (per 1,000 tokens or per action), storage, integration, support, and overage. Capture minimum term and any auto‑renew.
  5. Lightweight assurance pack: A 20‑question security and data handling checklist aligned to NCSC supply‑chain principles and common threats in the OWASP LLM Top 10. Ask for yes/no and a brief pointer to policy or evidence.
  6. Book the slots: 90 minutes per vendor next week. Everyone signs the same NDA. Vendors get the script and criteria in advance.

Week 2 — Run, score, decide

  1. Scripted demo (60 mins) + Q&A (30 mins): Your facilitator keeps time. Vendors run the exact tasks in order. No mystery decks, no detours.
  2. Hands‑off test: If possible, ask each vendor to run your 30–50 examples unattended and return outputs and logs within 24 hours for scoring.
  3. Score and compare: Use the scorecard below. Record numbers as they are — don’t debate feelings.
  4. Commercial tidy‑up: Clarify pricing tiers, storage locations, and data use. UK public bodies should note transparency obligations introduced by the Procurement Act 2023 and related notices; while SMEs aren’t directly in scope, this sets expectations when you sell into government.
  5. Decision and next step: Pick a preferred supplier for a 6–8 week pilot with a capped budget and pre‑agreed success metrics. Document a fallback supplier.

The vendor scorecard UK SMEs can actually use

Category Weight How to score (KPIs)
Problem fit & UX 20% Task completion rate on scripted demo; clicks to complete; time‑to‑first‑use for a new staff member.
Quality & safety 20% Accuracy vs ground truth; hallucination rate; red‑team pass rate against 10 prompts; refusal behaviour on unsafe requests.
Cost & scalability 20% Blended cost per task at 100/1,000/10,000 tasks; surge pricing rules; rate limits; ability to switch model tier.
Security & data handling 15% Data residency and encryption; data retention/deletion; training‑on‑your‑data controls; supplier’s secure‑by‑design evidence against NCSC guidance.
Integration & change 15% Connectors to your systems; admin controls and audit logs; feature flags; ability to A/B test.
Portability & vendor health 10% Export capability; API openness; model switch plan; financial stability and customer references.

If you need inspiration for the quality and safety checks, NIST’s cross‑sector AI Risk Management Framework and its Generative AI profile provide practical prompts for evaluation, while the UK AI procurement guidance advises engaging the market early and focusing requirements on outcomes, not specific tools.

The artefacts you’ll need (templates you can draft in an hour)

1) Scripted demo

  • Eight tasks in a fixed order, including two “known hard” cases.
  • For each task: input, expected output, acceptance criteria, and the KPI you’ll measure (e.g. time, accuracy, cost).

2) Evaluation set

  • 30–50 de‑identified examples representative of your real workload.
  • Include 10 red‑team prompts covering injection attempts, data exfiltration, and policy bypass, inspired by OWASP’s LLM Top 10.

3) Price sheet (one page)

  • Licence, usage, storage, support, integration, overage; minimum term and renewal; volume breaks.
  • Ask for a worked example: “What is our monthly bill at 5k, 20k, 100k tasks?”

4) Security & data handling checklist

  • Data residency, encryption, retention, deletion, access control, and logs.
  • Whether your data is used to train vendor models; opt‑out position and contract wording.
  • Evidence of secure design and deployment practices mapped to NCSC guidance and general good practice such as ISO/IEC 42001.

5) Decision record

  • A one‑page summary capturing scores, assumptions, pilot scope, budget cap, and exit conditions.

What to ask every vendor (12 essential questions)

  1. Data use: Will you use our prompts and outputs to train any models? If not, where is that prohibited in the contract?
  2. Data location: Where is data stored and processed, and can we choose UK or EU data centres?
  3. Export: Can we export all data and configurations in a usable format if we leave?
  4. Model choice: Which models and versions are supported today? How quickly can we switch tiers for cost/performance?
  5. Controls: Do you provide admin guardrails, audit logs, rate limits, and policy enforcement?
  6. Reliability: What uptime do you commit to, and what service credits apply for breaches?
  7. Support: What’s the response time for P1 incidents, and is UK business‑hours support included?
  8. Security evidence: Provide a short pack covering secure design and deployment practices aligned to NCSC AI security guidance.
  9. Safety: How do you mitigate the top LLM risks (prompt injection, data leakage, tool overreach)? Map responses to OWASP LLM Top 10 items.
  10. Assurance: What external frameworks do you reference (e.g., NIST AI RMF, ISO/IEC 42001) and how?
  11. Commercials: Provide a worked price at 5k, 20k and 100k tasks/month including overage, storage, and support.
  12. Pilot plan: If selected, what can you deliver in 6–8 weeks with clear acceptance criteria?

Measuring quality without a data science team

Keep it simple and repeatable. You don’t need fancy benchmarks to make a good decision.

  • Accuracy: For each of your 30–50 examples, mark pass/fail against an agreed definition of “good”. Aim for ≥85% to consider production, but adjust for risk level.
  • Safety: Count the number of red‑team prompts that the system handles correctly (refuses or sanitises). Investigate any unsafe completions.
  • Latency: Measure time‑to‑first‑token and time‑to‑complete for the same task over three runs.
  • Cost per task: Use the price sheet to compute a blended cost per task at different volumes. If the vendor bills per token, request a worked example.
  • Human effort saved: For two teams, measure minutes saved per task during the demo compared to your current process.

These metrics align well with the “measure and manage risk” principle in the NIST AI RMF and with UK guidance encouraging clarity on outcomes in procurement. Public bodies can also increase transparency by publishing a short record of the use case using the UK’s Algorithmic Transparency Recording Standard (ATRS), now widely promoted across government.

Risks to price and delivery (and how to defuse them early)

Risk What it looks like Mitigation you can buy
Hidden usage costs Cheap licence, expensive tokens or overage; costs spike at month 3. Fixed price for pilot; tiered unit costs in contract; right to switch model tier without penalty.
Data lock‑in Proprietary formats; export requires professional services. Contractual right to export all prompts, outputs, and configuration in common formats within 30 days of request.
Weak safety controls System accepts prompt injections or leaks sensitive info. Demonstrated controls against OWASP LLM Top 10; sandboxed integrations; content filters.
Unclear data use Vendor reserves right to train on your data. Explicit “no training on customer data” clause; deletion timelines; audit rights.
Supplier fragility Early‑stage vendor with limited runway. Step‑in and exit assistance; escrow or portability plan; month‑to‑month after pilot.
Public sector routes (if you sell to government) Transparency and notice requirements under new rules. Align with PPN 017 on AI transparency in procurement and the UK AI procurement guidance.

Contract guardrails you should insist on

  • Pilot cap and success criteria: A fixed price, a time limit (6–8 weeks), and explicit acceptance criteria tied to your KPIs.
  • Data rights: Your data is your data. No training or enrichment without written consent; define retention and deletion timelines.
  • Portability: Export of all content and configurations in common formats; reasonable assistance during exit at pre‑agreed day rates.
  • Security obligations: Encryption, access control, logging, and incident notification aligned to NCSC supply‑chain principles.
  • Service levels and credits: Uptime targets, response times, and credits meaningful enough to matter.
  • Change control: Right to approve material model/version changes used for your workload.
  • Fair renewal: No auto‑renewal without notice; price increase caps.

If you operate in or sell to the public sector, be aware of increasing transparency expectations, including ATRS records and notices under the Procurement Act 2023. Even if you’re not mandated, aligning with these norms signals maturity.

Who needs to be in the room?

  • Business owner: Owns the problem and the KPIs.
  • Operations lead: Knows the current process and exception paths.
  • Data/IT representative: Checks integration and security basics.
  • Legal/Commercial (light‑touch): Sanity‑checks terms and data rights.
  • Facilitator/timekeeper: Keeps demos on script and stops “magic tricks”.

A ready‑to‑use agenda for each vendor (90 minutes)

  1. Introductions and ground rules (5 mins)
  2. Scripted demo — tasks 1–8 (45 mins)
  3. Live quality and safety checks (10 mins)
  4. Costs walkthrough using your price sheet (10 mins)
  5. Q&A focused on integration, controls, and support (15 mins)
  6. Wrap‑up — confirm what they will submit within 24 hours (outputs, logs, filled‑in price sheet) (5 mins)

Putting it together: your decision in one page

After the final session, complete a one‑pager that lists 1) the ranked scorecard, 2) the recommended pilot scope and budget cap, 3) the contract guardrails you’ll require, and 4) the fallback option. Save it in your procurement folder with the evaluation set, scripts, and the suppliers’ evidence packs. If challenged later, you can show a transparent, outcome‑driven process that aligns with government‑endorsed buying principles and recognised assurance frameworks.

Where to learn more (trusted, practical sources)