Most UK SMEs and charities adopted AI in 2024–2025 through pilots and point solutions. By December 2025, the question has shifted from “Can we use AI?” to “Can we scale this without our cloud bill spiralling?” This practical guide shows directors, operations leads and trustees how to establish AI unit economics you can explain to your board, then reduce spend within 90 days—without harming service quality.
We’ll focus on the essentials: measuring the right things, selecting cheaper-but-good-enough options, and removing waste (unnecessary tokens, retries, duplicate calls). Where we reference external guidance, we link to vendor-neutral sources so you can sanity-check your approach, such as the FinOps Foundation and cloud cost pillars from AWS, Microsoft Azure and Google Cloud.
The simple unit economics formula your board will remember
For each use case, define:
- Outcome: the thing you value (e.g. a first-contact resolution in customer service, or a compliant, publishable blog draft).
- Cost per outcome: all AI-related costs divided by the number of successful outcomes.
In plain English:
Cost per successful outcome = (AI usage costs + orchestration costs + storage/search costs + observability costs) ÷ number of successful outcomes.
You’ll keep the numerator honest by tracking token usage, vector store queries, embedding generation, function calls and any third-party API fees. The denominator relies on your acceptance criteria and QA process—see our piece on AI content quality tests for practical acceptance tests you can run weekly.
Your 30–60–90 day cost playbook
Days 0–30: Instrumentation and baselines
- Choose 3 KPIs per use case: cost per resolved ticket, acceptance rate at first pass, and re-ask rate (how often a user must ask again).
- Set budgets and alerts in your cloud platform (Azure Cost Management, AWS Budgets, or GCP Billing) and in your AI platform’s usage dashboard. Use daily e-mail alerts for spend anomalies above a chosen threshold.
- Record model mix and volumes. Identify where you can switch 30–60% of requests to a cheaper model without harming quality (routing). Keep a one-pager that lists each model’s purpose and a link to its pricing page for transparency.
- Implement prompt hygiene: remove boilerplate super-instructions, cut redundant citations, cap output length and avoid “just in case” data retrieval. Measure before/after token use.
- Turn on caching where supported (at the prompt, completion or retrieval level). Cache FAQs, boilerplate legal summaries and policy explanations that change rarely.
- Baseline vector costs: log how many chunks are retrieved per answer and how often you re-embed documents. Prune and compress older versions where appropriate. For content-heavy teams, our guide on chunking and metadata will help you stop over-fetching.
Days 31–60: Shift to cheaper paths by default
- Routing and fallbacks: send simple requests to smaller, cheaper models; use larger models only when confidence falls below a threshold or the task is complex. Keep version pinning and canaries to avoid surprises when vendors update models—see our piece on shipping AI changes safely.
- Batch non-urgent jobs: summaries of weekly board papers, policy refreshes, backfile embedding. Night-time batches reduce contention and make spend predictable.
- Guardrails to reduce waste: stop prompts that would exceed your output caps; run a quick retrieval check before long generations to ensure there’s useful context; block patterns that lead to hallucinations and rework.
- Right-size context windows: don’t always use the largest possible context. Retrieve fewer, better chunks with strict metadata filters.
- Vendor hygiene: switch off features you don’t use (e.g. unused model endpoints, idle vector indices) and remove stale API keys.
Days 61–90: Contract, architecture and culture
- Negotiate discounts: if you have predictable monthly volumes, ask about committed-use discounts, model-family bundles or reserved capacity. If you’re comparing vendors, run a structured bake‑off; we explain how in The two‑week AI vendor bake‑off.
- Separate workloads into “real-time” and “asynchronous” paths. Real-time gets low-latency models; async gets cheaper, throughput-first options.
- Create a cost review rhythm: 30-minute weekly triage (spend anomalies, quick wins) and a monthly steering pack with KPIs, savings delivered, and quality impact.
- Train your teams: create a 60-minute session on practical prompt economy and retrieval hygiene. Make “tokens are company money” part of induction.
The seven levers that actually move AI cost
| Lever | Typical saving band | Quality risk | Where it works best |
|---|---|---|---|
| Prompt hygiene (shorter, clearer system and user prompts; tighter output caps) | 10–30% | Low | Service replies, policy summaries, routine updates |
| Retrieval right-sizing (fewer, better chunks; better filters) | 15–40% | Low | RAG assistants, intranet search |
| Model routing (smaller models by default; escalate on confidence) | 20–60% | Medium | Email triage, FAQ bots, form filling |
| Caching (prompt, completion or retrieval) | 10–50% | Low | FAQs, policies, templated outputs |
| Batching and scheduling non-urgent work | 10–25% | Low | Embedding, classification, periodic summaries |
| Guardrails (block runaway outputs, prevent rework) | 5–15% | Low | Any production chatbot or generator |
| Vendor optimisation (discounts, region choice, storage tiering) | 10–35% | Medium | Steady-state workloads, multi-team usage |
Your actual mix will vary; the important thing is to test one lever at a time and keep a short “before/after” note for your steering group.
KPIs to track every week
- Cost per successful outcome (per use case).
- Acceptance rate at first pass (how often the result is good enough without edits).
- Re-ask rate (how often users ask follow-ups because the first answer missed the mark).
- Token efficiency = output tokens ÷ input tokens (too high may indicate waffle; too low may mean stunted answers).
- Cache hit rate for prompts/retrieval.
- Escalation rate from small to large models.
- Guardrail save rate (how often guardrails prevented wasteful runs).
Publish these KPIs in a one-page dashboard for directors. If any KPI improves at the expense of acceptance rate, roll back and reassess.
Procurement questions that keep vendors honest
Ask these in writing during selection or renewal. If the answers are unclear, consider running a bake‑off.
- Pricing clarity: Do you bill by input tokens, output tokens, or both? Are tool/function calls billed? What happens with streamed responses?
- Discounts: Do you offer committed-use discounts, reserved capacity or model-family bundles for predictable workloads?
- Caching: Do you support server-side caching of prompts or completions? How are cache hits billed?
- Rate limits and throughput: What are the per‑minute and per‑day limits? Can you increase them contractually?
- Observability: Is there an API or export for per‑request token usage and latency so we can run cost KPIs?
- Vector store costs: How are storage, write (embedding) and read (search) operations billed? Are there cheaper archive tiers for old content?
- Data residency and egress: Which regions are available? Are there egress charges when our data or embeddings move between services?
- Change control: How do you notify us of model updates and deprecations? Can we pin versions and run canaries first?
To structure your comparison, reuse the approach in The two‑week AI vendor bake‑off.
Three worked examples
1) Service desk triage (5 agents, 2,000 tickets/month)
- Outcome: first-contact resolution (FCR).
- Baseline: £1.20 per ticket in AI costs; 62% FCR.
- Interventions: route simple tickets to a smaller model; cap outputs at 180 words; cache policy answers; retrieve no more than three chunks unless confidence is low.
- Result to aim for: £0.65–£0.85 per ticket; ≥60% FCR maintained or improved.
2) Board paper summarisation (weekly, 120 pages)
- Outcome: a 2‑page brief per meeting.
- Baseline: inconsistent costs due to uploading entire packs every week.
- Interventions: embed once; update deltas; right-size chunks; schedule summarisation overnight; enforce strict length and style.
- Result to aim for: 30–50% cost reduction with more predictable weekly spend.
3) Marketing content engine (8 long-form pieces/month)
- Outcome: publishable drafts aligned to tone and facts.
- Baseline: excessive tokens due to verbose “brand voice” prompts and full‑context retrieval.
- Interventions: turn the brand voice into a short style card; clip retrieval to the top 2–3 chunks; cache boilerplate sections.
- Result to aim for: 25–45% reduction while keeping acceptance rate ≥80%. For repurposing tactics, see Zero‑waste AI content engine.
Decision guide: when to pay for the “bigger” model
Use your acceptance criteria to drive the decision. Pay for larger models only when all are true:
- The task needs deeper reasoning or long‑range context you can’t reduce sensibly.
- Your smaller model fails acceptance tests consistently even after prompt and retrieval tuning.
- The value of a better answer outweighs the extra cost (e.g. legal or grant‑critical wording).
Otherwise, route to a smaller model and escalate on low confidence. Keep an eye on your rollback strategy when models change; our guidance on version pinning and canaries shows how to do this safely.
Right-sizing retrieval: the quiet superpower
Most overspend hides in retrieval. Over‑chunked documents, weak metadata and “fetch five just in case” habits all bloat tokens. Focus on:
- Chunk once, use everywhere: keep chunk sizes consistent and include source and freshness metadata.
- Strict filters before similarity search (owner, date, document type). Avoid fishing expeditions.
- Answer‑first prompts: ask the model to attempt an answer from the top two chunks, then retrieve more only if it flags low confidence.
If your team needs a refresher, start with our practical RAG guide.
Cost controls you can switch on this week
In your AI platform
- Set per‑project and per‑user quotas.
- Enable daily or weekly budget alerts.
- Turn on request‑level logging of token usage and model version.
- Enable caching where offered and define clear TTLs (time to live).
In your cloud console
- Create budgets and anomaly alerts in Azure Cost Management, AWS Budgets or GCP Billing.
- Tag or label resources by use case so you can allocate costs to business owners (a FinOps basic).
- Use cheaper storage tiers for cold embeddings and archived indices.
- Review data egress paths if you mix vendors to avoid unpleasant surprises.
For a deeper dive into governance mechanics, see our earlier 90‑day AI cost guardrail post and use this article to drive the “why now” and “what outcome” conversation with your board.
Who owns what
- Director/Trustee: agree the outcomes and the cost KPIs; sign off the monthly steering pack.
- Operations lead: run the weekly 30‑minute cost triage and own the 30–60–90 day plan.
- Data/AI champion: implement routing, caching and retrieval right‑sizing; maintain dashboards and acceptance tests.
- Finance partner: set budgets, check allocations and confirm savings in month-end numbers.
Checklist to finish this quarter in control
- We can state the outcome and cost per outcome for every AI use case.
- We have budgets, alerts and quotas switched on.
- We use routing + caching + retrieval right‑sizing by default.
- We have a version pinning and canary approach for model upgrades.
- We run a weekly cost triage and a monthly steering pack.
- We know who owns what and how savings show up in finance reports.