You’ve shipped a promising AI pilot. Users like it. Then the finance director asks the question: what will this cost at 10× scale? The honest answer is usually “it depends”. But “it depends” won’t fly at your next board meeting. This guide gives UK SME and charity leaders a practical, vendor-neutral playbook to set cost guardrails now—so you can scale without surprises.
We’ll cover the moving parts of AI spend, a 30‑day rollout of controls, the questions to ask vendors, and the handful of KPIs that keep your programme honest. Where relevant we link to authoritative pricing and limits so your numbers are grounded in reality.
Your unit cost, not your cloud bill, is the strategy
Stop treating AI as an undifferentiated “cloud cost”. Your north star is cost per successful task—the average all‑in cost to deliver a result that meets your acceptance criteria. Think of a “task” as a resolved customer email, a clean contract clause extraction, or a drafted case note approved by a human.
Set a ceiling today
- Define “successful” for 1–3 core tasks you plan to scale.
- Measure the all‑in cost per task over a week: model tokens, retrieval, storage, network, vendor mark‑ups, and human review minutes.
- Set a red line: “We will not scale beyond £X per successful task.”
Everything else in this article is in service of keeping that unit cost within your red line while quality holds steady.
Where the money actually goes
| Cost bucket | What drives it | Watch‑outs | What to do now |
|---|---|---|---|
| Model tokens | Input/output token volume, model tier, context window | Larger models and long prompts inflate cost; some platforms offer cheaper “cached input” rates when prompts repeat. | Shorten prompts, cap max output, and exploit prompt caching where supported. openai.com |
| Embeddings + vector search | Embedding tokens or document count; vector DB storage/throughput | Minimum monthlys on managed vector stores can dominate small deployments. | Batch embeddings; compress text; consider plans with transparent minimums. pinecone.io |
| Network egress | Data out of cloud, cross‑cloud traffic, CDN offload | First 100 GB/month is free on major clouds, then per‑GB charges; special rules for migrations and certain multicloud cases in UK/EU. | Budget for egress; use a CDN for user traffic; check if your use case qualifies for free or reduced rates. aws.amazon.com |
| Orchestration/gateways | Per‑call fees, mark‑ups on upstream models | Routing or optimisation add‑ons can pay for themselves if they lower model mix cost. | Pilot intelligent routing to downshift simple queries to small models. aws.amazon.com |
| Observability | Logs, traces, eval runs, storage | Verbose logs = cost; keep only what you need and rotate aggressively. | Set 30–90‑day retention with redaction; sample heavy payloads. |
Two other constraints shape your design: throughput limits and concurrency caps. For example, Azure OpenAI publishes tokens‑per‑minute and requests‑per‑minute quotas by model and tier—easy to miss until your traffic spikes. Use these as inputs to a realistic scaling plan. learn.microsoft.com
The 30‑day cost guardrails rollout
Week 1 — Baseline and budget
- Define success: one sentence per task. Tie each to a unit‑cost ceiling.
- Turn on usage logging at the edge: capture prompt and completion token counts, latency, and result outcome tags (success/fail/needs‑human).
- Token diet: remove boilerplate in prompts, strip footers/disclaimers from user inputs, and set a sensible maximum output length.
- Cache where possible: repeated system prompts and tools often qualify for lower input rates via caching on some platforms. openai.com
- Right‑size retrieval: chunk documents sensibly, cap the number of retrieved passages, and deduplicate before sending to the model.
Week 2 — Mix your models
- Adopt a two‑tier model strategy: a small/fast model for routine queries and a larger model for complex ones.
- Route intelligently: try managed prompt/model routing features; vendors cite material savings when simple queries are downshifted. aws.amazon.com
- Ring‑fence the premium model: require an explicit “escalation” signal to call it, and log a reason code each time.
- Measure quality deltas: if the small model meets your acceptance criteria for 80% of traffic, lock it in.
Week 3 — Tame your data transfer
- CDN all user‑facing content to avoid serving assets straight from your app or storage account.
- Budget egress explicitly: add a per‑GB line in your forecast after the first 100 GB/month; check your provider’s current tariff. azure.microsoft.com
- Plan for portability: if you ever need to migrate or burst across clouds, check eligibility for free or reduced outbound fees and multicloud zero‑charge options in the UK/EU. aws.amazon.com
- Keep inference close to data: avoid unnecessary cross‑region or cross‑cloud hops for retrieval.
Week 4 — Prove you can scale
- Load test with realistic prompts and retrieval payloads; observe throughput caps and back‑off behaviour. Use published provider quotas to size concurrency. learn.microsoft.com
- Set hard safety rails: per‑minute cost cap, per‑request token cap, and a kill‑switch if a daily budget is exceeded.
- Agree a go/no‑go: does cost per successful task stay under the red line when you push at 2× peak traffic?
Procurement questions that surface hidden costs
When a vendor says “we’re affordable”, ask these specifics and insist on written answers:
- Pricing model: is it pure pass‑through of model and storage costs, or does the vendor add a percentage? Are there per‑seat or per‑workspace fees?
- Minimums: are there monthly platform or vector database minimums even at low volume? If yes, what are they? pinecone.io
- Token transparency: will we see raw input and output token counts per request in our logs and invoices?
- Caching: do you support cached prompts and do we benefit from the reduced cached‑input rate where available? openai.com
- Routing: can your system route to a cheaper model for simple queries, and what evidence do you have of savings at scale? aws.amazon.com
- Egress: what happens to our cost if we move data across clouds or leave your platform—do we benefit from any zero‑ or reduced‑egress schemes in UK/EU? aws.amazon.com
- Quotas: how do you manage provider rate limits during spikes? Do you queue or degrade gracefully? learn.microsoft.com
- Observability: can we export full usage logs to our data warehouse without extra fees?
- Change control: how are price rises communicated? Do you guarantee pass‑through of upstream reductions?
KPIs that keep spend honest
- Cost per successful task (by use case)
- Tokens per task (input/output split)
- Retrieval payload size per task (KB or tokens)
- Model mix ratio (percentage handled by small vs large model)
- Cache hit rate (for prompts/tools)
- Escalation rate to premium model (and top 5 reasons)
- Network egress per 1,000 tasks
- Failure/retry rate and average retries per task
The “bill shock” risk register
| Risk | What to check | Signal it’s healthy | Quick fix if not |
|---|---|---|---|
| Runaway tokens | Distribution of output lengths; number of retrieved chunks per call | 95% of calls under your cap | Lower max tokens; reduce retrieval depth; remove boilerplate in context |
| Premium model overuse | Share of traffic on high‑tier model | ≤ 20% on premium tier | Add routing guard; require reason codes; audit weekly |
| Vector store minimums | Invoice line items vs usage | Minimums are ≤ 15% of total bill | Move to tier with clear minimums; batch jobs; consider serverless options pinecone.io |
| Egress surprises | GB out by region and by destination | Forecast includes post‑free‑tier per‑GB rates | Put traffic behind a CDN; co‑locate retrieval with inference; check UK/EU concessions. azure.microsoft.com |
| Quota throttling at peak | Provider TPM/RPM headroom | ≥ 30% headroom at forecast peak | Request higher limits; add queueing; pre‑compute where possible. learn.microsoft.com |
What’s changed lately that affects UK budgets
- Some AI platforms offer cheaper billing for repeated or “cached” prompt content, which can materially lower input costs when you reuse stable instructions and tools. openai.com
- Major clouds now include 100 GB/month free internet egress as a baseline, with special provisions for customers leaving a platform; in the UK/EU there are additional pathways to reduced or no‑cost cross‑cloud transfer in specific scenarios. aws.amazon.com
- Model routing features can automatically send simple requests to cheaper models, with vendors citing up to 30% spend reduction in some workloads. aws.amazon.com
If you last reviewed your assumptions even six months ago, it’s worth a refresh before you commit to annual budgets.
A simple forecast template your board will accept
- Volume: for each task, forecast monthly volume for the next two quarters.
- Unit profile: estimate input and output tokens, retrieval payload size, and escalation rate to a premium model.
- Provider charges: apply current model rates, storage, vector, and egress assumptions; include any minimums and routing fees.
- Human‑in‑the‑loop: add minutes of reviewer time for tasks that need it.
- Contingency: add a 15–20% buffer while you stabilise.
- Sensitivity: show the effect of a 25% change in volume, output length, or premium model usage.
Use this to decide whether to green‑light scale now, or to invest another sprint in prompt diet, retrieval tuning, and model mix.
Related playbooks
- If you haven’t formalised quality yet, set KPIs first: The AI Quality Scoreboard.
- Heading to production? See our low‑drama launch plan: The Quiet Cutover.
- Buying from vendors? Avoid the gotchas: Nine AI Procurement Traps.
Ready to lock in your guardrails?
We help UK SMEs and charities baseline costs, set the right KPIs, and implement routing, caching and retrieval patterns that scale sensibly—without sacrificing quality.