Directors reviewing an AI cost dashboard with budgets, caching and model tiers
Cost governance & scalability

The 90‑Day AI Cost Guardrail for UK SMEs — budgets, caching and model tiering you can ship on Monday

Most UK SMEs and charities start with small AI experiments, then get a shock when usage or a successful pilot doubles the bill overnight. The fix isn’t to slow down. It’s to install a simple set of guardrails that cap spend, steer workloads to the right model, and make repeated context cheaper — without hurting quality.

This article gives you a practical, non-technical plan you can complete in 90 days: budgets and alerts, prompt/context caching, model tiering, a handful of procurement questions, and a Monday‑morning rollout checklist. Where relevant, we link to vendor guidance so your team can verify details and set controls in the platforms you already use.

Why AI costs surprise good teams

  • Usage isn’t linear. A new form, a popular email feature or a sales campaign can multiply requests overnight.
  • Context is expensive. Long policies, knowledge bases and CRM extracts inflate input tokens. Re‑sending the same context for each query wastes money unless you cache it. Google’s Vertex AI and Anthropic’s Claude expose prompt/context caching that discounts cached tokens by around 90% on reads, with a TTL window to control how long caches live. docs.cloud.google.com
  • Model choice matters. Big models are brilliant, but you don’t need them everywhere. Providers such as AWS Bedrock highlight techniques like prompt caching and model distillation to cut costs dramatically when the heaviest model isn’t needed for every step. aws.amazon.com
  • Budget controls exist, but they’re not turned on. Cloud platforms offer budgets, alerts and automation so you can get warned early or trigger actions when spend trends the wrong way. See Google Cloud’s gen‑AI billing controls and Azure Cost Management budgets. cloud.google.com

Unit economics in plain English

“Unit economics” means tracking cost per useful outcome, not just the monthly bill. For example: cost per resolved support ticket, cost per tender shortlisted, or cost per HR case progressed. The FinOps Foundation and Microsoft summarise this well: pair cloud/AI spend with a unit that leaders recognise, then manage that ratio over time. finops.org

If you want a quick way to socialise this in your board pack, use the one‑pager ideas in our internal guide on turning tokens into pounds.

Your 90‑day spend plan (three sprints)

Sprint 1 (Weeks 1–2): Baseline and budget

  • Baseline: capture 14 days of usage by use case (support, sales, content, back‑office). Note average prompt size, output size, and success rate.
  • Budgets and alerts: create monthly budgets per project/workspace; set 50/80/100% alerts. Turn on forecast‑based alerts where available. In Azure and Google Cloud you can trigger action groups or automation on threshold breaches. learn.microsoft.com
  • Define the unit metric for each use case (e.g., “cost per resolved email” for triage; “cost per qualified lead” for marketing). finops.org

Sprint 2 (Weeks 3–6): Caching and model tiering

  • Context/prompt caching: enable caching for repeated policy text, product catalogues or long system instructions. On Vertex AI, cached reads are discounted ~90% and you can set TTL; on Claude via Vertex, writes cost more but reads are significantly cheaper. docs.cloud.google.com
  • Model tiering: define a primary “everyday” model (fast/cheap) and an “escalation” model for hard cases. Use a confidence or score threshold to route to the bigger model when needed; providers like AWS Bedrock describe prompt routing and distillation options that can reduce cost while keeping quality. aws.amazon.com

Sprint 3 (Weeks 7–12): Operationalise controls

  • Feature flags and kill‑switches for high‑risk features. If spend spikes, switch off optional enrichments first. See our rollout notes on feature flags for AI.
  • Showback to teams: simple monthly email: usage, £ spent, cost per unit, top saving opportunity.
  • Go‑live gate before scaling: confirm controls and dashboard KPIs are green. See our go‑live gate.

The five cost guardrails to set once, then maintain

1) Budgets, alerts and automation

Set budgets per project/use case with 50/80/100% tiers. Add a forecast alert so you get warned before month‑end. Where supported, connect alerts to an action (e.g., pause a non‑critical job, notify a Slack channel, or downgrade a model tier). Google Cloud documents budget alerts and automation; Microsoft details budgets and action groups in Cost Management. cloud.google.com

2) Prompt/context caching policies

  • What to cache: long, repeated prefixes such as policy statements, product specs, or instructions. Both Vertex AI (Gemini) context caching and Claude prompt caching discount cached tokens heavily on reads. docs.cloud.google.com
  • TTL: choose 5 minutes for highly dynamic content; extend when stable. Vertex AI supports extending TTL for explicit caches; costs and discounts vary by TTL. docs.cloud.google.com
  • Why it matters: vendors report read discounts of up to ~90% and latency improvements when cache hits occur. docs.cloud.google.com

3) Model tiering and fallbacks

Define a cheaper default model for routine work (drafts, summaries, extraction) and escalate to a larger model only when a quality threshold or “uncertainty” signal is hit. AWS highlights prompt routing and distillation so smaller models can handle more traffic at lower cost, with the option to escalate. aws.amazon.com

4) Rate limits and quotas

Set per‑user and per‑integration request caps for optional features (e.g., auto‑enrichment). This prevents runaway spend from a single misbehaving script or user.

5) Observability and showback

Track spend by tag (project, team, environment) and send a monthly showback note: “Team A spent £X, delivered Y outcomes, cost per outcome £Z.” This keeps conversations focused on value, not raw spend. The FinOps framework recommends making unit metrics visible and routine. finops.org

What to ask vendors this quarter (10 questions, 5 proofs)

Pricing and controls

  • How do you expose budgets and alerts per workspace or project? Can alerts trigger automation? cloud.google.com
  • Do you support prompt/context caching? What are the read discounts and TTL options? docs.cloud.google.com
  • What’s your guidance for model tiering or routing between small and large models? Any built‑in features or best‑practice patterns? aws.amazon.com

Observability and portability

  • Can we tag usage by team/customer? Do you provide per‑feature spend out of the box?
  • If we switch model/provider, how portable are our prompts, caches and embeddings?

Ask for proofs

  • A screenshot of budget settings and a live alert firing on a threshold.
  • A demo showing cache configuration, the cache hit rate, and a before/after cost line.
  • Evidence of prompt routing or distillation reducing cost with comparable quality. AWS publishes examples of these patterns. aws.amazon.com
  • A one‑page mapping of usage tags to your finance cost centres.
  • Confirmation on data handling for caching and logging. If using Vertex AI, review its notes on caching and data retention/zero‑retention options. docs.cloud.google.com

For broader supplier selection, see our guides on avoiding AI lock‑in and the 2025 Copilot Buyer’s Guide.

KPIs for your monthly dashboard

  • Spend trend vs budget per use case (month‑to‑date and forecast).
  • Cost per unit outcome (e.g., £/resolved ticket, £/qualified lead). finops.org
  • Prompt:completion ratio (high prompt tokens usually signal context bloat).
  • Cache hit rate and cached tokens % of total input. Aim for rising cached share where content is stable. docs.cloud.google.com
  • Escalation rate from small to large model (target stable or falling as prompts improve).
  • Human review rate on critical workflows (should trend down as confidence grows).

Cost risks and quick mitigations

Risk What you’ll see Quick fix
Context bloat Prompt tokens dwarf output; repeated policy text in every call Enable caching for static prefixes; move long background to cached context; review prompt templates. docs.cloud.google.com
One‑size‑fits‑all model High spend even on simple tasks Introduce tiering: small model by default, escalate on threshold; consider distillation. aws.amazon.com
Spend surprises End‑of‑month bill shock Budgets at 50/80/100% plus forecast alerts; route alerts to an action group. cloud.google.com
Unknown business value Bill goes up; no one can say if it’s “worth it” Define unit metrics aligned to outcomes; share showback monthly. finops.org

A simple decision tree for model spend

  1. Is the task routine and structured? Try your “everyday” model first. If accuracy is below target, improve prompts or add a small knowledge snippet. Only then consider escalating.
  2. Does the task repeat over the same context? Turn on caching and measure the cache hit rate before buying more capacity. docs.cloud.google.com
  3. Is quality still below target? Escalate to a larger model for the hard cases only. Consider distilling that behaviour back into a smaller model later. aws.amazon.com

How to calculate “cost per resolved ticket” (example)

  1. Pick a sample period (last full month) and count resolved tickets where AI assisted.
  2. Sum AI spend tied to the support workspace/project (plus any vector/hosting costs).
  3. Divide spend by resolved tickets to get £ per resolution; track monthly.
  4. Improve the ratio by: caching standard knowledge, trimming prompts, and keeping routine resolutions on your cheaper model. docs.cloud.google.com

Once you’ve measured one unit metric, roll the approach out to other areas (sales emails, HR cases). For a compact template, see our guide to AI unit economics.

Data handling notes when you use caching

Caching can change how your data is processed and retained. For example, Vertex AI explains how implicit and explicit caching work, the discounts and TTLs, and where to disable caching if you require zero data retention. Ask your vendor to show you the relevant settings. docs.cloud.google.com

Where to start on Monday (2 hours)

  1. Create two budgets (Support and Marketing) with 50/80/100% alerts and a forecast alert. Hook the 80% alert to a notification channel. cloud.google.com
  2. Enable caching for the longest, most repeated context in Support. Check the cache hit metric after one week and aim for >50% of input tokens cached on repeated sessions. docs.cloud.google.com
  3. Define model tiers: “Everyday” vs “Escalation”. Cap the escalation path to 10–20% of requests initially; review weekly. aws.amazon.com
  4. Publish a one‑page unit metric to your leadership channel: “£ per resolved ticket — baseline and target”. finops.org
  5. Add a go‑live check to your team rituals before launching any new AI feature. Our go‑live gate fits on one page.

Want help installing these guardrails?

We’ve helped UK SMEs and charities set this up in under a month — budgets, caching policies, model tiering, and a unit‑economics dashboard that directors can read in two minutes. If you need a quick enablement path, our AI Office Hours sprint may be a fit.