Why this matters before January
AI usage climbs quietly, then your finance team sees it loudly. Most UK SMEs and charities now run at least one AI service in production—an answers widget on the website, triage for the shared inbox, or a volunteer hub assistant. In many orgs, inference (the act of getting answers from models) is the dominant line item; community research finds that for generative AI, inference often makes up the bulk of spend, while idling GPUs and over‑sized deployments waste budget. That means your cost‑of‑serve is a product KPI, not just an IT metric. finops.org
This one‑week playbook shows non‑technical leaders how to stand up an AI cost‑of‑serve dashboard, the 10 KPIs to track weekly, and the practical levers—caching, batch jobs, model tiering, routing, budgets, and reservations—that cut spend without cutting outcomes. Where we reference provider features (OpenAI, AWS, Google, Microsoft), we link to their official pages.
The 10 KPIs for an AI cost‑of‑serve dashboard
Keep these in a single panel your execs can read on a Monday morning. Define them per use case (e.g., “customer enquiries”, “grant FAQs”, “internal policy search”).
- Cost per resolution: total AI cost divided by successful answers resolved without human escalation. Track the trend weekly.
- Tokens per resolution: input and output tokens used for a resolved case. Falls when you trim prompts, chunk documents well, and control output length.
- Cache hit rate: percentage of tokens served from a prompt/context cache. Higher is cheaper and faster. Many platforms now discount cached tokens significantly. docs.cloud.google.com
- Model mix: share of requests on “mini/fast” vs “pro/large” models. A healthy estate serves simple queries on lighter models automatically.
- Routing acceptance rate: how often your prompt/model router chooses the cheaper model and is not overridden. Some cloud services now offer intelligent prompt routing with claimed double‑digit cost reductions. aws.amazon.com
- Batching share: percentage of low‑priority jobs (summaries, tags, backfills) shipped via discounted batch processing. Several providers offer lower rates for asynchronous batches. openai.com
- Latency (p95): end‑to‑end time to answer. Monitor alongside model mix; very slow answers drive drop‑offs and extra retries.
- Human handoff rate: percentage of AI‑assisted cases still requiring people to finish the job. Falling handoffs often correlate with lower cost‑per‑resolution.
- Content retrieval efficiency: average retrieved context size and duplicate rate. Bloated context means wasted tokens.
- Budget compliance: % of weeks within your exception budget and the number of alerts triggered. On Azure, use budgets and alerts because hard caps aren’t available for Azure OpenAI. learn.microsoft.com
The one‑week build: step‑by‑step
Day 1 — Scope and success measures
- Pick one use case with material volume (e.g., 2,000+ interactions/month).
- Agree your success targets: cost‑per‑resolution, cache hit rate, routing acceptance, and p95 latency.
- Set an “exception budget”: the maximum weekly overspend before automatic safeguards kick in.
Day 2 — Instrument the basics
- Add a correlation ID to each user interaction so billing lines, logs, and outcomes can be tied to one case.
- Capture: model name, tokens in/out, latency, cache use, retrieval size, response length, resolution outcome.
Day 3 — Connect billing and budgets
- Export billing data daily from your platform of choice into a simple sheet or BI view.
- Set budgets and anomaly alerts in your cloud or vendor console. On Azure, use Budgets and Action Groups; note that Azure OpenAI doesn’t support hard spending caps. learn.microsoft.com
Day 4 — Baseline unit economics
- Measure cost‑per‑resolution over a representative 200–500 case sample.
- Identify “heavy” prompts and long outputs. Trim boilerplate and enforce output length where appropriate.
Day 5 — Ship two savings levers
- Enable caching for repeated context (system prompts, policy packs). Many platforms price cached reads at a steep discount and improve latency. docs.cloud.google.com
- Move low‑priority work to batch (overnight summaries, backfills) to use discounted batch rates. openai.com
Day 6 — Introduce a router
- Route easy queries to lighter models; escalate complex ones to larger models only when needed. Some cloud services provide built‑in prompt routing with claimed cost benefits. aws.amazon.com
- Start with simple rules (input length, confidence, topic) before chasing complex heuristics.
Day 7 — Publish the dashboard & rhythm
- Share the dashboard with directors and ops weekly. Log decisions and savings estimates.
- Set guardrails: automatic fallback to lighter models if spend crosses the weekly exception budget; cap output length during spikes.
Eight cost levers you can turn this month
- Prompt/context caching for reusable instructions and policies. Look for vendors that discount cached reads heavily and support explicit or implicit caches with sensible time‑to‑live. docs.cloud.google.com
- Batching for non‑urgent tasks. Many providers offer lower rates for asynchronous batch jobs; reserve real‑time capacity for user‑visible interactions. openai.com
- Model tiering: adopt a “mini → standard → large” ladder, and measure the share of traffic on the cheaper tiers weekly.
- Intelligent prompt routing: where available, enable provider routing that steers simpler requests to cheaper models; some providers claim up to 30% cost reduction without accuracy loss when routing well. aws.amazon.com
- Retrieval diet: reduce the number and size of documents you retrieve. Duplicate or oversized chunks inflate token costs.
- Output discipline: constrain verbosity and formats. Auto‑truncate or summarise lengthy drafts before sending back‑and‑forth through the model again.
- Autoscaling where you self‑host: if you run any models or vector stores yourself, use autoscaling to avoid paying for idle capacity. docs.cloud.google.com
- Reserved/provisioned capacity selectively: if your volume is predictable, reservations can secure capacity and discounts, but buy after you’ve proven usage because availability varies and terms are a commitment. learn.microsoft.com
The weekly operating rhythm (30 minutes)
- Review trends: cost‑per‑resolution, cache hit rate, model mix, p95 latency.
- Decide two actions: one quality action, one cost action (e.g., trim a long prompt; add a batch rule).
- Run a micro‑experiment: A/B test a lighter model on a narrow topic for three days.
- Update the exception log: if the exception budget triggered, note cause and automated response.
- Share a one‑line summary with the board sponsor: “This week cost‑per‑resolution fell 18% due to caching; no customer‑visible impact.”
Procurement questions that reveal hidden costs
Use these with platform vendors and integrators.
- What discounted routes do you support (batching, cached reads, model routing) and how are they reported on invoices? Provide a worked example.
- Can we set budgets and automated actions (pause, throttle, fallback) at workspace or use‑case level? If not, what’s the workaround? Note that some services (e.g., Azure OpenAI) rely on budgets and alerts rather than hard caps. learn.microsoft.com
- How do you expose cache hit rate and token breakdown per request so we can attribute savings?
- What is your reservation/provisioned capacity model? When do discounts apply and what happens if capacity moves region or is unavailable later? learn.microsoft.com
- Do you support per‑use case routing across “mini/standard/large” models, and can we set rules ourselves?
- How do you measure and report quality so cost cuts do not degrade outcomes?
- What is your data localisation and egress charge story if we keep embeddings or caches in your cloud?
- Show licence terms for seats vs usage. Where could we accidentally pay twice?
Decision guardrails: what to change when a KPI drifts
| Signal | Likely cause | Action to take this week |
|---|---|---|
| Cache hit rate falls below 25% | Prompt/context changed too frequently; TTL too short | Stabilise system prompt; move policy packs to a separate cached block with a longer TTL where supported; schedule cache refreshes. |
| Cost‑per‑resolution rises 15%+ | Routing skewed to expensive model; bloated retrieval; verbose outputs | Cap output length for low‑risk intents; tighten retrieval filters; increase router’s confidence threshold for “upgrade”. |
| p95 latency > 2× normal | Capacity limits or retries | Shift non‑urgent tasks to batch; reduce max output length; if persistent and predictable, consider reserved/provisioned capacity for the busy hour only. openai.com |
| Budget alerts fire two weeks running | Growth outpacing plan | Raise exception budget only with a matching savings plan; freeze new use‑cases until cache and routing targets recover. |
Worked example: a charity helpline FAQ assistant
A UK charity runs an FAQ assistant on its website. Baseline metrics over one week: 8,000 sessions, 5,600 resolved without escalation, cost‑per‑resolution £0.19, cache hit rate 12%, model mix 85% on a large model, p95 latency 3.6s.
Changes in week two:
- Split and cache a stable “policy pack” and system prompt; aim for 50–70% reuse across sessions. Many platforms price cached reads far lower than base tokens and reduce latency. docs.cloud.google.com
- Enable a simple router: small inputs and “common intents” go to a cheaper model; complex or sensitive categories escalate. Some cloud services advertise cost improvements here when routing is tuned. aws.amazon.com
- Move nightly re‑summaries and tagging into a batch job using discounted rates. openai.com
Result after seven days: cache hit rate to 41%, model mix shifts to 62% “mini/fast”, p95 latency down to 2.1s, cost‑per‑resolution £0.11 with the same resolution rate. Directors keep funding growth because quality was unchanged and costs are now predictable.
Helpful vendor features (and what they actually mean)
- Context/prompt caching: stores repeated context so you don’t pay full price each time; providers document read discounts (e.g., cached reads at around 10% of standard input cost in some contexts). Always check TTL and storage charges. docs.cloud.google.com
- Batch APIs: queue low‑urgency work for lower rates; expect longer completion windows. One major provider lists a 50% discount on batch tokens. openai.com
- Intelligent prompt routing: a managed router that tries cheaper models first and escalates on complexity, with claimed cost savings up to around 30%. Validate with your data. aws.amazon.com
- Autoscaling when self‑hosting: scale down idle capacity and reduce under‑utilisation. docs.cloud.google.com
What to build next
If you want to go further, line this dashboard up with your delivery roadmap:
- Stress‑test capacity and costs with a time‑boxed “load test and scale plan” before major campaigns. See our 15‑day AI load‑test playbook.
- Adopt a vendor scorecard to compare cost features (caching, batch, routing, reservations) side‑by‑side. Try our 2026 AI vendor scorecard.
- Tune unit economics across a quarter with our 30‑60‑90 day unit‑economics plan.
- When you ship cost changes, use safe deployment patterns to avoid regressions. See version pinning, canaries and rollbacks.
Common pitfalls (and fixes)
- Changing the system prompt weekly kills cache reuse. Fix: freeze the core prompt; change only narrow, cached annexes (e.g., holiday policies).
- “Large by default” model choice inflates tokens for simple tasks. Fix: default to mini; escalate on signals.
- Retrieval grabs whole PDFs so you pay to re‑read the same content. Fix: better chunking and metadata; cap retrieved tokens per turn.
- No batch lane: your real‑time service is clogged by back‑office jobs. Fix: nightly batch with SLA and retry rules.
- Buying reservations too early: volume isn’t proven yet. Fix: deploy first, then buy capacity covering the steady‑state you actually use. learn.microsoft.com
Sources and further reading
- FinOps Foundation: Optimising GenAI usage and the changing FinOps Framework. finops.org
- Google Cloud: Prompt/context caching and AI/ML cost optimisation. docs.cloud.google.com
- OpenAI: API pricing and Batch API discount. openai.com
- AWS: Bedrock pricing page detailing intelligent prompt routing claims. aws.amazon.com
- Microsoft Learn: Managing Azure OpenAI costs and provisioned throughput reservations. learn.microsoft.com