Cost governance & scalability

The AI Unit Economics Board Pack: 90 minutes to turn tokens into pounds

Published 04 Nov 2025 • 12–14 min read

AI spend should be as visible and governable as cloud hosting or payroll. Yet many UK SMEs and charities still see “AI” as an experimental line item—hard to predict, harder to control. This article gives directors, ops managers and trustees a 90‑minute monthly routine and a one‑page board pack to make confident, measurable decisions about AI cost and value.

We’ll define simple unit economics, show which levers actually move the bill this month (batching, caching, reranking, and reserved throughput), and give you procurement questions and KPIs that don’t require you to be technical.

Your monthly AI board pack (one page)

Bring a single page with these four blocks. If a number can’t be filled, it’s a signal to fix measurement before adding more AI features.

1) Unit economics

Cost per resolved support ticket
Cost per qualified lead (marketing/sales)
Cost per document processed (back‑office, compliance, research)

Express each as £ cost and trend vs last month and vs target.

2) Cost drivers

Top models used and share of spend
Average input tokens per request, average output tokens per response
Cache hit rate (if available) and batch vs real‑time share
Retrieval context size (average characters or tokens)

3) Quality & risk guardrails

Accuracy/Resolution rate vs SLA
Escalation rate to humans
Incident summary (downtime, unexpected spend, harmful outputs)

If you need a template for quality SLAs, use the scoreboard approach we outlined previously. The AI Quality Scoreboard

4) Actions for this month

Quick wins (e.g., switch nightly jobs to Batch API)
Medium bets (e.g., add a reranker to cut context by 50–80%)
Strategic (e.g., reserved throughput for predictable volumes)

Where the money actually goes

Most AI bills are driven by a few levers you can control in weeks, not quarters:

Tokens in, tokens out. Input tokens are your prompt and any retrieved context (documents, transcripts). Output tokens are the model’s reply. Many providers price these differently; some add discounts for cached tokens. help.openai.com
Batch vs real‑time. Non‑urgent workloads (e.g., bulk summarisation) can be 50% cheaper using batch processing. help.openai.com
Caching. If your prompts reuse large chunks of context (product catalogues, policies), cache hits can be discounted by ~50% (OpenAI) or ~90% (Vertex AI Gemini). openai.com
Reserved throughput. If you have predictable volumes, reserving capacity can reduce unit costs and guarantee throughput during busy periods. openai.com
Retrieval depth. Passing fewer, more relevant documents to the model via reranking materially reduces token spend and latency. cohere.com

If you’re new to cost guardrails, start with “caps and alerts” and an allow‑list of approved models and configurations. Our earlier playbook is a good primer. Beating AI Bill Shock

Build your unit economics in 30 minutes

Pick the outcome (ticket resolved, lead qualified, document processed).
Collect counts for the last 30 days: number of outcomes, number of AI calls, average input/output tokens, cache hits, batch share.
Get the costs from your bill or usage export. If your provider or tool supports the FinOps FOCUS billing format, pull that—it standardises columns like BilledCost and PricingQuantity across vendors and now covers SaaS/PaaS and token‑based charges. focus.finops.org
Calculate Cost per outcome = Total AI cost ÷ Outcomes, and include a simple bridge: cost change explained by volume, model mix, tokens per request, cache hit rate, batch share.

Tip: use the same unit costs across teams so your board can compare like with like. The FOCUS spec exists so your finance, data and engineering teams stop arguing about column names and can query everything in one schema. focus.finops.org

The 90‑minute monthly routine

15 min – Prep. Export last month’s usage and costs (prefer FOCUS), refresh your board pack, and pre‑draft three proposed actions.
45 min – Review. Walk the four blocks. Probe any spikes in tokens per request, cache hit drops, or a model’s share jumping.
20 min – Decisions. Approve a small set of changes (e.g., move nightly jobs to batch; add reranking to the top workflow; enable caching on the FAQ bot; purchase a 3‑month reserved throughput block for December–February).
10 min – Assign owners and dates. Every action gets a named owner and a check‑in date before the next board.

If your next month is peak season, combine this with our scaling guide to avoid rate‑limit surprises. Peak‑season AI that scales

Cost levers you can pull this month

Lever	What to change	Why it saves money	What to watch
Batch non‑urgent jobs	Move offline summarisation, dataset labelling, bulk document Q&A to Batch API.	Typical 50% discount vs real‑time; often higher rate limits. help.openai.com	Jobs return within hours, not seconds; set SLAs accordingly.
Enable caching	Make prompts reuse a common prefix (policies, product data). Turn on prompt/context caching where available.	Discounted cached tokens: OpenAI prompt caching ~50%; Gemini implicit/explicit caching ~90% off input tokens on cache hits. openai.com	Caches are short‑lived (minutes to an hour). Design prompts to maximise reuse. openai.com
Add a reranker	Rerank top N retrieved chunks, pass only the best 3–5 into the model.	Fewer tokens per request and faster responses; vendor docs emphasise cost and latency reductions. cohere.com	Small extra inference cost for rerank; tune N to balance quality and spend. docs.voyageai.com
Reserve throughput	Purchase provisioned throughput for steady traffic to cut unit cost and guarantee capacity.	Commitment discounts and predictable spend; clear GSU pricing exists for Vertex AI. cloud.google.com	Over‑buying wastes budget; review weekly burn‑down and right‑size. docs.cloud.google.com
Right‑size the model	Route simple requests to small/fast models; escalate only when needed.	Lower token pricing and latency for the majority of calls; formal model optimisers can balance cost vs quality. cloud.google.com	Guardrails must enforce when to escalate to larger models.

Not sure which lever first? If you run a helpdesk or claims flow, start with caching the shared policy text and add reranking to cut the context. For research backlogs or content generation, batch by default.

Procurement questions that smoke out hidden costs

Do you support prompt/context caching? What is the discount and cache lifetime? How is a “cache hit” reported on the bill? openai.com
Is there a Batch API and what is the discount vs real‑time? What’s the typical turnaround time? help.openai.com
Can we reserve throughput? What commitment tiers and per‑unit prices apply, and can we split by region or model? cloud.google.com
How can we export usage and costs in FOCUS format for cross‑vendor reporting? focus.finops.org
Do you report cached tokens, input tokens and output tokens separately in usage metadata? openai.com
Are reranking/embedding charges separate from generation? How are they counted? cohere.com

KPIs to track every month

Unit cost per outcome (ticket, lead, document)
Tokens per request (avg/95th percentile) and tokens per outcome
Cache hit rate and share of batch jobs
Model mix (% of spend by model) and rerank filtered context (avg docs passed to generation)
Quality KPIs tied to your SLA (accuracy/resolution rate, handover rate)
Incidents (production errors, unexpected spend, rate‑limit spikes)

Tie these to the decision cadence from your 5‑day evaluation sprint so you can prove quality and lock in costs before scaling.

Common pitfalls (and how to avoid them)

Unbounded prompts. Long context windows feel convenient but turn into cost multipliers. Use reranking to keep only the most relevant chunks. cohere.com
All real‑time, all the time. Many workloads don’t need instant responses. Move them to batch for automatic savings. help.openai.com
Ignoring cacheability. If your assistants repeatedly read the same policy pack or handbook, you’re paying twice. Design prompts to maximise common prefixes and enable caching. cloud.google.com
Single‑vendor lock‑in for reporting. Without a standard, finance can’t see the whole picture. Adopt FOCUS for a unified view across vendors. focus.finops.org

30/60/90‑day plan

Days 1–30 (stabilise)

Set monthly board pack and owners
Turn on batch for offline jobs; add alerts at spend thresholds
Enable caching where supported (OpenAI prompt caching; Gemini context caching) openai.com
Add reranking to the highest‑volume retrieval workflow

Days 31–60 (optimise)

Introduce model routing (small‑by‑default, escalate on need)
Pilot reserved throughput for a steady flow (e.g., daily FAQs) cloud.google.com
Adopt FOCUS export in your cost tooling; baseline unit costs focus.finops.org

Days 61–90 (scale with control)

Negotiate supplier terms using real usage (batch share, cache rate, reserved throughput)
Publish monthly unit‑cost trends to the board alongside quality SLAs
Plan next quarter’s experiments with explicit savings targets

Need a hand?

We can assemble your first board pack, wire up a FOCUS‑based cost export, and deliver a “switch‑list” of 3–5 changes that cut cost without hurting quality—usually in two weeks. The AI Skills Matrix shows who you already have to run this.

Why this works

These levers are simple because they follow how providers bill. Token‑based pricing and discounts for cached tokens and batched work are now standardised in documentation, and reserved throughput options give predictable pricing for steady demand. Your job is to translate “per token” into “per outcome” and review it monthly—exactly like any other operating cost. help.openai.com

Book a 30-min call Or email: team@youraiconsultant.london