Cost governance & scalability

Peak‑Season AI That Scales: A 3‑Week Plan to Handle Traffic Spikes Without Blowing the Budget

Published 31 Oct 2025 • 12–16 min read

For many UK SMEs, charities and law practices, late November to early December is crunch time: Black Friday and Cyber Monday for retailers and membership organisations, then Giving Tuesday for fundraisers. In 2025 those dates fall on Friday 28 November, Monday 1 December and Tuesday 2 December respectively. If your website or app already uses AI—for search, advice, triage, content or customer support—now is the moment to make sure it stays fast, reliable and affordable when demand spikes. aboutamazon.co.uk

This guide gives non‑technical leaders a plain‑English, three‑week playbook to get ready. You’ll forecast demand, choose the right capacity model, set cost guardrails, and run a safe cutover—so your AI features scale without surprises.

First, a quick capacity model you can use in a meeting

You don’t need to know tokens or throughput in detail to ask the right questions. Use this simplified model to check your exposure:

Peak concurrent users (PCU): Your busiest 10 minutes; estimate from last year’s web analytics or telephony logs, then apply a growth factor for 2025 promotions.
Requests per user per session (RPSu): How many AI calls a typical session makes (e.g. 1–3 for search, 3–6 for a chatty assistant).
Tokens per request (TPR): Average input + output size. If you don’t track tokens yet, assume 1,000 for a rich answer and 200–400 for short responses.
Model choice: “Mini”/“flash” models for simple tasks; larger models only where quality demonstrably pays back.

Back‑of‑envelope: Estimated tokens per minute ≈ PCU × (RPSu ÷ session minutes) × TPR. Compare that number with your provider’s tokens‑per‑minute (TPM) or requests‑per‑minute (RPM) quotas to see if you’ll throttle under load. Azure OpenAI exposes TPM/RPM per region and model; Google Vertex AI also publishes per‑project RPM, and AWS Bedrock offers Provisioned Throughput for guaranteed capacity. learn.microsoft.com

Your 3‑week peak‑season plan

Week 1 — Baseline and guardrails

Decide where you need “premium” AI—and where you don’t. Map every AI call in your user journeys. Label each as standard (use fast, cheaper models) or premium (use higher‑quality models). Only 10–20% typically need premium.
Set hard token budgets per feature. Cap max output and keep prompts lean. Many providers enforce RPM/TPM; you can distribute quota across deployments and regions to avoid a single bottleneck. Ask your team to show the TPM and RPM they’ve actually been assigned. learn.microsoft.com
Turn on caching and rate limits at the edge. If you serve repeatable prompts or common FAQs, enable an AI‑aware gateway with response caching and sliding‑window rate limiting to cut cost and smooth spikes. Cloudflare’s AI Gateway supports both and gives a live dashboard for requests, tokens, cache hit‑rate, errors and cost. developers.cloudflare.com
Batch non‑urgent work. For offline tasks (e.g. pre‑tagging products, nightly summaries), queue them through a batch API. OpenAI’s pricing page notes up to ~50% savings when using its Batch API for asynchronous jobs. openai.com
Put simple circuit‑breakers in front of AI. If latency crosses a threshold or the gateway returns too many 429s (rate‑limited), degrade gracefully: show a fallback answer, skip enrichment, or switch to a cheaper model tier. Azure’s docs explain why you may see 429s under quota pressure. learn.microsoft.com

Week 2 — Load test and choose your scaling path

Run a 60‑minute traffic ramp. Simulate your forecast peak ×1.2. Watch P95 latency, error rates and cost per 1,000 requests. If you hit rate limits, either:
- Increase shared quotas with your cloud provider (Vertex AI and Azure support quota increase requests), or
- Buy guaranteed capacity via AWS Bedrock Provisioned Throughput (measured in model units of tokens per minute). cloud.google.com
Tune your gateway policies. Set per‑feature rate limits and burst tolerance aligned to revenue or mission importance. For example, protect payment or donation flows first; throttle low‑value, anonymous chat later. Cloudflare AI Gateway offers fixed or sliding windows to shape bursts. developers.cloudflare.com
Measure caching and batching ROI. Track cache hit‑rate, saved tokens and reduced provider calls in the analytics. If the hit‑rate is below 10%, focus on templating prompts and standardising answer formats to raise reuse. Gateways surface cost and token trends so you can prove savings. developers.cloudflare.com
Right‑size model tiers. Where quality is comparable, switch to “mini/flash” variants for 70–90% lower unit costs; keep bigger models for edge cases. OpenAI’s public pricing shows large differences between model families and discounts for cached input tokens, which you can exploit when prompts repeat. openai.com

Week 3 — Cutover, observability and steady‑state ops

Canary release with a kill‑switch. Start at 5–10% of traffic; promote in steps. Keep a visible toggle to revert to the pre‑peak configuration within minutes if costs spike or latency degrades.
Live dashboards and budget alerts. Put three tiles on a screen: P95 latency, error/429 rate, and £ cost per 1,000 requests. Gateways and provider consoles give real‑time request, token and error metrics. developers.cloudflare.com
Escalation paths and on‑call drill. Agree who can lower model tiers, increase quotas, or flip to static fallbacks. If you haven’t yet, practice a 20‑minute incident drill—for structure, reuse our AI Incident Drill.
Post‑event wrap‑up (within 48 hours). Export usage and cost by feature. Lock in the improvements that saved the most (prompt trims, caching rules, model switches). Capture “what we’d buy up‑front next year”—reserved capacity or evergreen quotas.

Decision checklist for leaders

Capacity and reliability

Do we know our expected and stress tokens‑per‑minute at peak? Which region quotas or deployments apply? (Azure and Google expose these explicitly; ask for screenshots.) learn.microsoft.com
If throttled, what’s our first lever: quota increase, region split, or provisioned throughput? (Bedrock’s model units give guaranteed TPM/OPM at a fixed price.) docs.aws.amazon.com

Cost guardrails

Where do we apply caching? Expected cache hit‑rate? Who reviews the analytics daily? developers.cloudflare.com
Which jobs can move to batch for bulk discounts or off‑peak processing? (OpenAI’s Batch API: up to ~50% savings.) openai.com
What’s the per‑feature monthly budget and automatic throttle if breached?

User experience

What’s the acceptable P95 response time by journey? What’s the fallback if we exceed it?
Do we degrade gracefully—shorter answers, hide optional summaries, switch to simpler models—before we drop traffic?

Target KPIs for peak week

Area	Good	Alert
P95 model response time	≤ 1.5 s	> 2.5 s for 5+ min
Error / 429 rate	≤ 1%	≥ 3% for 3+ min
Cost per 1,000 requests	On or under budget	> 10% over for 30 min
Cache hit‑rate (where enabled)	≥ 20%	< 10% (fix prompts/templates)
Batch queue age (non‑urgent)	Clears overnight	Spills into business hours

Tip: agree thresholds per feature (e.g. donations vs browsing) and wire these to alerts in your monitoring.

Risk and cost controls you can choose from

Risk at peak	Control	Cost impact	Notes
Rate‑limit errors (429s)	Request quota increase; split deployments across regions; edge rate‑limiting with bursts	Low–Medium	Azure/Vertex define RPM/TPM; gateways smooth bursts. learn.microsoft.com
Runaway token spend	Hard caps on output length; cheaper model tiers for standard flows; cached responses	Low	OpenAI shows large price deltas and cached‑input discounts. openai.com
Throughput shortfall	Provisioned Throughput (Bedrock) for guaranteed model units	Medium (but predictable)	Fixed, reserved capacity for peak week. docs.aws.amazon.com
Latency spike	Cache common prompts; pre‑compute answers; batch non‑urgent jobs	Low	Gateway analytics verify cache hits and savings. developers.cloudflare.com
Single‑provider outage	Gateway‑level fallback routing to a secondary model	Low–Medium	Many gateways support dynamic routing and unified keys. ai.cloudflare.com

Procurement questions to ask this week

What are our documented TPM/RPM limits per region and model? Can we see them in the console, and what’s the SLA for increases? learn.microsoft.com
Do you offer reserved or provisioned capacity for peak season? How is throughput measured (tokens per minute, output per minute)? docs.aws.amazon.com
Can we enable response caching and rate limiting in front of the provider? What analytics do we get on tokens, cache hits and cost? developers.cloudflare.com
For offline jobs, do you support discounted batch processing? What’s the expected saving? openai.com

How this meshes with other playbooks

If you’re still setting up the basics, pair this plan with:

Cost guardrails that scale for budgets, alerts and token discipline.
The Quiet Cutover for safe go‑lives and canaries.
5‑Day AI Evaluation Sprint to validate cheaper model tiers before switching.
AI Incident Drill if you need an escalation runbook.

Wrap‑up: What “ready” looks like by next Friday

Peak traffic forecast documented with a simple TPM/RPM comparison per provider.
Token budgets and model tiers agreed per feature; fallbacks tested.
Gateway policies live: caching, rate limits, analytics and alerts configured.
Batch pipelines running for non‑urgent tasks.
Cutover checklist printed; one‑click rollback proven.

Book a 30‑min call Or email: team@youraiconsultant.london

Appendix: Why these dates matter for load

Retail peaks and charity appeals often bunch together: Black Friday 2025 is Friday 28 November, Cyber Monday is Monday 1 December, and UK Giving Tuesday is Tuesday 2 December. Planning for a three‑day surge across both commercial checkouts and donation pages reduces the risk of surprise costs and timeouts. aboutamazon.co.uk