For many UK SMEs, charities and law practices, late November to early December is crunch time: Black Friday and Cyber Monday for retailers and membership organisations, then Giving Tuesday for fundraisers. In 2025 those dates fall on Friday 28 November, Monday 1 December and Tuesday 2 December respectively. If your website or app already uses AI—for search, advice, triage, content or customer support—now is the moment to make sure it stays fast, reliable and affordable when demand spikes. aboutamazon.co.uk
This guide gives non‑technical leaders a plain‑English, three‑week playbook to get ready. You’ll forecast demand, choose the right capacity model, set cost guardrails, and run a safe cutover—so your AI features scale without surprises.
First, a quick capacity model you can use in a meeting
You don’t need to know tokens or throughput in detail to ask the right questions. Use this simplified model to check your exposure:
- Peak concurrent users (PCU): Your busiest 10 minutes; estimate from last year’s web analytics or telephony logs, then apply a growth factor for 2025 promotions.
- Requests per user per session (RPSu): How many AI calls a typical session makes (e.g. 1–3 for search, 3–6 for a chatty assistant).
- Tokens per request (TPR): Average input + output size. If you don’t track tokens yet, assume 1,000 for a rich answer and 200–400 for short responses.
- Model choice: “Mini”/“flash” models for simple tasks; larger models only where quality demonstrably pays back.
Back‑of‑envelope: Estimated tokens per minute ≈ PCU × (RPSu ÷ session minutes) × TPR. Compare that number with your provider’s tokens‑per‑minute (TPM) or requests‑per‑minute (RPM) quotas to see if you’ll throttle under load. Azure OpenAI exposes TPM/RPM per region and model; Google Vertex AI also publishes per‑project RPM, and AWS Bedrock offers Provisioned Throughput for guaranteed capacity. learn.microsoft.com
Your 3‑week peak‑season plan
Week 1 — Baseline and guardrails
- Decide where you need “premium” AI—and where you don’t. Map every AI call in your user journeys. Label each as standard (use fast, cheaper models) or premium (use higher‑quality models). Only 10–20% typically need premium.
- Set hard token budgets per feature. Cap max output and keep prompts lean. Many providers enforce RPM/TPM; you can distribute quota across deployments and regions to avoid a single bottleneck. Ask your team to show the TPM and RPM they’ve actually been assigned. learn.microsoft.com
- Turn on caching and rate limits at the edge. If you serve repeatable prompts or common FAQs, enable an AI‑aware gateway with response caching and sliding‑window rate limiting to cut cost and smooth spikes. Cloudflare’s AI Gateway supports both and gives a live dashboard for requests, tokens, cache hit‑rate, errors and cost. developers.cloudflare.com
- Batch non‑urgent work. For offline tasks (e.g. pre‑tagging products, nightly summaries), queue them through a batch API. OpenAI’s pricing page notes up to ~50% savings when using its Batch API for asynchronous jobs. openai.com
- Put simple circuit‑breakers in front of AI. If latency crosses a threshold or the gateway returns too many 429s (rate‑limited), degrade gracefully: show a fallback answer, skip enrichment, or switch to a cheaper model tier. Azure’s docs explain why you may see 429s under quota pressure. learn.microsoft.com
Week 2 — Load test and choose your scaling path
- Run a 60‑minute traffic ramp. Simulate your forecast peak ×1.2. Watch P95 latency, error rates and cost per 1,000 requests. If you hit rate limits, either:
- Increase shared quotas with your cloud provider (Vertex AI and Azure support quota increase requests), or
- Buy guaranteed capacity via AWS Bedrock Provisioned Throughput (measured in model units of tokens per minute). cloud.google.com
- Tune your gateway policies. Set per‑feature rate limits and burst tolerance aligned to revenue or mission importance. For example, protect payment or donation flows first; throttle low‑value, anonymous chat later. Cloudflare AI Gateway offers fixed or sliding windows to shape bursts. developers.cloudflare.com
- Measure caching and batching ROI. Track cache hit‑rate, saved tokens and reduced provider calls in the analytics. If the hit‑rate is below 10%, focus on templating prompts and standardising answer formats to raise reuse. Gateways surface cost and token trends so you can prove savings. developers.cloudflare.com
- Right‑size model tiers. Where quality is comparable, switch to “mini/flash” variants for 70–90% lower unit costs; keep bigger models for edge cases. OpenAI’s public pricing shows large differences between model families and discounts for cached input tokens, which you can exploit when prompts repeat. openai.com
Week 3 — Cutover, observability and steady‑state ops
- Canary release with a kill‑switch. Start at 5–10% of traffic; promote in steps. Keep a visible toggle to revert to the pre‑peak configuration within minutes if costs spike or latency degrades.
- Live dashboards and budget alerts. Put three tiles on a screen: P95 latency, error/429 rate, and £ cost per 1,000 requests. Gateways and provider consoles give real‑time request, token and error metrics. developers.cloudflare.com
- Escalation paths and on‑call drill. Agree who can lower model tiers, increase quotas, or flip to static fallbacks. If you haven’t yet, practice a 20‑minute incident drill—for structure, reuse our AI Incident Drill.
- Post‑event wrap‑up (within 48 hours). Export usage and cost by feature. Lock in the improvements that saved the most (prompt trims, caching rules, model switches). Capture “what we’d buy up‑front next year”—reserved capacity or evergreen quotas.
Decision checklist for leaders
Capacity and reliability
- Do we know our expected and stress tokens‑per‑minute at peak? Which region quotas or deployments apply? (Azure and Google expose these explicitly; ask for screenshots.) learn.microsoft.com
- If throttled, what’s our first lever: quota increase, region split, or provisioned throughput? (Bedrock’s model units give guaranteed TPM/OPM at a fixed price.) docs.aws.amazon.com
Cost guardrails
- Where do we apply caching? Expected cache hit‑rate? Who reviews the analytics daily? developers.cloudflare.com
- Which jobs can move to batch for bulk discounts or off‑peak processing? (OpenAI’s Batch API: up to ~50% savings.) openai.com
- What’s the per‑feature monthly budget and automatic throttle if breached?
User experience
- What’s the acceptable P95 response time by journey? What’s the fallback if we exceed it?
- Do we degrade gracefully—shorter answers, hide optional summaries, switch to simpler models—before we drop traffic?
Target KPIs for peak week
| Area | Good | Alert |
|---|---|---|
| P95 model response time | ≤ 1.5 s | > 2.5 s for 5+ min |
| Error / 429 rate | ≤ 1% | ≥ 3% for 3+ min |
| Cost per 1,000 requests | On or under budget | > 10% over for 30 min |
| Cache hit‑rate (where enabled) | ≥ 20% | < 10% (fix prompts/templates) |
| Batch queue age (non‑urgent) | Clears overnight | Spills into business hours |
Tip: agree thresholds per feature (e.g. donations vs browsing) and wire these to alerts in your monitoring.
Risk and cost controls you can choose from
| Risk at peak | Control | Cost impact | Notes |
|---|---|---|---|
| Rate‑limit errors (429s) | Request quota increase; split deployments across regions; edge rate‑limiting with bursts | Low–Medium | Azure/Vertex define RPM/TPM; gateways smooth bursts. learn.microsoft.com |
| Runaway token spend | Hard caps on output length; cheaper model tiers for standard flows; cached responses | Low | OpenAI shows large price deltas and cached‑input discounts. openai.com |
| Throughput shortfall | Provisioned Throughput (Bedrock) for guaranteed model units | Medium (but predictable) | Fixed, reserved capacity for peak week. docs.aws.amazon.com |
| Latency spike | Cache common prompts; pre‑compute answers; batch non‑urgent jobs | Low | Gateway analytics verify cache hits and savings. developers.cloudflare.com |
| Single‑provider outage | Gateway‑level fallback routing to a secondary model | Low–Medium | Many gateways support dynamic routing and unified keys. ai.cloudflare.com |
Procurement questions to ask this week
- What are our documented TPM/RPM limits per region and model? Can we see them in the console, and what’s the SLA for increases? learn.microsoft.com
- Do you offer reserved or provisioned capacity for peak season? How is throughput measured (tokens per minute, output per minute)? docs.aws.amazon.com
- Can we enable response caching and rate limiting in front of the provider? What analytics do we get on tokens, cache hits and cost? developers.cloudflare.com
- For offline jobs, do you support discounted batch processing? What’s the expected saving? openai.com
How this meshes with other playbooks
If you’re still setting up the basics, pair this plan with:
- Cost guardrails that scale for budgets, alerts and token discipline.
- The Quiet Cutover for safe go‑lives and canaries.
- 5‑Day AI Evaluation Sprint to validate cheaper model tiers before switching.
- AI Incident Drill if you need an escalation runbook.
Wrap‑up: What “ready” looks like by next Friday
- Peak traffic forecast documented with a simple TPM/RPM comparison per provider.
- Token budgets and model tiers agreed per feature; fallbacks tested.
- Gateway policies live: caching, rate limits, analytics and alerts configured.
- Batch pipelines running for non‑urgent tasks.
- Cutover checklist printed; one‑click rollback proven.
Appendix: Why these dates matter for load
Retail peaks and charity appeals often bunch together: Black Friday 2025 is Friday 28 November, Cyber Monday is Monday 1 December, and UK Giving Tuesday is Tuesday 2 December. Planning for a three‑day surge across both commercial checkouts and donation pages reduces the risk of surprise costs and timeouts. aboutamazon.co.uk