Two things tend to fail first when an AI feature meets real traffic: timeouts and the budget. Both are avoidable with a light‑touch load test and a simple capacity plan. This article gives UK SMEs and charities a 15‑day, no‑code playbook to prove your AI service will stay up and stay within budget during seasonal peaks and campaigns.
We lean on proven practices from the UK Government Service Manual (capacity planning, performance testing and monitoring), and adapt them to AI workloads where third‑party model limits, token throughput and bursty traffic dominate. See GOV.UK guidance on performance testing and capacity planning, uptime and fallback modes and monitoring. For what to watch on dashboards, use Google’s SRE “four golden signals”: latency, traffic, errors and saturation, explained in the official SRE Workbook.
Why AI services fail under load (in plain English)
- Hidden ceilings at your model vendor. Providers throttle requests and tokens per minute to keep platforms stable. Hitting those limits triggers HTTP 429s and spikes in slow responses. See vendor guidance on rate limits.
- Bursty user behaviour. Campaign emails, deadlines or media coverage compress demand into minutes, not days. GOV.UK advises testing both expected and excess loads and recording the break point so you can adapt design and costs accordingly (Service Manual).
- Token inflation. Long prompts, attachments and verbose answers quietly increase cost and time. A few more paragraphs can double token usage.
- Downstream effects. Search, RAG or databases saturate before the AI model does. Monitor the golden signals across each dependency (SRE Workbook).
- Single‑vendor fragility. When a provider blips, you need queueing, read‑only modes or “grounded only” answers to stay graceful. GOV.UK recommends designing services with partial modes and queues to avoid all‑or‑nothing outages (Uptime & availability).
The 15‑day plan
Days 1–2: Inventory entry points and limits
- List every way users hit the AI feature (web form, chat, email triage, mobile, back office).
- Map dependencies: model vendor(s), vector/RAG store, search, file storage, content filters.
- Collect published limits: requests per minute, tokens per minute and any “priority” or “burst” tiers (for example OpenAI’s Priority processing).
Days 3–4: Define what ‘good’ looks like
- Set two or three Service Level Objectives (SLOs) users would actually feel. Example: 99.5% success rate without manual handoff; p95 response under 3 seconds for FAQs; p99 under 7 seconds for long answers.
- Pick the golden signals per dependency: latency (p95/p99), traffic (requests and tokens), errors (4xx/5xx/429), saturation (CPU, connection pools, queue depth). See SRE guidance.
Days 5–6: Design test scenarios
- Normal load (baseline usage over an hour).
- Peak load (2–3× baseline for 15 minutes).
- Spike (10× for 60–120 seconds) to simulate campaign clicks.
- Soak (baseline for 4 hours) to find slow leaks and memory issues.
- Dependency failure (RAG off, model throttled) to test fallbacks.
- GOV.UK recommends gradually increasing until something breaks and recording the failure mode and threshold (guidance).
Days 7–8: Safe dry runs
- Run in a staging environment with synthetic data. Do not use live PII.
- Check you can capture the right metrics and separate model time from RAG/search time.
Days 9–10: Shadow production
- Mirror 10% of live traffic to the test path with responses discarded. Watch for 429s and p95 latency drift.
- Agree an error‑budget policy: how much SLO burn is acceptable before you slow releases. See our post on runbooks and SLOs.
Days 11–12: Add backpressure and graceful modes
- Queue and cap. When queues grow or you hit vendor ramp limits, slow intake and show a friendly “we’ll email you” option. GOV.UK explicitly recommends queueing and partial modes to keep services usable (guidance).
- Fallbacks. If the model is throttled, switch to a smaller model for FAQ‑like prompts, or to “search result + link” answers only.
- Retries with backoff. Retries should increase wait time between attempts to avoid making spikes worse (a standard resilience pattern referenced in cloud docs).
Days 13–14: Cost burn tests
- Run the spike test twice with verbose vs concise answers. Measure tokens and cost per helpful answer.
- Set budget alerts and a daily cap at the vendor level if available.
Day 15: Go/No‑Go
- Document limits, SLOs, fallbacks, and the threshold where you will switch features off to preserve core functions.
- Agree a change window and release gate. For safety patterns, see our post on shipping AI changes safely.
What to measure: the minimum viable AI SLOs
| KPI | Good target | Why it matters |
|---|---|---|
| Helpful answer rate (spot‑checked) | ≥ 90% | Quality users can feel; avoids cheap but unhelpful answers. |
| Deflection rate from human inbox | 40–70% depending on case type | Proves operational value; pairs with QA sampling. |
| p95 response time | ≤ 3s for FAQs; ≤ 7s for long RAG answers | Tail latency is the user experience; average is misleading. See the SRE “golden signals”. |
| Error rate (4xx/5xx/429) | ≤ 1% sustained | 429s indicate vendor throttling; 5xx/timeout signal saturation upstream. |
| Token use per answer | Budgeted band per use case | Directly controls spend and capacity. |
| Cache hit rate (prompts/doc chunks) | 30–60%+ | Reduces cost and latency under repeated queries. |
| Queue depth / wait time | Under 2 seconds average; alert at 10 seconds | Early signal you are approaching vendor or infra limits. |
Keep dashboards aligned to the four golden signals — latency, traffic, errors, saturation — a standard SRE approach endorsed by Google’s workbook for production systems (reference).
A quick capacity calculator (no maths degree required)
- Estimate peak concurrent users. Look at last campaign peaks or website analytics. Take the highest 5‑minute window you can find.
- Estimate tokens per request. Sample 20 real prompts. Note average tokens in: prompt + context + answer. Track this per use case.
- Check vendor limits. Look up your tokens‑per‑minute and requests‑per‑minute for the chosen model and tier (for example OpenAI and Azure OpenAI publish limits).
- Compute safe throughput. Divide vendor TPM by your tokens per request. That’s your maximum steady requests per minute. Keep 30–50% headroom for bursts and retries.
- Decide burst handling. Above the steady rate: queue for up to 10–20 seconds, switch to smaller models for short prompts, or degrade to “search‑only” answers.
- Set a firm budget cap. Combine price per token and forecast volume. Add alerts at 50%, 80% and 95% of daily budget. For choosing budgets and tiering, see our AI cost guardrail.
Risk and cost: what to agree before you test
| Risk | Mitigation | Indicative cost/time |
|---|---|---|
| Vendor throttling (429s) | Client‑side rate caps, exponential backoff, queues, smaller‑model fallback, or a paid “priority” tier where justified. | Setup: 1–2 days. Ongoing cost only if using a premium tier. |
| Token bill spike | Strict max‑tokens per use case, concise answer policy, caching, daily cap and alerts. | Policy 1 day; caching depends on platform. |
| Slow RAG/search | Timeouts with graceful degradation to “source links only”, batch chunking, and pre‑warm caches for popular topics. | 2–4 days tuning and indexing. |
| Third‑party outage | Read‑only and queue‑then‑process modes as per GOV.UK guidance on keeping services online. | Design session 0.5 day; config 1–2 days. |
| Data sensitivity in tests | Use synthetic data. Follow secure‑by‑design advice from the NCSC/CISA joint AI development guidelines and the UK’s AI Cyber Security Code of Practice. | Free guidance; half‑day policy review. |
Vendor questions to settle in procurement (before load hits)
- What are our per‑minute limits for the chosen model and tier? Are there ramp‑up constraints? Where is this documented?
- Can we buy or request burst capacity or a priority service tier during peak weeks? How do we toggle it and how is it priced?
- Do you offer regional failover and what is the documented recovery time objective when models degrade?
- What metrics and headers do you expose so we can detect saturation (remaining tokens, reset times)?
- How do you recommend clients implement retries and backoff to avoid stampedes?
- Will you notify us of limit changes or maintenance windows? Where is your status page and incident history?
For concrete examples of rate‑limit concepts, see the OpenAI docs on requests and tokens per minute and Azure’s quota tables.
A safe release plan for peak season
- Freeze window. Avoid non‑urgent AI changes in the 2 weeks before major campaigns.
- Two‑stage rollout. Canaries to 5–10% of traffic with an automatic rollback if p95 latency or error rate exceed SLO for 5 minutes. For more detail, see Ship AI changes safely.
- On‑call and runbook. Name a human, escalation path and pre‑agreed “degrade modes” you will switch on first. We covered this in From pilot to always‑on.
- Comms templates. One banner for delays, one email for queued responses, one internal Slack/Teams message explaining switches.
Governance: minimal paperwork that helps in a crisis
- One‑pager capacity plan. Peak users, tokens per request, vendor limits, steady‑state throughput, headroom, fallback order.
- SLO sheet. The 3 KPIs you’ll protect and the alert thresholds.
- Change checklist. Before going live: limits checked, budget alerts on, fallback tested, status page ready.
- Post‑event review. Keep a short log of breakpoints, vendor responses and cost per helpful answer — it will pay back next time. For unit‑economics framing, see AI unit economics.
Common pitfalls and how to avoid them
- Only testing average load. Tail latency and burst behaviour reveal the real user experience. Track p95/p99, not just averages (SRE).
- No degrade mode. GOV.UK recommends read‑only and queue‑then‑process modes so users can still do something useful while you recover (guidance).
- Unbounded retries. Re‑trying too quickly worsens a spike. Use backoff and caps; many cloud SDKs and docs model this pattern.
- Verbose answers by default. Tokens are money and capacity. Set answer style per use case and cache common chunks.
- Assuming vendor limits are static. Limits vary by tier and can evolve. Check your account and keep a contact route for peak changes.
Security notes for non‑technical leaders
Load testing should be safe by default. Use synthetic or anonymised data and treat the test harness as you would production. The NCSC and CISA’s joint secure AI development guidelines and the UK AI Cyber Security Code of Practice are short, practical reads to align teams on “secure by design”.
What this looks like in practice next week
- Book a 60‑minute session to agree your 3 SLOs and fallback order.
- Run a 30‑minute spike test in staging and record the first break point.
- Switch on budget alerts and a daily cap at your model vendor.
- Add a dashboard panel for queue depth and 429s.
- Prepare two public messages: “answers may be delayed” and “we’ll email you”.
Further reading
- GOV.UK: Test your service’s performance and Monitoring the status of your service.
- Google SRE Workbook: Monitoring and the four golden signals.
- OpenAI: API rate limits and Priority processing tier.
- Azure OpenAI: Quotas and limits.
- Related posts on this site: Runbooks, SLOs and on‑call and Shipping AI changes safely.