Operator reviewing an AI-assisted helpdesk dashboard with citations and confidence markers
Case study & playbook

30 days to a reliable AI helpdesk: a UK SME case study and playbook

What this is (and isn’t)

This is a practical, non‑technical walkthrough of how a 120‑person UK services SME stood up a retrieval‑augmented AI helpdesk in 30 calendar days. It’s a composite, anonymised case drawn from three recent engagements in 2025; your numbers will vary, but the steps, checks and costs are representative for UK SMEs and charities.

  • Scope: inbound email and web‑form queries; staff knowledge search; simple order/booking lookups via APIs.
  • Approach: retrieval‑augmented generation (RAG) with clear citations to source documents, supervised by humans during the pilot.
  • Outcomes in 30 days: email deflection on common “how do I…?” queries, faster agent answers with citations, and predictable costs.

If RAG is new to you: think of it as “AI that first looks up answers in your own documents, then writes a reply and shows its sources.” See McKinsey’s explainer and Microsoft’s overview for leaders. McKinsey RAG explainer; Microsoft Learn: RAG in Azure AI Search. ([mckinsey.com](https://www.mckinsey.com/featured-insights/mckinsey-explainers/what-is-retrieval-augmented-generation-rag))

The 30‑day plan (week by week)

Week 0–1: Define value, pick the first journeys, assemble the knowledge

  • Pick 3–5 common intents (password resets, booking changes, pricing, delivery status, donations/gift aid FAQs). Aim for intents you can fully answer from existing documents or CRM lookups.
  • Create your “Gold Set”: 100–150 real customer questions with the correct answers and links to the source doc. This becomes your evaluation set.
  • Collect sources: help centre PDFs, policy docs, product sheets, macros, and 6–12 months of solved tickets; de‑duplicate and mark authoritative versions.
  • Agree KPIs: first‑contact resolution (FCR), average handle time (AHT), agent “time‑to‑answer,” % of answers with citations, cost per resolved contact, CSAT for AI‑assisted replies.

Week 1–2: Build the retrieval layer and a safe “shadow mode”

  • Index the knowledge in a search system designed for RAG (for Microsoft‑centric shops, Azure AI Search is a good fit). Azure AI Search product page. ([azure.microsoft.com](https://azure.microsoft.com/en-us/products/search/))
  • Shadow mode: the assistant drafts answers with citations, but agents approve/send. No customer sees an unsupervised AI message yet.
  • Quality gates: require at least one citation per claim; if no relevant source is retrieved, return “I don’t know” and route to a human.
  • Security hygiene: treat it like any other SaaS dependency. Use SSO/MFA; record where data is stored; and follow GOV.UK/NCSC guidance for choosing and securing SaaS. Securing SaaS tools (GOV.UK); Using cloud tools securely. ([gov.uk](https://www.gov.uk/guidance/securing-saas-tools-for-your-organisation))

Week 2–3: Evaluate, tune, and open to a small cohort

  • Run daily evaluations on your Gold Set. Score each answer for correctness, groundedness (backed by the cited text), tone, and completeness. NIST’s GenAI work provides helpful evaluation themes to adapt for business QA. NIST GenAI Profile; NIST pilot evaluation. ([nist.gov](https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence))
  • Open a small beta (e.g., 10% of volume or one brand/region). Keep the “human‑in‑the‑loop” review in place.
  • Measure cost per answer across model calls and retrieval; prune long replies and set a sensible maximum context length to control spend.

Week 3–4: Gradual rollout and operational handover

  • Graduated autonomy: allow the assistant to auto‑send on low‑risk intents when the confidence and evaluation scores exceed your thresholds; keep agent review for the rest.
  • Playbooks for people: train agents in “AI‑first search,” citation checking, and escalation. Industry surveys show customers accept AI more when human help is clearly available on demand—design your flows accordingly. Gartner: customer attitudes to AI in service. ([gartner.com](https://www.gartner.com/en/newsroom/press-releases/2024-07-09-gartner-survey-finds-64-percent-of-customers-would-prefer-that-companies-didnt-use-ai-for-customer-service))
  • Handover to ops: document owners, weekly content refresh cadence, alerting, and cost dashboards.

The results (composite, anonymised)

Across three UK organisations (B2B services, e‑commerce, and a national charity) that followed this 30‑day approach in 2025, we observed:

  • Answer assistance: 55–70% of agent replies drafted with citations (“shadow mode”).
  • Self‑service deflection: 18–35% of incoming tickets resolved via AI‑assisted email templates or an embedded help widget for the limited intents in scope.
  • Speed: agent “time‑to‑answer” down 35–50% for in‑scope topics.
  • Cost control: average AI cost per assisted answer between £0.02 and £0.12 depending on model and context length.

These are indicative, not guarantees; they depend on knowledge quality, clean routing, and change management. Independent market data suggests both appetite and caution: many leaders plan AI pilots in customer service, but customers still want humans available. Gartner 2024/25 adoption survey; Gartner customer preferences; Zendesk CX Trends 2025. ([gartner.com](https://www.gartner.com/en/newsroom/press-releases/2024-12-09-gartner-survey-reveals-85-percent-of-customer-service-leaders-will-explore-or-pilot-customer-facing-conversational-genai-in-2025))

How the system works (plain English)

  1. Retrieve: for each question, the assistant searches your indexed policies, product docs and past tickets to fetch the most relevant passages.
  2. Generate: it drafts a reply using those passages, with citation links so staff (or customers) can verify the source.
  3. Guardrails: if nothing relevant is found, it says so and escalates; sensitive queries route to a human by design.

RAG and grounding reduce “made‑up” answers by anchoring responses in trusted sources; several major platforms now emphasise grounding for reliable enterprise outputs. Google: grounding for reliability. ([cloud.google.com](https://cloud.google.com/blog/products/ai-machine-learning/how-vertex-ai-grounding-helps-build-more-reliable-models))

Quality and safety: make it measurable

Your minimum evaluation set

  • 100–150 real questions with the correct answer and a link to the exact clause in your doc.
  • Score daily: correctness, groundedness, tone, completeness. Track trends.
  • Failure handling: answers without citations auto‑fail; null retrieval must escalate.

Security, data and supply chain

  • Use SSO/MFA; least‑privilege access to the index; log every query.
  • Confirm vendor claims against NCSC/CISA joint guidance on deploying AI securely; keep an asset inventory for prompts, logs and models. Deploying AI Systems Securely. ([cisa.gov](https://www.cisa.gov/news-events/alerts/2024/04/15/joint-guidance-deploying-ai-systems-securely))
  • Run an AI/data risk check. ICO’s AI and data protection toolkit is a useful starting point for UK controllers. ICO AI risk toolkit. ([ico.org.uk](https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/guidance-on-ai-and-data-protection/ai-and-data-protection-risk-toolkit/?search=DPIA))

Costs and guardrails (UK‑centric)

Expect three cost buckets: initial setup (indexing and evaluation), ongoing AI usage (model + retrieval), and operations (content upkeep, monitoring). Use the following “north‑star” guardrails:

AreaGuardrailWhy it matters
Per‑answer spendCap average AI cost per assisted answer at £0.15 for email; £0.05 for internal searchProtects margin; forces concise prompts and right‑sizing models
Context lengthSet hard limits; prefer 2–4 short, relevant passages over long dumpsMore tokens ≠ better answers; keeps costs predictable
Model mixUse “good‑enough” models for drafts; escalate to stronger models only when confidence is lowTiers cost to complexity without hurting quality
Index refreshWeekly for policies/FAQs; daily for price/stock/SLAsStops “stale knowledge” errors
Human oversightShadow mode for 2–3 weeks; auto‑send only on intents with ≥95% groundednessBuilds trust; avoids early misfires

If you’re Microsoft‑centric, Azure’s RAG guidance explains why indexing and chunking decisions affect both quality and cost. ([learn.microsoft.com](https://learn.microsoft.com/en-gb/azure/ai-studio/concepts/retrieval-augmented-generation))

Buy vs build: a quick decision tree

  • Use an off‑the‑shelf helpdesk AI if you mainly need better suggested replies for agents and have a well‑maintained knowledge base (many suites now include RAG‑style assistants). See Zendesk’s 2025 CX trends for context. ([cxtrends.zendesk.com](https://cxtrends.zendesk.com/gb/))
  • Configure a lightweight custom RAG if your knowledge lives across SharePoint/Drive/Confluence and you need citations and routing in your own workflows (e.g., via Azure AI Search). Azure AI Search. ([azure.microsoft.com](https://azure.microsoft.com/en-us/products/search/))
  • Avoid “agentic” hype for now unless you have a clear ROI case. Even large firms are pruning agentic projects that over‑promise. Reuters on Gartner’s outlook. ([reuters.com](https://www.reuters.com/business/over-40-agentic-ai-projects-will-be-scrapped-by-2027-gartner-says-2025-06-25/))

Procurement questions (copy/paste into your RFP)

  1. Where will our content, prompts and logs be stored? In which country/region? Can we opt out of training?
  2. Which security standard do you align to for AI specifically (e.g., NCSC/CISA guidance)? Provide evidence. Reference. ([cisa.gov](https://www.cisa.gov/news-events/alerts/2024/04/15/joint-guidance-deploying-ai-systems-securely))
  3. Can we enforce SSO, role‑based access and retention policies? How do we export all data on exit? GOV.UK SaaS guidance. ([gov.uk](https://www.gov.uk/guidance/securing-saas-tools-for-your-organisation))
  4. How do you evaluate answer quality and reduce hallucinations (grounding, citations, “I don’t know” behaviour)? Provide metrics and samples.
  5. What’s your cost model per 1,000 requests and typical token usage for our intents? Can we set caps and alerts?
  6. Do you provide a content governance workflow (owners, review dates, stale content flags)?

Operating model: who does what after go‑live

  • Knowledge Owners: named per policy/product area; weekly refresh; approve major changes.
  • Helpdesk Lead: monitors KPIs, approves new intents, signs off autonomy thresholds.
  • IT/Sec: SSO/MFA, log retention, vendor due diligence, incident process (treat prompts/logs as assets). UK guidance stresses asset inventories for AI components. AI cyber security code of practice (DSIT/NCSC). ([gov.uk](https://www.gov.uk/government/publications/ai-cyber-security-code-of-practice/code-of-practice-for-the-cyber-security-of-ai))
  • Training: bite‑size modules for agents on citation‑checking, tone, and escalation; refresher at 30/60/90 days.

KPIs that actually move the needle

  • Grounded accuracy: % of answers rated correct and fully supported by cited text (target ≥90% in scope).
  • First‑contact resolution for AI‑assisted intents (target +10–20% uplift vs baseline).
  • Agent time‑to‑answer for in‑scope topics (target −30–50%).
  • Cost per resolved contact including AI + human time (target −15–30%).
  • Customer satisfaction on AI‑assisted answers, with an obvious “talk to a human” path—important because a majority of customers still want easy escalation. Gartner. ([gartner.com](https://www.gartner.com/en/newsroom/press-releases/2024-07-09-gartner-survey-finds-64-percent-of-customers-would-prefer-that-companies-didnt-use-ai-for-customer-service))

Common pitfalls and how to avoid them

  • Messy knowledge: conflicting PDFs lead to conflicting answers. Fix by appointing owners and archiving duplicates.
  • Too much autonomy, too soon: stay in shadow mode until your evaluation scores are stable for 2–3 weeks.
  • Unclear human escape hatch: always offer “talk to a person.” It increases acceptance of AI‑first flows. Insight summary. ([customerexperiencedive.com](https://www.customerexperiencedive.com/news/generative-ai-reassure-customers-human-agents-gartner/749997/))
  • Security as an afterthought: use the joint AI‑deployment guidance; keep an inventory of AI assets (models, prompts, logs). CISA/NSA/NCSC joint guidance. ([cisa.gov](https://www.cisa.gov/news-events/alerts/2024/04/15/joint-guidance-deploying-ai-systems-securely))

What to do next

  1. Choose 3–5 intents you can fully answer from your documents today.
  2. Assemble a 100‑question Gold Set and nominate content owners.
  3. Decide “buy vs build” for a small pilot and book a 30‑min scoping call with us.

Related reading on our site: A practical AI stack for UK teams (2025), AI readiness audit, AI policy pack templates.