Operations lead reviewing a RAG dashboard with citations and cost metrics
Data foundations & retrieval patterns

The 6‑week RAG blueprint for UK SMEs: clean data, hybrid search, predictable costs

What this is (and why now)

RAG in plain English: your assistant looks up answers in your own documents first, then drafts a reply with citations. It’s the fastest route to useful AI without training a bespoke model. If you want a brisk primer, see OpenAI’s overview of RAG and semantic search and Microsoft’s Azure AI Search RAG guide.

But most pilots wobble on three basics: messy source data, weak retrieval (vector search alone), and no measurement. This six‑week blueprint fixes that with a small, auditable slice first: one business scenario, a gold‑standard evaluation set, and a retrieval stack that blends keyword and vector search with reranking. Microsoft and Elastic both show that hybrid search plus reranking raises quality and recall; Microsoft calls it “table stakes” for production RAG in their latest retrieval updates and documents hybrid behaviour and semantic ranking in depth here and here. Elastic’s product docs describe hybrid search and RRF rank fusion as the default pattern too for enterprise search.

The six‑week plan

Week 0: pick one value case and write your “Gold Set”

  • Choose a narrow journey with clear documents: staff policies, product FAQs, grant eligibility rules, or service SLAs.
  • Create a “Gold Set” of 50–100 real questions with the correct answer and a link to the exact clause/page. You’ll use this for daily evaluations (see Microsoft’s RAG evaluators).
  • Confirm sign‑offs: who owns the source docs? What is “correct” wording? Who approves live responses?

Week 1: inventory and clean the source data

  • Gather only the authoritative documents for this journey. Archive duplicates and flag “superseded” versions.
  • Fix PDF sins: export to text, preserve headings, ensure tables are machine‑readable. Add a source ID, title, section, and last‑reviewed date to each record.
  • Decide chunking approach. Microsoft’s chunking guidance recommends semantically coherent chunks (paragraphs/sections) over raw fixed‑length splits for production quality with pros/cons by method. Anthropic’s research on contextual retrieval shows why adding chunk‑level context can materially improve retrieval quality at scale without losing meaning.

Week 2: build the retrieval layer (hybrid by default)

  • Start with hybrid search: run both keyword (BM25) and vector queries, then rerank results. Microsoft documents this pattern with semantic ranking and reciprocal rank fusion in Azure AI Search, and the product page summarises it well for leaders.
  • Return short passages (2–4) with source metadata. If nothing relevant is found, say “I don’t know” and escalate — a groundedness check in Azure Content Safety can help detect ungrounded answers automatically.
  • Security hygiene: treat the index like a system of record. Use SSO/MFA, least‑privilege roles, and follow GOV.UK’s checklists when choosing and securing SaaS/search tools Securing SaaS tools and Using cloud tools securely.

Week 3: shadow mode, then a small cohort

  • Shadow mode means assistants draft, humans approve. Open to 10–20 internal users or a subset of queries.
  • Evaluate daily on your Gold Set: groundedness, relevance, correctness, and completeness — Microsoft’s evaluators define these clearly and show how to combine them for decisions across the pipeline and per metric. If you prefer open tooling, RAGAS provides practical metrics like faithfulness, answer accuracy, and context precision/recall with definitions.
  • Tune retrieval before prompts. Hybrid + reranker typically beats vector‑only; Microsoft’s evaluation posts show the uplift from adding a ranker on top of hybrid search with example scores.

Week 4: cost controls and caching

  • Set a per‑answer cost guardrail for the journey (e.g., internal search ≤ a few pence; external replies slightly higher). Track average tokens retrieved and generated per answer.
  • Use prompt/prefix caching where supported to cut spend and latency. OpenAI describes prompt caching with discounts on cached tokens in their API; Azure OpenAI supports prompt caching and documents thresholds/behaviour in detail. At the edge, semantic caching via Azure API Management can return cached responses for similar prompts to reduce costs further.
  • Trim long answers, cap context length, and ensure you only inject the minimal passages needed.

Week 5: prepare for production

  • Document owners and a refresh cadence (weekly for policies/FAQs; daily for pricing/stock/SLAs).
  • Define incident types: stale content, prompt injection, missing citations, cost spikes. Decide who triages each.
  • Adopt the UK government’s AI cyber security code as a practical checklist for inventorying AI assets (models, prompts, logs) and everyday secure‑by‑design actions on GOV.UK.

Week 6: graduated rollout and sign‑off

  • Promote a small set of low‑risk intents to auto‑send when evaluation scores meet your thresholds; keep human review elsewhere.
  • Publish your runbook (who does what), KPIs, and a two‑page “How to use the assistant” for staff.

Data preparation that pays off

Chunking (how big is a “chunk”?)

  • Prefer paragraph/section‑based chunks with headings in metadata. Microsoft’s chunking guide lists sentence, layout‑based and semantic approaches with trade‑offs; start simple and evolve as quality demands.
  • When chunks lose context, retrieval suffers. Anthropic’s contextual retrieval suggests prepending short, chunk‑level context (e.g., “Policy X, Section 4: refunds”) to protect meaning during embedding and keyword indexing — worth testing.

Metadata that makes search smarter

  • Store: title, section, page/anchor, source URI, last reviewed, owner, document type, version, and security tag.
  • Use the metadata for filters (e.g., current policy only) and to show staff a clear citation link.

Freshness and “one source of truth”

  • Agree the canonical source per policy/product. Your assistant should never retrieve from multiple conflicting PDFs.
  • Set automated stale‑content flags (e.g., anything older than 6 months in certain categories).

Retrieval choices that actually scale

Start here

  • Hybrid retrieval (BM25 + vector) with semantic reranking. Documented patterns: Microsoft Learn’s RAG information‑retrieval guide and product docs for Azure AI Search guide, product page.
  • Return a few short, high‑quality passages; do not dump entire documents.

When to add sophistication

  • Query rewriting for ambiguous inputs — Microsoft reports quality improvements combining rewriting with a new semantic ranker in production.
  • Layout‑aware or semantic chunking when PDFs are highly structured; Azure’s layout/semantic chunking guidance explains options with Document Intelligence.

Make quality measurable (and boring)

Agree thresholds that trigger a release or rollback. Keep them simple and score daily on your Gold Set:

MetricTarget for go‑liveWhy it mattersHow to score
Groundedness≥ 0.8 pass rateStops “made‑up” answersAzure evaluators or an open metric like RAGAS “faithfulness” definitions
Answer correctness≥ 80% vs Gold SetBusiness correctness firstAzure “correctness” or RAGAS “answer accuracy”
Context precision/recall≥ 0.7/0.7Retrieval quality, not just generationAzure retrieval evaluators or RAGAS context metrics
Citation presence100% of substantive claimsAuditability for staff and riskAutomatic check on outputs
Latency (p95)≤ 2.5s internal; ≤ 4s externalUser acceptanceLog and monitor
Cost per successful answerDefined cap per journeyPredictable spendSee governance below

Microsoft documents how to combine groundedness with completeness and similarity to understand trade‑offs and guide tuning across phases. Keep a changelog of hyperparameters (chunk size, top‑K, reranker) and how they moved the needle.

Cost governance in practice

Where the money goes

  • Retrieval: indexing and queries on your search/vector service.
  • Generation: model calls (tokens in/out).
  • Ops: content upkeep, monitoring, evaluation runs, and incident handling.

Four dependable levers

  • Trim context. Inject fewer, better passages. Long prompts look clever and cost money.
  • Cache aggressively. Reused prefixes get discounted via prompt caching (OpenAI and Azure OpenAI document behaviour and thresholds) OpenAI, Azure. For high‑traffic patterns, consider semantic caching at the API gateway to reuse similar answers safely.
  • Tier models by task. Use a “good‑enough” model for drafts; escalate to a stronger one only on low‑confidence or complex queries.
  • Cap per‑answer spend. Set alerts if the assistant exceeds your cap or drifts upwards week‑on‑week.

Review weekly with a simple dashboard: average tokens retrieved, average completion tokens, cache hit rate, and cost per successful answer.

Procurement questions UK buyers should actually ask

These map to GOV.UK guidance for choosing and securing SaaS and the UK’s AI cyber security code of practice.

  1. Data location: where are our content, embeddings, prompts and logs stored? Can we keep them in the UK or EEA? How do we export everything on exit? GOV.UK SaaS guidance
  2. Access control: SSO/MFA? Role‑based access for indexes and dashboards? Per‑query logging?
  3. Separation of data: do you ever train on our data? Can we technically enforce no training?
  4. Retrieval quality: do you support hybrid retrieval and reranking natively? Evidence of evaluation results on our sample data?
  5. Resilience and change: what’s your index refresh process? Do you flag stale content automatically?
  6. Security by design: align to the UK AI cyber security code (assets inventory, supply‑chain hygiene, attack surface for prompts/logs/models) see code.

Operating model: who does what after go‑live

  • Knowledge owners: one per policy/product area; weekly refresh; approve major changes.
  • RAG lead: monitors KPIs, tunes retrieval, signs off changes to prompts and thresholds.
  • IT/Sec: SSO/MFA, logs, vendor due diligence, incident process for prompts/logs as sensitive assets.
  • Support team: citation‑checking and escalation playbooks; tone and template reviews.

Common pitfalls (and quick fixes)

  • Vector‑only retrieval. Add BM25 + a reranker. Microsoft’s guidance and evaluations show the uplift clearly.
  • Over‑chunking. Use paragraph/section boundaries first; test contextual or layout‑aware chunking if accuracy still lags per Microsoft and Anthropic.
  • Stale content. Add owners and review dates; refuse to answer when documents are marked out‑of‑date.
  • No evaluation harness. You can start with 50–100 Q/A pairs and the built‑in Azure evaluators or RAGAS metrics to unblock decisions.
  • Security as an afterthought. Follow GOV.UK SaaS guidance and the UK AI cyber code to avoid nasty surprises Using cloud tools securely, Securing SaaS, AI cyber security code.

Related reading on our site

What to do next

  1. Pick one journey you can fully answer from existing documents.
  2. Assemble a 50–100 question Gold Set with links to the exact clause/page.
  3. Stand up a hybrid retrieval index and run in shadow mode for two weeks.
  4. Track groundedness, correctness, context precision/recall, latency and cost per successful answer.
--- END PRE_HTML ---