Data foundations & retrieval patterns

The 6‑week RAG blueprint for UK SMEs: clean data, hybrid search, predictable costs

Published 16 Oct 2025 • 13–16 min read

What this is (and why now)

RAG in plain English: your assistant looks up answers in your own documents first, then drafts a reply with citations. It’s the fastest route to useful AI without training a bespoke model. If you want a brisk primer, see OpenAI’s overview of RAG and semantic search and Microsoft’s Azure AI Search RAG guide.

But most pilots wobble on three basics: messy source data, weak retrieval (vector search alone), and no measurement. This six‑week blueprint fixes that with a small, auditable slice first: one business scenario, a gold‑standard evaluation set, and a retrieval stack that blends keyword and vector search with reranking. Microsoft and Elastic both show that hybrid search plus reranking raises quality and recall; Microsoft calls it “table stakes” for production RAG in their latest retrieval updates and documents hybrid behaviour and semantic ranking in depth here and here. Elastic’s product docs describe hybrid search and RRF rank fusion as the default pattern too for enterprise search.

The six‑week plan

Week 0: pick one value case and write your “Gold Set”

Choose a narrow journey with clear documents: staff policies, product FAQs, grant eligibility rules, or service SLAs.
Create a “Gold Set” of 50–100 real questions with the correct answer and a link to the exact clause/page. You’ll use this for daily evaluations (see Microsoft’s RAG evaluators).
Confirm sign‑offs: who owns the source docs? What is “correct” wording? Who approves live responses?

Week 1: inventory and clean the source data

Gather only the authoritative documents for this journey. Archive duplicates and flag “superseded” versions.
Fix PDF sins: export to text, preserve headings, ensure tables are machine‑readable. Add a source ID, title, section, and last‑reviewed date to each record.
Decide chunking approach. Microsoft’s chunking guidance recommends semantically coherent chunks (paragraphs/sections) over raw fixed‑length splits for production quality with pros/cons by method. Anthropic’s research on contextual retrieval shows why adding chunk‑level context can materially improve retrieval quality at scale without losing meaning.

Week 2: build the retrieval layer (hybrid by default)

Start with hybrid search: run both keyword (BM25) and vector queries, then rerank results. Microsoft documents this pattern with semantic ranking and reciprocal rank fusion in Azure AI Search, and the product page summarises it well for leaders.
Return short passages (2–4) with source metadata. If nothing relevant is found, say “I don’t know” and escalate — a groundedness check in Azure Content Safety can help detect ungrounded answers automatically.
Security hygiene: treat the index like a system of record. Use SSO/MFA, least‑privilege roles, and follow GOV.UK’s checklists when choosing and securing SaaS/search tools Securing SaaS tools and Using cloud tools securely.

Week 3: shadow mode, then a small cohort

Shadow mode means assistants draft, humans approve. Open to 10–20 internal users or a subset of queries.
Evaluate daily on your Gold Set: groundedness, relevance, correctness, and completeness — Microsoft’s evaluators define these clearly and show how to combine them for decisions across the pipeline and per metric. If you prefer open tooling, RAGAS provides practical metrics like faithfulness, answer accuracy, and context precision/recall with definitions.
Tune retrieval before prompts. Hybrid + reranker typically beats vector‑only; Microsoft’s evaluation posts show the uplift from adding a ranker on top of hybrid search with example scores.

Week 4: cost controls and caching

Set a per‑answer cost guardrail for the journey (e.g., internal search ≤ a few pence; external replies slightly higher). Track average tokens retrieved and generated per answer.
Use prompt/prefix caching where supported to cut spend and latency. OpenAI describes prompt caching with discounts on cached tokens in their API; Azure OpenAI supports prompt caching and documents thresholds/behaviour in detail. At the edge, semantic caching via Azure API Management can return cached responses for similar prompts to reduce costs further.
Trim long answers, cap context length, and ensure you only inject the minimal passages needed.

Week 5: prepare for production

Document owners and a refresh cadence (weekly for policies/FAQs; daily for pricing/stock/SLAs).
Define incident types: stale content, prompt injection, missing citations, cost spikes. Decide who triages each.
Adopt the UK government’s AI cyber security code as a practical checklist for inventorying AI assets (models, prompts, logs) and everyday secure‑by‑design actions on GOV.UK.

Week 6: graduated rollout and sign‑off

Promote a small set of low‑risk intents to auto‑send when evaluation scores meet your thresholds; keep human review elsewhere.
Publish your runbook (who does what), KPIs, and a two‑page “How to use the assistant” for staff.

Data preparation that pays off

Chunking (how big is a “chunk”?)

Prefer paragraph/section‑based chunks with headings in metadata. Microsoft’s chunking guide lists sentence, layout‑based and semantic approaches with trade‑offs; start simple and evolve as quality demands.
When chunks lose context, retrieval suffers. Anthropic’s contextual retrieval suggests prepending short, chunk‑level context (e.g., “Policy X, Section 4: refunds”) to protect meaning during embedding and keyword indexing — worth testing.

Metadata that makes search smarter

Store: title, section, page/anchor, source URI, last reviewed, owner, document type, version, and security tag.
Use the metadata for filters (e.g., current policy only) and to show staff a clear citation link.

Freshness and “one source of truth”

Agree the canonical source per policy/product. Your assistant should never retrieve from multiple conflicting PDFs.
Set automated stale‑content flags (e.g., anything older than 6 months in certain categories).

Retrieval choices that actually scale

Start here

Hybrid retrieval (BM25 + vector) with semantic reranking. Documented patterns: Microsoft Learn’s RAG information‑retrieval guide and product docs for Azure AI Search guide, product page.
Return a few short, high‑quality passages; do not dump entire documents.

When to add sophistication

Query rewriting for ambiguous inputs — Microsoft reports quality improvements combining rewriting with a new semantic ranker in production.
Layout‑aware or semantic chunking when PDFs are highly structured; Azure’s layout/semantic chunking guidance explains options with Document Intelligence.

Make quality measurable (and boring)

Agree thresholds that trigger a release or rollback. Keep them simple and score daily on your Gold Set:

Metric	Target for go‑live	Why it matters	How to score
Groundedness	≥ 0.8 pass rate	Stops “made‑up” answers	Azure evaluators or an open metric like RAGAS “faithfulness” definitions
Answer correctness	≥ 80% vs Gold Set	Business correctness first	Azure “correctness” or RAGAS “answer accuracy”
Context precision/recall	≥ 0.7/0.7	Retrieval quality, not just generation	Azure retrieval evaluators or RAGAS context metrics
Citation presence	100% of substantive claims	Auditability for staff and risk	Automatic check on outputs
Latency (p95)	≤ 2.5s internal; ≤ 4s external	User acceptance	Log and monitor
Cost per successful answer	Defined cap per journey	Predictable spend	See governance below

Microsoft documents how to combine groundedness with completeness and similarity to understand trade‑offs and guide tuning across phases. Keep a changelog of hyperparameters (chunk size, top‑K, reranker) and how they moved the needle.

Cost governance in practice

Where the money goes

Retrieval: indexing and queries on your search/vector service.
Generation: model calls (tokens in/out).
Ops: content upkeep, monitoring, evaluation runs, and incident handling.

Four dependable levers

Trim context. Inject fewer, better passages. Long prompts look clever and cost money.
Cache aggressively. Reused prefixes get discounted via prompt caching (OpenAI and Azure OpenAI document behaviour and thresholds) OpenAI, Azure. For high‑traffic patterns, consider semantic caching at the API gateway to reuse similar answers safely.
Tier models by task. Use a “good‑enough” model for drafts; escalate to a stronger one only on low‑confidence or complex queries.
Cap per‑answer spend. Set alerts if the assistant exceeds your cap or drifts upwards week‑on‑week.

Review weekly with a simple dashboard: average tokens retrieved, average completion tokens, cache hit rate, and cost per successful answer.

Procurement questions UK buyers should actually ask

These map to GOV.UK guidance for choosing and securing SaaS and the UK’s AI cyber security code of practice.

Data location: where are our content, embeddings, prompts and logs stored? Can we keep them in the UK or EEA? How do we export everything on exit? GOV.UK SaaS guidance
Access control: SSO/MFA? Role‑based access for indexes and dashboards? Per‑query logging?
Separation of data: do you ever train on our data? Can we technically enforce no training?
Retrieval quality: do you support hybrid retrieval and reranking natively? Evidence of evaluation results on our sample data?
Resilience and change: what’s your index refresh process? Do you flag stale content automatically?
Security by design: align to the UK AI cyber security code (assets inventory, supply‑chain hygiene, attack surface for prompts/logs/models) see code.

Operating model: who does what after go‑live

Knowledge owners: one per policy/product area; weekly refresh; approve major changes.
RAG lead: monitors KPIs, tunes retrieval, signs off changes to prompts and thresholds.
IT/Sec: SSO/MFA, logs, vendor due diligence, incident process for prompts/logs as sensitive assets.
Support team: citation‑checking and escalation playbooks; tone and template reviews.

Common pitfalls (and quick fixes)

Vector‑only retrieval. Add BM25 + a reranker. Microsoft’s guidance and evaluations show the uplift clearly.
Over‑chunking. Use paragraph/section boundaries first; test contextual or layout‑aware chunking if accuracy still lags per Microsoft and Anthropic.
Stale content. Add owners and review dates; refuse to answer when documents are marked out‑of‑date.
No evaluation harness. You can start with 50–100 Q/A pairs and the built‑in Azure evaluators or RAGAS metrics to unblock decisions.
Security as an afterthought. Follow GOV.UK SaaS guidance and the UK AI cyber code to avoid nasty surprises Using cloud tools securely, Securing SaaS, AI cyber security code.

What to do next

Pick one journey you can fully answer from existing documents.
Assemble a 50–100 question Gold Set with links to the exact clause/page.
Stand up a hybrid retrieval index and run in shadow mode for two weeks.
Track groundedness, correctness, context precision/recall, latency and cost per successful answer.

Book a 30‑min call Or email: team@youraiconsultant.london

What this is (and why now)

The six‑week plan

Week 0: pick one value case and write your “Gold Set”

Week 1: inventory and clean the source data

Week 2: build the retrieval layer (hybrid by default)

Week 3: shadow mode, then a small cohort

Week 4: cost controls and caching

Week 5: prepare for production

Week 6: graduated rollout and sign‑off

Data preparation that pays off

Chunking (how big is a “chunk”?)

Metadata that makes search smarter

Freshness and “one source of truth”

Retrieval choices that actually scale

Start here

When to add sophistication

Make quality measurable (and boring)

Cost governance in practice

Where the money goes

Four dependable levers

Procurement questions UK buyers should actually ask

Operating model: who does what after go‑live

Common pitfalls (and quick fixes)

Related reading on our site

What to do next