Data foundations & retrieval patterns

Chunk once, answer everywhere: the UK SME guide to document chunking, metadata and RAG that actually works

Published 06 Dec 2025 • 14–16 min read

If your AI assistant is still saying “I can’t find that policy” while the PDF sits in SharePoint or Google Drive, the problem usually isn’t the model—it’s how the content is chopped up and tagged before search. The two levers you control are chunking (how you break documents into retrievable passages) and metadata (the labels that tell search and AI what each passage is). Together they decide what’s actually retrieved, what gets ignored, and what your people trust.

There’s also a quirk of long‑context models: answers are often best when the relevant passage is at the start or end of the prompt, and can degrade when key facts sit in the middle—one reason careless chunking or over‑long context windows quietly lower accuracy. This “lost in the middle” effect is documented in peer‑reviewed research, with follow‑on work showing ways to mitigate positional bias. arxiv.org

The 10‑day chunking and metadata plan

Day 1: lock your scope (30 golden questions)

Pick one team or service line (for example, HR policies, helpline scripts, or grant criteria).
Write 30 realistic questions people actually ask. Include a mix of quick facts, “how do I” steps, and judgement calls.
For each question, note the single most authoritative source and the section/page where the answer lives.

Day 2: audit the source content

List where the content lives today (SharePoint library, Google Drive folder, Confluence, website, PDF store).
Note the formats (DOCX, PDF, PPTX, HTML), the typical length, and whether headings are accurate.
Identify duplicates and near‑duplicates—decide which is the “gold” version.

Day 3: choose a sensible baseline chunk size and overlap

Start simple. A widely used baseline is about 512 tokens with ~20–25% overlap, then adjust based on your documents and evaluation results. Microsoft’s Azure AI Search guidance recommends this ballpark and explicitly suggests experimenting with chunk size and overlap. learn.microsoft.com

Day 4: design a minimal metadata schema (no new system required)

Create a small set of fields you can populate automatically during indexing. Aim for:

source_path (URL or drive path)
title and section_headings
content_type (policy, SOP, form, minutes, FAQ)
last_modified and version
owner and audience (internal, partners, public)
permissions_label (for example, “internal” vs “public”)

These fields help your search engine combine keyword and vector results, filter by audience, and show clear citations. They also support everyday information management good practice on GOV.UK (naming, ownership, and only keeping what you need). gov.uk

Day 5: pick the right chunking pattern per document type

Policies/SOPs (DOCX/HTML): split by headings and paragraphs so each chunk is a single rule or step. Add section and policy IDs as metadata. Content‑aware chunking like this preserves meaning better than naive fixed windows for structured documents. docs.cohere.com
Scanned or formatted PDFs: start with page‑level chunks, then optionally combine short pages. Many teams see strong results with page‑level chunking in tests. developer.nvidia.com
Meeting notes/transcripts: chunk by speaker turn or agenda item; overlap lightly so names and actions stay intact. docs.cohere.com
FAQs: one Q–A per chunk; no overlap.

Day 6: wire this into your search index

If you use Azure AI Search, “integrated vectorisation” will chunk and embed during indexing and can combine vector and keyword fields for hybrid search, keeping your pipeline simple. Equivalent features exist in other platforms; the principle is the same. learn.microsoft.com

Day 7: evaluate with your 30 questions

For each question, check: did the right chunk appear in the top‑k results; did the answer cite the correct source and section; was the text current?
Track three numbers: Answer correctness (exact/partial/miss), Citation coverage (did it show a link and section), and Time‑to‑first‑useful‑answer.
If you need a structured way to test, adapt the checks in our recent quality guide and hybrid retrieval playbook. The 9 AI content quality tests and Hybrid retrieval, freshness and evaluation playbook.

Day 8: tune recall, precision and cost

Hybrid beats either/or: combine vector similarity with keyword/semantic ranking; Microsoft’s benchmark guidance finds this consistently improves relevance. learn.microsoft.com
Find your “middle” chunk size: tests from industry labs show very small (for example 128 tokens) and very large (for example 2,048 tokens) chunks often underperform; the sweet spot varies by query type—fact questions like smaller chunks; analytical questions benefit from larger or page‑level chunks. developer.nvidia.com
Reduce index size: use vector quantisation where supported to compress embeddings (for example, 4× with int8; up to ~28× with binary), then validate any impact on relevance. learn.microsoft.com

Day 9: build a small change policy

Nominate content owners; when a policy changes, they update the source, not the AI tool.
Set a “freshness SLO” per collection (for example, HR policies re‑indexed nightly; forms weekly).
Add a simple banner in answers that shows the source, section and last‑updated date.

Day 10: rollout and train

Show users how to click the citation to the exact section or page. Encourage feedback buttons (“Helpful?” “Out of date?”).
Publish a one‑pager on “What’s in scope” so colleagues know which sources are included.

Choosing the right chunking mode: a quick decision tree

Do your documents have clear headings or numbered steps? Use heading/paragraph chunks with minimal overlap.
Are they layout‑heavy PDFs (tables, forms, posters)? Start page‑level; test modest page merges for short pages. developer.nvidia.com
Are they transcripts or minutes? Split by speaker turn or agenda item with small overlap so names/actions stay together. docs.cohere.com
Mixed bag and you need speed now? Use fixed ~512‑token chunks with 20–25% overlap, then iterate using your evaluation KPIs. learn.microsoft.com

Minimal metadata that pays for itself

Keep it light so it can be populated automatically. At indexing time, extract:

Title and section_headings (for better snippets and navigation).
Source path and page/section anchor (so citations link to the precise spot).
Last modified and version (to display freshness and resolve duplicates).
Content type (policy, SOP, form, FAQ) to boost or filter.
Audience/permissions label (public/internal/partners) so your assistant can safely show or hide results. This also supports basic good practice for cloud tools use in the public sector. gov.uk

That’s it. Don’t over‑engineer it. You can always add more fields once you see where answers fail.

Evaluation KPIs and thresholds you can adopt tomorrow

KPI	Good (green)	Watch (amber)	Action (red)
Answer correctness on golden set	≥ 85%	70–84%	< 70%
Citation coverage (has a source + section)	≥ 95%	85–94%	< 85%
Top‑k contains correct chunk	≥ 95% @ k=5	85–94%	< 85%
Average tokens per answer	Within budget trend	+10–20% vs last week	> 20% spike
Time‑to‑first‑useful‑answer	< 2.5s	2.5–4s	> 4s

If scores drop as you increase context size, you may be hitting positional bias—consider reducing chunk size, increasing overlap slightly, or retrieving fewer but better‑ranked chunks. arxiv.org

Cost, risk and how to stay in control

Risk/cost	What it looks like	Mitigation
Runaway index size	Embedding costs swell; storage limits hit	De‑duplicate; reduce overlap; compress vectors with quantisation where available; archive stale content. learn.microsoft.com
Missed answers	Right document, wrong passage retrieved	Switch from fixed to heading‑aware or page‑level chunking; test hybrid retrieval with semantic ranking. learn.microsoft.com
Overlong prompts hurt accuracy	Model includes too many chunks; facts buried mid‑prompt	Reduce k; shorten chunks; summarise top passages; beware “lost in the middle”. arxiv.org
Permission leakage	Assistant surfaces internal content	Index only sources the user can access; store a permissions label and filter at query time; align with your internal cloud‑use policy. gov.uk

Procurement questions to separate signal from noise

Ask vendors to answer in writing and show evidence against your own documents:

Which chunking strategies do you support out of the box (fixed, heading‑aware, page‑level, transcript‑aware)? Can we set size and overlap per content type?
How do you combine vector and keyword search? Do you support hybrid with semantic re‑ranking? Show evaluation results on our golden questions. learn.microsoft.com
Can we store and query by metadata fields (title, section, version, audience/permissions)? How are citations constructed to precise sections?
What controls exist to cap token usage and limit retrieved chunks? Can we set a maximum k, maximum prompt length and safe defaults per workflow?
What compression or index‑size controls exist (for example, vector quantisation)? What is the measured impact on relevance? learn.microsoft.com
How do you keep indexes fresh without re‑embedding everything? What’s your incremental update strategy for SharePoint/Google Drive?

For a structured way to compare vendors in two weeks, adapt our AI vendor bake‑off templates.

Playbook recap: your first 30 days

Week 1: choose scope and golden questions; audit content; pick baseline chunking + metadata.
Week 2: implement indexing with hybrid search; deploy to a small group; log evaluation metrics.
Week 3: tune chunk size/overlap and k; compress vectors if storage is tight; fix outlier documents. learn.microsoft.com
Week 4: train users; publish “What’s in scope”; move a second team onto the pipeline.

Need to prepare shared drives first? Use our 30‑day knowledge readiness plan to get organised before indexing. RAG‑ready in 30 days.

Book a 30‑min call Or email: team@youraiconsultant.london