The Enterprise RAG Playbook

Updated: 10 July 2025 • ~10 min read

Retrieval-augmented generation (RAG) is usually explained with a single diagram: user question → retrieve docs → stuff into prompt → answer. The reality is messier. Over the last two years we have shipped RAG systems for due-diligence, policy Q&A, customer support and medical device troubleshooting. This playbook condenses those projects into a repeatable blueprint you can copy, fork or ruthlessly criticise.

1 · Data source triage 🗂️

Great RAG starts long before embeddings. We run a FACT triage on every candidate source:

Freshness – how often does it change, and do we need real-time updates?
Authority – can we trust it legally & reputationally?
Clarity – is the writing style and domain vocabulary consistent enough for the model to reason over?
Topology – do documents naturally form a hierarchy we can exploit for chunking and navigation?

Score each dimension 1-5, then plot sources on a radar chart. Anything scoring < 3 on Authority or Clarity is either excluded or routed through a human curation step.

2 · Chunk for cognitive coherence ✂️

The “400-token” rule is a myth. Instead measure the semantic completeness of a chunk: does it answer one atomic question? In regulatory handbooks we found 250-token chunks ideal, whereas troubleshooting guides performed better at ~650 tokens because diagrams are converted to long alt-text.

Practical tips:

Split on <h2>/<h3> first, then apply a text-split-recursive that respects sentences to avoid dangling clauses.
Store the parent_id so you can fetch sibling context if needed for grounding.
Generate an extractive summary (not abstractive) at ingestion time; it doubles as a search-optimised blurb and a preview for humans.

3 · Retrieval cocktail 🍹

No single search technique is good enough. Our default cocktail:

Dense vector search with cosine similarity (OpenAI text-embedding-3-small) for recall.
BM25 keyword filter on top 100 candidates for precision (excellent at numeric/abbreviation queries).
Recency + authority boosters applied as a re-rank score: score_final = sim * 0.7 + bm25 * 0.2 + freshness * 0.1.

We expose each stage’s results in the UI. Users gain trust when they can toggle “Why did I get this source?” and see the underlying signals.

4 · Prompt architecture 🧩

Avoid the mega-prompt. We separate concerns:

### System
You are an enterprise assistant…

### Context
{summaries list}

### Question
{user question}

Each summary is prefixed with a numbered identifier so the model can cite sources ([1], [2]). We tested JSON-schema outputs but plain text with regex extraction proved more robust across model versions.

5 · Evaluation loop 🔄

Cold-start your evaluation with synthetic Q&A pairs generated from the documents themselves, then augment with real queries captured via telemetry. Key metrics:

Signal	Why it matters	Target
Answer Relevance (Likert 1-5)	User usefulness	> 4.2
Source Attribution (%)	Trust & traceability	> 90 %
Latency (P95)	UX performance	< 2 s

Automate nightly eval runs with langchain-bench and push deltas to Slack. Teams notice regressions before users do.

6 · Governance & rollout 🛡️

Finally, RAG often touches regulated data. Build a “retrieval firewall”: query comes in → policy engine (opa / custom rules) decides which documents the caller is allowed to see, then passes an access-filtered vector query to the DB. Without this step you are leaking information, even if it never leaves your VPC.

Download the companion worksheet and scoring template here. Happy building!