The Enterprise RAG Playbook
Updated: 10 July 2025 • ~10 min read
Retrieval-augmented generation (RAG) is usually explained with a single diagram: user question → retrieve docs → stuff into prompt → answer. The reality is messier. Over the last two years we have shipped RAG systems for due-diligence, policy Q&A, customer support and medical device troubleshooting. This playbook condenses those projects into a repeatable blueprint you can copy, fork or ruthlessly criticise.
1 · Data source triage 🗂️
Great RAG starts long before embeddings. We run a FACT triage on every candidate source:
- Freshness – how often does it change, and do we need real-time updates?
- Authority – can we trust it legally & reputationally?
- Clarity – is the writing style and domain vocabulary consistent enough for the model to reason over?
- Topology – do documents naturally form a hierarchy we can exploit for chunking and navigation?
Score each dimension 1-5, then plot sources on a radar chart. Anything scoring < 3 on Authority or Clarity is either excluded or routed through a human curation step.
2 · Chunk for cognitive coherence ✂️
The “400-token” rule is a myth. Instead measure the semantic completeness of a chunk: does it answer one atomic question? In regulatory handbooks we found 250-token chunks ideal, whereas troubleshooting guides performed better at ~650 tokens because diagrams are converted to long alt-text.
Practical tips:
- Split on
<h2>/<h3>first, then apply atext-split-recursivethat respects sentences to avoid dangling clauses. - Store the
parent_idso you can fetch sibling context if needed for grounding. - Generate an extractive summary (not abstractive) at ingestion time; it doubles as a search-optimised blurb and a preview for humans.
3 · Retrieval cocktail 🍹
No single search technique is good enough. Our default cocktail:
- Dense vector search with cosine similarity (OpenAI
text-embedding-3-small) for recall. - BM25 keyword filter on top 100 candidates for precision (excellent at numeric/abbreviation queries).
- Recency + authority boosters applied as a re-rank score:
score_final = sim * 0.7 + bm25 * 0.2 + freshness * 0.1.
We expose each stage’s results in the UI. Users gain trust when they can toggle “Why did I get this source?” and see the underlying signals.
4 · Prompt architecture 🧩
Avoid the mega-prompt. We separate concerns:
### System
You are an enterprise assistant…
### Context
{summaries list}
### Question
{user question}
Each summary is prefixed with a numbered identifier so the model can cite sources
([1], [2]). We tested JSON-schema outputs but plain text with regex extraction
proved more robust across model versions.
5 · Evaluation loop 🔄
Cold-start your evaluation with synthetic Q&A pairs generated from the documents themselves, then augment with real queries captured via telemetry. Key metrics:
| Signal | Why it matters | Target |
|---|---|---|
| Answer Relevance (Likert 1-5) | User usefulness | > 4.2 |
| Source Attribution (%) | Trust & traceability | > 90 % |
| Latency (P95) | UX performance | < 2 s |
Automate nightly eval runs with langchain-bench and push deltas to Slack. Teams
notice regressions before users do.
6 · Governance & rollout 🛡️
Finally, RAG often touches regulated data. Build a “retrieval firewall”:
query comes in → policy engine (opa / custom rules) decides which documents the
caller is allowed to see, then passes an access-filtered vector query to the DB.
Without this step you are leaking information, even if it never leaves your VPC.
Download the companion worksheet and scoring template here. Happy building!