Skip to content
AI Foundations for Bankers
0%

Advanced RAG Patterns for Enterprise

advanced12 min readragadvancedhybrid-searchrerankingevaluationenterprise

Beyond Basic RAG

The basic RAG pattern -- embed a question, find similar document chunks, pass them to the LLM -- works well for straightforward use cases. But banking demands more. When a compliance officer asks about anti-money laundering requirements for correspondent banking, the system needs to retrieve the right passages from the right documents with high precision. A wrong or incomplete retrieval does not just produce a bad answer -- it produces a confidently wrong answer that could inform a compliance decision.

This unit covers the advanced patterns that elevate RAG from a prototype to a production-grade system suitable for banking workloads.

We introduced hybrid search in the Weaviate unit, but it deserves deeper treatment because it is one of the most impactful upgrades to basic RAG.

The Problem

Pure semantic search finds documents that are conceptually related to the query. But banking queries often include specific regulatory terms, statute references, or account numbers that must be matched exactly. Semantic search might understand the concept behind "12 CFR 1026.35" but miss a document that specifically references this regulation.

The Solution

Hybrid search runs both keyword-based search (BM25 or similar algorithms) and embedding-based semantic search simultaneously, then combines the results using a weighted scoring function. The weight balance can be tuned: for regulatory queries heavy in technical terms, weight keyword search higher; for natural-language questions, weight semantic search higher.

Implementation Approach

Most hybrid search implementations use Reciprocal Rank Fusion (RRF) to combine results from both search methods. RRF assigns scores based on the rank position in each result list and combines them, ensuring that documents appearing at the top of both lists score highest.

Pattern 2: Reranking

The Problem

Initial retrieval from a vector database uses fast but approximate similarity matching. The top 20 results are "probably relevant," but their ranking may not reflect true relevance to the specific question. Document chunk number 15 might actually be more relevant than chunk number 3.

The Solution

After initial retrieval, pass the results through a reranking model -- a specialized model trained specifically to assess the relevance of a document to a query. Reranking models are slower than vector search (they process each result individually), but they are far more accurate at judging true relevance.

The typical pattern:

  1. Retrieve the top 20-50 results from the vector database (fast, approximate)
  2. Pass all results through the reranker with the original query (slower, precise)
  3. Use the reranker's scores to select the top 3-5 most relevant results
  4. Send only these highly relevant results to the LLM for answer generation

Banking Impact

Reranking reduces the risk of the LLM receiving marginally relevant or irrelevant context, which is a primary cause of hallucinated or incorrect answers. In banking, where answer accuracy directly affects compliance and risk decisions, the precision improvement from reranking justifies the additional latency.

Pattern 3: Query Decomposition

The Problem

Complex banking questions often contain multiple sub-questions: "Compare our current CRE concentration against the OCC's guidance and explain what actions we should take if we exceed the soft limit." This single question requires: (1) retrieving the bank's current CRE exposure data, (2) retrieving OCC guidance on CRE concentration, and (3) retrieving the bank's internal escalation procedures.

The Solution

Query decomposition automatically breaks complex questions into simpler sub-questions, retrieves for each independently, and then synthesizes the results. This ensures that each aspect of the question gets focused retrieval attention rather than hoping a single query will surface all relevant information.

Implementation

An LLM analyzes the original question and generates sub-questions. Each sub-question is routed to the appropriate index or data source. Results are collected and passed to the LLM along with the original question for synthesis.

BANKING ANALOGY

Query decomposition works like a senior research analyst managing a complex request from the CEO. When the CEO asks, "How does our digital banking adoption compare to peer institutions, and what investment would we need to close the gap?" the analyst does not try to answer from a single source. They decompose the request: first, pull internal digital banking metrics; second, research peer institution adoption rates from industry reports; third, estimate the technology investment required based on vendor proposals and internal capacity assessments. Each sub-question gets its own research track, and the analyst synthesizes everything into a comprehensive briefing. Query decomposition automates this exact analytical approach.

Pattern 4: Multi-Index Strategy

The Problem

Different document collections have different characteristics and serve different retrieval needs. Your compliance policy manual (structured, versioned, authoritative) is fundamentally different from your customer complaint logs (unstructured, high-volume, time-sensitive). Using a single index and retrieval strategy for both produces suboptimal results.

The Solution

Maintain separate indexes for different document collections, each optimized for its specific characteristics:

  • Compliance policies: Tree index with version metadata filtering
  • Regulatory guidance: Hybrid search index (keyword + semantic)
  • Credit memos: Vector index with department and risk-rating metadata
  • Customer communications: Vector index with date-range filtering
  • Market research: Vector index with source and topic metadata

A router determines which index or indexes to query based on the nature of the question. Some questions may query multiple indexes simultaneously.

Pattern 5: RAG Evaluation

Why Evaluation Matters for Banking

You cannot improve what you cannot measure. And in banking, you cannot deploy what you cannot demonstrate is accurate. RAG evaluation is not optional -- it is the mechanism that gives compliance officers, risk managers, and regulators confidence that the system produces reliable outputs.

Key Metrics

Retrieval metrics:

  • Recall: Does the system retrieve all relevant documents? (Missing a critical regulation is unacceptable)
  • Precision: Are the retrieved documents actually relevant? (Noise degrades answer quality)
  • Mean Reciprocal Rank: Are the most relevant documents ranked highest?

Generation metrics:

  • Faithfulness: Is the generated answer supported by the retrieved documents? (Does the LLM hallucinate beyond what the context provides?)
  • Relevance: Does the answer actually address the question asked?
  • Completeness: Does the answer cover all aspects of the question?

Building an Evaluation Framework

For banking RAG deployments, establish:

  1. Ground truth datasets: Subject matter experts create question-answer pairs with identified source documents
  2. Automated evaluation: Run new questions through the pipeline and compare results against ground truth
  3. Human evaluation: Compliance and business experts regularly review system outputs for accuracy
  4. Regression testing: When you change the pipeline (new chunking strategy, different embedding model, updated documents), verify that previously correct answers remain correct
  5. Production monitoring: Track retrieval and generation metrics on live queries to detect degradation

Tip

Build your RAG evaluation framework before you optimize your pipeline. Without evaluation, you are making changes blindly -- you might improve performance on one type of query while degrading it on another. Start with 50-100 question-answer pairs created by your compliance or business subject matter experts. This ground truth dataset becomes the foundation for every pipeline improvement you make.

Putting It All Together

A production banking RAG system typically combines multiple advanced patterns:

  1. User asks a question
  2. Query decomposition breaks complex questions into sub-questions
  3. Router determines which indexes to query for each sub-question
  4. Hybrid search retrieves from the appropriate indexes using both keyword and semantic matching
  5. Reranker re-scores results for true relevance
  6. LLM generates an answer from the highest-ranked results with citations
  7. Evaluation pipeline monitors retrieval and generation quality continuously

Each pattern addresses a specific failure mode of basic RAG. Together, they produce a system that banking professionals can trust.

Quick Recap

  • Basic RAG is insufficient for banking -- advanced patterns address precision, recall, and reliability requirements
  • Hybrid search combines keyword and semantic matching for regulatory terminology accuracy
  • Reranking improves precision by re-scoring initial retrieval results with a specialized model
  • Query decomposition handles complex multi-part questions by breaking them into focused sub-questions
  • Multi-index strategies optimize retrieval for different document types and use cases
  • RAG evaluation is mandatory for banking -- build the evaluation framework before optimizing the pipeline

KNOWLEDGE CHECK

Why is reranking particularly important for banking RAG applications?

A compliance officer asks: Compare our BSA/AML procedures against the latest FinCEN guidance and identify gaps. Which advanced RAG pattern is most critical for handling this question effectively?

Why should a banking institution build a RAG evaluation framework before optimizing the retrieval pipeline?

What is the primary advantage of a multi-index RAG strategy over using a single index for all banking documents?