RAG: Retrieval-Augmented Generation

intermediate12 min readragretrievalgenerationhallucinationknowledge-base

Connecting LLMs to Your Bank's Knowledge

A Large Language ModelLarge Language Model (LLM)A neural network trained on vast amounts of text data that can understand and generate human language. LLMs power chatbots, document analysis, code generation, and many enterprise AI applications.See glossary trained on public data knows a great deal about the world -- but it knows nothing about your bank. It has never read your credit policy manual, your internal audit findings, your compliance procedures, or your board-approved risk appetite statement. And for most high-value banking applications, those proprietary documents are exactly what the AI needs to answer questions accurately.

Retrieval-Augmented GenerationRetrieval-Augmented Generation (RAG)A pattern that combines document retrieval with LLM generation. The system searches a knowledge base for relevant context, then feeds it to the model to produce grounded, accurate answers.See glossary (RAG) solves this problem. It is a pattern that combines document retrieval with LLM generation, giving the model access to your institution's knowledge base before it generates a response. Instead of relying solely on what the model learned during training, RAG fetches relevant documents from your own repositories and uses them as context for each answer.

For banking executives, RAG is arguably the most important AI architecture pattern to understand. It is the difference between a general-purpose AI assistant and one that can answer questions grounded in your institution's actual policies, procedures, and data.

BANKING ANALOGY

Think of RAG like giving a new hire access to your policy manual before they answer any questions. Without the manual, even a brilliant analyst can only draw on their general training and education -- they might give a plausible answer, but it may not reflect your institution's specific policies. With the manual in hand, they look up the relevant section first, then craft their response based on what your bank has actually decided. RAG gives an LLM that same "look it up first" capability.

The RAG Pipeline: Four Steps

The RAG pipeline follows a clear sequence that transforms a user's question into a grounded, accurate response:

Step 1: Chunk

Before any retrieval can happen, your documents must be prepared. Large documents -- a 300-page regulatory filing, a 50-page credit policy manual -- are split into smaller segments called chunks. Each chunk is typically 200 to 1,000 words, with overlapping boundaries to preserve context.

KEY TERM

Chunking: The process of splitting large documents into smaller, overlapping segments for embedding and retrieval. Chunk size and overlap strategy directly affect the quality of search results and the accuracy of generated answers.

Step 2: Embed

Each chunk is converted into an embeddingEmbeddingsNumerical representations (vectors) of text that capture semantic meaning. Similar concepts produce vectors that are close together, enabling machines to understand relationships between words, sentences, or documents.See glossary -- a numerical vector that captures its semantic meaning. These vectors are stored in a vector databaseVector DatabaseA specialized database optimized for storing and querying high-dimensional vectors (embeddings). Enables fast similarity search across millions of documents for RAG and recommendation systems.See glossary alongside the original text and metadata (source document, date, classification, department).

Step 3: Retrieve

When a user asks a question, the system embeds the question using the same model, then searches the vector database for the most semantically similar chunks. Typically, the top 3 to 10 most relevant chunks are retrieved.

Step 4: Generate

The retrieved chunks are combined with the user's original question and sent to the LLM as context. The model generates its response based on both the question and the retrieved information, producing an answer grounded in your actual documents.

The prompt to the LLM typically looks something like: "Based on the following internal policy excerpts, answer the user's question. If the excerpts do not contain sufficient information, say so explicitly."

Why RAG Reduces Hallucinations

HallucinationHallucinationWhen an AI model generates plausible-sounding but factually incorrect information. A critical risk in banking where inaccurate outputs could lead to regulatory violations or financial losses.See glossary -- when an LLM generates plausible but incorrect information -- is one of the most serious risks in banking AI deployments. RAG significantly reduces this risk through two mechanisms:

Grounding in source material. Instead of generating answers purely from learned patterns, the model is constrained to information in the retrieved documents. When an LLM has the actual policy text in front of it, it is far less likely to fabricate an answer.

Attributable answers. RAG systems can cite their sources. When the model states "Per Section 4.2 of the Commercial Lending Policy, concentration limits for CRE are set at 300% of Tier 1 capital," your users can verify this claim against the original document. This attribution transforms LLM outputs from opaque assertions into verifiable statements.

Warning

RAG reduces hallucination risk but does not eliminate it. An LLM can still misinterpret retrieved context, combine information from multiple chunks incorrectly, or generate confident-sounding responses that subtly deviate from the source material. For regulatory and compliance applications, human review of RAG outputs remains essential. RAG makes the AI more trustworthy, not infallible.

RAG Architecture Decisions for Banking

Deploying RAG in a banking environment involves several architecture decisions with significant implications:

Embedding Model Selection

Not all embedding models perform equally on financial text. Models trained with exposure to regulatory, legal, and financial language produce better embeddings for banking documents. Evaluate embedding models specifically on your document types -- a model that excels on general text may underperform on dense regulatory filings.

Chunk Strategy

Chunk size affects the precision-recall tradeoff:

Smaller chunks (200-300 words): More precise retrieval, but individual chunks may lack sufficient context
Larger chunks (800-1,000 words): More context per chunk, but less precise matching and higher token consumption in the generation step
Hierarchical chunking: Documents are chunked at multiple levels (section, paragraph, sentence), with the system retrieving at the appropriate granularity

Retrieval Configuration

How many chunks to retrieve, how to rank them, and whether to re-rank with a secondary model are all tunable parameters. More retrieved chunks provide more context but consume more of the LLM's context window and increase cost.

Tip

Start your RAG implementation with a focused, high-value document set rather than your entire knowledge base. A RAG system built on your compliance policy library or your credit policy manual delivers immediate value and gives your team practical experience before scaling to broader use cases. Quality of the document corpus matters more than quantity.

Banking Use Cases for RAG

RAG is particularly valuable in banking for use cases where accuracy and institutional specificity matter:

Compliance Q&A: Analysts ask natural-language questions about regulatory requirements, and the system responds with answers grounded in your institution's actual compliance policies and relevant regulatory text
Credit policy guidance: Lending officers query the credit policy manual using natural language, getting specific guidance on underwriting standards, exception criteria, and approval authorities
Audit preparation: Internal auditors search across previous audit reports, findings, and remediation plans to identify precedents and track recurring themes
Customer inquiry resolution: Contact center agents query internal knowledge bases to find accurate answers to complex customer questions, reducing hold times and escalations
Regulatory change impact analysis: New regulatory guidance is compared against your existing policy library to identify gaps and areas requiring updates

RAG vs. Fine-Tuning: When to Use Each

A common question is whether to use RAG or fine-tuningFine-TuningThe process of further training a pre-trained model on a specific dataset to specialize its behavior for a particular domain or task, such as banking compliance language.See glossary to adapt an LLM for banking. The answer depends on the use case:

Approach	Best For	Limitations
RAG	Answering questions about specific documents; information that changes frequently; when citation is required	Requires vector database infrastructure; limited by retrieval quality
Fine-tuning	Teaching the model banking-specific language and style; tasks requiring domain expertise without document lookup	Expensive; requires retraining when information changes; no built-in citation
Both	Complex enterprise deployments where the model needs both domain knowledge and document grounding	Higher complexity and cost; requires ML engineering expertise

For most banking institutions starting their AI journey, RAG is the higher-value starting point. It delivers immediate, measurable benefits without the cost and complexity of model fine-tuning.

Quick Recap

RAG connects LLMs to your proprietary documents, enabling answers grounded in your bank's actual policies and data
The RAG pipeline follows four steps: chunk documents, embed them as vectors, retrieve relevant chunks for each query, and generate a grounded response
RAG reduces hallucination by grounding outputs in source material and enabling citation of specific documents
Key architecture decisions include embedding model selection, chunk strategy, and retrieval configuration
RAG is generally the best starting point for banking AI, delivering immediate value for compliance, credit policy, and knowledge management use cases

KNOWLEDGE CHECK

A bank deploys a RAG system for compliance policy queries. An analyst asks about commercial real estate concentration limits. What happens in the RAG pipeline?

Why does RAG reduce hallucination risk but not eliminate it entirely?

A bank is deciding between RAG and fine-tuning for a compliance Q&A tool. Which approach is more appropriate, and why?