Internal Policy Q&A Assistant

intermediate15 min readreference-architectureragcompliancepolicy-searchvector-database

Overview

Every bank maintains hundreds -- often thousands -- of internal policy documents. Compliance manuals, credit policies, operational procedures, HR handbooks, and board-approved risk limits. When an employee needs to find a specific policy, they face a familiar frustration: searching through SharePoint folders, sifting through outdated documents, and hoping the search terms they chose match the language the policy author used.

A Retrieval-Augmented GenerationRetrieval-Augmented Generation (RAG)A pattern that combines document retrieval with LLM generation. The system searches a knowledge base for relevant context, then feeds it to the model to produce grounded, accurate answers.See glossary (RAG) powered Q&A assistant transforms this experience. Instead of keyword-based search, employees ask natural language questions -- "What is our current commercial real estate concentration limit?" or "What are the approval requirements for a new third-party vendor?" -- and receive accurate, sourced answers with citations back to the authoritative document.

This architecture is the most common first AI deployment in banking because it delivers immediate, visible value with manageable risk. The system does not make decisions -- it retrieves and summarizes existing policies. Human judgment remains the final authority.

BANKING ANALOGY

Think of a policy Q&A assistant the way you think about your best compliance analyst -- the one who has read every policy document and can tell you exactly where to find the answer to any regulatory question. They do not invent answers; they retrieve the relevant section, quote it, and tell you which document and page it came from. A RAG system works the same way: it finds the most relevant passages from your actual policy documents and generates an answer grounded in those sources, with citations you can verify.

Architecture Components

Loading diagram...

Document Ingestion Pipeline

The ingestion pipeline converts raw policy documents -- PDFs, Word files, SharePoint pages, Confluence articles -- into formats the AI system can search. This involves document parsing (extracting text from various formats), metadata extraction (document title, version, effective date, owner), and format normalization.

Tools to evaluate: Apache Tika for multi-format parsing, LlamaIndex data connectors for SharePoint and Confluence integration, or managed solutions like Amazon Textract for OCR on scanned documents.

Chunking Engine

Raw documents must be split into smaller segments before embeddingEmbeddingsNumerical representations (vectors) of text that capture semantic meaning. Similar concepts produce vectors that are close together, enabling machines to understand relationships between words, sentences, or documents.See glossary generation. ChunkingChunkingThe process of splitting large documents into smaller, overlapping segments before generating embeddings. Chunk size and overlap strategy directly affect retrieval quality in RAG systems.See glossary strategy directly affects retrieval quality -- chunks that are too large dilute relevance; chunks that are too small lose context.

For banking policy documents, semantic chunking (splitting on section boundaries and natural paragraph breaks) outperforms fixed-size chunking. Policy documents have clear hierarchical structures -- chapters, sections, subsections -- and preserving these boundaries produces more coherent retrieval results.

Vector Store

The vector storeVector DatabaseA specialized database optimized for storing and querying high-dimensional vectors (embeddings). Enables fast similarity search across millions of documents for RAG and recommendation systems.See glossary holds the embedded chunks and enables fast similarity searchSemantic SearchSearch that understands meaning rather than just matching keywords. Uses embeddings to find conceptually similar documents even when they use different terminology.See glossary. For banking deployments, the key evaluation criteria are: data residency (where vectors are stored), access control (can you restrict which users see which document vectors?), and scale (how many documents and how fast?).

Options range from managed cloud services (Pinecone, Weaviate Cloud) to self-hosted solutions (pgvector on your existing PostgreSQL, Milvus) to platform-integrated options (Snowflake Cortex Search, Bedrock Knowledge Bases).

LLM Generation Layer

The LLM receives the user's question along with retrieved document chunks and generates a synthesized answer. The system prompt instructs the model to base answers strictly on the provided context, cite specific documents, and clearly state when insufficient information is available to answer.

Model selection balances accuracy, latency, and cost. For internal policy Q&A -- where accuracy is paramount but latency tolerance is moderate (2-5 seconds is acceptable) -- larger models like GPT-4 or Claude typically outperform smaller models on complex policy questions.

Citation and Source Tracking

Every answer must include citations to the specific document, section, and version that supports it. This is non-negotiable in banking -- an unsourced policy answer is worse than no answer at all, because it creates false confidence.

The citation system tracks which retrieved chunks contributed to the answer and maps them back to source documents with direct links to the original content.

Data Flow

Document ingestion: Policy documents are loaded from source systems (SharePoint, network drives, compliance databases), parsed into text, and tagged with metadata (document ID, version, effective date, classification level)
Chunking and embedding: Documents are split into semantically meaningful chunks (500-1000 tokens each with 100-token overlap), and each chunk is converted into a vector embedding using a model like OpenAI text-embedding-3 or Cohere Embed v3
Vector indexing: Embeddings are stored in the vector database alongside the original text chunks and metadata, creating a searchable index of the entire policy corpus
User query: An employee submits a natural language question through the Q&A interface -- for example, "What are the documentation requirements for a commercial loan over $5M?"
Retrieval: The query is embedded using the same embedding model, and the vector store returns the top-k most semantically similar chunks (typically 5-10 chunks)
Context assembly: Retrieved chunks are assembled into a context window, ordered by relevance, and combined with the system prompt instructing the model to answer based only on provided context
Answer generation: The LLM generates a synthesized answer, citing specific documents and sections, with confidence indicators for each claim
Source verification: The system presents the answer alongside clickable links to the source documents, enabling the user to verify every claim against the original policy text

Banking Use Case

Scenario: A compliance officer at a regional bank receives a question from the commercial lending team: "What is our current policy on commercial real estate concentration limits, and has it changed since the last board review?"

Without the Q&A assistant: The compliance officer opens SharePoint, searches for "CRE concentration" across multiple policy folders, finds three versions of the concentration risk policy (two outdated), cross-references with the most recent board minutes, and drafts a response. This takes 45-90 minutes.

With the Q&A assistant: The compliance officer types the question. The system retrieves the current Concentration Risk Policy (v4.2, effective January 2025), the relevant board resolution from the December 2024 meeting, and the related OCC guidance. Within seconds, it presents: "The current CRE concentration limit is 300% of total risk-based capital, as established in Concentration Risk Policy v4.2 (Section 3.2). This represents a reduction from the previous 350% limit, approved by the Board in Resolution 2024-47 on December 15, 2024." Each citation links directly to the source document.

Tip

When implementing a policy Q&A assistant, start with a single high-value document collection -- such as credit policies or compliance manuals -- rather than ingesting everything at once. This allows you to tune chunking strategy, embedding models, and retrieval parameters for your specific content before scaling to the full policy corpus. Measure retrieval accuracy rigorously: randomly sample 50 questions, have subject matter experts evaluate the answers, and iterate until accuracy exceeds 90% before expanding the document scope.

Key Architectural Decisions

Decision	Options	Recommendation	Why
HallucinationHallucinationWhen an AI model generates plausible-sounding but factually incorrect information. A critical risk in banking where inaccurate outputs could lead to regulatory violations or financial losses.See glossary mitigation	Prompt engineering only; RAG + citation enforcement; RAG + citation + confidence scoring	RAG + citation + confidence scoring	Banking policy answers must be verifiable. Confidence scoring flags low-confidence answers for human review rather than presenting uncertain information as fact
Vector database deployment	Managed cloud (Pinecone); self-hosted (pgvector); platform-integrated (Bedrock KB)	Platform-integrated if available; otherwise pgvector	Minimizes data movement and leverages existing infrastructure. pgvector runs on PostgreSQL your team already manages
Embedding model	OpenAI text-embedding-3; Cohere Embed v3; domain-specific fine-tuned	Start with OpenAI or Cohere; evaluate fine-tuned if accuracy plateaus	General-purpose embeddings perform well for policy language. Fine-tuning is expensive and only justified if retrieval accuracy is consistently below 85%
Access control	Document-level filtering; user-role-based retrieval; no filtering	User-role-based retrieval	Different employees should only access policies relevant to their role. A teller should not see board-level risk appetite documents through the Q&A system

Quick Recap

A RAG-powered policy Q&A assistant transforms keyword search into natural language question-answering with sourced, cited answers
The architecture consists of five layers: document ingestion, chunking, vector storage, LLM generation, and citation tracking
Banking-specific requirements include data residency, role-based access control, citation enforcement, and confidence scoring
Start with a single document collection and measure retrieval accuracy before scaling
This is the most common first AI deployment in banking because it delivers value without making autonomous decisions

KNOWLEDGE CHECK

Why is citation tracking considered non-negotiable in a banking policy Q&A system?

What chunking strategy is recommended for banking policy documents and why?

A bank is deploying its first policy Q&A assistant. What approach does the architecture recommend?