Model Risk Management (MRM) for LLMs

advanced20 min readgovernancerisk-managementcompliancesr-11-7model-risk

Why LLMs Demand Special MRM Attention

Banking institutions have decades of experience managing model riskModel Risk ManagementThe regulatory framework (OCC SR 11-7) governing how banks validate, monitor, and control AI models. Ensures models perform as expected and risks are identified and mitigated.See glossary. Credit scoring models, market risk models, anti-money laundering models -- these are mature disciplines with well-established validation methodologies, governance structures, and regulatory expectations. Your institution almost certainly has a Model Risk Management framework built around the principles of SR 11-7, the Federal Reserve's foundational guidance on model risk.

Large Language ModelsLarge Language Model (LLM)A neural network trained on vast amounts of text data that can understand and generate human language. LLMs power chatbots, document analysis, code generation, and many enterprise AI applications.See glossary break nearly every assumption that traditional MRM frameworks are built on. And yet, regulators expect the same rigor -- if not more -- when banks deploy these systems.

KEY TERM

Model Risk Management (MRM): The discipline of identifying, measuring, monitoring, and controlling the risk that arises from the use of models in business decisions. In banking, MRM is governed by regulatory guidance -- most notably SR 11-7 (Board of Governors, 2011) in the United States and similar frameworks internationally. MRM encompasses model development, validation, ongoing monitoring, and governance.

SR 11-7 and AI Models

SR 11-7 defines a model as "a quantitative method, system, or approach that applies statistical, economic, financial, or mathematical theories, techniques, and assumptions to process input data into quantitative estimates." The guidance establishes three core requirements:

Model development and implementation must follow sound practices with appropriate documentation
Model validation must be performed by qualified, independent parties
Model governance must establish clear roles, responsibilities, and controls

These principles apply to LLMs -- but the practical application is dramatically more complex than for traditional models. Here is why.

The Opacity Problem

Traditional credit models produce a score, and you can trace exactly how each input variable contributed to that score. You can explain to a regulator why a specific applicant received a specific risk rating. LLMs are fundamentally opaque. With hundreds of billions of parameters, there is no practical way to trace why the model generated a specific output. The "explainability" requirement of SR 11-7 becomes extraordinarily difficult to satisfy.

The Non-Determinism Problem

Run the same input through a traditional credit model ten times, and you get the same output ten times. Run the same prompt through an LLM ten times, and you may get ten different responses -- all potentially valid, but none identical. This non-deterministic behavior challenges every testing and validation methodology built for traditional models.

The Scope Problem

A traditional model has a clearly defined scope: it scores credit risk, or it detects fraud, or it calculates market risk. An LLM deployed as a general-purpose assistant could be used for tasks no one anticipated. An employee might use it for customer communications one moment and regulatory analysis the next. Defining the "intended use" of an LLM -- a core SR 11-7 requirement -- is far more complex.

The Training Data Problem

For traditional models, you control the training data. You know exactly what data went in, you can audit it for bias, and you can assess its relevance. LLMs are trained on internet-scale data that you did not curate, cannot fully audit, and may contain biases you cannot identify. If your institution fine-tunesFine-TuningThe process of further training a pre-trained model on a specific dataset to specialize its behavior for a particular domain or task, such as banking compliance language.See glossary a model on proprietary data, you add another layer of complexity.

Warning

Regulators have made clear that the adoption of AI does not diminish existing model risk management obligations. The OCC, Federal Reserve, and FDIC issued joint guidance in 2023 emphasizing that banks deploying AI -- including LLMs -- must apply risk management principles commensurate with the risk posed. Institutions that deploy LLMs without adapting their MRM frameworks face material regulatory, legal, and reputational risk. Consent orders, MRAs (Matters Requiring Attention), and MRIAs (Matters Requiring Immediate Attention) related to AI governance are already being issued. Do not treat LLM deployment as a technology project that can proceed outside your risk management framework.

The Three Lines of Defense for LLMs

Banking institutions typically organize risk management around three lines of defense. Here is how each line must adapt for LLMs:

First Line: Business Units and Model Owners

The first line -- the teams using LLMs in their daily operations -- bears responsibility for:

Use case documentation: Clearly defining how the LLM is being used, what decisions it informs, and what customer-facing outputs it produces
Input quality controls: Establishing guardrailsGuardrailsSafety mechanisms that constrain AI model outputs to prevent harmful, off-topic, or non-compliant responses. Critical in banking for regulatory adherence and brand safety.See glossary on what data can be sent to the LLM and what prompts are permitted
Output review processes: Implementing human review workflows for high-risk outputs (anything customer-facing, anything regulatory, anything involving lending decisions)
Incident reporting: Flagging hallucinationsHallucinationWhen an AI model generates plausible-sounding but factually incorrect information. A critical risk in banking where inaccurate outputs could lead to regulatory violations or financial losses.See glossary, inappropriate outputs, or unexpected behavior to the second line

Second Line: Risk Management and Compliance

The second line must develop LLM-specific expertise:

LLM model inventory: Maintaining a comprehensive inventory of all LLM deployments, including shadow IT usage (employees using consumer AI tools for work purposes)
Risk tiering: Classifying LLM use cases by risk level -- a customer service chatbot has different risk implications than an LLM-assisted credit decisioning tool
Validation methodology: Developing validation approaches appropriate for LLMs (more on this below)
Policy development: Establishing acceptable use policies, data handling requirements, and governance standards

Third Line: Internal Audit

Internal audit must develop the capability to:

Assess MRM framework effectiveness: Evaluate whether the institution's LLM governance is adequate for the risk posed
Test controls independently: Verify that first and second line controls are operating as designed
Evaluate regulatory compliance: Confirm that LLM deployments satisfy applicable regulatory requirements
Audit trail review: Verify that model inputs, outputs, and decisions are logged and retrievable

BANKING ANALOGY

Extending MRM to LLMs is like what happened when banks moved from simple balance sheet lending to complex structured products. The fundamental risk management principles did not change -- you still needed to understand the risk, validate the modelsFoundation ModelA large AI model trained on broad data that can be adapted to many tasks. Examples include GPT-4, Claude, and Gemini. Banks evaluate these for capabilities, safety, and regulatory fit.See glossary, and maintain governance. But the practical application became dramatically more complex. The institutions that adapted their frameworks thrived. The ones that tried to force-fit existing approaches -- or worse, ignored the new risks entirely -- faced consequences. We are at that same inflection point with AI.

Validation Challenges for LLMs

Traditional model validation relies on techniques that do not translate cleanly to LLMs. Here is what makes LLM validation uniquely challenging and how the industry is adapting:

Benchmark Testing

Instead of validating against a single quantitative outcome (does the model correctly predict default?), LLM validation requires testing across a broad range of scenarios with qualitative assessment of output quality. Leading institutions are building "test suites" -- curated sets of prompts with expert-assessed reference answers -- specific to each use case.

Bias and Fairness Testing

Fair lending requirements demand that models do not discriminate on prohibited bases. For traditional models, you can conduct disparate impact analysis on quantitative outputs. For LLMs, you need to test whether the model produces systematically different language, tone, or recommendations when the input varies only by protected class characteristics. This is an emerging discipline with limited precedent.

Ongoing Monitoring

Traditional models drift slowly as the underlying data distribution changes. LLMs can behave unpredictably when:

The model provider updates the underlying model (often without advance notice)
Users discover novel prompting techniques that circumvent guardrails
The model encounters edge cases not covered by initial testing

Effective monitoring requires continuous evaluation, not periodic validation cycles. Many institutions are implementing automated monitoring that samples LLM outputs and flags anomalies for human review.

Red Team Testing

A practice borrowed from cybersecurity, red team testing involves deliberately attempting to make the LLM produce harmful, inaccurate, or non-compliant outputs. For banking applications, red teams should attempt to:

Elicit outputs that violate fair lending requirements
Generate plausible but inaccurate financial information
Circumvent data handling restrictions
Produce outputs that could constitute unauthorized investment advice

Building an LLM MRM Framework

For institutions beginning to formalize their LLM MRM approach, here is a practical framework:

Risk Tiering

Classify every LLM use case into risk tiers:

Tier 1 (Critical): LLM outputs directly influence lending decisions, customer pricing, regulatory submissions, or financial reporting. Full validation required. Human review of every output mandatory.
Tier 2 (Significant): LLM outputs are customer-facing or inform material business decisions. Validation required. Sampling-based human review.
Tier 3 (Standard): LLM used for internal productivity (drafting, summarization, research). Acceptable use policy compliance. User training required.

Governance Structure

Establish clear ownership:

AI/Model Risk Committee: Cross-functional body with authority over LLM deployment decisions
Model owners: Named individuals accountable for each LLM use case
Validation team: Independent team (or external firm) with LLM-specific expertise
Executive sponsorship: Senior leader accountable to the board for AI risk management

Documentation Requirements

At minimum, maintain documentation for each LLM deployment covering:

Use case definition and intended scope
Model selection rationale (why this model for this use case?)
Data handling practices (what goes in, where does it go?)
Prompt engineering approach and guardrails
Testing and validation results
Ongoing monitoring plan
Incident response procedures

Tip

Do not wait for a perfect framework before deploying LLMs. Start with a pragmatic risk tiering approach: deploy Tier 3 use cases (internal productivity) under your existing acceptable use policies while you build the full governance framework for Tier 1 and Tier 2 use cases. This lets your institution gain experience with the technology while managing risk appropriately. Many banks that pursued a "no AI until the framework is perfect" strategy found themselves 18 months behind competitors with nothing to show for their caution.

The Regulatory Trajectory

The regulatory landscape for AI in banking is evolving rapidly. Key developments to monitor:

NIST AI Risk Management Framework: Provides a voluntary framework that many regulators reference
EU AI Act: Classifies AI systems by risk level with specific requirements for high-risk applications (relevant for banks with European operations)
OCC/Fed/FDIC joint guidance: Continues to evolve, with increasing specificity about AI governance expectations
State-level regulations: Several states are developing AI-specific regulations, particularly around automated decision-making in lending

The consistent message across all regulatory developments: banks are expected to understand, govern, and manage the risks of AI systems with the same rigor they apply to any other material risk. The institutions that build robust MRM frameworks for LLMs now will be well-positioned regardless of how the regulatory landscape evolves.

Moving Forward

Model Risk Management for LLMs is not a solved problem -- it is an evolving discipline. But the banks that engage with it proactively, building on their existing MRM expertise rather than treating AI as an entirely separate domain, will navigate this transition most effectively. The foundational principles of SR 11-7 -- sound development practices, independent validation, effective governance -- remain as relevant as ever. The implementation just requires new tools, new expertise, and a willingness to adapt established frameworks to a genuinely new category of technology.

KNOWLEDGE CHECK

Why do LLMs present a fundamentally different challenge for model validation compared to traditional credit scoring models?

Under the three lines of defense model, which responsibility belongs to the SECOND line (Risk Management and Compliance) when governing LLM deployments?

A bank wants to deploy an LLM-powered tool that assists credit analysts with loan recommendations. Under the risk tiering framework described, how should this use case be classified?