Next-Gen Credit Risk AI — Major Retail

Situation

$84B Loan Portfolio, Logistic Regression Models From 2016, and a Rising Charge-Off Rate

The bank's consumer and SME lending portfolio had grown from $54 billion to $84 billion over the preceding five years through a combination of organic growth and two acquisitions. The credit risk models underpinning origination decisions, early warning signals, and provisioning had not kept pace with this growth. The primary origination scorecard was a logistic regression model built in 2016, validated against pre-pandemic credit data that no longer reflected the behavioral patterns of the current customer base. Two supplementary models used for SME lending had been built by the acquired institutions and had never been re-validated under the acquiring bank's Model Risk Management framework.

The consequences were visible in the numbers. The bank's charge-off rate had risen from 1.8% to 2.7% over 18 months while peer institutions with more current models had held steady at 1.9 to 2.1%. The Chief Risk Officer estimated that 40 basis points of the 90-point deterioration was attributable to model performance degradation rather than macroeconomic factors, representing approximately $340 million in excess annual credit losses. Fixing the models was a material financial priority, not a technology initiative.

The constraint was not an absence of data. The bank held rich transaction, behavioral, and relationship data across 11 million customer accounts. What was absent was an organized approach to feature engineering, model architecture selection, and governance that would produce ML models capable of passing the bank's stringent SR 11-7 Model Risk Management validation requirements within a commercially useful timeframe.

Challenge

SR 11-7 Compliance Is Not Optional and Cannot Be Retrofitted

The central challenge was not algorithmic. Gradient boosted ensembles and neural networks are well-established as outperforming logistic regression on default prediction tasks given sufficient data. The challenge was building ML models that would satisfy SR 11-7 model risk governance requirements without sacrificing the performance advantage that ML offers over traditional approaches. Financial services model risk teams have legitimate concerns about black-box ML models, and those concerns must be addressed architecturally, not just documented.

Specifically, the bank's MRM team required five things that purely predictive ML models typically do not provide by default:

Conceptual soundness documentation: Every model needed a documented theoretical basis explaining why the chosen features were expected to predict default, referencing established credit theory. Prior ML efforts at the bank had produced feature importance rankings but not conceptual soundness rationale for each feature's inclusion.
Individual-level explainability for adverse action: Regulatory requirements under ECOA and FCRA require that applicants who are declined or receive materially different pricing can receive a specific, accurate adverse action reason. SHAP-based global feature importance does not satisfy this requirement. Individual-level SHAP with deterministic reason codes did.
Stability monitoring specification: Before deployment, the MRM team required a documented performance monitoring framework specifying champion/challenger conditions, PSI (Population Stability Index) thresholds, performance metric triggers, and escalation procedures. Previous ML initiatives had defined monitoring frameworks after deployment, which MRM treated as inadequate.
Proxy discrimination testing: The bank required evidence that models did not use protected class proxies. This required disparate impact testing across 7 protected categories during model development, not just at deployment.
Challenger model framework: Production models needed a defined challenger model running in parallel with smaller traffic allocation to continuously validate performance and provide statistical power for champion/challenger decisions.

We treated these requirements as design inputs, not compliance checkboxes. Every model was architected to satisfy all five requirements by design rather than retrofitted after algorithm selection.

Solution

Nine Governance-First ML Models Deployed Across the Full Lending Portfolio

We structured the program around three principles that guided every technical decision: SR 11-7 compliance by design, feature engineering depth over algorithmic complexity, and production-readiness from the first sprint.

Governance-First Architecture. Before writing a single line of model code, we produced a Model Development Plan for each of the nine target models documenting: purpose and scope, data sources and feature rationale, algorithm selection justification, performance metrics and acceptance thresholds, disparate impact testing methodology, and monitoring framework specification. The MRM team reviewed and approved each Model Development Plan before development commenced. This upfront governance investment eliminated the back-and-forth validation cycles that typically extend financial services AI programs by 6 to 18 months.

Feature Engineering as the Primary Value Driver. The most important technical work in this program was feature engineering, not algorithm selection. We built a feature library of 340 behavioral, transactional, and relationship features from the bank's existing data assets, including 90-day trailing payment behavior patterns, product utilization velocity signals, relationship depth indicators, and delinquency vintage curves that the legacy models had not incorporated. A feature selection process combining mutual information analysis, VIF screening for multicollinearity, and business logic review produced the final feature sets for each model.

XGBoost with SHAP as the Core Architecture. After evaluating five candidate architectures against the MRM requirements and performance targets, gradient-boosted decision trees with XGBoost were selected as the primary architecture for six of the nine models, with a monotone constraint framework applied to ensure directionally sensible feature relationships that MRM could validate. Individual-level SHAP values were computed in real-time for each origination decision, producing deterministic adverse action reason codes. Two models covering thin-file applicants used a neural architecture with attention mechanisms that provided interpretability through attention weight visualization. One model retained a constrained logistic regression architecture where a regulatory reporting requirement necessitated simple coefficient-based documentation.

Champion/Challenger Deployment Infrastructure. We built a model serving infrastructure supporting champion/challenger splits from day one, with 90% of traffic routed to new champion models and 10% to retained challengers (either updated legacy models or alternative ML architectures). Performance monitoring dashboards tracked PSI, Gini coefficient, KS statistic, and disparate impact indices daily for each model, with automated alerts at 80% of defined escalation thresholds.

Deployment Timeline

11 Weeks from Strategy Approval to 9 Models in Production

Wk 1-2

Portfolio Analysis, Model Inventory, and Development Plan Drafting

Current model performance analysis across all 9 target models. Credit loss attribution analysis identifying model-driven versus macro-driven losses. Data asset inventory and quality profiling across 14 source systems. Model Development Plans drafted for all 9 models. MRM team review commenced. Output: approved Model Development Plans, feature data dictionary, governance framework.

Wk 2-6

Feature Engineering, Model Training, and Disparate Impact Testing

340-feature library construction from production data. Feature selection process producing per-model final feature sets (ranging from 28 to 67 features per model). XGBoost and neural architecture training on 4-year historical dataset. Disparate impact testing across 7 protected categories for all models. SHAP adverse action reason code framework implemented and tested. Champion/challenger serving infrastructure built and load-tested. All 9 models passed MRM conceptual soundness review.

Wk 6-9

Parallel Shadow Mode Operation and MRM Validation

All 9 models run in shadow mode for 3 weeks alongside production legacy models. New model decisions compared to legacy decisions: 23% of decisions different, with new model predictions subsequently validated as more accurate at 3-month outcome. MRM validation documentation submitted and reviewed. Monitoring dashboard live and reviewed by MRM team. All 9 models received MRM approval for production deployment.

Wk 9-11

Staged Production Deployment with Champion/Challenger Live

Mortgage origination model deployed week 9. Personal loan and auto loan models deployed week 10. SME lending models (3 models) deployed week 10. Credit limit management models (2 models) deployed week 11. Early warning model deployed week 11. All models live with 90/10 champion/challenger splits. Daily monitoring active. Initial performance at deployment: 34% higher Gini coefficient versus legacy models on concurrent comparison cohort.

Outcomes

Measured Results at 9 Months Post-Deployment

Default Prediction Improvement 34%

Gini coefficient improvement versus legacy models measured on matched cohort at 12-month outcome, confirming the model's superior discriminatory power across all product types.

Annualized Credit Loss Reduction $180M

Finance-validated annualized reduction in net charge-offs attributable to improved origination decisions. Charge-off rate returned from 2.7% to 2.1%, in line with peer institutions.

Approval Rate Impact +6%

Better risk discrimination enabled a 6% increase in approval rates for applicants previously misclassified as high-risk by legacy models, growing the performing loan book by $3.2B.

MRM Validation Cycles 1

All 9 models passed MRM validation in a single review cycle with zero findings requiring model rework. Industry average for ML model validation is 2.8 cycles. Governance-first design was the difference.

Key Takeaways

What Banks Get Wrong When Deploying ML Credit Models

Governance is a design discipline, not a compliance exercise. Banks that treat SR 11-7 compliance as a post-development checklist routinely spend 6 to 18 months in validation cycles reworking models that were not designed with governance in mind. Designing for governance from the first sprint eliminates this waste entirely.

Feature engineering creates more value than algorithm selection. The step change in predictive performance in this program came from 340 carefully engineered behavioral features, not from choosing XGBoost over logistic regression. A well-featured logistic regression outperforms a poorly-featured neural network every time in credit risk.

Adverse action explainability must be individual-level, not global. Many banks deploy ML models with global SHAP importance and believe they have solved the explainability problem. They have not. ECOA adverse action requirements are individual-level, deterministic, and legally binding. Individual SHAP with reason code mapping is the minimum viable standard.

Better risk discrimination grows both sides of the ledger. The $180M loss reduction was the headline metric, but the 6% increase in approval rates for previously misclassified applicants was equally significant. Improved credit models do not just reduce losses. They also identify creditworthy customers who legacy models turned away.

Credit risk model programs in financial services fail in two predictable ways: they either produce technically strong models that cannot pass MRM validation, or they produce MRM-compliant models with marginal performance improvement over what they replace. This engagement delivered neither failure mode. Nine models, one validation cycle, and a financial impact that was visible in our charge-off rate within a quarter of deployment. That is not a typical outcome. The governance-first design methodology is the reason it happened.