You walk into a bank and apply for a loan. Within 30 seconds, a model has scored you, ranked you against millions of other borrowers, and decided whether you're worth the risk. The bank doesn't read your essay about why you need the money. It reads a three-digit number. That number — your credit score — determines your interest rate, your credit limit, and sometimes whether you get approved at all. A 2025 systematic review of over 330 research papers found that the methods behind this number have evolved dramatically: from logistic regression scorecards to gradient-boosted ensembles that achieve AUCs above 0.85. But the fundamental question remains the same — can you predict who will default?
What Goes Into a Credit Score
The FICO score is the industry standard. Over 90% of top US lenders use some version of it. The score ranges from 300 to 850, and it's built from five components — each weighted by how predictive it is of default.
Payment History (35%) dominates because past behavior is the strongest predictor of future behavior. If you've missed payments before, you'll probably miss them again. Credit Utilization (30%) measures how much of your available credit you're using — maxing out your cards signals financial stress. Length of Credit History (15%) rewards longevity; longer track records mean more data. New Credit (10%) penalizes a flurry of recent applications. Credit Mix (10%) gives a small bonus for having both revolving (credit cards) and installment (mortgages, car loans) accounts.
The FICO score was invented in 1989 by Fair, Isaac and Company. Over 35 years later, it remains the dominant scoring system in the US. The model has been updated — FICO 8, FICO 9, FICO 10 — but the core five-component structure has been remarkably stable. FICO 10T, the latest version, introduced 24-month trended data instead of point-in-time snapshots.
Score Anatomy: Build a Credit Profile
The problem with component-based scoring is that it's a black box to the consumer. You know the five categories exist, but you don't know exactly how your specific payment history maps to points. And the weights themselves are averages — the actual impact of each factor varies by individual profile.[1]
VantageScore, the main competitor to FICO, uses a similar 300-850 range but weights factors differently. It's more forgiving of medical debt and treats recent behavior more heavily. Both models agree on the basic principle: payment history matters most.
Turning Raw Data Into Signals
Raw features like age, income, and debt ratio are continuous numbers. A 45-year-old isn't categorically different from a 46-year-old, but a 25-year-old is very different from a 65-year-old. To build a scorecard, you need to transform these continuous features into discrete risk buckets. That's where Weight of Evidence (WoE) comes in.
WoE bins a continuous variable and measures how much each bin separates good borrowers from bad ones. The formula is simple: WoE = ln(% of Good in bin / % of Bad in bin). A positive WoE means the bin has more good borrowers than expected. A negative WoE means more bad borrowers. The larger the absolute value, the stronger the signal.
Information Value (IV) summarizes the total predictive power of a variable across all its bins. It's the go-to metric for feature screening in credit scoring. IV below 0.02 means the variable is useless. Between 0.02 and 0.10, it's weak. Between 0.10 and 0.30, medium. Above 0.30, you have a strong predictor.
Weight of Evidence Explorer
Why does WoE matter? It transforms non-linear relationships into something logistic regression can handle. A raw age variable has a complex, non-monotonic relationship with default risk. After WoE transformation, each bin gets a single number that captures its relative riskiness. This is why traditional scorecards still work despite being "simple" — the feature engineering does the heavy lifting.
The Scorecard
A credit scorecard is a logistic regression model wearing a disguise. Under the hood, it's estimating the log-odds of default. But instead of showing you coefficients, it translates everything into points. Each attribute bin gets a fixed number of points. You sum them up, and the total maps to a probability of default.
The math is straightforward: Score = Offset + Factor × log-odds. The Offset and Factor are chosen so that a convenient reference point maps to a round score — for example, odds of 50:1 against default might equal 600 points, with every 20 additional points doubling the odds in your favor.
Banks love scorecards because they're interpretable, auditable, and compliant with SR 11-7[2]. A loan officer can literally look at the points table and explain exactly why an applicant scored the way they did. No black box, no "the algorithm decided."
SR 11-7 is the Federal Reserve's model risk management guidance, issued in April 2011. It governs how banks develop, validate, and monitor models used for credit decisions. It's the regulatory bible for model governance in US banking.
Scorecard Simulator
Enter Machine Learning
The logistic regression scorecard achieves an AUC of around 0.72 on credit data. For decades, this was "good enough." Then gradient-boosted trees arrived.
XGBoost and LightGBM achieve AUC of 0.84–0.86 on the same data. That 12–14 point jump in AUC translates to catching significantly more defaults at the same false alarm rate. At a 5% false positive rate, logistic regression catches about 18% of defaults. LightGBM catches 38%. That's more than double the detection rate.
LightGBM is also 20x faster to train than XGBoost on large datasets, thanks to histogram-based binning and leaf-wise tree growth. When you're processing millions of loan applications, that speed matters. A 2025 study in Nature Scientific Reports showed that a hybrid LightGBM framework captures non-linear interactions that logistic regression fundamentally cannot model.
The tradeoff is interpretability. A scorecard is transparent. A 1,000-tree gradient-boosted ensemble is not. Regulators require adverse action reasons — specific explanations for why an applicant was denied. This tension between accuracy and explainability is the central challenge in modern credit scoring.
ROC Curve Explorer: Model Comparison
The Gini coefficient is the most widely used metric in credit scoring. It relates to AUC by a simple formula: Gini = 2 × AUC − 1. A Gini above 0.60 is considered strong. LightGBM's Gini of 0.72 means it ranks defaulters ahead of non-defaulters 86% of the time. For logistic regression (Gini 0.44), that figure drops to 72%.
Why Was I Denied?
When a bank denies your loan application, US law (the Equal Credit Opportunity Act) requires them to tell you why. "The model said no" is not a valid answer. They need specific reasons: "your credit utilization was too high" or "you've had recent late payments."
SHAP (SHapley Additive exPlanations) solves this problem using game theory. Named after Lloyd Shapley's Nobel Prize-winning work on cooperative games, SHAP assigns each feature an exact contribution to the model's prediction. The idea: if we remove this feature, how much does the prediction change? By computing this across all possible combinations of features, SHAP produces consistent, mathematically rigorous explanations.
The base value is the average prediction across all borrowers — in our dataset, about 6.8% (the overall default rate). Each feature then pushes the prediction up (toward default) or down (toward safe). Green bars push toward approval. Red bars push toward denial. The final prediction is the sum of the base value and all feature contributions.
SHAP Explainer: Why This Decision?
ECOA compliance requires specific adverse action reasons. A 2024 study in PMC showed that SHAP explanations applied to credit scorecards improve both model trust and regulatory auditability. The top SHAP contributors map directly to the adverse action codes that lenders must provide: "excessive revolving utilization" (code 061), "delinquency on accounts" (code 007), "insufficient length of credit history" (code 009).
When Models Go Stale
A credit model is trained on historical data. But borrower populations change. Recessions hit. Lending policies shift. New customer segments emerge. The model that performed beautifully at launch slowly loses its edge. This is called population drift, and it's the reason you can't just deploy a model and walk away.
The Population Stability Index (PSI) measures how much the current borrower distribution has shifted from the development population. The formula sums across bins: PSI = ∑(Actual% − Expected%) × ln(Actual% / Expected%). Three thresholds guide action: PSI below 0.10 means stable — no action needed. Between 0.10 and 0.25, investigate the drift. Above 0.25, it's time to retrain.
SR 11-7 mandates ongoing monitoring programs. In practice, most credit models show measurable drift within 6–12 months of deployment. Banks run champion-challenger tests: the production model (champion) is continuously compared against a retrained model (challenger). When PSI triggers, the challenger gets promoted if it outperforms.
PSI Drift Monitor
◆
Written by Danish Mohd.
AI product builder. Previously VP Engineering at Pixis AI. Built credit risk models for a small finance bank.