Credit Scoring, Explained Plainly

You walk into a bank and apply for a loan. Within 30 seconds, a model has scored you, ranked you against millions of other borrowers, and decided whether you're worth the risk. The bank doesn't read your essay about why you need the money. It reads a three-digit number. That number — your credit score — determines your interest rate, your credit limit, and sometimes whether you get approved at all. A 2025 systematic review of over 330 research papers found that the methods behind this number have evolved dramatically: from logistic regression scorecards to gradient-boosted ensembles that achieve AUCs above 0.85. But the fundamental question remains the same — can you predict who will default?

What Goes Into a Credit Score

The FICO score is the industry standard. Over 90% of top US lenders use some version of it. The score ranges from 300 to 850, and it's built from five components — each weighted by how predictive it is of default.

Payment History (35%) dominates because past behavior is the strongest predictor of future behavior. If you've missed payments before, you'll probably miss them again. Credit Utilization (30%) measures how much of your available credit you're using — maxing out your cards signals financial stress. Length of Credit History (15%) rewards longevity; longer track records mean more data. New Credit (10%) penalizes a flurry of recent applications. Credit Mix (10%) gives a small bonus for having both revolving (credit cards) and installment (mortgages, car loans) accounts.

The FICO score was invented in 1989 by Fair, Isaac and Company. Over 35 years later, it remains the dominant scoring system in the US. The model has been updated — FICO 8, FICO 9, FICO 10 — but the core five-component structure has been remarkably stable. FICO 10T, the latest version, introduced 24-month trended data instead of point-in-time snapshots.

Score Anatomy: Build a Credit Profile

Payment History: 75

Utilization: 60

Credit Length: 50

New Credit: 80

Credit Mix: 70

Payment History (35%)

Utilization (30%)

Credit Length (15%)

New Credit (10%)

Credit Mix (10%)

Drag the sliders to see how each component affects the total score. A score of 670+ is generally considered "good."

The FICO credit score ranges from 300 to 850 and is composed of five weighted factors: Payment History (35%), Credit Utilization (30%), Length of Credit History (15%), New Credit Inquiries (10%), and Credit Mix (10%). A borrower with excellent payment history, low utilization under 30%, 10+ years of credit history, few recent inquiries, and a mix of installment and revolving credit would score approximately 800. Each component contributes independently. Payment history dominates because past delinquency is the strongest single predictor of future default, validated across decades of data from major US credit bureaus (FICO, 2024).

The problem with component-based scoring is that it's a black box to the consumer. You know the five categories exist, but you don't know exactly how your specific payment history maps to points. And the weights themselves are averages — the actual impact of each factor varies by individual profile.[1]

VantageScore, the main competitor to FICO, uses a similar 300-850 range but weights factors differently. It's more forgiving of medical debt and treats recent behavior more heavily. Both models agree on the basic principle: payment history matters most.

II.

Turning Raw Data Into Signals

Raw features like age, income, and debt ratio are continuous numbers. A 45-year-old isn't categorically different from a 46-year-old, but a 25-year-old is very different from a 65-year-old. To build a scorecard, you need to transform these continuous features into discrete risk buckets. That's where Weight of Evidence (WoE) comes in.

WoE bins a continuous variable and measures how much each bin separates good borrowers from bad ones. The formula is simple: WoE = ln(% of Good in bin / % of Bad in bin). A positive WoE means the bin has more good borrowers than expected. A negative WoE means more bad borrowers. The larger the absolute value, the stronger the signal.

Information Value (IV) summarizes the total predictive power of a variable across all its bins. It's the go-to metric for feature screening in credit scoring. IV below 0.02 means the variable is useless. Between 0.02 and 0.10, it's weak. Between 0.10 and 0.30, medium. Above 0.30, you have a strong predictor.

Weight of Evidence Explorer

Good borrowers (%)

Bad borrowers (%)

Toggle between variables to compare their predictive power. Taller WoE bars (bottom chart) mean stronger separation between good and bad borrowers.

Weight of Evidence (WoE) transforms continuous features into risk signals by binning values and measuring the ratio of good to bad borrowers in each bin. In the Give Me Some Credit dataset (150,000 records, 6.8% default rate), Revolving Utilization has an Information Value of 0.72, the strongest single predictor. Borrowers using under 10% of available credit default at roughly 2%, while those using over 100% default at approximately 12%. Age shows moderate predictive power (IV = 0.12): borrowers over 60 default half as often as those under 30. Debt Ratio is weaker (IV = 0.08), providing only marginal separation between good and bad borrowers.

Why does WoE matter? It transforms non-linear relationships into something logistic regression can handle. A raw age variable has a complex, non-monotonic relationship with default risk. After WoE transformation, each bin gets a single number that captures its relative riskiness. This is why traditional scorecards still work despite being "simple" — the feature engineering does the heavy lifting.

III.

The Scorecard

A credit scorecard is a logistic regression model wearing a disguise. Under the hood, it's estimating the log-odds of default. But instead of showing you coefficients, it translates everything into points. Each attribute bin gets a fixed number of points. You sum them up, and the total maps to a probability of default.

The math is straightforward: Score = Offset + Factor × log-odds. The Offset and Factor are chosen so that a convenient reference point maps to a round score — for example, odds of 50:1 against default might equal 600 points, with every 20 additional points doubling the odds in your favor.

Banks love scorecards because they're interpretable, auditable, and compliant with SR 11-7[2]. A loan officer can literally look at the points table and explain exactly why an applicant scored the way they did. No black box, no "the algorithm decided."

SR 11-7 is the Federal Reserve's model risk management guidance, issued in April 2011. It governs how banks develop, validate, and monitor models used for credit decisions. It's the regulatory bible for model governance in US banking.

Scorecard Simulator

Revolving Utilization:

Age:

Past Due (30-59 days):

Debt Ratio:

Open Credit Lines:

Select different attribute bins to see how points add up. The total score maps to a probability of default via a logistic function.

A traditional credit scorecard assigns fixed points per attribute bin. For example, Revolving Utilization under 10% adds +62 points, while over 100% subtracts 38 points. Age over 60 adds +36 points. Never having a 30-day past-due adds +32. The base score is 480, and total points are summed to produce a final score. A score of 650 corresponds to approximately 3% probability of default, while 550 corresponds to roughly 12% PD. The score-to-PD conversion uses a logistic function. Banks set cutoff thresholds based on their risk appetite, typically requiring scores above 600 for consumer lending.

IV.

Enter Machine Learning

The logistic regression scorecard achieves an AUC of around 0.72 on credit data. For decades, this was "good enough." Then gradient-boosted trees arrived.

XGBoost and LightGBM achieve AUC of 0.84–0.86 on the same data. That 12–14 point jump in AUC translates to catching significantly more defaults at the same false alarm rate. At a 5% false positive rate, logistic regression catches about 18% of defaults. LightGBM catches 38%. That's more than double the detection rate.

LightGBM is also 20x faster to train than XGBoost on large datasets, thanks to histogram-based binning and leaf-wise tree growth. When you're processing millions of loan applications, that speed matters. A 2025 study in Nature Scientific Reports showed that a hybrid LightGBM framework captures non-linear interactions that logistic regression fundamentally cannot model.

The tradeoff is interpretability. A scorecard is transparent. A 1,000-tree gradient-boosted ensemble is not. Regulators require adverse action reasons — specific explanations for why an applicant was denied. This tension between accuracy and explainability is the central challenge in modern credit scoring.

ROC Curve Explorer: Model Comparison

False Positive Rate: 20%

Logistic Regression

XGBoost

LightGBM

Drag the slider to set a false positive rate. Watch how many defaults each model catches at that operating point. The gap between curves is the real-world improvement.

Three models trained on the Give Me Some Credit dataset show dramatically different performance. Logistic Regression achieves AUC 0.72 (Gini 0.44, KS 33). XGBoost reaches AUC 0.85 (Gini 0.70, KS 52). LightGBM achieves AUC 0.86 (Gini 0.72, KS 54) while training 20x faster. At a fixed 5% false positive rate, logistic regression catches approximately 18% of defaults. XGBoost catches 35%. LightGBM catches 38%. Gradient-boosted models detect roughly twice as many defaults as traditional logistic regression at the same false alarm rate, according to benchmarks published in Nature Scientific Reports (2025).

The Gini coefficient is the most widely used metric in credit scoring. It relates to AUC by a simple formula: Gini = 2 × AUC − 1. A Gini above 0.60 is considered strong. LightGBM's Gini of 0.72 means it ranks defaulters ahead of non-defaulters 86% of the time. For logistic regression (Gini 0.44), that figure drops to 72%.

Why Was I Denied?

When a bank denies your loan application, US law (the Equal Credit Opportunity Act) requires them to tell you why. "The model said no" is not a valid answer. They need specific reasons: "your credit utilization was too high" or "you've had recent late payments."

SHAP (SHapley Additive exPlanations) solves this problem using game theory. Named after Lloyd Shapley's Nobel Prize-winning work on cooperative games, SHAP assigns each feature an exact contribution to the model's prediction. The idea: if we remove this feature, how much does the prediction change? By computing this across all possible combinations of features, SHAP produces consistent, mathematically rigorous explanations.

The base value is the average prediction across all borrowers — in our dataset, about 6.8% (the overall default rate). Each feature then pushes the prediction up (toward default) or down (toward safe). Green bars push toward approval. Red bars push toward denial. The final prediction is the sum of the base value and all feature contributions.

SHAP Explainer: Why This Decision?

Pushes toward approval (lower PD)

Pushes toward denial (higher PD)

Click each borrower profile to see which features drove the decision. The waterfall shows exactly how much each feature contributed to the final prediction.

SHAP (SHapley Additive exPlanations) decomposes a model's prediction into individual feature contributions using game theory. For a clearly denied borrower with 105% revolving utilization and two 90-day-past-due events, the base default probability is 6.8%. Revolving utilization contributes +8.5 percentage points, 90-day delinquency +7.2 points, and 30-day delinquency +3.8 points, pushing the final predicted probability to 28%. For a safely approved borrower with 12% utilization and no delinquencies, SHAP values push the prediction down to 2%. This provides the regulatory-compliant adverse action reasons required by ECOA and FCRA.

ECOA compliance requires specific adverse action reasons. A 2024 study in PMC showed that SHAP explanations applied to credit scorecards improve both model trust and regulatory auditability. The top SHAP contributors map directly to the adverse action codes that lenders must provide: "excessive revolving utilization" (code 061), "delinquency on accounts" (code 007), "insufficient length of credit history" (code 009).

VI.

When Models Go Stale

A credit model is trained on historical data. But borrower populations change. Recessions hit. Lending policies shift. New customer segments emerge. The model that performed beautifully at launch slowly loses its edge. This is called population drift, and it's the reason you can't just deploy a model and walk away.

The Population Stability Index (PSI) measures how much the current borrower distribution has shifted from the development population. The formula sums across bins: PSI = ∑(Actual% − Expected%) × ln(Actual% / Expected%). Three thresholds guide action: PSI below 0.10 means stable — no action needed. Between 0.10 and 0.25, investigate the drift. Above 0.25, it's time to retrain.

SR 11-7 mandates ongoing monitoring programs. In practice, most credit models show measurable drift within 6–12 months of deployment. Banks run champion-challenger tests: the production model (champion) is continuously compared against a retrained model (challenger). When PSI triggers, the challenger gets promoted if it outperforms.

PSI Drift Monitor

Months since deployment: Month 0

Development sample

Current month

Drag the month slider to watch the borrower population drift over time. When PSI crosses 0.25, the model needs retraining.

Population Stability Index (PSI) monitors whether a credit model's input distributions have shifted since deployment. PSI below 0.10 indicates stability. Between 0.10 and 0.25, minor drift requires investigation. Above 0.25, the model should be retrained. In simulated monthly monitoring of Revolving Utilization from the Give Me Some Credit dataset, PSI reaches 0.12 by month 6 and 0.28 by month 12 as borrower utilization patterns shift upward, crossing the retrain threshold. Continuous monitoring is mandated by the Federal Reserve's SR 11-7 model risk guidance. Most production credit models show measurable drift within 6 to 12 months.

◆

Written by Danish Mohd.
AI product builder. Previously VP Engineering at Pixis AI. Built credit risk models for a small finance bank.
Last updated March 2026