Datasets for Credit Risk Modeling

Deepanshu Bhalla 8 Comments
This tutorial outlines several free publicly available datasets which can be used for credit risk modeling. In banking world, credit risk is a critical business vertical which makes sure that bank has sufficient capital to protect depositors from credit, market and operational risks. During the process, its role is to work for bank in compliance to central bank regulations.

Important Credit Risk Modeling Projects

  1. Probability of Default (PD) tells us the likelihood that a borrower will default on the debt (loan or credit card). In simple words, it returns the expected probability of customers fail to repay the loan.
  2. Loss Given Default (LGD) is a proportion of the total exposure when borrower defaults. It is calculated by (1 - Recovery Rate). For example someone takes $200,000 loan from bank for purchase of flat. He/She paid some installments before he stopped paying installments further. When he defaults, loan has an outstanding balance of $100,000. Bank took possession of flat and was able to sell it for $90,000. Net loss to the bank is $10,000 which is 100,000-90,000, and the LGD is 10% i.e. $10,000/$100,000.
  3. Exposure at Default (EAD) is the amount that the borrower has to pay the bank at the time of default. In the above example shown in LGD, outstanding balance of $100,000 is EAD
credit risk datasets

Datasets for Credit Risk Modeling Projects

We have gathered data from several sources. See the list below. The following websites own the copyright on these data and authorizes their reproduction.
  1. Kaggle
  2. UCI Machine Learning Repository
  3. Econometric Analysis Book by William H. Greene
  4. Credit scoring and its applications Book by Lyn C. Thomas
  5. Credit Risk Analytics Book by Harald, Daniel and Bart
  6. Lending Club
  7. PAKDD 2009 Data Mining Competition, organized by NeuroTech Ltd. and Center for Informatics of the Federal University of Pernambuco
Kaggle : Home Credit Default Risk
It includes variables from different sources which are required to build robust and accurate probability of default model.
  • Credit bureau variables which contains details about borrower's previous credits provided by other banks
  • Previous Loans that the applicant had with Home Credit
  • Previous Point of sales and cash loans that the applicant had with Home Credit
  • Previous Credit Cards that the applicant had with Home Credit
Download data and data dictionary
Kaggle : Give Me Some Credit
Kaggle organised a competition few years ago which has problem statement - Building a probability of default model which predicts defaulters in the next two years. Download Data by visiting the website See the data dictionary below :
Variable Name Description
SeriousDlqin2yrs Person experienced 90 days past due delinquency or worse
RevolvingUtilizationOfUnsecuredLines Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits
age Age of borrower in years
NumberOfTime30-59DaysPastDueNotWorse Number of times borrower has been 30-59 days past due but no worse in the last 2 years.
DebtRatio Monthly debt payments, alimony,living costs divided by monthy gross income
MonthlyIncome Monthly income
NumberOfOpenCreditLinesAndLoans Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards)
NumberOfTimes90DaysLate Number of times borrower has been 90 days or more past due.
NumberRealEstateLoansOrLines Number of mortgage and real estate loans including home equity lines of credit
NumberOfTime60-89DaysPastDueNotWorse Number of times borrower has been 60-89 days past due but no worse in the last 2 years.
NumberOfDependents Number of dependents in family excluding themselves (spouse, children etc.)
Econometric Analysis Book by William H. Greene
This book has credit card data which comprises of target variable which is binary in nature (1 if application for credit card accepted, 0 if not) and a few independent variables about demographics and credit history of credit card holders.

You can download data and its description from this link

UCI Machine Learning Repository
This repository contains sample credit application data of many different countries.
Dataset about credit card defaults in Taiwan contains several attributes or characters which can be leveraged to test various machine learning algorithms for building credit scorecard.
Note : Poland dataset contains information about attributes of companies rather than retail customers.
PAKDD 2009 Data Mining Competition
It is a credit card application data of Brazilian customers. It has a labeled data set from one year period for training credit scoring model. You can do scoring to the leaderboard dataset from one year later. To download data, click on this link Download Data and then click on Download button.
Credit Risk Analytics Book

To download the datasets below, visit the link and fill the required details in the form. Once filled, you can download the datasets.

1. Data Set HMEQ

The data set HMEQ reports characteristics and delinquency information for 5,960 home equity loans. A home equity loan is a loan where the obligor uses the equity of his or her home as the underlying collateral.

2. Data Set Mortgage

The data set mortgage is in panel form and reports origination and performance observations for 50,000 residential U.S. mortgage borrowers over 60 periods. The periods have been deidentified. As in the real world, loans may originate before the start of the observation period (this is an issue where loans are transferred between banks and investors as in securitization). The loan observations may thus be censored as the loans mature or borrowers refinance. The data set is a randomized selection of mortgage-loan-level data collected from the portfolios underlying U.S. residential mortgage-backed securities (RMBS) securitization portfolios and provided by International Financial Research (

3. Data Set LGD

The data set has been kindly provided by a European bank and has been slightly modified and anonymized. It includes 2,545 observations on loans and LGDs.

4. Data Set Ratings

The ratings data set is an anonymized data set with corporate ratings where the ratings have been numerically encoded (1 = AAA, etc.).

Lending Club
It contains Peer to Peer Lending data for loans issued including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. Check out this link - Download Data
Credit scoring and its applications (Lyn C. Thomas, David B. Edelman, Jonathan N. Crook)
Download Data
Data Description is shown below -
Bad Good/bad indicator 
  1 = Bad
  0 = Good

yob Year of birth (If unknown the year will be 99)
nkid Number of children
dep Number of other dependents
phon    Is there a home phone (1=yes, 0 = no)
sinc Spouse's income

aes Applicant's employment status 
  V = Government
  W = housewife
  M = military 
  P = private sector
  B = public sector
  R = retired
  E = self employed
  T = student
  U = unemployed
  N = others
  Z  = no response
dainc Applicant's income 
res Residential status 
  O = Owner
  F = tenant furnished
  U = Tenant Unfurnished
  P = With parents
  N = Other
  Z = No response

dhval Value of Home  
  0 = no response or not owner
  000001 = zero value
  blank = no response

dmort Mortgage balance outstanding
  0 = no response or not owner
  000001 = zero balance
  blank = no response

doutm Outgoings on mortgage or rent 
doutl Outgoings on Loans 
douthp Outgoings on Hire Purchase 
doutcc Outgoings on credit cards 
Related Posts
Spread the Word!
About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

Post Comment 8 Responses to "Datasets for Credit Risk Modeling"
  1. You just saved a hell lot of time for me!! I was struggling a lot to find lgd data. You just made my task simpler.

  2. Hi ,
    I am looking for Indian credit data set , along with default flags , and loan types for my research . Will you be able to help me with any references please

  3. find listen data extremely useful.It makes understanding difficult concepts of analytics extremely easy.
    Thanks a ton once again :)

  4. You've done so much a great job!
    Thanks a bunch!

  5. Hi, I am not able to download the LGD data from the link given above. Could anyone kindly help me with a source to get the LGD data

Next → ← Prev