Datasets for Credit Risk Modeling

This tutorial outlines several free publicly available datasets which can be used for credit risk modeling. In banking world, credit risk is a critical business vertical which makes sure that bank has sufficient capital to protect depositors from credit, market and operational risks. During the process, its role is to work for bank in compliance to central bank regulations.

Important Credit Risk Modeling Projects

  1. Probability of Default (PD) tells us the likelihood that a borrower will default on the debt (loan or credit card). In simple words, it returns the expected probability of customers fail to repay the loan.
  2. Loss Given Default (LGD) is a proportion of the total exposure when borrower defaults. It is calculated by (1 - Recovery Rate). For example someone takes $200,000 loan from bank for purchase of flat. He/She paid some installments before he stopped paying installments further. When he defaults, loan has an outstanding balance of $100,000. Bank took possession of flat and was able to sell it for $90,000. Net loss to the bank is $10,000 which is 100,000-90,000, and the LGD is 10% i.e. $10,000/$100,000.
  3. Exposure at Default (EAD) is the amount that the borrower has to pay the bank at the time of default. In the above example shown in LGD, outstanding balance of $100,000 is EAD
credit risk datasets

Datasets for Credit Risk Modeling Projects

We have gathered data from several sources. See the list below. The following websites own the copyright on these data and authorizes their reproduction.
  1. Kaggle
  2. UCI Machine Learning Repository
  3. Econometric Analysis Book by William H. Greene
  4. Credit scoring and its applications Book by Lyn C. Thomas
  5. Credit Risk Analytics Book by Harald, Daniel and Bart
  6. Lending Club
  7. PAKDD 2009 Data Mining Competition, organized by NeuroTech Ltd. and Center for Informatics of the Federal University of Pernambuco
Kaggle : Home Credit Default Risk
It includes variables from different sources which are required to build robust and accurate probability of default model.
  • Credit bureau variables which contains details about borrower's previous credits provided by other banks
  • Previous Loans that the applicant had with Home Credit
  • Previous Point of sales and cash loans that the applicant had with Home Credit
  • Previous Credit Cards that the applicant had with Home Credit
Download data and data dictionary
Kaggle : Give Me Some Credit
Kaggle organised a competition few years ago which has problem statement - Building a probability of default model which predicts defaulters in the next two years. Download Data by visiting the website See the data dictionary below :
Variable Name Description
SeriousDlqin2yrs Person experienced 90 days past due delinquency or worse
RevolvingUtilizationOfUnsecuredLines Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits
age Age of borrower in years
NumberOfTime30-59DaysPastDueNotWorse Number of times borrower has been 30-59 days past due but no worse in the last 2 years.
DebtRatio Monthly debt payments, alimony,living costs divided by monthy gross income
MonthlyIncome Monthly income
NumberOfOpenCreditLinesAndLoans Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards)
NumberOfTimes90DaysLate Number of times borrower has been 90 days or more past due.
NumberRealEstateLoansOrLines Number of mortgage and real estate loans including home equity lines of credit
NumberOfTime60-89DaysPastDueNotWorse Number of times borrower has been 60-89 days past due but no worse in the last 2 years.
NumberOfDependents Number of dependents in family excluding themselves (spouse, children etc.)
Econometric Analysis Book by William H. Greene
This book has credit card data which comprises of target variable which is binary in nature (1 if application for credit card accepted, 0 if not) and a few independent variables about demographics and credit history of credit card holders.

You can download data and its description from this link

UCI Machine Learning Repository
This repository contains sample credit application data of many different countries.
Dataset about credit card defaults in Taiwan contains several attributes or characters which can be leveraged to test various machine learning algorithms for building credit scorecard.
Note : Poland dataset contains information about attributes of companies rather than retail customers.
PAKDD 2009 Data Mining Competition
It is a credit card application data of Brazilian customers. It has a labeled data set from one year period for training credit scoring model. You can do scoring to the leaderboard dataset from one year later. To download data, clink on this link Download Data and then click on Download button.
Credit Risk Analytics Book
Lending Club
It contains Peer to Peer Lending data for loans issued including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. Check out this link - Download Data
Credit scoring and its applications (Lyn C. Thomas, David B. Edelman, Jonathan N. Crook)
Download Data
Data Description is shown below -
Bad Good/bad indicator 
  1 = Bad
  0 = Good

yob Year of birth (If unknown the year will be 99)
nkid Number of children
dep Number of other dependents
phon    Is there a home phone (1=yes, 0 = no)
sinc Spouse's income

aes Applicant's employment status 
  V = Government
  W = housewife
  M = military 
  P = private sector
  B = public sector
  R = retired
  E = self employed
  T = student
  U = unemployed
  N = others
  Z  = no response
  
  
dainc Applicant's income 
res Residential status 
  O = Owner
  F = tenant furnished
  U = Tenant Unfurnished
  P = With parents
  N = Other
  Z = No response

dhval Value of Home  
  0 = no response or not owner
  000001 = zero value
  blank = no response

dmort Mortgage balance outstanding
  0 = no response or not owner
  000001 = zero balance
  blank = no response

doutm Outgoings on mortgage or rent 
doutl Outgoings on Loans 
douthp Outgoings on Hire Purchase 
doutcc Outgoings on credit cards 
Related Posts
About Author:

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 8 years of experience in data science. During his tenure, he has worked with global clients in various domains like Banking, Insurance, Telecom and Human Resource.

0 Response to "Datasets for Credit Risk Modeling"

Post a Comment

Next → ← Prev
Love this Post? Spread the Word!
Share