Analysis and Assessment of Credit rating model in P2P lending
An instrument to solve information asymmetry between lenders and borrowers By
Yang Yang
B.Sc. Management of Science and Project University of Science and Technology of China, 2007
SUBMITTED TO THE MIT SLOAN SCHOOL OF MANAGEMENT IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR DEGREE OF
MASTER OF SCIENCE IN MANAGEMENT STUDIES AT THE
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
JUNE 2015
2015 Yang Yang. All rights reserved
The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part
in any medium now know or hereafter created.
Signature
redacted
ARCHNES
MASSACHUSETTS INSTITUTE OF TECHNOLOLGYJUN 2 4 2015
LIBRARIES
MITSignature redacted
Sloan School of Management May 8, 2015 Certified by:
Accepted by:
Christian Catalini Assistant Professor of Technological Innovation, Entrepreneurship, and Strategic Management
Signature redacted
Thesis SupervisorMichael A. Cusumano SMR Distinguished Professor of Management Program Director, M.S. in Management Studies Program MIT Sloan School Of Management
I
Analysis and Assessment of Credit rating model in P2P lending
An instrument to solve information asymmetry between lenders and borrowers
By
Yang Yang
Submitted to MIT Sloan School of Management on May 8, 2015 in Partial Fulfillment of the Requirements for the Degree of Master of Science in
Management Studies.
ABSTRACT
Since the establishment of the first P2P lending platform in 2005, P2P lending industry has been nibbling the market share of traditional consumer credit. In 2014, Lending Club and Prosper originated over 7 billion personal loans. As one of the biggest traditional banks in the U.S., Citi issued 25.2 billion USD in 2014. Given the advantages of P2P lending over traditional banks, the market for P2P lending is expected to grow rapidly along with the improvement of the internal system of P2P lending platforms, external regulation and more participation from borrowers and lenders. Given the fact that most P2P lending platforms in China first imitated the business model from either the U.S. or European platforms, they have progressively evolved to incorporate different business models due to legislation, economic or behavioral reasons.
Several findings are detected by analyzing the data form Lending Club and Prosper. First, although both platforms progressively improve the default rate each year, currently both platforms offer negative returns for investors. Second, if only considering finished/matured loans, higher credit score doesn't lead to less default risk. Third, on average, a default loan will cost a loss more than twice as much as the interest return offered to investors. Taking this cost matrix into consideration, the optimal data model won't necessarily provide the highest accuracy but maximum return. Fourth, the ex post return offered by the platforms is not enough to cover the potential risk facing investors.
Thesis Supervisor: Christian Catalini
Title: Assistant Professor of Technological Innovation, Entrepreneurship, and Strategic Management
Analysis and Assessment of Credit rating model in P2P lending
An instrument to solve information asymmetry between lenders and borrowers By
Yang Yang
SUBMITTED TO THE MIT SLOAN SCHOOL OF MANAGEMENT IN PARTIAL FULFILLMENT
OF THE
REQUIREMENTS FOR DEGREE OF
MASTER OF SCIENCE IN MANAGEMENT STUDIES
AT THE
MASSACHUSETTS INSTITUTE OF TECHNOLOGY JUNE 2015
PURPOSES OF THIS PAPER
It's been almost 10 years since the first P2P lending platform was founded in the UK. While P2P lending has been growing rapidly within the past 10 years, it is still in the infant stage compared to the traditional banking industry. There are over 70 academic papers about P2P lending between 2008 and 2015, but from different perspectives, including analyses of determinants of a loan to be successfully funded by investors, regulations, credit risks, determinants of credit quality and default probability, business model of P2P lending across countries, internal information system and literature reviews.
Even though a handful of papers did research on credit risks using data mining methodologies, most of them were focused on explaining the determinants of a loan being successfully funded. Few literature considered cost matrix in the model or compared results from Prosper and Lending Club. P2P lending is a two-sided market. In order to further boost market growth, P2P lending platforms also need to enhance the ability of investors to assess credit risks. By doing this, Platforms can offer higher return, and thus, attract more participation of investors in lending activity.
The main purpose of this paper is to identify key determinants of a loan's default probability and respective coefficients, and then build the optimal model to predict the loan's status. This model will act as a way to mitigate information asymmetry on P2P lending and gaming philosophy of borrowers. Besides, this paper will also take a dynamic review of the current development of P2P lending built on previous literature.
Another motivation for this paper is that the Chinese government just granted the participation of personal credit rating business from non-state owned companies. The public believes this movement will become the game changer for the internet finance industry, especially the P2P lending segment. This paper will justify whether a 3rd party credit rating
will help investors prevent adverse selections.
ABSTRACT
Since the establishment of the first P2P lending platform in 2005, P2P lending industry has
been nibbling the market share of traditional consumer credit. In 2014, Lending Club and
Prosper originated over 7 billion personal loans. As one of the biggest traditional banks in the
U.S., Citi issued 25.2 billion USD in 2014. Given the advantages of P2P lending over
traditional banks, the market for P2P lending is expected to grow rapidly along with the
improvement of the internal system of P2P lending platforms, external regulation and more
participation from borrowers and lenders. Given the fact that most P2P lending platforms in
China first imitated the business model from either the U.S. or European platforms, they have
progressively evolved to incorporate different business models due to legislation, economic
or behavioral reasons.
Several findings are detected by analyzing the data form Lending Club and Prosper. First,
although both platforms progressively improve the default rate each year, currently both
platforms offer negative returns for investors. Second, if only considering finished/matured
loans, higher credit score doesn't lead to less default risk. Third, on average, a default loan
will cost a loss more than twice as much as the interest return offered to investors. Taking this
cost matrix into consideration, the optimal data model won't necessarily provide the highest
accuracy but maximum return. Fourth, the ex post return offered by the platforms is not
enough to cover the potential risk facing investors.
Thesis Supervisor: Christian Catalini
Title: Assistant Professor of Technological Innovation, Entrepreneurship, and Strategic
Management
Table of Contents
1. INTRODUCTION... 6
1.1 DEFINITION OF P2P LENDING ... 7
1.2 How P2P LENDING W ORKS (LENDING CLUB, PROSPER)...7
2. M ARKET REVIEW OF P2P LENDING ... 10
2.1 MEARKET SIZE ... 10
2.2 KEY PLAYERS AND RESPECTIVE M ARKETPLACE ... 11
2.3 M ARKET OUTLOOK OF P2P LENDING ... 13
2.4 BUSINESS M ODELS OF P2P LENDING...15
3. DATA ANALYSIS AND M ODELING ... 19
3.1 INTRODUCTION ... 19 3.2 KEY VARIABLES ... 20 3.2.1 Prosper ... 20 3.2.2 Lending Club...20 3.3 DISTRIBUTION OF DATASET ... 21 3.3.1 Prosper ... 21 3.3.2 Lending Club...24
3.4 M ODEL BUILDING AND INTERPRETATION-LENDING CLUB ... 26
3.4.1 Data Preparation...27
3.4.2 M odel Building ... 29
3.4.3 M odel interpretation ... 32
3.4.4 Robustness Check ... 34
3.5 M ODEL BUILDING AND INTERPRETATION-PROSPER ... 38
3.5.1 Data Preparation...38
3.5.2 M odel Building ... 43
3.5.3 M odel interpretation ... 47
3.5.4 Robustness Check ... 49
3.6 COMPARISON OF FINDINGS IN MODEL BUILDING FOR LENDING CLUB AND PROSPER ... 53
3 .6 .1 Sim ila rities...5 3 3 .6 .2 D ifferences...54
3.6.3 Lessons for China's P2P Lending ... 55
4. CO NCLUSIO N. ... 56
4.1. CONCLUSION OF THIS PAPER... 56
4.2. FURTHER RESEARCH PROPOSED...58
5. REFERENCES... 58
1. Introduction
Freedman and Zhe Jin (2007) wrote the first academic paper to look into the business of P2P
lending. They brought up the question of whether P2P lending would reshape the future of the
financial industry or if P2P lending would be a fad that would wane over time. Even though
it's been over 6 years since that paper, it's still too early to give an answer to that question,
whereas what we see on the market is the emergence of more P2P lending platforms globally
and the IPO of Lending Club in December 2014. In addition, the attitude of traditional banks
toward this infant industry is also evolving. For instance, in early 2014, one employee of
Wells Fargo told the media that one internal email was sent by the principal requesting all
employees of Wells Fargo not to get engaged with any business of P2P lending. By contrast,
many hedge funds or regional banks are purchasing personal Loan products from P2P lending
platforms due to stable and attractive return. In addition, more traditional financial
institutions also opened their own P2P platforms to catch up with the trend.
1.1 Definition of P2P Lending
P2P stands for Peer-to-Peer or Person-To-Person. In P2P lending, platforms act as
intermediaries matching lenders and borrowers, and transact the money. P2P lending was first
introduced by Zopa in UK, 2005. By the time of this paper, Zopa has originated 713 million
GBP and is one of the biggest platforms in the world. The emergence of P2P lending is also a
result of applying web 2.0 in financial industry. By reducing the overhead cost and
infrastructure of traditional banks, P2P lending platforms can offer lower interest rate for
borrowers and accumulate huge traffic within a short period (Dhand et al., 2008).
1.2 How P2P Lending Works (Lending Club, Prosper)
fl~ctdApk*u Lafure fistimp k"vma
Borrowers want to apply for personal loans for various reasons. The main reason of personal
loans on Lending Club and Prosper is credit consolidation. A borrower applies for loans by
providing private information such as loan amount, term, credit rating score, debt-to-income
ratio, monthly income, occupation and the loan purpose. Both platforms will then assess the
information and decide a fixed interest rate for the loan. After the interest is agreed on by the
borrower, the loan will be listed on the platform for investors to browse. Then investors can
browse loan information and decide whether to invest and how much to invest.
Among the 73 papers on P2P lending between 2008 and 2015, 20 papers discussed how to
increase the possibility of loans being successfully funded and what are the key determinants.
Compared with unverified variables, verified variables play a much more significant role in
determining whether to invest a loan (Gregor, et al., 2010). Also, borrowers who are willing
to disclose more information normally pay less interest rate (B6hme et al., 2010). Social ties
will increase the chances of having the loan fully funded (Sergio, 2009; Greiner & Wang,
2009; Herrero-Lopez, 2009; Hildebrand & Rocholl, 2010; Lin 2009), reduce the ex post interest charged on the loan, and also decrease the default risk associated with the loan (Lin et
al., 2009; Zhensheng, 2014). Furthermore, some research is focused on the contribution of
demographic information of borrowers on loan funding such as appearance and gender.
Research shows that appearance also does influence the decision of lenders to fund a loan or
not (Jefferson et al., 2012). Female borrowers are less likely to get loans funded than are male
borrowers.
Based on all the information provided by the borrower, investors then need to determine
whether to lend and how much to lend. The objective of lending money on P2P platforms is
to gain high return and mitigate default risk. Investors on P2P lending platforms are inclined
to invest in loans with higher ex post return, which also carry higher default risk. Assessing
There are 8 papers that built models to investigate what are the key determinants of default
risk, so investors can use this as a guideline to avoid adverse selection. Loans with lower
credit grade and longer terms will result in higher default risk (Riza et al., 2015). This finding
is opposite from the result in this paper because in my paper, rather than using either
completed loans or matured/finished loans, I used a combination of both. There are
discrepancies between risk premiums charged and real default risk associated with loans on
P2P lending platforms (Kumar, 2007). This conclusion is supported by the fact that the proof
shows that the premium charged by P2P platforms is not enough to cover the potential loss of
investors (Riza et al., 2015). Recommendations were also imposed that another way to
mitigate default risk of loans is to set up a social reputation system in P2P lending platforms
(Everett, 2010; Lin, 2009).
Platforms will charge borrowers a loan origination fee once the loan is successfully funded.
Investors will also be charged a service fee of managing installment payments from
borrowers. A handful of papers were focused on building the internal information system of
P2P platforms. For instance, Collier (2010) informed practice and theory on developing
community reputation that can improve information asymmetry on Prosper and mitigate
adverse selection. Also, as an intermediary in the financial market, platforms are regulated by
both SEC and CFPB. 4 papers uncovered the current regulations on P2P lending and inform
implications for further development of specific regulation for P2P lending. A multi-agency
regulatory approach of P2P lending should be implemented that intimates the approach
applied to regulate traditional lending (Eric et al., 2012).
Borrowers need to pay monthly installment payments until the the loans reach maturity. If
desired, they can also choose to repay all principle payments ahead of the loan's maturity by
paying a service fee. Platforms also provide a trading system to investors who want to sell
holding loans with a certain discount. This trading system, like an open market, helps
platforms to provide more flexibility to investors.
However, some loans default in early stages of installment payments. This causes a huge loss
for investors as a whole. Investors are inclined not to hire an agency to collect net principle
loss due to the small amount of investment (Freedman & Jin, 2008). Further research into
after-default management of P2P lending is an urgent need because it can help mitigate net
principle loss of investors and improve the risk-adjusted return of platforms as a whole.
2. Market Review of P2P Lending
2.1 Market Size
The potential market size of P2P lending could be measured in both micro and macro ways.
The market size of P2P lending is mainly the size of unsecured loans, including unsecured
personal loans and line of credit. The total amount of consumer credit in the U.S. as of Oct,
2014 is 3.283 trillion USD, as asserted by Federal Reserve G.19 release. Per the E2 Release
of Federal Reserve, the total amount of outstanding business loans ranging from $10,000 to
$99,000 is 3.4 billion. We can sum up above two components as the potential market size for P2P lending, which is 3.286 trillion USD purely in THE U.S. market. Currently, Prosper
contributes 2 billion in fund lending, and Lending club contributes 6 billion in loans.
In a macro way, we can even expand the market to the middle size business loans since
lending club also provides business loans up to 300K USD. The total amount of business 10
loans ranging from IOOK to 999K is 12 billion (Donghon, 2014). Conservatively, we can add
another 2.4 billion to the potential P2P lending market. This will result in a market with a
total amount of 4.288 trillion USD dollars. Investors on P2P lending platforms are about to
eat between 25 percent and 30 percent of the business that traditional banks are doing. The
overall market of P2P lending will then grow to about $1 trillion by 2025 (Cromwell, 2015).
2.2 Key Players and Respective Ma
Rank
Lending Site
1
Lending Club
2
CreditEase
3
Upstart
4
Prosper
5
Zopa
rketplace
Year Founded
2007 2006 2012 2006 2005 Loan Volume($billion) 6 3.2 3 2 0.8Lending Club. Lending club which was founded in 2007 has been paying investors $590
million in interest returns. Per the statistic data from Lending Club's websites, by 3 0th
September 2014, 83.17% of Lending Club borrowers reported that they use loans from Lending Club to refinance existing loans or pay off their credit cards. The breakdown of the main purposes of Lending Club loans is shown below.
11 Country USA China USA USA UK
/J
--- C t .' F g:ff
Prosper. Prosper, founded by Chris Larsen and John Witchel on February
5,
2006, was the
first P2P Lending platform in the U.S. It stays unlisted and is financially supported by several
big names in venture capitals. Till now, Prosper had more than 2 million members and
generated over 2 billion loans.
Upstart. It was founded by ex-Googlers in 2012 in the U.S. and originated more than $3
billion in loans with an annual growth rate of
265%.
The major difference that lies between
Upstart and other platforms is that when assessing the credit quality of borrowers, Upstart
starts with the same information but will further include academic variables to come up with
the risk assessment more statistically.
CreidtEase. As reported by Peter Renton in 2014, CreditEase is the largest P2P lending
platform in China and has generated more than $3.2 billion USD in loans to over 500,000
borrowers. This company was founded in 2006 and is now operating in over 150 cities of
China.
Zopa. Zopa is the oldest Peer-to-Peer lending company in the world. The company was
founded in 2005 in the UK. It has lent $1 billion USD and has helped both borrowers and
investors get better rates.
2.3 Market Outlook of P2P Lending
The emergence of P2P lending exceeded the public's expectation in recent years. P2P lending
would increase by 66% to a total size of 5 billion USD by the end of 2013 (Gartner, 2010).
Looking at the statistic data of the biggest platforms, I found that lending club experienced
over 150% annual growth rate till 2014. Besides, Prosper.com also achieved exponential
growth since its establishment. Till the end of 2013, it originated over 300 million USD in
loans and moved this number to over 1.5 billion USD in loans by the end of 2014.
Despite the fact that it's extremely difficult to estimate the exact growth rate of P2P lending,
there are several determinants that can indicate the future trend of P2P lending from a macro
perspective. 1) Geographic expansion. Till now, P2P lending is not fully authorized in all
states of the U.S. due to the complexity of autonomy. Even in China, the acceptance of P2P
lending varies among different regions. Further geographic expansion would be expected in
the next few years. 2) More comprehensive legislation. The main reason that certain public
authorities or groups are still skeptical about P2P lending is that it is still in its infancy and is
less regulated compared to traditional banks. The specific regulations for P2P lending are an
urgent need in the market. 3) Challenges from traditional banking. Given the fact that the P2P
lending has huge cost-advantage to traditional banks, with the recovery of the U.S. economy,
the government is considering loosening the requirement for loan borrowers. This will help
traditional banks to regain borrowers who are not entitled to a loan. In China, many financial
institutions also introduced their own P2P platforms to gain a piece of the pie. 4) Information
asymmetry. Information asymmetry might lead investors to adverse selection (Akerlof, 1970)
and moral hazard (Stiglitz and Weiss, 1981). Various efforts are being made in order to
mitigate the information asymmetry by the platforms. 5) Bottom line of the economy and
employment. The performance of both the economy and employment will impact the further
development of P2P lending. As the statistic data from Proper and Lending club, most of the
borrowers' purpose is credit consolidation. Stronger economy and improved wages and
employment rate indicate that people's financial condition will be better off and the need of
credit consolidation will decline accordingly. 6) Institutional investors. P2P lending can
provide a higher ROI than many other investments in the financial market. There are
institutional investors who purchase loan packages from platforms to gain stable cash flow
and return. A simple comparison among different financial investments is listed below. In
2013, P2P lending generated much lower return than NYSE and Dow Jones Industry
Composite, but outperformed NYSE and Dow Jones in 2014. However, for P2P lending
platforms, I'm using the official investment return rate while the true risk-adjusted
investment return might vary from this data. Another point worth noticing is that the superior
return from stock market in 2013 is due to the recovery from an economic and financial
downturn. An ROI around 10% is already very impressive in the financial investment sector.
As reported by Bloomberg, the average return of hedge funds was 7.4% in 2013.
Investment Lending club Prosper 3yr T NYSE Dow Jones
2014 10.50% 9.79% 1.10% 4.22% 7.52%
2013 8.75% 9.86% 0.78% 23.18% 26.50%
Till the end of 2014, the total amount of loans originated through P2P lending in China has
reached $40 billion with a default rate of 17.46%. 1.16 million borrowers got their loans
with numbers of 2013 respectively. There are 1575 P2P lending platforms in China, and 275
went bankrupt in 2014, implying that one out of six platforms was not sound. The average
amount of loans and money that individual investor funded is $35,000 USD and $64,000
USD. This statistics data comes from Wangdaizhijia.com in China.
2.4Business Models of P2P Lending
This section will introduce the business models used by major P2P lending platforms in the
U.S and China and address the major differences between the two markets.
In the U.S. market, the business models of P2P lending platforms are quite similar to each
other. Borrowers post their loans on platforms and investors browse and choose loans to
invest. The P2P lending platform acts as an intermediary and is responsible for risk rating,
determining interest rate, document verification and interest payment management. However,
Prosper and Lending Club still varies in several ways as below.
1) Loan type. Prosper only originates personal loans ($2000-$35,000 USD) while Lending
Club also originates business loans up to $300,000 USD and personal loans ranging from
$1000 to $35,000 USD. Besides, Prosper and Lending Club provides loans with different
maturities. Both provide 3-year and 5-year loans. In addition, Lending Club provides a
1-year loan as well.
2) Interest rate. P2P platforms determine the interest rate by considering information
reflecting borrowers' credit quality. Both Prosper and Lending Club stipulate the cap and
floor interest rate for loans falling into different credit Rating/Grades. However, Interest
rate in the same credit category varies between Prosper and Lending Club due to different
credit rating logic.
3) Credit scoring. Prosper and Lending Club provides a proprietary credit score as a major
indicator of loan risk. They both offer 7 rating categories, Prosper from HR (worst) to AA
(best) and Lending Club from G (worst) to A (best).
4) Origination Fee. Platforms earn money by charging fees to borrowers. The cap and floor
fee rates charged by Prosper and Lending Club are the same, whereas different rates are
charged for borrowers in different risk categories. A simple comparison is listed below,
including credit rating, respective interest rate and origination fee.
Lending Club
Rating Interest Rate Origination Fee Rating Interest Rate Origination Fee
AA 6.05%'7.96% 1%2% A 5.49%'8.19% %3% A 8.19%11.33% 4% B 8.67%11.99% 4%-5% B 11.56%'14.06% 5% C 12.39%'14.99% 5% C 14.59%'18.27% 5% D 15.59%-17.86% 5% D 19%'22.68% 5% E 18.54%21.99% 5% E 23.44%27.04% 5% F 22.99%-25.5.7% 5% HR 27.75%31.25% 5% G 25.8%'26.06% 5%
5) Affiliate & Referral Programs. Prosper introduces the affiliate program to attract more
borrowers and lenders from referrers and to provide $100-150 USD for borrower leads
and $50 for lender leads. Lending Club also introduced the affiliate & Referral program,
but detailed bonuses are not provided on its website.
6) Both Prosper and Lending Club provide Notes Trading Platform, where investors can
trade their holding notes with each other. Folio is a Broker-Dealer platform which only
charges sellers 1%.
7) Early repayment. Borrowers can choose to pay the remaining repayment without paying
any penalty, in order to refrain from paying monthly interest in the future.
based on the information provided by the borrowers. However, in early years, Prosper
introduced interest an rate auction in which investors can bid the lowest interest rate they
can accept to compete funding the most popular loans. This is the reason why sometimes
we can see that the loans were originated with a lower interest rate. Prosper stopped the
interest auction service in 2011 and implemented a fixed interest rate like Lending Club.
In China's market, P2P lending platforms are basically following the same model as those in
the U. S., acting as an intermediary between borrowers and lenders. However, due to
differences of economic and legal environment, as well as the customer's behavior, there are
unique features which evolved from P2P lending in China. We use Hongling Capital and
Creditease as representatives since they are two of the earliest P2P platforms which
originated in China.
1) Loan Type. Hongling Capital offers personal and business loans with an amount between $500 and $1,600,000 USD, with maturities between 3 months and 12 months. Creditease
offers personal loans of amounts between $1,600 USD and $1,000,000 USD with
maturities between 1 year and 4 years. Obviously, P2P lending platforms in China's
market are more aggressive and also bear higher default risk.
2) Interest Rate. Rather than determining the interest rate based on credit score, maturity and
amount as P2P platforms in the U.S., China's platforms determine the interest rates
simply based on loan type or maturity, because there is no credit agency that can provide
a comprehensive credit report for individuals (China's PBOC just authorized certificates
for credit agency in January 2015). Hongling Capital regulates interest rate between 8%
and 18% and Creditease between 10% and 12.5%.
3) Credit Scoring. The only credit report that a borrower can submit is the one provided by
PBOC that includes the history of credit card usage and loan repayment. Platforms don't
rate borrowers into different credit categories, which differs from U.S. platforms. It's a
common practice for platforms to enable credits to borrowers/investors if they
successfully pay the monthly payment or make investment. For instance, Hongling
Capital category sorts customers into 5 categories from VI (lowest) to V5 (highest).
Investors on Hongling Capital can refer to different categories as a risk indicator.
4) Origination fee. Creditease charges investors 10% of interest earnings and borrowers 10%
as service fee. Rates and Fees on Hongli is more complex. Hongli charges investors from
0% to 10% as fees. This charge is determined depending on the categories, which range
from V I to V5. For instance, investors in VI need to pay 10% of interest earnings as a
service fee, and those in V5 don't need to pay any service fee. For borrowers, Hongli also
charges various percentages on loans, as a service fee based on different loan types. The
overall range is from 3% to 14.6%.
5) Affiliate & Referral Programs. Creditease doesn't pay the referral bonus, while Hongli
pays $6 USD if the referred customer registers as a normal member, and $12 USD if he
registers as a VIP.
6) Notes trading. Platforms in China also provide notes trading services to investors.
7) Early repayment. On Creditease, if borrowers want to pay the remaining loan earlier,
besides the interest for the current month, remaining loan and service fee, they need to
pay a 0.5% of the remaining loan as a penalty to the platform. Similarly, borrowers on
the remaining loan earlier.
8) Principle Guarantee. The biggest difference between the U. S. and China in P2P lending
is that many platforms in China introduce a 3rd party company to guarantee the safety of
investors' money, just in case any fraudulent funding happens. This is the remedy for the
lack of credit score available from borrowers and platforms that will improve the
confidence of investors. However, 3rd party guarantee is not a catholicon for P2P lending
in China. A certificate of Guarantee Company only costs $1 million USD and there are
cases where owners disappeared with the money, leaving investors to lose all their money.
3. Data Analysis and Modeling
3.1 Introduction
There are questions being addressed in this section, including 1) the distribution of PV, rate of
bad loans and interest of different credit categories. 2) Whether the risk-return improves from
year to year, especially when platforms change their policy. 3) Any behavior difference of
borrowers and investors between Prosper and Lending Club. 4) Investigate the contribution
of determinant variables to the performance of loans. 5) Build the model to determine the
possibility of default using different data mining methodologies. 6) As researched by Riza,
Yanbin, Benjamas and Min in 2014, the higher interest rate regulated by Prosper and
Lending Club for riskier loans is not enough to reimburse the potential loss exposing to
investors. This section will use a FCFF methodology to test this conclusion considering the
time value of future cash flow.
3.2 Key Variables
3.2.1 Prosper
Variable name Type Definition
Credit Rating Numeric Proprietary Credit rating by P2P lending platforms
Loan Status Dummy Whether the loan is active, completed or default
Borrower Rate Numeric Interest rate borrower is willing to pay
Borrower APR Numeric Actual rate borrower needs to pay considering service cost
Lender Yield Numeric Actual rate lenders receive considering service cost
Listing Category Dummy The purpose of the loan
Employment Duration Numeric The time period of employment till the creation of listing
Is Borrower home owner Numeric Whether the borrower owns real estate
Current Credit Line Numeric The number of credit lines the borrower owns
OpenRevolvingMonthlyPayment Numeric The monthly payment of revolving account
RevolvingCreditBalance Numeric The current credit balance of revolving account
BankcardUtilization Numeric The percentage utilization of revolving credit balance
AvailableBankcardCredit Numeric The total amount of bank card credit till the creation of the loan
TradesNeverDelinquent Numeric The percentage of delinquency of trades
DebtToIncomeRatio Numeric The percentage of debt to income
StatedMonthlyIncome Numeric Monthly income stated by borrowers
LoanOriginalAmount Numeric The original amount of loan originated
Investors Numeric The number of investors who fund the loan
Terms Numeric The term length of the loan
Both Prosper and Lending Club define "bad loans" as loans that are 60+ days past due within
the first twelve months from the date of loan origination.
3.2.2 Lending Club
Variables Type Definition
Grade Dummy The proprietary credit rating of Lending Club
loan-status Dummy The current status of the loan
int rate Numeric The interest rate the borrower needs to pay
Purpose Dummy The purpose of the loan
emplength Numeric The time length of the employment of the borrower
home-ownership Dummy If the borrower owns or rents an apartment
open acc Numeric The number of open credit line of the borrower
revol util Numeric The current ratio of credit balance utilization
dti Numeric The debt to income ratio
annual inc Numeric The amount of annual income
loan amnt Numeric The amount of the loan
installment Numeric The amount of monthly payment
Term Numeric The term length of the loan
3.3 Distribution of Dataset
3.3.1 Prosper
When depicting the distribution of loan's characteristics, we exclude current and cancelled
listings that haven't completed and funded. Besides, records with proprietary credit rating
"NC" are excluded due to incomplete information, and those loans were originated in early 2006 and 2007 when Prosper was in infancy. There are 113 rows of records that are missing
proprietary credit rating. We assume that these records won't influence the validity of our
analysis due to the small amount of records.
Successful Amount of Loans
Credit Category Rate Number of Loans Total Average Mean STDEV Default Rate
AA 30% 6,487 61,402,940 9,466 12,000 6,664 11% A 23% 10,479 101,490,254 9,685 11,000 6,664 16% B 25% 12,023 117,411,802 9,764 12,000 8,345 22% C 29% 14,892 125,436,437 8,423 10,000 7,044 28% D 47% 15,259 96,539,254 6,326 7,500 5,853 31% E 49% 10,286 43,717,649 4,250 4,000 2,629 37% HR 76% 8,846 27,031,067 3,056 3,500 1,323 46% Term Credit Category AA A B C D E HR 1vear 3 years 3% 3% 3% 2% 2% 3% 0% 93% 86% 79% 76% 83% 87% 100%
5years Interest rate 4% 12% 190/ 22% 15% 10% 0% 8.9% 11.4% 15.4% 18.9% 23.6% 28.3% 29.3%
$/investor Credit Score 53 73 87 104 91 103 89 791 738 712 682 667 640 621 21
There are several features of the dataset distribution of Prosper. 1) Surprisingly, the
successful rate of a listing being funded to be a loan decreases when credit worsens. This
might be caused by the higher interest rate paid by worse credit rating. 2) The majority of
loans are from C and D, consistent with our expectation that the major loans on Prosper (even
most of the P2P lending platforms) came from borrowers with poor credit record. 3) From the
best credit rating to the worst, the average and medium amount of the loan is declining
continuously, majorly because the limitation placed by P2P platforms. 4) The default rate
climbs when credit getting worse. The default rate of A-loan is 11%, while 46% for HR-loan.
5)
As we expected, interest rate increases when credit quality declines. An assessment will be
done in the following section to test if the interest rate advised by Prosper is enough to cover
the potential loss. 6) There is a trend that for loans with poor credit rating, investors tend to
place more money on each investment.
Number of Loans
18,000 12,000 16,000 10,000 14,000 12,000 8,000 10,000 6,000 NO. of Loans8,000 -- Ave rage Amount
6,000 4,000 4,000 2,000 2,000 0 AA A B C D E HR
Borrower Rate vs. Prosper Rating
- h
A AA B C D E HR
Prosper Rating
- Smooth(Borrower Rate)
Percentage of Total Loans by amount
Year AA A B C D E HR Default Rate
2006 7.5% 7.7% 9.3% 11.2% 9.8% 8.9% 45.6% 39.2% 2007 15.4% 16.8% 19.9% 21.3% 15.3% 6.2% 5.2% 39.5% 2008 23.3% 19.5% 23.2% 17.4% 11.2% 3.0% 2.5% 33.0% 2009 21.6% 24.9% 6.9% 17.9% 13.6% 5.2% 9.8% 15.2% 2010 16.1% 20.9% 14.7% 9.0% 19.5% 8.3% 11.5% 16.7% 2011 7.3% 17.5% 16.9% 9.1% 27.1% 16.7% 5.5% 22.6% 2012 7.3% 17.7% 18.1% 22.8% 18.9% 5.4% 9.9% 31.2% 2013 4.6% 16.5% 24.5% 31.2% 14.9% 6.8% 1.5% 23.6% 2014 6.8% 18.3% 24.0% 29.4% 1.4% 6.5% 13.6% 24.5%
7) Year by year, more investors switch to riskier loans from A or AA classes, especially to
loans in B and C. This trend might be caused by investors seeking higher interest rate as well
as the improved loan default rate under each credit category. 8) Both the overall default rate
and the default rate for each credit category decreased continuously. However, investors are
becoming more risk-averse. This improvement can be explained by the effort that Prosper is
better off in risk screening and verification.
(When calculating the default rate, loans that originated after Q2 2014 are excluded from the
dataset, because no loans could be past due more than 60 days, and when they do, they are
considered as default) 23 3t 0 0.3443 03288 03125, 0299 02863 0.2745 0.2623 0.2521 0.2417 0232 02225 0.2127 0.2025 0.1932 0.1839 0.1753 0:1679 0.1587 0.1495 0,1424 0.1338 0.1248 0.1162 0.1075 0.0985 0.0911 0.0813 0.0714 0:0623 0
Default rate YoY Year AA A B C D E HR Overall 2006 8.8% 16.7% 24.7% 36.2% 35.8% 48.8% 64.8% 39.2% 2007 14.3% 25.8% 33.3% 41.1% 42.8% 53.2% 62.2% 39.5% 2008 18.3% 25.6% 32.9% 33.4% 37.4% 43.6% 52.5% 33.0% 2009 6.0% 9.3% 16.8% 15.4% 22.4% 22.3% 23.7% 15.2% 2010 3.9% 9.8% 11.2% 15.3% 21.4% 24.9% 25.4% 16.7% 2011 2.9% 9.4% 15.5% 14.9% 24.8% 32.1% 31.0% 22.6% 2012 8.1% 9.3% 14.1% 20.1% 23.9% 25.9% 28.5% 31.2% 2013 4.1% 2.8% 4.6% 7.5% 10.8% 13.1% 13.6% 23.6% 2014 8.7% 0.4% 0.7% 1.2% 1.6% 2.5% 1.7% 24.5% 3.3.2 Lending Club Amount of Loans
Credit Successful Number of Default
Category Rate Loans Total Average STDEV Rate
A 32.6% 20,076 213,245,525 10,622 6,586 8.5% B 28.8% 33,882 402,115,200 11,868 6,861 17.2% C 26.7% 27,641 352,094,900 12,738 7,769 24.2% D 28.3% 17,980 246,222,500 13,694 8,426 30.8% E 29.1% 8,484 148,964,150 17,558 9,505 36.4% F 33.6% 3,772 73,021,450 19,359 9,225 43.5% G 33.6% 916 20,171,950 22,022 8,417 43.2%
1) There is no significant difference of successful rate listing being funded across different
credit categories in Lending Club, 2) Loans are more concentrated on good-credit loans
from A to D in terms of number of loans and total amount. 3) What is different from loans
on Prosper are lower-credit loans on LC which tend to have bigger amount than
higher-credit loans. This is an indicator that LC considers amount as a contributor when
rating loans. 4) There is no significant switch of investors' risk aversion year by year on
lending club. 5) The default rate of LC is much lower than Prosper in each year and under
each category, but this doesn't mean that the overall risk return that Lending Club
following sections. 6) Interest rate for loans among the same credit rank on LC and Prosper is similar. 7) There is a trend of improvement regarding default rate from 2007 to 2010. I don't involve years after 2011 into consideration since most loans are still under regular payment process, whereas for loans originated in early years, most of them are
either fully paid or went default.
Percentage of Loans by credit grade-LC
Year A B C D E F G 2007 22.7% 24.3% 29.9% 14.7% 5.6% 2.8% 0.0% 2008 18.9% 32.5% 28.0% 14.2% 4.8% 1.3% 0.3% 2009 25.0% 28.9% 25.3% 13.9% 5.0% 1.4% 0.5% 2010 24.3% 30.7% 21.4% 14.0% 6.9% 2.1% 0.8% 2011 26.5% 30.2% 18.1% 12.9% 8.0% 3.3% 0.9% 2012 20.4% 34.7% 22.3% 13.7% 6.0% 2.5% 0.5% 2013 13.1% 32.7% 28.3% 15.3% 6.7% 3.3% 0.6% 2014 14.2% 26.6% 28.1% 18.9% 8.7% 2.8% 0.8%
Default Rate YoY-LC
Year A B C D E F G Overall 2007 1.8% 13.1% 18.7% 40.5% 35.7% 28.6% 0.0% 17.9% 2008 5.8% 14.6% 17.8% 24.3% 16.0% 47.6% 50.0% 15.8% 2009 6.7% 11.4% 14.8% 17.4% 21.6% 17.2% 34.8% 12.6% 2010 4.7% 11.1% 14.5% 18.6% 22.5% 30.0% 28.4% 12.6% 2011 6.6% 11.5% 16.8% 20.9% 23.8% 28.1% 31.5% 14.1% 2012 6.3% 11.0% 15.1% 19.1% 23.4% 25.6% 30.7% 13.2% 2013 1.7% 4.4% 7.4% 10.8% 12.8% 17.0% 16.6% 6.9% 2014 0.5% 1.1% 1.8% 2.8% 3.8% 5.8% 5.8% 1.9%
Number of Loans by Risk Category
Number/Amount of loans
40,000 35,000 30,000 25,000 20,000 15,000 10,000 5,000 Number of Loans -U-Average amount A B C D E F GInterest Rate Range by Risk Category
Column 2 vs. Column 1 02509 0.24S Smooth(Colu.m. 2) 0.2352 0.229 0.2215 02159 0.1939 0.1891 0.171 0162 als 014. .40.1426 U.324 0.1261 0.12183 -0.1172 0.1141 00432 0.0781 0.0692 Credit Grade
3.4 Model Building and Interpretation-Lending Club
This section contains five steps. First, prune the datasets of Lending Club and Prosper for the
model building. Second, select variables and build the logistic model to predict the default
probability. Third, try to interpret the significance of each variable and compare the estimates
with the expectation. Fourth, Choose alternative data models to predict the loan status, as
well as net profit/loss, and try to compare the result with conclusion made by logistic regression. Last, as a robustness check, I will test the linear assumption between predicting variables and target prediction, and try to explore the nonlinear relationship between target prediction and each individual predicting variable.
3.4.1 Data Preparation
In the data preparation, I tried to only incorporate parameters that can be somewhat verified. There are definitely some variables such as loan purposes that borrowers can fabricate subjectively. Even though we can build a model with a good performance using those subjective parameters, the reliability of the model is questionable.
1) Homeownership. The original options for this variable include "rent", "own", "Mortgage",
"None", "Other". We create dummy variable, considering 1 as "own" or "mortgage" and
0 for the rest. Answers of "own" and "Mortgage" are considered as 1, and the rest as 0.
2) There are over 300,000 rows of data; all current listings are excluded from the dataset since we're aiming to detect any indicators of risks from an investor's perspective.
3) Loan Status is the target to predict. Loan status. Loan status of "0" represents active loans
that already finished all payment or that are still in payment process. "I" represents default loans including charged-off, default, or delinquencies more than 31 days (since there are only two categories for delinquent loans, less or equal to 30 days or more than
31 days). Initially, there are 87880 "completed" loan listed on Lending Club, while my
interest is to look at loans that either finished all payments or declared default already. Keeping that in mind, I further split completed loans into two categories - paid and
in-process. Within completed loans, there are only 5509 loans that already finished all
payments. The remaining 82371 completed loans are still in payment process. However,
as shown in the below graph, 50% of bad loans declared default before Ih month. Or
75% of bad loans declared default before 171 month. This implies that within those
82371 loans that didn't finish all payments, there is a great chance that they will
eventually pay off all installments. Therefore, in order to provide a reliable data model
and mitigate bias toward completed loans, I treat completed loans that have paid at least
17th installments as finished loans, and assume that they won't go default in future. By
doing this, I get 38555 good loans (finished all payments) and 24871 bad loans (default or
charged off).
NO. of Month Paid vs. loan status
65 3 NO. of Month Paid
60 00 60 55 00 50 45 40 35 30 Z)25 20 15 10 0 0 1 loan_status
4) Income verified. "0" represent that the income is not verified while "1" means income
verified.
5) Independent variables involved in the regression: Loan amount, term, employment length,
homeownership, annual income, if the income is verified, debt to income ratio, FICO
credit score, open account, revolving credit balance, the utilization ratio of revolving
credit balance, total account. I excluded the variable "purpose" from the model due to the
low reliability of the value that borrowers put when they applied for the loan.
6) The whole dataset will be divided into training and validation. The whole dataset is
randomly partitioned into 43426 training rows and 20000 validation rows
7) Profit/Cost matrix. I need a cutoff value in order to classify the predictions into 0 or 1. To
do that, I need to compute firstly the profit/cost matrix for Lending Club. There are 63426
loans in the dataset, including 38555 good loans and 24871 bad loans. Good loans
generate $108,339,408 out of the total original amount $450,364,975, representing a ROI
of 24.1%. Bad loans cost investors a total loss of $219172141, out of the total original
amount $350771625, representing a negative ROI of 62.5%. Finished loans as a whole
causes a loss of 110,832,732 out of the total amount $801,136,600, representing negative
ROI of 13.8%. You might be surprised that the real ROI that Lending Club offers to
investors is actually much lower than the one it advertises on the website. The profit/cost
matrix should be as below.
Profit Matrix Actual Predicted Loan Status 0 1 0 1 -1 1 -2.6 0 3.4.2 Model Building
Before building the model in each step, I selected variables based on R-Square, AIC and BIC
rules. Then I compared the performance of models using different variable combinations. 1)
R-Square oriented stepwise selection intends to remove open acct from the model. 2) A
minimum AIC recommend further removing home-ownership from the data model. 3)
Selecting to use Minimum BIC also gives the same result of excluding open acct and
homeownership from the model. Detailed results are listed below.
Maximize Rsquare
Entered Parameter Sig Prob
[X]
Intercept[1]
1[X] loanamnt 8.30E-70
[X] term 3.00E-233
[X] emplength 5.00E-15
[XI homeownership 0.51441
[XI annualinc 1.30E-41
[XI isincv 6.81E-09
[XI dti 3.20E-84
[XI FICOScore 0
openacc 0.88003
[X] revolbal 3.76 E-09
[X] revolutil 4.57 E-06
Minimum AIC
Entered Parameter Sig Prob
[X] Intercept[I] I [X] loanamnt 8.30E-70 [X] Term 3.OOE-233 [X] emplength 5.OOE-15 home ownership 0.51441 [X] annualinc 1.30E-41 [X] isincv 6.81E-09 [X] Dti 3.20E-84 [X] FICOScore 0 open acc 0.88003 [X] revolbal 3.76E-09 [X] revolutil 4.57E-06 Minimum BIC
Entered Parameter Sig Prob
[XI
Intercept[1]
1[XI loanamnt 8.30E-70
[XI term 3.OOE-233
[X] emplength 5.OOE-15
homeownership 0.51441
[X] annualinc 1.30E-41
[XI isincv 6.81E-09
[XI dti 3.20E-84
[X] FICOScore 0
open_acc 0.88003
[XI revolbal 3.76E-09
[X] revolutil 4.57E-06
Based on the result from data selection, I ran the logistic regression Estimates of parameters
under slightly different variable combinations are listed below. There is no significant value
or sign difference between the two results. Besides, RSquare-oriented variable combination
offers a RSquare of 0.2135, while AIC/BIC selected variable combination gives only a
slightly lower RSqure -- 0.2134.
Estimate
Maximize Minimum
Term Rsquare AIC/BIC
Intercept -10.66162 -10.67306 loanamnt -0.00003 -0.00003 Term -0.03942 -0.03937 empjength -0.02573 -0.02533 homeownership 0.01513 N/A annualinc 0.00001 0.00001 isincv -0.13985 -0.13967 Dti -0.03298 -0.03296 FICOScore 0.01900 0.01902 revolbal 0.00001 0.00001 revolutil 0.21735 0.21590
Since the model using parameters selected by RSquare stepwise offers slightly better result, I
computed the formula as below accordingly.
1
P(Default) = 1 + eO-(-0.66162+PiXi)
fli: Coeff cient of parameter
X1: Parameters
The confusion matrix generated from two combinations is listed below. Both models achieve
the best performance under a cutoff value of 0.44, meaning that if the default probability
equals to or is bigger than 0.44, the loan will be determined as default, vice versa. The overall
accuracy rate of the two combinations is close to 69.1% for RSqure combination and 68.8%
for AIC/BIC. The former one does a better job in identifying good loans, while the latter one
is more accurate in identifying bad ones. Both combinations can improve the overall ROI of
Lending Clubto negative 1.2% by AIC/BIC combination and to negative 1.7% by RSquare
combination. Even though the risk return after enhancement is still negative, a progressive
step has been made by imitating 12% loss. Not surprisingly, there is a price paid to improve
the overall risk adjusted return to investors. Applying this model means the overall volume of
loan origination will decline by 37.8%, while this improvement in risk adjusted return can
help amass the credit worthiness for P2P platforms and attract more investors thus borrowers
in the long run.
Confusion Matrix-RSquare Actual Predicted loan Status 0 1 0 9180 2923 1 3256 4621 Confusion Matrix-AIC/BIC Actual Predicted loan Status 0 1 0 8959 3144 1 3099 4778 3.4.3 Model interpretation
In this section, I will analyze the estimates of parameters concluded in model building, and compare
parameter is claimed to have a positive impact to default rate, it means the higher the value the
parameter have, the higher default probability the loan involves, and vice-versa.
Several papers also tried to interpret the impact of parameters. FICOScore has a negative impact to
default rate, while debt-to-income ratio and credit line utilization have a positive impact (Riza, Yanbin,
Benjamas and Min, 2015). However, when looking at the result from the model that only included the
finished loans, some of estimates of variables are not intuitive. This section will start from interpreting
variables that are counter-intuitive with our expectation, and then go through those that match the
expectation. 1) "Loan amnt" has a negative impact to the default probability. Normally, a higher
Loan amnt gives people an image of involving higher risk, while it turns out that this is not the case.
2) The same to "term". There are two time length allowed on Lending Club - 36 and 60 months.
Generally speaking, given all the other features constant, 60-month loan doesn't contain a higher
default risk than 36-month. This might explain that Lending Club only approves a longer term loan if
the borrower is more qualified. 3) "Home_ownership". Owning a real estate doesn't necessarily mean
that you're more credit worthy. It's actually the opposite. 4) "Annualinc". A higher income put by the
borrower when applying for a loan won't guarantee a better consequence. The impact of this variable
should be considered with " is incv", which has a negative impact to the default rate. 5) "dti-debt" to
income ratio. This ratio also has a negative impact to the default rate. This impact could be explained
that some income information of borrowers is fictive. Further research in the paper will only include
loans with verified income to detect any different result. 6) One most surprising finding is that
"FICOScore" has a positive impact to the default rate. People might think that borrowers with higher
FICOScore normally have better credit quality, since the credit score backed by a 3rd party agency is
normally very reliable. However, on Lending Club (and also later mentioned in Prosper's model),
FICOScore is not a good indicator of the credit quality. Lenders can't simply make the decision
based on this score, which is actually what lots of investors are doing. 7) "revol_util" and "revolbal"
have positive impact to default rate, which is consistent with expectation. Because the majority of
borrowers on Lending Club are applying for loans to coordinate personal credit lines, a higher balance
and utilization ratio indicate a higher financial pressure of paying back the balance.
3.4.4 Robustness Check
Besides building the model to predict nominal target parameter, I also considered using the same
predicting variables to predict the numeric parameter-net profit/loss, to check the numeric regression
outperforms logistic regression. The same as the previous section,
I
prune the predicting variablecombination oriented by RSqure, AIC and BIC and list the result below. Three ways to rule out
variables give the U.S. the same result-to keep all variables in the linear regression model.
Entered Parameter Estimate [XI Intercept -13687.535 [XI loan_amnt -0.1754355 [XI term -106.27022 [X] emplength -44.95143 [X] annualinc 0.00523282 [X] is_inc_v -239.96839 [X] dti -96.287356 [XI FICOScore 27.2358601 [XI revolbal 0.01584572 [XI revolutil 1126.27626
Looking at the estimates of variables in a linear regression, it makes more intuitive sense than the
result from the logistic regression. For instance, "loanamnt", term and" dti" have a negative
coefficients with net profit in a sense that the higher value the variables have, the lower profit or
higher loss that the loan will cause investors. By contrast, FICO_Score, and annual_ inc place positive
to the loan's net profit/loss. The model generates an RSquare of 0.1072, which is significantly lower
than the value by logistics model. To further test which model is superior to the other one, I also draw
the confusion matrix for linear regression model by setting up a profit/loss value as cutoff of good or
bad loans. Under a cutoff value of net profit/loss of negative $2,100, the model achieves the highest
accuracy of 67%, which could be further broken down to 74% of identifying good loans and 55%
accuracy of identifying bad loans. However, the performance of this model is still worse than the
logistic model. Confusion Matrix-RSquare Actual Predicted loan Status 0 1 0 9152 3146 1 3422 4258
The different coefficient of the same parameter to default probability and net profit can be understood
by twofold way. First, the amount of net loss outweighs that of net profit significantly, therefore the
positive impact imposed by FICOScore or annualinc can't bring enough profit to push the net P/L
to positive numbers. 2) However, it's true that higher FICOScore and annual inc can reduce the net
loss if loans go default, and can also increase the positive return if loans are proved to be good.
I also used discriminant and neural network to classify good and bad loans and got confusion matrix
listed below. Literally, both models outperform logistic model in the overall accuracy and net profit if
applying the cost matrix to the results below. The overall accuracy of discriminant is 68% with a
further breakdown of 70% accurate for good loans and 65% for bad loans. Using neural network, the
accuracy turns out to be 69%, with 76% accurate for good loans and 59% for bad ones. However,
there are two key disadvantages of discriminant and neural network. One is that the structure of the
model is non-transparent and user can't interpret the importance of each parameter. Investors can't
apply the model easily when making investment decisions. Another disadvantage is both model need