In our last post, we discussed the importance of understanding both what is and what is not included in the data for regression analysis. In this post, we further emphasize this point with an illustration relevant to a common CECL methodology – the probability of default/loss given default method (PD/LGD).
If you are unfamiliar with CECL or the PD/LGD methodology, we have discussed both issues in detail in previous posts. In this post, we assume that readers have some familiarity with the methodology and do not belabor the details of CECL’s requirements.
Calculating the Probability of Default Using Regression Analysis
Suppose that in order to comply with CECL, a hypothetical bank wishes to predict the probability of default for each loan using borrower characteristics and information about the credit environment. In other words, the bank wishes to discover whether credit score, DTI and other borrower characteristics are related to whether borrowers default on their loans or not.
Can the bank use regression analysis to answer this question? Yes, regression analysis can explain the probability of an event occurring. For example, if we want to determine if individuals with lower credit scores have a higher probability of default, we need data on 1) each applicant’s credit score and 2) whether or not the borrower defaulted on the loan (see the Table 1 below). If the borrower defaulted on the loan, we put a 1 in the second column, meaning “true” for default. If the individual made all payments on time, we put a 0 in the second column, meaning “false” for default.
Table 1 

Credit Score 
Default? (1 = Yes) 
651 
0 
802 
0 
656 
1 
668 
0 
626 
1 
702 
0 
687 
0 
723 
0 
636 
0 
When we select the second column as measurement or dependent variable and the first column as our explanatory variable in a regression model, the resulting coefficient explains how credit score affects the probability that an individual defaults on a loan.
The point that we want to emphasize is that given the data available to banks in this situation, it would be unsurprising for the bank to find that credit score and the probability of default are unrelated. Or, in the very least, it would be unsurprising for a bank to find a weaker relation between credit score and the probability of default than one would expect. This possibility would hinge on a number of factors, including the size of the sample, but nevertheless is possible. Why is this possible? The answer may lie in understanding what is not in the data.
What is Missing from the Data?
Consider what is missing from the bank’s data (see the last 3 rows in Table 2 below). Sticking with the credit score relation as an example, the bank does not have data for those individuals who are rejected for a loan because their credit score is too low.
While the bank can collect data on each applicant’s credit score, it cannot collect data on whether or not each applicant defaulted on the loan because some applicants did not receive the loan. An applicant cannot default on a loan that does not exist. So, the repayment behavior of these rejected applicants is, by definition, unobservable.
Table 2 

Credit Score 
Default? (1 = Yes) 
651 
0 
802 
0 
656 
1 
668 
0 
626 
1 
702 
0 
687 
0 
723 
0 
636 
0 
615 (rejected) 
? 
595 (rejected) 
? 
602 (rejected) 
? 
How Does This Explain the Counterintuitive Regression Results?
How does this explain the weakness or lack of relation between credit score and the probability of default?
Because we may not have data for individuals with low credit scores, the regression model only calculates the relation between credit score and the probability of repayment for individuals whose credit scores are high enough for the loan to be approved.
Therefore, while the relation between credit score and the probability of default may be negative for all applicants, the relation between credit score and the probability of default may be zero for approved applicants.
As shown in the hypothetical graphs, in the graph on the left, the observed relation may be flat (meaning no relation) for approved applicants. This may be the case because every individual with a credit score that is high enough to be approved is also very unlikely to default. In other words, it may be difficult for the model to distinguish the difference in the repayment behavior between an individual with a 700 credit score and an individual with a 775 credit score.
The interpretation of this result may be there is no relation between credit score and the probability of default. But, that may not be a true statement as the results may be biased by missing data.
The true relation between credit score and the probability of default may be more similar to the graph on the right, where credit score and the probability of default are negatively related. But, because the upper part of the distribution is missing in the bank’s data, the bank’s regression analysis will not reveal this true relation.
For the purposes of CECL, the bank is not concerned about the probability of default for loans that don’t exist. The relation on the left (approved applicants) is the only relation about which the bank is concerned. However, if an institution was instead assessing their underwriting standards and was interested in understanding if their standards were too strict or too lenient, as demonstrated these data would be insufficient. If the data were used in that manner, a biased result is likely.
Regression analysis is like a calculator. It calculates an answer when you push the button, but it cannot tell you if you pushed the correct button. Its proper application requires that (1) the right question is being asked and (2) the appropriate data is being used to answer that question.