In our previous post, we addressed a few frequently encountered issues when using regression methods to conduct a fair lending analysis.

The issues discussed previously concerned the common parameter estimation routines used for underwriting analysis and some precautions in regard to interpreting and reporting the results. We made the point that various software applications available may produce results that computationally may differ; and with respect to understanding the results, it is important to know how individual calculations are being done.

In this post, we address (2) additional common issues that can complicate the use of regression analysis for underwriting. The first is sample size. We addressed the meaning of statistical significance in a previous series of posts, but it is straightforward that one should not draw statistical conclusions from a very small number of observation points.

However, in fair lending regression analysis where the outcome is binary and a number of explanatory factors are used in the model, even a seemingly large sample can become problematic. Here is why:

When using regression for underwriting analysis we are estimating probabilities of approval or denial based on a set of factors. The most common regression routines used in fair lending analysis (Logit or Probit) use Maximum Likelihood (MLE) for parameter estimation.

Maximum likelihood estimation requires us to assume a specific distribution (logistic for logit, normal for probit), and also requires a larger number of observations than does OLS. Logit and Probit estimation generally work well in large samples but can be severely biased in small samples.

Once we account for underwriting criteria, such as credit score, we are in effect estimating conditional probabilities. When the estimated probabilities are conditioned on multiple factors and then further for separate target and control groups, even large samples can become small very quickly.

This issue is exacerbated if all or many of the explanatory factors are categorical. In this case, each set of variables (i.e., in control group and with missing credit score) becomes a subsample or cell requiring a large number of observations by itself.

There are a number of methods to deal with small sample bias, and what method is chosen should be situation dependent. (For readers interested, there is vast literature available and one can simply search “Logit Rare Events” or something similar to find articles and information).

The second issue that is common in underwriting analysis is one that exists in most quantitative work and is known as Omitted Variable Bias (OVB). This condition occurs when a relevant explanatory variable is *left out* of a model with the effect of which attributed to a non-relevant variable that is *in* the model.

Although this is a possibility in many disciplines where regression analysis is used, it is common in underwriting analysis for fair lending specifically because it is often difficult to quantify and obtain data on every factor that could influence a credit decision.

Let’s consider an example that will demonstrate this in practice. For simplicity, Bank A makes its credit decisions for a particular product based solely on credit score of the applicant. It is the Bank’s policy to approve loans to any borrower with a score > or = 650 and deny loans to applicants that have scores less than this.

We review a sample of data, and we find that 10% of males were denied but 20% of females were denied. So there is correlation between being denied and being female. In reviewing the data, however, we find that females on average have much lower credit scores than males.

This suggests the disparity in the denial rates, therefore, is most likely a function of credit score rather than the applicant’s gender. *BUT*, if we estimate a model with only gender and leave out score, the effect of credit score will be attributed to gender since it is correlated with the outcome and credit score is not accounted for.

In summary, there are number of things to consider when drawing conclusions from regression analysis generally, but this is especially true with regard to underwriting analyses. The key to remember is that regression is a powerful and effective tool for evaluating fair lending compliance – but as with any tool, it must be understood and applied correctly.