Most financial institutions have rate sheets that provide pricing guidance for various loan products. Loan pricing is often in the form of a matrix in which the appropriate rate is determined by a combination of two or more values.

The graphic below provides an example. As shown, the rate is 6.0 for a loan < $5,000 to a customer with a credit score between 700 – 749, and 7.0 for a loan < $5,000 to a customer with a credit score of 650 – 699.

Since the rate varies in this case by credit score and loan amount, these factors obviously must be taken into account in order to conduct a regression analysis of loan pricing. A question often arises as to how these factors should be operationalized into the model.

It is obvious that credit score and the amount of the loan are factors that determine the rate that is charged. However, we further see that the size of the rate adjustments for score depends on the loan amount and vice versa. The actual rate is determined by the combination of these two factors; hence, they are not independent.

One thing that is commonly misunderstood with regression analysis, in general, and fair lending, specifically, is that models must be fit to the data. The answer then is one that most don’t like to hear—it depends. Let’s, however, examine some of the considerations.

First, we need to consider the total sample size. In a large sample, in general, we will have more latitude to use a larger number of variables. In a small sample, we will have less flexibility. Second, what is the target and control group distribution? Even if the total sample size is large, if the numbers within either group are small, this imposes further restrictions on what can be done. Third, we need to consider the distribution of the data. For example, if some of the cells on the rate sheet have very few observations, this needs to be considered as well.

There may be very few observations in the lower or upper score categories, for instance, even though the sample being examined is adequate. These issues are critical as they can have consequences on the model’s output as well as the conclusions drawn from the parameter estimates. One consequence is straight forward and that relates to the conclusion drawn from the model results. We have to understand that as we add a large number of categorical variables to the model, the sample size for the target and control groups can be depleted very quickly.

This means that any conclusion drawn may not be valid which, of course, in the fair lending space is absolutely critical.

A second consequence is one that should be considered generally because it can extend to other issues with the data as well as violation of certain assumptions of the regression model(s), and that is how individual software packages may handle data problems. Collinearity, for example, is an issue that different software packages may handle differently; and it can drastically affect the regression output and, in some instances, yield inaccurate results. (We will address a specific example in a future post.)