Your fair-lending regression results indicate a statistically significant disparity… now what? In our last blog post, we discussed the importance of a common-sense approach to statistical analysis. One common error in statistical analysis is to assume that a result is practically meaningful just because a result is statistically different from zero. This in not always the case. In fact, finding a statistically significant result may or may not be meaningful.

As fair lending analysis becomes increasingly technical, industry practitioners have had to familiarize themselves with the terminology of statistical analysis. Statistical significance is one of the most common and foundational concepts to successfully navigating these new waters. Moreover, it is a concept that, when misunderstood, may result in serious error.

In our last post, we discussed the importance of understanding both *what is* and *what is not* included in the data for regression analysis. In this post, we further emphasize this point with an illustration relevant to a common CECL methodology – the probability of default/loss given default method (PD/LGD).

One of the first questions before beginning any type of statistical analysis is what data are included and how should the sample or samples be formulated and segmented. In previous posts, we have addressed various nuances in regard to regression modeling and how the inappropriate application of regression and modeling techniques to real world issues (such as credit quality or fair lending) can have serious consequences.

Economist Karl Popper referred to science as the “art of systematic over-simplification.” Indeed, if science is discovery and knowledge creation, that certainly cannot take place through “systematic over-complication.” Knowledge can only be created by that which is understood, and often the pathway there is through simplification.

For the last decade the regulatory and enforcement agencies have been increasingly using statistical methods such as regression to evaluate fair lending compliance.

With the passage of Dodd-Frank and the new emphasis on modeling and quantification, there has been a fervor to apply econometric techniques to a wide array of issues in the financial industry. This includes not only fair lending but stress-testing for large institutions and the new requirements that will be soon implemented under CECL (Current Expected Credit Losses).

In our previous post, we started introducing the concept of statistical significance. We began with making two important points.

First, statistical methods are applied in order to estimate or measure an unknown. A sample of data is analyzed which is then used to draw conclusions about a larger population. This is known as statistical inference. We are only interested in the results from the sample because of what it tells us about the larger population. In instances when this is not the case, there would be no need for statistical methods.

The second point was that through the use of statistics, scientific methods are employed. Since the standard of proof in science is so high, absolutes are rarely, if ever, attainable. Instead, probabilities are relied on in order to draw conclusions.

When statistical methods are applied to evaluate fair lending compliance, one of the metrics of interest is the statistical significance of measured differences in treatment of applicants. Such differences may be measured by such things as the interest rates charged on loans or the rates of denial for one group versus another (such as males versus females).

Regression analysis for fair lending with respect to underwriting analyses generally use what are known as “discrete choice” models. Such functional forms are used in which the measurement (dependent) variable is categorical or a limited outcome. In an underwriting evaluation for fair lending analysis, for example, what is measured is either approval or denial.

A common question in discrete choice modeling (i.e., logistic regression or probit) is “why do some software packages, such as Stata, report z statistics while other packages, including SAS, report t-statistics?” Is there really a difference in the results? We will answer these questions by comparing the output from each package using identical data.

In studying your bank's loan data, how can you determine the relationships among various factors in your lending policies, customer base, pricing, and more? Through the use of regression modeling, an important tool in statistical analysis.