# What the Data is and is Not

»  What the Data is and is Not

The first and most important step of any quantitative analysis is understanding what the data consists of. This, unfortunately, is many times ignored in econometric analyses.

## On Assumptions and Foundations

It is often assumed that if the approach is multivariate, the modeling process will “winnow out” the effects of things that we want to control. This, however, is not always the case. Even the most sophisticated empirical techniques cannot overcome the problems created by an unrecognized, biased sample. It can be thought of as the foundation. As with building a house, if the foundation is not adequate and strong enough to support the house, no matter how well the house is constructed it will not stand.

Let’s consider a very simple example. Suppose you wanted to poll Americans regarding their preferred candidate in the 2020 U.S. presidential election. Traditionally, a survey of this type would take place exclusively by landline phone. But, this is 2020, and only 50 percent of Americans still have landline telephone service.

The problem with surveying by landline phone is not necessarily that fewer Americans have a landline. More specifically, the problem is that the Americans that do not have a landline may be different from Americans that do have a landline. And, if certain groups disproportionately have only cell phones the sample omits a portion of the voting population whose preferences may be different from the individuals in your sample. Therefore, the sample would be biased and not representative of what is intended to be measured.

Note in this example, the issue is what is NOT in the data as opposed to what IS in the data. It is easy to understand when a sample includes more information than needed, but it is less obvious when a sample omits key information. The issue of bias can equally arise in both cases.

## An Illustration

Consider the following scenario as an illustration of how sometimes this issue may not be obvious.

Suppose a financial institution wants to predict the probability of default for each individual that applies for a loan. Now, after years of collecting data, the institution wants to know if, in fact, credit scores are a good predictor of a borrower’s repayment behavior.

Although counterintuitive, it could very well be that an analysis of the data suggests that there is little or no relationship with credit score. This could be the case, if, similar to the example above, the data does not truly reflect the universe of loans – for example, if the sample only contained “A Grade” credits or the like. The finding would be incorrect but, nevertheless, could be gleaned from the data without an understanding of the data itself.

This issue arises more often than is realized in fair lending analysis. Often there are mixes of loans in samples that can at best distort, and at worst, bias results.

We plan to expand on this critical issue in more depth in future posts and a forthcoming white paper. We will provide real-world examples that will help both practitioners and those that rely on analyses avoid these types of pitfalls.