One of the first questions before beginning any type of statistical analysis is what data are included and how should the sample or samples be formulated and segmented. In previous posts, we have addressed various nuances in regard to regression modeling and how the inappropriate application of regression and modeling techniques to real world issues (such as credit quality or fair lending) can have serious consequences.
While data issues and oversights are omnipresent in any type of work, it is important to always understand the assumptions relied upon in any analysis.
In yet another post, we discussed some common misconceptions or myths with regard to regression analysis, generally, and fair lending, specifically. One of these myths was that a regression model is only affected by what is included in the model.
Omitted Variable Bias
As we state in the post, this could not be more wrong. On the contrary, the results from a regression model can be severely biased by what may be left out. In applied work this is referred to is Omitted Variable Bias (OVB). Understanding this is critical because it can affect how results should be interpreted as well as how a model should be structured. It also should influence decisions in terms of sample composition and stratification.
Although it is commonly understood that correlation does not equal causation, this is a fact that is often ignored. Indeed, whether an analysis be for credit quality or fair lending, there is an implicit causation which, of course, is the purpose of the analysis. The classic illustration of this is from Statistics 101 in which a study is presented that shows that the concentration of storks in certain areas are highly correlated with higher birth rates.
As it turns out, storks are more prevalent in rural areas; and at the time the study was conducted, higher birth rates also occurred in rural areas. Although this is a humorous example, the principle has much more serious implication.
How Can We Mitigate OVB
One of the ways to reduce the affect of omitted variables in a regression equation is with regard to how the samples are assembled and segmented. It should be obvious that data with different distributions or that are generated by different processes typically should not be combined into one sample. The reason is these data have the “fingerprints” or influences affecting the data that are unique to that particular data or process. When these data are combined into a single regression, these differences and influences may be unaccounted for. The implications are further magnified when the model itself does not account for all relevant variables.
Let’s consider a simple example. Let’s assume the model we are formulating is to predict underwriting decisions. We will have in our model credit score, DTI, LTV, income and length of employment. Our loan data, however, are generated by (2) different processes. One includes all applications submitted, and the other includes only applications that have been through a pre-approval process. In the first case, all applications are present and in the second case only those that passed the initial screen. It is easy to see that in the second case many of the poorer applications in terms of credit quality will already have been “weeded out.”
A consequence of this is that most of the applications will likely be similar on the variables that will be measured in our model. Accordingly, the credit decision will likely be largely dependent on factors outside of our model. On the other hand, the other application pool will contain a great deal of variation with respect to the factors in the model. Combining these samples into one regression can create a number of issues, but it should be clear regardless that doing so would not be the best approach.
Although there may be techniques that could be employed to model these data together, a solution is simply to run separate regressions. In doing so, the observations in effect become their own controls as you are limiting the variation that will be unaccounted for in the model by creating homogenous samples.
In truth, there is seldom any empirical reason to combine data when separate regressions are feasible. There are instances when the lack of data makes it necessary, or there is such limited variation in the samples that separate regressions is unnecessary and combining would yield greater precision. Most of the time, however, assuming the correct answer is being sought, running separate regressions with different samples is the most robust approach.
How to cite this blog post (APA Style):
Premier Insights. (2018, May 10). Omitted Variable Bias (OVB) In Regression Model Analysis [Blog post]. Retrieved from https://www.premierinsights.com/omitted-variable-bias-ovb-in-regression-model-analysis.