Regression analysis for fair lending with respect to underwriting analyses generally use what are known as “discrete choice” models. Such functional forms are used in which the measurement (dependent) variable is categorical or a limited outcome. In an underwriting evaluation for fair lending analysis, for example, what is measured is either approval or denial.
A common question in discrete choice modeling (i.e., logistic regression or probit) is “why do some software packages, such as Stata, report z statistics while other packages, including SAS, report t-statistics?” Is there really a difference in the results? We will answer these questions by comparing the output from each package using identical data.
A Brief Explanation of Z and T Statistics
A z-statistic is a value drawn from a standard normal distribution; that is, a normal distribution with a mean of 0 and a variance of 1. Given a z-statistic, probabilities can be easily computed using statistical tables or software such as Excel. For example, the Pr(Z < 1) = 0.8413.
A t-distribution closely resembles a standard normal, but with a higher probability of extreme values and a lower probability of values toward the center of the distribution. In addition, the t-distribution depends on a ‘degrees of freedom’ value, which in regression models is equal to N – K, where N is the number of observations and K is the number of parameters (including the intercept).
As degrees of freedom increases, the t-distribution converges to the standard normal (Z) distribution. For example, with 20 degrees of freedom Pr(t < 1) = 0.8354 (0.59% less than the standard normal), but with 200 degrees of freedom Pr(t < 1) = 0.8407 (0.06% less than the standard normal.
Putting Them Into Practice
In linear regression (OLS), both Stata and SAS report t-statistics and compute p-values using the t-distribution with the appropriate degrees of freedom. However, with logit (or probit), Stata reports z statistics while SAS reports t-statistics.
This can be seen in the output below where we estimate a logistic regression of whether an individual buys coke or pepsi. The dependent variable COKE equals 1 if an individual buys Coke and 0 if they buy Pepsi. The explanatory variables are the prices of a 2 liter bottle of each product, PR_COKE and PR_PEPSI.
Using 1,140 observations, the stata output is:
Note that ‘z’ is the coefficient divided by the standard error.
The SAS output, using proc qlim, is:
The parameter estimates and standard errors are identical between the packages. The t value computed by SAS is also equal to the coefficient divided by the standard errors. Thus, there is no difference in values except in the label used by each software.
Given 1,140 observations is no difference in the probabilities computed using a z or t distribution. Therefore, we next look at another example using a smaller number of observations.
In this example, the dependent variable is ADJUST, which equals 1 if a buyer chooses an adjustable rate mortgage and 0 for fixed rate. The explanatory variables are MARGIN, which is the variable rate minus the fixed rate, and NETWORTH, which is the net worth of the buyer. There are 28 observations. The stata output is:
The SAS output is:
Once again, we observe that the parameter estimates and standard errors are equivalent between the packages. In addition, the z and its p-value from Stata are equal to the t and its p-value from SAS. For example, the NETWORTH coefficient divided by its standard error is equal to 1.320639/0.575234 = 2.295829.
Using Excel, we find that P (Z > 2.295829) = 0.0217, which is the p-value given by both Stata and SAS. However, given 25 degrees of freedom (N = 28 minus K = 3), P (t > 2.295829) = 0.0303. Thus, the only difference between the two outputs is labeling.
Although SAS labels the p-value as Pr > |t|, the value is in fact calculated using the Z distribution (standard normal) so the output matches exactly with Stata. That is, both packages actually compute Z scores and associated p-values.
As a final point, one may ask why software packages compute t-statistics using the degrees of freedom correction for OLS, but not for logit/probit models.
The reason for this is that OLS is an unbiased estimator, which means that the average parameter estimates over many samples will equal the true values, regardless of the sample size. Thus, a small sample size (i.e., < 50 or so observations) is acceptable. The coefficient divided by standard error follows a t-distribution, which differs from the Z distribution when sample size is small, and so the software computes p-values using the t-distribution.
In contrast, we do not know the small sample properties of logit/probit estimators. These estimators may or may not be unbiased in small samples. All we know is that the parameters converge to true parameters values as the sample size, N, becomes infinitely large.
In practice, the sample size necessary for logit/probit is not exactly known but should be fairly large; it should generally be large enough so that Z and t scores are identical. Therefore, software packages are programmed to compute the p-values based on Z. SAS likely labels these as t-statistics to avoid confusion among users who are already familiar with t-statistics from linear regression models.