Issues in Name Matching for BISG and Other Proxy Methods for Fair Lending Analysis

Fair Lending  »  Issues in Name Matching for BISG and Other Proxy Methods for Fair Lending Analysis

One of the challenges in analyzing consumer data for fair lending is the lack of protected-class status, such as race, gender and ethnicity. There are solutions for approximating this data, but they offer their own challenges in turn.

Known as Government Monitoring Information (GMI), these data are reported with HMDA and are available for analysis of reportable real estate applications. While lenders are required to collect this information for HMDA reportable loan applications, they are prohibited from collecting these data for non-HMDA credits.

A lender that wishes to conduct a fair lending analysis for consumer loans must employ some form of what is known as “proxy” methods by which to estimate or infer protected or non-protected class status.

Methods and Challenges for Inferring Class Status

With respect to race or ethnicity, this is typically done one of two ways:

  1. Using the surname and obtaining the probability of race or ethnicity from Census data
  2. Using the last name and the racial composition of the area in which the person lives to compute a joint probability which is known as Bayesian Improved Surname Geocoding or BISG

Both methods require matching the loan files to the Census name files. The first challenge is how the names may be provided in the lender data and how clean and consistent the data may be. All these will affect the match rates and, therefore, both the accuracy and feasibility of the analysis. Doing so often presents challenges.

Often the source data may store first and last name together in one field. The surname must then be separated from the first name.

If the format is “surname, firstname” this is a simple task. Other formats require a little more logic to figure out what part is what. Once we have removed the first name from the data field, whatever remains must be the surname; and we should have no problem matching that to the Census name files, right?

Not quite.

The source name data likely contains suffixes, like “Jr.” or “Sr.” or perhaps “III.” It may also contain honorifics such as “M.D.” or “Ph.D.” (or even “PhD”). The code written to perform name matching must, therefore, also remove any occurrence of these from the source data.

There are other challenges, of course, especially given the wide variety of naming conventions across myriad cultural backgrounds. For example, surnames consisting of multiple parts are not uncommon in Hispanic populations. The source data may contain a record with a surname of “De La Cruz.” Almost all of the surnames in the Census name files consist of a single part. If a match cannot be made using the surname as-is, we can try combining a multi-part surname into a single name and trying to find a match on that. The same approach also finds matches for “De Los Santos” and “De La O” to name a few.

Working Around Complex Data Challenges

There are still often records that do not match, but massaging the source name data in this manner significantly increases the number of automatic matches that can be made on a large source dataset. Additional steps can then be employed to match the few remaining unmatched names, including manual intervention.

These processes can be accomplished fairly quickly in the development environment with different programming languages such as Python, Java, or C#. In our case, we used Microsoft Visual Studio and C# to create a small desktop utility to read in the source data files and produce new files that contained the matched name information.

Approximating protected-class status is not without its trials and tribulations, but it can and should be undertaken to effect a complete fair lending analysis.


Leave a Reply

Your email address will not be published.