Search

A Powerful Tool for Election Analysis

Last spring, the spread of COVID-19 forced lockdowns in many states in the United States, and disrupted election administration at the tail end of the Democratic presidential primaries. As we’ve discussed previously, the pandemic led to poll closures in many areas that disproportionately impacted underserved communities of color, and in fact led to decreased turnout in those areas.

Measuring racial disparities is of paramount importance given the pandemic induced inequalities in electoral access, representation via redistricting, and modifications in election law. However, it remains a challenge to identify where disparities by race might exist, given the lack of such information on most voter registration rolls. Instead, more often than not, researchers must rely upon ecological inference methods with wide levels of uncertainty, or purchase costly proprietary data from other organizations.

The Issue at Hand

To improve election administration, and prevent or mitigate these impacts from affecting access to the ballot box during the 2020 general election, researchers at the MIT Election Data + Science Lab, as part of the Stanford-MIT Healthy Elections Project, got to work identifying trends in voting methods and turnout. While most of this research was fairly straightforward, the team ran into difficulty in analyzing the disparate impacts of different election administration policies by race, given the lack of racial identification on voter files.

Because of this, we were initially unable to track potential disparities in turnout, provisional ballots cast, mail ballots unreturned, and other areas historically afflicting the American voting experience. Of the states we analyzed as part of the Healthy Elections project, only North Carolina, Florida, and Georgia reported race on the voter file, which left out important battlegrounds such as Wisconsin, Arizona, Pennsylvania, and Michigan.

Hark! A solution appears

We needed a solution to identifying racial disparities in order to conduct this timely research. The answer we landed upon was a method commonly used in public health called Bayesian Improved Surname Geocoding, otherwise known as “BISG.” BISG works on the general principle that surnames are correlated with different races depending on geography. For example, a Stacy Brown in Atlanta, Georgia is likely to be a different race than a Stacy Brown in Des Moines, Iowa. This tendency allows researchers to generate rough estimates of racial identity based on name and address, which they use to join their data (such as voter files) to the U.S. Census Surname list. This simple yet powerful breakthrough has allowed several key studies in race and politics, public health, and other fields that would not have otherwise been possible.

Possible pitfalls

While BISG is a powerful tool for researchers, it is not without its drawbacks. In its current implementation, BISG is slow and expensive. It requires researchers to geocode address data — that is, a process to place address data on a map to capture its physical location. This is very slow and painstaking work; in Georgia alone, running this process took over 24 hours. Geocoding is also very expensive, as it requires access to a proprietary geocoder such as the ESRI geocoding (which costs universities several thousands of dollars per year) or an online geocoder such as Google ($5 per 1000 matches). This makes BISG infeasible for many researchers and litigants in voting rights lawsuits. Similarly, using Census blocks — the current gold standard in geography for BISG — adds several hours to the process due to their large number in any given state.

So, what’s to be done?

Though BISG is a powerful tool, the geocoding induced bottlenecks meant it simply wasn’t feasible on the scale that we needed. To overcome these limitations, MEDSL Research Scientist John Curiel, UNC graduate student Tyler Steelman, and I found a new way to implement BISG.

We noted that voter files already came with a piece of information that geocoded the voter fairly precisely, the ZIP code. Perhaps using ZIP code, which comes with the file, instead of census block, which requires time-consuming and expensive geocoding from the entire address, would do the trick.Using this information, we wrote an addition to the existing wru R package currently used to conduct BISG, developed by Kosuke Imai and Kabir Khanna, that would instead use ZIP codes.

We tested our approach on the Georgia voter file, which contains racial data already, against these two approaches to BISG. The result? While blocks are still slightly more accurate than ZIP codes, our approach is on par with, or more accurate than, BISG implementations that use county or Census tract files. Most importantly, this new approach is incredibly fast. While it took us several days and two different types of software to run BISG on the Georgia voter file using the Imai and Khanna implementation, our new process took only 5 minutes using ZIP Codes, and used a mere two lines of code!

In order to make BISG and Imai and Khanna’s groundbreaking R package more accessible to researchers and voting rights litigants, we’ve created a new package in zipWRUext that implements BISG at the ZIP code level. The package is available on Github here, and we hope to have it on the CRAN repository in the coming months. A more formal write-up of this technique is forthcoming in Political Analysis — if you’re interested in learning more about our approach, be on the lookout for that!

Jesse Clark graduated with a Ph.D. in political science from the Massachusetts Institute of Technology, where he was also a researcher for the MIT Election Lab. He is currently a postdoctoral research associate at Princeton University as part of the Princeton Electoral Innovation Lab.

More
Topics Election Data and Tools

Back to Main

Related Articles