McCourt School of Public Policy
Massive Data Institute
Linkage Seminar

Fast Bayesian Record Linkage for Streaming Data – Andee Kaplan

On July 21, 2022, Andee Kaplan presented work on record linkage for streaming data. In streaming record linkage, files arrive sequentially in time and estimates of links are updated after the arrival of each file. This problem arises in settings such as longitudinal surveys, electronic health records, and online events databases, among others. The challenge in streaming record linkage is to efficiently update parameter estimates as a new data file arrives. In this talk, we will approach the problem from a Bayesian perspective and present methods for updating link estimates after the arrival of a new file that is faster than fitting a joint model with each new data file. To accomplish this aim, we generalize a two-file Bayesian Fellegi-Sunter model to the multi-file case and present two methods to perform streaming updates. We examine the effect of prior distribution on the resulting linkage accuracy as well as the computational trade-offs between the methods when compared to a Gibbs sampler through simulated and real-world survey panel data. With these streaming updates, we are able to achieve near-equivalent posterior inference at a small fraction of the compute time.

Andee Kaplan: Andee Kaplan is an assistant professor in the Department of Statistics at Colorado State University. Her research interests lie in the intersection of Bayesian methodology and statistical computing, particularly as applied to large social science and ecological problems with complex dependence and messy data structures. Prior to joining Colorado State University, Andee spent two years as a Postdoctoral Associate at Duke University after completing her Ph.D. in Statistics from Iowa State University. In her free time, Andee enjoys riding bikes and rock climbing.