434
Views
8
CrossRef citations to date
0
Altmetric
Spatial, Graph, and Dependent Data Methodology

d-blink: Distributed End-to-End Bayesian Entity Resolution

ORCID Icon, ORCID Icon, , &
Pages 406-421 | Received 02 Jan 2019, Accepted 11 Sep 2020, Published online: 19 Feb 2021
 

Abstract

Entity resolution (ER; also known as record linkage or de-duplication) is the process of merging noisy databases, often in the absence of unique identifiers. A major advancement in ER methodology has been the application of Bayesian generative models, which provide a natural framework for inferring latent entities with rigorous quantification of uncertainty. Despite these advantages, existing models are severely limited in practice, as standard inference algorithms scale quadratically in the number of records. While scaling can be managed by fitting the model on separate blocks of the data, such a naïve approach may induce significant error in the posterior. In this article, we propose a principled model for scalable Bayesian ER, called “distributed Bayesian linkage” or d-blink, which jointly performs blocking and ER without compromising posterior correctness. Our approach relies on several key ideas, including: (i) an auxiliary variable representation that induces a partition of the entities and records into blocks; (ii) a method for constructing well-balanced blocks based on k-d trees; (iii) a distributed partially collapsed Gibbs sampler with improved mixing; and (iv) fast algorithms for performing Gibbs updates. Empirical studies on six datasets—including a case study on the 2010 Decennial Census—demonstrate the scalability and effectiveness of our approach. Supplementary materials for this article are available online.

Supplementary Materials

Appendices: Includes proofs, further details about the experimental setup, and additional results. (PDF file)

Code: An implementation of d-blink in Apache Spark and a corresponding R interface. (Zip file)

Data: An archive containing datasets that we have permission to redistribute. (Zip file)

Acknowledgments

The authors would also like to thank the anonymous reviewers, associate editor, and editor for their valuable comments and helpful suggestions.

Additional information

Funding

N. Marchant acknowledges the support of an Australian Government Research Training Program Scholarship and the AMSIIntern program hosted by the Australian Bureau of Statistics. R. C. Steorts and A. Kaplan acknowledge the support of NSF SES-1534412 and CAREER-1652431. B. Rubinstein acknowledges the support of Australian Research Council grant DP150103710. N. Marchant and B. Rubinstein also acknowledge support of Australian Bureau of Statistics project ABS2018.363.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 61.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 180.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.