434
Views
8
CrossRef citations to date
0
Altmetric
Spatial, Graph, and Dependent Data Methodology

d-blink: Distributed End-to-End Bayesian Entity Resolution

ORCID Icon, ORCID Icon, , &
Pages 406-421 | Received 02 Jan 2019, Accepted 11 Sep 2020, Published online: 19 Feb 2021

References

  • Ahn, S. , Shahbaba, B. , and Welling, M. (2014), “Distributed Stochastic Gradient MCMC,” in Proceedings of the 31st International Conference on Machine Learning—ICML’14, JMLR.org, Beijing, China (Vol. 32), pp. II-1044–II-1052.
  • Bentley, J. L. (1975), “Multidimensional Binary Search Trees Used for Associative Searching,” Communications of the ACM , 18, 509–517. DOI: 10.1145/361002.361007.
  • Betancourt, B. , Zanella G. , and Steorts, R. C. (2020), “Random Partition Models for Microclustering Tasks,” arXiv no. 2004.02008.
  • Bilenko, M. , and Mooney, R. J. (2003), “Adaptive Duplicate Detection Using Learnable String Similarity Measures,” in Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’03, ACM, New York, NY, USA, pp. 39–48. DOI: 10.1145/956750.956759.
  • Binette, O. , and Steorts, R. C. (2020), “(Almost) All of Entity Resolution,” arXiv no. 2008.04443.
  • Chang, J. , and Fisher, J. W., III (2013), “Parallel Sampling of DP Mixture Models Using Sub-Clusters Splits,” in Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS’13 , Curran Associates Inc., NY, USA (Vol. 1), pp. 620–628.
  • Christen, P. (2012a), “A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication,” IEEE Transactions on Knowledge and Data Engineering , 24, 1537–1555. DOI: 10.1109/TKDE.2011.127.
  • Christen, P. (2012b), Data Matching: Concepts and Techniques for Record Linkage , Entity Resolution, and Duplicate Detection, Data-Centric Systems and Applications, Berlin, Heidelberg: Springer-Verlag.
  • Copas, J. B. , and Hilton, F. J. (1990), “Record Linkage: Statistical Models for Matching Computer Records,” Journal of the Royal Statistical Society, Series A, 153, 287–320. DOI: 10.2307/2982975.
  • Dong, X. L. , and Srivastava, D. (2015), “Big Data Integration,” Synthesis Lectures on Data Management , 7, 1–198. DOI: 10.2200/S00578ED1V01Y201404DTM040.
  • Enamorado, T. , Fifield, B. , and Imai, K. (2019), “Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records,” American Political Science Review , 113, 353–371. DOI: 10.1017/S0003055418000783.
  • Fan, W. , Jia, X. , Li, J. , and Ma, S. (2009), “Reasoning About Record Matching Rules,” Proceedings of the VLDB Endowment , 2, 407–418. DOI: 10.14778/1687627.1687674.
  • Fellegi, I. P. , and Sunter, A. B. (1969), “A Theory for Record Linkage,” Journal of the American Statistical Association , 64, 1183–1210. DOI: 10.1080/01621459.1969.10501049.
  • Flegal, J. M. , Hughes, J. , Vats, D. , and Dai, N. (2017), “mcmcse: Monte Carlo Standard Errors for MCMC ,” R Package Version 1.3-2, Riverside, CA, Denver, CO, Coventry, UK, and Minneapolis, MN.
  • Fortini, M. , Liseo, B. , Nuccitelli, A. , and Scanu, M. (2001), “On Bayesian Record Linkage,” Research in Official Statistics , 4, 185–198.
  • Friedman, J. H. , Bentley, J. L. , and Finkel, R. A. (1977), “An Algorithm for Finding Best Matches in Logarithmic Expected Time,” ACM Transactions on Mathematical Software , 3, 209–226. DOI: 10.1145/355744.355745.
  • Ge, H. , Chen, Y. , Wan, M. , and Ghahramani, Z. (2015), “Distributed Inference for Dirichlet Process Mixture Models,” in Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research (PMLR), Lille, France (Vol. 37), eds. F. Bach and D. Blei , pp. 2276–2284.
  • Getoor, L. , and Machanavajjhala, A. (2012), “Entity Resolution: Theory, Practice & Open Challenges,” Proceedings of the VLDB Endowment , 5, 2018–2019. DOI: 10.14778/2367502.2367564.
  • Gokhale, C. , Das, S. , Doan, A. , Naughton, J. F. , Rampalli, N. , Shavlik, J. , and Zhu, X. (2014), “Corleone: Hands-Off Crowdsourcing for Entity Matching,” in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD’14 , ACM, New York, NY, USA, pp. 601–612.
  • Gutman, R. , Afendulis, C. C. , and Zaslavsky, A. M. (2013), “A Bayesian Procedure for File Linking to Analyze End-of-Life Medical Costs,” Journal of the American Statistical Association , 108, 34–47. DOI: 10.1080/01621459.2012.726889.
  • Herzog, T. N. , Scheuren, F. J. , and Winkler, W. E. (2007), Data Quality and Record Linkage Techniques , New York: Springer-Verlag.
  • Jain, S. , and Neal, R. M. (2004), “A Split-Merge Markov Chain Monte Carlo Procedure for the Dirichlet Process Mixture Model,” Journal of Computational and Graphical Statistics , 13, 158–182. DOI: 10.1198/1061860043001.
  • Jasra, A. , Holmes, C. C. , and Stephens, D. A. (2005), “Markov Chain Monte Carlo Methods and the Label Switching Problem in Bayesian Mixture Modeling,” Statistical Science , 20, 50–67. DOI: 10.1214/088342305000000016.
  • Lahiri, P. , and Larsen, M. D. (2005), “Regression Analysis With Linked Data,” Journal of the American Statistical Association , 100, 222–230. DOI: 10.1198/016214504000001277.
  • Larsen, M. D. (2005), “Advances in Record Linkage Theory: Hierarchical Bayesian Record Linkage Theory,” in Proceedings of the Survey Research Methods Section, American Statistical Association , pp. 3277–3284.
  • Larsen, M. D. (2012), “An Experiment With Hierarchical Bayesian Record Linkage,” arXiv no. 1212.5203.
  • Lesot, M.-J. , Rifqi, M. , and Benhadda, H. (2008), “Similarity Measures for Binary and Numerical Data: A Survey,” International Journal of Knowledge Engineering and Soft Data Paradigms , 1, 63–84. DOI: 10.1504/IJKESDP.2009.021985.
  • Little, R. J. A. , and Rubin, D. B. (2002), Statistical Analysis With Missing Data , New York: Wiley.
  • Liu, J. S. (2004), Monte Carlo Strategies in Scientific Computing , Springer Series in Statistics , New York: Springer-Verlag.
  • Lovell, D. , Malmaud, J. , Adams, R. P. , and Mansinghka, V. K. (2013), “ClusterCluster: Parallel Markov Chain Monte Carlo for Dirichlet Process Mixtures,” arXiv no. 1304.2302.
  • McVeigh, B. S. , and Murray, J. S. (2017), “Practical Bayesian Inference for Record Linkage,” arXiv no. 1710.10558.
  • McVeigh, B. S. , Spahn, B. T. , and Murray, J. S. (2019), “Scaling Bayesian Probabilistic Record Linkage With Post-Hoc Blocking: An Application to the California Great Registers,” arXiv no. 1905.05337.
  • Mudgal, S. , Li, H. , Rekatsinas, T. , Doan, A. , Park, Y. , Krishnan, G. , Deep, R. , Arcaute, E. , and Raghavendra, V. (2018), “Deep Learning for Entity Matching: A Design Space Exploration,” in Proceedings of the 2018 International Conference on Management of Data, SIGMOD ’18 , ACM, New York, NY, USA, pp. 19–34. DOI: 10.1145/3183713.3196926.
  • Newcombe, H. B. , Kennedy, J. M. , Axford, S. J. , and James, A. P. (1959), “Automatic Linkage of Vital Records: Computers Can Be Used to Extract “Follow-Up” Statistics of Families From Files of Routine Records,” Science , 130, 954–959. DOI: 10.1126/science.130.3381.954.
  • Newman, D. , Asuncion, A. , Smyth, P. , and Welling, M. (2009), “Distributed Algorithms for Topic Models,” Journal of Machine Learning Research , 10, 1801–1828.
  • Papadakis, G. , Svirsky, J. , Gal, A. , and Palpanas, T. (2016), “Comparative Analysis of Approximate Blocking Techniques for Entity Resolution,” Proceedings of the VLDB Endowment , 9, 684–695. DOI: 10.14778/2947618.2947624.
  • Price, M. , Klinger, J. , Qtiesh, A. , and Ball, P. (2013), Updated Statistical Analysis of Documentation of Killings in the Syrian Arab Repulic , Geneva: Human Rights Data Analysis Group.
  • Rastogi, S. , O’Hara, A. , Noon, J. , Zapata, E. A. , Espinoza, C. , Marshall, L. B. , Schellhamer, T. A. , and Brown, J. D. (2012), “2010 Census Match Study,” Technical Report , Center for Administrative Records Research and Applications, United States Census Bureau.
  • Sadinle, M. (2014), “Detecting Duplicates in a Homicide Registry Using a Bayesian Partitioning Approach,” The Annals of Applied Statistics , 8, 2404–2434. DOI: 10.1214/14-AOAS779.
  • Sadinle, M. (2017), “Bayesian Estimation of Bipartite Matchings for Record Linkage,” Journal of the American Statistical Association , 112, 600–612.
  • Sadinle, M. , and Fienberg, S. E. (2013), “A Generalized Fellegi-Sunter Framework for Multiple Record Linkage With Application to Homicide Record Systems,” Journal of the American Statistical Association , 108, 385–397. DOI: 10.1080/01621459.2012.757231.
  • Saria, S. (2014), “A $3 Trillion Challenge to Computational Scientists: Transforming Healthcare Delivery,” IEEE Intelligent Systems , 29, 82–87. DOI: 10.1109/MIS.2014.58.
  • Sariyar, M. , and Borg, A. (2010), “The RecordLinkage Package: Detecting Errors in Data,” The R Journal , 2, 61–67. DOI: 10.32614/RJ-2010-017.
  • Singh, R. , Meduri, V. , Elmagarmid, A. , Madden, S. , Papotti, P. , Quiané-Ruiz, J.-A. , Solar-Lezama, A. , and Tang, N. (2017), “Generating Concise Entity Matching Rules,” in Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17 , ACM, New York, NY, USA, pp. 1635–1638. DOI: 10.1145/3035918.3058739.
  • Smola, A. , and Narayanamurthy, S. (2010), “An Architecture for Parallel Topic Models,” Proceedings of the VLDB Endowment , 3, 703–710. DOI: 10.14778/1920841.1920931.
  • Soon, W. M. , Ng, H. T. , and Lim, D. C. Y. (2001), “A Machine Learning Approach to Coreference Resolution of Noun Phrases,” Computational Linguistics , 27, 521–544. DOI: 10.1162/089120101753342653.
  • Steorts, R. C. (2015), “Entity Resolution With Empirically Motivated Priors,” Bayesian Analysis , 10, 849–875. DOI: 10.1214/15-BA965SI.
  • Steorts, R. C. , Barnes, M. , and Neiswanger, W. (2017), “Performance Bounds for Graphical Record Linkage,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research (PMLR), Fort Lauderdale, FL, USA (Vol. 54), eds. A. Singh and J. Zhu , pp. 298–306.
  • Steorts, R. C. , Hall, R. , and Fienberg, S. E. (2016), “A Bayesian Approach to Graphical Record Linkage and Deduplication,” Journal of the American Statistical Association , 111, 1660–1672. DOI: 10.1080/01621459.2015.1105807.
  • Steorts, R. C. , Ventura, S. L. , Sadinle, M. , and Fienberg, S. E. (2014), “A Comparison of Blocking Methods for Record Linkage,” in Privacy in Statistical Databases , Lecture Notes in Computer Science, ed. J. Domingo-Ferrer , Cham: Springer International Publishing, pp. 253–268.
  • Tancredi, A. , and Liseo, B. (2011), “A Hierarchical Bayesian Approach to Record Linkage and Population Size Problems,” The Annals of Applied Statistics , 5, 1553–1585. DOI: 10.1214/10-AOAS447.
  • Tancredi, A. , Steorts, R. C. , and Liseo, B. (2020), “A Unified Framework for De-Duplication and Population Size Estimation” (with discussion), Bayesian Analysis , 15, no. 2, 633–682. DOI: 10.1214/19-BA1146.
  • Turek, D. , de Valpine, P. , and Paciorek, C. J. (2016), “Efficient Markov Chain Monte Carlo Sampling for Hierarchical Hidden Markov Models,” Environmental and Ecological Statistics , 23, 549–564. DOI: 10.1007/s10651-016-0353-z.
  • United States Census Bureau (2010), “2010 Census Participation Rates,” available at https://www.census.gov/data/datasets/2010/dec/2010-participation-rates.html.
  • van Dyk, D. A. , and Park, T. (2008), “Partially Collapsed Gibbs Samplers,” Journal of the American Statistical Association , 103, 790–796. DOI: 10.1198/016214508000000409.
  • Vats, D. , Flegal, J. M. , and Jones, G. L. (2019), “Multivariate Output Analysis for Markov Chain Monte Carlo,” Biometrika , 106, 321–337. DOI: 10.1093/biomet/asz002.
  • Vinh, N. X. , Epps, J. , and Bailey, J. (2010), “Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance,” Journal of Machine Learning Research , 11, 2837–2854.
  • Vose, M. D. (1991), “A Linear Algorithm for Generating Random Numbers With a Given Distribution,” IEEE Transactions on Software Engineering , 17, 972–975. DOI: 10.1109/32.92917.
  • Wang, J. , Kraska, T. , Franklin, M. J. , and Feng, J. (2012), “CrowdER: Crowdsourcing Entity Resolution,” Proceedings of the VLDB Endowment , 5, 1483–1494. DOI: 10.14778/2350229.2350263.
  • Williamson, S. , Dubey, A. , and Xing, E. (2013), “Parallel Markov Chain Monte Carlo for Nonparametric Mixture Models,” in Proceedings of the 30th International Conference on Machine Learning, Proceedings of Machine Learning Research (PMLR), Atlanta, GA, USA (Vol. 28), eds. S. Dasgupta and D. McAllester , pp. 98–106.
  • Winkler, W. E. (1999), “The State of Record Linkage and Current Research Problems,” Technical Report, Statistical Research Division, U.S. Bureau of the Census.
  • Winkler, W. E. (2000), “Machine Learning, Information Retrieval, and Record Linkage,” in Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 20–29.
  • Winkler, W. E. (2002), “Methods for Record Linkage and Bayesian Networks,” Technical Report Statistics #2002-05, U.S. Bureau of the Census.
  • Winkler, W. E. (2006), “Overview of Record Linkage and Current Research Directions,” Technical Report Statistics #2006-2, Statistical Research Division, U.S. Census Bureau.
  • Winkler, W. E. (2014), “Matching and Record Linkage,” Wiley Interdisciplinary Reviews: Computational Statistics , 6, 313–325.
  • Yujian, L. , and Bo, L. (2007), “A Normalized Levenshtein Distance Metric,” IEEE Transactions on Pattern Analysis and Machine Intelligence , 29, 1091–1095. DOI: 10.1109/TPAMI.2007.1078.
  • Zanella, G. (2020), “Informed Proposals for Local MCMC in Discrete Spaces,” Journal of the American Statistical Association , 115, 852–865. DOI: 10.1080/01621459.2019.1585255.
  • Zanella, G. , Betancourt, B. , Wallach, H. , Miller, J. , Zaidi, A. , and Steorts, R. C. (2016), “Flexible Models for Microclustering With Application to Entity Resolution,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16 , Curran Associates Inc., NY, USA, pp. 1425–1433.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.