495
Views
5
CrossRef citations to date
0
Altmetric
Theory and Methods

Multifile Partitioning for Record Linkage and Duplicate Detection

&
Pages 1786-1795 | Received 20 Sep 2020, Accepted 28 Nov 2021, Published online: 28 Jan 2022

References

  • Ball, P., and Price, M. (2019), “Using Statistics to Assess Lethal Violence in Civil and Inter-State War,” Annual Review of Statistics and Its Application, 6, 63–84. DOI: 10.1146/annurev-statistics-030718-105222.
  • Betancourt, B., Sosa, J., and Rodríguez, A. (2020), ‘‘A Prior for Record Linkage Based on Allelic Partitions,” arXiv:2008.10118.
  • Betancourt, B., Zanella, G., and Steorts, R. C. (2020), “Random Partition Models for Microclustering Tasks,” Journal of the American Statistical Association, pp. 1–13, DOI: 10.1080/01621459.2020.1841647.
  • Bilenko, M., Mooney, R. J., Cohen, W. W., Ravikumar, P. and Fienberg, S. E. (2003), “Adaptive Name Matching in Information Integration,” IEEE Intelligent Systems, 18, 16–23. DOI: 10.1109/MIS.2003.1234765.
  • Binder, D. A. (1978), “Bayesian Cluster Analysis,’’ Biometrika, 65, 31–38. DOI: 10.1093/biomet/65.1.31.
  • Binette, O., and Steorts, R. C. (2020), “(Almost) All of Entity Resolution,” arXiv:2008.04443.
  • Bird, S. M., and King, R. (2018), “Multiple Systems Estimation (or Capture–Recapture Estimation) to Inform Public Policy,” Annual Review of Statistics and Its Application, 5, 95–118. DOI: 10.1146/annurev-statistics-031017-100641.
  • Enamorado, T., Fifield, B., and Imai, K. (2019), “Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records,” American Political Science Review, 113, 353–371. DOI: 10.1017/S0003055418000783.
  • Enamorado, T., and Steorts, R. C. (2020), “Probabilistic Blocking and Distributed Bayesian Entity Resolution,” in International Conference on Privacy in Statistical Databases, Cham: Springer, pp. 224–239.
  • Fellegi, I. P., and Sunter, A. B. (1969), “A Theory for Record Linkage,’’ Journal of the American Statistical Association, 64, 1183–1210. DOI: 10.1080/01621459.1969.10501049.
  • Fortini, M., Liseo, B., Nuccitelli, A. and Scanu, M. (2001), “On Bayesian Record Linkage,” Research in Official Statistics, 4, 185–198.
  • Herbei, R., and Wegkamp, M. H. (2006), “Classification With Reject Option,” Canadian Journal of Statistics, 34, 709–721. DOI: 10.1002/cjs.5550340410.
  • Hof, M. H., Ravelli, A. C., and Zwinderman, A. H. (2017), “A Probabilistic Record Linkage Model for Survival Data,” Journal of the American Statistical Association, 112, 1504–1515. DOI: 10.1080/01621459.2017.1311262.
  • Jaro, M. A. (1989), “Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida,” Journal of the American Statistical Association, 84, 414–420. DOI: 10.1080/01621459.1989.10478785.
  • Klami, A., and Jitta, A. (2016), “Probabilistic Size-Constrained Microclustering,” in Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, Arlington, VA: AUAI Press, pp. 329–338.
  • Larsen, M. D. (2005), “Advances in Record Linkage Theory: Hierarchical Bayesian Record Linkage Theory,” in Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 3277–3284.
  • Larsen, M. D., and Rubin, D. B. (2001), ‘Iterative Automated Record Linkage Using Mixture Models’, Journal of the American Statistical Association 96, 32–41. DOI: 10.1198/016214501750332956.
  • Liseo, B. and Tancredi, A. (2011), “Bayesian Estimation of Population Size via Linkage of Multivariate Normal Data Sets,” Journal of Official Statistics 27, 491–505.
  • Marchant, N. G., Kaplan, A., Elazar, D. N., Rubinstein, B. I. and Steorts, R. C. (2021), “d-Blink: Distributed End-to-End Bayesian Entity Resolution,’’ Journal of Computational and Graphical Statistics, 30, 406–421. DOI: 10.1080/10618600.2020.1825451.
  • Matsakis, N. E. (2010), “Active Duplicate Detection with Bayesian Nonparametric Models,” PhD thesis, Massachusetts Institute of Technology.
  • Meilă, M. (2007), “Comparing Clusteringsan Information Based Distance,’’ Journal of Multivariate Analysis, 98, 873–895. DOI: 10.1016/j.jmva.2006.11.013.
  • Miller, J., Betancourt, B., Zaidi, A., Wallach, H., and Steorts, R. C. (2015), ‘‘Microclustering: When the Cluster Sizes Grow Sublinearly With the Size of the Data Set,” arXiv:1512.00792.
  • Sadinle, M. (2014), “Detecting Duplicates in a Homicide Registry Using a Bayesian Partitioning Approach,” The Annals of Applied Statistics, 8, 2404–2434. DOI: 10.1214/14-AOAS779.
  • Sadinle, M. (2017), “Bayesian Estimation of Bipartite Matchings for Record Linkage,” Journal of the American Statistical Association, 112, 600–612.
  • Sadinle, M., and Fienberg, S. E. (2013), ‘‘A Generalized Fellegi–Sunter Framework for Multiple Record Linkage With Application to Homicide Record Systems,” Journal of the American Statistical Association, 108, 385–397. DOI: 10.1080/01621459.2012.757231.
  • Steorts, R. C. (2015), “Entity Resolution With Empirically Motivated Priors,” Bayesian Analysis, 10, 849–875. DOI: 10.1214/15-BA965SI.
  • Steorts, R. C., Hall, R., and Fienberg, S. E. (2016), “A Bayesian Approach to Graphical Record Linkage and Deduplication,” Journal of the American Statistical Association, 111, 1660–1672. DOI: 10.1080/01621459.2015.1105807.
  • Tancredi, A., and Liseo, B. (2011), “A Hierarchical Bayesian Approach to Record Linkage and Size Population Problems,’’ Annals of Applied Statistics, 5, 1553–1585.
  • Tancredi, A., Steorts, R., and Liseo, B. (2020), “A Unified Framework for De-Duplication and Population Size Estimation” (with discussion), Bayesian Analysis, 15, 633–682. DOI: 10.1214/19-BA1146.
  • Tran, K.-N., Vatsalan, D., and Christen, P. (2013), “GeCo: An Online Personal Data Generator and Corruptor,” in Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pp. 2473–2476.
  • Wade, S., and Ghahramani, Z. (2018), “Bayesian Cluster Analysis: Point Estimation and Credible Balls” (with discussion), Bayesian Analysis, 13, 559–626. DOI: 10.1214/17-BA1073.
  • Winkler, W. E. (1990), “String Comparator Metrics and Enhanced Decision Rules in the Fellegi–Sunter Model of Record Linkage,” in Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 354–359.
  • Winkler, W. E. (1994), “Advanced Methods for Record Linkage,” in Proceedings of the Section on Survey Research Methods, Alexandria, VA: American Statistical Association, pp. 467–472.
  • Zanella, G., Betancourt, B., Miller, J. W., Wallach, H., Zaidi, A. and Steorts, R. C. (2016), “Flexible Models for Microclustering With Application to Entity Resolution,” in Advances in Neural Information Processing Systems, NY, USA: Curran Associates Inc., pp. 1417–1425.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.