132
Views
1
CrossRef citations to date
0
Altmetric
Articles

A Bayesian approach for de-duplication in the presence of relational data

ORCID Icon & ORCID Icon
Pages 197-215 | Received 15 Nov 2021, Accepted 18 Aug 2022, Published online: 08 Sep 2022

References

  • S. Aleshin-Guendel and M. Sadinle, Multifile partitioning for record linkage and duplicate detection, J. Am. Stat. Assoc. (2022), pp. 1–10. Available at https://doi.org/10.1080/01621459.2021.2013242.
  • M.J. Beal, Variational Algorithms for Approximate Bayesian Inference, University of London, University College London (United Kingdom), 2003.
  • B. Betancourt, J. Sosa, and A. Rodríguez, A prior for record linkage based on allelic partitions, Comput. Stat. Data. Anal. 172 (2022), pp. 107474.
  • B. Betancourt, G. Zanella, J.W. Miller, H. Wallach, A. Zaidi, and R.C. Steorts, Flexible models for microclustering with application to entity resolution, Adv. Neural. Inf. Process. Syst. 29 (2016), pp. 1417–1425.
  • B. Betancourt, G. Zanella, and R.C. Steorts, Random partition models for microclustering tasks, J. Am. Stat. Assoc. (2020), pp. 1–13. Available at https://doi.org/10.1080/01621459.2020.1841647.
  • A. Borg and M Sariyar, RecordLinkage: Record Linkage in R. 2016, R package version 0.4-10.
  • T. Broderick and R.C Steorts, Variational Bayes for merging noisy databases, preprint (2014). Available at arXiv:1410.4792.
  • G. Casella, E. Moreno, and F.J. Girón, Cluster analysis, model selection, and prior distributions on models, Bayesian Analysis 9 (2014), pp. 613–658.
  • T. Chen, E. Fox, and C Guestrin, Stochastic gradient Hamiltonian Monte Carlo, International Conference on Machine Learning, PMLR, 2014, pp. 1683–1691.
  • P. Christen and A Pudjijono, Accurate synthetic generation of realistic personal information, Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2009, pp. 507–514.
  • P. Christen and D Vatsalan, Flexible and extensible generation and corruption of personal data, Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, ACM, 2013, pp. 1165–1168.
  • H. Crane, The ubiquitous Ewens sampling formula, Stat. Sci. 31 (2016), pp. 1–19.
  • P Domingos, Multi-relational record linkage, Proceedings of the KDD-2004 Workshop on Multi-Relational Data Mining, Citeseer, 2004.
  • T. Enamorado and R.C Steorts, Probabilistic blocking and distributed Bayesian entity resolution, International Conference on Privacy in Statistical Databases, Springer, 2020, pp. 224–239.
  • T. Ferguson, A bayesian analysis of some nonparametric problems, Ann. Stat. 1 (1973), pp. 209–230.
  • D. Gamerman and H.F. Lopes, Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference, CRC Press, Boca Raton, FL, 2006.
  • M.S. Handcock, A.E. Raftery, and J.M. Tantrum, Model-based clustering for social networks, J. R. Stat. Soc.: Ser. A (Stat. Soc.) 170 (2007), pp. 301–354.
  • P.D. Hoff, A.E. Raftery, and M.S. Handcock, Latent space approaches to social network analysis, J. Am. Stat. Assoc. 97 (2002), pp. 1090–1098.
  • S. Jain and R.M. Neal, A split-merge Markov chain Monte Carlo procedure for the dirichlet process mixture model, J. Comput. Graph. Stat. 13 (2004), pp. 158–182.
  • M.I. Jordan, Z. Ghahramani, T.S. Jaakkola, and L.K. Saul, An introduction to variational methods for graphical models, Mach. Learn. 37 (1999), pp. 183–233.
  • P.N. Krivitsky and M.S. Handcock, Fitting latent cluster models for networks with latentnet, J. Stat. Softw. 24 (2008), pp. 1–23.
  • J.W. Lau and P.J. Green, Bayesian model-based clustering procedures, J. Comput. Graph. Stat. 16 (2007), pp. 526–558.
  • N.G. Marchant, A. Kaplan, D.N. Elazar, B.I. Rubinstein, and R.C. Steorts, d-blink: Distributed end-to-end Bayesian entity resolution, J. Comput. Graph. Stat. 30 (2021), pp. 406–421.
  • P. McCullagh and J Yang, Stochastic classification models, International Congress of Mathematicians, Vol. 3, Citeseer, 2006, pp. 72–145.
  • J. Miller, B. Betancourt, A. Zaidi, H. Wallach, and R.C Steorts, Microclustering: When the cluster sizes grow sublinearly with the size of the data set, preprint (2015). Available at arXiv:1512.00792.
  • J.W. Miller and M.T. Harrison, Mixture models with a prior on the number of components, J. Am. Stat. Assoc. 113 (2018), pp. 340–356.
  • P. Müller and A. Rodríguez, Nonparametric Bayesian Inference, Institute of Mathematical Statistics, 2013. Available at https://imstat.org/overview/.
  • A. Narayanan and V Shmatikov, De-anonymizing social networks, 2009 30th IEEE Symposium on Security and Privacy, IEEE, 2009, pp. 173–187.
  • R.M. Neal, Markov chain sampling methods for dirichlet process mixture models, J. Comput. Graph. Stat. 9 (2000), pp. 249–265.
  • R.M. Neal, MCMC using Hamiltonian dynamics, Handbook of Markov chain Monte Carlo, Vol. 2, 2011, pp. 114–162,
  • J.S Rosenthal, Optimal proposal distributions and adaptive MCMC, Handbook of Markov Chain Monte Carlo, 4(10.1201). 2011.
  • M. Sadinle, Detecting duplicates in a homicide registry using a Bayesian partitioning approach, Ann. Appl. Stat. 8 (2014), pp. 2404–2434.
  • M. Sadinle and S.E. Fienberg, A generalized Fellegi–Sunter framework for multiple record linkage with application to homicide record systems, J. Am. Stat. Assoc. 108 (2013), pp. 385–397.
  • L.K. Saul, T. Jaakkola, and M.I. Jordan, Mean field theory for sigmoid belief networks, J Artif Intell Res 4 (1996), pp. 61–76.
  • J. Sethuraman, A constructive definition of dirichlet priors, Stat. Sin. 4 (1994), pp. 639–650.
  • A.L. Smith, D.M. Asta, and C.A. Calder, The geometry of continuous latent space models for network data, Stat. Sci. 34 (2019), pp. 428–453.
  • J. Sosa and L. Buitrago, A review of latent space models for social networks, Revista Colombiana De Estadística 44 (2021), pp. 171–200.
  • J. Sosa and A Rodríguez, A record linkage model incorporating relational data, preprint (2018). Available at arXiv:1808.04511.
  • R.C. Steorts, Entity resolution with empirically motivated priors, Bayesian Analysis 10 (2015), pp. 849–875.
  • R.C. Steorts, R. Hall, and S.E. Fienberg, A Bayesian approach to graphical record linkage and deduplication, J. Am. Stat. Assoc. 111 (2016), pp. 1660–1672.
  • R.C. Steorts, S.L. Ventura, M. Sadinle, and S.E Fienberg, A comparison of blocking methods for record linkage, International Conference on Privacy in Statistical Databases, Springer, 2014, pp. 253–268.
  • A. Tancredi, R. Steorts, and B. Liseo, A unified framework for de-duplication and population size estimation (with discussion), Bayesian Analysis 15 (2020), pp. 633–682.
  • H. Wallach, S. Jensen, L. Dicker, and K Heller, An alternative prior process for nonparametric Bayesian clustering, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, 2010, pp. 892–899.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.