187
Views
1
CrossRef citations to date
0
Altmetric
General

Estimating the Performance of Entity Resolution Algorithms: Lessons Learned Through PatentsView.org

ORCID Icon, , , , ORCID Icon &
Pages 370-380 | Received 03 Oct 2022, Accepted 04 Mar 2023, Published online: 13 Apr 2023

References

  • Bagga, A., and Baldwin, B. (1998), “Algorithms for Scoring Coreference Chains,” in The First International Conference on Language Resources and Evaluation (Vol. 1), pp. 563–566.
  • Bailey, M. J., Cole, C., Henderson, M. A., and Massey, C. G. (2017), How Well Do Automated Methods Perform in Historical Samples?: Evidence From New Ground Truth, Cambridge, MA: National Bureau of Economic Research.
  • Balsmeier, B., Assaf, M., Chesebro, T., Fierro, G., Johnson, K., Johnson, S., Li, G.-C., Lück, S., O’Reagan, D., Yeh, B., et al. (2018), “Machine Learning and Natural Language Processing on the Patent Corpus: Data, Tools, and New Measures,” Journal of Economics & Management Strategy, 27,535–553. DOI: 10.1111/jems.12259.
  • Barnes, M. (2015), “A Practioner’s Guide to Evaluating Entity Resolution Results”, pp. 1–6. arXiv e-prints, arxiv:1509.04238. https://arxiv.org/pdf/1509.04238.pdf
  • Belin, T. R., and Rubin, D. B. (1995), “A Method for Calibrating False-Match Rates in Record Linkage,” Journal of the American Statistical Association, 90, 694–707. DOI: 10.1080/01621459.1995.10476563.
  • Bilenko, M., and Mooney, R. (2003), “On Evaluation and Training-Set Construction for Duplicate Detection,” Proceedings of the KDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pp. 7–12.
  • Binette, O., Madhavan, S., Butler, J., Card, B. A., Melluso, E., and Jones, C. (2023), “PatentsView-Evaluation: Evaluation Datasets and Tools to Advance Research on Inventor Name Disambiguation,” arxiv:2301.03591.
  • Binette, O., and Steorts, R. C. (2022), “(Almost) all of Entity Resolution,” Science Advances, 8, eabi8021. DOI: 10.1126/sciadv.abi8021.
  • Choudhury, P., and Kim, D. Y. (2019), “The Ethnic Migrant Inventor Effect: Codification and Recombination of Knowledge across Borders,” Strategic Management Journal, 40, 203–229. DOI: 10.1002/smj.2977.
  • Christen, P. (2012), Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Data-Centric Systems and Applications. Berlin, Heidelberg: Springer-Verlag.
  • Christen, P. (2019), “Data Linkage: The Big Picture,” Harvard Data Science Review, 2. DOI: 10.1162/99608f92.84deb5c4.
  • Christen, P., and Goiser, K. (2007), “Quality and Complexity Measures for Data Linkage and Deduplication,” Studies in Computational Intelligence, 43, 127–151.
  • Christophides, V., Efthymiou, V., Palpanas, T., Papadakis, G., and Stefanidis, K. (2021), “An Overview of End-to-End Entity Resolution for Big Data,” ACM Computing Surveys, 53, 1–2. DOI: 10.1145/3418896.
  • Cochran, W. G. (1977), Sampling Techniques, New York: Wiley.
  • Doherr, T. (2021), “Disambiguation by Namesake Risk Assessment,” ZEW-Centre for European Economic Research Discussion Paper (21-021).
  • Dong, X. L., and Srivastava, D. (2015), Big Data Integration, San Rafael, CA: Morgan and Claypool Publishers.
  • Draisbach, U., and Naumann, F. (2013), “On Choosing Thresholds for Duplicate Detection,” Proceedings of the 18th International Conference on Information Quality, ICIQ 2013.
  • Enamorado, T., Fifield, B., and Imai, K. (2019), “Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records,” American Political Science Review, 113, 353–371. DOI: 10.1017/S0003055418000783.
  • Fellegi, I. P., and Sunter, A. B. (1969), “A Theory for Record Linkage,” Journal of the American Statistical Association, 64, 1183–1210. DOI: 10.1080/01621459.1969.10501049.
  • Ferreira, A. A., Gonçalves, M. A., and Laender, A. H. (2012), “A Brief Survey of Automatic Methods for Author Name Disambiguation,” ACM Sigmod Record, 41, 15–26. DOI: 10.1145/2350036.2350040.
  • Frisoli, K., and Nugent, R. (2018), “Exploring the Effect of Household Structure in Historical Record Linkage of Early 1900s Ireland Census Records,” in Proceedings of the 2018 IEEE International Conference on Data Mining Workshops, pp. 502–509. IEEE.
  • Fuller, W. A. (2011), Sampling Statistics, Hoboken, NJ: Wiley.
  • Gu, G., Lee, S., and Kim, J. (2008), “Matching Accuracy of the Lee-Kim-Marschke Computer Matching Program.” SUNY Albany Working Paper.
  • Han, H., Yu, Y., Wang, L., Zhai, X., Ran, Y., and Han, J. (2019), “Disambiguating USPTO Inventor Names with Semantic Fingerprinting and DBSCAN Clustering,” The Electronic Library, 37, 225–239. DOI: 10.1108/EL-12-2018-0232.
  • Hand, D., and Christen, P. (2018), “A Note on Using the F-Measure for Evaluating Record Linkage Algorithms,” Statistics and Computing, 28, 539–547. DOI: 10.1007/s11222-017-9746-6.
  • Herzog, T., Scheuren, F., and Winkler, W. (2007), Data Quality and Record Linkage Techniques, New York: Springer.
  • Horvitz, D. G., and Thompson, D. J. (1952), “A Generalization of Sampling Without Replacement from a Finite Universe,” Journal of the American Statistical Association, 47, 663–685. DOI: 10.1080/01621459.1952.10483446.
  • Ilyas, I. F., and Chu, X. (2019), Data Cleaning, New York: Association for Computing Machinery.
  • Kim, K., Khabsa, M., and Giles, C. L. (2016), “Random Forest DBSCAN for USPTO Inventor Name Disambiguation,” arXiv:1602.01792.
  • Li, G. C., Lai, R., D’Amour, A., Doolin, D. M., Sun, Y., Torvik, V. I., A. Z. Yu, and F. Lee (2014). Disambiguation and co-authorship Networks of the U.S. Patent Inventor Database (1975-2010),” Research Policy, 43,941–955.
  • Maidasani, H., Namata, G., Huang, B., and Getoor, L. (2012), “Entity Resolution Evaluation Measures,” Technical Report, University of Maryland.
  • Marchant, N. G., and Rubinstein, B. I. (2017), “In Search of an Entity Resolution OASIS: Optimal Asymptotic Sequential Importance Sampling,” Proceedings of the VLDB Endowment, 10, 1322–1333. DOI: 10.14778/3137628.3137642.
  • McVeigh, B. S., Spahn, B. T., and Murray, J. S. (2019), “Scaling Bayesian Probabilistic Record Linkage with Post-hoc Blocking: An Application to the California Great Registers,” arXiv:1905.05337.
  • Menestrina, D., Whang, S. E., and Garciamolina, H. (2010), “Evaluating Entity Resolution Results,” Proceedings of the VLDB Endowment, 3,208–219. DOI: 10.14778/1920841.1920871.
  • Michelson, M., and Macskassy, S. A. (2009), “Record Linkage Measures in An Entity Centric World,” Proceedings of the 4th Workshop on Evaluation Methods for Machine Learning.
  • Monath, N., Jones, C., and Madhavan, S. (2021), “PatentsView: Disambiguating Inventors, Assigness, and Locations,” Technical report, American Institutes for Research, Arlington, VA.
  • Monath, N., Kobren, A., Krishnamurthy, A., Glass, M. R., and McCallum, A. (2019), “Scalable Hierarchical Clustering with Tree Grafting,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1438–1448. DOI: 10.1145/3292500.3330929.
  • Morrison, G., Riccaboni, M., and Pammolli, F. (2017), “Disambiguation of Patent Inventors and Assignees Using High-Resolution Geolocation Data,” Scientific Data, 4, 1–21. DOI: 10.1038/sdata.2017.64.
  • Müller, M.-C. (2017), “Semantic Author Name Disambiguation with Word Embeddings,” in Proceedings of the 21st International Conference on Theory and Practice of Digital Libraries, pp. 300–311, Springer.
  • Newcombe, H. B., Kennedy, J. M., Axford, S. J., and James, A. P. (1959), “Automatic Linkage of Vital Records,” Science, 130, 954–959. DOI: 10.1126/science.130.3381.954.
  • Papadakis, G., Ioannou, E., Thanos, E., and Palpanas, T. (2021), The Four Generations of Entity Resolution, San Rafael, CA: Morgan & Claypool Publishers.
  • Sariyar, M., and Borg, A. (2022), RecordLinkage: Record Linkage Functions for Linking and Deduplicating Data Sets. R package version 0.4-12.3.
  • Särndal, C.-E., Swensson, B., and Wretman, J. (2003), Model Assisted Survey Sampling, New York: Springer.
  • Tam, D., Monath, N., Kobren, A., Traylor, A., Das, R., and McCallum, A. (2019), “Optimal Transport-based Alignment of Learned Character Representations for String Similarity,” arXiv:1907.10165.
  • Toole, A. A., Jones, C., and Madhavan, S. (2021), “PatentsView: An Open Data Platform to Advance Science and Technology Policy.” USPTO Economic Working Paper No. 2021-1. Available at SSRN: https://ssrn.com/abstract=3874213, DOI: 10.2139/ssrn.3874213.
  • Trajtenberg, M., and Shiff, G. (2008), Identification and Mobility of Israeli Patenting Inventors, Pinhas Sapir.
  • Traylor, A., Monath, N., Das, R., and McCallum, A. (2017), “Learning String Alignments for Entity Aliases,” in In Proceedings of the 31st Conference on Neural Information Processing Systems.
  • Ventura, S. L., Nugent, R., and Fuchs, E. R. (2013), “Methods Matter: Rethinking Inventor Disambiguation with Classification & Labeled Inventor Records,” in Academy of Management Proceedings (Vol. 2013), Academy of Management Briarcliff Manor, NY 10510.
  • Ventura, S. L., Nugent, R., and Fuchs, E. R. (2015), “Seeing the Non-stars: (Some) Sources of Bias in Past Disambiguation Approaches and a New Public Tool Leveraging Labeled Records,” Research Policy, 44,1672–1701. DOI: 10.1016/j.respol.2014.12.010.
  • Wang, T., Lin, H., Fu, C., Han, X., Sun, L., Xiong, F., Chen, H., Lu, M., and Zhu, X. (2022), “Bridging the Gap between Reality and Ideality of Entity Matching: A Revisiting and Benchmark Re-construction,” arxiv:2205.05889.
  • Yang, G.-C., Liang, C., Jing, Z., Wang, D.-R., and Zhang, H.-C. (2017), “A Mixture Record Linkage Approach for US Patent Inventor Disambiguation,” in Advanced Multimedia and Ubiquitous Engineering, pp. 331–338, Springer.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.