675
Views
15
CrossRef citations to date
0
Altmetric
Original Articles

Playing with matches: An assessment of accuracy in linked historical data

 

ABSTRACT

This article evaluates linkage quality achieved by various record linkage techniques used in historical demography. The author creates benchmark, or truth, data by linking the 2005 Current Population Survey Annual Social and Economic Supplement to the Social Security Administration's numeric identification system by social security number. By comparing simulated linkages to the benchmark data, she examines the value added (in terms of number and quality of links) from incorporating text-string comparators, adjusting age, and using a probabilistic matching algorithm. She finds that text-string comparators and probabilistic approaches are useful for increasing the linkage rate, but use of text-string comparators may decrease accuracy in some cases. Overall, probabilistic matching offers the best balance between linkage rates and accuracy.

Acknowledgements

The author thanks the U.S. Census Bureau's Center for Administrative Records Research and Applications for enabling this research. For their comments and feedback, she also thanks Adela Luque, Amy O'Hara, Ann Carlos, Joelle Abramowitz, J. Trent Alexander, Maggie Jones, and Sonya Porter. This article is released to inform interested parties of research and to encourage discussion. The views expressed are those of the author and not necessarily those of the U.S. Census Bureau.

Notes

1. PII refers to any information that can identify an individual, such as name, date of birth, and birthplace.

2. Wisselgren and colleagues employed name standardization techniques similar to those used by the MPC (see Vick and Huynh Citation2011). They linked records using standardized names, parish of birth, year of birth, and residence to link records across censuses. After editing names and using household information in the match, the percentage of confirmed records increased to 98.3%.

3. Goeken and colleagues (2011) compared households of married males in the 1870 census to their household in 1880 and determined that only 8 out of 3,609 males were linked to different households in 1880 (Goeken et al. Citation2011, 12). They also looked at brothers in the 1870 census who were young enough to have been enumerated with their parents in 1880. They found only 2.0% of brothers were linked to the wrong household (Goeken et al. Citation2011, 12).

4. Similar to historical data, much of the Numident was first collected on paper and later transcribed. From 1936–72, the SSA used paper copies of all SS-5 applications. In 1972, the SSA created the Numident file to record SSNs electronically. Legacy SS-5 forms were digitized between 1973 and 1979 (Puckett Citation2009). The Numident and CPS ASEC are still arguably higher quality than historical data because they likely suffered from less problems with legibility and image/scanning quality, and they are self-reported at a time when literacy rates are high. Furthermore, individuals can correct their information with the SSA throughout their lifetime, thus there is likely less error in the Numident than in historical data. Title 13, Section 6, Titles 5, 12, and 42 of the U.S. Code give SSA authority to share the Numident with the Census Bureau.

5. The 2005 CPS ASEC is the most recent CPS file that collected SSNs. Beginning in 2006, the CPS stopped collection of SSNs.

6. Verification does not require exact agreement of names or date of birth. I use the original names and age provided in the CPS in the record linkage results that follow. For a discussion of the SSN verification process of the Census Bureau, see Deborah Wagner and Mary Layne (Citation2014).

7. SSN was provided by 56,945 (27.0%) of respondents in the 2005 CPS ASEC. Of these SSNs, 52,634 (92.4%) were verified with the SSA Numident data. The number of verified SSNs may be low because the respondent may not know the SSNs of each member in their household. shows that household heads and spouses make up the majority of men in the verified sample, which is consistent with the hypothesis that respondents may only know their own or their spouses SSN.

8. There was only one respondent missing first name, 12 missing last name, and zero missing age.

9. This results in the removal of 23 observations.

10. Coding names phonetically is a technique largely necessitated by computational limitations in earlier record linkage attempts. In the analysis that follows, I do not find that NYSIIS codes introduce a significant amount of error. If anything, their use results in slightly more accurate linkages.

11. I remove all individuals who are not unique on phonetically coded name, age, and birthplace following Abramitzky and colleagues (Citation2012). Ferrie (Citation1996) allows no more than ten identical name combinations, regardless of age or birthplace.

12. Self-reported year of birth was collected in the 1900 and 1910 U.S. decennial censuses. Age at last birthday is available when year of birth is not.

13. Ferrie (Citation1996) drops all potential matches with differences between age less than 5 or greater than 15. More recent papers, such as Abramitzky and colleagues (Citation2012), use one- to five-year bands.

14. I find little difference between accuracy and match rates produced when using truncated first name in addition to phonetically coded first name versus using phonetically coded first name alone.

15. The discrepancies between the CPS NYSIIS codes and the NYSIIS codes of the true match in the Numident may have resulted from misreporting or keying errors, as well as the name standardizer process.

16. The MPC achieved a match rate of 3% for foreign-born males between the 1870 and 1880 censuses (MPC Citation2010). Avery Guest (Citation1987) achieved a match rate of 39.4% across the 1880 and 1900 censuses. Thomas Maloney (Citation2001) achieved a 58% match rate between white men living in Cincinnati in the 1920 Census and WWI selective service registration records.

17. Alaska, Colorado, Maine, Maryland, Massachusetts, Mississippi, New York, North Carolina, Pennsylvania, Rhode Island, Tennessee, Virginia, West Virginia, and Wyoming do not report date of death in the Numident.

18. See Brian A'Hearn, Jörg Baten, and Dorothee Crayen (Citation2009); John Budd and Timothy Guinnane (Citation1991); Douglas Ewbank (Citation1981); and Edward Stockwell and Jerry Wicks (Citation1974)for more information on the effects of misreported age and age heaping on demographic analyses. Age misreporting is particularly high for African Americans (Coale and Rives Citation1973; Elo and Preston Citation1994).

19. Consider two fictitious people, John Smith and Jon Smith (both born in 1955 in Texas). If you observe John Smith in the CPS ASEC and both John Smith and Jon Smith in the Numident, then no match would be found for John Smith in the CPS ASEC using phonetic codes because it is impossible to distinguish between the two potential links. If using string comparators, then John Smith in the CPS ASEC would be linked to John Smith in the Numident. However, if John Smith was misspelled as “Jon Smith” in the CPS ASEC, then this keying error would lead to an erroneous link to Jon Smith in the Numident.

20. To get a sense of the impact of this choice, I used the 1850–80 IPUMS linked representative sample of men in 1850 and ran the Jaro-Winkler match to the full-count 1880 Census first by dropping any record with more than one potential match based off of JW scores for first and last name, and then by using age as a tiebreaker. I found that the matches using age as a tiebreaker increased the percentage of cases that did not agree with the IPUMS match by 4.05 percentage points.

21. There are 25 men in the 1880 Census with “WALAN” as their NYSIIS first name and “ATRY” as their NYSIIS last name. Only two of these were born in Alabama.

22. I believe a 2% error rate relative to Goeken and colleagues' sample is reasonable, given I started with a sample of men who were already unique enough to classify as a link under their matching algorithm. I believe the false match error rate would be significantly higher for a match beginning with the full population of men in 1850 to 1880.

23. Race was not collected for anyone enumerated by the SSA at birth beginning in 1987 unless they applied for changes to their SSN later in life (e.g., name changes).

24. To make the CPS ASEC detailed race codes match those of the Numident, I linked the CPS ASEC to the Numident by SSN and compared the detailed race codes to the race codes in the Numident. I recoded each detailed race in the CSP ASEC to match the race most often associated with that detailed race code in the Numident (by looking at a cross-tabulation of detailed race in the CPS ASEC and race in the Numident).

25. I follow Jason Long and Joseph Ferrie's (Citation2013) approach to construct the weights.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.