Abstract
New large-scale linked data are revolutionizing quantitative history and demography. This paper proposes two complementary strategies for improving inference with linked historical data: the use of validation variables to identify higher quality links and a simple, regression-based weighting procedure to increase the representativeness of custom research samples. We demonstrate the potential value of these strategies using the 1850–1930 Integrated Public Use Microdata Series Linked Representative Samples (IPUMS-LRS)—a high quality, publicly available linked historical dataset. We show that, while incorrect linking rates appear low in the IPUMS-LRS, researchers can reduce error rates further using validation variables. We also show how researchers can reweight linked samples to balance observed characteristics in the linked sample with those in a reference population using a simple regression-based procedure.
Acknowledgements
We are grateful to George Alter, Trent Alexander, Katie Genadek, Alfia Karimova, Maggie Levenstein, Evan Roberts, and Steve Ruggles for their helpful suggestions and comments. We are also grateful to Sarah Anderson, Garrett Anstreicher, Ali Doxey, Meizi Li, and Mike Ricks for their many contributions to the LIFE-M project and assistance with this analysis.
Notes
1 See, for instance, early-life public health initiatives (Alsan and Goldin Citation2019; Cutler and Miller Citation2005), exposures to environmental pollutants (Clay, Lewis, and Severnini Citation2016) and animal diseases (Rhode and Olmstead Citation2015), and access to medicines (Bleakley Citation2007). Other examples include the long-run effects of exposure to human capital initiatives through Rosenwald schools (Mazumder and Aaronson Citation2011).
2 On-going and proposed projects are linking national surveys, administrative data, and research samples to recently digitized historical records, such as the full-count 1880 (Ruggles 2002; Ruggles et al. 201537) and 1940 U.S. Censuses (the first U.S. census to ask about education and wage income) and newly available administrative sources. The Census Bureau plans to link the 1940 Census to current administrative and census data (Census Longitudinal Infrastructure Project, CLIP) and the Minnesota Population Center plans to link it to other historical censuses. The Panel Survey of Income Dynamics (PSID) and the Health and Retirement Survey (HRS) are linking their respondents to the 1940 Census. The Longitudinal, Intergenerational Family Electronic Micro-Database Project (LIFE-M) is linking vital records to the 1940 Census (Bailey et al. Citation2019). Supplementing these public infrastructure projects, entrepreneurial researchers have also combined large datasets. See, for example, Abramitzky, Platt Boustan, and Eriksson (Citation2012, Citation2013, Citation2014), Boustan, Kahn, and Rhode (Citation2012), Hornbeck and Naidu (Citation2014), Mill (Citation2013), Mill and Stein (Citation2016), Aizer et al. (Citation2016), Bleakley and Ferrie (Citation2014, Citation2016, Citation2013), Nix and Qian (2015), Collins and Wanamaker (Citation2016), and Eli, Salisbury, and Shertzer (Citation2016).
3 FEBRL is a record linking software developed by the ANU Data Mining Group and the Centre for Epidemiology and Research in the New South Wales Department of Health. See Christen and Churches (Citation2005) for more information.
4 We are performing this restriction on the data ex post as we only have access to the finished IPUMS-LRS matches. However, Abramitzky, Platt Boustan, and Eriksson (Citation2012, Citation2014) as described in Bailey et al. (Citationforthcoming), perform this restriction before engaging their matching algorithm.
5 The MPC did use parental birthplace when linking the 1900, 1910, 1920, and 1930 Census samples to the 1880 full count Census.
6 Data quality issues prior to 1880 are the reason that the MPC did not use this variable in the matching process for 1850–1870. For these years, parent birthplaces can only be inferred from individuals living at home with their parents. Furthermore, relationships within a household in those years are not listed by Census takers, and need to be inferred from the order in which individuals are listed in the Census and the ages of individuals. In Appendix I, we demonstrate that, although parent birthplace is clearly measured with error, patterns of parental birthplace disagreement between individuals living at home with their parents and those not living at home are similar in the years after 1880. Therefore, assuming that the imputed household relationships are accurate in the years prior to 1880, this evidence suggests that parent birthplace disagreement patterns for children living at home might be similar to parent birthplace disagreements for people who are not living at home with their parents.
7 Appendix I provides more indirect evidence to demonstrate the relevance of parent birthplace disagreement as a validation variable without using hand-linked data.
8 It is worth noting that hand-linked data are not “true” matches. Human error in matching may also produce false matches or fail to capture all “true” matches. Given the dearth of longitudinal historical data, we have no direct test of the effectiveness of matching by hand.
9 For completeness, we also considered other age bands, including a one-year and three-year age band in addition to the two-year age band in Table 1. The larger the band, the more observations tend to be dropped from consideration, but the Type I error rate also falls.
10 Researchers use name cleaning algorithms to adjust exact names for errors in transcription, recording, and changes in phonetic spelling. For more background on these algorithms, see Bailey et al. (Citationforthcoming).
11 In 1850 and 1860, African-American slaves were enumerated separately under a slave schedule.
12 It is worth noting that these findings hold up in more traditional t-tests as well. Notably, we reject the null hypothesis of equality of means among the variables not included by the MPC roughly 63 percent of the time across all samples. See Appendix III for the full set of results. Note also that if the weights addressed all issues with representativeness of the data that there should not be these issues with other variables.