683
Views
18
CrossRef citations to date
0
Altmetric
Articles

Simple strategies for improving inference with linked data: a case study of the 1850–1930 IPUMS linked representative historical samples

, &
Pages 80-93 | Published online: 31 Oct 2019
 

Abstract

New large-scale linked data are revolutionizing quantitative history and demography. This paper proposes two complementary strategies for improving inference with linked historical data: the use of validation variables to identify higher quality links and a simple, regression-based weighting procedure to increase the representativeness of custom research samples. We demonstrate the potential value of these strategies using the 1850–1930 Integrated Public Use Microdata Series Linked Representative Samples (IPUMS-LRS)—a high quality, publicly available linked historical dataset. We show that, while incorrect linking rates appear low in the IPUMS-LRS, researchers can reduce error rates further using validation variables. We also show how researchers can reweight linked samples to balance observed characteristics in the linked sample with those in a reference population using a simple regression-based procedure.

Acknowledgements

We are grateful to George Alter, Trent Alexander, Katie Genadek, Alfia Karimova, Maggie Levenstein, Evan Roberts, and Steve Ruggles for their helpful suggestions and comments. We are also grateful to Sarah Anderson, Garrett Anstreicher, Ali Doxey, Meizi Li, and Mike Ricks for their many contributions to the LIFE-M project and assistance with this analysis.

Notes

1 See, for instance, early-life public health initiatives (Alsan and Goldin Citation2019; Cutler and Miller Citation2005), exposures to environmental pollutants (Clay, Lewis, and Severnini Citation2016) and animal diseases (Rhode and Olmstead Citation2015), and access to medicines (Bleakley Citation2007). Other examples include the long-run effects of exposure to human capital initiatives through Rosenwald schools (Mazumder and Aaronson Citation2011).

2 On-going and proposed projects are linking national surveys, administrative data, and research samples to recently digitized historical records, such as the full-count 1880 (Ruggles 2002; Ruggles et al. 201537) and 1940 U.S. Censuses (the first U.S. census to ask about education and wage income) and newly available administrative sources. The Census Bureau plans to link the 1940 Census to current administrative and census data (Census Longitudinal Infrastructure Project, CLIP) and the Minnesota Population Center plans to link it to other historical censuses. The Panel Survey of Income Dynamics (PSID) and the Health and Retirement Survey (HRS) are linking their respondents to the 1940 Census. The Longitudinal, Intergenerational Family Electronic Micro-Database Project (LIFE-M) is linking vital records to the 1940 Census (Bailey et al. Citation2019). Supplementing these public infrastructure projects, entrepreneurial researchers have also combined large datasets. See, for example, Abramitzky, Platt Boustan, and Eriksson (Citation2012, Citation2013, Citation2014), Boustan, Kahn, and Rhode (Citation2012), Hornbeck and Naidu (Citation2014), Mill (Citation2013), Mill and Stein (Citation2016), Aizer et al. (Citation2016), Bleakley and Ferrie (Citation2014, Citation2016, Citation2013), Nix and Qian (2015), Collins and Wanamaker (Citation2016), and Eli, Salisbury, and Shertzer (Citation2016).

3 FEBRL is a record linking software developed by the ANU Data Mining Group and the Centre for Epidemiology and Research in the New South Wales Department of Health. See Christen and Churches (Citation2005) for more information.

4 We are performing this restriction on the data ex post as we only have access to the finished IPUMS-LRS matches. However, Abramitzky, Platt Boustan, and Eriksson (Citation2012, Citation2014) as described in Bailey et al. (Citationforthcoming), perform this restriction before engaging their matching algorithm.

5 The MPC did use parental birthplace when linking the 1900, 1910, 1920, and 1930 Census samples to the 1880 full count Census.

6 Data quality issues prior to 1880 are the reason that the MPC did not use this variable in the matching process for 1850–1870. For these years, parent birthplaces can only be inferred from individuals living at home with their parents. Furthermore, relationships within a household in those years are not listed by Census takers, and need to be inferred from the order in which individuals are listed in the Census and the ages of individuals. In Appendix I, we demonstrate that, although parent birthplace is clearly measured with error, patterns of parental birthplace disagreement between individuals living at home with their parents and those not living at home are similar in the years after 1880. Therefore, assuming that the imputed household relationships are accurate in the years prior to 1880, this evidence suggests that parent birthplace disagreement patterns for children living at home might be similar to parent birthplace disagreements for people who are not living at home with their parents.

7 Appendix I provides more indirect evidence to demonstrate the relevance of parent birthplace disagreement as a validation variable without using hand-linked data.

8 It is worth noting that hand-linked data are not “true” matches. Human error in matching may also produce false matches or fail to capture all “true” matches. Given the dearth of longitudinal historical data, we have no direct test of the effectiveness of matching by hand.

9 For completeness, we also considered other age bands, including a one-year and three-year age band in addition to the two-year age band in Table 1. The larger the band, the more observations tend to be dropped from consideration, but the Type I error rate also falls.

10 Researchers use name cleaning algorithms to adjust exact names for errors in transcription, recording, and changes in phonetic spelling. For more background on these algorithms, see Bailey et al. (Citationforthcoming).

11 In 1850 and 1860, African-American slaves were enumerated separately under a slave schedule.

12 It is worth noting that these findings hold up in more traditional t-tests as well. Notably, we reject the null hypothesis of equality of means among the variables not included by the MPC roughly 63 percent of the time across all samples. See Appendix III for the full set of results. Note also that if the weights addressed all issues with representativeness of the data that there should not be these issues with other variables.

Additional information

Funding

This project was generously supported by the National Science Foundation under grant SMA 1539228, the National Institute on Aging under grant R21 AG05691201 and R01 AG057704, the University of Michigan Population Studies Center Small Grants under grant R24 HD041028, the Michigan Center for the Demography of Aging under grant P30 AG012846-21, the University of Michigan Associate Professor Fund, and the Michigan Institute on Research and Teaching in Economics (MITRE). We gratefully acknowledge the use of the services and facilities of the Population Studies Center at the University of Michigan under grant R24 HD041028. During work on this project, Cole was supported by the NICHD under grant T32 HD0007339 as a UM Population Studies Center Trainee.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 113.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.