Abstract
This paper describes the creation of the Longitudinal, Intergenerational Family Electronic Micro-Database (LIFE-M), a new data resource linking vital records and decennial censuses for millions of individuals and families living in the late 19th and 20th centuries in the United States. This combination of records provides a life-course and intergenerational perspective on the evolution of health and economic outcomes. Vital records also enable the linkage of women, because they contain a crosswalk between women’s birth (i.e., “maiden”) and married names. We describe (1) the data sources, coverage, and linking sequence; (2) the process and supervised machine-learning methods used to link records longitudinally and across generations; and (3) the resulting linked samples, including linking rates, representativeness, and weights.
Notes
1 Historical linking to create longitudinal data has been done extensively outside the United States. For the United Kingdom (U.K.), the Cambridge Group for the History of Population and Social Structure hosts several different datasets, primarily representing different areas and time periods for England. The most widely cited database is the family reconstitution data for 26 English parishes (Wrigley et al. Citation2018), which has been used to conduct individual-level studies of fertility, mortality and nuptiality. The Victorian Panel Study links vital and census records from 1851 to 1901 in Great Britain (Schürer Citation2007). Data from Sweden are linked longitudinally from 1650 to 1950 across full-count censuses, emigration records and death records (Wisselgren et al. Citation2014; Berger et al. Citation2023). Available through the Swedpop project, these links have been used in multiple studies and include women as well as men (Dribe, Eriksson, and Scalone Citation2019). The Scanian Economic Demographic Database covers the entire population of five rural and semi-urban parishes and an industrializing town in southern Sweden between 1815 and 2017 (Dribe and Quaranta Citation2020; Bengtsson and Dribe Citation2021). For Canada, the BALSAC database covers the population from an area in Quebec from 1621 to 1965 (Vézina and Bournival Citation2020). The Canadian Peoples contains over 40 million linked census records, representing three generations from the mid-19th century to early 20th century (Foxcroft, Inwood, and Antonie Citation2022). Researchers have also linked records for the U.K. (Long and Ferrie Citation2013) and Norway (Modalsli Citation2017, Citation2023) as well as between Norway and the U.S. (Abramitzky, Boustan, and Eriksson Citation2013; Biavaschi and Elsner Citation2013). We focus our discussion on U.S. databases which are most related to LIFE-M.
2 This large, linked sample follows two earlier linking projects. Guest (Citation1987) created a national sample of men in the 1880 Census linked to the 1900 Census (Guest Citation1987, N = 4,014, linkage rate 39.4%). Ferrie (Citation1996) linked a nationally representative sample of men in the 1850 Census to the 1860 Census (N = 4,938, linkage rate 19.3%).
3 Mohammed and Mohnen (Citation2023) also use a subset of the linked dataset used in Bailey, Mohammed, and Mohnen (Citation2022) to study the impact of Rosenwald schools on labor market outcomes for both men and women.
4 The NLS has subsequently tracked supplemental samples. One covers ages 14–22 in 1979 (N = 12,686) (and children for women in this survey) and another ages 12–16 in 1996 (N = 9,000).
5 A variety of independent administrative and restricted data sources offer a third type of longitudinal, intergenerational data. The National Longitudinal Mortality Study (NLMS) links the Current Population Surveys and other records to death certificates to examine the relationship of demographic and socio-economic characteristics with mortality rates. These large microdata samples (N > 340,000 deaths) generally link individuals ages 50 and older to demographic and socio-economic information in the CPS from about age 40. Researchers have also conducted labor-intensive hand-linkages across censuses (Ferrie Citation1996; Guest Citation1987; Long and Ferrie Citation2013; Collins and Wanamaker Citation2014, Citation2015, Citation2022; Bleakley and Ferrie Citation2013, Citation2014, Citation2016). Many of these linked samples are the property of the researchers who collected or linked them and are not available for public use. Lack of access to these data and substantial barriers to creating such samples limit replication, new research using these data, and analyses of data quality.
6 LIFE-M links more than 170,000 Black Americans and more than 368,000 foreign-born people.
7 Age misreporting is common in the census (e.g., there are a lot more 50- and 60-year-olds relative to 51- and 63-year-olds) as well as on marriage certificates to circumvent minimum age requirements (Blank, Charles, and Sallee Citation2009). Age misreporting is more common for Black Americans (Elo and Preston Citation1994; Logan and Parman Citation2011).
8 Multiple matches have been so problematic that past work has eliminated common names entirely from samples to be linked (Ferrie Citation1996; Ruggles Citation2002).
9 We use the terms “birth family” and “marriage family” to distinguish between when someone is a child (birth family) and when they are married or a parent (married family).
10 Completed education is first available in the 1940 Census; literacy is available in censuses prior to 1940.
11 These refer to the full-count censuses for the entire United States.
12 The project also tracked and provided trainers with feedback on their speed, which was determined using the metadata collected from time-stamped uploads and downloads of each batch from the distribution system. Tracking trainer speed helped minimize training costs due to inattention. Increasing accuracy also minimized training costs by reducing the number of records sent for discrepancy review.
13 This is due to name misspellings, incomplete names (e.g., nicknames, initials), transposed first and middle names, and other idiosyncrasies in historical records. The recording of age in the census tends to reflect “age heaping,” the common practice of rounding ages to the nearest multiple of five (A’Hearn, Baten, and Crayen Citation2009; Hacker Citation2013).
14 “Linkability” is determined by the completeness of name and birth year and is described in the notes of Table 4.
15 Linking with 97% precision, means the error rate is only 3%. For the 1940 Census and death records, we can also link with higher error rates of 5 and 10%. The advantage of a higher error rate is more links, thus larger samples. However, the samples only increase in size by, at most, a few hundred thousand.
16 LIFE-M links more than 170,000 Black Americans and more than 368,000 foreign-born people.
Wrigley, E. A., R. S. Davies, J. E. Oeppen, and R. S. Schofield. 2018. 26 English parish family reconstitutions. Colchester, Essex: UK Data Archive. Schürer, K. 2007. Focus: Creating a nationally representative individual and household sample for Great Britain, 1851 to 1901—The Victorian Panel Study (VPS). Historical Social Research/Historische Sozialforschung 32 (2): 211–331. Wisselgren, M. J., S. Edvinsson, M. Berggren, and M. Larsson. 2014. Testing methods of record linkage on Swedish censuses. Historical Methods: A Journal of Quantitative and Interdisciplinary History 47 (3):138–51. doi: 10.1080/01615440.2014.913967. Berger, T., P. Engzell, B. Eriksson, and J. Molinder. 2023. Social Mobility in Sweden before the Welfare State. The Journal of Economic History 83 (2):431–463. doi: 10.1017/S0022050723000098. Dribe, M., B. Eriksson, and F. Scalone. 2019. Migration, marriage and social mobility: Women in Sweden 1880–1900. Explorations in Economic History 71:93–111. doi: 10.1016/j.eeh.2018.09.003. Dribe, M., and L. Quaranta. 2020. The Scanian Economic-Demographic Database (SEDD). Historical Life Course Studies 9:158–72. doi: 10.51964/hlcs9302. Bengtsson, T., and M. Dribe. 2021. The Long Road to Health and Prosperity, Southern Sweden, 1765–2015. Research Contributions From the Scanian Economic-Demographic Database (SEDD). Historical Life Course Studies 11:74–96. doi: 10.51964/hlcs10941. Vézina, H., and J.-S. Bournival. 2020. An overview of the BALSAC population database: Past developments, current state and future prospects. Historical Life Course Studies 9:114–29. doi: 10.51964/hlcs9299. Foxcroft, J., K. Inwood, and L. Antonie. 2022. Linking eight decades of Canadian census collections. International Journal of Population Data Science 7 (3):2076. doi: 10.23889/ijpds.v7i3.2076. Long, J., and J. Ferrie. 2013. Intergenerational occupational mobility in Great Britain and the United States since 1850. American Economic Review 103 (4):1109–37. doi: 10.1257/aer.103.4.1109. Modalsli, J. 2017. Intergenerational mobility in Norway, 1865–2011. The Scandinavian Journal of Economics 119 (1):34–71. doi: 10.1111/sjoe.12196. Modalsli, J. 2023. Multigenerational persistence: Evidence from 146 years of administrative data. Journal of Human Resources 58 (3): 929–961. doi: 10.3368/jhr.59.1.1018-9825R2. Abramitzky, R., L. P. Boustan, and K. Eriksson. 2013. Have the poor always been less likely to migrate? Evidence from inheritance practices during the age of mass migration. Journal of Development Economics 102:2–14. doi: 10.1016/j.jdeveco.2012.08.004. Biavaschi, C., and B. Elsner. 2013. Let’s be selective about migrant self-selection. IZA Discussion Paper 7865. Guest, A. M. 1987. Notes from the National Panel Study: Linkage and migration in the late nineteenth century. Historical Methods: A Journal of Quantitative and Interdisciplinary History 20 (2):63–77. doi: 10.1080/01615440.1987.9955260. Guest, A. M. 1987. Notes from the National Panel Study: Linkage and migration in the late nineteenth century. Historical Methods: A Journal of Quantitative and Interdisciplinary History 20 (2):63–77. doi: 10.1080/01615440.1987.9955260. Ferrie, J. P. 1996. A new sample of males linked from the 1850 public use micro sample of the federal census of population to the 1860 federal census manuscript schedules. Historical Methods: A Journal of Quantitative and Interdisciplinary History 29 (4):141–56. doi: 10.1080/01615440.1996.10112735. Mohammed, A. R. S., and P. Mohnen. 2023. Black economic progress in the Jim Crow South: Evidence from Rosenwald schools. Working paper. Bailey, M. J., A. R. S. Mohammed, and P. Mohnen. 2022. U.S. educational mobility in the early twentieth century. Working paper. Ferrie, J. P. 1996. A new sample of males linked from the 1850 public use micro sample of the federal census of population to the 1860 federal census manuscript schedules. Historical Methods: A Journal of Quantitative and Interdisciplinary History 29 (4):141–56. doi: 10.1080/01615440.1996.10112735. Guest, A. M. 1987. Notes from the National Panel Study: Linkage and migration in the late nineteenth century. Historical Methods: A Journal of Quantitative and Interdisciplinary History 20 (2):63–77. doi: 10.1080/01615440.1987.9955260. Long, J., and J. Ferrie. 2013. Intergenerational occupational mobility in Great Britain and the United States since 1850. American Economic Review 103 (4):1109–37. doi: 10.1257/aer.103.4.1109. Collins, W. J., and M. H. Wanamaker. 2014. Selection and economic gains in the great migration of African Americans: New evidence from linked census data. American Economic Journal: Applied Economics 6 (1):220–52. doi: 10.1257/app.6.1.220. Collins, W. J., and M. H. Wanamaker. 2015. The great migration in black and white: New evidence on the selection and sorting of southern migrants. The Journal of Economic History 75 (4):947–92. doi: 10.1017/S0022050715001527. Collins, W. J., and M. H. Wanamaker. 2022. African American intergenerational economic mobility since 1880. American Economic Journal: Applied Economics 14 (3):84–117. doi: 10.1257/app.20170656. Bleakley, H., and J. Ferrie. 2013. Up from poverty? The 1832 Cherokee land lottery and the long-run distribution of wealth. NBER Working Paper 19175. Bleakley, H., and J. Ferrie. 2014. Land openings on the Georgia frontier and the coase theorem in the short- and long- run. Working paper. Bleakley, H., and J. Ferrie. 2016. Shocking behavior: Random wealth in antebellum Georgia and human capital across generations. The Quarterly Journal of Economics 131 (3):1455–95. doi: 10.1093/qje/qjw014. Blank, R. M., K. K. Charles, and J. M. Sallee. 2009. A cautionary tale about the use of administrative data: Evidence from age of marriage laws. American Economic Journal: Applied Economics 1 (2):128–49. doi: 10.1257/app.1.2.128. Elo, I. T., and S. H. Preston. 1994. Estimating African-American mortality from inaccurate data. Demography 31 (3):427–58. doi: 10.2307/2061751. Logan, T., and J. M. Parman. 2011. Race, socioeconomic status, and mortality in the 20th century: Evidence from the Carolinas. University of Michigan Population Studies Center Working Paper SC Research Report No. 11–739. Ferrie, J. P. 1996. A new sample of males linked from the 1850 public use micro sample of the federal census of population to the 1860 federal census manuscript schedules. Historical Methods: A Journal of Quantitative and Interdisciplinary History 29 (4):141–56. doi: 10.1080/01615440.1996.10112735. Ruggles, S. 2002. Linking historical censuses: A new approach. History and Computing 14 (1–2):213–24. doi: 10.3366/hac.2002.14.1-2.213. A’Hearn, B., J. Baten, and D. Crayen. 2009. Quantifying quantitative literacy: Age heaping and the history of human capital. The Journal of Economic History 69 (3):783–808. doi: 10.1017/S0022050709001120. Hacker, J. D. 2013. New estimates of census coverage in the United States, 1850–1930. Social Science History 37 (1):71–101. Additional information
Funding
This project was generously supported by the National Science Foundation (SMA1539228), the National Institute on Aging (R21AG05691201), the University of Michigan Population Studies Center Small Grants (R24HD041028), the Michigan Center for the Demography of Aging (MiCDA, P30 AG012846-21), the University of Michigan Associate Professor Fund, and the Michigan Institute on Research and Teaching in Economics (MITRE). We gratefully acknowledge the use of the Population Studies Center’s services and facilities at the University of Michigan (R24HD041028). The study team gratefully acknowledges the use of the services and facilities of the Population Studies Center at the UM (P2CHD041028) and the California Center for Population Research at the UCLA (P2CHD041022). We are grateful to Dora Costa, Shari Eli, Adriana Lleras-Muney, Joseph Price, and the board members of the LIFE-M project, including Eytan Adar, George Alter, Hoyt Bleakley, Matias Cattaneo, William Collins, Katie Genadek, Maggie Levenstein, Bhash Mazumder, Evan Roberts, and Steven Ruggles for their helpful suggestions. We are also grateful to Garrett Anstreicher, Sarah Anderson, Meizi Li, Morgan Henderson, Alfia Karimova, Catherine Massey, and Annie Wentz for their excellent contributions to the LIFE-M project and assistance with this project.