Linking individuals across historical sources: A fully automated approach*: Historical Methods: A Journal of Quantitative and Interdisciplinary History: Vol 53 , No 2

Abstract

Linking individuals across historical datasets relies on information such as name and age that is both non-unique and prone to enumeration and transcription errors. These errors make it impossible to find the correct match with certainty. In the first part of the paper, we suggest a fully automated probabilistic method for linking historical datasets that enables researchers to create samples at the frontier of minimizing type I (false positives) and type II (false negatives) errors. The first step guides researchers in the choice of which variables to use for linking. The second step uses the Expectation-Maximization (EM) algorithm, a standard tool in statistics, to compute the probability that each two records correspond to the same individual. The third step suggests how to use these estimated probabilities to choose which records to use in the analysis. In the second part of the paper, we apply the method to link historical population censuses in the US and Norway, and use these samples to estimate measures of intergenerational occupational mobility. The estimates using our method are remarkably similar to the ones using IPUMS’, which relies on hand linking to create a training sample. We created an R code and a Stata command that implement this method.

Keywords:

Notes

1 Recent examples include Abramitzky, Boustan, and Eriksson (Citation2012, Citation2013, Citation2014, Citation2016), Aizer et al. (Citation2016), Bleakley and Ferrie (Citation2016, Citation2013), Collins and Wanamaker (Citation2014, Citation2015, Citation2017), Eli, Salisbury, and Shertzer (Citation2018), Eriksson (Citation2018), Feigenbaum (Citation2016b, Citation2017), Ferrie (Citation1997), Fouka (Citation2016), Long (Citation2006), Long and Ferrie (Citation2013), Hornbeck and Naidu (Citation2014), Mill, Stein, and Race (Citation2016), Kosack and Ward (Citation2014), Modalsli (Citation2017), Parman (Citation2015), Pérez (Citation2017), and Salisbury (Citation2014).

2 Another variable that could potentially be used in linking is race. However, using this variable could be problematic if individuals selectively report a different race in different historical sources, a pattern documented in Mill, Stein, and Race (Citation2016) and Nix and Qian (Citation2015).

3 A related decision is how to map numerical distances (for instance, age differences) into a distance metric. We usually use the absolute difference in reported age, but we note that other distance metrics are also possible (for instance, an indicator that takes a value of one if both ages agree and is zero otherwise).

4 Recent economic history papers use the NYSIIS algorithm. Other examples of phonetic algorithms include Soundex (Odell and Russell Citation1918) and Metaphone (Philips Citation1990). Some phonetic algorithms are better suited for dealing with languages other than English. For example, the Spanish Metaphone algorithm is designed to match Spanish names (Mosquera, Lloret, and Moreda Citation2012).

5 For instance, “James Tennes” and “James Thomas” have the same NYSIIS code, but the Jaro-Winkler distance between “Tennes” and “Thomas” is 0.4.

6 The general EM algorithm was described in Dempster, Laird, and Rubin (Citation1977). The specific use of the EM algorithm for record linkage problems was developed by Winkler (Citation1989). For a Bayesian approach to record linkage problems see Larsen (Citation2005).

7 This reasoning is analogous to the one that indicates that to estimate the probability of heads for a coin using N tosses, one just needs to have information on the number of tosses that resulted in heads. So, instead of storing N numbers, it is enough to know just one.

8 We impose this symmetry condition because linking historical censuses is an example of one-to-one linking. Imposing this condition prevents situations in which a record b in B is the best candidate for a record a in A, but the best candidate for b in B is a different record a’ in A.

9 The sample is restricted to whites because slaves, who constituted the majority of the US black population at the time, were not individually listed in the 1850 population census.

10 Place of birth corresponded to states in the case of the US, municipalities in the case of Norway, country of birth for the foreign born in both countries.

11 This method is described in detail in Goeken et al. (Citation2011).

12 Unlike in our case, the method used by IPUMS does not block on first letter of first and last names, but rather just restricts the comparisons to individuals with a given race and birthplace. This coarser blocking dramatically increases the number of calculations that need to be made. Nevertheless, in the IPUMS samples, about 98% of the individuals in the linked US data and about 92% in the Norwegian data agree on the first letter of both the first and last names. Hence, although the method does not explicitly block on these characteristics, in practice there are only few individuals in the resulting samples for which these characteristics do not agree. This is expected because the Jaro-Winkler similarity score, which is used as an input in the construction of the linked samples, has a larger penalty for mistakes that take place in the first letter of a word. Hence, names with such mistakes are unlikely to have a high estimated probability of being a true link.

13 The procedure used to create the training sample is described in the following way: “For our project, we selected a random sample of potential links, and had a group of MPC data entry operators code each potential link as a “yes” or “no” based on a visual examination of names and ages of potential links (with yes indicating that it was in their opinion a true link). If a majority had the potential link as a “yes,” then it was coded as a “yes” in the training data (with the remainder coded as “no”).”

14 As described in Goeken et al. (Citation2011), “The SVM classifier analyzes the training data, plots them in a multidimensional space, and then constructs a boundary between the two classes of records that maximizes the distance from the hyperplane and the nearest data points in both of the classes (i.e., between the true and false links).”

15 There are about 45,000 males aged 16 or less in the 1850 US census 1% sample, and about 3,500 in the 1850–80 US linked sample. There are about 340,000 males aged 16 or less in the 1865 Norwegian census, and about 51,000 in the Norway 1865–1900 linked sample.

16 The linked samples (both ours and the one built by IPUMS) might differ from the cross-section for reasons unrelated to the linking procedure. For instance, if there is differential mortality or outmigration by father’s occupational category, the occupational distribution in the initial census year will differ from the cross section even if the method linked everyone who was still in the US by 1880/Norway in 1900.

Linking individuals across historical sources: A fully automated approach*

Log in via your institution

Log in to Taylor & Francis Online

Restore content access

Related Research

Information for

Open access

Opportunities

Help and information

Linking individuals across historical sources: A fully automated approach*

Abstract

Notes

Log in via your institution

Log in to Taylor & Francis Online

Log in to Taylor & Francis Online

Restore content access

Related Research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature