How to model the weather-migration link: a machine-learning approach to variable selection in the Mexico-U.S. context: Journal of Ethnic and Migration Studies: Vol 49 , No 2

ABSTRACT

A growing body of research investigates how changes in weather shape individual choices about migration, yet highly variable results continue to challenge our understanding of the weather-migration nexus. We use a data-driven approach to identify which weather variables best predicted migration decisions of 54,986 individuals originating in Mexico between 1989 and 2016. Using supervised machine learning, we fit random forests to model migration choices based on individual, household, and community attributes in training data (three-fourths of the sample) from the Mexican Migration Project. We aggregated 36 annual weather variables at the community level and applied k-fold cross-validation to evaluate which models best predicted migration decisions. The top performing models were then applied to the test data (one-fourth of our sample). Three weather variables consistently out-performed others across models: minimum temperature during day, maximum temperature at night, and ‘growing degree days’ – the number of days with optimal growth temperatures for corn (the major crop for most communities). Our results demonstrate that weather is related to individual choices about migration and illustrate the utility of using principled variable selection which revealed that both customized (growing degree days for a particular crop) and generic (max-min temperatures) metrics can be predictive of migration behaviors.

KEYWORDS:

Data availability statement

Access to the restricted MMP data needs agreement with a confidentiality form. The data can be obtained at https://mmp.opr.princeton.edu. Access to the Daymet data is public, and the data can be obtained at https://daymet.ornl.gov. All our code and instructions to download the Daymet data can be found at https://github.com/mariomolinam/climate_change_immigration. [NOTE: the repository is now private but will be made public when the paper is unconditionally accepted].

Acknowledgements

The authors thank the editors of this issue and reviewers for insightful comments that helped us improve our manuscript. We also thank Ariel Ortiz-Bobea and Julia Zhu for helpful feedback on earlier versions. This work was supported by the Cornell Atkinson Center Academic Venture Fund and Cornell Migrations: A Global Grand Challenge Grant.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes

1 Our work focuses on one potential source for mixed results: different weather measures. There are other sources such as heterogeneity across populations or contexts; different mediating factors (e.g., economic or political); different environmental stressors (e.g., sea-level rise versus weather fluctuations); different migration responses (e.g., short- versus long-run, internal versus international) with different return rates (for example, see Entwisle, Verdery, and Williams Citation2020).

2 Importantly, we set aside the test data early on, before we started data analyses, and only used these data to present final results.

3 By contrast, LASSO (Least Absolute Shrinkage and Selection Operator) – a popular alternative - is linear. See Bucca and Urbina (Citation2019) for an application of LASSO to variable selection in sociological research. Also, see Molina and Garip (Citation2019) and Athey (Citation2019) for discussion of alternative ML methods and their use in the social sciences.

4 ‘Climate’ refers to distribution of outcomes over a longer time span. ‘Weather’ can be thought of as a particular empirical realisation from that distribution.

5 The original sample had 129,968 respondents. Our sample reduced to 100,572 respondents after two minimal restrictions: we keep respondents who migrated to the U.S. from 1980 on and had at least 5 years of observations. We introduce additional restrictions to our data later (see below).

6 We identified 58 respondents who reported a year of migration but were coded as non-migrants in the variable for first-time migration. In all these cases, we noticed that the year of participation in the survey was one year earlier than the year reported for migration. We corrected these cases and assigned them the year of participation in the survey as their year of migration.

7 Visa accessibility is computed by the MMP team as the share of total Mexican immigrants legally admitted to total Mexican immigrants entering the United States (including the estimated illegal entries).

8 Trade is measured as the total of U.S. imports to and U.S.-bound exports from Mexico, normalised to 2010 US$. The value of trade spikes after NAFTA, which in itself a driver of a change in number and types of migrants (see Chau, Garip, and Ortiz-Bobea Citation2021).

9 This information comes from the year of first migration reported retrospectively for each respondent. For each year, the MMP team have added the number of individuals whose year of first U.S. trip precedes the current year, and divided it by the total number of residents in the community at the time.

10 Our sample is restricted to individuals aged 15 or higher. An individual might have less than 5 years in the data if they were younger than 15 in the included time period.

11 We consider 1980–1984 as the ‘normal’ period in order to use the fine-grained weather data that is available starting in 1980. These data allow us to measure how much a community deviates from its own average temperature and precipitation in the normal period. If we were to use an alternative ‘normal’ period (say, 1960–1979), we would only be able to measure how much a community deviates from the average temperature and precipitation of its state in that period. By using the more recent period as the ‘normal’, we opt for greater specificity in measurement. We also take into account potential adaptation that might have already occurred prior to 1980. See Dell, Jones, and Olken (Citation2014) for a discussion of issues around ‘normal’ period selection.

12 Entropy measures uncertainty, disorder, or ‘impurity’ in a random variable and it takes values between 0 and 1, where 1 indicates maximum uncertainty. The optimisation problem when creating a tree is to maximise the information gain using entropy conditional on a given split using X_j and c (Murphy Citation2012, ch. 16). A split is decided at every step of the process when X_j and c return the maximum gain.

13 If the variance of a random variable X is σ², the variance of the average of multiple random variables X_1, … , X_n, all with variance σ², is: $v a r ((1 / n) \sum_{i} X_{i}) = (1 / n^{2}) \sum_{i} v a r (X_{i}) = (1 / n^{2}) \sum^{σ} 2 = (σ^{2} / n)$ , meaning that the variance can be reduced by a factor of $\frac{1}{n}$ . This holds when the random variables X_i are independent, as the second step above ( $\sum_{i} v a r (X_{i}) = \sum^{σ} 2$ ) assumes that the covariance between X_i is zero – which does not happen exactly across trees in the random forests. Random forests then trade variance for some bias (Hastie, Tibshirani, and Friedman Citation2009).

14 Classification model predictions can be represented by a confusion matrix: $[\begin{matrix} T P & F P \\ F N & T N \end{matrix}]$ , where predictions are in the rows and actual values are in the columns. The outcome variable is said to have two classes: positives and negatives. TP corresponds to the actual positives (e.g., migrants) correctly predicted as positives by the model (true positives); TN is the actual negatives (i.e., non-migrants) correctly predicted as negatives by the model (true negatives). FP and FN are misclassifications when the model respectively classifies an actual negative as positive (a non-migrant classified as migrant) and an actual positive as negative (a migrant classified as non-migrant).

15 The validation scores are averages across the 5-fold cross-validation approach during the training process. Once we choose the best model based on the performance scores reported in table 3, we retrained the classifier using all the training data (i.e., with no cross-validation).

16 The importance of each variable is computed by measuring how much a feature contributes to the splitting process of a tree in terms of the information gain using entropy. This value is weighted by how deep in the process the variable was used for splitting, given that variables used earlier are more consequential (i.e., they impact more observations) than variables used at further steps in the process. For random forests, the importance is computed for every tree and then averaged across trees.

17 Because this analysis does not include community fixed effects, it is possible that unobserved community attributes (such as presence of irrigation) might account for the weather impact on migration decisions. Such hypotheses could be scrutinised with follow-up analyses aimed at causal identification.

18 For instance, AUC scores range between 0.97 and 0.99, meaning that the random forests were able to learn remarkably well the underlying structure of the training data.

How to model the weather-migration link: a machine-learning approach to variable selection in the Mexico-U.S. context

Log in via your institution

Log in to Taylor & Francis Online

Restore content access

Related Research

Information for

Open access

Opportunities

Help and information

How to model the weather-migration link: a machine-learning approach to variable selection in the Mexico-U.S. context

ABSTRACT

Data availability statement

Acknowledgements

Disclosure statement

Notes

Log in via your institution

Log in to Taylor & Francis Online

Log in to Taylor & Francis Online

Restore content access

Related Research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature