Abstract
Many alternative approaches for selecting mortality models and forecasting mortality have been proposed. The usual practice is to base forecasts on a single mortality model selected using in-sample goodness-of-fit measures. However, cross-validation measures are increasingly being used in model selection, and model combination methods are becoming a common alternative to using a single mortality model. We propose and assess a stacked regression ensemble that optimally combines different mortality models to reduce out-of-sample mean squared errors and mitigate model selection risk. Stacked regression uses a meta-learner to approximate horizon-specific weights by minimizing a cross-validation criterion for each forecasting horizon. The horizon-specific weights determine a mortality model combination customized to each horizon. We use 44 populations from the Human Mortality Database to compare the stacked regression ensemble with alternative methods. We show that, using one-year-ahead to 15-year-ahead out-of-sample mean squared errors, the stacked regression ensemble improves mortality forecast accuracy by 13% - 49% for males and 19% - 90% for females over individual mortality models. The stacked regression ensembles also have better predictive accuracy than other model combination methods, including Simple Model Averaging, Bayesian Model Averaging, and Model Confidence Set. We provide an R package, CoMoMo, that combines forecasts for Generalized-Age-Period-Cohort models.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Notes
1 Mortality rates on the log-scale reduce the impact of older age differences on the model comparison. Higher mortality variability at older ages can distort model comparison. using the log transform treats small differences between small observed and predicted mortality rates approximately the same as big differences between large observed and predicted mortality rates. As a result, the log transform is appropriate for the relative ranking of the models using the . A small difference in may occur when we use mortality rates directly.
2 An alternative imputation technique is the Kalman Smoother. This also fits a random walk with drift but imputes missing values by forecasting from the left and the right, and then applying a smoothing algorithm (Moritz Citation2018). Given the simplistic nature of the random walk, this just linearly interpolates across the missing region. While a valid technique, we believe that it is inappropriate for our context as it provides an artificially good fit.
3 A slight difference in combined mortality rates may arise when we combine the mortality rates using instead of Equation (Equation10(10) (10) ).
4 An equal predictive ability implies that models and are equally good based on a given loss function.
5 We assume that the statistical properties of do not change over time. This implies that the first moment of is constant, that is, , and the second moment of is finite, that is, , . Therefore, being weak stationary makes it possible to determine the best mortality model(s) from the initial collection of models (Hansen et al. Citation2011).
6 In this study, the short-term horizon corresponds to a period of one-to-five years, medium-term horizon corresponds to a period of six-to-10 years, and long-term horizon as a period of 11-to-15 years.
7 For and MCSV, which require a validation set for estimating the weights and selecting the superior models, respectively, we use data from 1960 to 1977 for training and 1978 to 1990 for estimating the weights or selecting the superior models.
8 We view cross-validation as a particular case of stacked regression ensemble where a single model is selected. A stacked regression ensemble relaxes the assumption that one model must be chosen and used to predict mortality rates. We choose multiple mortality models and optimally combine them to maximize out-of-sample accuracy in one step by minimizing the cross-validation criterion (Sridhar et al. Citation1996). In other model combinations such as standard , we do not have an optimization criterion that selects and optimally assigns the weights to each model in one step. Instead, we have to fit each model to the data and measure the Akaike Information criterion, which we then use to calculate the weight for each model independently. Thus, the process of selecting the models and estimating the weights are not done in one step.
9 BMA methods assign weights to each model independently without accounting for how the models differ from each other. methods do not incorporate diversity among the models in the weights. Conversely, stacked regression ensembles concurrently estimate the weights to the models using a meta-learner like a lasso regression. As a result, models that produce similar mortality rate forecasts or have the least forecasting accuracy at each forecasting horizon get small or zero weights due to the presence of alternative, more accurate, and diverse models.
10 and select as the only superior model for males and hence it is not combined with other models. Therefore, of are similar to both and . For females, selects as the only superior model.
11 Uncertainty is the difference between the highest and lowest mortality rate forecasts at any given forecasting horizon (Graefe et al. Citation2014).
12 The confidence interval for each mortality model is , where is the mean rank of each model and is the critical difference. We provide precise details in Section A.3 of Kessy et al. (Citation2021)