92
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Ensemble distributional forecasting for insurance loss reserving

, , &
Received 18 Jan 2023, Accepted 03 Jun 2024, Published online: 03 Jul 2024

ABSTRACT

Loss reserving generally focuses on identifying a single model that can generate superior predictive performance. However, different loss reserving models specialise in capturing different aspects of loss data. This is recognised in practice in the sense that results from different models are often considered, and sometimes combined. For instance, actuaries may take a weighted average of the prediction outcomes from various loss reserving models, often based on subjective assessments. In this paper, we propose a systematic framework to objectively combine (i.e. ensemble) multiple stochastic loss reserving models such that the strengths offered by different models can be utilised effectively. Our framework contains two main innovations compared to existing literature and practice. Firstly, our criteria model combination considers the full distributional properties of the ensemble and not just the central estimate – which is of particular importance in the reserving context. Secondly, our framework is that it is tailored for the features inherent to reserving data. These include, for instance, accident, development, calendar, and claim maturity effects. Crucially, the relative importance and scarcity of data across accident periods renders the problem distinct from the traditional ensembling techniques in statistical learning. Our framework is illustrated with a complex synthetic dataset. In the results, the optimised ensemble outperforms both (i) traditional model selection strategies, and (ii) an equally weighted ensemble. In particular, the improvement occurs not only with central estimates but also relevant quantiles, such as the 75th percentile of reserves (typically of interest to both insurers and regulators). The framework developed in this paper can be implemented thanks to an R package, ADLP, which is available from CRAN.

AMS SUBJECT CLASSIFICATIONS:

1. Introduction

1.1. Background

The prediction of outstanding claims is a crucial process for insurers, in order to establish sufficient reserves for future liabilities, and to meet regulatory requirements. Due to the stochastic nature of future claims payments, insurers not only need to provide an accurate central estimate of the reserve, but they also need to forecast the distribution of the reserve accurately. Furthermore, in many countries, provision of the predictive distribution of outstanding claims is required by regulatory authorities. For instance, the Australian Prudential Regulation Authority (APRA) requires general insurers to establish sufficient reserves to cover outstanding claims with at least 75% probability (APRA, Citation2019). Therefore, insurers need to ensure that their models achieve satisfactory predictive performance beyond the central estimate (expectation) of outstanding claims. This is sometimes referred to as ‘stochastic loss reserving’ (Wüthrich & Merz, Citation2008).

Early stochastic loss reserving approaches include the distribution-free stochastic Chain-Ladder model proposed by Mack (Citation1993). As computational power increased, Generalised Linear Models (GLMs) became the classical stochastic modelling approach (Taylor & McGuire, Citation2016). They have been widely applied in both practice and loss reserving literature due to their stochastic nature and high flexibility in modelling various effects on claims payments, such as accident period effects, development period effects, calendar period effects, and effects of claims notification and closure. Common distribution assumption for GLMs include the log-normal distribution (Verrall, Citation1991), over-dispersed Poisson (ODP) distribution (Renshaw & Verrall, Citation1998), and gamma distribution (Mack, Citation1991). Parametric curves to describe claim development patterns, such as the Hoerl curve and the Wright curve (Wright, Citation1990) can also be incorporated in the GLM context.

In recent years, machine learning models have become popular in loss reserving due to their potential in prediction accuracy as well as the potential for saving human labour costs. Notable examples include the smoothing spline (England & Verrall, Citation2001), Generalised Additive Model for Location, Scale and Shape (GAMLSS) (Spedicato et al., Citation2014), and Neural Networks (Al-Mudafer et al., Citation2022; Kuo, Citation2019; Wüthrich, Citation2018).

Despite the sophisticated development of loss reserving models, the main focus of current loss reserving literature is to identify a single model that can generate the best performance according to some measure of prediction quality. This approach is usually referred to as the ‘model selection strategy’. However, due to the difference in model structures and assumptions, different loss reserving models are usually specialised at capturing specific claims effects, which are different from one model to the other. However, multiple claims effects are typically present in reserving data. Furthermore, these effects are likely to appear inhomogenously accross accident and/or development years, which are noted in Friedland (Citation2010) and CitationTaylor (Chapter 5 of Citation2000, pp. 151–165). This makes it challenging for a single model to capture all of them. Using a single model to fit claims payments occurred in all accident periods may not be the most effective way to utilise the strengths offered by all the different available models.

Those concerns highlight the potential to develop a model combination (as opposed to selection) strategy.

In practice, ensembles of different deterministic loss reserving models may have been typically constructed based on ad-hoc rules. The weights allocated to component models are commonly selected subjectively based on the model properties (see, e.g. Friedland, Citation2010; Taylor, Citation2000). However, due to the subjective nature of this combination strategy, experience based on model properties may not generalise well to new claims data, and the weights allocated to component models are not optimised (Friedland, Citation2010). Therefore, the information at hand may not be utilised most effectively. The subjective model combination strategy can further cause complications in cases where one wants information on the random distribution of claims. To address this, a more rigorous framework for combining stochastic loss reserving models was developed by CitationTaylor (Chapter 12 of Citation2000, pp. 151–165), who proposed to select model weights by minimising the variance of total estimated outstanding claims. While a great step forward, this approach only considers the first and the second moments of the claims distribution. This may not be sufficient to derive the required quantiles from the distribution of reserves accurately.

Beyond the loss reserving literature, stacked regression has been recently applied in combining mortality forecasts (Kessy et al., Citation2022; Li, Citation2022). Under the stacked regression approach, the combination weights are determined by minimising the Mean Squared Error (MSE) achieved by the ensemble. Despite showing success for combining point forecasts, this approach does not take into account the ensemble's performance over the entire distribution, which is critical for loss reserving purposes.

The so-called ‘linear pool’ (Gneiting et al., Citation2005), which has been widely applied in combining distributional forecasts outside the field of loss reserving, may offer an alternative model combination approach that could overcome the aforementioned limitations of model selection strategy, and of traditional combination strategies for loss reserving models. The linear pool uses a proper scoring rule to determine the weights allocated to component distributional forecasting models. A proper scoring rule assigns a numeric score to a distributional forecasting model when an event materialises. This presents a number of advantages when assessing the quality of distributional forecasts (Gneiting & Raftery, Citation2007). Literature in the areas of weather forecast (Baars & Mass, Citation2005; Gneiting & Raftery, Citation2007; Gneiting et al., Citation2005; Raftery et al., Citation2005; Vogel et al., Citation2018), stocks return forecast, and GDP growth rate forecast (Hall & Mitchell, Citation2004Citation2007; McDonald & Thorsrud, Citation2011; Opschoor et al., Citation2017) all suggest that linear pool ensembles generate more accurate distributional forecast than any single model in their respective contexts. A common choice for the proper score is the Log Score. Furthermore, linear pools are akin to finite mixture models in producing a weighted average of predictive distributions, yet they differ in weight determination. Finite mixture models simultaneously train model parameters and weights using the same dataset (Baran & Lerch, Citation2018). Conversely, the linear pooling method first calibrates model parameters on a training set and then optimises combination weights on a separate validation set. This is likely to improve the ensemble's overall predictive accuracy compared to standard finite mixture models. In this paper, we start by developing a linear pool approach to loss reserving, and extend the methodology to best fit the reserving context, as detailed in the following section.

1.2. Statement of contributions

To our knowledge, linear pools have not appeared in the loss reserving literature. Unfortunately, existing linear pool implementations cannot be applied in this context directly for a number of reasons outlined below. This paper aims to fill this gap by developing an adjusted linear pool ensemble forecasting approach that is particularly suitable to the loss reserving context. Although actuaries may combine different loss reserving models in practice, we argued earlier that the model combination is then generally performed subjectively and may not optimise the ensemble's performance. Compared with the subjective model combination approach, our proposed scheme has two advantages. Firstly, data driven weights reduce the amount of manual adjustment required by the actuary, particularly in cases where a new experience has emerged. Secondly, by using a proper score as the objective function, the distributional forecast accuracy of the ensemble can be optimised. As Shapland (Citation2022) briefly suggests in his monograph, the future development of model combination strategies for loss reserving models should consider the relative predictive power of component models, and the weighted results should ‘reflect the actuary's judgements about the entire distribution, not just a central estimate’. Since the proper score can formally assess the quality of distributional forecast, the linear pools satisfy the two conditions proposed by Shapland (Citation2022) by using a proper score as the optimisation criterion.

In this paper, we introduce a distributional forecast combination framework that is tailored to the loss reserving context. Although the standard ensemble modelling approach has been extensively studied in the literature, it can not be applied directly without modification due to the special data structures and characteristics of the loss reserving data. Specifically, we introduce ‘Accident and Development period adjusted Linear Pools (ADLP)’, which have the following attributes:

Triangular shape of aggregate loss reserving data: To project out-of-sample incremental claims, the training data must have at least one observation from each accident period and development period in the upper triangle. Our proposed ensemble ensures that this constraint is satisfied when partitioning the training and validation sets.

Time-series characteristics of incremental claims: Under the standard ensemble approach, the in-sample data is randomly split into the training set for training component models and the validation set for estimating combination weights (Breiman, Citation1996). However, since the main purpose of loss reserving is to predict future incremental claims, a random split of the data can not adequately validate the extrapolating capability of loss reserving models. In light of the above concern, we allocate incremental claims from the most recent calendar periods to the validation set (rather than a randomly selected sample) as they are more relevant to the future incremental claims. This is a common partition strategy used in loss reserving literature (e.g. Taylor & McGuire, Citation2016). Therefore, the combination weights estimated from the validation set more accurately reflect the component models' predictive performance on the most recent out-of-sample data.

Accident Period effects: The standard ensemble strategy usually derives its combination weights by using the whole validation set, which implies each component model will receive a single weight for a given data set.

However, it is a very well known fact that the performance of loss reserving models tends to differ by accident periods (Chapter 5 of Taylor, Citation2000, pp. 151–165). Hence, assigning the same weights (for each model) across all accident years might not be optimal, and it may be worthwhile allowing those to change over time. To address this, we allow the set of weights to change at arbitrary ‘split points’. For instance, the set might be calibrated differently for the 10 most recent (immature) accident years. More specifically, we further partition the validation set based on accident periods to train potentially different combination weights, such that the weights can best reflect a models' performance in the different accident period sets. Our paper shows that the ensemble's out-of-sample performance can be substantially improved – based on the whole range of evaluation metrics considered here – by adopting the proposed Accident-Period-dependent weighting scheme. Additionally, we also empirically discuss the choice of number and location of accident year split points.

Development Period effects: The accident year split described in the previous item requires some care. One cannot strictly isolate and focus on a given set of accident years (think of a horizontal band) and use the validation data of that band only as this will lack data from later development years which are required for the projection of full reserves. Hence, we start with the oldest accident year band (or split set), and when moving to the next band, we (i) change the training set according to the accident year effect argument discussed above, but (ii) add the new layer of validation points to the previously used set of validation data when selecting weights for that band.

Note that the data kept from one split to the next is the validation data only, which by virtue of the second item above is always the most recent (last calendar year payments). Hence, this tweak does not contradict the probabilistic forecasting literature's recommendation of using only the data in the most recent periods (see, e.g. Opschoor et al., Citation2017) when selecting weights.

Data Scarcity: Data is scarce for aggregate loss reserving, particularly for immature accident periods (i.e. the recent accident periods), where there are only a few observed incremental claims available. A large number of component models relative to the number of observations may cause the prediction or weights estimation less statistically reliable (Aiolfia & Timmermann, Citation2006). Our proposed ensemble scheme mitigates this data scarcity issue by incorporating incremental claims from both mature and immature accident periods to train the model weights in immature accident periods.

The out-of-sample performance of the ADLP is evaluated and compared with single models (model selection strategy), as well as other common model combination methods, such as the equally weighted ensemble, by using synthetic data sets that possess complicated features inspired by real world data.

Overall, we show that our reserving-oriented ensemble modelling framework ADLP yields more accurate distributional forecasts than either (i) a model selection strategy, or (ii) the equally weighted ensemble. The out-performance of the proposed ensembling strategy can be credited to its ability to effectively combine the strengths offered by different loss reserving models and capture the key characteristics of loss reserving data as discussed above. In fact, the mere nature of the challenges described above justify and explain why an ensembling approach is favourable. Finally, we have also developed an R package ADLP, which has been published in CRAN, to promote the usability of the proposed framework.

1.3. Outline of the paper

In Section 2, we discuss the design and properties of component loss reserving models that will constitute the different building blocks for the model combination approaches discussed in this paper. Section 3 discusses how to best combine those component models, and describes the modelling frameworks of the standard linear pool ensemble and the Accident and Development period adjusted Linear Pools (ADLP) that are tailored to the general insurance loss reserving data. The implementation of our framework is not straighforward, and is detailed in Section 4. Section 5 discusses the evaluation metrics and statistical tests used to assess and compare the performance of different models. The empirical results are illustrated and discussed in Section 6. Section 7 concludes.

2. Choice of component models

Given the extensive literature in loss reserving, which offers a plethora of modelling options, we start by defining criteria for choosing models to include in the ensemble – the ‘component models’. We then describe the structure of each of the component models we selected, as well as the characteristics of loss data that they aim to capture.

2.1. Notation

We first define the basic notation that will used in this paper:

  • i: the index for accident periods, where i=1,,I

  • j: the index for development periods, where j=1,,J and J=I

  • t: the calendar period of incremental claims; t=i+j−1, where i and j denote the accident period and the development period, respectively

  • Dout : the out-of-sample loss data (i.e. the lower part of an aggregate loss triangle)

  • DTrain : the loss data used for training component models

  • Dval : the validation loss data for selecting component models and optimising the combination weights; DTrain and Dval constitute the in-sample loss data (i.e. the upper part of an aggregate loss triangle)

  • Yij: the observed incremental claims paid in accident period i and development period j

  • Yˆij: the estimated incremental claims in accident period i and development period j

  • RTrue: the true reserve, defined as RTrue=i,jDoutYij

  • Rˆ: the estimated reserve, defined as Rˆ=i,jDoutYˆij

  • Nij: the observed reported claims count occurred in accident period i and development period j

  • Fij: the observed number of finalised claims in accident period i and development period j

  • μˆij: the predicted mean for the distribution of Yij (or ln(Yij) for the Log-Normal distribution)

  • σˆij2: the predicted variance for the distribution of Yij (or ln(Yij) for the Log-Normal distribution)

  • ϕ: the dispersion parameter for the distribution of Yij; for distributions coming from Exponential Distribution Family (EDF), Var(Yij) is a function of ϕ and μij

  • ηi,j: the linear predictor for incremental claims paid in accident period i and development period j that incorporates information from explanatory variables into the model; ηi,j=h1(μˆij), where h1 is a link function

  • fˆm(yij): the predicted density by model m evaluated at the observation Yij

  • m: the index for component models, where m=1,,M

2.2. Criteria for choosing component models

There is a wide range of loss reserving models available in the literature. With an aim to achieve a reasonable balance between diversity and computational cost, we propose the following criteria to select which component models to include in our ensembles:

  1. The component models should be able to fit mechanically without requiring substantial user adjustment in order to avoid the potential subjective bias in model fitting, and to save time spent on manually adjusting each component model.

    This can be of particular concern when there is a large number of component models contained in the ensemble. This criterion is consistent with the current literature on linear pool ensembles, where parametric time-series models with minimum manual adjustments being required are used as component models (Gneiting & Ranjan, Citation2013; Opschoor et al., Citation2017).

  2. The component models should have different strengths and limitations so as to complement each other in the ensemble.

    This criterion ensures that different patterns present in data can be potentially captured by the diverse range of component models in the ensemble. As Breiman (Citation1996) suggests, having diverse component models could also reduce the potential model correlation, thus improving the reliability and accuracy of the prediction results.

  3. The component models should be easily interpretable and identifiable, and hence are restricted to traditional stochastic loss reserving models and statistical learning models with relatively simple structures.

    Although advanced machine learning models, such as Neural Networks, have demonstrated outstanding potential in forecast accuracy in literature (Kuo, Citation2019), they might place a substantial challenge on the interpretability of the ensemble and the data size requirement if they are included in the ensemble. Therefore, their implementation into the proposed ensemble has not been pursued in this paper.

In summary, the ensemble should be automatically generated (1), rich (2) and constituted of known and identifiable components (3).

Note, however, that our framework works for any number and type of models in principle, provided the optimisation described in Section 3 can be achieved.

2.3. Summary of selected component models

Based on the criteria for choosing component models introduced in the previous section, we have selected eighteen reserving models with varying distribution assumptions and model structures from the literature. These include GLM models with various effects and various dispersions, zero-adjusted models, smoothing splines, as well as GAMLSS models with varying dispersions. While Appendix 1 provides a detailed description of those models, Table  summarises the common patterns in claim payments data that actuaries could encounter in modelling outstanding claims, and the corresponding component models that can capture those patterns.

Table 1. Summary of component models.

Remark 2.1

Another group of models that are easily interpretable and could be fitted mechanically is the age-period-cohort type of models proposed in Harnau and Nielsen (Citation2018) and Kuang et al. (Citation2008). These models have a close connection to the model with calendar period effects (i.e. GLMCal) and the cross–classified model (i.e. GLMCC), which will be discussed in more details in Appendix  A.2.

3. Model combination strategies for stochastic ensemble loss reserving

In this section, we first define which criterion we will use for determining the relative strength of one model when compared to another. We then review the two most basic ways of using models within an ensemble: (i) the Best Model in the Validation set (BMV), and (ii) the Equally Weighted ensemble (EW). Both are deficient, in that (i) considers the optimal model, but in doing so discards all other models, and (ii) considers all models, but ignores their relative strength. To improve these and combine the benefits of both approaches, both the Standard Linear Pool (SLP) and Accident and Development period adjusted Linear Pools (ADLP) are next developed and tailored to the reserving context. The section concludes with details of the optimisation routine we used, as well as how predictions were obtained.

3.1. Model combination criterion

3.1.1. Strictly proper scoring rules and probabilistic forecasting

We consider ‘strictly proper’ scoring rules, which have wide application in probabilistic forecasting literature (Gneiting & Katzfuss, Citation2014), to measure the accuracy of distributional forecast. Scoring rules assign an numeric score to a distributional forecasting model when an event materialises. Define S(P,x) as the score received by the forecasting model P when the event x is observed, and S(Q,x) as the score assigned to the true model Q. A scoring rule is defined to be ‘strictly proper’ if S(Q,x)S(P,x) always holds, and if the equality holds if and only if P=Q. The definition of proper scoring rules ensures the true distribution is always the optimal choice, thus encouraging the forecasters to use the true distribution (Gneiting & Katzfuss, Citation2014; Gneiting & Raftery, Citation2007). This property is crucial as improper scoring rules can give misguided information about the predictive performance of forecasting models (Gneiting & Katzfuss, Citation2014; Gneiting & Raftery, Citation2007).

Furthermore the performance of a distributional forecasting model can be evaluated by its calibration and sharpness. Calibration measures the statistical consistency between the predicted distribution and the observed distribution, while sharpness concerns the concentration of the predicted distribution. A general goal of distributional forecasting is to maximise the sharpness of the predictive distributions, subject to calibration. By using a proper scoring rule, the calibration and sharpness of the predictive distribution can be measured at the same time. For introduction to scoring rules and the advantages of using strictly proper scoring rules, we refer to Gneiting and Katzfuss (Citation2014); Gneiting and Raftery (Citation2007).

Actuaries commonly employ goodness-of-fit criteria such as the Kolmogorov–Smirnov or Anderson–Darling statistics, to assess the level of agreement between theoretical distributions and empirical distributions. However, it is crucial to recognise the fundamental distinction between evaluating probabilistic forecasts and assessing the goodness-of-fit of a theoretical distribution. In probabilistic forecasting, our objective is to evaluate the predictive distribution based on a single value, as the true distribution of a particular instance is typically unobservable in practical scenarios. Given this consideration, proper scoring rules are better suited for our purposes.

3.1.2. Log score

In this paper, we use the Log Score, which is a strictly proper scoring rule proposed by Good (Citation1992), to assess our models' distributional forecast accuracy. Log score is also a local scoring rule, which is any scoring rule that depends on a density function only through its value at a materialised event (Parry et al., Citation2012). Under certain regularity conditions, it can be shown that every proper local scoring rule is equivalent to the Log Score (Gneiting & Raftery, Citation2007). For full discussion of the advantages of Log Score, we refer to (Gneiting & Raftery, Citation2007; Roulston & Smith, Citation2002).

For a given dataset D, the average Log Score attained by a predictive distribution can be expressed as: (1) LogS=1|D|i,jDln(fˆ(yi,j)),(1) where fˆ(yi,j) is the predicted density at the observation yij, and |D| is the number of observations in the dataset D.

Remark 3.1

Here, the Log Score is chosen to calibrate model component weights in the validation sets (the latest diagonals). Strictly proper scoring rules are also used to assess the quality if the distributional forecasting properties of the ensembles developed in this paper, but this time applied in the test sets (in the lower triangle). This is further described in Section 5.

3.2. Best model in the validation set ( BMV )

Traditionally, a single model, usually the one with the best performance in the validation set, is chosen to generate out-of-sample predictions. In this paper, the Log Score is used as the selection criterion, and the selected model is referred to as the ‘best model in validation set’ – abbreviated ‘BMV’. The model selection strategy can be regarded as a special ensemble where a weight of 100% is allocated to a single model (i.e. the BMV). Although model selection is based on the relative predictive performance of various models, it is usually hard for a single model to capture all the complex patterns that could be present in loss data (Friedland, Citation2010). In our alternative ensembles, we aim to incorporate the strengths of different models by combining the prediction results from various models.

3.3. Equally weighted ensemble ( EW )

The simplest model combination strategy is to assign equal weights to all component models, which is commonly used as a benchmark strategy in literature (Hall & Mitchell, Citation2007; McDonald & Thorsrud, Citation2011; Ranjan & Gneiting, Citation2010). For this so-called ‘equally weighted ensemble’ – abbreviated ‘EW’, the combined predictive density can be specified as (2) f(yij)=m=1M1Mfˆm(yij),(2) where M is the total number of component models, and fˆm(yij) is the predicted density from the component model m evaluated at the observation yij. As the equally weighted ensemble can sometimes outperforms more complex model combination schemes (Claeskens et al., Citation2016; Pinar et al., Citation2012; Smith & Wallis, Citation2009), we also consider it as a benchmark model combination strategy in this paper.

3.4. Standard linear pool ensembles ( SLP )

A more sophisticated, while still intuitive approach is to let the combination weights be driven by the data so that the ensemble's performance can be optimised. A prominent example of such combination strategies in probabilistic forecasting literature is the linear pool approach (Gneiting & Ranjan, Citation2013), which seeks to aggregate individual predictive distributions through a linear combination formula. The combined density function of such ‘standard linear pool’ – abbreviated ‘SLP’ can be then specified as (3) f(yij)=m=1Mwmfm(yij)(3) at observation yij, where wm is the weight allocated to predictive distribution m (Gneiting & Ranjan, Citation2013). The output of linear pools, as defined in (Equation3), closely resembles that of finite mixture models, with both being a weighted average of an ensemble of predictive distributions. However, unlike finite mixture models, the calibration of a linear pool ensemble usually involves two stages. In the first stage, the component models are fitted in the training data, and the fitted models will generate predictions for the validation set, which constitute the Level 1 predictions. In the second stage, the optimal combination weights are learned from the Level 1 predictions by maximising (or minimising) a proper score. Since the weights allocated to component models should reflect their out-of-sample predictive performance, it is important to determine the combination weights by using models' predictions for the validation set instead of the training set.

Figure  illustrates the modelling framework for constructing a standard linear pool (SLP) ensemble, with the following steps:

  1. Partition the in-sample data (i.e. the upper triangle Din) into training set DTrain (shaded in blue) and validation set Dval (shaded in orange). We allocate the latest calendar periods to the validation set to better validate the projection accuracy of models, which is also common practice in the actuarial literature.

  2. Fit the M component models to the training data (i.e. DTrain ), that is, find the model parameters that maximise the Log Score within DTrain (equivalent to Maximum Likelihood estimation).

  3. Use the fitted models to generate predicted densities for incremental claims in the validation set, which forms the Level 1 predictions; Denote the Level 1 predictions as {fˆ1(yij),,fˆM(yij),i,jDval}

  4. Estimate the optimal combination weights from the Level 1 predictions by maximising the mean Log-Score achieved by the ensemble, subject to the constraint that combination weights must sum up to one (i.e. m=1Mwm=1) and each weight should be non-negative (i.e. wm0). We have then (4) wˆ=argmaxw1|Dval|yijDvalln(m=1Mwmfˆm(yi,j)),(4) where wˆ=[wˆ1,,wˆM].

    The sum-to-unity and non-negativity constraint on model weights are important for two main reasons. Firstly, those constraints ensure the ensemble not to perform worse than the worst component model in out-of-sample data (proof in Appendix C.2 ). Additionally, the non-negativity constraint can mitigate the potential issue of model correlation by shrinking the weights of some of the highly correlated models towards zero (Breiman, Citation1996; Coqueret & Guida, Citation2020).

  5. Fit the component models in the entire upper triangle Din, and then use the fitted models to generate predictions for out-of-sample data Dout . The predictions are combined using the optimal combination weights determined in step 4.

Figure 1. Linear pool ensemble framework.

Figure 1. Linear pool ensemble framework.

3.5. Accident and development period adjusted linear pools ( ADLP ) driven by general insurance characteristics

The estimation of the SLP combination weights specified in (Equation4) implies that each model will receive the same weight in all accident periods. However, as previously argued, loss reserving models tend to have different predictive performances in different accident periods, making it potentially ideal for combination weights to vary by accident periods. In industry practice, it is common to divide the aggregate claims triangle by claims maturity and fit different loss reserving models accordingly (Friedland, Citation2010).

For instance, the PPCF model tends to yield better predictive performance in earlier accident periods (i.e. the more mature accident periods) by taking into account the small number of outstanding claims. In contrast, the PPCI model and the Chain-Ladder tend to have better performance in less mature accident periods as they are insensitive to claims closure (Chapter 5 of Taylor, Citation2000, pp. 151–165). Therefore, in practice, actuaries may want to assign a higher weight to the PPCF model in earlier accident periods and larger weights to the PPCI or Chain-Ladder model in more recent accident periods (Chapter 5 of Taylor, Citation2000, pp. 151–165).

The argument above can be applied to other loss reserving models. To take into account the accident-period-dependent characteristics of loss reserving models, we partition the loss reserving data by claims maturity, with the first subset containing incremental claims from earlier accident periods (i.e. the mature data) and the second subset containing incremental claims from more recent accident periods (i.e. the immature data). The data is thus split an arbitrary number of times K. The weight allocated to component model m for projecting future incremental claims in the kth subset (i.e. Dtestk) is then optimised using the kth validation subset (denoted as Dvalk): (5) wˆk=argmaxwk1|Dvalk|yijDvalkln(m=1Mwmkfˆm(yi,j)),k=1,,K,(5) subject to m=1Mwmk=1,wmk0. The choice of split points, as well as Dvalk require care, as discussed below. We will mainly focus on the case K=2, but this can be easily extended to K>2; see Remark 3.4 and Appendix D.1, which consider K=3 and briefly discuss how K could be chosen.

Table  summarises multiple ensembles constructed by using different split points between the two subsets and the corresponding proportion of data points in the upper triangle in the subset 1. We do not test split points beyond accident period 33 due to the scarcity of data beyond this point. Results for all those options will be compared in Section 6, with the objective of choosing the optimal split point.

Table 2. Details of two-subsets data partition strategies.

Remark 3.2

The ‘empiricalness’ of the choice of split point might be seen as a limitation. It is hard to avoid as there might not be a closed formula for the best split point. While this does require extra work each period, the additional computation time is not significant. Also, many reserving models are iterative, suggesting minimal shifts in optimal partitions with a period's worth of extra experience. This would suggest that a relatively stable choice of split point can be made early in the analysis, and maintained over future periods, assuming consistent historical experience.

We now turn our attention to the definition of the validation sets to be used in (Equation5). This pure ‘band’ approach leads to the definitions depicted in Figure . Unfortunately this may ignore very important development period effects, because the second validation set (in orange) does not include any of the development periods beyond 12.

Figure 2. Data Partition Diagram: An illustrative example for ensembles that only consider impacts from accident periods.

Figure 2. Data Partition Diagram: An illustrative example for ensembles that only consider impacts from accident periods.

More specifically, the training scheme for combination weights specified by (Equation5) implies the weights in Subset 2 (i.e. wˆ2) only derive from Subset 2 validation data (i.e. Dval2 ). However, Dval2 does not capture the late development period effects in out-of-sample data: as shown in Figure , Dval2 (shaded in orange) contains incremental claims from Development Period 2 to 12. However, Dtest2 (shaded in light orange) comprises incremental claims from Development Period 2 to 40. Depending on the lines of business, the impact from late development periods on claims could be quite significant. For example, for long-tailed lines of business, such as the Bodily Injury liability claims, it usually takes a considerably long time for the claims to be fully settled, and large payments can sometimes occur close to the finalisation of claims (Chapter 4 of Taylor, Citation2000, pp. 151–165). In such circumstances, adequately capturing the late development period effects is critically important as missing those effects tends to substantially increase the risk of not having sufficient reserve to cover for those large payments that could occur in the late development periods.

This situation can be exacerbated by the data scarcity issue in Dval2 for ensembles with late split points. For instance, partition strategy 17 only uses claims from Accident Period 34 to 40 in Dval2 , which corresponds to only 20 observations. However, those 20 observations are used to train the 18 model weights. When the data size is small relative to the number of model weights that need to be trained, the estimation of combination weights tend to be less statistically reliable due to the relatively large sampling error (Aiolfia & Timmermann, Citation2006; Bellman, Citation1966).

In light of the above concerns, we modify the training scheme such that the mature claims data in Dval1 , which contains valuable information about late development period effects, can be utilised to train the weights in less mature accident periods. Additionally, since both Dval1 and Dval2 are used to train the weights in the second subset, the data scarcity issue in Dval2 can be mitigated. Under the modified training scheme, (Equation5) is revised as (6) wˆk=argmaxwk1|Dval1Dvalk|yijDval1Dvalkln(m=1Mwmkfˆm(yi,j)),k=1,,K,(6) subject to m=1Mwmk=1,wmk0. Since the combination weights estimated using (Equation6) take account into impacts and features related to different Accident and Development period combinations, we denote the resulting ensembles as the Accident and Development period adjusted Linear Pools (ADLP).

An example of data partition strategy under ADLP ensembles with K=2 is given in Figure  and , where the new second validation set overlaps the previous first and the second validation set. Since the bottom set uses validation data from all accident periods, its results will hence correspond to the SLP results.

Figure 3. ADLP Validation Subset 1.

Figure 3. ADLP Validation Subset 1.

Figure 4. ADLP Validation Subset 2.

Figure 4. ADLP Validation Subset 2.

Based on the different partition strategies in Table , we have 17 ADLP ensembles in total. We use ADLPi to denote the ADLP ensemble constructed from the ith partition strategy. The fitting algorithm for ADLP is similar to the algorithm for fitting SLP ensembles described in Section 3.4, except for Step 4, where the combination weights are estimated using (Equation6) instead of (Equation5).

Remark 3.3

The partition of training and validation data is not new in the aggregate loss reserving context. In Lindholm et al. (Citation2020) and Gabrielli et al. (Citation2020), the data partition is performed on the individual claims level. The loss triangle for models training is constructed by aggregating half of the individual claims, and the validation triangle is built by grouping the other half of the individual claims. This is to accommodate the fact that only aggregate reserving triangles with size 12 x 12 are available in the original study (Gabrielli et al., Citation2020), which makes partitioning based on the aggregate level difficult due to data scarcity issue. In such scenario, this approach is advantageous as it efficiently uses the limited loss data by using the whole triangle constructed on the validation set to assess the model. Since a relatively large reserving triangle (i.e. 40 x 40) is available in this study and we are more interested on the component models' performance on the latest calendar periods to allocate combination weights based on individual projection accuracy, we decide to directly partition the loss triangle on the aggregate level.

Another approach for partitioning the aggregate loss reserving data is the rolling origin method (Al-Mudafer et al., Citation2022; Balona & Richman, Citation2020), which does not rely on the availability of individual claims information. However, this approach is mainly designed for selecting candidate models based on their projection accuracy rather than model combination.

The data partition strategy proposed in this paper is novel in three different ways. Firstly, it is dedicated to the combination of loss reserving models, as it explicitly considers the difference in component models' performance in different accident periods by partitioning the data by claims maturity. The additional partition step is generally unnecessary for model selection and has not been undertaken in the aforementioned loss reserving literature to our knowledge. However, as discussed in Section 3.5 and further illustrated in Section 6.2.2, this step is critical for combining loss reserving models as it allows the ensemble to effectively utilise the strengths of component models in different accident periods. Secondly, the proposed partition strategy takes into account the combination of effects from accident periods and development periods in the triangular data format by bringing forward the validation data in older accident periods to train the combination weights in subsequent accident periods. Thirdly, this paper is also the first one that investigates the impact of various ways of splitting the validation data on the ensemble's performance in the aggregate loss reserving context. This paper has found that the ensemble's performance can be sensitive to the choice of split point, which has not been explicitly addressed in the current literature. Motivated by the findings above, the optimal point to split the validation data (see detailed illustrations in Section 6.2.2) has been proposed in this paper, which might provide useful guidance for implementation.

Remark 3.4

This paper focuses on the case when K=2 (i.e. partitioning the data into two subsets). However, one can divide the data into more subsets and train the corresponding combination weights by simply using K>2 in (Equation6). To illustrate this idea, we provide an example when K=3. For instance, for ADLP2, another split point, say accident period 15, can be introduced after the first split at the accident period 5. We denote the resulting ensemble as ADLP2+. Similarly, the third subset can be introduced to other ADLP ensembles. Table  shows some examples of the three-subsets partition strategies based on the current ADLP ensembles.

Table 3. Details of three-subsets data partition strategies.

The details of the performance analysis of the ADLP+ ensembles listed in Table  is given in Appendix D.1. Overall, there is no significant difference between the out-of-sample performance of the ADLP ensembles and the corresponding ADLP+ ensembles. At least with the data set used in this study, we do not find enough evidence to support the improvement over the current ADLP ensembles by introducing more subsets for training the combination weights. This phenomenon might be explained by the small difference in the performance of component models after the accident period 20, limiting the ensemble's potential benefit gained from the diversity among the performance of component models (Breiman, Citation1996). Therefore, adding additional subsets in the later accident periods seems redundant at least in this example.

3.6. Relationship to the stacked regression approach

Another popular ensemble scheme is the stacked regression approach proposed by Breiman (Citation1996), which uses the Mean Square Error (MSE) as combination criterion. The idea of stacked regression approach has been applied in the mortality forecasts literature. For detailed implementation, one may refer to Kessy et al. (Citation2022) and Li (Citation2022). In Lindholm and Palmborg (Citation2022), the authors propose averaging the predictions from an ensemble of neural networks with random initialisation and training data manipulation to obtain the final mortality forecast, which could be thought of as a special case of stacked regression approach where an equal weight is applied to each component model.

In summary, under the stacked regression approach, the weights allocated to component models are determined by minimising the MSE achieved by the combined forecast: (7) wˆ=argmaxw1|Dval|yijDval(m=1Mwmμˆijmyij)2,(7) where μˆijm is the predicted mean from component model m.

As Yao et al. (Citation2018) suggest, both stacked regression and linear pools correspond to the idea of ‘stacking’. When MSE is used as optimisation criterion (i.e. the stacked regression approach proposed by Breiman (Citation1996)), the resulting ensemble is a stacking of means. When using the Log Score, the corresponding ensemble is a stacking of predictive distributions.

Although both the stacked regression and linear pools are stacking methods, their objectives are different: the stacked regression seeks to optimise the ensemble's central forecast accuracy, whereas linear pools strive to optimise the calibration and sharpness of distributional forecast. Despite showing success in combining point forecasts, stacked regression is not recommended to combine distributional forecasts (Knüppel & Krüger, Citation2022; Yao et al., Citation2018). The reasons are twofold as explained below.

Firstly, the combination criterion under the stacked regression approach, which is the Mean Squared Error, is not a strictly proper scoring rule (Gneiting & Raftery, Citation2007; Knüppel & Krüger, Citation2022). As discussed in Section 3.1, a strictly proper scoring rule guarantees that the score assigned to the predictive distribution is equal to the score given to the true distribution if and only if the predictive distribution is the same as the true distribution. However, since the MSE only concerns the first moment, a predictive distribution with misspecified higher moments could potentially receive the same score as the true distribution. Therefore, using MSE as the evaluation criterion does not give forecasters the incentives to correctly model higher moments. For an illustrative example, one might refer to Appendix C.3. In contrast, the Log Score is a strictly proper scoring rule, which evaluates the model's performance over the entire distribution, and encourages the forecaster to model the whole distribution correctly (Gneiting & Raftery, Citation2007). This property is particularly desirable for loss reserving, where the model's performance over the higher quantile of the loss distribution is also of interest to insurers and regulators.

The second advantage of using the linear pooling scheme over the stacked regression approach in combining distributional forecasts comes from the potential trade-off between the ensemble's performance on the mean and variance forecasts. In short, minimising the MSE will always improve the mean forecasts, but might harm the ensemble's variance forecasts when the individual variances are unbiased or over-estimated (Knüppel & Krüger, Citation2022). An illustrative proof of this idea can be found in Appendix C.3. To balance the trade-off between mean and variance forecasts, the combination criterion should take into account the whole distribution (or at least the first two moments). This can be achieved by using a strictly proper scoring rule (e.g. the Log Score).

4. Implementation of the SLP and ADLP

This section is dedicated to the implementation of the SLP and ADLP framework. It provides theoretical details for such implementation, and it complements the codes available at https://github.com/agi-lab/reserving-ensemble.

4.1. Minorisation–Maximisation strategy for log score optimisation

To solve the optimisation problem specified in (Equation5) and (Equation6), we implement the Minorisation–Maximisation strategy proposed by Conflitti et al. (Citation2015) which is a popular and convenient strategy for maximising the Log Score. Instead of directly maximising the Log Score, a surrogate objective function is maximised. Define (8) ϕλ(wmk,a)=1|Dvalk|yijDvalkm=1Mfˆm(yi,j)aml=1Mfˆl(yi,j)alln(wmkaml=1Mfˆl(yi,j)al)λ(m=1Mwmk),(8) where am is an arbitrary weight, λ is a Lagrange Multiplier, and where Dvalk=Dval1Dvalk . Note that for the SLP, we have wm1==wmkwm. The weights are then updated iterativily by maximising (Equation8): (9) (wmk)i+1=argmaxwϕλ(wmk,(wmk)i).(9) By setting ϕλ(wmk,(wmk)i)wmk=0,we have wmk=1λ1|Dvalk|yijDvalkfˆm(yi,j)(wmk)il=1Mfˆl(yi,j)(wlk)i.Now, using the constraint m=1Mwmk=1 yields m=1Mwmk=1λ1|Dvalk|yijDvalkm=1Mfˆm(yi,j)(wmk)il=1Mfˆl(yi,j)(wlk)i=1λ1|Dvalk|yijDvalk1=1λ|Dvalk||Dvalk|=1λ=1.Therefore, λ=1, and the updated weights in each iteration become: (10) (wmk)i+1=(wmk)iyijDvalkfˆm(yi,j)l=1Mfˆl(yi,j)(wlk)i,(10) where the weights will be initialised as (wmk)0=1/M such that the constraints m=1Mwmk=1 and wmk0 can be automatically satisfied in each iteration. The iterations of weights are expected to converge due to the monotonically increasing property of the surrogate function (Conflitti et al., Citation2015). To promote computational efficiency, the algorithm is usually terminated when the difference between the resulting Log Scores from the two successive iterates is less than a tolerable error ϵ (Conflitti et al., Citation2015). We set ϵ=1016, which is a common tolerable error used in the literature. An outline for the Minorisation–Maximisation Strategy is given in Algorithm 1.

Note that this process can be used to optimise Equation4, as this is a special case of the ADLP with K=1.

4.2. Prediction from ensemble

Based on (Equation3), the central estimate of incremental claims by the ensemble can be calculated as follows: (11) μij=m=1Mwmμijm(11) To avoid abuse of notation, wm in this section can be either the weight under SLP or the weight under ADLP. For ADLP ensembles, wm is set to wmk in the calculation below if yijDk.

To provide a central estimation for reserve, we simply aggregate all the predicted mean in the out-of-sample data: Rˆ=i,jDoutμij . Since insurers and regulators are also interested in the estimation of 75th reserve quantile, simulation is required. The process for this simulation is outlined in Algorithm 2.

5. Comparison of the predictive performance of the ensembles

5.1. Measuring distributional forecast accuracy: log score

To assess and compare the out-of-sample distributional forecast performance of different models, the Log Score is averaged across all the cells in the lower triangle : (12) LogSout=1|Dout|i,jDoutln(fˆ(yi,j)).(12) Since loss reserving models tend to perform differently in different accident periods, we also summarise the Log Score by accident periods: (13) LogSAPiout=1|DAPiout|jDAPioutln(fˆ(yi,j)),(13) where DAPiout denotes the set of out-of-sample incremental claims in accident period i.

Remark 5.1

Besides Log Score, another popular proper score in literature is the Continuously Ranked Probability Score (CRPS) (Gneiting & Ranjan, Citation2011; Opschoor et al., Citation2017). This paper (and associated R package) also incorporates the CRPS as an alternative evaluation metric for the methods proposed. Further details and illustrative results are provided in Appendix  D.3. Both scores generally lead to consistent conclusions.

5.2. Statistical tests

To determine whether the difference between the performance of two competing forecasting models is statistically significant, statistical tests are necessary. We consider the Diebold-Mariano test in this paper due to its high flexibility and wide application in probabilistic forecasting literature (Diebold & Mariano, Citation2002; Gneiting & Katzfuss, Citation2014). Diebold-Mariano test is one of the few tests that does not require loss functions to be quadratic and the forecast errors to be normally distributed, which is what we need in our framework. Furthermore, the Diebold-Mariano test is specifically designed to compare forecasts and inherently emphasises out-of-sample performance (Diebold, Citation2015). This feature aligns well with our objectives, as actuaries conducting reserving exercises are primarily concerned with the accuracy of projections. The performance of the Diebold-Mariano test has been discussed in Costantini and Kunst (Citation2011), who discovered a tendency for the test to favour simpler benchmark models. As a result, the Diebold-Mariano test serves as a conservative criterion for evaluating the performance of our proposed ensemble strategies.

Consider the null hypothesis stating that model G has equal performance as model F. The Diebold–Mariano test statistic can then be specified as Diebold and Mariano (Citation2002) (14) tn=nS¯nFS¯nGσˆn,(14) where S¯nF and S¯nG are the average scores (here, Log Score) received by models F and G in out-of-sample data, and where σˆn is the estimated standard deviation of the score differential. That is, (15) σˆn=1|Dout|i,jDout(ln(fFˆ(yi,j))ln(fGˆ(yi,j)))2.(15) Assuming the test statistic to asymptotically follow a standard Normal distribution, the null hypothesis will be rejected at the significance level of α if the test statistic is greater than the (1α) quantile of the standard Normal distribution.

5.3. Measuring performance on the aggregate reserve level

On the aggregate level, an indication of the performance of the central estimate for reserve by model m can be provided by the relative reserve bias: (16) Rbias=RˆmRTrueRTrue.(16) Since insurers and regulators are also interested on the estimation of the 75th quantile of reserve, a model's performance in this quantile can be indicated by its relative reserve bias at the 75th quantile: (17) R75bias=Rˆ75mR75TrueR75True,(17) where R75True is the 75th quantile of the true outstanding claims, and Rˆ75m is the estimated 75th quantile by model m. The calculation details for Rˆ75m can be found in Appendix C.1.

Remark 5.2

Of course, RTrue and R75True are not observable in practice. Thanks to our use of simulated data (see Section 6.1) though, we are able to calculate those from the full simulations and here have been calculated using simulated data in the bottom triangles. Furthermore, as these metrics are not proper scoring rules, care should be taken in usage of these results in ranking models.

6. Results and discussion

After presenting the ensemble modelling framework, this section provides numerical examples to illustrate the out-of-sample performance of the proposed linear pool ensembles SLP and ADLP, as well as the BMV and EW. Following the introduction of the synthetic loss reserving data in Section 6.1, Section 6.2.1 and Section 6.2.2 analyse the distributional forecast performance of the SLP and the ADLP ensembles on incremental claims level, respectively. Section 6.2.3 analyses the difference in models' performance using statistical test results. The performance of the proposed ensembles is also analysed at the aggregate reserve level in Section 6.3. Finally, we share some insights on the properties of component models by analysing the optimal combination weights yielded by our framework.

6.1. Example data

Our models are tested using the SynthETIC simulator (Avanzi et al., Citation2021), from which triangles for aggregate paid loss amounts, reported claim counts, and finalised claim counts are generated and aggregated into 40x40 triangles. To better understand the models' predictive performance and test the consistency of the performance, 100 loss data sets are simulated, and models are fit to each of the 100 loss data sets.

Simulated data is preferred in this study as it allows the models' performance to be assessed over the distribution of outstanding claims, and the quality of distribution forecast is the main focus of this paper. This paper uses the default version of SynthETIC, with the following key characteristics (Avanzi et al., Citation2021):

  • Claim payments are long-tailed: In general, the longer the settlement time, the more challenging it is to provide an accurate estimate of the reserve due to more uncertainties involved; (Jamal et al., Citation2018)

  • High volatility of incremental claims;

  • Payment size depends on claim closure: The future paid loss amount generally varies in line with the finalisation claims count (i.e. the number of claims being settled), particularly in earlier accident periods.

The key assumptions above are inspired by features observed in real data and aim to simulate the complicated issues that actuaries can encounter in practice when modelling outstanding claims (Avanzi et al., Citation2021). With such complex data patterns, it is challenging for a single model to capture all aspects of the data. Therefore, relying on a single model might increase the risk of missing important patterns present in the data.

6.2. Predictive performance at the incremental claims level

In terms of distributional forecast accuracy, the linear pool ensembles SLP and ADLP outperform both the EW ensemble and the BMV at the incremental claims level, measured by the average Log Score over 100 simulations. Furthermore when appropriately partitioning the validation data into two subsets, the performance of the ADLP is better than that of the SLP.

6.2.1. SLP: the standard linear pool ensemble

The distribution of the Log Score received by SLP, BMV, EW (i.e. the equally weighted ensemble) is illustrated in Figure . Even without any partition of the validation set for training combination weights, the linear pool ensemble achieves a higher average Log Score than its competing strategies (which means it is preferred), with similar variability in performance.

Figure 5. Distribution of Log Score over 100 simulations (higher is better): comparison among SLP, BMV and EW.

Figure 5. Distribution of Log Score over 100 simulations (higher is better): comparison among SLP, BMV and EW.

The performance of different methods in each accident period is illustrated in Figure , which shows the Log Score by accident periods being calculated using (Equation13), averaging over 100 simulations. The SLP strictly outperforms the BMV in all accident periods. Although the BMV, which is the GAMLSS model with Log-Normal distributional assumption (i.e. GAMLSSLN) in this example, has the highest average Log Score among the component models, it does not yield the best performance in every accident period. For instance, the BMV under-performs the PPCF model in accident periods 3, and it is beaten by ZAGA GLM in earlier accident periods. Therefore, the common model selection strategy, which relies on the BMV to generate predictions for all accident periods, might not be the optimal solution.

Figure 6. Mean Log Score by accident periods (higher is better).

Figure 6. Mean Log Score by accident periods (higher is better).

Additionally, the SLP strictly outperforms the equally weighted ensemble after accident period 20. The under-performance of the equally weighted ensemble might be explained by the relatively poor performance of several models in late accident periods, which drag down the Log Score attained by the equally weighted ensemble. This relative poor performance is due to the fact that some models are more sensitive to data scarcity in more immature (later) accident years. For instance, the cross-classified models assume a separate parameter for each accident period, and later accident periods have little data available for training. Therefore, despite the simplicity of the equally weighted ensemble, blindly assigning equal weight to all component models tends to ignore the difference in the component models' strengths and might increase the risk of letting a few inferior models distort the ensemble's performance.

However, there is no substantial improvement over the equally weighted ensemble brought by SLP at the earlier half of the accident periods. The unsatisfactory predictive performance of SLP at earlier accident periods (i.e. mature accident periods) might be explained by the fact that a component model receives the same weight across all accident periods under the SLP, which is what motivated the development of the ADLP in Section 3.5. More specifically, and in this example, although the PPCF and the ZAGA model have relatively low overall Log Score due to their poor performance in immature accident periods, they have the best predictive performance among the component models at mature accident periods. However, since the weight allocated to each model reflects its overall predictive performance in all accident periods, the PPCF and the ZAGA GLM are assigned small weights as shown in Section 6.4, regardless of their outstanding predictive performance in earlier accident periods. The ADLP offers a solution to this problem, as illustrated in the following section.

6.2.2. ADLP1 to ADLP18: linear pool ensembles with different data partition strategies

Figure  plots the out-of-sample Log Score attained by ADLP ensembles with the different split points specified in Table , averaging over 100 simulations. The grey point represents the mean Log Score attained by SLP, which is a special case of ADLP ensemble with split point at accident period 0 (i.e. there is no subset). All the ADLP ensembles outperform SLP based on Log Score, supporting the advantages of data partitioning as discussed in previous sections.

Figure 7. Mean Log Score Plot: comparison among different partition strategies (higher is better).

Figure 7. Mean Log Score Plot: comparison among different partition strategies (higher is better).

As per Figure , the mean Log Score forms an inverse U shape, with the peak when splitting at accident period 18. The shape of the mean Log Score can be explained by the following reasons. If the split point is too early (e.g. split point at accident period 3), there is little improvement brought by ADLP over SLP due to the small number of data points in the first subset. This idea can also be illustrated in Figure , which dissects the ensembles' performance into accident periods. Although ADLP1, represented by the orange line in Figure , has the best performance in the second and third accident period, its Log Score falls to the SLP's level afterward. Therefore, the overall difference between ADLP1 and SLP is relatively tiny. As more accident periods are incorporated into the first subset, there is a greater overall difference between ADLP ensembles and SLP, which explains the increasing trend of mean Log Score before accident period 18.

Figure 8. Log score plot by accident periods (higher is better): comparison among different partition strategies.

Figure 8. Log score plot by accident periods (higher is better): comparison among different partition strategies.

However, for late split points (e.g. the split point at accident period 33), the performance of ADLP ensembles also tends to SLP as shown in Figure . This is because the first subset of those ADLP ensembles with late split points covers most of the accident periods, and the SLP can be interpreted as a special case of ADLP ensembles with one set that includes all accident periods. Additionally, late split points might make it hard for the ensemble to focus on and thus take advantage of the high-performing models in earlier accident periods.

Based on the above reasoning, an ideal partition strategy should balance the ensemble's performance in both earlier and later accident periods. In this study, the optimal split point is attained at accident period 18. As per Figure , ADLP12, which has a split point on accident period 18, substantially outperforms its competitors in the first fifteen accident periods by effectively capturing the high Log Score attained by a few outstanding models.

However, we want to acknowledge that this example is for illustration purposes. The recommendation made above may be more relevant to the loss data sets that share similar characteristics with the synthetic data used in this paper. For practical application, if there is no substantial difference among the performance of component models in earlier accident periods, one may consider simply allocating the first half of accident periods to the first subset.

6.2.3. Analysing the difference in distributional forecast performance using statistical tests

The Log Scores of the ADLP in Figure are clearly higher than that of the SLP, but we haven't established whether this difference is statistically significant yet. To analyse the statistical significance of the improvement over the traditional approaches brought by the ADLP ensembles, we run the Diebold-Mariano test on each simulated dataset (see (Equation14)), with the null hypothesis stating that the two strategies have equal performance. In particular, we test ADLP12, which has the best predictive performance measured by Log Score, against the EW and BMV, respectively. If the null hypothesis is rejected, the ADLP12 should be favoured.

Figure  illustrates the distribution of test statistics under the Diebold-Mariano test, with the red dashed line marking the critical value at the customary 5% significance level. The test statistics above the red dashed line indicate those datasets where the null hypothesis has been rejected, and the number of rejections under each test is summarised in Table . Since the null hypothesis is rejected for most simulations and almost all the test statistics are above zero (as marked by the blue dashed line),the DM test implies a decision in favour of ADLP12.

Figure 9. Test statistics: ADLP12 vs. EW and BMV.

Figure 9. Test statistics: ADLP12 vs. EW and BMV.

Table 4. The number of datasets (out of 100) when H0 is rejected under 5% significance level.

We now investigate where the difference in the performance between ADLP12 and BMV or the EW comes from. Since we expect the improvement over the traditional approaches to partially come from the ability of linear pools to optimise the combination of component models based on Log Score, firstly, we test the Standard Linear Pool (SLP) against the traditional approaches for each simulated data set. As per Figure  and Table , the null hypothesis is rejected for most simulated data sets, and most of the test statistics are positive, SLP is favoured over BMV and the EW in most cases.

Figure 10. Test statistics: SLP v.s EW and BMV.

Figure 10. Test statistics: SLP v.s EW and BMV.

Figure 11. Test statistics: ADLP12 v.s SLP.

Figure 11. Test statistics: ADLP12 v.s SLP.

As per the discussion in Section 3.5, we expect ADLP12 to improve SLP by better taking into the typical reserving characteristics found in the simulated data set. Based on the results shown in Table  and Figure , ADLP12 is favoured by the statistical tests in most simulated data sets.

Therefore, the outstanding performance of ADLP12 can be decomposed into two parts. Firstly, it improves the performance of the model selection strategy and the equally weighted ensemble, which can be credited to the advantages of linear pools in model combination. Secondly, as a linear pool ensemble being specially tailored to the general insurance loss reserving purpose, ADLP12 further improves the performance of SLP by considering the impact from accident periods and development periods on the ensemble's performance.

6.3. Predictive performance on aggregate reserve level

Since the central estimation of reserve and its corresponding risk margin at the 75th quantile are required to be reported by insurers under Australian prudential regulatory standards (APRA, Citation2019), we took this as an example of actuarial application and have also calculated those two quantities following the methodology specified in Section 5.3 in order to gain some insights regarding the performance of the three different models (ADLP, EW, and BMV). The box plots in Figures  and  illustrate the distributions of relative reserve bias for central and quantile estimations, respectively, across 100 simulations. A blue dashed line indicates the point of 0% bias in each plot.

Figure 12. Distribution of central relative reserve bias for ADLP12, BMV and EW.

Figure 12. Distribution of central relative reserve bias for ADLP12, BMV and EW.

Figure 13. Distribution of relative reserve bias at 75th quantile for ADLP12, BMV and EW.

Figure 13. Distribution of relative reserve bias at 75th quantile for ADLP12, BMV and EW.

The equally weighted ensemble displays a notably significant inclination towards overestimation, evident under both central and quantile estimation approaches. Although the BMV demonstrates relatively good performance in central reserve estimation, it frequently under-predicts at the 75th quantile. In contrast, the ADLP ensemble experiences fewer instances of underestimation or overestimation biases under both central and quantile reserve estimation methods.

Remark 6.1

Note that, as discussed in Section 5.3, the reserve bias metrics are not proper scoring rules, and so care needs to be taken in interpreting the above results. Nevertheless, we consider the above useful in terms of developing intuition regarding the performance of each model.

6.4. Analysis of optimal model weights

Analysing the combination weights can potentially help modelers gain insights about the component models' properties, and examples can be found in CitationTaylor (Chapter 5 of Citation2000, pp. 151–165) and Kessy et al. (Citation2022). Therefore, we extract the combination weights from ADLPs, with Figure  plotting the model weights in subset 1 using different split points, and Figure  illustrating the model weights in subset 2. Since model weights under ADLPs in the second subset are trained using the whole validation data – remember (Equation6), they will be identical to the combination weights under the SLP.

Figure 14. Combination weights in Subset 1: ADLP ensembles.

Figure 14. Combination weights in Subset 1: ADLP ensembles.

Figure 15. Combination weights in SLP and Subset 2 of ADLP.

Figure 15. Combination weights in SLP and Subset 2 of ADLP.

Several models are worth highlighting. The weight allocated to GAMLSSLN, which is the BMV (identified as the model with the highest Log Score averaging over different accident periods) in this example, is increasing with the split point as shown in Figure . Since the combination weights reflect the relative validation performance of component models in the ensemble, GAMLSSLN tends to have better performance in more recent accident periods as its weight increases when data from later accident periods are added to the validation set. Indeed, as per Figure , GAMLSSLN is allocated with the largest weight in the second subset. However, the weight allocated to GLMPPCF has demonstrated an opposite pattern: there is a substantial weight given to GLMPPCF with the earliest split point, whereas the weight decreases to a small value as the split point increases. This phenomenon might be explained by the ability of GLMPPCF to capture claims finalisation information. Since claims in earlier accident periods are close to being finalised, GLMPPCF tends to yield better predictive performance in earlier accident periods by taking this factor into account. This property of PPCF models has also been noted by CitationTaylor (Chapter 5 of Citation2000, pp. 151–165).

Besides analysing individual models, we also analyse the weights' pattern at the group level by clustering the component models based on their structure and distributional assumptions. In terms of model structure, all models with standard cross-classified structures are assigned with small weights based in Figures and . Due to the relatively complex characteristics of the simulated data, the standard cross-classified models tend to perform poorly as they only consider the accident and development period effects. We have also found that only one of them will be allocated with a large weight for those models with similar structures. In fact, the models with the four largest weights in both Figures and all have different structures. This finding highlights the advantage of using linear pools as the weights of potential correlated models can be shrunk towards zero by imposing the non-negativity constraint on weights.

From the angle of distributional assumptions, all models with ODP distributional assumptions are assigned relatively small weights, whereas models with Log-Normal distributional assumptions are allocated larger weights. Due to the relatively large tail-heaviness of incremental claims' distribution, Log-Normal tends to provide a better distributional fit as it has a heavier tail than the ODP distribution. Additionally, we have also found that Zero-Adjusted Gamma distribution (ZAGA) is assigned with a larger weight when the split occurs earlier as shown in Figure . The weight allocation might be explained by the tendency for a greater proportion of zero incremental claims in earlier accident periods as the claims from earlier accident periods are usually closer to their maturity: these are the ones that are the most likely to have 0 claims in the very late development periods (the later accident years haven't reached that stage in the training set). Since the design of the ZAGA model allows the probability mass of zero incremental claims to be modeled as a function of development periods (i.e. the closeness of claims to their maturity), ZAGA tends to yield better performance on when only the most mature accident periods are used. Interestingly, the Zero-Adjusted Log-Normal (ZALN) model has been assigned with small weights for earlier accident periods splits, which might be explained by the fact that it provides overlapping information with the ZAGA model.

6.5. Evaluating performance across varying data sizes

To assess the robustness of our results against different data sizes, we conducted additional tests using a 20×20 reserving triangle, which represents the aggregation of claims at semi-annual levels. To ensure consistency, we employed the same sets of simulation parameters from the SynthETIC generator to simulate the claims payments in both cases.

Using a similar partition strategy as the 40×40 case, the 20×20 triangle is split into a training set, a validation set and a out-of-sample set to build the ensembles and evaluate models' performance. This split is illustrated in Figure  in Appendix D.2.

Following the same modelling framework as in Section 3, the results for the 20×20 triangle are obtained and listed in Appendix D.2. In summary, the ADLP ensembles continue to out-perform the SLP, the equally weighted ensemble and the BMV in most simulations based on the Log Score, even when considering the smaller triangle size. Based on the statistical test results, the majority of p-values falls below the conventional 5% level when assessing the score differential between the ADLP and the BMV (indicating statistical significance). Although the count of statistically significant outcomes experiences a slight reduction when assessing the performance difference between the ADLP and the equally weighted ensemble, it remains substantial, constituting over half of the total simulations. However, the difference between SLP and ADLP becomes less significant as the triangle size decreases. This suggests that (unsurprisingly) simpler data partition strategies may be suitable when dealing with relatively small triangle sizes.

As for the change in the composition of the ensembles, we have found (again unsurprisingly) that more parsimonious models receive larger weights on the 20×20 triangle. Since there are no instances of zero incremental claims at higher aggregation levels, models utilising zero-adjusted distributions (specifically, ZALN and ZAGA) are excluded from the ensembles before fitting.

The results above suggest that, while it may be possible to apply our ensembling framework to even smaller triangles, the added value of doing so might not be as significant. In practice and nowadays, we expect that most companies would be in a position to analyse data at sufficient level of granularity in order to reap the potential benefits of the approach developed in this paper.

7. Conclusions

In this paper, we have designed, constructed, analysed and tested an objective, data-driven method to optimally combine different models for reserving. Using the Log Score as the combination criterion, the proposed ensemble modelling framework has demonstrated its potential in yielding high distributional forecast accuracy.

Although the standard ensembling approach (e.g. the standard linear pool) has demonstrated success in non-actuarial fields, it does not yield the optimal performance in the loss reserving context considered in this paper. To address the common general insurance loss data characteristics as listed in Section 1.2, we introduced the Accident and Development period adjusted Linear Pools (ADLP) modelling framework. This includes a novel validation strategy that takes account into the triangular loss reserving data structure, impacts and features related to different Accident and Development period combinations.

In particular, the partitioning strategy not only effectively utilises variations in component models' performance in different accident periods, but also captures the impact from late development periods on incremental claims, which can be particularly important for long-tailed lines of insurance business. Based on the illustrative examples which use simulated aggregate loss reserving data, the ADLP ensembles outperform traditional approaches (i.e. the model selection strategy BMV and the equally weighted ensemble EW) and the standard linear pool ensemble, when the Log Score, CRPS, or relative reserve bias at the 75th quantile are used for the assessment.

Furthermore, we also investigate the impact of a split point on the ensemble's performance. Given the composition of models included in the ensemble, we have found that splitting data into two subsets close to the middle point (i.e. between accident period 15 and 20 for a 40 x 40 loss triangle) can generally yield satisfactory performance by taking the advantages of several high-performing models in earlier accident periods while maintaining sufficient data size for training the combination weights in the both subsets.

Acknowledgments

Authors are very grateful for comments from two anonymous referees, which led to significant improvements of the paper. Earlier versions of this paper were presented at the Actuaries Institute 2022 All Actuaries Summit, the 25th International Congress on Insurance: Mathematics and Economics (Online), the 2022 Virtual ASTIN/AFIR Colloquium, the One World Actuarial Research Seminars, as well as at the 2023 CAS Annual Meeting in Los Angeles. The authors are grateful for constructive comments received from colleagues who attended those events.

Additionally, the authors are grateful for the research assistance of Yun Wai (William) Ho, who contributed materially to the review of the code related to the examples presented. The authors are also grateful to the Casualty Actuarial Society (CAS) for awarding the 2023 Hachemeister Prize to this paper, and for supporting its presentation at the 2023 CAS Annual Meeting in Los Angeles.

R package ADLP

An R package entitled ADLP has been developed based on the Accident and Development period adjusted Linear Pools (ADLP) method, and is available from CRAN (link: https://cran.r-project.org/web/packages/ADLP/index.html). It features a suite of functions for calibrating ADLP ensembles, calculating evaluation metrics, and generating simulations from fitted ADLP objects. Additionally, the package grants users ample flexibility to choose or create component models for the ensemble, and to employ data partitioning for calibrating either the component models or the combination weights, aligning with their experiences.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data and code

All data sets were generated using the CRAN package SynthETIC as explained above. Results presented in this paper can be replicated using the R code available at https://github.com/agi-lab/reserving-ensemble.

Additional information

Funding

This research was supported under Australian Research Council's Discovery Project funding scheme (DP200101859). The views expressed herein are those of the authors and are not necessarily those of the supporting organisations.

References

  • Aiolfia, M., & Timmermann, A. (2006). Persistence in forecasting performance and conditional combination strategies. Journal of Econometrics, 135(1-2), 31–53. https://doi.org/10.1016/j.jeconom.2005.07.015
  • Al-Mudafer, M. T., Avanzi, B., Taylor, G., & Wong, B. (2022). Stochastic loss reserving with mixture density neural networks. Insurance: Mathematics and Economics, 105, 144–174.
  • APRA (2019). Prudential standard gps 340 insurance liability valuation. https://www.legislation.gov.au/Details/F2018L00738.
  • Avanzi, B., Taylor, G., Wang, M., & Wong, B. (2021). SynthETIC: An individual insurance claim simulator with feature control. Insurance: Mathematics and Economics, 100, 296–308.
  • Baars, J. A., & Mass, C. F. (2005). Performance of national weather service forecasts compared to operational, consensus, and weighted model output statistics. Weather and Forecasting, 20(6), 1034–1047. https://doi.org/10.1175/WAF896.1
  • Balona, C., & Richman, R. (2020). The actuary and ibnr techniques: A machine learning approach. Available at SSRN 3697256.
  • Baran, S., & Lerch, S. (2018). Combining predictive distributions for the statistical post-processing of ensemble forecasts. International Journal of Forecasting, 34(3), 477–496. https://doi.org/10.1016/j.ijforecast.2018.01.005
  • Bellman, R. (1966). Dynamic programming. Science (New York, N.Y.), 153(3731), 34–37. https://doi.org/10.1126/science.153.3731.34
  • Breiman, L. (1996). Stacked regressions. Machine Learning, 24(1), 49–64.
  • Claeskens, G., Magnus, J. R., A. L. Vasnev, & Wang, W. (2016). The forecast combination puzzle: A simple theoretical explanation. International Journal of Forecasting, 32(3), 754–762. https://doi.org/10.1016/j.ijforecast.2015.12.005
  • Conflitti, C., De Mol, C., & Giannone, D. (2015). Optimal combination of survey forecasts. International Journal of Forecasting, 31(4), 1096–1103. https://doi.org/10.1016/j.ijforecast.2015.03.009
  • Coqueret, G., & Guida, T. (2020). Machine learning for factor investing: R version. CRC Press.
  • Costantini, M., & Kunst, R. M. (2011). On the usefulness of the Diebold-Mariano test in the selection of prediction models. Statistical Journal, 22, 153–161.
  • Diebold, F. X. (2015). Comparing predictive accuracy, twenty years later: A personal perspective on the use and abuse of Diebold–Mariano tests. Journal of Business & Economic Statistics, 33(1), 1–1. https://doi.org/10.1080/07350015.2014.983236
  • Diebold, F. X., & Mariano, R. S. (2002). Comparing predictive accuracy. Journal of Business & Economic Statistics, 20(1), 134–144. https://doi.org/10.1198/073500102753410444
  • England, P. D., & Verrall, R. J. (2001). A flexible framework for stochastic claims reserving. In Proceedings of the Casualty Actuarial Society (Vol. 88, No. 1, pp. 1-38).
  • Friedland, J. (2010). Estimating unpaid claims using basic techniques. Casualty Actuarial Society.
  • Gabrielli, A. (2020). A neural network boosted double overdispersed Poisson claims reserving model. ASTIN Bulletin, 50(1), 25–60. https://doi.org/10.1017/asb.2019.33
  • Gabrielli, A., Richman, R., & Wüthrich, M. V. (2020). Neural network embedding of the over-dispersed Poisson reserving model. Scandinavian Actuarial Journal, 2020(1), 1–29. https://doi.org/10.1080/03461238.2019.1633394
  • Gneiting, T., & Katzfuss, M. (2014). Probabilistic forecasting. Annual Review of Statistics and Its Application, 1, 1125–151. https://doi.org/10.1146/statistics.2013.1.issue-1
  • Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477), 359–378. https://doi.org/10.1198/016214506000001437
  • Gneiting, T., Raftery, A. E., Westveld III, A. H., & Goldman, T. (2005). Calibrated probabilistic forecasting using ensemble model output statistics and minimum crps estimation. Monthly Weather Review, 133(5), 1098–1118. https://doi.org/10.1175/MWR2904.1
  • Gneiting, T., & Ranjan, R. (2011). Comparing density forecasts using threshold-and quantile-weighted scoring rules. Journal of Business & Economic Statistics, 29(3), 411–422. https://doi.org/10.1198/jbes.2010.08110
  • Gneiting, T., & Ranjan, R. (2013). Combining predictive distributions. Electronic Journal of Statistics, 7, none1747–1782.
  • Good, I. J. (1992). Rational decisions. In Breakthroughs in statistics (pp. 365–377). Springer.
  • Green, P. J., & Silverman, B. W. (1993). Nonparametric regression and generalized linear models: A roughness penalty approach. Crc Press.
  • Gu, C., & Xiang, D. (2001). Cross-validating non-Gaussian data: Generalized approximate cross-validation revisited. Journal of Computational and Graphical Statistics, 10(3), 581–591. https://doi.org/10.1198/106186001317114992
  • Hall, S. G., & Mitchell, J. (2004). Density forecast combination. Tanaka Business School.
  • Hall, S. G., & Mitchell, J. (2007). Combining density forecasts. International Journal of Forecasting, 23(1), 1–13. https://doi.org/10.1016/j.ijforecast.2006.08.001
  • Harnau, J., & Nielsen, B. (2018). Over-dispersed age-period-cohort models. Journal of the American Statistical Association, 113(524), 1722–1732. https://doi.org/10.1080/01621459.2017.1366908
  • Hersbach, H. (2000). Decomposition of the continuous ranked probability score for ensemble prediction systems. Weather and Forecasting, 15(5), 559–570. https://doi.org/10.1175/1520-0434(2000)015<0559:DOTCRP>2.0.CO;2
  • Jamal, S., Canto, S., Fernwood, R., Giancaterino, C., Hiabu, M., Invernizzi, L., Korzhynska, T., Martin, Z., & Shen, H. (2018). Machine learning & traditional methods synergy in non-life reserving. Report of the ASTIN Working Party of the International Actuarial Association. Available online: https://www.actuaries.org/IAA/Documents/ASTIN/ASTIN_MLTMS%20Report_SJAMAL. pdf (accessed on 19 July 2019).
  • Kessy, S. R., Sherris, M., Villegas, A. M., & Ziveyi, J. (2022). Mortality forecasting using stacked regression ensembles. Scandinavian Actuarial Journal, 2022(7), 591–626. https://doi.org/10.1080/03461238.2021.1999316
  • Knüppel, M., & Krüger, F. (2022). Forecast uncertainty, disagreement, and the linear pool. Journal of Applied Econometrics, 37(1), 23–41. https://doi.org/10.1002/jae.v37.1
  • Kuang, D., Nielsen, B., & Nielsen, J. P. (2008). Identification of the age-period-cohort model and the extended chain-ladder model. Biometrika, 95(4), 979–986. https://doi.org/10.1093/biomet/asn026
  • Kuo, K. (2019). Deeptriangle: A deep learning approach to loss reserving. Risks, 7(3), 97. https://doi.org/10.3390/risks7030097
  • Li, J. (2022). A model stacking approach for forecasting mortality. North American Actuarial Journal 27, 530–545.
  • Lindholm, M., & Palmborg, L. (2022). Efficient use of data for lstm mortality forecasting. European Actuarial Journal 12, 749–778.
  • Lindholm, M., Verrall, R., Wahl, F., & Zakrisson, H. (2020). Machine learning, regression models, and prediction of claims reserves. In Casualty Actuarial Society E-Forum, Summer 2020.
  • Mack, T. (1991). A simple parametric model for rating automobile insurance or estimating ibnr claims reserves. ASTIN Bulletin: The Journal of the IAA, 21(1), 93–109. https://doi.org/10.2143/AST.21.1.2005403
  • Mack, T. (1993). Distribution-free calculation of the standard error of chain ladder reserve estimates. ASTIN Bulletin, 23(2), 213–225. https://doi.org/10.2143/AST.23.2.2005092
  • Martínez Miranda, M. D., Nielsen, B., & Nielsen, J. P. (2015). Inference and forecasting in the age–period–cohort model with unknown exposure with an application to mesothelioma mortality. Journal of the Royal Statistical Society Series A: Statistics in Society, 178(1), 29–55. https://doi.org/10.1111/rssa.12051
  • McDonald, C., & Thorsrud, L. A. (2011). Evaluating density forecasts: Model combination strategies versus the rbnz. Tech. rep., Reserve Bank of New Zealand.
  • Opschoor, A., Van Dijk, D., & Van der Wel, M. (2017). Combining density forecasts using focused scoring rules. Journal of Applied Econometrics, 32(7), 1298–1313. https://doi.org/10.1002/jae.v32.7
  • Parry, M., Dawid, A. P., & Lauritzen, S. (2012). Proper local scoring rules. The Annals of Statistics, 40(1), 561–592. https://doi.org/10.1214/12-AOS971
  • Pinar, M., Stengos, T., & Yazgan, M. E. (2012). Is there an optimal forecast combination?: A stochastic dominance approach to the forecasting combination puzzle. Department of Economics and Finance, College of Management and Economics.
  • Raftery, A. E., Gneiting, T., Balabdaoui, F., & Polakowski, M. (2005). Using Bayesian model averaging to calibrate forecast ensembles. Monthly Weather Review, 133(5), 1155–1174. https://doi.org/10.1175/MWR2906.1
  • Ranjan, R., & Gneiting, T. (2010). Combining probability forecasts. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(1), 71–91. https://doi.org/10.1111/j.1467-9868.2009.00726.x
  • Renshaw, A. E., & Verrall, R. J. (1998). A stochastic model underlying the chain-ladder technique. British Actuarial Journal, 4(4), 903–923. https://doi.org/10.1017/S1357321700000222
  • Resti, Y., Ismail, N., & Jamaan, S. H. (2013). Estimation of claim cost data using zero adjusted gamma and inverse gaussian regression models. Journal of Mathematics and Statistics, 9(3), 186–192. https://doi.org/10.3844/jmssp.2013.186.192
  • Rigby, R. A., & Stasinopoulos, D. M. (2005). Generalized additive models for location, scale and shape. Journal of the Royal Statistical Society: Series C (Applied Statistics), 54(3), 507–554.
  • Roulston, M. S., & Smith, L. A. (2002). Evaluating probabilistic forecasts using information theory. Monthly Weather Review, 130(6), 1653–1660. https://doi.org/10.1175/1520-0493(2002)130<1653:EPFUIT>2.0.CO;2
  • Shapland, M. R. (2022). Using the hayne mle models: A practitioner's guide. CAS Monograph 4, 4, 39–40.
  • Smith, J., & Wallis, K. F. (2009). A simple explanation of the forecast combination puzzle. Oxford Bulletin of Economics and Statistics, 71(3), 331–355. https://doi.org/10.1111/obes.2009.71.issue-3
  • Spedicato, G. A., Clemente, A. G. P., & Schewe, F. (2014). The use of gamlss in assessing the distribution of unpaid claims reserves. In Casualty Actuarial Society E-Forum, Summer (Vol. 2).
  • Taylor, G. (2000). Loss reserving: An actuarial perspective. Huebner International Series on Risk, Insurance and Economic Security. Kluwer Academic Publishers.
  • Taylor, G. C., & McGuire, G. (2016). Stochastic loss reserving using generalized linear models. Casualty Actuarial Society.
  • Taylor, G. C., & Xu, J. (2016). An empirical investigation of the value of finalisation count information to loss reserving. Variance, 10, 75–120.
  • Verrall, R. J. (1991). On the estimation of reserves from loglinear models. Insurance: Mathematics and Economics, 10(1), 75–80.
  • Vogel, P., Knippertz, P., Fink, A. H., Schlueter, A., & Gneiting, T. (2018). Skill of global raw and postprocessed ensemble predictions of rainfall over northern tropical africa. Weather and Forecasting, 33(2), 369–388. https://doi.org/10.1175/WAF-D-17-0127.1
  • Wright, T. S. (1990). A stochastic method for claims reserving in general insurance. Journal of the Institute of Actuaries, 117(3), 677–731. https://doi.org/10.1017/S0020268100043262
  • Wüthrich, M. V. (2018). Neural networks applied to chain–ladder reserving. European Actuarial Journal, 8(2), 407–436. https://doi.org/10.1007/s13385-018-0184-4
  • Wüthrich, M. V., & Merz, M. (2008). Stochastic claims reserving methods in insurance. John Wiley & Sons.
  • Yao, Y., Vehtari, A., Simpson, D., & Gelman, A. (2018). Using stacking to average Bayesian predictive distributions (with discussion). Bayesian Analysis, 13(3), 917–1007. https://doi.org/10.1214/17-BA1091

Appendices

Appendix 1.

Descriptions of component models

A.1. GLMCC: the basic form

The first model included in our ensemble is the GLM with cross-classified structure that has been introduced in modelling outstanding claims since Renshaw and Verrall (Citation1998). The standard cross-classified models can be fitted mechanically, and there is usually no manual adjustment needed for tuning the parameters. Using the notations from CitationTaylor and McGuire (Chapter 2 of Citation2016, pp. 8–12), the linear predictor can be specified as (A1) ηi,j=ln(αi)+ln(βj),(A1) where the accident period effect and development period effect can be captured by the parameters ln(αi) and ln(βi), respectively. Using the log-link function (i.e. μij=exp(ηi,j)), the predicted incremental claim is modelled as (A2) μˆij=αˆiβˆj.(A2) Therefore, the cross-classified models have the nice interpretation that the predicted incremental claim is the product of the estimated total amount of claims paid in accident period i (i.e. αˆi) and the proportion of that total claims paid in development period j (i.e. βˆj). GLMCC is fitted under the Over-Dispersed Poisson (ODP), Gamma, and Log-Normal distribution, which are three common distribution assumptions in loss reserving literature, respectively. The above distribution assumptions are summarised below:

  • ODP: E[Yi,j]=exp(ηi,j) and Var(Yi,j)=ϕμi,j;

  • Log-Normal: E[ln(Yi,j)]=ηi,j and Var(ln(Yi,j))=σ2;

  • Gamma: E[Yi,j]=μi,j=exp(ηi,j) and Var(Yi,j)=ϕμi,j2.

Note that GLMCC with the ODP distribution assumption yields the identical central estimate of outstanding claims as to the traditional Chain-Ladder algorithm (Renshaw & Verrall, Citation1998).

A.2. GLMCal: incorporating calendar period effect

As CitationTaylor and McGuire (Chapter 7 of Citation2016, pp. 76–94) commented, calendar period effects can be present due to economic inflation or changes in claims management. Therefore, the second model seeks to capture this effect by introducing the parameter γ in the linear predictor (Taylor & Xu, Citation2016): (A3) ηi,j=ln(βj)+tln(γ).(A3) The calendar period effect is assumed to be constant such that the model can be applied mechanically without much users' adjustment based on Criterion 1. GLMCal is fitted using the same distribution assumptions as GLMCC.

The model with calendar period effect has a close connection with the age-period-cohort model proposed by Kuang et al. (Citation2008) and Harnau and Nielsen (Citation2018). If we assume a separate effect for each calendar period and add the accident period effects, the model defined in EquationA3 will become: (A4) ηi,j=ln(αi)+ln(βj)+ln(γt),(A4) which has the same structural form as the age-period-cohort type of models in Kuang et al. (Citation2008) and Harnau and Nielsen (Citation2018). Furthermore, removing the period term in (EquationA4) transforms the model into an age-cohort model (Martínez Miranda et al., Citation2015), which also corresponds to the cross-classified model defined in (EquationA1) (i.e. GLMCC). However, it is also important to recognise that the forecasting methodologies for the models specified in (EquationA3) and (EquationA4) differ. Specifically, for forecasting purposes, the period parameter (denoted as ln(γt)) in (EquationA4) requires extrapolation to produce forecasts for the lower triangle. In contrast, the GLMCal model, as defined in (EquationA3), treats the calendar parameter as a numeric variable, which does not require this additional extrapolation step.

A.3. GLMHC: incorporating the hoerl curve

The third GLM seeks to capture the shape of development pattern of incremental claims by incorporating a Hoerl curve (Wright, Citation1990) in the linear predictor, which can be specified as (A5) ηi,j=ai+bln(j)+cj.(A5) The number of parameters in GLMHC for a 40x40 loss triangle have been significantly reduced from 80 to 42 compared with GLMCC. By adding the Hoerl curve to the ensemble, the potential over-fitting risk of the cross-classified GLM can be mitigated. Similarly, GLMHC is fitted under the ODP, Gamma, and Log-Normal distribution assumptions.

A.4. GLMPPCI: incorporating reported claims count information

Incorporating reported (or notified) claims count might lead to better data representation as claim payments could be subject to the impact of claims notification (Chapter 7 of Taylor & McGuire, Citation2016, pp. 76–94), particularly for lines of business with volatile payments but stable average payments per claim (Taylor & Xu, Citation2016). Following the notations from Taylor and Xu (Citation2016), the linear predictor can be specified as (A6) ηi,j=ln(Nˆi)+ln(βj),(A6) where Nˆi denote the estimated total reported claims count for accident period i. Nˆi can be specified as Nˆi=j=1Ji+1Nij+j=Ji+2JNˆij, where Nˆij denote the reported claims count that can not be observed and therefore need to be estimated. Following the methodology in Taylor and Xu (Citation2016), Nˆij is estimated from the reported claims count triangle by GLMCC with Over-Dispersed Poisson distribution assumption. GLMPPCI is fitted under the ODP distribution assumption.

A.5. GLMPPCF: incorporating finalised claims count information

Incremental claims can also be impacted by the claims closure rate, which can be captured by the PPCF model (Taylor & Xu, Citation2016). Following the methodology from Taylor and Xu (Citation2016), the payment per finalised claim, defined as Yij/Fˆij, is modeled by an Over-dispersed Poisson GLM: (A7) μˆijPPCF=exp(β0+β1tˆi(j)),(A7) where tˆi(j) is the (estimated) proportion of claims closed up to development period j that are incurred in accident period i, and where Fˆij is estimated by a binomial GLM (Taylor & Xu, Citation2016). Finally, the estimated incremental claim can be derived as: Yˆij=μˆijPPCFFijˆ

A.6. GLMZALN and GLMZAGA: handling incremental claims with zero values

In addition to the Over-Dispersed Poisson, Log-Normal, and Gamma distribution assumptions, GLMZALN and GLMZAGA are fitted using the Zero-Adjusted Log-Normal (ZALN) distribution and the Zero-Adjusted Gamma (ZAGA) distribution, which are specialised at handling zero incremental claims by assigning a point mass at zero (Resti et al., Citation2013). Denote the probability mass at zero as νj. We assume the point mass to be dependent on development periods, which takes into account the fact that zero incremental claims are more likely to be observed in later development periods when the claims are close to being settled. νj can be then modeled by fitting a logistic regression: (A8) νˆj=exp(β0+β1j)1+exp(β0+β1j).(A8) Denote μˆijGA and μˆijLN as the predicted mean for Gamma and Log-Normal distribution, respectively. Finally, the predicted incremental claims under the ZAGA and ZALN distribution can be specified as μˆijZAGA=(1νˆj)μˆijGA and μˆijZALN=(1νˆj)μˆijLN, respectively.

A.7. Smoothing spline

Smoothing splines provide a flexible approach to automatically balance data adherence and smoothness. Comparing to fixed parametric curves, smoothing splines have a more flexible shape to incorporate different patterns present in claims development (England & Verrall, Citation2001). Following the work in England and Verrall (Citation2001), the linear predictor can be specified as (A9) ηij=sθi(i)+sθj(j),(A9) where sθi are sθj are smoothing functions over the accident period i and development period j. The distributional assumption of the smoothing splines implemented in this paper are Normal, Gamma and Log-Normal. The Normal distribution assumption is the original distributional assumption for smoothing spline in England and Verrall (Citation2001), where the latter two distribution assumptions are commonly used in loss reserving literature. For each distributional assumption, the smoothing functions for the accident periods and the development periods are found by maximising their respective penalised log-likelihoods (Green & Silverman, Citation1993): (A10) l(sθi)=i,j(yijsθi(i)exp(sθi(i))ϕ)θi(sθi(t))2dt(A10) (A11) l(sθj)=i,j(yijsθj(j)exp(sθj(j))ϕ)θj(sθj(t))2dt;(A11) where θi and θj are hyper-parameters that control the degree of smoothness for smoothing spline. The parameters θi and θj are estimated by minimising the Generalised Approximate CV Score (GACV) (Gu & Xiang, Citation2001).

A.8. GAMLSS: generalised additive models for location scale and shape

Under the standard GLM setting, the dispersion parameter is assumed to be constant across all accident periods and development periods. However, dispersion levels typically change in practice across development (and sometimes accident) periods (Chapter 6 of Taylor & McGuire, Citation2016, pp. 56–72). By modelling the variance as a function of predictors, GAMLSS, which is firstly proposed by Rigby and Stasinopoulos (Citation2005), might tackle the inconstant dispersion issue automatically (Spedicato et al., Citation2014). Therefore, it is included in our ensembles.

Following the work from Spedicato et al. (Citation2014), the linear predictor for the distribution mean is in the standard cross-classified form. Although the distribution variance can also be specified in the cross-classified form, to mitigate the potential over-parametrisation risk, a smoothing spline is applied to the development period when modelling the variance term. We also assume the distribution variance to vary only by development periods, which matches the common assumption in loss reserving literature (Chapter 6 of Taylor & McGuire, Citation2016, pp. 56–72). Under the above assumptions, the mean and the variance of the distribution of incremental claims can be modelled as (A12) E[Yij]=g1(η1,i,j)=exp(ln(αi)+ln(βj));(A12) (A13) Var[Yij]=g1(η2,i,j)=exp(sθj(j));(A13) where sθj is a smooth function applied here to the development period j only. In this paper, the GAMLSS model is fitted under the distribution assumption of Gamma and Log-Normal, which are common distribution assumptions for loss reserving models.

Appendix 2.

Distributional forecasting concepts

A.9. Log score and Kullback–Leibler divergence distance

The Log Score has a close connection to the Kullback–Leibler divergence (KLIC) distance between the true density fm and the predicted density fˆm (Hall & Mitchell, Citation2007), which is defined as: (A14) KLIC=ln(fm(yij))ln(fˆm(yij))(A14) The smaller the KLIC distance, the smaller the deviation of the predictive density from the true density. If the predictive density is equal to the true density, its KLIC distance will be zero. However, direct minimisation of the difference between the true density and the predictive density might not be feasible, as the true density is often unknown in practice. One solution is to maximise ln(fˆm(yij)), which is equivalent to minimise the KLIC distance (Hall & Mitchell, Citation2007). As ln(fˆm(yij)) corresponds to the Log Score of the predictive density, the logarithmic scoring rule has the nice interpretation of assigning a higher score to the predictive density with a closer distance to the true density (Hall & Mitchell, Citation2007).

Appendix 3.

Technical details

A.10. Details of calculating reserve quantiles

To simulate the 75th reserve quantile for component model m, we implement the following procedure:

  1. Simulate N random variables for each cell in out-of-sample data from its corresponding distribution; Denote Y~ijm=(Y~ij,(1)m,Y~ij,(2)m,,Y~ij,(N)m) as the vector of all simulated random variables for cell (i,j) under the component model m

  2. Repeat Step 1 for each cell (i,j)

  3. Calculate the simulated reserve based on the simulated random variables for each cell (i,j): R~m=(R~(1)m,R~(2)m,,R~(N)m)=(i,jDoutY~ij,(1)m,i,jDoutY~ij,(2)m,,i,jDoutY~ij,(N)m)

  4. Calculate the empirical 75th quantile of the N simulated reserves for component model m: R75m

Here we take N=1000, which provides enough to be computationally feasible but also reasonably accurate at the quantiles we are looking at. One should also note that the above simulation procedure implicitly assumes the independence of individual cells, which is a common assumption used to simulate reserve quantiles in literature (Gabrielli, Citation2020). The procedure mentioned above is similar to the parametric bootstrapping method of simulating reserve (Chapter 5 of Taylor & McGuire, Citation2016, pp. 42–53). Since this paper focuses on assessing predictive distributions, we do not pursue non-parametric or semi-parametric bootstrapping further here.

A.11. Constraints on combination weights: acting as downside insurance

As Breiman (Citation1996) suggests, imposing the sum-to-unity and non-negativity constraints act as downside insurance for the out-of-sample predictive performance of the ensemble. This proposition can be supported by the following simple proof.

Denote fˆ(yij)=m=1Mwmfˆm(yij) as the predictive density for the ensemble at out-of-sample point yij, and fˆworst(yij)=min{fˆm(yij),m=1,,M} as the the lowest predictive density at the same data point from one of the component model m (i.e. the predictive density from the worst component model). By imposing the sum-to-unity constraint and non-negativity constraints on model weights, ln(m=1Mwmfˆm(yij))ln(fˆworst(yij)) is guaranteed.

A.12. Comparison between stacked regression approach and linear pools: further illustrations

Section 3.6 states that the MSE does not give forecasters the incentive to model higher moments correctly. We supplement the discussion in Section 3.6 with the following illustrative example. Suppose the true distribution for cell (i,j) is N(2,2). Consider the predictive distribution N(2,1). Since it has the same mean as the true distribution, it will receive the same MSE as the true distribution, even though it severely underestimates the true dispersion. Consequently, the estimated 75th quantile from the forecasting distribution, calculated as μˆ+σˆz0.75, will also be lower than the true 75th quantile. The potential of under-estimation of higher quantiles could be of particular concern to general insurers and regulators.

As mentioned in Section 3.6, optimising the model's performance on central estimation might harm the variance forecast. To illustrate this idea, we follow the decomposition of Mean-Squared-Error and prediction variance given in Knüppel and Krüger (Citation2022). Firstly, the weighted average Mean-Squared Error attained by component models can be decomposed as: (A15) m=1MwmE[(Yμm)2]=m=1MwmE[(Yμ+μμm)2]=(m=1Mwm)E[(Yμ)2]+m=1MwmE[(μmμ)2]=E[(Yμ)2]+D,(A15) where D=m=1MwmE[(μmμ)2] is the disagreement term, and m=1Mwm=1. Assume the individual variance forecast is unbiased (i.e. E[σm2]=E[(Yμm)2]). Then the Mean-Squared Error attained by the ensemble can be specified as: (A16) E[(Yμ)2]=m=1MwmE[(Yμm)2]D=E[σm2]D,(A16) and the (expected value of) predicted variance under the combined distributional forecast is: (A17) E[σ~2]=m=1MwmE[σm2]+m=1MwmE[(μmμ)2]=m=1MwmE[σm2]+D.(A17) Equation (EquationA16) and (EquationA17) have several important implications. To obtain the optimal central estimate, one should minimise the Mean-Squared error attained by the ensemble. As per (EquationA16), the MSE of the combined forecast decreases as the disagreement term increases. Therefore, if our goal is to achieve optimal central estimates, the optimal weights should be derived such that the disagreement among component models could be maximised, which is usually the case under the stacked regression approach. However, as per (EquationA17), the predicted variance increases with the disagreement term. Therefore, if the individual forecast variances are unbiased or over-estimated, the optimal weights derived by minimising the MSE will result in a further over-statement of the ensemble's predicted variance (Knüppel & Krüger, Citation2022).

Appendix 4.

Additional fitting results

A.13. ADLP with three subsets

The average out-of-sample Log Score attained by the four ADLP+ ensembles, in comparison with the corresponding ADLP ensembles, are summarised in Table . The ‘Split Points’ column shows the accident period(s) where the split of subsets occurs for both ADLP ensembles (outside the bracket) and ADLP+ ensembles (inside the bracket). As per Table , the average Log Score differential between the ADLP and the ADLP+ ensembles tend to be small, except for ADLP9 and ADLP9+.

Table E1. Comparison of Average Log Score: ADLP vs. ADLP+.

Additionally, based on the box-plot of Log Score in Figure , the distributions of Log Score for the ADLP and ADLP+ over the 100 simulations are also similar, though ADLP9 under-performs ADLP9+ in most simulated data set by a small margin.

Figure A1. Box-plot of Log Score: ADLP vs. ADLP+.

Figure A1. Box-plot of Log Score: ADLP vs. ADLP+.

As per Table , ADLP9+ has the highest Log Score among all ADLP ensembles with three subsets, and ADLP11 yields the best performance among the two-subsets ADLP ensembles. We also test whether the Log Score differential between those two partition strategies is statistically significant. Under the 5% significance level, the rejection rate is considered low. Since ADLP9+ is the best ADLP+ ensembles based on Log Score, the test results also imply any other three-subsets partition strategy cannot statistically significantly improve ADLP11. Additionally, we also feel parsimony would suggest restricting ourselves to a single split point.

A.14. Fitting results for a reserving triangle of dimension 20×20

A.14.1. Data partition under the 20×20 triangle case

A.14.2. Assessing predictive performance: log score

To evaluate the impact of partition strategies on the predictive performance of ADLP ensembles within a 20×20 reserving triangle, we present the mean Log Score achieved for various split points below:

The predictive performance of the optimal ADLP identified in Figure  is then compared against the SLP and the benchmark strategies:

Figure A2. Data partition under the 20×20 triangle case.

Figure A2. Data partition under the 20×20 triangle case.

Figure A3. Mean Log Score Plot: comparison among different partition strategies (20×20 triangle).

Figure A3. Mean Log Score Plot: comparison among different partition strategies (20×20 triangle).

Figure A4. Distribution of Log Score over 100 simulations (20×20 triangle): comparison among ADLP, SLP, BMV and EW.

Figure A4. Distribution of Log Score over 100 simulations (20×20 triangle): comparison among ADLP, SLP, BMV and EW.

Figure A5. Mean Log Score by accident periods (20×20 triangle).

Figure A5. Mean Log Score by accident periods (20×20 triangle).

Figure A6. Distribution of p-values: ADLP v.s BMV, ADLP v.s EW, and ADLP v.s SLP; Green: marks the 10% level; Red: marks the 5% level.

Figure A6. Distribution of p-values: ADLP v.s BMV, ADLP v.s EW, and ADLP v.s SLP; Green: marks the 10% level; Red: marks the 5% level.

Figure A7. Combination Weights in Subset 1 under optimal ADLP (20×20 triangle).

Figure A7. Combination Weights in Subset 1 under optimal ADLP (20×20 triangle).

Figure A8. Combination Weights in Subset 2 under optimal ADLP (20×20 triangle).

Figure A8. Combination Weights in Subset 2 under optimal ADLP (20×20 triangle).

Figure A9. Distribution of weighted CRPS (right–tail focus) over 100 simulations (lower is better): comparison among ADLP12, EW and BMV.

Figure A9. Distribution of weighted CRPS (right–tail focus) over 100 simulations (lower is better): comparison among ADLP12, EW and BMV.

For better presentation, the mean Log Scores for the four competing strategies are summarised below:

Table E2. Comparison of Log Score: BMV, EW, SLP, and OptimalADLP.

A.14.3. Statistical test results

To assess the difference in the predictive performance among various competing strategies, we performed the Diebold-Mariano test, with the distribution of p-values shown below:

Table E3. The Number of Datasets (out of 100) when H0 is rejected under the 5% and 10% significance levels (20 × 20 Triangle), based on the Diebold–Mariano test

A.14.4. Combination weights

A.15. Results for CRPS calculation

The CRPS can be interpreted as a measurement of the difference between the predicted and observed cumulative distributions (Hersbach, Citation2000). Formally, the generalised version of CRPS for a predictive distribution with a cumulative distribution function F, observed at yk, is defined as Gneiting and Ranjan (Citation2011): (A18) CRPS(F,yk)=(F(z)Izyk)2u(z)dz,(A18) where u(z) denotes a weight function that allows users to adjust focus on different regions of the distribution. For implementation, we use the discretised version of (EquationA18) (Gneiting & Ranjan, Citation2011): (A19) CRPS(F,yk)=zuzlI1z=zlzu(F(z)Izyk)2u(z),(A19) where I is the number of discretised points, zl and zu are the lower and upper bounds for integration, respectively.

The standard version of CRPS corresponds to the special case of u(z)=1. Particularly relevant to the reserving context, there is often an interest in models' performance concerning the right tail of the distribution. Thus, a weight function emphasising the right tail can be selected. To serve this purpose, a recommended choice of weight function in Gneiting and Ranjan (Citation2011) is: u(z)=Φa,b(z), which is the cumulative distribution function of a Normal distribution with mean a and standard deviation b. Given that Φa,b is monotonically increasing in z, deviations from the ideal distribution function (i.e. Izyk) are more heavily penalised as z increases towards the upper end of the distribution. Inspired by the choice of the mean and standard deviation parameters for the weight function in Gneiting and Ranjan (Citation2011), we select a to represent the average of claim payments within the lower triangle, and b as the standard deviation of the claim payments in the same region. However, this choice is only one among several possibilities. In practice, actuaries might choose different weight parameters to adjust the emphasis placed on the right-tail of the distribution, guided by their expert knowledge or professional judgement.

The outcomes of the weighted CRPS calculations are presented in Figure . Here, ADLP12 exhibits a noticeably lower average CRPS compared to both BMV and EW. These results, alongside the lower reserve bias observed at the 75th quantile, suggest that ADLP's superiority over BMV and EW is more substantial in the right-tailed region of the distribution.