1,712
Views
11
CrossRef citations to date
0
Altmetric
Research Article

Predictive verification for the design of partially exchangeable multi-model ensembles

ORCID Icon, ORCID Icon, & ORCID Icon
Pages 1-12 | Received 01 Feb 2019, Accepted 30 Sep 2019, Published online: 18 Dec 2019

Abstract

The performance of an ensemble forecast, as measured by scoring rules, depends on its number of members. Under the assumption of ensemble member exchangeability, ensemble-adjusted scores provide unbiased estimates of the ensemble-size effect. In this study, the concept of ensemble-adjusted scores is revisited and exploited in the general context of multi-model ensemble forecasting. In particular, an ensemble-size adjustment is proposed for the continuous ranked probability score in a multi-model ensemble setting. The method requires that the ensemble forecasts satisfy generalised multi-model exchangeability conditions. These conditions do not require the models themselves to be exchangeable. The adjusted scores are tested here on a dual-resolution ensemble, an ensemble which combines members drawn from the same numerical model but run at two different grid resolutions. It is shown that performance of different ensemble combinations can be robustly estimated based on a small subset of members from each model. At no additional cost, the ensemble-size effect is investigated not only considering the pooling of potential extra-members but also including the impact of optimal weighting strategies. With simple and efficient tools, the proposed methodology paves the way for predictive verification of multi-model ensemble forecasts; the derived statistics can provide guidance for the design of future operational ensemble configurations without having to run additional ensemble forecast experiments for all the potential configurations.

1. Introduction

Ensemble systems provide a framework for probabilistic forecasting in numerical weather prediction. A collection of forecasts with the same target serves as a basis for the generation of probabilistic products. In this framework, it is well-established that the ensemble-size, that is the number of forecasts available at the product-generation stage, has an impact on the quality of the ensemble probabilistic products. This is for example the case when we consider a cumulative probability distribution function (CDF) generated from m ensemble members and the quality of the CDF forecasts estimated with the continuous ranked probability score (CRPS). When the ensemble is reliable, the ratio between the expected score of the m-member-based forecasts and the expected score if the ensemble was of infinite size is 1+1/m (Richardson, Citation2001).

More generally, ensemble-adjusted scores provide a means to estimate the ensemble-size effect on forecast performance assuming ensemble member exchangeability and stationarity of the error statistics. The concept of score adjustment allows one to derive an unbiased estimate of a score S for an ensemble of size M when say m < M members are available for the score computation. Denoted SmM, adjusted scores can be applied, for example, to compare ensemble forecasting systems with different ensemble-sizes, disentangling the ensemble-size effect and the impact of ensemble/model configuration on forecast performance. Furthermore, adjusted scores provide estimates of the expected benefit of an ensemble-size upgrade without the need to run extra members. Practically, in numerical experimentation, expected scores of an M-member ensemble are inferred from a small number of members (Leutbecher, Citation2018). This way, unused computational resources are made available for other experimental tests. Adjusted versions of the CRPS, Brier score, and ranked probability score are available in the literature as well as an adjusted version of the ignorance score for forecasts issued as Normal distributions (Ferro et al., Citation2008; Siegert et al., Citation2018).

The first objective of this paper is to revisit the concept of ensemble-adjusted scores and its applicability in the general context of multi-model ensembles. The multi-model ensemble approach refers here to the combination of forecasts from k(k>1) ensembles with different statistical characteristics. Let m1,,mk denote the ensemble-size of the k ensembles that are going to be combined. An ensemble-adjusted score S(m1,,mk)(M1,,Mk) provides a forecast performance estimate of a M1,,Mk combined ensemble forecast based on verification statistics from ensembles of size m1,,mk. In the following, we discuss how multi-model ensemble-size affects forecast performance, in particular in terms of the CRPS. As an application, the ensemble-size effect is investigated for a dual-resolution ensemble which combines forecasts from the same model but run at two different resolutions (Leutbecher and Ben Bouallègue, Citation2019).

The benefit of the multi-model ensemble approach and the rationale explaining its success were investigated successively in Hagedorn et al. (Citation2005); Weigel et al. (Citation2008); Weigel and Bowler (Citation2009); Leutbecher and Ben Bouallègue (Citation2019). Multi-model ensembles by definition gather ensemble forecasts with different error characteristics. Forecast improvement occurs as a result of a noise reduction associated with the increase of the ensemble-size or by addition of new predictable signals (DelSole et al., Citation2014). When the size of the combined ensemble is fixed, forecast improvement can arise from an appropriate weighting of the different ensemble members. Instead of a simple pooling of the forecasts, post-processing methods can be applied in order to attribute more weight to a set of forecasts when justified by previous forecast performance (see Doblas-Reyes et al., Citation2005; Casanova and Ahrens, Citation2009; DelSole et al., Citation2013; Baran et al., Citation2019, among others).

The second objective of this paper is to propose a new approach for ensemble-weighting optimisation. We show that optimal weights can be derived directly from the kernel representation of the CRPS. As a result, ensemble-size effect and weighting strategy can be analysed simultaneously. This is illustrated here in the particular case of a two-ensemble combination. An exhaustive analysis of weighted and unweighted ensemble combinations is performed without the need to run large ensemble experiments or complex post-processing methods. This novel approach to forecast verification is coined predictive verification which is used as an umbrella term for the assessment of potential ensemble configurations based on a reduced set of members. As a fundamental application, predictive verification of ensemble forecasts aims to provide guidance for the design of future ensemble systems.

This paper is organised as follows: the concept of ensemble-adjusted scores in a multi-model setting is described and tested in Section 2, applications to ensemble-size optimisation as well as model weighting are discussed in Section 3 before concluding in Section 4.

2. Score adjustment

2.1. Unbiased estimators

Consider forecasting a continuous outcome y. Consider first a single ensemble system of size m. The ensemble members are denoted zm=(z1,,zm). For the derivation of score unbiased estimators, it is assumed that the ensemble members (from any one model) are exchangeable. We focus in this manuscript on the CRPS but unbiased estimators for the Brier score are also provided in Appendix B.

Following Gneiting and Raftery (Citation2007), the kernel representation of the CRPS (C) for the empirical distribution function (EDF) of the ensemble reads: (1) C(zm,y)=1mj=1m|zjy|12m2g=1mh=1m|zgzh|(1) where the first term is the mean absolute error of the ensemble members and the second term is a measure of the ensemble spread. An unbiased estimator of the score for an ensemble with M members takes the following form: (2) C(zm,y)12MmM(m1)m2g=1mh=1m|zgzh|,(2) as discussed in Ferro et al. (Citation2008). The adjusted CRPS is denoted CmM, and it is interpreted as the expected CRPS if we had M ensemble members, estimated from our statistical knowledge of the ensemble characteristics based on m ensemble members.

Now consider that we are in a multi-model setting. The multi-model ensemble comprises k models with mi members from the i-th model. The multi-model ensemble forecast is denoted (3) zm=(z11zigzkmk)(3) where zig is the g-th member in the i-th model. The multi-index m=(m1,m2,,mk) contains the ensemble sizes. We will refer to it as the (multi-model) ensemble size. The total ensemble size is |m|=mi.

We can form the EDF for each of the k models and then combine these to form a probability forecast from the multi-model ensemble. We define the forecast distribution function to be the mixture (4) i=1kλiFi(z),(4) where λi>0 is the weight assigned to the EDF, Fi, of model i and i=1kλi=1. The CRPS for this forecast distribution is (5) C(zm,λ,y)=i=1kλimij=1mi|zijy|12i=1kj=1kλiλjmimjg=1mih=1mj|zigzjh|.(5)

If we choose λi=1/k then each model receives equal weight, and if we choose λi=mi/jmj then each member receives equal weight. We refer to this latter choice as ensemble pooling and we compare it to estimating optimal weights in Section 3.

Similarly to the single ensemble case, we would like to measure the expected ensemble-size effect on forecast performance in a multi-model ensemble setting. Not only is exchangeability of the ensemble members from any one model required but also multi-model exchangeability, which is a form of partial exchangeability (Bernardo and Smith, Citation2000), as well as multi-model ensemble size invariance. These so-called generalised multi-model exchangeability conditions do not require exchangeability between models. Formal definitions of the multi-model exchangeability and multi-model ensemble size invariance conditions are provided in Appendix A, Section A2 along with the corresponding mathematical developments. In addition, we provide more practical conditions that can be checked easily for any multi-model ensemble (see EquationEquations (A7) and Equation(A8)). In plain words, these conditions demand (i) that the expected mean absolute error of an ensemble member only depends on the model, i.e. it is the same for all members generated with that model and (ii) that the expected mean absolute difference between a pair of distinct members only depends on which models generated them, i.e. it is the same for all pairs of distinct members provided they originate from the same pair of models. Finally, the impact of the violation of the requirements is illustrated based on a concrete example in Section 2.3.

When the generalised multi-model exchangeability conditions are satisfied, an unbiased estimator of the CRPS for a multi-model ensemble with M=(M1,,Mk) members is: (6) C(zm,λ,y)12i=1kλi2(Mimi)Mi(mi1)mi2g=1mih=1mi|zigzih|.(6)

The adjusted CRPS in a multi-model setting is denoted CmM. How the ensemble-adjusted CRPS can help with the design of a multi-model ensemble system is discussed later in Section 3. But first, in the remainder of this section, the robustness of Expression (6) as a score estimator is tested on a particular multi-model ensemble.

2.2. Multi-model ensemble setting

The concepts developed in Sections 2.1. (and later in Sections 3) are tested on a dual-resolution ensemble experiment. A dual-resolution ensemble is a particular case of a multi-model ensemble because the different contributing ensembles share the same underlying model. However, this specificity is neither required for the application of predictive verification as developed here, nor impacts the interpretation of the results. The choice of a multi-resolution ensemble to illustrate our method derives from a recent interest at ECMWF for this type of configuration but the method will also work with more traditional multi-model ensembles (as long as they satisfy the generalised multi-model exchangeability conditions).

In our test example, the forecast data-set comprises forecasts from the same numerical model (the ECMWF integrated forecasting system) but run at two different resolutions: TCo639 (∼18km) and TCo399 (∼29km). They are referred to as the higher resolution (Hres) and the lower resolution (Lres) members, respectively. A dual-resolution ensemble combines p Lres and q Hres members into a so-called (p,q) ensemble. The combination parameters, p and q, can be varied in such a way, for example, to keep constant the total computational cost of an ensemble run. The cost ratio between forecasts at TCo639 and TCo399 is approximately 4:1. Given a fixed computational cost defined by limited computer resources, one would like to know the combination (p,q) which optimises the predictive performance of the ensemble.

In order to answer this question, several forecast combinations with the same computational cost are assessed and their performances compared. An ensemble with an operational-like setup, that is a (0,50) ensemble which comprises 50 Hres members only, is used as the reference ensemble forecast. Other ensemble combinations are compared with this reference. These baseline combinations correspond to the (40,40), (120,20), (160,10), and (200,0) ensembles. Results of this type of analysis are documented in Leutbecher and Ben Bouallègue (Citation2019). They show that the (40,40) ensemble performs significantly better than the other tested combinations when focussing on 2 m temperature short- and medium-range forecasts over the northern hemisphere.

Here, the question is whether the same results and conclusion can be obtained using ensemble-adjusted scores. The ensemble-adjusted approach potentially allows one to reduce drastically the cost of an ensemble experiment. In the case of a positive answer to the above question, score adjustment would provide the framework for the analysis of all potential ensemble combination performance at a lower experiment computational cost (see Section 3).

Using Expression (6), performance analysis of multi-model combinations is based on verification statistics computed from small subsets of Hres and Lres members. Subsets of the type (2, 4), and (8) are tested where each forecast of the subset is selected in order to be exchangeable with the other members. Leutbecher (Citation2018) provides more details about member selection and exchangeability of the ECMWF ensemble.

Dual-resolution ensemble performance is assessed for several surface weather variables but only results for 2 m temperature forecasts are shown here. Results for 10 m wind speed and 24 h accumulated total precipitation were analysed as well but only briefly discussed here because they are qualitatively similar. The chosen verification period covers the boreal summer (JJA) 2016, and the forecasts are compared with SYNOP measurements distributed over the northern hemisphere.

2.3. Illustration

Performance in terms of CRPS is computed for each baseline combination, (40,40), (120,20), (160,10), and (200,0), and compared with the performance of the reference forecast (0,50). The CRPS difference is normalised by the mean CRPS of the reference forecast over the verification period and is simply denoted ΔCRPS. A negative difference indicates that the baseline combination is better than the reference forecast. In terms of experiment computational cost, this analysis requires, in our example, to run 50 Hres members and 200 Lres members over a 92-day verification period.

Now, the CRPS for each of the baseline/reference combinations is estimated based on a (8) subset of members. In other words, we compute C(8,8)(40,40),C(8,8)(120,20), and so on using Expression (6). The score differences are then normalised by the mean scores of the reference combination (C(0,8)(0,50)) and denoted ΔCRPSadj because they are based on adjusted scores. This time, the ensemble performance analysis requires only 8 Hres members and 8 Lres members. This corresponds approximately to 10% of the experiment computational cost of the original analysis. Adjusted scores based on smaller subsets, (4) and (2), are also tested. They correspond to a reduction of the computational cost by a factor 20 and 40, respectively.

shows the correspondence between ΔCRPS and ΔCRPSadj based on a (8) subset of members. Performance for 2 m temperature forecasts is shown for the four baseline combinations at each forecast lead-time (ranging between day 1 and day 15) separately (15 × 4 points in total). Very good agreement appears between normalised score differences computed from the actual dual-resolution ensembles and estimated from the subset of members. The corresponding correlation coefficient reaches 0.99 and the rank correlation coefficient is 0.88 as reported in . When the size of the ensemble subset on which the estimation is based is smaller than (8), the accuracy of the score estimates tends to decrease. In all cases, the linear correlation between normalised score differences ΔCRPS and ΔCRPSadj is higher than 0.9, even when based on the computationally very cheap (2) subset of ensembles. Performing the same analysis for 24 h precipitation and 10 m wind speed, we notice that the correlation coefficients for these variables are in general smaller than for 2 m temperature (not shown). However, the linear correlation coefficient reaches 0.87 for the former and 0.97 for the latter when based on a (8) ensemble. In principle, the method proposed here can be applied similarly to any other weather variable.

Fig. 1. Left: comparison of scores computed from actual (p, q) ensembles (ΔCRPS) and adjusted scores based on a (Equation8) subset of members (ΔCRPSadj). The scores plotted here are CRPS normalised differences between the (0,50) reference ensemble and the baseline multi-model combinations (40,40), (20,120), (10,160), and (200,0) as represented by squares, circles, triangles, and crosses, respectively. Each symbol corresponds to the result for one lead-time ranging between 1 and 15 days. Right: amplitude of the confidence intervals (CI) associated with the CRPS normalised differences (grey line) and the adjusted CRPS normalised differences (red lines) based on (Citation2, Citation4), and (Equation8) subsets of members (dotted, dashed, full lines, respectively), as a function of the forecast lead-time. CI are estimated by block-bootstrapping with blocks of three days. Results are shown for the comparison between the (0,50) and the (40,40) ensembles only. CI at day 3 based on a (Equation8) subset of members are reported on the plot on the left. Note that the vertical axes have the same scale in both plots.

Fig. 1. Left: comparison of scores computed from actual (p, q) ensembles (ΔCRPS) and adjusted scores based on a (Equation8(8) dC(m1,m2)→(M1,M2)(λ1)dλ1=0.(8) ) subset of members (ΔCRPSadj). The scores plotted here are CRPS normalised differences between the (0,50) reference ensemble and the baseline multi-model combinations (40,40), (20,120), (10,160), and (200,0) as represented by squares, circles, triangles, and crosses, respectively. Each symbol corresponds to the result for one lead-time ranging between 1 and 15 days. Right: amplitude of the confidence intervals (CI) associated with the CRPS normalised differences (grey line) and the adjusted CRPS normalised differences (red lines) based on (Citation2, Citation4), and (Equation8(8) dC(m1,m2)→(M1,M2)(λ1)dλ1=0.(8) ) subsets of members (dotted, dashed, full lines, respectively), as a function of the forecast lead-time. CI are estimated by block-bootstrapping with blocks of three days. Results are shown for the comparison between the (0,50) and the (40,40) ensembles only. CI at day 3 based on a (Equation8(8) dC(m1,m2)→(M1,M2)(λ1)dλ1=0.(8) ) subset of members are reported on the plot on the left. Note that the vertical axes have the same scale in both plots.

Table 1. Gain in experiment computational time and accuracy of the score estimates based on adjusted scores for different sizes of ensemble subset.

Computational resources for experimental testing are generally limited. In order to decrease computational cost for numerical experimentation, one can think of reducing the length of the testing period. This alternative is also considered here. Scores computed from the actual size dual-resolution ensembles but over reduced sets of randomly selected verification days are compared with scores averaged over a full 92-day verification period (JJA 2016). Following the same procedure as for the ensemble-adjusted scores, correlations between normalised score differences and their estimates are computed for a range of verification window lengths. The results are reported in . It appears that reducing the number of verification days can provide results substantially different than results obtained with the original verification sample. For example, a reduction in computational cost to ∼10% of the original cost leads to a rank correlation coefficient below 0.8. Comparing results in and , we see that estimates from ensemble-adjusted scores are more robust than estimates based on a reduced sample of observations with comparable experiment computational time. This is the case not only for 2 m temperature forecasts but also for 24 h precipitation and for 10 m wind speed forecasts (not shown).

Table 2. Same as but for score estimates based on different verification sample sizes and using the actual complete ensemble with p Lres and q Hres members.

Besides the accuracy of the ensemble-adjusted scores, the level of confidence associated with score differences is also important when verification results serve as a basis for decision-making regarding future ensemble configurations. In (right panel), we show as a function of the forecast lead-time the uncertainty associated with the score differences, ΔCRPS (grey lines) and ΔCRPSadj based on a (8) ensemble (solid red lines), on a (4) ensemble (dashed red lines), and on a (2) ensemble (dotted red lines). Score uncertainty is represented by the 2.5%–97.5% percentile of the block-bootstrapped score distribution, using a block length of three days. In general, the score uncertainty tends to increase with forecast lead-time. The score uncertainty increases more rapidly when the performance is estimated with ensemble-adjusted scores. The uncertainty of the adjusted scores increases with decreasing size of the ensemble used for the score estimations as illustrated in (right panel) for adjusted scores estimated from (4, 8) and (2) ensembles. The level of uncertainty (right panel) with respect to the score differences (left panel) appears however reasonable in this example. The CRPS difference ΔCRPS is significant at a 5% level in 85% of the cases (for all lead times and combinations). When score estimates are based on a (8) ensemble, the ratio of significant ΔCRPSadj values is still 83%. For example, in (left panel), the benefit of the (40,40) combination at day 3 is significant when computing both ΔCRPS and ΔCRPSadj as depicted by the grey and red confidence bars, respectively. However, when the score estimations are based on (4) and (2) ensembles, the percentage of significant differences falls to 78% and 57%, respectively.

Finally in this section, we would like to highlight the importance of the generalised multi-model exchangeability condition. This condition is required in order to have a valid unbiased score estimator as discussed in Section 2.1. For illustration purposes, we consider a configuration where the generalised exchangeability conditions are violated although the individual single ensembles are still exchangeable: In the following example members from the two models share the same initial perturbations. Every Hres member has the same initial condition as one corresponding Lres member. This implies that the difference between such a pair of Hres and Lres members sharing the initial condition is on average smaller than the difference between any other pair of Hres and Lres members. shows the adjusted score estimation for this example that violates exchangeability (bottom panel) as well as for a configuration that satisfies the generalised multi-model exchangeability condition (top panel). In the latter case, the score estimate is unbiased while in the former case, the CRPS estimate exhibits a clear bias up to day 6. For any actual ensemble generation methodology, it can be checked whether the exchangeability conditions (A7) and (A8) expressed in the Appendix hold within sampling uncertainty. Multi-model ensembles based on ensembles from very different models might of course fulfil these conditions, but it is worth noting that if any one of the models contributing to the multi-model ensemble fails to respect the criterion of member exchangeability, the approach proposed here is not applicable. Similarly, as in Leutbecher (Citation2018), we can affirm that it will be more difficult to infer scores from a small subset when a multiphysics-based ensemble is contributing to the assessed multi-model ensemble because different members will have different biases. These are likely to render the expected mean absolute error dependent on the physics choices of the member and it is likely to render the expected mean difference of a pair of members dependent on the physics choices of this pair.

Fig. 2. Relative difference between CRPS and adjusted CRPS for a (40,40) ensemble as a function of the forecast lead time. The score adjustments are based on a (Equation2) ensemble subset. Generalised multi-model exchangeability conditions are respected in (a) and violated in (b). Mean difference over the verification period (black curve) and variability as measured by block-bootstrap 90% confidence intervals (grey plume).

Fig. 2. Relative difference between CRPS and adjusted CRPS for a (40,40) ensemble as a function of the forecast lead time. The score adjustments are based on a (Equation2(2) C(zm,y)−12M−mM(m−1)m2∑g=1m∑h=1m|zg−zh|,(2) ) ensemble subset. Generalised multi-model exchangeability conditions are respected in (a) and violated in (b). Mean difference over the verification period (black curve) and variability as measured by block-bootstrap 90% confidence intervals (grey plume).

3. Multi-model ensemble design

In this section, the concept of ensemble-adjusted scores is exploited to efficiently assess the ensemble-size effect on multi-model ensemble performance. This is illustrated in the context of the design of a dual-resolution ensemble as described in Section 2.2. Here, again, the target is the minimisation of the ensemble expected CRPS. First, optimal mixtures of higher and lower resolution ensemble forecasts for a given computational cost are investigated in Section 3.1. Second, optimal weighting strategies for fixed combinations of members are discussed in Section 3.2. Both applications are complementary. The illustrations are based on the case where k = 2, that is when forecasts from two different models are combined, but the methodology can be generalised to the combination of any number of models.

3.1. Ensemble pooling

We first consider flexibility in terms of ensemble-size in the design of a multi-model ensemble. This is for instance the case in the dual-resolution ensemble example: A decision can be made about the number of Lres and Hres forecasts to be combined with computer power limited by current or foreseeable resources. Using ensemble-adjusted scores, forecast performance of any combination is estimated from a small set of Hres and Lres forecasts. These estimates of the scores for many different configurations are available virtually for free once the required verification statistics from one representative small ensemble have been computed. There is no additional cost in terms of numerical experimentation or in terms of repeated computations of verification statistics.

As suggested by results in Section 2.3, we illustrate the adjusted score applications by deriving statistics from an (8) ensemble. Indeed, in that case, ensemble-adjusted CRPS are expected to provide robust performance estimations for larger ensemble combinations. Adjusted CRPS are estimated here for a range of model combinations (p,q) with p[0,200] and q[0,50].

In , results are shown for 2-metre temperature forecast at day 5. On the left panel, a simple pooling of all combined members is considered. To ease readability, the CRPS is normalised by the minimum CRPS over all tested combinations: this shows the percentage of deterioration with respect to this (local) optimal score, indicated with an asterisk on the graph. Among all configurations considered here, the optimal one is not the (200,50) but the (25,50): adding more lower resolution members to the ensemble degrades the ensemble performance in this example. Configurations with equal predictive performance are highlighted with contour lines (solid black lines). Configurations with computational cost equivalent to the current one are indicated with a dashed descending diagonal line.

Fig. 3. Left: ensemble-adjusted CRPS of a dual-resolution ensemble as a function of the number of Lres (p) and Hres (q) members combined with equal weighting. The CRPS values are normalised in order to indicate the percentage degradation with respect to the optimal solution among the tested combinations (as indicated by a *). Black lines indicate ensemble combinations with equal performance. The diagonal dashed line indicates ensemble combinations with computational cost equivalent to the (0,50) reference forecast. Dotted lines indicate results for ensemble combinations with half or double the reference computer resources. Right: ensemble-adjusted CRPS (CRPSadj) after normalisation as a function of the number of Hres (Lres) members q (p) considering a fixed computational cost equivalent to running 50 Hres forecasts. The plot shows performance for ensembles with equal weighting (°) and optimal weighting (+) for each combination. Weights are estimated based on a (Equation8) ensemble. Grey shading indicates 90%-confidence intervals (see text). Optima are highlighted in bold. Results for the baseline/reference combinations of the original analysis are indicated in red. Results are valid for 2 m temperature forecast at day 5.

Fig. 3. Left: ensemble-adjusted CRPS of a dual-resolution ensemble as a function of the number of Lres (p) and Hres (q) members combined with equal weighting. The CRPS values are normalised in order to indicate the percentage degradation with respect to the optimal solution among the tested combinations (as indicated by a *). Black lines indicate ensemble combinations with equal performance. The diagonal dashed line indicates ensemble combinations with computational cost equivalent to the (0,50) reference forecast. Dotted lines indicate results for ensemble combinations with half or double the reference computer resources. Right: ensemble-adjusted CRPS (CRPSadj) after normalisation as a function of the number of Hres (Lres) members q (p) considering a fixed computational cost equivalent to running 50 Hres forecasts. The plot shows performance for ensembles with equal weighting (°) and optimal weighting (+) for each combination. Weights are estimated based on a (Equation8(8) dC(m1,m2)→(M1,M2)(λ1)dλ1=0.(8) ) ensemble. Grey shading indicates 90%-confidence intervals (see text). Optima are highlighted in bold. Results for the baseline/reference combinations of the original analysis are indicated in red. Results are valid for 2 m temperature forecast at day 5.

Parallel lines to the descending diagonal indicate results for combinations with equal computational costs. For example, results for combinations that require twice and half the current level of computer resources are indicated with dotted lines. Focussing on the current computational cost (dashed diagonal), one can consider running a (200,0) ensemble (top left corner), or a (0,50) ensemble (bottom right corner), or any intermediate combination. The ensemble performances of all these ensemble combinations with equal computational cost are reported on the right panel in .

In (right panel), the ensemble-adjusted CRPS provides estimates of the dual-resolution ensemble performance for the full range of possible combinations between Lres and Hres members given the current computer resources (grey circles). The original analysis focussed only on four baseline plus one reference combinations (red circles) pointing to a (40,40) ensemble as the optimal combination. Applying score adjustments, a finer analysis shows that the (20,45) ensemble is the optimal dual-resolution combination for this particular weather variable and forecast lead time (2 m temperature at day 5). Using the adjusted CRPS in Expression (6), this type of analysis can be repeated easily, and at no additional cost for any other computational resource constraint.

3.2. Ensemble weighting

We consider now the case where the number of forecasts to be combined is fixed. The question is whether the ensemble performance can be improved by applying appropriate weighting to each combined model. We propose here an analytical expression of the optimal weights which is directly derived from the kernel representation of the CRPS. In contrast to post-processing methods generally in use, no numerical optimisation procedure is required in the proposed approach. Nevertheless, prior knowledge of forecast performance based on historical data is needed.

The weighting strategy proposed here is useful for showing the relative merits of the different model ensembles and its expected impact on multi-model ensemble performance. Potentially, further improvement could be achieved with considering additionally bias correction of the ensemble members. Such corrections are left for future studies.

In the following, we discuss the case where forecasts from two models are directly combined (k = 2). We use the simplified notations defined in Appendix A, Section A1. Mathematical developments can be found in Appendix A, Section A4 where a general solution for optimal weighting is presented for an arbitrary number k2 of models.

Considering the constraint λ1+λ2=1, the minimisation problem follows: (7) argminλ1(C(m1,m2)(λ1)).(7)

The optimal weighting depends directly on the number of members combined from model 1 and model 2, that is m1 and m2, respectively. Optimal weights can also be estimated accounting for the ensemble-size effect. In order to derive optimal weights for a (M1, M2) ensemble based on a subset (m1, m2) of forecasts, one must solve (8) dC(m1,m2)(M1,M2)(λ1)dλ1=0.(8)

Using the kernel representation of the adjusted CRPS in a multi-model context, an analytical solution of EquationEquation (8) can be found applying linear algebra. The optimal weight estimate follows: (9) λ̂1°=Ĉ2Ĉ1+R̂122R̂12,(9) where the first two terms on the numerator correspond to the adjusted CRPS of model 2 and model 1, respectively, and with R̂12 defined as: (10) R̂12=2D̂12D̂11D̂22(10) where D̂ij is an estimate of the expected absolute difference between members of model i and members of model j. So R̂12 is proportional to the difference between ‘inter-model spread’ and mean ‘intra-model’ spread. This term is positive as soon as there is less similarity between members originating from two different models than between members originating from the same model.

From EquationEquation (9) and the developments in Appendix A, Section A4, we can make the following remarks regarding the values that λ̂1° can take:

  • λ̂1°=0.5 if Ĉ1=Ĉ2, that is model 1 and 2 have the same performance;

  • λ̂1°=1 if Ĉ1=0 and Ĉ20, that is model 1 is perfect and model 2 is not;

  • λ̂1°=0 if Ĉ2=0 and Ĉ10, that is model 2 is perfect and model 1 is not.

Exploiting these results, performance of dual-resolution ensembles can be now examined considering optimal weighting. For each (p, q) ensemble, optimal weights are estimated using EquationEquation (9) and applied to the score estimation in Expression (6). For illustration purposes, we apply an out-of-sample approach for the computation of the weights and their application to the multi-ensemble forecast before verification. We proceed as follow: the 92-day verification sample is divided into a training period and a testing period. Elements of the training group are randomly selected as 15 blocks of three consecutive days with replacement, that is blocks that can overlap. Days not selected for training are used for testing. We average the CRPS over all pairs in the training period and then compute the optimal weights for the mean CRPS. Optimal weights estimated over the training period are applied to the forecasts over the testing period. Testing and training period are then swapped. Verification of the weighted ensemble forecasts is finally computed over the whole verification window. The process is repeated 1000 times with a random selection of training and testing days in order to obtain a score distribution from which statistics are derived.

shows the optimal weights as estimated from a (200,50) ensemble (black line) and from a (8) subset of members (red line). Optimal weighting is compared with ensemble pooling where all combined members have the same weight (grey line). When a coloured line is above the grey line, it means that the weight optimisation increases the contribution of the Hres members with respect to the ensemble pooling. This is the case when the number of combined Hres members is smaller than the one associated with the optimal (p, q) combination as indicated by the vertical line. We also see that the differences between optimal weight estimates based on a (200,50) ensemble or a (8) ensemble are small. Moreover, the resulting CRPS when applying one or the other optimal weight exhibit differences which are not larger than 0.1% and non-significant (not shown).

Fig. 4. Weight associated with the Hres model as a function of the number of Hres (Lres) members q (p) considering a fixed computational cost equivalent to running 50 Hres forecasts: weights when ensemble pooling is applied (grey line), optimal weights estimated from a (200,50) ensemble (black line), optimal weights estimated from a (Equation8) ensemble (red line). The vertical dotted line indicates the optimal (p,q) combination as seen in (right panel).

Fig. 4. Weight associated with the Hres model as a function of the number of Hres (Lres) members q (p) considering a fixed computational cost equivalent to running 50 Hres forecasts: weights when ensemble pooling is applied (grey line), optimal weights estimated from a (200,50) ensemble (black line), optimal weights estimated from a (Equation8(8) dC(m1,m2)→(M1,M2)(λ1)dλ1=0.(8) ) ensemble (red line). The vertical dotted line indicates the optimal (p,q) combination as seen in Fig. 3 (right panel).

In (right panel), ensemble forecast performances with optimal weighting are plotted for each ensemble combination (p,q). The CRPS of the forecast with optimal weights (crosses) can be compared with the CRPS of pooled forecasts (circles) for a given combination of Hres and Lres forecasts. Grey shading shows confidence intervals as measured by the 5%–95% percentiles of the score distribution. They represent the performance uncertainty associated with the weighting procedure.

From the results presented on the right panel in , we conclude that (i) the application of optimal weights can substantially improve the performance of a multi-ensemble forecast, (ii) a large range of ensemble combinations have near-optimal score when optimal weighting is applied, and (iii) the optimal combination with pooled forecasts (the (20,45) ensemble) is not improved further by model weighting.

More generally, the multi-model ensemble performance under simple pooling and optimal weighting provide complementary information. For the design of an ensemble system, the assessments of raw and of potential performance after optimal weighting are both relevant figures. Both can be performed efficiently based on the ensemble-adjusted CRPS.

4. Summary

Ensemble-adjusted scores allow one to account for the ensemble-size effect on ensemble forecast performance. This paper revisits the ensemble-adjusted score concept in the context of multi-model ensemble forecasting. An unbiased estimator of the continuous ranked probability score as a function of the ensemble size is proposed and its robustness tested on dual-resolution ensemble forecasts. It is shown that adjusted scores S(m1,,mk)(M1,,Mk) based on a small subset of exchangeable members from each model (typically mi = 8 for any combined models i) provide good performance estimates of any (M1,,Mk) ensemble configuration. The validity of the approach depends on generalised multi-model exchangeability conditions. A simple tool is provided to determine whether the multi-model ensemble at hand satisfies the conditions within sampling uncertainty. Score adjustment of the Brier score in a multi-model ensemble context is also provided in the appendix of this article. Further research is welcome in order to develop ensemble-size adjustment of other scores for a more comprehensive analysis of ensemble performance based on small subsets of ensemble members.

From a research testing perspective, the use of ensemble-size adjusted scores can represent a substantial saving in terms of the computational cost for the numerical experimentation. In our illustrative example, a decrease up to a factor 10 of the experiment cost (by running fewer members) does not considerably deteriorate the quality of the analysis: The unbiased score estimates are highly correlated with scores computed from the actual size ensemble. It is shown that this strategy is more efficient than a strategy consisting in drastically reducing the verification sample in terms of the number of forecast start dates. The latter can be detrimental to a robust assessment of ensemble performance.

Ensemble-adjusted scores find applications in the design of multi-model ensemble systems. This is also illustrated here with a dual-resolution ensemble where an optimal combination of higher- and lower-resolution forecasts is targeted. Not only simple pooling of forecasts but also optimal weighting of the contributing models can be investigated, accounting for the ensemble-size effect. Based on linear algebra, optimal weights are directly derived from the CRPS kernel representation. Applying optimal weighting strategies helps to demonstrate the potential performance of optimally combined ensemble forecasts. The derivation of optimal weights, in a non-iterative fashion, can be applied without restriction to any combination of ensemble members.

At low experiment computational cost and with limited verification effort, it is possible to draw a full picture of expected performance in terms of CRPS as a function of the number of members from each contributing model. The optimal ensemble configuration can be easily identified for a given computational cost with and without weighting members. This new type of ensemble skill investigations is coined predictive verification and aims to provide a framework for making informed decisions on future multi-model ensemble configurations.

Acknowledgement

The authors are grateful to Francisco J. Doblas-Reyes for triggering their attention on the main topic of this manuscript during the Annual Seminar held at ECMWF in September 2017.

Disclosure statement

No potential conflict of interest was reported by the authors.

References

  • Baran, S., Leutbecher, M., Szabó, M. and Ben Bouallègue, Z. 2019. Statistical post-processing of dual-resolution ensemble forecasts. Q. J. R. Meteorol. Soc.145, 1705–1720.
  • Bernardo, J. M. and Smith, A. 2000. Bayesian Theory. Wiley, Chichester.
  • Casanova, S. and Ahrens, B. 2009. On the weighting of multimodel ensembles in seasonal and shortrange weather forecasting. Mon. Wea. Rev. 137, 3811–3822. doi:10.1175/2009MWR2893.1
  • DelSole, T., Nattala, J. and Tippett, M. 2014. Skill improvement from increased ensemble size and model diversity. Geophys. Res. Lett. 41, 7331–7342. doi:10.1002/2014GL060133
  • DelSole, T., Yang, X. and Tippett, M. 2013. Is unequal weighting significantly better than equal weighting for multimodel forecasting? Q. J. Meteorol. Soc. 139, 176–183. doi:10.1002/qj.1961
  • Doblas-Reyes, F., Hagedorn, R. and Palmer, T. 2005. The rationale behind the success of multimodel ensembles in seasonal forecasting I. Calibration and combination. Tellus A 57, 234–252.
  • Ferro, C. A. T., Richardson, D. S. and Weigel, A. P. 2008. On the effect of ensemble size on the discrete and continuous ranked probability scores. Met. Apps. 15, 19–24. doi:10.1002/met.45
  • Gneiting, T. and Raftery, A. E. 2007. Strictly proper scoring rules, prediction, and estimation. J. Am. Stat. Assoc. 102, 359–378. doi:10.1198/016214506000001437
  • Hagedorn, R., Doblas-Reyes, F. and Palmer, T. 2005. The rationale behind the success of multimodel ensembles in seasonal forecasting. I. Basic concept. Tellus A 57, 219–233.
  • Leutbecher, M. 2019. Ensemble size: How suboptimal is less than infinity? Q. J. R. Meteorol. Soc. 145(Suppl. 1), 107–128.
  • Leutbecher, M. and Ben Bouallègue, Z. 2019. On the probabilistic skill of dual-resolution ensemble forecasts. Q. J. R. Meteorolog. Soc. doi:10.1002/qj.3704
  • Richardson, D. S. 2001. Measures of skill and value of ensemble prediction systems, their interrelationship and the effect of ensemble size. Q. J. R. Meteorol. Soc. 127, 2473–2489. doi:10.1002/qj.49712757715
  • Siegert, S., Ferro, C. A. T., Stephenson, D. B. and Leutbecher, M. 2019. The ensemble-adjusted Ignorance score for forecasts issued as Normal distributions. Q. J. R. Meteorol. Soc. 145(Suppl. 1), 129–139.
  • Weigel, A. and Bowler, N. 2009. Comment on ’can multimodel combination really enhance the prediction skill of probabilistic ensemble forecasts? Q. J. Meteorol. Soc. 135, 535–539. doi:10.1002/qj.381
  • Weigel, A., Liniger, M. and Appenzeller, C. 2008. Can multi-model combination really enhance the prediction skill of ensemble forecasts?. Q. J. Meteorol. Soc. 134, 241–260. doi:10.1002/qj.210

APPENDIX A:

Mathematical details for CRPS adjustment

In the following, the mathematical details that were omitted from the main text are provided. The required exchangeability conditions and the proof that the score estimator given these conditions is unbiased are provided first. Then, the general solution for the optimal weights is derived.

A1. Kernel representation in compact notation

It is convenient to introduce a compact notation for the derivations that follow. The verification statistics that need to be aggregated in order to apply the score adjustment in the kernel representation of the CRPS are the mean absolute error Ei and the L1-spread matrix Dij. When obtained from numerical experimentation with ensemble size m, these verification statistics are (A1) Ei(m)=1mig=1mi|zigy|,(A1) (A2) Dij(m)=12mimjg=1mih=1mj|zigzjh|.(A2)

The matrix D is symmetric. Its off-diagonal terms describe the diversity between models, or ‘inter-model spread’, while the diagonal terms describe the spread within the individual models, or ‘intra-model’ spread.

The CRPS kernel representation for the multi-model ensemble with weights λ as given earlier in EquationEquation (5), can be expressed in terms of Ei and Dij as (A3) Cm(λ)=i=1kλiEi(m)i=1kj=1kλiλjDij(m).(A3)

Similarly, the adjusted score according to (Equation6) can be written as (A4) CmM(λ)=Cm(λ)i=1kλi2γiDii(m)=i=1kλiEi(m)i=1kj=1kλiλjD̂ij(m,M),(A4) with the adjusted spread matrix given by (A5) D̂ij=Dij+δijγiDii(A5) and ensemble-size adjustment factors (A6) γi=(Mimi)Mi(mi1)(A6) where δij=1 if i = j and 0 otherwise. The k+k(k+1)/2 numbers Ei,Dij with ij characterise the joint distribution of y and zm completely in terms of the CRPS. Storing the Ei and Dij permits the computation of the CRPS estimator for any set of weights λ and any target ensemble size M from numerical experimentation with ensemble size m.

A2. Generalised multi-model exchangeability conditions

Now, we focus on the conditions required to render the adjusted score given by Expressions (Equation6), Equation(A4) an unbiased estimate for the target ensemble size M. The multi-model exchangeability and the ensemble size invariance conditions described in Sections A2.1 and A2.2 are sufficient but impractical to validate. Subsequently in Section A2.3, we provide less general conditions in terms of expected absolute differences between members and between members and the verification data. The latter conditions are more practical as they can be checked easily for any multi-model ensemble.

A2.1. Multi-model exchangeability

We extend the notion of exchangeability to the multi-model setting as follows. For each of the k models consider an arbitrary permutation of its members. This describes a permutation for the entire ensemble that respects the order of the models in the vector zm. We can represent this permutation via its block-diagonal permutation matrix P.

An ensemble composed of members from k different models is said to be multi-model exchangeable if for any such permutation P that consists of arbitrary permutations of the single-model sub-ensembles, the joint distribution of Pz is identical to the joint distribution of z.

A2.2. Multi-model ensemble-size invariance

Consider an ensemble generation method that can generate multi-model exchangeable ensembles of different sizes, say m=(m1,,mk) and m˜=(m˜1,,m˜k). Let zm and z˜m˜ denote the two ensembles corresponding to the ensemble size m and m˜. One can compare marginal distributions constructed from the two ensembles that contain the same number ljmin(mj,m˜j) of distinct members from each model j. Let H and H˜ denote operators that extract for each model j a subset of lj members from the ensembles zm and z˜m˜, respectively. We define the multi-model ensemble generation method to be ensemble-size invariant if for any ensemble sizes m and m˜ and for any subensemble extractions H and H˜ that yield the same number of distinct members from each model j, the joint distribution of Hzm is identical to the joint distribution of H˜z˜m˜.

A2.3. Conditions in terms of expected distances

For an ensemble that satisfies the generalised multi-model exchangeability conditions given in A2.1 and A2.2, the expected values of the distance between members and the distance between members and verification depend only on the model indices i, j and neither on ensemble size m nor on member numbers g, h within a model: (A7) E|zigzjh|=(1δijδgh)Δij,(A7) (A8) E|zigy|=Θi.(A8)

These conditions are the key ingredient for the following proof.

A3. Proof

With the conditions expressed in EquationEquations (A7) and Equation(A8), the expected E and D are given by (A9) E Ei(m)=Θi,(A9) (A10) E Dij(m)=12Δijfor ij,(A10) (A11) E Dii(m)=(mi1)2miΔii.(A11)

This implies that the expected adjusted spread matrix satisfies (A12) ED̂ii(m,M)=(Mi1)2MiΔii.(A12)

Therefore, ED̂ii(m,M)=EDii(M) which yields the desired result that (A13) E CmM(λ)=E CM(λ).(A13)

So CmM(λ) is an unbiased estimator for the expected CRPS at the target ensemble size, M.

A4. Optimal weighting

Now, we describe how to obtain the optimal weights λ for a target ensemble size M. Let (A14) e=(E1Ek)T(A14) denote the column vector with the mean absolute errors. Then, the adjusted CRPS can be written in matrix notation as (A15) CmM(λ)=λTeλTD^λ.(A15)

Optimal weights are sought subject to the constraint λj=1. This can be achieved via a Lagrange multiplier. Define (A16) L(λ,ϕ)=CmM(λ)ϕ(uTλ1)(A16) with the vector u=(11)T. The optimum weights are then the solution of (A17) λ,ϕL=0.(A17)

This yields linear equations for the weights and the Lagrange multiplier ϕ: (A18) 2D^λ=eϕu,(A18) (A19) uTλ=1.(A19)

Solving (A18) for the weights and inserting in (A19) yields the Lagrange multiplier (A20) ϕ=uTD^1e2uTD^1u.(A20)

Now, the optimum weights can be computed as (A21) λ̂°=12D^1(euTD^1e2uTD^1uu).(A21)

If we consider the combination of only two models (k = 2), the optimal weight associated with model 1 can be written as (A22) λ̂1°=Ĉ2Ĉ1+R̂122R̂12,(A22) with Ĉi the expected value of the adjusted CRPS of model i in the single ensemble case, (A23) Ĉi=EiD̂ii,(A23) and with (A24) R̂12=2D̂12D̂11D̂22.(A24)

From EquationEquation (A22), it is straightforward to see that λ̂1°=0.5 when Ĉ2=Ĉ1. We can also note that λ̂1°=1 when ensemble 1 is perfect (and ensemble 2 is not), that is Ĉ1=0 (and Ĉ20). Indeed, in that case, z1g=y with g=1,,m1 which implies that D̂11=0 and D̂12=12E2, so R̂12=Ĉ2. Similarly, we can show that λ̂1°=0 when ensemble 2 is perfect and ensemble 1 is not.

APPENDIX B.

Adjusted brier score for multi-model ensembles

Consider forecasting a binary outcome, y. Suppose that we have m=(m1,m2,,mk) ensemble members from k models, and let Ni denote the number of the mi members that forecast the event {y=1}. Suppose that we form the probability forecast (B1) P(m,λ)=i=1kλiNimi,(B1) a linear combination of the proportions, Ni/mi, from each model with weights λ=(λ1,λ2,,λk) associated with the k models. The Brier score for this probability forecast is (B2) B(m,λ)={P(m,λ)y}2.(B2)

Suppose that we want to estimate B(M,λ), the Brier score that we would obtain if we had M=(M1,M2,,Mk) ensemble members from the k models. An unbiased estimate for the expected value of B(M,λ) is (B3) B(m,λ)i=1kλi2(Mimi)Mi(mi1)Nimi(1Nimi).(B3)

This result holds under the same generalised multi-model exchangeability conditions as the CRPS result (see Appendix A, Section A2).