Full article: Isotonic recalibration under a low signal-to-noise ratio

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Insurance pricing systems should fulfill the auto-calibration property to ensure that there is no systematic cross-financing between different price cohorts. Often, regression models are not auto-calibrated. We propose to apply isotonic recalibration to a given regression model to restore auto-calibration. Our main result proves that under a low signal-to-noise ratio, this isotonic recalibration step leads to an explainable pricing system because the resulting isotonically recalibrated regression function has a low complexity.

KEYWORDS:

1. Introduction

There are two seemingly unrelated problems in insurance pricing that we are going to tackle in this paper. First, an insurance pricing system should not have any systematic cross-financing between different price cohorts. Systematic cross-financing implicitly means that some parts of the portfolio are under-priced, and this is compensated by other parts of the portfolio that are over-priced. We can prevent systematic cross-financing between price cohorts by ensuring that the pricing system is auto-calibrated. We propose to apply isotonic recalibration which turns any regression function into an auto-calibrated pricing system.

The second problem that we tackle is the explainability of complex algorithmic models for insurance pricing. In a first step, one may use any complex regression model to design an insurance pricing system such as, e.g. a deep neural network. Such complex regression models typically lack explainability and rather act as black boxes. For this reason, there are several post-hoc tools deployed to explain such complex solutions, we mention, for instance, SHAP by Lundberg & Lee (Citation2017), see also Mayer et al. (Citation2023). Since algorithmic solutions do not generally fulfill the aforementioned auto-calibration property, we propose to apply isotonic recalibration to the algorithmic solution. If the signal-to-noise ratio is low in the data, then the isotonic recalibration step leads to a coarse partition of the covariate space and, as a consequence, it leads to an explainable version of the algorithmic model used in the first place. Thus, explainability is a nice side result of applying isotonic recalibration in a low signal-to-noise ratio situation, which is typically the case in insurance pricing settings.

There are other methods for obtaining auto-calibration through a recalibration step; we mention the tree binning approach of Lindholm et al. (Citation2023) and the local regression approach of Loader (Citation1999) considered in Denuit et al. (Citation2021). These other methods often require tuning of hyperparameters, e.g. using cross-validation. Isotonic recalibration does not involve any hyperparameters as it solves a constraint regression problem (ensuring monotonicity). As such, isotonic recalibration is universal because it also does not depend on the specific choice of the loss function within the family of Bregman losses, and it is scale invariant w.r.t. the first regression as it only uses the ranks of the first regression.

We formalize our proposal. Throughout, we assume that all considered random variables have finite means. Consider a response variable Y that is equipped with covariate information $X \in X \subseteq R^{q}$ . The general goal is to determine the (true) regression function $x \mapsto E [Y | X = x]$ that describes the conditional mean of Y, given $X$ . Typically, this true regression function is unknown, and it needs to be determined from i.i.d. data $(y_{i}, x_{i})_{i = 1}^{n}$ , that is, a sample from $(Y, X)$ . For this purpose, we try to select a regression function $x \mapsto μ (x)$ from a (pre-chosen) function class on $X$ that approximates the conditional mean $E [Y | X = \cdot]$ as well as possible. Often, it is not possible to capture all features of the regression function from data. In financial applications, a minimal important requirement for a well-selected regression function $μ (\cdot)$ is that it fulfills the auto-calibration property.

Definition 1.1

The regression function μ is auto-calibrated for $(Y, X)$ if $μ (X) = E [Y | μ (X)], P -a.s.$

Auto-calibration is an important property in actuarial and financial applications because it implies that, on average, the (price) cohorts $μ (X)$ are self-financing for the corresponding claims Y, i.e. there is no systematic cross-financing within the portfolio, if the structure of this portfolio is described by the covariates $X$ and the price cohorts $μ (X)$ , respectively. In a Bernoulli context, an early version of auto-calibration (called well-calibrated) has been introduced by Schervish (Citation1989), and recently, it has been considered in detail by Gneiting & Resin (Citation2022). In an actuarial and financial context, the importance of auto-calibration has been emphasized in Krüger & Ziegel (Citation2021), Denuit et al. (Citation2021), Lindholm et al. (Citation2023) and Wüthrich (Citation2023).

Many regression models do not satisfy the auto-calibration property. However, there is a simple and powerful method, which we call isotonic recalibration, to obtain an (in-sample) auto-calibrated regression function starting from any candidate function $π : X \to R$ . We apply isotonic recalibration to the pseudo-sample $(y_{i}, π (x_{i}))_{i = 1}^{n}$ to obtain an isotonic regression function $\hat{μ}$ . Then, (1.1) $\hat{μ} (X^{'}) = E [Y^{'} | \hat{μ} (X^{'})], P_{n} -a.s.,$ (1.1) where $(Y^{'}, X^{'})$ is distributed according to the empirical distribution $P_{n}$ of $(y_{i}, x_{i})_{i = 1}^{n}$ ; see Section 2.1 for details. Isotonic regression determines an adaptive partition of the covariate space $X$ , and $\hat{μ}$ is determined by averaging y-values over the partition elements. Clearly, other binning approaches can also be used on the pseudo-sample $(y_{i}, π (x_{i}))_{i = 1}^{n}$ to enforce (Equation1.1(1.1) $\hat{μ} (X^{'}) = E [Y^{'} | \hat{μ} (X^{'})], P_{n} -a.s.,$ (1.1) ), but we argue that isotonic regression is preferable since it avoids subjective choices of tuning parameters and leads to sensible regression functions under reasonable and verifiable assumptions. The only assumption for isotonic recalibration to be informative is that the function π gets the rankings of the conditional means right, that is, whenever $E [Y | X = x_{i}] \leq E [Y | X = x_{j}]$ , we would like to have $π (x_{i}) \leq π (x_{j})$ ; we will justify this in more detail in the first paragraph of Section 3.1, below.

Using isotonic regression for recalibration is not new in the literature. In the case of binary outcomes, it has already been proposed by Zadrozny & Elkan (Citation2002), Menon et al. (Citation2012) and recently by Tasche (Citation2021, Section 5.3). The monotone single index models of Balabdaoui et al. (Citation2019) follow the same strategy as described above but the focus of their work is different from ours. They specifically consider a linear regression model for the candidate function π, which is called the index. In the case of distributional regression, that is, when interest is in determining the whole conditional distribution of Y given covariate information $X$ , Henzi et al. (Citation2023) have suggested to first estimate an index function π that determines the ordering of the conditional distributions w.r.t. first order stochastic dominance and then estimate conditional distributions using isotonic distributional regression; see Henzi et al. (Citation2021).

As a new contribution, we show that the size of the partition of the isotonic recalibration may give insight concerning the information content of the recalibrated regression function $\hat{μ}$ . Furthermore, the partition of the isotonic recalibration allows to explain connections between covariates and outcomes, in particular, when the signal-to-noise ratio is small which typically is the case for insurance claims data.

In order to come up with a candidate function $π : X \to R$ , one may consider any regression model such as, e.g. a generalized linear model, a regression tree, a tree boosting regression model or a deep neural network regression model. The aim is that $π (\cdot)$ provides us with the correct rankings of the conditional means $E [Y | X = x_{i}]$ , $i = 1, \dots, n$ . Moreover, this first candidate function $π : X \to R$ can be estimated on the original response scale or on any monotonically transformed scale, e.g. logged insurance losses, as long as we get the rankings correct. The latter may be an attractive option to limit the influence of outliers on the estimation of the candidate function. The details are discussed in Section 3.

Organization. In Section 2, we formally introduce isotonic regression which is a constraint optimization problem. This constraint optimization problem is usually solved with the pool adjacent violators (PAV) algorithm, which is described in Appendix A.1. Our main result is stated in Section 2.2. It relates the complexity of the isotonic recalibration solution to the signal-to-noise ratio in the data. Section 3 gives practical guidance on the use of isotonic recalibration, and in Section 4 we exemplify our results on a frequently used insurance dataset. In this section we also present graphic tools for interpreting the regression function. In Section 5, we conclude.

2. Isotonic regression

2.1. Definition and basic properties

For simplicity, we assume that the candidate function $π : X \to R$ does not lead to any ties in the values $π (x_{1}), \dots, π (x_{n})$ , and that the indices $i = 1, \dots, n$ are chosen such that they are aligned with the ranking, that is, $π (x_{1}) < \dots < π (x_{n})$ . Remark 2.1 explains how to handle ties. The isotonic regression of $z = (y_{i}, π (x_{i}))_{i = 1}^{n}$ with positive case weights $(w_{i})_{i = 1}^{n}$ is the solution $\hat{μ} \in R^{n}$ to the restricted minimization problem (2.1) $\hat{μ} = \underset{μ = (μ_{1}, \dots, μ_{n})^{⊤}}{\arg min} \sum_{i = 1}^{n} w_{i} (y_{i} - μ_{i})^{2}, subject to μ_{1} \leq \dots \leq μ_{n} .$ (2.1) This problem has a unique solution; see Barlow & Brunk (Citation1972, Theorem 2.1). We can rewrite the side constraints as $A μ \geq 0$ (component-wise), where $A = (a_{i, j})_{i, j} \in R^{n \times (n - 1)}$ is the matrix with the elements $a_{i, j} = 1_{i = j - 1} - 1_{i = j}$ . We define $y = (y_{1}, \dots, y_{n})^{⊤} \in R^{n}$ and the (diagonal) case weight matrix $W = diag (w_{1}, \dots, w_{n})$ . The above optimization problem then reads as (2.2) $\hat{μ} = \hat{μ} (z) = \underset{μ : A μ \geq 0}{\arg min} (y - μ)^{⊤} W (y - μ) .$ (2.2) This shows that the isotonic regression is solved by a convex minimization with linear side constraints. It remains to verify that the auto-calibration property claimed in (Equation1.1(1.1) $\hat{μ} (X^{'}) = E [Y^{'} | \hat{μ} (X^{'})], P_{n} -a.s.,$ (1.1) ) holds.

Remark 2.1

If there are ties in the values $π (x_{1}), \dots, π (x_{n})$ , for example, $π (x_{i}) = π (x_{j})$ for some $i \neq j$ , we replace $y_{i}$ and $y_{j}$ with their weighted average $(w_{i} y_{i} + w_{j} y_{j}) / (w_{i} + w_{j})$ and assign them weights $(w_{i} + w_{j}) / 2$ . The procedure is analogous for more than two tied values. This corresponds to the second option of dealing with ties in de Leeuw et al. (Citation2009, Section 2.1).

Remark 2.2

Barlow et al. (Citation1972, Theorem 1.10) show that the square loss part $L (y, μ) = (y - μ)^{2}$ in (Equation2.1(2.1) $\hat{μ} = \underset{μ = (μ_{1}, \dots, μ_{n})^{⊤}}{\arg min} \sum_{i = 1}^{n} w_{i} (y_{i} - μ_{i})^{2}, subject to μ_{1} \leq \dots \leq μ_{n} .$ (2.1) ) can be replaced by any Bregman loss, $L_{ϕ} (y, μ) = ϕ (y) - ϕ (μ) - ϕ^{'} (μ) (y - μ)$ , without changing the optimal solution $\hat{μ}$ . Here, ϕ is a strictly convex function with subgradient $ϕ^{'}$ . Bregman losses are the only strictly consistent loss functions for mean estimation; see Savage (Citation1971) and Gneiting (Citation2011, Theorem 7). For claims count modeling, a Bregman loss of relevance for this paper is the Poisson deviance loss, which arises by choosing $ϕ (x) = xlog (x)$ , $x \in [0, \infty)$ .

The solution to the minimization problem (Equation2.3(2.2) $\hat{μ} = \hat{μ} (z) = \underset{μ : A μ \geq 0}{\arg min} (y - μ)^{⊤} W (y - μ) .$ (2.2) ) can be given explicitly as a min-max formula ${\hat{μ}}_{i} = min_{ℓ = i, \dots, n} max_{k = 1, \dots, ℓ} \frac{1}{\sum_{j = k}^{ℓ} w_{j}} \sum_{j = k}^{ℓ} w_{j} y_{j} .$ While the min-max formula is theoretically appealing and useful, the related minimum lower sets (MLS) algorithm of Brunk et al. (Citation1957) is not efficient to compute the solution. The pool adjacent violators (PAV) algorithm, which is due to Ayer et al. (Citation1955), Miles (Citation1959) and Kruskal (Citation1964), allows for fast computation of the isotonic regression and provides us with the desired insights about the solution. In Appendix A.1, we describe the PAV algorithm in detail. The solution is obtained by suitably partitioning the index set $I = {1, \dots, n}$ into (discrete) intervals (2.3) $I_{k} = I_{k} (z) = {i_{k - 1} + 1, \dots, i_{k}} for k = 1, \dots, K (z),$ (2.3) with $z$ -dependent slicing points $0 = i_{0} < i_{1} < \dots < i_{K} = n$ , and with $K (z) \in {1, \dots, n}$ denoting the number of discrete intervals $I_{k}$ . The number $K (z)$ of intervals and the slicing points $i_{k} = i_{k} (z)$ , $k = 1, \dots, K (z)$ , for the partition of $I$ depend on the observations $z$ . On each discrete interval $I_{k}$ we then obtain the isotonic regression parameter estimate, i.e. for an instance $i \in I_{k}$ we have (2.4) ${\hat{μ}}_{i} = {\hat{μ}}_{i_{k}} = \frac{1}{\sum_{j \in I_{k}} w_{j}} \sum_{j \in I_{k}} w_{j} y_{j},$ (2.4) see also (EquationA5(A5) ${\hat{μ}}_{i}^{(t + 1)} = \frac{1}{\sum_{j \in I_{k}^{(t + 1)}} w_{j}} \sum_{j \in I_{k}^{(t + 1)}} w_{j} y_{j} .$ (A5) ). Thus, on each block $I_{k}$ we have a constant estimate ${\hat{μ}}_{i_{k}}$ , and the isotonic property tells us that these estimates are strictly increasing over the block indices $k = 1, \dots, K (z)$ , because these blocks have been chosen to be maximal. We call $K (z)$ the complexity number of the resulting isotonic regression.

Figure gives an example for n = 20 and rankings $π (x_{i}) = i$ for $i = 1, \dots, n$ . The resulting (non-parametric) isotonic regression function $\hat{μ} = \hat{μ} (z)$ , which is only uniquely determined at the observations $(π (x_{i}))_{i = 1}^{n}$ , can be interpolated by a step function. In Figure this results in a step function having $K (z) - 1 = 9$ steps, that is, we have $K (z) = 10$ blocks, and the estimated regression function $\hat{μ}$ takes only $K (z) = 10$ different values. This motivates to call $K (z)$ the complexity number of the resulting step function.

Figure 1. Example of an isotonic regression with $K (z) = 10$ blocks, the x-axis shows the ranks $(π (x_{i}))_{i = 1}^{n}$ and the y-axis the resulting isotonic regression estimates $\hat{μ}$ connected by a step function (in red color).

The partition of the indices $I$ into the isotonic blocks $I_{k}$ is obtained naturally by requiring monotonicity. This is different from the regression tree approach considered in Lindholm et al. (Citation2023). In fact, this latter reference does not require monotonicity but aims at minimizing the ‘plain’ square loss using cross-validation for determining the optimal number of partitions. In our context, the complexity number $K (z)$ is fully determined through requiring monotonicity and, in general, the results will differ.

In insurance applications, the blocks $I_{k} \subset I$ provide us with the (empirical) price cohorts ${\hat{μ}}_{i} = {\hat{μ}}_{i_{k}}$ , for $i \in I_{k}$ , and (Equation2.4(2.4) ${\hat{μ}}_{i} = {\hat{μ}}_{i_{k}} = \frac{1}{\sum_{j \in I_{k}} w_{j}} \sum_{j \in I_{k}} w_{j} y_{j},$ (2.4) ) leads to the (in-sample) auto-calibration property for Y (2.5) $E [Y^{'} | \hat{μ} (X^{'}) = {\hat{μ}}_{i_{k}}] = \frac{1}{\sum_{i \in I_{k}} w_{i}} \sum_{i \in I_{k}} w_{i} y_{i} = {\hat{μ}}_{i_{k}},$ (2.5) where $(Y^{'}, X^{'})$ is distributed according to the weighted empirical distribution of $(y_{i}, x_{i})_{i = 1}^{n}$ with weights $(w_{i})_{i = 1}^{n}$ , see also (Equation1.1(1.1) $\hat{μ} (X^{'}) = E [Y^{'} | \hat{μ} (X^{'})], P_{n} -a.s.,$ (1.1) ). Moreover, summing over the entire portfolio we have the (global) balance property (2.6) $\sum_{i = 1}^{n} w_{i} {\hat{μ}}_{i} = \sum_{k = 1}^{K (z)} \sum_{i \in I_{k}} w_{i} {\hat{μ}}_{i} = \sum_{k = 1}^{K (z)} {\hat{μ}}_{i_{k}} \sum_{i \in I_{k}} w_{i} = \sum_{k = 1}^{K (z)} \sum_{i \in I_{k}} w_{i} y_{i} = \sum_{i = 1}^{n} w_{i} y_{i},$ (2.6) that is, in average the overall (price) level is correctly specified if we price the insurance policies with covariates $x_{i}$ by $w_{i} {\hat{μ}}_{i}$ , where the weights $w_{i} > 0$ now receive the interpretation of exposures.

2.2. Monotonicity of the expected complexity number

In this section, we prove that the expected complexity number $E [K (z)]$ is an increasing function of the signal-to-noise ratio. For this, we assume a location-scale model for the responses $Y_{i}$ , that is, we assume that (2.7) $Y_{i} = μ_{i} + σ ϵ_{i}, i = 1, \dots, n,$ (2.7) with noise terms $ϵ_{i}$ , location parameters $μ_{i} \in R$ with $μ_{1} \leq \dots \leq μ_{n}$ , and scale parameter $σ > 0$ . Here, $μ_{i}$ takes the role of $π (x_{i})$ in the previous section. The parameters $μ_{1}, \dots, μ_{n}$ are unknown but it is known that they are labeled in increasing order. The signal-to-noise ratio is then described by the scale parameter σ, i.e. we have a low signal-to-noise ratio for high σ and vice-versa. The explicit location-scale structure (Equation2.7(2.7) $Y_{i} = μ_{i} + σ ϵ_{i}, i = 1, \dots, n,$ (2.7) ) allows us to analyze (2.8) $y = Y_{σ} (ω) = μ + σ ϵ (ω) = (μ_{1}, \dots, μ_{n})^{⊤} + σ (ϵ_{1}, \dots, ϵ_{n})^{⊤} (ω),$ (2.8) point-wise in the sample points $ω \in Ω$ of the probability space $(Ω, F, P)$ as a function of $σ > 0$ ; this is similar to the re-parametrization trick of Kingma & Welling (Citation2019) that is frequently used to explore variational auto-encoders.

In this section, we write $K (y) = K (z)$ , because the ranking of the outcomes $y$ is clear from the context (labeling), and we do not go via a ranking function $π (\cdot)$ .

Theorem 2.3

Assume that the responses $Y_{i}$ , $i = 1, \dots, n$ , follow the location-scale model (Equation2.7(2.7) $Y_{i} = μ_{i} + σ ϵ_{i}, i = 1, \dots, n,$ (2.7) ) with (unknown) ordered location parameters $μ_{1} < \dots < μ_{n}$ , and scale parameter $σ > 0$ . Then, the expected complexity number $E [K (Y)]$ of the isotonic regression of $Y$ is a decreasing function in $σ > 0$ . If the distribution of the noise vector $ϵ = (ϵ_{1}, \dots, ϵ_{n})^{⊤}$ has full support on $R^{n}$ , then $E [K (Y)]$ is strictly decreasing in σ.

Theorem 2.3 proves that, under a specific but highly relevant model, the complexity number $K (Y)$ of the isotonic regression is decreasing on average with a decreasing signal-to-noise ratio. Implicitly, this means that more noisy data, which has a lower information ratio, leads to a less granular regression function. Consequently, if the partition of the isotonic regression is used to obtain a partition of the covariate space $X$ via the candidate function π, this partition will be less granular, the more noise of $Y$ cannot be explained by $π (X)$ . Remark that ‘explain’ has different meanings in machine learning and in statistics. In statistics, a regression function explains part of the uncertainty in the response variable, also called resolution or discrimination in Murphy's score decomposition (Murphy Citation1973); see Gneiting & Resin (Citation2022, Theorem 2.23). In our case, a regression function that poorly explains the variability of the response, leads to explainability of the regression function itself by a low complexity partitioning of the covariate space.

To the best of our knowledge, our result is a new contribution to the literature on isotonic regression. While we focus on the finite sample case, a related result studies the complexity number of the isotonic regression function as function of the sample size n, see Dimitriadis et al. (Citation2022, Lemma 3.2). The complexity number is also related to the effective dimension of the model, and has been studied under the assumption of independent Gaussian errors in (Equation2.7(2.7) $Y_{i} = μ_{i} + σ ϵ_{i}, i = 1, \dots, n,$ (2.7) ) in Meyer & Woodroofe (Citation2000) and Tibshirani et al. (Citation2011).

We are assuming strictly ordered location parameters $μ_{i}$ in the formulation of Theorem 2.3. This assumption simplifies the proof in the case where we show that the expected complexity number $K (Y)$ is strictly decreasing in σ. With some additional notation, the theorem could be generalized to allow for ties between some (but not all) $μ_{i}$ .

Figure gives an example of a location-scale model (Equation2.7(2.7) $Y_{i} = μ_{i} + σ ϵ_{i}, i = 1, \dots, n,$ (2.7) ) with i.i.d. standard Gaussian noise and scale parameters $σ = 2$ (lhs) and $σ = 20$ (rhs), and both figures consider the same sample point $ω \in Ω$ in the noise term $ϵ (ω)$ , see (Equation2.8(2.8) $y = Y_{σ} (ω) = μ + σ ϵ (ω) = (μ_{1}, \dots, μ_{n})^{⊤} + σ (ϵ_{1}, \dots, ϵ_{n})^{⊤} (ω),$ (2.8) ). On the left-hand side of Figure , we have complexity number $K (y) = 46$ , and on the right-hand side $K (y) = 13$ ; the chosen sample size is n = 100.

Figure 2. Example of an isotonic regression of location-scale type with varying signal-to-noise ratio for the identical sample point $ω \in Ω$ : (lhs) $σ = 2$ with $K (y) = 46$ and (rhs) $σ = 20$ with $K (y) = 13$ .

Figure 2. Example of an isotonic regression of location-scale type with varying signal-to-noise ratio for the identical sample point ω∈Ω: (lhs) σ=2 with K(y)=46 and (rhs) σ=20 with K(y)=13.

3. Isotonic recalibration for prediction and interpretation

3.1. Estimation and prediction

To determine an auto-calibrated estimate of the true regression function $x \mapsto E [Y | X = x]$ from i.i.d. data $(y_{i}, x_{i})_{i = 1}^{n}$ , we are suggesting a two-step estimation procedure. First, we use the data $(y_{i}, x_{i})_{i = 1}^{n}$ to obtain an estimate $\hat{π}$ of a candidate function π that should satisfy (3.1) $π (x) < π (x^{'}) ⟺ E [Y | X = x] < E [Y | X = x^{'}],$ (3.1) for all $x, x^{'} \in X$ . This first step is motivated by the fact that the true regression function $μ^{⋆}$ is auto-calibrated and satisfies, $P$ -a.s., $E [Y | X] =: μ^{⋆} (X) = E [Y | μ^{⋆} (X)] .$ As a consequence, requirement (Equation3.1(3.1) $π (x) < π (x^{'}) ⟺ E [Y | X = x] < E [Y | X = x^{'}],$ (3.1) ) gives strict comonotonicity between $π (X)$ and $μ^{⋆} (X)$ , that is, $P$ -a.s., $μ^{⋆} (X) = T (π (X))$ for a strictly increasing function T.

In a second step, we aim at finding this monotone transformation T. We therefore apply isotonic regression (rank-preserving) to the pseudo-sample $(y_{i}, \hat{π} (x_{i}))_{i = 1}^{n}$ to obtain the in-sample auto-calibrated regression function $\hat{μ}$ defined on ${\hat{π} (x_{i}) : i = 1, \dots, n}$ . We call this second step isotonic recalibration, and this isotonically recalibrated version $\hat{μ} (X)$ of $\hat{π} (X)$ not only satisfies auto-calibration (Equation1.1(1.1) $\hat{μ} (X^{'}) = E [Y^{'} | \hat{μ} (X^{'})], P_{n} -a.s.,$ (1.1) ), but it is also an approximation to the true regression function $μ^{⋆} (X)$ , i.e. $\hat{μ} (X) \approx μ^{⋆} (X)$ ; for a related statement in a binary classification context we refer to Tasche (Citation2021, Proposition 3.6). In our case study in Section 4, below, a deep neural network model is chosen for estimating the ranks π.

In order to obtain a prediction for a new covariate value $x \in X$ , we compute $\hat{π} (x)$ , find i such that $\hat{π} (x_{i}) < \hat{π} (x) \leq \hat{π} (x_{i + 1})$ , and interpolate by setting $\hat{μ} (x) = (\hat{μ} (x_{i}) + \hat{μ} (x_{i + 1})) / 2$ . This interpolation may be advantageous for prediction. For interpretation and analysis, however, we prefer a step function interpolation as this leads to a partition of the covariate space, see Section 3.3, below, and Figure .

Summarizing, the proposed two-step estimation procedure first aims at getting the ranks $π (X)$ right, see (Equation3.1(3.1) $π (x) < π (x^{'}) ⟺ E [Y | X = x] < E [Y | X = x^{'}],$ (3.1) ). Based on a correct ranking, isotonic recalibration lifts these ranks to the right level using that the true regression function $μ^{⋆} (X)$ is a monotone transformation of the correct ranks $π (X)$ , and, thus, the isotonically recalibrated ranks $\hat{μ} (X)$ provide an approximation to the true regression function $μ^{⋆} (X)$ . A way to verify the rank property (Equation3.1(3.1) $π (x) < π (x^{'}) ⟺ E [Y | X = x] < E [Y | X = x^{'}],$ (3.1) ) is to maximize the Gini score; we refer to Wüthrich (Citation2023). A first regression function $π (X)$ may not be auto-calibrated, but since the Gini score is invariant under strictly increasing transformations, maximizing the Gini score is still a valid procedure, if one isotonically recalibrates this first regression function $π (X)$ . As a consequence, isotonic recalibration gives a Gini score preserving transformation that provides the auto-calibrated counterpart of the first regression function. Using the uniqueness part of Theorem 4.3 of Wüthrich (Citation2023) then says that this auto-calibrated version $\hat{μ} (X)$ is a valid approximation to the true regression function. This justifies the isotonic recalibration procedure under (Equation3.1(3.1) $π (x) < π (x^{'}) ⟺ E [Y | X = x] < E [Y | X = x^{'}],$ (3.1) ), and it is the general regression version of the binary case presented in Menon et al. (Citation2012).

This two-step estimation approach can also be interpreted as a generalization of the monotone single index models considered by Balabdaoui et al. (Citation2019). They assume that the true regression function is of the form $E [Y | X = x] = ψ (α^{⊤} x)$ , with an increasing function ψ. In contrast to our proposal, the regression model π is fixed to be a linear model $α^{⊤} x$ in their approach. They consider global least squares estimation jointly for $(ψ, α)$ , but find it computationally intensive. As an alternative they suggest a two-step estimation procedure similar to our approach with a split of the data such that $α$ and the isotonic regression are estimated on independent samples. They find that if the rate of convergence of the estimator for $α$ is sufficiently fast, then the resulting estimator of the true regression function is consistent with a convergence rate of order $n^{1 / 3}$ .

In a distributional regression framework, Henzi et al. (Citation2023) considered the described two-step estimation procedure with an isotonic distributional regression, instead of a classical least squares isotonic regression as described in Section 2.1. They show that in both cases, with and without sample splitting, the procedure leads to consistent estimation of the conditional distribution of Y, given $X$ , as long as the index π can be estimated at a parametric rate. The two options, with and without sample splitting, do not result in relevant differences in predictive performance in the applications considered by Henzi et al. (Citation2023).

Assumption (Equation3.1(3.1) $π (x) < π (x^{'}) ⟺ E [Y | X = x] < E [Y | X = x^{'}],$ (3.1) ) can also be checked by diagnostic plots using binning similarly to the plots in Henzi et al. (Citation2023, Figure ) in the distributional regression case, such plots are also called lift plots in actuarial practice. Predictive performance should be assessed on a test set of data disjoint from $(y_{i}, x_{i})_{i = 1}^{n}$ , that is, on data that has not been used in the estimation procedure at all. Isotonic recalibration ensures auto-calibration in-sample, and under an i.i.d. assumption, auto-calibration will also hold approximately out-of-sample. Out-of-sample auto-calibration can be diagnosed with CORP (consistent, optimally binned, reproducible and PAV) mean reliability diagrams as suggested by Gneiting & Resin (Citation2022), and comparison of predictive performance can be done with the usual squared error loss function or with deviance loss functions.

3.2. Over-fitting at the boundary

There is a small issue with the isotonic recalibration, namely, it may tend to over-fit at the lower and upper boundaries of the ranks $\hat{π} (x_{1}) < \dots < \hat{π} (x_{n})$ . For instance, if $y_{n}$ is the largest observation in the portfolio (which is not unlikely since the ranking $\hat{π}$ is chosen response data-driven), then we estimate ${\hat{μ}}_{i_{K}} = y_{n}$ , where $K = K ((y_{i}, \hat{π} (x_{i}))_{i = 1}^{n})$ . Often, this over-fits to the (smallest and largest) observations, as such extreme values/estimates cannot be verified on out-of-sample data. For this reason, we visually analyze the largest and smallest values in the estimates $\hat{μ}$ , and we manually merge, say, the smallest block $I_{1}$ with the second smallest one $I_{2}$ (with the resulting estimate (Equation2.4(2.4) ${\hat{μ}}_{i} = {\hat{μ}}_{i_{k}} = \frac{1}{\sum_{j \in I_{k}} w_{j}} \sum_{j \in I_{k}} w_{j} y_{j},$ (2.4) ) on the merged block). More rigorously, this pooling could be cross-validated on out-of-sample data, but we refrain from doing so. In our examples in Section 4.1, below, we merge the two blocks with the smallest/biggest estimates.

3.3. Interpretation

In (Equation2.3(2.3) $I_{k} = I_{k} (z) = {i_{k - 1} + 1, \dots, i_{k}} for k = 1, \dots, K (z),$ (2.3) ) we have introduced the complexity number $K ((y_{i}, \hat{π} (x_{i}))_{i = 1}^{n})$ that counts the number of different values in $\hat{μ}$ , obtained by the isotonic regression (Equation2.2(2.2) $\hat{μ} = \hat{μ} (z) = \underset{μ : A μ \geq 0}{\arg min} (y - μ)^{⊤} W (y - μ) .$ (2.2) ) in the isotonic recalibration step. This complexity number $K ((y_{i}, \hat{π} (x_{i}))_{i = 1}^{n})$ allows one to assess the information content of the model, or in other words, how much signal is explainable from the data. Theorem 2.3 shows that the lower the signal-to-noise ratio, the lower the complexity number of the isotonic regression that we can expect. Clearly, in Theorem 2.3 we assume that the ranking of the observations is correct which will only be approximately satisfied since π has to be estimated. In general, having large samples and flexible regression models for modeling π, it is reasonable to assume that the statement remains qualitatively valid. However, in complex (algorithmic) regression models, we need to ensure that we prevent from in-sample over-fitting; this is typically controlled by either using (independent) validation data or by performing a cross-validation analysis.

Typical claims data in non-life insurance have a low signal-to-noise ratio. Regarding claims frequencies, this low signal-to-noise ratio is caused by the fact that claims are not very frequent events, e.g. in car insurance annual claims frequencies range from 5% to 10%, that is, only one out of 10 (or 20) drivers suffers a claim within a calendar year. A low signal-to-noise ratio also applies to claim amounts, which are usually strongly driven by randomness and the explanatory part from policyholder information is comparably limited. Therefore, we typically expect a low complexity number $K ((y_{i}, \hat{π} (x_{i}))_{i = 1}^{n})$ both for claims frequency and claim amounts modeling. This low signal-to-noise ratio in actuarial modeling is also a reason for exploiting new sources of information such as telematics data, see, e.g. Peiris et al. (Citation2023), or images, see, e.g. Blier-Wong & Lamontagne (Citation2022).

In case of a small to moderate complexity number $K = K ((y_{i}, \hat{π} (x_{i}))_{i = 1}^{n})$ , the regression function $\hat{μ}$ becomes interpretable through the isotonic recalibration step. For this, we extend the auto-calibrated regression function $\hat{μ}$ from the set ${\hat{π} (x_{1}), \dots, \hat{π} (x_{n})}$ to the entire covariate space $X$ by defining a step function $\hat{μ} (x) = {\hat{μ}}_{i_{k}}, if \hat{π} (x_{i_{k}}) \leq \hat{π} (x) < \hat{π} (x_{i_{k + 1}}),$ for all $x \in X$ , where $0 = i_{0} < i_{1} < \dots < i_{K} = n$ are the slicing points of the isotonic regression as defined in (Equation2.3(2.3) $I_{k} = I_{k} (z) = {i_{k - 1} + 1, \dots, i_{k}} for k = 1, \dots, K (z),$ (2.3) ). Figure illustrates this step function interpolation. We define a partition $X_{1}, \dots, X_{K}$ of the original covariate space $X$ by (3.2) $X_{k} = {x \in X : \hat{μ} (x) = {\hat{μ}}_{i_{k}}}, k = 1, \dots, K .$ (3.2) Figure , below, illustrates how this partition of $X$ provides insights on the covariate-response relationships in the model. This procedure has some analogy to regression trees and boosting trees that rely on partitions of the covariate space $X$ . In the case study in Section 4, we illustrate two further possibilities to use the partition defined at (Equation3.2(3.2) $X_{k} = {x \in X : \hat{μ} (x) = {\hat{μ}}_{i_{k}}}, k = 1, \dots, K .$ (3.2) ) for understanding covariate-response relationships. First, in Figure , the influence of individual covariates on the price cohorts is analyzed, and second, Figure gives a summary view of the whole covariate space for a chosen price cohort.

Figure 3. Isotonic recalibration in the French MTPL example only using DrivAge and BonusMalus as covariates resulting in the complexity number $K ((y_{i}, \hat{π} (x_{i}))_{i = 1}^{n}) = 49$ ; the x-axis shows the estimated ranks $(\hat{π} (x_{i}))_{i = 1}^{n}$ and the y-axis the resulting isotonic regression estimates $\hat{μ}$ using a step function (in red color) on $T$ .

Figure 4. Estimated expected claims frequencies $\hat{μ}$ : (lhs) isotonically recalibrated FFNN regression and (rhs) binary regression tree, both only using DrivAge and BonusMalus as covariates; the color scale shows the expected frequencies, this color scale is the same in both plots.

Figure 4. Estimated expected claims frequencies μ^: (lhs) isotonically recalibrated FFNN regression and (rhs) binary regression tree, both only using DrivAge and BonusMalus as covariates; the color scale shows the expected frequencies, this color scale is the same in both plots.

Figure 5. (top-left) Isotonic recalibration, (top-right) regression tree recalibration with 17 leaves, (bottom) local regression recalibration with nearest neighbor fractions $α = 5 %, 1 %$ (lhs and rhs); these graphs show how the different recalibration methods change the first FFNN predictions.

$Figure 5. (top-left) Isotonic recalibration, (top-right) regression tree recalibration with 17 leaves, (bottom) local regression recalibration with nearest neighbor fractions α=5%,1% (lhs and rhs); these graphs show how the different recalibration methods change the first FFNN predictions.$

Figure 6. Marginal view of the isotonically recalibrated FFNN predictions (line (2) of Table ) of the considered covariate components (top) DrivAge, Area, VehPower, (middle) VehAge, BonusMalus, VehGas, (bottom) VehBrand, Region; the K = 77 colors show the distribution of the estimates $\hat{μ}$ on all levels of the considered covariates.

Figure 6. Marginal view of the isotonically recalibrated FFNN predictions (line (2) of Table 2) of the considered covariate components (top) DrivAge, Area, VehPower, (middle) VehAge, BonusMalus, VehGas, (bottom) VehBrand, Region; the K = 77 colors show the distribution of the estimates μ^ on all levels of the considered covariates.

Figure 7. Partition $(X_{k})_{k = 1, \dots, K}$ of the covariate space $X$ w.r.t. the isotonic recalibration; the complexity number K = 77 induces 77 different expected frequencies illustrated by the colors at the right-hand side of the plots. The lines and dots illustrate the covariate combinations that lead to these expected frequencies for three selected values of k = 2, 45, 77 (top-middle-bottom).

Figure 7. Partition (Xk)k=1,…,K of the covariate space X w.r.t. the isotonic recalibration; the complexity number K = 77 induces 77 different expected frequencies illustrated by the colors at the right-hand side of the plots. The lines and dots illustrate the covariate combinations that lead to these expected frequencies for three selected values of k = 2, 45, 77 (top-middle-bottom).

4. Motor third party liability claims frequency example

We consider the French motor third party liability (MTPL) claims frequency data FreMTPL2freq, available through the R package CASdatasets (Dutang & Charpentier Citation2018). This is a frequently studied dataset that has been considered, e.g. in Lindholm et al. (Citation2023) and Denuit et al. (Citation2021). For an exploratory analysis of this data set we refer to Wüthrich & Merz (Citation2023, Chapter 13). For our analysis we use the same data cleaning as described in Wüthrich & Merz (Citation2023, Listing 13.1), an excerpt of the cleaned data is given in Listing 1.

We have $n = 678,007$ insurance policies which have been exposed within the same accounting year and for which we have claims records and covariate information, see Listing 1. Our goal is to build a neural network regression model for pricing and prediction, and we use isotonic recalibration for auto-calibration and interpretation as described in Section 3.3. To make our analysis directly comparable to the one in Wüthrich & Merz (Citation2023), we partition the available sample into a learning sample $L$ and a test sample $T$ precisely as described in Wüthrich & Merz (Citation2023, Listing 5.2). A summary statistics of the resulting learning and test samples is given in Wüthrich & Merz (Citation2023, Table 5.2). For model fitting and the isotonic recalibration step we only use the learning sample $L$ . All models are then evaluated out-of-sample on the test sample $T$ , and these out-of-sample results are directly comparable, e.g. to Wüthrich & Merz (Citation2023, Table 7.10), because we use the identical test sample $T$ .

We remark that in the present study we consider the time exposures as known information for claims frequency modeling. Lindholm et al. (Citation2023) raise the problem of having random time exposures, e.g. caused by early terminations of insurance policies. In some sense, early terminations contaminate claims data which needs a proper statistical treatment for predictive modeling. However, to be able to focus our discussion on isotonic recalibration, we do not discuss and model uncertainties coming from random time exposures here, but we refer to Lindholm et al. (Citation2023) for an in-depth treatment.

4.1. Isotonic recalibration and binary regression trees

We start by considering the two covariate components DrivAge and BonusMalus only. Since the resulting covariate space $X = {(DrivAge, BonusMalus)} \subset R^{2}$ is two-dimensional, we can illustrate graphically the differences between the isotonic recalibration approach and a binary regression tree (as a competing model for interpretation). In Section 4.2, below, we consider all available covariates for a competitive predictive model.

We fit a deep feed-forward neural network (FFNN) regression model to the learning sample $L$ . We choose a network architecture of depth 3 with $(20, 15, 10)$ neurons in the three hidden layers, the hyperbolic tangent activation function in the hidden layers, and the log-link for the output layer. The input has dimension 2, this results in a FFNN architecture with a network parameter of dimension 546; for a detailed discussion of FFNNs we refer to Wüthrich & Merz (Citation2023, Chapter 7), in particular, to Listings 7.1–7.3 of that reference. We fit this model using the Poisson deviance loss, see Remark 2.2 and Wüthrich & Merz (Citation2023, Section 5.2.4), we use the nadam version of stochastic gradient descent. We exercise early stopping on a validation set being 10% of the learning sample $L$ , and we ensemble over ten different network calibrations to reduce the randomness of individual stochastic gradient descent network fits, see Wüthrich & Merz (Citation2023, Section 7.4.4). Line (1) of Table , called ‘FFNN ensemble’, shows the performance of the fitted FFNN regression model ensemble. This is compared to the null model (empirical mean) on line (0) that does not consider any covariates. We observe a decrease in Poisson deviance loss which justifies the use of a regression model. The last column of Table , called ‘average frequency’, compares the average frequency estimate of the FFNN ensemble on $L$ to its empirical mean, and we observe a negative bias in the FFNN prediction, i.e. $7.22 % < 7.36 %$ , thus, these network estimates under-price the portfolio.

Table 1. Loss figures in the French MTPL example only considering DrivAge and BonusMalus; Poisson deviance losses in $10^{- 2}$ .

Display Table

Table 2. Loss figures in the French MTPL example using all covariates; Poisson deviance losses in $10^{- 2}$ .

Display Table

In the next step, we use the FFNN estimates as ranks $\hat{π} (x_{i})$ for ordering the claims $y_{i}$ and the covariates $x_{i}$ , respectively. Note that these ranks $\hat{π} (x_{i})$ correspond to expected frequencies which still need to be multiplied by the time exposures $w_{i} > 0$ to receive the expected number of claims; these time exposures $w_{i} > 0$ are given on line 3 of Listing 1. Therefore, we perform isotonic recalibration on the observed frequencies for $y_{i}$ . These observed frequencies $y_{i}$ are obtained by dividing the claim counts, line 13 of Listing 1, by the time exposures, line 3 of Listing 1. Then, the isotonic recalibration step is performed by solving the constraint weighted square loss minimization problem (Equation2.1(2.1) $\hat{μ} = \underset{μ = (μ_{1}, \dots, μ_{n})^{⊤}}{\arg min} \sum_{i = 1}^{n} w_{i} (y_{i} - μ_{i})^{2}, subject to μ_{1} \leq \dots \leq μ_{n} .$ (2.1) ) and (Equation2.2(2.2) $\hat{μ} = \hat{μ} (z) = \underset{μ : A μ \geq 0}{\arg min} (y - μ)^{⊤} W (y - μ) .$ (2.2) ), respectively. We use the R package monotone (Busing Citation2022), which provides a fast implementation of the PAV algorithm; in fact, this optimization step takes much less than one second on an ordinary laptop. The numerical results are presented on line (2) of Table . We observe that this isotonic recalibration leads to a complexity number $K = K ((y_{i}, \hat{π} (x_{i}))_{i = 1}^{n}) = 49$ , i.e. we receive a partition into 49 different expected frequencies to represent the isotonically recalibrated FFNN predictor. The out-of-sample loss on $T$ is slightly increased by this isotonic calibration step, which may indicate that the ranks of the FFNN prediction could be improved. On the other hand, we now have the auto-calibration property (Equation1.1(1.1) $\hat{μ} (X^{'}) = E [Y^{'} | \hat{μ} (X^{'})], P_{n} -a.s.,$ (1.1) ) and the global balance property on $L$ , as can be seen from the last column of Table .

Figure provides the resulting step function from the isotonic recalibration (in red color) of the ranking $(\hat{π} (x_{i}))_{i = 1}^{n}$ given by the FFNN ensemble. The isotonic recalibration has been performed on $L$ , and Figure shows the results out-of-sample on $T$ . From this figure it is not directly observable why the global balance property does not hold in the first place, as the isotonic recalibration is fairly concentrated around the diagonal orange line.

The isotonic recalibration on the ranks $\hat{π} (x) = \hat{π} (DrivAge, BonusMalus)$ of the FFNN leads to a partition $X_{1}, \dots, X_{49}$ of the covariate space as defined in (Equation3.2(3.2) $X_{k} = {x \in X : \hat{μ} (x) = {\hat{μ}}_{i_{k}}}, k = 1, \dots, K .$ (3.2) ). We compare this partition to the one that results from a binary split regression tree approach (with Poisson deviance loss). We use 10-fold cross-validation to determine the optimal tree size. In this example the optimal tree has 60 splits and, hence, 61 leaves. The resulting losses of this optimal tree are shown on line (3) of Table . We observe that this tree has a worse out-of-sample performance on $T$ than the isotonically recalibrated version, line (2). The binary regression tree uses a finer partition of the covariate space for this result, 61 leaves vs. complexity number K = 49. Figure which shows the resulting partitions (the color scale illustrates the estimated expected claims frequencies). We observe that binary splitting along the covariate axis may not be sufficiently flexible (the binary tree only has vertical and horizontal splits, see Figure , rhs), whereas the isotonically recalibrated version indicates a more complex topology with diagonal splits, see Figure (lhs).

Finally, we remark that the binary tree does not (exactly) fulfill the global balance property, $7.34 % < 7.36 %$ , see last column of Table . This comes from the fact that the Poisson binary tree implementation in the R package rpart uses a Bayesian version for frequency estimation to avoid leaves with an expected frequency estimate of zero; see Therneau & Atkinson (Citation2022). In our isotonic recalibration approach, this is avoided by merging the two blocks with the smallest estimates, see Section 3.2.

4.2. Consideration of all covariates

In this section, we consider all available covariate components, see lines 4–12 of Listing 1. We first fit an ensemble of ten FFNNs to the learning sample $L$ . This is done exactly as in the previous section with the only difference that the input dimension changes when we consider all available information. We use the same covariate pre-processing as in Wüthrich & Merz (Citation2023, Example 7.10). That is, we transform the (ordered) Area code to real values, and we use embedding layers of dimension two for the two categorical covariates VehBrand and Region. The FFNN receives a network parameter of dimension 792. The FFNNs are fitted with stochastic gradient descents that are early stopped based on a validation set of 10% of the learning sample $L$ . The results are presented on line (1) of Table .

We compare the FFNN prediction to the null model (empirical mean), the FFNN prediction with reduced covariates given in Table , and the GLM and network predictors given in Wüthrich & Merz (Citation2023, Tables 5.5 and 7.9). The out-of-sample Poisson deviance losses are directly comparable between all these examples because we use the identical test sample $T$ . In view of these results, we conclude that the FFNN ensemble on line (1) of Table (with an out-of-sample loss of 23.784) is highly competitive because it has roughly the same out-of-sample loss as the best predictor in the aforementioned examples/tables.

Our main concern is the auto-calibration property (Equation1.1(1.1) $\hat{μ} (X^{'}) = E [Y^{'} | \hat{μ} (X^{'})], P_{n} -a.s.,$ (1.1) ) and the global balance property (Equation2.6(2.6) $\sum_{i = 1}^{n} w_{i} {\hat{μ}}_{i} = \sum_{k = 1}^{K (z)} \sum_{i \in I_{k}} w_{i} {\hat{μ}}_{i} = \sum_{k = 1}^{K (z)} {\hat{μ}}_{i_{k}} \sum_{i \in I_{k}} w_{i} = \sum_{k = 1}^{K (z)} \sum_{i \in I_{k}} w_{i} y_{i} = \sum_{i = 1}^{n} w_{i} y_{i},$ (2.6) ). We study this on the learning sample $L$ , and in view of the last column of Table , we conclude that the FFNN predictor on line (1) severely under-estimates the average frequency, $7.23 % < 7.36 %$ , and auto-calibration fails to hold here.

In the next step, we use the FFNN prediction as ranks $\hat{π} (x_{i})$ for ordering the responses and covariates, and we label the claims $y_{i}$ such that $\hat{π} (x_{1}) < \dots < \hat{π} (x_{n})$ ; ties are aggregated as described in Remark 2.1. We then isotonically recalibrate $(\hat{π} (x_{i}))_{i = 1}^{n}$ on the learning sample $L$ . Figure (top-left) shows how the isotonic recalibration step changes the original FFNN predictions (x-axis, on log scale) to their isotonically recalibrated version (y-axis, on log scale). The isotonic recalibration compresses the 581,272 different values of the FFNN predictions to a complexity number of $K = K ((y_{i}, \hat{π} (x_{i}))_{i = 1}^{n}) = 77$ . Thus, the entire regression problem is encoded (binned) into 77 different estimates ${\hat{μ}}_{i_{k}}$ , $k = 1, \dots, K$ . In view of Figure (top-left), the isotonic recalibration step provides heavier lower and upper tails, and this could also be a sign of over-fitting. This isotonically recalibrated regression is empirically auto-calibrated (Equation2.5(2.5) $E [Y^{'} | \hat{μ} (X^{'}) = {\hat{μ}}_{i_{k}}] = \frac{1}{\sum_{i \in I_{k}} w_{i}} \sum_{i \in I_{k}} w_{i} y_{i} = {\hat{μ}}_{i_{k}},$ (2.5) ) on $L$ and, henceforth, it fulfills the global balance property, see last column of line (2) in Table . The resulting out-of-sample loss on $T$ of 23.800 is slightly worse than the one on line (1) but it is the best of all recalibration methods considered.

We compare the isotonic recalibration results to other recalibration methods, namely, the regression tree binning proposal of Lindholm et al. (Citation2023), and the local regression proposal of Denuit et al. (Citation2021). Both of these methods need hyperparameter tuning which is in contrast to isotonic recalibration. We start with the regression tree recalibration proposed by Lindholm et al. (Citation2023), which aims at finding an optimal binning of the the ranks $\hat{π} (x_{1}) < \dots < \hat{π} (x_{n})$ using a regression tree predictor. If one would not aim for binning the ranks optimally, but just considered binning, e.g. using empirical deciles, this would result in a $χ^{2}$ -test situation similar to Hosmer & Lemeshow (Citation1980). For an optimal regression tree binning, we use 10-fold cross-validation on $L$ with the Poisson deviance loss to receive the optimal number of leaves (bins). A further hyperparameter involved is the minimal leaf size, i.e. the required minimal number of instances in each leaf of the regression tree. A comparably small minimal leaf size seems advantageous, and we choose $\geq 100$ . Intuitively, this is motivated by the fact that we have ranks $\hat{π} (x_{1}) < \dots < \hat{π} (x_{n})$ that should respect monotonicity in the true expected frequencies. I.e. though we do not rely on a restricted (isotonic) regression in this tree binning approach, we expect a similar behavior as in the isotonic recalibration because the binary tree splits use this ordering for exploring the best possible splits. If we run 10-fold cross-validation on this configuration, we select an optimal tree size having 17 leaves; there are small variations in this choice because 10-fold cross-validation involves some randomness from the random choice of the folds. The results are presented on line (3) of Table and Figure (top-right).

This tree binning recalibration approach results in a comparably small model, only having 17 leaves, and, in view of Table , we have a slightly bigger out-of-sample loss. Interestingly, the regression tree predictor illustrated in Figure (top-right) is not (completely) monotone in the ranks, but there seem to be two outlier bins. There are different explanations for this observation, namely, these outlier bins might be driven by the fact that we have chosen a minimal leaf size that is too small which implies that the irreducible risk (random noise) has a bigger impact on the binary tree splitting, but it could also indicate that the initial ranking of the FFNN predictor is not fully correct in this part of the covariate space. Finally, from the last column of Table we observe that the tree recalibration approach has the global balance property, see line (3); subject to the explanation at the end of Section 4.1 on how rpart treats zeros in predictions; see also Therneau & Atkinson (Citation2022).

The recalibration proposal of Denuit et al. (Citation2021) is based on a local regression; see Loader (Citation1999). This local regression can be interpreted as a local kernel smoothing. Since we use the same data as Denuit et al. (Citation2021), we choose the same configuration for this local regression recalibration as in Denuit et al. (Citation2021, Section 6.1), i.e. based on the R package locfit Loader (Citation1999), we perform a local regression of degree 0 with a nearest neighbor fraction of $α = 5 %$ . The results are presented in Figure (bottom-left) and line (4a) of Table . This local regression recalibration seems to provide biased estimates in the tails. Selecting a smaller nearest neighbor fraction of $α = 1 %$ gives a smaller bias, but a more wiggly curve in the main body of the data, see Figure (bottom-right), therefore, an adaptive smoothing window size may work better. Focusing on the last column of Table , we see that this local regression recalibration tries to accurately estimate the expected frequency locally, but this does not give us any guarantee that the global balance property holds. Alternatively, one could explore a generalized additive model (GAM) which is based on natural cubic splines, and which will have the global balance property if we work with the canonical link.

Next, we exploit Theorem 2.3. The expected claims frequency on the considered data is roughly 7.36%, i.e. we expected that on average only 7 out of 100 car drivers suffer a claim within one calendar year. This is a typical situation of a low signal-to-noise ratio. This can also be verified from Table . The total uncertainty of the null model of 25.445 (out-of-sample Poisson deviance loss) is reduced to 23.800 by the regression structure using all available information $x$ . Thus, the systematic part reduces the total uncertainty by roughly 6.5%, and the remainder can be interpreted as irreducible risk. This is a typical situation of a low signal-to-noise ratio which, in our case, provides a complexity number of $K ((y_{i}, \hat{π} (x_{i}))_{i = 1}^{n}) = 77$ , see Figure (top-left); the first FFNN predictor has 581,272 different values. Thus, the resulting complexity of the partition $(X_{k})_{k = 1, \dots, K}$ of the covariate space $X$ is substantially reduced by the isotonic recalibration. This leads to explainability of insurance prices, illustrated in Figures and , that we are going to discuss next.

In Figure , we illustrate the resulting marginal plots if we project the estimated values $\hat{μ}$ of the isotonic recalibration to the corresponding covariate levels, i.e. this is the marginal distribution of the resulting covariate space partition (Equation3.2(3.2) $X_{k} = {x \in X : \hat{μ} (x) = {\hat{μ}}_{i_{k}}}, k = 1, \dots, K .$ (3.2) ) on each level of the considered covariates. For our complexity number of $K ((y_{i}, \hat{π} (x_{i}))_{i = 1}^{n}) = 77$ , this results in 77 different colors, and the resulting picture can be interpreted nicely, see Figure . From this plot we conclude that BonusMalus is the most important covariate component, young DrivAge have significantly higher expected frequencies, Area code A (country side) has a lower expected frequency, etc.

The previous figure gives a marginal view because the estimates (and colors) are projected to the corresponding covariate levels, similar to partial dependence plots (PDPs). Since it is also important to understand interactions of covariate components, we analyze a second plot in Figure ; this plot uses the same coloring for the 77 different expected frequencies as Figure . The lines connect all the covariate combinations in $x$ that are observed within the data $(x_{i})_{1 \leq i \leq n}$ for a given value ${\hat{μ}}_{i_{k}}$ , and the size of the black dots illustrates how often a certain covariate level is represented in this data. E.g. the figure on the top belongs to the second smallest expected claims frequency estimate ${\hat{μ}}_{i_{2}} = 1.86 %$ . These car drivers are aged above 30, they mostly live in Area A, have lower power and old cars, and they are on the lowest bonus-malus level. The bottom plot shows the drivers that belong to the biggest expected frequency of ${\hat{μ}}_{i_{2}} = 66.74 %$ . Not surprisingly, these are mostly young drivers (and some very old drivers) with a higher bonus-malus level. Similar conclusions can be drawn for the other parts $X_{k}$ of the covariate space $X$ , thus, having a low complexity number $K ((y_{i}, \hat{π} (x_{i}))_{i = 1}^{n})$ enables to explain the regression model.

Finally, we perform a robustness check of the isotonic recalibration by perturbing the ranks $(\hat{π} (x_{i}))_{i = 1}^{n}$ . As a benchmark, we consider the local regression calibration to which we apply the same perturbations. In a first experiment, we choose a different FFNN ensemble. Since network fitting involves several elements of randomness, see Wüthrich & Merz (Citation2023, Remarks 7.9), every fitted network will be slightly different. On lines (2a)–(2c) of Table , we consider a second FFNN ensemble to which we apply precisely the same isotonic recalibration and local regression recalibration (using the same hyperparameters) as to the first one on lines (1a)-(1c). The results are rather similar which supports a certain robustness across different FFNN calibrations. The second experiment that we consider is to randomly locally perturb the ranks $(\hat{π} (x_{i}))_{i = 1}^{n}$ of the first FFNN ensemble by multiplying the estimated FFNN frequencies by $1 + 20 % \cdot (2 U_{i} - 1)$ for i.i.d. standard uniformly distributed random variables $U_{i}$ . This increases or decreases to original frequencies by maximally $20 %$ . Through this additional random noise we generally expect that the complexity number K will decrease. The results are presented on lines (3a)–(3c). We observe that this perturbation does not harm the results too much, which supports that these methods have a certain robustness towards small perturbations of the ranks.

Table 3. Robustness check in the French MTPL example using all covariates; Poisson deviance losses in $10^{- 2}$ .

Display Table

5. Conclusions

We have tackled two problems. First, we have enforced that the regression model fulfills the auto-calibration property by applying an isotonic recalibration to the ranks of a fitted (first) regression model. This isotonic recalibration does not involve any hyperparameters, but it solely assumes that the ranks from the first regression model are (approximately) correct. Isotonic regression has the property that the complexity of the resulting (non-parametric) regression function is small in low signal-to-noise ratio problems. Benefiting from this property, we have shown that this leads to explainable regression functions because a low complexity is equivalent to a coarse partition of the covariate space. In insurance pricing problems, this is particularly useful as we typically face a low signal-to-noise ratio in insurance claims data. We can then fit a complex (algorithmic) model to that data in a first step, and in a subsequent step we propose to auto-calibrate the first regression function using isotonic recalibration, which also leads to a substantial simplification of the regression function.

Acknowledgements

We would like thank the editor and the referees for giving us very useful remarks that have helped us to improve this manuscript. The second author would like to thank Alexander Henzi for pointing out related results on the complexity number of isotonic regression.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

Ayer, M., Brunk, H. D., Ewing, G. M., Reid, W. T. & Silverman, E. (1955). An empirical distribution function for sampling with incomplete information. Annals of Mathematical Statistics 26, 641–647.
Google Scholar
Balabdaoui, F., Durot, C. & Jankowski, H. (2019). Least squares estimation in the monotone single index model. Bernoulli 25, 3276–3310.
Web of Science ®Google Scholar
Barlow, R. E., Bartholomew, D. J., Brenner, J. M. & Brunk, H. D. (1972). Statistical inference under order restrictions. Wiley.
Google Scholar
Barlow, R. E. & Brunk, H. D. (1972). The isotonic regression problem and its dual. Journal of the American Statistical Association 67(337), 140–147.
Web of Science ®Google Scholar
Blier-Wong, C., Lamontagne, L. & Marceau, E. (2022). A representation-learning approach for insurance pricing with images. Insurance Data Science Conference, 15–17 June 2022, Milan.
Google Scholar
Brunk, H. D., Ewing, G. M. & Utz, W. R. (1957). Minimizing integrals in certain classes of monotone functions. Pacific Journal of Mathematics 7, 833–847.
Google Scholar
Busing, F. M. T. A. (2022). Monotone regression: a simple and fast O(n) PAVA implementation. Journal of Statistical Software 102, 1–25.
Web of Science ®Google Scholar
de Leeuw, J., Hornik, K. & Mair, P. (2009). Isotone optimization in R: pool-adjacent-violators algorithm (PAVA) and active set methods. Journal of Statistical Software 32(5), 1–24.
Web of Science ®Google Scholar
Denuit, M., Charpentier, A. & Trufin, J. (2021). Autocalibration and Tweedie-dominance for insurance pricing with machine learning. Insurance: Mathematics & Economics 101, 485–497.
Web of Science ®Google Scholar
Dimitriadis, T., Dümbgen, L., Henzi, A., Puke, M. & Ziegel, J. (2022). Honest calibration assessment for binary outcome predictions. arXiv:2203.04065.
Google Scholar
Dutang, C. & Charpentier, A. (2018). CASdatasets R Package Vignette. Reference Manual. Version 1.0-8, packaged 2018-05-20.
Google Scholar
Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American Statistical Association 106(494), 746–762.
Web of Science ®Google Scholar
Gneiting, T. & Resin, J. (2022). Regression diagnostics meets forecast evaluation: conditional calibration, reliability diagrams and coefficient of determination. arXiv:2108.03210v3.
Google Scholar
Henzi, A., Kleger, G.-R. & Ziegel, J. F. (2023). Distributional (single) index models. Journal of the American Statistical Association 118(541), 489–503.
Web of Science ®Google Scholar
Henzi, A., Ziegel, J. F. & Gneiting, T. (2021). Isotonic distributional regression. Journal of the Royal Statistical Society Series B: Statistical Methodology 83, 963–993.
Web of Science ®Google Scholar
Hosmer, D. W. & Lemeshow, S. (1980). Goodness of fit tests for the multiple logistic regression model. Communications in Statistics - Theory and Methods 9, 1043–1069.
Web of Science ®Google Scholar
Karush, W. (1939). Minima of functions of several variables with inequalities as side constraints [MSc thesis]. Department of Mathematics, University of Chicago.
Google Scholar
Kingma, D. P. & Welling, M. (2019). An introduction to variational autoencoders. Foundations and Trends in Machine Learning 12(4), 307–392.
Web of Science ®Google Scholar
Krüger, F. & Ziegel, J. F. (2021). Generic conditions for forecast dominance. Journal of Business & Economics Statistics 39(4), 972–983.
Web of Science ®Google Scholar
Kruskal, J. B. (1964). Nonmetric multidimensional scaling: a numerical method. Psychometrika 29, 115–129.
Web of Science ®Google Scholar
Kuhn, H. W. & Tucker, A. W. (1951). Nonlinear programming. Proceedings of 2nd Berkeley Symposium. University of California Press. P. 481–492.
Google Scholar
Lindholm, M., Lindskog, F. & Palmquist, J. (2023). Local bias adjustment, duration-weighted probabilities, and automatic construction of tariff cells. Scandinavian Actuarial Journal, published online: 14 Feb 2023.
Web of Science ®Google Scholar
Loader, C. (1999). Local regression and likelihood. Springer.
Google Scholar
Lundberg, S. M. & Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30. Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (Eds.). Curran Associates. P. 4765–4774.
Google Scholar
Mayer, M., Meier, D. & Wüthrich, M. V. (2023). SHAP for actuaries: explain any model. SSRN Manuscript ID 4389797.
Google Scholar
Menon, A. K., Jiang, X., Vembu, S., Elkan, C. & Ohno-Machado, L. (2012). Predicting accurate probabilities with ranking loss. ICML'12: Proceedings of the 29th International Conference on Machine Learning. Omnipress, 2600 Anderson Street, Madison, WI, United States P. 659–666.
Google Scholar
Meyer, M. & Woodroofe, M. (2000). On the degrees of freedom in shape-restricted regression. Annals of Statistics 28(4), 1083–1104.
Web of Science ®Google Scholar
Miles, R. E. (1959). The complete amalgamation into blocks, by weighted means, of a finite set of real numbers. Biometrika 46, 317–327.
Web of Science ®Google Scholar
Murphy, A. H. (1973). A new vector partition of the probability score. Journal of Applied Meteorology 12(4), 595–600.
Google Scholar
Peiris, H., Jeong, H. & Kim, J.-K. (2023). Integration of traditional and telematics data for efficient insurance claims prediction. SSRN Manuscript ID 4344952.
Google Scholar
Savage, L. J. (1971). Elicitable of personal probabilities and expectations. Journal of the American Statistical Association 66(336), 783–801.
Web of Science ®Google Scholar
Schervish, M. J. (1989). A general method of comparing probability assessors. The Annals of Statistics 17(4), 1856–1879.
Web of Science ®Google Scholar
Tasche, D. (2021). Calibrating sufficiently. Statistics: A Journal of Theoretical and Applied Statistics 55(6), 1356–1386.
Google Scholar
Therneau, T. M. & Atkinson, E. J. (2022). An introduction to recursive partitioning using the RPART routines. R Vignettes, version of October 21, 2022. Rochester: Mayo Foundation.
Google Scholar
Tibshirani, R. J., Hoefling, H. & Tibshirani, R. (2011). Nearly-isotonic regression. Technometrics 53(1), 54–61.
Web of Science ®Google Scholar
Wüthrich, M. V. (2023). Model selection with Gini indices under auto-calibration. European Actuarial Journal 13(1), 469–477.
Web of Science ®Google Scholar
Wüthrich, M. V. & Merz, M. (2023). Statistical foundations of actuarial learning and its applications. Springer Actuarial.
Google Scholar
Zadrozny, B. & Elkan, C. (2002). Transforming classifier scores into accurate multiclass probability estimates. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, United State P. 694–699.
Google Scholar

Appendix

A.1. Pool adjacent violators algorithm

Minimization problem (Equation2.2) is a quadratic optimization problem with linear side constraints, and it can be solved using the method of Karush–Kuhn–Tucker (KKT) (Karush Citation1939, Kuhn & Tucker Citation1951). We therefore consider the Lagrangian

L (μ, η) = (y - μ)^{⊤} W (y - μ) - η^{⊤} A μ,

with Lagrange multiplier

η \in R^{n - 1}

. The KKT conditions are given by

(A1)

\begin{aligned} 0 & = \nabla_{μ} L (μ, η) = - W (y - μ) - A^{⊤} η, \end{aligned}

(A1)

(A2)

\begin{aligned} 0 & \geq \nabla_{η} L (μ, η) = - A μ, \end{aligned}

(A2)

(A3)

\begin{aligned} 0 & \leq η, \end{aligned}

(A3)

(A4)

\begin{aligned} 0 & = (η_{1} (μ_{1} - μ_{2}), \dots, η_{n - 1} (μ_{n - 1} - μ_{n}))^{⊤} . \end{aligned}

(A4) The solution to these KKT conditions (EquationA1)–(EquationA4) provides the isotonic estimate

\hat{μ}

. This solution can be found by the PAV algorithm. The main idea is to compare raw estimates

({\tilde{μ}}_{i})_{i}

. If we have an adjacent pair with

{\tilde{μ}}_{i} > {\tilde{μ}}_{i + 1}

, it violates the monotonicity constraint. Such pairs are recursively merged (pooled) to a block with an identical estimate, and iterating this pooling of adjacent pairs and blocks, respectively, that violate the monotonicity constraint, yields the PAV algorithm.

Pool Adjacent Violators (PAV) Algorithm

(0)

Initialize the algorithm ${\hat{μ}}^{(0)} = y$ and define the blocks $I_{k}^{(0)} = {k}$ for $k = 1, \dots, K^{(0)} = n$ .

(1)

Iterate for $t \geq 0$ :

If ${\hat{μ}}^{(t)}$ fulfills KKT condition (EquationA2(A2) $\begin{aligned} 0 & \geq \nabla_{η} L (μ, η) = - A μ, \end{aligned}$ (A2) ) go to item (2), otherwise go to the next step (1b).
Select an index $i = 1, \dots, n$ with ${\hat{μ}}_{i}^{(t)} > {\hat{μ}}_{i + 1}^{(t)}$ , merge the two adjacent blocks with $i \in I_{k}^{(t)}$ and $i + 1 \in I_{k + 1}^{(t)}$ , and leave all other blocks unchanged. This provides the new blocks $I_{k}^{(t + 1)}$ with $k = 1, \dots, K^{(t + 1)} = K^{(t)} - 1$ .
Set on each block $k = 1, \dots, K^{(t + 1)}$ and for all indices $i \in I_{k}^{(t + 1)}$ the new estimates (A5) ${\hat{μ}}_{i}^{(t + 1)} = \frac{1}{\sum_{j \in I_{k}^{(t + 1)}} w_{j}} \sum_{j \in I_{k}^{(t + 1)}} w_{j} y_{j} .$ (A5)
Increase $t \mapsto t + 1$ .

(2)

Set the isotonic regression estimate $\hat{μ} = {\hat{μ}}^{(t)}$ and merge adjacent blocks $I_{k}^{(t)}$ and $I_{k + 1}^{(t)}$ if we have the same estimates ${\hat{μ}}_{i}$ on these blocks. Return the resulting partition of $I$ denoted by $(I_{k})_{k = 1, \dots, K}$ and $\hat{μ}$ .

Remark A.1

PAV algorithm interpretation

We initialize with the unconstraint optimal solution, and setting $η^{(0)} = 0$ ensures that the KKT conditions (EquationA1(A1) $\begin{aligned} 0 & = \nabla_{μ} L (μ, η) = - W (y - μ) - A^{⊤} η, \end{aligned}$ (A1) ), (EquationA3(A3) $\begin{aligned} 0 & \leq η, \end{aligned}$ (A3) ) and (EquationA4(A4) $\begin{aligned} 0 & = (η_{1} (μ_{1} - μ_{2}), \dots, η_{n - 1} (μ_{n - 1} - μ_{n}))^{⊤} . \end{aligned}$ (A4) ) are fulfilled, thus, only the monotonicity (EquationA2(A2) $\begin{aligned} 0 & \geq \nabla_{η} L (μ, η) = - A μ, \end{aligned}$ (A2) ) is not necessarily fulfilled.
1. We identify a pair ${\hat{μ}}_{i}^{(t)} > {\hat{μ}}_{i + 1}^{(t)}$ that violates the monotonicity constraint (EquationA2(A2) $\begin{aligned} 0 & \geq \nabla_{η} L (μ, η) = - A μ, \end{aligned}$ (A2) ). This pair needs to belong to two adjacent blocks $I_{k}^{(t)}$ and $I_{k + 1}^{(t)}$ because within blocks we have constant estimates (Equation2.4(2.4) ${\hat{μ}}_{i} = {\hat{μ}}_{i_{k}} = \frac{1}{\sum_{j \in I_{k}} w_{j}} \sum_{j \in I_{k}} w_{j} y_{j},$ (2.4) ). We merge these two adjacent blocks to $I_{k}^{(t + 1)} = I_{k}^{(t)} \cup I_{k + 1}^{(t)}$ , which reduces the number of blocks $K^{(t)}$ by 1.
2. We set on each block the constant estimate (EquationA5(A5) ${\hat{μ}}_{i}^{(t + 1)} = \frac{1}{\sum_{j \in I_{k}^{(t + 1)}} w_{j}} \sum_{j \in I_{k}^{(t + 1)}} w_{j} y_{j} .$ (A5) ) which satisfies the monotonicity constraint (EquationA2(A2) $\begin{aligned} 0 & \geq \nabla_{η} L (μ, η) = - A μ, \end{aligned}$ (A2) ) within blocks, and also (EquationA4(A4) $\begin{aligned} 0 & = (η_{1} (μ_{1} - μ_{2}), \dots, η_{n - 1} (μ_{n - 1} - μ_{n}))^{⊤} . \end{aligned}$ (A4) ) is naturally fulfilled in this block. Conditions (EquationA1(A1) $\begin{aligned} 0 & = \nabla_{μ} L (μ, η) = - W (y - μ) - A^{⊤} η, \end{aligned}$ (A1) ) and (EquationA3(A3) $\begin{aligned} 0 & \leq η, \end{aligned}$ (A3) ) are achieved by changing the Lagrange parameter $η^{(t)} \mapsto η^{(t + 1)} \geq 0$ correspondingly to account for the change in mean estimates (EquationA5(A5) ${\hat{μ}}_{i}^{(t + 1)} = \frac{1}{\sum_{j \in I_{k}^{(t + 1)}} w_{j}} \sum_{j \in I_{k}^{(t + 1)}} w_{j} y_{j} .$ (A5) ) in (EquationA1(A1) $\begin{aligned} 0 & = \nabla_{μ} L (μ, η) = - W (y - μ) - A^{⊤} η, \end{aligned}$ (A1) ).
3. On a sample of size n, this algorithm can be iterated at most n−1 times, thus, the algorithm will terminate.
Since we have for $i \in I_{k}^{(t)}$ and $i + 1 \in I_{k + 1}^{(t)}$ the inequality ${\hat{μ}}_{i}^{(t)} \leq {\hat{μ}}_{i + 1}^{(t)}$ , the last step is to ensure that the resulting blocks are maximal by merging blocks where we do not have a strict inequality in the corresponding estimates.

A.2. Proof of Theorem 2.3

Proof

Proof of Theorem 2.3

For given responses $y = Y_{σ} (ω)$ , the solution to (Equation2.2(2.2) $\hat{μ} = \hat{μ} (z) = \underset{μ : A μ \geq 0}{\arg min} (y - μ)^{⊤} W (y - μ) .$ (2.2) ) gives the partition (Equation2.3(2.3) $I_{k} = I_{k} (z) = {i_{k - 1} + 1, \dots, i_{k}} for k = 1, \dots, K (z),$ (2.3) ) of the index set $I$ with empirical weighted averages (Equation2.4(2.4) ${\hat{μ}}_{i} = {\hat{μ}}_{i_{k}} = \frac{1}{\sum_{j \in I_{k}} w_{j}} \sum_{j \in I_{k}} w_{j} y_{j},$ (2.4) ) on the blocks $I_{k}$ . These empirical weighted averages satisfy ${\hat{μ}}_{i_{k}} < {\hat{μ}}_{i_{k + 1}}$ for all $k = 1, \dots, K (Y) - 1$ , because the blocks $I_{k}$ have been chosen maximal. We now consider how these blocks are constructed in the PAV algorithm. Suppose that we are in iteration $t \geq 0$ , and in this iteration of the PAV algorithm, we merge the two adjacent blocks $I_{k}^{(t)}$ and $I_{k + 1}^{(t)}$ because ${\hat{μ}}_{i}^{(t)} > {\hat{μ}}_{i + 1}^{(t)}$ for $i \in I_{k}^{(t)}$ and $i + 1 \in I_{k + 1}^{(t)}$ . We analyze this inequality $\frac{1}{\sum_{j \in I_{k}^{(t)}} w_{j}} \sum_{j \in I_{k}^{(t)}} w_{j} y_{j} = {\hat{μ}}_{i}^{(t)} > {\hat{μ}}_{i + 1}^{(t)} = \frac{1}{\sum_{l \in I_{k + 1}^{(t)}} w_{l}} \sum_{l \in I_{k + 1}^{(t)}} w_{l} y_{l} .$ We use the location-scale structure (Equation2.4(2.4) ${\hat{μ}}_{i} = {\hat{μ}}_{i_{k}} = \frac{1}{\sum_{j \in I_{k}} w_{j}} \sum_{j \in I_{k}} w_{j} y_{j},$ (2.4) ) which gives us the equivalent condition $\frac{1}{\sum_{j \in I_{k}^{(t)}} w_{j}} \sum_{j \in I_{k}^{(t)}} w_{j} (μ_{j} + σ ϵ_{j} (ω)) > \frac{1}{\sum_{l \in I_{k + 1}^{(t)}} w_{l}} \sum_{l \in I_{k + 1}^{(t)}} w_{l} (μ_{l} + σ ϵ_{l} (ω)) .$ Since for any indices $j \in I_{k}^{(t)}$ and $l \in I_{k + 1}^{(t)}$ we have $μ_{j} \leq μ_{l}$ , it follows that the previous condition for merging the two adjacent blocks $I_{k}^{(t)}$ and $I_{k + 1}^{(t)}$ in iteration t of the PAV algorithm reads as (A6) $σ [\frac{\sum_{j \in I_{k}^{(t)}} w_{j} ϵ_{j} (ω)}{\sum_{j \in I_{k}^{(t)}} w_{j}} - \frac{\sum_{l \in I_{k + 1}^{(t)}} w_{l} ϵ_{l} (ω)}{\sum_{l \in I_{k + 1}^{(t)}} w_{l}}] > \frac{\sum_{l \in I_{k + 1}^{(t)}} w_{l} μ_{l}}{\sum_{l \in I_{k + 1}^{(t)}} w_{l}} - \frac{\sum_{j \in I_{k}^{(t)}} w_{j} μ_{j}}{\sum_{j \in I_{k}^{(t)}} w_{j}} \geq 0.$ (A6) The important observation is that if this condition is fulfilled for scale parameter $σ > 0$ , then it will also be fulfilled for any bigger scale parameter $σ^{'} > σ > 0$ (pointwise in $ω \in Ω$ ). Thus, any pooling that happens for σ also happens for $σ^{'} > σ > 0$ . Since this is pointwise on the underlying probability space $(Ω, F, P)$ , it shows that $E [K (Y)]$ is decreasing in $σ > 0$ .

Suppose now that the distribution of $ϵ$ has full support on $R^{n}$ . Then, the event $A_{σ} = {K (Y_{σ}) = n}$ occurs with positive probability, i.e. $\begin{aligned} 0 & < P [A_{σ}] = P [K (Y_{σ}) = n] \\ = P [Y_{1} < Y_{2} < \dots < Y_{n}] \\ = P [μ_{1} + σ ϵ_{1} < μ_{2} + σ ϵ_{2} < \dots < μ_{n} + σ ϵ_{n}] . \end{aligned}$ Consider $\begin{aligned} A_{σ} & = {μ_{1} + σ ϵ_{1} < μ_{2} + σ ϵ_{2} < \dots < μ_{n} + σ ϵ_{n}} \\ = ⋂_{k = 2}^{n} {μ_{k - 1} + σ ϵ_{k - 1} < μ_{k} + σ ϵ_{k}} \\ = {σ (ϵ_{1} - ϵ_{2}) < μ_{2} - μ_{1}} \cap ⋂_{k = 3}^{n} {μ_{k - 1} + σ ϵ_{k - 1} < μ_{k} + σ ϵ_{k}} . \end{aligned}$ We focus on the first event on the right-hand side. Note that $μ_{2} - μ_{1} > 0$ , hence ${\tilde{A}}_{σ} = {ε \in R^{n} : ε_{1} - ε_{2} < \frac{μ_{2} - μ_{1}}{σ}}$ describes an open half space in $R^{n}$ containing the origin and with bounding hyperplane that moves further away from the origin when decreasing σ. Overall, the set ${\tilde{A}}_{σ} \subset R^{n}$ of values of $ε$ is a non-empty open polyhedron containing the origin that scales with σ, that is, ${\tilde{A}}_{σ} = (1 / σ) {\tilde{A}}_{1}$ . Therefore, since the distribution of $ϵ$ has full support, the probability $P [A_{σ}]$ is strictly decreasing in σ.

Isotonic recalibration under a low signal-to-noise ratio

ABSTRACT

1. Introduction