Full article: The balance property in neural network modelling

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

In estimation and prediction theory, considerable attention is paid to the question of having unbiased estimators on a global population level. Recent developments in neural network modelling have mainly focused on accuracy on a granular sample level, and the question of unbiasedness on the population level has almost completely been neglected by that community. We discuss this question within neural network regression models, and we provide methods of receiving unbiased estimators for these models on the global population level.

Keywords:

1. Introduction

In recent years, neural networks have become state-of-the-art in all kinds of classification and regression problems. Snapshots of their history and their success are illustrated in LeCun et al. (Citation2015) and Schmidhuber (Citation2015). Their popularity is largely based on the facts that they offer much more modelling flexibility than classical statistical regression models (such as generalised linear models) and that increasing computational power combined with effective training methods have become available, see Rumelhart et al. (Citation1986). Neural networks outperform many other classical statistical approaches in terms of predictive performance on an individual sample level, they allow to include unstructured data such as texts into the regression models, see Lee et al. (Citation2020) for a word embedding example, and they allow for solving rather unconventional regression problems, see Cheng et al. (Citation2020) and Gabrielli (Citation2020) for examples. Therefore, our community has gradually been shifting from a data modelling culture to an algorithmic modelling culture, we refer the reader to Breiman (Citation2001) and Shmueli(Citation2010).

A question that is often neglected in neural network modelling is their average predictive performance on the global population level, in particular, their unbiasedness on the global population level. In insurance, this latter property is implied by the so-called balance property, see Theorem 4.5 in Bühlmann and Gisler (Citation2005). The balance property is highly relevant in financial applications. Think of an insurance portfolio consisting of individual insurance policies. Granular regression models may provide excellent predictions on an individual policy level (sample level), however, the global price level may completely be misspecified, since adding up numerous small errors may still result in a big error on a global portfolio level (population level). Unfortunately, many models suffer this deficiency if one does not pay sufficient attention to the balance property during model training. The purpose of this essay is to explore and improve this point. For illustrative purposes, we restrict ourselves to a binary classification problem and the situation of a feedforward neural network (FNN). However, the results can (easily) be extended and adapted to other regression problems and models, i.e. they hold in much more generality.

The rest of this paper is structured as follows. In the next section, we review classical logistic regression modelling. This will build the core of our understanding of the balance property. In Section 3, we review FNNs, and in our discussion we put special emphasis on parameter regularisation via early stopping of gradient-descent algorithms because this is the crucial issue that causes the problems in FNN model fitting. In Section 4, we discuss two different approaches that help us to dissolve the bias problem. The first one is based on the classical logistic regression approach discussed in Section 2; the second one uses regularisation in combination with shrinkage. The latter approach also motivates to regularise neural networks with classification and regression tree models. In Section 5, we give an example that shows the relevance of these considerations. Section 6 concludes.

2. Logistic regression

To discuss the issue of the balance property and to provide possible solutions for this issue, we start from classical logistic regression, see Cox (Citation1958). Assume we have data $D = {(x_{i}, Y_{i})}_{i = 1}^{n}$ , where $n \in N$ is the sample size of the data, and where $(x_{i}, Y_{i})$ are the individual samples with $x_{i} \in R^{q}$ describing the covariates and $Y_{i} \in {0, 1}$ describing the label of sample i. In logistic regression, we assume that the labels of these samples have been drawn independently from Bernoulli distributions having logistic success probabilities (1) $x \in R^{q} \mapsto p (x) = σ (w_{0} + \sum_{j = 1}^{q} w_{j} x_{j}) \in (0, 1),$ (1) with logistic function $σ (t) = (1 + e^{- t})^{- 1}$ , weights $w_{j} \in R$ , $j = 1, \dots, q$ , and intercept $w_{0} \in R$ . In machine learning, the logistic function is called sigmoid function.

Under these assumptions, we can fit the weights $w = (w_{0}, \dots, w_{q})^{'} \in R^{q + 1}$ to the given data $D$ using maximum likelihood estimation (MLE), we refer the reader to McCullagh and Nelder (Citation1983). The corresponding log-likelihood function is given by (2) $\begin{aligned} w \mapsto ℓ (w; D) & = \sum_{i = 1}^{n} Y_{i} \log p (x_{i}) \\ + (1 - Y_{i}) \log (1 - p (x_{i})) . \end{aligned}$ (2) This log-likelihood function is concave in $w$ and, therefore, we find a unique MLE ${\hat{w}}^{M L E} \in R^{q + 1}$ for $w$ (under the additional assumption that the corresponding design matrix $X = ((\binom{1}{x_{1}}), \dots, (\binom{1}{x_{n}}))^{'} \in R^{n \times (q + 1)}$ has full rank $q + 1 \leq n$ ). This MLE ${\hat{w}}^{M L E}$ is a critical point of the log-likelihood function $w \mapsto ℓ (w; D)$ , that is, (3) ${\frac{\partial}{\partial w} ℓ (w; D)|}_{w = {\hat{w}}^{M L E}} = 0.$ (3) This implies the following identity (by considering (Equation3(3) ${\frac{\partial}{\partial w} ℓ (w; D)|}_{w = {\hat{w}}^{M L E}} = 0.$ (3) ) with respect to the intercept component $w_{0}$ ) (4) $\begin{aligned} \frac{1}{n} \sum_{i = 1}^{n} Y_{i} & = \frac{1}{n} \sum_{i = 1}^{n} {\hat{p}}^{M L E} (x_{i}) \\ = \frac{1}{n} \sum_{i = 1}^{n} σ ({\hat{w}}_{0}^{M L E} + \sum_{j = 1}^{q} {\hat{w}}_{j}^{M L E} x_{i, j}), \end{aligned}$ (4) if we use estimates ${\hat{w}}^{M L E}$ for $w$ in the logistic success probabilities $p (x)$ in (Equation1(1) $x \in R^{q} \mapsto p (x) = σ (w_{0} + \sum_{j = 1}^{q} w_{j} x_{j}) \in (0, 1),$ (1) ). Identity (Equation4(4) $\begin{aligned} \frac{1}{n} \sum_{i = 1}^{n} Y_{i} & = \frac{1}{n} \sum_{i = 1}^{n} {\hat{p}}^{M L E} (x_{i}) \\ = \frac{1}{n} \sum_{i = 1}^{n} σ ({\hat{w}}_{0}^{M L E} + \sum_{j = 1}^{q} {\hat{w}}_{j}^{M L E} x_{i, j}), \end{aligned}$ (4) ) is the aforementioned balance property. Namely, setting correctly the estimate ${\hat{w}}_{0}^{M L E}$ for the intercept $w_{0}$ in (Equation4(4) $\begin{aligned} \frac{1}{n} \sum_{i = 1}^{n} Y_{i} & = \frac{1}{n} \sum_{i = 1}^{n} {\hat{p}}^{M L E} (x_{i}) \\ = \frac{1}{n} \sum_{i = 1}^{n} σ ({\hat{w}}_{0}^{M L E} + \sum_{j = 1}^{q} {\hat{w}}_{j}^{M L E} x_{i, j}), \end{aligned}$ (4) ) provides us with unbiasedness on the population level (5) $\begin{aligned} \frac{1}{n} E [\sum_{i = 1}^{n} {\hat{p}}^{M L E} (x_{i})] = \frac{1}{n} E [\sum_{i = 1}^{n} Y_{i}] = \frac{1}{n} \sum_{i = 1}^{n} p (x_{i}), \end{aligned}$ (5) where we assume that the labels $Y_{i}$ are independent and Bernoulli distributed with success probabilities $p (x_{i})$ , for $i = 1, \dots, n$ . This is the crucial global unbiasedness property. It tells us, for instance in financial applications, that the global price level has been set accurately (in average). We can even quantify the estimation uncertainty on the global level, similarly to (Equation5(5) $\begin{aligned} \frac{1}{n} E [\sum_{i = 1}^{n} {\hat{p}}^{M L E} (x_{i})] = \frac{1}{n} E [\sum_{i = 1}^{n} Y_{i}] = \frac{1}{n} \sum_{i = 1}^{n} p (x_{i}), \end{aligned}$ (5) ) we have $V a r (\frac{1}{n} \sum_{i = 1}^{n} {\hat{p}}^{M L E} (x_{i})) = \frac{1}{n^{2}} \sum_{i = 1}^{n} p (x_{i}) (1 - p (x_{i})) .$

Remark 2.1

•	The balance property (Equation4(4) $\begin{aligned} \frac{1}{n} \sum_{i = 1}^{n} Y_{i} & = \frac{1}{n} \sum_{i = 1}^{n} {\hat{p}}^{M L E} (x_{i}) \\ = \frac{1}{n} \sum_{i = 1}^{n} σ ({\hat{w}}_{0}^{M L E} + \sum_{j = 1}^{q} {\hat{w}}_{j}^{M L E} x_{i, j}), \end{aligned}$ (4) ) is satisfied for any generalized linear model (GLM) within the exponential dispersion family (EDF) under the choice of the canonical link function, see Section 2.2 in Nelder and Wedderburn (Citation1972).
•	Noteworthy, the balance property (Equation4(4) $\begin{aligned} \frac{1}{n} \sum_{i = 1}^{n} Y_{i} & = \frac{1}{n} \sum_{i = 1}^{n} {\hat{p}}^{M L E} (x_{i}) \\ = \frac{1}{n} \sum_{i = 1}^{n} σ ({\hat{w}}_{0}^{M L E} + \sum_{j = 1}^{q} {\hat{w}}_{j}^{M L E} x_{i, j}), \end{aligned}$ (4) ) does not assume that the model has been correctly specified. That is, even if we work with a completely wrong GLM for the responses $Y_{i}$ , identity (Equation4(4) $\begin{aligned} \frac{1}{n} \sum_{i = 1}^{n} Y_{i} & = \frac{1}{n} \sum_{i = 1}^{n} {\hat{p}}^{M L E} (x_{i}) \\ = \frac{1}{n} \sum_{i = 1}^{n} σ ({\hat{w}}_{0}^{M L E} + \sum_{j = 1}^{q} {\hat{w}}_{j}^{M L E} x_{i, j}), \end{aligned}$ (4) ) holds and we have unbiasedness of the estimated GLM on portfolio level (Equation5(5) $\begin{aligned} \frac{1}{n} E [\sum_{i = 1}^{n} {\hat{p}}^{M L E} (x_{i})] = \frac{1}{n} E [\sum_{i = 1}^{n} Y_{i}] = \frac{1}{n} \sum_{i = 1}^{n} p (x_{i}), \end{aligned}$ (5) ).

3. Neural network regressions and early stopping

A FNN provides a generalisation of the logistic regression probabilities introduced in (Equation1(1) $x \in R^{q} \mapsto p (x) = σ (w_{0} + \sum_{j = 1}^{q} w_{j} x_{j}) \in (0, 1),$ (1) ). Denote the FNN map that maps the covariates $x \in R^{q}$ (non-linearly) to the last hidden layer of the FNN by $\begin{aligned} f^{θ} : R^{q} & \to R^{d}, \\ x & \mapsto f^{θ} (x) = (f_{1}^{θ} (x), \dots, f_{d}^{θ} (x))^{'}, \end{aligned}$ where $d \in N$ is the dimension of the last hidden layer of the FNN. The choice of this map $f^{θ}$ involves the choices of the network architecture, the nonlinear activation function, etc., for details we refer the reader to Goodfellow et al. (Citation2016) and to Section 5.1.1 in Wüthrich and Buser (Citation2016), in particular, the FNN map $f^{θ}$ corresponds to formula (5.5) in Wüthrich and Buser (Citation2016). FNNs are universal approximators which means that the family of FFNs is dense in the class of compactly supported continuous functions (if we choose a discriminatory activation function), see Cybenko (Citation1989) and Hornik et al. (Citation1989) for precise statements and corresponding proofs. This explains that FNNs provide a much bigger modelling flexibility over GLMs, in fact, a GLM can be embedded into a FNN as highlighted in Wüthrich and Merz (Citation2019).

Each FNN map $f^{θ}$ involves a corresponding network parameter θ (collecting all weights and intercepts in the hidden layers) and, for simplicity, we assume that $f^{θ}$ is differentiable with respect to θ. This motivates the definition of the FNN regression probabilities (compare with (Equation1(1) $x \in R^{q} \mapsto p (x) = σ (w_{0} + \sum_{j = 1}^{q} w_{j} x_{j}) \in (0, 1),$ (1) )) (6) $\begin{aligned} x \in R^{q} \mapsto p (x) = σ (w_{0} + \sum_{j = 1}^{d} w_{j} f_{j}^{θ} (x)) \in (0, 1), \end{aligned}$ (6) for output intercept and weights $w = (w_{0}, \dots, w_{d})^{'} \in R^{d + 1}$ . We observe that (Equation1(1) $x \in R^{q} \mapsto p (x) = σ (w_{0} + \sum_{j = 1}^{q} w_{j} x_{j}) \in (0, 1),$ (1) ) and (Equation6(6) $\begin{aligned} x \in R^{q} \mapsto p (x) = σ (w_{0} + \sum_{j = 1}^{d} w_{j} f_{j}^{θ} (x)) \in (0, 1), \end{aligned}$ (6) ) have the same structural form, but the original covariates $x \in R^{q}$ are replaced by (new) features $f^{θ} (x) \in R^{d}$ . This can be interpreted that the original covariates have been pre-processed by the FNN, or that the FNN performs representation learning.

Recently, a lot of effort has been put into the development of efficient fitting algorithms for these FNNs. Most calibration methods use variants of the stochastic gradient descent (SGD) algorithm in combination with back-propagation for gradient calculations, see Rumelhart et al. (Citation1986) and Goodfellow et al. (Citation2016). The plain-vanilla SGD algorithm improves step-wise locally the parameter $(w, θ)$ with respect to the chosen loss function (objective function) by considering the corresponding gradients, see Chapters 6 and 8 of Goodfellow et al. (Citation2016). In our considerations, the canonical loss function is given by the deviance loss, which corresponds in the Bernoulli model to twice the average negative log-likelihood function, see also (Equation2(2) $\begin{aligned} w \mapsto ℓ (w; D) & = \sum_{i = 1}^{n} Y_{i} \log p (x_{i}) \\ + (1 - Y_{i}) \log (1 - p (x_{i})) . \end{aligned}$ (2) ), (7) $(w, θ) \mapsto L (w, θ; D) = - \frac{2}{n} ℓ (w, θ; D) .$ (7) For deviance losses, we refer to Section 2.3 in McCullagh and Nelder (Citation1983).

The SGD algorithm calibrates the parameter $(w, θ)$ adaptively by step-wise locally decreasing loss (Equation7(7) $(w, θ) \mapsto L (w, θ; D) = - \frac{2}{n} ℓ (w, θ; D) .$ (7) ). In order to prevent this model from in-sample over-fitting, typically, an early stopping rule is exercised, see C. Wang et al. (Citation1994). This early stopping rule is seen as a regularisation method, see Section 7.8 in Goodfellow et al. (Citation2016).

It is exactly this early stopping rule that causes the failure of the balance property (Equation4(4) $\begin{aligned} \frac{1}{n} \sum_{i = 1}^{n} Y_{i} & = \frac{1}{n} \sum_{i = 1}^{n} {\hat{p}}^{M L E} (x_{i}) \\ = \frac{1}{n} \sum_{i = 1}^{n} σ ({\hat{w}}_{0}^{M L E} + \sum_{j = 1}^{q} {\hat{w}}_{j}^{M L E} x_{i, j}), \end{aligned}$ (4) ). Early stopping implies that we are not in a critical point of the loss function $w \mapsto L (w, θ; D)$ , see also (Equation3(3) ${\frac{\partial}{\partial w} ℓ (w; D)|}_{w = {\hat{w}}^{M L E}} = 0.$ (3) ). Therefore, an identity similar to (Equation4(4) $\begin{aligned} \frac{1}{n} \sum_{i = 1}^{n} Y_{i} & = \frac{1}{n} \sum_{i = 1}^{n} {\hat{p}}^{M L E} (x_{i}) \\ = \frac{1}{n} \sum_{i = 1}^{n} σ ({\hat{w}}_{0}^{M L E} + \sum_{j = 1}^{q} {\hat{w}}_{j}^{M L E} x_{i, j}), \end{aligned}$ (4) ) fails to hold.

4. Global bias regularisation

4.1. Logistic regression regularisation

A simple way to achieve the balance property (Equation4(4) $\begin{aligned} \frac{1}{n} \sum_{i = 1}^{n} Y_{i} & = \frac{1}{n} \sum_{i = 1}^{n} {\hat{p}}^{M L E} (x_{i}) \\ = \frac{1}{n} \sum_{i = 1}^{n} σ ({\hat{w}}_{0}^{M L E} + \sum_{j = 1}^{q} {\hat{w}}_{j}^{M L E} x_{i, j}), \end{aligned}$ (4) ) is to add an additional logistic regression step to the early stopped SGD calibration. Denote the early stopped SGD calibration by $({\hat{w}}^{S G D}, {\hat{θ}}^{S G D})$ . This provides us with estimated success probabilities (8) $\begin{aligned} x \mapsto {\hat{p}}^{S G D} (x) = σ ({\hat{w}}_{0}^{S G D} + \sum_{j = 1}^{d} {\hat{w}}_{j}^{S G D} f_{j}^{{\hat{θ}}^{S G D}} (x)), \end{aligned}$ (8) and with neuron activations $f^{{\hat{θ}}^{S G D}} (x) \in R^{d}$ in the last hidden layer of the FNN, respectively. We freeze these neuron activations and use them as new covariates (inputs) in an additional logistic regression step. Therefore, we replace the original data $D$ by the working data $D^{S G D} = {(f^{{\hat{θ}}^{S G D}} (x_{i}), Y_{i})}_{i = 1}^{n}$ , and we assume that the resulting design matrix has full rank $d + 1 \leq n$ . An additional logistic regression MLE step on the working data $D^{S G D}$ provides us with the (unique) MLE ${\hat{w}}^{M L E (d)} \in R^{d}$ for $w$ ; this step is similar to Section 2, but with dimension q replaced by d, see also (Equation3(3) ${\frac{\partial}{\partial w} ℓ (w; D)|}_{w = {\hat{w}}^{M L E}} = 0.$ (3) ). Note that this MLE ${\hat{w}}^{M L E (d)}$ improves (in-sample) the early stopped SGD estimate ${\hat{w}}^{S G D}$ with respect to the given loss function $L (w, θ; D)$ , and we obtain the new FNN parameter estimate $({\hat{w}}^{M L E (d)}, {\hat{θ}}^{S G D})$ .

This establishes us with the GLM improved estimated success probabilities (9) $\begin{aligned} x \mapsto {\hat{p}}^{S G D +} (x) \\ = σ ({\hat{w}}_{0}^{M L E (d)} + \sum_{j = 1}^{d} {\hat{w}}_{j}^{M L E (d)} f_{j}^{{\hat{θ}}^{S G D}} (x)) . \end{aligned}$ (9) These estimated success probabilities satisfy the balance property $\frac{1}{n} \sum_{i = 1}^{n} Y_{i} = \frac{1}{n} \sum_{i = 1}^{n} {\hat{p}}^{S G D +} (x_{i}),$ which is equivalent to (Equation4(4) $\begin{aligned} \frac{1}{n} \sum_{i = 1}^{n} Y_{i} & = \frac{1}{n} \sum_{i = 1}^{n} {\hat{p}}^{M L E} (x_{i}) \\ = \frac{1}{n} \sum_{i = 1}^{n} σ ({\hat{w}}_{0}^{M L E} + \sum_{j = 1}^{q} {\hat{w}}_{j}^{M L E} x_{i, j}), \end{aligned}$ (4) ) and, henceforth, we obtain the balance property and global unbiasedness (Equation5(5) $\begin{aligned} \frac{1}{n} E [\sum_{i = 1}^{n} {\hat{p}}^{M L E} (x_{i})] = \frac{1}{n} E [\sum_{i = 1}^{n} Y_{i}] = \frac{1}{n} \sum_{i = 1}^{n} p (x_{i}), \end{aligned}$ (5) ), respectively.

We give some remarks.

•	The neuron activations $f^{{\hat{θ}}^{S G D}} (x_{i})$ , $i = 1, \dots, n$ , in the last hidden layer of the FNN can be understood as pre-processed covariates, and we perform a logistic regression on the resulting pre-processed (working) data $D^{S G D}$ . In machine learning the transformation from $x_{i}$ to $f^{{\hat{θ}}^{S G D}} (x_{i})$ is also called representation learning.
•	The additional logistic regression step is a convex optimisation problem that can efficiently be solved by Fisher's scoring method or by the iteratively reweighted least squares (IRLS) algorithm, see Nelder Wedderburn (Citation1972), Green (Citation1984) and the references therein. Alternatively, we could continue to iterate the gradient-descent algorithm restricted to the output parameter $w \in R^{d + 1}$ , while keeping frozen all other network parameters ${\hat{θ}}^{S G D}$ .
•	The additional logistic regression step is optimal with respect to the chosen objective function for the given learned representations $f^{{\hat{θ}}^{S G D}} (x_{i})$ . A less optimal (but simple) way for bias correction is to just adjust the intercept parameter estimate ${\hat{w}}_{0}^{S G D}$ .
•	The additional logistic regression step may lead to over-fitting. If this is the case we could either exercise a more early stopping rule or we could choose a FNN architecture with a low dimensional last hidden layer, i.e. with a small d. The latter also has a positive effect on the run-time of the additional logistic regression step. Alternatively, we could apply classical regularisation techniques such as ridge or LASSO regression to this last optimisation step. Importantly, the intercept $w_{0}$ should be excluded from regularisation/penalisation, otherwise the resulting model may not have the balance property.

4.2. Penalty term and shrinkage regularisation

As a second regularisation approach we introduce a penalty term. Choose a tuning parameter $η > 0$ and define the penalised loss function (10) $L^{(η)} (w, θ; D) = L (w, θ; D) + η R (w, θ; \bar{p}, I),$ (10) where we set $I = {1, \dots, n}$ for the sample indexes, and for the penalty term $R$ we choose the Kullback–Leibler (KL) divergence $\begin{aligned} R (w, θ; \bar{p}, I) & = \bar{p} \log (\frac{\bar{p}}{p_{I}}) \\ + (1 - \bar{p}) \log (\frac{1 - \bar{p}}{1 - p_{I}}) \geq 0, \end{aligned}$ with empirical average $\bar{p}$ and model average $p_{I}$ on $I$ given by, respectively, (11) $\begin{aligned} \bar{p} = \frac{1}{n} \sum_{i = 1}^{n} Y_{i} a n d p_{I} = p_{I} (w, θ) = \frac{1}{| I |} \sum_{i \in I} p (x_{i}) . \end{aligned}$ (11) Remark that the penalty term vanishes if and only if the two averages are equal, i.e. iff $\bar{p} = p_{I}$ . This implies that the penalised version (Equation10(10) $L^{(η)} (w, θ; D) = L (w, θ; D) + η R (w, θ; \bar{p}, I),$ (10) ) favours gradient-descent steps that move towards the empirical average $\bar{p}$ and, henceforth, tend to be less biased on the population level compared to the unpenalised version.

There is one issue that has not been mentioned in (Equation10(10) $L^{(η)} (w, θ; D) = L (w, θ; D) + η R (w, θ; \bar{p}, I),$ (10) ). Typically, we use SGD methods that act on randomly selected mini-batches, see Goodfellow et al. (Citation2016). For this reason (Equation10(10) $L^{(η)} (w, θ; D) = L (w, θ; D) + η R (w, θ; \bar{p}, I),$ (10) ) cannot be evaluated by classical SGD software, but only its counterpart on the selected mini-batch. Thus, for a mini-batch $B \subset I = {1, \dots, n}$ we have to replace (Equation11(11) $\begin{aligned} \bar{p} = \frac{1}{n} \sum_{i = 1}^{n} Y_{i} a n d p_{I} = p_{I} (w, θ) = \frac{1}{| I |} \sum_{i \in I} p (x_{i}) . \end{aligned}$ (11) ) in the penalty term by $\begin{aligned} {\bar{p}}_{B} & = \frac{1}{| B |} \sum_{i \in B} Y_{i} a n d \\ p_{B} & = p_{B} (w, θ) = \frac{1}{| B |} \sum_{i \in B} p (x_{i}) . \end{aligned}$ Technically, this is no difficulty, however in practical applications this has turned out to be not sufficiently robust, and the penalty term did not provide the anticipated regularisation effect.

A more efficient way is borrowed from shrinkage regularisation to the global population level; this approach is similar to empirical Bayesian considerations. We therefore modify for mini-batch $B \subset I$ the penalty term to (12) $\begin{aligned} R (w, θ; \bar{p}, B) = \bar{p} \log (\frac{\bar{p}}{p_{B}}) + (1 - \bar{p}) \log (\frac{1 - \bar{p}}{1 - p_{B}}) . \end{aligned}$ (12) This penalty term shrinks the model averages $p_{B}$ on the selected mini-batch $B$ towards the global empirical average $\bar{p}$ and, henceforth, favours SGD steps that tend to be unbiased on the global population level.

4.3. Classification tree regularisation

The previous idea of shrinkage regularisation can be carried forward to classification tree regularisation; we refer to Breiman et al. (Citation1984) for classification and regression trees. Classification trees partition the covariate space $X \subset R^{q}$ into a family ${X_{s}}_{s = 1}^{S}$ of homogeneous subsets, where homogeneity is quantified with a dissimilarity measure. Denote the sample indexes of the data $D$ that have covariates $x_{i} \in X_{s}$ by $T_{s} \subset I$ . The MLE on each subset $X_{s}$ is given by the individual empirical average ${\bar{p}}_{T_{s}} = \frac{1}{| T_{s} |} \sum_{i \in T_{s}} Y_{i} .$ The family ${{\bar{p}}_{T_{s}}}_{s = 1}^{S}$ of probabilities describes the regression tree estimator on the partition ${X_{s}}_{s = 1}^{S}$ of $X$ .

We may now replace the homogeneous regularisation problem (Equation10(10) $L^{(η)} (w, θ; D) = L (w, θ; D) + η R (w, θ; \bar{p}, I),$ (10) ) by the regression tree implied regularisa-tion. We choose tuning constants $η_{s} > 0$ and set for the penalty term (13) $\begin{aligned} \sum_{s = 1}^{S} η_{s} [{\bar{p}}_{T_{s}} \log (\frac{{\bar{p}}_{T_{s}}}{p_{T_{s} \cap B}}) \\ + (1 - {\bar{p}}_{T_{s}}) \log (\frac{1 - {\bar{p}}_{T_{s}}}{1 - p_{T_{s} \cap B}})], \end{aligned}$ (13) where $p_{T_{s} \cap B} = p_{T_{s} \cap B} (w, θ) = \frac{1}{| T_{s} \cap B |} \sum_{i \in T_{s} \cap B} p (x_{i}) .$ Regularisation term (Equation13(13) $\begin{aligned} \sum_{s = 1}^{S} η_{s} [{\bar{p}}_{T_{s}} \log (\frac{{\bar{p}}_{T_{s}}}{p_{T_{s} \cap B}}) \\ + (1 - {\bar{p}}_{T_{s}}) \log (\frac{1 - {\bar{p}}_{T_{s}}}{1 - p_{T_{s} \cap B}})], \end{aligned}$ (13) ) can go in both ways, namely, for very large tuning parameters $η_{s}$ we receive a model that is regression tree-like, and we use the FNN to discriminate the samples within the leaves of the regression tree, this is more in the spirit of Quinlan (Citation1992) and Y. Wang and Witten (Citation1997). For smaller tuning parameters $η_{s}$ (and smaller regression trees), we use the regression tree to stabilise model averages on the tree partition ${X_{s}}_{s = 1}^{S}$ of the covariate space $X$ , and because the regression tree has the balance property, this also helps us to get the right global level of the success probabilities.

Note that the regularisation approach of Section 4.2 can also be seen as a special case of classification tree regularisation if we use tree stumps in the latter.

5. Example

5.1. Motor third party liability insurance data

For illustration, we choose the French motor third party liability (MTPL) insurance data set called freMTPL2freq. This data is included in the R package CASdatasets, see Charpentier (Citation2015).Footnote¹ An excerpt of the data is illustrated in Listing 1, and an extensive descriptive analysis of this MTPL insurance data is provided in Section 1 of Noll et al. (Citation2018). We pre-process this data as described in Noll et al. (Citation2018), this includes a small data cleaning part; the choices of the learning data set $D$ and the test data set $T$ are done as in Listing 2 of Noll et al. (Citation2018). The learning data set $D$ is used for model calibration (in-sample), and the test data set $T$ is used for an out-of-sample test analysis (generalisation analysis). The only difference to Noll et al. (Citation2018) is that we replace the integer-valued claims counts ClaimNb, see line 4 of Listing 1, by an indicator variable $Y \in {0, 1}$ which shows whether at least one claim has occurred for a given policy; this turns our prediction problem into a binary classification exercise.

5.2. Logistic regression

For the logistic regression model of Section 2 we use the same covariate pre-processing as described in Listing 3 of Noll et al. (Citation2018), and we include the exposure into the covariate vector. Maximizing log-likelihood (Equation2(2) $\begin{aligned} w \mapsto ℓ (w; D) & = \sum_{i = 1}^{n} Y_{i} \log p (x_{i}) \\ + (1 - Y_{i}) \log (1 - p (x_{i})) . \end{aligned}$ (2) ) on the learning data $D$ provides the MLE ${\hat{w}}^{M L E}$ ; this is done by using the $R$ command glm under model choice family=binomial(). For the resulting MLE ${\hat{w}}^{M L E}$ we calculate the in-sample and the out-of-sample deviance losses on learning data $D$ and test data $T$ , respectively, given by $\begin{aligned} L ({\hat{w}}^{M L E}, θ; D) & = - \frac{2}{n} ℓ ({\hat{w}}^{M L E}, θ; D) a n d \\ L ({\hat{w}}^{M L E}, θ; T) & = - \frac{2}{| T |} ℓ ({\hat{w}}^{M L E}, θ; T), \end{aligned}$ where $n = | D | = 610, 212$ is the number of learning samples and $| T | = 67, 801$ is the number of test samples, see Section 2.2 of Noll et al. (Citation2018). Note that the MLE ${\hat{w}}^{M L E}$ is solely based on the learning data $D$ .

The results are presented in Table . Line (1) presents the homogeneous model (null model) where we do not use any covariate information. In the homogeneous model, the overall default probability (estimated on the learning data $D$ ) is given by $\bar{p} = \sum_{i} Y_{i} / n = 5.007276 %$ , see (Equation11(11) $\begin{aligned} \bar{p} = \frac{1}{n} \sum_{i = 1}^{n} Y_{i} a n d p_{I} = p_{I} (w, θ) = \frac{1}{| I |} \sum_{i \in I} p (x_{i}) . \end{aligned}$ (11) ). This empirical overall default probability (given in the last column of Table ) also provides the balance property, see (Equation4(4) $\begin{aligned} \frac{1}{n} \sum_{i = 1}^{n} Y_{i} & = \frac{1}{n} \sum_{i = 1}^{n} {\hat{p}}^{M L E} (x_{i}) \\ = \frac{1}{n} \sum_{i = 1}^{n} σ ({\hat{w}}_{0}^{M L E} + \sum_{j = 1}^{q} {\hat{w}}_{j}^{M L E} x_{i, j}), \end{aligned}$ (4) ). Line (2) presents the logistic regression approach. We see a decrease in both the in-sample and the out-of-sample losses; this shows that there are systematic effects (heterogeneity) in the MTPL portfolio which can (partially) be detected by the logistic regression approach (Equation1(1) $x \in R^{q} \mapsto p (x) = σ (w_{0} + \sum_{j = 1}^{q} w_{j} x_{j}) \in (0, 1),$ (1) ). The last column of Table confirms that the logistic regression approach fulfills the balance property (Equation4(4) $\begin{aligned} \frac{1}{n} \sum_{i = 1}^{n} Y_{i} & = \frac{1}{n} \sum_{i = 1}^{n} {\hat{p}}^{M L E} (x_{i}) \\ = \frac{1}{n} \sum_{i = 1}^{n} σ ({\hat{w}}_{0}^{M L E} + \sum_{j = 1}^{q} {\hat{w}}_{j}^{M L E} x_{i, j}), \end{aligned}$ (4) ).

5.3. Early stopping feedforward neural network

In this section, we consider a FNN regression model with early stopping for model calibration. We choose a FNN having 3 hidden layers with $(20, 15, 10)$ hidden neurons in these hidden layers. We choose the hyperbolic tangent activation function, the nadam SGD optimizer, and a mini-batch size of 1,000 policies; these terms are described in detail in Chapter 5 of Wüthrich and Buser (Citation2016), and the corresponding $R$ code (using the $R$ interface to Keras) is a Bernoulli version of the R code provided in Listing 5.5 of Wüthrich and Buser (Citation2016); in particular, we replace in Listing 5.5 of Wüthrich and Buser (Citation2016) the exponential output function (log-link) of the Poisson regression model by the sigmoid/logistic output function (logit-link) of the Bernoulli regression model.

Table 1. Comparison of the homogeneous model (null model), the logistic regression model, the early stopping FNN and the GLM improved/regularised FNN.

Display Table

In a preliminary analysis, we explore how many SGD steps we need to perform until the FNN starts to over-fit to the learning data. For this preliminary analysis, we split the learning data $D$ at random into a training sample $D_{0}$ and a validation sample $V$ . As it is common practice, we choose 80% of the learning data $D$ as training samples $D_{0}$ and the remaining 20% as validation samples $V$ . We then train the network with SGD on the training data $D_{0}$ and we track over-fitting on the validation data $V$ ; note that this step does not use the test data $T$ which is only used later on for out-of-sample testing (a generalisation analysis) of the final model.

In Figure , we illustrate this preliminary analysis which shows that after roughly 40 epochs the model starts to over-fit to the training data $D_{0}$ (because the validation loss on $V$ starts to increase). For this reason, we fix the early stopping rule at 40 epochs for all further network calibrations.

Figure 1. Preliminary analysis exploring the early stopping rule: model fitting on the training data $D_{0}$ (in upper graph) and tracking over-fitting on the validation data $V$ (in lower graph); note that this is a standard output in Keras which (unfortunately) drops the factor 2 from the loss function (Equation7(7) $(w, θ) \mapsto L (w, θ; D) = - \frac{2}{n} ℓ (w, θ; D) .$ (7) ), thus, the y-axis needs to be scaled with 2.

We then fit the FNN regression model over 40 epochs which provides us with an early stopped SGD calibration $({\hat{w}}^{S G D}, {\hat{θ}}^{S G D})$ . Every SGD calibration needs an initial value in which the SGD algorithm is started from. This initial value is usually chosen at random, in Keras the default is the Glorot uniform initialiser, see Glorot and Bengio (Citation2010). This initialiser needs a seed for random number generation and, therefore, the early stopped SGD calibration $({\hat{w}}^{S G D}, {\hat{θ}}^{S G D})$ will depend on this initial seed. On lines (3a) –(3c) we provide three such early stopped SGD calibrations having different seeds 1, 2 and 3. We note that all three calibrations provide lower in-sample and out-of-sample losses on $D$ and $T$ , respectively, compared to the logistic regression model. This illustrates that the logistic regression model misses important model structure that is captured by the FNN. More worrying is that the balance property (Equation4(4) $\begin{aligned} \frac{1}{n} \sum_{i = 1}^{n} Y_{i} & = \frac{1}{n} \sum_{i = 1}^{n} {\hat{p}}^{M L E} (x_{i}) \\ = \frac{1}{n} \sum_{i = 1}^{n} σ ({\hat{w}}_{0}^{M L E} + \sum_{j = 1}^{q} {\hat{w}}_{j}^{M L E} x_{i, j}), \end{aligned}$ (4) ) fails to hold and the deviations are substantial, see last column of Table .

To receive a better intuition about the potential failure of the balance property, we run this SGD calibration over 50 different seeds (starting values of the SGD algorithm). On the left-hand side of Figure the box plot illustrates the different values we receive for ${\hat{p}}^{S G D}$ . They fluctuate between 4.5% and 5.6% (the balance property is 5.007276%, orange horizontal line). We conclude that the balance property may substantially be misspecified by early stopping of SGD calibration which may lead to a huge bias and a severe global population (price) misspecification. In Figure (middle and right-hand side) we illustrate the resulting in-sample losses and the out-of-sample losses, respectively, over 50 different seeds of early stopped SGD calibrations. We note that the solutions provided on lines (3a) –(3c) of Table are part of these plots, i.e., they correspond to the first 3 seeds with corresponding values in Figure .

Figure 2. (lhs) Balance property (Equation4(4) $\begin{aligned} \frac{1}{n} \sum_{i = 1}^{n} Y_{i} & = \frac{1}{n} \sum_{i = 1}^{n} {\hat{p}}^{M L E} (x_{i}) \\ = \frac{1}{n} \sum_{i = 1}^{n} σ ({\hat{w}}_{0}^{M L E} + \sum_{j = 1}^{q} {\hat{w}}_{j}^{M L E} x_{i, j}), \end{aligned}$ (4) ) of the early stopping FNN over 50 different seeds (starting points), the orange horizontal line shows the balance property of 5.007276%; (middle) in-sample learning losses on $D$ and (rhs) out-of-sample test losses on $T$ of the early stopping FNN (left box plots in graphs) and the GLM improved/regularised FNN (right box plots in graphs) over the 50 different seeds (starting values of the SGD algorithm).

Figure 2. (lhs) Balance property (Equation4(4) 1n∑i=1nYi=1n∑i=1npˆMLE(xi)=1n∑i=1nσwˆ0MLE+∑j=1qwˆjMLExi,j,(4) ) of the early stopping FNN over 50 different seeds (starting points), the orange horizontal line shows the balance property of 5.007276%; (middle) in-sample learning losses on D and (rhs) out-of-sample test losses on T of the early stopping FNN (left box plots in graphs) and the GLM improved/regularised FNN (right box plots in graphs) over the 50 different seeds (starting values of the SGD algorithm).

5.4. Logistic regression bias regularisation

The failure of the balance property as illustrated in Figure (lhs) motivates us to apply the additional logistic regression step to the early stopping FNN calibration. This provides us with the GLM improved calibrations ${\hat{p}}^{S G D +}$ , see (Equation9(9) $\begin{aligned} x \mapsto {\hat{p}}^{S G D +} (x) \\ = σ ({\hat{w}}_{0}^{M L E (d)} + \sum_{j = 1}^{d} {\hat{w}}_{j}^{M L E (d)} f_{j}^{{\hat{θ}}^{S G D}} (x)) . \end{aligned}$ (9) ). Table , lines (4a)–(4c), provide the corresponding figures (they use exactly the same seeds as the ones on lines (3a)–(3c)). We note that in-sample losses on $D$ decrease (which needs to be the case), that out-of-sample losses on $T$ decrease (which shows that the early stopping FNN does not over-fit, yet), and that the balance property is fulfilled. The decreases in in-sample and out-of-sample losses are also illustrated in the red coloured box plots of Figure . From this example, we conclude that the GLM improved/regularised FNN calibration (Equation9(9) $\begin{aligned} x \mapsto {\hat{p}}^{S G D +} (x) \\ = σ ({\hat{w}}_{0}^{M L E (d)} + \sum_{j = 1}^{d} {\hat{w}}_{j}^{M L E (d)} f_{j}^{{\hat{θ}}^{S G D}} (x)) . \end{aligned}$ (9) ) provides a substantially improved FNN compared to (Equation8(8) $\begin{aligned} x \mapsto {\hat{p}}^{S G D} (x) = σ ({\hat{w}}_{0}^{S G D} + \sum_{j = 1}^{d} {\hat{w}}_{j}^{S G D} f_{j}^{{\hat{θ}}^{S G D}} (x)), \end{aligned}$ (8) ), and this additional logistic regression step on the working data $D^{S G D}$ should be explored for a suitable predictive model.

5.5. Shrinkage regularisation

Our final analysis explores the shrinkage regularisation approach of Section 4.2 by applying penalty term (Equation12(12) $\begin{aligned} R (w, θ; \bar{p}, B) = \bar{p} \log (\frac{\bar{p}}{p_{B}}) + (1 - \bar{p}) \log (\frac{1 - \bar{p}}{1 - p_{B}}) . \end{aligned}$ (12) ). The use of this method is more complicated since it requires more work and fine-tuning. Firstly, we need to implement a custom-made loss function in Keras adding a KL divergence penalty term to the Bernoulli deviance loss. Secondly, we need to fine-tune the hyper-parameters: these are the batch size, the tuning parameter $η > 0$ and the number of epochs. We have performed a grid search to receive good parameters. We keep the batch size of 1000 samples and 40 epochs as in the previous calibrations. The tuning parameter is chosen as $η \in {10, 100, 250, 1000}$ .

In Table , we present the results. The general observation is that the shrinkage regularised versions are not fully competitive. Bias regularisation requires a comparably large tuning parameter η, and having a small batch size of 1000 this large tuning parameter η negatively impacts the accuracy of the FNN regression model. We conclude that the shrinkage regularisation approach is not fully compatible with SGD fitting because the bias property is a global property whereas SGD acts on (local) mini batches.

Table 2. Comparison of the homogeneous model (null model), the logistic regression model, the early stopping FNN and the GLM improved/regularised FNN, shrinkage regularised versions for different tuning parameters $η \in {10, 100, 250, 1000}$ .

Display Table

6. Conclusions

We have discussed the important problem of considering statistical models that provide unbiased mean estimates on a global population level (balance property). Classical statistical regression models like generalised linear model naturally have this balance property under the canonical link choice because the maximum likelihood estimator provides a critical value of the corresponding optimisation problem. In general, early stop gradient-descent calibrated neural networks fail to have the balance property, because early stopping prevents these models from taking parameters in critical points of the (deviance) loss function. In many applications, this does not reflect a favourable model calibration because it may lead to substantial price misspecification on a global population level. Therefore, we have proposed improvements that lead to globally unbiased solutions. These solutions include an additional generalised linear model optimisation step or shrinkage regularisation to empirical averages. The numerical example shows that we prefer the additional generalized linear model optimisation step.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Notes on contributors

Mario V. Wüthrich

Mario V. Wüthrich is Professor in the Department of Mathematics at ETH Zurich, Honorary Visiting Professor at City, University of London (2011-2022), Honorary Professor at University College London (2013-2019), and Adjunct Professor at University of Bologna (2014-2016). He holds a Ph.D. in Mathematics from ETH Zurich (1999). From 2000 to 2005, he held an actuarial position at Winterthur Insurance, Switzerland. He is Actuary SAA (2004), served on the board of the Swiss Association of Actuaries (2006-2018), and is Editor-in-Chief of ASTIN Bulletin (since 2018).

Notes

1 CASdatasets website http://cas.uqam.ca; see also page 55 of the reference manual CASdatasets Package Vignette (Citation2018); we use version 1.0-8 which has been packaged on 2018-05-20.

References

Breiman, L. (2001). Statistical modeling: The two cultures. Statistical Science, 16(3), 199–231. https://doi.org/https://doi.org/10.1214/ss/1009213726
Web of Science ®Google Scholar
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Wadsworth Statistics/Probability Series.
Google Scholar
Bühlmann, H., & Gisler, A. (2005). A course in credibility theory and its applications. Springer.
Google Scholar
CASdatasets Package Vignette (2018). Version 1.0-8, May 20, 2018. http://cas.uqam.ca
Google Scholar
Charpentier, A. (2015). Computational actuarial science with R. CRC Press.
Google Scholar
Cheng, X., Jin, Z., & Yang, H. (2020). Optimal insurance strategies: A hybrid deep learning Markov chain approximation approach. ASTIN Bulletin, 50(2), 449–477. https://doi.org/https://doi.org/10.1017/asb.2020.9
Web of Science ®Google Scholar
Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological), 20(2), 215–232. https://doi.org/https://doi.org/10.1111/j.2517-6161.1958.tb00292.x
Web of Science ®Google Scholar
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 2(4), 303–314. https://doi.org/https://doi.org/10.1007/BF02551-274
Google Scholar
Gabrielli, A. (2020). A neural network boosted double overdispersed Poisson claims reserving model. ASTIN Bulletin, 50(1), 25–60. https://doi.org/https://doi.org/10.1017/asb.2019.33
Web of Science ®Google Scholar
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. Proceedings of Machine Learning Research, 9, 249–256. Proceedings of the thirteenth international conference on artificial intelligence and statistics.
Google Scholar
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
Google Scholar
Green, P. J. (1984). Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. Journal of the Royal Statistical Society: Series B (Methodological), 46(2), 149–170. https://doi.org/https://doi.org/10.1111/j.2517-6161.1984.tb01288.x
Web of Science ®Google Scholar
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366. https://doi.org/https://doi.org/10.1016/0893-6080(89)90020-8
Web of Science ®Google Scholar
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/https://doi.org/10.1038/nat-ure14539
PubMed Web of Science ®Google Scholar
Lee, G. Y., Manski, S., & Maiti, T. (2020). Actuarial applications of word embedding models. ASTIN Bulletin, 50(1), 1–24. https://doi.org/https://doi.org/10.1017/asb.2019.28
Web of Science ®Google Scholar
McCullagh, P., & Nelder, J. A. (1983). Generalized linear models. Chapman & Hall.
Google Scholar
Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society. Series A (General), 135(3), 370–384. https://doi.org/https://doi.org/10.2307/234-4614
Web of Science ®Google Scholar
Noll, A., Salzmann, R., & Wüthrich, M. V. (2018). Case study: French motor third-party liability claims. SSRN Manuscript ID 3164764. Version March 4, 2020.
Google Scholar
Quinlan, J. R. (1992). Learning with continuous classes. In Proceedings of the 5th Australian joint conference on artificial intelligence (pp. 343–348). Singapore: World Scientific.
Google Scholar
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536. https://doi.org/https://doi.org/10.1038/323-533a0
Web of Science ®Google Scholar
Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85–117. https://doi.org/https://doi.org/10.1016/j.neunet.2014.09.003
PubMed Web of Science ®Google Scholar
Shmueli, G. (2010). To explain or to predict? Statistical Science, 25(3), 289–310. https://doi.org/https://doi.org/10.1214/10-STS330
Web of Science ®Google Scholar
Wang, C., Venkatesh, S. S., & Judd, J. S. (1994). Optimal stopping and effective machine complexity in learning. In Advances in neural information processing systems (NIPS'6) (pp. 303–310).
Google Scholar
Wang, Y., & Witten, I. H. (1997). Inducing model trees for continuous classes. In Proceedings of the ninth European conference on machine learning (pp. 128–137).
Google Scholar
Wüthrich, M. V., & Buser, C. (2016). Data analytics for non-life insurance pricing. SSRN Manuscript ID 2870308, Version of September 10, 2020.
Google Scholar
Wüthrich, M. V., & Merz, M. (2019). Editorial: Yes we CANN! ASTIN Bulletin, 49(1), 1–3. https://doi.org/https://doi.org/10.1017/asb.2018.42
Web of Science ®Google Scholar

The balance property in neural network modelling

Abstract

1. Introduction

2. Logistic regression

3. Neural network regressions and early stopping