321
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Loss modeling with many-parameter distributions

&
Received 22 Nov 2022, Accepted 22 Jan 2024, Published online: 05 Feb 2024

Abstract

It is argued that many-parameter families of loss distributions may work even with limited amounts of historical data. A restriction to unimodality works as a stabilizer, which makes fitted distributions much more stable than their parameters. We propose Box-Cox transformed Gamma and Burr variables. Those are models with three or four parameters with many of the traditional two-parameter families as special cases, and there are well-defined distributions at the boundaries of the parameter space, which is important for stability. The approach is evaluated with model error defined though the theory of misspecification in statistics. It is shown that such error is drastically reduced when a third or fourth parameter is added without increasing the random error more than a little. It is pointed out that the approach may be a suitable starting point for completely automatic procedures.

1. Introduction

Modeling insurance losses from historical observations might have been carried out by the empirical distribution or by non-parametric density estimation (for example Scott Citation1992) if it had not been for future, huge losses not in the data. Indeed, the importance of this extreme right tail has promoted special methods to deal with it; Embrechts et al. (Citation1997) is the standard reference. Yet, in practical insurance, two-parameter models is without doubt still one of the most common approaches (Omari et al. Citation2018). The drawback is that distributions now are tied to a specific shape for which there may be little theoretical justification. Families of distributions with more than two parameters such as the Burr family which goes back to Burr (Citation1942) or the transformed Beta and Gamma distributions in Venter (Citation1985), also treated in Kleiber & Kotz (Citation2003) and Cummins et al. (Citation1990), would be more flexible. It is the aim of the present paper to argue for the use of such multi-parameter families among which we construct a new one with attractive properties. Another tradition in actuarial science is to use slicing or composite distributions with different parametric families imposed on each side of a threshold. The approach has some support in Pickands' theorem, which reveals that tail distributions always become Pareto or exponentially distributed when the threshold tends to infinity. The problem here is that it varies enormously from case to case how large the threshold must be; see Bølviken (Citation2014). This motivates other composite models, and they have in recent years been proposed in large numbers, see for example Cooray & Ananda (Citation2005), Scollnik (Citation2007), Scollnik & Sun (Citation2012), Bakar et al. (Citation2015), Reynkens et al. (Citation2017) and Chan et al. (Citation2018). Composites can be seen as special cases of mixture distributions which have also been used for loss modeling. Some references are Lee & Lin (Citation2010), Verbelen et al. (Citation2015), Miljkovic & Grün (Citation2016) and Gui et al. (Citation2021).

The key issue is how to cope with a lack of knowledge of the shape of the underlying distribution when the experience to go on is limited. One approach, used in Miljkovic & Grün (Citation2016), is to try several families of distributions and select the best-fitting one, another is the so-called Bayesian averaging in Hoeting et al. (Citation1999). Still another, and in our view simpler way, is to start out with multi-parameter models containing the standard two-parameter ones as special cases. There are several possibilities (see above), but our preferred vehicle is the Box-Cox-transformed Gamma and Burr variables constructed in the next section. These models are always unimodal to which we attach great importance. Is such a delimitation too narrow in scope?

Loss data from different populations is common, but their origin would often be known, and we are then in a regression situation outside the realm of this paper. Mixture distributions arise when the underlying populations are unobservable or latent so that it is not known which populations individual losses come from. Whether this violates unimodality depends on how far apart the populations are, but even if multimodality would have been spotted with lots of data, it might not be for smaller amounts. Further, there is a price to be paid for all this flexibility; it is bound to make modeling more vulnerable to random error. This is demonstrated by example in Section 4, and that is where our idea kicks off. By imposing unimodality, fitted distributions might be stable even when there are many parameters and not overwhelmingly much data. The rationale is as follows. Many-parameter, unimodal families will inevitably contain distributions resembling each other, which can only mean uncertain parameter estimates. However, those are not really the target, and the fitted distributions they lead to are much less vulnerable to random error. The transformation families proposed in Section 2 work not only because their distributions are unimodal, but also because they tend to well-defined distributions at the boundary of the parameter space, which is important for their behavior when fitted. This makes it possible to program automatic methods where the computer on its own picks a distribution from the three- or four-parameter family, without human intervention, thus avoiding the whole model selection step, which is particularly useful when data are not abundant.

Then the question is how different methods should be evaluated and compared. What really matters is not the loss model itself, but rather its impact on what it is used for, for example on the projected reserve of a portfolio, which is the chief criterion in the remains of the paper. There are both modeling and sampling error to deal with, and one way to make the distinction between them precise is to draw on the asymptotic theory of likelihood estimates in misspecified models; White (Citation1982) is a much cited reference. Suppose a family of distributions parameterized by θ has been imposed with θˆ its likelihood estimate. As the number of historical losses become infinite, θˆ will nearly always converge to some θ even if the true loss distribution is outside the family. This θ-distribution minimizes the so-called Kullback-Leibler distance to the true one (Section 2.4), and provides a reasonable definition of model error. When we in Section 2.5 compare the projected reserve under these two distributions over a broad range of possibilities, it will emerge that a third parameter often lowers model error drastically and that a fourth one makes it very small indeed.

Model error under our transformation models is thus fairly small, but then there is random error, which we study by both asymptotics and Monte Carlo. One side of that is over-parameterization; how much extra random error is brought in when extra parameters are added, for instance to a two-parameter family? Special weight is placed on log-normal models under which error studies can be carried out by particularly simple mathematics. One thing that must be kept in mind is that errors with scarce data and heavy-tailed distributions might be large in any case, and the real question is how much larger they become when additional parameters are added. Bayesian assumptions if warranted might mitigate error, but that is outside the scope of this paper.

The outline of this paper is as follows. Multi-parameter modeling and model misspecification are introduced in Section 2. Section 3 presents asymptotic results concerning the error that is made using a multi-parameter model when the true model is a simpler one, whereas this is explored in a simulation study in Section 4. An illustrative example involving Norwegian automobile insurance losses is given in Section 5. Finally, Section 6 contains some concluding remarks.

2. A multi-parameter approach

2.1. The Burr model

One way of defining families of loss distributions beyond the two-parameter ones is through ratios of Gamma variables (Venter Citation1985, Kleiber & Kotz Citation2003). Let (1) X=GθGα(1) with Gα and Gθ two independent Gamma variables with mean 1 and shape α and θ. This is the Burr model, also called Beta Prime, Pearson type 6 and extended Pareto, which goes back to Burr (Citation1942). It can also be defined through its density function (2) f(x)=Γ(α+θ)Γ(α)Γ(θ)(αθ)αxθ1(α/θ+x)α+θ,x>0.(2) Adding a parameter of scale β>0 leads to a model Z=βX for a claim Z. Its moments are (3) E(Zk)=(αθ)k(j=1kθ+j1αj)βk,k=1,2,(3) which are finite when k<α.

One of the attractive sides of this framework is that it captures under the same roof both heavy-tailed Pareto distributions when θ=1 and more light-tailed Gamma ones which appear when α. The latter is an immediate consequence of (Equation1) since Gαp1 as α by virtue of E(Gα)=1 and sd(Gα)=1/α, and a similar argument shows that the inverse Gamma 1/Gα emerges when θ.

2.2. Transformation modeling

One way to extend the Burr family is to use parameterized transformations, say (4) Z=βh(X;η)(4) with η>0 an additional parameter and h(x;η) some increasing function to be specified. An example of such a construction is the generalized beta distribution of order II (GB2), that uses h(x;η)=x1/η which is mentioned in Kleiber & Kotz (Citation2003) (see p. 190), and used for modeling insurance losses in Cummins et al. (Citation1990).

A natural restriction is h(x;1)=x, making the Burr model a special case. Not all functions h(x;η) satisfying these conditions lead to the families of unimodal distributions, but one possibility is the Box-Cox transformation h(x;η)=(1+x)η1 which goes back to Box & Cox (Citation1964). This choice sets up what we shall refer to as the PowerGamma and PowerBurr models (5) Z=β{(1+Gθ)η1}andZ=β{(1+X)η1}(5) with parameters θ, η and β for PowerGamma on the left and α, θ, η and β for PowerBurr on the right. The latter collapses to PowerGamma when α. It will emerge in Section 3 that the log-normal is a special case of both, and most other standard loss models are included approximately. The PowerGamma and PowerBurr families are unimodal everywhere; i.e. their density functions are monotonely decreasing when θ1 and have a single maximum otherwise, which follows easily from their mathematical expressions in Appendices C and D.

One of the attractive features of the Box-Cox transformation is its behavior as η0. It follows from (Equation5) that Z=βηlog(1+Gθ)+op(η) on the left and Z=βηlog(1+X)+op(η) on the right, in both cases in the limit well-defined models with scale parameter βη. We believe this property to have a stabilizing influence when models are fitted. Indeed, it is worth recording what happens with the perhaps superficially equally attractive h(x;η)=xη. The models for Z=βGθη and Z=βXη are again unimodal, but now (for example) βXη=β+βηlog(X)+op(η) as η0, and a very small η can only be compensated by extreme variation in X. Such solutions simply are not plausible, yet likelihood fitting for moderate or small amounts of data often came up with them when the transformation was tried out. Does a restriction such as η1 get rid of them? The problem is that these freak solutions now turn into η-estimates close to 1, and such fitting when examined implied considerable estimation bias, in the sense that the estimated distribution did not fit the data well.

2.3. Illustration

A small example of how such modeling may work is shown in Table  where the Gamma, Pareto, Burr, PowerGamma and PowerBurr distributions have been fitted the n = 60 Belgian fire losses from Beirlant et al. (Citation1996), here scaled so their average is one (R-code for fitting the Burr, PowerGamma and PowerBurr is available in https://github.uio.no/ingrihaf/PowerBurr). The parameter values from the five fitted distributions are displayed along with the corresponding means, standard deviations and skewnesses. We have also computed the 99% reserve of a portfolio, i.e. the 99% percentile of the distribution of the total portfolio loss. We assumed a Poisson distribution for the number of claims with expectation λ=10, and each of the five distributions as the claim size distribution. The resulting reserves are shown in the rightmost column of the Table. Not all modeling based on 60 losses can be expected to be as stable, with standard deviations, skewnesses and reserves that hardly vary with the different loss models. The very large α for Burr and PowerBurr (22,000 was the maximum value allowed) signify that Gα1 in (Equation1), so that the Burr and the PowerBurr models obtained are very nearly Gamma and PowerGamma. Standard deviations match the empirical one in the data, which is 1.02, whereas the skewness is substantially higher than the empirical skewness coefficient of 1.46, but which is known to under-estimate the true skewness; see for instance Bølviken (Citation2014).

Table 1. Parameters, mean, sd, skewness and 99% reserve for the Belgian fire data.

2.4. Misspecification

Let f(z;p) be the density function of Z under one of the models above depending on the parameter vector p, and let pˆn be the maximum likelihood estimate of p based on n historical losses. Maximum likelihood is not only the standard method, but also has an appealing error theory. Suppose the true density function g(z) differs from f(z;p) for all p. Such model error is highly relevant in insurance since there is typically little hard justification for the parametric families imposed. However, even when they do not contain the true distribution, there usually exists some p so that pˆnpp as n. Indeed, it was shown in Huber (Citation1967) and later in White (Citation1982Citation1994) under very general conditions that p minimizes the Kullback-Leibler divergence of the family f(z;p) with respect to the true g(z), i.e. (6) E(p;g)=0logg(z)f(z;p)g(z)dz.(6) If g(z)=f(z;p) for some p, then p=p, but otherwise f(z;p), minimizing (Equation6), deviates from the true g(z), resulting in model error.

Consider a quantity of interest ψ, for example the reserve of a portfolio in general insurance, i.e. a percentile of the total loss distribution, and write ψ(p) for its value when f(z;p) is the loss distribution. The value we seek is ψ(g) under the true density function g(z), whereas our estimate is ψ(pˆn), and its error can be decomposed as (7) ψ(pˆn)ψ(g)=ψ(pˆn)E{ψ(pˆn)}+E{ψ(pˆn)}ψ(g),(7) where the first component on the right reflects the standard error of the estimate. The second component is bias which is the sum of two contributions; i.e. (8) E{ψ(pˆn)}ψ(p)+ψ(p)ψ(g)(8) with model error on the right which is what remains when n. There is no such error when the PowerGamma or the PowerBurr models from Section 2.1 are fitted Gamma, Pareto or log-normal data.

2.5. Model error

To examine the model error numerically, consider the computation of the reserve in a small Poisson portfolio in general insurance with λ=10 expected events, meaning a percentile of the total portfolio loss distribution. The true loss models are varied among four different families, notably the log-normal Z=βeθY with YN(0,1), the log-Gamma Z=β(eGθ1) where Gθ is a Gamma-variable with mean 1 and shape θ, the Weibull model Z=βG11/θ and the logistic distribution defined by Pr(Zz)=11/(1θ+θez/β) with 0<θ1. The reserve was computed assuming each of the loss models Pareto, Burr, PowerGamma and PowerBurr in addition to the true one. In the cases where the assumed distribution f(z;p) did not include the true distribution g(z) as a special case, the best-fitting parameter vector p was found by minimizing the Kullback-Leibler divergence (Equation6) numerically. Finally Monte Carlo with ten million simulations were used to calculate the 99% reserve under the true g(z) and the approximating one f(z;p).

Table  reports how much the reserve changes (in per cent) when g(z) is replaced by the best-fitting f(z;p). In mathematical terms the criterion is (ψ(p)/ψ(g)1)×100,ψ() being the reserve, which does not depend on β so that only the shape parameter θ has to be varied. Their values in Table  have been chosen with some care to reflect a reasonable range of skewness. The log-Gamma distribution is so heavy-tailed at θ=1 or lower that not even the expectation is finite, and both the log-normal with θ=1 and the Weibull distribution with θ=0.5 allow very large claims. The results show how well higher-parameter distributions approximate many standard models. Whereas two-parameter Pareto fitting (included for comparison) over-shoots the target enormously, the four-parameter PowerBurr has inconsequential model bias everywhere, and also the three-parameter PowerGamma, except for the exceptionally heavy-tailed log-Gamma case with θ=1. Unlike the other fitted distributions, the PowerGamma does not contain the Pareto family as a special case, but when model bias was calculated as described above, it was again largely minor; i.e. 1.1% when the Pareto parameter α=10, 3.5% when α=5 and 27.2% when α=2, the latter so heavy-tailed that the variance is infinite.

Table 2. Model bias in the estimated 99% reserve in % of the true values under Pareto Burr, PowerGamma and PowerBurr fitting.

3. Over-parameterization

3.1. Introduction

Modeling beyond two parameters introduces extra parameter uncertainty if one of the usual two-parameter sub-families is the true one. The size of the degradation can be explored analytically through asymptotics as n. Let p=(p1,,pd) be the parameter vector and consider moments μk(p)=E(Zk;p) and percentiles qϵ(p) defined as Pr(Zqϵ(p);p)=ϵ. Their estimates are μˆkn=μk(pˆn) and qˆϵn=qϵ(pˆn) where pˆn is the maximum likelihood estimate of p based on n observations of Z.

Under the assumption that the model f(z;p) is the true one, theoretical statistics offer an approximation of sd(μˆkn) and sd(qˆϵn) through the information matrix I=(Iij) and the gradient vectors vk=(vki) for μk(p) and vϵ=(vϵi) for qϵ(p), where (9) Iij=E(log{f(Z;p)pilog{f(Z;p)pj),vki=μk(p)piandvϵi=qϵ(p)pi,i,j=1,,d.(9) For moment estimation the result is (10) sd(μˆkn)=sk(p)n+o(1n),sk(p)=vkTI1vk.(10) For percentile estimation the results is the same except for sk(p) being replaced by sϵ(p)=vϵTI1vϵ.

A more general form, replacing the information matrix with a sandwich form, covers the situation when the model is misspecified, consult White (Citation1982). Such an extension is relevant for the problems under discussion here, but we shall make no use of it. Our aim in this section is to evaluate sk(p) under a default, two-parameter model and examine how much it increases when a wider three or four-parameter one is being used.

How large n must be for the asymptotic results to be reasonably accurate approximations in the finite sample case was checked by Monte Carlo and varied with both the skewness of the distribution and with the quantity estimated. In Tables  and  below, n = 100 observations or even lower (n = 50) are often enough to get rather accurate results with the asymptotic formulas. The convergence is faster for the percentiles than the higher order moments and becomes slower when skewness is high.

Table 3. Relative increase (in %) in the asymptotic standard error of moment and percentile estimates when Burr has been fitted Pareto and Gamma data.

Table 4. Relative increase (in %) in asymptotic standard error of moment and percentile estimates when PowerGamma has been fitted log-normal data.

3.2. Burr against Pareto and Gamma

The first example examines how much Burr fitting inflates uncertainty when the data come from a Pareto or a Gamma distribution. Write the coefficients sk(p) and sϵ(p) in (Equation10) as skbu(α,θ,β) and sϵbu(α,θ,β) under the Burr model and skpa(α,β) and sϵpa(α,β) under Pareto. The latter is the special case θ=1, and the ratios skbu(α,1,β)/skpa(α,β) and sϵbu(α,1,β)/sϵpa(α,β) indicate extra uncertainty when the Burr family is fitted to Pareto data. We have chosen to report (11) Dkbu|pa(α,β)=skbu(α,1,β)skpa(α,β)1andDϵbu|pa(α,β)=sϵbu(α,1,β)sϵpa(α,β)1,(11) the degree of degradation under Burr. Mathematical expressions for the information matrix I and the gradient vector vk under the Burr and the Pareto are given in Appendix A.

When the Gamma distribution replaces the Pareto as the true, data generating model, there is the slight complication that the true model now is on the boundary of the parameter space since Burr tends to Gamma as α. The degree of degradation thus becomes (12) Dkbu|ga(θ,β)=limαskbu(α,θ,β)skga(θ,β)1andDϵbu|ga(θ,β)=limαsϵbu(α,θ,β)sϵga(θ,β)1(12) with skga(θ,β) and sϵga(θ,β) the coefficients sk(p) and sϵ(p) in (Equation10), respectively, under the Gamma model. For how these quantities are calculated, consult Appendix B2.

Neither of the criteria (Equation11) and (Equation12) depends on β, and only the other parameter (α in the Pareto and θ in the Gamma) needs to be varied. The general picture in the upper half of Table , where the Burr/Pareto and Burr/Gamma comparisons are recorded is that the extra uncertainty in over-specification is limited, especially for the Pareto family in the upper panel. The degradation tends to be higher in models with heavier tails and when estimating moments of higher order or percentiles far out in the right tail, which appears to be a general phenomenon; more on that below.

3.3. PowerGamma against the log-normal

Let Z=ξeσY with Y Normal (0,1) and ξ,σ>0 are parameters. This widely applied log-normal model (Marlin Citation1984, Zuanetti & Diniz Citation2006) is outside the Burr family, but is a special case of the PowerGamma and PowerBurr, where it is located on the boundary of their parameter spaces. Indeed, in (Equation5), if we take X=Gθ and (13) η=2σθ,andβ=ξelog(2)σθ,(13) the log-normal appears in the limit as θ. This follows from (Equation5) by elementary linearization manipulations, similar to those in Section 3.4, but is also a consequence of (Equation15) below. The crucial point is that θ(Gθ1)dN(0,1) as θ.

There is a useful tilting relationship between the density function fpg(z;θ) of the PowerGamma and the log-normal one (14) fln(z;ξ,σ)=12πσz1elog2(z/ξ)/(2σ2)(14) which is derived in Appendix C1 and reads (15) fpg(z;θ,ξ,σ)=fln(z;ξ,σ)ea(z;ξ,σ)θ1/2+o(θ1/2),a(z;ξ,σ)=log(z/ξ)2σ+log3(z/ξ)12σ3.(15) Note that the PowerGamma now has been reparameterized in terms of the parameters p=(θ,ξ,σ), which are related to the original parameters p=(θ,η,β) as described in (Equation13).

Let skpg(θ,ξ,σ) and sϵpg(θ,ξ,σ) be the coefficients defining the asymptotic standard deviation of the estimators of E(Zk) and qϵ under PowerGamma and skln(ξ,σ) and sϵln(ξ,σ) the analogies under the log-normal. The degradations of the PowerGamma with respect to the log-normal are (16) Dkpg|ln(ξ,σ)=limθskpg(θ,ξ,σ)skln(ξ,σ)1andDϵpg|ln(ξ,σ)=limθsϵpg(θ,ξ,σ)sϵln(ξ,σ)1.(16) This is similar to (Equation12), but now a simple expression is available though the tilting relationship. It is proved in Appendix C that (17) Dkpg|ln(ξ,σ)=(1+k4σ46+3k2σ2)1/21andDϵpg|ln(ξ,σ)=(1+(ϕϵ21)23ϕϵ2+6)1/21.(17) How these quantities vary with σ is shown in Table . The picture is the same as in Table  with extra uncertainty for more volatile loss models, for moments of higher order and for percentiles far out into the right tail.

For percentiles the degradation is asymptotically the same for all log-normal distributions in accordance with (Equation17) right.

3.4. PowerBurr against the log-normal

The PowerBurr family for log-normal data has the same asymptotic uncertainty as the PowerGamma, as explained below. Such a stability is also manifest in simulations with smaller n in Tables  and . One way to address the issue, is to reparametrize once more to (18) η=2σ1+γθwithγ=θα.(18) (19) γ=θα,η=2σ1+γθ,β=ξelog(2)η(19) so that γ is a the fourth parameter with θ, ξ, and σ as before. Now X=Gθ/Gα with θ(Gθ1)dN(0,1) as θ and α(Gα1)dN(0,1) as α, and by linearization θ(X1)=θ(Gθ1)γα(Gα1)+op(1)d1+γYwith YN(0,1). Hence when inserting for β and η in Z=β{(1+X)η1}, another round of linearization yields as θ Z=ξeηlog(1+(X1)/2)ξelog(2)η=ξeσθ/(1+γ)(X1)+op(1)ξelog(2)ηdξeσY,and the new parametrization has lead to Z becoming log-normal in the limit.

Table 5. Error (in %) of the true 99% portfolio reserves, shown in the heading, for the five different fitting regimes and five different loss distributions.

Table 6. Error (in %) in the true reinsurance premia, shown in the heading, for the five different fitting regimes and five different loss distributions.

A more refined version of this argument along the lines in Appendix B would establish a tilting relationship between the PowerBurr density function fpb(z;θ,ξ,σ,γ) and the log-normal one. It resembles that in (Equation15) and is now of the form (20) fpb(z;θ,ξ,σ,γ)=fln(z;ξ,σ)eb(z;ξ,σ,γ)θ1/2+o(θ1/2)(20) It can be established that the function b(z;ξ,σ,γ) is an odd polynomial of degree 3, as in (Equation15) right, with coefficients depending on γ. However, the point is that the extra parameter γ in PowerBurr becomes irrelevant as θ, i.e. for log-normal data, and the uncertainty under PowerBurr and PowerGamma fitting then must be the same.

4. Simulation study

The asymptotic results become invalid or at least inaccurate when the sample size n is small. In order to explore the performance of the many-parameter distributions in cases with little available data, we have supplemented the asymptotic results with a simulation study. This also enables us to study more complex measures, such as portfolio reserves and reinsurance premia. The test case below assumes a small portfolio with λ=10 expected incidents, so that the shape of the loss distribution beyond mean and variance is of importance. The claim numbers were assumed to follow a Poisson distribution, and the issue under study is how much many-parameter loss modeling inflates error in two actuarial evaluations, namely the portfolio reserve and a reinsurance premium. When there are only n = 50 historical losses, as in the upper half of Tables  and , it is often difficult to determine which family of distributions the data came from.

Claim sizes were drawn from five different loss models, notably a Pareto, a Gamma, a Weibull, a log-Gamma and a log-Normal distribution. Their expectation was always one. Shape parameters were α=3 for Pareto, θ=2 (Gamma and Weibull), θ=5 (log-Gamma) and σ=1 (log-Normal), all corresponding to highly skewed distributions. Standard deviations varied from 0.52 (Weibull) up to 1.77 (Pareto); for precise definitions of these distributions, see Section 2.5.

We drew n = 50 or n = 5000 observations from each of the five loss distributions listed above. The former corresponds to a situation with little available data, whereas the latter is close to the asymptotic case. Then, we fitted the true distribution family, as well as the Burr, the PowerGamma and the PowerBurr distributions to each of the simulated datasets. As a reference, we also included an alternative many-parameter distribution, namely the log-Normal-Pareto mixture distribution proposed by Scollnik (Citation2007), with the restrictions that the density and its first derivative should be continuous at the threshold between the log-Normal and the Pareto. This results in a distribution with three free parameters μ and σ from the log-Normal and the threshold parameter θ. All distributions were fitted by maximum likelihood, maximizing the log-likelihood numerically with the R function optim. For the Burr, PowerGamma and PowerBurr, we supplied analytically computed gradients of the log-likelihood in order to speed up the convergence. All of this was repeated m = 1000 times, which gave m different actuarial projections from each of the four fitted distributions for each sample size n. Based on these, we computed the bias (i.e. the average discrepancy from the true value) and the standard deviation.

Table  shows the results for the 99% reserve of the Poisson portfolio with λ=10 expected incidents, presented as percents of the true assessments. The table should be read columnwise since the point of the exercise is to compare errors when the true family of distributions has been fitted to the data with what we get under the many-parameter fitting regimes. Note that n = 50 historical losses yield huge inaccuracies in any case, but still the biases and standard deviations do not increase by much when many-parameter families have been fitted. Indeed, there are from these assessments no particular reasons to be prefer the projections from the correct two-parameter family as the bias and standard deviation when using the Burr, the PowerGamma or the PowerBurr distributions are either comparable to the ones from using the true distribution family, or even lower. The results in the Gamma column are peculiar. Errors in percent are large, and there is a huge downwards bias even when n = 5000 losses were available, actually substantially smaller with the many-parameter families than with the Gamma family itself. There is no mistake; the results have been checked thoroughly. Note also that the PowerGamma overall gave results that were better than the Burr distribution, and the PowerBurr again slightly better. As for the log-Normal-Pareto mixture, it gives a very low bias and standard deviation when the amount of data is large (n = 5000) and the true claim size model is Log-Normal, as one would expect. Indeed, the fitted distributions in this case are almost the same as the log-Normal, with a very high threshold parameter. It also performs well when the true distribution is log-Gamma, which is quite similar to the log-Normal. In all other cases, and in particular when the amount of data is smaller (n = 50), the log-Normal-Pareto mixture gives rather poor results compared to the Burr, PowerGamma and PowerBurr.

A second experiment recorded in Table  confirms the picture. The portfolio was the same as before, but now the target was the pure premium when an a×b layer of the total risk was ceded a reinsurance company. The limits a and b were the 75% and the 90% quantiles of the distribution of the total portfolio loss, respectively. The pure premium was then computed as the expectation of the ceded loss. Again, the errors when using the many-parameter distributions to estimate the reinsurance premium are comparable to the ones from using the true distribution family. In fact, the biases tend to be lower for the many-parameter distributions, whereas the standard deviations are more or less the same. One should keep in mind that in practice, the true distribution family is unknown, and so one must expect model error in addition to the random one, which would increase the advantage from using a many-parameter distribution further. Moreover, the performance of the PowerGamma was again overall better than that of the Burr, and the PowerGamma gave all in all the best results of the five loss distributions. The results using the log-Normal-Pareto mixture are similar to the ones for the reserve in Table .

We also ran simulations of the 99.5% reserve, as well as with λ=100 expected claims. The results from these simulations, which are shown in the supplementary material, follow the same patterns as the one presented in the paper. Further, we have also run simulations with the negative binomial as claim frequency distribution, which in practice is a more realistic model. Again, we let the expected number of claims be λ=10 and the number of observations be n = 50, but the standard deviation was 3.5. The corresponding results for the 99% reserve and the reinsurance premium, shown in the supplementary material, are very similar to the ones for Poisson distributed claim numbers.

5. Example: automobile insurance

The example is taken from Bølviken (Citation2014) and consists of n = 6446 claims of cost due to personal injuries in motor insurance with the deductible subtracted. This means that the claims are left-truncated at the deductible, but since the deductible is subtracted, the distribution still starts at 0, and we may use the distributions from Section 4 without modifications. The claims include cost due to personal injuries and are heavy-tailed with skewness 5.6 and kurtosis 71.2. The data are around 25 years old with mean and standard deviation 23.9 and 28.9 (in Norwegian kroner, NOK, about ten in one EUR) which would have been higher today. Among the five two-parameter distributions considered in this paper, the log-Normal was the only one providing a reasonable fit to the data, as judged from a Q-Q-plot, though it does not quite capture the largest claims. Its parameters are shown in the right panel of Figure , along with the mean, the standard deviation and the skewness. We also estimated the 99% reserve, assuming a Poisson distributed claim number with λ=10 expected incidents, based on ten million simulations This is shown in the same table, and so are corresponding results for the Burr, the PowerGamma and the PowerBurr families. As in the simulations of Section 4, we included the log-Normal-Pareto mixture as a reference. The results from the different distributions are quite similar, except for the skewness. The discrepancy between the smallest and largest projection of the reserve is no more than 2.5%.

Figure 1. Five families of distributions fitted the automobile data with Q-Q-plots on the left and parameter estimates and statistical summary on the right.

Figure 1. Five families of distributions fitted the automobile data with Q-Q-plots on the left and parameter estimates and statistical summary on the right.

Q-Q-plots of the five fitted distributions are shown in the left panel of Figure . They suggest that all five distributions fit the automobile insurance data reasonably well, except for the PowerGamma and also the log-Normal far out in the tail. The downward curvatures corresponds to an underestimation of the risk. This is reflected in a lower skewness than for the other families of distributions, and for the PowerGamma also a lower standard deviation. In this case, a Pareto element in the multi-parameter modeling, capturing the heavy tail of the data, seems required for a good fit. This is provided by both the Burr and the PowerBurr distributions, as they both have the Pareto distribution as a special case. For the same reason, the log-Normal-Pareto also gives a better fit to the data than the log-Normal in the tail.

As the Norwegian car insurance dataset is quite large, we also wanted to try the fit to a much smaller one. Therefore, we drew a random sub-sample of 64 claims from automobile insurance data, corresponding to 1/100 of its size. The resulting sample had a mean of 23.8, a standard deviation of 19.7 and a skewness of 1.3. We fitted the same five distributions as to the original data, and the corresponding results are shown in Figure . The Burr, PowerGamma and PowerBurr all provide a rather good fit to the data, also in the tail, as seen from the Q-Q-plot to the left. The fitted log-Normal is now much too heavy-tailed, which is also reflected in the corresponding standard deviation. In this case, the fitted log-Normal-Pareto mixture is essentially the same as the log-Normal, with the same parameters for the log-Normal part and a very high threshold parameter, and therefore does not fit the data particularly well.

Figure 2. Five families of distributions fitted to a random sample of 64 observations from the automobile data with Q-Q-plots on the left and parameter estimates and statistical summary on the right.

Figure 2. Five families of distributions fitted to a random sample of 64 observations from the automobile data with Q-Q-plots on the left and parameter estimates and statistical summary on the right.

6. Concluding remarks

The Box-Cox transformed Gamma and Burr variables lead to models with many of the usual two- parameter families included as special cases, exactly or approximately. If the true distribution is on the boundary of the parameter space, as in the case of the log-normal, the best fit may involve very large (or small) parameters, but there is nothing wrong in using the computer solution as it is even if it could have been interpreted as some simpler sub-model. This is the convenient way of programming a completely automatized procedure with the computer converting loss data to actuarial evaluations on its own.

In unimodal families fitted distributions are much more stable than parameter estimates, and over-parameterization did not raise errors in actuarial projections substantially unless the parent distribution was very heavy-tailed. This was demonstrated by Monte Carlo and also by asymptotics as the number of historical losses n. Much of the latter argument used the log-normal as a vehicle, and its special properties allowed us to derive simple mathematical expressions for asymptotic errors, in particular prove that they are the same under PowerGamma and PowerBurr.

Supplemental material

Supplemental Material

Download PDF (154.7 KB)

Acknowledgments

The authors wish to thank the two Reviewers and the Associate Editor for their helpful and constructive comments and suggestions.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • Bakar S., Hamzah N., Maghsoudi M. & Nadarajah S. (2015). Modeling loss data using composite models. Insurance: Mathematics and Economics 61, 146–154.
  • Beirlant J., Teugels J. & Vynckier P. (1996). Practical analysis of extreme values. Leuven University Press.
  • Bølviken E. (2014). Computation and modelling in insurance and finance. Cambridge University Press.
  • Box G. & Cox D. (1964). An analysis of transformations (with discussion). Journal of the Royal Statistical Society Series B 26, 231–252.
  • Burr I. J. (1942). Cumulative frequency functions. Annals of Mathematical Statistics 13, 215–232.
  • Chan J. S. K., Choy S. T. B., Makov U. R. & Landsman Z. (2018). Modelling insurance losses using contaminated, generalized beta type II distributions. ASTIN Bulletin 48(02), 871–904.
  • Cooray K. & Ananda M. M. A. (2005). Modeling actuarial data with composite lognormal-Pareto model. Scandinavian Actuarial Journal 2005, 642–660.
  • Cummins J. D., Dionne G., MacDonald J. B. & Pritchett B. M. (1990). Applications of the GB2 family of distributions in modeling insurance loss processes. Insurance: Mathematics and Economics 9(4), 257–272.
  • Embrechts P., Klüppelberg C. & Mikosch T. (Modelling extremal events. Stochastic Modelling and Applied Probability.1997). Springer Verlag.
  • Gui W., Huang R. & Lin X. S. (2021). Fitting multivariate Erlang mixtures to data: a roughness penality approach. Journal of Computational and Applied Mathematics 386, 113216.
  • Hoeting J. A., Madigan D., Raftery A. E. & Volinsky C. T. (1999). Bayesian model averaging: a tutorial. Statistical Science 14(4), 382–401.
  • Huber P. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. In Proceedings of the Fifth Berkely Symposium in Mathematical Statistics and Probability. University of California Press.
  • Kleiber C. & Kotz S. (2003). Statistical size distributions in economics and actuarial sciences. Wiley.
  • Lee S. & Lin X. (2010). Modeling and evaluating insurance losses via mixtures of Erlang distributions. North American Actuarial Journal 14(1), 107–130.
  • Marlin P. (1984). Fitting the log-normal distribution to loss data subject to multiple deductibles. The Journal of Risk and Insurance 51( 4), 687–701.
  • Miljkovic T. & Grün B. (2016). Modeling loss data using mixtures of distributions. Insurance: Mathematics and Economics 70, 387–396.
  • Omari C. O., Nyambura S. G. & Mwangi J. M. W. (2018). Modeling the frequency and severity of auto insurance claims using statistical distributions. Journal of Mathematical Finance 8(1), 137–160.
  • Reynkens T., Verbelen R., Beirlant J. & Antonio K. (2017). Modelling censored losses using splicing: a global fit strategy with mixed Erlang and extreme value distributions. Insurance Mathematics & Economics 77, 65–77.
  • Scollnik D. P. M. (2007). On composite lognormal-Pareto distributions. Scandinavian Actuarial Journal2007(1), 20–33.
  • Scollnik D. P. M. & Sun C. (2012). Modelling with Weibull-Pareto models. North American Actuarial Journal 16(2), 260–272.
  • Scott P. W. (1992). Multivariate density estimation: theory, practice and visualization. John Wiley & Sons.
  • Venter C. (1985). Transformed Beta and Gamma distributions and aggregate losses. Casually Actuarial Society 70, 156–193.
  • Verbelen R., Gong L., Antonio K., Badescu A. & Lin S. (2015). Fitting extremes of Erlangs to censored and truncated data using the EM algorithm. ASTIN Bulletin 45, 729–758.
  • White H. (1982). Maximum likelihood estimation of misspecified models. Econometrica 50(1), 1–25.
  • White H. (1994). Estimation, inference and specification analysis. Cambridge: Cambridge University Press.
  • Zuanetti D. & Diniz C. (2006). A lognormal model for insurance claims data. REVSTAT – Statistical Journal 4, 131–142.

Appendices

Appendix 1.

Burr uncertainty for Pareto data

Section 3.2 required the information matrix for the Burr family and the gradient vectors for its moments and percentiles. If Ibu is the information matrix with elements Iααbu, Iαθbu and so on, then Iααbu=ψ1(α)ψ1(θ+α)θαhαθbu,Iαθbu=ψ1(θ+α)+hαθ,Iαβbu=θ(θ+α)(θ+α+1)β1,Iθθbu=ψ1(θ)ψ1(θ+α)αθhαθ,Iθβbu=α(θ+α)(θ+α+1)β1,Iββbu=θαθ+α+1β2where ψ1(z)=d2log{Γ(z)}dz2andhαθ=θ+α+2(θ+α)(θ+α+1).The derivation of Ibu from the density function (Equation9) is straightforward (albeit lengthy) and is omitted.

The gradient vectors for moments and percentiles are most conveniently calculated by differentiating their expressions numerically. The k'th order moment was given in (Equation3) whereas for the percentiles there is a simple relationship between Burr and Beta variables that can be utilized. Let bϵαθ be the ϵ-percentile of the Beta distribution with shape parameters α and θ, available in standard software. Then α(bϵαθ11)β/θ is the same Burr percentile.

Appendix 2.

Asymptotics in tilting models

A.1. General

Several of the uncertainty evaluations in Section 3 apply to models with density functions of the form (A1) f(z;ν,p)=f(z;p)ea(z;p)ν+o(ν)(A1) with ν a tilting parameter adding to the parameter vector p and with a(z;p) a known function. Note that ν is a transformation of those in Section 3, but when using likelihoods this does not affect uncertainty. The idea is to study how much the asymptotic standard deviation of moments and percentile estimates go up when we fit f(z;ν,p) for data generated by its limit f(z;p) as ν0. This requires the information matrix Iν under f(z;ν,p) and the gradient vector vν of the quantity ψ=ψ(ν,p) of interest, again in the limit as ν0. Notation with superscript ν, for example Eν and qϵν for expectation and ϵ-percentile, applies to f(z;ν,p) whereas E and qϵ without ν refer to f(z;p).

The information matrix Iν and gradient vector vν under f(z;ν,p) are of the form (A2) Iν=(IνννIνpνIpννIppν)andvν=(vνν,vpν)(A2) with Iνpν and vpν row vectors. A recepy for calculating the limits of these quantities is (A3) limν0Ippν=Ippandlimν0vpν=vp,(A3) (A4) limν0Iννν=E{a(Z;p)}2andlimν0Iνpν=E(a(Z;p)log{f(Z;p)}p),(A4) (A5) limν0vνν=E{a(Z;p)Zk}ifψ=Eν(Zk),(A5) (A6) limν0vνν=0qϵa(z;p)f(z;p)dzf(qϵ;p)ifψ=qϵν.(A6) To verify this start by recalling that Ipp and vp in (EquationA3) are information matrix and gradient vector under f(z;p), and become the limits as stated since f(z;ν,p)f(z;p) when ν0. Next note that Iννν=Eν(log{f(z;ν,p)}ν)2andIνpν=Eν(log{f(z;ν,p)}νlog{f(z;ν,p)}p),and inserting (EquationA1) implies (EquationA4). The limits for the gradients remain. The k'th order moment is Eν(Zk)=0zkf(z;ν,p)dz,and replacing the density function by (EquationA1) and then differentiating leads to (EquationA5). Finally, the ϵ-percentile qϵν is from (EquationA1) the solution of 0qϵνf(z;p)ea(z;p)ν+o(ν)dz=ϵwhich when differentiated with respect to ν yields f(qϵν;p)qϵνν+0qϵνf(z;p)a(z;p)dz+o(1)=0and (EquationA6) follows since qϵνqϵ.

A.2. Burr uncertainty for Gamma data

The asymptotics when the Burr model is fitted Gamma data is covered by Section A.1. Start with the expression for the Burr density function in (Equation2). It applies when β=1, but when it is rewritten for general β and with ν=1/α replacing α, it becomes fbu(z;ν,θ,β)=Γ(ν1+θ)Γ(ν1)Γ(θ)(νθ)θβ(z/β)θ1(1+νθz/β)ν1+θ.Its limit when α or equivalently ν0 is the Gamma density function fga(z;θ,β)=θθΓ(θ)1β(z/β)θ1eθz/β,and it is shown below that fbu(z;ν,θ,β) and fga(z;θ,β) are connected by the tilting relationship (EquationA1) with (A7) a(z;θ,β)=12θ2(zβ)2θ2zβ+12(θ2θ)(A7) Table  was computed by inserting this function into (EquationA3)–(EquationA6). Calculations were numerical since simple mathematical expressions are not available.

To verify (EquationA7) note that the logarithm of the ratio between fbu(z;ν,θ,β) and fga(z;θ,β) is log(fbu(z;ν,θ,β)fga(z;θ,β))=log{Γ(ν1+θ)}log{Γ(ν1}+θlog(ν)(ν1+θ)log(1+νθz/β)+θz/β.The two log-Gamma functions can be simplified though the tail approximation (A8) log{Γ(z)}=(z1/2)log(z)z+log(2π)+o(z1/2)(A8) whereas the Taylor series for log(1+z) yields log(1+νθz/β)=νθz/β12(νθz/β)2+o(ν2),and after some simple calculations log(fbu(z;ν,θ,β)fga(z;θ,β))=νa(z;θ,β)+o(ν)with a(z;θ,β) as in (EquationA7).

Appendix 3.

PowerGamma asymptotics

A.3. Tilting against the log-normal

The PowerGamma variable Z=β{1+Gθ)η1} has the density function (A9) fpg(z;θ,η,β)=θθΓ(θ)1βηxθ1eθx(1+x)1ηwherex=(1+z/β)1/η1.(A9) It was claimed in Section 3.3 that it can under a suitable reparameterization be tilted against a log-normal density function as θ. Introduce new parameters ν=1/θ,β=ξelog(4)σ/ν,η=2σ/νso that limits are in terms of ν0. With the new parameters the density function in (EquationA9) is on log-form (A10) log{fpg(z;ν,ξ,σ)}=cν+(ν21)log(x)ν2x+(12σ/ν)log(1+x)(A10) where cν=2ν2log(ν)log{Γ(ν2)}+log(4)σν1+log(ν)log(2σξ).By invoking the tail series (EquationA8) for the log Gamma function cν reduces to (A11) cν=ν2+log(4)σν1log(8πσξ)+o(ν).(A11) To simplify the other terms in (EquationA10) start by noting that (EquationA9) right yields log(1+x)=log(1+z/β)/η=log(z/β)/η+log(1+β/z)/ηwhere the second term on the right through β tends to 0 at exponential rate as ν0. Hence after inserting for η and β (A12) log(1+x)=log(2)+y2ν+o(ν3)wherey=log(z/ξ)/σ.(A12) This will be the vehicle for much of the calculations to follow. Utilize that x=elog(1+x)1=2e(y/2)ν+o(ν3)1which from the Taylor series of the exponential function becomes (A13) x=1++y24ν2+y324ν3+o(ν3).(A13) Take logarithms on both sides and use the Taylor series of the logarithmic function. Some straightforward calculations and rearrangements then lead to (A14) log(x)=y24ν2+y38ν3+o(ν3)(A14) Finally, when inserting (EquationA11)–(EquationA14) into (EquationA10), a lot of terms cancel, and it follows after some straightforward manipulations that log{fpg(z;ν,ξ,σ)}=log(2πσξ)σy12y2+(y2+y312)ν+o(ν)or log{fpg(z;ν,ξ,σ)}=log{fln(z;ξ,σ)}+(log(z/ξ)2σ+log3(z/ξ)12σ3)ν+o(ν)after inserting y=log(z/ξ)/σ. This is the tilting relation (Equation15).

A.4. The information matrix

The information matrix when the PowerGamma family is fitted log-normal data is (A15) limν0Ipg=(5/481/(4ξσ)01/(4ξσ)1/(ξ2σ2)0002/σ2)(A15) To verify this start by noting that when Z is log-normal, then Y=log(Z/ξ)/σ is N(0,1) with second, fourth and sixth order moments 1, 3 and 15. Hence (EquationA4) left and (Equation15) right yield limν0Iννpg=E(Y2+Y312)2=E(Y24Y412+Y6144)=548.Next from (Equation14) log{fln(Z;ξ,σ)}ξ=log(Z/ξ)σ1ξσ=Yξσ,log{fln(Z;ξ,σ)}σ=(log2(Z/ξ)σ21)1σ=Y21σso that by (EquationA4) right limν0Iνξpg=E(Y2+Y312)Yξσ=14ξσ,limν0Iνξpg=E(Y2+Y312)Y21σ=0where the right hand side vanishes in expectation since all powers of Y are odd when multiplied out. This establishes the first row and column in (EquationA15), and the rest are elementary results for the log-normal.

A.5. Moment estimation

The moment of order k under the log-normal is (A16) μk=ξkek2σ2/2,(A16) and we seek the asymptotic standard deviation of its estimate when the PowerGamma model has been fitted. We need the inverse of the matrix (EquationA15) which is (A17) limν0(Ipg)1=(246ξσ06ξσ5σ2ξ2/2000σ2/2)(A17) and also the limit of the gradient vpg=(vνpg,vξpg,vσpg)T. The latter is (A18) limν0vνpg=(4+k3σ312)μk,limν0vξpg=kξμk,limν0vσpg=k2σμk(A18) The limits for vξpg and vσpg follows by differentiating (EquationA16) whereas the one for vνpg is determined from (EquationA5) and (Equation15) right which yield limν0vνpg=E(a(Z;ξ,σ)Zk)=ξkE(Y2+Y312)ekσY,and an elementary calculation utilizing that YN(0,1) closes the argument.

Let skpg(ξ,σ)/n be the asymptotic standard deviation for the moment estimate when fitting the PowerGamma model to log-normal data. Then skpg(ξ,σ) is the limit of the square root of vpg(Ipg)1(vpg)T and becomes skpg(ξ,σ)=(k2σ2+k4σ42+k6σ66)1/2μkwhich is to be compared with its counterpart skln(ξ,σ) under log-normal fitting. Now the information matrix is the 2×2 block in the lower, left corner of (EquationA15) and the gradient the last two elements in (EquationA18) so that skln(ξ,σ)=(k2σ2+k4σ42)1/2μk.Hence (skpg(ξ,σ)skln(ξ,σ))2=1+k4σ46+3k2σ2,leading to (Equation17) left.

A.6. Percentile estimation

The quantity we are now trying to estimate is (A19) qϵ=ξeσϕϵ(A19) with ϕϵ the ϵ-percentile of the standard normal. The study of asymptotic uncertainty is the same as above except for the gradient which has become (A20) limν0vνpg=13(1+ϕϵ24)σqϵ,limν0vξpg=1ξqϵlimν0vσpg=ϕϵqϵ.(A20) Only the expression for vνpg requires a calculation since the rest are obvious consequences of differentiating (EquationA19). By (EquationA6) limν0vνpg=0qϵfln(z;ξ,σ)a(z;ξ,σ)dzfpg(qϵ;ξ,σ)where fpg(z;ξ,σ) is the log-normal density function above and a(z;ξ,σ) is the function (Equation15) right. The integral is evaluated by substituting y=log(z/ξ)/σ. With φ(y)=(2π)1/2ey2/2 this yields (A21) 0qϵfln(z;ξ,σ)a(z;ξ,σ)dz=ϕϵ(y/2+y3/12)φ(y)dy=13(1ϕϵ24)φ(ϕϵ).(A21) whereas fln(qϵ;ξ,σ)=φ(ϕϵ)σqϵ,and (EquationA20) left follows.

The quadratic form when combining the vector (EquationA20) with the matrix (EquationA17) yields the coefficient sϵpg(ξ,σ) as sϵpg(ξ,σ)=(249(1+ϕϵ24)2+4(1+ϕϵ24)+52+12ϕϵ2)1/2σqϵwhereas for the log-normal the similar quantity is sϵln(ξ,σ)=(1+ϕϵ22)1/2σqϵafter using the last two elements in (EquationA20) as the gradient vector and the lower 2×2 block in (EquationA15) as information matrix. It follows that {sϵpg(ξ,σ)}2{sϵln(ξ,σ)}2=16(ϕϵ21)2σ2qϵ2so that sϵpg(ξ,σ)sϵln(ξ,σ)=(1+(ϕϵ21)23ϕϵ2+6)1/2as claimed in (Equation17) right.

Appendix 4.

PowerBurr density

The PowerBurr model defined in (Equation5) right has density function (A22) fpb(z;θ,ξ,σ,α)=Γ(θ+α)Γ(α)Γ(θ)(θα)θ1ηβxθ1(1+x)1η(1+θx/α)α+θ,x=(1+z/β)1/η1.(A22)