309
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Maximum leave-one-out likelihood method for the location parameter of variance gamma distribution with unbounded density

&
Pages 2642-2671 | Received 24 Jun 2022, Accepted 04 Apr 2023, Published online: 18 Jul 2023

Abstract

Maximum likelihood estimation of a location parameter fails when the density has unbounded mode. An alternative approach is considered by leaving out a data point to avoid the unbounded density in the full likelihood. This modification gives rise to the leave-one-out (LOO) likelihood. We propose an expectation-conditional maximisation (ECM) algorithm to estimate parameters of variance gamma (VG) distribution with unbounded density by maximising the LOO likelihood. VG distribution has normal mean-variance mixtures representation that facilitates ECM algorithm. K Podgórski, J Wallin [Maximizing leave-one-out likelihood for the location parameter of unbounded densities. Ann Inst Stat Math. 2015;67(1):19–38.] showed that the location parameter estimate which maximises the LOO likelihood is consistent and super-efficient. In the case of repeated data, the LOO likelihood is still unbounded and we extend it to Weighted LOO (WLOO) likelihood. We perform simulations to investigate the accuracy of ECM method using WLOO likelihood and compare it with other likelihood methods, including the alternating ECM with fixed and adaptive caps to the unbounded density, the ghyp package using multi-cycle ECM, and the ECM using LOO likelihood. Lastly, we explore the asymptotic properties of the location parameter estimate of VG distribution such as the optimal convergence rate and asymptotic distribution.

1. Introduction

Asymptotic properties of maximum likelihood estimation for location parameters have been studied for the case when the likelihood is non-differentiable but bounded [Citation1]. However, this method breaks down when the likelihood is unbounded at certain points. Alternative methods for the unbounded density case have been considered in Refs [Citation2,Citation3] and in Ref. [Citation4] where they proved consistency of the location estimate using the Bayesian approach.

Under the likelihood approach however, modifications to the full likelihood are necessary since the unbounded points occur at the data points. A possible solution is to leave out a data point closest to the location parameter in the full likelihood. This modification leads to a concept known as the leave-one-out (LOO) likelihood proposed by Podgórski and Wallin [Citation5]. They proved consistency and super-efficiency of the location estimate that maximises the LOO likelihood when the density is unbounded. More precisely, under certain regularity conditions, they found an upper bound for the index of convergence rate for the location parameter estimate (see Section 3.2 for the definition of convergence rate and its index). Currently, there are no theoretical results regarding the optimal convergence rate and the asymptotic distribution of the maximum LOO likelihood estimate for the location parameter when the density is unbounded or with cusp. Since the theoretical derivation of these two asymptotic properties is very challenging, we propose to perform numerical simulations to investigate the properties and obtain some insights for the theoretical development.

To demonstrate the idea, we consider multivariate skewed variance gamma (VG) distributions with unbounded densities when the shape parameters νd/2 where d is the dimension of the multivariate data. To estimate the parameters of VG distributions with unbounded densities, Nitithumbundit and Chan [Citation6] proposed an expectation-conditional maximisation (ECM) algorithm [Citation7] where they proposed to cap the density within a small interval of the location parameter. Depending on whether the conditional density or actual density is used to estimate the shape parameter, ECM algorithm can be further classified into multi-cycle ECM (MCECM) and ECM either (ECME) [Citation8,Citation9]. In this paper, the ECME algorithm is still called ECM algorithm as the methodological idea of ECME is similar to ECM. However, a drawback to capping the unbounded density is that the size of the interval to cap the density is somewhat arbitrary, and that different intervals can give different results. Podgórski and Wallin [Citation5] established a theoretical framework for using the LOO likelihood in an expectation-maximisation (EM) algorithm to estimate the location parameter of univariate symmetric generalised Laplace distribution (which is equivalent to VG distribution proposed by Madan and Seneta [Citation10]) with fixed shape parameter and unbounded density.

However, when there are data multiplicity (repeated data), the LOO likelihood will still be unbounded even if we leave out one of the data points. If we leave out multiple data points to obtain the leave-multiple-out (LMO) likelihood, this LMO likelihood has varied data size and contribution across the parameter space depending on the data multiplicities. This inconsistent data contribution leads to a discontinuous LMO likelihood. It is necessary to address this data multiplicity issue in the LOO likelihood as the issue is likely to occur for three reasons. Firstly, data rounding is common in practice simply for convenience or for reducing data storage. Secondly, as big data are more prevalence in recent years, the chance of data multiplicity increases with sample size. Lastly, as measurements are made more instantaneously for high-frequency data, the changes between measurements will be minimal and hence measures using these changes such as returns will be extremely small or even zero again giving rise to data multiplicity. Although the chance of data multiplicity is lower when the means for the time series model say change over time and when multi-dimensional models are used, one will never be sure in practice if data multiplicity exists. Hence, it is always advisable to fix the LOO likelihood issues to provide numerically stable estimates.

Following this idea, Nitithumbundit and Chan [Citation11] proposed weighted LOO (WLOO) likelihood to deal with discontinuous LMO likelihood when there are data multiplicities. They applied WLOO likelihood to multivariate data with unbounded densities and derived an ECM algorithm [Citation8] to obtain the maximum WLOO estimate of the location parameters of the vector autoregressive moving average (VARMA) model with multivariate skewed VG distribution when densities are cusped or unbounded with respect to the location parameters. The proposed algorithm is an extension of the EM algorithm. They derive explicit formulae to calculate all parameters in the expectation step (E-step) and conditional maximisation steps (CM-steps). When there are data multiplicities, they applied WLOO likelihood to obtain a continuous likelihood function. However, the paper focuses on the application and did not study the properties of this new method using the ECM algorithm with WLOO likelihood.

The objective of this paper is to fill up this research gap and study the properties of our proposed ECM algorithm with WLOO likelihood. Our first objective assesses the accuracy of the ECM algorithm using LOO likelihood in the case of no data multiplicity and compare the ECM algorithm using the LOO and WLOO likelihoods with other full likelihood methods in the case of data multiplicity due to repetition and rounding. To focus on the properties of the method, we make two modifications to Ref. [Citation11]. We consider a constant mean instead of the ARMA mean and we drop the alternating ECM (AECM) algorithm which aims only to improve the convergence rate of the ECM algorithm. Our second and main objective analyses the asymptotic behaviour including the optimal convergence rate and asymptotic distribution for the maximum LOO likelihood estimate for the location parameter through simulation studies using data simulated from the VG distribution with different sample sizes and shape parameters. Specifically, we investigate how the index of optimal convergence rate in Ref. [Citation5] for the location parameter estimates and its asymptotic distribution change across the shape parameter of VG distribution with unbounded or cusp density. From these studies, we aim to approximate the asymptotic distribution with some distributions to provide standard error for the location parameter estimates.

We choose VG distribution to demonstrate the ability of our proposed methods to deal with unbounded densities for three important reasons. Firstly, VG distribution is an important special case of the more general class, the generalised hyperbolic (GH) distribution [Citation12]. When GH distribution approaches VG distribution, one of its shape parameters will approach the boundary of the parameter space potentially causing the density to become unbounded. Although Protassov [Citation13] proposed an EM algorithm for GH distribution, such algorithm considers only regular shape parameters inside the parameter space and hence it does not truly capture the case of VG distribution with unbounded density when one of the shape parameter approaches the boundary of the parameter space causing unbounded densities. Our proposed ECM algorithm deals with such a case when the density is unbounded. Secondly, VG distribution has a normal mean-variance mixture (NMVM) representation that facilitates the implementation of ECM algorithm. Lastly, VG distribution has applications in many areas such as finance, signal processing and quality control. See Ref. [Citation14] for other applications and further details on generalised Laplace distribution. We remark that the idea of the proposed ECM algorithm can be applied to other distributions in NMVM representation including Student's t and GH distributions with some suitable adjustments in the E-step.

The remaining paper is organised as follows. Section 2 summarises some important properties of the multivariate skewed VG distribution. Section 3 formulates the maximum LOO likelihood framework for estimating the location parameter of distributions with unbounded densities, reports some properties of the estimate and introduces the WLOO likelihood with two examples to explain the weight settings. Section 4 introduces the ECM algorithm using the LOO and WLOO likelihoods to estimate parameters from the VG distribution. Although the methodologies in Sections 3 and 4 were described in Ref. [Citation11], we present them in a slightly different way starting with defining different LOO likelihoods which are the focus and with different emphasis giving more technical details. Moreover, some equations are modified for the different mean structures. Section 5 presents three simulation studies. The first study tests the accuracy of our ECM method using WLOO likelihood while the second study compares the accuracy of the ECM method with other full and LOO likelihood methods in three scenarios, namely, no data multiplicity, data multiplicity due to repetition and data multiplicity due to rounding. The last study analyses the asymptotic behaviour of the maximum LOO likelihood method to estimate the location parameter of VG distribution. Section 6 concludes the paper with further remarks.

2. Variance gamma distribution

The probability density function (pdf) of a d-dimensional VG distribution is given by (1) fVG(y)=21d2νν|Σ|12πd2Γ(ν)Kνd2([2ν+γΣ1γ]z)exp((yμ)Σ1γ)(2ν+γΣ1γ)14(2νd)z12(d2ν)(1) where z2=(yμ)Σ1(yμ), μRd is the location parameter, Σ is a d×d positive definite symmetric scale matrix, γRd is the skewness parameter, ν>0 is the shape parameter, Γ() is the gamma function and Kλ() is the modified Bessel function of the second kind with index λ [Citation15].

The VG distribution has a normal mean-variance mixtures representation [Citation12] given by (2) y|uNd(μ+γu,uΣ),uG(ν,ν)(2) where G(a,b) is a Gamma distribution with shape parameter a>0, rate parameter b>0 and pdf fG(u)=baΓ(a)ua1exp(bu),for u>0.The mean and covariance matrix of a VG random vector Y are given by E(Y)=μ+γandcov(Y)=Σ+1νγγ,respectively. As yμ, the pdf in (Equation1) is given by (3) fVG(y){2νdπd2|Σ|12Γ(νd2)Γ(ν)νν(2ν+γΣ1γ)2νd2if ν>d2,21dπd2|Σ|12dd2Γ(d2)log(z)if ν=d2,2νπd2|Σ|12Γ(d2ν)Γ(ν)ννz2νdif ν<d2.(3) Hence the density becomes unbounded at μ when νd2. This poses some technical difficulty when working with VG distribution as it is unclear whether the shape parameter will fall into the unbounded range. It is also worth noting that the density is bounded with a cusp when d2<νd+12.

This problem of unbounded density is illustrated in Figure  of Ref. [Citation11] with the full log-likelihood and LOO log-likelihood across the location parameter for a data of 10 observations simulated from a standard VG distribution (μ=0, σ=1, γ=0) with shape parameter ν=0.2. Leaving the data point out essentially smooths out the unbounded log-likelihood at data points so the maximum can be well-defined. Additionally, if we zoom in at around μ=0, we observe that cusps tend to occur between data points.

Figure 1. Plots of full (solid red), LOO (striped blue), LMO (dotted green), and WLOO (dot & striped magenta) log-likelihoods of univariate symmetric VG distribution with ν=0.4 for data set {1,0,1,0} represented by light grey vertical strips. (a) Full and LOO log-likelihoods and (b) LMO and WLOO log-likelihoods.

Figure 1. Plots of full (solid red), LOO (striped blue), LMO (dotted green), and WLOO (dot & striped magenta) log-likelihoods of univariate symmetric VG distribution with ν=0.4 for data set {−1,0,1,0} represented by light grey vertical strips. (a) Full and LOO log-likelihoods and (b) LMO and WLOO log-likelihoods.

3. Maximum leave-one-out likelihood

Let y=(y1,,yn) be observed data from VG distribution with corresponding missing parameters u=(u1,,un), and θ=(μ,Σ,γ,ν) be parameters from VG distribution in parameter space Θ. The density of the VG distribution is unbounded at μ when νd2. Consequently the maximum likelihood estimate is not well-defined since there are multiple unbounded points at each data point in the likelihood function. Kawai [Citation16] showed that for the univariate case, the Fisher information matrix with respect to μ is also not well-defined when ν<1/2. That is if ν<1/2, then Eθ[(μlogf(Y;θ))2]=where Y is a univariate VG random variable with density function f, and the expectation is taken with respect to Y which depends on θ.

We aim to provide a methodology to estimate parameters from VG distribution that can also deal with unbounded density. We first redefine the likelihood so that the maximum is well-defined even with the unbounded density.

3.1. Leave-one-out likelihood

Podgórski and Wallin [Citation5] proposed the observed LOO likelihood to be defined as LLOO(θ;y)=tκ(μ)f(yt)where the LOO index is defined as (4) κ(μ)=argmint{1,,n}(ytμ)Σ1(ytμ).(4) For the case where there are more than one indices, we choose the smallest index. Note that we slightly modify the convention in Ref. [Citation5]: ‘if there are two indices we take the one for which corresponding yk(μ) is on the right side of μ as it only deals with the univariate case and cannot be easily extended to multivariate setting using this convention. Let the observed LOO log-likelihood be defined as (5) LOO(θ;y)=logLLOO(θ;y).(5) We remark that when considering asymmetric distributions, the LOO likelihood function is discontinuous. For the VG distribution with skewness, the discontinuity is not an issue since the density is asymptotically symmetric as yμ from (Equation3), and so the effect of the discontinuities is minimised for larger sample size. On the other hand, when using other distributions with different skewness behaviour, the LOO index can alternatively be defined as (6) κ(μ)=argmaxt{1,,n}f(yt;θ).(6) In this paper, we adopt the LOO index in (Equation6) and define the maximum LOO likelihood estimate denoted as θˆnMLLE (or simply θˆn) to be the estimate that maximises the LOO likelihood with respect to θ. This new and more general index definition makes the LOO likelihood continuous and for the symmetric case it would coincide with (Equation4).

3.2. Properties of the maximum leave-one-out likelihood method

For the one-dimensional case, some properties of the location parameter estimate μˆn such as consistency and super-efficient rate of convergence are proved by Podgórski and Wallin [Citation5]. We state both the assumptions and theorem relating to these main properties:

(A1)

The pdf f(y)=p(y)|y|α, α(1,0), p has bounded derivative on R{0} and, for some ϵ>0, f is non-zero and continuous either on [ϵ,0] or on [0,ϵ].

(A2)

There exist ρ>0 such that f(y)=O(|y|ρ1) when |y|.

(A3)

For all ϵ>0, the incomplete Fisher information is finite. That is, Iϵ(θ):=Eθ[(θlogf(Y;θ))2||Y|>ϵ]<.

Theorem 3.1

Let f satisfies the assumptions (A1) to (A3) and let μˆ be the maximizer of LLOO(μ;y). Then μˆ is consistent estimate of μ and for any β<1/(1+α), nβ(μˆnμ)p0where α is defined in (A1) and p means convergence in probability.

This theorem states the lower bound nβ for the convergence rate of the location parameter estimate using maximum LOO likelihood. For univariate VG distribution, α=2ν1. Hence setting β=1/(1+α)=1/(2ν) possibly gives us the index for the optimal convergence rate (or the proposed optimal rate) for ν<1/2. Additionally, nβ(μˆnμ) may converge to some asymptotic distribution for some suitable choice of β. We will investigate the asymptotic properties later in Section 5 using simulations from VG distribution.

3.3. Extension to weighted LOO likelihood for data multiplicity

Data multiplicity is common when the measurements have limited level of accuracy. In this case, the LOO likelihood is first extended to the leave-multiple-out (LMO) likelihood defined as LLMO(θ;y)=tK(μ)f(yt;θ)where (7) K(μ)={t{1,,n}|yt=yk(μ)}(7) represents the LMO indices which correspond to the data points identical to yκ(μ), and κ(μ) represents the LOO index defined in (Equation4) in Section 3.1. When there are no data multiplicities, the LMO likelihood reduces to the LOO likelihood with consistent (n1) data contribution. However, when there are varied data multiplicities, the number of data points to leave out is not fixed and so the LMO likelihood has varied data contribution across the parameter space. This inconsistent data contribution throughout the parameter space leads to a discontinuous LMO likelihood. To remedy this, we propose to adjust the LMO likelihood to weighted LOO (WLOO) likelihood defined as (8) LWLOO(θ;y)=t=1nf(yt;θ)wt(8) where we choose the weights to be (9) wt={0,if tK(μ)|K(μ)|+|J(μ)|1|J(μ)|,if tJ(μ)1,otherwise,(9) K(μ) is defined in (Equation7) such that |K(μ)| represents the cardinality of the set K(μ), (10) J(μ)={t{1,,n}K(μ):yt=yj(μ)}(10) and j(μ)=argmintIK(μ)(ytμ)Σ1(ytμ)represents the secondary LOO index for VG distribution in NMVM representation. Similarly, the WLOO log-likelihood is defined as (11) WLOO(θ;y)=t=1nwtf(yt;θ).(11) Clearly, LOO likelihood is the special case of WLOO likelihood when there is no data multiplicity and the weights are chosen to be wt=1. We use two examples with data multiplicity at one and two locations respectively to demonstrate how the weights in (Equation9) are derived to ensure bounded and continuous likelihood as well as consistent weighted data size, that is t=1nwt=n1, when data with different multiplicities are removed. We call the first example symmetric since there are equal number of data points on each side of the data with multiplicity. The second example is asymmetric as there are different number of data points on each side of the data with multiplicity. In these two examples, the shape parameter ν=0.4 that corresponds to unbounded likelihood. When the density function is skewed, both LOO and WLOO likelihoods are not continuous between data points. Nevertheless, the pdf in (Equation3) is approximately symmetric as μ approaches any data point. So by having more data points, the effect of discontinuities due to skewness is negligible.

Example 1: data multiplicity at a single location

The first data {1,0,1,0} have data multiplicity at a single location 0. Figure plots the LOO log-likelihood across μ. In Figure (a), the LOO log-likelihood is unbounded at 0 even after leaving out one data point 0.

Figure 2. Plot of the full (solid red), LOO (striped blue), LMO (dotted green), and WLOO (dot & striped magenta) log-likelihoods of symmetric VG distribution with ν=0.4 for data set {1,0,1,0,0,1} represented by light grey vertical strips. (a) Full and LOO log-likelihoods and (b) LMO and WLOO log-likelihoods.

Figure 2. Plot of the full (solid red), LOO (striped blue), LMO (dotted green), and WLOO (dot & striped magenta) log-likelihoods of symmetric VG distribution with ν=0.4 for data set {−1,0,1,0,0,1} represented by light grey vertical strips. (a) Full and LOO log-likelihoods and (b) LMO and WLOO log-likelihoods.

The LMO log-likelihood in Figure (b) is bounded after leaving out multiple data points at 0 but it also produces discontinuities at the midpoints of 0.5 and 0.5. We aim to choose the weights in the WLOO log-likelihood WLOO(μ)=w1logf(1)+2w2logf(0)+w3logf(1)(w2=w4) to ensure consistent data contribution, that is, t=1nwt=n1. Since the pdf is assumed to be symmetric, we define gt(|ytμ|)=f(yt;μ)and analyse the WLOO log-likelihood around the small neighbourhood ϵ>0 of the midpoints located at 0.5 and 0.5. It is sufficient to look at one of the midpoints.

Midpoint of 0 and 1: At μ=0.5+ϵ, the data point y3=1 is closest to μ. So we leave out y3=1 in the WLOO log-likelihood by setting w3=0. This gives us WLOO(0.5+ϵ)=w1logg1(1.5+ϵ)+2w2logg2(0.5+ϵ)+0logg3(0.5ϵ).Since the data point y1=1 has a single contribution to the log-likelihood, we set w1=1 leaving the other weights w2=w4=1 so that t=14wt=3. At μ=0.5ϵ, the data points x2=x4=0 is closest to μ so we set w2=0 which gives us WLOO(0.5ϵ)=w1logg1(1.5ϵ)+0logg2(0.5ϵ)+w3logg3(0.5+ϵ).Again, we set w1=1 leaving w3=2 so that t=14wt=3. Choosing these weights gives us a continuous WLOO likelihood at μ=0.5 given by WLOO(0.5)=logg1(1.5)+2logg2(0.5)where g2(|00.5|)=g3(|10.5|). This shows that after leaving out multiple data points, extra weight is added to the neighbouring data points to compensate for the missing weights.

Example 2: data multiplicities at two locations

The second data {1,0,1,0,0,1} have different data multiplicities at 0 and 1. As before we define the WLOO log-likelihood as WLOO(μ)=w1logf(1)+3w2logf(0)+2w3logf(1).

Midpoint of -1 and 0: At μ=0.5ϵ, we set w1=0 giving w2=w3=1 whereas at μ=0.5+ϵ, we set w2=0 giving w1=w3=1. Hence the WLOO log-likelihood is WLOO(0.5)=3logg1(0.5)+2logg3(1.5).Midpoint of 0 and 1: Unlike μ=0.5, μ=0.5 lies between 0 and 1 with different data multiplicities. At μ=0.5+ϵ, we set w3=0 giving the WLOO log-likelihood WLOO(0.5+ϵ)=w1logg1(1.5+ϵ)+3w2logg2(0.5+ϵ)+0logg3(0.5ϵ).With single contribution of x1=1, we set w1=1 and so w2=4/3. At μ=0.5ϵ, we set w2=0 giving the WLOO log-likelihood WLOO(0.5ϵ)=w1logg1(1.5ϵ)+0w2logg2(0.5ϵ)+2w3logg3(0.5+ϵ).Again, we set w1=1 and so w3=4/2. This gives us the WLOO log-likelihood of WLOO(0.5)=logg1(1.5)+4logg2(0.5)where g2(0.5)=g3(0.5).

Hence for a general data set in which yt,yt have data multiplicities mt,mt, data points not around the neighbourhood of the midpoint of yt,yt have the same contribution to the WLOO likelihood so they always have a weight of 1. For the neighbourhood around the midpoint of yt,yt on the side closer to yt, the mt data points with value yt would be left out by setting their weights to be 0. Then the weights for yt is set to be wt=(mt+mt1)/mtso that t=1nwt=n1. This gives us the formula for the weights in (Equation9). We note that the ECM algorithm using LOO likelihood in the next section can be extended to the more general WLOO likelihood.

4. ECM algorithm

Finding the maximum LOO likelihood estimate θˆ for VG distribution directly can be difficult as the LOO likelihood has many cusps when νd/2, and the LOO index k(μ) makes derivatives tedious to work with since the summation and the differential with respect to μ cannot simply be interchanged due to the dependence of the summation index on μ. Alternatively, we can implement the ECM algorithm to maximise the complete data LOO likelihood conditional on the mixing variables u.

Given the complete data (y,u), we use the normal mean-variance mixture representation in (Equation2) to represent the complete data LOO log-likelihood as (12) LOO(θ;y,u)=NLOO(μ,Σ,γ;y,u)+GLOO(ν;u)(12) where the LOO log-likelihood of the conditional normal distribution ignoring constants is given by (13) NLOO(μ,Σ,γ;y,u)=n12log|Σ|12tκ(μ)1ut(ytμutγ)Σ1(ytμutγ)(13) and the LOO log-likelihood of the conditional gamma distribution is given by (14) GLOO(ν;u)=(n1)(νlogνlogΓ(ν))+(ν1)tκ(μ)logutνtκ(μ)ut.(14) The outline of the ECM algorithm of VG distribution using the full likelihood is similar to the ECM algorithm in Ref. [Citation9] for Student-t distribution and in Ref. [Citation6] for VG distribution. However, modifications to the algorithm are necessary when using the LOO likelihood. We will discuss the necessary modifications needed in order to search for the maximum of the LOO likelihood.

4.1. E-step

By analysing the conditional posterior distribution of ui given yi with density f(ut|yt,θ) utνd21exp[12ut(ytμ)Σ1(ytμ)ut2(2ν+γΣ1γ)]which corresponds to the pdf of a generalised inverse Gaussian distribution [Citation17], we can calculate the following conditional expectations: (15) utˆ=E(ut|y,θ)=ztKνd2+1(2ν+γΣ1γzt)2ν+γΣ1γKνd2(2ν+γΣ1γzt),(15) (16) 1/utˆ=E(1uty,θ)=2ν+γΣ1γKνd21(2ν+γΣ1γzt)ztKνd2(2ν+γΣ1γzt),(16) (17) logutˆ=E(logut|y,θ)=log(zt2ν+γΣ1γ)+Kνd2(1,0)(2ν+γΣ1γzt)Kνd2(2ν+γΣ1γzt)(17) where Kλ(1,0)(z)=αKα(z)|α=λ is approximated using the second-order central difference (18) Kλ(1,0)(z)Kλ+h(z)Kλh(z)2h,zt2=(ytμ)Σ1(ytμ),(18) and we let h=105. Numerical problems may occur when νd2 since the pdf fVG(y) in (Equation1) at μ is unbounded. Using the asymptotic property of the modified Bessel function of the second kind, Kλ(z)Γ(λ)2λ1zλ,K0(z)log(z).We can show that as ytμ, (19) E(ut|y,θ){2νd2ν+γΣ1γif ν>d2,1(2ν+γΣ1γ)log(zt)if ν=d2,Γ(νd2+1)Γ(d2ν)22νd+1(2ν+γΣ1γ)d2ν1ztd2νif ν(d21,d2),log(zt)zt2if ν=d21,zt2d2(ν+1)if ν<d21,(19) (20) E(1uty,θ){2ν+γΣ1γ2νd2if ν>d2+1,(2ν+γΣ1γ)log(zt)if ν=d2+1,Γ(1ν+d2)Γ(νd2)212ν+d(2ν+γΣ1γ)νd2zt2νd2if ν(d2,d2+1),1log(zt)zt2if ν=d2,d2νzt2if ν<d2,(20) (21) E(logut|y,θ){ψ(νd2)log(2ν+γΣ1γ2)if ν>d2,logztif ν=d2,2logztif ν<d2.(21) Thus for the case when νd2, the main source of numerical problems comes from calculating E(1ut|y,θ) as it diverges to infinity at a hyperbolic rate since the maximum of the likelihood function occurs at the data points. This leads to the overflow problem when calculating the estimates and this problem is explained later in Section 4.2.3. The LOO likelihood avoids the problem by preventing the location parameter estimate to converge towards the data point since the maximum of the LOO likelihood tends to be roughly between data points.

4.2. CM-step

Derivatives of NLOO in (Equation13) with respect to (Σ,γ) are straight forward to calculate using matrix differentiation. However we encounter two types of difficulties in calculating the derivative with respect to μ for the CM-step.

Firstly, even when the LOO likelihood removes the unbounded points from the full likelihood, there still exist cusps in the LOO likelihood. Consequently we cannot completely rely on derivative-based methods to find the maximum of the LOO likelihood with respect to the location parameter μ.

Secondly, the first-order derivative of the complete-data LOO log-likelihood in (Equation13) with respect to μ is (22) μNLOO=12(μtκ(μ)1/utˆ(ytμutγ)Σ1(ytμutγ)).(22) Since the summation index depends on μ, the differential and the summation cannot simply be interchanged. Thus the CM-step for μ does not have a closed-form solution. To solve these problems, we propose the local mid-point search and local point search algorithm for the first problem, and the approximate CM-step for the second problem.

4.2.1. Local mid-point search (for one-dimensional case)

As seen in Figure , the maximum of the LOO log-likelihood tends to occur at the cusp points which are located between data points for the one-dimensional case. So ideally we would want to search along these mid-points to maximise the LOO log-likelihood with respect to μ. This leads to the local mid-point search. The idea is to search for mid-points around the current iterate μ(k) and choose the one that maximises the LOO log-likelihood.

Local mid-point search algorithm:

Let (μ(k),Σ(k),γ(k),ν(k)) be our current estimates, and y(t) be the ordered data:

  1. Calculate |xtμ(k)| for t=1,,n1, where xt:=(y(t)+y(t+1))/2, and choose the least m Euclidean distances with corresponding mid-points xt1,,xtm where we choose m=max{20,n/100} for our simulation study given that n20. Additionally, let xt0=μˆ(k).

  2. Update the location estimate by choosing μ out of {xt0,,xtm} such that it maximises the LOO likelihood in (Equation5). argmaxμ{xt0,,xtm}LOO(μ,Σ(k),γ(k),ν(k);y).

  3. Repeat steps 1 and 2 until the location parameter estimate converges.

4.2.2. Local point search (for higher dimensional case)

In general, finding the maximum along the cusps in higher dimensions is more computationally demanding. For 2-dimensional data, these cusps occur at the cusp lines in R2 which is demonstrated in Figure of Ref. [Citation11]. For d-dimensional data, these cusps occur on the (d1) dimensional cusp manifolds in Rd. So for simplicity, we propose to search for data points around the current iterate μˆ(k) and choose the one that increases the LOO likelihood.

Local point search (LPS) algorithm:

The algorithm is similar to the local mid-point search algorithm in Section 4.2.1 except we search over the data points instead of the mid-points, and replace the Euclidean distance with the Mahalanobis distance between yt and μ(k) (ytμ(k))(Σ(k))1(ytμ(k))for t=1,,n.

4.2.3. Approximation in CM-step

To evaluate the first-order derivative in (Equation22), we propose to approximate the derivative by simply considering the LOO index in (Equation4) to be fixed at the current estimate μ(k) at k-th iteration so that we leave out the data point closest to μ(k) instead of μ. This gives us an approximation to the derivative (23) μNLOO12(μtκ(μ(k))1/utˆ(ytμutγ)Σ1(ytμutγ))=Σ1tκ(μ(k))1/utˆ(ytμutγ).(23) Similarly, applying the approximate derivative to NLOO and GLOO with respect to other parameters and solving the approximate derivatives at zero gives us the following CM-steps.

CM-step for μ,Σ,γ:

Suppose that the current iterate is θ(k) and u is given. After equating each component of the approximate partial derivatives of NLOO(μ,Σ,γ|y,u) to zero, we obtain the following estimates: (24) μ(k+1)=Sy/uSu(n1)SyS1/uSu(n1)2,(24) (25) γ(k+1)=Sy(n1)μ(k+1)Su,(25) (26) Σ(k+1)=1n1tκ(μ(k))1ut(ytμ(k+1))(ytμ(k+1))1n1γ(k+1)(γ(k+1))Su,(26) where the sufficient statistics are: (27) Sy=tκ(μ(k))yt,Sy/u=tκ(μ(k))1utyt,Su=tκ(μ(k))ut,S1/u=tκ(μ(k))1ut.(27) CM-step for ν:

Given the mixing parameters u, the estimate ν(k+1) can be obtained by numerically maximising GLOO(ν|u) in (Equation14) with respect to ν using Newton-Raphson (NR) algorithm where the approximate derivatives is given by: νGLOO=(n1)(1+logνψ(ν))+SloguSu,2ν2GLOO=(n1)(1νψ(ν))where ψ(x)=ddxlogΓ(x) is the digamma function and (28) Slogu=tκ(μ(k))logut.(28) This algorithm using the conditional log-likelihood GLOO is called MCECM. Alternatively, we can speed up the convergence rate by maximising directly the LOO log-likelihood VGLOO=tκ(μ(k))lnfVG(yt) with respect to ν given the other parameters. This algorithm is called ECM. Unless otherwise stated, we consider ECM algorithm in the subsequent analyses.

4.2.4. Line search

The estimates in (Equation24) to (Equation26) using approximate derivatives will not guarantee the LOO likelihood increase. In this regard, we propose to apply a line search to ensure the LOO likelihood increases after each CM-step. This line search is part of a class of adaptive overrelaxed methods which can improve the efficiency of EM [Citation18]. Denote the current estimate by θ(k) and the updated estimate by θ(k+1) after the CM-step in Section 4.2.3, we propose to construct a direct line search by defining θ=θ(k)+φ(θ(k+1)θ(k))where φIR and the interval I is chosen so that θ remains in the parameter space. Using the optimise function in R, φ is estimated to be φ such that it maximises the LOO log-likelihood φ=argmaxφILOO(θ).Since finding the maximum of a non-smooth likelihood function is difficult, we can alternatively search φ to choose θ such that it improves the LOO likelihood over the previous estimate LOO(θ;y)LOO(θ(k);y).

4.3. ECM algorithm

Combining these steps gives us the ECM algorithm for VG distribution using the LOO likelihood:

Initialisation step: Choose suitable starting values (μ0,Σ0,γ0,ν0). It is recommended to choose starting values (y¯,cov(y),0,4d) where y¯ and cov(y) denote the sample mean and sample variance-covariance matrix of y respectively. For more leptokurtic data, it is recommended to use more robust measure of location and scale.

ECM algorithm for VG: At the k-th iteration with current estimates (μ(k),Σ(k),γ(k),ν(k)):

Local Point Search: Update the estimate to μ(k+1/2) using local point search in Section 4.2.2.

E-step 1: Calculate uˆt(k+1) and 1/utˆ(k+1) for t=1,,n in (Equation15) and (Equation16) respectively using (μ(k+1/2),Σ(k),γ(k),ν(k)). Calculate also the sufficient statistics Sy(k+1), Sy/u(k+1), Su(k+1) and S1/u(k+1) in (Equation27).

CM-step 1: Update the estimates to (μ(k+1),γ(k+1)) in (Equation24) and (Equation25) respectively using the sufficient statistics in E-step 1 along with the line search in Section 4.2.4.

CM-step 2: Update the estimate to Σ(k+1) in (Equation26) using the sufficient statistics in E-step 1 along with the line search.

CM-step 3: Update the estimate to ν(k+1) by maximising the actual LOO log-likelihood with respect to ν while keeping the other parameters fixed.

Stopping rule: Repeat the procedures until the relative increment of LOO log-likelihood function is smaller than tolerance level 108.

We remark that the local mid-point search and local point search ensure the location parameter estimates jump closer towards the maximum whereas the line search in Section 4.2.4 is applied after each CM-step to ensure monotonic convergence of the ECM algorithm. We will numerically verify the accuracy of this algorithm in Section 5.1 using Monte Carlo simulations.

4.4. Convergence of ECM algorithm using LOO likelihood

Let the approximate LOO log-likelihood be defined as (29) LOO~(θ;y)=tκ(μ(k))logf(yt;θ)(29) with LOO index fixed at κ(μ(k)). To show the convergence of ECM algorithm using approximate LOO likelihood, we first prove the convergence of the ECM algorithm with one CM-step for approximate LOO log-likelihood, then extend the proof for multiple CM-steps. For the case with one CM-step, we apply the idea in Ref. [Citation19] to the LOO likelihood and state two fundamental results below: LOO~(θ;y)=QLOO~(θ;θ(t))HLOO~(θ;θ(t))and HLOO~(θ;θ(k))HLOO~(θ(k);θ(k))where we let (30) QLOO~(θ;θ(k))=LOO~(θ;y,u)f(u|y;θ(k))du(30) with f(u|y;θ(k))=t=1nf(ut|yt;θ(k)), LOO~(θ;y,u)=tκ(μ(k))logf(yt,ut;θ), and HLOO~(θ;θ(k))=LOO~(θ;u|y)f(u|y;θ(k))duwith LOO~(θ;u|y)=tκ(μ(k))logf(ut|yt;θ). These fundamental results are stated by Podgórski and Wallin [Citation5], and the ideas of the proof are exactly the same as in Ref. [Citation20] by replacing the full likelihood with the LOO likelihood. However, choosing θ(k+1) such that QLOO~(θ(k+1),θ(k))QLOO~(θ(k),θ(k))only guarantee that LOO~(θ(k+1);y)LOO~(θ(k);y) but not LOO(θ(k+1);y)LOO(θ(k);y). For this reason, we need to perform a line search in Section 4.2.4 so that the LOO log-likelihood improves and thus guarantees the monotonic convergence of the algorithm for LOO log-likelihood.

For the case with multiple CM-steps, applying the line search after the CM-step increases the actual LOO log-likelihood rather than the Q-function in Equation (Equation30). Liu and Rubin [Citation8] and Meng and van Dyk [Citation21] proved the monotonic convergence of the ECM algorithm only if all the CM-steps applied to Q-functions are performed before the CM-step applied to the actual LOO log-likelihood. Thus for our case, if we applied the line search to the initial CM-step, the line search have to also be applied to the other subsequent CM-steps to ensure that the actual LOO log-likelihood increases after each CM-step. Thus, this guarantees the monotonic convergence of the algorithm in Section 4.3.

5. Simulation studies

5.1. Accuracy of estimates for VG distribution

To demonstrate the accuracy of the proposed ECM algorithm, we simulate n = 1000 bivariate skewed VG samples with parameters (31) μ=(00),Σ=(10.70.71),γ=(0.81),and ν=0.15(31) and estimate the parameters using the algorithm specified in Section 4.3. We replicate this experiment r = 1000 times and present the results in Figure . Figure shows the violin plots implemented using the caroline package [Citation22] in R which represents the density estimate of the parameter estimates using normal kernel. The median of the estimates are very close to the true parameters of the model implying that the algorithm gives us consistent estimates for these parameters, even when ν<d2 leading to unbounded density. We also see the distribution of each component of the parameters in Σ, γ, and ν appear to follow roughly a normal distribution. On the other hand, the distribution of μ has high density around 0 and extreme heavy-tails.

Figure 3. Vioplots to show accuracy of parameter estimates for VG distribution in the first simulation study. The median is displayed as a grey box which is connected by a crimson line. True parameter values represented by the blue lines is drawn for comparison. (a) Vioplot of μ and (b) Vioplot of Σ, γ, ν.

Figure 3. Vioplots to show accuracy of parameter estimates for VG distribution in the first simulation study. The median is displayed as a grey box which is connected by a crimson line. True parameter values represented by the blue lines is drawn for comparison. (a) Vioplot of μ and (b) Vioplot of Σ, γ, ν.

To illustrate the idea of the local point search and line-search in Sections 4.2.2 and 4.2.4 graphically, readers may refer to Figure of Ref. [Citation11]. The figure shows that a global maximum lies roughly between data points along the cusp lines. These cusp lines make computation more demanding as we cannot purely rely on derivative-based methods (see Equation (Equation23)). The local point search and then the line search serves as an efficient iterative method to obtain estimates which will converge towards the global maximum approximately. To explain the idea, the local point search takes the estimates μˆ along the black solid line to one of the three data points (red circles) closer to the global maximum point (yellow triangle), while the CM-step and line search improves μˆ along the black dash line to the point in blue square so that they converge closer towards the global maximum point which lies on the cusp line between two of the data points.

5.2. Comparison of WLOO, LOO and full likelihood methods

We assess the performance of WLOO, LOO likelihood relative to other full likelihood methods including the R package called ghyp in a simulation study. Nitithumbundit and Chan [Citation6] proposed an ECM algorithm [Citation7] to estimate parameters for the VG distribution with unbounded density by capping the density within a small interval of the location parameter. Specifically, they bounded the conditional expectations of (Equation19) to (Equation21) around μ by a region such that if zt<Δ where Δ is some small fixed positive capping level and zt is defined in Equation (Equation18), then the conditional expectations are calculated by replacing zt with zt=max(zt,Δ). To improve the rate of convergence, they further considered a data augmentation algorithm called the alternating ECM (AECM) by setting ut(k)=wt/a(θ(k)) where the positive function a(θ)=ξ and updated ξ by choosing ξˆ=argmaxξ>0VG(μˆ,Σˆξ,γξ,νˆ)using numerical optimisation techniques given the current estimates (μˆ,Σˆ,γ,νˆ). Then they updated the parameters as γˆ=γˆξˆ and Σˆ=Σˆξˆ. Lastly, they showed that the optimal Δ decreases as ν decreases. Hence they proposed an adaptive Δ method by updating Δ(k+1)=gˆ(ν(k+1),d) where g(ν,d) is a cubic spline by fitting the median (over repeats) of the optimal Δ across ν and d in a simulation study. The following likelihood methods are considered in this study:

  1. Full likelihood ghyp package: ECM algorithm with Δ set to .Machine$double.eps0.25 1.2e4 which is a tolerance level of many R functions.

  2. Full likelihood fixed Δ : AECM algorithm with the smallest Δ=

    sqrt(.Machine$double.xmin)1.5e-154 to provide numerical stability where double.xmin represents the smallest non-zero normalised floating-point number in R.

  3. Full likelihood adaptive Δ: AECM algorithm with adaptive Δ.

  4. LOO likelihood: ECME/AECM algorithm with LPS (Section 4.2.2), line search (Section 4.2.4) and the smallest Δ which is relevant when there is data multiplicity.

  5. WLOO likelihood: ECM algorithm with LPS and line search.

Details of these five likelihood methods are summarised in Table .

Table 1. ghyp, fixed Δ, adaptive Δ, LOO and WLOO likelihood methods.

To conduct the simulation study, we create data multiplicity in two ways: replicate each data point R times or round each data point to D decimal places. The procedure for the simulation study are summarised below:

Step 1:

Choose ν=0.05,0.1,,1.45,1.5 and R=1,,5 or D=,8,6,5,4 for Section 5.2.2. The density is unbounded if ν<1 and cusped if 1ν1.5 for dimension d = 2.

Step 2:

Simulate n = 1000 data from the bivariate VG distribution with true parameters μ, Σ and γ given in (Equation31) and the chosen ν. Then repeat each data point R times, or round each data point to D decimal places.

Step 3:

Apply the five likelihood methods in Table  to the data sets to obtain five sets of estimates.

Step 4:

Repeat these steps until we have r = 500 replicates for each method, level of ν, and level of R or D.

5.2.1. No data multiplicity

The median of the measures of accuracy for the four parameters are reported in Table . Results show that full likelihood methods using ghyp and fixed Δ mostly give the largest biases whereas adaptive Δ full likelihood, LOO likelihood and WLOO likelihood methods perform similarly with lower biases particularly for μ when the density is unbounded. It is reasonable to see that LOO and WLOO likelihood methods give very similar results for data without data multiplicity. As ν decreases, the bias seems to increase for |Σˆ| and decrease for μˆ but there is no clear trend for γˆ.

Table 2. Median of 500 accuracy measures of parameter estimates across five likelihood methods with no data multiplicity (R = 1). ‘*’ indicates the methods with lowest absolute error.

5.2.2. Data multiplicity due to repetition and rounding

To visualise the differences in performance better, we transform the accuracy measures by log[log(log|μˆμ|)], log|log|Σˆ|log|Σ||, log|γˆγ| and log|logνˆlogν| respectively and graph the transformed accuracy measures in Figure  for replicated data when R = 1, 3, 5. Since the plot for rounded data is very similar, it is omitted. See Ref. [Citation23], page 95 for this plot. We remark that the transformation f(x)=log[log(logx)] for |μˆμ| is monotonically increasing on (0,1/e).

Figure 4. Plot of transformed accuracy measures for full ghyp package (red solid line), full fixed Δ (light green striped line), full adaptive Δ (dotted green line), LOO (blue dot & striped line), and WLOO (dash magenta line) likelihoods. The columns from left to right represents the median parameter accuracy measures for (μˆ,Σˆ,γˆ,νˆ) respectively. The rows from top to bottom represents R = 1, 3, 5 respectively where R is the number of times each data point is repeated.

Figure 4. Plot of transformed accuracy measures for full ghyp package (red solid line), full fixed Δ (light green striped line), full adaptive Δ (dotted green line), LOO (blue dot & striped line), and WLOO (dash magenta line) likelihoods. The columns from left to right represents the median parameter accuracy measures for (μˆ,Σˆ,γˆ,νˆ) respectively. The rows from top to bottom represents R = 1, 3, 5 respectively where R is the number of times each data point is repeated.

With data repetition, the WLOO likelihood (magenta line) generally performs better than other likelihood methods for all levels of ν in Figure . The full likelihood ghyp with Δ0= 1.2e-4 (red line) and adaptive Δ (green line) methods provide reasonable accuracy when ν>0.15 but the adaptive Δ method is better and similar to WLOO for νˆ. This result is in agreement with the optimal Δ choice study in Section 4.2 of Ref. [Citation6]. They considered Δ1= 1.49e-154 and Δ2= 1.49e-8 and plotted in Figure , the log of the sum of mean absolute errors (MAE) over all the parameters for a d=2 dimensional VG model with μ=0, Σ=[1040.41] and γ=(1.2,0.2) using the AECM algorithm. Results show that Δ1 performs better for much smaller shape parameter at ν<0.15 whereas Δ2 performs better for 0.15<ν<1.15 approximately. Since Δ0 is similar to Δ2 relative to Δ1, this result explains why full and LOO likelihoods both with Δ1 perform worst in Figure for 0.15<ν<1.5 approximately.

When ν is small, full likelihood ghyp method performs poorly as expected. The fixed Δ full and LOO likelihood methods (light green and blue lines) have the worst performance particularly for Σˆ and νˆ since they both adopt a fixed capping level Δ. The only difference is that the LOO likelihood leaves out a data point while the full likelihood does not. Nevertheless, as the data multiplicity increases, leaving a data point out becomes insignificant and so the accuracies of the full fixed Δ and LOO likelihood converge. As ν approaches 1.5, the accuracy for Σˆ and νˆ for all likelihood method converges.

With data rounding, the results are very similar. The WLOO likelihood is still better for most cases, such as μˆ and γˆ when ν>0.3 approximately. For νˆ, the full likelihood ghyp package performs well around 0.2<ν<0.5, the full adaptive Δ likelihood performs well in the mid-range at around 0.6<ν<1.2 and the full fixed Δ likelihood performs well at around ν=1.4. Again, the ghyp package and adaptive Δ full likelihood perform similarly. For Σˆ and νˆ, the full fixed Δ likelihood still performs worse whereas the LOO likelihood is now much better and is similar to WLOO when ν>0.3 but they perform similarly for smaller ν. We remark that for larger ν, the data rounding is insignificant because of the lower peak and hence less data multiplicity around μˆ.

In summary, for both data repetition and rounding cases, we have demonstrated that the full ghyp and fixed Δ likelihood methods using fixed capping levels perform worse and can be improved by providing an adaptive capping level. Moreover, the LOO likelihood method can be extended to the WLOO likelihood to deal with the unbounded likelihood due to data multiplicity. Overall, the WLOO likelihood gives the most stable and accurate results. As the range of ν is uncertain in practice, it is better to use the WLOO likelihood which provides the best overall performance.

5.3. Asymptotic properties for the location parameter of VG distribution

Podgórski and Wallin [Citation5] proved the consistency and super-efficiency of the location parameter estimate using the maximum LOO likelihood as stated in Theorem 3.1 in Section 3.2. They also stated the upper bound for the index of the rate of convergence β<1/(1+α)=1/(2ν) where α=2ν1 for unbounded univariate VG distribution. The aim of this section is to determine these optimal rates through simulation studies and to analyse the asymptotic distribution of the location parameter estimate for unbounded and cusp densities. In this study, we consider LOO instead of WLOO loglikelihood to investigate the optimal convergence rate of Ref. [Citation5].

We consider univariate case (d=1) and set up the simulation as below for ν(0,1]:

  1. Set the sample size n to be one of the 41 sample sizes {500,1000,,19,500,20,000}{100,000}.

  2. For each sample size, set the true shape parameter ν to be one of the 50 shape parameters {0.02,0.04,0.98,1}.

  3. For each pair of (n,ν), generate r=20,000 different sets of samples, each set from standard univariate symmetric VG distribution with shape parameter ν and sample size n.

  4. For the ith set of 20,000 samples, estimate μˆn,ν,i by searching through each mid data points and choosing the one that maximises the LOO log-likelihood where the other parameters (σ2=1,γ=0,ν) are fixed.

This gives us for each pair of (n,ν) the sample μˆn,ν,1:20,000 of r=20,000 estimates which is used to study the optimal convergence rate and asymptotic distribution across sample size n and shape parameter ν.

5.3.1. Optimal convergence rate

Since the scale of asymptotic distribution of μˆn,ν,1:20,000 changes with n according to the convergence rate nβν, we can fit a power curve to estimate βν in the optimal convergence rate. We choose a robust measure of spread called the median absolute deviation (MAD) from 0 defined by MADn,ν=median(|μn,ν,1|,,|μn,ν,20,000|).For each ν, we fit a power curve to the MADn,ν against n. In other words, we find parameters aν and bν such that MADn,ν=aνnbν. This is equivalent to fitting a simple linear regression model logMADn,ν=logaν+bνlogn to obtain the estimates (logaˆν,bˆν). Then an estimate of the optimal rate for a given ν is obtained by setting βˆν=bˆν. We repeat this process for the other ν's.

Figure (a) plots the relative error of βˆν against ν along with its confidence intervals. From the figure, βˆν appears to follow the proposed optimal rate of 1/(2ν) when 0<ν0.4. However when 0.4<ν<1 (the density is bounded with a cusp when 0.5<ν<1), βˆν appears to be slightly different and the relative error follows a sinusoidal pattern. In fact for 0.4<ν<0.76, βˆν appears to be greater than the proposed optimal convergence rate index. For 0.76<ν<1, βˆν appears to be less than the proposed optimal convergence rate index. As ν approaches to 1, βˆν approaches the convergence rate for asymptotic normality. Overall the estimated optimal rate is consistent with Theorem 3.1 in Section 3.2 in the range of ν<0.5 for unbounded density. As for 0.5ν1, more theoretical studies is needed to understand the behaviour of the location parameter estimates μˆn,ν,1:20,000. To investigate this unusual behaviour of μˆn,ν,1:20,000, we further analyse the asymptotic distribution of μˆn,ν,1:20,000 across ν when n=100,000 is large.

Figure 5. (a) Relative error βˆνβνβν (solid black) against ν. The horizontal solid grey line indicates agreement of βˆν with the proposed optimal rate βν=12ν. The vertical grey dotted lines represent grid lines for ν={0,0.2,0.4,0.6,0.8,1}. We also include the 95% confidence interval for the relative error (dashed grey). (b) Density of estimates μˆn,ν,1:20,000 with its scale standardised using MAD for each ν where n = 100, 000. We use a rainbow colour scheme ranging from red (ν=0.02) to magenta (ν=1).

Figure 5. (a) Relative error βˆν−βνβν (solid black) against ν. The horizontal solid grey line indicates agreement of βˆν with the proposed optimal rate βν=12ν. The vertical grey dotted lines represent grid lines for ν={0,0.2,0.4,0.6,0.8,1}. We also include the 95% confidence interval for the relative error (dashed grey). (b) Density of estimates μˆn,ν,1:20,000 with its scale standardised using MAD for each ν where n = 100, 000. We use a rainbow colour scheme ranging from red (ν=0.02) to magenta (ν=1).

5.3.2. Asymptotic distribution

This section explores the distribution of μˆn,ν,1:20,000 or log|μˆn,ν,1:20,000| in the more appropriate scale and approximate the distribution to allow SE calculation for μˆ.

Analysis 1: preview the distribution of location parameter estimates using normal kernel

We begin with plotting the normal kernel density estimate in Figure (b) for μˆn,ν,1:20,000 with its scale standardised using MAD, taking n=100,000 and representing the density for each ν in rainbow colours. We note that the estimated density exhibits heavier tails and sharper peaks at the expense of intermediate tails as ν decreases (from magenta line to red line) and this feature resembles VG distribution.

Analysis 2: explore the distribution on the more appropriate log scale

To explore the distributions on a more appropriate scale, we transform μˆn,ν,1:20,000 to log|μˆn,ν,1:20,000| and plot the kernel density estimates in Figure  for (n,ν)=(1000,5000,20,000,100,000) by (0.02,0.2, 0.4,0.6,0.8,1). We observe that the location of the distribution shifts to the right in general for each ν as the sample size n increases but the scale and the shape of the distribution remain relatively similar. An exception is when ν=0.8 where the scale gets slightly larger and the shape becomes less left skewed. We remark that these densities resemble the densities of the generalised Gumbel (GG) distribution. Fitting the GG distribution to log|μˆn,ν,1:20,000| corresponds to fitting a double generalised gamma (DGG) distribution to μˆn,ν,1:20,000. See Appendices A and B for details of the two distributions, respectively.

Figure 6. Kernel density estimates of log|μˆn|'s for ν=0.02,0.2,0.4,0.6,0.8,1 and n = 1000 (solid black), 5000 (dashed dark grey), 20, 000 (dotted grey), 100, 000 (dash-dotted light grey) with each n being combined into a single plot for comparison. (a) density plots for ν=0.02 (b) density plots for ν=0.2 (c) density plots for ν=0.4 (d) density plots for ν=0.6 (e) density plots for ν=0.8 and (f) density plots for ν=1.

Figure 6. Kernel density estimates of log⁡|μˆn|'s for ν=0.02,0.2,0.4,0.6,0.8,1 and n = 1000 (solid black), 5000 (dashed dark grey), 20, 000 (dotted grey), 100, 000 (dash-dotted light grey) with each n being combined into a single plot for comparison. (a) density plots for ν=0.02 (b) density plots for ν=0.2 (c) density plots for ν=0.4 (d) density plots for ν=0.6 (e) density plots for ν=0.8 and (f) density plots for ν=1.

Analysis 3: approximate the distribution on the log scale

This analysis aims to approximate the kernel densities for log|μˆn,ν,1:20,000| in Figure with some distributions which should include log-normal as the asymptotic distribution when ν approaches or exceeds 1. In the search for these distributions, we start with VG based on Figure (b) and check the goodness-of-fit using P-P plots by plotting the cumulative distribution function (cdf) F(log|μˆn,ν,1:20,000|;θν) against the ordered sequence (1:20,000)/20,001 giving one P-P plot for each ν=0.02,0.04,,1 and taking n = 100, 000. We also consider stable distribution which, like normal, can model any properly normed sums of independent and identically distributed random variables. If each variable has a finite variance, classical central limit theorem states that the sum will tend toward a normal distribution as the number of variables increases but without the finite variance assumption, the limit may be a stable distribution. These provide some theoretical ground for the trial although it still does not address the unbounded density problem. However, the P-P plots for both log-VG and log-stable distributions argue against adoption. Finally, we try DGG distribution with the flexible generalised gamma distribution on each side and the cdf of GG distribution for fitting log|μˆn,ν,1:20,000| is given by Equation (EquationA2) evaluated at the fitted parameters θˆν=(μˆGG,σˆGG,mˆGG). As the set of P-P plots for GG distribution shows near-perfect straight lines except some very minimal deviations around the 25th and 75th percentiles for 0.22ν0.34, we confirm that GG distribution fits log|μˆn,ν,1:20,000| the best amongst the log-VG and log-stable distributions. See Ref. [Citation23], pages 77-78 for these P-P plots. Hence we choose DGG to approximate the distribution for the location parameter estimate μ using maximum LOO likelihood. We declare the limitation of this ad hoc search but based on the P-P plots, we are confident that DGG distribution can provide a reasonably accurate standard error (SE) estimate for μˆ. We will demonstrate the calculation and confirm the accuracy.

Analysis 4: assess the convergence of DGG distribution to asymptotic distribution

We assess the convergence of DGG distribution as n increases to asymptotic distribution by plotting in Figure  the parameter estimates (μˆGG,σˆGG,mˆGG) against ν when fitting the log|μˆn,ν,1:20,000| for each (n,ν) to GG distribution while different n is represented by lines in rainbow colours from red for the lowest n = 500. We also plot the transformed parameter estimates to view the behaviour across ν more clearly. As ν decreases, μˆGG in Figure (a,d) appears to drop roughly at a hyperbolic rate with some minor curvature for larger values of ν. This agrees with the observation in Figure that the location of the density shifts to the left as ν decreases. However, σˆGG in Figure (b,c) increases at a hyperbolic rate with two bumps as ν decreases. This increase in the spread can also be observed in the densities of Figure . The two bumps occur around ν=0.2 and ν=0.7 with the left one showing more fluctuation and no clear distinction between each n possibly due to sampling error whereas the right one has a clear distinction between each n especially for n = 100, 000. In Figure (c,f), mˆGG also has two bumps similar to σˆGG but unlike μˆGG and σˆGG, mˆGG tends to some constant value as ν approaches to 0. The right bumps for both σˆGG and mˆGG suggest that the DGG distribution for μˆ has yet to converge to the asymptotic distribution and it is unclear how large n should be for the convergence for 0.5ν1, the range with cusp density. Adding to the inconsistency of the index for optimal convergence rate as displayed in Figure , the asymptotic properties for this range of ν worth further analyses.

Figure 7. Plot of estimates of generalized Gumbel distribution fitted to the distribution of log|μˆn| against ν. Row 1 plots the (μˆGG,σˆGG,mˆGG) against ν, respectively, while row 2 plots the transformation (1/μˆGG,1/σˆGG,logmˆGG) against ν, respectively, to enlarge certain portion of the plots. A rainbow colour scheme ranging from red (n = 500) to magenta (n = 20, 000) is used to denote sample size and the black line represents n = 100, 000. (a) μˆGG vs ν (b) σˆGG vs ν (c) mˆGG vs ν (d) 1/μˆGG vs ν (e) 1/σˆGG vs ν and (f) logmˆGG vs ν.

Figure 7. Plot of estimates of generalized Gumbel distribution fitted to the distribution of log⁡|μˆn| against ν. Row 1 plots the (μˆGG,σˆGG,mˆGG) against ν, respectively, while row 2 plots the transformation (1/μˆGG,−1/σˆGG,log⁡mˆGG) against ν, respectively, to enlarge certain portion of the plots. A rainbow colour scheme ranging from red (n = 500) to magenta (n = 20, 000) is used to denote sample size and the black line represents n = 100, 000. (a) μˆGG vs ν (b) σˆGG vs ν (c) mˆGG vs ν (d) −1/μˆGG vs ν (e) −1/σˆGG vs ν and (f) log⁡mˆGG vs ν.

Demonstration of SE calculation for the location parameter estimate using DGG distribution

Lastly, we demonstrate the SE calculation for μˆ (d = 2) using DGG distribution and the true parameters in Section 5.1 as the sample estimates. For each component μˆj, we apply the GG distribution to estimate the variability of log|μˆj|. Specifically, we use νˆ=0.15 with sample size n = 1000 to extrapolate the values 1/μˆGG=0.0466,1/σˆGG=0.1291 and logmˆGG=1.8733 by applying the spline function in R to Figure (d–f). This gives the GG parameter estimates μˆGG=21.4431,σˆGG=7.7431 and mˆGG=6.5098. Applying these estimates to Equation (EquationA5) with Σˆ11=1 and Σˆ22=1 gives us the MADs MˆADn,ν(μˆ)=(MAD(μˆ1)MAD(μˆ2))(3.25×10103.25×1010)using DGG distribution which are close to the ‘observed’ MADn,ν MADn,ν(μˆ)=(3.16×10102.85×1010)by applying MADn,ν(x)=median(|xmedian(x)|) to the simulated sample μˆj,n,ν,1:r with r = 1000 replicates. Hence MˆADn,ν estimated from DGG distribution provides reasonably accurate estimates of MADn,ν, unobserved in practice, for the SE estimate of μˆ using maximum LOO likelihood when the shape parameter ν(0,d2) gives unbounded density or ν(d2,d+12) gives cusp density. Using MˆADn,ν as SE estimate, we can also construct confidence interval for μˆ.

6. Conclusion

Optimising a non-convex likelihood is always challenging particularly for noisy data with cusped or unbounded likelihood. An obvious example is the estimation of μ from VG distribution when νd2. We propose ECM algorithm using hierarchical model via the NMVM representation of VG distribution to avoid the direct maximisation of full likelihood. Conditional on the mixing variable ut, yt follows a condition normal distribution in (Equation2). In case ytμ and νd2 when the density becomes unbounded in (Equation3), E(1ut|y,θ) in (Equation17). Nitithumbundit and Chan [Citation6] proposed to cap the density but we propose to use LOO likelihood. In case of data multiplicity, we extend the LOO likelihood to WLOO likelihood.

To improve the estimation efficiency, we propose local mid-point search (one-dimensional case in Section 4.2.1) and local point search (higher dimensional case in Section 4.2.2) for μˆ as well as the line search (Section 4.2.4). The local point searches ensure the location parameter estimates jump closer towards the maximum. Since the derivatives of LOO likelihood in (Equation23) are also approximations, the line search ensures monotonicity along the recurrent updates. The first simulation study confirms the accuracy of the location parameter estimate for the VG distribution even though these search procedures do not necessary guarantee that the path leads to a global maximum.

In the second simulation study, our proposed ECM algorithm using WLOO likelihood is compared to three full likelihood methods, namely, MCECM from the ghyp package, AECM with fixed Δ and AECM with adaptive Δ, and one LOO likelihood method. The comparison is conducted under three scenarios: no data multiplicity, data multiplicity due to repetition and data multiplicity due to rounding. Results show that ECM algorithm using WLOO likelihood performs well in all three scenarios.

Lastly, our third simulation study explores empirically the optimal convergence rate and asymptotic distribution for our proposed location parameter estimate using the maximum LOO likelihood method for symmetric univariate VG data. Results show that the index for optimal convergence rate follows 12ν when 0<ν0.4. However, when 0.4<ν<1, the index appears to be slightly different with a sinusoidal pattern for the relative error. As ν approaches 1, the optimal rate approaches the convergence rate for asymptotic normality. Following the result, we propose to approximate the asymptotic distribution using the DGG distribution when ν(0,1] so that we can calculate SE and construct confidence interval for each component of μˆj. In summary, the idea of the proposed ECM algorithm can be applied to other distributions in NMVM representation including Students' t and GH distributions with some suitable adjustments in the E-step.

Our method using hierarchical model in NMVM representation is similar in spirit to Bayesian methods which are sometimes treated as a frequentist method when flat priors are adopted so that the solution could be considered ‘equivalent’ to the solution using the classical likelihood method. This proxy likelihood approach can avoid the optimisation of complicated likelihoods. Further, the empirical Bayesian (EB) method offers procedures for statistical inference in which the prior π(θ) is estimated from the data, ie π(θ|y) without relying on external input (similar to using flat priors). Hossain, Kozubowski, Podgórski[Citation24] proposed the weighted likelihood estimation for a parameter θ as (32) θˆEB=i=1nwiθˆi=i=1nθˆiL(θˆi|yi)i=1nL(θˆi|yi)=LMi=1nyij=1nh(yjyi)i=1nj=1nh(yjyi),(32) where θˆi=υ(yi) is an estimator of θ using the ML principle and observation yi, and the last term in (Equation32) is for the location model (LM) f(yi|θ)=h(yiθ), yiR and h() is a symmetric and unimodal density. This estimation method has an EB flavour since θˆEB can be viewed as the posterior sample mean of θˆi with probability wi where the data-driven prior distribution is discrete and supported on the same values θˆi with equal probability 1/n. They showed that the asymptotic normality and efficiency of Bayes estimator using ordinary prior π(θ) can be applied to EB estimator using empirical sample prior π(θ|y). Based on the symmetric gamma (SG) distribution with pdf fSG(y|θ)=|yθ|α1exp(|yθ|)2Γ(α),yR,which is unbounded when 0<α<1 (α=1 corresponds to symmetric Laplace (SL) distribution), they showed that the location parameter θ in LM has a closed-form expression θˆEB=i=1nyiexp(ji|xjxi|)ji|yjyi|α1i=1nexp(ji|xjxi|)ji|yjyi|α1using (Equation32) when the estimator θˆi=υ(yi) is a LOO estimator such that θˆi is calculated using the sample without yi. In a simulation study, they showed that θˆEB performs slightly better than θˆLOO=argmax{LOO(yi|y), i=1,,n}in Tables 5 and 6 for SG and SL distributions in Ref. [Citation24]. This EB method which provides closed-form solution is attractive. Future studies may derive the EB estimates for VG distribution and compare the performance of EB method with our proposed ECM method using WLOO likelihood in several scenarios.

Moreover, our third simulation study for the optimal convergence rate and asymptotic distribution of the location parameter estimate considers only the univariate symmetric case using LOO likelihood. For future work, we will consider multivariate skewed case using WLOO likelihood. Furthermore, in-depth study should be directed to the case of 0.5ν1 when the density has cusp density as results show inconsistency of the index for the optimal convergence rate and problem of convergence for the asymptotic distribution. Furthermore, the dependence between the location and other parameters of VG distribution should also be considered in the simulation studies. Lastly, it is worth exploring better approximation to the asymptotic distribution from both numerical and theoretical perspectives even though it can be challenging.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • Rao B. Estimation of the location of the cusp of a continuous density. Ann Math Stat. 1968;39:76–87.
  • Ibragimov IA, Khasminskii RZ. Asymptotic behavior of statistical estimates of the location parameter for samples with unbounded density. J Sov Math. 1981a;16(2):1035–1041.
  • Ibragimov IA, Khasminskii RZ. Statistical estimation, volume 16 of Applications of Mathematics. Asymptotic theory, Translated from the Russian by Samuel Kotz. Springer-Verlag: New York-Berlin; 1981b.
  • Rao B. Asymptotic distributions in some non-regular statistical problems. PhD diss., Ph. D. Dissertation, Michigan State University; 1966.
  • Podgórski K, Wallin J. Maximizing leave-one-out likelihood for the location parameter of unbounded densities. Ann Inst Stat Math. 2015;67(1):19–38.
  • Nitithumbundit T, Chan JSK. ECM algorithm for auto-regressive multivariate skewed variance gamma model with unbounded density. Methodol Comput Appl Probab. 2020;22(3):1169–1191.
  • Meng X-L, Rubin DB. Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika. 1993;80(2):267–278.
  • Liu C, Rubin DB. The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika. 1994;81(4):633–648.
  • Liu C, Rubin DB. ML estimation of the t distribution using EM and its extensions, ECM and ECME. Stat Sin. 1995;5(1):19–39.
  • Madan DB, Seneta E. The variance gamma (V.G.) model for share market returns. J Bus. 1990;63(4):511–524.
  • Nitithumbundit T, Chan JSK. ECM algorithm for estimating vector ARMA model with variance gamma distribution and possible unbounded density. Aust N Z J Stat. 2021;63(3):485–516.
  • Barndorff-Nielsen O, Kent J, Sørensen M. Normal variance-mean mixtures and z distributions. Int Stat Rev. 1982;50(2):145–159.
  • Protassov RS. EM-based maximum likelihood parameter estimation for multivariate generalized hyperbolic distributions with fixed λ. Stat Comput. 2004;14(1):67–77.
  • Kotz S, Kozubowski TJ, Podgórski K. The Laplace distribution and generalizations : a revisit with applications to communications, economics, engineering, and finance. Boston: Birkhäuser; 2001.
  • Gradshteyn IS, Ryzhik IM. Table of integrals, series, and products. 7th ed., Amsterdam: Elsevier/Academic Press; 2007.
  • Kawai R. On the likelihood function of small time variance gamma Lévy processes. Statistics. 2015;49(1):63–83.
  • Embrechts P. A property of the generalized inverse Gaussian distribution with some applications. J Appl Probab. 1983;20(3):537–544.
  • Salakhutdinov R, Roweis S. Adaptive overrelaxed bound optimization methods. In: Proceedings of the International Conference on Machine Learning; 2003; Vol. 20. p. 664–671.
  • Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J Roy Statist Soc Ser B. 1977;39(1):1–38.With discussion.
  • Wu C-FJ. On the convergence properties of the EM algorithm. Ann Stat. 1983;11(1):95–103.
  • Meng X-L, van Dyk D. The EM algorithm – an old folk-song sung to a fast new tune. J Roy Statist Soc Ser B. 1997;59(3):511–567.With discussion and a reply by the authors.
  • Schruth D. caroline: A Collection of Database, Data Structure, Visualization, and Utility Functions for R. R package version 0.7.6; 2013.
  • Nitithumbundit T. EM Algorithms for Multivariate Skewed Variance Gamma Distribution with Unbounded Densities and Applications. PhD thesis. School of Mathematics and Statistics; 2017. The address of the publisher. https://ses.library.usyd.edu.au/handle/2123/18155.
  • Hossain MM, Kozubowski TJ, Podgórski K. A novel weighted likelihood estimation with empirical bayes flavor. Commun Stat Simul Comput. 2018;47(2):392–412.
  • Adeyemi S. On a generalization of the gumbel distribution. Inter-Stat (London). 2002;11(4):1–7.
  • Lin GD, Huang JS. The cube of a logistic distribution is indeterminate. Austral J Statist. 1997;39(3):247–252.

Appendix

A Generalised Gumbel (GG) distribution

The pdf of a GG distribution is given by (A1) fGG(x)=mmσΓ(m)exp(mxμσmexp(xμσ)),xR(A1) where μR is the location parameter, σ>0 is the scale parameter, and m>0 is the shape parameter. Note that we consider the negative version of the GG distribution given in Ref. [Citation25].

We can easily generate GG random variables based on gamma random variables denoted by G(a,b) using the following theorem.

Theorem A.1

If XG(m,m), then Y=logXμσ follows GG distribution with pdf in Equation (EquationA1).

Proof.

The idea of the proof is similar to the proof in Ref. [Citation25].

Using this transformation, we can compute the cdf as (A2) FGG(x)=Fgamma(exp(μ+σx);m,m),x>0(A2) and quantile function as QGG(p)=μ+σlogQgamma(p;m,m),p[0,1]where Fgamma(x;a,b) and Qgamma(p;a,b) are cdf and quantile function of G(a,b). In other words, we just need to calculate the cdf and quantiles of the gamma distribution. The mean and variance of a GG random variable X is given by (A3) E(X)=μ+σ(ψ(m)logm)andVar(X)=σ2ψ(m).(A3)

B Double generalized gamma (DGG) distribution

After fitting log|μˆn,ν,1:r| to the GG distribution, we can deduce the distribution of μˆn,ν,1:r follows DGG [Citation26] using the following theorem.

Theorem A.2

If X follows a symmetric distribution such that log|X| follows GG distribution with pdf in Equation (EquationA1), then X follows a double generalised gamma distribution with pdf (A4) γβα2Γ(α)|x|γα1exp(β|x|α),xR(A4) where α>0, β>0, and γ>0.

Proof.

If Y=log|X| follows GG distribution, then Y has pdf proportional to fY(y)exp(myσmexp(μσ)exp(yσ))where it is sufficient to consider the functional form. Now applying the transformation of random variable Y=logW where W=|X|, we get the pdf for W given by fW(w)exp(mσlogwmexp(μσ)exp(logwσ))1wwm/σ1exp(mexp(μσ)w1/σ)which has the functional form of the generalised gamma distribution. Reverse the transformation W=|X| by reflecting the pdf at 0 gives us the result. Setting α=2, β=1/2, and γ=1/2 in (EquationA4) gives the standard normal density as a special case. As the simulation results in Section 5.3.2 show that the GG distribution fits the log|μˆn,ν,1:r| reasonably well, we can model μˆn,ν,1:r using the DGG distribution. Suppose that Σˆ is the scale parameter estimate of VG distribution with diagonals Σˆii for i=1,,d. Then the standard error of μˆ can be approximated using the formula SE(μˆi)Σˆiiβ2/γΓ(α+2/γ)Γ(α)where α=mGG, β=mGGexp(μGG/σGG), γ=1/σGG, and (μGG,σGG,mGG) are estimates of GG distribution extrapolated from Figure . Since the standard error is sensitive to outliers, we recommend using the MAD given by (A5) MAD(μˆi)Qgengamma(0.5)=ΣˆiiQgamma(0.5;mGG,mGGexp(μGG/σGG))σGG(A5) as a robust measure of the spread for μˆi where Qgengamma() represents the quantile function of generalised gamma distribution and the pdf has functional form in (EquationA4) with support x>0.