Full article: Maximum leave-one-out likelihood method for the location parameter of variance gamma distribution with unbounded density

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Maximum likelihood estimation of a location parameter fails when the density has unbounded mode. An alternative approach is considered by leaving out a data point to avoid the unbounded density in the full likelihood. This modification gives rise to the leave-one-out (LOO) likelihood. We propose an expectation-conditional maximisation (ECM) algorithm to estimate parameters of variance gamma (VG) distribution with unbounded density by maximising the LOO likelihood. VG distribution has normal mean-variance mixtures representation that facilitates ECM algorithm. K Podgórski, J Wallin [Maximizing leave-one-out likelihood for the location parameter of unbounded densities. Ann Inst Stat Math. 2015;67(1):19–38.] showed that the location parameter estimate which maximises the LOO likelihood is consistent and super-efficient. In the case of repeated data, the LOO likelihood is still unbounded and we extend it to Weighted LOO (WLOO) likelihood. We perform simulations to investigate the accuracy of ECM method using WLOO likelihood and compare it with other likelihood methods, including the alternating ECM with fixed and adaptive caps to the unbounded density, the ghyp package using multi-cycle ECM, and the ECM using LOO likelihood. Lastly, we explore the asymptotic properties of the location parameter estimate of VG distribution such as the optimal convergence rate and asymptotic distribution.

Keywords:

1. Introduction

Asymptotic properties of maximum likelihood estimation for location parameters have been studied for the case when the likelihood is non-differentiable but bounded [Citation1]. However, this method breaks down when the likelihood is unbounded at certain points. Alternative methods for the unbounded density case have been considered in Refs [Citation2,Citation3] and in Ref. [Citation4] where they proved consistency of the location estimate using the Bayesian approach.

Under the likelihood approach however, modifications to the full likelihood are necessary since the unbounded points occur at the data points. A possible solution is to leave out a data point closest to the location parameter in the full likelihood. This modification leads to a concept known as the leave-one-out (LOO) likelihood proposed by Podgórski and Wallin [Citation5]. They proved consistency and super-efficiency of the location estimate that maximises the LOO likelihood when the density is unbounded. More precisely, under certain regularity conditions, they found an upper bound for the index of convergence rate for the location parameter estimate (see Section 3.2 for the definition of convergence rate and its index). Currently, there are no theoretical results regarding the optimal convergence rate and the asymptotic distribution of the maximum LOO likelihood estimate for the location parameter when the density is unbounded or with cusp. Since the theoretical derivation of these two asymptotic properties is very challenging, we propose to perform numerical simulations to investigate the properties and obtain some insights for the theoretical development.

To demonstrate the idea, we consider multivariate skewed variance gamma (VG) distributions with unbounded densities when the shape parameters $ν \leq d / 2$ where d is the dimension of the multivariate data. To estimate the parameters of VG distributions with unbounded densities, Nitithumbundit and Chan [Citation6] proposed an expectation-conditional maximisation (ECM) algorithm [Citation7] where they proposed to cap the density within a small interval of the location parameter. Depending on whether the conditional density or actual density is used to estimate the shape parameter, ECM algorithm can be further classified into multi-cycle ECM (MCECM) and ECM either (ECME) [Citation8,Citation9]. In this paper, the ECME algorithm is still called ECM algorithm as the methodological idea of ECME is similar to ECM. However, a drawback to capping the unbounded density is that the size of the interval to cap the density is somewhat arbitrary, and that different intervals can give different results. Podgórski and Wallin [Citation5] established a theoretical framework for using the LOO likelihood in an expectation-maximisation (EM) algorithm to estimate the location parameter of univariate symmetric generalised Laplace distribution (which is equivalent to VG distribution proposed by Madan and Seneta [Citation10]) with fixed shape parameter and unbounded density.

However, when there are data multiplicity (repeated data), the LOO likelihood will still be unbounded even if we leave out one of the data points. If we leave out multiple data points to obtain the leave-multiple-out (LMO) likelihood, this LMO likelihood has varied data size and contribution across the parameter space depending on the data multiplicities. This inconsistent data contribution leads to a discontinuous LMO likelihood. It is necessary to address this data multiplicity issue in the LOO likelihood as the issue is likely to occur for three reasons. Firstly, data rounding is common in practice simply for convenience or for reducing data storage. Secondly, as big data are more prevalence in recent years, the chance of data multiplicity increases with sample size. Lastly, as measurements are made more instantaneously for high-frequency data, the changes between measurements will be minimal and hence measures using these changes such as returns will be extremely small or even zero again giving rise to data multiplicity. Although the chance of data multiplicity is lower when the means for the time series model say change over time and when multi-dimensional models are used, one will never be sure in practice if data multiplicity exists. Hence, it is always advisable to fix the LOO likelihood issues to provide numerically stable estimates.

Following this idea, Nitithumbundit and Chan [Citation11] proposed weighted LOO (WLOO) likelihood to deal with discontinuous LMO likelihood when there are data multiplicities. They applied WLOO likelihood to multivariate data with unbounded densities and derived an ECM algorithm [Citation8] to obtain the maximum WLOO estimate of the location parameters of the vector autoregressive moving average (VARMA) model with multivariate skewed VG distribution when densities are cusped or unbounded with respect to the location parameters. The proposed algorithm is an extension of the EM algorithm. They derive explicit formulae to calculate all parameters in the expectation step (E-step) and conditional maximisation steps (CM-steps). When there are data multiplicities, they applied WLOO likelihood to obtain a continuous likelihood function. However, the paper focuses on the application and did not study the properties of this new method using the ECM algorithm with WLOO likelihood.

The objective of this paper is to fill up this research gap and study the properties of our proposed ECM algorithm with WLOO likelihood. Our first objective assesses the accuracy of the ECM algorithm using LOO likelihood in the case of no data multiplicity and compare the ECM algorithm using the LOO and WLOO likelihoods with other full likelihood methods in the case of data multiplicity due to repetition and rounding. To focus on the properties of the method, we make two modifications to Ref. [Citation11]. We consider a constant mean instead of the ARMA mean and we drop the alternating ECM (AECM) algorithm which aims only to improve the convergence rate of the ECM algorithm. Our second and main objective analyses the asymptotic behaviour including the optimal convergence rate and asymptotic distribution for the maximum LOO likelihood estimate for the location parameter through simulation studies using data simulated from the VG distribution with different sample sizes and shape parameters. Specifically, we investigate how the index of optimal convergence rate in Ref. [Citation5] for the location parameter estimates and its asymptotic distribution change across the shape parameter of VG distribution with unbounded or cusp density. From these studies, we aim to approximate the asymptotic distribution with some distributions to provide standard error for the location parameter estimates.

We choose VG distribution to demonstrate the ability of our proposed methods to deal with unbounded densities for three important reasons. Firstly, VG distribution is an important special case of the more general class, the generalised hyperbolic (GH) distribution [Citation12]. When GH distribution approaches VG distribution, one of its shape parameters will approach the boundary of the parameter space potentially causing the density to become unbounded. Although Protassov [Citation13] proposed an EM algorithm for GH distribution, such algorithm considers only regular shape parameters inside the parameter space and hence it does not truly capture the case of VG distribution with unbounded density when one of the shape parameter approaches the boundary of the parameter space causing unbounded densities. Our proposed ECM algorithm deals with such a case when the density is unbounded. Secondly, VG distribution has a normal mean-variance mixture (NMVM) representation that facilitates the implementation of ECM algorithm. Lastly, VG distribution has applications in many areas such as finance, signal processing and quality control. See Ref. [Citation14] for other applications and further details on generalised Laplace distribution. We remark that the idea of the proposed ECM algorithm can be applied to other distributions in NMVM representation including Student's t and GH distributions with some suitable adjustments in the E-step.

The remaining paper is organised as follows. Section 2 summarises some important properties of the multivariate skewed VG distribution. Section 3 formulates the maximum LOO likelihood framework for estimating the location parameter of distributions with unbounded densities, reports some properties of the estimate and introduces the WLOO likelihood with two examples to explain the weight settings. Section 4 introduces the ECM algorithm using the LOO and WLOO likelihoods to estimate parameters from the VG distribution. Although the methodologies in Sections 3 and 4 were described in Ref. [Citation11], we present them in a slightly different way starting with defining different LOO likelihoods which are the focus and with different emphasis giving more technical details. Moreover, some equations are modified for the different mean structures. Section 5 presents three simulation studies. The first study tests the accuracy of our ECM method using WLOO likelihood while the second study compares the accuracy of the ECM method with other full and LOO likelihood methods in three scenarios, namely, no data multiplicity, data multiplicity due to repetition and data multiplicity due to rounding. The last study analyses the asymptotic behaviour of the maximum LOO likelihood method to estimate the location parameter of VG distribution. Section 6 concludes the paper with further remarks.

2. Variance gamma distribution

The probability density function (pdf) of a d-dimensional VG distribution is given by (1) $f_{V G} (y) = \frac{2^{1 - \frac{d}{2}} ν^{ν}}{{| Σ |}^{\frac{1}{2}} π^{\frac{d}{2}} Γ (ν)} \frac{K_{ν - \frac{d}{2}} (\sqrt{[2 ν + γ^{⊤} Σ^{- 1} γ]} z) \exp ((y - μ)^{⊤} Σ^{- 1} γ)}{(2 ν + γ^{⊤} Σ^{- 1} γ)^{\frac{1}{4} (2 ν - d)} z^{\frac{1}{2} (d - 2 ν)}}$ (1) where $z^{2} = (y - μ)^{⊤} Σ^{- 1} (y - μ)$ , $μ \in R^{d}$ is the location parameter, $Σ$ is a $d \times d$ positive definite symmetric scale matrix, $γ \in R^{d}$ is the skewness parameter, $ν > 0$ is the shape parameter, $Γ (\cdot)$ is the gamma function and $K_{λ} (\cdot)$ is the modified Bessel function of the second kind with index λ [Citation15].

The VG distribution has a normal mean-variance mixtures representation [Citation12] given by (2) $y | u \sim N_{d} (μ + γ u, u Σ), u \sim G (ν, ν)$ (2) where $G (a, b)$ is a Gamma distribution with shape parameter a>0, rate parameter b>0 and pdf $f_{G} (u) = \frac{b^{a}}{Γ (a)} u^{a - 1} \exp (- b u), f o r u > 0.$ The mean and covariance matrix of a VG random vector $Y$ are given by $E (Y) = μ + γ a n d cov (Y) = Σ + \frac{1}{ν} γ γ^{⊤},$ respectively. As $y \to μ$ , the pdf in (Equation1(1) $f_{V G} (y) = \frac{2^{1 - \frac{d}{2}} ν^{ν}}{{| Σ |}^{\frac{1}{2}} π^{\frac{d}{2}} Γ (ν)} \frac{K_{ν - \frac{d}{2}} (\sqrt{[2 ν + γ^{⊤} Σ^{- 1} γ]} z) \exp ((y - μ)^{⊤} Σ^{- 1} γ)}{(2 ν + γ^{⊤} Σ^{- 1} γ)^{\frac{1}{4} (2 ν - d)} z^{\frac{1}{2} (d - 2 ν)}}$ (1) ) is given by (3) $\begin{aligned} f_{V G} (y) \sim {\begin{cases} 2^{ν - d} π^{- \frac{d}{2}} {| Σ |}^{- \frac{1}{2}} \frac{Γ (ν - \frac{d}{2})}{Γ (ν)} \frac{ν^{ν}}{(2 ν + γ^{⊤} Σ^{- 1} γ)^{\frac{2 ν - d}{2}}} & i f ν > \frac{d}{2}, \\ - 2^{1 - d} π^{- \frac{d}{2}} {| Σ |}^{- \frac{1}{2}} \frac{d^{\frac{d}{2}}}{Γ (\frac{d}{2})} \log (z) & i f ν = \frac{d}{2}, \\ 2^{- ν} π^{- \frac{d}{2}} {| Σ |}^{- \frac{1}{2}} \frac{Γ (\frac{d}{2} - ν)}{Γ (ν)} ν^{ν} z^{2 ν - d} & i f ν < \frac{d}{2} . \end{cases} \end{aligned}$ (3) Hence the density becomes unbounded at $μ$ when $ν \leq \frac{d}{2}$ . This poses some technical difficulty when working with VG distribution as it is unclear whether the shape parameter will fall into the unbounded range. It is also worth noting that the density is bounded with a cusp when $\frac{d}{2} < ν \leq \frac{d + 1}{2}$ .

This problem of unbounded density is illustrated in Figure of Ref. [Citation11] with the full log-likelihood and LOO log-likelihood across the location parameter for a data of 10 observations simulated from a standard VG distribution ( $μ = 0$ , $σ = 1$ , $γ = 0$ ) with shape parameter $ν = 0.2$ . Leaving the data point out essentially smooths out the unbounded log-likelihood at data points so the maximum can be well-defined. Additionally, if we zoom in at around $μ = 0$ , we observe that cusps tend to occur between data points.

Figure 1. Plots of full (solid red), LOO (striped blue), LMO (dotted green), and WLOO (dot & striped magenta) log-likelihoods of univariate symmetric VG distribution with $ν = 0.4$ for data set ${- 1, 0, 1, 0}$ represented by light grey vertical strips. (a) Full and LOO log-likelihoods and (b) LMO and WLOO log-likelihoods.

3. Maximum leave-one-out likelihood

Let $y = (y_{1}, \dots, y_{n})$ be observed data from VG distribution with corresponding missing parameters $u = (u_{1}, \dots, u_{n})$ , and $θ = (μ, Σ, γ, ν)$ be parameters from VG distribution in parameter space $Θ$ . The density of the VG distribution is unbounded at $μ$ when $ν \leq \frac{d}{2}$ . Consequently the maximum likelihood estimate is not well-defined since there are multiple unbounded points at each data point in the likelihood function. Kawai [Citation16] showed that for the univariate case, the Fisher information matrix with respect to μ is also not well-defined when $ν < 1 / 2$ . That is if $ν < 1 / 2$ , then $E_{θ} [{(\frac{\partial}{\partial μ} \log f (Y; θ))}^{2}] = \infty$ where Y is a univariate VG random variable with density function f, and the expectation is taken with respect to Y which depends on $θ$ .

We aim to provide a methodology to estimate parameters from VG distribution that can also deal with unbounded density. We first redefine the likelihood so that the maximum is well-defined even with the unbounded density.

3.1. Leave-one-out likelihood

Podgórski and Wallin [Citation5] proposed the observed LOO likelihood to be defined as $L^{L O O} (θ; y) = \prod_{t \neq κ (μ)} f (y_{t})$ where the LOO index is defined as (4) $κ (μ) = \underset{t \in {1, \dots, n}}{a r g m i n} (y_{t} - μ)^{⊤} Σ^{- 1} (y_{t} - μ) .$ (4) For the case where there are more than one indices, we choose the smallest index. Note that we slightly modify the convention in Ref. [Citation5]: ‘if there are two indices we take the one for which corresponding $y_{k (μ)}$ is on the right side of μ’ as it only deals with the univariate case and cannot be easily extended to multivariate setting using this convention. Let the observed LOO log-likelihood be defined as (5) $ℓ^{L O O} (θ; y) = \log L^{L O O} (θ; y) .$ (5) We remark that when considering asymmetric distributions, the LOO likelihood function is discontinuous. For the VG distribution with skewness, the discontinuity is not an issue since the density is asymptotically symmetric as $y \to μ$ from (Equation3(3) $\begin{aligned} f_{V G} (y) \sim {\begin{cases} 2^{ν - d} π^{- \frac{d}{2}} {| Σ |}^{- \frac{1}{2}} \frac{Γ (ν - \frac{d}{2})}{Γ (ν)} \frac{ν^{ν}}{(2 ν + γ^{⊤} Σ^{- 1} γ)^{\frac{2 ν - d}{2}}} & i f ν > \frac{d}{2}, \\ - 2^{1 - d} π^{- \frac{d}{2}} {| Σ |}^{- \frac{1}{2}} \frac{d^{\frac{d}{2}}}{Γ (\frac{d}{2})} \log (z) & i f ν = \frac{d}{2}, \\ 2^{- ν} π^{- \frac{d}{2}} {| Σ |}^{- \frac{1}{2}} \frac{Γ (\frac{d}{2} - ν)}{Γ (ν)} ν^{ν} z^{2 ν - d} & i f ν < \frac{d}{2} . \end{cases} \end{aligned}$ (3) ), and so the effect of the discontinuities is minimised for larger sample size. On the other hand, when using other distributions with different skewness behaviour, the LOO index can alternatively be defined as (6) $κ (μ) = \underset{t \in {1, \dots, n}}{a r g m a x} f (y_{t}; θ) .$ (6) In this paper, we adopt the LOO index in (Equation6(6) $κ (μ) = \underset{t \in {1, \dots, n}}{a r g m a x} f (y_{t}; θ) .$ (6) ) and define the maximum LOO likelihood estimate denoted as ${\hat{θ}}_{n}^{M L L E}$ (or simply ${\hat{θ}}_{n}$ ) to be the estimate that maximises the LOO likelihood with respect to $θ$ . This new and more general index definition makes the LOO likelihood continuous and for the symmetric case it would coincide with (Equation4(4) $κ (μ) = \underset{t \in {1, \dots, n}}{a r g m i n} (y_{t} - μ)^{⊤} Σ^{- 1} (y_{t} - μ) .$ (4) ).

3.2. Properties of the maximum leave-one-out likelihood method

For the one-dimensional case, some properties of the location parameter estimate ${\hat{μ}}_{n}$ such as consistency and super-efficient rate of convergence are proved by Podgórski and Wallin [Citation5]. We state both the assumptions and theorem relating to these main properties:

(A1)	The pdf $f (y) = p (y) \| y \|^{α}$ , $α \in (- 1, 0)$ , p has bounded derivative on $R ∖ {0}$ and, for some $ϵ > 0$ , f is non-zero and continuous either on $[- ϵ, 0]$ or on $[0, ϵ]$ .
(A2)	There exist $ρ > 0$ such that $f (y) = O (\| y \|^{- ρ - 1})$ when $\| y \| \to \infty$ .
(A3)	For all $ϵ > 0$ , the incomplete Fisher information is finite. That is, $I_{ϵ} (θ) := E_{θ} [{(\frac{\partial}{\partial θ} \log f (Y; θ))}^{2} \| \| Y \| > ϵ] < \infty .$

Theorem 3.1

Let f satisfies the assumptions (A1) to (A3) and let $\hat{μ}$ be the maximizer of $L^{L O O} (μ; y)$ . Then $\hat{μ}$ is consistent estimate of μ and for any $β < 1 / (1 + α)$ , $n^{β} ({\hat{μ}}_{n} - μ) \overset{p}{\to} 0$ where α is defined in (A1) and $\overset{p}{\to}$ means convergence in probability.

This theorem states the lower bound $n^{- β}$ for the convergence rate of the location parameter estimate using maximum LOO likelihood. For univariate VG distribution, $α = 2 ν - 1$ . Hence setting $β = 1 / (1 + α) = 1 / (2 ν)$ possibly gives us the index for the optimal convergence rate (or the proposed optimal rate) for $ν < 1 / 2$ . Additionally, $n^{β} ({\hat{μ}}_{n} - μ)$ may converge to some asymptotic distribution for some suitable choice of β. We will investigate the asymptotic properties later in Section 5 using simulations from VG distribution.

3.3. Extension to weighted LOO likelihood for data multiplicity

Data multiplicity is common when the measurements have limited level of accuracy. In this case, the LOO likelihood is first extended to the leave-multiple-out (LMO) likelihood defined as $L^{L M O} (θ; y) = \prod_{t \neq K (μ)} f (y_{t}; θ)$ where (7) $K (μ) = {t \in {1, \dots, n} | y_{t} = y_{k (μ)}}$ (7) represents the LMO indices which correspond to the data points identical to $y_{κ (μ)}$ , and $κ (μ)$ represents the LOO index defined in (Equation4(4) $κ (μ) = \underset{t \in {1, \dots, n}}{a r g m i n} (y_{t} - μ)^{⊤} Σ^{- 1} (y_{t} - μ) .$ (4) ) in Section 3.1. When there are no data multiplicities, the LMO likelihood reduces to the LOO likelihood with consistent $(n - 1)$ data contribution. However, when there are varied data multiplicities, the number of data points to leave out is not fixed and so the LMO likelihood has varied data contribution across the parameter space. This inconsistent data contribution throughout the parameter space leads to a discontinuous LMO likelihood. To remedy this, we propose to adjust the LMO likelihood to weighted LOO (WLOO) likelihood defined as (8) $L^{W L O O} (θ; y) = \prod_{t = 1}^{n} f (y_{t}; θ)^{w_{t}}$ (8) where we choose the weights to be (9) $\begin{aligned} w_{t} & = {\begin{cases} 0, & i f t \in K (μ) \\ \frac{| K (μ) | + | J (μ) | - 1}{| J (μ) |}, & i f t \in J (μ) \\ 1, & o t h e r w i s e, \end{cases} \end{aligned}$ (9) $K (μ)$ is defined in (Equation7(7) $K (μ) = {t \in {1, \dots, n} | y_{t} = y_{k (μ)}}$ (7) ) such that $| K (μ) |$ represents the cardinality of the set $K (μ)$ , (10) $J (μ) = {t \in {1, \dots, n} ∖ K (μ) : y_{t} = y_{j (μ)}}$ (10) and $j (μ) = \underset{t \in I ∖ K (μ)}{a r g m i n} (y_{t} - μ)^{⊤} Σ^{- 1} (y_{t} - μ)$ represents the secondary LOO index for VG distribution in NMVM representation. Similarly, the WLOO log-likelihood is defined as (11) $ℓ^{W L O O} (θ; y) = \sum_{t = 1}^{n} w_{t} f (y_{t}; θ) .$ (11) Clearly, LOO likelihood is the special case of WLOO likelihood when there is no data multiplicity and the weights are chosen to be $w_{t} = 1$ . We use two examples with data multiplicity at one and two locations respectively to demonstrate how the weights in (Equation9(9) $\begin{aligned} w_{t} & = {\begin{cases} 0, & i f t \in K (μ) \\ \frac{| K (μ) | + | J (μ) | - 1}{| J (μ) |}, & i f t \in J (μ) \\ 1, & o t h e r w i s e, \end{cases} \end{aligned}$ (9) ) are derived to ensure bounded and continuous likelihood as well as consistent weighted data size, that is $\sum_{t = 1}^{n} w_{t} = n - 1$ , when data with different multiplicities are removed. We call the first example symmetric since there are equal number of data points on each side of the data with multiplicity. The second example is asymmetric as there are different number of data points on each side of the data with multiplicity. In these two examples, the shape parameter $ν = 0.4$ that corresponds to unbounded likelihood. When the density function is skewed, both LOO and WLOO likelihoods are not continuous between data points. Nevertheless, the pdf in (Equation3(3) $\begin{aligned} f_{V G} (y) \sim {\begin{cases} 2^{ν - d} π^{- \frac{d}{2}} {| Σ |}^{- \frac{1}{2}} \frac{Γ (ν - \frac{d}{2})}{Γ (ν)} \frac{ν^{ν}}{(2 ν + γ^{⊤} Σ^{- 1} γ)^{\frac{2 ν - d}{2}}} & i f ν > \frac{d}{2}, \\ - 2^{1 - d} π^{- \frac{d}{2}} {| Σ |}^{- \frac{1}{2}} \frac{d^{\frac{d}{2}}}{Γ (\frac{d}{2})} \log (z) & i f ν = \frac{d}{2}, \\ 2^{- ν} π^{- \frac{d}{2}} {| Σ |}^{- \frac{1}{2}} \frac{Γ (\frac{d}{2} - ν)}{Γ (ν)} ν^{ν} z^{2 ν - d} & i f ν < \frac{d}{2} . \end{cases} \end{aligned}$ (3) ) is approximately symmetric as $μ$ approaches any data point. So by having more data points, the effect of discontinuities due to skewness is negligible.

Example 1: data multiplicity at a single location

The first data ${- 1, 0, 1, 0}$ have data multiplicity at a single location 0. Figure plots the LOO log-likelihood across μ. In Figure (a), the LOO log-likelihood is unbounded at 0 even after leaving out one data point 0.

Figure 2. Plot of the full (solid red), LOO (striped blue), LMO (dotted green), and WLOO (dot & striped magenta) log-likelihoods of symmetric VG distribution with $ν = 0.4$ for data set ${- 1, 0, 1, 0, 0, 1}$ represented by light grey vertical strips. (a) Full and LOO log-likelihoods and (b) LMO and WLOO log-likelihoods.

The LMO log-likelihood in Figure (b) is bounded after leaving out multiple data points at 0 but it also produces discontinuities at the midpoints of $- 0.5$ and 0.5. We aim to choose the weights in the WLOO log-likelihood $ℓ^{W L O O} (μ) = w_{1} \log f (- 1) + 2 w_{2} \log f (0) + w_{3} \log f (1)$ ( $w_{2} = w_{4}$ ) to ensure consistent data contribution, that is, $\sum_{t = 1}^{n} w_{t} = n - 1$ . Since the pdf is assumed to be symmetric, we define $g_{t} (| y_{t} - μ |) = f (y_{t}; μ)$ and analyse the WLOO log-likelihood around the small neighbourhood $ϵ > 0$ of the midpoints located at $- 0.5$ and 0.5. It is sufficient to look at one of the midpoints.

Midpoint of 0 and 1: At $μ = 0.5 + ϵ$ , the data point $y_{3} = 1$ is closest to μ. So we leave out $y_{3} = 1$ in the WLOO log-likelihood by setting $w_{3} = 0$ . This gives us $ℓ^{W L O O} (0.5 + ϵ) = w_{1} \log g_{1} (1.5 + ϵ) + 2 w_{2} \log g_{2} (0.5 + ϵ) + 0 \log g_{3} (0.5 - ϵ) .$ Since the data point $y_{1} = - 1$ has a single contribution to the log-likelihood, we set $w_{1} = 1$ leaving the other weights $w_{2} = w_{4} = 1$ so that $\sum_{t = 1}^{4} w_{t} = 3$ . At $μ = 0.5 - ϵ$ , the data points $x_{2} = x_{4} = 0$ is closest to μ so we set $w_{2} = 0$ which gives us $ℓ^{W L O O} (0.5 - ϵ) = w_{1} \log g_{1} (1.5 - ϵ) + 0 \log g_{2} (0.5 - ϵ) + w_{3} \log g_{3} (0.5 + ϵ) .$ Again, we set $w_{1} = 1$ leaving $w_{3} = 2$ so that $\sum_{t = 1}^{4} w_{t} = 3$ . Choosing these weights gives us a continuous WLOO likelihood at $μ = 0.5$ given by $ℓ^{W L O O} (0.5) = \log g_{1} (1.5) + 2 \log g_{2} (0.5)$ where $g_{2} (| 0 - 0.5 |) = g_{3} (| 1 - 0.5 |)$ . This shows that after leaving out multiple data points, extra weight is added to the neighbouring data points to compensate for the missing weights.

Example 2: data multiplicities at two locations

The second data ${- 1, 0, 1, 0, 0, 1}$ have different data multiplicities at 0 and 1. As before we define the WLOO log-likelihood as $ℓ^{W L O O} (μ) = w_{1} \log f (- 1) + 3 w_{2} \log f (0) + 2 w_{3} \log f (1) .$

Midpoint of -1 and 0: At $μ = - 0.5 - ϵ$ , we set $w_{1} = 0$ giving $w_{2} = w_{3} = 1$ whereas at $μ = - 0.5 + ϵ$ , we set $w_{2} = 0$ giving $w_{1} = w_{3} = 1$ . Hence the WLOO log-likelihood is $ℓ^{W L O O} (- 0.5) = 3 \log g_{1} (0.5) + 2 \log g_{3} (1.5) .$ Midpoint of 0 and 1: Unlike $μ = - 0.5$ , $μ = 0.5$ lies between 0 and 1 with different data multiplicities. At $μ = 0.5 + ϵ$ , we set $w_{3} = 0$ giving the WLOO log-likelihood $ℓ^{W L O O} (0.5 + ϵ) = w_{1} \log g_{1} (1.5 + ϵ) + 3 w_{2} \log g_{2} (0.5 + ϵ) + 0 \log g_{3} (0.5 - ϵ) .$ With single contribution of $x_{1} = - 1$ , we set $w_{1} = 1$ and so $w_{2} = 4 / 3$ . At $μ = 0.5 - ϵ$ , we set $w_{2} = 0$ giving the WLOO log-likelihood $ℓ^{W L O O} (0.5 - ϵ) = w_{1} \log g_{1} (1.5 - ϵ) + 0 w_{2} \log g_{2} (0.5 - ϵ) + 2 w_{3} \log g_{3} (0.5 + ϵ) .$ Again, we set $w_{1} = 1$ and so $w_{3} = 4 / 2$ . This gives us the WLOO log-likelihood of $ℓ^{W L O O} (0.5) = \log g_{1} (1.5) + 4 \log g_{2} (0.5)$ where $g_{2} (0.5) = g_{3} (0.5)$ .

Hence for a general data set in which $y_{t}, y_{t^{'}}$ have data multiplicities $m_{t}, m_{t^{'}}$ , data points not around the neighbourhood of the midpoint of $y_{t}, y_{t^{'}}$ have the same contribution to the WLOO likelihood so they always have a weight of 1. For the neighbourhood around the midpoint of $y_{t}, y_{t^{'}}$ on the side closer to $y_{t^{'}}$ , the $m_{t^{'}}$ data points with value $y_{t^{'}}$ would be left out by setting their weights to be 0. Then the weights for $y_{t}$ is set to be $w_{t} = (m_{t} + m_{t^{'}} - 1) / m_{t}$ so that $\sum_{t = 1}^{n} w_{t} = n - 1$ . This gives us the formula for the weights in (Equation9(9) $\begin{aligned} w_{t} & = {\begin{cases} 0, & i f t \in K (μ) \\ \frac{| K (μ) | + | J (μ) | - 1}{| J (μ) |}, & i f t \in J (μ) \\ 1, & o t h e r w i s e, \end{cases} \end{aligned}$ (9) ). We note that the ECM algorithm using LOO likelihood in the next section can be extended to the more general WLOO likelihood.

4. ECM algorithm

Finding the maximum LOO likelihood estimate $\hat{θ}$ for VG distribution directly can be difficult as the LOO likelihood has many cusps when $ν \leq d / 2$ , and the LOO index $k (μ)$ makes derivatives tedious to work with since the summation and the differential with respect to $μ$ cannot simply be interchanged due to the dependence of the summation index on $μ$ . Alternatively, we can implement the ECM algorithm to maximise the complete data LOO likelihood conditional on the mixing variables $u$ .

Given the complete data $(y, u)$ , we use the normal mean-variance mixture representation in (Equation2(2) $y | u \sim N_{d} (μ + γ u, u Σ), u \sim G (ν, ν)$ (2) ) to represent the complete data LOO log-likelihood as (12) $ℓ^{L O O} (θ; y, u) = ℓ_{N}^{L O O} (μ, Σ, γ; y, u) + ℓ_{G}^{L O O} (ν; u)$ (12) where the LOO log-likelihood of the conditional normal distribution ignoring constants is given by (13) $ℓ_{N}^{L O O} (μ, Σ, γ; y, u) = - \frac{n - 1}{2} \log | Σ | - \frac{1}{2} \sum_{t \neq κ (μ)} \frac{1}{u_{t}} (y_{t} - μ - u_{t} γ)^{⊤} Σ^{- 1} (y_{t} - μ - u_{t} γ)$ (13) and the LOO log-likelihood of the conditional gamma distribution is given by (14) $ℓ_{G}^{L O O} (ν; u) = (n - 1) (ν \log ν - \log Γ (ν)) + (ν - 1) \sum_{t \neq κ (μ)} \log u_{t} - ν \sum_{t \neq κ (μ)} u_{t} .$ (14) The outline of the ECM algorithm of VG distribution using the full likelihood is similar to the ECM algorithm in Ref. [Citation9] for Student-t distribution and in Ref. [Citation6] for VG distribution. However, modifications to the algorithm are necessary when using the LOO likelihood. We will discuss the necessary modifications needed in order to search for the maximum of the LOO likelihood.

4.1. E-step

By analysing the conditional posterior distribution of $u_{i}$ given $y_{i}$ with density $f (u_{t} | y_{t}, θ) \propto u_{t}^{ν - \frac{d}{2} - 1} \exp [- \frac{1}{2 u_{t}} (y_{t} - μ)^{⊤} Σ^{- 1} (y_{t} - μ) - \frac{u_{t}}{2} (2 ν + γ^{⊤} Σ^{- 1} γ)]$ which corresponds to the pdf of a generalised inverse Gaussian distribution [Citation17], we can calculate the following conditional expectations: (15) $\begin{aligned} \hat{u_{t}} & = E (u_{t} | y, θ) = \frac{z_{t} K_{ν - \frac{d}{2} + 1} (\sqrt{2 ν + γ^{⊤} Σ^{- 1} γ} z_{t})}{\sqrt{2 ν + γ^{⊤} Σ^{- 1} γ} K_{ν - \frac{d}{2}} (\sqrt{2 ν + γ^{⊤} Σ^{- 1} γ} z_{t})}, \end{aligned}$ (15) (16) $\begin{aligned} \hat{1 / u_{t}} & = E (\frac{1}{u_{t}} ∣ y, θ) = \frac{\sqrt{2 ν + γ^{⊤} Σ^{- 1} γ} K_{ν - \frac{d}{2} - 1} (\sqrt{2 ν + γ^{⊤} Σ^{- 1} γ} z_{t})}{z_{t} K_{ν - \frac{d}{2}} (\sqrt{2 ν + γ^{⊤} Σ^{- 1} γ} z_{t})}, \end{aligned}$ (16) (17) $\begin{aligned} \hat{\log u_{t}} & = E (\log u_{t} | y, θ) = \log (\frac{z_{t}}{\sqrt{2 ν + γ^{⊤} Σ^{- 1} γ}}) + \frac{K_{ν - \frac{d}{2}}^{(1, 0)} (\sqrt{2 ν + γ^{⊤} Σ^{- 1} γ} z_{t})}{K_{ν - \frac{d}{2}} (\sqrt{2 ν + γ^{⊤} Σ^{- 1} γ} z_{t})} \end{aligned}$ (17) where $K_{λ}^{(1, 0)} (z) = \frac{\partial}{\partial α} K_{α} (z) |_{α = λ}$ is approximated using the second-order central difference (18) $\begin{aligned} K_{λ}^{(1, 0)} (z) & \approx \frac{K_{λ + h} (z) - K_{λ - h} (z)}{2 h}, \\ z_{t}^{2} & = (y_{t} - μ)^{⊤} Σ^{- 1} (y_{t} - μ), \end{aligned}$ (18) and we let $h = 10^{- 5}$ . Numerical problems may occur when $ν \leq \frac{d}{2}$ since the pdf $f_{V G} (y)$ in (Equation1(1) $f_{V G} (y) = \frac{2^{1 - \frac{d}{2}} ν^{ν}}{{| Σ |}^{\frac{1}{2}} π^{\frac{d}{2}} Γ (ν)} \frac{K_{ν - \frac{d}{2}} (\sqrt{[2 ν + γ^{⊤} Σ^{- 1} γ]} z) \exp ((y - μ)^{⊤} Σ^{- 1} γ)}{(2 ν + γ^{⊤} Σ^{- 1} γ)^{\frac{1}{4} (2 ν - d)} z^{\frac{1}{2} (d - 2 ν)}}$ (1) ) at $μ$ is unbounded. Using the asymptotic property of the modified Bessel function of the second kind, $\begin{aligned} K_{λ} (z) & \sim Γ (λ) 2^{λ - 1} z^{- λ}, \\ K_{0} (z) & \sim - \log (z) . \end{aligned}$ We can show that as $y_{t} \to μ$ , (19) $\begin{aligned} E (u_{t} | y, θ) & \sim {\begin{cases} \frac{2 ν - d}{2 ν + γ^{⊤} Σ^{- 1} γ} & i f ν > \frac{d}{2}, \\ - \frac{1}{(2 ν + γ^{⊤} Σ^{- 1} γ) \log (z_{t})} & i f ν = \frac{d}{2}, \\ \frac{Γ (ν - \frac{d}{2} + 1)}{Γ (\frac{d}{2} - ν)} 2^{2 ν - d + 1} (2 ν + γ^{⊤} Σ^{- 1} γ)^{\frac{d}{2} - ν - 1} z_{t}^{d - 2 ν} & i f ν \in (\frac{d}{2} - 1, \frac{d}{2}), \\ - \log (z_{t}) z_{t}^{2} & i f ν = \frac{d}{2} - 1, \\ \frac{z_{t}^{2}}{d - 2 (ν + 1)} & i f ν < \frac{d}{2} - 1, \end{cases} \end{aligned}$ (19) (20) $\begin{aligned} E (\frac{1}{u_{t}} ∣ y, θ) & \sim {\begin{cases} \frac{2 ν + γ^{⊤} Σ^{- 1} γ}{2 ν - d - 2} & i f ν > \frac{d}{2} + 1, \\ - (2 ν + γ^{⊤} Σ^{- 1} γ) \log (z_{t}) & i f ν = \frac{d}{2} + 1, \\ \frac{Γ (1 - ν + \frac{d}{2})}{Γ (ν - \frac{d}{2})} 2^{1 - 2 ν + d} {(2 ν + γ^{⊤} Σ^{- 1} γ)}^{ν - \frac{d}{2}} z_{t}^{2 ν - d - 2} & i f ν \in (\frac{d}{2}, \frac{d}{2} + 1), \\ - \frac{1}{\log (z_{t}) z_{t}^{2}} & i f ν = \frac{d}{2}, \\ \frac{d - 2 ν}{z_{t}^{2}} & i f ν < \frac{d}{2}, \end{cases} \end{aligned}$ (20) (21) $\begin{aligned} E (\log u_{t} | y, θ) & \sim {\begin{cases} ψ (ν - \frac{d}{2}) - \log (\frac{2 ν + γ^{⊤} Σ^{- 1} γ}{2}) & i f ν > \frac{d}{2}, \\ \log z_{t} & i f ν = \frac{d}{2}, \\ 2 \log z_{t} & i f ν < \frac{d}{2} . \end{cases} \end{aligned}$ (21) Thus for the case when $ν \leq \frac{d}{2}$ , the main source of numerical problems comes from calculating $E (\frac{1}{u_{t}} | y, θ)$ as it diverges to infinity at a hyperbolic rate since the maximum of the likelihood function occurs at the data points. This leads to the overflow problem when calculating the estimates and this problem is explained later in Section 4.2.3. The LOO likelihood avoids the problem by preventing the location parameter estimate to converge towards the data point since the maximum of the LOO likelihood tends to be roughly between data points.

4.2. CM-step

Derivatives of $ℓ_{N}^{L O O}$ in (Equation13(13) $ℓ_{N}^{L O O} (μ, Σ, γ; y, u) = - \frac{n - 1}{2} \log | Σ | - \frac{1}{2} \sum_{t \neq κ (μ)} \frac{1}{u_{t}} (y_{t} - μ - u_{t} γ)^{⊤} Σ^{- 1} (y_{t} - μ - u_{t} γ)$ (13) ) with respect to $(Σ, γ)$ are straight forward to calculate using matrix differentiation. However we encounter two types of difficulties in calculating the derivative with respect to $μ$ for the CM-step.

Firstly, even when the LOO likelihood removes the unbounded points from the full likelihood, there still exist cusps in the LOO likelihood. Consequently we cannot completely rely on derivative-based methods to find the maximum of the LOO likelihood with respect to the location parameter $μ$ .

Secondly, the first-order derivative of the complete-data LOO log-likelihood in (Equation13(13) $ℓ_{N}^{L O O} (μ, Σ, γ; y, u) = - \frac{n - 1}{2} \log | Σ | - \frac{1}{2} \sum_{t \neq κ (μ)} \frac{1}{u_{t}} (y_{t} - μ - u_{t} γ)^{⊤} Σ^{- 1} (y_{t} - μ - u_{t} γ)$ (13) ) with respect to $μ$ is (22) $\frac{\partial}{\partial μ} ℓ_{N}^{L O O} = - \frac{1}{2} (\frac{\partial}{\partial μ} \sum_{t \neq κ (μ)} \hat{1 / u_{t}} (y_{t} - μ - u_{t} γ)^{⊤} Σ^{- 1} (y_{t} - μ - u_{t} γ)) .$ (22) Since the summation index depends on $μ$ , the differential and the summation cannot simply be interchanged. Thus the CM-step for $μ$ does not have a closed-form solution. To solve these problems, we propose the local mid-point search and local point search algorithm for the first problem, and the approximate CM-step for the second problem.

4.2.1. Local mid-point search (for one-dimensional case)

As seen in Figure , the maximum of the LOO log-likelihood tends to occur at the cusp points which are located between data points for the one-dimensional case. So ideally we would want to search along these mid-points to maximise the LOO log-likelihood with respect to μ. This leads to the local mid-point search. The idea is to search for mid-points around the current iterate $μ^{(k)}$ and choose the one that maximises the LOO log-likelihood.

Local mid-point search algorithm:

Let $(μ^{(k)}, Σ^{(k)}, γ^{(k)}, ν^{(k)})$ be our current estimates, and $y_{(t)}$ be the ordered data:

Calculate $| x_{t} - μ^{(k)} |$ for $t = 1, \dots, n - 1$ , where $x_{t} := (y_{(t)} + y_{(t + 1)}) / 2$ , and choose the least m Euclidean distances with corresponding mid-points $x_{t_{1}}, \dots, x_{t_{m}}$ where we choose $m = max {20, n / 100}$ for our simulation study given that $n \geq 20$ . Additionally, let $x_{t_{0}} = {\hat{μ}}^{(k)}$ .
Update the location estimate by choosing μ out of ${x_{t_{0}}, \dots, x_{t_{m}}}$ such that it maximises the LOO likelihood in (Equation5(5) $ℓ^{L O O} (θ; y) = \log L^{L O O} (θ; y) .$ (5) ). $\underset{μ \in {x_{t_{0}}, \dots, x_{t_{m}}}}{a r g m a x} ℓ^{L O O} (μ, Σ^{(k)}, γ^{(k)}, ν^{(k)}; y) .$
Repeat steps 1 and 2 until the location parameter estimate converges.

4.2.2. Local point search (for higher dimensional case)

In general, finding the maximum along the cusps in higher dimensions is more computationally demanding. For 2-dimensional data, these cusps occur at the cusp lines in $R^{2}$ which is demonstrated in Figure of Ref. [Citation11]. For d-dimensional data, these cusps occur on the $(d - 1)$ dimensional cusp manifolds in $R^{d}$ . So for simplicity, we propose to search for data points around the current iterate ${\hat{μ}}^{(k)}$ and choose the one that increases the LOO likelihood.

Local point search (LPS) algorithm:

The algorithm is similar to the local mid-point search algorithm in Section 4.2.1 except we search over the data points instead of the mid-points, and replace the Euclidean distance with the Mahalanobis distance between $y_{t}$ and $μ^{(k)}$ $(y_{t} - μ^{(k)})^{⊤} (Σ^{(k)})^{- 1} (y_{t} - μ^{(k)})$ for $t = 1, \dots, n$ .

4.2.3. Approximation in CM-step

To evaluate the first-order derivative in (Equation22(22) $\frac{\partial}{\partial μ} ℓ_{N}^{L O O} = - \frac{1}{2} (\frac{\partial}{\partial μ} \sum_{t \neq κ (μ)} \hat{1 / u_{t}} (y_{t} - μ - u_{t} γ)^{⊤} Σ^{- 1} (y_{t} - μ - u_{t} γ)) .$ (22) ), we propose to approximate the derivative by simply considering the LOO index in (Equation4(4) $κ (μ) = \underset{t \in {1, \dots, n}}{a r g m i n} (y_{t} - μ)^{⊤} Σ^{- 1} (y_{t} - μ) .$ (4) ) to be fixed at the current estimate $μ^{(k)}$ at k-th iteration so that we leave out the data point closest to $μ^{(k)}$ instead of $μ$ . This gives us an approximation to the derivative (23) $\begin{aligned} \frac{\partial}{\partial μ} ℓ_{N}^{L O O} & \approx - \frac{1}{2} (\frac{\partial}{\partial μ} \sum_{t \neq κ (μ^{(k)})} \hat{1 / u_{t}} (y_{t} - μ - u_{t} γ)^{⊤} Σ^{- 1} (y_{t} - μ - u_{t} γ)) \\ = Σ^{- 1} \sum_{t \neq κ (μ^{(k)})} \hat{1 / u_{t}} (y_{t} - μ - u_{t} γ) . \end{aligned}$ (23) Similarly, applying the approximate derivative to $ℓ_{N}^{L O O}$ and $ℓ_{G}^{L O O}$ with respect to other parameters and solving the approximate derivatives at zero gives us the following CM-steps.

CM-step for $μ, Σ, γ$ :

Suppose that the current iterate is $θ^{(k)}$ and $u$ is given. After equating each component of the approximate partial derivatives of $ℓ_{N}^{L O O} (μ, Σ, γ | y, u)$ to zero, we obtain the following estimates: (24) $\begin{aligned} μ^{(k + 1)} = & \frac{S_{y / u} S_{u} - (n - 1) S_{y}}{S_{1 / u} S_{u} - (n - 1)^{2}}, \end{aligned}$ (24) (25) $\begin{aligned} γ^{(k + 1)} = & \frac{S_{y} - (n - 1) μ^{(k + 1)}}{S_{u}}, \end{aligned}$ (25) (26) $\begin{aligned} Σ^{(k + 1)} = & \frac{1}{n - 1} \sum_{t \neq κ (μ^{(k)})} \frac{1}{u_{t}} (y_{t} - μ^{(k + 1)}) (y_{t} - μ^{(k + 1)})^{⊤} - \frac{1}{n - 1} γ^{(k + 1)} {(γ^{(k + 1)})}^{⊤} S_{u}, \end{aligned}$ (26) where the sufficient statistics are: (27) $S_{y} = \sum_{t \neq κ (μ^{(k)})} y_{t}, S_{y / u} = \sum_{t \neq κ (μ^{(k)})} \frac{1}{u_{t}} y_{t}, S_{u} = \sum_{t \neq κ (μ^{(k)})} u_{t}, S_{1 / u} = \sum_{t \neq κ (μ^{(k)})} \frac{1}{u_{t}} .$ (27) CM-step for $ν$ :

Given the mixing parameters $u$ , the estimate $ν^{(k + 1)}$ can be obtained by numerically maximising $ℓ_{G}^{L O O} (ν | u)$ in (Equation14(14) $ℓ_{G}^{L O O} (ν; u) = (n - 1) (ν \log ν - \log Γ (ν)) + (ν - 1) \sum_{t \neq κ (μ)} \log u_{t} - ν \sum_{t \neq κ (μ)} u_{t} .$ (14) ) with respect to ν using Newton-Raphson (NR) algorithm where the approximate derivatives is given by: $\begin{aligned} \frac{\partial}{\partial ν} ℓ_{G}^{L O O} & = (n - 1) (1 + \log ν - ψ (ν)) + S_{\log u} - S_{u}, \\ \frac{\partial^{2}}{\partial ν^{2}} ℓ_{G}^{L O O} & = (n - 1) (\frac{1}{ν} - ψ^{'} (ν)) \end{aligned}$ where $ψ^{'} (x) = \frac{d}{d x} \log Γ (x)$ is the digamma function and (28) $S_{\log u} = \sum_{t \neq κ (μ^{(k)})} \log u_{t} .$ (28) This algorithm using the conditional log-likelihood $ℓ_{G}^{L O O}$ is called MCECM. Alternatively, we can speed up the convergence rate by maximising directly the LOO log-likelihood $ℓ_{V G}^{L O O} = \sum_{t \neq κ (μ^{(k)})} \ln f_{V G} (y_{t})$ with respect to ν given the other parameters. This algorithm is called ECM. Unless otherwise stated, we consider ECM algorithm in the subsequent analyses.

4.2.4. Line search

The estimates in (Equation24(24) $\begin{aligned} μ^{(k + 1)} = & \frac{S_{y / u} S_{u} - (n - 1) S_{y}}{S_{1 / u} S_{u} - (n - 1)^{2}}, \end{aligned}$ (24) ) to (Equation26(26) $\begin{aligned} Σ^{(k + 1)} = & \frac{1}{n - 1} \sum_{t \neq κ (μ^{(k)})} \frac{1}{u_{t}} (y_{t} - μ^{(k + 1)}) (y_{t} - μ^{(k + 1)})^{⊤} - \frac{1}{n - 1} γ^{(k + 1)} {(γ^{(k + 1)})}^{⊤} S_{u}, \end{aligned}$ (26) ) using approximate derivatives will not guarantee the LOO likelihood increase. In this regard, we propose to apply a line search to ensure the LOO likelihood increases after each CM-step. This line search is part of a class of adaptive overrelaxed methods which can improve the efficiency of EM [Citation18]. Denote the current estimate by $θ^{(k)}$ and the updated estimate by $θ^{(k + 1)}$ after the CM-step in Section 4.2.3, we propose to construct a direct line search by defining $θ^{*} = θ^{(k)} + φ (θ^{(k + 1)} - θ^{(k)})$ where $φ \in I \subset R$ and the interval $I$ is chosen so that $θ^{*}$ remains in the parameter space. Using the optimise function in R, φ is estimated to be $φ^{*}$ such that it maximises the LOO log-likelihood $φ^{*} = \underset{φ \in I}{a r g m a x} ℓ^{L O O} (θ^{*}) .$ Since finding the maximum of a non-smooth likelihood function is difficult, we can alternatively search φ to choose $θ^{*}$ such that it improves the LOO likelihood over the previous estimate $ℓ^{L O O} (θ^{*}; y) \geq ℓ^{L O O} (θ^{(k)}; y) .$

4.3. ECM algorithm

Combining these steps gives us the ECM algorithm for VG distribution using the LOO likelihood:

Initialisation step: Choose suitable starting values $(μ_{0}, Σ_{0}, γ_{0}, ν_{0})$ . It is recommended to choose starting values $(\bar{y}, c o v (y), 0, 4 d)$ where $\bar{y}$ and $c o v (y)$ denote the sample mean and sample variance-covariance matrix of $y$ respectively. For more leptokurtic data, it is recommended to use more robust measure of location and scale.

ECM algorithm for VG: At the k-th iteration with current estimates $(μ^{(k)}, Σ^{(k)}, γ^{(k)}, ν^{(k)})$ :

Local Point Search: Update the estimate to $μ^{(k + 1 / 2)}$ using local point search in Section 4.2.2.

E-step 1: Calculate ${\hat{u}}_{t}^{(k + 1)}$ and ${\hat{1 / u_{t}}}^{(k + 1)}$ for $t = 1, \dots, n$ in (Equation15(15) $\begin{aligned} \hat{u_{t}} & = E (u_{t} | y, θ) = \frac{z_{t} K_{ν - \frac{d}{2} + 1} (\sqrt{2 ν + γ^{⊤} Σ^{- 1} γ} z_{t})}{\sqrt{2 ν + γ^{⊤} Σ^{- 1} γ} K_{ν - \frac{d}{2}} (\sqrt{2 ν + γ^{⊤} Σ^{- 1} γ} z_{t})}, \end{aligned}$ (15) ) and (Equation16(16) $\begin{aligned} \hat{1 / u_{t}} & = E (\frac{1}{u_{t}} ∣ y, θ) = \frac{\sqrt{2 ν + γ^{⊤} Σ^{- 1} γ} K_{ν - \frac{d}{2} - 1} (\sqrt{2 ν + γ^{⊤} Σ^{- 1} γ} z_{t})}{z_{t} K_{ν - \frac{d}{2}} (\sqrt{2 ν + γ^{⊤} Σ^{- 1} γ} z_{t})}, \end{aligned}$ (16) ) respectively using $(μ^{(k + 1 / 2)}, Σ^{(k)}, γ^{(k)}, ν^{(k)})$ . Calculate also the sufficient statistics $S_{y}^{(k + 1)}$ , $S_{y / u}^{(k + 1)}$ , $S_{u}^{(k + 1)}$ and $S_{1 / u}^{(k + 1)}$ in (Equation27(27) $S_{y} = \sum_{t \neq κ (μ^{(k)})} y_{t}, S_{y / u} = \sum_{t \neq κ (μ^{(k)})} \frac{1}{u_{t}} y_{t}, S_{u} = \sum_{t \neq κ (μ^{(k)})} u_{t}, S_{1 / u} = \sum_{t \neq κ (μ^{(k)})} \frac{1}{u_{t}} .$ (27) ).

CM-step 1: Update the estimates to $(μ^{(k + 1)}, γ^{(k + 1)})$ in (Equation24(24) $\begin{aligned} μ^{(k + 1)} = & \frac{S_{y / u} S_{u} - (n - 1) S_{y}}{S_{1 / u} S_{u} - (n - 1)^{2}}, \end{aligned}$ (24) ) and (Equation25(25) $\begin{aligned} γ^{(k + 1)} = & \frac{S_{y} - (n - 1) μ^{(k + 1)}}{S_{u}}, \end{aligned}$ (25) ) respectively using the sufficient statistics in E-step 1 along with the line search in Section 4.2.4.

CM-step 2: Update the estimate to $Σ^{(k + 1)}$ in (Equation26(26) $\begin{aligned} Σ^{(k + 1)} = & \frac{1}{n - 1} \sum_{t \neq κ (μ^{(k)})} \frac{1}{u_{t}} (y_{t} - μ^{(k + 1)}) (y_{t} - μ^{(k + 1)})^{⊤} - \frac{1}{n - 1} γ^{(k + 1)} {(γ^{(k + 1)})}^{⊤} S_{u}, \end{aligned}$ (26) ) using the sufficient statistics in E-step 1 along with the line search.

CM-step 3: Update the estimate to $ν^{(k + 1)}$ by maximising the actual LOO log-likelihood with respect to ν while keeping the other parameters fixed.

Stopping rule: Repeat the procedures until the relative increment of LOO log-likelihood function is smaller than tolerance level $10^{- 8}$ .

We remark that the local mid-point search and local point search ensure the location parameter estimates jump closer towards the maximum whereas the line search in Section 4.2.4 is applied after each CM-step to ensure monotonic convergence of the ECM algorithm. We will numerically verify the accuracy of this algorithm in Section 5.1 using Monte Carlo simulations.

4.4. Convergence of ECM algorithm using LOO likelihood

Let the approximate LOO log-likelihood be defined as (29) $\tilde{ℓ^{L O O}} (θ; y) = \sum_{t \neq κ (μ^{(k)})} \log f (y_{t}; θ)$ (29) with LOO index fixed at $κ (μ^{(k)})$ . To show the convergence of ECM algorithm using approximate LOO likelihood, we first prove the convergence of the ECM algorithm with one CM-step for approximate LOO log-likelihood, then extend the proof for multiple CM-steps. For the case with one CM-step, we apply the idea in Ref. [Citation19] to the LOO likelihood and state two fundamental results below: $\tilde{ℓ^{L O O}} (θ; y) = \tilde{Q^{L O O}} (θ; θ^{(t)}) - \tilde{H^{L O O}} (θ; θ^{(t)})$ and $\tilde{H^{L O O}} (θ; θ^{(k)}) \leq \tilde{H^{L O O}} (θ^{(k)}; θ^{(k)})$ where we let (30) $\tilde{Q^{L O O}} (θ; θ^{(k)}) = \int \tilde{ℓ^{L O O}} (θ; y, u) f (u | y; θ^{(k)}) d u$ (30) with $f (u | y; θ^{(k)}) = \prod_{t = 1}^{n} f (u_{t} | y_{t}; θ^{(k)})$ , $\tilde{ℓ^{L O O}} (θ; y, u) = \sum_{t \neq κ (μ^{(k)})} \log f (y_{t}, u_{t}; θ)$ , and $\tilde{H^{L O O}} (θ; θ^{(k)}) = \int \tilde{ℓ^{L O O}} (θ; u | y) f (u | y; θ^{(k)}) d u$ with $\tilde{ℓ^{L O O}} (θ; u | y) = \sum_{t \neq κ (μ^{(k)})} \log f (u_{t} | y_{t}; θ)$ . These fundamental results are stated by Podgórski and Wallin [Citation5], and the ideas of the proof are exactly the same as in Ref. [Citation20] by replacing the full likelihood with the LOO likelihood. However, choosing $θ^{(k + 1)}$ such that $\tilde{Q^{L O O}} (θ^{(k + 1)}, θ^{(k)}) \geq \tilde{Q^{L O O}} (θ^{(k)}, θ^{(k)})$ only guarantee that $\tilde{ℓ^{L O O}} (θ^{(k + 1)}; y) \geq \tilde{ℓ^{L O O}} (θ^{(k)}; y)$ but not $ℓ^{L O O} (θ^{(k + 1)}; y) \geq ℓ^{L O O} (θ^{(k)}; y)$ . For this reason, we need to perform a line search in Section 4.2.4 so that the LOO log-likelihood improves and thus guarantees the monotonic convergence of the algorithm for LOO log-likelihood.

For the case with multiple CM-steps, applying the line search after the CM-step increases the actual LOO log-likelihood rather than the Q-function in Equation (Equation30(30) $\tilde{Q^{L O O}} (θ; θ^{(k)}) = \int \tilde{ℓ^{L O O}} (θ; y, u) f (u | y; θ^{(k)}) d u$ (30) ). Liu and Rubin [Citation8] and Meng and van Dyk [Citation21] proved the monotonic convergence of the ECM algorithm only if all the CM-steps applied to Q-functions are performed before the CM-step applied to the actual LOO log-likelihood. Thus for our case, if we applied the line search to the initial CM-step, the line search have to also be applied to the other subsequent CM-steps to ensure that the actual LOO log-likelihood increases after each CM-step. Thus, this guarantees the monotonic convergence of the algorithm in Section 4.3.

5. Simulation studies

5.1. Accuracy of estimates for VG distribution

To demonstrate the accuracy of the proposed ECM algorithm, we simulate n = 1000 bivariate skewed VG samples with parameters (31) $μ = (\begin{matrix} 0 \\ 0 \end{matrix}), Σ = (\begin{matrix} 1 & 0.7 \\ 0.7 & 1 \end{matrix}), γ = (\begin{matrix} 0.8 \\ 1 \end{matrix}), a n d ν = 0.15$ (31) and estimate the parameters using the algorithm specified in Section 4.3. We replicate this experiment r = 1000 times and present the results in Figure . Figure shows the violin plots implemented using the caroline package [Citation22] in R which represents the density estimate of the parameter estimates using normal kernel. The median of the estimates are very close to the true parameters of the model implying that the algorithm gives us consistent estimates for these parameters, even when $ν < \frac{d}{2}$ leading to unbounded density. We also see the distribution of each component of the parameters in $Σ$ , $γ$ , and ν appear to follow roughly a normal distribution. On the other hand, the distribution of $μ$ has high density around $0$ and extreme heavy-tails.

Figure 3. Vioplots to show accuracy of parameter estimates for VG distribution in the first simulation study. The median is displayed as a grey box which is connected by a crimson line. True parameter values represented by the blue lines is drawn for comparison. (a) Vioplot of $μ$ and (b) Vioplot of $Σ$ , $γ$ , ν.

To illustrate the idea of the local point search and line-search in Sections 4.2.2 and 4.2.4 graphically, readers may refer to Figure of Ref. [Citation11]. The figure shows that a global maximum lies roughly between data points along the cusp lines. These cusp lines make computation more demanding as we cannot purely rely on derivative-based methods (see Equation (Equation23(23) $\begin{aligned} \frac{\partial}{\partial μ} ℓ_{N}^{L O O} & \approx - \frac{1}{2} (\frac{\partial}{\partial μ} \sum_{t \neq κ (μ^{(k)})} \hat{1 / u_{t}} (y_{t} - μ - u_{t} γ)^{⊤} Σ^{- 1} (y_{t} - μ - u_{t} γ)) \\ = Σ^{- 1} \sum_{t \neq κ (μ^{(k)})} \hat{1 / u_{t}} (y_{t} - μ - u_{t} γ) . \end{aligned}$ (23) )). The local point search and then the line search serves as an efficient iterative method to obtain estimates which will converge towards the global maximum approximately. To explain the idea, the local point search takes the estimates $\hat{μ}$ along the black solid line to one of the three data points (red circles) closer to the global maximum point (yellow triangle), while the CM-step and line search improves $\hat{μ}$ along the black dash line to the point in blue square so that they converge closer towards the global maximum point which lies on the cusp line between two of the data points.

5.2. Comparison of WLOO, LOO and full likelihood methods

We assess the performance of WLOO, LOO likelihood relative to other full likelihood methods including the R package called ghyp in a simulation study. Nitithumbundit and Chan [Citation6] proposed an ECM algorithm [Citation7] to estimate parameters for the VG distribution with unbounded density by capping the density within a small interval of the location parameter. Specifically, they bounded the conditional expectations of (Equation19(19) $\begin{aligned} E (u_{t} | y, θ) & \sim {\begin{cases} \frac{2 ν - d}{2 ν + γ^{⊤} Σ^{- 1} γ} & i f ν > \frac{d}{2}, \\ - \frac{1}{(2 ν + γ^{⊤} Σ^{- 1} γ) \log (z_{t})} & i f ν = \frac{d}{2}, \\ \frac{Γ (ν - \frac{d}{2} + 1)}{Γ (\frac{d}{2} - ν)} 2^{2 ν - d + 1} (2 ν + γ^{⊤} Σ^{- 1} γ)^{\frac{d}{2} - ν - 1} z_{t}^{d - 2 ν} & i f ν \in (\frac{d}{2} - 1, \frac{d}{2}), \\ - \log (z_{t}) z_{t}^{2} & i f ν = \frac{d}{2} - 1, \\ \frac{z_{t}^{2}}{d - 2 (ν + 1)} & i f ν < \frac{d}{2} - 1, \end{cases} \end{aligned}$ (19) ) to (Equation21(21) $\begin{aligned} E (\log u_{t} | y, θ) & \sim {\begin{cases} ψ (ν - \frac{d}{2}) - \log (\frac{2 ν + γ^{⊤} Σ^{- 1} γ}{2}) & i f ν > \frac{d}{2}, \\ \log z_{t} & i f ν = \frac{d}{2}, \\ 2 \log z_{t} & i f ν < \frac{d}{2} . \end{cases} \end{aligned}$ (21) ) around $μ$ by a region such that if $z_{t} < Δ$ where Δ is some small fixed positive capping level and $z_{t}$ is defined in Equation (Equation18(18) $\begin{aligned} K_{λ}^{(1, 0)} (z) & \approx \frac{K_{λ + h} (z) - K_{λ - h} (z)}{2 h}, \\ z_{t}^{2} & = (y_{t} - μ)^{⊤} Σ^{- 1} (y_{t} - μ), \end{aligned}$ (18) ), then the conditional expectations are calculated by replacing $z_{t}$ with $z_{t}^{*} = max (z_{t}, Δ)$ . To improve the rate of convergence, they further considered a data augmentation algorithm called the alternating ECM (AECM) by setting $u_{t}^{(k)} = w_{t} / a (θ^{(k)})$ where the positive function $a (θ) = ξ$ and updated ξ by choosing $\hat{ξ} = \underset{ξ > 0}{a r g m a x} ℓ_{V G} (\hat{μ}, \frac{\hat{Σ}}{ξ}, \frac{γ}{ξ}, \hat{ν})$ using numerical optimisation techniques given the current estimates $(\hat{μ}, \hat{Σ}, γ, \hat{ν})$ . Then they updated the parameters as ${\hat{γ}}^{*} = \frac{\hat{γ}}{\hat{ξ}}$ and ${\hat{Σ}}^{*} = \frac{\hat{Σ}}{\hat{ξ}}$ . Lastly, they showed that the optimal Δ decreases as ν decreases. Hence they proposed an adaptive Δ method by updating $Δ^{(k + 1)} = \hat{g} (ν^{(k + 1)}, d)$ where $g (ν, d)$ is a cubic spline by fitting the median (over repeats) of the optimal Δ across ν and d in a simulation study. The following likelihood methods are considered in this study:

Full likelihood ghyp package: ECM algorithm with Δ set to $.Machine$double. e p s 0.25$ $\approx 1.2 e - 4$ which is a tolerance level of many R functions.
Full likelihood fixed Δ : AECM algorithm with the smallest $Δ =$
$sqrt(.Machine$double.xmin) \approx 1.5e-154$ to provide numerical stability where double.xmin represents the smallest non-zero normalised floating-point number in R.
Full likelihood adaptive Δ: AECM algorithm with adaptive Δ.
LOO likelihood: ECME/AECM algorithm with LPS (Section 4.2.2), line search (Section 4.2.4) and the smallest Δ which is relevant when there is data multiplicity.
WLOO likelihood: ECM algorithm with LPS and line search.

Details of these five likelihood methods are summarised in Table .

Table 1. ghyp, fixed Δ, adaptive Δ, LOO and WLOO likelihood methods.

Download CSV Display Table

To conduct the simulation study, we create data multiplicity in two ways: replicate each data point R times or round each data point to D decimal places. The procedure for the simulation study are summarised below:

Step 1:	Choose $ν = 0.05, 0.1, \dots, 1.45, 1.5$ and $R = 1, \dots, 5$ or $D = \infty, 8, 6, 5, 4$ for Section 5.2.2. The density is unbounded if $ν < 1$ and cusped if $1 \leq ν \leq 1.5$ for dimension d = 2.
Step 2:	Simulate n = 1000 data from the bivariate VG distribution with true parameters $μ$ , $Σ$ and $γ$ given in (Equation31(31) $μ = (\begin{matrix} 0 \\ 0 \end{matrix}), Σ = (\begin{matrix} 1 & 0.7 \\ 0.7 & 1 \end{matrix}), γ = (\begin{matrix} 0.8 \\ 1 \end{matrix}), a n d ν = 0.15$ (31) ) and the chosen ν. Then repeat each data point R times, or round each data point to D decimal places.
Step 3:	Apply the five likelihood methods in Table to the data sets to obtain five sets of estimates.
Step 4:	Repeat these steps until we have r = 500 replicates for each method, level of ν, and level of R or D.

5.2.1. No data multiplicity

The median of the measures of accuracy for the four parameters are reported in Table . Results show that full likelihood methods using ghyp and fixed Δ mostly give the largest biases whereas adaptive Δ full likelihood, LOO likelihood and WLOO likelihood methods perform similarly with lower biases particularly for $μ$ when the density is unbounded. It is reasonable to see that LOO and WLOO likelihood methods give very similar results for data without data multiplicity. As ν decreases, the bias seems to increase for $| \hat{Σ} |$ and decrease for $\hat{μ}$ but there is no clear trend for $\hat{γ}$ .

Table 2. Median of 500 accuracy measures of parameter estimates across five likelihood methods with no data multiplicity (R = 1). ‘*’ indicates the methods with lowest absolute error.

Display Table

5.2.2. Data multiplicity due to repetition and rounding

To visualise the differences in performance better, we transform the accuracy measures by $- \log [\log (- \log | \hat{μ} - μ |)]$ , $\log | \log | \hat{Σ} | - \log | Σ | |$ , $\log | \hat{γ} - γ |$ and $\log | \log \hat{ν} - \log ν |$ respectively and graph the transformed accuracy measures in Figure for replicated data when R = 1, 3, 5. Since the plot for rounded data is very similar, it is omitted. See Ref. [Citation23], page 95 for this plot. We remark that the transformation $f (x) = - \log [\log (- \log x)]$ for $| \hat{μ} - μ |$ is monotonically increasing on $(0, 1 / e)$ .

Figure 4. Plot of transformed accuracy measures for full ghyp package (red solid line), full fixed Δ (light green striped line), full adaptive Δ (dotted green line), LOO (blue dot & striped line), and WLOO (dash magenta line) likelihoods. The columns from left to right represents the median parameter accuracy measures for $(\hat{μ}, \hat{Σ}, \hat{γ}, \hat{ν})$ respectively. The rows from top to bottom represents R = 1, 3, 5 respectively where R is the number of times each data point is repeated.

With data repetition, the WLOO likelihood (magenta line) generally performs better than other likelihood methods for all levels of ν in Figure . The full likelihood ghyp with $Δ_{0} =$ 1.2e-4 (red line) and adaptive Δ (green line) methods provide reasonable accuracy when $ν > 0.15$ but the adaptive Δ method is better and similar to WLOO for $\hat{ν}$ . This result is in agreement with the optimal Δ choice study in Section 4.2 of Ref. [Citation6]. They considered $Δ_{1} =$ 1.49e-154 and $Δ_{2} =$ 1.49e-8 and plotted in Figure , the log of the sum of mean absolute errors (MAE) over all the parameters for a $d = 2$ dimensional VG model with $μ = 0$ , $Σ = [\begin{matrix} 1 & - 04 \\ - 0.4 & 1 \end{matrix}]$ and $γ = (- 1.2, - 0.2)$ using the AECM algorithm. Results show that $Δ_{1}$ performs better for much smaller shape parameter at $ν < 0.15$ whereas $Δ_{2}$ performs better for $0.15 < ν < 1.15$ approximately. Since $Δ_{0}$ is similar to $Δ_{2}$ relative to $Δ_{1}$ , this result explains why full and LOO likelihoods both with $Δ_{1}$ perform worst in Figure for $0.15 < ν < 1.5$ approximately.

When ν is small, full likelihood ghyp method performs poorly as expected. The fixed Δ full and LOO likelihood methods (light green and blue lines) have the worst performance particularly for $\hat{Σ}$ and $\hat{ν}$ since they both adopt a fixed capping level Δ. The only difference is that the LOO likelihood leaves out a data point while the full likelihood does not. Nevertheless, as the data multiplicity increases, leaving a data point out becomes insignificant and so the accuracies of the full fixed Δ and LOO likelihood converge. As ν approaches 1.5, the accuracy for $\hat{Σ}$ and $\hat{ν}$ for all likelihood method converges.

With data rounding, the results are very similar. The WLOO likelihood is still better for most cases, such as $\hat{μ}$ and $\hat{γ}$ when $ν > 0.3$ approximately. For $\hat{ν}$ , the full likelihood ghyp package performs well around $0.2 < ν < 0.5$ , the full adaptive Δ likelihood performs well in the mid-range at around $0.6 < ν < 1.2$ and the full fixed Δ likelihood performs well at around $ν = 1.4$ . Again, the ghyp package and adaptive Δ full likelihood perform similarly. For $\hat{Σ}$ and $\hat{ν}$ , the full fixed Δ likelihood still performs worse whereas the LOO likelihood is now much better and is similar to WLOO when $ν > 0.3$ but they perform similarly for smaller ν. We remark that for larger ν, the data rounding is insignificant because of the lower peak and hence less data multiplicity around $\hat{μ}$ .

In summary, for both data repetition and rounding cases, we have demonstrated that the full ghyp and fixed Δ likelihood methods using fixed capping levels perform worse and can be improved by providing an adaptive capping level. Moreover, the LOO likelihood method can be extended to the WLOO likelihood to deal with the unbounded likelihood due to data multiplicity. Overall, the WLOO likelihood gives the most stable and accurate results. As the range of ν is uncertain in practice, it is better to use the WLOO likelihood which provides the best overall performance.

5.3. Asymptotic properties for the location parameter of VG distribution

Podgórski and Wallin [Citation5] proved the consistency and super-efficiency of the location parameter estimate using the maximum LOO likelihood as stated in Theorem 3.1 in Section 3.2. They also stated the upper bound for the index of the rate of convergence $β < 1 / (1 + α) = 1 / (2 ν)$ where $α = 2 ν - 1$ for unbounded univariate VG distribution. The aim of this section is to determine these optimal rates through simulation studies and to analyse the asymptotic distribution of the location parameter estimate for unbounded and cusp densities. In this study, we consider LOO instead of WLOO loglikelihood to investigate the optimal convergence rate of Ref. [Citation5].

We consider univariate case ( $d = 1$ ) and set up the simulation as below for $ν \in (0, 1]$ :

Set the sample size n to be one of the 41 sample sizes ${500, 1000, \dots, 19, 500, 20, 000} \cup {100, 000}$ .
For each sample size, set the true shape parameter ν to be one of the 50 shape parameters ${0.02, 0.04 \dots, 0.98, 1}$ .
For each pair of $(n, ν)$ , generate $r = 20, 000$ different sets of samples, each set from standard univariate symmetric VG distribution with shape parameter ν and sample size n.
For the ith set of 20,000 samples, estimate ${\hat{μ}}_{n, ν, i}$ by searching through each mid data points and choosing the one that maximises the LOO log-likelihood where the other parameters $(σ^{2} = 1, γ = 0, ν)$ are fixed.

This gives us for each pair of $(n, ν)$ the sample ${\hat{μ}}_{n, ν, 1 : 20, 000}$ of $r = 20, 000$ estimates which is used to study the optimal convergence rate and asymptotic distribution across sample size n and shape parameter ν.

5.3.1. Optimal convergence rate

Since the scale of asymptotic distribution of ${\hat{μ}}_{n, ν, 1 : 20, 000}$ changes with n according to the convergence rate $n^{- β_{ν}}$ , we can fit a power curve to estimate $β_{ν}$ in the optimal convergence rate. We choose a robust measure of spread called the median absolute deviation ( $M A D$ ) from 0 defined by ${M A D}_{n, ν} \overset{△}{=} m e d i a n (| μ_{n, ν, 1} |, \dots, | μ_{n, ν, 20, 000} |) .$ For each ν, we fit a power curve to the ${M A D}_{n, ν}$ against n. In other words, we find parameters $a_{ν}$ and $b_{ν}$ such that ${M A D}_{n, ν} = a_{ν} n^{b_{ν}}$ . This is equivalent to fitting a simple linear regression model $\log {M A D}_{n, ν} = \log a_{ν} + b_{ν} \log n$ to obtain the estimates $({\hat{\log a}}_{ν}, {\hat{b}}_{ν})$ . Then an estimate of the optimal rate for a given ν is obtained by setting ${\hat{β}}_{ν} = - {\hat{b}}_{ν}$ . We repeat this process for the other ν's.

Figure (a) plots the relative error of ${\hat{β}}_{ν}$ against ν along with its confidence intervals. From the figure, ${\hat{β}}_{ν}$ appears to follow the proposed optimal rate of $1 / (2 ν)$ when $0 < ν \leq 0.4$ . However when $0.4 < ν < 1$ (the density is bounded with a cusp when $0.5 < ν < 1$ ), ${\hat{β}}_{ν}$ appears to be slightly different and the relative error follows a sinusoidal pattern. In fact for $0.4 < ν < 0.76$ , ${\hat{β}}_{ν}$ appears to be greater than the proposed optimal convergence rate index. For $0.76 < ν < 1$ , ${\hat{β}}_{ν}$ appears to be less than the proposed optimal convergence rate index. As ν approaches to 1, ${\hat{β}}_{ν}$ approaches the convergence rate for asymptotic normality. Overall the estimated optimal rate is consistent with Theorem 3.1 in Section 3.2 in the range of $ν < 0.5$ for unbounded density. As for $0.5 \leq ν \leq 1$ , more theoretical studies is needed to understand the behaviour of the location parameter estimates ${\hat{μ}}_{n, ν, 1 : 20, 000}$ . To investigate this unusual behaviour of ${\hat{μ}}_{n, ν, 1 : 20, 000}$ , we further analyse the asymptotic distribution of ${\hat{μ}}_{n, ν, 1 : 20, 000}$ across ν when $n = 100, 000$ is large.

Figure 5. (a) Relative error $\frac{{\hat{β}}_{ν} - β_{ν}}{β_{ν}}$ (solid black) against ν. The horizontal solid grey line indicates agreement of ${\hat{β}}_{ν}$ with the proposed optimal rate $β_{ν} = \frac{1}{2 ν}$ . The vertical grey dotted lines represent grid lines for $ν = {0, 0.2, 0.4, 0.6, 0.8, 1}$ . We also include the 95% confidence interval for the relative error (dashed grey). (b) Density of estimates ${\hat{μ}}_{n, ν, 1 : 20, 000}$ with its scale standardised using $M A D$ for each ν where n = 100, 000. We use a rainbow colour scheme ranging from red ( $ν = 0.02$ ) to magenta ( $ν = 1$ ).

Figure 5. (a) Relative error βˆν−βνβν (solid black) against ν. The horizontal solid grey line indicates agreement of βˆν with the proposed optimal rate βν=12ν. The vertical grey dotted lines represent grid lines for ν={0,0.2,0.4,0.6,0.8,1}. We also include the 95% confidence interval for the relative error (dashed grey). (b) Density of estimates μˆn,ν,1:20,000 with its scale standardised using MAD for each ν where n = 100, 000. We use a rainbow colour scheme ranging from red (ν=0.02) to magenta (ν=1).

5.3.2. Asymptotic distribution

This section explores the distribution of ${\hat{μ}}_{n, ν, 1 : 20, 000}$ or $\log | {\hat{μ}}_{n, ν, 1 : 20, 000} |$ in the more appropriate scale and approximate the distribution to allow SE calculation for $\hat{μ}$ .

Analysis 1: preview the distribution of location parameter estimates using normal kernel

We begin with plotting the normal kernel density estimate in Figure (b) for ${\hat{μ}}_{n, ν, 1 : 20, 000}$ with its scale standardised using $M A D$ , taking $n = 100, 000$ and representing the density for each ν in rainbow colours. We note that the estimated density exhibits heavier tails and sharper peaks at the expense of intermediate tails as ν decreases (from magenta line to red line) and this feature resembles VG distribution.

Analysis 2: explore the distribution on the more appropriate log scale

To explore the distributions on a more appropriate scale, we transform ${\hat{μ}}_{n, ν, 1 : 20, 000}$ to $\log | {\hat{μ}}_{n, ν, 1 : 20, 000} |$ and plot the kernel density estimates in Figure for $(n, ν) = (1000, 5000, 20, 000, 100, 000)$ by $(0.02, 0.2,$ $0.4, 0.6, 0.8, 1)$ . We observe that the location of the distribution shifts to the right in general for each ν as the sample size n increases but the scale and the shape of the distribution remain relatively similar. An exception is when $ν = 0.8$ where the scale gets slightly larger and the shape becomes less left skewed. We remark that these densities resemble the densities of the generalised Gumbel (GG) distribution. Fitting the GG distribution to $\log | {\hat{μ}}_{n, ν, 1 : 20, 000} |$ corresponds to fitting a double generalised gamma (DGG) distribution to ${\hat{μ}}_{n, ν, 1 : 20, 000}$ . See Appendices A and B for details of the two distributions, respectively.

Figure 6. Kernel density estimates of $\log | {\hat{μ}}_{n} |$ 's for $ν = 0.02, 0.2, 0.4, 0.6, 0.8, 1$ and n = 1000 (solid black), 5000 (dashed dark grey), 20, 000 (dotted grey), 100, 000 (dash-dotted light grey) with each n being combined into a single plot for comparison. (a) density plots for $ν = 0.02$ (b) density plots for $ν = 0.2$ (c) density plots for $ν = 0.4$ (d) density plots for $ν = 0.6$ (e) density plots for $ν = 0.8$ and (f) density plots for $ν = 1$ .

Figure 6. Kernel density estimates of log⁡|μˆn|'s for ν=0.02,0.2,0.4,0.6,0.8,1 and n = 1000 (solid black), 5000 (dashed dark grey), 20, 000 (dotted grey), 100, 000 (dash-dotted light grey) with each n being combined into a single plot for comparison. (a) density plots for ν=0.02 (b) density plots for ν=0.2 (c) density plots for ν=0.4 (d) density plots for ν=0.6 (e) density plots for ν=0.8 and (f) density plots for ν=1.

Analysis 3: approximate the distribution on the log scale

This analysis aims to approximate the kernel densities for $\log | {\hat{μ}}_{n, ν, 1 : 20, 000} |$ in Figure with some distributions which should include log-normal as the asymptotic distribution when ν approaches or exceeds 1. In the search for these distributions, we start with VG based on Figure (b) and check the goodness-of-fit using P-P plots by plotting the cumulative distribution function (cdf) $F (\log | {\hat{μ}}_{n, ν, 1 : 20, 000} |; θ_{ν})$ against the ordered sequence $(1 : 20, 000) / 20, 001$ giving one P-P plot for each $ν = 0.02, 0.04, \dots, 1$ and taking n = 100, 000. We also consider stable distribution which, like normal, can model any properly normed sums of independent and identically distributed random variables. If each variable has a finite variance, classical central limit theorem states that the sum will tend toward a normal distribution as the number of variables increases but without the finite variance assumption, the limit may be a stable distribution. These provide some theoretical ground for the trial although it still does not address the unbounded density problem. However, the P-P plots for both log-VG and log-stable distributions argue against adoption. Finally, we try DGG distribution with the flexible generalised gamma distribution on each side and the cdf of GG distribution for fitting $\log | {\hat{μ}}_{n, ν, 1 : 20, 000} |$ is given by Equation (EquationA2(A2) $F_{G G} (x) = F_{g a m m a} (\exp (μ + σ x); m, m), x > 0$ (A2) ) evaluated at the fitted parameters ${\hat{θ}}_{ν} = ({\hat{μ}}_{G G}, {\hat{σ}}_{G G}, {\hat{m}}_{G G})$ . As the set of P-P plots for GG distribution shows near-perfect straight lines except some very minimal deviations around the 25th and 75th percentiles for $0.22 \leq ν \leq 0.34$ , we confirm that GG distribution fits $\log | {\hat{μ}}_{n, ν, 1 : 20, 000} |$ the best amongst the log-VG and log-stable distributions. See Ref. [Citation23], pages 77-78 for these P-P plots. Hence we choose DGG to approximate the distribution for the location parameter estimate μ using maximum LOO likelihood. We declare the limitation of this ad hoc search but based on the P-P plots, we are confident that DGG distribution can provide a reasonably accurate standard error (SE) estimate for $\hat{μ}$ . We will demonstrate the calculation and confirm the accuracy.

Analysis 4: assess the convergence of DGG distribution to asymptotic distribution

We assess the convergence of DGG distribution as n increases to asymptotic distribution by plotting in Figure the parameter estimates $({\hat{μ}}_{G G}, {\hat{σ}}_{G G}, {\hat{m}}_{G G})$ against ν when fitting the $\log | {\hat{μ}}_{n, ν, 1 : 20, 000} |$ for each $(n, ν)$ to GG distribution while different n is represented by lines in rainbow colours from red for the lowest n = 500. We also plot the transformed parameter estimates to view the behaviour across ν more clearly. As ν decreases, ${\hat{μ}}_{G G}$ in Figure (a,d) appears to drop roughly at a hyperbolic rate with some minor curvature for larger values of ν. This agrees with the observation in Figure that the location of the density shifts to the left as ν decreases. However, ${\hat{σ}}_{G G}$ in Figure (b,c) increases at a hyperbolic rate with two bumps as ν decreases. This increase in the spread can also be observed in the densities of Figure . The two bumps occur around $ν = 0.2$ and $ν = 0.7$ with the left one showing more fluctuation and no clear distinction between each n possibly due to sampling error whereas the right one has a clear distinction between each n especially for n = 100, 000. In Figure (c,f), ${\hat{m}}_{G G}$ also has two bumps similar to ${\hat{σ}}_{G G}$ but unlike ${\hat{μ}}_{G G}$ and ${\hat{σ}}_{G G}$ , ${\hat{m}}_{G G}$ tends to some constant value as ν approaches to 0. The right bumps for both ${\hat{σ}}_{G G}$ and ${\hat{m}}_{G G}$ suggest that the DGG distribution for $\hat{μ}$ has yet to converge to the asymptotic distribution and it is unclear how large n should be for the convergence for $0.5 \leq ν \leq 1$ , the range with cusp density. Adding to the inconsistency of the index for optimal convergence rate as displayed in Figure , the asymptotic properties for this range of ν worth further analyses.

Figure 7. Plot of estimates of generalized Gumbel distribution fitted to the distribution of $\log | {\hat{μ}}_{n} |$ against ν. Row 1 plots the $({\hat{μ}}_{G G}, {\hat{σ}}_{G G}, {\hat{m}}_{G G})$ against ν, respectively, while row 2 plots the transformation $(1 / {\hat{μ}}_{G G}, - 1 / {\hat{σ}}_{G G}, \log {\hat{m}}_{G G})$ against ν, respectively, to enlarge certain portion of the plots. A rainbow colour scheme ranging from red (n = 500) to magenta (n = 20, 000) is used to denote sample size and the black line represents n = 100, 000. (a) ${\hat{μ}}_{G G}$ vs ν (b) ${\hat{σ}}_{G G}$ vs ν (c) ${\hat{m}}_{G G}$ vs ν (d) $- 1 / {\hat{μ}}_{G G}$ vs ν (e) $- 1 / {\hat{σ}}_{G G}$ vs ν and (f) $\log {\hat{m}}_{G G}$ vs ν.

Figure 7. Plot of estimates of generalized Gumbel distribution fitted to the distribution of log⁡|μˆn| against ν. Row 1 plots the (μˆGG,σˆGG,mˆGG) against ν, respectively, while row 2 plots the transformation (1/μˆGG,−1/σˆGG,log⁡mˆGG) against ν, respectively, to enlarge certain portion of the plots. A rainbow colour scheme ranging from red (n = 500) to magenta (n = 20, 000) is used to denote sample size and the black line represents n = 100, 000. (a) μˆGG vs ν (b) σˆGG vs ν (c) mˆGG vs ν (d) −1/μˆGG vs ν (e) −1/σˆGG vs ν and (f) log⁡mˆGG vs ν.

Demonstration of SE calculation for the location parameter estimate using DGG distribution

Lastly, we demonstrate the SE calculation for $\hat{μ}$ (d = 2) using DGG distribution and the true parameters in Section 5.1 as the sample estimates. For each component ${\hat{μ}}_{j}$ , we apply the GG distribution to estimate the variability of $\log | {\hat{μ}}_{j} |$ . Specifically, we use $\hat{ν} = 0.15$ with sample size n = 1000 to extrapolate the values $- 1 / {\hat{μ}}_{G G} = 0.0466, - 1 / {\hat{σ}}_{G G} = - 0.1291$ and $\log {\hat{m}}_{G G} = 1.8733$ by applying the spline function in R to Figure (d–f). This gives the GG parameter estimates ${\hat{μ}}_{G G} = - 21.4431, {\hat{σ}}_{G G} = 7.7431$ and ${\hat{m}}_{G G} = 6.5098$ . Applying these estimates to Equation (EquationA5(A5) $M A D ({\hat{μ}}_{i}) \approx Q_{g e n g a m m a} (0.5) = \sqrt{{\hat{Σ}}_{i i}} Q_{g a m m a} {(0.5; m_{G G}, m_{G G} \exp (- μ_{G G} / σ_{G G}))}^{σ_{G G}}$ (A5) ) with ${\hat{Σ}}_{11} = 1$ and ${\hat{Σ}}_{22} = 1$ gives us the MADs $\begin{aligned} \hat{M} {A D}_{n, ν} (\hat{μ}) = (\begin{matrix} M A D ({\hat{μ}}_{1}) \\ M A D ({\hat{μ}}_{2}) \end{matrix}) \approx (\begin{matrix} 3.25 \times 10^{- 10} \\ 3.25 \times 10^{- 10} \end{matrix}) \end{aligned}$ using DGG distribution which are close to the ‘observed’ MAD $_{n, ν}$ $\begin{aligned} {M A D}_{n, ν} (\hat{μ}) = (\begin{matrix} 3.16 \times 10^{- 10} \\ 2.85 \times 10^{- 10} \end{matrix}) \end{aligned}$ by applying ${M A D}_{n, ν} (x) = m e d i a n (| x - m e d i a n (x) |)$ to the simulated sample ${\hat{μ}}_{j, n, ν, 1 : r}$ with r = 1000 replicates. Hence $\hat{M} {A D}_{n, ν}$ estimated from DGG distribution provides reasonably accurate estimates of MAD $_{n, ν}$ , unobserved in practice, for the SE estimate of $\hat{μ}$ using maximum LOO likelihood when the shape parameter $ν \in (0, \frac{d}{2})$ gives unbounded density or $ν \in (\frac{d}{2}, \frac{d + 1}{2})$ gives cusp density. Using $\hat{M} {A D}_{n, ν}$ as SE estimate, we can also construct confidence interval for $\hat{μ}$ .

6. Conclusion

Optimising a non-convex likelihood is always challenging particularly for noisy data with cusped or unbounded likelihood. An obvious example is the estimation of $μ$ from VG distribution when $ν \leq \frac{d}{2}$ . We propose ECM algorithm using hierarchical model via the NMVM representation of VG distribution to avoid the direct maximisation of full likelihood. Conditional on the mixing variable $u_{t}$ , $y_{t}$ follows a condition normal distribution in (Equation2(2) $y | u \sim N_{d} (μ + γ u, u Σ), u \sim G (ν, ν)$ (2) ). In case $y_{t} \to μ$ and $ν \leq \frac{d}{2}$ when the density becomes unbounded in (Equation3(3) $\begin{aligned} f_{V G} (y) \sim {\begin{cases} 2^{ν - d} π^{- \frac{d}{2}} {| Σ |}^{- \frac{1}{2}} \frac{Γ (ν - \frac{d}{2})}{Γ (ν)} \frac{ν^{ν}}{(2 ν + γ^{⊤} Σ^{- 1} γ)^{\frac{2 ν - d}{2}}} & i f ν > \frac{d}{2}, \\ - 2^{1 - d} π^{- \frac{d}{2}} {| Σ |}^{- \frac{1}{2}} \frac{d^{\frac{d}{2}}}{Γ (\frac{d}{2})} \log (z) & i f ν = \frac{d}{2}, \\ 2^{- ν} π^{- \frac{d}{2}} {| Σ |}^{- \frac{1}{2}} \frac{Γ (\frac{d}{2} - ν)}{Γ (ν)} ν^{ν} z^{2 ν - d} & i f ν < \frac{d}{2} . \end{cases} \end{aligned}$ (3) ), $E (\frac{1}{u_{t}} | y, θ) \to \infty$ in (Equation17(17) $\begin{aligned} \hat{\log u_{t}} & = E (\log u_{t} | y, θ) = \log (\frac{z_{t}}{\sqrt{2 ν + γ^{⊤} Σ^{- 1} γ}}) + \frac{K_{ν - \frac{d}{2}}^{(1, 0)} (\sqrt{2 ν + γ^{⊤} Σ^{- 1} γ} z_{t})}{K_{ν - \frac{d}{2}} (\sqrt{2 ν + γ^{⊤} Σ^{- 1} γ} z_{t})} \end{aligned}$ (17) ). Nitithumbundit and Chan [Citation6] proposed to cap the density but we propose to use LOO likelihood. In case of data multiplicity, we extend the LOO likelihood to WLOO likelihood.

To improve the estimation efficiency, we propose local mid-point search (one-dimensional case in Section 4.2.1) and local point search (higher dimensional case in Section 4.2.2) for $\hat{μ}$ as well as the line search (Section 4.2.4). The local point searches ensure the location parameter estimates jump closer towards the maximum. Since the derivatives of LOO likelihood in (Equation23(23) $\begin{aligned} \frac{\partial}{\partial μ} ℓ_{N}^{L O O} & \approx - \frac{1}{2} (\frac{\partial}{\partial μ} \sum_{t \neq κ (μ^{(k)})} \hat{1 / u_{t}} (y_{t} - μ - u_{t} γ)^{⊤} Σ^{- 1} (y_{t} - μ - u_{t} γ)) \\ = Σ^{- 1} \sum_{t \neq κ (μ^{(k)})} \hat{1 / u_{t}} (y_{t} - μ - u_{t} γ) . \end{aligned}$ (23) ) are also approximations, the line search ensures monotonicity along the recurrent updates. The first simulation study confirms the accuracy of the location parameter estimate for the VG distribution even though these search procedures do not necessary guarantee that the path leads to a global maximum.

In the second simulation study, our proposed ECM algorithm using WLOO likelihood is compared to three full likelihood methods, namely, MCECM from the ghyp package, AECM with fixed Δ and AECM with adaptive Δ, and one LOO likelihood method. The comparison is conducted under three scenarios: no data multiplicity, data multiplicity due to repetition and data multiplicity due to rounding. Results show that ECM algorithm using WLOO likelihood performs well in all three scenarios.

Lastly, our third simulation study explores empirically the optimal convergence rate and asymptotic distribution for our proposed location parameter estimate using the maximum LOO likelihood method for symmetric univariate VG data. Results show that the index for optimal convergence rate follows $\frac{1}{2 ν}$ when $0 < ν \leq 0.4$ . However, when $0.4 < ν < 1$ , the index appears to be slightly different with a sinusoidal pattern for the relative error. As ν approaches 1, the optimal rate approaches the convergence rate for asymptotic normality. Following the result, we propose to approximate the asymptotic distribution using the DGG distribution when $ν \in (0, 1]$ so that we can calculate SE and construct confidence interval for each component of ${\hat{μ}}_{j}$ . In summary, the idea of the proposed ECM algorithm can be applied to other distributions in NMVM representation including Students' t and GH distributions with some suitable adjustments in the E-step.

Our method using hierarchical model in NMVM representation is similar in spirit to Bayesian methods which are sometimes treated as a frequentist method when flat priors are adopted so that the solution could be considered ‘equivalent’ to the solution using the classical likelihood method. This proxy likelihood approach can avoid the optimisation of complicated likelihoods. Further, the empirical Bayesian (EB) method offers procedures for statistical inference in which the prior $π (θ)$ is estimated from the data, ie $π (θ | y)$ without relying on external input (similar to using flat priors). Hossain, Kozubowski, Podgórski[Citation24] proposed the weighted likelihood estimation for a parameter θ as (32) ${\hat{θ}}_{E B} = \sum_{i = 1}^{n} w_{i} {\hat{θ}}_{i} = \frac{\sum_{i = 1}^{n} {\hat{θ}}_{i} L ({\hat{θ}}_{i} | y_{i})}{\sum_{i = 1}^{n} L ({\hat{θ}}_{i} | y_{i})} \overset{L M}{=} \frac{\sum_{i = 1}^{n} y_{i} \prod_{j = 1}^{n} h (y_{j} - y_{i})}{\sum_{i = 1}^{n} \prod_{j = 1}^{n} h (y_{j} - y_{i})},$ (32) where ${\hat{θ}}_{i} = υ (y_{i})$ is an estimator of θ using the ML principle and observation $y_{i}$ , and the last term in (Equation32(32) ${\hat{θ}}_{E B} = \sum_{i = 1}^{n} w_{i} {\hat{θ}}_{i} = \frac{\sum_{i = 1}^{n} {\hat{θ}}_{i} L ({\hat{θ}}_{i} | y_{i})}{\sum_{i = 1}^{n} L ({\hat{θ}}_{i} | y_{i})} \overset{L M}{=} \frac{\sum_{i = 1}^{n} y_{i} \prod_{j = 1}^{n} h (y_{j} - y_{i})}{\sum_{i = 1}^{n} \prod_{j = 1}^{n} h (y_{j} - y_{i})},$ (32) ) is for the location model (LM) $f (y_{i} | θ) = h (y_{i} - θ), y_{i} \in R$ and $h (\cdot)$ is a symmetric and unimodal density. This estimation method has an EB flavour since ${\hat{θ}}_{E B}$ can be viewed as the posterior sample mean of ${\hat{θ}}_{i}$ with probability $w_{i}$ where the data-driven prior distribution is discrete and supported on the same values ${\hat{θ}}_{i}$ with equal probability 1/n. They showed that the asymptotic normality and efficiency of Bayes estimator using ordinary prior $π (θ)$ can be applied to EB estimator using empirical sample prior $π (θ | y)$ . Based on the symmetric gamma (SG) distribution with pdf $f_{S G} (y | θ) = \frac{| y - θ |^{α - 1} \exp (- | y - θ |)}{2 Γ (α)}, y \in R,$ which is unbounded when $0 < α < 1$ ( $α = 1$ corresponds to symmetric Laplace (SL) distribution), they showed that the location parameter θ in LM has a closed-form expression ${\hat{θ}}_{E B} = \frac{\sum_{i = 1}^{n} y_{i} \exp (- \sum_{j \neq i} | x_{j} - x_{i} |) \prod_{j \neq i} | y_{j} - y_{i} |^{α - 1}}{\sum_{i = 1}^{n} \exp (- \sum_{j \neq i} | x_{j} - x_{i} |) \prod_{j \neq i} | y_{j} - y_{i} |^{α - 1}}$ using (Equation32(32) ${\hat{θ}}_{E B} = \sum_{i = 1}^{n} w_{i} {\hat{θ}}_{i} = \frac{\sum_{i = 1}^{n} {\hat{θ}}_{i} L ({\hat{θ}}_{i} | y_{i})}{\sum_{i = 1}^{n} L ({\hat{θ}}_{i} | y_{i})} \overset{L M}{=} \frac{\sum_{i = 1}^{n} y_{i} \prod_{j = 1}^{n} h (y_{j} - y_{i})}{\sum_{i = 1}^{n} \prod_{j = 1}^{n} h (y_{j} - y_{i})},$ (32) ) when the estimator ${\hat{θ}}_{i} = υ (y_{i})$ is a LOO estimator such that ${\hat{θ}}_{i}$ is calculated using the sample without $y_{i}$ . In a simulation study, they showed that ${\hat{θ}}_{E B}$ performs slightly better than ${\hat{θ}}_{L O O} = a r g m a x {ℓ^{L O O} (y_{i} | y), i = 1, \dots, n}$ in Tables 5 and 6 for SG and SL distributions in Ref. [Citation24]. This EB method which provides closed-form solution is attractive. Future studies may derive the EB estimates for VG distribution and compare the performance of EB method with our proposed ECM method using WLOO likelihood in several scenarios.

Moreover, our third simulation study for the optimal convergence rate and asymptotic distribution of the location parameter estimate considers only the univariate symmetric case using LOO likelihood. For future work, we will consider multivariate skewed case using WLOO likelihood. Furthermore, in-depth study should be directed to the case of $0.5 \leq ν \leq 1$ when the density has cusp density as results show inconsistency of the index for the optimal convergence rate and problem of convergence for the asymptotic distribution. Furthermore, the dependence between the location and other parameters of VG distribution should also be considered in the simulation studies. Lastly, it is worth exploring better approximation to the asymptotic distribution from both numerical and theoretical perspectives even though it can be challenging.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

Rao B. Estimation of the location of the cusp of a continuous density. Ann Math Stat. 1968;39:76–87.
Google Scholar
Ibragimov IA, Khasminskii RZ. Asymptotic behavior of statistical estimates of the location parameter for samples with unbounded density. J Sov Math. 1981a;16(2):1035–1041.
Google Scholar
Ibragimov IA, Khasminskii RZ. Statistical estimation, volume 16 of Applications of Mathematics. Asymptotic theory, Translated from the Russian by Samuel Kotz. Springer-Verlag: New York-Berlin; 1981b.
Google Scholar
Rao B. Asymptotic distributions in some non-regular statistical problems. PhD diss., Ph. D. Dissertation, Michigan State University; 1966.
Google Scholar
Podgórski K, Wallin J. Maximizing leave-one-out likelihood for the location parameter of unbounded densities. Ann Inst Stat Math. 2015;67(1):19–38.
Web of Science ®Google Scholar
Nitithumbundit T, Chan JSK. ECM algorithm for auto-regressive multivariate skewed variance gamma model with unbounded density. Methodol Comput Appl Probab. 2020;22(3):1169–1191.
Web of Science ®Google Scholar
Meng X-L, Rubin DB. Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika. 1993;80(2):267–278.
Web of Science ®Google Scholar
Liu C, Rubin DB. The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika. 1994;81(4):633–648.
Web of Science ®Google Scholar
Liu C, Rubin DB. ML estimation of the t distribution using EM and its extensions, ECM and ECME. Stat Sin. 1995;5(1):19–39.
Web of Science ®Google Scholar
Madan DB, Seneta E. The variance gamma (V.G.) model for share market returns. J Bus. 1990;63(4):511–524.
Web of Science ®Google Scholar
Nitithumbundit T, Chan JSK. ECM algorithm for estimating vector ARMA model with variance gamma distribution and possible unbounded density. Aust N Z J Stat. 2021;63(3):485–516.
Web of Science ®Google Scholar
Barndorff-Nielsen O, Kent J, Sørensen M. Normal variance-mean mixtures and z distributions. Int Stat Rev. 1982;50(2):145–159.
Web of Science ®Google Scholar
Protassov RS. EM-based maximum likelihood parameter estimation for multivariate generalized hyperbolic distributions with fixed λ. Stat Comput. 2004;14(1):67–77.
Web of Science ®Google Scholar
Kotz S, Kozubowski TJ, Podgórski K. The Laplace distribution and generalizations : a revisit with applications to communications, economics, engineering, and finance. Boston: Birkhäuser; 2001.
Google Scholar
Gradshteyn IS, Ryzhik IM. Table of integrals, series, and products. 7th ed., Amsterdam: Elsevier/Academic Press; 2007.
Google Scholar
Kawai R. On the likelihood function of small time variance gamma Lévy processes. Statistics. 2015;49(1):63–83.
Web of Science ®Google Scholar
Embrechts P. A property of the generalized inverse Gaussian distribution with some applications. J Appl Probab. 1983;20(3):537–544.
Web of Science ®Google Scholar
Salakhutdinov R, Roweis S. Adaptive overrelaxed bound optimization methods. In: Proceedings of the International Conference on Machine Learning; 2003; Vol. 20. p. 664–671.
Google Scholar
Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J Roy Statist Soc Ser B. 1977;39(1):1–38.With discussion.
Google Scholar
Wu C-FJ. On the convergence properties of the EM algorithm. Ann Stat. 1983;11(1):95–103.
Web of Science ®Google Scholar
Meng X-L, van Dyk D. The EM algorithm – an old folk-song sung to a fast new tune. J Roy Statist Soc Ser B. 1997;59(3):511–567.With discussion and a reply by the authors.
Google Scholar
Schruth D. caroline: A Collection of Database, Data Structure, Visualization, and Utility Functions for R. R package version 0.7.6; 2013.
Google Scholar
Nitithumbundit T. EM Algorithms for Multivariate Skewed Variance Gamma Distribution with Unbounded Densities and Applications. PhD thesis. School of Mathematics and Statistics; 2017. The address of the publisher. https://ses.library.usyd.edu.au/handle/2123/18155.
Google Scholar
Hossain MM, Kozubowski TJ, Podgórski K. A novel weighted likelihood estimation with empirical bayes flavor. Commun Stat Simul Comput. 2018;47(2):392–412.
Web of Science ®Google Scholar
Adeyemi S. On a generalization of the gumbel distribution. Inter-Stat (London). 2002;11(4):1–7.
Google Scholar
Lin GD, Huang JS. The cube of a logistic distribution is indeterminate. Austral J Statist. 1997;39(3):247–252.
Google Scholar

Appendix

A Generalised Gumbel (GG) distribution

The pdf of a GG distribution is given by

(A1)

f_{G G} (x) = \frac{m^{m}}{σ Γ (m)} \exp (m \frac{x - μ}{σ} - m \exp (\frac{x - μ}{σ})), x \in R

(A1) where

μ \in R

is the location parameter,

σ > 0

is the scale parameter, and m>0 is the shape parameter. Note that we consider the negative version of the GG distribution given in Ref. [Citation25].

We can easily generate GG random variables based on gamma random variables denoted by $G (a, b)$ using the following theorem.

Theorem A.1

If $X \sim G (m, m)$ , then $Y = \frac{\log X - μ}{σ}$ follows GG distribution with pdf in Equation (EquationA1(A1) $f_{G G} (x) = \frac{m^{m}}{σ Γ (m)} \exp (m \frac{x - μ}{σ} - m \exp (\frac{x - μ}{σ})), x \in R$ (A1) ).

Proof.

The idea of the proof is similar to the proof in Ref. [Citation25].

Using this transformation, we can compute the cdf as (A2) $F_{G G} (x) = F_{g a m m a} (\exp (μ + σ x); m, m), x > 0$ (A2) and quantile function as $Q_{G G} (p) = μ + σ \log Q_{g a m m a} (p; m, m), p \in [0, 1]$ where $F_{g a m m a} (x; a, b)$ and $Q_{g a m m a} (p; a, b)$ are cdf and quantile function of $G (a, b)$ . In other words, we just need to calculate the cdf and quantiles of the gamma distribution. The mean and variance of a GG random variable X is given by (A3) $E (X) = μ + σ (ψ (m) - \log m) a n d V a r (X) = σ^{2} ψ^{'} (m) .$ (A3)

B Double generalized gamma (DGG) distribution

After fitting $\log | {\hat{μ}}_{n, ν, 1 : r} |$ to the GG distribution, we can deduce the distribution of ${\hat{μ}}_{n, ν, 1 : r}$ follows DGG [Citation26] using the following theorem.

Theorem A.2

If X follows a symmetric distribution such that $\log | X |$ follows GG distribution with pdf in Equation (EquationA1(A1) $f_{G G} (x) = \frac{m^{m}}{σ Γ (m)} \exp (m \frac{x - μ}{σ} - m \exp (\frac{x - μ}{σ})), x \in R$ (A1) ), then X follows a double generalised gamma distribution with pdf (A4) $\frac{γ β^{α}}{2 Γ (α)} | x |^{γ α - 1} \exp (- β | x |^{α}), x \in R$ (A4) where $α > 0$ , $β > 0$ , and $γ > 0$ .

Proof.

If $Y = \log | X |$ follows GG distribution, then Y has pdf proportional to $f_{Y} (y) \propto \exp (\frac{m y}{σ} - m \exp (- \frac{μ}{σ}) \exp (\frac{y}{σ}))$ where it is sufficient to consider the functional form. Now applying the transformation of random variable $Y = \log W$ where $W = | X |$ , we get the pdf for W given by $\begin{aligned} f_{W} (w) & \propto \exp (\frac{m}{σ} \log w - m \exp (- \frac{μ}{σ}) \exp (\frac{\log w}{σ})) \frac{1}{w} \\ \propto w^{m / σ - 1} \exp (- m \exp (- \frac{μ}{σ}) w^{1 / σ}) \end{aligned}$ which has the functional form of the generalised gamma distribution. Reverse the transformation $W = | X |$ by reflecting the pdf at 0 gives us the result. Setting $α = 2$ , $β = 1 / 2$ , and $γ = 1 / 2$ in (EquationA4(A4) $\frac{γ β^{α}}{2 Γ (α)} | x |^{γ α - 1} \exp (- β | x |^{α}), x \in R$ (A4) ) gives the standard normal density as a special case. As the simulation results in Section 5.3.2 show that the GG distribution fits the $\log | {\hat{μ}}_{n, ν, 1 : r} |$ reasonably well, we can model ${\hat{μ}}_{n, ν, 1 : r}$ using the DGG distribution. Suppose that $\hat{Σ}$ is the scale parameter estimate of VG distribution with diagonals ${\hat{Σ}}_{i i}$ for $i = 1, \dots, d$ . Then the standard error of $\hat{μ}$ can be approximated using the formula $S E ({\hat{μ}}_{i}) \approx \sqrt{\frac{{\hat{Σ}}_{i i} β^{- 2 / γ} Γ (α + 2 / γ)}{Γ (α)}}$ where $α = m_{G G}$ , $β = m_{G G} \exp (- μ_{G G} / σ_{G G})$ , $γ = 1 / σ_{G G}$ , and $(μ_{G G}, σ_{G G}, m_{G G})$ are estimates of GG distribution extrapolated from Figure . Since the standard error is sensitive to outliers, we recommend using the MAD given by (A5) $M A D ({\hat{μ}}_{i}) \approx Q_{g e n g a m m a} (0.5) = \sqrt{{\hat{Σ}}_{i i}} Q_{g a m m a} {(0.5; m_{G G}, m_{G G} \exp (- μ_{G G} / σ_{G G}))}^{σ_{G G}}$ (A5) as a robust measure of the spread for ${\hat{μ}}_{i}$ where $Q_{g e n g a m m a} (\cdot)$ represents the quantile function of generalised gamma distribution and the pdf has functional form in (EquationA4(A4) $\frac{γ β^{α}}{2 Γ (α)} | x |^{γ α - 1} \exp (- β | x |^{α}), x \in R$ (A4) ) with support $x > 0$ .

Maximum leave-one-out likelihood method for the location parameter of variance gamma distribution with unbounded density

Abstract

1. Introduction

2. Variance gamma distribution