Full article: A gradient search maximization algorithm for the asymmetric Laplace likelihood

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

The asymmetric Laplace likelihood naturally arises in the estimation of conditional quantiles of a response variable given covariates. The estimation of its parameters entails unconstrained maximization of a concave and non-differentiable function over the real space. In this note, we describe a maximization algorithm based on the gradient of the log-likelihood that generates a finite sequence of parameter values along which the likelihood increases. The algorithm can be applied to the estimation of mixed-effects quantile regression, Laplace regression with censored data, and other models based on Laplace likelihood. In a simulation study and in a number of real-data applications, the proposed algorithm has shown notable computational speed.

Keywords:

1. Introduction

An increasing number of recently proposed methods for the estimation of conditional quantiles of a continuous outcome variable given covariates make use of the asymmetric Laplace distribution.[Citation1–6] We describe a maximization algorithm for the asymmetric Laplace likelihood and compare its large-sample performance with that of the Frisch–Newton (FN) interior-point method for the estimation of quantile regression. Simple extensions of the presented algorithm have been applied to the estimation of Laplace regression with censored data [Citation7] and mixed-effects quantile regression as an alternative to the Gibbs sampler.[Citation8,Citation9]

Let y_i, i=1, … , n, be a sample of continuous variables and x_i be k-dimensional vectors of corresponding covariates. We consider the regression model:

y_{i} = x_{i}^{'} β_{p} + u_{i},

where the residual u_i follows an unspecified distribution with

P (u_{i} \leq 0) = p

for a given probability p∈(0, 1). If model (1) is correct, then x_i′β_p is the p-quantile of the conditional distribution of y_i given x_i. The unknown regression coefficient

β_{p} \in R^{v}

depends on p and can be estimated by minimizing a weighted sum of absolute residuals with computationally efficient methods.[Citation10–12]

It is known [Citation13] that this minimization problem can be seen as the maximization of a likelihood function by assuming that the regression residual u_i follows the asymmetric Laplace density function $f (u_{i}) = p (1 - p) \exp {u_{i} (I_{u_{i} \leq 0} - p)}$ . Here and throughout, I_A denotes the indicator function of the set A.

The log-likelihood function of model (1)

l (β_{p}) = n \log {p (1 - p)} + \sum_{i = 1}^{n} (y_{i} - x_{i}^{'} β_{p}) (I_{y_{i} \leq x_{i}^{'} β_{p}} - p)

is continuous at all

β_{p} \in R^{v}

. Its first derivative,

\partial l (β_{p}) / \partial β_{p}

, is continuous everywhere except at the points β_p such that

y_{i} = x_{i}^{'} β_{p}

for at least one observation

i \in {1, 2, \dots, n}

. We define the gradient

s (β_{p}) = - \sum_{i = 1}^{n} x_{i} (I_{y_{i} \leq x_{i}^{'} β_{p}} - p) .

The function, s(β_p),

R^{v} \mapsto R^{v}

, equals the first derivative

\partial l (β_{p}) / \partial β_{p}

at all points of the parameter space where the latter is defined.

2. Gradient-search algorithm

In this section, we describe an algorithm for the unconstrained maximization of the log-likelihood function, l(β_p), with respect to the parameter, $β_{p} \in R^{v}$ . Briefly, from a current parameter value, the algorithm searches the positive semi-line in the direction of the gradient s(β_p) for a new parameter value at which the likelihood is larger. The algorithm stops when the change in the log-likelihood is less than a specified tolerance. Convergence is guaranteed by the continuity and concavity of the log-likelihood as shown in Section 3.

The algorithm can be summarized by the following steps:

(1) Set k=0 and the initial values $β_{p}^{0} \in R^{v}$ and δ⁰>0
(2) If $l (β_{p}^{k}) > l {β_{p}^{k} + δ^{k} s (β_{p}^{k})}$
- (a) then set $δ^{k + 1} = a δ^{k}$
- (b) else if ${l (β_{p}^{k + 1}) - l (β_{p}^{k})} > ϵ$
  - (i) then set $β_{p}^{k + 1} = β_{p}^{k} + δ^{k} s (β_{p}^{k})$ ; $δ^{k + 1} = b δ^{k}$
  - (ii) else return $β_{p}^{k + 1}$ ; stop
(3) Set k=k+1; go to step 2.

The algorithm requires setting the following initial values: the starting value of the parameter, $β_{p}^{0} \in R^{v}$ ; the initial step-length, δ⁰>0; the factor for shortening the step-length, a∈(0, 1); the factor for expanding the step-length, b≥1; and the tolerance for the change in the log-likelihood, ε>0. In addition, the constant p∈(0, 1) and the data y_i and x_i are required to evaluate the functions l(β_p) and s(β_p).

3. Convergence of the algorithm

This section discusses the convergence of the gradient search (GS) algorithm defined in Section 2 and states some properties of the log-likelihood function l(β_p) defined in Equation (2).

Definition 1

Let $Ω = {β_{p}^{*} \in R^{v} : l (β_{p}^{*}) \geq l (β_{p}), \forall β_{p} \in R^{v}}$ be the set of maximizers of the log-likelihood function.

The log-likelihood function l(β_p) is a sum of continuous, concave functions and therefore itself continuous and concave. Its continuity and concavity guarantee that a unique maximum exists and that the set of maximizers, $Ω \subset R^{v}$ , is compact. The well-known results stated in Lemma 1 is instrumental in establishing convergence of the GS algorithm in Theorem 1.

lemma 1

For any sequence of parameters ${β_{p}^{k}}$ such that $l (β_{p}^{k}) < l (β_{p}^{k + 1}) \forall k \in N$ $lim_{k \to \infty} β_{p}^{k} = β_{p}^{*} \in Ω .$

Proof The statement in Lemma 1 follows by the continuity and concavity of the log-likelihood function l(β_p) defined in Equation (2).

The following Theorem 1 shows that the sequence of parameter values $β_{p}^{k + 1} = β_{p}^{k} + δ^{k} s (β_{p}^{k})$ iteratively produced by Step 2.b.i of the GS algorithm satisfies the condition in Lemma 1.

Theorem 1

For any given point $β_{p} \in R^{v},$ such that $y_{i} \neq x_{i}^{'} β_{p}$ for all $i \in {1, \dots, n},$ there exists a value δ₀>0 such that $l (β_{p}) \leq l (β_{p} + δ s (β_{p}))$ for every $0 < δ < δ_{0}$ .

Proof The directional derivative of l(β_p) in the direction v≠0 evaluated at β_p is $\begin{aligned} {\frac{\partial}{\partial δ} l (β_{p} + δ v)|}_{δ = 0} & = {\frac{\partial}{\partial δ} \sum_{i = 1}^{n} (y_{i} - x_{i}^{'} β_{p} - δ x_{i}^{'} v) (I_{y_{i} \leq x_{i}^{'} β_{p} + δ x_{i}^{'} v} - p)|}_{δ = 0} \\ = - \sum_{i = 1}^{n} x_{i}^{'} v (I_{y_{i} \leq x_{i}^{'} β_{p}} - p)^{I_{y_{i} \neq x_{i}^{'} β_{p}}} (I_{x_{i}^{'} v \geq 0} - p)^{I_{y_{i} = x_{i}^{'} β_{p}}} . \end{aligned}$ Assuming that $y_{i} \neq x_{i}^{'} β_{p}$ for all $i \in {1, \dots, n}$ , the directional derivative in the direction of the gradient $v = s (β_{p})$ evaluated at β_p ${[- \sum_{i = 1}^{n} x_{i} (I_{y_{i} \leq x_{i}^{'} β_{p}} - p)]}^{'} [- \sum_{i = 1}^{n} x_{i} (I_{y_{i} \leq x_{i}^{'} β_{p}} - p)] = ∥ s (β_{p}) ∥^{2}$ is non-negative and greater than that in any other direction $v \in R^{v}$ such that $∥ v ∥ = ∥ s (β_{p}) ∥$ .

If $∥ s (β_{p}^{k}) ∥^{2} > 0$ , then the directional derivative at $β_{p}^{k}$ in the direction of the gradient s(β_p) is positive and there exists a δ₀>0 such that the log-likelihood function increases in a δ₀-neighbourhood of $β_{p}^{k}$ in the direction of s(β_p); that is, $\exists δ_{0} > 0 : l (β_{p}^{k}) < l (β_{p}^{k} + δ^{k} s (β_{p}^{k})) = l (β_{p}^{k + 1}) \forall δ \in (0, δ_{0})$ . Step 2a of the GS algorithm shortens the step-length δ^k by a factor a∈(0, 1) until $δ^{k} < δ_{0}$ . If $∥ s (β_{p}^{k}) ∥^{2} = 0$ , then $∥ s (β_{p}^{k}) ∥ = 0$ . The first-order condition $∥ s (β_{p}^{k}) ∥ = 0$ , along with the continuity and concavity of l(β_p) over the entire space ℝ^v, implies that $β_{p}^{k}$ is a maximizer of the log-likelihood, i.e. $β_{p}^{k} \in Ω$ .

Theorem 1 assumes that $y_{i} \neq x_{i}^{'} β_{p}^{k}$ for all observations $i \in {1, \dots, n}$ . With the GS algorithm defined in Section 2, the event that y_i is numerically equal to $x_{i}^{'} β_{p}^{k}$ for an observation i occurs with probability zero. If it did occur, the GS would move off of it in the direction s(β_p) defined in Equation (3).

While meeting the condition of Lemma 1 guarantees that a parameter sequence ${β_{p}^{k}}$ convergences to a maximizer, it does not guarantee that it converges efficiently. Indeed, in an infinite number of iterations, any algorithm that generates a sequence of increasing likelihood values would converge to the maximum. For example, a trivial compass search along the orthogonal directions of ℝ^v would be a convergent, albeit very inefficient, alternative to the GS. Efficiency of the GS algorithm ensues from its moving from the current parameter value in the direction of maximum local increase.

4. A simulation study

We performed a simulation study and compared the proposed GS method with the FN interior point method in large samples. In this case, the latter method has been shown to outperform the Barrodale–Roberts method.[Citation11]

With the statistical program R, we generated 7200 samples, which consisted of 100 samples under each of 72 scenarios obtained from the combination of 3 quantiles, 2 sample sizes, 4 regression models, and 3 distributions for the regression residual. The three quantiles were $p \in {0.50, 0.75, 0.90}$ . The two sample sizes were $n \in {10^{4}, 10^{5}}$ . The four regression models were (1) y_i=i, with $i \in {1, \dots, n}$ , (2) $y_{i} = 1 + x_{i}^{(1)} + x_{i}^{(2)} + u_{i}$ , (3) $y_{i} = 1 + x_{i}^{(3)} + \dots + x_{i}^{(10)} + u_{i}$ , and (4) $y_{i} = - 30 + 5000 x_{i}^{(11)} + 2 x_{i}^{(12)} + 0.5 x_{i}^{(13)} + (0.5 + x_{i}^{(11)} - 0.5 x_{i}^{(12)} + 0.5 x_{i}^{(13)}) u_{i}$ ; where $x_{i}^{(j)} \sim$ Normal(0,1), $j \in {1, \dots, 10}$ , $x_{i}^{(11)} \sim$ Uniform(0,1), $x_{i}^{(12)} \sim$ Bernoulli(0.5), and $x_{i}^{(13)} \sim$ T(3). The three distributions of the regression residual were u_i ∼ Normal(0,1), u_i ∼ T(3), and u_i ∼ Lognormal(0,1).

The GS algorithm was implemented as an R function with the following starting values: $β_{p}^{0}$ equal to the least-squares estimate, δ⁰ equal to the sample standard deviation of y, a=0.5, b=1.25, and $ϵ = 10^{- 10}$ . For the FN method, we used the R function rq.fit.fnb, which called compiled Fortran code (quantreg package, version 5.05).

In each generated sample, we estimated the regression coefficients with the GS and the FN algorithm. We compared the methods on the following measures:

(1) ‘FN-to-GS time ratio’: the ratio between mean CPU time of FN and mean CPU time of GS. In all models, the timing of GS included least-squares estimation for the initial values $β_{p}^{0}$ . For models (2)–(4), it also included the standardization of all covariates and subsequent rescaling of the parameter estimates.
(2) ‘Condition on residuals met’: the proportion of samples in which the number of negative (R₋) and the number of positive (R₊) residuals satisfied the condition $R_{-} \leq n p \leq n - R_{+}$ .[Citation14, Theorem 3.4]
(3) ‘Max $\log_{10} | l (β_{p}^{GS}) - l (β_{p}^{FN}) |$ ’: the maximum of the logarithm to base 10 of the absolute difference between the maximum log-likelihood values achieved in each sample.
(4) ‘Max $| β_{p}^{GS} / β_{p}^{FN} - 1 |$ ’: the maximum of the absolute relative difference of the parameter estimates in each sample.
(5) ‘Observed bias’: the difference between the estimated and the true parameter values averaged over the simulated samples.

The results of the simulation are given in . The observed bias was virtually zero at all quantiles in all the considered scenarios and it was not reported. The entries in are averages over 300 samples comprising 100 replicates for each of the 3 distributions of the regression residual. No notable differences were observed across distributions. The performance of the two methods was comparable with respect to maximum likelihood achieved and proportion of sample in which the condition on residuals was met. The GS algorithm was faster than FN, although its advantage shrank as the number of covariates increased or the sample size decreased. Optimizing the R programming code and compiling it into machine language might further improve its computational speed.

Table 1. Performance of the GS and FN algorithm for quantile regression estimation.

Display Table

Following an insightful comment from an anonymous reviewer, we checked sensitivity of the algorithm with respect to its initial parameter values $β_{p}^{0}$ . We re-ran the entire simulation study twice. The first time we set all elements of $β_{p}^{0}$ equal to 0; the second time we set them equal to 100. While GS took longer to converge with the unreasonable starting values than with the least-squares estimates, it was still generally faster than FN. The final estimates for the parameter β_p and achieved maximum likelihood seemed largely unaffected by the choice of the initial values and remained comparable to those reported in .

5. Remarks

The GS algorithm presented in this note is similar to the Newton–Raphson algorithm, in that it moves from the current parameter value in the direction of the gradient, and to direct search methods, in that it adjusts iteratively the step-length multiplier. Unlike methods based on vertex-search in a simplex, as Barrodale and Roberts’ method,[Citation10] the algorithm proposed in this paper searches the parameter space without constraints. As Newton–Raphson and other optimization algorithms, GS only approximates the optimal solution within a given tolerance.

Perhaps one of the most useful features of the GS algorithm is its wide applicability. GS may be regarded as a general algorithm for unconstrained optimization of any objective function that is continuous, concave, and first-order differentiable everywhere except at a set of point with measure zero. Unlike Newton–Raphson, GS does not require a second derivative, which is generally unavailable in Laplace-based likelihoods. Thanks to its flexibility, the algorithm can be applied to the estimation of Laplace regression with censored data, mixed-effects quantile regression, and other models based on Laplace likelihood or the optimization of objective function with similar features.

We recommend setting the initial parameter, $β_{p}^{0}$ , at reasonable values. While GS appears to be rather insensitive to the choice of the initial parameter value, reasonable starting values generally improve both convergence speed and accuracy. Least-squares estimates have been satisfactory in our experience. The choice of the convergence criteria and tolerance levels may substantially impact convergence. Stricter criteria may improve maximization accuracy but increase computing time. Generally, we recommend setting ε as small, a as close to 1, and b as large, as computational time and user's patience allow. The values used in the simulation study ( $ϵ = 10^{- 10}$ , a=0.5, and b=1.25) gave a comparable performance with the FN method. These values have also proved adequate in a number of recent applications of the GS algorithm to the estimation of the coefficients of Laplace regression with censored data.[Citation15–17among others]

The steps of the GS algorithm listed in Section 2 can be modified in several ways. The following are some examples. The convergence criteria in step 2.b can include checking change from the previous iteration in the parameter estimates or in the signs of the residuals. In step 2.b.i, the step-length can be set to the initial value, $δ^{k + 1} = δ^{0}$ . The step-length multipliers, a and b, may depend on the sample size. When searching for a new parameter value along the direction of the gradient, the algorithm could find the root of the directional derivative by bisection or by considering the sign of the residuals. This would reduce the number of iterations and increase the computational burden at each iteration.

In the simulation reported in and in a number of real-data applications, the GS algorithm has shown a remarkable computational speed, which may be a desirable feature when analysing large samples. In small samples (e.g. n=100), the GS, FN, and Barrodale–Roberts algorithms converge so quickly as to make any possible slight differences in computational time inconsequential from most practical standpoints.

REFERENCES

Bottai M, Zhang J. Laplace regression with censored data. Biom J. 2010;52(4):487–503. doi: 10.1002/bimj.200900310
PubMed Web of Science ®Google Scholar
Farcomeni A. Quantile regression for longitudinal data based on latent Markov subject-specific parameters. Stat Comput. 2012;22:141–152. doi: 10.1007/s11222-010-9213-0
Web of Science ®Google Scholar
Geraci M, Bottai M. Quantile regression for longitudinal data using the asymmetric Laplace distribution. Biostatistics. 2007;8:140–154. doi: 10.1093/biostatistics/kxj039
PubMed Web of Science ®Google Scholar
Lee D, Neocleous T. Bayesian quantile regression for count data with application to environmental epidemiology. J R Stat Soc: Ser C Appl Statist. 2010;59:905–920. doi: 10.1111/j.1467-9876.2010.00725.x
Web of Science ®Google Scholar
Liu Y, Bottai M. Mixed-effects models for conditional quantiles with longitudinal data. Int J Biostat. 2009;5. Article 28.
PubMedGoogle Scholar
Yuan Y, Yin G. Bayesian quantile regression for longitudinal studies with nonignorable missing data. Biometrics. 2010;66:105–114. doi: 10.1111/j.1541-0420.2009.01269.x
PubMed Web of Science ®Google Scholar
Bottai M, Orsini N. A command for Laplace regression. Stata J. 2013;13(2):302–314.
Web of Science ®Google Scholar
Geraci M, Bottai M. Linear quantile mixed models. Stat Comput. 2014;24:461–479. doi: 10.1007/s11222-013-9381-9
Web of Science ®Google Scholar
Geraci M. Linear quantile mixed models: the lqmm package for Laplace quantile regression. J Statist Soft. 2014 (in press).
PubMed Web of Science ®Google Scholar
Koenker R, d'Orey V. Computing regression quantiles. Stat Algorithms. 1987;383–393.
Google Scholar
Portnoy S, Koenker R. The Gaussian hare and the Laplacian tortoise: computability of squared-error versus absolute-error estimators. Stat Sci. 1997;12:279–296. doi: 10.1214/ss/1030037960
Web of Science ®Google Scholar
Hunter D, Lange K. Quantile regression via an MM algorithm. J Comput Graph Statist. 2000;9:60–77.
Web of Science ®Google Scholar
Koenker R, Machado JAF. Goodness of fit and related inference processes for quantile regression. J Amer Statist Assoc. 1999;94:1296–1310. doi: 10.1080/01621459.1999.10473882
Web of Science ®Google Scholar
Koenker R, Basset G. Regression quantiles. Econometrica. 1978;46:33–50. doi: 10.2307/1913643
Web of Science ®Google Scholar
Orsini N, Wolk A, Bottai M. Evaluating percentiles of survival. Epidemiology. 2012;23:770–771. doi: 10.1097/EDE.0b013e3182625eff
PubMed Web of Science ®Google Scholar
Bellavia A, Akerstedt T, Bottai M, Wolk A, Orsini N. Sleep duration and survival percentiles across categories of physical activity. Amer J Epidemiol. 2014;179(4):484–491. doi: 10.1093/aje/kwt280
PubMed Web of Science ®Google Scholar
Johannessen A, Skorge TD, Bottai M, Grydeland TB, Nilsen RM, Coxson H, Dirksen A, Omenaas E, Gulsvik A, Bakke P. Mortality by level of emphysema and airway wall thickness. Amer J Respir Crit Care Med. 2013;187(6): 602–608. doi: 10.1164/rccm.201209-1722OC
PubMed Web of Science ®Google Scholar

A gradient search maximization algorithm for the asymmetric Laplace likelihood

Abstract

1. Introduction

2. Gradient-search algorithm

3. Convergence of the algorithm

4. A simulation study

Table 1. Performance of the GS and FN algorithm for quantile regression estimation.

5. Remarks

REFERENCES

Information for

Open access

Opportunities

Help and information

A gradient search maximization algorithm for the asymmetric Laplace likelihood

Abstract

1. Introduction

2. Gradient-search algorithm

3. Convergence of the algorithm

4. A simulation study

Table 1. Performance of the GS and FN algorithm for quantile regression estimation.

5. Remarks

REFERENCES

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date