3,048
Views
15
CrossRef citations to date
0
Altmetric
Original Articles

A gradient search maximization algorithm for the asymmetric Laplace likelihood

, &
Pages 1919-1925 | Received 12 Feb 2014, Accepted 23 Mar 2014, Published online: 25 Apr 2014

Abstract

The asymmetric Laplace likelihood naturally arises in the estimation of conditional quantiles of a response variable given covariates. The estimation of its parameters entails unconstrained maximization of a concave and non-differentiable function over the real space. In this note, we describe a maximization algorithm based on the gradient of the log-likelihood that generates a finite sequence of parameter values along which the likelihood increases. The algorithm can be applied to the estimation of mixed-effects quantile regression, Laplace regression with censored data, and other models based on Laplace likelihood. In a simulation study and in a number of real-data applications, the proposed algorithm has shown notable computational speed.

1. Introduction

An increasing number of recently proposed methods for the estimation of conditional quantiles of a continuous outcome variable given covariates make use of the asymmetric Laplace distribution.[Citation1–6] We describe a maximization algorithm for the asymmetric Laplace likelihood and compare its large-sample performance with that of the Frisch–Newton (FN) interior-point method for the estimation of quantile regression. Simple extensions of the presented algorithm have been applied to the estimation of Laplace regression with censored data [Citation7] and mixed-effects quantile regression as an alternative to the Gibbs sampler.[Citation8,Citation9]

Let yi, i=1, … , n, be a sample of continuous variables and xi be k-dimensional vectors of corresponding covariates. We consider the regression model:

yi=xiβp+ui,
where the residual ui follows an unspecified distribution with P(ui0)=p for a given probability p∈(0, 1). If model (1) is correct, then xi′βp is the p-quantile of the conditional distribution of yi given xi. The unknown regression coefficient βpRv depends on p and can be estimated by minimizing a weighted sum of absolute residuals with computationally efficient methods.[Citation10–12]

It is known [Citation13] that this minimization problem can be seen as the maximization of a likelihood function by assuming that the regression residual ui follows the asymmetric Laplace density function f(ui)=p(1p)exp{ui(Iui0p)}. Here and throughout, IA denotes the indicator function of the set A.

The log-likelihood function of model (1)

l(βp)=nlog{p(1p)}+i=1n(yixiβp)(Iyixiβpp)
is continuous at all βpRv. Its first derivative, l(βp)/βp, is continuous everywhere except at the points βp such that yi=xiβp for at least one observation i{1,2,,n}. We define the gradient
s(βp)=i=1nxi(Iyixiβpp).
The function, sp), RvRv, equals the first derivative l(βp)/βp at all points of the parameter space where the latter is defined.

2. Gradient-search algorithm

In this section, we describe an algorithm for the unconstrained maximization of the log-likelihood function, lp), with respect to the parameter, βpRv. Briefly, from a current parameter value, the algorithm searches the positive semi-line in the direction of the gradient sp) for a new parameter value at which the likelihood is larger. The algorithm stops when the change in the log-likelihood is less than a specified tolerance. Convergence is guaranteed by the continuity and concavity of the log-likelihood as shown in Section 3.

The algorithm can be summarized by the following steps:

  1. (1) Set k=0 and the initial values βp0Rv and δ0>0

  2. (2) If l(βpk)>l{βpk+δks(βpk)}

    • (a) then set δk+1=aδk

    • (b) else if {l(βpk+1)l(βpk)}>ϵ

      • (i) then set βpk+1=βpk+δks(βpk); δk+1=bδk

      • (ii) else return βpk+1; stop

  3. (3) Set k=k+1; go to step 2.

The algorithm requires setting the following initial values: the starting value of the parameter, βp0Rv; the initial step-length, δ0>0; the factor for shortening the step-length, a∈(0, 1); the factor for expanding the step-length, b≥1; and the tolerance for the change in the log-likelihood, ε>0. In addition, the constant p∈(0, 1) and the data yi and xi are required to evaluate the functions lp) and sp).

3. Convergence of the algorithm

This section discusses the convergence of the gradient search (GS) algorithm defined in Section 2 and states some properties of the log-likelihood function lp) defined in Equation (2).

Definition 1

Let Ω={βpRv:l(βp)l(βp),βpRv} be the set of maximizers of the log-likelihood function.

The log-likelihood function lp) is a sum of continuous, concave functions and therefore itself continuous and concave. Its continuity and concavity guarantee that a unique maximum exists and that the set of maximizers, ΩRv, is compact. The well-known results stated in Lemma 1 is instrumental in establishing convergence of the GS algorithm in Theorem 1.

lemma 1

For any sequence of parameters {βpk} such that l(βpk)<l(βpk+1)kN limkβpk=βpΩ.

Proof The statement in Lemma 1 follows by the continuity and concavity of the log-likelihood function lp) defined in Equation (2).

The following Theorem 1 shows that the sequence of parameter values βpk+1=βpk+δks(βpk) iteratively produced by Step 2.b.i of the GS algorithm satisfies the condition in Lemma 1.

Theorem 1

For any given point βpRv, such that yixiβp for all i{1,,n}, there exists a value δ0>0 such that l(βp)l(βp+δs(βp)) for every 0<δ<δ0.

Proof The directional derivative of lp) in the direction v≠0 evaluated at βp is δl(βp+δv)δ=0=δ i=1n(yixiβpδxiv)(Iyixiβp+δxivp)δ=0=i=1nxiv(Iyixiβpp)Iyixiβp(Ixiv0p)Iyi=xiβp. Assuming that yixiβp for all i{1,,n}, the directional derivative in the direction of the gradient v=s(βp) evaluated at βp i=1nxi(Iyixiβpp)i=1nxi(Iyixiβpp)=s(βp)2 is non-negative and greater than that in any other direction vRv such that v=s(βp).

If s(βpk)2>0, then the directional derivative at βpk in the direction of the gradient sp) is positive and there exists a δ0>0 such that the log-likelihood function increases in a δ0-neighbourhood of βpk in the direction of sp); that is,  δ0>0:l(βpk)<l(βpk+δks(βpk))=l(βpk+1)δ(0,δ0). Step 2a of the GS algorithm shortens the step-length δk by a factor a∈(0, 1) until δk<δ0. If s(βpk)2=0, then s(βpk)=0. The first-order condition s(βpk)=0, along with the continuity and concavity of lp) over the entire space ℝv, implies that βpk is a maximizer of the log-likelihood, i.e. βpkΩ.

Theorem 1 assumes that yixiβpk for all observations i{1,,n}. With the GS algorithm defined in Section 2, the event that yi is numerically equal to xiβpk for an observation i occurs with probability zero. If it did occur, the GS would move off of it in the direction sp) defined in Equation (3).

While meeting the condition of Lemma 1 guarantees that a parameter sequence {βpk} convergences to a maximizer, it does not guarantee that it converges efficiently. Indeed, in an infinite number of iterations, any algorithm that generates a sequence of increasing likelihood values would converge to the maximum. For example, a trivial compass search along the orthogonal directions of ℝv would be a convergent, albeit very inefficient, alternative to the GS. Efficiency of the GS algorithm ensues from its moving from the current parameter value in the direction of maximum local increase.

4. A simulation study

We performed a simulation study and compared the proposed GS method with the FN interior point method in large samples. In this case, the latter method has been shown to outperform the Barrodale–Roberts method.[Citation11]

With the statistical program R, we generated 7200 samples, which consisted of 100 samples under each of 72 scenarios obtained from the combination of 3 quantiles, 2 sample sizes, 4 regression models, and 3 distributions for the regression residual. The three quantiles were p{0.50,0.75,0.90}. The two sample sizes were n{104,105}. The four regression models were (1) yi=i, with i{1,,n}, (2) yi=1+xi(1)+xi(2)+ui, (3) yi=1+xi(3)++xi(10)+ui, and (4) yi=30+5000 xi(11)+2xi(12)+0.5 xi(13)+(0.5+xi(11)0.5 xi(12)+0.5 xi(13))ui; where xi(j) Normal(0,1), j{1,,10}, xi(11) Uniform(0,1), xi(12) Bernoulli(0.5), and xi(13) T(3). The three distributions of the regression residual were ui ∼ Normal(0,1), ui ∼ T(3), and ui ∼ Lognormal(0,1).

The GS algorithm was implemented as an R function with the following starting values: βp0 equal to the least-squares estimate, δ0 equal to the sample standard deviation of y, a=0.5, b=1.25, and ϵ=1010. For the FN method, we used the R function rq.fit.fnb, which called compiled Fortran code (quantreg package, version 5.05).

In each generated sample, we estimated the regression coefficients with the GS and the FN algorithm. We compared the methods on the following measures:

  1. (1) ‘FN-to-GS time ratio’: the ratio between mean CPU time of FN and mean CPU time of GS. In all models, the timing of GS included least-squares estimation for the initial values βp0. For models (2)–(4), it also included the standardization of all covariates and subsequent rescaling of the parameter estimates.

  2. (2) ‘Condition on residuals met’: the proportion of samples in which the number of negative (R) and the number of positive (R+) residuals satisfied the condition RnpnR+.[Citation14, Theorem 3.4]

  3. (3) ‘Max log10|l(βpGS)l(βpFN)|’: the maximum of the logarithm to base 10 of the absolute difference between the maximum log-likelihood values achieved in each sample.

  4. (4) ‘Max |βpGS/βpFN1|’: the maximum of the absolute relative difference of the parameter estimates in each sample.

  5. (5) ‘Observed bias’: the difference between the estimated and the true parameter values averaged over the simulated samples.

The results of the simulation are given in . The observed bias was virtually zero at all quantiles in all the considered scenarios and it was not reported. The entries in are averages over 300 samples comprising 100 replicates for each of the 3 distributions of the regression residual. No notable differences were observed across distributions. The performance of the two methods was comparable with respect to maximum likelihood achieved and proportion of sample in which the condition on residuals was met. The GS algorithm was faster than FN, although its advantage shrank as the number of covariates increased or the sample size decreased. Optimizing the R programming code and compiling it into machine language might further improve its computational speed.

Table 1.  Performance of the GS and FN algorithm for quantile regression estimation.

Following an insightful comment from an anonymous reviewer, we checked sensitivity of the algorithm with respect to its initial parameter values βp0. We re-ran the entire simulation study twice. The first time we set all elements of βp0 equal to 0; the second time we set them equal to 100. While GS took longer to converge with the unreasonable starting values than with the least-squares estimates, it was still generally faster than FN. The final estimates for the parameter βp and achieved maximum likelihood seemed largely unaffected by the choice of the initial values and remained comparable to those reported in .

5. Remarks

The GS algorithm presented in this note is similar to the Newton–Raphson algorithm, in that it moves from the current parameter value in the direction of the gradient, and to direct search methods, in that it adjusts iteratively the step-length multiplier. Unlike methods based on vertex-search in a simplex, as Barrodale and Roberts’ method,[Citation10] the algorithm proposed in this paper searches the parameter space without constraints. As Newton–Raphson and other optimization algorithms, GS only approximates the optimal solution within a given tolerance.

Perhaps one of the most useful features of the GS algorithm is its wide applicability. GS may be regarded as a general algorithm for unconstrained optimization of any objective function that is continuous, concave, and first-order differentiable everywhere except at a set of point with measure zero. Unlike Newton–Raphson, GS does not require a second derivative, which is generally unavailable in Laplace-based likelihoods. Thanks to its flexibility, the algorithm can be applied to the estimation of Laplace regression with censored data, mixed-effects quantile regression, and other models based on Laplace likelihood or the optimization of objective function with similar features.

We recommend setting the initial parameter, βp0, at reasonable values. While GS appears to be rather insensitive to the choice of the initial parameter value, reasonable starting values generally improve both convergence speed and accuracy. Least-squares estimates have been satisfactory in our experience. The choice of the convergence criteria and tolerance levels may substantially impact convergence. Stricter criteria may improve maximization accuracy but increase computing time. Generally, we recommend setting ε as small, a as close to 1, and b as large, as computational time and user's patience allow. The values used in the simulation study (ϵ=1010, a=0.5, and b=1.25) gave a comparable performance with the FN method. These values have also proved adequate in a number of recent applications of the GS algorithm to the estimation of the coefficients of Laplace regression with censored data.[Citation15–17among others]

The steps of the GS algorithm listed in Section 2 can be modified in several ways. The following are some examples. The convergence criteria in step 2.b can include checking change from the previous iteration in the parameter estimates or in the signs of the residuals. In step 2.b.i, the step-length can be set to the initial value, δk+1=δ0. The step-length multipliers, a and b, may depend on the sample size. When searching for a new parameter value along the direction of the gradient, the algorithm could find the root of the directional derivative by bisection or by considering the sign of the residuals. This would reduce the number of iterations and increase the computational burden at each iteration.

In the simulation reported in and in a number of real-data applications, the GS algorithm has shown a remarkable computational speed, which may be a desirable feature when analysing large samples. In small samples (e.g. n=100), the GS, FN, and Barrodale–Roberts algorithms converge so quickly as to make any possible slight differences in computational time inconsequential from most practical standpoints.

REFERENCES

  • Bottai M, Zhang J. Laplace regression with censored data. Biom J. 2010;52(4):487–503. doi: 10.1002/bimj.200900310
  • Farcomeni A. Quantile regression for longitudinal data based on latent Markov subject-specific parameters. Stat Comput. 2012;22:141–152. doi: 10.1007/s11222-010-9213-0
  • Geraci M, Bottai M. Quantile regression for longitudinal data using the asymmetric Laplace distribution. Biostatistics. 2007;8:140–154. doi: 10.1093/biostatistics/kxj039
  • Lee D, Neocleous T. Bayesian quantile regression for count data with application to environmental epidemiology. J R Stat Soc: Ser C Appl Statist. 2010;59:905–920. doi: 10.1111/j.1467-9876.2010.00725.x
  • Liu Y, Bottai M. Mixed-effects models for conditional quantiles with longitudinal data. Int J Biostat. 2009;5. Article 28.
  • Yuan Y, Yin G. Bayesian quantile regression for longitudinal studies with nonignorable missing data. Biometrics. 2010;66:105–114. doi: 10.1111/j.1541-0420.2009.01269.x
  • Bottai M, Orsini N. A command for Laplace regression. Stata J. 2013;13(2):302–314.
  • Geraci M, Bottai M. Linear quantile mixed models. Stat Comput. 2014;24:461–479. doi: 10.1007/s11222-013-9381-9
  • Geraci M. Linear quantile mixed models: the lqmm package for Laplace quantile regression. J Statist Soft. 2014 (in press).
  • Koenker R, d'Orey V. Computing regression quantiles. Stat Algorithms. 1987;383–393.
  • Portnoy S, Koenker R. The Gaussian hare and the Laplacian tortoise: computability of squared-error versus absolute-error estimators. Stat Sci. 1997;12:279–296. doi: 10.1214/ss/1030037960
  • Hunter D, Lange K. Quantile regression via an MM algorithm. J Comput Graph Statist. 2000;9:60–77.
  • Koenker R, Machado JAF. Goodness of fit and related inference processes for quantile regression. J Amer Statist Assoc. 1999;94:1296–1310. doi: 10.1080/01621459.1999.10473882
  • Koenker R, Basset G. Regression quantiles. Econometrica. 1978;46:33–50. doi: 10.2307/1913643
  • Orsini N, Wolk A, Bottai M. Evaluating percentiles of survival. Epidemiology. 2012;23:770–771. doi: 10.1097/EDE.0b013e3182625eff
  • Bellavia A, Akerstedt T, Bottai M, Wolk A, Orsini N. Sleep duration and survival percentiles across categories of physical activity. Amer J Epidemiol. 2014;179(4):484–491. doi: 10.1093/aje/kwt280
  • Johannessen A, Skorge TD, Bottai M, Grydeland TB, Nilsen RM, Coxson H, Dirksen A, Omenaas E, Gulsvik A, Bakke P. Mortality by level of emphysema and airway wall thickness. Amer J Respir Crit Care Med. 2013;187(6): 602–608. doi: 10.1164/rccm.201209-1722OC