![MathJax Logo](/templates/jsp/_style2/_tandf/pb2/images/math-jax.gif)
Abstract
The asymmetric Laplace likelihood naturally arises in the estimation of conditional quantiles of a response variable given covariates. The estimation of its parameters entails unconstrained maximization of a concave and non-differentiable function over the real space. In this note, we describe a maximization algorithm based on the gradient of the log-likelihood that generates a finite sequence of parameter values along which the likelihood increases. The algorithm can be applied to the estimation of mixed-effects quantile regression, Laplace regression with censored data, and other models based on Laplace likelihood. In a simulation study and in a number of real-data applications, the proposed algorithm has shown notable computational speed.
1. Introduction
An increasing number of recently proposed methods for the estimation of conditional quantiles of a continuous outcome variable given covariates make use of the asymmetric Laplace distribution.[Citation1–6] We describe a maximization algorithm for the asymmetric Laplace likelihood and compare its large-sample performance with that of the Frisch–Newton (FN) interior-point method for the estimation of quantile regression. Simple extensions of the presented algorithm have been applied to the estimation of Laplace regression with censored data [Citation7] and mixed-effects quantile regression as an alternative to the Gibbs sampler.[Citation8,Citation9]
Let yi, i=1, … , n, be a sample of continuous variables and xi be k-dimensional vectors of corresponding covariates. We consider the regression model:
It is known [Citation13] that this minimization problem can be seen as the maximization of a likelihood function by assuming that the regression residual ui follows the asymmetric Laplace density function . Here and throughout, IA denotes the indicator function of the set A.
The log-likelihood function of model (1)
2. Gradient-search algorithm
In this section, we describe an algorithm for the unconstrained maximization of the log-likelihood function, l(βp), with respect to the parameter, . Briefly, from a current parameter value, the algorithm searches the positive semi-line in the direction of the gradient s(βp) for a new parameter value at which the likelihood is larger. The algorithm stops when the change in the log-likelihood is less than a specified tolerance. Convergence is guaranteed by the continuity and concavity of the log-likelihood as shown in Section 3.
The algorithm can be summarized by the following steps:
(1) Set k=0 and the initial values
and δ0>0
(2) If
(a) then set
(b) else if
(i) then set
;
(ii) else return
; stop
(3) Set k=k+1; go to step 2.
The algorithm requires setting the following initial values: the starting value of the parameter, ; the initial step-length, δ0>0; the factor for shortening the step-length, a∈(0, 1); the factor for expanding the step-length, b≥1; and the tolerance for the change in the log-likelihood, ε>0. In addition, the constant p∈(0, 1) and the data yi and xi are required to evaluate the functions l(βp) and s(βp).
3. Convergence of the algorithm
This section discusses the convergence of the gradient search (GS) algorithm defined in Section 2 and states some properties of the log-likelihood function l(βp) defined in Equation (2).
Definition 1
Let be the set of maximizers of the log-likelihood function.
The log-likelihood function l(βp) is a sum of continuous, concave functions and therefore itself continuous and concave. Its continuity and concavity guarantee that a unique maximum exists and that the set of maximizers, , is compact. The well-known results stated in Lemma 1 is instrumental in establishing convergence of the GS algorithm in Theorem 1.
lemma 1
For any sequence of parameters such that
Proof The statement in Lemma 1 follows by the continuity and concavity of the log-likelihood function l(βp) defined in Equation (2).
The following Theorem 1 shows that the sequence of parameter values iteratively produced by Step 2.b.i of the GS algorithm satisfies the condition in Lemma 1.
Theorem 1
For any given point such that
for all
there exists a value δ0>0 such that
for every
.
Proof The directional derivative of l(βp) in the direction v≠0 evaluated at βp is
Assuming that
for all
, the directional derivative in the direction of the gradient
evaluated at βp
is non-negative and greater than that in any other direction
such that
.
If , then the directional derivative at
in the direction of the gradient s(βp) is positive and there exists a δ0>0 such that the log-likelihood function increases in a δ0-neighbourhood of
in the direction of s(βp); that is,
. Step 2a of the GS algorithm shortens the step-length δk by a factor a∈(0, 1) until
. If
, then
. The first-order condition
, along with the continuity and concavity of l(βp) over the entire space ℝv, implies that
is a maximizer of the log-likelihood, i.e.
.
Theorem 1 assumes that for all observations
. With the GS algorithm defined in Section 2, the event that yi is numerically equal to
for an observation i occurs with probability zero. If it did occur, the GS would move off of it in the direction s(βp) defined in Equation (3).
While meeting the condition of Lemma 1 guarantees that a parameter sequence convergences to a maximizer, it does not guarantee that it converges efficiently. Indeed, in an infinite number of iterations, any algorithm that generates a sequence of increasing likelihood values would converge to the maximum. For example, a trivial compass search along the orthogonal directions of ℝv would be a convergent, albeit very inefficient, alternative to the GS. Efficiency of the GS algorithm ensues from its moving from the current parameter value in the direction of maximum local increase.
4. A simulation study
We performed a simulation study and compared the proposed GS method with the FN interior point method in large samples. In this case, the latter method has been shown to outperform the Barrodale–Roberts method.[Citation11]
With the statistical program R, we generated 7200 samples, which consisted of 100 samples under each of 72 scenarios obtained from the combination of 3 quantiles, 2 sample sizes, 4 regression models, and 3 distributions for the regression residual. The three quantiles were . The two sample sizes were
. The four regression models were (1) yi=i, with
, (2)
, (3)
, and (4)
; where
Normal(0,1),
,
Uniform(0,1),
Bernoulli(0.5), and
T(3). The three distributions of the regression residual were ui ∼ Normal(0,1), ui ∼ T(3), and ui ∼ Lognormal(0,1).
The GS algorithm was implemented as an R function with the following starting values: equal to the least-squares estimate, δ0 equal to the sample standard deviation of y, a=0.5, b=1.25, and
. For the FN method, we used the R function rq.fit.fnb, which called compiled Fortran code (quantreg package, version 5.05).
In each generated sample, we estimated the regression coefficients with the GS and the FN algorithm. We compared the methods on the following measures:
(1) ‘FN-to-GS time ratio’: the ratio between mean CPU time of FN and mean CPU time of GS. In all models, the timing of GS included least-squares estimation for the initial values
. For models (2)–(4), it also included the standardization of all covariates and subsequent rescaling of the parameter estimates.
(2) ‘Condition on residuals met’: the proportion of samples in which the number of negative (R−) and the number of positive (R+) residuals satisfied the condition
.[Citation14, Theorem 3.4]
(3) ‘Max
’: the maximum of the logarithm to base 10 of the absolute difference between the maximum log-likelihood values achieved in each sample.
(4) ‘Max
’: the maximum of the absolute relative difference of the parameter estimates in each sample.
(5) ‘Observed bias’: the difference between the estimated and the true parameter values averaged over the simulated samples.
The results of the simulation are given in . The observed bias was virtually zero at all quantiles in all the considered scenarios and it was not reported. The entries in are averages over 300 samples comprising 100 replicates for each of the 3 distributions of the regression residual. No notable differences were observed across distributions. The performance of the two methods was comparable with respect to maximum likelihood achieved and proportion of sample in which the condition on residuals was met. The GS algorithm was faster than FN, although its advantage shrank as the number of covariates increased or the sample size decreased. Optimizing the R programming code and compiling it into machine language might further improve its computational speed.
Table 1. Performance of the GS and FN algorithm for quantile regression estimation.
Following an insightful comment from an anonymous reviewer, we checked sensitivity of the algorithm with respect to its initial parameter values . We re-ran the entire simulation study twice. The first time we set all elements of
equal to 0; the second time we set them equal to 100. While GS took longer to converge with the unreasonable starting values than with the least-squares estimates, it was still generally faster than FN. The final estimates for the parameter βp and achieved maximum likelihood seemed largely unaffected by the choice of the initial values and remained comparable to those reported in .
5. Remarks
The GS algorithm presented in this note is similar to the Newton–Raphson algorithm, in that it moves from the current parameter value in the direction of the gradient, and to direct search methods, in that it adjusts iteratively the step-length multiplier. Unlike methods based on vertex-search in a simplex, as Barrodale and Roberts’ method,[Citation10] the algorithm proposed in this paper searches the parameter space without constraints. As Newton–Raphson and other optimization algorithms, GS only approximates the optimal solution within a given tolerance.
Perhaps one of the most useful features of the GS algorithm is its wide applicability. GS may be regarded as a general algorithm for unconstrained optimization of any objective function that is continuous, concave, and first-order differentiable everywhere except at a set of point with measure zero. Unlike Newton–Raphson, GS does not require a second derivative, which is generally unavailable in Laplace-based likelihoods. Thanks to its flexibility, the algorithm can be applied to the estimation of Laplace regression with censored data, mixed-effects quantile regression, and other models based on Laplace likelihood or the optimization of objective function with similar features.
We recommend setting the initial parameter, , at reasonable values. While GS appears to be rather insensitive to the choice of the initial parameter value, reasonable starting values generally improve both convergence speed and accuracy. Least-squares estimates have been satisfactory in our experience. The choice of the convergence criteria and tolerance levels may substantially impact convergence. Stricter criteria may improve maximization accuracy but increase computing time. Generally, we recommend setting ε as small, a as close to 1, and b as large, as computational time and user's patience allow. The values used in the simulation study (
, a=0.5, and b=1.25) gave a comparable performance with the FN method. These values have also proved adequate in a number of recent applications of the GS algorithm to the estimation of the coefficients of Laplace regression with censored data.[Citation15–17among others]
The steps of the GS algorithm listed in Section 2 can be modified in several ways. The following are some examples. The convergence criteria in step 2.b can include checking change from the previous iteration in the parameter estimates or in the signs of the residuals. In step 2.b.i, the step-length can be set to the initial value, . The step-length multipliers, a and b, may depend on the sample size. When searching for a new parameter value along the direction of the gradient, the algorithm could find the root of the directional derivative by bisection or by considering the sign of the residuals. This would reduce the number of iterations and increase the computational burden at each iteration.
In the simulation reported in and in a number of real-data applications, the GS algorithm has shown a remarkable computational speed, which may be a desirable feature when analysing large samples. In small samples (e.g. n=100), the GS, FN, and Barrodale–Roberts algorithms converge so quickly as to make any possible slight differences in computational time inconsequential from most practical standpoints.
REFERENCES
- Bottai M, Zhang J. Laplace regression with censored data. Biom J. 2010;52(4):487–503. doi: 10.1002/bimj.200900310
- Farcomeni A. Quantile regression for longitudinal data based on latent Markov subject-specific parameters. Stat Comput. 2012;22:141–152. doi: 10.1007/s11222-010-9213-0
- Geraci M, Bottai M. Quantile regression for longitudinal data using the asymmetric Laplace distribution. Biostatistics. 2007;8:140–154. doi: 10.1093/biostatistics/kxj039
- Lee D, Neocleous T. Bayesian quantile regression for count data with application to environmental epidemiology. J R Stat Soc: Ser C Appl Statist. 2010;59:905–920. doi: 10.1111/j.1467-9876.2010.00725.x
- Liu Y, Bottai M. Mixed-effects models for conditional quantiles with longitudinal data. Int J Biostat. 2009;5. Article 28.
- Yuan Y, Yin G. Bayesian quantile regression for longitudinal studies with nonignorable missing data. Biometrics. 2010;66:105–114. doi: 10.1111/j.1541-0420.2009.01269.x
- Bottai M, Orsini N. A command for Laplace regression. Stata J. 2013;13(2):302–314.
- Geraci M, Bottai M. Linear quantile mixed models. Stat Comput. 2014;24:461–479. doi: 10.1007/s11222-013-9381-9
- Geraci M. Linear quantile mixed models: the lqmm package for Laplace quantile regression. J Statist Soft. 2014 (in press).
- Koenker R, d'Orey V. Computing regression quantiles. Stat Algorithms. 1987;383–393.
- Portnoy S, Koenker R. The Gaussian hare and the Laplacian tortoise: computability of squared-error versus absolute-error estimators. Stat Sci. 1997;12:279–296. doi: 10.1214/ss/1030037960
- Hunter D, Lange K. Quantile regression via an MM algorithm. J Comput Graph Statist. 2000;9:60–77.
- Koenker R, Machado JAF. Goodness of fit and related inference processes for quantile regression. J Amer Statist Assoc. 1999;94:1296–1310. doi: 10.1080/01621459.1999.10473882
- Koenker R, Basset G. Regression quantiles. Econometrica. 1978;46:33–50. doi: 10.2307/1913643
- Orsini N, Wolk A, Bottai M. Evaluating percentiles of survival. Epidemiology. 2012;23:770–771. doi: 10.1097/EDE.0b013e3182625eff
- Bellavia A, Akerstedt T, Bottai M, Wolk A, Orsini N. Sleep duration and survival percentiles across categories of physical activity. Amer J Epidemiol. 2014;179(4):484–491. doi: 10.1093/aje/kwt280
- Johannessen A, Skorge TD, Bottai M, Grydeland TB, Nilsen RM, Coxson H, Dirksen A, Omenaas E, Gulsvik A, Bakke P. Mortality by level of emphysema and airway wall thickness. Amer J Respir Crit Care Med. 2013;187(6): 602–608. doi: 10.1164/rccm.201209-1722OC