2,366
Views
5
CrossRef citations to date
0
Altmetric
Original Articles

Estimation of variance components, heritability and the ridge penalty in high-dimensional generalized linear models

, &
Pages 116-134 | Received 08 Apr 2019, Accepted 17 Jul 2019, Published online: 12 Aug 2019

Abstract

For high-dimensional linear regression models, we review and compare several estimators of variances τ2 and σ2 of the random slopes and errors, respectively. These variances relate directly to ridge regression penalty λ and heritability index h2, often used in genetics. Several estimators of these, either based on cross-validation (CV) or maximum marginal likelihood (MML), are also discussed. The comparisons include several cases of the high-dimensional covariate matrix such as multi-collinear covariates and data-derived ones. Moreover, we study robustness against model misspecifications such as sparse instead of dense effects and non-Gaussian errors. An example on weight gain data with genomic covariates confirms the good performance of MML compared to CV. Several extensions are presented. First, to the high-dimensional linear mixed effects model, with REML as an alternative to MML. Second, to the conjugate Bayesian setting, shown to be a good alternative. Third, and most prominently, to generalized linear models for which we derive a computationally efficient MML estimator by re-writing the marginal likelihood as an n-dimensional integral. For Poisson and Binomial ridge regression, we demonstrate the superior accuracy of the resulting MML estimator of λ as compared to CV. Software is provided to enable reproduction of all results.

1. Introduction

Estimation of hyper-parameters is an essential part of fitting high-dimensional Gaussian random effect regression models, also known as ridge regression. These models are widely applied in genomics and genetics applications, where often the number of variables p is much larger than the number of samples n, i.e. pn.

We initially focus on the linear model. The goal is to estimate error variance σ2 and random effects variance τ2 or functions thereof, in particular the ridge penalty parameter, λ=σ2τ2, or heritability index, h2=pτ2pτ2+σ2. Here, the ridge penalty is used in classical ridge regression to shrink the regression coefficients towards zero (Hoerl and Kennard Citation1970), whereas heritability measures the fraction of variation between individuals within a population that is due to their genotypes (Visscher, Hill, and Wray Citation2008). The estimators of σ2 and τ2 can be used to estimate λ or h2, or for statistical testing (Kang et al. Citation2008). We review several estimators, based on maximum marginal likelihood (MML), moment equations, (generalized) cross-validation, dimension reduction, and for degrees-of-freedom adjustment. Some of these estimators are classical, while others have recently been introduced.

We systematically review and compare the estimators in a broad variety of high-dimensional settings. For estimation of λ in low-dimensional settings, we refer to Muniz and Kibria (Citation2009); Månsson and Shukur (Citation2011); Kibria and Banik (Citation2016). We address the effect of multi-collinearity and robustness against model misspecifications, such as sparsity and non-Gaussian errors. The comparisons are extended to the linear mixed effects model, with qn fixed effects added to the model and to Bayesian linear regression. The linear model part is concluded by a genomics data application to weight gain prediction after kidney transplantation.

The observed good performance of MML-estimation in the linear model setting was a stimulus to consider MML for high-dimensional generalized linear models (GLM). MML is more involved here than in the linear model, because of the non-conjugacy of the likelihood and prior. Therefore, approximations are required, such as Laplace ones. While these have been addressed by others (Heisterkam, van Houwelingen, and Downs Citation1999; Wood Citation2011), we derive an estimator which is computationally efficient for pn settings. For Poisson and Binomial ridge regression, we demonstrate the superior accuracy of MML estimation of λ as compared to cross-validation.

Our software enables reproduction of all results. In addition, it allows comparisons for one’s own high-dimensional data matrix by simulating the response conditional on this matrix, as we do for two cancer genomics examples. Computational shortcuts and considerations are discussed throughout the paper, and detailed at the end, including computing times.

1.1. The model

We initially focus on high-dimensional linear regression with random effects. Variables are denoted by j=1,,p and samples by i=1,,n. Then:(1) yn×1=Xn×pβp×1+ϵn×1βp×1N(0,τ2Ip)ϵn×1N(0,σ2In).(1)

Here, y=(y1,,yn) is the vector of responses, β=(β1,,βp)T corresponds to the random effects and ϵ=(ϵ1,,ϵn)T is a vector of Gaussian errors. Furthermore, X is a fixed n × p matrix: (X1Xn)T, with Xi=(xi1,,xip)T.

1.2. Estimation methods

We distinguish three categories of estimation methods:

  1. Estimation of functions of (σ2,τ2), in particular: i) λ=σ2τ2 (Golub, Heath, and Wahba Citation1979), used in ridge regression to minimize ||yXβ||22+λ||β||22 ; and ii) heritability h2=pτ2pτ2+σ2 (Bonnet, Gassiat, and Lévy-Leduc Citation2015).

  2. Separate estimation of σ2 (Cule, Vineis, and De Iorio Citation2011; Cule and De Iorio Citation2012), possibly followed by plug-in estimation of τ2.

  3. Joint estimation of σ2 and τ2.

Below, we discuss several methods for each of these categories. They have several matrices and matrix computations in common, which we therefore introduce first.

1.3. Notation and matrix computations

Throughout the paper, we will use the following notation:(2) β̂=β̂λ=Cλy=(XTX+λIp×p)1XTy i.e.thelinearridgeestimatorH=Hλ=XCλ=X(XTX+λIp×p)1XTi.e.thehatmatrix.(2)

Many of the estimators below require calculations on potentially very large matrices. The following two well-known equalities can highly alleviate the computational burden.

First, C=Cλ, and hence also β̂ and H, can be efficiently computed by using singular value decomposition (SVD). Decompose X=Un×nDn×n(Vp×n)T by SVD, and denote Λq=λIq. Then,(3) C=(XTX+Λp)1XT=V(D2+Λn)1DUT.(3)

The latter requires inversion of an n × n matrix only. Second, the following efficient trace computation for matrix products applies to tr(H)=tr(XCλ): (4) tr(Ap×nBn×p)=i=1nj=1p[A°BT]ij.(4)

2. Methods

2.1. Estimating functions of σ2 and τ2

2.1.1. Estimating λ by K-fold CV

A benchmark method that is used extensively to estimate λ=σ2/τ2 is cross-validation. Here, we use K-fold CV, as implemented in the popular R-package glmnet (Friedman, Hastie, and Tibshirani Citation2010). Let f(i) denote the set of samples left out for testing at the same fold as sample i. Then, CV-based estimation of λ pertains to minimizing the cross-validated prediction error:(5) λcv=arg minλ{i=1n(yiXiβ̂λf(i))2},(5) where β̂λf(i) denotes the estimate of β based on training samples {1,,n}f(i) and penalty λ. Note that for leave-one-out-cross-validation (n-fold CV) the analytical solution of (5) is the PRESS statistic (Allen Citation1974).

2.1.2. Estimating λ by generalized cross validation

Generalized Cross Validation (GCV) is a rotation-invariant form of the PRESS statistic. It is more robust than the latter to (near-diagonal) hat matrices Hλ (Golub, Heath, and Wahba Citation1979). For the linear model, the criterion is (Hastie, Tibshirani, and Friedman Citation2008):(6) GCV(λ)=i=1n(yiXiTβ̂λntr(Hλ))2(6) where the trace of Hλ can be computed efficiently by (4). Then, λgcv=arg minλGCV(λ).

2.1.3. Estimating heritability by HiLMM

Heritability is defined by h2=pτ2pτ2+σ2. A recent method which estimates heritability directly using maximum likelihood is proposed by Bonnet, Gassiat, and Lévy-Leduc (Citation2015). Analogously to EquationEq. (12), it is based on writing:(7) yN(0,h2σ*2R+(1h2)σ*2In),(7) where σ*2=pτ2+σ2 and R=XXT/p. Now, apply an eigen-decomposition to R: R=QLQT. Then, heritability is estimated by Bonnet, Gassiat, and Lévy-Leduc (Citation2015):(8) h2=arg maxh2(log(1ni=1ny˜i2h2(i1)+1)1ni=1n(log(h2(i1)+1)),(8) with i and y˜i the ith element of L and y˜=QTy, respectively. The authors provide rigorous consistency results for their estimator, as well as theoretical confidence bounds, also for mixed models and sparse settings.

2.2. Estimation of σ2

The two methods below rely on an estimate β̂=β̂λ, where λ=σ2/τ2 is estimated by (G)CV. Then σ2 is estimated conditional on β̂. If desired, τ2 may then be estimated by τ̂2=σ̂2/λ̂.

2.2.1. Basic estimate

A basic estimate of σ2, and often used in practice, is given by (Hastie and Tibshirani Citation1990):(9) σ̂2=(yXβ̂)T(yXβ̂)ν(9) which is the residual mean square error. Here, the residual effective degrees of freedom (Hastie and Tibshirani Citation1990) equals ν=ntr(2HHHT), with H as in (2). We also considered (9) with ν=ntr(H), as in Hellton and Hjort (Citation2018), which rendered similar, slightly inferior results.

2.2.2. PCR-based estimate

The estimator for σ2 may also be based on Principal Component Regression (PCR). PCR is based on the eigen-decomposition XTX=Q˜D2Q˜T. Denoting Z=XQ˜ and α=Q˜Tβ, we have y=Zα+ϵ. Then, Z is reduced from p columns to rmin(n,p) principal components, a crucial step (Cule and De Iorio Citation2012). Using the reduced model, σ2 is estimated by the residual mean square error (Cule and De Iorio Citation2012):(10) σ̂r2=(yZrα̂r)T(yZrα̂r)nr.(10)

2.3. Joint estimation of σ2 and τ2

2.3.1. MML

An Empirical Bayes estimate of σ2 and τ2 is obtained by maximizing the marginal likelihood (MML), also referred to as model evidence in machine learning (Murphy Citation2012). This corresponds to:(11) arg maxσ2,τ2P(y)=arg maxσ2,τ2β(y;β,σ2)π(β;τ2)dβ.(11)

Since y=Xβ+ϵ,P(y) is simply derived from the convolution of Gaussian random variables, implying E[y]=E[Xβ]+E[ϵ]=0, and V[y]=V[Xβ]+V[ϵ]=XXTτ2+σ2In, so(12) P(y)=N(y;μ=0,Σ=XXTτ2+σ2In).(12)

This is easily maximized over σ2 and τ2. Note that after computing XXT (12) requires operations on n × n matrices only.

2.3.2. Method of moments (MoM)

An alternative to MML is to match the empirical second moments of y to their theoretical counterparts. From (12) we observe that the covariances depend on τ2 only. Hence, we obtain an estimator of τ2 by equating the sum of yiyk to that of the theoretical covariances, Σik=E[yiyk], with Σ as in (12). Then, with ΣX=XXT, an estimator for σ2 is obtained by substituting τ̂2 and equating the sum of yi2 to the sum of theoretical variances, Σii=E[yi2] :(13) τ̂2=ikn,nyiykikn,nΣikXσ̂2=n1i=1n(yi2τ̂2ΣiiX).(13)

These equations also hold for non-Gaussian error terms, which could be an advantage over MML. Moreover, no optimization over σ2 and τ2 is required, so MoM is computationally very attractive.

3. Comparisons

For the linear random effects model (ridge regression) we study the following settings:

  • β and ϵ generated from model (1), independent X

  • β or ϵ generated from non-Gaussian distributions, independent X

  • β and ϵ from model (1), multicollinear X

  • β and ϵ from model (1), data-based X.

As is common for real data, the variables, i.e. the rows of X, were always standardized for the L2-penalty to have the same effect on all variables. All the results are based on 100 simulated data sets. Cross-validation is applied on 10 folds. Results from n-fold CV (leave-one-out) were generally fairly similar. We focus on the high-dimensional setting with n=100,p=1000, with excursions to larger data sets and dimensions of real data. In all visualizations below the red dotted lines indicate true values. Moreover, values larger than 20 times the true value were truncated and slightly jittered. Discussion of all results is postponed to Sec. 3.4.

3.1. Independent X

In correspondence to model (1) we sample:(14) yn×1=Xn×pβp×1+ϵn×1ϵiiidN(0,σ2)xijiidN(0,1)βjiidN(0,τ2).(14)

display the results for n=100,p=1000,τ2=0.01,σ2=10 and for a large data setting n=1000,p=15000,τ2=0.01,σ2=150 (which both imply h2=0.5).

Figure 1. Results for independent.

Figure 1. Results for independent.

3.2. Departures from a normal effect size distribution

We study the robustness of the methods against (sparse) non-Gaussian effect size distribution or error distribution. In sparse settings, many variables do not have an effect. To mimic this, we simulated the β’s from a mixture distribution with a ‘spike’ and a Gaussian ‘slab’:(15) βjiidp0δ0+(1p0)N(0,τ02).(15)

Here, we set p0=0.9,τ02=0.1, which implies τ2=V(βj)=E(βj2)E(βj)2=(1p0)τ02=0.01, as in the Gaussian βj setting. Moreover, we also considered:βjiid Laplace(μ=0,b=0.0707)andβjiid Uniform(a=0.17,b=0.17)where again the parameters are chosen such that E(βj)=0 and τ2=V(βj)=0.01. Apart from β all other quantities are simulated as in (14). Results are displayed for σ2=10,τ2=0.01,n=100,p=1000 in for the Laplace (= lasso) effect size distribution and in Supplementary Figure 3 for the spike-and-slab and uniform effect size distribution.

Moreover, we considered heavy-tailed errors by samplingϵiiid t4ϵi=(10/2)1/2ϵiwhere the scalar is chosen such that σ2=V(ϵi)=10, as in the Gaussian error setting. Apart from ϵ, all other quantities are simulated as in (14). Results are displayed in Supplementary Figure 3c.

3.3. Multicollinear X

3.3.1. Simulated X

Next, the design matrix X is sampled using block-wise correlation. We replace the sampling of X in simulation model (14) by:(16) Xn×pN(0,Ξ),(16) where Ξ is a unit variance covariance matrix with blocks of size p*p with correlations ρ on the off-diagonal. shows the results for ρ=0.5,p*=10,n=100,p=1000.

Figure 2. Results for multi-collinear and real.

Figure 2. Results for multi-collinear and real.

3.3.2. Real data X

Finally, we consider the estimation of τ2 and σ2 in a high- and medium-dimensional setting where X are real data, with likely collinear columns. The first data set (TCGA KIRC) concerns gene expression data of p = 18, 391 genes for n = 71 kidney tumors. The second data set (TCPA OV) holds expression data of p = 224 proteins for n = 408 ovarian tumor samples. Details on both data sets are supplied in the Supplementary Information. To generate response y we use model (14) with X given by the data. Here, τ2=0.01 and σ2 is set such that h2=0.5. show the results.

3.4. Discussion of results

3.4.1. MML vs MoM, basic and PCR

and and Supplementary Figure 3 clearly show superior performance of MML compared to MoM: both the bias and variability are much smaller for MML. Generally, MML also outperforms the Basic and PCR estimators of σ2. The PCR estimator approaches the performance of MML for the KIRC and TCPA data (), and the Basic estimator performs reasonably well for the latter (p < n) data set. For other settings, the Basic estimator performs equally inferior as MoM. The results highlight the importance of joint estimation of σ2 and τ2 in high-dimensional settings, because of their delicate interplay.

3.4.2. MML vs GCV and CV

For the estimation of λ MML seems slightly superior to GCV and CV. GCV shows more estimates that deviate towards too small values of λ (e.g. and , i.e. the large p settings), whereas CV tends to render somewhat more skewed results, either to the right ( and ), or to the left (). For the spike-and-slab and uniform effects sizes and the t4 errors the right-skewness of the CV-results is more pronounced (Supplementary Figure 3), indicating that minimization of the cross-validated prediction error (5) is more vulnerable to non-Gaussian y than MML and GCV. Note that the Laplace setting () relates directly to the lasso prior with scale parameter 1/λ1 (Tibshirani Citation1996). The results indicate that MML with Gaussian prior could be useful to find the lasso penalty, or serve as a fast initial estimate by simply setting the lasso penalty λ1=2/τ̂, which follows from the variance of the lasso prior.

3.4.3. MML vs HiLMM

For the estimation of heritability h2 and and Supplementary Figure 3 show very comparable performance of MML and HiLMM. This similar performance is not surprising given that both methods are likelihood-based. Hence, while reparametrizing the likelihood (7) is certainly useful to study it as function of h2 (Bonnet, Gassiat, and Lévy-Leduc Citation2015), the reparametrization seems not beneficial for the purpose of estimating h2. In addition, unlike HiLMM, MML also returns estimates of τ2 and σ2. Finally, comparing we observe that both MML and HiLMM clearly benefit from the larger n and p.

4. Data example

We re-analyse the weight gain data, recently discussed in Hellton and Hjort (Citation2018). Details on the data are presented there, we provide a summary. The data consists of expression profiles of n = 26 individuals with kidney transplants, where profiles consists of 28,869 genes as measured by Affymetrix Human Gene 1.0 ST arrays. The data is available in the EMBL-EBI ArrayExpress database (www.ebi.ac.uk/arrayexpress) under accession number E-GEOD-33070. It is known that kidney transplantation may lead to weight gain, and the study by Cashion et al. (Citation2013) investigates whether gene expression can be used to predict this. Such a prediction can be used to decide upon additional measures to prevent excessive weight gains. We reproduced the analysis by Hellton and Hjort (Citation2018) as much as possible, including their prior selection of 1000 genes. Details on minor discrepancies, and an alternative analysis that accounts for the gene selection are discussed in the Supplementary Material. These did not affect the comparison qualitatively.

In Hellton and Hjort (Citation2018), the authors illustrate their focused ridge (fridge) method and compare it with conventional ridge. In short, fridge estimates sample-specific ridge penalties, based on minimizing a per sample mean squared error (MSE) criterion on the level of the linear predictor Xiβ. Since β is not known, it is replaced by an initial ridge estimate, β̂λ. Their sample specific penalty then depends on Xi, and also on both λ̂ and σ̂2. The authors use GCV (6) to obtain λ, and a slight variation of (9) to estimate σ2. They show that fridge improves upon GCV-based ridge estimation. We wish to investigate whether i) MML estimation of λ=σ2/τ2 also improves the performance of GCV-based ridge regression; and ii) whether MML estimation further boosts the performance of the fridge estimator. Here, predictive performance is measured by the mean squared prediction error (MSPE) using leave-one-out cross-validation (loocv).

The estimates of MML differ markedly from those of GCV: (λ̂MML,σ̂MML2)=(0.77,0.59), while (λ̂GCV,σ̂GCV2)=(20.92,8.08). Using λ̂MML instead of λ̂GCV for the estimation of β substantially reduced the mean squared prediction error: MSPEMML=14.40, while MSPEGCV=16.38, a relative decrease of 12.1%. Using λ̂GCV, as in Hellton and Hjort (Citation2018), fridge also reduced the MSPE, but to a lesser extent: MSPEfridge=15.80, a relative decrease of 3.5% with respect to MSPEGCV. Application of fridge using λ̂MML did not further decrease MSPEMML, nor did it increase it. Possibly, the already fairly small value of λ̂MML left little room for improvement. displays absolute prediction errors per sample and illustrates the improved prediction by ridge using λMML (and to a lesser extent by fridge) with respect to ridge using λGCV.

Figure 3. Absolute prediction errors (obtained by loocv; y-axis) for ridge using λGCV, for fridge and for ridge using λMML. Sample indices (x-axis) are sorted by GCV results.

Figure 3. Absolute prediction errors (obtained by loocv; y-axis) for ridge using λGCV, for fridge and for ridge using λMML. Sample indices (x-axis) are sorted by GCV results.

5. Extensions

5.1. Extension 1: mixed effects model

A natural extension of the high-dimensional random effects model (1) is the mixed effects model:(17) y=Xfα+Xrβ+ϵ,(17) where we assume that the n × m design matrix for the fixed effects, Xf, is of low-rank, so mn, as opposed to the random effects design matrix Xr. Restricted maximum likelihood (REML) deals with the fixed effects by contrasting them out. For the error contrast vector yXfα̂OLS=ATy, with A=InXf(XfTXf)1XfT, the marginal likelihood for the variance components equals (see e.g. Zhang Citation2015):(18) P(ATy)=N(y;μ=0,Σ=ATΣrA)(18) with Σr=XrXrTτ2+σ2In. In addition to maximizing (18) as a function of (σ2,τ2), we attempted solving the set of two estimation equations suggested by Jiang (Citation2007), but this rendered instable results inferior to maximizing (18) directly.

Alternatively, MML may be used, but it has to be adjusted to also estimate the fixed effects in the model. This implies replacing 0 in Gaussian likelihood (11) by Xfα, and optimizing (11) with respect to 2+m parameters, where m is the number of fixed parameters. The mixed model simulation setting is as follows:(19) yn×1=Xf,n×mαm×1+Xr,n×pβp×1+ϵn×1ϵiiidN(0,σ2)xf,ikiidN(0,1)xr,ijiidN(0,1)αkiidp0,fδ0+(1p0,f)N(0,τ0,f2)βjiidp0δ0+(1p0)N(0,τ02),(19) where n=100,p=1000,m=10,p0=0.9,τ02=0.1 (implying variance τ2=(1p0)τ02=0.01 for generating random effects) and p0,f=0.5,τ0,f2=0.20 (implying variance τf2=0.1 for generating fixed effects). Note that we focused on a fairly sparse setting for the random effects and larger prior variance of fixed effects than of random effects, which enables a stronger impact of the small number of fixed effects. shows the results of REML, MML and CV (by glmnet, using penalty factor 0 for the fixed effects) for the estimation of τ2,σ2,λ and h2.

Figure 4. Estimates for mixed effects model, τ2=0.01,σ2=10,n=100,m=10,p=1000.

Figure 4. Estimates for mixed effects model, τ2=0.01,σ2=10,n=100,m=10,p=1000.

From we observe that REML indeed improves MML in terms of bias, however at the cost of increased variability. For the estimation of λ, CV is fairly competitive to REML and MML, although it renders markedly more over-penalization.

5.2. Extension 2: Bayesian linear regression

So far, we focused on classical methods. Bayesian methods may be a good alternative. We applied the standard Bayesian linear regression model, i.e. the conjugate model with i.i.d. priors π(βj)=N(0,σ2τ2), with τ2 fixed and σ2 endowed with a vague inverse-gamma prior (see Supplementary Material for details). For this model the maximum marginal likelihood estimator for τ2 is still analytical (Karabatsos Citation2018), and so is the posterior mode estimate of σ2. shows the results in comparison to MML, i.e. maximization of (12), for the random effects case with multi-collinear X, as in Sec. 3.3.1. Results for other settings were in essence very similar.

Figure 5. Bayes and MML (12) estimates for multi-collinear X, with τ2=0.01,σ2=10,n=100,p=1000.

Figure 5. Bayes and MML (12) estimates for multi-collinear X, with τ2=0.01,σ2=10,n=100,p=1000.

From the results we conclude that the conjugate Bayes estimates are very close to those of MML. This is in line with the fact that both estimators maximize a marginal likelihood and the conjugate model with prior variance τ2=σ2/λ is known to render posterior mean estimates of β that equal the λ-penalized ridge regression estimates.

The conjugate Bayesian model is scale-invariant, because the β prior contains the error variance σ2. Recently, it was criticized for its non-robustness against misspecification of the fixed τ2 when estimating σ2 (Moran, Rockova, and George Citation2018). However, in practice one needs to estimate τ2 by either empirical Bayes (e.g. maximum marginal likelihood) or full Bayes. We repeated the simulation by Moran, Rockova, and George (Citation2018) (see Supplementary Material). The results show that the estimates of σ2 are much better when estimating τ2 by empirical Bayes instead of fixing it, and in fact very competitive to alternatives proposed by Moran, Rockova, and George (Citation2018).

5.3. Extension 3: generalized linear models

5.3.1. Setting

Motivated by the good results for MML in the linear setting, we wish to extend MML estimation to the high-dimensional generalized linear model (GLM) setting, where the likelihood depends on the regression parameter β only via the linear predictor, Xβ. Hence, likelihood L(Y;β,X) is defined by a density fμ(Y) (e.g. Poisson), where Xβ is mapped to μ by a link function (e.g. log). As before, we a priori assume i.i.d. βjN(0,τ2), here equivalent to an L2 penalty λ=1/τ2 when estimating β by penalized likelihood. In Heisterkam, van Houwelingen, and Downs (Citation1999) an iterative algorithm to estimate λ is derived which alternates estimation of β by maximization w.r.t. λ, requiring the computation of the trace of a Hessian of a p×p matrix. Here, the estimation of β itself is much slower than in the linear case, because it is not analytic and requires iterative weighted least squares approximation. Below we show how to substantially alleviate the computational burden in the pn setting by re-parameterizing the marginal likelihood implying computations in Rn instead of Rp.

5.3.2. Method

We have for the marginal likelihood:(20) ML(λ)=βRpL(Y;β,X)πλ(β)dβ=βRpL(Y;β,X)ϕ(β1;0,1/λ)ϕ(βp;0,1/λ)dβ(20) where ϕ(β,μ,τ2) denotes the normal density with mean μ and variance τ2. Now a crucial observation is that for GLM:(21) ML(λ)=Eπλ(β)[L(Y;β,X)]=Eπλ(β)[L(Y;Xβ)]=Eπλ(Xβ)[L(Y;Xβ)](21) because the likelihood depends on β only via the linear predictor Xβ. Here, πλ(Xβ) is the implied n-dimensional prior distribution of Xβ. This is a multivariate normal: ϕ(βX;μ=0,Σλ=XXT/λ). Therefore, we have:(22) ML(λ)=βRpgY,λ(β)dβ=βRpL(Y;β,X)ϕ(β1;0,1/λ)ϕ(βp;0,1/λ)dβ=βXRnhY,λ(βX)dβX=βXRnL(Y;βX,In)ϕ(βX;0,Σλ)dβX.(22)

Hence, the p-dimensional integral may be replaced by an n-dimensional one, with obvious computational advantages when pn. Moreover, the use of (22) allows applying implemented Laplace approximations, which tend to be more accurate in lower dimensions. The Laplace approximation requires β̂X=arg maxβX{hY,λ(βX)}. We emphasize that this does generally not equal Xβ̂, where β̂=arg maxβ{gY,λ(β)} : the maximum of the commonly used L2 penalized (log)-likelihood. However, β̂X can be computed by noting that(23) loghY,λ(βX)(Y;βX,In)(βX)TΣλ1βX.(23)

In other words, this is the penalized log-likelihood when regressing Y on the identity design matrix In using an L2 smoothing penalty matrix (βX)TΣλ1βX=λ(βX)T(XXT)1βX. The latter fits conveniently into the set-up of Wood (Citation2011), as implemented in the R-package mgcv. This also facilitates MML estimation of λ by maximizing ML(λ), with hY,λ(βX) as in (23). If the columns of X are standardized (common in high-dimensional studies), XXT has rank n – 1 instead of n, implying that (XXT)1 does not exist and should be replaced by a pseudo-inverse (XXT)+, such as the Moore-Penrose inverse.

In a full Bayesian linear model setting, dimension reduction is also discussed by Bernardo et al. (Citation2003), where Xβ is substituted by a n-dimensional factor analytic representation, which requires an SVD of X. In addition, there it is not used for hyper-parameter estimation by marginal likelihood, but instead for specifying (hierarchical) priors for the factors.

5.3.3. Results

R packages like glmnet (Friedman, Hastie, and Tibshirani Citation2010) and penalized (Goeman Citation2010) estimate λ by cross-validation, and also mgcv allows, next to the MML estimation, (generalized) CV estimation (Wood Citation2011). show the results for Poisson ridge regression, with YiPois(λi),λi=exp(Xiβ),β generated as in (14), and X generated as in (14) and (16), which denote the independent X and multi-collinear X setting, respectively.

Figure 6. λ estimates for Poisson ridge regression, λ=1/τ2=100,n=100,p=1000.

Figure 6. λ estimates for Poisson ridge regression, λ=1/τ2=100,n=100,p=1000.

clearly shows the superior performance of MML based on (22) over CV. In particular, glmnet and penalized render strongly upward biased values. The mgcv GCV values are still inferior to MML based ones, but much better than the latter two, which may be due to the different regression estimators used (Laplace approximation versus iterative weighted least squares). We should stress that CV does not target for the estimation of λ as such, but merely for minimizing prediction error. Nevertheless, the difference is remarkably larger than in the corresponding linear case (see and ).

The Supplementary Material shows the results for Binomial ridge regression. While the differences in performance are less dramatic than for the Poisson setting, MML still renders much better estimates of λ than CV-based approaches.

6. Computational aspects and software

All methods and simulations presented here are implemented in a few wrapper R scripts: one for the linear random effects model (which includes the conjugate Bayes estimator), one for the linear mixed effects model, and one for Poisson and Binomial ridge regression. Parallel computations are supported. The scripts allow exact reproduction of the results in this manuscript as well as comparisons for other simulation or user-specific real data X cases. In addition, a script is supplied to produce the box-plots as in this manuscript.

HiLMM, PCR and CV implementations are provided by the R-packages HiLMM, v1.1 (Bonnet, Gassiat, and Lévy-Leduc Citation2015), ridge, v1.8-16 (Cule and De Iorio Citation2012) (code slightly adapted for computational efficiency) and glmnet, v2.0-16 (Friedman, Hastie, and Tibshirani Citation2010). The methods MML, REML, Bayes, MoM, Basic and GCV were implemented by us for the linear random and mixed effects models. For Poisson and Binomial ridge regression we applied mgcv, v1.8-16 (Wood Citation2011) after our re-parametrization (22) to obtain MML and GCV results, while for CV glmnet and penalized, v0.9-50 (Goeman Citation2010) were applied. For all methods that required optimization the R routine optim was used, with default settings. CV was based on 10 folds.

Computing times of the various methods largely depend on n and p, much less so on the exact simulation setting. These are displayed for n = 100, 500 and p=103,104,105 in , based on computations with one CPU of an Intel®Xeon®CPUE5 - 2660v3@2.60 GHz server. For Poisson ridge regression, we only report the computing times of MML and GCV, because, as reported in , the performance of CV-based methods was very inferior.

Table 1. Computing times for hyper-parameter estimation for linear and Poisson ridge regression.

From we conclude that MML is also computationally very attractive. Its efficiency is explained by the fact that, unlike many of other methods, it does not require an SVD or other matrix decomposition of X. Moreover, the only computation that involves dimension p is the product XXT.

7. Discussion

We compared several estimators in a large variety of high-dimensional settings. The results showed that plain maximum marginal likelihood works well in many settings. MML is generally superior to methods that aim to separately estimate σ2 (9, 10). Apparently, the estimates of σ2 and τ2 are so intrinsically linked in the high-dimensional setting that separate estimation is sub-optimal. The moment estimator (MoM) is generally not competitive to MML. It may, however, be useful in large systems with multiple hyper-parameters to estimate relative penalties, which are less sensitive to scaling issues than the global penalty parameter (Van de Wiel et al. Citation2016). MoM may also be a useful initial estimator for more complex estimators that are based on optimization, such as MML.

Possibly somewhat surprising is the good performance of MML for estimating λ and h2, as these are functions of σ2 and τ2. For the estimation of λ it is generally better than or competitive to (generalized) CV, an observation also made for the low-dimensional setting (Wood Citation2011). The inferior performance of the basic estimator of σ2 (9) implies that alternative estimators of λ that use σ̂2 as a plug-in are unlikely to perform well in high-dimensional settings. Such estimators, including the original one by Hoerl and Kennard (Citation1970), are compared by Muniz and Kibria (Citation2009); Kibria and Banik (Citation2016), who show that some do perform well in the low-dimensional setting. For Poisson ridge regression, similar estimators of λ are available (Månsson and Shukur Citation2011), but these rely on an initial maximum likelihood estimator of β, and hence do not apply to the high-dimensional setting. For estimating h2 it should be noticed that HiLMM (Bonnet, Gassiat, and Lévy-Leduc Citation2015) aims to compute a confidence interval for h2 as well. For that purpose their direct estimator (8) is likely more useful than MML on the pair (τ2,σ2). We also used Esther (Bonnet et al. Citation2018), which precedes HiLMM by sure independence screening. It did not improve HiLMM in our (semi-)sparse settings, and requires manual steps. However, it likely improves HiLMM results in very sparse settings (Bonnet et al. Citation2018).

For mixed effect models with a small number of fixed effects, MML compares fairly well to REML, with a larger bias, but smaller variance. Probably the potential advantage of contrasting out the fixed effects is small when the number of random effects is large. REML may have a larger advantage in very sparse settings (Jiang et al. Citation2016) or when the number of fixed effects is large with respect to n. Estimates from the conjugate Bayes model are very similar to those by MML. We show that estimating τ2 along with σ2 highly improves the σ2 estimates presented by Moran, Rockova, and George (Citation2018), where a fixed value of τ2 is used. In the case of many variance components or multiple similar regression equations, Bayesian extensions that shrink the estimates by a common prior are appealing, in particular in combination with efficient posterior approximations such as variational Bayes (Leday et al. Citation2017).

Our model (1) implies a dense setting, but we have demonstrated that the MML and REML estimators of τ2 and σ2 are fairly robust against moderate sparsity, which corroborates the results by Jiang et al. (Citation2016). Nevertheless, true sparse models may be preferable when variable selection is desired, which depends on accurate estimation of β. On the other hand, post-hoc selection procedures can be rather competitive (Bondell and Reich Citation2012). Moreover, the sparsity assumption is questionable for several applications. For example in genetics, it was suggested that many complex traits (such as height or cholesterol levels) are not even polygenic, but instead ‘omnigenic’ (Boyle, Li, and Pritchard Citation2017).

The extension of MML to high-dimensional GLM settings (22) is promising given its computational efficiency and performance for Poisson and Binomial regression. A special case of the latter, logistic regression, requires further research, because the Laplace approximations of the marginal likelihood are less accurate here (Wood Citation2011). Extension to survival is a promising avenue, because Cox regression is directly linked to Poisson regression (Cai and Betensky Citation2003). Alternatively, parametric survival models may be pursued. To what extent the estimates of hyper-parameters impact predictions depends on the sensitivity of the likelihood to these parameters. For the linear setting, a re-analysis of the weight-gain data showed that predictions based on λ̂MML improved those based on λ̂CV. Karabatsos (Citation2018) shows that MML estimation also performs well compared to GCV for linear power ridge regression, which extends ridge regression by multiplying λ by (XTX)δ.

The MML estimator can be extended to estimation of multiple variance components or penalty parameters, which was addressed by iterative likelihood minorization (Zhou et al. Citation2015) and by parameter-based moment estimation (Van de Wiel et al. Citation2016). The latter extends to non-Gaussian response such as survival or binary. Further comparison of these methods with multi-parameter MML, both in terms of performance and computational efficiency, is left for future research. Finally, in particular in genetics applications, extensions of estimation of variance components by MML to non-independent individuals can be implemented by use of a well-structured between-individual covariance matrix Σ (Kang et al. Citation2008).

Although our simulations cover a fairly broad spectrum of settings, many other variations could be of interest. We therefore supply fully annotated R scripts https://github.com/markvdwiel/Hyperpar that allow i) comparison of all algorithms discussed here, also for one’s ‘own’ real covariate set X; and ii) reproduction of all results presented here.

Supplemental material

lssp_a_1646760_sm0546.pdf

Download PDF (597.7 KB)

Acknowledgment

Gwenaël Leday was supported by the Medical Research Council, grant number MR/M004421. We thank Jiming Jiang and Can Yang for their correspondence, input and software for the MM algorithm. In addition, we thank Kristoffer Hellton for providing the fridge software and data. Finally, Iuliana Ciocǎnea-Teodorescu is acknowledged for preparing the TCGA KIRC data.

Disclosure statement

No potential conflict of interest was reported by the authors.

References