Full article: Estimation of variance components, heritability and the ridge penalty in high-dimensional generalized linear models

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

For high-dimensional linear regression models, we review and compare several estimators of variances $τ^{2}$ and $σ^{2}$ of the random slopes and errors, respectively. These variances relate directly to ridge regression penalty λ and heritability index h², often used in genetics. Several estimators of these, either based on cross-validation (CV) or maximum marginal likelihood (MML), are also discussed. The comparisons include several cases of the high-dimensional covariate matrix such as multi-collinear covariates and data-derived ones. Moreover, we study robustness against model misspecifications such as sparse instead of dense effects and non-Gaussian errors. An example on weight gain data with genomic covariates confirms the good performance of MML compared to CV. Several extensions are presented. First, to the high-dimensional linear mixed effects model, with REML as an alternative to MML. Second, to the conjugate Bayesian setting, shown to be a good alternative. Third, and most prominently, to generalized linear models for which we derive a computationally efficient MML estimator by re-writing the marginal likelihood as an n-dimensional integral. For Poisson and Binomial ridge regression, we demonstrate the superior accuracy of the resulting MML estimator of λ as compared to CV. Software is provided to enable reproduction of all results.

KEYWORDS:

1. Introduction

Estimation of hyper-parameters is an essential part of fitting high-dimensional Gaussian random effect regression models, also known as ridge regression. These models are widely applied in genomics and genetics applications, where often the number of variables p is much larger than the number of samples n, i.e. $p ≫ n$ .

We initially focus on the linear model. The goal is to estimate error variance $σ^{2}$ and random effects variance $τ^{2}$ or functions thereof, in particular the ridge penalty parameter, $λ = \frac{σ^{2}}{τ^{2}}$ , or heritability index, $h^{2} = \frac{p τ^{2}}{p τ^{2} + σ^{2}}$ . Here, the ridge penalty is used in classical ridge regression to shrink the regression coefficients towards zero (Hoerl and Kennard Citation1970), whereas heritability measures the fraction of variation between individuals within a population that is due to their genotypes (Visscher, Hill, and Wray Citation2008). The estimators of $σ^{2}$ and $τ^{2}$ can be used to estimate λ or h², or for statistical testing (Kang et al. Citation2008). We review several estimators, based on maximum marginal likelihood (MML), moment equations, (generalized) cross-validation, dimension reduction, and for degrees-of-freedom adjustment. Some of these estimators are classical, while others have recently been introduced.

We systematically review and compare the estimators in a broad variety of high-dimensional settings. For estimation of λ in low-dimensional settings, we refer to Muniz and Kibria (Citation2009); Månsson and Shukur (Citation2011); Kibria and Banik (Citation2016). We address the effect of multi-collinearity and robustness against model misspecifications, such as sparsity and non-Gaussian errors. The comparisons are extended to the linear mixed effects model, with $q ≪ n$ fixed effects added to the model and to Bayesian linear regression. The linear model part is concluded by a genomics data application to weight gain prediction after kidney transplantation.

The observed good performance of MML-estimation in the linear model setting was a stimulus to consider MML for high-dimensional generalized linear models (GLM). MML is more involved here than in the linear model, because of the non-conjugacy of the likelihood and prior. Therefore, approximations are required, such as Laplace ones. While these have been addressed by others (Heisterkam, van Houwelingen, and Downs Citation1999; Wood Citation2011), we derive an estimator which is computationally efficient for $p ≫ n$ settings. For Poisson and Binomial ridge regression, we demonstrate the superior accuracy of MML estimation of λ as compared to cross-validation.

Our software enables reproduction of all results. In addition, it allows comparisons for one’s own high-dimensional data matrix by simulating the response conditional on this matrix, as we do for two cancer genomics examples. Computational shortcuts and considerations are discussed throughout the paper, and detailed at the end, including computing times.

1.1. The model

We initially focus on high-dimensional linear regression with random effects. Variables are denoted by $j = 1, \dots, p$ and samples by $i = 1, \dots, n$ . Then:(1) $\begin{array}{l} y_{n \times 1} = X_{n \times p} β_{p \times 1} + ϵ_{n \times 1} \\ β_{p \times 1} \sim N (0, τ^{2} I_{p}) \\ ϵ_{n \times 1} \sim N (0, σ^{2} I_{n}) . \end{array}$ (1)

Here, $y = (y_{1}, \dots, y_{n})$ is the vector of responses, $β = {(β_{1}, \dots, β_{p})}^{T}$ corresponds to the random effects and $ϵ = {(ϵ_{1}, \dots, ϵ_{n})}^{T}$ is a vector of Gaussian errors. Furthermore, X is a fixed n × p matrix: ${(X_{1} \dots X_{n})}^{T},$ with $X_{i} = {(x_{i 1}, \dots, x_{ip})}^{T}$ .

1.2. Estimation methods

We distinguish three categories of estimation methods:

Estimation of functions of $(σ^{2}, τ^{2})$ , in particular: i) $λ = \frac{σ^{2}}{τ^{2}}$ (Golub, Heath, and Wahba Citation1979), used in ridge regression to minimize $| | y - X β | |_{2}^{2} + λ | | β | |_{2}^{2}$ ; and ii) heritability $h^{2} = \frac{p τ^{2}}{p τ^{2} + σ^{2}}$ (Bonnet, Gassiat, and Lévy-Leduc Citation2015).
Separate estimation of $σ^{2}$ (Cule, Vineis, and De Iorio Citation2011; Cule and De Iorio Citation2012), possibly followed by plug-in estimation of $τ^{2}$ .
Joint estimation of $σ^{2}$ and $τ^{2}$ .

Below, we discuss several methods for each of these categories. They have several matrices and matrix computations in common, which we therefore introduce first.

1.3. Notation and matrix computations

Throughout the paper, we will use the following notation:(2) $\begin{array}{l} \hat{β} = {\hat{β}}_{λ} = C_{λ} y = {(X^{T} X + λ I_{p \times p})}^{- 1} X^{T} y i . e . the linear ridge estimator \\ H = H_{λ} = X C_{λ} = X {(X^{T} X + λ I_{p \times p})}^{- 1} X^{T} i . e . the hat matrix . \end{array}$ (2)

Many of the estimators below require calculations on potentially very large matrices. The following two well-known equalities can highly alleviate the computational burden.

First, $C = C_{λ}$ , and hence also $\hat{β}$ and H, can be efficiently computed by using singular value decomposition (SVD). Decompose $X = U_{n \times n} D_{n \times n} {(V_{p \times n})}^{T}$ by SVD, and denote $Λ_{q} = λ I_{q}$ . Then,(3) $C = {(X^{T} X + Λ_{p})}^{- 1} X^{T} = V {(D^{2} + Λ_{n})}^{- 1} D U^{T} .$ (3)

The latter requires inversion of an n × n matrix only. Second, the following efficient trace computation for matrix products applies to $tr (H) = tr (X C_{λ}) :$ (4) $tr (A_{p \times n} B_{n \times p}) = \sum_{i = 1}^{n} \sum_{j = 1}^{p} [A ° B^{T}]_{ij} .$ (4)

2. Methods

2.1. Estimating functions of $σ^{2}$ and $τ^{2}$

2.1.1. Estimating λ by K-fold CV

A benchmark method that is used extensively to estimate $λ = σ^{2} / τ^{2}$ is cross-validation. Here, we use K-fold CV, as implemented in the popular R-package glmnet (Friedman, Hastie, and Tibshirani Citation2010). Let f(i) denote the set of samples left out for testing at the same fold as sample i. Then, CV-based estimation of λ pertains to minimizing the cross-validated prediction error:(5) $λ_{cv} = \underset{λ}{arg min} {\sum_{i = 1}^{n} {(y_{i} - X_{i} {\hat{β}}_{λ}^{- f (i)})}^{2}},$ (5) where ${\hat{β}}_{λ}^{- f (i)}$ denotes the estimate of $β$ based on training samples ${1, \dots, n} ∖ f (i)$ and penalty λ. Note that for leave-one-out-cross-validation (n-fold CV) the analytical solution of (5) is the PRESS statistic (Allen Citation1974).

2.1.2. Estimating λ by generalized cross validation

Generalized Cross Validation (GCV) is a rotation-invariant form of the PRESS statistic. It is more robust than the latter to (near-diagonal) hat matrices $H_{λ}$ (Golub, Heath, and Wahba Citation1979). For the linear model, the criterion is (Hastie, Tibshirani, and Friedman Citation2008):(6) $GCV (λ) = \sum_{i = 1}^{n} (\frac{y_{i} - X_{i}^{T} {\hat{β}}_{λ}}{n - tr (H_{λ})})^{2}$ (6) where the trace of $H_{λ}$ can be computed efficiently by (4). Then, $λ_{gcv} = arg {min}_{λ} GCV (λ)$ .

2.1.3. Estimating heritability by HiLMM

Heritability is defined by $h^{2} = \frac{p τ^{2}}{p τ^{2} + σ^{2}}$ . A recent method which estimates heritability directly using maximum likelihood is proposed by Bonnet, Gassiat, and Lévy-Leduc (Citation2015). Analogously to EquationEq. (12)(12) $P (y) = N (y; μ = 0, Σ = X X^{T} τ^{2} + σ^{2} I_{n}) .$ (12) , it is based on writing:(7) $y \sim N (0, h^{2} σ^{* 2} R + (1 - h^{2}) σ^{* 2} I_{n}),$ (7) where $σ^{* 2} = p τ^{2} + σ^{2}$ and $R = X X^{T} / p$ . Now, apply an eigen-decomposition to R: $R = QL Q^{T}$ . Then, heritability is estimated by Bonnet, Gassiat, and Lévy-Leduc (Citation2015):(8) $h^{2} = \underset{h^{2}}{arg max} (- log (\frac{1}{n} \sum_{i = 1}^{n} \frac{{\tilde{y}}_{i}^{2}}{h^{2} (ℓ_{i} - 1) + 1}) - \frac{1}{n} \sum_{i = 1}^{n} (log (h^{2} (ℓ_{i} - 1) + 1)),$ (8) with $ℓ_{i}$ and ${\tilde{y}}_{i}$ the ith element of L and $\tilde{y} = Q^{T} y$ , respectively. The authors provide rigorous consistency results for their estimator, as well as theoretical confidence bounds, also for mixed models and sparse settings.

2.2. Estimation of $σ^{2}$

The two methods below rely on an estimate $\hat{β} = {\hat{β}}_{λ}$ , where $λ = σ^{2} / τ^{2}$ is estimated by (G)CV. Then $σ^{2}$ is estimated conditional on $\hat{β} .$ If desired, $τ^{2}$ may then be estimated by ${\hat{τ}}^{2} = {\hat{σ}}^{2} / \hat{λ} .$

2.2.1. Basic estimate

A basic estimate of $σ^{2}$ , and often used in practice, is given by (Hastie and Tibshirani Citation1990):(9) ${\hat{σ}}^{2} = \frac{{(y - X \hat{β})}^{T} (y - X \hat{β})}{ν}$ (9) which is the residual mean square error. Here, the residual effective degrees of freedom (Hastie and Tibshirani Citation1990) equals $ν = n - tr (2 H - H H^{T})$ , with H as in (2). We also considered (9) with $ν = n - tr (H)$ , as in Hellton and Hjort (Citation2018), which rendered similar, slightly inferior results.

2.2.2. PCR-based estimate

The estimator for $σ^{2}$ may also be based on Principal Component Regression (PCR). PCR is based on the eigen-decomposition $X^{T} X = \tilde{Q} D^{2} {\tilde{Q}}^{T}$ . Denoting $Z = X \tilde{Q}$ and $α = {\tilde{Q}}^{T} β$ , we have $y = Z α + ϵ$ . Then, Z is reduced from p columns to $r \leq min (n, p)$ principal components, a crucial step (Cule and De Iorio Citation2012). Using the reduced model, $σ^{2}$ is estimated by the residual mean square error (Cule and De Iorio Citation2012):(10) ${\hat{σ}}_{r}^{2} = \frac{{(y - Z_{r} {\hat{α}}_{r})}^{T} (y - Z_{r} {\hat{α}}_{r})}{n - r} .$ (10)

2.3. Joint estimation of $σ^{2}$ and $τ^{2}$

2.3.1. MML

An Empirical Bayes estimate of $σ^{2}$ and $τ^{2}$ is obtained by maximizing the marginal likelihood (MML), also referred to as model evidence in machine learning (Murphy Citation2012). This corresponds to:(11) $\underset{σ^{2}, τ^{2}}{arg max} P (y) = \underset{σ^{2}, τ^{2}}{arg max} \int_{β} ℓ (y; β, σ^{2}) π (β; τ^{2}) d β .$ (11)

Since $y = X β + ϵ, P (y)$ is simply derived from the convolution of Gaussian random variables, implying $E [y] = E [X β] + E [ϵ] = 0$ , and $V [y] = V [X β] + V [ϵ] = X X^{T} τ^{2} + σ^{2} I_{n}$ , so(12) $P (y) = N (y; μ = 0, Σ = X X^{T} τ^{2} + σ^{2} I_{n}) .$ (12)

This is easily maximized over $σ^{2}$ and $τ^{2}$ . Note that after computing $X X^{T}$ (12) requires operations on n × n matrices only.

2.3.2. Method of moments (MoM)

An alternative to MML is to match the empirical second moments of y to their theoretical counterparts. From (12) we observe that the covariances depend on $τ^{2}$ only. Hence, we obtain an estimator of $τ^{2}$ by equating the sum of $y_{i} y_{k}$ to that of the theoretical covariances, $Σ_{ik} = E [y_{i} y_{k}]$ , with $Σ$ as in (12). Then, with $Σ^{X} = X X^{T}$ , an estimator for $σ^{2}$ is obtained by substituting ${\hat{τ}}^{2}$ and equating the sum of $y_{i}^{2}$ to the sum of theoretical variances, $Σ_{ii} = E [y_{i}^{2}]$ :(13) $\begin{array}{l} {\hat{τ}}^{2} = \frac{\sum_{i \neq k}^{n, n} y_{i} y_{k}}{\sum_{i \neq k}^{n, n} Σ_{ik}^{X}} \\ {\hat{σ}}^{2} = n^{- 1} \sum_{i = 1}^{n} (y_{i}^{2} - {\hat{τ}}^{2} Σ_{ii}^{X}) . \end{array}$ (13)

These equations also hold for non-Gaussian error terms, which could be an advantage over MML. Moreover, no optimization over $σ^{2}$ and $τ^{2}$ is required, so MoM is computationally very attractive.

3. Comparisons

For the linear random effects model (ridge regression) we study the following settings:

$β$ and $ϵ$ generated from model (1), independent X
$β$ or $ϵ$ generated from non-Gaussian distributions, independent X
$β$ and $ϵ$ from model (1), multicollinear X
$β$ and $ϵ$ from model (1), data-based X.

As is common for real data, the variables, i.e. the rows of X, were always standardized for the L₂-penalty to have the same effect on all variables. All the results are based on 100 simulated data sets. Cross-validation is applied on 10 folds. Results from n-fold CV (leave-one-out) were generally fairly similar. We focus on the high-dimensional setting with $n = 100, p = 1000$ , with excursions to larger data sets and dimensions of real data. In all visualizations below the red dotted lines indicate true values. Moreover, values larger than 20 times the true value were truncated and slightly jittered. Discussion of all results is postponed to Sec. 3.4.

3.1. Independent X

In correspondence to model (1) we sample:(14) $\begin{array}{l} y_{n \times 1} = X_{n \times p} β_{p \times 1} + ϵ_{n \times 1} & ϵ_{i} \overset{iid}{\sim} N (0, σ^{2}) \\ x_{ij} \overset{iid}{\sim} N (0, 1) & β_{j} \overset{iid}{\sim} N (0, τ^{2}) . \end{array}$ (14)

display the results for $n = 100, p = 1000, τ^{2} = 0.01, σ^{2} = 10$ and for a large data setting $n = 1000, p = 15000, τ^{2} = 0.01, σ^{2} = 150$ (which both imply $h^{2} = 0.5$ ).

Figure 1. Results for independent.

3.2. Departures from a normal effect size distribution

We study the robustness of the methods against (sparse) non-Gaussian effect size distribution or error distribution. In sparse settings, many variables do not have an effect. To mimic this, we simulated the β’s from a mixture distribution with a ‘spike’ and a Gaussian ‘slab’:(15) $β_{j} \overset{iid}{\sim} p_{0} δ_{0} + (1 - p_{0}) N (0, τ_{0}^{2}) .$ (15)

Here, we set $p_{0} = 0.9, τ_{0}^{2} = 0.1,$ which implies $τ^{2} = V (β_{j}) = E (β_{j}^{2}) - E {(β_{j})}^{2} = (1 - p_{0}) τ_{0}^{2} = 0.01$ , as in the Gaussian β_j setting. Moreover, we also considered: $β_{j} \overset{iid}{\sim} Laplace (μ = 0, b = 0.0707) and β_{j} \overset{iid}{\sim} Uniform (a = - 0.17, b = 0.17)$ where again the parameters are chosen such that $E (β_{j}) = 0$ and $τ^{2} = V (β_{j}) = 0. 01.$ Apart from $β$ all other quantities are simulated as in (14). Results are displayed for $σ^{2} = 10, τ^{2} = 0.01, n = 100, p = 1000$ in for the Laplace (= lasso) effect size distribution and in Supplementary Figure 3 for the spike-and-slab and uniform effect size distribution.

Moreover, we considered heavy-tailed errors by sampling $ϵ'_{i} \overset{iid}{\sim} t_{4} ϵ_{i} = {(10 / 2)}^{1 / 2} ϵ'_{i}$ where the scalar is chosen such that $σ^{2} = V (ϵ_{i}) = 10$ , as in the Gaussian error setting. Apart from $ϵ$ , all other quantities are simulated as in (14). Results are displayed in Supplementary Figure 3c.

3.3. Multicollinear X

3.3.1. Simulated X

Next, the design matrix X is sampled using block-wise correlation. We replace the sampling of X in simulation model (14) by:(16) $\begin{array}{l} X_{n \times p} \sim N (0, Ξ), \end{array}$ (16) where $Ξ$ is a unit variance covariance matrix with blocks of size $p^{*} ≪ p$ with correlations ρ on the off-diagonal. shows the results for $ρ = 0.5, p^{*} = 10, n = 100, p = 1000$ .

Figure 2. Results for multi-collinear and real.

3.3.2. Real data X

Finally, we consider the estimation of $τ^{2}$ and $σ^{2}$ in a high- and medium-dimensional setting where X are real data, with likely collinear columns. The first data set (TCGA KIRC) concerns gene expression data of p = 18, 391 genes for n = 71 kidney tumors. The second data set (TCPA OV) holds expression data of p = 224 proteins for n = 408 ovarian tumor samples. Details on both data sets are supplied in the Supplementary Information. To generate response y we use model (14) with X given by the data. Here, $τ^{2} = 0.01$ and $σ^{2}$ is set such that $h^{2} = 0. 5.$ show the results.

3.4. Discussion of results

3.4.1. MML vs MoM, basic and PCR

and and Supplementary Figure 3 clearly show superior performance of MML compared to MoM: both the bias and variability are much smaller for MML. Generally, MML also outperforms the Basic and PCR estimators of $σ^{2}$ . The PCR estimator approaches the performance of MML for the KIRC and TCPA data (), and the Basic estimator performs reasonably well for the latter (p < n) data set. For other settings, the Basic estimator performs equally inferior as MoM. The results highlight the importance of joint estimation of $σ^{2}$ and $τ^{2}$ in high-dimensional settings, because of their delicate interplay.

3.4.2. MML vs GCV and CV

For the estimation of λ MML seems slightly superior to GCV and CV. GCV shows more estimates that deviate towards too small values of λ (e.g. and , i.e. the large p settings), whereas CV tends to render somewhat more skewed results, either to the right ( and ), or to the left (). For the spike-and-slab and uniform effects sizes and the t₄ errors the right-skewness of the CV-results is more pronounced (Supplementary Figure 3), indicating that minimization of the cross-validated prediction error (5) is more vulnerable to non-Gaussian y than MML and GCV. Note that the Laplace setting () relates directly to the lasso prior with scale parameter $1 / λ_{1}$ (Tibshirani Citation1996). The results indicate that MML with Gaussian prior could be useful to find the lasso penalty, or serve as a fast initial estimate by simply setting the lasso penalty $λ_{1} = \sqrt{2} / \hat{τ}$ , which follows from the variance of the lasso prior.

3.4.3. MML vs HiLMM

For the estimation of heritability h² and and Supplementary Figure 3 show very comparable performance of MML and HiLMM. This similar performance is not surprising given that both methods are likelihood-based. Hence, while reparametrizing the likelihood (7) is certainly useful to study it as function of h² (Bonnet, Gassiat, and Lévy-Leduc Citation2015), the reparametrization seems not beneficial for the purpose of estimating h². In addition, unlike HiLMM, MML also returns estimates of $τ^{2}$ and $σ^{2}$ . Finally, comparing we observe that both MML and HiLMM clearly benefit from the larger n and p.

4. Data example

We re-analyse the weight gain data, recently discussed in Hellton and Hjort (Citation2018). Details on the data are presented there, we provide a summary. The data consists of expression profiles of n = 26 individuals with kidney transplants, where profiles consists of 28,869 genes as measured by Affymetrix Human Gene 1.0 ST arrays. The data is available in the EMBL-EBI ArrayExpress database (www.ebi.ac.uk/arrayexpress) under accession number E-GEOD-33070. It is known that kidney transplantation may lead to weight gain, and the study by Cashion et al. (Citation2013) investigates whether gene expression can be used to predict this. Such a prediction can be used to decide upon additional measures to prevent excessive weight gains. We reproduced the analysis by Hellton and Hjort (Citation2018) as much as possible, including their prior selection of 1000 genes. Details on minor discrepancies, and an alternative analysis that accounts for the gene selection are discussed in the Supplementary Material. These did not affect the comparison qualitatively.

In Hellton and Hjort (Citation2018), the authors illustrate their focused ridge (fridge) method and compare it with conventional ridge. In short, fridge estimates sample-specific ridge penalties, based on minimizing a per sample mean squared error (MSE) criterion on the level of the linear predictor $X_{i} β$ . Since $β$ is not known, it is replaced by an initial ridge estimate, ${\hat{β}}_{λ} .$ Their sample specific penalty then depends on $X_{i}$ , and also on both $\hat{λ}$ and ${\hat{σ}}^{2}$ . The authors use GCV (6) to obtain λ, and a slight variation of (9) to estimate $σ^{2}$ . They show that fridge improves upon GCV-based ridge estimation. We wish to investigate whether i) MML estimation of $λ = σ^{2} / τ^{2}$ also improves the performance of GCV-based ridge regression; and ii) whether MML estimation further boosts the performance of the fridge estimator. Here, predictive performance is measured by the mean squared prediction error (MSPE) using leave-one-out cross-validation (loocv).

The estimates of MML differ markedly from those of GCV: $({\hat{λ}}_{MML}, {\hat{σ}}_{MML}^{2}) = (0.77, 0.59)$ , while $({\hat{λ}}_{GCV}, {\hat{σ}}_{GCV}^{2}) = (20.92, 8.08)$ . Using ${\hat{λ}}_{MML}$ instead of ${\hat{λ}}_{GCV}$ for the estimation of $β$ substantially reduced the mean squared prediction error: ${MSPE}_{MML} = 14.40,$ while ${MSPE}_{GCV} = 16.38$ , a relative decrease of 12.1%. Using ${\hat{λ}}_{GCV}$ , as in Hellton and Hjort (Citation2018), fridge also reduced the MSPE, but to a lesser extent: ${MSPE}_{fridge} = 15.80,$ a relative decrease of 3.5% with respect to ${MSPE}_{GCV} .$ Application of fridge using ${\hat{λ}}_{MML}$ did not further decrease ${MSPE}_{MML}$ , nor did it increase it. Possibly, the already fairly small value of ${\hat{λ}}_{MML}$ left little room for improvement. displays absolute prediction errors per sample and illustrates the improved prediction by ridge using $λ_{MML}$ (and to a lesser extent by fridge) with respect to ridge using $λ_{GCV}$ .

Figure 3. Absolute prediction errors (obtained by loocv; y-axis) for ridge using $λ_{GCV}$ , for fridge and for ridge using $λ_{MML}$ . Sample indices (x-axis) are sorted by GCV results.

5. Extensions

5.1. Extension 1: mixed effects model

A natural extension of the high-dimensional random effects model (1) is the mixed effects model:(17) $y = X_{f} α + X_{r} β + ϵ,$ (17) where we assume that the n × m design matrix for the fixed effects, $X_{f}$ , is of low-rank, so $m ≪ n$ , as opposed to the random effects design matrix $X_{r}$ . Restricted maximum likelihood (REML) deals with the fixed effects by contrasting them out. For the error contrast vector $y - X_{f} {\hat{α}}^{OLS} = A^{T} y,$ with $A = I_{n} - X_{f} {(X_{f}^{T} X_{f})}^{- 1} X_{f}^{T}$ , the marginal likelihood for the variance components equals (see e.g. Zhang Citation2015):(18) $P (A^{T} y) = N (y; μ = 0, Σ = A^{T} Σ_{r} A)$ (18) with $Σ_{r} = X_{r} X_{r}^{T} τ^{2} + σ^{2} I_{n} .$ In addition to maximizing (18) as a function of $(σ^{2}, τ^{2})$ , we attempted solving the set of two estimation equations suggested by Jiang (Citation2007), but this rendered instable results inferior to maximizing (18) directly.

Alternatively, MML may be used, but it has to be adjusted to also estimate the fixed effects in the model. This implies replacing 0 in Gaussian likelihood (11) by $X_{f} α$ , and optimizing (11) with respect to $2 + m$ parameters, where m is the number of fixed parameters. The mixed model simulation setting is as follows:(19) $\begin{array}{l} y_{n \times 1} = X_{f, n \times m} α_{m \times 1} + X_{r, n \times p} β_{p \times 1} + ϵ_{n \times 1} & ϵ_{i} \overset{iid}{\sim} N (0, σ^{2}) \\ x_{f, ik} \overset{iid}{\sim} N (0, 1) & x_{r, ij} \overset{iid}{\sim} N (0, 1) \\ α_{k} \overset{iid}{\sim} p_{0, f} δ_{0} + (1 - p_{0, f}) N (0, τ_{0, f}^{2}) & β_{j} \overset{iid}{\sim} p_{0} δ_{0} + (1 - p_{0}) N (0, τ_{0}^{2}), \end{array}$ (19) where $n = 100, p = 1000, m = 10, p_{0} = 0.9, τ_{0}^{2} = 0.1$ (implying variance $τ^{2} = (1 - p_{0}) τ_{0}^{2} = 0.01$ for generating random effects) and $p_{0, f} = 0.5, τ_{0, f}^{2} = 0.20$ (implying variance $τ_{f}^{2} = 0.1$ for generating fixed effects). Note that we focused on a fairly sparse setting for the random effects and larger prior variance of fixed effects than of random effects, which enables a stronger impact of the small number of fixed effects. shows the results of REML, MML and CV (by glmnet, using penalty factor 0 for the fixed effects) for the estimation of $τ^{2}, σ^{2}, λ$ and h².

Figure 4. Estimates for mixed effects model, $τ^{2} = 0.01, σ^{2} = 10, n = 100, m = 10, p = 1000$ .

From we observe that REML indeed improves MML in terms of bias, however at the cost of increased variability. For the estimation of λ, CV is fairly competitive to REML and MML, although it renders markedly more over-penalization.

5.2. Extension 2: Bayesian linear regression

So far, we focused on classical methods. Bayesian methods may be a good alternative. We applied the standard Bayesian linear regression model, i.e. the conjugate model with i.i.d. priors $π (β_{j}) = N (0, σ^{2} τ^{2})$ , with $τ^{2}$ fixed and $σ^{2}$ endowed with a vague inverse-gamma prior (see Supplementary Material for details). For this model the maximum marginal likelihood estimator for $τ^{2}$ is still analytical (Karabatsos Citation2018), and so is the posterior mode estimate of $σ^{2}$ . shows the results in comparison to MML, i.e. maximization of (12), for the random effects case with multi-collinear X, as in Sec. 3.3.1. Results for other settings were in essence very similar.

Figure 5. Bayes and MML (12) estimates for multi-collinear X, with $τ^{2} = 0.01, σ^{2} = 10, n = 100, p = 1000$ .

From the results we conclude that the conjugate Bayes estimates are very close to those of MML. This is in line with the fact that both estimators maximize a marginal likelihood and the conjugate model with prior variance $τ^{2} = σ^{2} / λ$ is known to render posterior mean estimates of $β$ that equal the λ-penalized ridge regression estimates.

The conjugate Bayesian model is scale-invariant, because the β prior contains the error variance $σ^{2}$ . Recently, it was criticized for its non-robustness against misspecification of the fixed $τ^{2}$ when estimating $σ^{2}$ (Moran, Rockova, and George Citation2018). However, in practice one needs to estimate $τ^{2}$ by either empirical Bayes (e.g. maximum marginal likelihood) or full Bayes. We repeated the simulation by Moran, Rockova, and George (Citation2018) (see Supplementary Material). The results show that the estimates of $σ^{2}$ are much better when estimating $τ^{2}$ by empirical Bayes instead of fixing it, and in fact very competitive to alternatives proposed by Moran, Rockova, and George (Citation2018).

5.3. Extension 3: generalized linear models

5.3.1. Setting

Motivated by the good results for MML in the linear setting, we wish to extend MML estimation to the high-dimensional generalized linear model (GLM) setting, where the likelihood depends on the regression parameter $β$ only via the linear predictor, $X β$ . Hence, likelihood $L (Y; β, X)$ is defined by a density $f_{μ} (Y)$ (e.g. Poisson), where $X β$ is mapped to μ by a link function (e.g. $log$ ). As before, we a priori assume i.i.d. $β_{j} \sim N (0, τ^{2})$ , here equivalent to an L₂ penalty $λ = 1 / τ^{2}$ when estimating $β$ by penalized likelihood. In Heisterkam, van Houwelingen, and Downs (Citation1999) an iterative algorithm to estimate λ is derived which alternates estimation of $β$ by maximization w.r.t. λ, requiring the computation of the trace of a Hessian of a $p \times p$ matrix. Here, the estimation of $β$ itself is much slower than in the linear case, because it is not analytic and requires iterative weighted least squares approximation. Below we show how to substantially alleviate the computational burden in the $p ≫ n$ setting by re-parameterizing the marginal likelihood implying computations in $R^{n}$ instead of $R^{p}$ .

5.3.2. Method

We have for the marginal likelihood:(20) $ML (λ) = \int_{β \in R^{p}} L (Y; β, X) π_{λ} (β) d β = \int_{β \in R^{p}} L (Y; β, X) ϕ (β_{1}; 0, 1 / λ) \dots ϕ (β_{p}; 0, 1 / λ) d β$ (20) where $ϕ (β, μ, τ^{2})$ denotes the normal density with mean μ and variance $τ^{2}$ . Now a crucial observation is that for GLM:(21) $ML (λ) = E_{π_{λ} (β)} [L (Y; β, X)] = E_{π_{λ} (β)} [L (Y; X β)] = E_{π'_{λ} (X β)} [L (Y; X β)]$ (21) because the likelihood depends on $β$ only via the linear predictor $X β$ . Here, $π'_{λ} (X β)$ is the implied n-dimensional prior distribution of $X β$ . This is a multivariate normal: $ϕ (β^{X}; μ = 0, Σ_{λ} = X X^{T} / λ)$ . Therefore, we have:(22) $\begin{array}{l} ML (λ) = \int_{β \in R^{p}} g_{Y, λ} (β) d β = \int_{β \in R^{p}} L (Y; β, X) ϕ (β_{1}; 0, 1 / λ) \dots ϕ (β_{p}; 0, 1 / λ) d β \\ = \int_{β^{X} \in R^{n}} h_{Y, λ} (β^{X}) d β^{X} = \int_{β^{X} \in R^{n}} L (Y; β^{X}, I_{n}) ϕ (β^{X}; 0, Σ_{λ}) d β^{X} . \end{array}$ (22)

Hence, the p-dimensional integral may be replaced by an n-dimensional one, with obvious computational advantages when $p ≫ n$ . Moreover, the use of (22) allows applying implemented Laplace approximations, which tend to be more accurate in lower dimensions. The Laplace approximation requires ${\hat{β}}^{X} = arg {max}_{β^{X}} {h_{Y, λ} (β^{X})}$ . We emphasize that this does generally not equal $X \hat{β},$ where $\hat{β} = arg {max}_{β} {g_{Y, λ} (β)}$ : the maximum of the commonly used L₂ penalized (log)-likelihood. However, ${\hat{β}}^{X}$ can be computed by noting that(23) $log h_{Y, λ} (β^{X}) \propto ℓ (Y; β^{X}, I_{n}) - {(β^{X})}^{T} Σ_{λ}^{- 1} β^{X} .$ (23)

In other words, this is the penalized log-likelihood when regressing Y on the identity design matrix $I_{n}$ using an L₂ smoothing penalty matrix ${(β^{X})}^{T} Σ_{λ}^{- 1} β^{X} = λ {(β^{X})}^{T} {(X X^{T})}^{- 1} β^{X}$ . The latter fits conveniently into the set-up of Wood (Citation2011), as implemented in the R-package mgcv. This also facilitates MML estimation of λ by maximizing $ML (λ)$ , with $h_{Y, λ} (β^{X})$ as in (23). If the columns of X are standardized (common in high-dimensional studies), $X X^{T}$ has rank n – 1 instead of n, implying that ${(X X^{T})}^{- 1}$ does not exist and should be replaced by a pseudo-inverse ${(X X^{T})}^{+}$ , such as the Moore-Penrose inverse.

In a full Bayesian linear model setting, dimension reduction is also discussed by Bernardo et al. (Citation2003), where $X β$ is substituted by a n-dimensional factor analytic representation, which requires an SVD of X. In addition, there it is not used for hyper-parameter estimation by marginal likelihood, but instead for specifying (hierarchical) priors for the factors.

5.3.3. Results

R packages like glmnet (Friedman, Hastie, and Tibshirani Citation2010) and penalized (Goeman Citation2010) estimate λ by cross-validation, and also mgcv allows, next to the MML estimation, (generalized) CV estimation (Wood Citation2011). show the results for Poisson ridge regression, with $Y_{i} \sim Pois (λ_{i}), λ_{i} = exp (X_{i} β), β$ generated as in (14), and X generated as in (14) and (16), which denote the independent X and multi-collinear X setting, respectively.

Figure 6. λ estimates for Poisson ridge regression, $λ = 1 / τ^{2} = 100, n = 100, p = 1000$ .

clearly shows the superior performance of MML based on (22) over CV. In particular, glmnet and penalized render strongly upward biased values. The mgcv GCV values are still inferior to MML based ones, but much better than the latter two, which may be due to the different regression estimators used (Laplace approximation versus iterative weighted least squares). We should stress that CV does not target for the estimation of λ as such, but merely for minimizing prediction error. Nevertheless, the difference is remarkably larger than in the corresponding linear case (see and ).

The Supplementary Material shows the results for Binomial ridge regression. While the differences in performance are less dramatic than for the Poisson setting, MML still renders much better estimates of λ than CV-based approaches.

6. Computational aspects and software

All methods and simulations presented here are implemented in a few wrapper R scripts: one for the linear random effects model (which includes the conjugate Bayes estimator), one for the linear mixed effects model, and one for Poisson and Binomial ridge regression. Parallel computations are supported. The scripts allow exact reproduction of the results in this manuscript as well as comparisons for other simulation or user-specific real data X cases. In addition, a script is supplied to produce the box-plots as in this manuscript.

HiLMM, PCR and CV implementations are provided by the R-packages HiLMM, v1.1 (Bonnet, Gassiat, and Lévy-Leduc Citation2015), ridge, v1.8-16 (Cule and De Iorio Citation2012) (code slightly adapted for computational efficiency) and glmnet, v2.0-16 (Friedman, Hastie, and Tibshirani Citation2010). The methods MML, REML, Bayes, MoM, Basic and GCV were implemented by us for the linear random and mixed effects models. For Poisson and Binomial ridge regression we applied mgcv, v1.8-16 (Wood Citation2011) after our re-parametrization (22) to obtain MML and GCV results, while for CV glmnet and penalized, v0.9-50 (Goeman Citation2010) were applied. For all methods that required optimization the R routine optim was used, with default settings. CV was based on 10 folds.

Computing times of the various methods largely depend on n and p, much less so on the exact simulation setting. These are displayed for n = 100, 500 and $p = 10^{3}, 10^{4}, 10^{5}$ in , based on computations with one CPU of an Intel®Xeon^®CPUE5 - 2660v3@2.60 GHz server. For Poisson ridge regression, we only report the computing times of MML and GCV, because, as reported in , the performance of CV-based methods was very inferior.

Table 1. Computing times for hyper-parameter estimation for linear and Poisson ridge regression.

Display Table

From we conclude that MML is also computationally very attractive. Its efficiency is explained by the fact that, unlike many of other methods, it does not require an SVD or other matrix decomposition of X. Moreover, the only computation that involves dimension p is the product $X X^{T}$ .

7. Discussion

We compared several estimators in a large variety of high-dimensional settings. The results showed that plain maximum marginal likelihood works well in many settings. MML is generally superior to methods that aim to separately estimate $σ^{2}$ (9, 10). Apparently, the estimates of $σ^{2}$ and $τ^{2}$ are so intrinsically linked in the high-dimensional setting that separate estimation is sub-optimal. The moment estimator (MoM) is generally not competitive to MML. It may, however, be useful in large systems with multiple hyper-parameters to estimate relative penalties, which are less sensitive to scaling issues than the global penalty parameter (Van de Wiel et al. Citation2016). MoM may also be a useful initial estimator for more complex estimators that are based on optimization, such as MML.

Possibly somewhat surprising is the good performance of MML for estimating λ and h², as these are functions of $σ^{2}$ and $τ^{2}$ . For the estimation of λ it is generally better than or competitive to (generalized) CV, an observation also made for the low-dimensional setting (Wood Citation2011). The inferior performance of the basic estimator of $σ^{2}$ (9) implies that alternative estimators of λ that use ${\hat{σ}}^{2}$ as a plug-in are unlikely to perform well in high-dimensional settings. Such estimators, including the original one by Hoerl and Kennard (Citation1970), are compared by Muniz and Kibria (Citation2009); Kibria and Banik (Citation2016), who show that some do perform well in the low-dimensional setting. For Poisson ridge regression, similar estimators of λ are available (Månsson and Shukur Citation2011), but these rely on an initial maximum likelihood estimator of $β$ , and hence do not apply to the high-dimensional setting. For estimating h² it should be noticed that HiLMM (Bonnet, Gassiat, and Lévy-Leduc Citation2015) aims to compute a confidence interval for h² as well. For that purpose their direct estimator (8) is likely more useful than MML on the pair $(τ^{2}, σ^{2})$ . We also used Esther (Bonnet et al. Citation2018), which precedes HiLMM by sure independence screening. It did not improve HiLMM in our (semi-)sparse settings, and requires manual steps. However, it likely improves HiLMM results in very sparse settings (Bonnet et al. Citation2018).

For mixed effect models with a small number of fixed effects, MML compares fairly well to REML, with a larger bias, but smaller variance. Probably the potential advantage of contrasting out the fixed effects is small when the number of random effects is large. REML may have a larger advantage in very sparse settings (Jiang et al. Citation2016) or when the number of fixed effects is large with respect to n. Estimates from the conjugate Bayes model are very similar to those by MML. We show that estimating $τ^{2}$ along with $σ^{2}$ highly improves the $σ^{2}$ estimates presented by Moran, Rockova, and George (Citation2018), where a fixed value of $τ^{2}$ is used. In the case of many variance components or multiple similar regression equations, Bayesian extensions that shrink the estimates by a common prior are appealing, in particular in combination with efficient posterior approximations such as variational Bayes (Leday et al. Citation2017).

Our model (1) implies a dense setting, but we have demonstrated that the MML and REML estimators of $τ^{2}$ and $σ^{2}$ are fairly robust against moderate sparsity, which corroborates the results by Jiang et al. (Citation2016). Nevertheless, true sparse models may be preferable when variable selection is desired, which depends on accurate estimation of $β$ . On the other hand, post-hoc selection procedures can be rather competitive (Bondell and Reich Citation2012). Moreover, the sparsity assumption is questionable for several applications. For example in genetics, it was suggested that many complex traits (such as height or cholesterol levels) are not even polygenic, but instead ‘omnigenic’ (Boyle, Li, and Pritchard Citation2017).

The extension of MML to high-dimensional GLM settings (22) is promising given its computational efficiency and performance for Poisson and Binomial regression. A special case of the latter, logistic regression, requires further research, because the Laplace approximations of the marginal likelihood are less accurate here (Wood Citation2011). Extension to survival is a promising avenue, because Cox regression is directly linked to Poisson regression (Cai and Betensky Citation2003). Alternatively, parametric survival models may be pursued. To what extent the estimates of hyper-parameters impact predictions depends on the sensitivity of the likelihood to these parameters. For the linear setting, a re-analysis of the weight-gain data showed that predictions based on ${\hat{λ}}_{MML}$ improved those based on ${\hat{λ}}_{CV}$ . Karabatsos (Citation2018) shows that MML estimation also performs well compared to GCV for linear power ridge regression, which extends ridge regression by multiplying λ by ${(X^{T} X)}^{δ}$ .

The MML estimator can be extended to estimation of multiple variance components or penalty parameters, which was addressed by iterative likelihood minorization (Zhou et al. Citation2015) and by parameter-based moment estimation (Van de Wiel et al. Citation2016). The latter extends to non-Gaussian response such as survival or binary. Further comparison of these methods with multi-parameter MML, both in terms of performance and computational efficiency, is left for future research. Finally, in particular in genetics applications, extensions of estimation of variance components by MML to non-independent individuals can be implemented by use of a well-structured between-individual covariance matrix $Σ$ (Kang et al. Citation2008).

Although our simulations cover a fairly broad spectrum of settings, many other variations could be of interest. We therefore supply fully annotated R scripts https://github.com/markvdwiel/Hyperpar that allow i) comparison of all algorithms discussed here, also for one’s ‘own’ real covariate set X; and ii) reproduction of all results presented here.

Supplemental material

lssp_a_1646760_sm0546.pdf

Download PDF (597.7 KB)

Acknowledgment

Gwenaël Leday was supported by the Medical Research Council, grant number MR/M004421. We thank Jiming Jiang and Can Yang for their correspondence, input and software for the MM algorithm. In addition, we thank Kristoffer Hellton for providing the fridge software and data. Finally, Iuliana Ciocǎnea-Teodorescu is acknowledged for preparing the TCGA KIRC data.

Disclosure statement

No potential conflict of interest was reported by the authors.

References

Allen, D. 1974. The relationship between variable selection and data agumentation and a method for prediction. Technometrics 16 (1):125–7. doi:https://doi.org/10.1080/00401706.1974.10489157.
Web of Science ®Google Scholar
Bernardo, J. M., M. J. Bayarri, J. O. Berger, and A. P. Dawid. 2003. Bayesian factor regression models in the “large p, small n” paradigm. Bayesian Statistics 7:733–42.
Google Scholar
Bondell, H., and B. Reich. 2012. Consistent high-dimensional Bayesian variable selection via penalized credible regions. Journal of the American Statistical Association 107 (500):1610–24. doi:https://doi.org/10.1080/01621459.2012.716344.
PubMed Web of Science ®Google Scholar
Bonnet, A., E. Gassiat, and C. Lévy-Leduc. 2015. Heritability estimation in high dimensional sparse linear mixed models. Electronic Journal of Statistics 9 (2):2099–129. doi:https://doi.org/10.1214/15-EJS1069.
Web of Science ®Google Scholar
Bonnet, A., C. Lévy-Leduc, E. Gassiat, R. Toro, and T. Bourgeron. 2018. Improving heritability estimation by a variable selection approach in sparse high dimensional linear mixed models. Journal of the Royal Statistical Society: Series C (Applied Statistics) 67 (4):813–39. doi:https://doi.org/10.1111/rssc.12261.
Web of Science ®Google Scholar
Boyle, E. A., Y. I. Li, and J. K. Pritchard. 2017. An expanded view of complex traits: From polygenic to omnigenic. Cell 169 (7):1177–86. doi:https://doi.org/10.1016/j.cell.2017.05.038.
PubMed Web of Science ®Google Scholar
Cai, T., and R. Betensky. 2003. Hazard regression for interval-censored data with penalized spline. Biometrics 59 (3):570–9. doi:https://doi.org/10.1111/1541-0420.00067.
PubMed Web of Science ®Google Scholar
Cashion, A., A. Stanfill, F. Thomas, L. Xu, T. Sutter, J. Eason, M. Ensell, and R. Homayouni. 2013. Expression levels of obesity-related genes are associated with weight change in kidney transplant recipients. PLoS One 8 (3):e59962. doi:https://doi.org/10.1371/journal.pone.0059962.
PubMed Web of Science ®Google Scholar
Cule, E., and M. De Iorio. 2012. A semi-automatic method to guide the choice of ridge parameter in ridge regression. arXiv preprint, arXiv:1205.0686.
Google Scholar
Cule, E., P. Vineis, and M. De Iorio. 2011. Significance testing in ridge regression for genetic data. BMC Bioinformatics 12:372. doi:https://doi.org/10.1186/1471-2105-12-372.
PubMed Web of Science ®Google Scholar
Friedman, J., T. Hastie, and R. Tibshirani. 2010. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33 (1):1–22.
PubMed Web of Science ®Google Scholar
Goeman, J. 2010. L1 penalized estimation in the Cox proportional hazards model. Biometrical Journal 52 (1):70–84. doi:https://doi.org/10.1002/bimj.200900028.
PubMed Web of Science ®Google Scholar
Golub, G. H., M. Heath, and G. Wahba. 1979. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics 21 (2):215–23. doi:https://doi.org/10.1080/00401706.1979.10489751.
Web of Science ®Google Scholar
Hastie, T., and R. Tibshirani. 1990. Generalized additive models. Boca Raton, FL: CRC Press.
Google Scholar
Hastie, T., R. Tibshirani, and J. H. Friedman. 2008. The elements of statistical learning. 2nd ed. New York, NY: Springer.
Google Scholar
Heisterkam, S. H., J. C. van Houwelingen, and A. M. Downs. 1999. Empirical Bayesian estimators for a Poisson process propagated in time. Biometrical Journal 41 (4):385–400. doi:https://doi.org/10.1002/(SICI)1521-4036(199907)41:4<385::AID-BIMJ385>3.0.CO;2-Z.
Web of Science ®Google Scholar
Hellton, K. H., and N. L. Hjort. 2018. Fridge: Focused fine-tuning of ridge regression for personalized predictions. Statistics in Medicine 37 (8):1290–303. doi:https://doi.org/10.1002/sim.7576.
PubMed Web of Science ®Google Scholar
Hoerl, A. E., and R. W. Kennard. 1970. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12 (1):55–67. doi:https://doi.org/10.1080/00401706.1970.10488634.
Web of Science ®Google Scholar
Jiang, J. 2007. Linear and generalized linear mixed models and their applications. New York, NY: Springer Science & Business Media.
Google Scholar
Jiang, J., C. Li, D. Paul, C. Yang, and H. Zhao. 2016. On high-dimensional misspecified mixed model analysis in genome-wide association study. The Annals of Statistics 44 (5):2127–60. doi:https://doi.org/10.1214/15-AOS1421.
Web of Science ®Google Scholar
Kang, H. M., N. A. Zaitlen, C. M. Wade, A. Kirby, D. Heckerman, M. J. Daly, and E. Eskin. 2008. Efficient control of population structure in model organism association mapping. Genetics 178 (3):1709–23. doi:https://doi.org/10.1534/genetics.107.080101.
PubMed Web of Science ®Google Scholar
Karabatsos, G. 2018. Marginal maximum likelihood estimation methods for the tuning parameters of ridge, power ridge, and generalized ridge regression. Communications in Statistics - Simulation and Computation 47 (6):1632–51. doi:https://doi.org/10.1080/03610918.2017.1321119.
Web of Science ®Google Scholar
Kibria, B., and S. Banik. 2016. Some ridge regression estimators and their performances. Journal of Modern Applied Statistical Methods 15:12.
Web of Science ®Google Scholar
Leday, G. G. R., M. C. M. de Gunst, G. B. Kpogbezan, A. W. van der Vaart, W. N. van Wieringen, and M. A. van de Wiel. 2017. Gene network reconstruction using global-local shrinkage priors. The Annals of Applied Statistics 11 (1):41–68. doi:https://doi.org/10.1214/16-AOAS990.
PubMed Web of Science ®Google Scholar
Månsson, K., and G. Shukur. 2011. A Poisson ridge regression estimator. Economic Modelling 28 (4):1475–81. doi:https://doi.org/10.1016/j.econmod.2011.02.030.
Web of Science ®Google Scholar
Moran, G. E., V. Rockova, and E. I. George. 2018. On variance estimation for Bayesian variable selection. arXiv preprint, arXiv:1801.03019.
Google Scholar
Muniz, G., and B. M. G. Kibria. 2009. On some ridge regression estimators: An empirical comparisons. Communications in Statistics - Simulation and Computation 38 (3):621–30. doi:https://doi.org/10.1080/03610910802592838.
Web of Science ®Google Scholar
Murphy, K. 2012. Machine learning, a probabilistic perspective. Cambridge, MA: The MIT Press.
Google Scholar
Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B 58 (1):267–88.
Google Scholar
van de Wiel, M. A., T. G. Lien, W. Verlaat, W. N. van Wieringen, and S. M. Wilting. 2016. Better prediction by use of co-data: Adaptive group-regularized ridge regression. Statistics in Medicine 35 (3):368–81. doi:https://doi.org/10.1002/sim.6732.
PubMed Web of Science ®Google Scholar
Visscher, P. M., W. G. Hill, and N. R. Wray. 2008. Heritability in the genomics era – Concepts and misconceptions. Nature Reviews Genetics 9 (4):255–66. doi:https://doi.org/10.1038/nrg2322.
PubMed Web of Science ®Google Scholar
Wood, S. N. 2011. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73 (1):3–36. doi:https://doi.org/10.1111/j.1467-9868.2010.00749.x.
Web of Science ®Google Scholar
Zhang, X. 2015. A tutorial on restricted maximum likelihood estimation in linear regression and linear mixed-effects model. http://statdb1.uos.ac.kr/teaching/multi-grad/ReML.pdf.
Google Scholar
Zhou, H., L. Hu, J. Zhou, and K. Lange. 2015. MM algorithms for variance components models. arXiv preprint, arXiv:1509.07426.
Google Scholar

Estimation of variance components, heritability and the ridge penalty in high-dimensional generalized linear models