348
Views
3
CrossRef citations to date
0
Altmetric
Articles

Achieving the oracle property of OEM with nonconvex penalties

, &
Pages 28-36 | Received 07 Mar 2017, Accepted 30 Apr 2017, Published online: 19 May 2017

ABSTRACT

Thepenalised least square estimator of non-convex penalties such as the smoothly clipped absolute deviation (SCAD) and the minimax concave penalty (MCP) is highly nonlinear and has many local optima. Finding a local solution to achieve the so-called oracle property is a challenging problem. We show that the orthogonalising EM (OEM) algorithm can indeed find such a local solution with the oracle property under some regularity conditions for a moderate but diverging number of variables.

1. Introduction

Consider a linear model (1) where is the n × p regression matrix, is the response vector, is the vector of regression coefficients, and is the vector of random error. Throughout, let ‖ · ‖ denote the Euclidean norm. A regularised least squares estimator of with the smoothly clipped absolute deviation (SCAD) (Fan & Li, Citation2001) is given by solving (2) where for θ > 0, (3) a > 2 and b > 0 are the tuning parameters, and I is the indicator function. In order to apply the penalty P equally on all the variables, can be standardised so that (4) Fan and Li (Citation2001) used the SCAD penalty in (Equation3) to achieve simultaneous estimation and variable selection. Zhang (Citation2010) studied a class of non-convex penalties and introduced the minimax concave penalty (MCP) method, which replaces P in (Equation2) with PMCP satisfying (5) where a > 1, b > 0, and θ > 0. Non-convex penalties are now commonly used in statistics. Existing algorithms for solving the SCAD or MCP problem include local quadratic approximation (Fan & Li, Citation2001; Hunter & Li, Citation2005), local linear approximation (Zou & Li, Citation2008), the ConCave Convex procedure (CCCP) (Kim, Choi, & Oh, Citation2008), the minimisation by iterative soft thresholding algorithm (Schifano, Strawderman, & Wells, Citation2010), and the coordinate descent algorithm (Breheny & Huang, Citation2011; Mazumder, Friedman, & Hastie, Citation2011; Tseng, Citation2001; Tseng & Yun, Citation2009), among others.

The non-convex nature of the SCAD and MCP penalties has an interesting interface between computation and statistical properties. Fan and Li (Citation2001) proposed the SCAD penalty and also an important property, called the oracle property. An estimator of having this property can not only select the correct submodel asymptotically, but also estimate the nonzero coefficients as efficiently as if the correct submodel were known in advance. Fan and Li (Citation2001) proved that there exists a local solution of the SCAD problem in (Equation2) enjoying this property for fixed p. The corresponding results with a diverging p were presented in Fan and Peng (Citation2004) and Fan and Lv (Citation2011). However, it is challenging to single out the one with the oracle property since the non-convex penalised problems like SCAD or MCP can have many local optima (Huo & Chen, Citation2010; Huo & Ni, Citation2007). On the algorithmic aspect, different initial points in a specific algorithm can yield different solutions for np or n < p case. For example, let all rows of in (Equation1) be identically and independently generated from a zero-mean multi-normal distribution with correlation 0.5, , and the components of be identically and independently generated from N(0, 32). We use the coordinate descent algorithm to solve the SCAD problem with a = 3.7 and b = 1/(n  log (n)). For one simulation, we use 100 initial points, which are independently generated from a multi-normal distribution , where denotes the p × p identity matrix. The solution numbers for different pairs of (n, p) over 100 simulations are described in .

Table 1. Averages and maxima (in parentheses) of the solution numbers of SCAD.

To get a good estimate from the multiple solutions, one needs to elaborately select the algorithm and the initial point. Fan and Li (Citation2001) suggested using the ordinary least squares (OLS) estimator as the initial point in their local quadratic approximation algorithm for n > p. This method performs well in simulations but lacks theoretical justification. In recent years, related theoretical study has been carried out. Zhang (Citation2010) devised a novel penalised linear unbiased selection algorithm and proved that it can achieve selection consistency of local solutions to MCP. Loh and Wainwright (Citation2013) discussed the convergence rate of local solutions given by certain algorithms for a general class of non-convex penalised problems. When the oracle property is concerned, Zou and Li (Citation2008) showed that the one-step solution of the local linear approximation algorithm, which stops the algorithm after one iteration, has the oracle property with a good initial estimator for a fixed p. Fan, Xue, and Zou (Citation2014) extended such a result to a general penalised estimation problem. Wang, Kim, and Li (Citation2013) proposed a two-step CCCP with different tuning parameters in the two steps, called calibrated CCCP, and showed that it produces the oracle estimator with probability approaching one. These theoretical results are useful for high-dimensional statistics. However, since a local solution is achieved when the iteration number goes to infinity, after only one or several iterations, the estimator is not guaranteed to be a local minimum of the non-convex problem in finite-sample cases. To find a local solution with the oracle property, we need to study the oracle property of a k-step estimator as k goes to infinity. As Meng (Citation2008) pointed out, if we believe that the SCAD objective function is a valid criterion for sparse estimation, then the estimator from a monotonic algorithm is expected to become better as the iteration number increases. Therefore, a k-step estimator with the oracle property as k goes to infinity matches Fan and Li's original purpose of introducing the SCAD penalty. To the best of our knowledge, no such estimator has been presented in the literature, even for the case of n > p.

In this paper, we will show a rather surprising result. In our early results summarised in an unpublished paper (Xiong, Dai, & Qian, Citation2011), we introduced the orthogonalising EM (OEM) algorithm for general least squares problems. For the SCAD and MCP problems, each OEM iteration has a simple closed form. This feature makes it possible to study the theoretical properties of a k-step estimator as k goes to infinity, and thus motivates us to consider whether the local solution of SCAD or MCP given by OEM has the oracle property. Since OEM is more suitable for big tall data with n > p (Xiong, Dai, Huling, & Qian, Citation2016), here our study mainly focuses on such cases. We prove that an OEM sequence for SCAD or MCP can indeed achieve the oracle property if the iteration number goes to infinity with an initial estimator having certain consistency property. We allow the dimensionality p to depend on n of the order p = O(nq) with q ∈ [0, 3/2). For p exceeding this order, our result can be applied to the submodel after a screening stage (Fan & Lv, Citation2008), which reduces the initial p to the order p = o(n). We therefore present a detailed discussion on the applications of our result to the case of n > p. For this case, with the OLS estimator being the initial estimator, our result holds for a broad class of deterministic or random in (Equation1), which can be allowed to be nearly degenerate.

It should be pointed out that, the technical report (Xiong et al., Citation2011) summarised some earlier results we have got on OEM. We have divided the report into two papers for publication. One paper (Xiong et al., Citation2016) focuses on the details of the algorithm. The current paper deals with the oracle property. The two papers have no overlapping.

Section 2 reviews the OEM algorithm. Section 3 presents the main result that the local solution of SCAD given by OEM has the oracle property. Section 4 discusses our result in the case of n > p. Section 5 presents simulation results. Section 6 concludes with some discussion. All proofs are given in the Appendix.

2. The OEM algorithm

Consider the SCAD problem in (Equation2) with the regression matrix standardised as in (Equation4). For a matrix, denote by λmax (·) and λmin (·) its largest and smallest eigenvalues, respectively. In the OEM algorithm, the first step of OEM is active orthogonalisation, which computes such that (6) with . Therefore, is column orthogonal. Consider the linear model (7) where is the complete response vector including a missing part . Based on the complete model (Equation7), we can solve (Equation2) by iteratively imputing . Let be an initial point. For k = 0, 1,… , impute as , let , and solve Since is orthogonal, the above problem becomes (8) where u(k)j is the jth component of and (9) For , define and (10) We fix a in (Equation3) and thus omit it in s hereinafter. Let b = λn/n that depends on n. The problem in (Equation8) has a closed-form solution that leads to the OEM iteration formula (11) For the regularised least squares problem with the MCP penalty in (Equation5), the OEM iteration formula is with sMCP(u; a, b) = [sMCP(u1; a, b), …, sMCP(up; a, b)]′ and (12)

A simple choice of dn in (Equation6) is dn = γ1, which can be computed efficiently by the Lanczos algorithm (Lanczos, Citation1950). With this choice, it suffices to compute γ1 for obtaining in (Equation11) instead of computing the whole .

The above OEM procedure can be easily used in other regularised least squares problems including ridge regression (Hoerl & Kennard, Citation1970), the nonnegative garrote (Breiman, Citation1995), and the lasso (Tibshirani, Citation1996). Xiong et al. (Citation2011) showed that this is an EM algorithm, and derived its monotonicity and convergence properties for general regularised least squares problems.

3. Main results

Suppose that the number of nonzero coefficients of in (Equation1) is p1 and partition as (13) where and no component of is zero. Divide columns of the regression matrix in (Equation1) to with corresponding to . We assume . Let denote the oracle estimator with and . A regularised least squares estimator of in (Equation1) has the oracle property if it can not only select the correct submodel asymptotically, but also estimate the nonzero coefficients in (Equation13) as efficiently as if the correct submodel were known in advance. Specifically, an estimator has this property if and has the same asymptotic distribution as .

In virtue of the simplicity of the OEM iteration in (Equation11), we can study the oracle property of the local solution of SCAD given by OEM. First, we prove that, under certain conditions, a fixed point of the OEM iterations for SCAD is the oracle estimator with probability tending to one. Some notation, definitions, and assumptions are needed. Hereinafter, p, p1, and can depend on n. Let βmin  denote the component of that has the smallest absolute value. The matrix is standardised as in (Equation4). Recall that dn ≥ γ1 in (Equation8).

Definition 3.1:

For a series of numbers cn → ∞ and a positive constant κ, an estimator of is said to be cn-concentratively consistent of order κ if there exists a constant C > 0 such that for sufficiently large t, .

Remark 3.1:

By the Markov inequality, is cn-concentratively consistent of order κ if .

Assumption 3.1:

The random errors ϵ1,… , ϵn are i.i.d. random variables with E ϵ1 = 0 and E|ϵ1|r < ∞ for some r ⩾ 2.

Assumption 3.2:

As n → ∞, p1/(n1/2dnmin |)r → 0, p1/(cnmin |)κ → 0, λn/(nmin |) → 0, λn/(n1/2p1/r) → ∞, and cnλn/(ndnp1/κ) → ∞.

Remark 3.2:

Consider the case of pn. We can assume (Bühlmann & van de Geer Citation2011). Since has at most n non-zero eigenvalues, . The equality holds for some , and thus dn can be assumed to have the same order of p/n. For , dnp/n, and sufficiently large r and κ, if we fix p1 and and set pnq for some q ⩾ 1, then λn can be chosen as λnnα, where 1 > α > max {1/2 + q/r, q − 1/2 + q/κ}. It is clear that q should be smaller than 3/2. In other words, our results in this paper can handle dimensionality of order p = O(nq) for q ∈ [1, 3/2).

Theorem 3.1:

Let be a fixed point of the OEM iterations for SCAD. Suppose that is a cn-concentratively consistent estimator of order κ. Under Assumptions 3.1 and 3.2, as n → ∞,

Theorem 3.1 indicates that a fixed point of OEM consistent to the true parameter has the oracle property even when p grows faster than n.

Sometimes it is difficult to know whether a fixed point is consistent. We next show that, with an initial point concentratively consistent to , an OEM sequence can converge to that fixed point and possess the oracle property. In addition, its limit point is indeed the oracle estimator with probability tending to one.

Let be the OEM sequence from (Equation11). Denote , , , and .

Assumption 3.3:

As n → ∞, p1 + r/21/((nζp1)r/2min |r) → 0, and c*nλn/(ndn) → ∞.

Theorem 3.2:

If is cn-concentratively consistent of order κ, then, under Assumptions 3.13.3, we have (i) as n → ∞, (14) (ii) for all k = 1, 2,… , and .

By (ii) of Theorem 3.2, the OEM sequence can possess the oracle property only if k is sufficiently large. In particular, has the same asymptotic normality as . We state this result as a corollary.

Assumption 3.4:

As n → ∞, , and k = kn satisfies (nζ1)1/2exp ( − klog (1 − δ)− 1)/c*n → 0 for δ > 0 and for δ = 0.

Corollary 3.1:

Suppose that is cn-concentratively consistent of order κ. If (15) where , then under Assumptions 3.13.4, as n → ∞,

(i)

;

(ii)

for any non-zero p1 × 1 vector , in distribution, where σ2 = E ϵ21.

Remark 3.3:

A sufficient condition for (Equation15) is and max 1 ⩽ inp− 11p1j = 1x2ij = O(1).

Remark 3.4:

The proofs of Theorems 3.1 and 3.2 only use the convergence rates of P(β(k + 1)j = 0) = P(|uj(k)| < λn/n) and P(β(k + 1)j = uj(k)/dn) = P(|u(k)j| > adnλn/n). Since an OEM sequence for MCP has the same structure from (Equation12), all results in this section hold for MCP with almost the same proofs.

From Corollary 3.1, the OEM sequence can have the oracle property for sufficiently large k. For example, let p1 and be fixed and , where is a positive-definite matrix. For this case, if dn → ∞ and as n → ∞, then Assumption 3.4 holds and in distribution.

Huo and Chen (Citation2010) showed that, for the SCAD penalty, solving the global minimum of the SCAD problem leads to an NP-hard problem. Theorem 3.2 indicates that as far as the oracle property is concerned, the local solution given by OEM will suffice.

Theorems 3.1 and 3.2 can handle dimensionality of order p = O(nq) for q < 3/2. For p exceeding this order, penalised regression methods can perform poorly. A practical approach is a two-stage procedure in Fan and Lv (Citation2008). The first stage uses an efficient screening method like the sure independence screening (SIS) (Fan & Lv, Citation2008) to reduce the dimensionality. OEM can be used in the second stage to obtain a SCAD estimator with the oracle property. In fact, OEM can also be used to screen variables even with a p increasing at an exponential rate, since the one-step OEM estimator with a proper λn using is equivalent to the SIS method.

4. Discussion on the n > p case

The OEM algorithm is originally motivated by applications with Big Data, i.e., large n (Xiong et al., Citation2016). In this section, we discuss the case of p = o(n), and show that Theorems 3.1 and 3.2 hold under fairly weak conditions on with the OLS estimator being the initial point of OEM. Like the previous sections, the matrix in (Equation1) is standardised as in (Equation4). As in Section 3, the results in this section allow and/or ζ1 → ∞ as n → ∞, that is, the regression matrix and can be nearly degenerate. Let . By Lemma A.5 in the Appendix, we immediately obtain the following theorem.

Theorem 4.1:

Suppose that γp > 0 and p = o(nγp). Then, the OLS estimator is -concentratively consistent of order r under Assumption 3.1.

For p = o(nγp), we use the OLS estimator as the initial estimator in Theorem 3.2 and take dn = γ1. Since and pγ21 ⩾ γp, under Assumption 3.1, Assumptions 3.2 and 3.3 become

Assumption 4.1:

As n → ∞, p1 + r/21/((nζp1)r/2min |r) → 0, λn/(nmin |) → 0, and

γ1/2pλn/(n1/2p1/2 + 1/rγ1) → ∞.

Assumption 3.4 becomes

Assumption 4.2:

As n → ∞, , and k = kn satisfies (pζ1p)1/2exp ( − klog (1 − δ)−1) → 0 for δ > 0 and for δ = 0.

By Theorem 3.2 and Corollary 3.1, under Assumption 3.1, Assumptions 4.1 and 4.2 imply the oracle property of the OEM sequence for SCAD with the OLS estimator being the initial estimator.

Under above conditions and suppose further that there exist two positive constants and such that . Then, Assumption 4.1 reduces further to that, as n → ∞, p1 + r/21/(nr/2min |r) → 0, λn/(nmin |) → 0, and λn/(n1/2p1/2 + 1/r) → ∞. Assumption 4.2 reduces further to that, as n → ∞, .

We next show an example where Assumption 4.1 holds with and ζ1 → ∞. Let pnq with q ∈ [0, 1), , ζ1 ∼ log (n), γp ∼ 1/log (n)2, γ1 ∼ log (n)2, λnnα, and |βmin | be a constant. Then, Assumption 4.1 holds if p1 = o((n/log (n))r/(r + 2)) and 1 > α > 1/2 + q/2 + q/r. Such α exists for any q ∈ [0, 1) if r is sufficiently large.

Remark 4.1:

In many papers on high-dimensional asymptotics, the random errors are assumed to follow sub-Gaussian distributions. Under the sub-Gaussian assumption, we can show that the OLS estimator has an exponential tail probability bound using the results in Hsu, Kakade, and Zhang (Citation2012). Therefore, when using it as the initial point, Assumption 4.1 can be relaxed with more choices of λn.

Assumption 4.1 requires that the convergence rates of γ1, 1/γp, ζ1, and to infinity should be relatively slow, which holds for commonly encountered regression matrix . The following theorem derives the bounds of the eigenvalues under random designs from a broad class of correlated multivariate distributions.

Theorem 4.2:

Let zij, i = 1,… , n, j = 1,… , p, be i.i.d. random variables with Ez11 = 0, Ez211 = 1, and E|z11|4 < ∞. Suppose that is a p × p positive-definite matrix and is the p1 × p1 sub-matrix constructed by its first p1 rows and columns. Denote , and for i = 1,…, n. Let . Then, for p = o(n),

(16) with probability one.

Theorem 4.2 indicates that the convergence rates of γ1, 1/γp, ζ1, and to infinity will be relatively slow if the eigenvalues of and are restrictive. It is clear that a nearly degenerate can also yield γ1, γp, ζ1, and satisfying Assumption 4.1 asymptotically.

5. Simulations

This section presents some simulation results of the SCAD solution given by OEM to support our theoretical discoveries. Our main purpose is to show that our method is at least comparable with other estimators having the oracle property.

We focus on the p < n case, and compare the SCAD solution given by OEM (SCADOEM) with other four methods, including the lasso (Tibshirani, Citation1996), the adaptive lasso (AdaLasso) (Zou, Citation2006), Zou and Li (Citation2008)'s one-step LLA estimator, and the SCAD solution given by the coordinate descent algorithm (SCADCD). The regression matrix in (Equation1) is constructed as , where x1, …, xn are independently generated from , and the (i, j) entry of is ρ|ij|. The random errors ϵ1,… , ϵnN(0, 1), p = 8, and The sample size n is chosen as 40, 60, and 80. We first use the OEM algorithm to compute the SCAD solution with the initial point being the OLS estimator. The tuning parameter a in (Equation3) is set as 3.7 as recommended in Fan and Li (Citation2001). The other parameter b = λn/n is selected by BIC (Wang, Li, & Tsai, Citation2007). With the same b, we compute the one-step estimator, and compare the variable selection errors (VSEs) and the model errors (MEs) of the two estimators. Here, the VSE and ME of an estimator are, respectively, defined as and where | · | denotes cardinality and .

The average VSE and ME values of the two estimators over 1000 times are shown in . We can see that SCADOEM outperforms the one-step estimator, especially when ρ is large. Besides, SCADOEM is comparable to AdaLasso that also possesses the oracle property (Zou, 2006) in terms of ME. SCADCD gives almost the same results as SCADOEM, but its oracle property has not been proved in the literature.

Table 2. Comparisons of VSEs and MEs.

6. Concluding remarks

Since Fan and Li (Citation2001) pointed out that there exists a local solution of SCAD having the oracle property, it has become an interesting open problem to find such a local solution. We have proved that the OEM algorithm can indeed provide this local solution with the oracle property even with a diverging p. Compared with other estimators after one or several iterations with this property, our results provide a new way to compute the required local solution in Fan and Li (Citation2001) and Zhang (Citation2010) and indicates a new interface between optimisation and statistics for non-convex penalties.

Although our main results require p = O(nq) with q ∈ [0, 3/2), the condition on the random error is accordingly weak. Especially, for p = o(n) and under such a condition on , our results easily hold for commonly encountered regression matrix , even nearly degenerate, with the OLS estimator being the initial point. This point also matches the spirit why Fan and Lv (Citation2008) proposed the two-stage method for an ultra-high p. That is, regularised least squares methods like SCAD are more suitable for a moderate p.

This paper does not discuss the selection of the tuning parameter λn in the SCAD penalty. This issue has been intensively studied in the literature, and we refer the reader to Wang et al. (Citation2007), Wang, Li, and Leng (Citation2009), Wang et al. (Citation2013), and Fan and Tang (Citation2013), among others.

Acknowledgments

We thank the editor, the associate editor, and two referees for their helpful and constructive comments.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

Xiong's research was supported by the National Natural Science Foundation of China [grant number 11471172], [grant number 11671386].

References

  • Bai, Z. D., & Yin, Y. Q. (1993). Limit of smallest eigenvalue of a large dimensional sample covariance matrix. Annals of Probability, 21, 1275–1294.
  • Breiman, L. (1995). Better subset regression using the nonnegative garrote. Technometrics, 37, 373–384.
  • Breheny, P., & Huang, J. (2011). Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Annals of Applied Statistics, 5, 232–253.
  • Bühlmann, P., & van de Geer, S. (2011). Statistics for high-dimensional data: Methods, theory and applications. Berlin: Springer.
  • Eicker, F. R. (1963). Asymptotic normality and consistency of the least squares estimators for families of linear regressions. The Annals of Mathematical Statistics, 34, 447–456.
  • Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348–1360.
  • Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the Royal Statistical Society: Series B, 70, 849–911.
  • Fan, J., & Lv, J. (2011). Properties of non-concave penalized likelihood with NP-dimensionality. IEEE Transactions on Information Theory, 57, 5467–5484.
  • Fan, J., & Peng, H. (2004). Non-concave penalized likelihood with diverging number of parameters. Annals of Statistics, 32, 928–961.
  • Fan, J., Xue, L., & Zou, H. (2014). Strong oracle optimality of folded concave penalized estimation. Annals of Statistics, 42, 819–849.
  • Fan, Y., & Tang, C. Y. (2013). Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society, Series B, 75, 531–552.
  • Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12, 55–67.
  • Hsu, D., Kakade, S. M., & Zhang, T. (2012). A tail inequality for quadratic forms of sub-Gaussian random vectors. Electronic Journal of Probability, 17, 1–6.
  • Hunter, D. R., & Li, R. (2005). Variable selection using MM algorithms. Annals of Statistics, 33, 1617–1642.
  • Huo, X., & Chen, J. (2010). Complexity of penalized likelihood estimation. Journal of Statistical Computation and Simulation, 80, 747–759.
  • Huo, X., & Ni, X. L. (2007). When do stepwise algorithms meet subset selection criteria? Annals of Statistics, 35, 870–887.
  • Kim, Y., Choi, H., & Oh, H.-S. (2008). Smoothly clipped absolute deviation on high dimensions. Journal of American Statistical Association, 103, 1665–1673.
  • Lai, T. L., & Wei, C. Z. (1984). Moment inequalities with applications to regression and time series models. Inequalities in Statistics and Probability, IMS Lecture Notes-Monograph Series, 5, 165–172.
  • Lanczos, C. (1950). An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. Journal of Research of the National Bureau of Standards, 45, 255–282.
  • Loh, P. L., & Wainwright, M. J. (2013). Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Advances in Neural Information Processing Systems, 26, 476–484.
  • Mazumder, R., Friedman, J., & Hastie, T. (2011). SparseNet: Coordinate descent with non-convex penalties. Journal of American Statistical Association, 106, 1125–1138.
  • Meng, X. L. (2008). Discussion on one-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics, 36, 1542–1552.
  • Schifano, E. D., Strawderman, R., & Wells, M. T. (2010). Majorization–minimization algorithms for nonsmoothly penalized objective functions. Electronic Journal of Statistics, 4, 1258–1299.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58, 267–288.
  • Tseng, P. (2001). Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications, 109, 475–494.
  • Tseng, P., & Yun, S. (2009). A coordinate gradient descent method for nonsmooth separable minimization. Programs in Mathematics, 117, 387–423.
  • Wang, H., Li, B., & Leng, C. (2009). Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society: Series B, 71, 671–683.
  • Wang, H., Li, R., & Tsai, C-L. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94, 553–568.
  • Wang, L., Kim, Y., & Li, R. (2013). Calibrating nonconvex penalized regression in ultra-high dimension. Annals of Statistics, 41, 2505–2536.
  • Wang, S., & Jia, Z. (1994). Inequalities in matrix theory ( In Chinese). Hefei: Anhui Education Press.
  • Xiong, S., Dai, B., Huling, J., & Qian, P. (2016). Orthogonalizing EM: A design-based least squares algorithm. Technometrics, 58, 285–293.
  • Xiong, S., Dai, B., & Qian, P. (2011). OEM algorithm for least squares problems (Unpublished report). Retrieved from http://arxiv.org/abs/1108.0185
  • Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics, 38, 894–942.
  • Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429.
  • Zou, H., & Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics, 36, 1509–1533.

Appendix: Proofs

Lemma A.1:

Under Assumption 3.1, for with ‖a‖ = 1 and , , where C > 0 is a constant that does not rely on n or a.

Proof:

By the Markov inequality, . By the Lai-Wei inequality in Example 3 of Lai and Wei (Citation1984) and Assumption 3.1, is not greater than a positive constant that does not rely on n or a. This completes the proof.

Proof of Theorem 3.1:

Under Assumption 3.1, σ2 = E ϵ21 < ∞. Without loss of generality, assume σ2 = 1 in all the proofs. Let x1, …, xp denote the columns of .

Since is a fixed point, with , where u and s are defined in (Equation9) and (Equation10), respectively. Therefore, (A1) We have (A2) Note that and that is concentratively consistent. For sufficiently large n, by (EquationA1) and Assumption 3.2, (A3) For the other part in (EquationA2), (A4) Plugging the above inequalities in (EquationA2), we have

Note that when and , which implies that This completes the proof.

To prove Theorem 3.2, we need several lemmas. For φ > 0, define and , where the vector-valued function u = (u1, …, up)′ is defined in (Equation9).

Lemma A.2:

If φ < λn/(ndn) and for j = 1,… , p1, then E(φ)⊂F.

Proof:

For , we have Then, for z = (z1, 0′)′ ∈ E(φ) and j = 1,… , p1, ; for j = p1 + 1,… , p, since , we have .

Lemma A.3:

Let be the OEM sequence from (Equation11). If , then under the conditions of Lemma A.2, as k → ∞, and for all k = 0, 1,… .

Proof:

Since , by Lemma A.2, , which implies and . Recall that . We have . Consequently, . By recursive, we know that, for all k = 1, 2,… , , and . Letting k → ∞, we complete the proof.

Lemma A.4:

Under Assumption 3.1, .

Proof:

Let be all eigenvalues of . Therefore, can be written as with ‖aj‖ = 1 and aiaj = 0 for ij. We have . The lemma follows from the Markov inequality and the Lai-Wei inequality.

Lemma A.5:

If , then under Assumption 3.1, is -concentratively consistent of order r.

Proof:

Let be the same as in the proof of Lemma A.4. Since all non-zero eigenvalues of matrix are , we have , where with ‖bj‖ = 1 and bibj = 0 for ij. By the Cr inequality, we have By the Lai-Wei inequality, the above expression is not greater than a constant. By Remark 3.1, this completes the proof.

Lemma A.6:

Let be the OEM sequence from (Equation11). If is cn-concentratively consistent of order κ, then under Assumptions 3.1 and 3.2, as n → ∞.

Proof:

By (Equation9), (A5) Similar to (EquationA3) and (EquationA4), (A6) and (A7) By Assumption 3.2, we complete the proof.

Lemma A.7:

Under Assumptions 3.13.3, as n → ∞, where φn = (λn/(ndnc*n))1/2.

Proof:

By Lemma A.6, . It suffices to consider . If , then by (EquationA5) and Lemma A.4, , where c**n = min {cn, n1/2dn/(p1ζ1)1/2}. Since by Lemma A.5, . By Assumption 3.3, φnc*n → 1 as n → ∞. This implies and completes the proof.

Proof of Theorem 3.2:

(i) By Assumption 3.3, φn < λn/(ndn) for sufficiently large n, where φn is defined in Lemma A.7. By Lemma A.3, . By Lemma A.7, it suffices to show (A8) Since λn/(nmin |) → 0 in Assumption 3.2, for sufficiently large n, we have where C > 0 is a constant. The last inequality is from Lemma A.5. By Assumption 3.3, (EquationA8) holds and this completes the proof of (i).

(ii) By Lemma A.3 and its proof, when , , and for some , and for all k = 0, 1,… . The proof of (ii) is completed by noting .

Proof of Corollary 3.1:

We only need to prove (ii). With minor modifications from Eicker (Citation1963), under Assumption 3.1 and (Equation15), we have in distribution. Then, it suffices to show . By (ii) of Theorem 3.2 and Assumption 3.4, the left side .

Proof of Theorem 4.2:

Note that , where . By the results in Bai and Yin (Citation1993), and with probability one. Then, the right part of (Equation16) can be proved by noting that , where the last inequality is from Corollary 4.6.3 in Wang and Jia (Citation1994). The proof of the left part is similar.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.