Search in:

Statistical Theory and Related Fields Volume 1, 2017 - Issue 1

Submit an article Journal homepage

Free access

349

Views

CrossRef citations to date

Altmetric

Listen

Articles

Achieving the oracle property of OEM with nonconvex penalties

Shifeng XiongAcademy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, ChinaView further author information

Bin DaiTower Research Capital, New York, NY, USAView further author information

Peter Z. G. QianDepartment of Statistics, University of Wisconsin-Madison, Madison, WI, USACorrespondence[email protected]
View further author information

Pages 28-36 | Received 07 Mar 2017, Accepted 30 Apr 2017, Published online: 19 May 2017

Cite this article
https://doi.org/10.1080/24754269.2017.1326079
CrossMark

In this article

ABSTRACT
1. Introduction
2. The OEM algorithm
3. Main results
4. Discussion on the n > p case
5. Simulations
6. Concluding remarks
Acknowledgements
Disclosure statement
Additional information
References
Appendixes

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions
View PDF PDF

ABSTRACT

Thepenalised least square estimator of non-convex penalties such as the smoothly clipped absolute deviation (SCAD) and the minimax concave penalty (MCP) is highly nonlinear and has many local optima. Finding a local solution to achieve the so-called oracle property is a challenging problem. We show that the orthogonalising EM (OEM) algorithm can indeed find such a local solution with the oracle property under some regularity conditions for a moderate but diverging number of variables.

KEYWORDS:

EM algorithm
non-convex optimisation
penalised regression
variable selection

1. Introduction

Consider a linear model (1) where is the n × p regression matrix, is the response vector, is the vector of regression coefficients, and is the vector of random error. Throughout, let ‖ · ‖ denote the Euclidean norm. A regularised least squares estimator of with the smoothly clipped absolute deviation (SCAD) (Fan & Li, Citation2001) is given by solving (2) where for θ > 0, (3) a > 2 and b > 0 are the tuning parameters, and I is the indicator function. In order to apply the penalty P equally on all the variables, can be standardised so that (4) Fan and Li (Citation2001) used the SCAD penalty in (Equation3(3) ) to achieve simultaneous estimation and variable selection. Zhang (Citation2010) studied a class of non-convex penalties and introduced the minimax concave penalty (MCP) method, which replaces P in (Equation2(2) ) with P_MCP satisfying (5) where a > 1, b > 0, and θ > 0. Non-convex penalties are now commonly used in statistics. Existing algorithms for solving the SCAD or MCP problem include local quadratic approximation (Fan & Li, Citation2001; Hunter & Li, Citation2005), local linear approximation (Zou & Li, Citation2008), the ConCave Convex procedure (CCCP) (Kim, Choi, & Oh, Citation2008), the minimisation by iterative soft thresholding algorithm (Schifano, Strawderman, & Wells, Citation2010), and the coordinate descent algorithm (Breheny & Huang, Citation2011; Mazumder, Friedman, & Hastie, Citation2011; Tseng, Citation2001; Tseng & Yun, Citation2009), among others.

The non-convex nature of the SCAD and MCP penalties has an interesting interface between computation and statistical properties. Fan and Li (Citation2001) proposed the SCAD penalty and also an important property, called the oracle property. An estimator of having this property can not only select the correct submodel asymptotically, but also estimate the nonzero coefficients as efficiently as if the correct submodel were known in advance. Fan and Li (Citation2001) proved that there exists a local solution of the SCAD problem in (Equation2(2) ) enjoying this property for fixed p. The corresponding results with a diverging p were presented in Fan and Peng (Citation2004) and Fan and Lv (Citation2011). However, it is challenging to single out the one with the oracle property since the non-convex penalised problems like SCAD or MCP can have many local optima (Huo & Chen, Citation2010; Huo & Ni, Citation2007). On the algorithmic aspect, different initial points in a specific algorithm can yield different solutions for n ⩾ p or n < p case. For example, let all rows of in (Equation1(1) ) be identically and independently generated from a zero-mean multi-normal distribution with correlation 0.5, , and the components of be identically and independently generated from N(0, 3²). We use the coordinate descent algorithm to solve the SCAD problem with a = 3.7 and b = 1/(n log (n)). For one simulation, we use 100 initial points, which are independently generated from a multi-normal distribution , where denotes the p × p identity matrix. The solution numbers for different pairs of (n, p) over 100 simulations are described in .

Table 1. Averages and maxima (in parentheses) of the solution numbers of SCAD.

Download CSV Display Table

To get a good estimate from the multiple solutions, one needs to elaborately select the algorithm and the initial point. Fan and Li (Citation2001) suggested using the ordinary least squares (OLS) estimator as the initial point in their local quadratic approximation algorithm for n > p. This method performs well in simulations but lacks theoretical justification. In recent years, related theoretical study has been carried out. Zhang (Citation2010) devised a novel penalised linear unbiased selection algorithm and proved that it can achieve selection consistency of local solutions to MCP. Loh and Wainwright (Citation2013) discussed the convergence rate of local solutions given by certain algorithms for a general class of non-convex penalised problems. When the oracle property is concerned, Zou and Li (Citation2008) showed that the one-step solution of the local linear approximation algorithm, which stops the algorithm after one iteration, has the oracle property with a good initial estimator for a fixed p. Fan, Xue, and Zou (Citation2014) extended such a result to a general penalised estimation problem. Wang, Kim, and Li (Citation2013) proposed a two-step CCCP with different tuning parameters in the two steps, called calibrated CCCP, and showed that it produces the oracle estimator with probability approaching one. These theoretical results are useful for high-dimensional statistics. However, since a local solution is achieved when the iteration number goes to infinity, after only one or several iterations, the estimator is not guaranteed to be a local minimum of the non-convex problem in finite-sample cases. To find a local solution with the oracle property, we need to study the oracle property of a k-step estimator as k goes to infinity. As Meng (Citation2008) pointed out, if we believe that the SCAD objective function is a valid criterion for sparse estimation, then the estimator from a monotonic algorithm is expected to become better as the iteration number increases. Therefore, a k-step estimator with the oracle property as k goes to infinity matches Fan and Li's original purpose of introducing the SCAD penalty. To the best of our knowledge, no such estimator has been presented in the literature, even for the case of n > p.

In this paper, we will show a rather surprising result. In our early results summarised in an unpublished paper (Xiong, Dai, & Qian, Citation2011), we introduced the orthogonalising EM (OEM) algorithm for general least squares problems. For the SCAD and MCP problems, each OEM iteration has a simple closed form. This feature makes it possible to study the theoretical properties of a k-step estimator as k goes to infinity, and thus motivates us to consider whether the local solution of SCAD or MCP given by OEM has the oracle property. Since OEM is more suitable for big tall data with n > p (Xiong, Dai, Huling, & Qian, Citation2016), here our study mainly focuses on such cases. We prove that an OEM sequence for SCAD or MCP can indeed achieve the oracle property if the iteration number goes to infinity with an initial estimator having certain consistency property. We allow the dimensionality p to depend on n of the order p = O(n^q) with q ∈ [0, 3/2). For p exceeding this order, our result can be applied to the submodel after a screening stage (Fan & Lv, Citation2008), which reduces the initial p to the order p = o(n). We therefore present a detailed discussion on the applications of our result to the case of n > p. For this case, with the OLS estimator being the initial estimator, our result holds for a broad class of deterministic or random in (Equation1(1) ), which can be allowed to be nearly degenerate.

It should be pointed out that, the technical report (Xiong et al., Citation2011) summarised some earlier results we have got on OEM. We have divided the report into two papers for publication. One paper (Xiong et al., Citation2016) focuses on the details of the algorithm. The current paper deals with the oracle property. The two papers have no overlapping.

Section 2 reviews the OEM algorithm. Section 3 presents the main result that the local solution of SCAD given by OEM has the oracle property. Section 4 discusses our result in the case of n > p. Section 5 presents simulation results. Section 6 concludes with some discussion. All proofs are given in the Appendix.

2. The OEM algorithm

Consider the SCAD problem in (Equation2(2) ) with the regression matrix standardised as in (Equation4(4) ). For a matrix, denote by λ_max(·) and λ_min(·) its largest and smallest eigenvalues, respectively. In the OEM algorithm, the first step of OEM is active orthogonalisation, which computes such that (6) with . Therefore, is column orthogonal. Consider the linear model (7) where is the complete response vector including a missing part . Based on the complete model (Equation7(7) ), we can solve (Equation2(2) ) by iteratively imputing . Let be an initial point. For k = 0, 1,… , impute as , let , and solve Since is orthogonal, the above problem becomes (8) where u^(k)_j is the jth component of and (9) For , define and (10) We fix a in (Equation3(3) ) and thus omit it in s hereinafter. Let b = λ_n/n that depends on n. The problem in (Equation8(8) ) has a closed-form solution that leads to the OEM iteration formula (11) For the regularised least squares problem with the MCP penalty in (Equation5(5) ), the OEM iteration formula is with s_MCP(u; a, b) = [s_MCP(u₁; a, b), …, s_MCP(u_p; a, b)]′ and (12)

A simple choice of d_n in (Equation6(6) ) is d_n = γ₁, which can be computed efficiently by the Lanczos algorithm (Lanczos, Citation1950). With this choice, it suffices to compute γ₁ for obtaining in (Equation11(11) ) instead of computing the whole .

The above OEM procedure can be easily used in other regularised least squares problems including ridge regression (Hoerl & Kennard, Citation1970), the nonnegative garrote (Breiman, Citation1995), and the lasso (Tibshirani, Citation1996). Xiong et al. (Citation2011) showed that this is an EM algorithm, and derived its monotonicity and convergence properties for general regularised least squares problems.

3. Main results

Suppose that the number of nonzero coefficients of in (Equation1(1) ) is p₁ and partition as (13) where and no component of is zero. Divide columns of the regression matrix in (Equation1(1) ) to with corresponding to . We assume . Let denote the oracle estimator with and . A regularised least squares estimator of in (Equation1(1) ) has the oracle property if it can not only select the correct submodel asymptotically, but also estimate the nonzero coefficients in (Equation13(13) ) as efficiently as if the correct submodel were known in advance. Specifically, an estimator has this property if and has the same asymptotic distribution as .

In virtue of the simplicity of the OEM iteration in (Equation11(11) ), we can study the oracle property of the local solution of SCAD given by OEM. First, we prove that, under certain conditions, a fixed point of the OEM iterations for SCAD is the oracle estimator with probability tending to one. Some notation, definitions, and assumptions are needed. Hereinafter, p, p₁, and can depend on n. Let β_min denote the component of that has the smallest absolute value. The matrix is standardised as in (Equation4(4) ). Recall that d_n ≥ γ₁ in (Equation8(8) ).

Definition 3.1:

For a series of numbers c_n → ∞ and a positive constant κ, an estimator of is said to be c_n-concentratively consistent of order κ if there exists a constant C > 0 such that for sufficiently large t, .

Remark 3.1:

By the Markov inequality, is c_n-concentratively consistent of order κ if .

Assumption 3.1:

The random errors ϵ₁,… , ϵ_n are i.i.d. random variables with E ϵ₁ = 0 and E|ϵ₁|^r < ∞ for some r ⩾ 2.

Assumption 3.2:

Remark 3.2:

Consider the case of p ⩾ n. We can assume (Bühlmann & van de Geer Citation2011). Since has at most n non-zero eigenvalues, . The equality holds for some , and thus d_n can be assumed to have the same order of p/n. For , d_n ∼ p/n, and sufficiently large r and κ, if we fix p₁ and and set p ∼ n^q for some q ⩾ 1, then λ_n can be chosen as λ_n ∼ n^α, where 1 > α > max {1/2 + q/r, q − 1/2 + q/κ}. It is clear that q should be smaller than 3/2. In other words, our results in this paper can handle dimensionality of order p = O(n^q) for q ∈ [1, 3/2).

Theorem 3.1:

Let be a fixed point of the OEM iterations for SCAD. Suppose that is a c_n-concentratively consistent estimator of order κ. Under Assumptions 3.1 and 3.2, as n → ∞,

Theorem 3.1 indicates that a fixed point of OEM consistent to the true parameter has the oracle property even when p grows faster than n.

Sometimes it is difficult to know whether a fixed point is consistent. We next show that, with an initial point concentratively consistent to , an OEM sequence can converge to that fixed point and possess the oracle property. In addition, its limit point is indeed the oracle estimator with probability tending to one.

Let be the OEM sequence from (Equation11(11) ). Denote , , , and .

Assumption 3.3:

As n → ∞, p^{1 + r/2}₁/((nζ_p₁)^r/2|β_min|^r) → 0, and c*_nλ_n/(nd_n) → ∞.

Theorem 3.2:

If is c_n-concentratively consistent of order κ, then, under Assumptions 3.1–3.3, we have (i) as n → ∞, (14) (ii) for all k = 1, 2,… , and .

By (ii) of Theorem 3.2, the OEM sequence can possess the oracle property only if k is sufficiently large. In particular, has the same asymptotic normality as . We state this result as a corollary.

Assumption 3.4:

As n → ∞, , and k = k_n satisfies (nζ₁)^1/2exp ( − klog (1 − δ)^{− 1})/c*_n → 0 for δ > 0 and for δ = 0.

Corollary 3.1:

Suppose that is c_n-concentratively consistent of order κ. If (15) where , then under Assumptions 3.1–3.4, as n → ∞,

(i)	;
(ii)	for any non-zero p₁ × 1 vector , in distribution, where σ² = E ϵ²₁.

Remark 3.3:

A sufficient condition for (Equation15(15) ) is and max _{1 ⩽ i ⩽ n}p^{− 1}₁∑^p₁_{j = 1}x²_ij = O(1).

Remark 3.4:

The proofs of Theorems 3.1 and 3.2 only use the convergence rates of P(β^{(k + 1)}_j = 0) = P(|u_j^(k)| < λ_n/n) and P(β^{(k + 1)}_j = u_j^(k)/d_n) = P(|u^(k)_j| > ad_nλ_n/n). Since an OEM sequence for MCP has the same structure from (Equation12(12) ), all results in this section hold for MCP with almost the same proofs.

From Corollary 3.1, the OEM sequence can have the oracle property for sufficiently large k. For example, let p₁ and be fixed and , where is a positive-definite matrix. For this case, if d_n → ∞ and as n → ∞, then Assumption 3.4 holds and in distribution.

Huo and Chen (Citation2010) showed that, for the SCAD penalty, solving the global minimum of the SCAD problem leads to an NP-hard problem. Theorem 3.2 indicates that as far as the oracle property is concerned, the local solution given by OEM will suffice.

Theorems 3.1 and 3.2 can handle dimensionality of order p = O(n^q) for q < 3/2. For p exceeding this order, penalised regression methods can perform poorly. A practical approach is a two-stage procedure in Fan and Lv (Citation2008). The first stage uses an efficient screening method like the sure independence screening (SIS) (Fan & Lv, Citation2008) to reduce the dimensionality. OEM can be used in the second stage to obtain a SCAD estimator with the oracle property. In fact, OEM can also be used to screen variables even with a p increasing at an exponential rate, since the one-step OEM estimator with a proper λ_n using is equivalent to the SIS method.

4. Discussion on the n > p case

The OEM algorithm is originally motivated by applications with Big Data, i.e., large n (Xiong et al., Citation2016). In this section, we discuss the case of p = o(n), and show that Theorems 3.1 and 3.2 hold under fairly weak conditions on with the OLS estimator being the initial point of OEM. Like the previous sections, the matrix in (Equation1(1) ) is standardised as in (Equation4(4) ). As in Section 3, the results in this section allow and/or ζ₁ → ∞ as n → ∞, that is, the regression matrix and can be nearly degenerate. Let . By Lemma A.5 in the Appendix, we immediately obtain the following theorem.

Theorem 4.1:

Suppose that γ_p > 0 and p = o(nγ_p). Then, the OLS estimator is -concentratively consistent of order r under Assumption 3.1.

For p = o(nγ_p), we use the OLS estimator as the initial estimator in Theorem 3.2 and take d_n = γ₁. Since and pγ²₁ ⩾ γ_p, under Assumption 3.1, Assumptions 3.2 and 3.3 become

Assumption 4.1:

As n → ∞, p^{1 + r/2}₁/((nζ_p₁)^r/2|β_min|^r) → 0, λ_n/(n|β_min|) → 0, and

γ^1/2_pλ_n/(n^1/2p^{1/2 + 1/r}γ₁) → ∞.

Assumption 3.4 becomes

Assumption 4.2:

As n → ∞, , and k = k_n satisfies (pζ₁/γ_p)^1/2exp ( − klog (1 − δ)⁻¹) → 0 for δ > 0 and for δ = 0.

By Theorem 3.2 and Corollary 3.1, under Assumption 3.1, Assumptions 4.1 and 4.2 imply the oracle property of the OEM sequence for SCAD with the OLS estimator being the initial estimator.

Under above conditions and suppose further that there exist two positive constants and such that . Then, Assumption 4.1 reduces further to that, as n → ∞, p^{1 + r/2}₁/(n^r/2|β_min|^r) → 0, λ_n/(n|β_min|) → 0, and λ_n/(n^1/2p^{1/2 + 1/r}) → ∞. Assumption 4.2 reduces further to that, as n → ∞, .

We next show an example where Assumption 4.1 holds with and ζ₁ → ∞. Let p ∼ n^q with q ∈ [0, 1), , ζ₁ ∼ log (n), γ_p ∼ 1/log (n)², γ₁ ∼ log (n)², λ_n ∼ n^α, and |β_min| be a constant. Then, Assumption 4.1 holds if p₁ = o((n/log (n))^{r/(r + 2)}) and 1 > α > 1/2 + q/2 + q/r. Such α exists for any q ∈ [0, 1) if r is sufficiently large.

Remark 4.1:

In many papers on high-dimensional asymptotics, the random errors are assumed to follow sub-Gaussian distributions. Under the sub-Gaussian assumption, we can show that the OLS estimator has an exponential tail probability bound using the results in Hsu, Kakade, and Zhang (Citation2012). Therefore, when using it as the initial point, Assumption 4.1 can be relaxed with more choices of λ_n.

Assumption 4.1 requires that the convergence rates of γ₁, 1/γ_p, ζ₁, and to infinity should be relatively slow, which holds for commonly encountered regression matrix . The following theorem derives the bounds of the eigenvalues under random designs from a broad class of correlated multivariate distributions.

Theorem 4.2:

Let z_ij, i = 1,… , n, j = 1,… , p, be i.i.d. random variables with Ez₁₁ = 0, Ez²₁₁ = 1, and E|z₁₁|⁴ < ∞. Suppose that is a p × p positive-definite matrix and is the p₁ × p₁ sub-matrix constructed by its first p₁ rows and columns. Denote , and for i = 1,…, n. Let . Then, for p = o(n),

(16) with probability one.

Theorem 4.2 indicates that the convergence rates of γ₁, 1/γ_p, ζ₁, and to infinity will be relatively slow if the eigenvalues of and are restrictive. It is clear that a nearly degenerate can also yield γ₁, γ_p, ζ₁, and satisfying Assumption 4.1 asymptotically.

5. Simulations

This section presents some simulation results of the SCAD solution given by OEM to support our theoretical discoveries. Our main purpose is to show that our method is at least comparable with other estimators having the oracle property.

We focus on the p < n case, and compare the SCAD solution given by OEM (SCAD_OEM) with other four methods, including the lasso (Tibshirani, Citation1996), the adaptive lasso (AdaLasso) (Zou, Citation2006), Zou and Li (Citation2008)'s one-step LLA estimator, and the SCAD solution given by the coordinate descent algorithm (SCAD_CD). The regression matrix in (Equation1(1) ) is constructed as , where x₁, …, x_n are independently generated from , and the (i, j) entry of is ρ^{|i − j|}. The random errors ϵ₁,… , ϵ_n ∼ N(0, 1), p = 8, and The sample size n is chosen as 40, 60, and 80. We first use the OEM algorithm to compute the SCAD solution with the initial point being the OLS estimator. The tuning parameter a in (Equation3(3) ) is set as 3.7 as recommended in Fan and Li (Citation2001). The other parameter b = λ_n/n is selected by BIC (Wang, Li, & Tsai, Citation2007). With the same b, we compute the one-step estimator, and compare the variable selection errors (VSEs) and the model errors (MEs) of the two estimators. Here, the VSE and ME of an estimator are, respectively, defined as and where | · | denotes cardinality and .

The average VSE and ME values of the two estimators over 1000 times are shown in . We can see that SCAD_OEM outperforms the one-step estimator, especially when ρ is large. Besides, SCAD_OEM is comparable to AdaLasso that also possesses the oracle property (Zou, 2006) in terms of ME. SCAD_CD gives almost the same results as SCAD_OEM, but its oracle property has not been proved in the literature.

Table 2. Comparisons of VSEs and MEs.

Download CSV Display Table

6. Concluding remarks

Since Fan and Li (Citation2001) pointed out that there exists a local solution of SCAD having the oracle property, it has become an interesting open problem to find such a local solution. We have proved that the OEM algorithm can indeed provide this local solution with the oracle property even with a diverging p. Compared with other estimators after one or several iterations with this property, our results provide a new way to compute the required local solution in Fan and Li (Citation2001) and Zhang (Citation2010) and indicates a new interface between optimisation and statistics for non-convex penalties.

Although our main results require p = O(n^q) with q ∈ [0, 3/2), the condition on the random error is accordingly weak. Especially, for p = o(n) and under such a condition on , our results easily hold for commonly encountered regression matrix , even nearly degenerate, with the OLS estimator being the initial point. This point also matches the spirit why Fan and Lv (Citation2008) proposed the two-stage method for an ultra-high p. That is, regularised least squares methods like SCAD are more suitable for a moderate p.

This paper does not discuss the selection of the tuning parameter λ_n in the SCAD penalty. This issue has been intensively studied in the literature, and we refer the reader to Wang et al. (Citation2007), Wang, Li, and Leng (Citation2009), Wang et al. (Citation2013), and Fan and Tang (Citation2013), among others.

Acknowledgments

We thank the editor, the associate editor, and two referees for their helpful and constructive comments.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

Xiong's research was supported by the National Natural Science Foundation of China [grant number 11471172], [grant number 11671386].

References

Bai, Z. D., & Yin, Y. Q. (1993). Limit of smallest eigenvalue of a large dimensional sample covariance matrix. Annals of Probability, 21, 1275–1294.
Web of Science ®Google Scholar
Breiman, L. (1995). Better subset regression using the nonnegative garrote. Technometrics, 37, 373–384.
Web of Science ®Google Scholar
Breheny, P., & Huang, J. (2011). Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Annals of Applied Statistics, 5, 232–253.
PubMed Web of Science ®Google Scholar
Bühlmann, P., & van de Geer, S. (2011). Statistics for high-dimensional data: Methods, theory and applications. Berlin: Springer.
Google Scholar
Eicker, F. R. (1963). Asymptotic normality and consistency of the least squares estimators for families of linear regressions. The Annals of Mathematical Statistics, 34, 447–456.
Google Scholar
Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348–1360.
Web of Science ®Google Scholar
Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the Royal Statistical Society: Series B, 70, 849–911.
PubMed Web of Science ®Google Scholar
Fan, J., & Lv, J. (2011). Properties of non-concave penalized likelihood with NP-dimensionality. IEEE Transactions on Information Theory, 57, 5467–5484.
PubMed Web of Science ®Google Scholar
Fan, J., & Peng, H. (2004). Non-concave penalized likelihood with diverging number of parameters. Annals of Statistics, 32, 928–961.
Web of Science ®Google Scholar
Fan, J., Xue, L., & Zou, H. (2014). Strong oracle optimality of folded concave penalized estimation. Annals of Statistics, 42, 819–849.
PubMed Web of Science ®Google Scholar
Fan, Y., & Tang, C. Y. (2013). Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society, Series B, 75, 531–552.
Google Scholar
Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12, 55–67.
Web of Science ®Google Scholar
Hsu, D., Kakade, S. M., & Zhang, T. (2012). A tail inequality for quadratic forms of sub-Gaussian random vectors. Electronic Journal of Probability, 17, 1–6.
Google Scholar
Hunter, D. R., & Li, R. (2005). Variable selection using MM algorithms. Annals of Statistics, 33, 1617–1642.
PubMed Web of Science ®Google Scholar
Huo, X., & Chen, J. (2010). Complexity of penalized likelihood estimation. Journal of Statistical Computation and Simulation, 80, 747–759.
Web of Science ®Google Scholar
Huo, X., & Ni, X. L. (2007). When do stepwise algorithms meet subset selection criteria? Annals of Statistics, 35, 870–887.
Web of Science ®Google Scholar
Kim, Y., Choi, H., & Oh, H.-S. (2008). Smoothly clipped absolute deviation on high dimensions. Journal of American Statistical Association, 103, 1665–1673.
Web of Science ®Google Scholar
Lai, T. L., & Wei, C. Z. (1984). Moment inequalities with applications to regression and time series models. Inequalities in Statistics and Probability, IMS Lecture Notes-Monograph Series, 5, 165–172.
Google Scholar
Lanczos, C. (1950). An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. Journal of Research of the National Bureau of Standards, 45, 255–282.
Google Scholar
Loh, P. L., & Wainwright, M. J. (2013). Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Advances in Neural Information Processing Systems, 26, 476–484.
Google Scholar
Mazumder, R., Friedman, J., & Hastie, T. (2011). SparseNet: Coordinate descent with non-convex penalties. Journal of American Statistical Association, 106, 1125–1138.
PubMed Web of Science ®Google Scholar
Meng, X. L. (2008). Discussion on one-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics, 36, 1542–1552.
Google Scholar
Schifano, E. D., Strawderman, R., & Wells, M. T. (2010). Majorization–minimization algorithms for nonsmoothly penalized objective functions. Electronic Journal of Statistics, 4, 1258–1299.
Web of Science ®Google Scholar
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58, 267–288.
Google Scholar
Tseng, P. (2001). Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications, 109, 475–494.
Web of Science ®Google Scholar
Tseng, P., & Yun, S. (2009). A coordinate gradient descent method for nonsmooth separable minimization. Programs in Mathematics, 117, 387–423.
Google Scholar
Wang, H., Li, B., & Leng, C. (2009). Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society: Series B, 71, 671–683.
Google Scholar
Wang, H., Li, R., & Tsai, C-L. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94, 553–568.
PubMed Web of Science ®Google Scholar
Wang, L., Kim, Y., & Li, R. (2013). Calibrating nonconvex penalized regression in ultra-high dimension. Annals of Statistics, 41, 2505–2536.
PubMed Web of Science ®Google Scholar
Wang, S., & Jia, Z. (1994). Inequalities in matrix theory ( In Chinese). Hefei: Anhui Education Press.
Google Scholar
Xiong, S., Dai, B., Huling, J., & Qian, P. (2016). Orthogonalizing EM: A design-based least squares algorithm. Technometrics, 58, 285–293.
PubMed Web of Science ®Google Scholar
Xiong, S., Dai, B., & Qian, P. (2011). OEM algorithm for least squares problems (Unpublished report). Retrieved from http://arxiv.org/abs/1108.0185
Google Scholar
Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics, 38, 894–942.
Web of Science ®Google Scholar
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429.
Web of Science ®Google Scholar
Zou, H., & Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics, 36, 1509–1533.
PubMed Web of Science ®Google Scholar

Appendix: Proofs

Lemma A.1:

Under Assumption 3.1, for with ‖a‖ = 1 and , , where C > 0 is a constant that does not rely on n or a.

Proof:

By the Markov inequality, . By the Lai-Wei inequality in Example 3 of Lai and Wei (Citation1984) and Assumption 3.1, is not greater than a positive constant that does not rely on n or a. This completes the proof.

Proof of Theorem 3.1:

Under Assumption 3.1, σ² = E ϵ²₁ < ∞. Without loss of generality, assume σ² = 1 in all the proofs. Let x₁, …, x_p denote the columns of .

Since is a fixed point, with , where u and s are defined in (Equation9(9) ) and (Equation10(10) ), respectively. Therefore, (A1) We have (A2) Note that and that is concentratively consistent. For sufficiently large n, by (EquationA1(A1) ) and Assumption 3.2, (A3) For the other part in (EquationA2(A2) ), (A4) Plugging the above inequalities in (EquationA2(A2) ), we have

Note that when and , which implies that This completes the proof.

To prove Theorem 3.2, we need several lemmas. For φ > 0, define and , where the vector-valued function u = (u₁, …, u_p)′ is defined in (Equation9(9) ).

Lemma A.2:

If φ < λ_n/(nd_n) and for j = 1,… , p₁, then E(φ)⊂F.

Proof:

For , we have Then, for z = (z′₁, 0′)′ ∈ E(φ) and j = 1,… , p₁, ; for j = p₁ + 1,… , p, since , we have .

Lemma A.3:

Let be the OEM sequence from (Equation11(11) ). If , then under the conditions of Lemma A.2, as k → ∞, and for all k = 0, 1,… .

Proof:

Since , by Lemma A.2, , which implies and . Recall that . We have . Consequently, . By recursive, we know that, for all k = 1, 2,… , , and . Letting k → ∞, we complete the proof.

Lemma A.4:

Under Assumption 3.1, .

Proof:

Let be all eigenvalues of . Therefore, can be written as with ‖a_j‖ = 1 and a′_ia_j = 0 for i ≠ j. We have . The lemma follows from the Markov inequality and the Lai-Wei inequality.

Lemma A.5:

If , then under Assumption 3.1, is -concentratively consistent of order r.

Proof:

Let be the same as in the proof of Lemma A.4. Since all non-zero eigenvalues of matrix are , we have , where with ‖b_j‖ = 1 and b′_ib_j = 0 for i ≠ j. By the C_r inequality, we have By the Lai-Wei inequality, the above expression is not greater than a constant. By Remark 3.1, this completes the proof.

Lemma A.6:

Let be the OEM sequence from (Equation11(11) ). If is c_n-concentratively consistent of order κ, then under Assumptions 3.1 and 3.2, as n → ∞.

Proof:

By (Equation9(9) ), (A5) Similar to (EquationA3(A3) ) and (EquationA4(A4) ), (A6) and (A7) By Assumption 3.2, we complete the proof.

Lemma A.7:

Under Assumptions 3.1–3.3, as n → ∞, where φ_n = (λ_n/(nd_nc*_n))^1/2.

Proof:

By Lemma A.6, . It suffices to consider . If , then by (EquationA5(A5) ) and Lemma A.4, , where c**_n = min {c_n, n^1/2d_n/(p₁ζ₁)^1/2}. Since by Lemma A.5, . By Assumption 3.3, φ_nc*_n → 1 as n → ∞. This implies and completes the proof.

Proof of Theorem 3.2:

(i) By Assumption 3.3, φ_n < λ_n/(nd_n) for sufficiently large n, where φ_n is defined in Lemma A.7. By Lemma A.3, . By Lemma A.7, it suffices to show (A8) Since λ_n/(n|β_min|) → 0 in Assumption 3.2, for sufficiently large n, we have where C > 0 is a constant. The last inequality is from Lemma A.5. By Assumption 3.3, (EquationA8(A8) ) holds and this completes the proof of (i).

(ii) By Lemma A.3 and its proof, when , , and for some , and for all k = 0, 1,… . The proof of (ii) is completed by noting .

Proof of Corollary 3.1:

We only need to prove (ii). With minor modifications from Eicker (Citation1963), under Assumption 3.1 and (Equation15(15) ), we have in distribution. Then, it suffices to show . By (ii) of Theorem 3.2 and Assumption 3.4, the left side .

Proof of Theorem 4.2:

Note that , where . By the results in Bai and Yin (Citation1993), and with probability one. Then, the right part of (Equation16(16) ) can be proved by noting that , where the last inequality is from Corollary 4.6.3 in Wang and Jia (Citation1994). The proof of the left part is similar.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Download PDF

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Achieving the oracle property of OEM with nonconvex penalties

ABSTRACT

1. Introduction

Table 1. Averages and maxima (in parentheses) of the solution numbers of SCAD.

2. The OEM algorithm

3. Main results

4. Discussion on the n > p case

5. Simulations

Table 2. Comparisons of VSEs and MEs.

6. Concluding remarks

Acknowledgments

Disclosure statement

References

Appendix: Proofs

Information for

Open access

Opportunities

Help and information

Achieving the oracle property of OEM with nonconvex penalties

ABSTRACT

1. Introduction

Table 1. Averages and maxima (in parentheses) of the solution numbers of SCAD.

2. The OEM algorithm

3. Main results

4. Discussion on the n > p case

5. Simulations

Table 2. Comparisons of VSEs and MEs.

6. Concluding remarks

Acknowledgments

Disclosure statement

Additional information

Funding

References

Appendix: Proofs

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date