5,120
Views
2
CrossRef citations to date
0
Altmetric
Theory and Methods

Estimation of Linear Functionals in High-Dimensional Linear Models: From Sparsity to Nonsparsity

ORCID Icon, & ORCID Icon
Pages 1579-1591 | Received 08 Jul 2021, Accepted 18 Apr 2023, Published online: 22 Jun 2023

ABSTRACT

High-dimensional linear models are commonly used in practice. In many applications, one is interested in linear transformations βx of regression coefficients βRp, where x is a specific point and is not required to be identically distributed as the training data. One common approach is the plug-in technique which first estimates β, then plugs the estimator in the linear transformation for prediction. Despite its popularity, estimation of β can be difficult for high-dimensional problems. Commonly used assumptions in the literature include that the signal of coefficients β is sparse and predictors are weakly correlated. These assumptions, however, may not be easily verified, and can be violated in practice. When β is non-sparse or predictors are strongly correlated, estimation of β can be very difficult. In this article, we propose a novel pointwise estimator for linear transformations of β. This new estimator greatly relaxes the common assumptions for high-dimensional problems, and is adaptive to the degree of sparsity of β and strength of correlations among the predictors. In particular, β can be sparse or nonsparse and predictors can be strongly or weakly correlated. The proposed method is simple for implementation. Numerical and theoretical results demonstrate the competitive advantages of the proposed method for a wide range of problems. Supplementary materials for this article are available online.

1 Introduction

With the advance of technology, high-dimensional data are prevalent in many scientific disciplines such as biology, genetics, and finance. Linear regression models are commonly used for the analysis of high-dimensional data, typically with two important goals: prediction and interpretability. Variable selection can help to provide useful insights on the relationship between predictors and the response, and thus improve the interpretability of the resulting model. During the past several decades, many sparse penalized techniques have been proposed for simultaneous variable selection and prediction, including convex penalized methods (Tibshirani Citation1996; Zou and Hastie Citation2005), as well as nonconvex ones (Fan and Li Citation2001; Zhang Citation2010).

In this article, we are interested in estimating linear transformations βx of regression coefficients β=(β1,,βp)Rp for high-dimensional linear models, where xRp is a specific point and is not required to be from the same distribution as the training data. It relates to both coefficient estimation and prediction. For instance, sometimes we are interested in estimating β1 and β1β2, where both of them can be expressed as βx by taking x as (1,0,,0) and (1,1,0,,0), respectively. On the other hand, for a typical prediction problem, x follows the same distribution as the training data.

To estimate βx, a natural and commonly used solution is to estimate β first by β̂ and construct the estimator β̂x, which can be viewed as the plug-in one. The efficiency of the plug-in estimator depends on that of β̂. Despite its simplicity, obtaining a good estimate of β may not be easy in high-dimensional problems. If β is sparse (i.e., the support of β,supp(β), is small), sparse regularized techniques such as the LASSO can be used to obtain a consistent estimator of β. Desirable theoretical and numerical results have been established for various sparse penalized methods in the literature (see, e.g., Bickel, Ritov, and Tsybakov Citation2009; Raskutti, Wainwright, and Yu Citation2011; Bühlmann and Van De Geer Citation2011; Negahban et al. Citation2012). These regularized methods assume that β is a sparse vector, which is difficult to verify in practice and may fail when supp(β) has a magnitude compatible with the sample size n or larger than n. The problem becomes more difficult when the predictors are strongly correlated since most sparse regularized methods work well on weakly dependent predictors.

We use a small simulation to illustrate the adverse effects of the sparsity degree of β on the plug-in estimator of βx in the linear regression model (2.1), where Xi follows the normal distribution N(0,Σ), β=δ0(1p0,0pp0) and p0=r0p, and Σ=(σij), σij=0.5|ij|/η with η controlling the correlation strength among the predictors. A larger value of r0 implies a denser β. The setup of δ0 and other parameters are presented in Setting 1 of Section 5.1. The average testing errors of the plug-in estimators and our proposed PointWise Estimator (PWE) are shown in . We can see that the errors of the plug-in estimators deteriorate quickly as r0 increases. In contrast, our proposed estimator is much less sensitive to the change of the degree of non-sparsity.

Fig. 1 The effect of nonsparsity of β on plug-in estimators in terms of prediction error, where p = 1000. “A-lasso” and “lasso” denote the results of plug-in estimators with β̂ being adaptive LASSO and LASSO, respectively, and PWE denotes the proposed method.

Fig. 1 The effect of nonsparsity of β on plug-in estimators in terms of prediction error, where p = 1000. “A-lasso” and “lasso” denote the results of plug-in estimators with β̂ being adaptive LASSO and LASSO, respectively, and PWE denotes the proposed method.

In typical prediction problems, a number of papers studied the convergence of prediction for various estimators, including LASSO, ridge, partial least squares, overparameterized estimators, and many others under different settings (Dalalyan, Hebiri, and Lederer Citation2017; Zhang, Wainwright, and Jordan Citation2017; Dobriban and Wager Citation2018; Bartlett et al. Citation2020, etc.). It has been observed that LASSO and related methods are less affected by the correlation strength among predictors in prediction than in estimation problems (Hebiri and Lederer Citation2013; Dalalyan, Hebiri, and Lederer Citation2017). However, for some sparse vectors for x such as x=(1,0,0), the estimation of βx becomes that of the first coefficient and these methods are more affected by the correlation strength than prediction (Zou and Hastie Citation2005). All the above mentioned methods consider the plug-in estimator and the average prediction error.

Different from these existing methods, we focus on βx for a specific fixed x (x0) rather than on estimating β and the average prediction error, where x is not required to have the same distribution as the training data. The line of works that are closely related to ours are those on the hypothesis testing and confidence intervals of βx in high-dimensional linear models (van de Geer et al. Citation2014; Zhang and Zhang Citation2014; Javanmard and Montanari Citation2014; Lu et al. Citation2017; Cai and Guo Citation2017; Zhu and Bradic Citation2018, etc.). Most of these papers considered the case of β being ultra-sparse with |supp(β)|n/logp. Cai and Guo (Citation2017) considered the broader range where |supp(β)| has an order no more than n/logp. Zhu and Bradic (Citation2018) considered the hypothesis testing and confidence intervals of βx where β is allowed to be non-sparse by introducing a sparse auxiliary model, which can be restrictive. For example, if the predictor vector follows the normal distribution N(0,Σ) and the sparse auxiliary model holds for any xRp simultaneously, then Σ must be equal to Ip. Moreover, the estimator of βx obtained from the confidence interval of Zhu and Bradic (Citation2018), despite allowing β to be dense, works only when p/n0 in prediction problems. Although these results have optimality in the minimax sense (Cai and Guo Citation2017; Zhu and Bradic Citation2018), they can be conservative and are actually determined by the most difficult case. In this article, we introduce the sparsity of eigenvalues (or approximately low rank) of some matrices, which is shown to be complementary to the sparsity of β. The most difficult case is actually the situation where both types of sparsity fail.

Our key observation is that we can directly target at γx:=βx, treating it as an unknown parameter for estimation. We refer to the resulting estimate as the pointwise estimator. To this end, we propose a unified framework to leverage multiple sources of information. In many cases, the eigenvalues of the covariance matrix decrease dramatically, due to correlations among the predictors, which will be referred as sparsity of eigenvalues in the following descriptions. This type of sparsity is generally viewed as an adverse factor, making the estimation of β more difficult. Contrary to this popular view, we show that the sparsity of eigenvalues is beneficial in our framework and serves as a good complement to the sparsity of β. In practice, two different kinds of test points x are of particular interest: (a) x is a given sparse vector, and (b) x is a random vector having the same distribution as the training data (i.e., the prediction problem). We give detailed results on these two special cases and compare our estimator with several other methods. The main contribution of this article is that we propose a transformed model under a new basis, which provides a unified way to use different sources of information.

First, to use the sparsity of eigenvalues, we propose an estimator based on a basis consisting of eigenvectors of a specific matrix constructed from the training data. When the eigenvalues decrease at a certain rate, our estimator performs well for both kinds of test points x regardless of the sparsity of β. On the other hand, if eigenvalues decrease slowly (or covariance matrix close to Ip), this estimator is less efficient; and consequently is inferior to LASSO when β is indeed sparse. In fact, the pointwise estimator using the sparsity of eigenvalues is complementary to LASSO.

Second, to leverage the information of β, such as β being sparse, and the sparsity of eigenvalues jointly, we construct another basis based on an initial estimator β̂. It is shown that two types of information help each other: a faster decreasing rate of eigenvalues allows β̂ converging in a slower rate, and vice versa. When the test point x is a sparse vector, we show that the pointwise estimator performs well. The case of x being random as in prediction problems is more complicated in the sense that the sparsity degree of β should be taken into account. Hence, we consider a subset S1 of {1,,p} satisfying S1supp(β), where S1 can be estimated from data. Specifically, we consider two cases: (a) Let S1={1,,p}, which allows β to be sparse or dense. When sparsity of eigenvalues holds, our pointwise estimator performs well. When the eigenvalues are less sparse (or covariance matrix is close to Ip), our pointwise estimator performs similarly to the existing results on dense β in the literature. (b) When β is sparse, a smaller |S1| leads to a better estimator. If a good initial estimator β̂ and a good S1 are available, our estimator’s performance is similar to that of LASSO.

The rest of this article is organized as follows. In Section 2, we propose our pointwise estimator for the linear transformation βx in high-dimensional linear models. Theoretical properties are established in Sections 3 and 4. Some simulated examples and real data analysis are presented in Section 5, followed by some discussions in Section 6. Proofs of the theoretical results are provided in the Supplementary Materials.

Notation. We first introduce some notations to be used for the article. For any symmetric positive semidefinite matrix ARm×m, denote the eigenvalues of A in a decreasing order as λ1(A)λm(A), and the smallest nonzero eigenvalues as λmin+(A). For any matrix ARm1×m2,λmax(A),λmin(A) are the maximum and minimum singular values of A, respectively. For any vector v=(v1,,vm)Rm,v and v1 denote the l2 and l1 norms of v, respectively, and v=max1jm|vj|; the support set of v is denoted as supp (v). In addition, define vA=vAv for any positive semidefinite matrix ARm×m. For two sequences {an} and {bn}, both anbn and an=O(bn) imply limnan/bnc for some constant c<; both anbn and an=Ω(bn) indicate that limnan/bnc; anbn means that an has exactly the same order as bn. For any integer i, let ei denote the vector of zeros except the ith element being 1.

2 A Unified Framework for Pointwise Estimation

Suppose (Xi,Yi);1in, are iid from the following linear regression model (2.1) Yi=Xiβ+ϵi; 1in,(2.1) where ϵiR satisfies E(ϵi)=0 and var(ϵi)=σ2<,XiRp is independent of ϵi satisfying E(Xi)=0, and cov(Xi)=Σ=(σij). Without loss of generality, we assume that σii=1;i=1,,p, and that var(Yi)<. Denote X=(X1,,Xn)Rn×p,Y=(Y1,,Yn)Rn, and ε=(ϵ1,,ϵn)Rn. Then the model can be written as Y=Xβ+ε. Here the dimension p can be much larger than the sample size n. Let xRp be a given point at which we intend to estimate βx. We assume that Xix0 for some 1in, which can be checked numerically. Let S0=supp(β) of cardinality s0=|S0|. Since a nonsparse β and the case where Σ might not be of full rank will be considered, we make the following identifiability condition and discuss some useful facts.

  • When Σ is not of full rank, we assume that β falls into the column space of Σ for identifiability, due to the following reasons. Denote Xi=Σ1/2X˜i;1in, where X˜i satisfies E(X˜i)=0 and cov(X˜i)=Ip. Let PΣ be the projection matrix on the column space of Σ and QΣ=IpPΣ. Then Xiβ=X˜iΣ1/2β=X˜iΣ1/2(PΣ+QΣ)β=XiPΣβ. Thus, the parameter can be set as PΣβ, which falls into the column space of Σ.

  • The magnitude of βj; jS0, depends on the sparsity degree s0. Note that λmin+(Σ)β2βΣβ<var(Yi)<, and consequently that β2var(Yi)/λmin+(Σ). Assume that βj’s with jS0 are of the same magnitude. Then it follows that |βj|[s0λmin+(Σ)/var(Yi)]1/2,jS0. Particularly, if λmin+(Σ)1,|βj|’s are of order s01/2, which can be small when s0 is large.

Next we first introduce the transformed model based on a set of basis to leverage multiple sources of information in Section 2.1. The construction of basis is discussed in Section 2.2. A penalized estimator and a pointwise estimator are proposed in Section 2.3.

2.1 The Transformed Model

For any fixed xRp, let Px=xx/x2 be the projection matrix on the space spanned by x and Qx=IpPx be the projection matrix on the complementary space. Recall that γx=βx and denote βQx=Qxβ. Then one can write (2.2) Xβ=XPxβ+XQxβ=Xx·xβx2+nζβ:=nzx·αx+nζβ,(2.2) where αx=γx·x2⏧Xx⏧n1/2R,zx=Xx/⏧Xx⏧Rn and ζβ=n1/2XβQxRn. Here αx is a scaled version of γx such that the l2 norm of the predictor nzx equals n. Estimating γx is equivalent to that of αx, since given αx, one can compute γx directly from data (X,x). Then we get Y=Xβ+ε=nzxαx+nζβ+ε, where ζβ is a nuisance parameter vector. Note that ζβ is a nonsparse vector in general, particularly when Xi’s are iid variables; thus, we have n + 1 nonsparse parameters with the sample size n. To handle the difficulty, we introduce a set of basis ΓRn×n using different sources of information such that ζβ can be expressed sparsely under the set of basis.

The construction of Γ depends on the information at hand and will be elaborated in Section 2.2. For an invertible matrix ΓRn×n, of which the columns are of unit length (i.e., Γ·jΓ·j=1,1jn), we denote nζβ=(nΓ)(Γ1ζβ)=nΓθ, where θ=Γ1ζβRn. Although Γ here can be any invertible matrix, as shown later, the case we are interested in is Γ being (approximately) orthogonal. We hope that θ is (approximately) sparse when Γ is chosen properly. Combining these together, we have the transformed linear model (2.3) Y=nzx·αx+nΓθ+ε=Zα+ε,(2.3) where Z=n(zx,Γ)Rn×(n+1),α=(αx,θ)Rn+1. The parameter θ is treated as a n-dimensional nuisance parameter vector. As shown later in Section 2.2, Γ plays a critical role in this model, providing a flexible way to leverage different sources of information. A naive choice is Γ=In without using additional information, which will be discussed further in Section F of supplementary materials. In contrast to p parameters of the original linear model, the transformed model (2.3) has only n + 1 unknown parameters that are (approximately) sparse. Without loss of generality, we assume that both |αx| and consequently ζβ are bounded, given x and X. This assumption is mild and holds in probability when x and Xi’s are iid from N(0,Σ). Detailed discussions are presented in Section A of the supplementary materials.

Denote by α̂xa generic estimator of αx. Then γx can be estimated by γ̂x=α̂x·x2⏧Xx1n1/2. Given x, noting that n1⏧Xx2pxΣ2, where xΣ=xΣx, we see that the quantity x2/xΣ affects the convergence rate of the estimator γ̂x. We investigate the magnitude of x2/xΣ, and consider two typical settings for clarity. Recall that x is a given vector, which may or may not have the same distribution as Xi.

Example 1.

Let x be a sparse vector with the support set Sx=supp(x) and the cardinality |Sx|:=sx. Examples of such x include x = ei or x=eiej. Denote XiSx=(Xij,jSx). If the eigenvalues of ΣSxSx=cov(XiSx) are both upper and lower bounded, and x=O(1), then it follows that x2/xΣ≍⏧x⏧≍sx1/2.

Example 2.

Let x be a random vector as in prediction problems. For simplicity, we assume x and Xi’s are iid variables from N(0,Σ). As shown in Section 4.2, it follows (2.4) x2/xΣ[tr(Σ)]2/tr(Σ2):=MΣ,(2.4) in probability. Moreover, it holds that 1MΣp1/2 and MΣ2 can be viewed as the effective rank of Σ. Particularly, if some eigenvalues of Σ are much larger than the rest (e.g., the largest eigenvalue has the order p), then MΣ1; if Σ is close to Ip, then MΣ is close to p1/2 and γx is close to p1/2αx.

Clearly, the magnitude of x2/xΣ with a sparse x in Example 1 can be smaller than that of the dense x in Example 2. If x is not sparse, the sparsity of β can help as well. This observation motivates us to consider estimators using the information of the sparsity degree of β. For any subset S1{1,,p} such that S0S1, we observe that γx=βx=βS0xS0=βS1xS1=βx˜S1=γx˜S1,where xS0 and βS0 are the subvectors of x and β, respectively, and x˜S1 is a p-dimensional vector obtained by setting xS1c=0 in x. Thus, instead of estimating γx, one can equivalently consider prediction at the point x˜S1, which is a sparse vector when |S1| is small. Clearly, one can set S1={1,,p}, and then γx˜S1 becomes γx. Estimating γx˜S1 provides a way to take the sparsity degree of β into account. In practice, one can choose different S1’s and select the best one by CV, as shown later in Section 2.3. In Section 4 we will compare our method with several existing methods in details for Examples 1 and 2.

Remark 1.

In Example 2 above, a smaller set S1 is preferred. However, the true support set S0 is unknown. When s0 is small, choosing a set S1 that covers S0 is feasible. For example, S1 can be taken as the support set of the LASSO estimator, or constructed by some screening methods (Fan and Lv Citation2008).

2.2 Construct Basis Utilizing Multiple Sources of Information

Next we discuss the construction of Γ. For a positive semidefinite matrix ARp×p, denote λ(A)=(λ1(A),,λp(A)) the vector of eigenvalues of A in a decreasing order. We say that λ(A) is approximately sparse when only a few eigenvalues are much larger than the average p1i=1pλi(A), and the detailed requirements on the decreasing rate of eigenvalues will be elaborated later. In practice, different sources of information may be available. For clarity of the presentation, we focus on two different sources of information.

Source I. We use the information of β through an initial estimator β̂. Note that a good estimator is available in some cases. In particular, if β is sparse, then β̂ can be obtained from that of LASSO, or other sparse regression methods. However, an estimator β̂ may not be good enough in many cases especially when β is less sparse and p is larger than n.

Source II. We use the dependence among predictors. Note that λ(n1XQxX)=λ(n1XQxQxX) that equals the first n elements of λ(n1QxXXQx), of which the population version is λ(QxΣQx). When predictors are correlated such that λ(n1XQxX) is (approximately) sparse, a key observation is that by choosing a suitable Γ, the parameter θ can be (approximately) sparse regardless β being sparse or not (details are referred in Section 3.2). Thus, it is feasible to estimate the parameters well in the transformed model (2.3).

Denote C1={β is sparse},C2={λ(n1XQxX) is sparse}, and let C˜=C1cC2c. The sparsity in C2 is complementary to that in C1. Both C1 and C2 are ideal cases with good estimators available (we will introduce estimators for Case C2 later), and the case C˜ can be the least favorable case. There are intermediate cases between C1 and C˜, when the degree of sparsity of β increases gradually. A similar argument applies between C2 and C˜. To handle these complicated cases, it is natural to use these two different sources of information jointly. In fact, our approach works well under the complementary condition in Section 3.3 where stronger requirements on one type of sparsity weaken those of the other. It is possible that both C1 and C2 hold simultaneously, where our estimators leverage both sources of information.

Denote the spectral decomposition of n1XQxX as ΓegΨΓeg, where Γeg=(ueg,1,, ueg,n)Rn×n are eigenvectors and Ψ=diag(ψ1,,ψn) is the diagonal matrix of the associated eigenvalues in a deceasing order. Denote ζ¯β=ζβ/ζβ=XβQx/XβQx. When an initial estimator β̂ is available, ζ¯β can be estimated by ζ¯β̂. To use two different sources of information jointly, or in other words to use both ζ¯β̂ and the columns of Γeg, we construct Γ by replacing one of the columns (say for example the ith column) of Γeg by ζ¯β̂, that is, Γ=Γ(β̂), defined as (2.5) Γ(β̂)=(ζ¯β̂,ueg,j,ji),(2.5) which is the empirical version of Γ(β)=(ζ¯β,ueg,j,ji). More discussion on this process is provided in Proposition 1 and Remark 2.

Recall that αx, associated with predictor nzx, is the parameter of interest in the transformed model that has predictors n(zx,Γ). It is desirable to avoid the collinearity between zx and other predictors in the transformed model. Hence, we assume that β̂ satisfies that ζ¯β̂zx, which can be checked from data, and require the matrix (zx,ueg,j,ji) being invertible. Note that it is also required that Γ(β̂), that is (ζ¯β̂,ueg,j,ji), is invertible.

Proposition 1.

Suppose that β̂ satisfies |zxζ¯β̂|>0 and ζ¯β̂zx. Then for at least one i{1,,n}, it holds that both matrices (ζ¯β̂,ueg,j,ji) and (zx,ueg,j,ji) are invertible or equivalently that min{|ueg,iζ¯β̂|,|ueg,izx|}>0.

In principle, we can replace any ueg,i by ζ¯β̂ as long as min{|ueg,iζ¯β̂|,|ueg,izx|}>0. In our numerical studies, the strategy in Remark 2 below is used to further reduce collinearity.

Remark 2.

Reducing the collinearity between zx and other predictors in the transformed model makes the estimator of αx more stable. In our simulation studies, we replace ueg,i0 by ζ¯β̂ with i0=argmax1in|ueg,izx|, and the resulting Γ(β̂) is observed nonsingular numerically.

Naturally, one can just use the information in Source II by setting Γ=Γeg. A good property for this choice is that it depends only on X without requiring an initial estimator β̂ and is free of any assumption on the sparsity of β. However, this choice may not be ideal if a reasonably good initial estimator β̂ can be obtained. We summarize the constructions of Γ used in this article in .

Table 1 Candidates of Γ for the pointwise estimator.

In Section 3, we show that when λ(n1XQxX) is sparse enough, θ=Γeg1ζβ is approximately sparse without any sparsity assumption on β. An extreme case is Σ being a low rank matrix, where θ is exactly sparse with at most rank(Σ)+1 nonzero elements. It is worth pointing out that which elements of θ are large are generally unknown since β is involved. When λ(n1XQxX) is less sparse, as discussed below, β̂ will provide additional information and θ can still be approximately sparse.

When Γ=Γ(β̂), the sparsity of θ also depends on the accuracy of β̂. For the ideal case that β̂=β, it can be shown that θ=Γ(β)1ζβe1=(1,0,,0), and consequently θ is a sparse vector. This argument still holds, if we replace (ueg,j,ji0) by any other vectors such that Γ(β) is invertible, implying that if we know β, there is no need for additional information. Consequently, if β̂ is good, (ueg,j,ji0) do not help much. When β̂ is not good enough (e.g., β is less sparse in particular) but λ(n1XQxX) is sufficiently sparse, using ueg,i’s will compensate the low accuracy of β̂. Thus, both types of information can help each other in our framework and the estimator becomes more robust to the underlying assumptions. Details are provided in Section 3.

2.3 Penalized Estimator and Pointwise Estimator

As discussed in Section 2.2, the parameters in the transformed model (2.3) can be approximately sparse if Γ is constructed properly. Thus, we consider the minimization of the following objective function (2.6) Lλ,Γ(α)=n1YZα2+Pen,λ(α),(2.6) where Pen,λ(α)=λα1 is the l1 penalty function used by LASSO, and λ and Γ are the tuning parameters. The parameter λ plays the same role as that for the usual regularized estimator and can be selected by cross-validation (CV). The selection of Γ will be elaborated below. Since the l1 penalty usually induces biases for coefficients of large absolute values, to solve this problem, other nonconvex penalty functions, such as the SCAD (Fan and Li Citation2001) or MCP (Zhang Citation2010), can be used instead. Denote the minimizer as (2.7) α̂λ,Γ=(α̂x,θ̂)=argminαRn+1Lλ,Γ(α).(2.7)

Then we have γ̂x=α̂x·x2⏧Xx1n1/2.

Remark 3.

Our approach involves eigenvalue decomposition of the matrix n1XQxXRn×n with the computational complexity of the order O(n3), which can be a burden when n is large. To reduce the complexity, the Divide-and-Conquer (DC) approach for handling big data can be used (Zhang, Duchi, and Wainwright Citation2015). Simulation results based on DC are presented in Section G.2 of the supplementary materials.

Next we briefly discuss the estimator in (2.7). First, the predictor Z in Model (2.3) involves the transformation matrix Γ, while in the classical model (2.1), the predictor is X, of which each row represents a realization of the predictors. Second, unlike the classical methods such as the LASSO, where the parameters are unknown constants, the parameter α here involves (x,X,Γ), which makes the theoretical analysis challenging.

Remark 4.

In the above arguments, we consider a single point x that may or may not be from the same distribution as the predictor vector. However, if we are going to consider a large number of test points {xi,i=1,,nte} that are iid observations of Xi, the bias should be taken into account. Because E(γxi)=E(βxi)=0 and var(γxi)=βΣβ<var(Y1), there will be many γxi’s that are close to 0 and the corresponding estimators are shrunken to 0 by the penalization, resulting in large biases in terms of the average estimation error. There are many bias correction methods for LASSO and related penalized estimators in the literature (Belloni and Chernozhukov Citation2013; Zhang and Zhang Citation2014, etc.). In our simulation results for the prediction problem, the method of Belloni and Chernozhukov (Citation2013) is applied.

For clarity, we briefly summarize the estimation procedure for a given Γ as follows.

Algorithm 1

Estimator of γx (or γx˜S1) with given Γ

1. (Estimate αx) Solve the optimization problem (2.6), where the optimal λ is chosen by CV; the corresponding parameter obtained is denoted as (α̂x,θ̂).

2. (Bias-correction) This is an optional step. Denote Ŝ=supp(θ̂) and Γ·Ŝ the columns of Γ with index Ŝ. Apply the OLS with responses Y and predictors Z=n(zx,Γ·Ŝ) to get the updated coefficient α̂x of nzx, which is a bias-corrected estimator of αx.

3. (Estimate γx) γx is estimated by γ̂x=α̂x·x2⏧Xx1n1/2.

4. (Alternative Steps 1–3) Replacing x by x˜S1 in Steps 1–3, we get the estimator γ̂x˜S1.

Step 2 is mainly applied for the setting in Remark 4 with a large number of testing points that are iid observations as of the training data Xi’s. For clarity, the pointwise estimator obtained by Algorithm 1 is named based on the specific Γ used. There are many estimators of β available for different settings in the literature, and can be used as an initial estimator. Denote by β̂lasso and β̂ridge the estimators of LASSO and ridge regression, respectively, and by β̂rdl the overparameterized ridgeless OLS estimator using the Moore-Penrose generalized inverse (Bartlett et al. Citation2020; Azriel and Schwartzman Citation2020; Hastie et al. Citation2022). We consider the estimators γ̂x in Algorithm 1 with Γ being Γeg and Γ(β̂) for β̂{β̂lasso,β̂ridge,β̂rdl}, and the resulting estimators are denoted as Peg,Plasso,Pridge,Prdl, respectively. Now we propose a procedure to select Γ adaptively. Denote by Ma set of estimators, for example, M={Plasso,Pridge,Peg,Prdl} in Setting 1 of our numerical study. The best estimator selected from M by the following CV procedure will be denoted as PWE.

Algorithm 2

Select Γ by CV

1. Compute the CV error for estimators in M. Split randomly the whole data D of size n into K parts, denoted as D1,,DK. For each method AM, compute the CV error, n1k=1K(xi,yi)Dk(γ̂xiAyi)2, where γ̂xiA is estimated by A with data DDk.

2. The method A0 in M with the minimum CV error is chosen to be the best one.

3. The final estimator of γx is γ̂xA0.

For Example 1 in Section 2.1 where x is a sparse vector, we apply Algorithm 2 to get the pointwise estimator. For the prediction problems in Example 2, since the sparsity degree of β is unknown, we have two choices on the subset S1: (a) simply taking S1={1,,p} (i.e., estimate γx directly); (b) estimating S1 from data using LASSO or screening methods. Given Γ, among the candidates of S1, we can select the best one by CV, similar to Steps 1 and 2 of Algorithm 2. Note that xi should be replaced by x˜iS1, which is defined in the way similar to x˜S1, when one computes the CV error in Step 1 of Algorithm 2. Moreover, one can select both Γ and S1 simultaneously by CV.

3 Properties of the Penalized Estimator α̂x

Throughout our theoretical analysis, it is assumed that ϵi’s are iid from N(0,σ2), and (x,X) can be fixed or random and will be specified later. In this section, we present theoretical properties of the regularized estimator α̂x in (2.7), which lays a foundation for properties of the estimator γ̂x in Section 4. To this end, we first establish results for a generic invertible matrix Γ with fixed (x,X) in Section 3.1, and then apply it to Γeg and Γ(β̂) in Sections 3.2 and 3.3, respectively.

3.1 A General Result on α̂x with a Generic Γ and Fixed (x,X)

Recall that Z=n(zx,Γ) in Model (2.3). Let vx=(1,bx)/(1+bxbx)1/2 with bx=Γ1zx. It can be shown that vx is the eigenvector associated with the zero eigenvalue of n1ZZ (see proof of Theorem E.1 in supplementary materials). For any v=(v1,,vn+1)Rn+1, denote v(q)=i=1n+1|vi|q;q[0,1]. It is worth noting that v(q) here is not the lq norm of v that is defined as vq=v(q)1/q; and for q = 1, they are the same. This notation v(q) is convenient for our purpose, particularly for the discussions of the case with q = 0 or q0.

For α=(αx,θ) and q[0,1], denote Rq=α(q)=|αx|q+θ(q), which measures the sparsity of the parameter α. Particularly, when q = 0, Rq is the number of nonzero elements in α. Let the tuning parameter λ=λn:=a0σ(logn)/n with a02. We have the following result on a generic invertible matrix Γ.

Theorem 1.

Assume that (x,X) are fixed. Let Γ be a generic invertible matrix that may depend on (x,X). Assume the following α-sparsity condition: RqCλnqvx12 for some q[0,1] and some constant C > 0. Then with probability 1C1n3, we have |α̂xαx|  54λn+Cκλn1qRq·min{λnq/2Rq1/2Γzx,Γzx},where C1, Cκ are positive constants. Furthermore, when Γ is a generic orthogonal matrix, the α-sparsity condition can be simplified as RqCλnqΓzx12 for some constant C>0.

We make a brief discussion on the above result. First, note that ΓzxΓzx, where the latter equals 1 when Γ is an orthogonal matrix. Further discussions are referred to Section A of the supplementary materials. Second, since λn has the order logn/n, the bound of Theorem 1 does not explicitly depend on p, which is reasonable, since we have only n + 1 parameters in the transformed model (2.3). Third, the bound given above involves the sparsity of the parameter vector α in the transformed model instead of that of β, allowing β being sparse or non-sparse. As an application of Theorem 1, we present the results for Γ being Γeg and Γ(β̂), respectively below.

Proposition 2.

Suppose that (x,X) are fixed. Let Γ=Γeg. Assume that the α-sparsity condition RqCλnqΓzx12 holds for some constant C > 0. Then it holds that |α̂xαx|Cλn1q/2Rq1/2 with probability 1C1n3, where C1,C are positive constants.

Similar to Theorem 1, Proposition 2 allows β to be sparse or non-sparse. Due to αx being bounded, we have Rqθ(q), which can be bounded as shown in Section 3.2. When Xi’s are random, a lower bound on Γegzx1 is given in Section E of the supplementary materials.

3.2 Sparsity of θ and Properties of α̂x when Γ=Γeg

We first show that for Γ=Γeg, if the eigenvalues of n1XQxX decrease at a certain rate, θ will be approximately sparse; specifically θ(q) is bounded for some q[0,1]. Then we give the asymptotic results on α̂x. Recall ψ1ψ2ψn0 are the eigenvalues of n1XQxX; they are also the first n eigenvalues of n1QxXXQx. For q[0,1] and k=0,1,,n1, let ψ¯q,k=(nk)1i=k+1nψiq/(2q) and ϕk1/2(β) be the norm of the projected vector of β onto the subspace spanned by the eigenvectors of n1QxXXQx associated with eigenvalues ψk+1,,ψn, ϕn(β)=ψ¯q,n=0, and ψ¯q,k is the average magnitude of the smallest nk eigenvalues. For any x1,x20, denote Hq,k(x1,x2)=k1q/2+[(nk)x1]1q/2x2q/2. The following Lemma 1 is a deterministic result on θ(q).

Lemma 1.

For q[0,1], it holds that θ(q)min0kn{ζβqk1q/2+[(nk)ψ¯q,k]1q/2ϕkq/2(β)}. Moreover, since ζβ=O(1) in Model (2.3), it holds that θ(q)=O(min0knHq,k(ψ¯q,k,ϕk(β))).

Generally, it is difficult to obtain the sharp upper bound in Lemma 1 without any information of eigenvalues and β. Simply setting k = n, we have the trivial bound (3.1) θ(q)Hq,n(ψ¯q,n,ϕk(β))=n1q/2,(3.1) which however is unbounded and undesirable. To get better bounds on θ(q), we take into account of the properties of eigenvalues and β, and consider the following three cases:

  • Case (a): n1XX has low rank, say rank(n1XX)=rX=O(1);

  • Case (b): β is sparse with β0=s0 and that β<C<;

  • Case (c): β is dense, following a normal distribution N(0,Σβ).

Results are presented in Corollary 1 and Proposition 3, respectively.

Corollary 1.

The following conclusions are deterministic.

1. For Case (a), it holds that ψ¯q,k=0 for krX+1, and θ(q)=O(rX1q/2)=O(1).

2. For Case (b), assume that n < p, then θ(q)min0knHq,k(ψ¯q,k,s0). Thus, if ψ¯q,k0=O(s0q/(2q)n1) for some fixed k0, then θ(q)=O(k01q/2)=O(1).

In Corollary 1, conclusion (1) holds regardless of the sparsity of β, while conclusion (2) relaxes the conditions on eigenvalues by taking advantages of the sparsity of β. For Case (c), we assume that the column space of Σβ is the same as that of Σ, accommodating the identifiability condition that βspan(Σ). We have the following results.

Proposition 3.

Suppose that (X,x) is fixed. For Case (c) with span(Σβ)=span(Σ), assume the following conditions: (i) ζβ=Op(1); (ii) (np)1i=1nXiQxXi1; (iii) n < p. Then it holds that ϕk(β)=Op(dβn/p) for all k and that θ(q)=Op(min0knHq,k(ψ¯q,k,dβn/p)), where dβ=λmax(Σβ)/λmin+(Σβ) can be viewed as the condition number of Σβ. For clarity, we assume that dβ=O(1) and consider the following two specific examples of eigenvalues:

  1. Suppose that ψ1ψk0=Ω(p) and maxjk0+1ψj=O(pn2/q) for some fixed k0. Then it holds that θ(q)=Op(k01q/2)=Op(1).

  2. Denote fψ(i)=ψi/i=1nψi, the scaled version of ψi, i=1,,n. Assume that fψ(i) decreases exponentially, that is, fψ(w)=aexp(a(w1)) for w1, where a=an>0 may depend on n. Then θ(q)=Op(1) if an= Ω(q1logn).

Condition (i) is a natural extension of the condition ζβ=O(1) for fixed β in Section 2. Condition (ii) is mild, ruling out the extreme case that Σ has eigenvalues (p,0,,0). Details are referred to the proof in supplementary materials. The two examples in Proposition 3 imply that θ can be approximately sparse, when eigenvalues decrease fast enough. Based on the results of θ(q) in Corollary 1 and Propositions 3, by applying the conclusion of Proposition 2, we are ready to give the asymptotic results of α̂x.

Theorem 2.

Suppose that (x,X) are fixed. Taking Γ=Γeg, we have the following conclusions for Cases (a)–(c):

  1. For Case (a), |α̂xαx|=Op(λnrX1/2). For Case (b), suppose that λnqΓzx12=Ω(1) and that n < p; if ψ¯q,k0=O(s0q/(2q)n1) for some fixed k0, then |α̂xαx|=Op(λn1q/2k01/2q/4).

  2. For Case (c) of a dense β, assume that the conditions of Proposition 3 hold and that λnqΓzx12=Ω(1). If ψi’s are from Proposition 3 (1), then |α̂xαx|=Op(λn1q/2k01/2q/4); if ψi’s are from Proposition 3 (2), then |α̂xαx|=Op(λn1q/2).

The condition λnqΓzx12=Ω(1) above can be checked from data. Here k0 and rX reflect the sparsity of the parameters in the transformed model. Faster decay rates of eigenvalues result in smaller values of k0 and rX, and consequently better convergence rates of |α̂xαx|. Moreover, a smaller value of q results in a smaller value of λn1q/2, but requires a faster decreasing rate of eigenvalues. Note that the above bounds only depend on n and the degree of sparsity in eigenvalues, and do not explicitly depend on p. Moreover, for Cases (a) and (c) where β is allowed to be dense, the sparsity of eigenvalues is complementary to the sparsity of β.

3.3 Properties of the Regularized Estimator α̂x with an Initial β̂

We study properties of our estimator with Γ(β̂), giving the convergence rate of α̂x, for fixed x and Xi’s being iid from N(0,Σ). The normality assumption of Xi’s simplifies the proofs and can be relaxed to general distributions such as sub-Gaussian distributions. We first give a result on ζ¯β̂ζ¯β. Recall that ζβ̂=n1/2Xβ̂Qx.

Proposition 4.

Assume that x is fixed, Xi’s are iid from N(0,Σ), and cov(XiβQx)1. For any estimator β̂ in Model (2.1), it holds that ζ¯β̂ζ¯β=Op(min{2,ζβ̂ζβ}). Assuming further that xxΣ/x2=O(1), then ζβ̂ζβ=Op(β̂β1(logp)1/2).

When β̂ is the LASSO estimator, we have β̂β1=Op(σs0logp/n) with s0=supp(β) (Bickel, Ritov, and Tsybakov Citation2009), and consequently ζ¯β̂ζ¯β=Op(min{2,σs0(logp)2/n}). Let H1,k(ψ¯1,k,x) be the function obtained by setting q = 1 in Hq,k(ψ¯q,k,x) defined in Section 3.2.

Theorem 3.

Let Γ=Γ(β̂). Assume that following conditions: (i) Xi’s are iid from N(0,Σ), x is fixed, and n < p; (ii) β satisfies cov(XiβQx)1 and β̂ is obtained from additional data independent of (X,Y), satisfying cov(Xiβ̂Qx|β̂)1. Then it holds that |α̂xαx|=Op(λn[1+Hminζβ̂ζβ]), where Hmin=min0knH1,k(ψ¯1,k,n/[pλmin+(Σ)]). Assuming further the Complementary condition: Hminζβ̂ζβ=Op(1), it follows that |α̂xαx|=Op(λn).

Based on the proof, one can see that ζβ̂ and ζβ in Theorem 3 can be replaced by ζ¯β̂ and ζ¯β respectively. The assumption on β in (ii) is mild as cov(XiβQx)=E(ζβ2) and ζβ is bounded in probability. We give some examples on the magnitude of Hmin.

Corollary 2.

(a) If Σ is of low rank, say rank(Σ)=rΣ=O(1), then Hmin=O(rΣ1/2). (b) Suppose that maxjk0+1ψj=Op(pn2) for some fixed k0 and that λmin+(Σ)1. Then Hmin=Op(k01/2). (c) Suppose that λmin+(Σ)1 and denote fψ(i)=ψi/i=1nψi,i=1,,n. Assuming that fψ(i) decreases exponentially in probability, that is, fψ(w)=aexp(a(w1)) for w1, where a=anlogn, then Hmin=Op(1).

Proof of Corollary 2

is similar to those of Corollary 1 and Proposition 3 and is omitted. We point out two basic facts: 1Hminn from (3.1), and ζβ̂ζβ=Op(1) by Condition (ii) and the law of large numbers. Moreover, the error rate of an estimator is generally believed having an order no less than n1/2; thus, without loss of generality we have ζβ̂ζβ=Ωp(n1/2) throughout this article. Hence, Theorem 3 leads to the error rate of order λnHminζβ̂ζβmin{λnnζβ̂ζβ,λnHmin} in probability without any restriction on the decay rate of eigenvalues.

The complementary condition involves two terms, Hmin and ζβ̂ζβ, where the former is controlled by the decay rate of the eigenvalues, and the latter depends on the accuracy of β̂. As argued before, as long as the eigenvalues decrease fast, Hmin will be bounded, even if ζβ̂ζβ0. Therefore, if β̂ is not accurate but eigenvalues decay fast, we can still get good results. The same argument applies when β̂ is accurate but the eigenvalues decay slowly. Particularly, if β̂ is good enough such that ζβ̂ζβ=Op(n1/2), then the requirement on the decaying rate of eigenvalues can be removed completely. Thus, the information of β̂ and eigenvalues are complementary to each other, requiring only the product of Hmin and ζβ̂ζβ being bounded. Theorem 3 assumes β̂ being independent of the data (X,Y). When β̂ depends on (X,Y), we get a similar result on α̂x with some modifications on the Condition (i), which is presented in Section F of the supplementary materials.

4 Results of γ̂x for Two Types of Test Points

In Section 3, we establish theoretical properties of α̂x. With γ̂x=α̂x·x2⏧Xx1n1/2, it is natural to get the corresponding results on γ̂x for a generic x. As mentioned in Examples 1 and 2 in Section 2.1, we are interested in two typical settings in particular: (a) x is a sparse vector with Sx=supp(x) and sx=|Sx|, and (b) x is a random vector as an iid copy of Xi (i.e., the prediction problem). Next we give the theoretical results on γ̂x for these two examples, based on the simple fact that |γ̂xγx||α̂xαx|·x2⏧Xx1n1/2.

4.1 Properties of γ̂x for a Sparse x

Proposition 5.

Assume the following conditions: (i) x is a fixed sparse vector satisfying x=O(1); (ii) λmin(n1XSxXSx)>c>0, where XSx is formed by the columns of X with index Sx. Then we have x2⏧Xx1n1/2=O(x)=O(sx1/2). Moreover, the following results hold.

  1. Take Γ=Γeg and assume that the conditions of Theorem 2 hold. For Case (a) in Section 3.2, it holds that |γ̂xγx|=Op(λnxrX1/2)=Op(λn(sxrX)1/2); for Cases (b) and (c) in Section 3.2, |γ̂xγx|=Op(λn1q/2x)=Op(sx1/2λn1q/2), where k0 appeared in Theorem 2 is omitted due to k0=O(1).

  2. Let Γ=Γ(β̂). Assume further that the conditions of Theorem 3 hold. Then it follows that |γ̂xγx|=Op(sx1/2λnHminζβ̂ζβ); assume further that the Complementary condition holds, then |γ̂xγx|=Op(xn1logn)=Op(n1sxlogn).

The condition λmin(n1XSxXSx)>c>0 is a type of the restricted eigenvalue condition (Bickel, Ritov, and Tsybakov Citation2009). If Xi’s are iid variables, n1XSxXSxpcov(XSx)=ΣSxSx. Recall that (n1XX)=rX in Case (a) of Theorem 2; then the condition λmin(n1XSxXSx)>c>0 implies that sxrX there.

Remark 5.

We briefly discuss the case of Σ=Ip or close to Ip for a sparse vector x. For Γ=Γeg, using the trivial bound in (3.1) on Rq and taking q = 1 in Proposition 2, we can see that the error of |α̂xαx| has the order Op((logn)1/4), and consequently |γ̂xγx|=Op(x(logn)1/4).

We briefly compare our method with the plug-in estimator using the LASSO estimator β̂ (named briefly as LASSO). For LASSO, the error varies depending on the direction of x, while results of our estimator depend only on the sparsity degree sx of x. For simplicity of comparison, we consider a bound of LASSO depending only on sx. Specifically, as x=O(1), we have |xβ̂lassoxβ|=O(⏧x⏧⏧β̂lassoβ)=Op(sxs0logp/n) (Bickel, Ritov, and Tsybakov Citation2009); the latter will be used as the error rate of LASSO. Recall that Peg denote our estimator with Γ=Γeg. For Case (a), it follows from Proposition 5 that Peg is better than LASSO if and only if rX(logn)(logp)1=o(s0). Cases (b) and (c) can be analyzed similarly. Recall that Plasso is our estimator with Γ=Γ(β̂lasso).

Corollary 3.

Denote by Tn,fix the ratio of error rate of Plasso over that of LASSO for a fixed sparse x. Suppose that the conditions (i) and (ii) in Proposition 5 and the conditions of Theorem 3 hold. Then Plasso has the error rate of order λnHminsx(s0logp)2/n and consequently Tn,fix=Op(λnHmins0logp). If Hmin=op(n1/2[s0(logn)(logp)]1/2), then Tn,fix=op(1), impliying Plasso is superior to LASSO; otherwise Plasso is inferior or similar to LASSO.

The proof of Corollary 3 is a simple combination of Proposition 4 and (2) of Proposition 5 and is omitted here. Since we always have Hminn, the requirement Hmin=op(n1/2[s0(logn)(logp)]1/2) is mild.

4.2 Properties of γ̂x for Prediction Problems

In a prediction problem, x and Xi’s are iid variables. Recall that MΣ=tr2(Σ)/tr(Σ2).

Proposition 6.

Suppose that x and Xi’s are iid from N(0,Σ). Then x2⏧Xx1n1/2=Op(x2/xΣ)=Op(MΣ). In addition, it holds that 1MΣp1/2. Particularly, MΣ1 when Σ is of low rank, and MΣ=p1/2 when Σ=Ip.

Next we first derive properties of γ̂x with Γ=Γeg. Then we consider the estimator γ̂x˜S1 with Γ=Γ(β̂), given both an initial estimator β̂ and a subset S1.

4.2.1 Properties of γ̂x for x in Prediction with Γ=Γeg

Recall that |γ̂xγx||α̂xαx|·x2⏧Xx1n1/2. Combining Proposition 6 and the result on |α̂xαx| with fixed (x,X) given in Proposition 2 for Γ=Γeg, it can be inferred that |γ̂xγx|=Op(λn1q/2Rq1/2MΣ). Clearly, a faster decay rate of the eigenvalues λ(Σ) leads to a smaller value of MΣ, and a faster decay rate of ψi’s or equivalently a smaller value of Rq, consequently a better rate. Different from the fixed (x,X) considered in Theorem 2, (x,X) are random variables in this section. Thus, Rq lacks an explicit rate, due to randomness of the empirical eigenvalues ψi’s. The magnitudes of ψi’s, though can be checked from data, are hard to extract in theory generally, according to random matrix theory. To the best of our knowledge, there is no solution for a general case. To obtain an explicit result, we consider two extreme cases: (a) Σ is (approximately) low rank; (b) Σ=Ip, the least favorable case.

Proposition 7.

Suppose that x and Xi’s are independent from N(0,Σ). Assume that n < p. Taking Γ=Γeg, we have the following conclusions:

  1. If Σ is of low rank with rank(Σ)=rΣ, we have |γ̂xγx|=Op(rΣn1logn). An extension to Σ being approximately low rank is presented in Proposition D.1 of supplementary material.

  2. For the least favorable case of Σ=Ip, it holds that |γ̂xγx|=Op(p1/2(logn)1/4).

For the case of Σ having a low rank, |γ̂xγx| has the order similar to that of |α̂xαx|. But when Σ=Ip, the error diverges, which is not surprising since θ is nonsparse in the transformed model in this setting and the rate is determined by the most difficult case. Particularly, the rate for Σ=Ip is the combination of the facts that |α̂xαx|=Op((logn)1/4) and MΣ=p1/2. Next we briefly compare LASSO with the proposed method with Γ=Γeg. LASSO performs well in prediction when β is (approximately) sparse, and is less sensitive to the sparsity of eigenvalues. In contrast, the proposed method with Γ=Γeg has good performance when the eigenvalues of Σ decrease fast, and β can be sparse or less sparse. Thus, our method with Γ=Γeg and LASSO are complementary to each other.

4.2.2 Error of γ̂x˜S1 in Prediction with Γ=Γ(β̂)

Recall that in Example 2 in Section 2.1, γx=γx˜S1 for any S1S0 with S0=supp(β), implying that one can make prediction at the point x˜S1. Trivially, one can take S1={1,,p} such that x˜S1=x. The subset S1 takes the sparsity degree of β into account. Given an initial estimator β̂ and a subset S1 such that S1S0, by applying our approach with Γ=Γ(β̂) that is constructed with x replaced by x˜S1, we obtain the estimator of γx˜S1 denoted as γ̂x˜S1.

Denote d(β̂,β)=[max{var(Xi(β̂β)),var(XiS1(β̂S1βS1))}]1/2, which stands for the prediction error of an initial estimator β̂. Without loss of generality, we assume d(β̂,β) has magnitude of order no less than n1/2. Let ΣS1S1=cov(XiS1) and MS1=[tr2(ΣS1S1)/tr(ΣS1S12)]1/2 with MS12 standing for the effective rank of matrix ΣS1S1. Let H˜min be the quantity defined similar to Hmin but with the eigenvalues of n1XX, satisfying 1H˜minn (the detailed expression is given in the supplementary materials). We have the following conclusions from Theorem 3.

Theorem 4.

Let Γ=Γ(β̂). Assume that (i) x and Xi’s are iid variables from N(0,Σ) and n < p; (ii) Both S1 and β̂ are independent of (X,Y) satisfying cov(Xiβ̂)1. Then it holds that |γ̂x˜S1γx|=Op(λnMS1H˜mind(β̂,β)). If we further assume the complementary condition: d(β̂,β))H˜min=Op(1), it holds that |γ̂x˜S1γx|=Op(λnMS1). These conclusions are still valid for S1={1,,p}.

Theorem 4 shows that the rate depends on MS1,H˜min, and d(β̂,β). The first two depend on the decay rate of eigenvalues. Moreover, it holds that d(β̂,β)=Op(1) by the assumptions that both cov(Xiβ) and cov(Xiβ̂) are bounded, which imposes restrictions on β̂. Hence, by Theorem 4, without the complementary condition, we have the error rate min{λnMS1nd(β̂,β),λnMS1H˜min}.

We compare Plasso with the plug-in method using LASSO estimator β̂ (briefly named LASSO). By the typical rate of LASSO, we have d(β̂lasso,β)=Op(min{1,s0logp/n}). To simplify the comparison, we assume that the LASSO estimator is consistent, that is, s0logp/n=o(1). Then one can see that the condition d(β̂lasso,β)H˜min=Op(1) becomes H˜min=Op(n1/2(s0logp)1/2), which is mild, since it always holds that H˜minn1/2.

Denote by Tn,rad the ratio of the error rate of Plasso over d(β̂lasso,β) for random test points. Then Tn,rad=op(1) would imply that our method is better than LASSO. By Theorem 4, we see that Tn,radλnMS1H˜min in probably. To investigate the magnitude of the latter, we consider the following two cases:

  • Suppose that β is sparse with support set S0 of cardinality s0=|S0|. Moreover, if a good S1 is available such that S1S0 and |S1|s0, that is, we know sufficiently well on the support set. Then we have MS1s01/2. If H˜min=op(n1/2(s0logn)1/2) (i.e., λns01/2H˜min=op(1)) which is mild as argued above, we have Tn,rad=op(1); otherwise Tn,rad=Ωp(1). Moreover, Plasso has error |γ̂x˜S1γx|=Op(s0(logn)/n) under the mild condition H˜mind(β̂lasso,β)=Op(1). If one knows the support set S0 in advance, the OLS estimator using the oracle predictor XiS0’s has the rate s0/n, which is similar to the rate of our estimator up to a term logn. However, the performance of our estimator depends on that of S1 and the decay of eigenvalues.

  • If we simply set S1={1,,p}, ignoring the sparsity information of β, then Tn,rad=op(1) if and only if MS1H˜min=op(λn1); otherwise Tn,rad=Ωp(1). This condition MS1H˜min=op(λn1) holds when eigenvalues decay fast. Moreover, under the settings similar to Corollary 2, we have H˜min=Op(1). For instance, if rank(Σ)=o(λn1), then MS1H˜minrank(Σ)=op(λn1). However, Plasso is worse than LASSO when Σ is close to Ip and β is indeed sparse; specifically, under the complementary condition, Plasso has error rate λnMS1=Op((n1plogn)1/2), which is worse than that of LASSO. However, taking S1={1,,p} accommodates both sparse and non-sparse β. The classical OLS estimator for a non-sparse β has the order p/n, close to that of Plasso. In practice, we do not know the sparsity degree of β. The CV approach in Section 2.3 can be used to select between γ̂xS1 and γ̂x in an automatic data adaptive manner.

5 Numerical Studies

We use simulation studies in Section 5.1 and real data analysis in Section 5.2 to further illustrate the numerical performance of our method.

5.1 Simulations

We consider the simulation studies with samples generated iid from the linear model (2.1) with p = 1000 dimensional vector XiN(0,Σ) and ϵiN(0,1). Set Σ=(σij) with σij=0.5|ij|/η, where η controls the level of dependence strength among the predictors, with larger values of η implying stronger correlations among predictors. For the convenience of discussion, the plug-in estimator is named by the method used in estimating β. For example, “LASSO,” “ridge,” and “ridgeless” denote the plug-in estimators β̂x with β̂ being LASSO, ridge and ridgeless estimators, respectively.

Setting 1 (Prediction). We set β=δ0(1p0,0,,0)Rp, where p0=r0p,1p0 is the p0-dimensional vector of 1, and δ0=10/p0 such that β=10. Clearly, β is denser for larger values of r0, and we set r0{0.1,0.2,,0.9}. For prediction, we set M={Plasso,Pridge,Peg,Prdl} for Algorithm 2 in Section 2.3.

We compare the prediction performance of different methods. Split the data into two parts with the training sample of size ntr = 200 and the testing sample of size nte = 500 to compute the test error. PWE estimators are obtained by Algorithm 2 with the M given above and S1={1,,p}. For the implementation of Algorithm 1, the bias correction step is adopted. We repeat the procedure 100 times and calculate the average test error for each method. We compare LASSO, ridge, ridgeless, Peg and the PWE estimators. For clarity, simulation results for Setting 1 are summarized as follows:

  1. Comparison of LASSO, ridge, ridgeless with PWE and Peg on prediction. The simulation results are presented in . When β is sparse such as r = 0.1, PWE performs similar to (with small η) and better than (with large η) LASSO, and much better than other methods including ridgeless. When r0 is large, PWE is similar to or slightly better than the ridgeless estimator, and is much better than other methods. By taking the advantages of different initial estimators, PWE performs well for both sparse and dense β. Moreover, Peg is insensitive to r0, which supports our theoretical findings, and is better than LASSO when r0 is large. Its advantage over LASSO is more clear when η is large.

  2. Comparison of plug-in estimators with our proposed pointwise counterparts. Due to the limited space, the results are presented in in Section G.1 of the supplementary material. It is seen that the performance of Plasso is similar to LASSO for small η and is better than LASSO for large η; in all cases, Pridge is better than ridge. The numerical results match with our theoretical findings in the sense that the sparsity in eigenvalues can be helpful. In addition, Prdl is close to ridgeless, especially when η is small. More comparisons are presented in Setting 4 and Section G.1 in Supplementary Materials.

Fig. 2 Simulation prediction results of PWE, LASSO, ridge, Peg, and ridgeless (RDL).

Fig. 2 Simulation prediction results of PWE, LASSO, ridge, Peg, and ridgeless (RDL).

Setting 2 (Sparse linear transformation). We consider x being sparse vectors in γx=βx. Set β=(3,3,3,1,δ11p04,0pp0), where δ1=5/p0. Consider γx being one of the four quantities β1, βp0, βp and β1β3, corresponding to taking x being e1,ep0,ep and e1e3, respectively in γx. Note that β1β3=0, indicating that there is no difference between effects of the first and third predictors; β1=3 and βp0=δ1 stand for strong and weak signals, respectively. Moreover, βp=0 indicates that the pth predictor is insignificant. Let p0=300 and Σ=(σij) with σij=0.5|ij|/150, so that the predictors are highly correlated. We set the training data size ntr = 150, and compute the average errors of |γ̂xγx| over 100 replications. We compare the regularized estimators, Peg,Plasso, and Pridge, with the plug-in ones. Estimation comparison results are presented in .

Table 2 The average values of |γ̂xγx|.

For a strong signal β1, it can be inferred from that the regularized pointwise estimators Pridge and Peg are better than LASSO, adaptive LASSO and ridge regression. For weak signal βp0, ridge estimator is better than others. The main reason is that other methods sometimes shrink the estimators to zero, resulting in large biases. For βp=0, except ridge regression, other methods give zero estimates. Finally, for β1β3=0, all the regularized pointwise estimators result in exactly zero estimates, while the plug-in estimators LASSO and adaptive LASSO lead to large biases.

Setting 3 (Comparison on different subset S1). As pointed out in Sections 2.1 and 2.3, one can consider prediction at the point x˜S1 instead of x in prediction problems. Under the setup of Setting 1, we take Γ=Γ(β̂lasso) and compare the following four candidates of S1: (1) S1 being S0=supp(β), which is the ideal case; (2) S1 being Sfull={1,,p}, that is, γx˜S1=γx; (3) S1 is obtained by SIS of Fan and Lv (Citation2008), denoted as SSIS; (4) S1 being the Slasso=supp(β̂lasso). For each candidate of S1, we repeat 100 times and report the average values of the prediction error, true positive rate (TP) and the average length (LEN) that are defined as TP=|S1S0|/|S0| and LEN=|S1|/p, respectively.

Due to the limited space, we present the simulation results in Section G.2 in supplementary materials. It is seen that S0 always leads to the best prediction errors in all cases. When r0=0.01 where β is very sparse, both SSIS and Slasso have higher values of TP and smaller values of LEN, leading to smaller prediction errors than those of Sfull. As r0 increases, the signal of βj’s becomes weak due to the constraint β=10, and the values of TP for SSIS are very small and are the smallest ones among all subsets, which lead to the worst prediction errors. On the other hand, TP and consequently errors of Slasso are much better than those of SSIS, because LASSO takes into account correlations among the predictors when selecting the significant variables, while SIS uses only the marginal correlations. In addition, it is observed that Slasso performs similar to that of Sfull, which is partially due to the following reason. During the construction of Γ(β̂lasso) with a given S1, we need to compute x˜S1β̂lasso, which equals xβ̂lasso for S1 being both Slasso and Sfull.

Setting 4 (Further comparison for heterogeneous test points). In Setting 1 where test points are iid copies from the training distribution, Plasso is nearly the same as LASSO when η is small such as η = 5 (results shown in in supplementary material). We compare them further for the case of xi’s following a distribution different from Xi’s, which is known as covariate shift in the literature of transfer learning (Weiss, Khoshgoftaar, and Wang Citation2016). We generate training data of size 100 as in Setting 1 with η = 5 and β(1p0,0,,0)Rp with β=5. The test points xi’s are iid from N(0,Σte). The eigenvectors matrix Ute of Σte is uniformly distributed on the set of all orthogonal matrices in Rp×p. The eigenvalues of Σte, denoted as ϱte,1,,ϱte,p, satisfy that ϱte,i=2(pi+1)/(p+1) such that tr(Σte)=p. We first generate a Σte and then 200 test points with given Σte, and repeat this procedure 100 times to compute average prediction errors. Results in show that Plasso is much better than LASSO for the case of covariate shift even for small η, possibly due to the flexible pointwise prediction of our proposed method. Besides these examples, additional results demonstrate that Prdl can also substantially improve the ridgeless estimator when covariate shift exists for testing data (Section G.1 of supplementary materials).

Table 3 Test errors of LASSO and Plasso with η = 5 for testing points from N(0,Σte).

5.2 Real Data Analysis

We apply our method to a dataset from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (https://adni.loni.usc.edu/). Alzheimer’s Disease (AD) is a form of dementia characterized by progressive cognitive and memory deficits. The Mini Mental State Examination (MMSE) is a very useful score in practice for the diagnosis of AD. Generally, any score greater than or equal to 27 points (out of 30) indicates a normal cognition. Below this, MMSE score can indicate severe (9 points), moderate (10–18 points) or mild (19–24 points) cognitive impairment (Mungas Citation1991). Currently, structural magnetic resonance imaging (MRI) is one of the most popular and powerful techniques for the diagnosis of AD. One can use MRI data to predict the MMSE score and identify the important diagnostic and prognostic biomarkers. The dataset we used contains the MRI data and MMSE scores of 51 AD patients and 52 normal controls. After the image preprocessing steps for the MRI data, we obtain the subject-labeled image based on a template with 93 manually labeled regions of interest (ROI) (Zhang and Shen Citation2012). For each of the 93 ROI in the labeled MRI, the volume of gray matter tissue is used as a feature. Therefore, the final dataset has 103 subjects. For each subject, there are one MMSE score and 93 MRI features. We treat the MMSE score as the response variable and MRI features as predictors.

We split the data at random with 80% as the training set, denoted as Str, and 20% as the testing set, denoted as Ste, then compute the average test error YiSte|ŶiYi|/|Ste|. We repeat the procedure for 100 times and report the average test errors. The boxplots of different methods are presented in . It shows that pointwise estimators are much better than plug-in estimators of LASSO, adaptive LASSO and ridge, respectively.

Fig. 3 Comparison of the test errors of different methods for the AD data analysis. Here Plasso,Pridge,Prdl, and Peg represent our regularized estimators without bias-correction in Algorithm 1. The PWE is automatically selected from Plasso,Pridge,Prdl, and Peg.

Fig. 3 Comparison of the test errors of different methods for the AD data analysis. Here Plasso,Pridge,Prdl, and Peg represent our regularized estimators without bias-correction in Algorithm 1. The PWE is automatically selected from Plasso,Pridge,Prdl, and Peg.

6 Discussion

In this article, we estimate the linear transformation βx of parameters β in high-dimensional linear models. We propose a pointwise estimator, which works well when β is sparse or non-sparse, and predictors are highly or weakly correlated. The theoretical analysis reveals the significant difference between estimating a linear transformation of βx and that of β. When β is non-sparse or predictors are highly correlated, estimating β is difficult, but we can still get good estimate of βx using our proposed pointwise estimators.

Supplementary Materials

The supplementary materials contain two parts. The first part provides the proofs of the theoretical results in the main paper. The second part includes some simulation results for the settings 2 and 3 of the main paper, together with additional simulation results.

Supplemental material

Supplemental Material

Download Zip (768.7 KB)

Acknowledgments

The authors would like to thank the Editor, the Associate Editor, and reviewers, whose helpful comments and suggestions led to a much improved presentation.

Additional information

Funding

Junlong Zhao’s research was supported in part by National Natural Science Foundation of China grants No. 11871104 and 12131006. Yang Zhou’s research was supported in part by China Postdoctoral Science Foundation grants No. 2020M680226 and 2020TQ0014, and Youth Fund, No.310422112. Yufeng Liu’s research was supported in part by US NIH grant R01GM126550 and NSF grants DMS2100729 and SES2217440.

References

  • Azriel, D., and Schwartzman, A. (2020), “Estimation of Linear Projections of Non-Sparse coefficients in High-Dimensional Regression,” Electronic Journal of Statistics, 14, 174–206. DOI: 10.1214/19-EJS1656.
  • Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. (2020), “Benign Overfitting in Linear Regression,” Proceedings of the National Academy of Sciences, 117, 30063–30070. DOI: 10.1073/pnas.1907378117.
  • Belloni, A., and Chernozhukov, V. (2013), “Least Squares after Model Selection in High-Dimensional Sparse Models,” Bernoulli, 19, 521–547. DOI: 10.3150/11-BEJ410.
  • Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009), “Simultaneous Analysis of Lasso and Dantzig Selector,” The Annals of Statistics, 37, 1705–1732. DOI: 10.1214/08-AOS620.
  • Bühlmann, P., and Van De Geer, S. (2011), Statistics for High-Dimensional Data: Methods, Theory and Applications, Berlin: Springer.
  • Cai, T. T., and Guo, Z. (2017), “Confidence Intervals for High-Dimensional Linear Regression: Minimax Rates and Adaptivity,” The Annals of Statistics, 45, 615–646.
  • Dalalyan, A. S., Hebiri, M., and Lederer, J. (2017), “On the Prediction Performance of the Lasso,” Bernoulli, 23, 552–581. DOI: 10.3150/15-BEJ756.
  • Dobriban, E., and Wager, S. (2018), “High-Dimensional Asymptotics of Prediction: Ridge Regression and Classification,” The Annals of Statistics, 46, 247–279. DOI: 10.1214/17-AOS1549.
  • Fan, J., and Li, R. (2001), “Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties,” Journal of the American Statistical Association, 96, 1348–1360. DOI: 10.1198/016214501753382273.
  • Fan, J., and Lv, J. (2008), “Sure Independence Screening for Ultrahigh Dimensional Feature Space,” Journal of the Royal Statistical Society, Series B, 70, 849–911. DOI: 10.1111/j.1467-9868.2008.00674.x.
  • Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. (2022), “Surprises in High-Dimensional Ridgeless Least Squares Interpolation,” The Annals of Statistics, 50, 949–986. DOI: 10.1214/21-aos2133.
  • Hebiri, M., and Lederer, J. (2013), “How Correlations Influence Lasso Prediction,” IEEE Transactions on Information Theory, 59, 1846–1854. DOI: 10.1109/TIT.2012.2227680.
  • Javanmard, A., and Montanari, A. (2014), “Confidence Intervals and Hypothesis Testing for High-Dimensional Regression,” The Journal of Machine Learning Research, 15, 2869–2909.
  • Lu, S., Liu, Y., Yin, L., and Zhang, K. (2017), “Confidence Intervals and Regions for the Lasso by Using Stochastic Variational Inequality Techniques in Optimization,” Journal of the Royal Statistical Society, Series B, 79, 589–611. DOI: 10.1111/rssb.12184.
  • Mungas, D. (1991), “In-Office Mental Status Testing: A Practical Guide,” Geriatrics, 46, 54–67.
  • Negahban, S. N., Ravikumar, P., Wainwright, M. J., and Yu, B. (2012), “A Unified Framework for High-Dimensional Analysis of m-estimators with Decomposable Regularizers,” Statistical Science, 27, 538–557. DOI: 10.1214/12-STS400.
  • Raskutti, G., Wainwright, M. J., and Yu, B. (2011), “Minimax Rates of Estimation for High-Dimensional Linear Regression over lq -balls,” IEEE Transactions on Information Theory, 57, 6976–6994.
  • Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso,” Journal of the Royal Statistical Society, Series B, 58, 267–288. DOI: 10.1111/j.2517-6161.1996.tb02080.x.
  • van de Geer, S., Bühlmann, P., Ritov, Y., and Dezeure, R. (2014), “On Asymptotically Optimal Confidence Regions and Tests for High-Dimensional Models,” Annals of Statistics, 42, 1166–1202.
  • Weiss, K., Khoshgoftaar, T. M., and Wang, D. (2016), “A Survey of Transfer Learning,” Journal of Big Data, 3, 1–40. DOI: 10.1186/s40537-016-0043-6.
  • Zhang, C. H. (2010), “Nearly Unbiased Variable Selection Under Minimax Concave Penalty,” Annals of Statistics, 38, 894–942.
  • Zhang, C. H., and Zhang, S. S. (2014), “Confidence Intervals for Low Dimensional Parameters in High Dimensional Linear Models,” Journal of the Royal Statistical Society, Series B, 76, 217–242. DOI: 10.1111/rssb.12026.
  • Zhang, D., and Shen, D. (2012), “Multi-Modal Multi-Task Learning for Joint Prediction of Multiple Regression and Classification Variables in Alzheimer’s Disease,” Neuroimage, 59, 895–907. DOI: 10.1016/j.neuroimage.2011.09.069.
  • Zhang, Y., Duchi, J., and Wainwright, M. (2015), “Divide and Conquer Kernel Ridge Regression: A Distributed Algorithm with Minimax Optimal Rates,” The Journal of Machine Learning Research, 16, 3299–3340.
  • Zhang, Y., Wainwright, M. J., and Jordan, M. I. (2017), “Optimal Prediction for Sparse Linear Models? Lower Bounds for Coordinate-Separable m-estimators,” Electronic Journal of Statistics, 11, 752–799. DOI: 10.1214/17-EJS1233.
  • Zhu, Y., and Bradic, J. (2018), “Linear Hypothesis Testing in Dense High-Dimensional Linear Models,” Journal of the American Statistical Association, 113, 1583–1600. DOI: 10.1080/01621459.2017.1356319.
  • Zou, H., and Hastie, T. (2005), “Regularization and Variable Selection via the Elastic Net,” Journal of the Royal Statistical Society, Series B, 67, 301–320. DOI: 10.1111/j.1467-9868.2005.00503.x.