Full article: Estimation of Linear Functionals in High-Dimensional Linear Models: From Sparsity to Nonsparsity

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

High-dimensional linear models are commonly used in practice. In many applications, one is interested in linear transformations $β^{⊤} x$ of regression coefficients $β \in R^{p}$ , where x is a specific point and is not required to be identically distributed as the training data. One common approach is the plug-in technique which first estimates $β$ , then plugs the estimator in the linear transformation for prediction. Despite its popularity, estimation of $β$ can be difficult for high-dimensional problems. Commonly used assumptions in the literature include that the signal of coefficients $β$ is sparse and predictors are weakly correlated. These assumptions, however, may not be easily verified, and can be violated in practice. When $β$ is non-sparse or predictors are strongly correlated, estimation of $β$ can be very difficult. In this article, we propose a novel pointwise estimator for linear transformations of $β$ . This new estimator greatly relaxes the common assumptions for high-dimensional problems, and is adaptive to the degree of sparsity of $β$ and strength of correlations among the predictors. In particular, $β$ can be sparse or nonsparse and predictors can be strongly or weakly correlated. The proposed method is simple for implementation. Numerical and theoretical results demonstrate the competitive advantages of the proposed method for a wide range of problems. Supplementary materials for this article are available online.

KEYWORDS:

1 Introduction

With the advance of technology, high-dimensional data are prevalent in many scientific disciplines such as biology, genetics, and finance. Linear regression models are commonly used for the analysis of high-dimensional data, typically with two important goals: prediction and interpretability. Variable selection can help to provide useful insights on the relationship between predictors and the response, and thus improve the interpretability of the resulting model. During the past several decades, many sparse penalized techniques have been proposed for simultaneous variable selection and prediction, including convex penalized methods (Tibshirani Citation1996; Zou and Hastie Citation2005), as well as nonconvex ones (Fan and Li Citation2001; Zhang Citation2010).

In this article, we are interested in estimating linear transformations $β^{⊤} x$ of regression coefficients $β = {(β_{1}, \dots, β_{p})}^{⊤} \in R^{p}$ for high-dimensional linear models, where $x \in R^{p}$ is a specific point and is not required to be from the same distribution as the training data. It relates to both coefficient estimation and prediction. For instance, sometimes we are interested in estimating β₁ and $β_{1} - β_{2}$ , where both of them can be expressed as $β^{⊤} x$ by taking x as ${(1, 0, \dots, 0)}^{⊤}$ and ${(1, - 1, 0, \dots, 0)}^{⊤}$ , respectively. On the other hand, for a typical prediction problem, x follows the same distribution as the training data.

To estimate $β^{⊤} x$ , a natural and commonly used solution is to estimate $β$ first by $\hat{β}$ and construct the estimator ${\hat{β}}^{⊤} x$ , which can be viewed as the plug-in one. The efficiency of the plug-in estimator depends on that of $\hat{β}$ . Despite its simplicity, obtaining a good estimate of $β$ may not be easy in high-dimensional problems. If $β$ is sparse (i.e., the support of $β, supp (β)$ , is small), sparse regularized techniques such as the LASSO can be used to obtain a consistent estimator of $β$ . Desirable theoretical and numerical results have been established for various sparse penalized methods in the literature (see, e.g., Bickel, Ritov, and Tsybakov Citation2009; Raskutti, Wainwright, and Yu Citation2011; Bühlmann and Van De Geer Citation2011; Negahban et al. Citation2012). These regularized methods assume that $β$ is a sparse vector, which is difficult to verify in practice and may fail when $supp (β)$ has a magnitude compatible with the sample size n or larger than n. The problem becomes more difficult when the predictors are strongly correlated since most sparse regularized methods work well on weakly dependent predictors.

We use a small simulation to illustrate the adverse effects of the sparsity degree of $β$ on the plug-in estimator of $β^{⊤} x$ in the linear regression model (2.1), where X_i follows the normal distribution $N (0, Σ)$ , $β = δ_{0} {(1_{p_{0}}^{⊤}, 0_{p - p_{0}}^{⊤})}^{⊤}$ and $p_{0} = r_{0} p$ , and $Σ = (σ_{i j})$ , $σ_{i j} = {0.5}^{| i - j | / η}$ with η controlling the correlation strength among the predictors. A larger value of r₀ implies a denser $β$ . The setup of δ₀ and other parameters are presented in Setting 1 of Section 5.1. The average testing errors of the plug-in estimators and our proposed PointWise Estimator (PWE) are shown in . We can see that the errors of the plug-in estimators deteriorate quickly as r₀ increases. In contrast, our proposed estimator is much less sensitive to the change of the degree of non-sparsity.

Fig. 1 The effect of nonsparsity of $β$ on plug-in estimators in terms of prediction error, where p = 1000. “A-lasso” and “lasso” denote the results of plug-in estimators with $\hat{β}$ being adaptive LASSO and LASSO, respectively, and PWE denotes the proposed method.

In typical prediction problems, a number of papers studied the convergence of prediction for various estimators, including LASSO, ridge, partial least squares, overparameterized estimators, and many others under different settings (Dalalyan, Hebiri, and Lederer Citation2017; Zhang, Wainwright, and Jordan Citation2017; Dobriban and Wager Citation2018; Bartlett et al. Citation2020, etc.). It has been observed that LASSO and related methods are less affected by the correlation strength among predictors in prediction than in estimation problems (Hebiri and Lederer Citation2013; Dalalyan, Hebiri, and Lederer Citation2017). However, for some sparse vectors for x such as $x = {(1, 0 \dots, 0)}^{⊤}$ , the estimation of $β^{⊤} x$ becomes that of the first coefficient and these methods are more affected by the correlation strength than prediction (Zou and Hastie Citation2005). All the above mentioned methods consider the plug-in estimator and the average prediction error.

Different from these existing methods, we focus on $β^{⊤} x$ for a specific fixed x ( $x \neq 0$ ) rather than on estimating $β$ and the average prediction error, where x is not required to have the same distribution as the training data. The line of works that are closely related to ours are those on the hypothesis testing and confidence intervals of $β^{⊤} x$ in high-dimensional linear models (van de Geer et al. Citation2014; Zhang and Zhang Citation2014; Javanmard and Montanari Citation2014; Lu et al. Citation2017; Cai and Guo Citation2017; Zhu and Bradic Citation2018, etc.). Most of these papers considered the case of $β$ being ultra-sparse with $| supp (β) | ≪ \sqrt{n} / \log p$ . Cai and Guo (Citation2017) considered the broader range where $| supp (β) |$ has an order no more than $n / \log p$ . Zhu and Bradic (Citation2018) considered the hypothesis testing and confidence intervals of $β^{⊤} x$ where $β$ is allowed to be non-sparse by introducing a sparse auxiliary model, which can be restrictive. For example, if the predictor vector follows the normal distribution $N (0, Σ)$ and the sparse auxiliary model holds for any $x \in R^{p}$ simultaneously, then Σ must be equal to I_p. Moreover, the estimator of $β^{⊤} x$ obtained from the confidence interval of Zhu and Bradic (Citation2018), despite allowing $β$ to be dense, works only when $p / n \to 0$ in prediction problems. Although these results have optimality in the minimax sense (Cai and Guo Citation2017; Zhu and Bradic Citation2018), they can be conservative and are actually determined by the most difficult case. In this article, we introduce the sparsity of eigenvalues (or approximately low rank) of some matrices, which is shown to be complementary to the sparsity of $β$ . The most difficult case is actually the situation where both types of sparsity fail.

Our key observation is that we can directly target at $γ_{x} : = β^{⊤} x$ , treating it as an unknown parameter for estimation. We refer to the resulting estimate as the pointwise estimator. To this end, we propose a unified framework to leverage multiple sources of information. In many cases, the eigenvalues of the covariance matrix decrease dramatically, due to correlations among the predictors, which will be referred as sparsity of eigenvalues in the following descriptions. This type of sparsity is generally viewed as an adverse factor, making the estimation of $β$ more difficult. Contrary to this popular view, we show that the sparsity of eigenvalues is beneficial in our framework and serves as a good complement to the sparsity of $β$ . In practice, two different kinds of test points x are of particular interest: (a) x is a given sparse vector, and (b) x is a random vector having the same distribution as the training data (i.e., the prediction problem). We give detailed results on these two special cases and compare our estimator with several other methods. The main contribution of this article is that we propose a transformed model under a new basis, which provides a unified way to use different sources of information.

First, to use the sparsity of eigenvalues, we propose an estimator based on a basis consisting of eigenvectors of a specific matrix constructed from the training data. When the eigenvalues decrease at a certain rate, our estimator performs well for both kinds of test points x regardless of the sparsity of $β$ . On the other hand, if eigenvalues decrease slowly (or covariance matrix close to I_p), this estimator is less efficient; and consequently is inferior to LASSO when $β$ is indeed sparse. In fact, the pointwise estimator using the sparsity of eigenvalues is complementary to LASSO.

Second, to leverage the information of $β$ , such as $β$ being sparse, and the sparsity of eigenvalues jointly, we construct another basis based on an initial estimator $\hat{β}$ . It is shown that two types of information help each other: a faster decreasing rate of eigenvalues allows $\hat{β}$ converging in a slower rate, and vice versa. When the test point x is a sparse vector, we show that the pointwise estimator performs well. The case of x being random as in prediction problems is more complicated in the sense that the sparsity degree of $β$ should be taken into account. Hence, we consider a subset S₁ of ${1, \dots, p}$ satisfying $S_{1} \supseteq supp (β)$ , where S₁ can be estimated from data. Specifically, we consider two cases: (a) Let $S_{1} = {1, \dots, p}$ , which allows $β$ to be sparse or dense. When sparsity of eigenvalues holds, our pointwise estimator performs well. When the eigenvalues are less sparse (or covariance matrix is close to I_p), our pointwise estimator performs similarly to the existing results on dense $β$ in the literature. (b) When $β$ is sparse, a smaller $| S_{1} |$ leads to a better estimator. If a good initial estimator $\hat{β}$ and a good S₁ are available, our estimator’s performance is similar to that of LASSO.

The rest of this article is organized as follows. In Section 2, we propose our pointwise estimator for the linear transformation $β^{⊤} x$ in high-dimensional linear models. Theoretical properties are established in Sections 3 and 4. Some simulated examples and real data analysis are presented in Section 5, followed by some discussions in Section 6. Proofs of the theoretical results are provided in the Supplementary Materials.

Notation. We first introduce some notations to be used for the article. For any symmetric positive semidefinite matrix $A \in R^{m \times m}$ , denote the eigenvalues of A in a decreasing order as $λ_{1} (A) \geq \dots \geq λ_{m} (A)$ , and the smallest nonzero eigenvalues as $λ_{\min}^{+} (A)$ . For any matrix $A \in R^{m_{1} \times m_{2}}, λ_{\max} (A), λ_{\min} (A)$ are the maximum and minimum singular values of A, respectively. For any vector $v = {(v_{1}, \dots, v_{m})}^{⊤} \in R^{m}, ⏧ v ⏧$ and $⏧ v ⏧_{1}$ denote the $l_{2}$ and $l_{1}$ norms of v, respectively, and $⏧ v ⏧_{\infty} = \max_{1 \leq j \leq m} | v_{j} |$ ; the support set of v is denoted as supp $(v)$ . In addition, define $⏧ v ⏧_{A} = \sqrt{v^{⊤} A v}$ for any positive semidefinite matrix $A \in R^{m \times m}$ . For two sequences ${a_{n}}$ and ${b_{n}}$ , both $a_{n} ≲ b_{n}$ and $a_{n} = O (b_{n})$ imply $\lim_{n} a_{n} / b_{n} \leq c$ for some constant $c < \infty$ ; both $a_{n} ≳ b_{n}$ and $a_{n} = Ω (b_{n})$ indicate that $\lim_{n} a_{n} / b_{n} \geq c$ ; $a_{n} ≍ b_{n}$ means that a_n has exactly the same order as b_n. For any integer i, let e_i denote the vector of zeros except the ith element being 1.

2 A Unified Framework for Pointwise Estimation

Suppose $(X_{i}, Y_{i}); 1 \leq i \leq n$ , are iid from the following linear regression model (2.1) $Y_{i} = X_{i}^{⊤} β + ϵ_{i}; 1 \leq i \leq n,$ (2.1) where $ϵ_{i} \in R$ satisfies $E (ϵ_{i}) = 0$ and $var (ϵ_{i}) = σ^{2} < \infty, X_{i} \in R^{p}$ is independent of ϵ_i satisfying $E (X_{i}) = 0$ , and $cov (X_{i}) = Σ = (σ_{i j})$ . Without loss of generality, we assume that $σ_{i i} = 1; i = 1, \dots, p$ , and that $var (Y_{i}) < \infty$ . Denote $X = {(X_{1}, \dots, X_{n})}^{⊤} \in R^{n \times p}, Y = {(Y_{1}, \dots, Y_{n})}^{⊤} \in R^{n}$ , and $ε = {(ϵ_{1}, \dots, ϵ_{n})}^{⊤} \in R^{n}$ . Then the model can be written as $Y = X β + ε$ . Here the dimension p can be much larger than the sample size n. Let $x \in R^{p}$ be a given point at which we intend to estimate $β^{⊤} x$ . We assume that $X_{i}^{⊤} x \neq 0$ for some $1 \leq i \leq n$ , which can be checked numerically. Let $S_{0} = supp (β)$ of cardinality $s_{0} = | S_{0} |$ . Since a nonsparse $β$ and the case where Σ might not be of full rank will be considered, we make the following identifiability condition and discuss some useful facts.

When Σ is not of full rank, we assume that $β$ falls into the column space of Σ for identifiability, due to the following reasons. Denote $X_{i} = Σ^{1 / 2} {\tilde{X}}_{i}; 1 \leq i \leq n$ , where ${\tilde{X}}_{i}$ satisfies $E ({\tilde{X}}_{i}) = 0$ and $cov ({\tilde{X}}_{i}) = I_{p}$ . Let $P_{Σ}$ be the projection matrix on the column space of Σ and $Q_{Σ} = I_{p} - P_{Σ}$ . Then $X_{i}^{⊤} β = {\tilde{X}}_{i}^{⊤} Σ^{1 / 2} β = {\tilde{X}}_{i}^{⊤} Σ^{1 / 2} (P_{Σ} + Q_{Σ}) β = X_{i}^{⊤} P_{Σ} β$ . Thus, the parameter can be set as $P_{Σ} β$ , which falls into the column space of Σ.
The magnitude of β_j; $j \in S_{0}$ , depends on the sparsity degree s₀. Note that $λ_{\min}^{+} (Σ) ⏧ β ⏧^{2} \leq β^{⊤} Σ β < var (Y_{i}) < \infty$ , and consequently that $⏧ β ⏧^{2} \leq var (Y_{i}) / λ_{\min}^{+} (Σ)$ . Assume that β_j’s with $j \in S_{0}$ are of the same magnitude. Then it follows that $| β_{j} | ≲ {[s_{0} λ_{\min}^{+} (Σ) / var (Y_{i})]}^{- 1 / 2}, j \in S_{0}$ . Particularly, if $λ_{\min}^{+} (Σ) ≍ 1, | β_{j} |$ ’s are of order $s_{0}^{- 1 / 2}$ , which can be small when s₀ is large.

Next we first introduce the transformed model based on a set of basis to leverage multiple sources of information in Section 2.1. The construction of basis is discussed in Section 2.2. A penalized estimator and a pointwise estimator are proposed in Section 2.3.

2.1 The Transformed Model

For any fixed $x \in R^{p}$ , let $P_{x} = x x^{⊤} / ⏧ x ⏧^{2}$ be the projection matrix on the space spanned by x and $Q_{x} = I_{p} - P_{x}$ be the projection matrix on the complementary space. Recall that $γ_{x} = β^{⊤} x$ and denote $β_{Q_{x}} = Q_{x} β$ . Then one can write (2.2) $\begin{matrix} X β & = X P_{x} β + X Q_{x} β = X x \cdot x^{⊤} β ⏧ x ⏧^{- 2} + \sqrt{n} ζ_{β} \\ : = \sqrt{n} z_{x} \cdot α_{x} + \sqrt{n} ζ_{β}, \end{matrix}$ (2.2) where $α_{x} = γ_{x} \cdot ⏧ x ⏧^{- 2} ⏧Xx⏧ n^{- 1 / 2} \in R, z_{x} = X x / ⏧Xx⏧ \in R^{n}$ and $ζ_{β} = n^{- 1 / 2} X β_{Q_{x}} \in R^{n}$ . Here α_x is a scaled version of γ_x such that the $l_{2}$ norm of the predictor $\sqrt{n} z_{x}$ equals $\sqrt{n}$ . Estimating γ_x is equivalent to that of α_x, since given α_x, one can compute γ_x directly from data $(X, x)$ . Then we get $Y = X β + ε = \sqrt{n} z_{x} α_{x} + \sqrt{n} ζ_{β} + ε,$ where $ζ_{β}$ is a nuisance parameter vector. Note that $ζ_{β}$ is a nonsparse vector in general, particularly when X_i’s are iid variables; thus, we have n + 1 nonsparse parameters with the sample size n. To handle the difficulty, we introduce a set of basis $Γ \in R^{n \times n}$ using different sources of information such that $ζ_{β}$ can be expressed sparsely under the set of basis.

The construction of Γ depends on the information at hand and will be elaborated in Section 2.2. For an invertible matrix $Γ \in R^{n \times n}$ , of which the columns are of unit length (i.e., $Γ_{\cdot j}^{⊤} Γ_{\cdot j} = 1, 1 \leq j \leq n$ ), we denote $\sqrt{n} ζ_{β} = (\sqrt{n} Γ) (Γ^{- 1} ζ_{β}) = \sqrt{n} Γ θ,$ where $θ = Γ^{- 1} ζ_{β} \in R^{n}$ . Although Γ here can be any invertible matrix, as shown later, the case we are interested in is Γ being (approximately) orthogonal. We hope that $θ$ is (approximately) sparse when Γ is chosen properly. Combining these together, we have the transformed linear model (2.3) $Y = \sqrt{n} z_{x} \cdot α_{x} + \sqrt{n} Γ θ + ε = Z α + ε,$ (2.3) where $Z = \sqrt{n} (z_{x}, Γ) \in R^{n \times (n + 1)}, α = {(α_{x}, θ^{⊤})}^{⊤} \in R^{n + 1}$ . The parameter $θ$ is treated as a n-dimensional nuisance parameter vector. As shown later in Section 2.2, Γ plays a critical role in this model, providing a flexible way to leverage different sources of information. A naive choice is $Γ = I_{n}$ without using additional information, which will be discussed further in Section F of supplementary materials. In contrast to p parameters of the original linear model, the transformed model (2.3) has only n + 1 unknown parameters that are (approximately) sparse. Without loss of generality, we assume that both $| α_{x} |$ and consequently $⏧ ζ_{β} ⏧$ are bounded, given x and X. This assumption is mild and holds in probability when x and X_i’s are iid from $N (0, Σ)$ . Detailed discussions are presented in Section A of the supplementary materials.

Denote by ${\hat{α}}_{x}$ a generic estimator of α_x. Then γ_x can be estimated by ${\hat{γ}}_{x} = {\hat{α}}_{x} \cdot ⏧ x ⏧^{2} ⏧Xx ⏧^{- 1} n^{1 / 2}$ . Given x, noting that $n^{- 1} ⏧Xx ⏧^{2} \to_{p} ⏧ x ⏧_{Σ}^{2}$ , where $⏧ x ⏧_{Σ} = \sqrt{x^{⊤} Σ x}$ , we see that the quantity $⏧ x ⏧^{2} / ⏧ x ⏧_{Σ}$ affects the convergence rate of the estimator ${\hat{γ}}_{x}$ . We investigate the magnitude of $⏧ x ⏧^{2} / ⏧ x ⏧_{Σ}$ , and consider two typical settings for clarity. Recall that x is a given vector, which may or may not have the same distribution as X_i.

Example 1.

Let x be a sparse vector with the support set $S_{x} = supp (x)$ and the cardinality $| S_{x} | : = s_{x}$ . Examples of such x include x = e_i or $x = e_{i} - e_{j}$ . Denote $X_{i S_{x}} = (X_{i j}, j \in S_{x})$ . If the eigenvalues of $Σ_{S_{x} S_{x}} = cov (X_{i S_{x}})$ are both upper and lower bounded, and $⏧ x ⏧_{\infty} = O (1)$ , then it follows that $⏧ x ⏧^{2} / ⏧ x ⏧_{Σ} ≍⏧x⏧≍ s_{x}^{1 / 2}$ .

Example 2.

Let x be a random vector as in prediction problems. For simplicity, we assume x and X_i’s are iid variables from $N (0, Σ)$ . As shown in Section 4.2, it follows (2.4) $⏧ x ⏧^{2} / ⏧ x ⏧_{Σ} ≍ \sqrt{{[tr (Σ)]}^{2} / tr (Σ^{2})} : = M_{Σ},$ (2.4) in probability. Moreover, it holds that $1 \leq M_{Σ} \leq p^{1 / 2}$ and $M_{Σ}^{2}$ can be viewed as the effective rank of Σ. Particularly, if some eigenvalues of Σ are much larger than the rest (e.g., the largest eigenvalue has the order p), then $M_{Σ} ≍ 1$ ; if Σ is close to I_p, then $M_{Σ}$ is close to $p^{1 / 2}$ and γ_x is close to $p^{1 / 2} α_{x}$ .

Clearly, the magnitude of $⏧ x ⏧^{2} / ⏧ x ⏧_{Σ}$ with a sparse x in Example 1 can be smaller than that of the dense x in Example 2. If x is not sparse, the sparsity of $β$ can help as well. This observation motivates us to consider estimators using the information of the sparsity degree of $β$ . For any subset $S_{1} \subseteq {1, \dots, p}$ such that $S_{0} \subseteq S_{1}$ , we observe that $γ_{x} = β^{⊤} x = β_{S_{0}}^{⊤} x_{S_{0}} = β_{S_{1}}^{⊤} x_{S_{1}} = β^{⊤} {\tilde{x}}_{S_{1}} = γ_{{\tilde{x}}_{S_{1}}},$ where $x_{S_{0}}$ and $β_{S_{0}}$ are the subvectors of x and $β$ , respectively, and ${\tilde{x}}_{S_{1}}$ is a p-dimensional vector obtained by setting $x_{S_{1}^{c}} = 0$ in x. Thus, instead of estimating γ_x, one can equivalently consider prediction at the point ${\tilde{x}}_{S_{1}}$ , which is a sparse vector when $| S_{1} |$ is small. Clearly, one can set $S_{1} = {1, \dots, p}$ , and then $γ_{{\tilde{x}}_{S_{1}}}$ becomes γ_x. Estimating $γ_{{\tilde{x}}_{S_{1}}}$ provides a way to take the sparsity degree of $β$ into account. In practice, one can choose different S₁’s and select the best one by CV, as shown later in Section 2.3. In Section 4 we will compare our method with several existing methods in details for Examples 1 and 2.

Remark 1.

In Example 2 above, a smaller set S₁ is preferred. However, the true support set S₀ is unknown. When s₀ is small, choosing a set S₁ that covers S₀ is feasible. For example, S₁ can be taken as the support set of the LASSO estimator, or constructed by some screening methods (Fan and Lv Citation2008).

2.2 Construct Basis Utilizing Multiple Sources of Information

Next we discuss the construction of Γ. For a positive semidefinite matrix $A \in R^{p \times p}$ , denote $λ (A) = (λ_{1} (A), \dots, λ_{p} (A))$ the vector of eigenvalues of A in a decreasing order. We say that $λ (A)$ is approximately sparse when only a few eigenvalues are much larger than the average $p^{- 1} \sum_{i = 1}^{p} λ_{i} (A)$ , and the detailed requirements on the decreasing rate of eigenvalues will be elaborated later. In practice, different sources of information may be available. For clarity of the presentation, we focus on two different sources of information.

Source I. We use the information of $β$ through an initial estimator $\hat{β}$ . Note that a good estimator is available in some cases. In particular, if $β$ is sparse, then $\hat{β}$ can be obtained from that of LASSO, or other sparse regression methods. However, an estimator $\hat{β}$ may not be good enough in many cases especially when $β$ is less sparse and p is larger than n.

Source II. We use the dependence among predictors. Note that $λ (n^{- 1} X Q_{x} X^{⊤}) = λ (n^{- 1} X Q_{x} Q_{x} X^{⊤})$ that equals the first n elements of $λ (n^{- 1} Q_{x} X^{⊤} X Q_{x})$ , of which the population version is $λ (Q_{x} Σ Q_{x})$ . When predictors are correlated such that $λ (n^{- 1} X Q_{x} X^{⊤})$ is (approximately) sparse, a key observation is that by choosing a suitable Γ, the parameter $θ$ can be (approximately) sparse regardless $β$ being sparse or not (details are referred in Section 3.2). Thus, it is feasible to estimate the parameters well in the transformed model (2.3).

Denote $C_{1} = {β is sparse}, C_{2} = {λ (n^{- 1} X Q_{x} X^{⊤}) is sparse}$ , and let $\tilde{C} = C_{1}^{c} \cap C_{2}^{c}$ . The sparsity in $C_{2}$ is complementary to that in $C_{1}$ . Both $C_{1}$ and $C_{2}$ are ideal cases with good estimators available (we will introduce estimators for Case $C_{2}$ later), and the case $\tilde{C}$ can be the least favorable case. There are intermediate cases between $C_{1}$ and $\tilde{C}$ , when the degree of sparsity of $β$ increases gradually. A similar argument applies between $C_{2}$ and $\tilde{C}$ . To handle these complicated cases, it is natural to use these two different sources of information jointly. In fact, our approach works well under the complementary condition in Section 3.3 where stronger requirements on one type of sparsity weaken those of the other. It is possible that both $C_{1}$ and $C_{2}$ hold simultaneously, where our estimators leverage both sources of information.

Denote the spectral decomposition of $n^{- 1} X Q_{x} X^{⊤}$ as $Γ_{eg} Ψ Γ_{eg}^{⊤},$ where $Γ_{eg} = (u_{eg, 1}, \dots,$ $u_{eg, n}) \in R^{n \times n}$ are eigenvectors and $Ψ = diag (ψ_{1}, \dots, ψ_{n})$ is the diagonal matrix of the associated eigenvalues in a deceasing order. Denote ${\bar{ζ}}_{β} = ζ_{β} / ⏧ ζ_{β} ⏧ = X β_{Q_{x}} / ⏧ X β_{Q_{x}} ⏧ .$ When an initial estimator $\hat{β}$ is available, ${\bar{ζ}}_{β}$ can be estimated by ${\bar{ζ}}_{\hat{β}}$ . To use two different sources of information jointly, or in other words to use both ${\bar{ζ}}_{\hat{β}}$ and the columns of $Γ_{eg}$ , we construct Γ by replacing one of the columns (say for example the ith column) of $Γ_{eg}$ by ${\bar{ζ}}_{\hat{β}}$ , that is, $Γ = Γ (\hat{β})$ , defined as (2.5) $Γ (\hat{β}) = ({\bar{ζ}}_{\hat{β}}, u_{eg, j}, j \neq i),$ (2.5) which is the empirical version of $Γ (β) = ({\bar{ζ}}_{β}, u_{eg, j}, j \neq i)$ . More discussion on this process is provided in Proposition 1 and Remark 2.

Recall that α_x, associated with predictor $\sqrt{n} z_{x}$ , is the parameter of interest in the transformed model that has predictors $\sqrt{n} (z_{x}, Γ)$ . It is desirable to avoid the collinearity between z_x and other predictors in the transformed model. Hence, we assume that $\hat{β}$ satisfies that ${\bar{ζ}}_{\hat{β}} \neq z_{x}$ , which can be checked from data, and require the matrix $(z_{x}, u_{eg, j}, j \neq i)$ being invertible. Note that it is also required that $Γ (\hat{β})$ , that is $({\bar{ζ}}_{\hat{β}}, u_{eg, j}, j \neq i)$ , is invertible.

Proposition 1.

Suppose that $\hat{β}$ satisfies $| z_{x}^{⊤} {\bar{ζ}}_{\hat{β}} | > 0$ and ${\bar{ζ}}_{\hat{β}} \neq z_{x}$ . Then for at least one $i \in {1, \dots, n}$ , it holds that both matrices $({\bar{ζ}}_{\hat{β}}, u_{eg, j}, j \neq i)$ and $(z_{x}, u_{eg, j}, j \neq i)$ are invertible or equivalently that $\min {| u_{eg, i}^{⊤} {\bar{ζ}}_{\hat{β}} |, | u_{eg, i}^{⊤} z_{x} |} > 0$ .

In principle, we can replace any $u_{eg, i}$ by ${\bar{ζ}}_{\hat{β}}$ as long as $\min {| u_{eg, i}^{⊤} {\bar{ζ}}_{\hat{β}} |, | u_{eg, i}^{⊤} z_{x} |} > 0$ . In our numerical studies, the strategy in Remark 2 below is used to further reduce collinearity.

Remark 2.

Reducing the collinearity between z_x and other predictors in the transformed model makes the estimator of α_x more stable. In our simulation studies, we replace $u_{eg, i_{0}}$ by ${\bar{ζ}}_{\hat{β}}$ with $i_{0} = arg \max_{1 \leq i \leq n} | u_{eg, i}^{⊤} z_{x} |$ , and the resulting $Γ (\hat{β})$ is observed nonsingular numerically.

Naturally, one can just use the information in Source II by setting $Γ = Γ_{eg}$ . A good property for this choice is that it depends only on X without requiring an initial estimator $\hat{β}$ and is free of any assumption on the sparsity of $β$ . However, this choice may not be ideal if a reasonably good initial estimator $\hat{β}$ can be obtained. We summarize the constructions of Γ used in this article in .

Table 1 Candidates of Γ for the pointwise estimator.

Display Table

In Section 3, we show that when $λ (n^{- 1} X Q_{x} X^{⊤})$ is sparse enough, $θ = Γ_{eg}^{- 1} ζ_{β}$ is approximately sparse without any sparsity assumption on $β$ . An extreme case is Σ being a low rank matrix, where $θ$ is exactly sparse with at most $rank (Σ) + 1$ nonzero elements. It is worth pointing out that which elements of $θ$ are large are generally unknown since $β$ is involved. When $λ (n^{- 1} X Q_{x} X^{⊤})$ is less sparse, as discussed below, $\hat{β}$ will provide additional information and $θ$ can still be approximately sparse.

When $Γ = Γ (\hat{β})$ , the sparsity of $θ$ also depends on the accuracy of $\hat{β}$ . For the ideal case that $\hat{β} = β$ , it can be shown that $θ = Γ {(β)}^{- 1} ζ_{β} \propto e_{1} = {(1, 0, \dots, 0)}^{⊤}$ , and consequently $θ$ is a sparse vector. This argument still holds, if we replace $(u_{eg, j}, j \neq i_{0})$ by any other vectors such that $Γ (β)$ is invertible, implying that if we know $β$ , there is no need for additional information. Consequently, if $\hat{β}$ is good, $(u_{eg, j}, j \neq i_{0})$ do not help much. When $\hat{β}$ is not good enough (e.g., $β$ is less sparse in particular) but $λ (n^{- 1} X Q_{x} X^{⊤})$ is sufficiently sparse, using $u_{eg, i}$ ’s will compensate the low accuracy of $\hat{β}$ . Thus, both types of information can help each other in our framework and the estimator becomes more robust to the underlying assumptions. Details are provided in Section 3.

2.3 Penalized Estimator and Pointwise Estimator

As discussed in Section 2.2, the parameters in the transformed model (2.3) can be approximately sparse if Γ is constructed properly. Thus, we consider the minimization of the following objective function (2.6) $L_{λ, Γ} (α) = n^{- 1} ⏧ Y - Z α ⏧^{2} + P_{en, λ} (α),$ (2.6) where $P_{en, λ} (α) = λ ⏧ α ⏧_{1}$ is the $l_{1}$ penalty function used by LASSO, and λ and Γ are the tuning parameters. The parameter λ plays the same role as that for the usual regularized estimator and can be selected by cross-validation (CV). The selection of Γ will be elaborated below. Since the $l_{1}$ penalty usually induces biases for coefficients of large absolute values, to solve this problem, other nonconvex penalty functions, such as the SCAD (Fan and Li Citation2001) or MCP (Zhang Citation2010), can be used instead. Denote the minimizer as (2.7) ${\hat{α}}_{λ, Γ} = {({\hat{α}}_{x}, {\hat{θ}}^{⊤})}^{⊤} = arg \min_{α \in R^{n + 1}} L_{λ, Γ} (α) .$ (2.7)

Then we have ${\hat{γ}}_{x} = {\hat{α}}_{x} \cdot ⏧ x ⏧^{2} ⏧Xx ⏧^{- 1} n^{1 / 2}$ .

Remark 3.

Our approach involves eigenvalue decomposition of the matrix $n^{- 1} X Q_{x} X^{⊤} \in R^{n \times n}$ with the computational complexity of the order $O (n^{3})$ , which can be a burden when n is large. To reduce the complexity, the Divide-and-Conquer (DC) approach for handling big data can be used (Zhang, Duchi, and Wainwright Citation2015). Simulation results based on DC are presented in Section G.2 of the supplementary materials.

Next we briefly discuss the estimator in (2.7). First, the predictor Z in Model (2.3) involves the transformation matrix Γ, while in the classical model (2.1), the predictor is X, of which each row represents a realization of the predictors. Second, unlike the classical methods such as the LASSO, where the parameters are unknown constants, the parameter $α$ here involves $(x, X, Γ)$ , which makes the theoretical analysis challenging.

Remark 4.

In the above arguments, we consider a single point x that may or may not be from the same distribution as the predictor vector. However, if we are going to consider a large number of test points ${x_{i}, i = 1, \dots, n_{t e}}$ that are iid observations of X_i, the bias should be taken into account. Because $E (γ_{x_{i}}) = E (β^{⊤} x_{i}) = 0$ and $var (γ_{x_{i}}) = β^{⊤} Σ β < var (Y_{1})$ , there will be many $γ_{x_{i}}$ ’s that are close to 0 and the corresponding estimators are shrunken to 0 by the penalization, resulting in large biases in terms of the average estimation error. There are many bias correction methods for LASSO and related penalized estimators in the literature (Belloni and Chernozhukov Citation2013; Zhang and Zhang Citation2014, etc.). In our simulation results for the prediction problem, the method of Belloni and Chernozhukov (Citation2013) is applied.

For clarity, we briefly summarize the estimation procedure for a given Γ as follows.

Algorithm 1

Estimator of γ_x (or $γ_{{\tilde{x}}_{S_{1}}}$ ) with given Γ

1. (Estimate α_x) Solve the optimization problem (2.6), where the optimal λ is chosen by CV; the corresponding parameter obtained is denoted as ${({\hat{α}}_{x}, {\hat{θ}}^{⊤})}^{⊤}$ .

2. (Bias-correction) This is an optional step. Denote $\hat{S} = supp (\hat{θ})$ and $Γ_{\cdot \hat{S}}$ the columns of Γ with index $\hat{S}$ . Apply the OLS with responses Y and predictors $Z = \sqrt{n} (z_{x}, Γ_{\cdot \hat{S}})$ to get the updated coefficient ${\hat{α}}_{x}$ of $\sqrt{n} z_{x}$ , which is a bias-corrected estimator of α_x.

3. (Estimate γ_x) γ_x is estimated by ${\hat{γ}}_{x} = {\hat{α}}_{x} \cdot ⏧ x ⏧^{2} ⏧Xx ⏧^{- 1} n^{1 / 2} .$

4. (Alternative Steps 1–3) Replacing x by ${\tilde{x}}_{S_{1}}$ in Steps 1–3, we get the estimator ${\hat{γ}}_{{\tilde{x}}_{S_{1}}}$ .

Step 2 is mainly applied for the setting in Remark 4 with a large number of testing points that are iid observations as of the training data X_i’s. For clarity, the pointwise estimator obtained by Algorithm 1 is named based on the specific Γ used. There are many estimators of $β$ available for different settings in the literature, and can be used as an initial estimator. Denote by ${\hat{β}}_{lasso}$ and ${\hat{β}}_{ridge}$ the estimators of LASSO and ridge regression, respectively, and by ${\hat{β}}_{rdl}$ the overparameterized ridgeless OLS estimator using the Moore-Penrose generalized inverse (Bartlett et al. Citation2020; Azriel and Schwartzman Citation2020; Hastie et al. Citation2022). We consider the estimators ${\hat{γ}}_{x}$ in Algorithm 1 with Γ being $Γ_{eg}$ and $Γ (\hat{β})$ for $\hat{β} \in {{\hat{β}}_{lasso}, {\hat{β}}_{ridge}, {\hat{β}}_{rdl}}$ , and the resulting estimators are denoted as $P_{eg}, P_{lasso}, P_{ridge}, P_{rdl}$ , respectively. Now we propose a procedure to select Γ adaptively. Denote by $M$ a set of estimators, for example, $M = {P_{lasso}, P_{ridge}, P_{eg}, P_{rdl}}$ in Setting 1 of our numerical study. The best estimator selected from $M$ by the following CV procedure will be denoted as PWE.

Algorithm 2

Select Γ by CV

1. Compute the CV error for estimators in $M$ . Split randomly the whole data $D$ of size n into K parts, denoted as $D_{1}, \dots, D_{K}$ . For each method $A \in M$ , compute the CV error, $n^{- 1} \sum_{k = 1}^{K} \sum_{(x_{i}, y_{i}) \in D_{k}} {({\hat{γ}}_{x_{i}}^{A} - y_{i})}^{2}$ , where ${\hat{γ}}_{x_{i}}^{A}$ is estimated by A with data $D ∖ D_{k}$ .

2. The method A₀ in $M$ with the minimum CV error is chosen to be the best one.

3. The final estimator of γ_x is ${\hat{γ}}_{x}^{A_{0}}$ .

For Example 1 in Section 2.1 where x is a sparse vector, we apply Algorithm 2 to get the pointwise estimator. For the prediction problems in Example 2, since the sparsity degree of $β$ is unknown, we have two choices on the subset S₁: (a) simply taking $S_{1} = {1, \dots, p}$ (i.e., estimate γ_x directly); (b) estimating S₁ from data using LASSO or screening methods. Given Γ, among the candidates of S₁, we can select the best one by CV, similar to Steps 1 and 2 of Algorithm 2. Note that x_i should be replaced by ${\tilde{x}}_{i S_{1}}$ , which is defined in the way similar to ${\tilde{x}}_{S_{1}}$ , when one computes the CV error in Step 1 of Algorithm 2. Moreover, one can select both Γ and S₁ simultaneously by CV.

3 Properties of the Penalized Estimator ${\hat{α}}_{x}$

Throughout our theoretical analysis, it is assumed that ϵ_i’s are iid from $N (0, σ^{2})$ , and $(x, X)$ can be fixed or random and will be specified later. In this section, we present theoretical properties of the regularized estimator ${\hat{α}}_{x}$ in (2.7), which lays a foundation for properties of the estimator ${\hat{γ}}_{x}$ in Section 4. To this end, we first establish results for a generic invertible matrix Γ with fixed $(x, X)$ in Section 3.1, and then apply it to $Γ_{eg}$ and $Γ (\hat{β})$ in Sections 3.2 and 3.3, respectively.

3.1 A General Result on ${\hat{α}}_{x}$ with a Generic Γ and Fixed $(x, X)$

Recall that $Z = \sqrt{n} (z_{x}, Γ)$ in Model (2.3). Let $v_{x} = {(1, - b_{x}^{⊤})}^{⊤} / {(1 + b_{x}^{⊤} b_{x})}^{1 / 2}$ with $b_{x} = Γ^{- 1} z_{x} .$ It can be shown that v_x is the eigenvector associated with the zero eigenvalue of $n^{- 1} Z^{⊤} Z$ (see proof of Theorem E.1 in supplementary materials). For any $v = {(v_{1}, \dots, v_{n + 1})}^{⊤} \in R^{n + 1}$ , denote $⏧ v ⏧_{(q)} = \sum_{i = 1}^{n + 1} | v_{i} |^{q}; q \in [0, 1] .$ It is worth noting that $⏧ v ⏧_{(q)}$ here is not the $l_{q}$ norm of v that is defined as $⏧ v ⏧_{q} = ⏧ v ⏧_{(q)}^{1 / q}$ ; and for q = 1, they are the same. This notation $⏧ v ⏧_{(q)}$ is convenient for our purpose, particularly for the discussions of the case with q = 0 or $q \to 0$ .

For $α = {(α_{x}, θ^{⊤})}^{⊤}$ and $q \in [0, 1]$ , denote $R_{q} = ⏧ α ⏧_{(q)} = | α_{x} |^{q} + ⏧ θ ⏧_{(q)},$ which measures the sparsity of the parameter $α$ . Particularly, when q = 0, R_q is the number of nonzero elements in $α$ . Let the tuning parameter $λ = λ_{n} : = a_{0} σ \sqrt{(\log n) / n}$ with $a_{0} \geq 2$ . We have the following result on a generic invertible matrix Γ.

Theorem 1.

Assume that $(x, X)$ are fixed. Let Γ be a generic invertible matrix that may depend on $(x, X)$ . Assume the following $α$ -sparsity condition: $R_{q} \leq C λ_{n}^{q} ⏧ v_{x} ⏧_{1}^{2}$ for some $q \in [0, 1]$ and some constant C > 0. Then with probability $1 - C_{1} n^{- 3}$ , we have $| {\hat{α}}_{x} - α_{x} | \leq \frac{5}{4} λ_{n} + C_{κ} λ_{n}^{1 - q} R_{q} \cdot \min {λ_{n}^{q / 2} R_{q}^{- 1 / 2} ⏧ Γ^{⊤} z_{x} ⏧, ⏧ Γ^{⊤} z_{x} ⏧_{\infty}},$ where C₁, $C_{κ}$ are positive constants. Furthermore, when Γ is a generic orthogonal matrix, the $α$ -sparsity condition can be simplified as $R_{q} \leq C' λ_{n}^{q} ⏧ Γ^{⊤} z_{x} ⏧_{1}^{2}$ for some constant $C' > 0$ .

We make a brief discussion on the above result. First, note that $⏧ Γ^{⊤} z_{x} ⏧_{\infty} \leq ⏧ Γ^{⊤} z_{x} ⏧$ , where the latter equals 1 when Γ is an orthogonal matrix. Further discussions are referred to Section A of the supplementary materials. Second, since λ_n has the order $\sqrt{\log n / n}$ , the bound of Theorem 1 does not explicitly depend on p, which is reasonable, since we have only n + 1 parameters in the transformed model (2.3). Third, the bound given above involves the sparsity of the parameter vector $α$ in the transformed model instead of that of $β$ , allowing $β$ being sparse or non-sparse. As an application of Theorem 1, we present the results for Γ being $Γ_{eg}$ and $Γ (\hat{β})$ , respectively below.

Proposition 2.

Suppose that $(x, X)$ are fixed. Let $Γ = Γ_{eg}$ . Assume that the $α$ -sparsity condition $R_{q} \leq C λ_{n}^{q} ⏧ Γ^{⊤} z_{x} ⏧_{1}^{2}$ holds for some constant C > 0. Then it holds that $| {\hat{α}}_{x} - α_{x} | \leq C' λ_{n}^{1 - q / 2} R_{q}^{1 / 2}$ with probability $1 - C_{1} n^{- 3}$ , where $C_{1}, C'$ are positive constants.

Similar to Theorem 1, Proposition 2 allows $β$ to be sparse or non-sparse. Due to α_x being bounded, we have $R_{q} ≍ ⏧ θ ⏧_{(q)}$ , which can be bounded as shown in Section 3.2. When X_i’s are random, a lower bound on $⏧ Γ_{eg}^{⊤} z_{x} ⏧_{1}$ is given in Section E of the supplementary materials.

3.2 Sparsity of $θ$ and Properties of ${\hat{α}}_{x}$ when $Γ = Γ_{eg}$

We first show that for $Γ = Γ_{eg}$ , if the eigenvalues of $n^{- 1} X Q_{x} X^{⊤}$ decrease at a certain rate, $θ$ will be approximately sparse; specifically $⏧ θ ⏧_{(q)}$ is bounded for some $q \in [0, 1]$ . Then we give the asymptotic results on ${\hat{α}}_{x}$ . Recall $ψ_{1} \geq ψ_{2} \dots \geq ψ_{n} \geq 0$ are the eigenvalues of $n^{- 1} X Q_{x} X$ ; they are also the first n eigenvalues of $n^{- 1} Q_{x} X^{⊤} X Q_{x}$ . For $q \in [0, 1]$ and $k = 0, 1, \dots, n - 1$ , let ${\bar{ψ}}_{q, k} = {(n - k)}^{- 1} \sum_{i = k + 1}^{n} ψ_{i}^{q / (2 - q)}$ and $ϕ_{k}^{1 / 2} (β)$ be the norm of the projected vector of $β$ onto the subspace spanned by the eigenvectors of $n^{- 1} Q_{x} X^{⊤} X Q_{x}$ associated with eigenvalues $ψ_{k + 1}, \dots, ψ_{n}$ , $ϕ_{n} (β) = {\bar{ψ}}_{q, n} = 0$ , and ${\bar{ψ}}_{q, k}$ is the average magnitude of the smallest n – k eigenvalues. For any $x_{1}, x_{2} \geq 0$ , denote $H_{q, k} (x_{1}, x_{2}) = k^{1 - q / 2} + {[(n - k) x_{1}]}^{1 - q / 2} x_{2}^{q / 2} .$ The following Lemma 1 is a deterministic result on $⏧ θ ⏧_{(q)}$ .

Lemma 1.

For $q \in [0, 1]$ , it holds that $⏧ θ ⏧_{(q)} \leq \min_{0 \leq k \leq n} {⏧ ζ_{β} ⏧^{q} k^{1 - q / 2} + {[(n - k) {\bar{ψ}}_{q, k}]}^{1 - q / 2} ϕ_{k}^{q / 2} (β)} .$ Moreover, since $⏧ ζ_{β} ⏧ = O (1)$ in Model (2.3), it holds that $⏧ θ ⏧_{(q)} = O (\min_{0 \leq k \leq n} H_{q, k} ({\bar{ψ}}_{q, k}, ϕ_{k} (β)))$ .

Generally, it is difficult to obtain the sharp upper bound in Lemma 1 without any information of eigenvalues and $β$ . Simply setting k = n, we have the trivial bound (3.1) $⏧ θ ⏧_{(q)} ≲ H_{q, n} ({\bar{ψ}}_{q, n}, ϕ_{k} (β)) = n^{1 - q / 2},$ (3.1) which however is unbounded and undesirable. To get better bounds on $⏧ θ ⏧_{(q)}$ , we take into account of the properties of eigenvalues and $β$ , and consider the following three cases:

Case (a): $n^{- 1} X X^{⊤}$ has low rank, say $rank (n^{- 1} X X^{⊤}) = r_{X} = O (1)$ ;
Case (b): $β$ is sparse with $⏧ β ⏧_{0} = s_{0}$ and that $⏧ β ⏧_{\infty} < C < \infty$ ;
Case (c): $β$ is dense, following a normal distribution $N (0, Σ_{β})$ .

Results are presented in Corollary 1 and Proposition 3, respectively.

Corollary 1.

The following conclusions are deterministic.

1. For Case (a), it holds that ${\bar{ψ}}_{q, k} = 0$ for $k \geq r_{X} + 1$ , and $⏧ θ ⏧_{(q)} = O (r_{X}^{1 - q / 2}) = O (1)$ .

2. For Case (b), assume that n < p, then $⏧ θ ⏧_{(q)} ≲ \min_{0 \leq k \leq n} H_{q, k} ({\bar{ψ}}_{q, k}, s_{0}) .$ Thus, if ${\bar{ψ}}_{q, k_{0}} = O (s_{0}^{- q / (2 - q)} n^{- 1})$ for some fixed k₀, then $⏧ θ ⏧_{(q)} = O (k_{0}^{1 - q / 2}) = O (1)$ .

In Corollary 1, conclusion (1) holds regardless of the sparsity of $β$ , while conclusion (2) relaxes the conditions on eigenvalues by taking advantages of the sparsity of $β$ . For Case (c), we assume that the column space of $Σ_{β}$ is the same as that of Σ, accommodating the identifiability condition that $β \in span (Σ)$ . We have the following results.

Proposition 3.

Suppose that $(X, x)$ is fixed. For Case (c) with span( $Σ_{β}$ )=span(Σ), assume the following conditions: (i) $⏧ ζ_{β} ⏧ = O_{p} (1)$ ; (ii) ${(n p)}^{- 1} \sum_{i = 1}^{n} X_{i}^{⊤} Q_{x} X_{i} ≍ 1$ ; (iii) n < p. Then it holds that $ϕ_{k} (β) = O_{p} (d_{β} n / p)$ for all k and that $⏧ θ ⏧_{(q)} = O_{p} (\min_{0 \leq k \leq n} H_{q, k} ({\bar{ψ}}_{q, k}, d_{β} n / p)),$ where $d_{β} = λ_{\max} (Σ_{β}) / λ_{\min}^{+} (Σ_{β})$ can be viewed as the condition number of $Σ_{β}$ . For clarity, we assume that $d_{β} = O (1)$ and consider the following two specific examples of eigenvalues:

Suppose that $ψ_{1} \geq \dots \geq ψ_{k_{0}} = Ω (p)$ and $\max_{j \geq k_{0} + 1} ψ_{j} = O (p n^{- 2 / q})$ for some fixed k₀. Then it holds that $⏧ θ ⏧_{(q)} = O_{p} (k_{0}^{1 - q / 2}) = O_{p} (1)$ .
Denote $f_{ψ} (i) = ψ_{i} / \sum_{i = 1}^{n} ψ_{i}$ , the scaled version of ψ_i, $i = 1, \dots, n$ . Assume that $f_{ψ} (i)$ decreases exponentially, that is, $f_{ψ} (w) = a \exp (- a (w - 1))$ for $w \geq 1$ , where $a = a_{n} > 0$ may depend on n. Then $⏧ θ ⏧_{(q)} = O_{p} (1)$ if $a_{n} =$ $Ω (q^{- 1} \log n)$ .

Condition (i) is a natural extension of the condition $⏧ ζ_{β} ⏧ = O (1)$ for fixed $β$ in Section 2. Condition (ii) is mild, ruling out the extreme case that Σ has eigenvalues $(p, 0, \dots, 0)$ . Details are referred to the proof in supplementary materials. The two examples in Proposition 3 imply that $θ$ can be approximately sparse, when eigenvalues decrease fast enough. Based on the results of $⏧ θ ⏧_{(q)}$ in Corollary 1 and Propositions 3, by applying the conclusion of Proposition 2, we are ready to give the asymptotic results of ${\hat{α}}_{x}$ .

Theorem 2.

Suppose that $(x, X)$ are fixed. Taking $Γ = Γ_{eg}$ , we have the following conclusions for Cases (a)–(c):

For Case (a), $| {\hat{α}}_{x} - α_{x} | = O_{p} (λ_{n} r_{X}^{1 / 2})$ . For Case (b), suppose that $λ_{n}^{q} ⏧ Γ^{⊤} z_{x} ⏧_{1}^{2} = Ω (1)$ and that n < p; if ${\bar{ψ}}_{q, k_{0}} = O (s_{0}^{- q / (2 - q)} n^{- 1})$ for some fixed k₀, then $| {\hat{α}}_{x} - α_{x} | = O_{p} (λ_{n}^{1 - q / 2} k_{0}^{1 / 2 - q / 4}) .$
For Case (c) of a dense $β$ , assume that the conditions of Proposition 3 hold and that $λ_{n}^{q} ⏧ Γ^{⊤} z_{x} ⏧_{1}^{2} = Ω (1)$ . If ψ_i’s are from Proposition 3 (1), then $| {\hat{α}}_{x} - α_{x} | = O_{p} (λ_{n}^{1 - q / 2} k_{0}^{1 / 2 - q / 4})$ ; if ψ_i’s are from Proposition 3 (2), then $| {\hat{α}}_{x} - α_{x} | = O_{p} (λ_{n}^{1 - q / 2})$ .

The condition $λ_{n}^{q} ⏧ Γ^{⊤} z_{x} ⏧_{1}^{2} = Ω (1)$ above can be checked from data. Here k₀ and $r_{X}$ reflect the sparsity of the parameters in the transformed model. Faster decay rates of eigenvalues result in smaller values of k₀ and $r_{X}$ , and consequently better convergence rates of $| {\hat{α}}_{x} - α_{x} |$ . Moreover, a smaller value of q results in a smaller value of $λ_{n}^{1 - q / 2}$ , but requires a faster decreasing rate of eigenvalues. Note that the above bounds only depend on n and the degree of sparsity in eigenvalues, and do not explicitly depend on p. Moreover, for Cases (a) and (c) where $β$ is allowed to be dense, the sparsity of eigenvalues is complementary to the sparsity of $β$ .

3.3 Properties of the Regularized Estimator ${\hat{α}}_{x}$ with an Initial $\hat{β}$

We study properties of our estimator with $Γ (\hat{β})$ , giving the convergence rate of ${\hat{α}}_{x}$ , for fixed x and X_i’s being iid from $N (0, Σ)$ . The normality assumption of X_i’s simplifies the proofs and can be relaxed to general distributions such as sub-Gaussian distributions. We first give a result on $⏧ {\bar{ζ}}_{\hat{β}} - {\bar{ζ}}_{β} ⏧$ . Recall that $ζ_{\hat{β}} = n^{- 1 / 2} X {\hat{β}}_{Q_{x}}$ .

Proposition 4.

Assume that x is fixed, X_i’s are iid from $N (0, Σ)$ , and $cov (X_{i}^{⊤} β_{Q_{x}}) ≍ 1$ . For any estimator $\hat{β}$ in Model (2.1), it holds that $⏧ {\bar{ζ}}_{\hat{β}} - {\bar{ζ}}_{β} ⏧ = O_{p} (\min {2, ⏧ ζ_{\hat{β}} - ζ_{β} ⏧})$ . Assuming further that $⏧ x ⏧_{\infty} ⏧ x ⏧_{Σ} / ⏧ x ⏧^{2} = O (1)$ , then $⏧ ζ_{\hat{β}} - ζ_{β} ⏧ = O_{p} (⏧ \hat{β} - β ⏧_{1} {(\log p)}^{1 / 2})$ .

When $\hat{β}$ is the LASSO estimator, we have $⏧ \hat{β} - β ⏧_{1} = O_{p} (σ s_{0} \sqrt{\log p / n})$ with $s_{0} = supp (β)$ (Bickel, Ritov, and Tsybakov Citation2009), and consequently $⏧ {\bar{ζ}}_{\hat{β}} - {\bar{ζ}}_{β} ⏧ = O_{p} (\min {2, σ s_{0} \sqrt{{(\log p)}^{2} / n}})$ . Let $H_{1, k} ({\bar{ψ}}_{1, k}, x)$ be the function obtained by setting q = 1 in $H_{q, k} ({\bar{ψ}}_{q, k}, x)$ defined in Section 3.2.

Theorem 3.

Let $Γ = Γ (\hat{β})$ . Assume that following conditions: (i) X_i’s are iid from $N (0, Σ)$ , x is fixed, and n < p; (ii) $β$ satisfies $cov (X_{i}^{⊤} β_{Q_{x}}) ≍ 1$ and $\hat{β}$ is obtained from additional data independent of $(X, Y)$ , satisfying $cov (X_{i}^{⊤} {\hat{β}}_{Q_{x}} | \hat{β}) ≍ 1$ . Then it holds that $| {\hat{α}}_{x} - α_{x} | = O_{p} (λ_{n} [1 + H_{\min} ⏧ ζ_{\hat{β}} - ζ_{β} ⏧])$ , where $H_{\min} = \min_{0 \leq k \leq n} H_{1, k} ({\bar{ψ}}_{1, k}, n / [p λ_{\min^{+}} (Σ)])$ . Assuming further the Complementary condition: $H_{\min} ⏧ ζ_{\hat{β}} - ζ_{β} ⏧ = O_{p} (1)$ , it follows that $| {\hat{α}}_{x} - α_{x} | = O_{p} (λ_{n})$ .

Based on the proof, one can see that $ζ_{\hat{β}}$ and $ζ_{β}$ in Theorem 3 can be replaced by ${\bar{ζ}}_{\hat{β}}$ and ${\bar{ζ}}_{β}$ respectively. The assumption on $β$ in (ii) is mild as $cov (X_{i}^{⊤} β_{Q_{x}}) = E (⏧ ζ_{β} ⏧^{2})$ and $⏧ ζ_{β} ⏧$ is bounded in probability. We give some examples on the magnitude of $H_{\min}$ .

Corollary 2.

(a) If Σ is of low rank, say $rank (Σ) = r_{Σ} = O (1)$ , then $H_{\min} = O (r_{Σ}^{1 / 2})$ . (b) Suppose that $\max_{j \geq k_{0} + 1} ψ_{j} = O_{p} (p n^{- 2})$ for some fixed k₀ and that $λ_{\min^{+}} (Σ) ≳ 1$ . Then $H_{\min} = O_{p} (k_{0}^{1 / 2})$ . (c) Suppose that $λ_{\min^{+}} (Σ) ≳ 1$ and denote $f_{ψ} (i) = ψ_{i} / \sum_{i = 1}^{n} ψ_{i}, i = 1, \dots, n$ . Assuming that $f_{ψ} (i)$ decreases exponentially in probability, that is, $f_{ψ} (w) = a \exp (- a (w - 1))$ for $w \geq 1$ , where $a = a_{n} ≳ \log n$ , then $H_{\min} = O_{p} (1)$ .

Proof of Corollary 2

is similar to those of Corollary 1 and Proposition 3 and is omitted. We point out two basic facts: $1 ≲ H_{\min} ≲ \sqrt{n}$ from (3.1), and $⏧ ζ_{\hat{β}} - ζ_{β} ⏧ = O_{p} (1)$ by Condition (ii) and the law of large numbers. Moreover, the error rate of an estimator is generally believed having an order no less than $n^{- 1 / 2}$ ; thus, without loss of generality we have $⏧ ζ_{\hat{β}} - ζ_{β} ⏧ = Ω_{p} (n^{- 1 / 2})$ throughout this article. Hence, Theorem 3 leads to the error rate of order $λ_{n} H_{\min} ⏧ ζ_{\hat{β}} - ζ_{β} ⏧ ≲ \min {λ_{n} \sqrt{n} ⏧ ζ_{\hat{β}} - ζ_{β} ⏧, λ_{n} H_{\min}}$ in probability without any restriction on the decay rate of eigenvalues.

The complementary condition involves two terms, $H_{\min}$ and $⏧ ζ_{\hat{β}} - ζ_{β} ⏧$ , where the former is controlled by the decay rate of the eigenvalues, and the latter depends on the accuracy of $\hat{β}$ . As argued before, as long as the eigenvalues decrease fast, $H_{\min}$ will be bounded, even if $ζ_{\hat{β}} - ζ_{β} ↛ 0$ . Therefore, if $\hat{β}$ is not accurate but eigenvalues decay fast, we can still get good results. The same argument applies when $\hat{β}$ is accurate but the eigenvalues decay slowly. Particularly, if $\hat{β}$ is good enough such that $⏧ ζ_{\hat{β}} - ζ_{β} ⏧ = O_{p} (n^{- 1 / 2})$ , then the requirement on the decaying rate of eigenvalues can be removed completely. Thus, the information of $\hat{β}$ and eigenvalues are complementary to each other, requiring only the product of $H_{\min}$ and $⏧ ζ_{\hat{β}} - ζ_{β} ⏧$ being bounded. Theorem 3 assumes $\hat{β}$ being independent of the data $(X, Y)$ . When $\hat{β}$ depends on $(X, Y)$ , we get a similar result on ${\hat{α}}_{x}$ with some modifications on the Condition (i), which is presented in Section F of the supplementary materials.

4 Results of ${\hat{γ}}_{x}$ for Two Types of Test Points

In Section 3, we establish theoretical properties of ${\hat{α}}_{x}$ . With ${\hat{γ}}_{x} = {\hat{α}}_{x} \cdot ⏧ x ⏧^{2} ⏧Xx ⏧^{- 1} n^{1 / 2}$ , it is natural to get the corresponding results on ${\hat{γ}}_{x}$ for a generic x. As mentioned in Examples 1 and 2 in Section 2.1, we are interested in two typical settings in particular: (a) x is a sparse vector with $S_{x} = supp (x)$ and $s_{x} = | S_{x} |$ , and (b) x is a random vector as an iid copy of X_i (i.e., the prediction problem). Next we give the theoretical results on ${\hat{γ}}_{x}$ for these two examples, based on the simple fact that $| {\hat{γ}}_{x} - γ_{x} | \leq | {\hat{α}}_{x} - α_{x} | \cdot ⏧ x ⏧^{2} ⏧Xx ⏧^{- 1} n^{1 / 2}$ .

4.1 Properties of ${\hat{γ}}_{x}$ for a Sparse x

Proposition 5.

Assume the following conditions: (i) x is a fixed sparse vector satisfying $⏧ x ⏧_{\infty} = O (1)$ ; (ii) $λ_{\min} (n^{- 1} X_{S_{x}}^{⊤} X_{S_{x}}) > c > 0$ , where $X_{S_{x}}$ is formed by the columns of X with index S_x. Then we have $⏧ x ⏧^{2} ⏧Xx ⏧^{- 1} n^{1 / 2} = O (⏧ x ⏧) = O (s_{x}^{1 / 2})$ . Moreover, the following results hold.

Take $Γ = Γ_{eg}$ and assume that the conditions of Theorem 2 hold. For Case (a) in Section 3.2, it holds that $| {\hat{γ}}_{x} - γ_{x} | = O_{p} (λ_{n} ⏧ x ⏧ r_{X}^{1 / 2}) = O_{p} (λ_{n} {(s_{x} r_{X})}^{1 / 2})$ ; for Cases (b) and (c) in Section 3.2, $| {\hat{γ}}_{x} - γ_{x} | = O_{p} (λ_{n}^{1 - q / 2} ⏧ x ⏧) = O_{p} (s_{x}^{1 / 2} λ_{n}^{1 - q / 2})$ , where k₀ appeared in Theorem 2 is omitted due to $k_{0} = O (1)$ .
Let $Γ = Γ (\hat{β})$ . Assume further that the conditions of Theorem 3 hold. Then it follows that $| {\hat{γ}}_{x} - γ_{x} | = O_{p} (s_{x}^{1 / 2} λ_{n} H_{\min} ⏧ ζ_{\hat{β}} - ζ_{β} ⏧)$ ; assume further that the Complementary condition holds, then $| {\hat{γ}}_{x} - γ_{x} | = O_{p} (⏧ x ⏧ \sqrt{n^{- 1} \log n}) = O_{p} (\sqrt{n^{- 1} s_{x} \log n})$ .

The condition $λ_{\min} (n^{- 1} X_{S_{x}}^{⊤} X_{S_{x}}) > c > 0$ is a type of the restricted eigenvalue condition (Bickel, Ritov, and Tsybakov Citation2009). If X_i’s are iid variables, $n^{- 1} X_{S_{x}}^{⊤} X_{S_{x}} \to_{p} cov (X_{S_{x}}) = Σ_{S_{x} S_{x}}$ . Recall that $(n^{- 1} X^{⊤} X) = r_{X}$ in Case (a) of Theorem 2; then the condition $λ_{\min} (n^{- 1} X_{S_{x}}^{⊤} X_{S_{x}}) > c > 0$ implies that $s_{x} \leq r_{X}$ there.

Remark 5.

We briefly discuss the case of $Σ = I_{p}$ or close to I_p for a sparse vector x. For $Γ = Γ_{eg}$ , using the trivial bound in (3.1) on R_q and taking q = 1 in Proposition 2, we can see that the error of $| {\hat{α}}_{x} - α_{x} |$ has the order $O_{p} ({(\log n)}^{1 / 4})$ , and consequently $| {\hat{γ}}_{x} - γ_{x} | = O_{p} (⏧ x ⏧ {(\log n)}^{1 / 4})$ .

We briefly compare our method with the plug-in estimator using the LASSO estimator $\hat{β}$ (named briefly as LASSO). For LASSO, the error varies depending on the direction of x, while results of our estimator depend only on the sparsity degree s_x of x. For simplicity of comparison, we consider a bound of LASSO depending only on s_x. Specifically, as $⏧ x ⏧_{\infty} = O (1)$ , we have $| x^{⊤} {\hat{β}}_{lasso} - x^{⊤} β | = O (⏧x⏧⏧ {\hat{β}}_{lasso} - β ⏧) = O_{p} (\sqrt{s_{x} s_{0} \log p / n})$ (Bickel, Ritov, and Tsybakov Citation2009); the latter will be used as the error rate of LASSO. Recall that $P_{eg}$ denote our estimator with $Γ = Γ_{eg}$ . For Case (a), it follows from Proposition 5 that $P_{eg}$ is better than LASSO if and only if $r_{X} (\log n) {(\log p)}^{- 1} = o (s_{0})$ . Cases (b) and (c) can be analyzed similarly. Recall that $P_{lasso}$ is our estimator with $Γ = Γ ({\hat{β}}_{lasso})$ .

Corollary 3.

Denote by $T_{n, f i x}$ the ratio of error rate of $P_{lasso}$ over that of LASSO for a fixed sparse x. Suppose that the conditions (i) and (ii) in Proposition 5 and the conditions of Theorem 3 hold. Then $P_{lasso}$ has the error rate of order $λ_{n} H_{\min} \sqrt{s_{x} {(s_{0} \log p)}^{2} / n}$ and consequently $T_{n, f i x} = O_{p} (λ_{n} H_{\min} \sqrt{s_{0} \log p})$ . If $H_{\min} = o_{p} (n^{1 / 2} {[s_{0} (\log n) (\log p)]}^{- 1 / 2})$ , then $T_{n, f i x} = o_{p} (1)$ , impliying $P_{lasso}$ is superior to LASSO; otherwise $P_{lasso}$ is inferior or similar to LASSO.

The proof of Corollary 3 is a simple combination of Proposition 4 and (2) of Proposition 5 and is omitted here. Since we always have $H_{\min} ≲ \sqrt{n}$ , the requirement $H_{\min} = o_{p} (n^{1 / 2} {[s_{0} (\log n) (\log p)]}^{- 1 / 2})$ is mild.

4.2 Properties of ${\hat{γ}}_{x}$ for Prediction Problems

In a prediction problem, x and X_i’s are iid variables. Recall that $M_{Σ} = \sqrt{t r^{2} (Σ) / t r (Σ^{2})}$ .

Proposition 6.

Suppose that x and X_i’s are iid from $N (0, Σ)$ . Then $⏧ x ⏧^{2} ⏧Xx ⏧^{- 1} n^{1 / 2} = O_{p} (⏧ x ⏧^{2} / ⏧ x ⏧_{Σ}) = O_{p} (M_{Σ}) .$ In addition, it holds that $1 \leq M_{Σ} \leq p^{1 / 2}$ . Particularly, $M_{Σ} ≍ 1$ when Σ is of low rank, and $M_{Σ} = p^{1 / 2}$ when $Σ = I_{p}$ .

Next we first derive properties of ${\hat{γ}}_{x}$ with $Γ = Γ_{eg}$ . Then we consider the estimator ${\hat{γ}}_{{\tilde{x}}_{S_{1}}}$ with $Γ = Γ (\hat{β})$ , given both an initial estimator $\hat{β}$ and a subset S₁.

4.2.1 Properties of ${\hat{γ}}_{x}$ for x in Prediction with $Γ = Γ_{eg}$

Recall that $| {\hat{γ}}_{x} - γ_{x} | \leq | {\hat{α}}_{x} - α_{x} | \cdot ⏧ x ⏧^{2} ⏧Xx ⏧^{- 1} n^{1 / 2}$ . Combining Proposition 6 and the result on $| {\hat{α}}_{x} - α_{x} |$ with fixed $(x, X)$ given in Proposition 2 for $Γ = Γ_{eg}$ , it can be inferred that $| {\hat{γ}}_{x} - γ_{x} | = O_{p} (λ_{n}^{1 - q / 2} R_{q}^{1 / 2} M_{Σ})$ . Clearly, a faster decay rate of the eigenvalues $λ (Σ)$ leads to a smaller value of $M_{Σ}$ , and a faster decay rate of ψ_i’s or equivalently a smaller value of R_q, consequently a better rate. Different from the fixed $(x, X)$ considered in Theorem 2, $(x, X)$ are random variables in this section. Thus, R_q lacks an explicit rate, due to randomness of the empirical eigenvalues ψ_i’s. The magnitudes of ψ_i’s, though can be checked from data, are hard to extract in theory generally, according to random matrix theory. To the best of our knowledge, there is no solution for a general case. To obtain an explicit result, we consider two extreme cases: (a) Σ is (approximately) low rank; (b) $Σ = I_{p}$ , the least favorable case.

Proposition 7.

Suppose that x and X_i’s are independent from $N (0, Σ)$ . Assume that n < p. Taking $Γ = Γ_{eg}$ , we have the following conclusions:

If Σ is of low rank with $rank (Σ) = r_{Σ}$ , we have $| {\hat{γ}}_{x} - γ_{x} | = O_{p} (r_{Σ} \sqrt{n^{- 1} \log n})$ . An extension to Σ being approximately low rank is presented in Proposition D.1 of supplementary material.
For the least favorable case of $Σ = I_{p}$ , it holds that $| {\hat{γ}}_{x} - γ_{x} | = O_{p} (p^{1 / 2} {(\log n)}^{1 / 4})$ .

For the case of Σ having a low rank, $| {\hat{γ}}_{x} - γ_{x} |$ has the order similar to that of $| {\hat{α}}_{x} - α_{x} |$ . But when $Σ = I_{p}$ , the error diverges, which is not surprising since $θ$ is nonsparse in the transformed model in this setting and the rate is determined by the most difficult case. Particularly, the rate for $Σ = I_{p}$ is the combination of the facts that $| {\hat{α}}_{x} - α_{x} | = O_{p} ({(\log n)}^{1 / 4})$ and $M_{Σ} = p^{1 / 2}$ . Next we briefly compare LASSO with the proposed method with $Γ = Γ_{eg}$ . LASSO performs well in prediction when $β$ is (approximately) sparse, and is less sensitive to the sparsity of eigenvalues. In contrast, the proposed method with $Γ = Γ_{eg}$ has good performance when the eigenvalues of Σ decrease fast, and $β$ can be sparse or less sparse. Thus, our method with $Γ = Γ_{eg}$ and LASSO are complementary to each other.

4.2.2 Error of ${\hat{γ}}_{{\tilde{x}}_{S_{1}}}$ in Prediction with $Γ = Γ (\hat{β})$

Recall that in Example 2 in Section 2.1, $γ_{x} = γ_{{\tilde{x}}_{S_{1}}}$ for any $S_{1} \supseteq S_{0}$ with $S_{0} = supp (β)$ , implying that one can make prediction at the point ${\tilde{x}}_{S_{1}}$ . Trivially, one can take $S_{1} = {1, \dots, p}$ such that ${\tilde{x}}_{S_{1}} = x$ . The subset S₁ takes the sparsity degree of $β$ into account. Given an initial estimator $\hat{β}$ and a subset S₁ such that $S_{1} \supseteq S_{0}$ , by applying our approach with $Γ = Γ (\hat{β})$ that is constructed with x replaced by ${\tilde{x}}_{S_{1}}$ , we obtain the estimator of $γ_{{\tilde{x}}_{S_{1}}}$ denoted as ${\hat{γ}}_{{\tilde{x}}_{S_{1}}}$ .

Denote $d (\hat{β}, β) = {[\max {var (X_{i}^{⊤} (\hat{β} - β)), var (X_{i S_{1}}^{⊤} ({\hat{β}}_{S_{1}} - β_{S_{1}}))}]}^{1 / 2}$ , which stands for the prediction error of an initial estimator $\hat{β}$ . Without loss of generality, we assume $d (\hat{β}, β)$ has magnitude of order no less than $n^{- 1 / 2}$ . Let $Σ_{S_{1} S_{1}} = cov (X_{i S_{1}})$ and $M_{S_{1}} = {[t r^{2} (Σ_{S_{1} S_{1}}) / t r (Σ_{S_{1} S_{1}}^{2})]}^{1 / 2}$ with $M_{S_{1}}^{2}$ standing for the effective rank of matrix $Σ_{S_{1} S_{1}}$ . Let ${\tilde{H}}_{\min}$ be the quantity defined similar to $H_{\min}$ but with the eigenvalues of $n^{- 1} X X^{⊤}$ , satisfying $1 ≲ {\tilde{H}}_{\min} ≲ \sqrt{n}$ (the detailed expression is given in the supplementary materials). We have the following conclusions from Theorem 3.

Theorem 4.

Let $Γ = Γ (\hat{β})$ . Assume that (i) x and X_i’s are iid variables from $N (0, Σ)$ and n < p; (ii) Both S₁ and $\hat{β}$ are independent of $(X, Y)$ satisfying $cov (X_{i}^{⊤} \hat{β}) ≍ 1$ . Then it holds that $| {\hat{γ}}_{{\tilde{x}}_{S_{1}}} - γ_{x} | = O_{p} (λ_{n} M_{S_{1}} {\tilde{H}}_{\min} d (\hat{β}, β))$ . If we further assume the complementary condition: $d (\hat{β}, β)) {\tilde{H}}_{\min} = O_{p} (1)$ , it holds that $| {\hat{γ}}_{{\tilde{x}}_{S_{1}}} - γ_{x} | = O_{p} (λ_{n} M_{S_{1}})$ . These conclusions are still valid for $S_{1} = {1, \dots, p}$ .

Theorem 4 shows that the rate depends on $M_{S_{1}}, {\tilde{H}}_{\min}$ , and $d (\hat{β}, β)$ . The first two depend on the decay rate of eigenvalues. Moreover, it holds that $d (\hat{β}, β) = O_{p} (1)$ by the assumptions that both $cov (X_{i}^{⊤} β)$ and $cov (X_{i}^{⊤} \hat{β})$ are bounded, which imposes restrictions on $\hat{β}$ . Hence, by Theorem 4, without the complementary condition, we have the error rate $\min {λ_{n} M_{S_{1}} \sqrt{n} d (\hat{β}, β), λ_{n} M_{S_{1}} {\tilde{H}}_{\min}}$ .

We compare $P_{lasso}$ with the plug-in method using LASSO estimator $\hat{β}$ (briefly named LASSO). By the typical rate of LASSO, we have $d ({\hat{β}}_{lasso}, β) = O_{p} (\min {1, \sqrt{s_{0} \log p / n}})$ . To simplify the comparison, we assume that the LASSO estimator is consistent, that is, $\sqrt{s_{0} \log p / n} = o (1)$ . Then one can see that the condition $d ({\hat{β}}_{lasso}, β) {\tilde{H}}_{\min} = O_{p} (1)$ becomes ${\tilde{H}}_{\min} = O_{p} (n^{1 / 2} {(s_{0} \log p)}^{- 1 / 2})$ , which is mild, since it always holds that ${\tilde{H}}_{\min} ≲ n^{1 / 2}$ .

Denote by $T_{n, r a d}$ the ratio of the error rate of $P_{lasso}$ over $d ({\hat{β}}_{lasso}, β)$ for random test points. Then $T_{n, r a d} = o_{p} (1)$ would imply that our method is better than LASSO. By Theorem 4, we see that $T_{n, r a d} ≍ λ_{n} M_{S_{1}} {\tilde{H}}_{\min}$ in probably. To investigate the magnitude of the latter, we consider the following two cases:

Suppose that $β$ is sparse with support set S₀ of cardinality $s_{0} = | S_{0} |$ . Moreover, if a good S₁ is available such that $S_{1} \supseteq S_{0}$ and $| S_{1} | ≍ s_{0}$ , that is, we know sufficiently well on the support set. Then we have $M_{S_{1}} ≍ s_{0}^{1 / 2}$ . If ${\tilde{H}}_{\min} = o_{p} (n^{1 / 2} {(s_{0} \log n)}^{- 1 / 2})$ (i.e., $λ_{n} s_{0}^{1 / 2} {\tilde{H}}_{\min} = o_{p} (1)$ ) which is mild as argued above, we have $T_{n, r a d} = o_{p} (1)$ ; otherwise $T_{n, r a d} = Ω_{p} (1)$ . Moreover, $P_{lasso}$ has error $| {\hat{γ}}_{{\tilde{x}}_{S_{1}}} - γ_{x} | = O_{p} (\sqrt{s_{0} (\log n) / n})$ under the mild condition ${\tilde{H}}_{\min} d ({\hat{β}}_{lasso}, β) = O_{p} (1)$ . If one knows the support set S₀ in advance, the OLS estimator using the oracle predictor $X_{i S_{0}}$ ’s has the rate $\sqrt{s_{0} / n}$ , which is similar to the rate of our estimator up to a term $\log n$ . However, the performance of our estimator depends on that of S₁ and the decay of eigenvalues.
If we simply set $S_{1} = {1, \dots, p}$ , ignoring the sparsity information of $β$ , then $T_{n, r a d} = o_{p} (1)$ if and only if $M_{S_{1}} {\tilde{H}}_{\min} = o_{p} (λ_{n}^{- 1})$ ; otherwise $T_{n, r a d} = Ω_{p} (1)$ . This condition $M_{S_{1}} {\tilde{H}}_{\min} = o_{p} (λ_{n}^{- 1})$ holds when eigenvalues decay fast. Moreover, under the settings similar to Corollary 2, we have ${\tilde{H}}_{\min} = O_{p} (1)$ . For instance, if $rank (Σ) = o (λ_{n}^{- 1})$ , then $M_{S_{1}} {\tilde{H}}_{\min} ≲ rank (Σ) = o_{p} (λ_{n}^{- 1})$ . However, $P_{lasso}$ is worse than LASSO when Σ is close to I_p and $β$ is indeed sparse; specifically, under the complementary condition, $P_{lasso}$ has error rate $λ_{n} M_{S_{1}} = O_{p} ({(n^{- 1} p \log n)}^{1 / 2})$ , which is worse than that of LASSO. However, taking $S_{1} = {1, \dots, p}$ accommodates both sparse and non-sparse $β$ . The classical OLS estimator for a non-sparse $β$ has the order $\sqrt{p / n}$ , close to that of $P_{lasso}$ . In practice, we do not know the sparsity degree of $β$ . The CV approach in Section 2.3 can be used to select between ${\hat{γ}}_{x_{S_{1}}}$ and ${\hat{γ}}_{x}$ in an automatic data adaptive manner.

5 Numerical Studies

We use simulation studies in Section 5.1 and real data analysis in Section 5.2 to further illustrate the numerical performance of our method.

5.1 Simulations

We consider the simulation studies with samples generated iid from the linear model (2.1) with p = 1000 dimensional vector $X_{i} \sim N (0, Σ)$ and $ϵ_{i} \sim N (0, 1)$ . Set $Σ = (σ_{i j})$ with $σ_{i j} = {0.5}^{| i - j | / η}$ , where η controls the level of dependence strength among the predictors, with larger values of η implying stronger correlations among predictors. For the convenience of discussion, the plug-in estimator is named by the method used in estimating $β$ . For example, “LASSO,” “ridge,” and “ridgeless” denote the plug-in estimators ${\hat{β}}^{⊤} x$ with $\hat{β}$ being LASSO, ridge and ridgeless estimators, respectively.

Setting 1 (Prediction). We set $β = δ_{0} {(1_{p_{0}}^{⊤}, 0, \dots, 0)}^{⊤} \in R^{p}$ , where $p_{0} = r_{0} p, 1_{p_{0}}$ is the p₀-dimensional vector of 1, and $δ_{0} = 10 / \sqrt{p_{0}}$ such that $⏧ β ⏧ = 10$ . Clearly, $β$ is denser for larger values of r₀, and we set $r_{0} \in {0.1, 0.2, \dots, 0.9}$ . For prediction, we set $M = {P_{lasso}, P_{ridge}, P_{eg}, P_{rdl}}$ for Algorithm 2 in Section 2.3.

We compare the prediction performance of different methods. Split the data into two parts with the training sample of size n_tr = 200 and the testing sample of size n_te = 500 to compute the test error. PWE estimators are obtained by Algorithm 2 with the $M$ given above and $S_{1} = {1, \dots, p}$ . For the implementation of Algorithm 1, the bias correction step is adopted. We repeat the procedure 100 times and calculate the average test error for each method. We compare LASSO, ridge, ridgeless, $P_{eg}$ and the PWE estimators. For clarity, simulation results for Setting 1 are summarized as follows:

Comparison of LASSO, ridge, ridgeless with PWE and $P_{eg}$ on prediction. The simulation results are presented in . When $β$ is sparse such as r = 0.1, PWE performs similar to (with small η) and better than (with large η) LASSO, and much better than other methods including ridgeless. When r₀ is large, PWE is similar to or slightly better than the ridgeless estimator, and is much better than other methods. By taking the advantages of different initial estimators, PWE performs well for both sparse and dense $β$ . Moreover, $P_{eg}$ is insensitive to r₀, which supports our theoretical findings, and is better than LASSO when r₀ is large. Its advantage over LASSO is more clear when η is large.
Comparison of plug-in estimators with our proposed pointwise counterparts. Due to the limited space, the results are presented in in Section G.1 of the supplementary material. It is seen that the performance of $P_{lasso}$ is similar to LASSO for small η and is better than LASSO for large η; in all cases, $P_{ridge}$ is better than $ridge$ . The numerical results match with our theoretical findings in the sense that the sparsity in eigenvalues can be helpful. In addition, $P_{rdl}$ is close to ridgeless, especially when η is small. More comparisons are presented in Setting 4 and Section G.1 in Supplementary Materials.

Fig. 2 Simulation prediction results of PWE, LASSO, ridge, $P_{eg}$ , and ridgeless (RDL).

Setting 2 (Sparse linear transformation). We consider x being sparse vectors in $γ_{x} = β^{⊤} x$ . Set $β = {(3, - 3, 3, 1, δ_{1} 1_{p_{0} - 4}^{⊤}, 0_{p - p_{0}}^{⊤})}^{⊤}$ , where $δ_{1} = 5 / \sqrt{p_{0}}$ . Consider γ_x being one of the four quantities β₁, $β_{p_{0}}$ , β_p and $β_{1} - β_{3}$ , corresponding to taking x being $e_{1}, e_{p_{0}}, e_{p}$ and $e_{1} - e_{3}$ , respectively in γ_x. Note that $β_{1} - β_{3} = 0$ , indicating that there is no difference between effects of the first and third predictors; $β_{1} = 3$ and $β_{p_{0}} = δ_{1}$ stand for strong and weak signals, respectively. Moreover, $β_{p} = 0$ indicates that the pth predictor is insignificant. Let $p_{0} = 300$ and $Σ = (σ_{i j})$ with $σ_{i j} = {0.5}^{| i - j | / 150}$ , so that the predictors are highly correlated. We set the training data size n_tr = 150, and compute the average errors of $| {\hat{γ}}_{x} - γ_{x} |$ over 100 replications. We compare the regularized estimators, $P_{eg}, P_{lasso}$ , and $P_{ridge}$ , with the plug-in ones. Estimation comparison results are presented in .

Table 2 The average values of $| {\hat{γ}}_{x} - γ_{x} |$ .

Display Table

For a strong signal β₁, it can be inferred from that the regularized pointwise estimators $P_{ridge}$ and $P_{eg}$ are better than LASSO, adaptive LASSO and ridge regression. For weak signal $β_{p_{0}}$ , ridge estimator is better than others. The main reason is that other methods sometimes shrink the estimators to zero, resulting in large biases. For $β_{p} = 0$ , except ridge regression, other methods give zero estimates. Finally, for $β_{1} - β_{3} = 0$ , all the regularized pointwise estimators result in exactly zero estimates, while the plug-in estimators LASSO and adaptive LASSO lead to large biases.

Setting 3 (Comparison on different subset S₁). As pointed out in Sections 2.1 and 2.3, one can consider prediction at the point ${\tilde{x}}_{S_{1}}$ instead of x in prediction problems. Under the setup of Setting 1, we take $Γ = Γ ({\hat{β}}_{lasso})$ and compare the following four candidates of S₁: (1) S₁ being $S_{0} = supp (β)$ , which is the ideal case; (2) S₁ being $S_{full} = {1, \dots, p}$ , that is, $γ_{{\tilde{x}}_{S_{1}}} = γ_{x}$ ; (3) S₁ is obtained by SIS of Fan and Lv (Citation2008), denoted as $S_{SIS}$ ; (4) S₁ being the $S_{lasso} = supp ({\hat{β}}_{lasso})$ . For each candidate of S₁, we repeat 100 times and report the average values of the prediction error, true positive rate (TP) and the average length (LEN) that are defined as TP= $| S_{1} \cap S_{0} | / | S_{0} |$ and LEN= $| S_{1} | / p$ , respectively.

Due to the limited space, we present the simulation results in Section G.2 in supplementary materials. It is seen that S₀ always leads to the best prediction errors in all cases. When $r_{0} = 0.01$ where $β$ is very sparse, both $S_{SIS}$ and $S_{lasso}$ have higher values of TP and smaller values of LEN, leading to smaller prediction errors than those of $S_{full}$ . As r₀ increases, the signal of β_j’s becomes weak due to the constraint $⏧ β ⏧ = 10$ , and the values of TP for $S_{SIS}$ are very small and are the smallest ones among all subsets, which lead to the worst prediction errors. On the other hand, TP and consequently errors of $S_{lasso}$ are much better than those of $S_{SIS}$ , because LASSO takes into account correlations among the predictors when selecting the significant variables, while SIS uses only the marginal correlations. In addition, it is observed that $S_{lasso}$ performs similar to that of $S_{full}$ , which is partially due to the following reason. During the construction of $Γ ({\hat{β}}_{lasso})$ with a given S₁, we need to compute ${\tilde{x}}_{S_{1}}^{⊤} {\hat{β}}_{lasso}$ , which equals $x^{⊤} {\hat{β}}_{lasso}$ for S₁ being both $S_{lasso}$ and $S_{full}$ .

Setting 4 (Further comparison for heterogeneous test points). In Setting 1 where test points are iid copies from the training distribution, $P_{lasso}$ is nearly the same as LASSO when η is small such as η = 5 (results shown in in supplementary material). We compare them further for the case of x_i’s following a distribution different from X_i’s, which is known as covariate shift in the literature of transfer learning (Weiss, Khoshgoftaar, and Wang Citation2016). We generate training data of size 100 as in Setting 1 with η = 5 and $β \propto {(1_{p_{0}}^{⊤}, 0, \dots, 0)}^{⊤} \in R^{p}$ with $⏧ β ⏧ = 5$ . The test points x_i’s are iid from $N (0, Σ_{te})$ . The eigenvectors matrix $U_{te}$ of $Σ_{te}$ is uniformly distributed on the set of all orthogonal matrices in $R^{p \times p}$ . The eigenvalues of $Σ_{te}$ , denoted as $ϱ_{te, 1}, \dots, ϱ_{te, p}$ , satisfy that $ϱ_{te, i} = 2 (p - i + 1) / (p + 1)$ such that $t r (Σ_{te}) = p$ . We first generate a $Σ_{te}$ and then 200 test points with given $Σ_{te}$ , and repeat this procedure 100 times to compute average prediction errors. Results in show that $P_{lasso}$ is much better than LASSO for the case of covariate shift even for small η, possibly due to the flexible pointwise prediction of our proposed method. Besides these examples, additional results demonstrate that $P_{rdl}$ can also substantially improve the ridgeless estimator when covariate shift exists for testing data (Section G.1 of supplementary materials).

Table 3 Test errors of LASSO and $P_{lasso}$ with η = 5 for testing points from $N (0, Σ_{te})$ .

Display Table

5.2 Real Data Analysis

We apply our method to a dataset from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (https://adni.loni.usc.edu/). Alzheimer’s Disease (AD) is a form of dementia characterized by progressive cognitive and memory deficits. The Mini Mental State Examination (MMSE) is a very useful score in practice for the diagnosis of AD. Generally, any score greater than or equal to 27 points (out of 30) indicates a normal cognition. Below this, MMSE score can indicate severe ( $\leq 9$ points), moderate (10–18 points) or mild (19–24 points) cognitive impairment (Mungas Citation1991). Currently, structural magnetic resonance imaging (MRI) is one of the most popular and powerful techniques for the diagnosis of AD. One can use MRI data to predict the MMSE score and identify the important diagnostic and prognostic biomarkers. The dataset we used contains the MRI data and MMSE scores of 51 AD patients and 52 normal controls. After the image preprocessing steps for the MRI data, we obtain the subject-labeled image based on a template with 93 manually labeled regions of interest (ROI) (Zhang and Shen Citation2012). For each of the 93 ROI in the labeled MRI, the volume of gray matter tissue is used as a feature. Therefore, the final dataset has 103 subjects. For each subject, there are one MMSE score and 93 MRI features. We treat the MMSE score as the response variable and MRI features as predictors.

We split the data at random with 80% as the training set, denoted as $S_{t r}$ , and 20% as the testing set, denoted as $S_{t e}$ , then compute the average test error $\sum_{Y_{i} \in S_{t e}} | {\hat{Y}}_{i} - Y_{i} | / | S_{t e} |$ . We repeat the procedure for 100 times and report the average test errors. The boxplots of different methods are presented in . It shows that pointwise estimators are much better than plug-in estimators of LASSO, adaptive LASSO and ridge, respectively.

Fig. 3 Comparison of the test errors of different methods for the AD data analysis. Here $P_{lasso}, P_{ridge}, P_{rdl}$ , and $P_{eg}$ represent our regularized estimators without bias-correction in Algorithm 1. The PWE is automatically selected from $P_{lasso}, P_{ridge}, P_{rdl}$ , and $P_{eg}$ .

6 Discussion

In this article, we estimate the linear transformation $β^{⊤} x$ of parameters $β$ in high-dimensional linear models. We propose a pointwise estimator, which works well when $β$ is sparse or non-sparse, and predictors are highly or weakly correlated. The theoretical analysis reveals the significant difference between estimating a linear transformation of $β^{⊤} x$ and that of $β$ . When $β$ is non-sparse or predictors are highly correlated, estimating $β$ is difficult, but we can still get good estimate of $β^{⊤} x$ using our proposed pointwise estimators.

Supplementary Materials

The supplementary materials contain two parts. The first part provides the proofs of the theoretical results in the main paper. The second part includes some simulation results for the settings 2 and 3 of the main paper, together with additional simulation results.

Supplemental material

Supplemental Material

Download Zip (768.7 KB)

Acknowledgments

The authors would like to thank the Editor, the Associate Editor, and reviewers, whose helpful comments and suggestions led to a much improved presentation.

Additional information

Funding

Junlong Zhao’s research was supported in part by National Natural Science Foundation of China grants No. 11871104 and 12131006. Yang Zhou’s research was supported in part by China Postdoctoral Science Foundation grants No. 2020M680226 and 2020TQ0014, and Youth Fund, No.310422112. Yufeng Liu’s research was supported in part by US NIH grant R01GM126550 and NSF grants DMS2100729 and SES2217440.

References

Azriel, D., and Schwartzman, A. (2020), “Estimation of Linear Projections of Non-Sparse coefficients in High-Dimensional Regression,” Electronic Journal of Statistics, 14, 174–206. DOI: 10.1214/19-EJS1656.
Web of Science ®Google Scholar
Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. (2020), “Benign Overfitting in Linear Regression,” Proceedings of the National Academy of Sciences, 117, 30063–30070. DOI: 10.1073/pnas.1907378117.
PubMed Web of Science ®Google Scholar
Belloni, A., and Chernozhukov, V. (2013), “Least Squares after Model Selection in High-Dimensional Sparse Models,” Bernoulli, 19, 521–547. DOI: 10.3150/11-BEJ410.
Web of Science ®Google Scholar
Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009), “Simultaneous Analysis of Lasso and Dantzig Selector,” The Annals of Statistics, 37, 1705–1732. DOI: 10.1214/08-AOS620.
Web of Science ®Google Scholar
Bühlmann, P., and Van De Geer, S. (2011), Statistics for High-Dimensional Data: Methods, Theory and Applications, Berlin: Springer.
Google Scholar
Cai, T. T., and Guo, Z. (2017), “Confidence Intervals for High-Dimensional Linear Regression: Minimax Rates and Adaptivity,” The Annals of Statistics, 45, 615–646.
Web of Science ®Google Scholar
Dalalyan, A. S., Hebiri, M., and Lederer, J. (2017), “On the Prediction Performance of the Lasso,” Bernoulli, 23, 552–581. DOI: 10.3150/15-BEJ756.
Web of Science ®Google Scholar
Dobriban, E., and Wager, S. (2018), “High-Dimensional Asymptotics of Prediction: Ridge Regression and Classification,” The Annals of Statistics, 46, 247–279. DOI: 10.1214/17-AOS1549.
Web of Science ®Google Scholar
Fan, J., and Li, R. (2001), “Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties,” Journal of the American Statistical Association, 96, 1348–1360. DOI: 10.1198/016214501753382273.
Web of Science ®Google Scholar
Fan, J., and Lv, J. (2008), “Sure Independence Screening for Ultrahigh Dimensional Feature Space,” Journal of the Royal Statistical Society, Series B, 70, 849–911. DOI: 10.1111/j.1467-9868.2008.00674.x.
PubMed Web of Science ®Google Scholar
Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. (2022), “Surprises in High-Dimensional Ridgeless Least Squares Interpolation,” The Annals of Statistics, 50, 949–986. DOI: 10.1214/21-aos2133.
PubMed Web of Science ®Google Scholar
Hebiri, M., and Lederer, J. (2013), “How Correlations Influence Lasso Prediction,” IEEE Transactions on Information Theory, 59, 1846–1854. DOI: 10.1109/TIT.2012.2227680.
Web of Science ®Google Scholar
Javanmard, A., and Montanari, A. (2014), “Confidence Intervals and Hypothesis Testing for High-Dimensional Regression,” The Journal of Machine Learning Research, 15, 2869–2909.
Web of Science ®Google Scholar
Lu, S., Liu, Y., Yin, L., and Zhang, K. (2017), “Confidence Intervals and Regions for the Lasso by Using Stochastic Variational Inequality Techniques in Optimization,” Journal of the Royal Statistical Society, Series B, 79, 589–611. DOI: 10.1111/rssb.12184.
Google Scholar
Mungas, D. (1991), “In-Office Mental Status Testing: A Practical Guide,” Geriatrics, 46, 54–67.
PubMed Web of Science ®Google Scholar
Negahban, S. N., Ravikumar, P., Wainwright, M. J., and Yu, B. (2012), “A Unified Framework for High-Dimensional Analysis of m-estimators with Decomposable Regularizers,” Statistical Science, 27, 538–557. DOI: 10.1214/12-STS400.
Web of Science ®Google Scholar
Raskutti, G., Wainwright, M. J., and Yu, B. (2011), “Minimax Rates of Estimation for High-Dimensional Linear Regression over lq -balls,” IEEE Transactions on Information Theory, 57, 6976–6994.
Web of Science ®Google Scholar
Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso,” Journal of the Royal Statistical Society, Series B, 58, 267–288. DOI: 10.1111/j.2517-6161.1996.tb02080.x.
Google Scholar
van de Geer, S., Bühlmann, P., Ritov, Y., and Dezeure, R. (2014), “On Asymptotically Optimal Confidence Regions and Tests for High-Dimensional Models,” Annals of Statistics, 42, 1166–1202.
Web of Science ®Google Scholar
Weiss, K., Khoshgoftaar, T. M., and Wang, D. (2016), “A Survey of Transfer Learning,” Journal of Big Data, 3, 1–40. DOI: 10.1186/s40537-016-0043-6.
Google Scholar
Zhang, C. H. (2010), “Nearly Unbiased Variable Selection Under Minimax Concave Penalty,” Annals of Statistics, 38, 894–942.
Web of Science ®Google Scholar
Zhang, C. H., and Zhang, S. S. (2014), “Confidence Intervals for Low Dimensional Parameters in High Dimensional Linear Models,” Journal of the Royal Statistical Society, Series B, 76, 217–242. DOI: 10.1111/rssb.12026.
Web of Science ®Google Scholar
Zhang, D., and Shen, D. (2012), “Multi-Modal Multi-Task Learning for Joint Prediction of Multiple Regression and Classification Variables in Alzheimer’s Disease,” Neuroimage, 59, 895–907. DOI: 10.1016/j.neuroimage.2011.09.069.
PubMed Web of Science ®Google Scholar
Zhang, Y., Duchi, J., and Wainwright, M. (2015), “Divide and Conquer Kernel Ridge Regression: A Distributed Algorithm with Minimax Optimal Rates,” The Journal of Machine Learning Research, 16, 3299–3340.
Web of Science ®Google Scholar
Zhang, Y., Wainwright, M. J., and Jordan, M. I. (2017), “Optimal Prediction for Sparse Linear Models? Lower Bounds for Coordinate-Separable m-estimators,” Electronic Journal of Statistics, 11, 752–799. DOI: 10.1214/17-EJS1233.
Web of Science ®Google Scholar
Zhu, Y., and Bradic, J. (2018), “Linear Hypothesis Testing in Dense High-Dimensional Linear Models,” Journal of the American Statistical Association, 113, 1583–1600. DOI: 10.1080/01621459.2017.1356319.
Web of Science ®Google Scholar
Zou, H., and Hastie, T. (2005), “Regularization and Variable Selection via the Elastic Net,” Journal of the Royal Statistical Society, Series B, 67, 301–320. DOI: 10.1111/j.1467-9868.2005.00503.x.
Google Scholar

Estimation of Linear Functionals in High-Dimensional Linear Models: From Sparsity to Nonsparsity

ABSTRACT

1 Introduction