Full article: Higher-Order Least Squares: Assessing Partial Goodness of Fit of Linear Causal Models

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

We introduce a simple diagnostic test for assessing the overall or partial goodness of fit of a linear causal model with errors being independent of the covariates. In particular, we consider situations where hidden confounding is potentially present. We develop a method and discuss its capability to distinguish between covariates that are confounded with the response by latent variables and those that are not. Thus, we provide a test and methodology for partial goodness of fit. The test is based on comparing a novel higher-order least squares principle with ordinary least squares. In spite of its simplicity, the proposed method is extremely general and is also proven to be valid for high-dimensional settings. Supplementary materials for this article are available online.

Keywords:

1 Introduction

Linear models are the most commonly used statistical tools to study the relationship between a response and a set of covariates. The regression coefficient corresponding to a particular covariate is usually interpreted as its net effect on the response variable when all else is held fixed. Such an interpretation is essential in many applications and yet could be rather misleading when the linear model assumptions are in question, in particular, when there are hidden confounders.

In this work, we develop a simple but powerful approach to goodness of fit tests for potentially high-dimensional linear causal models, including also tests for partial goodness of fit of single predictor variables. While hidden confounding is the primary alternative in mind, different nonlinear deviations from the linear model assumption are also in scope. Tests for goodness of fit are essential to statistical modeling (e.g., Lehmann, Romano, and Casella Citation2005) and the concept is also very popular in econometrics where it is referred to as specification tests. For an overview of such methods; see, for example, Godfrey (Citation1991) or Maddala and Lahiri (Citation2009).

Another set of related works is Buja et al. (Citation2019a, Citation2019b), which elaborately discusses deviations from the (linear) model and how distributional robustness, that is, robustenss against shifts in the covariates’ distribution, links to correctly specified models. For this, they introduce the definition of “well-specified” statistical functionals. Distributional robustness, implied by well-specification, is also related to the causal interpretation of a linear model as discussed in Peters, Bühlmann, and Meinshausen (Citation2016).

We consider here the question when and which causal effects can be inferred from the ordinary least squares estimator or a debiased Lasso procedure for the high-dimensional setting, even when there is hidden confounding. We address this by partial goodness of fit testing: if the data speaks against a linear causal model, we are able to specify which components of the least squares estimator should be rejected to be linear causal effects and which not. In the case of a joint Gaussian distribution, one cannot detect anything: this corresponds to a well-known unidentifiability result in causality (Hyvärinen and Oja Citation2000; Peters et al. Citation2014). But, in certain models, we are able to identify some causal relations. Of particular importance are non-Gaussian linear structural equation models, as used in Shimizu et al. (Citation2006) or Wang and Drton (Citation2020) among others. The latter constructs the causal graph from observational data in a stepwise procedure using a test statistic similar to the one we suggest.

Our strategy has a very different focus than other approaches which do not rely on the least squares principle any longer to deal with the issue of hidden confounding. Most prominent, particularly in econometrics, is the framework of instrumental variables regression: assuming valid instruments, one can identify all causal effects, see, for example, Angrist, Imbens, and Rubin (Citation1996) or the books by Bowden and Turkington (Citation1990) and Imbens and Rubin (Citation2015). The popular Durbin-Wu-Hausman test (Hausman Citation1978) for validity of instruments bears a relation to our methodology, namely that we are also looking at the difference of two estimators to test goodness of fit.

Our automated partial goodness of fit methodology is easy to be used as it is based on ordinary (or high-dimensional adaptions of) least squares and its novel higher-order version: we believe that this simplicity is attractive for statistical practice.

1.1 Our Contribution

We propose a novel method with a corresponding test, called higher-order least squares (HOLS). The test statistic is based on the residuals from an ordinary least squares or Lasso fit. In that regard, it is related to Shah and Bühlmann (Citation2018) who use “residual prediction” to test for deviation from the linear model. However, our approach does neither assume Gaussian errors nor does it rely on sample splitting, and our novel test statistic has a $\sqrt{n}$ convergence rate (with n denoting the sample size).

In addition to presenting a “global” goodness of fit test for the entire model, we also develop a local interpretation that allows detecting which among the covariates are giving evidence for hidden confounding or nonlinear relations. Thus, we strongly increase the amount of extracted information compared to a global goodness-of-fit test. In particular, in the case of localized (partial) confounding in linear structural equation models, we are able to recover the unconfounded regression parameters for a subset of predictors. This is a setting where techniques assuming dense (essentially global) confounding, as in Ćevid, Bühlmann, and Meinshausen (Citation2020) or Guo, Ćevid, and Bühlmann (Citation2022), fail.

The work by Buja et al. (Citation2019a, Citation2019b) especially the second paper, shows how to detect deviations in a linear model using reweighting of the data. Our HOLS technique can be seen as a special way of reweighting. In contrast to their work, we provide a simple test statistic that tests for well-specification without requiring any resampling. Furthermore, we provide guarantees for a local interpretation under suitable modeling assumptions while as their per-covariate view remains rather exploratory.

1.2 Outline

The remainder of this article will be structured as follows. We conclude this section with the necessary notation. In Section 2, we present the main idea of HOLS and the according global null hypothesis. For illustrative purposes, we first discuss univariate regression. Then, we consider multivariate regression and extend our theory to high-dimensional problems incorporating the (debiased) Lasso. In Section 3, we present the local interpretation when the global null does not hold true alongside with theoretical guarantees. Models for which this local interpretation is most suitable are discussed in Section 4. Section 5 contains a real data analysis. We conclude with a summarizing discussion in Section 6.

1.3 Notation

We present some notation that is used throughout this work. Vectors and matrices are written in boldface, while scalars have the usual lettering. This holds for both random and fixed quantities. We use upper case letters to denote a random variable, for example, X or Y. We use lower case letters to denote iid copies of a random variable, for example, x. If $X \in R^{p}$ , then $x \in R^{n \times p}$ . With a slight abuse of notation, x can either denote the copies or realizations thereof. We write $x_{j}$ to denote the jth column of matrix x and $x_{- j}$ to denote all but the jth column. We write $\overset{H_{0}}{=}$ to state that equality holds under H₀. With ←, we emphasize that an equality between random variables is induced by a causal mechanism. We use $⊙$ to denote elementwise multiplication of two vectors, for example, $x ⊙ y$ . Similarly, potencies of vectors are also to be understood in an elementwise fashion, for example, $x^{2} = x ⊙ x$ . $I_{n}$ is the n-dimensional identity matrix. $P_{- j}$ denotes the orthogonal projection onto $x_{- j}$ and $P_{- j}^{⊥} = I_{n} - P_{- j}$ denotes the orthogonal projection onto its complement. For some random vector X, we have the moment matrix $Σ^{X} : = E [X X^{⊤}]$ . Note that this equals the covariance matrix for centered X. We denote statistical independence by $⊥$ . We write e to denote a vector for which every entry is 1 and $e_{j}$ to denote the unit vector in the direction of the jth coordinate axis.

2 Higher-Order Least Squares (HOLS)

We develop here the main idea of HOLS estimation.

2.1 Univariate Regression as a Motivating Case

It is instructive to begin with the case of simple linear regression where we have a pair of random variables X and Y. We consider the causal linear model (1) $\begin{matrix} Y \leftarrow X β + E, where X ⊥ E, \\ E [E] = 0 and E [E^{2}] = σ^{2} < \infty . \end{matrix}$ (1)

We formulate a null hypothesis that the model in (1) is correct and we denote such a hypothesis by H₀. This model is of interest as β describes the effect of a unit change if we were to intervene on covariate X without intervening on the independent $E$ . Such model, or its multivariate extension, is often assumed in causal discovery, see, for example, Shimizu et al. (Citation2006) or Hoyer et al. (Citation2008). Therefore, we aim to provide a test for its well-specification.

Estimation of the regression parameter is typically done by the least squares principle $β^{OLS} : = \underset{b \in R}{argmin} E [{(Y - X b)}^{2}] = \frac{E [X Y]}{E [X^{2}]} \overset{H_{0}}{=} β,$ where we use the superscript OLS to denote ordinary least squares. Alternatively, we can pre-multiply the linear model (1) with X: the parameter minimizing the expected squared error loss is then $β^{HOLS} : = \underset{b \in R}{argmin} E [{(X Y - X^{2} b)}^{2}] = \frac{E [X^{3} Y]}{E [X^{4}]} \overset{H_{0}}{=} \frac{E [X^{4} β]}{E [X^{4}]} = β .$

More generally, $β^{HOLS} = β^{OLS} = β$ , if $E [Y | X] = X β$ . Using the definition from Buja et al. (Citation2019b), this means that the OLS parameter is well-specified. The estimation principle is called higher-order least squares, or HOLS for short, as it involves higher-order moments of X. One could also multiply the linear model with a factor other than X, which may have implications on the power to detect deviations from (1). We shall focus here on the specific choice to fix ideas.

The motivation to look at HOLS is when H₀ is violated, in terms of a hidden confounding variable: let H be a hidden confounder leading to a model $X \leftarrow E_{X} + H ρ, Y \leftarrow X β + H α + E,$ where $E_{X}$ , H, and $E$ are all independent and α and ρ define additional model parameters. In particular, we can compute under such a confounding model that(2) $\begin{matrix} β^{HOLS} - β^{OLS} = ρ α (\frac{3 E [E_{X}^{2}] E [H^{2}] + ρ^{2} E [H^{4}]}{E [E_{X}^{4}] + 6 ρ^{2} E [E_{X}^{2}] E [H^{2}] + ρ^{4} E [H^{4}]} \\ - \frac{E [H^{2}]}{E [E_{X}^{2}] + ρ^{2} E [H^{2}]}) . \end{matrix}$ (2)

For simplicity, we assumed here $E [E_{X}] = E [H] = E [E] = 0$ . In practice, one can get rid of this assumption by including an intercept in the model. If either α or ρ equals to 0, we see that the difference in (2) is 0. This is not surprising as there is no confounding effect when either X or Y is unaffected. However, this is not the only possibility how the difference can be 0. Namely, $\begin{matrix} E [H^{2}] (E [E_{X}^{4}] - 3 E {[E_{X}^{2}]}^{2}) = ρ^{2} E [E_{X}^{2}] (E [H^{4}] - 3 E {[H^{2}]}^{2}) \\ \Rightarrow β^{HOLS} - β^{OLS} = 0. \end{matrix}$

Especially, if neither $E_{X}$ nor H have excess kurtosis, the difference is 0 for any ρ. This can be intuitively explained as it corresponds to Gaussian data (up to the moments we consider). For Gaussian $E_{X}$ and H, one can always write $Y = X β^{OLS} + \tilde{E} where X ⊥ \tilde{E},$ which cannot be distinguished from the null model (1). Or in other words $E [Y | X] = X β^{OLS}$ , that is, the OLS parameter is well-specified although it is not the parameter β that we would like to recover. For other data-generating distributions, one should be able to distinguish H₀ from certain deviations when hidden confounding is present. We discuss the implications of this in the general multivariate setting in Section 3.2. Similar behavior occurs for a violation of H₀ in terms of a nonlinear model $Y = f (X, ϵ)$ which then (typically) leads to $β^{HOLS} - β^{OLS} \neq 0$ .

One can construct a test based on the sample estimates of $β^{HOLS}$ and $β^{OLS}$ . We consider the centered data $\tilde{x} = x - \bar{x} e, \tilde{y} = y - \bar{y} e and \tilde{ϵ} = ϵ - \bar{ϵ} e = (I_{n} - \frac{1}{n} e e^{⊤}) ϵ,$

where we use the upper bar to denote sample means. We can derive $\tilde{y} = y - \bar{y} e = x β - \bar{x} e β + ϵ - \bar{ϵ} e = \tilde{x} β + \tilde{ϵ} .$

We now obtain ${\hat{β}}^{OLS}$ from regression through the origin of $\tilde{y}$ versus $\tilde{x}$ with an error term of $\tilde{ϵ}$ and ${\hat{β}}^{HOLS}$ from regression through the origin of $\tilde{x} ⊙ \tilde{y}$ versus $\tilde{x} ⊙ \tilde{y}$ with an error term of $\tilde{x} ⊙ \tilde{ϵ}$ . More precisely, we define ${\hat{β}}^{OLS} : = \frac{{\tilde{x}}^{⊤} \tilde{y}}{{\tilde{x}}^{⊤} \tilde{x}} and {\hat{β}}^{HOLS} : = \frac{{({\tilde{x}}^{2})}^{⊤} (\tilde{x} ⊙ \tilde{y})}{{({\tilde{x}}^{2})}^{⊤} ({\tilde{x}}^{2})} = \frac{{({\tilde{x}}^{3})}^{⊤} (\tilde{y})}{{({\tilde{x}}^{2})}^{⊤} ({\tilde{x}}^{2})} .$

Under H₀, one can see that $({\hat{β}}^{HOLS} - {\hat{β}}^{OLS})$ given x is some known linear combination of $ϵ$ . Assuming further Gaussianity of $ϵ$ , it is conditionally Gaussian. We find(3) $\begin{matrix} ({\hat{β}}^{HOLS} - {\hat{β}}^{OLS}) | x \overset{H_{0}}{\sim} \\ N (0, σ^{2} (\frac{{({\tilde{x}}^{3})}^{⊤} (I_{n} - \frac{1}{n} e e^{⊤}) ({\tilde{x}}^{3})}{{({({\tilde{x}}^{2})}^{⊤} ({\tilde{x}}^{2}))}^{2}} - \frac{1}{({\tilde{x}}^{⊤} \tilde{x})})) . \end{matrix}$ (3)

We can calculate this variance except for $σ^{2}$ . Further, we can consistently estimate $σ^{2}$ , for example, with the standard formula ${\hat{σ}}^{2} = \frac{{‖ \tilde{y} - \tilde{x} {\hat{β}}^{OLS} ‖}_{2}^{2}}{n - 2} .$

Thus, we receive asymptotically valid z-tests for the null-hypothesis H₀ that the model (1) holds. We treat the extension to non-Gaussian $ϵ$ in Section 2.2 (for the multivariate case directly). As discussed above, in the presence of confounding, we can have that $β^{HOLS} \neq β^{OLS}$ . In such situations, a test assuming (3) will have asymptotic power equal to 1 for correctly rejecting H₀ under some conditions. These asymptotic results are discussed in Sections 3.1 and 3.2.

2.2 Multivariate Regression

We typically want to examine the goodness of fit of a linear model with p > 1 covariates. We consider p to be fixed in this section and discuss the case where p is allowed to diverge with n in Section 2.3.

We consider the causal model(4) $\begin{matrix} Y \leftarrow X^{⊤} β + E, where X ⊥ E, \\ E [E] = 0 and E [E^{2}] = σ^{2} < \infty \end{matrix}$ (4) with $X \in R^{p}$ and $β \in R^{p}$ . Note that $E [E] = 0$ can always be enforced by including an intercept in the set of predictors. We assume the according moment matrix $Σ^{X}$ to be invertible. Then, the principal submatrices $Σ_{- j, - j}^{X} : = E [X_{- j} X_{- j}^{⊤}]$ are also invertible. We formulate a global null hypothesis that the model in (4) is correct and we denote it by H₀. To make use of the test described for the univariate case, we consider every component $j \in {1, \dots, p}$ separately and work with partial regression, see, for example, Belsley, Kuh, and Welsch (Citation2005). For the population version, we define(5) $\begin{matrix} Z_{j} : = X_{j} - X_{- j}^{⊤} γ_{j}, where \\ γ_{j} : = \underset{b \in R^{p - 1}}{argmin} E [{(X_{j} - X_{- j}^{⊤} b)}^{2}] = {(Σ_{- j, - j}^{X})}^{- 1} E [X_{- j} X_{j}] \\ W_{j} : = Y - X_{- j}^{⊤} ζ_{j}, where \\ ζ_{j} : = \underset{b \in R^{p - 1}}{argmin} E [{(Y - X_{- j}^{⊤} b)}^{2}] = {(Σ_{- j, - j}^{X})}^{- 1} E [X_{- j} Y] . \end{matrix}$ (5)

Under H₀, it holds that $W_{j} = Z_{j} β_{j} + E$ .

For $β^{OLS} : = {(Σ^{X})}^{- 1} E [X Y]$ , we find $β_{j}^{OLS} = \frac{E [Z_{j} W_{j}]}{E [Z_{j}^{2}]} \overset{H_{0}}{=} β_{j} .$

The first equality is a well-known application of the Frish-Waugh theorem, see, for example, Greene (Citation2003). We define the according HOLS parameter by partial regression for every component j separately, namely $β_{j}^{HOLS} : = \frac{E [Z_{j}^{3} W_{j}]}{E [Z_{j}^{4}]} \overset{H_{0}}{=} β_{j} .$

We define a local, that is, per-covariate null hypothesis $H_{0, j} : β_{j}^{OLS} = β_{j}^{HOLS}$ . The difference $β_{j}^{OLS} - β_{j}^{HOLS}$ can detect certain local alternatives from the null hypothesis H₀. Here, local refers to the covariate X_j whose effect on Y is potentially confounded or involves a nonlinearity. Under model (4), $H_{0, j}$ holds true for every j. We discuss in Sections 3 and 4 some concrete examples, where it is insightful to consider tests for individual $H_{0, j}$ .

We turn to sample estimates of the parameters. The residuals are estimated by $\begin{matrix} {\hat{z}}_{j} = x_{j} - P_{- j} x_{j} = P_{- j}^{⊥} x_{j} and \\ {\hat{w}}_{j} = y - P_{- j} y = P_{- j}^{⊥} y \overset{H_{0}}{=} P_{- j}^{⊥} (x β + ϵ) = {\hat{z}}_{j} β_{j} + P_{- j}^{⊥} ϵ . \end{matrix}$

With ordinary least squares, we receive ${\hat{β}}_{j}^{OLS}$ from regression of ${\hat{w}}_{j}$ versus ${\hat{z}}_{j}$ , where the error term is $P_{- j}^{⊥} ϵ$ . Accordingly, we calculate ${\hat{β}}_{j}^{HOLS}$ from regression of ${\hat{z}}_{j} ⊙ {\hat{w}}_{j}$ versus ${\hat{z}}_{j}^{2}$ with an error term ${\hat{z}}_{j} ⊙ P_{- j}^{⊥} ϵ$ . Thus, we define ${\hat{β}}_{j}^{OLS} : = \frac{{\hat{z}}_{j}^{⊤} {\hat{w}}_{j}}{{\hat{z}}_{j}^{⊤} {\hat{z}}_{j}} and {\hat{β}}_{j}^{HOLS} : = \frac{{({\hat{z}}_{j}^{2})}^{⊤} ({\hat{z}}_{j} ⊙ {\hat{w}}_{j})}{{({\hat{z}}_{j}^{2})}^{⊤} ({\hat{z}}_{j}^{2})} = \frac{{({\hat{z}}_{j}^{3})}^{⊤} {\hat{w}}_{j}}{{({\hat{z}}_{j}^{3})}^{⊤} {\hat{z}}_{j}} .$

This is analogous to the univariate case, where we have $\tilde{y}$ instead of ${\hat{w}}_{j}, \tilde{x}$ instead of ${\hat{z}}_{j}$ and $(I_{n} - \frac{1}{n} e e^{⊤})$ instead of $P_{- j}^{⊥}$ , and $(I_{n} - \frac{1}{n} e e^{⊤})$ can be thought of as orthogonal projection onto e’s complement, which completes the analogy. Again, we see that given x, $({\hat{β}}_{j}^{HOLS} - {\hat{β}}_{j}^{OLS})$ is some known linear combination of $ϵ$ , thus, it is conditionally Gaussian for Gaussian $ϵ$ . The same holds for $({\hat{β}}^{HOLS} - {\hat{β}}^{OLS})$ .

Naturally, Gaussian $E$ is an overly strong assumption. Therefore, we consider additional assumptions such that the central limit theorem can be invoked.

(A1) The moment matrix $Σ^{X}$ has positive smallest eigenvalue.
(A2) $E [X_{j}^{6}] < \infty$ and $E [Z_{j}^{6}] < \infty \forall j .$

Further, let(6) $\begin{matrix} {\tilde{Z}}_{j}^{3} : = Z_{j}^{3} - X_{- j}^{⊤} {\tilde{γ}}_{j}, where \\ {\tilde{γ}}_{j} : = \underset{b \in R^{p - 1}}{argmin} E [{(Z_{j}^{3} - X_{- j}^{⊤} b)}^{2}] = {(Σ_{- j, - j}^{X})}^{- 1} E [X_{- j} Z_{j}^{3}] . \end{matrix}$ (6)

Note that $E [{({\tilde{Z}}_{j}^{3})}^{2}] \leq E [Z_{j}^{6}] < \infty$ .

Theorem 1.

Assume that the data follows the model (4) and that (A1)–(A2) hold. Let p be fixed and $n \to \infty$ . Then, $\begin{matrix} \sqrt{n} ({\hat{β}}^{HOLS} - {\hat{β}}^{OLS}) \overset{D}{\to} N (0, σ^{2} E [V V^{⊤}]) \\ \frac{1}{n} {\hat{v}}^{⊤} \hat{v} \overset{P}{\to} E [V V^{⊤}], where \\ {\hat{v}}_{j} = \frac{P_{- j}^{⊥} ({\hat{z}}_{j}^{3})}{\frac{1}{n} {({\hat{z}}_{j}^{2})}^{⊤} ({\hat{z}}_{j}^{2})} - \frac{{\hat{z}}_{j}}{\frac{1}{n} {\hat{z}}_{j}^{⊤} {\hat{z}}_{j}} and \\ V_{j} = \frac{{\tilde{Z}}_{j}^{3}}{E [Z_{j}^{4}]} - \frac{Z_{j}}{E [Z_{j}^{2}]} . \end{matrix}$

Note that $\begin{matrix} ({\hat{β}}^{HOLS} - {\hat{β}}^{OLS}) \overset{H_{0}}{=} \frac{1}{n} {\hat{v}}^{⊤} ϵ, and, in analogy to (3), \\ \frac{1}{n^{2}} {\hat{v}}_{j}^{⊤} {\hat{v}}_{j} = \frac{{({\hat{z}}_{j}^{3})}^{⊤} P_{- j}^{⊥} ({\hat{z}}_{j}^{3})}{{({({\hat{z}}_{j}^{2})}^{⊤} ({\hat{z}}_{j}^{2}))}^{2}} - \frac{1}{{\hat{z}}_{j}^{⊤} {\hat{z}}_{j}} . \end{matrix}$

Following Theorem 1, we can test the null hypothesis H₀ with a consistent estimate for $σ^{2}$ . Such an estimate can be obtained, for example, using the standard formula ${\hat{σ}}^{2} = \frac{‖ y - x {\hat{β}}^{OLS} ‖_{2}^{2}}{n - p} .$

We define for later reference(7) $\hat{var} ({\hat{β}}_{j}^{HOLS} - {\hat{β}}_{j}^{OLS}) : = {\hat{σ}}^{2} \frac{1}{n^{2}} {\hat{v}}_{j}^{⊤} {\hat{v}}_{j} .$ (7)

To test $H_{0, j}$ , we can compare $({\hat{β}}_{j}^{HOLS} - {\hat{β}}_{j}^{OLS})$ to the quantiles of the univariate normal distribution with the according variance. The joint distribution leads to a global test that controls the Type I error. Namely, one can look at the maximum test statistic $T = \max_{k} | {\hat{β}}_{k}^{HOLS} - {\hat{β}}_{k}^{OLS} | \overset{H_{0}}{\sim} \max_{k} | S_{k} |$ , where $S \sim N (0, {\hat{σ}}^{2} {\hat{v}}^{⊤} \hat{v} / n^{2})$ can be easily simulated. Further, one receives multiplicity corrected individual p-values for $H_{0, j}$ by comparing each $| {\hat{β}}_{j}^{HOLS} - {\hat{β}}_{j}^{OLS} |$ to the distribution of $\max_{k} | S_{k} |$ . This is in analogy to the multiplicity correction suggested by Bühlmann (Citation2013). Naturally, other multiplicity correction techniques such as Bonferroni-Holm are valid as well.

Algorithm 1 summarizes how to find both raw and multiplicity corrected p-values for each component j corresponding to the jth covariate, p_j and P_j, respectively. Then, one would reject the global null hypothesis H₀ that the model (4) holds if $\min_{j} P_{j} \leq α$ , and such a decision procedure provides control of the Type I error at level α. Note that this means that we have strong control of the FWER for testing all $H_{0, j}$ .

Algorithm 1

HOLS check

1: for j = 1 to p do

2: $P_{- j}^{⊥} = I_{n} - x_{- j} (x_{- j}^{⊤} x_{- j}) x_{- j}^{⊤}$

3: Regress $x_{j}$ versus $x_{- j}$ via least squares, denote the residual by ${\hat{z}}_{j} = P_{- j}^{⊥} x_{j}$

4: Regress y versus $x_{- j}$ via least squares, denote the residual by ${\hat{w}}_{j} = P_{- j}^{⊥} y$

5: ${\hat{β}}_{j}^{OLS} = \frac{{\hat{z}}_{j}^{⊤} {\hat{w}}_{j}}{{\hat{z}}_{j}^{⊤} {\hat{z}}_{j}}, {\hat{β}}_{j}^{HOLS} = \frac{{({\hat{z}}_{j}^{3})}^{⊤} {\hat{w}}_{j}}{{\hat{z}}_{j}^{⊤} {\hat{z}}_{j}}$ and ${\hat{v}}_{j} = \frac{P_{- j}^{⊥} ({\hat{z}}_{j}^{3})}{\frac{1}{n} {({\hat{z}}_{j}^{2})}^{⊤} ({\hat{z}}_{j}^{2})} - \frac{{\hat{z}}_{j}}{\frac{1}{n} {\hat{z}}_{j}^{⊤} {\hat{z}}_{j}}$

6: ${\hat{σ}}^{2} = \frac{‖ y - x {\hat{β}}^{OLS} ‖_{2}^{2}}{n - p}$

7: Create n_sim i.i.d copies of $S \sim N (0, {\hat{σ}}^{2} {\hat{v}}^{⊤} \hat{v} / n^{2})$ , say, $s^{1}$ to $s^{n_{sim}}$

8: for j = 1 to p do

9: $p_{j} = 2 (1 - Φ (\frac{| {\hat{β}}_{j}^{HOLS} - {\hat{β}}_{j}^{OLS} |}{\hat{σ} \frac{1}{n} {‖ {\hat{v}}_{j} ‖}_{2}}))$

10: $P_{j} = \frac{1}{n_{sim}} \sum_{i = 1}^{n_{sim}} 1 (| {\hat{β}}_{j}^{HOLS} - {\hat{β}}_{j}^{OLS} | > {‖ s^{i} ‖}_{\infty})$

Corollary 1.

Assume the conditions of Theorem 1. Consider the decision rule to reject H₀ iff $\min_{j} P_{j} \leq α$ , where P_j is as in Step 10 of Algorithm 1. Then, the Type I error is asymptotically controlled at α. Furthermore, the FWER is asymptotically controlled at level α for testing all local hypotheses ${H_{0, j}; j = 1, \dots, p}$ with the decision rule to reject $H_{0, j}$ iff $P_{j} \leq α$ .

We provide simulation results supporting this theory in Section A of the supplemental materials.

2.3 High-Dimensional Data

We now turn to high-dimensional situations. We assume the global null hypothesis (4) but allow for p to diverge with and even exceed n such that ordinary least squares regression is not applicable. Instead, we apply the debiased Lasso introduced in Zhang and Zhang (Citation2014) and further discussed in van de Geer et al. (Citation2014). We denote the estimator again by ${\hat{β}}^{OLS}$ since it converges under certain conditions to the population parameter $β^{OLS}$ .

From the debiased Lasso, we receive ${\hat{z}}_{j} = x_{j} - x_{- j} {\hat{γ}}_{j}$ , where ${\hat{γ}}_{j}$ is obtained by regressing $x_{j}$ versus $x_{- j}$ using the Lasso, and ${\hat{w}}_{j} = y - x_{- j} {\hat{β}}_{- j}$ with $\hat{β}$ coming from the Lasso fit of y versus x. Since ${\hat{β}}_{j}^{OLS} = {\hat{z}}_{j}^{⊤} {\hat{w}}_{j} / {\hat{z}}_{j}^{⊤} x_{j}$ , one might want to use ${({\hat{z}}_{j}^{3})}^{⊤} {\hat{w}}_{j} / {({\hat{z}}_{j}^{3})}^{⊤} x_{j}$ for HOLS. However, this leads in general to an uncontrollable approximation error since $E [Z_{j}^{3} X_{- j}] \neq 0$ . As a remedy, we suggest a second level of orthogonalization based on ${\tilde{Z}}_{j}^{3}$ and ${\tilde{γ}}_{j}$ as defined in (6). Naturally, we have ${\tilde{Z}}_{j}^{3} = Z_{j}^{3}$ iff $E [Z_{j}^{3} X_{- j}] = 0$ and always $E [{\tilde{Z}}_{j}^{3} X_{- j}] = 0$ . To approximate ${\tilde{z}}_{j}^{3}$ we use the Lasso for the regression ${\hat{z}}_{j}^{3}$ versus $x_{- j}$ leading to ${\hat{\tilde{z}}}_{j}^{3} = {\hat{z}}_{j}^{3} - x_{- j} {\hat{\tilde{γ}}}_{j}$ . We define ${\hat{β}}^{HOLS}$ as $\begin{matrix} {\hat{β}}_{j}^{HOLS} : = \frac{{({\hat{\tilde{z}}}_{j}^{3})}^{⊤} {\hat{w}}_{j}}{{({\hat{\tilde{z}}}_{j}^{3})}^{⊤} x_{j}} = \frac{{({\hat{\tilde{z}}}_{j}^{3})}^{⊤} (y - x_{- j} {\hat{β}}_{- j})}{{({\hat{\tilde{z}}}_{j}^{3})}^{⊤} x_{j}} \\ \overset{H_{0}}{=} β_{j} + \frac{{({\hat{\tilde{z}}}_{j}^{3})}^{⊤} x_{- j} (β_{- j} - {\hat{β}}_{- j}) / n}{{({\hat{\tilde{z}}}_{j}^{3})}^{⊤} x_{j} / n} + \frac{{({\hat{\tilde{z}}}_{j}^{3})}^{⊤} ϵ}{{({\hat{\tilde{z}}}_{j}^{3})}^{⊤} x_{j}} . \end{matrix}$

Finally, we are interested in the difference between ${\hat{β}}_{j}^{HOLS}$ and ${\hat{β}}_{j}^{OLS}$ . Under suitable assumptions for the sparsity, the moment matrix, and the tail behavior of X and $E$ , we can derive the limiting Gaussian distribution of this difference allowing for asymptotically valid tests. W apply Algorithm 2 where we make use of the (asymptotic) normality of the nonvanishing term in this difference. For non-Gaussian $E$ , a multiplicity correction method that does not rely on exact Gaussianity of this remainder might be preferred since the CLT does not apply for dimensions growing too fast.

Algorithm 2

HOLS check for p > n

1: Regress y versus x via Lasso with a penalty parameter λ, denote the estimated regression coefficients by $\hat{β}$

2: for j = 1 to p do

3: Regress $x_{j}$ versus $x_{- j}$ via Lasso with a penalty parameter λ_j, denote the residual by ${\hat{z}}_{j}$

4: Regress ${\hat{z}}_{j}^{3}$ versus $x_{- j}$ via Lasso with a penalty parameter ${\tilde{λ}}_{j}$ , denote the residual by ${\hat{\tilde{z}}}_{j}^{3}$

5: ${\hat{w}}_{j} = y - x_{- j} {\hat{β}}_{- j}$

6: ${\hat{β}}_{j}^{OLS} = \frac{{\hat{z}}_{j}^{⊤} {\hat{w}}_{j}}{{\hat{z}}_{j}^{⊤} x_{j}}, {\hat{β}}_{j}^{HOLS} = \frac{{({\hat{\tilde{z}}}_{j}^{3})}^{⊤} {\hat{w}}_{j}}{{({\hat{\tilde{z}}}_{j}^{3})}^{⊤} x_{j}}$ and ${\hat{v}}_{j} = \frac{({\hat{\tilde{z}}}_{j}^{3})}{\frac{1}{n} {({\hat{\tilde{z}}}_{j}^{3})}^{⊤} x_{j}} - \frac{{\hat{z}}_{j}}{\frac{1}{n} {\hat{z}}_{j}^{⊤} x_{j}}$

7: ${\hat{σ}}^{2} = \frac{‖ y - x \hat{β} ‖_{2}^{2}}{n - {| \hat{β} |}_{0}}$ (or any other reasonable variance estimator)

8: Create n_sim i.i.d copies of $S \sim N (0, {\hat{σ}}^{2} \frac{1}{n^{2}} {\hat{v}}^{⊤} \hat{v})$ , say, $s^{1}$ to $s^{n_{sim}}$

9: for j = 1 to p do

10: $p_{j} = 2 (1 - Φ (\frac{| {\hat{β}}_{j}^{HOLS} - {\hat{β}}_{j}^{OLS} |}{\hat{σ} \frac{1}{n} {‖ {\hat{v}}_{j} ‖}_{2}}))$

11: $P_{j} = \frac{1}{n_{sim}} \sum_{i = 1}^{n_{sim}} 1 (| {\hat{β}}_{j}^{HOLS} - {\hat{β}}_{j}^{OLS} | > {‖ s^{i} ‖}_{\infty})$

We provide here the main result to justify Algorithm 2 invoking additional assumptions on the dimensionality and sparsity of the problem. We use the definitions $s : = {‖ β ‖}_{0}, s_{j} : = {‖ γ_{j} ‖}_{0}$ and ${\tilde{s}}_{j} : = {‖ {\tilde{γ}}_{j} ‖}_{0}$ to denote the different levels of sparsity.

(C1) The design matrix x has iid sub-Gaussian rows. The moment matrix $Σ^{X}$ has strictly positive smallest eigenvalue $Λ_{\min}^{2}$ satisfying $1 / Λ_{\min}^{2} = O (1) .$ Also, $\max_{j} Σ_{j, j}^{X} = O (1) .$
(C2) $s = O (\frac{n^{1 / 2}}{\log {(p)}^{3}})$ .
(C3) $s s_{j}^{2} = O (\frac{n^{3 / 2}}{\log {(p)}^{3}})$ , $s s_{j} = O (\frac{n}{\log {(p)}^{5 / 2}})$ and $s s_{j}^{1 / 2} = O (\frac{n^{1 / 2}}{\log {(p)}^{3 / 2}})$ .
(C4) $s_{j} = O (\frac{n^{3 / 5}}{\log (p)})$ .
(C5) $\sqrt{n} s λ {\tilde{λ}}_{j} = O (1)$ .
(C6) ${\tilde{s}}_{j} {\tilde{λ}}_{j}^{2} = O (1)$ .

Theorem 2.

Assume that the data follows the model (4) with sub-Gaussian $E$ and that (C1)–(C6) hold ( $\forall j$ ). Let $\hat{β}$ come from Lasso regression with $λ ≍ \sqrt{\log (p) / n}, {\hat{z}}_{j}$ from nodewise Lasso regression using $λ_{j} ≍ \sqrt{\log (p) / n}$ , and ${\hat{\tilde{z}}}_{j}^{3}$ from nodewise Lasso regression of ${\hat{z}}_{j}^{3}$ versus $x_{- j}$ using ${\tilde{λ}}_{j} ≍ \max {\log {(p)}^{5 / 2} n^{- 1 / 2},$ $s_{j}^{2} \log {(p)}^{5 / 2} n^{- 3 / 2}, s_{j} \log {(p)}^{2} n^{- 1}, \sqrt{s_{j}} \log (p) n^{- 1 / 2}}$ . Let $\hat{σ}$ be any consistent estimator for σ. Then, $\begin{matrix} \frac{\sqrt{n} ({\hat{β}}_{j}^{HOLS} - {\hat{β}}_{j}^{OLS})}{\sqrt{{\hat{σ}}^{2} \frac{1}{n} {\hat{v}}_{j}^{⊤} {\hat{v}}_{j}}} \overset{D}{\to} N (0, 1) \\ where {\hat{v}}_{j} = \frac{({\hat{\tilde{z}}}_{j}^{3})}{\frac{1}{n} {({\hat{\tilde{z}}}_{j}^{3})}^{⊤} x_{j}} - \frac{{\hat{z}}_{j}}{\frac{1}{n} {\hat{z}}_{j}^{⊤} x_{j}} . \end{matrix}$

We defer the technical details to Section C of the supplemental materials. Simulation results concerning high-dimensional data can be found in Section A of the supplemental materials.

3 The Confounded Case and Local Null Hypotheses

In this section and the following, we mainly exploit confounding in linear models as the alternative hypothesis since these are the models where our tests for the local null hypotheses $H_{0, j}$ are most informative. For a discussion of which interpretations might carry over to more general data generating distributions, we refer to Section 4.3.

Note that everything that is discussed in Sections 3 and 4 implicitly applies to high-dimensional data as well under suitable assumptions. We refrain from going into detail for the sake of brevity. Thus, Theorems 3–6 which contain our main asymptotic results for the local intepretation are designed explicitly for the fixed p case.

We look at the causal model(8) $\begin{matrix} X & \leftarrow ρ H + E_{X} \\ Y & \leftarrow X^{⊤} β + H^{⊤} α + E, \end{matrix}$ (8) where $H \in R^{d}, E_{X} \in R^{p}$ and $E \in R$ are independent and centered random variables, and $α \in R^{d}$ and $ρ \in R^{p \times d}$ are fixed model parameters. Thus, there exists some hidden confounder H. For the inner product matrices, it holds that $Σ^{X} = Σ^{E_{X}} + ρ Σ^{H} ρ^{⊤} .$

Furthermore, we have(9) $\begin{matrix} β^{OLS} = {(Σ^{X})}^{- 1} E [X Y] = {(Σ^{X})}^{- 1} (Σ^{X} β + ρ Σ^{H} α) \\ = β + {(Σ^{X})}^{- 1} ρ Σ^{H} α . \end{matrix}$ (9)

We will generally refer to $β_{j}^{OLS} \neq β_{j}$ , where β_j is according to model (8), as confounding bias on $β_{j}^{OLS}$ . Further, when writing directly confounded, we mean covariate indices j for which $X_{j} \neq E_{X_{j}}$ .

Note that we can always decompose Y both globally and locally as follows(10) $\begin{matrix} Y = X^{⊤} β^{OLS} + \tilde{E}, with E [X \tilde{E}] = 0, E [\tilde{E}] = 0 \\ but (potentially) X ⊥ \tilde{E} \end{matrix}$ (10) (11) $\begin{matrix} W_{j} = Z_{j} β_{j}^{OLS} + \tilde{E}, with E [Z_{j} \tilde{E}] = 0, E [\tilde{E}] = 0 \\ but (potentially) Z_{j} ⊥ \tilde{E} \end{matrix}$ (11) using the definitions from (5). We now want to see how $β^{OLS}$ relates to $β$ in certain models. Especially, we are interested in whether there is some potential local interpretation in the sense of distinguishing between “confounded” and “unconfounded” variables. From (9), we see that this is linked to the structure of the covariance matrices as well as $ρ$ and $α$ . We define the sets(12) $\begin{matrix} V = {j : β_{j}^{OLS} = β_{j}} and \\ U = {j : β_{j}^{OLS} = β_{j}^{HOLS}} = {j : H_{0, j} is true} . \end{matrix}$ (12)

Using the Woodbury matrix identity, we find a sufficient condition(13) $\begin{matrix} j \in V if ρ^{⊤} {(Σ^{E_{X}})}_{j}^{- 1} = 0 which is implied by \\ {k \in {1, \dots, p} : {(Σ^{E_{X}})}_{j k}^{- 1} \neq 0} \cap \\ {l \in {1, \dots, p} : ‖ e_{l}^{⊤} ρ ‖ > 0} = \emptyset . \end{matrix}$ (13)

Thus, if the intersection between covariates that have linear predictive power for X_j and covariates that are directly confounded is empty, it must hold that $β_{j}^{OLS} = β_{j}$ . Therefore, we can indeed say that for these variables we estimate the true causal effect using ordinary least squares.

To correctly detect V, we would like $β_{j}^{HOLS} = β_{j}^{OLS} = β_{j}$ . As $β_{j}^{HOLS}$ involves higher-order moments, knowledge of the covariance structure is not sufficient to check this. From (11), we see that $E [Z_{j}^{3} \tilde{E}] = 0$ is necessary and sufficient to ensure $j \in U$ . In Section 3.2, we discuss the two cases where detection fails, that is, $U ∖ V \neq \emptyset$ and $V ∖ U \neq \emptyset$ . We present models for which we can characterize a set of variables which are in $U \cap V = {j : β_{j}^{HOLS} = β_{j}^{OLS} = β_{j}}$ in Section 4.

3.1 Sample Estimates

For a confounded model, the hope is that the global test $\min_{j} P_{j} \leq α$ , where P_j is the adjusted p-value according to Step 10 in Algorithm 1, leads to a rejection of H₀, that is, the modeling assumption (4). One could further examine the local structure and, based on the corrected p-values P_j, distinguish the predictors for which we have evidence that $β_{j}^{HOLS} \neq β_{j}^{OLS}$ . We consider in the following this local interpretation, showing that we asymptotically control the Type-I error and receive power approaching 1. Implicitly, we assume that U is a useful proxy for V.

For all asymptotic results in this section, we assume p to be fixed and $n \to \infty$ as in Theorem 1.

Theorem 3.

Assume that the data follows the model (10) and that (A1)–(A2) hold. Assume further $σ_{\tilde{E}}^{2} = E [{\tilde{E}}^{2}] < \infty$ . Then, $\begin{matrix} {\hat{β}}_{j}^{OLS} = β_{j}^{OLS} + O_{p} (1), {\hat{β}}_{j}^{HOLS} = β_{j}^{HOLS} + O_{p} (1) \\ and \hat{var} ({\hat{β}}_{j}^{HOLS} - {\hat{β}}_{j}^{OLS}) = O_{p} (\frac{1}{n}), \end{matrix}$ where $\hat{var} ({\hat{β}}_{j}^{HOLS} - {\hat{β}}_{j}^{OLS})$ is according to (7).

Thus, for some fixed alternative $| β_{j}^{HOLS} - β_{j}^{OLS} | > 0$ , the absolute z-statistics increases as $\sqrt{n}$ .

In order to get some local interpretation, the behavior for variables $j \in U$ is of importance. If $| β_{j}^{HOLS} - β_{j}^{OLS} | = 0$ , Theorem 3 is not sufficient to understand the asymptotic behavior. We refine the results using additional assumptions.

(A3) $E [{(X_{j} \tilde{E})}^{2}] < \infty \forall j$
(B1) $E [Z_{j}^{2} X_{k} \tilde{E}] = 0 \forall k \neq j$
(B2) $Z_{j} ⊥ \tilde{E}$
(B3) ${\tilde{Z}}_{j}^{3} ⊥ \tilde{E}$

Note that we use different letters for the assumptions to distinguish between those that are essentially some (mild) moment conditions and those that truly make nodes unconfounded. Obviously, (B2) is not necessary for $β_{j}^{OLS} = β_{j}^{HOLS}$ , but we will focus on these variables as these are the ones that are truly unconfounded in the sense that the projected single variable model (11) has an independent error term, while as for other variables it can be rather considered an unwanted artifact of our method. Furthermore, the derived asymptotic variance results only hold true when assuming (B2) and (B3) as well. Assumption (A3) implies a further moment condition. Especially, when considering nonlinearities, there exist cases for which (A3) is not implied by (A2). We discuss Assumptions (B1)–(B3) for certain models in Section 4. Condition (B1) seems to be a bit artificial but is invoked in the proofs. We argue in Section 4 that it is naturally linked to the models in scope.

Theorem 4.

Assume that the data follows the model (10) and that (A1)–(A3) hold. Let j be some covariate with $β_{j}^{OLS} = β_{j}^{HOLS}$ for which (B1)–(B3) hold. Then, $\frac{\sqrt{n} ({\hat{β}}_{j}^{HOLS} - {\hat{β}}_{j}^{OLS})}{\sqrt{\hat{var} (\sqrt{n} ({\hat{β}}_{j}^{HOLS} - {\hat{β}}_{j}^{OLS}))}} \overset{D}{\to} N (0, 1) .$

Thus, for these predictors we receive asymptotically valid tests.

Multiplicity correction

In order not to falsely reject the local null hypothesis $H_{0, j}$ for any covariate with $j \in U$ (with probability at least $1 - α$ ), we need to invoke some multiplicity correction. Analogously to Section 2.2, one can see that ${\hat{β}}^{HOLS} - {\hat{β}}^{OLS} = {\hat{v}}^{⊤} \tilde{ϵ} / n$ , which enables the multiplicity correction as in Algorithm 1.

Theorem 5.

Assume that the data follows the model (10) and that (A1)–(A3) hold. Let $U'$ be the set of variables j for which $j \in U$ and (B1)–(B3) hold. Then, $\begin{matrix} \sqrt{n} ({\hat{β}}_{U'}^{HOLS} - {\hat{β}}_{U'}^{OLS}) \overset{D}{\to} N (0, σ_{\tilde{E}}^{2} E [V_{U'} V_{U'}^{⊤}]) \\ \frac{1}{n} {\hat{v}}_{U'}^{⊤} {\hat{v}}_{U'} \overset{P}{\to} E [V_{U'} V_{U'}^{⊤}] \end{matrix}$

Corollary 2.

Assume the conditions of Theorem 5. Consider the decision rule to reject $H_{0, j}$ iff $P_{j} \leq α$ , where P_j is as in Step 10 of Algorithm 1. Then, the familywise error rate amongst the set $U'$ is asymptotically controlled at α.

3.2 Inferring V from U

Recall the definitions in (12). U is the set that we try to infer with our HOLS check. Naturally, one would rather be interested in the set V, which consists of the variables for which we can consistently estimate the true linear causal effect according to (8) through linear regression. We discuss here when using U as proxy for V might fail and especially analyze how variables could belong to the difference between the sets. For this, recall our formulation of the model when the global null hypothesis does not hold true in (10) and (11). Note that $j \in U$ is equivalent to $E [Z_{j}^{3} \tilde{E}] = 0$ .

For any variable $j \in U ∖ V$ , certain modeling assumptions, that we discuss in the sequel, cannot be fulfilled but they are not necessary for $E [Z_{j}^{3} \tilde{E}] = 0$ . Especially, the last equality always holds if both $E_{X}$ and H jointly have Gaussian kurtosis. If they are even jointly Gaussian, then it is clear that $X ⊥ \tilde{E}$ such that the model $(10)$ has independent Gaussian error. Thus, when using only observational data, it behaves exactly like a model under the global null hypothesis and, naturally, we cannot infer the confounding effect. Apart from Gaussian kurtosis, $j \in U ∖ V$ would be mainly due to special constellations implying cancelation of terms that one does not expect to encounter in normal circumstances.

For $j \in V ∖ U$ , Z_j and $\tilde{E}$ must not be independent. As $Z_{j} ⊥ \tilde{E}$ , the single-covariate model (11) is not a linear causal model with independent error term as given in (1). Therefore, from a causal inference perspective, one can argue that rejecting the local null hypothesis $H_{0, j}$ is the right thing to do in this case. Furthermore, having variables $j \in V$ is usually related to certain model assumptions except for very specific data setups that lead to cancelation of terms. Under these assumptions, $Z_{j} ⊥ \tilde{E}$ is then usually implied. An example where $β^{OLS} = β$ , but (potentially) $X ⊥ \tilde{E}$ is data for which $ρ Σ^{H} α = 0$ using the definitions from model (8).

Recovery of U

Based on our asymptotic results when the global null does not hold true, we would like to construct a method that perfectly detects the unconfounded variables as $n \to \infty$ . Define(14) $\hat{U} = {j : H_{0, j} not rejected}$ (14)

The question is how and when can we ensure that $\lim_{n \to \infty} P [\hat{U} = U] = 1.$

Suppose that we conduct our local z-tests at level α_n, which varies with the sample size such that $α_{n} \to 0$ as $n \to \infty$ . It will be more convenient to interpret this as a threshold on the (scaled) absolute z-statistics, say, τ_n that grows with n, where the z-statistics is defined as $t_{j} = \frac{\sqrt{n} ({\hat{β}}_{j}^{HOLS} - {\hat{β}}_{j}^{OLS})}{\sqrt{\hat{var} (\sqrt{n} ({\hat{β}}_{j}^{HOLS} - {\hat{β}}_{j}^{OLS}))}} .$

We refrain from calling it z_j to avoid confusion. We use an additional assumption which is a relaxed version of (B3).

(A4) $E [{({\tilde{Z}}_{j}^{3} \tilde{E})}^{2}] < \infty$

Theorem 6.

Assume that the data follows the model (10) and that (A1)–(A3) hold. Assume that (B1) and (A4) hold $\forall j \in U$ . Let τ_n be the threshold on the absolute z-statistics to reject the according null hypothesis with $τ_{n} = O (\sqrt{n})$ and $1 / τ_{n} = O (1)$ . Then, $\lim_{n \to \infty} P [\hat{U} = U] = 1.$

In other words, we can choose τ_n to grow at any rate slower than $\sqrt{n}$ .

4 Specific Models

In this section, we discuss two types of models where the local interpretation applies. In these settings, there are variables for which $β_{j} = β_{j}^{OLS} = β_{j}^{HOLS}$ and Assumptions (B1)–(B3) hold even though the overall data follows the model (8). We note here first that the model of jointly Gaussian $E_{X}$ , for which the method is suited, is a special case of the model in Section 4.2.

4.1 Block Independence of $E_{X}$

Assume that the errors $E_{X}$ can be grouped into two or more independent and disjoint blocks. Denote the block that includes j by $B (j)$ . Then, it is clear that ${(Σ^{E_{X}})}_{j k}^{- 1} = 0$ if $B (j) \neq B (k)$ . If $X_{B (j)} = E_{X_{B (j)}}$ , that is, the confounder has no effect onto $X_{B (j)}$ , (13) holds for all covariates in $B (j)$ . Then, no variable in $X_{B (j)}$ contributes to the best linear predictor for $H^{⊤} α$ . Due to the block independence, this yields $X_{B (j)} ⊥ \tilde{E}$ and $Z_{j} ⊥ \tilde{E}$ , that is, (B2) is fulfilled. This also ensures $E [Z_{j}^{3} \tilde{E}] = 0$ . We consider the remaining assumptions: Naturally, the regression $Z_{j}^{3}$ versus $X_{- j}$ only involves $X_{B (j) ∖ j}$ and (B3) holds as well. For (B1), separately consider the case $k \in B (j)$ and $k \notin B (j)$ . In the first case, $E [Z_{j}^{2} X_{k} \tilde{E}] = E [Z_{j}^{2} X_{k}] E [\tilde{E}] = 0$ . In the second case, $E [Z_{j}^{2} X_{k} \tilde{E}] = E [Z_{j}^{2}] E [X_{k} \tilde{E}] = 0$ .

Theorem 7.

Assume the data follows the model (8) with errors $E_{X}$ that can be grouped into independent blocks. Then, $\begin{matrix} β_{j}^{HOLS} = β_{j}^{OLS} = β_{j} \forall j for which X_{B (j)} = E_{X_{B (j)}} . Further, \\ (B 1) - (B 3) hold \forall j for which X_{B (j)} = E_{X_{B (j)}} . \end{matrix}$

In some cases, block independence may be a restrictive assumption. Testing this assumption is not an easy problem, and will remain out of the scope of this article. However, the HOLS check still provides an indirect check of such an assumption since HOLS would likely reject the local null-hypotheses for all covariates, at least for large datasets, if there is no block that is unaffected by the confounding.

4.2 Linear Structural Equation Model

From the previous sections, we know that locally unconfounded structures, in the sense that $β_{j}^{OLS} = β_{j}$ , are strongly related to zeroes in the precision matrix. Thus, the question arises for what type of models having zeroes in the precision matrix is a usual thing. Besides block independence, which we have discussed in Section 4.1, this will mainly be the case if the data follows a linear structural equation model (SEM). Thus, we will focus on these linear SEMs for the interpretation of local, that is, by parameter, null hypotheses.

To start, assume that there are no hidden variables. So, let X be given by the following linear SEM(15) $X_{j} \leftarrow Ψ_{j} + \sum_{k \in PA (j)} θ_{j, k} X_{k} j = 1, \dots, p,$ (15) where the $Ψ_{1}, \dots, Ψ_{p}$ are independent and centered random variables. We use the notation $PA (j), CH (j)$ and $AN (j)$ for j’s parents, children and ancestors. Further, assume that there exists a directed acyclic graph (DAG) representing this structure. For this type of model, we know that a variable’s Markov boundary consists of its parents, its children, and its children’s other parents. For every other variable k outside of j’s Markov boundary, we have ${(Σ^{X})}_{j k}^{- 1} = 0$ . Thus, these 0 partial correlations are very usual. In the following, we will analyze how our local tests are especially applicable to this structure.

In the context of linear SEMs, hidden linear confounders can be thought of as unmeasured variables. Therefore, we split X which contains all possible predictors into two parts. Let $X_{M}$ be the measured variables and $X_{N}$ the hidden confounder variables. Let $Ψ = {(\begin{matrix} Ψ_{1}, \dots, Ψ_{p} \end{matrix})}^{⊤}$ with the according subsets $Ψ_{M}$ and $Ψ_{N}$ . Then, we can write $\begin{matrix} X = ω Ψ, X_{M} = ω_{M, M} Ψ_{M} + ω_{M, N} Ψ_{N}, and \\ X_{N} = ω_{N, M} Ψ_{M} + ω_{N, N} Ψ_{N} \end{matrix}$ for some suitable $ω \in R^{p \times p}$ , where $ω_{k, l} = 0$ for $k \neq l$ if $l \notin AN (k)$ and $ω_{k, k} = 1$ . Note that $ω_{M, M}$ is always invertible since it can be written as a triangular matrix with ones on the diagonal if permuted properly. Under model (8), Y can be thought of as a sink node in (15). To avoid confusion, we call the parameter if all predictors were observed $β^{*}$ . This leads to the definitions(16) $\begin{matrix} E_{X} : = ω_{M, M} Ψ_{M}, ρ : = ω_{M, N} and H : = Ψ_{N} such that \\ X_{M} = E_{X} + ρ H with E_{X} ⊥ H \\ Y - E = X^{⊤} β^{*} = X_{M}^{⊤} β_{M}^{*} + X_{N}^{⊤} β_{N}^{*} \\ = X_{M}^{⊤} (β_{M}^{*} + {(ω_{N, M} ω_{M, M}^{- 1})}^{⊤} β_{N}^{*}) + \\ H^{⊤} {(ω_{N, N} - ω_{N, M} ω_{M, M}^{- 1} ω_{M, N})}^{⊤} β_{N}^{*} \\ : = X_{M}^{⊤} β + H^{⊤} α . \end{matrix}$ (16)

When only the given subset is observed we are interested in the parameter $β$ as before. We have $β_{j} = β_{j}^{*}$ iff ${({(ω_{N, M} ω_{M, M}^{- 1})}^{⊤} β_{N}^{*})}_{j} = 0$ .

Theorem 8.

Assume that the data follows the model (15) and (16). Let $X_{M}$ and $X_{N}$ be the observed and hidden variables. Denote by ${PA}^{M} (k)$ the closest ancestors of k that are in M. Consider some $j \in M$ . $If \exists k \in N : (j \in {PA}^{M} (k) and β_{k} \neq 0), then β_{j} = β_{j}^{*} .$

In other words, the causal parameter can only change for variables that have at least one direct descendant in the hidden set which is a parent of Y itself. By direct descendant, we mean that there is a path from j to k that does not pass any other observed variable. We analyze for which variables we can reconstruct this causal parameter using ordinary least squares regression.

Theorem 9.

Assume that the data follows the model (15) and (16). Let $X_{M}$ and $X_{N}$ be the observed and hidden variables. Then, $\begin{matrix} β_{j}^{HOLS} = β_{j}^{OLS} = β_{j} \forall j \in M that are not in the Markov \\ boundary of any hidden variable . \\ (B 1) - (B 3) hold \forall j \in M that are not in the Markov \\ boundary of any hidden variable . \end{matrix}$

Thus, for those variables, we can (a) correctly retrieve the causal parameter using ordinary least squares regression and (b) detect that this is the true parameter by comparing it to $β_{j}^{HOLS}$ .

Simulation example

We assess the performance of our HOLS method in a linear SEM using a simple example. In , we show the DAG that represents the setup.

Fig. 1 DAG of the linear SEM. X₃ is assumed to be hidden which is depicted by the dashed circle. We use the following specifications: $Ψ_{1} \overset{D}{=} Ψ_{3} \overset{D}{=} Ψ_{5} \sim t_{7} / \sqrt{7 / 5}, Ψ_{2} \overset{D}{=} Ψ_{6} \overset{D}{=} Ψ_{7} \sim N (0, 1 / 2), Ψ_{4} \sim Unif [- \sqrt{3 / 2}, \sqrt{3 / 2}]$ and $E \sim N (0, 1)$ . $θ_{2, 1} = θ_{7, 6} = \sqrt{1 / 2}, θ_{4, 2} = θ_{4, 3} = θ_{6, 4} = θ_{6, 5} = 0.5$ and $β_{3}^{*} = \sqrt{5 / 2}$ .

Fig. 1 DAG of the linear SEM. X3 is assumed to be hidden which is depicted by the dashed circle. We use the following specifications: Ψ1=DΨ3=DΨ5∼t7/7/5, Ψ2=DΨ6=DΨ7∼N(0,1/2), Ψ4∼Unif[−3/2,3/2] and E∼N(0,1). θ2,1=θ7,6=1/2, θ4,2=θ4,3=θ6,4=θ6,5=0.5 and β3*=5/2.

For simplicity, the parameters are set such that X₁ to X₇ all have unit variance. X₃ is the only parent of Y and we apply the HOLS method using all but X₃ as predictors, that is, X₃ is treated as hidden variable. Following Theorem 9, we know that for variables X₁ and X₅ to X₇ the causal effect on Y is consistently estimated with OLS, while we chose the detailed setup such that there is a detectable confounding bias on $β_{2}^{OLS}$ and $β_{4}^{OLS}$ . Thus, ideally, our local tests reject the null hypothesis for those two covariates but not for the rest.

For numerical results, we let the sample size grow from 10² to 10⁶. For each sample size, we do 200 simulation runs. On the left-hand side of , we show the average absolute z-statistics per predictor for the different sample sizes. For X₂ and X₄ we see the expected $\sqrt{n}$ -growth. For the other variables, the empirical averages are close to the theoretical mean, which equals $\sqrt{2 / π} \approx 0.8$ , with a minimum of 0.70 and a maximum of 0.88. Further, we see that the confounding bias on the OLS parameter for X₄, which is a child of the hidden variable, is easier to detect than the bias onto the parameter for X₂, which is a child’s other parent. The multiplicity corrected p-value for X₄ is rejected at level $α = 0.05$ in 91.5% of the cases for $n = 10^{3}$ , while as the null hypothesis for X₂ is only rejected with a empirical probability of 3%. For X₂, it takes $n = 10^{5}$ samples to reject the local null hypothesis in 89% of the simulation runs.

Fig. 2 Simulation in a linear SEM corresponding to . The results are based on 200 simulation runs. On the left: Average absolute z-statistics per covariate for different sample sizes. The dotted lines grow as $\sqrt{n}$ and are fit to match perfectly at $n = 10^{5}$ . On the right: Empirical probability of perfectly recovering U (see, EquationEquation (12)(12) $\begin{matrix} V = {j : β_{j}^{OLS} = β_{j}} and \\ U = {j : β_{j}^{OLS} = β_{j}^{HOLS}} = {j : H_{0, j} is true} . \end{matrix}$ (12) ) for different sample sizes.

Fig. 2 Simulation in a linear SEM corresponding to Figure 1. The results are based on 200 simulation runs. On the left: Average absolute z-statistics per covariate for different sample sizes. The dotted lines grow as n and are fit to match perfectly at n=105. On the right: Empirical probability of perfectly recovering U (see, EquationEquation (12)(12) V={j:βjOLS=βj} andU={j:βjOLS=βjHOLS}={j:H0,j is true}.(12) ) for different sample sizes.

Following Section 3.2, we should be able to perfectly recover the set U (see, EquationEquation (12)(12) $\begin{matrix} V = {j : β_{j}^{OLS} = β_{j}} and \\ U = {j : β_{j}^{OLS} = β_{j}^{HOLS}} = {j : H_{0, j} is true} . \end{matrix}$ (12) ) as $n \to \infty$ if we let the threshold on the absolute z-statistics grow at the right rate. Therefore, we plot on the right-hand side of the empirical probability of perfectly recovering U over a range of possible thresholds for the different sample sizes. For $n = 10^{6}$ , we could achieve an empirical probability of 1. For $n = 10^{5}$ the optimum probability is 87%, while as for $n = 10^{4}$ it is only 19%.

Naturally, perfectly recovering U is a very ambitious goal for smaller sample sizes, and one might want to consider different objectives. In , we plot two different performance metrics. On the left-hand side, we plot the empirical probability of not falsely including any variable in $\hat{U}$ against the average intersection size $| \hat{U} \cap U |$ . The curve is parameterized implicitly by the threshold on the absolute z-statistics in order to reject the local null hypothesis for some variable. Thus, the graphic considers the question of how many variables in U can be recovered while keeping the probability of not falsely including any low. For a sample size of 10⁵, we have an average intersection size of 3.97 allowing for a 10% probability of false inclusions. For 10⁴, it is still 0.995. Thus, we can find (almost) one of the four variables in U on average. As we see in , the bias on $β_{4}^{OLS}$ is much easier to detect than the bias on $β_{2}^{OLS}$ . Thus, keeping the probability of including X₂ in $\hat{U}$ low is still an ambitious task. Therefore, we analyze on the right-hand side of how many variables in U we can find while removing a certain amount of confounding signal. We define the remaining fraction as $\frac{{‖ β_{\hat{U}}^{OLS} - β_{\hat{U}} ‖}_{1}}{{‖ β^{OLS} - β ‖}_{1}},$ that is, how much of the difference $β^{OLS} - β$ persists in terms of $l_{1}$ norm.

Fig. 3 Simulation in a linear SEM corresponding to . The results are based on 200 simulation runs. On the left: Probability of not falsely including a variable in $\hat{U}$ versus average intersection size $| \hat{U} \cap U |$ (see, EquationEquation (14)(14) $\hat{U} = {j : H_{0, j} not rejected}$ (14) ). On the right: average remaining fraction of confounding signal versus average intersection size $| \hat{U} \cap U |$ . It holds that $| U | = 4$ . Both curves use the threshold on the absolute z-statistics as implicit curve parameter. Note that the legend applies to either plot.

$Fig. 3 Simulation in a linear SEM corresponding to Figure 1. The results are based on 200 simulation runs. On the left: Probability of not falsely including a variable in Û versus average intersection size |Û∩U| (see, EquationEquation (14)(14) Û={j:H0,j not rejected}(14) ). On the right: average remaining fraction of confounding signal versus average intersection size |Û∩U|. It holds that |U|=4. Both curves use the threshold on the absolute z-statistics as implicit curve parameter. Note that the legend applies to either plot.$

In this SEM, $β_{4}^{OLS}$ caries 2/3 of the confounding signal, $β_{2}^{OLS}$ only 1/3. Accepting 1/3 of remaining confounding signal, we receive an average intersection size of 3.885 for a sample size of 10³. For 10⁴, the average is 4. Thus, if we allow for false inclusion of X₂ we can almost perfectly retrieve all of U for sample size 10³ already.

What if X includes descendants of Y? So far, we have only considered the case where X causally affects Y, but potentially, some of Y’s parents are missing leading to a confounding effect. However, another possibility for $β^{OLS}$ to not denote a causal effect is that there are descendants of Y amongst the predictors. The two different situations are depicted in . The case with descendants in the set of predictors fits our theory from before if interpreted properly. If the model (4) for Y holds true using only the parents as predictors, Y can be naturally included in the assumed linear SEM for X in (15). Then, one can also think of $E$ as an unobserved confounder. The Markov boundary of $E$ with respect to the observed predictors is the same as the Markov boundary of Y. Of course, it holds $β = β^{*}$ , that is, $β_{j} = 0 \forall j \notin PA (Y)$ . Using Theorem 9, we find $\begin{matrix} β_{j}^{HOLS} = β_{j}^{OLS} = β_{j} = 0 \forall j \in M that are not in the \\ Markov boundary of Y . \end{matrix}$

Fig. 4 Left: SEM with a hidden confounder. Right: SEM with a descendant of Y.

Thus, for all variables outside Y’s Markov boundary, one can correctly detect that they have no causal effect onto Y ceteris paribus. The coefficients for the variables in the boundary, which includes all parents, are up to term cancelations all subject to confounding bias. This can be detected under some conditions, as discussed in Section 3.2.

4.3 Beyond Linearity

We have mainly focused on linear models, that is, the data is either generated by model (4) or model (8). Naturally, this assumption might be questionable in practice. Therefore, we provide some intuition about how HOLS might be applied in a more general setup. As we only detect misspecification of the OLS coefficient without identifying the type of misspecification, one should not try to over-interpret the effect of the regressors in ${\hat{U}}^{c} = {1, \dots, p} ∖ \hat{U}$ (see, Section 3.2). However, the linear effect of the variables in $\hat{U}$ can always be interpreted to be well-specified, meaning that $E [W_{j} | Z_{j}] = Z_{j} β_{j}^{OLS}$ or at least “sufficiently” well-specified such that no misspecification is detected in the data. Generally, we can write $\begin{matrix} Y = X^{⊤} β^{OLS} + f_{nonlinear} (X) + E, where \\ f_{nonlinear} (X) = E [Y | X] - X^{⊤} β^{OLS}, E [E | X] = 0. \\ W_{j} = Z_{j} β_{j}^{OLS} + f_{nonlinear} (X) + E . \end{matrix}$

Thus, well-specification of $β_{j}^{OLS}$ implies $E [f_{nonlinear} (X) | Z_{j}] = 0$ , that is, after linearly adjusting for $X_{- j}$ , X_j does not have any predictive power for $f_{nonlinear} (X)$ . If it does not have any predictive power after linear adjustment, it would be a natural conclusion that it does not have predictive power after optimally adjusting for $X_{- j}$ implying that $f_{nonlinear} (X)$ can be written as function of $X_{- j}$ only. This would then imply $\begin{matrix} E [Y | X_{j} = x_{j} + 1, X_{- j} = x_{- j}] - E [Y | X_{j} = x_{j}, X_{- j} = x_{- j}] \\ = β_{j}^{OLS} \forall x_{j}, x_{- j} . \end{matrix}$

Except for Gaussian data, such a linear relationship must be either causal or due to very pathological data setups. Excluding such unusual cancelations, the conclusion is that for $j \in U$ there must be a true linear causal effect from X_j to Y keeping the other predictors fixed, which can be consistently estimated using OLS. Of course, if there are no locally linear structures, it might well be that $U = \emptyset$ such that the local tests are not more informative than the global test. However, there is also nothing to be lost by exploiting this local view.

Note that the asymptotic results presented in Sections 3.1 and 3.2 hold for nonlinear data as well since they only assume model (10)–(11), which is the most general formulation.

5 Real Data Example

We analyze the flow cytometry dataset presented by Sachs et al. (Citation2005). It contains cytometry measurements of 11 phosphorylated proteins and phospholipids. There is a “ground truth” on how these quantities affect each other, the so-called consensus network (Sachs et al. Citation2005). Data is available from various experimental conditions, some of which are interventional environments. The dataset has been further analyzed in various projects, see, for example, Mooij and Heskes (Citation2013), Meinshausen et al. (Citation2016), and Taeb et al. (Citation2021). Following these works, we consider data from eight different environments, seven of which are interventional. The sample size per environment ranges from 707 to 913.

In our analysis, we focus on the consensus network from Sachs et al. (Citation2005). For each node, we go through all environments, fit a linear model using all its claimed parents as predictors, and assess the goodness of fit of the model using our HOLS check. In the consensus network, there is one bidirected edge between the variables PIP2 and PIP3. We include it as a parent for either direction. For each suggested edge, we also collect the p-values from the linear model fit in all environments, keeping only those where the edge passes the local HOLS check at level $α = 5 %$ without multiplicity correction. We omit the multiplicity correction here to lower the tendency to falsely claim causal detection. In , we report the minimum p-value from OLS, over the environments where the HOLS check is passed, sorted by increasing p-values. Additionally, we show the number of environments in which the check is passed and out of these the number where the edge is significant at level $α = 5 %$ in the respective linear model fit (with Bonferroni correction over all 8 environments and 17 edges, that is, we require a p-value of at most $0.05 / 136$ ). Note that there is one p-value 0 reported corresponding to a t-value of 174, which exceeds the precision that can be obtained with the standard R-function lm.

Table 1 The working model is taken from the consensus network.

Download CSV Display Table

We see that the edge RAF → MEK is the most significant. Further, every edge of the consensus network passes the HOLS check in at least one environment. Frequently, we see that edges pass the HOLS check in certain environments without being significant in the linear model. Considering our discussion around linear SEMs, this could easily happen if the alleged predictor node is not actually in the Markov boundary of the response. In fact, there are seven edges that pass the HOLS check in every environment which are not significant based on the linear model fits. This is in agreement with Taeb et al. (Citation2021), where none of them is reported.

As we cannot guarantee that the data follows a linear SEM as in EquationEquation (15)(15) $X_{j} \leftarrow Ψ_{j} + \sum_{k \in PA (j)} θ_{j, k} X_{k} j = 1, \dots, p,$ (15) , we shall not interpret the edges that do not pass the HOLS check to be subject to hidden confounding. However, the fact that we still find a decent number of suggested edges that pass the HOLS check, at least in some environments, leads to evidence that the assumption of some local unconfounded linear structures is not unrealistic, see also the discussion in Section 4.3.

We can also analyze our results in the light of invariant causal prediction, see, for example, Peters, Bühlmann, and Meinshausen (Citation2016), where one typically assumes that interventions do not change the underlying graph except for edges that point toward the node that is intervened on. This assumption is highly questionable in practice, and our findings, which vary a lot over different environments, indicate that the assumption is likely not fulfilled in the given setup.

6 Discussion

We have introduced the so-called HOLS check to assess the goodness of fit of linear causal models. It is based on the dependence between residuals and predictors in misspecified models, leading to nonvanishing higher moments. Besides checking whether the overall model might hold true, the method allows to detect a set of variable for which linear regression consistently estimates a true (unconfounded) causal effect for certain model classes.

We extend the HOLS method to high-dimensional datasets based on the idea of the debiased Lasso (Zhang and Zhang Citation2014; van de Geer et al. Citation2014). This extension comes very naturally as our HOLS check involves nodewise regression just as the debiased Lasso.

Of particular interest are linear structural equation models, for which our method allows for very precise characterizations regarding which least squares parameters are causal effects. The result requires some non-Gaussianity. We complement our theory with a simulation study as well as a real data example.

A drawback of our method is that it does not distinguish whether a model is misspecified due to confounding or due to nonlinearities in the model. Therefore, an interesting follow-up direction would be to extend our methodology and theory from linear to nonlinear SEM using more flexible regression methods. This could allow to detect local causal structures in nonlinear settings as well.

Supplemental material

Supplemental Material

Download PDF (124.6 KB)

Supplemental Material

Download PDF (1.2 MB)

Supplementary Materials

Simulation results as well as proofs and extended theory can be found in the supplemental material.

Data Availability Statement

Code scripts to reproduce the results presented in this article are available here https://github.com/cschultheiss/HOLS.

Disclosure Statement

The authors report there are no competing interests to declare.

Additional information

Funding

The project leading to this application has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 786461). M. Yuan was supported in part by NSF Grants DMS-2015285 and DMS-2052955. Part of the work was done while M. Yuan was visiting the Institute for Theoretical Studies at ETH Zürich, Switzerland, and he wishes to thank the institute for their support and hospitality.

References

Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996), “Identification of Causal Effects using Instrumental Variables,” Journal of the American Statistical Association, 91, 444–455. DOI: 10.1080/01621459.1996.10476902.
Web of Science ®Google Scholar
Belsley, D. A., Kuh, E., and Welsch, R. E. (2005), Regression Diagnostics: Identifying Influential Data and Sources of Collinearity, Hoboken, NJ: Wiley.
Google Scholar
Bowden, R. J., and Turkington, D. A. (1990), Instrumental Variables, Cambridge: Cambridge University Press.
Google Scholar
Bühlmann, P. (2013), “Statistical Significance in High-Dimensional Linear Models,” Bernoulli, 19, 1212–1242. DOI: 10.3150/12-BEJSP11.
Web of Science ®Google Scholar
Buja, A., Brown, L., Berk, R., George, E., Pitkin, E., Traskin, M., Zhang, K., and Zhao, L. (2019a), “Models as Approximations I: Consequences Illustrated with Linear Regression,” Statistical Science, 34, 523–544. DOI: 10.1214/18-STS693.
Web of Science ®Google Scholar
Buja, A., Brown, L., Kuchibhotla, A. K., Berk, R., George, E., and Zhao, L. (2019b), “Models as Approximations II: A Model-Free Theory of Parametric Regression,” Statistical Science, 34, 545–565. DOI: 10.1214/18-STS694.
Web of Science ®Google Scholar
Ćevid, D., Bühlmann, P., and Meinshausen, N. (2020), “Spectral Deconfounding via Perturbed Sparse Linear Models,” Journal of Machine Learning Research, 21, 9442–9482.
Web of Science ®Google Scholar
Godfrey, L. G. (1991), Misspecification Tests in Econometrics: The Lagrange Multiplier Principle and Other Approaches, Cambridge: Cambridge University Press.
Google Scholar
Greene, W. H. (2003), Econometric Analysis, Noida: Pearson Education India.
Google Scholar
Guo, Z., Ćevid, D., and Bühlmann, P. (2022), “Doubly Debiased Lasso: High-Dimensional Inference Under Hidden Confounding,” Annals of Statistics, 50, 1320–1347. DOI: 10.1214/21-aos2152.
PubMed Web of Science ®Google Scholar
Hausman, J. A. (1978), “Specification Tests in Econometrics,” Econometrica: Journal of the Econometric Society, 46, 1251–1271. DOI: 10.2307/1913827.
Web of Science ®Google Scholar
Hoyer, P. O., Hyvärinen, A., Scheines, R., Spirtes, P., Ramsey, J., Lacerda, G., and Shimizu, S. (2008), “Causal Discovery of Linear Acyclic Models with Arbitrary Distributions,” in Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence, pp. 282–289.
Google Scholar
Hyvärinen, A., and Oja, E. (2000), “Independent Component Analysis: Algorithms and Applications,” Neural Networks, 13, 411–430. DOI: 10.1016/s0893-6080(00)00026-5.
PubMed Web of Science ®Google Scholar
Imbens, G. W., and Rubin, D. B. (2015), Causal Inference in Statistics, Social, and Biomedical Sciences, New York: Cambridge University Press.
Google Scholar
Lehmann, E. L., Romano, J. P., and Casella, G. (2005), Testing Statistical Hypotheses (Vol. 3), Cham: Springer.
Google Scholar
Maddala, G., and Lahiri, K. (2009), Introduction to Econometrics (4th ed), Chichester: Wiley.
Google Scholar
Meinshausen, N., Hauser, A., Mooij, J. M., Peters, J., Versteeg, P., and Bühlmann, P. (2016), “Methods for Causal Inference from Gene Perturbation Experiments and Validation,” Proceedings of the National Academy of Sciences, 113, 7361–7368. DOI: 10.1073/pnas.1510493113.
PubMed Web of Science ®Google Scholar
Mooij, J. M., and Heskes, T. (2013), “Cyclic Causal Discovery from Continuous Equilibrium Data,” in Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, pp. 431–439.
Google Scholar
Peters, J., Bühlmann, P., and Meinshausen, N. (2016), “Causal Inference by using Invariant Prediction: Identification and Confidence Intervals,” Journal of the Royal Statistical Society, Series B, 78, 947–1012. DOI: 10.1111/rssb.12167.
Google Scholar
Peters, J., Mooij, J. M., Janzing, D., and Schölkopf, B. (2014), “Causal Discovery with Continuous Additive Noise Models,” Journal of Machine Learning Research, 15, 2009–2053.
Web of Science ®Google Scholar
Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D. A., and Nolan, G. P. (2005), “Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data,” Science, 308, 523–529. DOI: 10.1126/science.1105809.
PubMed Web of Science ®Google Scholar
Shah, R. D., and Bühlmann, P. (2018), “Goodness-of-Fit Tests for High Dimensional Linear Models,” Journal of the Royal Statistical Society, Series B, 80, 113–135. DOI: 10.1111/rssb.12234.
Google Scholar
Shimizu, S., Hoyer, P. O., Hyvärinen, A., Kerminen, A., and Jordan, M. (2006), “A Linear Non-Gaussian Acyclic Model for Causal Discovery,” Journal of Machine Learning Research, 7, 2003–2030.
Web of Science ®Google Scholar
Taeb, A., Gamella, J. L., Heinze-Deml, C., and Bühlmann, P. (2021), “Perturbations and Causality in Gaussian Latent Variable Models,” arXiv preprint arXiv:2101.06950.
Google Scholar
van de Geer, S., Bühlmann, P., Ritov, Y., and Dezeure, R. (2014), “On Asymptotically Optimal Confidence Regions and Tests for High-Dimensional Models,” The Annals of Statistics, 42, 1166–1202. DOI: 10.1214/14-AOS1221.
Web of Science ®Google Scholar
Wang, Y. S., and Drton, M. (2020), “High-Dimensional Causal Discovery under Non-Gaussianity,” Biometrika, 107, 41–59. DOI: 10.1093/biomet/asz055.
Web of Science ®Google Scholar
Zhang, C.-H., and Zhang, S. S. (2014), “Confidence Intervals for Low Dimensional Parameters in High Dimensional Linear Models,” Journal of the Royal Statistical Society, Series B, 76, 217–242. DOI: 10.1111/rssb.12026.
Web of Science ®Google Scholar

Higher-Order Least Squares: Assessing Partial Goodness of Fit of Linear Causal Models

Abstract