2,238
Views
1
CrossRef citations to date
0
Altmetric
Theory and Methods

Higher-Order Least Squares: Assessing Partial Goodness of Fit of Linear Causal Models

ORCID Icon, ORCID Icon &
Pages 1019-1031 | Received 24 Nov 2021, Accepted 29 Nov 2022, Published online: 24 Feb 2023

Abstract

We introduce a simple diagnostic test for assessing the overall or partial goodness of fit of a linear causal model with errors being independent of the covariates. In particular, we consider situations where hidden confounding is potentially present. We develop a method and discuss its capability to distinguish between covariates that are confounded with the response by latent variables and those that are not. Thus, we provide a test and methodology for partial goodness of fit. The test is based on comparing a novel higher-order least squares principle with ordinary least squares. In spite of its simplicity, the proposed method is extremely general and is also proven to be valid for high-dimensional settings. Supplementary materials for this article are available online.

1 Introduction

Linear models are the most commonly used statistical tools to study the relationship between a response and a set of covariates. The regression coefficient corresponding to a particular covariate is usually interpreted as its net effect on the response variable when all else is held fixed. Such an interpretation is essential in many applications and yet could be rather misleading when the linear model assumptions are in question, in particular, when there are hidden confounders.

In this work, we develop a simple but powerful approach to goodness of fit tests for potentially high-dimensional linear causal models, including also tests for partial goodness of fit of single predictor variables. While hidden confounding is the primary alternative in mind, different nonlinear deviations from the linear model assumption are also in scope. Tests for goodness of fit are essential to statistical modeling (e.g., Lehmann, Romano, and Casella Citation2005) and the concept is also very popular in econometrics where it is referred to as specification tests. For an overview of such methods; see, for example, Godfrey (Citation1991) or Maddala and Lahiri (Citation2009).

Another set of related works is Buja et al. (Citation2019a, Citation2019b), which elaborately discusses deviations from the (linear) model and how distributional robustness, that is, robustenss against shifts in the covariates’ distribution, links to correctly specified models. For this, they introduce the definition of “well-specified” statistical functionals. Distributional robustness, implied by well-specification, is also related to the causal interpretation of a linear model as discussed in Peters, Bühlmann, and Meinshausen (Citation2016).

We consider here the question when and which causal effects can be inferred from the ordinary least squares estimator or a debiased Lasso procedure for the high-dimensional setting, even when there is hidden confounding. We address this by partial goodness of fit testing: if the data speaks against a linear causal model, we are able to specify which components of the least squares estimator should be rejected to be linear causal effects and which not. In the case of a joint Gaussian distribution, one cannot detect anything: this corresponds to a well-known unidentifiability result in causality (Hyvärinen and Oja Citation2000; Peters et al. Citation2014). But, in certain models, we are able to identify some causal relations. Of particular importance are non-Gaussian linear structural equation models, as used in Shimizu et al. (Citation2006) or Wang and Drton (Citation2020) among others. The latter constructs the causal graph from observational data in a stepwise procedure using a test statistic similar to the one we suggest.

Our strategy has a very different focus than other approaches which do not rely on the least squares principle any longer to deal with the issue of hidden confounding. Most prominent, particularly in econometrics, is the framework of instrumental variables regression: assuming valid instruments, one can identify all causal effects, see, for example, Angrist, Imbens, and Rubin (Citation1996) or the books by Bowden and Turkington (Citation1990) and Imbens and Rubin (Citation2015). The popular Durbin-Wu-Hausman test (Hausman Citation1978) for validity of instruments bears a relation to our methodology, namely that we are also looking at the difference of two estimators to test goodness of fit.

Our automated partial goodness of fit methodology is easy to be used as it is based on ordinary (or high-dimensional adaptions of) least squares and its novel higher-order version: we believe that this simplicity is attractive for statistical practice.

1.1 Our Contribution

We propose a novel method with a corresponding test, called higher-order least squares (HOLS). The test statistic is based on the residuals from an ordinary least squares or Lasso fit. In that regard, it is related to Shah and Bühlmann (Citation2018) who use “residual prediction” to test for deviation from the linear model. However, our approach does neither assume Gaussian errors nor does it rely on sample splitting, and our novel test statistic has a n convergence rate (with n denoting the sample size).

In addition to presenting a “global” goodness of fit test for the entire model, we also develop a local interpretation that allows detecting which among the covariates are giving evidence for hidden confounding or nonlinear relations. Thus, we strongly increase the amount of extracted information compared to a global goodness-of-fit test. In particular, in the case of localized (partial) confounding in linear structural equation models, we are able to recover the unconfounded regression parameters for a subset of predictors. This is a setting where techniques assuming dense (essentially global) confounding, as in Ćevid, Bühlmann, and Meinshausen (Citation2020) or Guo, Ćevid, and Bühlmann (Citation2022), fail.

The work by Buja et al. (Citation2019a, Citation2019b) especially the second paper, shows how to detect deviations in a linear model using reweighting of the data. Our HOLS technique can be seen as a special way of reweighting. In contrast to their work, we provide a simple test statistic that tests for well-specification without requiring any resampling. Furthermore, we provide guarantees for a local interpretation under suitable modeling assumptions while as their per-covariate view remains rather exploratory.

1.2 Outline

The remainder of this article will be structured as follows. We conclude this section with the necessary notation. In Section 2, we present the main idea of HOLS and the according global null hypothesis. For illustrative purposes, we first discuss univariate regression. Then, we consider multivariate regression and extend our theory to high-dimensional problems incorporating the (debiased) Lasso. In Section 3, we present the local interpretation when the global null does not hold true alongside with theoretical guarantees. Models for which this local interpretation is most suitable are discussed in Section 4. Section 5 contains a real data analysis. We conclude with a summarizing discussion in Section 6.

1.3 Notation

We present some notation that is used throughout this work. Vectors and matrices are written in boldface, while scalars have the usual lettering. This holds for both random and fixed quantities. We use upper case letters to denote a random variable, for example, X or Y. We use lower case letters to denote iid copies of a random variable, for example, x. If XRp, then xRn×p. With a slight abuse of notation, x can either denote the copies or realizations thereof. We write xj to denote the jth column of matrix x and xj to denote all but the jth column. We write =H0 to state that equality holds under H0. With ←, we emphasize that an equality between random variables is induced by a causal mechanism. We use to denote elementwise multiplication of two vectors, for example, xy. Similarly, potencies of vectors are also to be understood in an elementwise fashion, for example, x2=xx. In is the n-dimensional identity matrix. Pj denotes the orthogonal projection onto xj and Pj=InPj denotes the orthogonal projection onto its complement. For some random vector X, we have the moment matrix ΣX:=E[XX]. Note that this equals the covariance matrix for centered X. We denote statistical independence by . We write e to denote a vector for which every entry is 1 and ej to denote the unit vector in the direction of the jth coordinate axis.

2 Higher-Order Least Squares (HOLS)

We develop here the main idea of HOLS estimation.

2.1 Univariate Regression as a Motivating Case

It is instructive to begin with the case of simple linear regression where we have a pair of random variables X and Y. We consider the causal linear model (1) YXβ+E,whereXE,   E[E]=0andE[E2]=σ2<.(1)

We formulate a null hypothesis that the model in (1) is correct and we denote such a hypothesis by H0. This model is of interest as β describes the effect of a unit change if we were to intervene on covariate X without intervening on the independent E. Such model, or its multivariate extension, is often assumed in causal discovery, see, for example, Shimizu et al. (Citation2006) or Hoyer et al. (Citation2008). Therefore, we aim to provide a test for its well-specification.

Estimation of the regression parameter is typically done by the least squares principleβOLS:=argminbRE[(YXb)2]=E[XY]E[X2]=H0β,where we use the superscript OLS to denote ordinary least squares. Alternatively, we can pre-multiply the linear model (1) with X: the parameter minimizing the expected squared error loss is thenβHOLS:=argminbRE[(XYX2b)2]=E[X3Y]E[X4]=H0E[X4β]E[X4]=β.

More generally, βHOLS=βOLS=β, if E[Y|X]=Xβ. Using the definition from Buja et al. (Citation2019b), this means that the OLS parameter is well-specified. The estimation principle is called higher-order least squares, or HOLS for short, as it involves higher-order moments of X. One could also multiply the linear model with a factor other than X, which may have implications on the power to detect deviations from (1). We shall focus here on the specific choice to fix ideas.

The motivation to look at HOLS is when H0 is violated, in terms of a hidden confounding variable: let H be a hidden confounder leading to a model XEX+Hρ,YXβ+Hα+E,where EX, H, and E are all independent and α and ρ define additional model parameters. In particular, we can compute under such a confounding model that(2) βHOLSβOLS=ρα(3E[EX2]E[H2]+ρ2E[H4]E[EX4]+6ρ2E[EX2]E[H2]+ρ4E[H4]E[H2]E[EX2]+ρ2E[H2]).(2)

For simplicity, we assumed here E[EX]=E[H]=E[E]=0. In practice, one can get rid of this assumption by including an intercept in the model. If either α or ρ equals to 0, we see that the difference in (2) is 0. This is not surprising as there is no confounding effect when either X or Y is unaffected. However, this is not the only possibility how the difference can be 0. Namely,E[H2](E[EX4]3E[EX2]2)=ρ2E[EX2](E[H4]3E[H2]2)βHOLSβOLS=0.

Especially, if neither EX nor H have excess kurtosis, the difference is 0 for any ρ. This can be intuitively explained as it corresponds to Gaussian data (up to the moments we consider). For Gaussian EX and H, one can always writeY=XβOLS+E˜whereXE˜,which cannot be distinguished from the null model (1). Or in other words E[Y|X]=XβOLS, that is, the OLS parameter is well-specified although it is not the parameter β that we would like to recover. For other data-generating distributions, one should be able to distinguish H0 from certain deviations when hidden confounding is present. We discuss the implications of this in the general multivariate setting in Section 3.2. Similar behavior occurs for a violation of H0 in terms of a nonlinear model Y=f(X,ϵ) which then (typically) leads to βHOLSβOLS0.

One can construct a test based on the sample estimates of βHOLS and βOLS. We consider the centered datax˜=xx¯e,y˜=yy¯eandϵ˜=ϵϵ¯e=(In1nee)ϵ,

where we use the upper bar to denote sample means. We can derivey˜=yy¯e=xβx¯eβ+ϵϵ¯e=x˜β+ϵ˜.

We now obtain β̂OLS from regression through the origin of y˜ versus x˜ with an error term of ϵ˜ and β̂HOLS from regression through the origin of x˜y˜ versus x˜y˜ with an error term of x˜ϵ˜. More precisely, we defineβ̂OLS:=x˜y˜x˜x˜andβ̂HOLS:=(x˜2)(x˜y˜)(x˜2)(x˜2)=(x˜3)(y˜)(x˜2)(x˜2).

Under H0, one can see that (β̂HOLSβ̂OLS) given x is some known linear combination of ϵ. Assuming further Gaussianity of ϵ, it is conditionally Gaussian. We find(3) (β̂HOLSβ̂OLS)|xH0N(0,σ2((x˜3)(In1nee)(x˜3)((x˜2)(x˜2))21(x˜x˜))).(3)

We can calculate this variance except for σ2. Further, we can consistently estimate σ2, for example, with the standard formulaσ̂2=y˜x˜β̂OLS22n2.

Thus, we receive asymptotically valid z-tests for the null-hypothesis H0 that the model (1) holds. We treat the extension to non-Gaussian ϵ in Section 2.2 (for the multivariate case directly). As discussed above, in the presence of confounding, we can have that βHOLSβOLS. In such situations, a test assuming (3) will have asymptotic power equal to 1 for correctly rejecting H0 under some conditions. These asymptotic results are discussed in Sections 3.1 and 3.2.

2.2 Multivariate Regression

We typically want to examine the goodness of fit of a linear model with p > 1 covariates. We consider p to be fixed in this section and discuss the case where p is allowed to diverge with n in Section 2.3.

We consider the causal model(4) YXβ+E,whereXE,    E[E]=0andE[E2]=σ2<(4) with XRp and βRp. Note that E[E]=0 can always be enforced by including an intercept in the set of predictors. We assume the according moment matrix ΣX to be invertible. Then, the principal submatrices Σj,jX:=E[XjXj] are also invertible. We formulate a global null hypothesis that the model in (4) is correct and we denote it by H0. To make use of the test described for the univariate case, we consider every component j{1,,p} separately and work with partial regression, see, for example, Belsley, Kuh, and Welsch (Citation2005). For the population version, we define(5) Zj:=XjXjγj,whereγj:=argminbRp1E[(XjXjb)2]=(Σj,jX)1E[XjXj]Wj:=YXjζj,whereζj:=argminbRp1E[(YXjb)2]=(Σj,jX)1E[XjY].(5)

Under H0, it holds that Wj=Zjβj+E.

For βOLS:=(ΣX)1E[XY], we findβjOLS=E[ZjWj]E[Zj2]=H0βj.

The first equality is a well-known application of the Frish-Waugh theorem, see, for example, Greene (Citation2003). We define the according HOLS parameter by partial regression for every component j separately, namelyβjHOLS:=E[Zj3Wj]E[Zj4]=H0βj.

We define a local, that is, per-covariate null hypothesis H0,j:βjOLS=βjHOLS. The difference βjOLSβjHOLS can detect certain local alternatives from the null hypothesis H0. Here, local refers to the covariate Xj whose effect on Y is potentially confounded or involves a nonlinearity. Under model (4), H0,j holds true for every j. We discuss in Sections 3 and 4 some concrete examples, where it is insightful to consider tests for individual H0,j.

We turn to sample estimates of the parameters. The residuals are estimated byẑj=xjPjxj=Pjxjandŵj=yPjy=Pjy=H0Pj(xβ+ϵ)=ẑjβj+Pjϵ.

With ordinary least squares, we receive β̂jOLS from regression of ŵj versus ẑj, where the error term is Pjϵ. Accordingly, we calculate β̂jHOLS from regression of ẑjŵj versus ẑj2 with an error term ẑjPjϵ. Thus, we defineβ̂jOLS:=ẑjŵjẑjẑjandβ̂jHOLS:=(ẑj2)(ẑjŵj)(ẑj2)(ẑj2)=(ẑj3)ŵj(ẑj3)ẑj.

This is analogous to the univariate case, where we have y˜ instead of ŵj,x˜ instead of ẑj and (In1nee) instead of Pj, and (In1nee) can be thought of as orthogonal projection onto e’s complement, which completes the analogy. Again, we see that given x, (β̂jHOLSβ̂jOLS) is some known linear combination of ϵ, thus, it is conditionally Gaussian for Gaussian ϵ. The same holds for (β̂HOLSβ̂OLS).

Naturally, Gaussian E is an overly strong assumption. Therefore, we consider additional assumptions such that the central limit theorem can be invoked.

  • (A1) The moment matrix ΣX has positive smallest eigenvalue.

  • (A2) E[Xj6]< and E[Zj6]<j.

Further, let(6) Z˜j3:=Zj3Xjγ˜j, whereγ˜j:=argminbRp1E[(Zj3Xjb)2]=(Σj,jX)1E[XjZj3].(6)

Note that E[(Z˜j3)2]E[Zj6]<.

Theorem 1.

Assume that the data follows the model (4) and that (A1)–(A2) hold. Let p be fixed and n. Then,n(β̂HOLSβ̂OLS)DN(0,σ2E[VV])1nv̂v̂PE[VV], wherev̂j=Pj(ẑj3)1n(ẑj2)(ẑj2)ẑj1nẑjẑj andVj=Z˜j3E[Zj4]ZjE[Zj2].

Note that(β̂HOLSβ̂OLS)=H01nv̂ϵ,and, in analogy to (3),1n2v̂jv̂j=(ẑj3)Pj(ẑj3)((ẑj2)(ẑj2))21ẑjẑj.

Following Theorem 1, we can test the null hypothesis H0 with a consistent estimate for σ2. Such an estimate can be obtained, for example, using the standard formulaσ̂2=yxβ̂OLS22np.

We define for later reference(7) var̂(β̂jHOLSβ̂jOLS):=σ̂21n2v̂jv̂j.(7)

To test H0,j, we can compare (β̂jHOLSβ̂jOLS) to the quantiles of the univariate normal distribution with the according variance. The joint distribution leads to a global test that controls the Type I error. Namely, one can look at the maximum test statistic T=maxk|β̂kHOLSβ̂kOLS|H0maxk|Sk|, where SN(0,σ̂2v̂v̂/n2) can be easily simulated. Further, one receives multiplicity corrected individual p-values for H0,j by comparing each |β̂jHOLSβ̂jOLS| to the distribution of maxk|Sk|. This is in analogy to the multiplicity correction suggested by Bühlmann (Citation2013). Naturally, other multiplicity correction techniques such as Bonferroni-Holm are valid as well.

Algorithm 1 summarizes how to find both raw and multiplicity corrected p-values for each component j corresponding to the jth covariate, pj and Pj, respectively. Then, one would reject the global null hypothesis H0 that the model (4) holds if minjPjα, and such a decision procedure provides control of the Type I error at level α. Note that this means that we have strong control of the FWER for testing all H0,j.

Algorithm 1

HOLS check

1: for j = 1 to p do

2: Pj=Inxj(xjxj)xj

3: Regress xj versus xj via least squares, denote the residual by ẑj=Pjxj

4: Regress y versus xj via least squares, denote the residual by ŵj=Pjy

5: β̂jOLS=ẑjŵjẑjẑj,β̂jHOLS=(ẑj3)ŵjẑjẑj and v̂j=Pj(ẑj3)1n(ẑj2)(ẑj2)ẑj1nẑjẑj

6: σ̂2=yxβ̂OLS22np

7: Create nsim i.i.d copies of SN(0,σ̂2v̂v̂/n2), say, s1 to snsim

8: for j = 1 to p do

9: pj=2(1Φ(|β̂jHOLSβ̂jOLS|σ̂1nv̂j2))

10: Pj=1nsimi=1nsim1(|β̂jHOLSβ̂jOLS|>si)

Corollary 1.

Assume the conditions of Theorem 1. Consider the decision rule to reject H0 iff minjPjα, where Pj is as in Step 10 of Algorithm 1. Then, the Type I error is asymptotically controlled at α. Furthermore, the FWER is asymptotically controlled at level α for testing all local hypotheses {H0,j;j=1,,p} with the decision rule to reject H0,j iff Pjα.

We provide simulation results supporting this theory in Section A of the supplemental materials.

2.3 High-Dimensional Data

We now turn to high-dimensional situations. We assume the global null hypothesis (4) but allow for p to diverge with and even exceed n such that ordinary least squares regression is not applicable. Instead, we apply the debiased Lasso introduced in Zhang and Zhang (Citation2014) and further discussed in van de Geer et al. (Citation2014). We denote the estimator again by β̂OLS since it converges under certain conditions to the population parameter βOLS.

From the debiased Lasso, we receive ẑj=xjxjγ̂j, where γ̂j is obtained by regressing xj versus xj using the Lasso, and ŵj=yxjβ̂j with β̂ coming from the Lasso fit of y versus x. Since β̂jOLS=ẑjŵj/ẑjxj, one might want to use (ẑj3)ŵj/(ẑj3)xj for HOLS. However, this leads in general to an uncontrollable approximation error since E[Zj3Xj]0. As a remedy, we suggest a second level of orthogonalization based on Z˜j3 and γ˜j as defined in (6). Naturally, we have Z˜j3=Zj3 iff E[Zj3Xj]=0 and always E[Z˜j3Xj]=0. To approximate z˜j3 we use the Lasso for the regression ẑj3 versus xj leading to z˜̂j3=ẑj3xjγ˜̂j. We define β̂HOLS asβ̂jHOLS:=(z˜̂j3)ŵj(z˜̂j3)xj=(z˜̂j3)(yxjβ̂j)(z˜̂j3)xj=H0βj+(z˜̂j3)xj(βjβ̂j)/n(z˜̂j3)xj/n+(z˜̂j3)ϵ(z˜̂j3)xj.

Finally, we are interested in the difference between β̂jHOLS and β̂jOLS. Under suitable assumptions for the sparsity, the moment matrix, and the tail behavior of X and E, we can derive the limiting Gaussian distribution of this difference allowing for asymptotically valid tests. W apply Algorithm 2 where we make use of the (asymptotic) normality of the nonvanishing term in this difference. For non-Gaussian E, a multiplicity correction method that does not rely on exact Gaussianity of this remainder might be preferred since the CLT does not apply for dimensions growing too fast.

Algorithm 2

HOLS check for p > n

1: Regress y versus x via Lasso with a penalty parameter λ, denote the estimated regression coefficients by β̂

2: for j = 1 to p do

3: Regress xj versus xj via Lasso with a penalty parameter λj, denote the residual by ẑj

4: Regress ẑj3 versus xj via Lasso with a penalty parameter λ˜j, denote the residual by z˜̂j3

5: ŵj=yxjβ̂j

6: β̂jOLS=ẑjŵjẑjxj,β̂jHOLS=(z˜̂j3)ŵj(z˜̂j3)xj and v̂j=(z˜̂j3)1n(z˜̂j3)xjẑj1nẑjxj

7: σ̂2=yxβ̂22n|β̂|0 (or any other reasonable variance estimator)

8: Create nsim i.i.d copies of SN(0,σ̂21n2v̂v̂), say, s1 to snsim

9: for j = 1 to p do

10: pj=2(1Φ(|β̂jHOLSβ̂jOLS|σ̂1nv̂j2))

11: Pj=1nsimi=1nsim1(|β̂jHOLSβ̂jOLS|>si)

We provide here the main result to justify Algorithm 2 invoking additional assumptions on the dimensionality and sparsity of the problem. We use the definitions s:=β0,sj:=γj0 and s˜j:=γ˜j0 to denote the different levels of sparsity.

  • (C1) The design matrix x has iid sub-Gaussian rows. The moment matrix ΣX has strictly positive smallest eigenvalue Λmin2 satisfying 1/Λmin2=O(1). Also, maxjΣj,jX=O(1).

  • (C2) s=O(n1/2log(p)3).

  • (C3) ssj2=O(n3/2log(p)3), ssj=O(nlog(p)5/2) and ssj1/2=O(n1/2log(p)3/2).

  • (C4) sj = O(n3/5log(p)).

  • (C5) nsλλ˜j = O(1).

  • (C6) s˜jλ˜j2=O(1).

Theorem 2.

Assume that the data follows the model (4) with sub-Gaussian E and that (C1)–(C6) hold (j). Let β̂ come from Lasso regression with λlog(p)/n,ẑj from nodewise Lasso regression using λjlog(p)/n, and z˜̂j3 from nodewise Lasso regression of ẑj3 versus xj using λ˜jmax{log(p)5/2n1/2, sj2log(p)5/2n3/2,sjlog(p)2n1,sjlog(p)n1/2}. Let σ̂ be any consistent estimator for σ. Then,n(β̂jHOLSβ̂jOLS)σ̂21nv̂jv̂jDN(0,1)wherev̂j=(z˜̂j3)1n(z˜̂j3)xjẑj1nẑjxj.

We defer the technical details to Section C of the supplemental materials. Simulation results concerning high-dimensional data can be found in Section A of the supplemental materials.

3 The Confounded Case and Local Null Hypotheses

In this section and the following, we mainly exploit confounding in linear models as the alternative hypothesis since these are the models where our tests for the local null hypotheses H0,j are most informative. For a discussion of which interpretations might carry over to more general data generating distributions, we refer to Section 4.3.

Note that everything that is discussed in Sections 3 and 4 implicitly applies to high-dimensional data as well under suitable assumptions. We refrain from going into detail for the sake of brevity. Thus, Theorems 3–6 which contain our main asymptotic results for the local intepretation are designed explicitly for the fixed p case.

We look at the causal model(8) XρH+EXYXβ+Hα+E,(8) where HRd,EXRp and ER are independent and centered random variables, and αRd and ρRp×d are fixed model parameters. Thus, there exists some hidden confounder H. For the inner product matrices, it holds thatΣX=ΣEX+ρΣHρ.

Furthermore, we have(9) βOLS=(ΣX)1E[XY]=(ΣX)1(ΣXβ+ρΣHα)=β+(ΣX)1ρΣHα.(9)

We will generally refer to βjOLSβj, where βj is according to model (8), as confounding bias on βjOLS. Further, when writing directly confounded, we mean covariate indices j for which XjEXj.

Note that we can always decompose Y both globally and locally as follows(10) Y=XβOLS+E˜,withE[XE˜]=0,E[E˜]=0  but (potentially)XE˜(10) (11) Wj=ZjβjOLS+E˜,withE[ZjE˜]=0,E[E˜]=0  but (potentially)ZjE˜(11) using the definitions from (5). We now want to see how βOLS relates to β in certain models. Especially, we are interested in whether there is some potential local interpretation in the sense of distinguishing between “confounded” and “unconfounded” variables. From (9), we see that this is linked to the structure of the covariance matrices as well as ρ and α. We define the sets(12) V={j:βjOLS=βj}andU={j:βjOLS=βjHOLS}={j:H0,j is true}.(12)

Using the Woodbury matrix identity, we find a sufficient condition(13) jVifρ(ΣEX)j1=0which is implied by{k{1,,p}:(ΣEX)jk10}{l{1,,p}:elρ>0}=.(13)

Thus, if the intersection between covariates that have linear predictive power for Xj and covariates that are directly confounded is empty, it must hold that βjOLS=βj. Therefore, we can indeed say that for these variables we estimate the true causal effect using ordinary least squares.

To correctly detect V, we would like βjHOLS=βjOLS=βj. As βjHOLS involves higher-order moments, knowledge of the covariance structure is not sufficient to check this. From (11), we see that E[Zj3E˜]=0 is necessary and sufficient to ensure jU. In Section 3.2, we discuss the two cases where detection fails, that is, UV and VU. We present models for which we can characterize a set of variables which are in UV={j:βjHOLS=βjOLS=βj} in Section 4.

3.1 Sample Estimates

For a confounded model, the hope is that the global test minjPjα, where Pj is the adjusted p-value according to Step 10 in Algorithm 1, leads to a rejection of H0, that is, the modeling assumption (4). One could further examine the local structure and, based on the corrected p-values Pj, distinguish the predictors for which we have evidence that βjHOLSβjOLS. We consider in the following this local interpretation, showing that we asymptotically control the Type-I error and receive power approaching 1. Implicitly, we assume that U is a useful proxy for V.

For all asymptotic results in this section, we assume p to be fixed and n as in Theorem 1.

Theorem 3.

Assume that the data follows the model (10) and that (A1)–(A2) hold. Assume further σE˜2=E[E˜2]<. Then,β̂jOLS=βjOLS+Op(1),β̂jHOLS=βjHOLS+Op(1)andvar̂(β̂jHOLSβ̂jOLS)=Op(1n),where var̂(β̂jHOLSβ̂jOLS) is according to (7).

Thus, for some fixed alternative |βjHOLSβjOLS|>0, the absolute z-statistics increases as n.

In order to get some local interpretation, the behavior for variables jU is of importance. If |βjHOLSβjOLS|=0, Theorem 3 is not sufficient to understand the asymptotic behavior. We refine the results using additional assumptions.

  • (A3) E[(XjE˜)2]<j

  • (B1) E[Zj2XkE˜]=0kj

  • (B2) ZjE˜

  • (B3) Z˜j3E˜

Note that we use different letters for the assumptions to distinguish between those that are essentially some (mild) moment conditions and those that truly make nodes unconfounded. Obviously, (B2) is not necessary for βjOLS=βjHOLS, but we will focus on these variables as these are the ones that are truly unconfounded in the sense that the projected single variable model (11) has an independent error term, while as for other variables it can be rather considered an unwanted artifact of our method. Furthermore, the derived asymptotic variance results only hold true when assuming (B2) and (B3) as well. Assumption (A3) implies a further moment condition. Especially, when considering nonlinearities, there exist cases for which (A3) is not implied by (A2). We discuss Assumptions (B1)–(B3) for certain models in Section 4. Condition (B1) seems to be a bit artificial but is invoked in the proofs. We argue in Section 4 that it is naturally linked to the models in scope.

Theorem 4.

Assume that the data follows the model (10) and that (A1)–(A3) hold. Let j be some covariate with βjOLS=βjHOLS for which (B1)–(B3) hold. Then,n(β̂jHOLSβ̂jOLS)var̂(n(β̂jHOLSβ̂jOLS))DN(0,1).

Thus, for these predictors we receive asymptotically valid tests.

Multiplicity correction

In order not to falsely reject the local null hypothesis H0,j for any covariate with jU (with probability at least 1α), we need to invoke some multiplicity correction. Analogously to Section 2.2, one can see that β̂HOLSβ̂OLS=v̂ϵ˜/n, which enables the multiplicity correction as in Algorithm 1.

Theorem 5.

Assume that the data follows the model (10) and that (A1)–(A3) hold. Let U be the set of variables j for which jU and (B1)–(B3) hold. Then,n(β̂UHOLSβ̂UOLS)DN(0,σE˜2E[VUVU])1nv̂Uv̂UPE[VUVU]

Corollary 2.

Assume the conditions of Theorem 5. Consider the decision rule to reject H0,j iff Pjα, where Pj is as in Step 10 of Algorithm 1. Then, the familywise error rate amongst the set U is asymptotically controlled at α.

3.2 Inferring V from U

Recall the definitions in (12). U is the set that we try to infer with our HOLS check. Naturally, one would rather be interested in the set V, which consists of the variables for which we can consistently estimate the true linear causal effect according to (8) through linear regression. We discuss here when using U as proxy for V might fail and especially analyze how variables could belong to the difference between the sets. For this, recall our formulation of the model when the global null hypothesis does not hold true in (10) and (11). Note that jU is equivalent to E[Zj3E˜]=0.

For any variable jUV, certain modeling assumptions, that we discuss in the sequel, cannot be fulfilled but they are not necessary for E[Zj3E˜]=0. Especially, the last equality always holds if both EX and H jointly have Gaussian kurtosis. If they are even jointly Gaussian, then it is clear that XE˜ such that the model (10) has independent Gaussian error. Thus, when using only observational data, it behaves exactly like a model under the global null hypothesis and, naturally, we cannot infer the confounding effect. Apart from Gaussian kurtosis, jUV would be mainly due to special constellations implying cancelation of terms that one does not expect to encounter in normal circumstances.

For jVU, Zj and E˜ must not be independent. As ZjE˜, the single-covariate model (11) is not a linear causal model with independent error term as given in (1). Therefore, from a causal inference perspective, one can argue that rejecting the local null hypothesis H0,j is the right thing to do in this case. Furthermore, having variables jV is usually related to certain model assumptions except for very specific data setups that lead to cancelation of terms. Under these assumptions, ZjE˜ is then usually implied. An example where βOLS=β, but (potentially) XE˜ is data for which ρΣHα=0 using the definitions from model (8).

Recovery of U

Based on our asymptotic results when the global null does not hold true, we would like to construct a method that perfectly detects the unconfounded variables as n. Define(14) Û={j:H0,j not rejected}(14)

The question is how and when can we ensure thatlimnP[Û=U]=1.

Suppose that we conduct our local z-tests at level αn, which varies with the sample size such that αn0 as n. It will be more convenient to interpret this as a threshold on the (scaled) absolute z-statistics, say, τn that grows with n, where the z-statistics is defined astj=n(β̂jHOLSβ̂jOLS)var̂(n(β̂jHOLSβ̂jOLS)).

We refrain from calling it zj to avoid confusion. We use an additional assumption which is a relaxed version of (B3).

  • (A4) E[(Z˜j3E˜)2]<

Theorem 6.

Assume that the data follows the model (10) and that (A1)–(A3) hold. Assume that (B1) and (A4) hold jU. Let τn be the threshold on the absolute z-statistics to reject the according null hypothesis with τn=O(n) and 1/τn=O(1). Then,limnP[Û=U]=1.

In other words, we can choose τn to grow at any rate slower than n.

4 Specific Models

In this section, we discuss two types of models where the local interpretation applies. In these settings, there are variables for which βj=βjOLS=βjHOLS and Assumptions (B1)–(B3) hold even though the overall data follows the model (8). We note here first that the model of jointly Gaussian EX, for which the method is suited, is a special case of the model in Section 4.2.

4.1 Block Independence of EX

Assume that the errors EX can be grouped into two or more independent and disjoint blocks. Denote the block that includes j by B(j). Then, it is clear that (ΣEX)jk1=0 if B(j)B(k). If XB(j)=EXB(j), that is, the confounder has no effect onto XB(j), (13) holds for all covariates in B(j). Then, no variable in XB(j) contributes to the best linear predictor for Hα. Due to the block independence, this yields XB(j)E˜ and ZjE˜, that is, (B2) is fulfilled. This also ensures E[Zj3E˜]=0. We consider the remaining assumptions: Naturally, the regression Zj3 versus Xj only involves XB(j)j and (B3) holds as well. For (B1), separately consider the case kB(j) and kB(j). In the first case, E[Zj2XkE˜]=E[Zj2Xk]E[E˜]=0. In the second case, E[Zj2XkE˜]=E[Zj2]E[XkE˜]=0.

Theorem 7.

Assume the data follows the model (8) with errors EX that can be grouped into independent blocks. Then,βjHOLS=βjOLS=βjjfor whichXB(j)=EXB(j). Further,(B1)(B3) hold jfor whichXB(j)=EXB(j).

In some cases, block independence may be a restrictive assumption. Testing this assumption is not an easy problem, and will remain out of the scope of this article. However, the HOLS check still provides an indirect check of such an assumption since HOLS would likely reject the local null-hypotheses for all covariates, at least for large datasets, if there is no block that is unaffected by the confounding.

4.2 Linear Structural Equation Model

From the previous sections, we know that locally unconfounded structures, in the sense that βjOLS=βj, are strongly related to zeroes in the precision matrix. Thus, the question arises for what type of models having zeroes in the precision matrix is a usual thing. Besides block independence, which we have discussed in Section 4.1, this will mainly be the case if the data follows a linear structural equation model (SEM). Thus, we will focus on these linear SEMs for the interpretation of local, that is, by parameter, null hypotheses.

To start, assume that there are no hidden variables. So, let X be given by the following linear SEM(15) XjΨj+kPA(j)θj,kXkj=1,,p,(15) where the Ψ1,,Ψp are independent and centered random variables. We use the notation PA(j), CH(j) and AN(j) for j’s parents, children and ancestors. Further, assume that there exists a directed acyclic graph (DAG) representing this structure. For this type of model, we know that a variable’s Markov boundary consists of its parents, its children, and its children’s other parents. For every other variable k outside of j’s Markov boundary, we have (ΣX)jk1=0. Thus, these 0 partial correlations are very usual. In the following, we will analyze how our local tests are especially applicable to this structure.

In the context of linear SEMs, hidden linear confounders can be thought of as unmeasured variables. Therefore, we split X which contains all possible predictors into two parts. Let XM be the measured variables and XN the hidden confounder variables. Let Ψ=(Ψ1,,Ψp) with the according subsets ΨM and ΨN. Then, we can writeX=ωΨ,XM=ωM,MΨM+ωM,NΨN,andXN=ωN,MΨM+ωN,NΨNfor some suitable ωRp×p, where ωk,l=0 for kl if lAN(k) and ωk,k=1. Note that ωM,M is always invertible since it can be written as a triangular matrix with ones on the diagonal if permuted properly. Under model (8), Y can be thought of as a sink node in (15). To avoid confusion, we call the parameter if all predictors were observed β*. This leads to the definitions(16) EX:=ωM,MΨM,ρ:=ωM,NandH:=ΨNsuch thatXM=EX+ρHwithEXHYE=Xβ*=XMβM*+XNβN*=XM(βM*+(ωN,MωM,M1)βN*)+H(ωN,NωN,MωM,M1ωM,N)βN*:=XMβ+Hα.(16)

When only the given subset is observed we are interested in the parameter β as before. We have βj=βj* iff ((ωN,MωM,M1)βN*)j=0.

Theorem 8.

Assume that the data follows the model (15) and (16). Let XM and XN be the observed and hidden variables. Denote by PAM(k) the closest ancestors of k that are in M. Consider some jM.IfkN:(jPAM(k) and βk0),thenβj=βj*.

In other words, the causal parameter can only change for variables that have at least one direct descendant in the hidden set which is a parent of Y itself. By direct descendant, we mean that there is a path from j to k that does not pass any other observed variable. We analyze for which variables we can reconstruct this causal parameter using ordinary least squares regression.

Theorem 9.

Assume that the data follows the model (15) and (16). Let XM and XN be the observed and hidden variables. Then,βjHOLS=βjOLS=βjjMthat are not in the Markov  boundary of any hidden variable.(B1)(B3) hold jMthat are not in the Markov  boundary of any hidden variable.

Thus, for those variables, we can (a) correctly retrieve the causal parameter using ordinary least squares regression and (b) detect that this is the true parameter by comparing it to βjHOLS.

Simulation example

We assess the performance of our HOLS method in a linear SEM using a simple example. In , we show the DAG that represents the setup.

Fig. 1 DAG of the linear SEM. X3 is assumed to be hidden which is depicted by the dashed circle. We use the following specifications: Ψ1=DΨ3=DΨ5t7/7/5,Ψ2=DΨ6=DΨ7N(0,1/2),Ψ4Unif[3/2,3/2] and EN(0,1). θ2,1=θ7,6=1/2,θ4,2=θ4,3=θ6,4=θ6,5=0.5 and β3*=5/2.

Fig. 1 DAG of the linear SEM. X3 is assumed to be hidden which is depicted by the dashed circle. We use the following specifications: Ψ1=DΨ3=DΨ5∼t7/7/5, Ψ2=DΨ6=DΨ7∼N(0,1/2), Ψ4∼Unif[−3/2,3/2] and E∼N(0,1). θ2,1=θ7,6=1/2, θ4,2=θ4,3=θ6,4=θ6,5=0.5 and β3*=5/2.

For simplicity, the parameters are set such that X1 to X7 all have unit variance. X3 is the only parent of Y and we apply the HOLS method using all but X3 as predictors, that is, X3 is treated as hidden variable. Following Theorem 9, we know that for variables X1 and X5 to X7 the causal effect on Y is consistently estimated with OLS, while we chose the detailed setup such that there is a detectable confounding bias on β2OLS and β4OLS. Thus, ideally, our local tests reject the null hypothesis for those two covariates but not for the rest.

For numerical results, we let the sample size grow from 102 to 106. For each sample size, we do 200 simulation runs. On the left-hand side of , we show the average absolute z-statistics per predictor for the different sample sizes. For X2 and X4 we see the expected n-growth. For the other variables, the empirical averages are close to the theoretical mean, which equals 2/π0.8, with a minimum of 0.70 and a maximum of 0.88. Further, we see that the confounding bias on the OLS parameter for X4, which is a child of the hidden variable, is easier to detect than the bias onto the parameter for X2, which is a child’s other parent. The multiplicity corrected p-value for X4 is rejected at level α=0.05 in 91.5% of the cases for n=103, while as the null hypothesis for X2 is only rejected with a empirical probability of 3%. For X2, it takes n=105 samples to reject the local null hypothesis in 89% of the simulation runs.

Fig. 2 Simulation in a linear SEM corresponding to . The results are based on 200 simulation runs. On the left: Average absolute z-statistics per covariate for different sample sizes. The dotted lines grow as n and are fit to match perfectly at n=105. On the right: Empirical probability of perfectly recovering U (see, EquationEquation (12)) for different sample sizes.

Fig. 2 Simulation in a linear SEM corresponding to Figure 1. The results are based on 200 simulation runs. On the left: Average absolute z-statistics per covariate for different sample sizes. The dotted lines grow as n and are fit to match perfectly at n=105. On the right: Empirical probability of perfectly recovering U (see, EquationEquation (12)(12) V={j:βjOLS=βj} andU={j:βjOLS=βjHOLS}={j:H0,j is true}.(12) ) for different sample sizes.

Following Section 3.2, we should be able to perfectly recover the set U (see, EquationEquation (12)) as n if we let the threshold on the absolute z-statistics grow at the right rate. Therefore, we plot on the right-hand side of the empirical probability of perfectly recovering U over a range of possible thresholds for the different sample sizes. For n=106, we could achieve an empirical probability of 1. For n=105 the optimum probability is 87%, while as for n=104 it is only 19%.

Naturally, perfectly recovering U is a very ambitious goal for smaller sample sizes, and one might want to consider different objectives. In , we plot two different performance metrics. On the left-hand side, we plot the empirical probability of not falsely including any variable in Û against the average intersection size |ÛU|. The curve is parameterized implicitly by the threshold on the absolute z-statistics in order to reject the local null hypothesis for some variable. Thus, the graphic considers the question of how many variables in U can be recovered while keeping the probability of not falsely including any low. For a sample size of 105, we have an average intersection size of 3.97 allowing for a 10% probability of false inclusions. For 104, it is still 0.995. Thus, we can find (almost) one of the four variables in U on average. As we see in , the bias on β4OLS is much easier to detect than the bias on β2OLS. Thus, keeping the probability of including X2 in Û low is still an ambitious task. Therefore, we analyze on the right-hand side of how many variables in U we can find while removing a certain amount of confounding signal. We define the remaining fraction asβÛOLSβÛ1βOLSβ1,that is, how much of the difference βOLSβ persists in terms of l1 norm.

Fig. 3 Simulation in a linear SEM corresponding to . The results are based on 200 simulation runs. On the left: Probability of not falsely including a variable in Û versus average intersection size |ÛU| (see, EquationEquation (14)). On the right: average remaining fraction of confounding signal versus average intersection size |ÛU|. It holds that |U|=4. Both curves use the threshold on the absolute z-statistics as implicit curve parameter. Note that the legend applies to either plot.

Fig. 3 Simulation in a linear SEM corresponding to Figure 1. The results are based on 200 simulation runs. On the left: Probability of not falsely including a variable in Û versus average intersection size |Û∩U| (see, EquationEquation (14)(14) Û={j:H0,j not rejected}(14) ). On the right: average remaining fraction of confounding signal versus average intersection size |Û∩U|. It holds that |U|=4. Both curves use the threshold on the absolute z-statistics as implicit curve parameter. Note that the legend applies to either plot.

In this SEM, β4OLS caries 2/3 of the confounding signal, β2OLS only 1/3. Accepting 1/3 of remaining confounding signal, we receive an average intersection size of 3.885 for a sample size of 103. For 104, the average is 4. Thus, if we allow for false inclusion of X2 we can almost perfectly retrieve all of U for sample size 103 already.

What if X includes descendants of Y? So far, we have only considered the case where X causally affects Y, but potentially, some of Y’s parents are missing leading to a confounding effect. However, another possibility for βOLS to not denote a causal effect is that there are descendants of Y amongst the predictors. The two different situations are depicted in . The case with descendants in the set of predictors fits our theory from before if interpreted properly. If the model (4) for Y holds true using only the parents as predictors, Y can be naturally included in the assumed linear SEM for X in (15). Then, one can also think of E as an unobserved confounder. The Markov boundary of E with respect to the observed predictors is the same as the Markov boundary of Y. Of course, it holds β=β*, that is, βj=0jPA(Y). Using Theorem 9, we findβjHOLS=βjOLS=βj=0jMthat are not in the  Markov boundary of Y.

Fig. 4 Left: SEM with a hidden confounder. Right: SEM with a descendant of Y.

Fig. 4 Left: SEM with a hidden confounder. Right: SEM with a descendant of Y.

Thus, for all variables outside Y’s Markov boundary, one can correctly detect that they have no causal effect onto Y ceteris paribus. The coefficients for the variables in the boundary, which includes all parents, are up to term cancelations all subject to confounding bias. This can be detected under some conditions, as discussed in Section 3.2.

4.3 Beyond Linearity

We have mainly focused on linear models, that is, the data is either generated by model (4) or model (8). Naturally, this assumption might be questionable in practice. Therefore, we provide some intuition about how HOLS might be applied in a more general setup. As we only detect misspecification of the OLS coefficient without identifying the type of misspecification, one should not try to over-interpret the effect of the regressors in Ûc={1,,p}Û (see, Section 3.2). However, the linear effect of the variables in Û can always be interpreted to be well-specified, meaning that E[Wj|Zj]=ZjβjOLS or at least “sufficiently” well-specified such that no misspecification is detected in the data. Generally, we can writeY=XβOLS+fnonlinear(X)+E, wherefnonlinear(X)=E[Y|X]XβOLS,E[E|X]=0.Wj=ZjβjOLS+fnonlinear(X)+E.

Thus, well-specification of βjOLS implies E[fnonlinear(X)|Zj]=0, that is, after linearly adjusting for Xj, Xj does not have any predictive power for fnonlinear(X). If it does not have any predictive power after linear adjustment, it would be a natural conclusion that it does not have predictive power after optimally adjusting for Xj implying that fnonlinear(X) can be written as function of Xj only. This would then implyE[Y|Xj=xj+1,Xj=xj]E[Y|Xj=xj,Xj=xj]=βjOLSxj,xj.

Except for Gaussian data, such a linear relationship must be either causal or due to very pathological data setups. Excluding such unusual cancelations, the conclusion is that for jU there must be a true linear causal effect from Xj to Y keeping the other predictors fixed, which can be consistently estimated using OLS. Of course, if there are no locally linear structures, it might well be that U= such that the local tests are not more informative than the global test. However, there is also nothing to be lost by exploiting this local view.

Note that the asymptotic results presented in Sections 3.1 and 3.2 hold for nonlinear data as well since they only assume model (10)–(11), which is the most general formulation.

5 Real Data Example

We analyze the flow cytometry dataset presented by Sachs et al. (Citation2005). It contains cytometry measurements of 11 phosphorylated proteins and phospholipids. There is a “ground truth” on how these quantities affect each other, the so-called consensus network (Sachs et al. Citation2005). Data is available from various experimental conditions, some of which are interventional environments. The dataset has been further analyzed in various projects, see, for example, Mooij and Heskes (Citation2013), Meinshausen et al. (Citation2016), and Taeb et al. (Citation2021). Following these works, we consider data from eight different environments, seven of which are interventional. The sample size per environment ranges from 707 to 913.

In our analysis, we focus on the consensus network from Sachs et al. (Citation2005). For each node, we go through all environments, fit a linear model using all its claimed parents as predictors, and assess the goodness of fit of the model using our HOLS check. In the consensus network, there is one bidirected edge between the variables PIP2 and PIP3. We include it as a parent for either direction. For each suggested edge, we also collect the p-values from the linear model fit in all environments, keeping only those where the edge passes the local HOLS check at level α=5% without multiplicity correction. We omit the multiplicity correction here to lower the tendency to falsely claim causal detection. In , we report the minimum p-value from OLS, over the environments where the HOLS check is passed, sorted by increasing p-values. Additionally, we show the number of environments in which the check is passed and out of these the number where the edge is significant at level α=5% in the respective linear model fit (with Bonferroni correction over all 8 environments and 17 edges, that is, we require a p-value of at most 0.05/136). Note that there is one p-value 0 reported corresponding to a t-value of 174, which exceeds the precision that can be obtained with the standard R-function lm.

Table 1 The working model is taken from the consensus network.

We see that the edge RAF → MEK is the most significant. Further, every edge of the consensus network passes the HOLS check in at least one environment. Frequently, we see that edges pass the HOLS check in certain environments without being significant in the linear model. Considering our discussion around linear SEMs, this could easily happen if the alleged predictor node is not actually in the Markov boundary of the response. In fact, there are seven edges that pass the HOLS check in every environment which are not significant based on the linear model fits. This is in agreement with Taeb et al. (Citation2021), where none of them is reported.

As we cannot guarantee that the data follows a linear SEM as in EquationEquation (15), we shall not interpret the edges that do not pass the HOLS check to be subject to hidden confounding. However, the fact that we still find a decent number of suggested edges that pass the HOLS check, at least in some environments, leads to evidence that the assumption of some local unconfounded linear structures is not unrealistic, see also the discussion in Section 4.3.

We can also analyze our results in the light of invariant causal prediction, see, for example, Peters, Bühlmann, and Meinshausen (Citation2016), where one typically assumes that interventions do not change the underlying graph except for edges that point toward the node that is intervened on. This assumption is highly questionable in practice, and our findings, which vary a lot over different environments, indicate that the assumption is likely not fulfilled in the given setup.

6 Discussion

We have introduced the so-called HOLS check to assess the goodness of fit of linear causal models. It is based on the dependence between residuals and predictors in misspecified models, leading to nonvanishing higher moments. Besides checking whether the overall model might hold true, the method allows to detect a set of variable for which linear regression consistently estimates a true (unconfounded) causal effect for certain model classes.

We extend the HOLS method to high-dimensional datasets based on the idea of the debiased Lasso (Zhang and Zhang Citation2014; van de Geer et al. Citation2014). This extension comes very naturally as our HOLS check involves nodewise regression just as the debiased Lasso.

Of particular interest are linear structural equation models, for which our method allows for very precise characterizations regarding which least squares parameters are causal effects. The result requires some non-Gaussianity. We complement our theory with a simulation study as well as a real data example.

A drawback of our method is that it does not distinguish whether a model is misspecified due to confounding or due to nonlinearities in the model. Therefore, an interesting follow-up direction would be to extend our methodology and theory from linear to nonlinear SEM using more flexible regression methods. This could allow to detect local causal structures in nonlinear settings as well.

Supplemental material

Supplemental Material

Download PDF (124.6 KB)

Supplemental Material

Download PDF (1.2 MB)

Supplementary Materials

Simulation results as well as proofs and extended theory can be found in the supplemental material.

Data Availability Statement

Code scripts to reproduce the results presented in this article are available here https://github.com/cschultheiss/HOLS.

Disclosure Statement

The authors report there are no competing interests to declare.

Additional information

Funding

The project leading to this application has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 786461). M. Yuan was supported in part by NSF Grants DMS-2015285 and DMS-2052955. Part of the work was done while M. Yuan was visiting the Institute for Theoretical Studies at ETH Zürich, Switzerland, and he wishes to thank the institute for their support and hospitality.

References

  • Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996), “Identification of Causal Effects using Instrumental Variables,” Journal of the American Statistical Association, 91, 444–455. DOI: 10.1080/01621459.1996.10476902.
  • Belsley, D. A., Kuh, E., and Welsch, R. E. (2005), Regression Diagnostics: Identifying Influential Data and Sources of Collinearity, Hoboken, NJ: Wiley.
  • Bowden, R. J., and Turkington, D. A. (1990), Instrumental Variables, Cambridge: Cambridge University Press.
  • Bühlmann, P. (2013), “Statistical Significance in High-Dimensional Linear Models,” Bernoulli, 19, 1212–1242. DOI: 10.3150/12-BEJSP11.
  • Buja, A., Brown, L., Berk, R., George, E., Pitkin, E., Traskin, M., Zhang, K., and Zhao, L. (2019a), “Models as Approximations I: Consequences Illustrated with Linear Regression,” Statistical Science, 34, 523–544. DOI: 10.1214/18-STS693.
  • Buja, A., Brown, L., Kuchibhotla, A. K., Berk, R., George, E., and Zhao, L. (2019b), “Models as Approximations II: A Model-Free Theory of Parametric Regression,” Statistical Science, 34, 545–565. DOI: 10.1214/18-STS694.
  • Ćevid, D., Bühlmann, P., and Meinshausen, N. (2020), “Spectral Deconfounding via Perturbed Sparse Linear Models,” Journal of Machine Learning Research, 21, 9442–9482.
  • Godfrey, L. G. (1991), Misspecification Tests in Econometrics: The Lagrange Multiplier Principle and Other Approaches, Cambridge: Cambridge University Press.
  • Greene, W. H. (2003), Econometric Analysis, Noida: Pearson Education India.
  • Guo, Z., Ćevid, D., and Bühlmann, P. (2022), “Doubly Debiased Lasso: High-Dimensional Inference Under Hidden Confounding,” Annals of Statistics, 50, 1320–1347. DOI: 10.1214/21-aos2152.
  • Hausman, J. A. (1978), “Specification Tests in Econometrics,” Econometrica: Journal of the Econometric Society, 46, 1251–1271. DOI: 10.2307/1913827.
  • Hoyer, P. O., Hyvärinen, A., Scheines, R., Spirtes, P., Ramsey, J., Lacerda, G., and Shimizu, S. (2008), “Causal Discovery of Linear Acyclic Models with Arbitrary Distributions,” in Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence, pp. 282–289.
  • Hyvärinen, A., and Oja, E. (2000), “Independent Component Analysis: Algorithms and Applications,” Neural Networks, 13, 411–430. DOI: 10.1016/s0893-6080(00)00026-5.
  • Imbens, G. W., and Rubin, D. B. (2015), Causal Inference in Statistics, Social, and Biomedical Sciences, New York: Cambridge University Press.
  • Lehmann, E. L., Romano, J. P., and Casella, G. (2005), Testing Statistical Hypotheses (Vol. 3), Cham: Springer.
  • Maddala, G., and Lahiri, K. (2009), Introduction to Econometrics (4th ed), Chichester: Wiley.
  • Meinshausen, N., Hauser, A., Mooij, J. M., Peters, J., Versteeg, P., and Bühlmann, P. (2016), “Methods for Causal Inference from Gene Perturbation Experiments and Validation,” Proceedings of the National Academy of Sciences, 113, 7361–7368. DOI: 10.1073/pnas.1510493113.
  • Mooij, J. M., and Heskes, T. (2013), “Cyclic Causal Discovery from Continuous Equilibrium Data,” in Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, pp. 431–439.
  • Peters, J., Bühlmann, P., and Meinshausen, N. (2016), “Causal Inference by using Invariant Prediction: Identification and Confidence Intervals,” Journal of the Royal Statistical Society, Series B, 78, 947–1012. DOI: 10.1111/rssb.12167.
  • Peters, J., Mooij, J. M., Janzing, D., and Schölkopf, B. (2014), “Causal Discovery with Continuous Additive Noise Models,” Journal of Machine Learning Research, 15, 2009–2053.
  • Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D. A., and Nolan, G. P. (2005), “Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data,” Science, 308, 523–529. DOI: 10.1126/science.1105809.
  • Shah, R. D., and Bühlmann, P. (2018), “Goodness-of-Fit Tests for High Dimensional Linear Models,” Journal of the Royal Statistical Society, Series B, 80, 113–135. DOI: 10.1111/rssb.12234.
  • Shimizu, S., Hoyer, P. O., Hyvärinen, A., Kerminen, A., and Jordan, M. (2006), “A Linear Non-Gaussian Acyclic Model for Causal Discovery,” Journal of Machine Learning Research, 7, 2003–2030.
  • Taeb, A., Gamella, J. L., Heinze-Deml, C., and Bühlmann, P. (2021), “Perturbations and Causality in Gaussian Latent Variable Models,” arXiv preprint arXiv:2101.06950.
  • van de Geer, S., Bühlmann, P., Ritov, Y., and Dezeure, R. (2014), “On Asymptotically Optimal Confidence Regions and Tests for High-Dimensional Models,” The Annals of Statistics, 42, 1166–1202. DOI: 10.1214/14-AOS1221.
  • Wang, Y. S., and Drton, M. (2020), “High-Dimensional Causal Discovery under Non-Gaussianity,” Biometrika, 107, 41–59. DOI: 10.1093/biomet/asz055.
  • Zhang, C.-H., and Zhang, S. S. (2014), “Confidence Intervals for Low Dimensional Parameters in High Dimensional Linear Models,” Journal of the Royal Statistical Society, Series B, 76, 217–242. DOI: 10.1111/rssb.12026.