5,684
Views
5
CrossRef citations to date
0
Altmetric
Theory and Methods

Heteroscedasticity-Robust Inference in Linear Regression Models With Many Covariates

Pages 887-896 | Received 17 Sep 2018, Accepted 27 Sep 2020, Published online: 19 Nov 2020

Abstract

We consider inference in linear regression models that is robust to heteroscedasticity and the presence of many control variables. When the number of control variables increases at the same rate as the sample size the usual heteroscedasticity-robust estimators of the covariance matrix are inconsistent. Hence, tests based on these estimators are size distorted even in large samples. An alternative covariance-matrix estimator for such a setting is presented that complements recent work by Cattaneo, Jansson, and Newey. We provide high-level conditions for our approach to deliver (asymptotically) size-correct inference as well as more primitive conditions for three special cases. Simulation results and an empirical illustration to inference on the union premium are also provided. Supplementary materials for this article are available online.

1 Introduction

When performing inference in linear regression models it is common practice to safeguard against (conditional) heteroscedasticity of unknown form. The estimator of the covariance matrix of the least-squares estimator proposed by Eicker (Citation1963, Citation1967) and White (Citation1980) is known to be biased. The bias can be severe if the regressor design contains observations with high leverage (Chesher and Jewitt Citation1987).Footnote 1 A necessary condition for the least-squares estimator to be asymptotically normal is that maximal leverage vanishes as the sample size grows large (Huber Citation1973). This condition, then, also implies consistency of the robust covariance-matrix estimator (under regularity conditions).

The requirement that maximal leverage vanishes is problematic when the regressors include a large set of control variables. Under asymptotics where their number, qn , grows with the sample size, n, the robust covariance-matrix estimator will be inconsistent unless qn/n0, as demonstrated by Cattaneo, Jansson, and Newey (Citation2018). They obtained the same result for the other members of the so-called HC-class of covariance-matrix estimators (see, e.g., Long and Ervin Citation2000; MacKinnon Citation2012 for reviews) and showed that the jackknife variance estimator of MacKinnon and White (Citation1985), although inconsistent, can be used to perform asymptotically conservative inference under asymptotics where qn/n0. On the other hand, a bias-corrected covariance-matrix estimator in the spirit of Hartley, Rao, and Kiefer (Citation1969) and Bera, Suprayitno, and Premaratne (Citation2002) is shown to be consistent under conditions that bound maximal leverage below 12. One implication of these conditions is that limsupnqn/n<12. Although not guaranteed to be positive semidefinite (see Bera, Suprayitno, and Premaratne Citation2002 for a discussion), this estimator is attractive as it is asymptotically equivalent to a minimum-norm unbiased estimator in the sense of Rao (Citation1970).

In this article, we discuss an alternative estimator of the covariance matrix that can deal with designs where maximal leverage is bounded away from 1. As such it remains consistent when limsupnqn/n<1. To show this, we need to impose additional conditions relative to Cattaneo, Jansson, and Newey (Citation2018). A consistency result is first provided under high-level conditions. Primitive conditions are then given for three special cases; the partially linear regression model, the one-way model for short panel data, and the generic linear model with increasing dimension. Again, our covariance-matrix estimator need not be positive semidefinite in small samples. It is further not invariant to changes in the scale of the regression slopes. Achieving such invariance under asymptotics where limsupnqn/n>12 appears difficult, as we discuss below.

The idea underlying our variance estimator can be traced back to work by Kline, Saggio, and Sølvsten (Citation2019, Remark 4 and Lemma 5); see below. The chief difference lies in the conditions under which the consistency result is obtained. They considered settings where the observations are independent and the regressors are fixed, and entertained models that are correctly specified with regression functions that are uniformly bounded. This is reasonable in analysis-of-variance problems, which are the main focus of application in their work. Here, we maintain the framework of Cattaneo, Jansson, and Newey (Citation2018). Some dependence between observations is allowed, regressors are stochastic, and the model can feature (vanishing) misspecification bias. Our consistency result can be understood to be an extension of Kline, Saggio, and Sølvsten (Citation2019) to such settings. It allows for regressors to have unbounded support and for observations to depend on a growing number of parameters. Our conditions do rule out dynamic models, for example. Verdier (Citation2020) provides estimation and inference results for two-way models estimated from short panel data that allow for dynamics over time.

In Section 2, we introduce the framework, present our covariance-matrix estimator, and provide a consistency result under a set of high-level conditions. In Section 3, we connect our work to the literature and, notably, to Cattaneo, Jansson, and Newey (Citation2018) and Kline, Saggio, and Sølvsten (Citation2019). In Section 4, we provide primitive conditions for three special cases of our setup. In Section 5, we present and discuss the results from Monte Carlo experiments and apply our variance estimator to perform inference on the union wage premium. A short conclusion ends the article. The supplemental appendix contains technical details and additional simulation results.

2 Inference With Many Regressors

2.1 Framework

Consider the linear model(1) yi,n=xi,nβ+wi,nγn+ui,n,i=1,,n,(1) where yi,n is a scalar outcome, xi,n is a vector of regressors of fixed dimension r, wi,n is a vector of covariates whose dimension, qn , may grow with n, and ui,n is an unobserved error term. Our aim is to perform asymptotically valid inference on β that is robust to heteroscedasticity, when γn is high-dimensional, in the sense that qn is not a vanishing fraction of the sample size. In such a case, the nuisance parameter γn is not consistently estimable, in general.

The (ordinary) least-squares estimator of β isβ̂n:=(i=1nv̂i,nv̂i,n)1(i=1nv̂i,nyi,n),wherev̂i,n:=j=1n(Mn)i,j xj,n,(Mn)i,j:={i=j}wi,n(k=1nwk,nwk,n)1wj,n, and {·} denotes the indicator function. We will provide an inference result based on the limit distribution of β̂n. We begin by stating a set of high-level conditions that guarantee this distribution to be Gaussian that cover the case where qn/n0 as n. Our point of departure for this is Cattaneo, Jansson, and Newey (Citation2018, Theorem 1) and we closely follow their notation.

Let Xn:=(x1,n,,xn,n) and let Wn denote a collection of random variables such that E(wi,n|Wn)=wi,n. We introduceεi,n:=ui,nei,n,Vi,n:=xi,nE(xi,n|Wn),where ei,n:=E(ui,n|Xn,Wn), to state our first assumption. We use |·| to denote the cardinality of a set.

Assumption 1

(Sampling). The errors εi,n are uncorrelated across i conditional on Xn and Wn, and the collections {εi,n,Vi,n:iNg} are independent across g conditional on Wn, where {N1,,NGn} represents a partition of {1,,n} into Gn sets such that maxg|Ng|=O(1).

This assumption covers standard randomly sampled data but also repeated-measurement data (such as short panel data) where strata are independent but dependence between observations within the strata is allowed, for example.

The second assumption contains regularity conditions. We letσi,n2:=E(εi,n2|Xn,Wn),V˜i,n:=j=1n(Mn)i,j Vj,n,denote the Euclidean and Frobenius norms by ||·||, and write λmin(·) for the minimum eigenvalue of its argument.

Assumption 2

(Design). With probability approaching one i=1nwi,nwi,n has full rank,maxi(E(εi,n4|Xn,Wn)+1σi,n2+E(||Vi,n||4 |Wn))+1λmin(i=1nE(V˜i,nV˜i,n|Wn)n)=Op(1),and limsupnqn/n<1.

The rank condition on the design matrix i=1nwi,nwi,n is standard. Furthermore, given that the slope coefficients on wi,n are not of direct interest to us, dropping any covariates that are (perfectly) collinear is not an issue. The second condition contains conventional moment conditions. The third condition, finally, allows for qn to grow at the same rate as the sample size.

Our setting covers situations where the regression in (1) is a linear-in-parameters mean-square approximation to the conditional expectation μi,n:=E(yi,n|Xn,Wn), in the sense that we allow that ei,n0. The third assumption contains conditions on how fast such an approximation should improve. They are expressed in terms of the two constantsϱn:=i=1nE(ei,n2)n,ρn:=i=1nE(E(ei,n|Wn)2)n.

The assumption also contains a similar restriction on how well Vi,n can be approximated byvi,n:=xi,n(j=1nE(xj,nwj,n))(j=1nE(wj,nwj,n))1wi,n,the deviation of xi,n from its population linear projection. This is expressed using the constantχn:=i=1nE(||Qi,n||2)n, where Qi,n:=E(vi,n|Wn).

Assumption 3

(Approximations). χn=O(1),ϱn+n(ϱnρn)+nχnϱn=o(1), and maxi ||v̂i,n||/n=op(1).

The last part of this assumption is a high-level negligibility condition on the residuals from the auxiliary regression of xi,n on wi,n. Given thatmaxi(v̂i,n(j=1nv̂j,nv̂j,n)1 v̂i,n)maxi||v̂i,n||2n (j=1nv̂j,nv̂j,nn)12and should vanish in large samples for β̂n to be asymptotically normal (Huber Citation1973), this requirement appears close to minimal.

Assumptions 1–3 coincide with Assumptions 1–3 in Cattaneo, Jansson, and Newey (Citation2018). Consequently, by their Theorem 1,(2) Ωn1/2(β̂nβ)dN(0,Ir)(2) as n, whereΩn:=(i=1nv̂i,nv̂i,n)1(i=1nv̂i,nv̂i,n σi,n2)(i=1nv̂i,nv̂i,n)1, and Ir denotes the r × r identity matrix.

2.2 Variance Estimation

Constructing confidence intervals and test statistics based on (2) requires an estimator of Ωn, and thus ofΣn:=i=1nv̂i,nv̂i,n σi,n2.

When qn/n0 and the errors are permitted to be heteroscedastic, the construction of a consistent estimator is nontrivial. To appreciate the problem, consider the estimator of Eicker (Citation1963, Citation1967) and White (Citation1980), which usesΣ̂n:=i=1nv̂i,nv̂i,n ûi,n2,where ûi,n:=j=1n(Mn)i,j (yj,nxj,nβ̂n) are the least-squares residuals. This estimator is well known to be (conditionally) biased. The bias arises from the sampling noise in the least-squares estimator and can be severe (Chesher and Jewitt Citation1987). Unless qn/n0 as n, some observations will remain influential, in the sense that maximal leverage does not vanish. This causes the bias in Σ̂n to persist in large samples, implying that it is inconsistent.

The alternative to Σ̂n that we consider in this article isΣ´n:=i=1nv̂i,nv̂i,n (yi,nu´i,n),u´i,n:=ûi,n(Mn)i,i.

As stated, this estimator is well defined provided thatmini(Mn)i,i>0.

Notice that (Mn)i,i=0 means that the model reserves a parameter for this observation. This implies that the auxiliary regression of the regressors of interest on the other covariates yields a perfect prediction, in the sense that v̂i,n=0. Consequently, such an observation does not carry information on β and can be dropped. It does not affect the least-squares estimator β̂n and does not contribute to its covariance matrix Ωn. This is important as perfect prediction of this form arises frequently in empirical work when many dummy variables are included.

Additional conditions are needed to show that Σ´n is consistent. We letQ˜i,n:=j=1n(Mn)i,j Qi,nand collect one such set of conditions in the following assumption.

Assumption 4

(Variance estimation). nϱn=O(1),Pr(mini(Mn)i,i>0)1,1mini(Mn)i,i=Op(1),i=1n||Q˜i,n||4n=Op(1),and maxi||μi,n||/n=op(1).

The first part of Assumption 4 is a small-bias condition; it is relevant only when (1) is misspecified, in the sense that ei,n0. In that case it is a strengthening of Assumption 3 only when χn=o(1). The conditions on the diagonal entries of the projection matrix are very weak. Providing primitive conditions for them in great generality appears to be difficult. However, when wi,n is multivariate normal they follow under limsupnqn/n<1 as stated in Assumption 2 in the same way as in Cattaneo, Jansson, and Newey (Citation2018). In the one-way panel model they hold automatically while Verdier (Citation2020) gives sufficient conditions for them to be satisfied in the two-way model. To understand why the last part of Assumption 4 is needed, note that the (conditional) variance of yi,n u´i,n depends on μi,n2. The requirement that maxiμi,n2=op(n) allows to control the variance of Σ´n. Weak moment requirements typically suffice for this condition to be satisfied. The condition on Q˜i,n is used in concordance with the condition on μi,n. One simple sufficient condition for it is that nχn=O(1), but it can also be satisfied when χn=O(1). Primitive conditions for Assumption 4 in three special cases are given below.

We can now state our consistency result.

Theorem 1

(Inference). Let Assumptions 1–4 hold. ThenΣn1Σ´npIras n.

Theorem 1, combined with the limit result in (2), implies thatΩ´n1/2(β̂nβ)dN(0,Ir)as n, whereΩ´n:=(i=1nv̂i,nv̂i,n)1(i=1nv̂i,nv̂i,n (yi,n u´i,n))(i=1nv̂i,nv̂i,n)1.

This result permits the construction of test statistics that (in large samples) will have correct size and of confidence regions that will exhibit correct coverage.

3 Connections to the Literature

3.1 HC-Class Estimators

The bias in the Eicker (Citation1963, Citation1967) and White (Citation1980) estimator has led to a variety of modifications to it being proposed that, following MacKinnon and White (Citation1985), are often referred to as the HC-class of covariance-matrix estimators. These estimators are reviewed in Long and Ervin (Citation2000) and MacKinnon (Citation2012). Unfortunately, as shown by Cattaneo, Jansson, and Newey (Citation2018, Theorem 3), none of these alternatives is consistent, in general, when qn/n0. We briefly review their main findings on these estimators here.

The first variance estimator, HC1, differs from the conventional estimator, HC0, in that it performs a degrees-of-freedom correction (Hinkley Citation1977, eq. (2.11)). This estimator will be consistent in the special case where errors are homoscedastic and the covariate design is balanced, that is, when (Mn)1,1==(Mn)n,n.

The second variance estimator, HC2, usesFootnote 2 i=1nv̂i,nv̂i,n(ûi,n u´i,n)as an estimator of Σn (Horn, Horn, and Duncan Citation1975). This estimator will be consistent under homoscedasticity. Becauseyi,nu´i,n=ûi,n u´i,n+ŷi,n u´i,n, where ŷi,n:=yi,nûi,n are fitted values, Σ´n can be interpreted as a bias-corrected version of the HC2 estimator.

The third variance estimator, HC3, is constructed withi=1nv̂i,nv̂i,n(u´i,n u´i,n)

(MacKinnon and White Citation1985). While this estimator is inconsistent, its probability limit exceeds Σn (in the matrix sense). It follows that (under Assumptions 1–3) test procedures based on HC3 will be asymptotically conservative when limsupnq/n(0,1), both under homoscedasticity and heteroscedasticity.

3.2 Bias-Corrected Estimation

Cattaneo, Jansson, and Newey (Citation2018) also considered the estimatorΣ`n:=i=1nv̂i,nv̂i,n (j=1n((Mn*Mn)1)i,j ûj,n2),where Mn*Mn denotes the elementwise product of the matrix Mn. This estimator has its origins in work by Hartley, Rao, and Kiefer (Citation1969) and Rao (Citation1970) and can be motivated through an (asymptotic) bias calculation; see also Bera, Suprayitno, and Premaratne (Citation2002), and Anatolyev (2018) for a refinement under homoscedasticity. For Σ`n to be well defined, Mn*Mn needs to be nonsingular. Necessary and sufficient conditions for this to be the case are stated in Mallela (Citation1972) but these are neither simple nor intuitive (Horn, Horn, and Duncan Citation1975). As noted by Horn and Horn (Citation1975), a simple sufficient condition is thatmini(Mn)i,i>12.

Depending on the problem at hand it may also be necessary; an example is the one-way panel model.

Cattaneo, Jansson, and Newey (Citation2018, Theorem 4) show that Σ`n is consistent for Σn if(3) Pr(mini(Mn)i,i>12)1,1mini(Mn)i,i12=Op(1),(3) are added to Assumptions 1–3. Because i=1n(Mn)i,i=nqn, mini(Mn)i,i1qn/n, and solimsupnqn/n<12 is required for (3) to be satisfied. This, in turn, is a strengthening of the condition that limsupnqn/n<1 in Assumption 2.

Stock and Watson (Citation2008) proposed a covariance-matrix estimator for linear fixed-effect models that is applicable to short panel data. It is based on an explicit calculation of the probability limit of Σ̂nΣn, adjusting the inconsistent Σ̂n by subtracting from it a plug-in estimator of this limit quantity. As discussed in Cattaneo, Jansson, and Newey (Citation2018), Σ`n can be understood to be a generalization of this approach to the generic setting where qn/n0.

Like Σ´n, the estimator Σ`n need not be positive semidefinite in small samples. Being based on estimators of the individual σi,n2 that are linear combinations of û1,n2,,ûn,n2 it does, however, retain the invariance to changes in μi,n that is inherent in the HC-class estimators. This is a desirable feature and explains why a restriction on the magnitude of μi,n is not needed for this estimator to be consistent. Furthermore, jointly estimating σ1,n2,,σn,n2 in this way is attractive from an efficiency point of view, as it is asymptotically equivalent to a minimum-norm unbiased estimator (see Hartley, Rao, and Kiefer Citation1969; Rao Citation1970 for details).

3.3 Inference on Variance Components

The variance estimator Σ´n is closely related to the work of Kline, Saggio, and Sølvsten (Citation2019). In our context, their proposal is to estimate each σi,n2 by the cross-fit (Newey and Robins Citation2018) estimatoryi,n uˇi,n,where uˇi,n is the residual for observation i when the slope coefficients are estimated from the sample from which the ith observation has been omitted. When the regression model is correctly specified—that is, when ei,n=0—then yi,n uˇi,n is (conditionally) unbiased, provided that the leave-own-out regression slopes are well defined. This will be the case if maximal leverage is bounded away from one.Footnote 3 In this case, from the Sherman–Morrison formula (as in, e.g., Miller Citation1974),yi,n uˇi,n=yi,nu´i,n+op(1), under our assumptions, connecting the cross-fit estimator of Kline, Saggio, and Sølvsten (Citation2019) to our approach.

Kline, Saggio, and Sølvsten (Citation2019) used this device to construct unbiased estimators of quadratic forms and present several applications. One of these (Kline, Saggio, and Sølvsten Citation2019, Remark 4 and Lemma 5) is a consistency result for the implied covariance-matrix estimatori=1nv̂i,nv̂i,n (yi,n uˇi,n)in designs with fixed regressors where errors are independent and maximal leverage is bounded away from unity. This result is established under the assumption that ei,n=0 and that maxi||μi,n||2=O(1), together with standard regularity conditions as those in Assumption 2. It follows that our variance estimator can be seen as a modification of theirs that is targeted to a setting with many control variables. Theorem 1 may then be understood to be an extension of their Lemma 5 to settings with stochastic regressors and (vanishing) misspecification bias, where the regressors can have unbounded support and observations may depend on a growing number of parameters.

The implied interpretation of Σ´n as an (approximate) cross-fit estimator is also useful to highlight an apparent tension between invariance of a covariance-matrix estimator to changes in μi,n and the possibility for it to be consistent when we have limsupnqn/n>12. A cross-fit estimator of σi,n2 that is invariant is necessarily of the form u̇i,nu¨i,n, where u̇i,n and u¨i,n are two least-squares residuals, each one obtained from a (conditionally) independent subsample that excludes the ith observation. Because these auxiliary regressions can be based on at most (n1)/2 and (n1)/2 observations such an approach cannot accommodate situations where qn/n>12. The alternative estimators yi,n uˇi,n and yi,n u´i,n circumvent the need for two independent estimators of .. by using the level of the outcome variable, yi,n, as a proxy for ui,n. This allows to deal with cases where limsupnqn/n>12 but makes the variance estimator sensitive to μi,n.

4 Examples

We next provide more primitive conditions in three special cases that fit out general setup. We focus on sufficient conditions for Assumption 4. Cattaneo, Jansson, and Newey (Citation2018) already gave such conditions for the other assumptions—and, notably, for Assumption 3—to hold.

4.1 Partially Linear Model

Suppose that observations on (yi,xi,zi) are independent and identically distributed. The partially linear regression model states thatyi=xiβ+φ(zi)+εi,E(εi|x,zi)=0,for an unknown function φ. A series approximation of order κn of φ(zi) takes the form wi,nγn for wi,n=(p1(zi),,pκn(zi)) and {p1,,pκn} a collection of basis functions such as orthogonal polynomials. Our estimator β̂n, then, is the least-squares estimator of β inyi,n=xiβ+wi,nγn+ui,n,ui,n=εi+φ(zi)wi,nγn.

Note that E(ui,n|xi,zi)0, in general. Consistency of β̂n requires that κn (and, thus, that qn) as n. Here,ϱn=minγE(||φ(zi)wi,nγ||2),χn=minδE(||E(xi,n|wi,n)δwi,n||2).

In this example, the number of covariates can be large when the dimension of zi is large, so that many terms are included in the approximation even for small κn , or when the underlying functions are not (assumed to be) very smooth, so that a large κn needs to be used to control bias.

Standard smoothness conditions on φ imply that nϱn=O(1) (Newey Citation1997), yielding the first condition of Assumption 4. The fourth condition can be validated in the same way, by imposing sufficient smoothness on E(xi,n|wi,n) to get nχn=O(1), which implies that i=1n||Q˜i,n||4=Op(n). This is different from (and stronger than) the first set of primitive conditions discussed by Cattaneo, Jansson, and Newey (Citation2018) to validate Assumption 3, where more smoothness in E(xi,n|wi,n) can be used to compensate for less smoothness in φ. Alternatively, if χnE(||E(xi,n|wi,n)||2)=O(1) and a partitioning estimator (Cattaneo and Farrell Citation2013) is used to approximate φ then Mn is a band matrix. This is also sufficient to reach the desired result. Moreover, in this case, the first and fourth condition of Assumption 4 are implied by the rate requirements in Assumption 3, and by the second set of restrictions for this example given in Cattaneo, Jansson, and Newey (Citation2018). Next, simple sufficient conditions for the requirement that maxi||μi,n||/n=op(1) are the moment conditions E(||xi||2+θ)=O(1) and E(||φ(zi)||2+θ)=O(1) for some θ>0. While these two conditions are not imposed in Cattaneo, Jansson, and Newey (Citation2018), they would not appear to be overly strong.

4.2 One-Way Model for Panel Data

For double-indexed data (y(g,m),x(g,m)), the fixed-effect model isy(g,m)=x(g,m) β+αg+ε(g,m),g=1,,Gn,m=1,M,where αg is a group-specific intercept and we assume that E(ε(g,m)|x(g,1),,x(g,M))=0. The regressors x(g,m) are assumed independent between groups but may be dependent within each group. The errors ε(g,m) are independent between groups and (conditionally) uncorrelated within groups. The usual asymptotic embedding here has Gn with M=O(1). The number of fixed effects grows at the same rate as the sample size; we have n=Gn×M and qn = Gn so that qn/n=1/M, which does not vanish. These conditions fit Assumption 1.

The fixed-effect estimator is the least-squares estimator of y(g,m) on x(g,m) and Gn dummy variables that indicate group membership. The estimated coefficients on these dummies are computed from M observations and are not consistent under our asymptotic approximation. In this example Mn=IGnTM, where (TM)m,m:={m=m}M1 is the M × M matrix that transforms observations into deviations from their within-group mean. Consequently,Σ´n=MM1g=1Gnm=1Mx˜(g,m) x˜(g,m)y(g,m) (y˜(g,m)x˜(g,m)β̂n),where y˜(g,m):=m=1M(TM)m,m y(g,m), and x˜(g,m) and ε˜(g,m) are defined in the same way. Further, using that β̂n=β+op(1),Σ´n=MM1g=1Gnm=1Mx˜(g,m) x˜(g,m)(ε(g,m)ε˜(g,m)+(x(g,m)β+αg) ε˜(g,m))+op(Gn).

Because E(ε(g,m)ε˜(g,m)|x(g,1),,x(g,M))=σ(g,m)2 (M1)/M the first term constitutes an unbiased estimator of Σn. The second term on the right-hand side is mean zero because the errors are mean-independent of the regressors. Its variance, however, depends on the α1,,αGn.

In a two-wave panel,Σ´n=14g=1GnΔxg Δxg (ΔygΔxgβ̂n) Δyg,where Δ denotes the first-difference operator; so, for example, Δyg:=y(g,2)y(g,1). In this case, the variance estimator does not depend on the fixed effects. The factor 14 appears because of the de-meaning and is inconsequential. The implied estimator of the covariance matrix Ωn is(g=1GnΔxg Δxg)1(g=1GnΔxg Δxg (ΔygΔxgβ̂n) Δyg)×(g=1GnΔxg Δxg)1.

The same estimator would be obtained if our procedure would be applied directly to the first-differenced model Δyg=Δxgβ+Δεg. The standard estimator of Eicker (Citation1963, Citation1967) and White (Citation1980) applied to this model is(g=1GnΔxg Δxg)1(g=1GnΔxg Δxg (ΔygΔxgβ̂n)2)×(g=1GnΔxg Δxg)1,and is known to be consistent here.

In the one-way panel model our Assumption 4 holds provided thatg=1Gn||αg||2+θGn=O(1)and maxgmaxmE(||x(g,m)||2+θ)=O(1) for some θ>0. Given that Cattaneo, Jansson, and Newey (Citation2018) imposed that maxgmaxmE(||x(g,m)||4)=O(1) to validate Assumption 2, the condition on the fixed effects is the only additional requirement needed for our variance estimator to be consistent in this model.

4.3 Linear Model With Increasing Dimension

Finally, consider the regression model that takes (1) as the data generating process for independent and identically distributed observations (yi,n,xi,n,wi,n), that is,yi,n=xi,nβ+wi,nγn+ui,n,withn E(||E(ui,n|xi,n,wi,n)||2)=nϱn=o(1), as in Cattaneo, Jansson, and Newey (Citation2018). The generic nature of this model makes it difficult to specify a single set of simple primitive conditions for our Assumption 4 to hold. First consider the requirement that i=1n||Q˜i,n||4=Op(n). A sufficient condition here is thatn χn=n minδE(||E(xi,n|wi,n)δwi,n||2)=O(1).

This will be the case, for example, when wi,n is discrete and a saturated regression model is used. It will also hold under smoothness conditions on the function E(xi,n|wi,n) when the wi,n are approximating functions, as discussed above. Alternatively, if χn=O(1), a sparsity condition on the projection matrix Mn can be used. One such condition is maxini,n=Op(1), together with i=1nE(||Qi,n||4)=O(n), where we have introduced ni,n:=j=1n{(Mn)i,j0}. Cattaneo, Jansson, and Newey (Citation2018) showed that the weaker condition maxini,n=op(n1/3) can be used to support Assumption 3 under the same moment condition on the Qi,n. Such a rate would appear difficult to obtain here. Finally, for maxi||μi,n||/n=op(1) to hold we again impose that E(||xi,n||2+θ)=O(1) for some θ>0. When wi,n are approximating functions, a similar moment condition on the function that is being approximated will again suffice. In the case where wi,n are just many control variables included in the regression we can again use a sparsity condition. Let κi,n denote the number of nuisance parameters on which yi,n depends. Then one alternative sufficient condition is thatmaxiκi,n=O(n12θ2+θ),for some θ>0, together with the assumption that the entries of wi,n have 2+θ moments. A condition on κi,n is different from a condition on ni,n, as it only pertains to the regression of yi,n on xi,n and wi,n and does not restrict the auxiliary regression of xi,n on wi,n. When the covariates are normally distributed or have bounded support, for example, we can allow for maxiκi,n=O(n).

5 Numerical Illustrations

5.1 Simulations

We present numerical results for a setup taken from Cattaneo, Jansson, and Newey (Citation2018). Data are generated asyi,n=xi,nβ+wi,nγn+εi,n,where xi,niid N(0,1),wi,n contains a constant term and a collection of qn1 zero/one dummy variables, and εi,niid N(0,1). The dummy variables are drawn independently with success probability π and γn=0. The sample size was fixed to n = 700 throughout and we considered qn{1,71,141,211,281,351,421,491,561,631}. All statistics reported below were computed over 10,000 Monte Carlo replications and all the variables were redrawn in each replication.

We consider three designs that vary in β and π. Design A is the design of Cattaneo, Jansson, and Newey (Citation2018). It has β = 1 and π=0.02.Footnote 4 Each dummy variable takes on the value one for about 14 out of 700 observations, on average. Design B is a more sparse design where β = 1 is maintained and π is reduced to 0.01, leading to each dummy switching on for only 7 observations in each replication, on average. Design C, in turn, sets β = 2 and maintains π=0.02. This implies that the conditional variance of yi,n u´i,n increases by a factor of four.

The results for the three design variations are presented in . Empirical size of the two-sided t-test of the null that β = 1 and the average width of the corresponding confidence interval are given for all variance estimators discussed; HC0, its modifications HC1, HC2, and HC3, the bias-corrected estimator of Cattaneo, Jansson, and Newey (Citation2018) (HCK) and the estimator presented here (HCA).

Table 1 Design A (β = 1 and π=0.02).

Table 2 Design B (β = 1 and π=0.01).

Table 3 Design C (β = 2 and π=0.02).

As the setup features homoscedastic errors, inference based on HC0 will be liberal for large n when qn/n is not small (Chesher and Jewitt Citation1987; Cattaneo, Jansson, and Newey Citation2018). This is apparent from inspection of the tables. The degrees-of-freedom correction performed by HC1 alleviates most of this concern here. HC2, which is consistent under homoscedasticity, performs quite similarly to HC1. Both corrections do come with (on average) wider confidence intervals. The simulations also illustrate that HC3 yields conservative inference. The rejection frequency (under the null) approaches zero as qn/n increases. The confidence intervals are also wide. This implies that tests based on HC3 will suffer from low power. The relative inefficiency compared to HC1 and HC2 grows as qn/n increases.

In Design A, HCK gives close to correct size and confidence intervals of a comparable length as HC1 and HC2 for most of the values of qn/n. As qn/n increases above 12 this variance estimator does not always exist in each replication. When this happens in our Monte Carlo, HCK defaults to HC0. Consequently, for the larger values of qn/n in , the size of the HCK method increases above its nominal size and the average length of the confidence interval equally shrinks relative to HC1 and HC2. In the sparser Design B, the nonexistence of the HCK estimator is more frequent and also arises for smaller values of qn/n. This explains the large overrejection rates observed in . The results for Design C in are similar to those in as the performance of HCK is invariant to the scale of the regression slopes.

The simulation results also confirm the theory behind our variance estimator. It yields close to correct inference in all three designs and for all values of qn/n considered, although some more overrejection is observed for the top-end values of this ratio. The average length of its confidence interval is comparable to those obtained for HC1, HC2, and HCK and is substantially smaller than those for HC3 for large values of qn/n. These results show that this estimator is useful in problems with many covariates when HCK is not available, for example, due to sparseness of the regressor design or because of the presence of high-leverage observations more generally.

The supplemental appendix contains additional simulation results for a partially linear regression model and for a one-way panel model.

5.2 Empirical Example

We next use the different variance estimators available to infer the union membership premium. The data are a balanced panel on 545 working individuals and span 8 years (1980–1987), giving a total of 4360 observations. They are taken from Vella and Verbeek (Citation1998) and are available from the data archive of the Journal of Applied Econometrics (http://qed.econ.queensu.ca/jae/1998-v13.2/vella-verbeek/).

We estimate the union premium as the coefficient from a least-squares regression of log wages on a dummy for union membership, after partialling-out a set of control variables. This set contains average hours worked per week, dummies for marital status and for poor health, a quadratic term in years of experience, and a large collection of dummies, as follows. First, as the data are a panel, both individual fixed effects and year fixed effects are included. Second, as the data contain information on the type of occupation (out of 9 categories) and the industry (out of a total of 12) in which the job is located we also control for these by including sector and industry dummies. Third, we allow for interaction effects between these categorical variables by including occupation-by-year, sector-by-year, and occupation-by-sector dummies as well as occupation-by-sector-by-year dummies. Baseline categories for the year-, sector-, and occupational dummies are chosen and their corresponding dummies are dropped as to avoid a dummy-variable trap. Certain interactions of the occupation and sector dummies never take on the value one and so are equally removed from the analysis. This gives a total of 1086 control variables relative to 4360 observations. Our point estimate of the union premium is 7.43%. This is in line with the literature (see, e.g., Jakubson Citation1991).

The standard errors on this point estimate, obtained by the various methods discussed, together with the implied 95% confidence intervals for the union premium are collected in . A total of 330 observations have leverage that exceeds the one-half threshold. 36 of these observations can be perfectly explained by the control variables because of the inclusion of the occupation-by-sector and the occupation-by-sector-by-year dummies and so contain no information on the union premium. The HCK variance estimator could not be computed here. (We note that the invertibility condition fails also after dropping the 36 non-informative observations.) This explains why an entry for this estimator is not available. HC0 yields the smallest standard error on our point estimate (1.72%). HC3 yields the largest (2.32%). The standard error of HCA (1.93%) is roughly in the middle of these two bounds. HC1 and HC2 give similar, slightly larger, standard errors as HCA (1.99% and 1.98%).

Table 4 Inference on the union premium.

6 Conclusion

This article has presented a heteroscedasticity-robust covariance-matrix estimator for linear regression models that is consistent under an asymptotic scheme where the number of control variables, qn , grows at the same rate as the sample size, n. The estimator is similar to the proposal of Kline, Saggio, and Sølvsten (Citation2019) but our consistency result covers more general settings. The estimator complements work by Cattaneo, Jansson, and Newey (Citation2018), who derived inconsistency results for members of the HC-class of variance estimators, proved asymptotic-conservativeness of the HC3 estimator, and presented an alternative variance estimator (based on Hartley, Rao, and Kiefer Citation1969) that remains consistent when limsupnqn/n<12. Under a set of additional high-level conditions, our estimator allows to weaken this restriction to limsupnqn/n<1. Primitive conditions for these where given for partially linear models, fixed-effect panel data models, and generic regression models with increasing dimension. Simulations verify the theoretical properties. An empirical application to estimation of the union premium from panel data was also presented.

The idea underlying our variance estimator can be useful as a device to correct for bias more generally. Kline, Saggio, and Sølvsten (Citation2019) utilized it to bias-correct quadratic forms in fixed-effect estimators. Chernozhukov et al. (Citation2018) and Newey and Robins (Citation2018) utilized related cross-fitting techniques to reduce bias in high-dimensional estimation problems that feature machine-learning estimators. Cattaneo, Jansson, and Ma (Citation2019) characterized the bias in (nonlinear) two-step estimators when the first step features a high-dimensional linear regression.

Supplemental material

Supplemental Material

Download PDF (247.9 KB)

Acknowledgments

An associate editor and three referees provided very constructive feedback on earlier versions of this article and pointed out a non sequitur in the derivation of a primitive condition. Matias Cattaneo generously shared and discussed his replication material. I am most grateful to each of them for their help.

Supplementary Materials

The supplementary material contains proofs, technical details, and additional simulation results.

Additional information

Funding

Financial support from the European Research Council through grant no. 715787 (MiMo) is gratefully acknowledged.

Notes

1 The leverage of an observation i is defined as the ith diagonal element of the hat matrix, that is, the matrix that transforms the observed outcomes into fitted values. It is bounded between zero and one and measures the influence of the observation on its own fitted value; a larger value reflects a higher influence (see, e.g., Hoaglin and Welsch 1978).

2 We follow Cattaneo, Jansson, and Newey (Citation2018) and construct the HC-class variance estimators using the projection matrix Mn. The original proposals were made in a context where qn is treated as fixed and did not differentiate between xi,n and wi,n; they used the annihilator matrix that projects out both sets of variables. The difference is asymptotically negligible under our assumptions.

3 This is a slight strengthening of our requirement in Assumption 4 as we project out only wi,n to obtain u´i,n while Kline, Saggio, and Sølvsten (Citation2019) projected out both the regressors xi,n and the covariates wi,n to obtain uˇi,n.

4 The description of the simulation design in Cattaneo, Jansson, and Newey (Citation2018, p. 1358) contains a typo that would imply that π=0.0062.

References