1,135
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Testing for Equivalence of Pre-Trends in Difference-in-Differences Estimation

& ORCID Icon

Abstract

The plausibility of the “parallel trends assumption” in Difference-in-Differences estimation is usually assessed by a test of the null hypothesis that the difference between the average outcomes of both groups is constant over time before the treatment. However, failure to reject the null hypothesis does not imply the absence of differences in time trends between both groups. We provide equivalence tests that allow researchers to find evidence in favor of the parallel trends assumption and thus increase the credibility of their treatment effect estimates. While we motivate our tests in the standard two-way fixed effects model, we discuss simple extensions to settings in which treatment adoption is staggered over time.

1 Introduction

In the classic case, the Difference-in-Differences (DiD) framework consists of two groups observed over two periods of time, where the “treatment group” is untreated in the initial period and has received a treatment in the second period whereas the “control group” is untreated in both periods. The key condition under which the DiD estimator yields sensible point estimates of the average treatment effect on the treated is known as the “parallel paths” or “parallel trends assumption,” henceforth referred to as PTA, which states that in the absence of treatment both groups would have experienced the same temporal trends in the outcome variable on average. If pre-treatment observations are available for both groups, the plausibility of this assumption is typically assessed by plots accompanied by a formal testing procedure showing that there is no evidence of differences in trends over time between the treatment and the control group. However, traditional pre-tests can suffer from low power to detect violations of the PTA (Kahn-Lang and Lang Citation2020). Thus, finding no evidence of differences in trends in finite samples does not imply that there are no differences in trends in the population. More concerningly, Roth (Citation2022) points out that if differences in trends exist, conditional on not detecting violations of parallel trends at the pre-testing stage, the bias of DiD-estimators may be greatly amplified.

Given the severe consequences of falsely accepting the PTA, we propose that instead of testing the null hypothesis of “no differences in trends” between the treatment and the control group in the pre-treatments periods, one should apply a test for statistical equivalence. We provide three distinct types of equivalence that impose bounds on the maximum, the average and the root mean square change over time in the group mean difference between treatment and control in the pre-treatment periods. Given a threshold below which deviations from the PTA can be considered negligible, these tests allow the researcher to provide statistical evidence in favor of the PTA, thus, increasing its credibility. If no sensible equivalence threshold can be determined before analyzing the data, we propose to report the smallest equivalence threshold for which the null hypothesis of “non-negligible trend differences” can still be rejected at a given level of significance. Conceptually, this idea is similar to the “equivalence confidence interval” in Hartman and Hidalgo (Citation2018) applied to a DiD setting. Our procedure reverses the burden of proof since the data has to provide evidence in favor of similar trends in the treatment and the control group, which is arguably more appropriate for an assumption as crucial to the DiD-framework as the PTA. Furthermore, the power to reject the null hypothesis of “non-negligible differences” is increasing with the sample size (also see Hartman and Hidalgo Citation2018). This improves upon the current practice of testing the null hypothesis of “no difference,” since large samples increase the chances of rejecting this null hypothesis (and thus seemingly making the DiD framework inapplicable), even if the true difference between treatment and control may be negligible in the given context. Finally, our equivalence test statistics can easily be implemented in practice. While we motivate our tests in the standard two-way fixed effects model with panel data, we discuss how our tests can be applied in situations where treatment timing differs across groups (i.e., “staggered treatment assignment”) or where average treatment effects depend on some observable characteristics.

As we use equivalence tests, our article is closely related to Bilinski and Hatfield (Citation2020), who provide a discussion on the benefits of using equivalence (or “non-inferiority”) tests when testing for violations of modeling assumptions. Their “one-step-up” approach is based on a non-inferiority test of treatment effect estimates obtained from a standard DiD model and from a model augmented with a particular violation of the parallel trends assumption (e.g., a linear trend). While both approaches stress the potential benefits of equivalence testing in DiD setups, a distinctive feature is that we do not necessarily focus on a particular violation of the PTA. As pointed out in Kahn-Lang and Lang (Citation2020), including for instance group-specific linear time trends can lead to a loss in degrees of freedom and thus to a substantial loss in power. In contrast, our approach focuses on testing for any non-negligible differences between treatment and control in the pre-treatment periods. Our article is also related to other approaches that allow for certain deviations from exactly parallel trends. Rambachan and Roth (Citation2023) partially identify the ATT under restrictions that impose that the post-treatment violation of parallel trends is not too large relative to the pre-treatment violation. By contrast, we propose tests for the null hypothesis that the pre-treatment violation of parallel trends is large. If the null is rejected, so that the pre-treatment violation is determined to be small, then the researcher may decide that violations of parallel trends can safely be ignored with minimal bias. Alternatively, the researcher could use the upper bound on the pre-treatment violation given by our approach as a way of determining reasonable bounds on the post-treatment violations of parallel trends, which could then be used as input to the partial identification frameworks of Rambachan and Roth (Citation2023) or Manski and Pepper (Citation2018).

The rest of the article is organized as follows. Section 2 introduces the main two-way fixed effects model and discusses the widely used practice of testing for violations of the PTA. Equivalence tests and our hypotheses are discussed in Section 3. Section 4 introduces our assumptions and presents the test statistics for our hypotheses. In Section 5, we present extensions of the main model that allow for heterogeneous treatment effects due to differences in treatment timing or observable characteristics. Simulation evidence on the performance of our test procedures is provided in Section 6, while Section 7 contains an empirical illustration of our approach. Section 8 concludes. Finally, mathematical details and tables are collected in the Appendix.

2 Pre-Testing in the Canonical DiD Model

We initially contrast our test procedures with the usual pre-test in the canonical DiD case with only two groups and common treatment timing. The researcher observes a balanced panel of n individuals recorded over T + 2 periods of time. We assume that in the first T + 1 periods none of the individuals is treated whereas in period T + 2 a subset has received treatment. Individual i is a member of the treatment group if Gi = 1 and a member of the control group if Gi = 0. Following Kahn-Lang and Lang (Citation2020) we refer to period T + 1 as the “base period” as treatment effects are usually assessed by comparing post-treatment outcomes with the outcomes in the last pre-treatment period. The potential outcomes of unit i in period t when treated and in the absence of treatment are denoted as Yi,t(1) and Yi,t(0), respectively. The object of interest is the average treatment effect on the treated (ATT), given as πATT:=E[Yi,T+2(1)Yi,T+2(0)|Gi=1].

For identification of the ATT, we need to make several assumptions. First, we require that Pr(Gi=1)=p(0,1), which is an “overlap” condition that ensures that both treatment and control are non-empty (Sant’Anna and Zhao Citation2020). Next, we assume “no-anticipation,” which rules out treatment effects before the actual treatment date. Adapting Borusyak, Jaravel, and Spiess (Citation2023), we assume (2.1) E[Yi,t|Gi]=E[Yi,t(0)|Gi]+πATTGiDT+2(t),t=1,,T+2,(2.1) with Dl(t)=1{l=t}. This implies that the expected observed outcome coincides with the expected potential outcome corresponding to the actual treatment status both for the control units and the eventually treated units. Next, let (2.2) E[Yi,t(0)|Gi]=αi+λt+Giγt,(2.2) where αi,λt and γt are some non-stochastic constants. Combining (2.1) with (2.2), we can write (2.3) Yi,t=αi+λt+Giγt+πATTGiDT+2(t)+ui,t,(2.3) with ui,t=Yi,tE[Yi,t|Gi]. It is clear from (2.3) that πATT is not identified without further restrictions on γt . The fundamental assumption that leads to the DiD estimator is the (augmented) PTA (2.4) E[Yi,t(0)Yi,T+1(0)|Gi=1]E[Yi,t(0)Yi,T+1(0)|Gi=0]=0,t=1,,T+2,(2.4) which implies that γtγT+1=0 for all t=1,,T+2. Notice that this assumption over-identifies πATT , as identification only requires parallel trends between the post-treatment and the base period, that is γT+2γT+1=0. However, as it is often difficult to imagine circumstances under which the latter condition is satisfied whereas (2.4) is not, the augmented PTA can be useful for a pre-testing procedure that allows researchers to assess the plausibility of the PTA post-treatment. To do so, one uses the data from periods 1,,T+1 to estimate the two-way fixed effects (TWFE) model (2.5) Yi,t=αi+λt+l=1TβlGiDl(t)+ui,t,(2.5) where βl=γlγT+1. Here, β=(β1,,βT) collects all “placebo” treatment effects so that under (2.1) and (2.2), β = 0 if and only if the augmented PTA holds. As shown, for instance, in Baltagi (Citation2021) or Wooldridge (Citation2021), β can be estimated by pooled OLS on “double-demeaned” data. Let Wi,t,l=GiDl(t) and Wi,t=(Wi,t,1,,Wi,t,T). Double-demeaning (2.5) then yields (2.6) Y¨i,t=W¨i,tβ+u¨i,t,(2.6) for i=1,,n and t=1,,T+1, where Y¨i,t=Yi,tY¯i,·Y¯·,t+Y¯··, Y¯i,·=1T+1t=1T+1Yi,t,Y¯·,t=1ni=1nYi,t,Y¯··=1n(T+1)i=1nt=1T+1Yi,t and W¨i,t and u¨i,t are defined analogously. Since E[ui,t|Gi]=0 by construction, consistency and asymptotic normality of the pooled OLS estimator in (2.6), denoted as β̂, follow under mild conditions. To find evidence against the plausibility of parallel trends, it is therefore common in applied economic research to test for individual significance (see Roth Citation2022), that is for every l{1,,T} we test (2.7) H0:βl=0versusH1:βl0.(2.7)

If the null hypothesis is rejected in a pre-treatment period, the PTA is deemed unreasonable, and consequently the DiD framework is often regarded as unsuitable in the corresponding context. This procedure has several shortcomings. For instance, a problematic common practice is to treat failure to reject the null hypothesis in (2.7) as evidence in favor of H0, that is one proceeds as if the null hypothesis was true and as if the PTA held. From a statistical point of view, this practice is incorrect as it neglects the error of type II. In some cases, there may be differences in trends between both groups in the population that cannot be detected with traditional test of (2.7) due to a lack of statistical power. Roth (Citation2022) points out that ignoring these differences can amplify the bias and thus raise concerns of a “publication bias,” since articles using a DiD identification argument are more likely to be deemed publishable when a test of (2.7) could not detect evidence against the PTA. Moreover, the DiD framework is sometimes used even when H0 in (2.7) is rejected, as some statistically significant differences are deemed negligible in a given context. However, a potential threshold U>0 that quantifies what constitutes a negligible difference is often not adequately discussed. On the other hand, if the DiD framework is not applied when H0 in (2.7) is rejected in at least one pre-treatment period, useful information may be lost if U can be interpreted as a plausible “upper bound” for trend differences. For these reasons, the plausibility of the PTA as the fundamental modeling assumption of the DiD framework can be more convincingly assessed using statistical equivalence tests as these tests address all of the above shortcomings of the current standard testing procedure. For instance, to rewrite (2.7) in terms of statistical equivalence, for some l{1,,T} one would define the equivalence threshold U>0 and test (2.8) H0:|βl|UversusH1:|βl|<U(2.8) so that rejecting H0 yields evidence in favor of negligible trend differences between periods l and T + 1. While the hypothesis (2.8) can easily be tested with the “two one-sided tests” procedure of Schuirmann (Citation1987), we do not recommend this approach as it would lead to an accumulation of Type I error due to multiple testing. In the following, we elaborate on the benefits of equivalence tests and provide ways of summarizing the statistical evidence in favor of the PTA in the pre-treatment periods by formulating joint hypotheses.

3 Testing for Equivalence

Equivalence testing is well known in biostatistics (see Berger and Hsu Citation1996 or Wellek Citation2010). While it has recently been considered in the statistical literature for the analysis of structural breaks (e.g., Dette and Wied Citation2014; Dette and Wu Citation2019; Dette, Kokot, and Aue Citation2020 or Dette, Kokot, and Volgushev Citation2020), it is less frequently used in econometrics. Instead of assuming that treatment and control are perfectly comparable unless there is strong evidence against this assumption, we suggest several testing procedures that explicitly require finding evidence in favor of the comparability of both groups. Each of the tests is based on an upper bound U>0 for changes in the group mean differences in the pre-treatment periods relative to the base period. There are two ways in which one can make use of the upper bound U. First, as in the “classic” use of equivalence tests, one can specify a threshold U below which changes in the group mean differences over time are deemed negligible. Rejecting the null hypothesis that trend differences are larger than U at level of significance α then implies that deviations from parallel trends in the pre-treatment periods are negligible at confidence level 1α. The researcher may then interpret this as support for negligible differences in trends post-treatment and hence assume that the PTA holds. This procedure improves upon the current pre-test as it requires an explicit rationalization of the threshold U and sufficient data to support the assumption of negligible violations of the PTA. The choice of the threshold U should thus reflect the specific scientific background of the application. In bio-statistics, the popularity of equivalence tests has led to a consensus on sensible choices for U, and regulators frequently specify the equivalence thresholds that should be employed (see Wellek Citation2010 for a recent review). We expect that with a more frequent adoption of equivalence testing in applied economics a similar consensus will be reached. However, in some applications, it may still be difficult to objectively argue that a certain extent of violations of the PTA can be ignored in practice. It may then be sensible to report U* as the smallest value at which H0 can be rejected at a given level of significance (i.e., for which “equivalence of pre-trends” can be concluded). Small values of U* relative to the estimated treatment effect may then be regarded as reassuring as it is unlikely that the treatment effect is merely an artifact of differences in trends. On the contrary, if U* is relatively large, the credibility of the estimated effect is in serious doubt. A similar idea has been proposed in Hartman and Hidalgo (Citation2018). It can further be related to the “breakdown frontier” proposed by Masten and Poirier (Citation2020), as treatment effects can be considered non-robust to violations of the PTA when U* exceeds the estimated treatment effect. Finally, in cases where the choice of the threshold is difficult, the methodology presented here can also be used to provide (asymptotic) confidence intervals for violations of the PTA.

3.1 Formulating Equivalence Hypotheses

We assume that there exists a vector βRT that can be used to assess the plausibility of the augmented PTA. For instance, in (2.5), each βl corresponds to a “placebo” treatment effect in period l{1,,T}. To keep the notation tractable, the dimension of β corresponds to the number of available pre-treatment periods T. In practice, it is of course possible that the dimension of β is smaller than T (e.g., when only a subset of all pre-treatment period is used for a pre-test) or exceeds T (e.g., when a conditional PTA is tested; see Section 4). Our tests can then be applied with minor adjustments.

Overall, we consider three distinct hypotheses to test for equivalence of pre-trends in treatment and control. We start with a discussion of the maximum placebo treatment effect. For βRT, level of significance α and the equivalence threshold δ>0, we test (3.1) H0:||β||δ versusH1:||β||<δ,(3.1) where ||β||:=maxl{1,,T}|βl|. Since we are now controlling the Type I error, this implies that with confidence level of at least 1α, δ is an upper bound for the maximum placebo treatment effect.

In many applications, pre- and post-treatment periods are pooled, for instance to increase statistical power. Similarly, it may be sensible in some applications to consider a pooled or average measure of the pre-treatment deviations from parallel trends. Thus, defining β¯:=1Tl=1Tβl, one can find bounds on the average placebo effect by testing (3.2) H0:|β¯|τversusH1:|β¯|<τ.(3.2)

One disadvantage of (3.2) is that there may be cancelation effects in situations where the components of β are large in absolute terms but have opposing signs. Therefore, (3.2) should be used when differences in pre-trends can safely assumed to be of the same sign. As pointed out in Rambachan and Roth (Citation2023), monotone violations of the PTA are frequently discussed in the applied literature. For instance, treatment effect estimates are often considered robust if potential violations of the PTA are of the opposing sign and can thus be ruled out as an explanation for the estimated effects. As an alternative to (3.2) that does not suffer from potential cancelation effects, we further consider the root mean square (RMS) of β, that is βRMS:=||β||/T=1Tl=1Tβl2, where ||·|| denotes the euclidean norm on RT. The RMS of β can thus be interpreted as the euclidean distance between treatment and control in the pre-treatment periods relative to the distance in the base period scaled by the number of pre-treatment periods. The scaling is induced to ensure that this distance between treatment and control does not increase with the number of pre-treatment periods available. The hypothesis is then formulated as (3.3) H0:βRMSζversusH1:βRMS<ζ,(3.3) which can equivalently be written as (3.4) H0:βRMS2ζ2versusH1:βRMS2<ζ2.(3.4)

In Section 4.2 we develop a test statistic for (3.4) and recover ζ as ζ2.

4 Theory and Implementation

We begin this section by introducing the assumptions underlying our tests. We then proceed by discussing the implementation and the statistical properties of our tests.

4.1 Assumptions

For our statistical theory, we mainly need that a “sequential” version of asymptotic normality holds:

Assumption 4.1.

Fix ε>0. For λ[ε,1], let β̂(λ) an estimator of β based on nλ individuals. Then (4.1) {n(β̂(λ)β)}λ[ε,1]{1λDB(λ)}λ[ε,1],(4.1) where B is a T-dimensional vector of independent Brownian motions, D is a T × T-dimensional matrix of full rank, and “” denotes weak convergence in the space (l[ε,1])T of all T-dimensional bounded functions on the interval [ε,1].

Notice that (4.1) comprises “conventional” asymptotic normality of β̂ when λ = 1, that is (4.2) n(β̂β)dN(0,Σ) as n,(4.2) where Σ denotes a T × T-dimensional covariance matrix. A further assumption required in some of our results is that this matrix can be consistently estimated:

Assumption 4.2.

An estimator Σ̂ of Σ is given such that Σ̂pΣ as n.

The proofs of the validity of the first test for the hypothesis (3.1) and of the test for the hypothesis (3.2) do not require Assumption 4.1 but only the weaker condition (4.2) together with Assumption 4.2 (see the first part of Sections 4.2.1 and 4.2.2 for the exact definition of these two tests). A second more powerful test for the hypothesis (3.1) is introduced in the second part of Section 4.2.1, and its validity will be established in the TWFE model (2.5), under conditions that imply (4.2). In contrast, our test for the hypothesis (3.3) requires (4.1), but Assumption 4.2 is not needed, as this test is based on “self-normalization.” We emphasize that without any additional assumptions the asymptotic normality in (4.2) does not imply the process convergence in (4.1). In fact, this a very delicate probabilistic question, which has, to our best knowledge, only been solved for sums of random variables. For example, we refer to Kuelbs (Citation1973) for the independent case and to Samur (Citation1987) for the dependent case (note that these authors consider Banach space valued random variables, which contains the case of finite dimensional vectors).

Sequential asymptotic normality as in Assumption 4.1 can be shown to hold for stationary processes under various forms of dependence such as mixing or physical dependence (see Merlevède, Peligrad, and Utev Citation2006 and the references therein). Moreover, our tests can be flexibly adapted to different types of data (e.g., panel data or repeated cross-sections) and other features of the design such as clustering or staggered treatment assignment. As an illustration, in Appendix A, we verify that Assumption 4.1 holds under mild conditions in the canonical DiD model of Section 2.

Notice that our approach is a test of the PTA conditional on additional parametric restrictions (for instance, (2.1) and (2.2), or parametric restriction on the effect of observed covariates). While arguably the vast majority of empirical studies are willing to impose similar assumptions, one might want to test the PTA without these restrictions. In this case, as an alternative to our method, one could consider tests based on conditional moment inequalities (see, among others, Andrews and Soares Citation2010, Romano, Shaikh, and Wolf Citation2014 or Canay and Shaikh Citation2017, sec. 4).

4.2 Implementing Equivalence Tests

Based on Assumptions 4.1 and 4.2, we now derive test statistics for our equivalence hypotheses. We further analyze the statistical properties of the resulting test procedures.

4.2.1 Tests for the Hypothesis (3.1)

To describe the first test for the hypothesis (3.1) we initially consider the case T = 1 so that our objective is to test whether a single parameter β1 exceeds a certain threshold. As β̂1 is approximately distributed as N(β1,Σ11/n), the test statistic |β̂1| approximately follows a folded normal distribution. We therefore propose to reject the null hypothesis in (3.1) whenever (4.3) |β̂1|<QNF(δ,Σ̂11/n)(α),(4.3) where QNF(δ,σ2)(α) denotes the α quantile of the folded normal distribution with mean δ and variance σ2. It is shown in Appendix B that this test is consistent, has asymptotic level α and is (asymptotically) uniformly most powerful for testing the hypothesis (3.1) in the case T = 1. In particular this test is more powerful than the two-sided t-test (TOST), which could be developed following the arguments in Hartman and Hidalgo (Citation2018). For T > 1, we apply the idea of intersection-union (IU) tests outlined in Berger and Hsu (Citation1996) and reject the null hypothesis in (3.1), whenever (4.4) |β̂t|<QNF(δ,Σ̂tt/n)(α)t{1,,T}.(4.4)

While this test is computationally attractive, a well-known disadvantage of testing procedures based on the IU principle is that they tend to be rather conservative (see Berger and Hsu Citation1996, among others), which is confirmed by our simulation study (see ).

Specifically for the TWFE model (2.5), it is possible to obtain a more powerful test for the hypothesis (3.1) as follows: In the first step, estimate β in (2.5) to obtain the unconstrained TWFE estimator β̂u. In the second step, re-estimate (2.5) by minimizing the sum of squared residuals under the constraint ||β||=maxl=1,,T|βl|=δ to obtain a constrained estimator, say β̂c. We then define new estimators of the parameters as (4.5) β̂̂c={β̂uif||β̂||δβ̂cif||β̂||<δ(4.5) and (4.6) σ̂̂c=1(n1)Ti=1nt=1T+1(Y¨i,tW¨i,tβ̂̂c)2.(4.6)

Note that the vector β̂̂c satisfies the null hypothesis in (3.1). In the third step, for b=1,,BN, we generate bootstrap samples with u¨i,t(b)iidN(0,σ̂̂c) and Y¨i,t(b)=W¨i,tβ̂̂c+u¨i,t(b) for i=1,,n and t=1,,T+1. For each bootstrap sample, estimate β̂(b) and compute Qα* as the empirical α-quantile of the bootstrap sample {maxl=1,,T|β̂l(b)|:b=1,,B}. Finally, reject the null hypothesis H0 in (3.1) if (4.7) ||β̂||<Qα*.(4.7)

Notice that the condition ||β||=δ in the calculation of the constrained estimator β̂c is made for technical reasons in the proof of this statement. Numerical results show that the test has a very similar behavior if β̂c is calculated under the condition that ||β||δ. The following result shows that this test is consistent and has asymptotic level α.

Theorem 4.1.

If the errors ui=(ui,1,,ui,T+1) in model (2.5) satisfy E[uiui|Gi]=σu2I(T+1)×(T+1), then the test defined by (4.7) is consistent and has asymptotic level α for the hypothesis (3.1). More precisely,

  1. if the null hypothesis in (3.1) is satisfied, then we have for any α(0,0.5)

    limsupnPβ(||β̂||<Qα*)α. (4.8)

  2. if the null hypothesis in (3.1) is satisfied and the set

    E={l=1,,T:|βl|=||β||} (4.9)

    consists of one point, then we have for any α(0,0.5)

    limnPβ(||β̂||<Qα*)={0if||β||>δαif||β||=δ.

  3. if the alternative in (3.1) is satisfied, then we have for any α(0,0.5)

    limnPβ(||β̂||<Qα*)=1.

Remark 4.1.

  1. It is well known that the bootstrap fails when the target parameter is non-differentiable, and so alternative approaches are needed (see, for example, Fang and Santos Citation2019; Hong and Li Citation2020). To explain why these observations are consistent with the statements in Theorem 4.1, we briefly explain the main arguments for its proof here. To be precise, note that Theorem 4.1 distinguishes several cases under the null hypothesis. First we consider the case (2), where the set E={l=1,,T:|βl|=||β||} consists of one point. In this case, as ||β||δ>0, the mapping β||β|| is in fact Hadamard-differentiable, and one could alternatively use Theorem 3.1 in Fang and Santos (Citation2019) to prove that

    limnPβ(||β̂||<Qα*)=α, (4.10)

    if ||β||=δ>0, and that the limit vanishes if ||β||>δ. Next, we consider the case (1) of Theorem 4.1 and the situation where the set E in (4.9) contains more than 1 element. In this case the mapping β||β|| is not Hadamard-differentiable at this point. Consequently, by Theorem 3.1 in Fang and Santos (Citation2019) a statement of the from (4.10) cannot hold in general. However, in (4.8) we neither claim the existence of the limit nor equality. In fact we can show by similar arguments as given for the derivation of eq. (A.32) in Dette et al. (Citation2018) that

    n(||β̂*||||β̂||)Z˜n*+oP(1),

    where the statistic Z˜n* has, conditionally on the data, the same asymptotic distribution as the original statistic n(||β̂||||β||). It turns out that this argument is sufficient to prove (4.8), and the same argument can also be used to show assertion (3) of Theorem 4.1.

  2. We emphasize that Theorem 4.1 provides a pointwise result. We expect that uniform results can be obtained as well. However, as indicated in part (a), even proving pointwise results requires some nonstandard techniques. This is even more true if one would aim for uniform results, and we leave this problem for future research.

  3. The assumption of spherical model errors, that is E[uiui|Gi]=σu2I(T+1)×(T+1), is crucial for the validity of Theorem 4.1. In fact, our simulation results in Section 6 suggest that the test may not keep its nominal level under high levels of serial dependence in the error terms. As a further alternative, we therefore suggest replacing the bootstrap procedure described before (4.7) with the wild cluster bootstrap (Cameron and Miller Citation2015). To implement this, first compute the residuals u¨̂i,c=Y¨iW¨iβ̂̂c, where Y¨i=(Y¨i,1,,Y¨i,T+1) and W¨i=(W¨i,1,,W¨i,T+1). Next, let Ri denote a Rademacher random variable that takes the values–1 and 1 with probability 0.5 each. For b{1,,B}, generate bootstrap samples Y¨i(b)=W¨iβ̂̂c+u¨̂i*, where u¨̂i*=Ri×u¨̂i,c. As before, compute β̂(b) in each bootstrap sample, compute Qα* and reject H0 in (3.1) if (4.7) holds. We demonstrate by means of a simulation study in Section 6 that this bootstrap test yields reasonable results for nonspherical errors, and we conjecture that the wild bootstrap provides a valid solution in general. A formal proof of this statement is beyond the scope of the present article.

Remark 4.2.

Theorem 4.1 shows the consistency of a specific estimator in model (2.5). As pointed out by a referee, it is of interest to investigate if a similar approach is applicable for other estimators and models. We conjecture that this is possible, but proving the consistency of such an approach would be challenging, as the result would depend sensitively on the given model and the estimator used.

To be precise, let β̂ denote an estimator of a parameter β, such that n(β̂β)dN(0,Σ) as n, where Σ denotes the covariance matrix, and define the constrained estimate β̂̂c via (4.5). Further let Σ̂ denote a consistent estimate of the matrix Σ. We propose to generate β̂(b) from a N(β̂̂c,Σ̂/n) distribution and compute Qα** as the empirical α-quantile of the bootstrap sample {maxl=1,,T|β̂l(b)|:b=1,,B}. The null hypothesis H0 in (3.1) is rejected, if ||β̂||<Qα**. Note that implementation of this procedure neither requires a specific model nor a specific estimator. In particular, we have implemented this alternative approach in model (2.5) with Σ̂=σ̂̂c2(W¨W¨)1 for spherical errors and with Σ̂=nnT(W¨W¨)1(i=1nW¨iu¨̂i,cu¨̂i,cW¨i)(W¨W¨)1 for errors that are clustered on the individual level, where W¨=(W¨1,,W¨n) and σ̂̂c2 and u¨̂i,c are defined in (4.6) and Remark 4.1, respectively. In a simulation study, we obtained very similar results as for the bootstrap tests considered in Theorem 4.1 for the spherical case and Remark 4.1 (c) for the clustered case (these results are not displayed in Section 6 for the sake of brevity).

4.2.2 A Test for the Hypothesis (3.2)

For some fixed τ>0, a test can be constructed by first computing the statistic β̂¯:=1Tt=1Tβ̂t= 1β̂/T,

where 1=(1,,1)RT. Note that it follows from Assumptions 4.1 and 4.2 that n 1(β̂β)N(0, 1Σ 1)).

Consequently, based on the discussion in the first part of Section 4.2.1, we propose to reject the null hypothesis in (3.2), whenever (4.11) |β̂¯|<QNF(τ,σ̂2)(α),(4.11) where σ̂2=1Σ̂ 1/n, and Σ̂ is a consistent estimator of Σ which is known by Assumption 4.2.

4.2.3 A Pivotal Test for the Hypothesis (3.4)

For λ[ε,1], let β̂RMS(λ) denote the RMS of β̂ based on a random subsample of size λn of the original sample of n individuals. In order to construct a pivot test for the hypothesis (3.4), define (4.12) M̂n:=β̂RMS2(1)βRMS2V̂n,(4.12) where (4.13) V̂n=(ε1(β̂RMS2(λ)β̂RMS2(1))2ν(d λ))1/2(4.13) and ν denotes a measure on the interval [ε,1]. The following result is proved in the Appendix.

Theorem 4.2.

If Assumption 4.1 is satisfied and β=0, then the statistic M̂n defined in (4.12) converges weakly with a nondegenerate limit distribution, that is (4.14) M̂ndW:=B(1)(ε1(B(λ)/λB(1))2ν(dλ))1/2,(4.14)

where {B(λ)}λ[ε,1] is a Brownian motion on the interval [ε,1]. Moreover, if β = 0, then (4.15) M̂ndZ2(1)(ε1(Z2(λ)Z2(1))2ν(dλ))1/2,(4.15) where Z2(λ)=1λ2B(λ)DDB(λ),B is a T-dimensional vector of independent Brownian motions and D is the matrix in (4.1).

It follows from the proof of Theorem 4.2 that the statistic β̂RMS2 is a consistent estimator of βRMS2. Therefore, we propose to reject the null hypothesis H0 in (3.4) (and consequently H0 in (3.3)), whenever (4.16) β̂RMS2<ζ2+QW(α)V̂n,(4.16) where QW(α) is the α-quantile of the limiting distribution of the random variable W on the right-hand side of (4.14). Note that these quantiles can be easily obtained by simulation because the distribution of W is completely known. For instance, QW(0.05)2.1. The following result shows that this decision rule defines a valid test for the hypothesis (3.4).

Theorem 4.3.

If Assumption 4.1 is satisfied, then the test defined by (4.16) is a consistent asymptotic level α-test for the hypothesis (3.4), that is limnPβ(β̂RMS2<ζ2+QW(α)V̂n)={0, if βRMS2>ζ2α, if βRMS2=ζ21, if βRMS2<ζ2

Remark 4.3.

  1. Notice that in practice one chooses ν as a discrete distribution which makes the evaluation of the integrals in (4.13) and in the denominator of the random variable W very easy. For example, if ν denotes the uniform distribution on {15,25,35,45}, then the statistics V̂n in (4.13) simplifies to

    (14k=14(β̂RMS2(k5)β̂RMS2(1))2)1/2.

    This measure is also used in the simulation study in Section 6, where we analyze the finite sample properties of the different procedures. In practice, it is thus not necessary to explicitly choose ε.

  2. It follows from the proof of Theorem 4.2 that an asymptotic (1α)-confidence interval for the parameter βRMS2>0 is given by

    [β̂RMS2+QW(α/2)V̂n,β̂RMS2+QW(1α/2)V̂n].

  3. Theorem 4.3 can be extended to get uniform results. More precisely, define for a small positive constant c the sets

    H={β|Δ(β)>c,β̂RMS2ζ2},A(x)={β|Δ(β)>c,β̂RMS2<ζ2x/n},

    corresponding to the null hypothesis and the alternative, respectively, where Δ(β) is defined in (B.6) in the Appendix. Then limsupnsupβHPβ(β̂RMS2<ζ2+QW(α)V̂n)=α.

Furthermore there exists a nondecreasing function f:R>0R>0, with f(x)>α for all x > 0 and limxf(x)=1, such that liminfninfβA(x)Pβ(β̂RMS2<ζ2+QW(α)V̂n)=f(x).

The details are omitted for the sake of brevity.

5 Equivalence Testing with Heterogeneous Treatment Effects and Staggered Adoption

The use of the simple TWFE model has recently experienced substantial criticism in the presence of multiple groups, heterogeneous treatment effects and differences in treatment timing. In this situation, the TWFE estimator often does not correspond to a reasonable estimate of the ATT, and alternative estimators have been proposed by several authors (see, for instance, Goodman-Bacon Citation2021; Callaway and Sant’Anna Citation2021; Sun and Abraham Citation2021; Borusyak, Jaravel, and Spiess Citation2023 or de Chaisemartin and D’Haultfœuille Citation2020. Excellent reviews of this fast-growing literature are provided by Roth et al. Citation2023 and de Chaisemartin and D’Haultfœuille 2022.). Specifically, Wooldridge (Citation2021) shows that the deficiency of the TWFE estimator can be regarded as a model misspecification problem. He then proposes model adjustments that allow for treatment effect heterogeneity due to differences in treatment timing and observed characteristics (which are assumed to be unaffected by the treatment). While we conjecture that our equivalence tests can be used with most (if not all) of the mentioned estimators, we focus on the regression-based approach of Wooldridge (Citation2021) in the following, as it is straightforward to show that our assumptions hold with minor adjustments to the arguments in Appendix A. We now consider the case of the staggered adoption (i.e., the initial treatment period varies across groups) of an absorbing treatment (i.e., the treatment status does not change after the initial treatment) in the presence of a never-treated group. Following Wooldridge (Citation2021, Sec. 6), we assume that the time since the initial treatment adoption produces different levels of exposure to the treatment, resulting in treatment effect heterogeneity across time. As in Section 2, we consider a balanced panel of n individuals that are observed in T + 1 pre-treatment periods. In periods T+2,,T¯, a subset of individuals adopts treatment, leading to “treatment cohorts.” As before, period T + 1 is used as the base period. To define a treatment cohort dummy, let Gir=1 if individual i has first adopted treatment in period rR:={T+2,,T¯,} and zero otherwise, where Gi is a dummy indicating that individual i is a member of the never treated group. The potential outcome of unit i in treatment cohort rR observed in time period t{1,,T¯} is denoted by Yi,t(r), where the “baseline” potential outcome in period t if unit i is untreated is given by Yi,t(). The objects of interest are the cohort and time-specific treatment effects that may depend on a vector of observed and time-invariant covariates Xi , that is (5.1) πr,t(Xi):=E[Yi,t(r)Yi,t()|Gir=1,Xi].(5.1)

In many applications, it is assumed that the treatment effect is a linear function of the observed covariates (which may contain polynomials of Xi ), so that πr,t(Xi)=πr,t+ρr,tXi., where X˙i=XiE[Xi|Gir=1] is the covariate centered around the cohort mean so that πr,t is the treatment effect averaged across the distribution of Xi conditional on Gir=1. Adapting (2.1), we assume E[Yi,t|Gi,Xi]=E[Yi,t()|Gi,Xi]+r=T+2T¯s=rT¯πr,s(Xi)GirDs(t), where Gi=(GiT+2,,GiT¯), so that deviation from the designated treatment path (e.g., through anticipation) are ruled out. Further extending (2.2), E[Yi,t()|Gi,Xi]=αi+λt+κXi+r=T+2T¯ζrXiGir+sT+1s=1T¯ιsXiDs(t)+rT+1r=1T¯sT+1s=1T¯γr,sGirDs(t), where γr,s is a non-stochastic constant that differs across cohorts and time. Notice that the covariates are again assumed to enter the baseline outcome linearly. As in Section 2, the cohort and time-specific ATTs cannot be identified without further restrictions. In order to ensure that γr,s=0, we adapt (2.4) to impose a conditional staggered parallel trends assumption (CSPTA), using the never-treated group as the control: E[Yi,t()Yi,T+1()|Gir=1,Xi]=E[Yi,t()Yi,T+1()|Gi=1,Xi],t,r=1,,T¯.

Combining the above assumptions, we obtain (5.2) E[Yi,t|Gi,Xi]=αi+λt+κXir=T+2T¯ζrXiGir+sT+1s=1T¯ιsXiDs(t)+r=T+2T¯s=rT¯(πr,sGirDs(t)+ρr,sX˙iGirDs(t))+m=T+2T¯kT+1k=1m1(π˜m,kGimDk(t)+ρ˜m,kX˙iGimDk(t)).(5.2)

Simple algebra sows that the placebo conditional cohort and time specific treatment effects π˜m,k+ρ˜m,kX˙i are identified by taking differences-in-differences, that is by comparing the evolution in the average outcome between periods k and the base T + 1 between treatment cohort m and the never treated group. Given (5.1)–(2.2), the CSPTA implies that the placebo treatments are zero. We therefore avoid any “contamination” by treatment effects at time m>m, which, as noted by Sun and Abraham (Citation2021), can lead to a rejection of the CSPTA in the pre-treatment periods even in cases where it actually holds. Since the assumptions in Section 4.1 can easily be shown to hold under mild conditions by adapting the arguments in Section A, we can directly apply our equivalence tests to the placebo treatment effects. Moreover, the model in (5.2) can be flexibly adjusted to the problem at hand. For instance, one may be willing to exclude a subset of the placebo treatment effects from the model in order to allow for some pooling across cohorts and time. As noted by Wooldridge (Citation2021), in this case, the pooled OLS estimator of πr,s is an averaged “rolling DiD” where, on top of the never-treated group and the base period, any cohort and period that corresponds to an omitted placebo treatment effect is used as a control. In this case, the CSPTA needs to be adjusted accordingly (e.g., as in Roth et al. Citation2023), as parallel trends need to be plausible between multiple groups and periods. Finally, notice that in practice E[Xi|Gir=1] needs to be replaced by the sample average of Xi in cohort r. As suggested in Wooldridge (Citation2021), one should adjust the standard errors to account for the additional sampling variation.

As a further note of caution, practitioners should be aware that our tests only consider an implication of the CSPTA, as we are relying on assumptions that impose a particular form of treatment effect heterogeneity (e.g., linear dependence on observed covariates). In fact, we are thus testing a joint test of a parametric restriction on treatment effect heterogeneity and conditional parallel trends. While parametric restrictions are popular in the applied literature, one might still be concerned about their validity when testing for parallel trends. As an alternative, one may then consider tests based on conditional moment inequalities (see, for instance, Andrews and Shi Citation2013).

6 Simulations

In order to investigate the small sample properties of our tests, we conduct a simulation study in R. For that, we create a panel dataset with T{1,4,8,12} and n{100,1000}. For i=1,,n and t=1,,T+2, we generate the data from (6.1) Yi,t=αi+λt+l=1TβlGiDl(t)+πATTGiDT+2(t)+ui,t,(6.1) with αi and λt standard normal, Pr(Gi=1)=12 and πATT=0.

In an initial step, we investigate the level of the proposed tests. To do so, we set the level of significance to α=5% and the equivalence threshold for all hypotheses to 1. We then choose the parameters βl in the pre-treatment periods such that we are on the “boundary” of the hypotheses, that is β1=1 and βl=0 for l > 1 or βl=1 for all l=1,,T. Moreover, we also investigate the power of the test procedures by choosing βl{0.8,0.9} for all l=1,,T. The bootstrap based tests for (3.1) are computed using 1000 bootstrap draws. Finally, the generated error terms are both serially correlated and heteroscedastic, as they are drawn from a stationary AR(3) process with autoregressive parameters (ϕ1,ϕ2,ϕ3)=(0.5,0.3,0.1) and with standard deviation 1+Gi. Consequently, the tests that require an estimate of Σ are based on standard errors that are clustered on the individual level. The results for all tests based on 20,000 simulations are presented in –C.3.

Table 1 Smallest equivalence thresholds such that the null hypotheses in (3.1), (3.2), and (3.3) can be rejected for varying numbers of pre-treatment periods.

In the following scenarios, we choose the level of significance α=5% and compute δIU*,δBoot* and δc.Boot* as the smallest equivalence thresholds for the IU, the bootstrap and the cluster bootstrap tests such that the null hypothesis in (3.1) can still be rejected (i.e., for which equivalence of pre-trends can be concluded). Similarly, we compute the smallest equivalence thresholds τ* and ζ* for the corresponding null hypotheses in (3.2) and (3.3) using the tests in (4.11) and (4.16), respectively. The reported numbers correspond to the average over M = 2500 simulations and can be used to assess at what value of the equivalence threshold a particular test can be expected to reject the null hypothesis. Finally, we report the usual 95% confidence interval CIπ̂ATT and the number of simulations in which each βl for l=1,,T was found to be statistically insignificant. We then investigate how violations of the PTA affect the chance of falsely detecting a treatment effect and how these violations affect the smallest equivalence thresholds for which equivalence can be concluded. For simplicity, the model errors are drawn independently from a standard normal distribution.

Table C.5 shows the results under the PTA. We further simulate scenarios in which the PTA is violated due to the presence of unobserved effects that affect the treatment group but not the control group. First, we simulate a pre-program shock, also known as “Ashenfelter’s dip” by replacing ui,t in (6.1) by u˜i,t=ui,t+GiDT+1(l)Vi, where ViN(μ,1) with mean μ{14,12}. This implies that E[π̂ATT]=E[β̂l]=μ for l=1,,T. The results are presented in Tables C.6 and C.7. In our second scenario, u˜i,t=ψ×t×Gi with ψ=0.025, modeling an unobserved linear difference in time trends, starting in t = 1. Consequently, E[β̂l]=ψ(lT1) and E[π̂ATT]=ψ. The results are presented in Table C.8.

6.1 Simulation Results—Discussion

shows that the test in (4.11) approximately keeps the desired level for every T even in small samples. The test in (4.16) appears to be slightly over-rejecting when n = 100 but keeps its nominal level in larger samples. Notice that in the tests in (4.11) and (4.16) rightfully reject the null hypothesis in an increasing number of cases as n and T increase as β¯ and βRMS are further away from the boundary of the null, resulting in an increase in statistical power. Regarding the tests for (3.1) and C.4 illustrate that the IU and cluster bootstrap based tests maintain their nominal level for T = 2. When only one parameter is at the boundary of the null hypothesis, both tests also perform well in the sense that the empirical rejection frequency is close to the nominal level for sufficiently large n. In comparison, the test in (4.7) over-rejects even for large n, showing that it is not robust to high levels of serial correlation. If βl=1 for all pre-treatment periods, all three tests become conservative for larger values of T. This phenomenon appears to be much more pronounced for the test based on the IU principle, for which it is well-documented (Berger and Hsu Citation1996). For instance, the empirical level of the IU based test is more than seven times smaller than the corresponding level of the bootstrap based tests for T = 12 (see ). As shown in , this has important consequences for the power of both tests as the cluster bootstrap based test procedure outperforms the IU based test for T > 2. On the other hand, the IU based test may still be attractive for practical applications as it is numerically much less demanding. As compared to the tests for (3.1), the power of our test in (4.16) is substantially larger, only surpassed by the power of the test in (4.11). All tests have in common that the power decreases with T. This is true even for the test in (4.11), which is the (asymptotically) uniformly most powerful test for (3.2) for any T. Thus, concluding equivalence of pre-trends becomes more demanding with an increase in the number of pre-treatment periods. This makes intuitive sense in the DiD setup, where equivalence of pre-trends in a larger number of periods is often regarded as stronger evidence for the plausibility of the PTA.

For T = 1, we find that ζ*>δBoot*δc.Boot*>δIU*=τ*. This is not surprising, since the IU based test is asymptotically uniformly most powerful for the hypothesis (3.1), where (3.1) coincides with (3.2) and (3.3) when a single pre-treatment parameter is tested. For T > 1, we roughly observe that τ*<ζ*δBoot*δc.Boot*<δIU*. This relationship between the smallest equivalence thresholds of the maximum, average and RMS tests is expected, as |β¯|βRMS||β||. The better performance of the bootstrap based tests for (3.1) as compared to the IU based test may attributed to their higher power as evidenced in and C.4. We also observe that δBoot* performs only slightly better than δc.Boot* under the PTA when errors are spherical. Under violations of the PTA, δBoot*<δc.Boot* for T > 1. However, as u˜i,t becomes nonspherical, only the cluster bootstrap based test maintains its nominal level (see ). We therefore recommend to use the cluster-robust version of the bootstrap based test in practice. Further notice that even when the PTA holds, the practice of rejecting the DiD framework when β̂l is statistically insignificant for at least one l{1,,T} is clearly inefficient as is shown by the first row of , as an increase in available pre-treatment periods increases the chance of incorrectly rejecting the DiD framework under the PTA. A similar observation can be made in the presence of a linear time trend as shown in . Here, even when the empirical coverage rate of the usual confidence interval is only slightly lower than the nominal level, the DiD framework is rejected in a large number of cases.

When the PTA is violated due to a small temporary shock as in , the usual practice of adopting the PTA when no significant differences in pre-trends could be found can lead to a false discovery of a nonzero treatment effect in a substantial number of cases, in particular when the sample size is small. If the temporal shock is larger as in , a non-existing treatment effect will be found to be significantly different from zero in almost all cases. All our test procedures require an unrealistically large equivalence threshold in order to be able to conclude equivalence of pre-trends. In particular, any equivalence threshold for which equivalence could be concluded would have to be larger than the estimated treatment effect, therefore, casting serious doubt on the validity of the latter. Similarly, when the PTA is violated due to a linear difference in trends (see ), the equivalence thresholds would have to be chosen larger than the estimated treatment effect, thus, suggesting that the estimated ATT may contain bias due to insufficient support for the PTA. Moreover, our methodology can be useful in identifying the presence of a linear time trend, as τ* and ζ* tend do remain stable in T under the PTA or when the violation of the PTA is only temporary, whereas under the presence of a linear trend, they increase with T.

7 Empirical Illustration

We illustrate our approach by re-considering the influential Difference-in-Differences analysis in Di Tella and Schargrodsky (Citation2004). They use a shock to the allocation of police forces as a consequence of a terrorist attack on a Jewish institution as a natural experiment to study the effect of police on crime. The original authors conduct the usual pre-test in (2.7) and find no evidence for violations of the PTA. However, Donohue, Ho, and Leahy (2013) point out several shortcomings of the original paper (e.g., spillover effects from the treated to the untreated group). In particular, they find that the PTA is not plausible if the pre-treatment data is inspected on a more granular level, thus, casting doubt on the validity of the estimated treatment effects. While the traditional test failed to detect evidence against the PTA, we will apply our test procedures to analyze how much evidence in favor of the PTA can be extracted from the original specification in Di Tella and Schargrodsky (Citation2004).

The data consists of monthly averages of the number of car thefts between April and December 1994 in each out of 876 Buenos Aires city blocks out of which 37 blocks received additional protection after the attack. The main specification in Di Tella and Schargrodsky (Citation2004) is given by Yit=αi+λt+βDit, where Yit denotes the number of car thefts in block i and month t and Dit is a dummy variable taking the value 1 if block i is treated in period t. Finally, αi and λt are block- and time-specific fixed effects. By using this specification, the pre- and post-treatment periods are pooled together so that the estimated treatment effect compares the post-treatment difference in car thefts between treated and non-treated blocks to the corresponding pre-treatment difference. To analyze group mean differences in the pre-treatment periods, we fit (2.5) to subsets of the data that include one, two or three pre-treatment periods, corresponding to June, May and June and April–June. As in the original paper, we include block- and time-specific effects and cluster on the block level. We find that using heteroscedasticity-robust standard errors instead of clustering has no substantial effect. Finally, we compute δIU*,δBoot*,δc.Boot*,τ* and ζ*. The results are summarized in below. Notice that for all tests the smallest equivalence threshold that still allows us to conclude equivalence of pre-trends are the largest when only the pre-treatment period June is used. This hints toward a temporary shock to treatment or control in June which may bias the pooled estimates in of Di Tella and Schargrodsky (Citation2004). The latter are significant and range between–0.058 and–0.081. One important outcome of our equivalence test based analysis is that, even without the granular data inspection of Donohue, Ho, and Leahy (2013), the equivalence thresholds have to be chosen unrealistically large in order to conclude equivalence of pre-trends. In fact, the smallest equivalence thresholds for which the null hypotheses can be rejected are larger than the estimated effect size of police on crime. Therefore, it is questionable whether there is any effect at all, since the estimated effect may be an artifact of the violated PTA only.

8 Conclusion

We have derived several tests that allow researchers to provide statistical evidence in support of the parallel trends assumption in difference-in-differences estimation by testing for negligible differences in pre-trends in treatment and control. The tests can easily be implemented in popular models such as the two-way fixed effects model, and they can be flexibly adjusted to accommodate more complex settings. Our simulation analysis yields support for our theoretical results as our tests maintain their nominal level and exhibit high statistical power in sufficiently large samples. Finally, we apply our methodology to the data provided by Di Tella and Schargrodsky (Citation2004). Even without a granular inspection of the data as in Donohue, Ho, and Leahy (2013), our methodology casts doubt on the estimated effects, as they may simply be the result of trend differences that could not be detected using traditional pre-tests.

Supplemental material

Online supplements.zip

Download Zip (14.2 KB)

Acknowledgments

We thank the editor Ivan Canay, an associate editor, and two anonymous referees for comments that greatly improved this article. We further thank participants of the International Panel Data Conference 2023, the European Summer Meeting of the Econometric Society 2023 and the European Winter Meeting of the Econometric Society 2023.

Supplementary Materials

The supplementary material consists of two R files that replicate in Section 7 and the simulation results in Appendix C.

Disclosure Statement

The authors report there are no competing interests to declare.

References

  • Andrews, D. W. K., and Shi, X. (2013), “Inference based on Conditional Moment Inequalities,” Econometrica, 81, 609–666.
  • Andrews, D. W. K., and Soares, G. (2010), “Inference for Parameters Defined by Moment Inequalities Using Generalized Moment Selection,” Econometrica, 78, 119–157.
  • Baltagi, B. H. (2021), The Two-Way Error Component Regression Model, pp. 47–74, Cham: Springer.
  • Berger, R. L., and Hsu, J. C. (1996), “Bioequivalence Trials, Intersection-Union Tests and Equivalence Confidence Sets,” Statistical Science, 11, 283–319. DOI: 10.1214/ss/1032280304.
  • Bilinski, A., and Hatfield, L. A. (2020), “Nothing to See Here? Non-inferiority Approaches to Parallel Trends and Other Model Assumptions,” arXiv preprint arXiv:1805.03273.
  • Borusyak, K., Jaravel, X., and Spiess, J. (2023), “Revisiting Event Study Designs: Robust and Efficient Estimation,” arXiv preprint arXiv:2108.12419.
  • Callaway, B., and Sant’Anna, P. H. (2021), “Difference-in-Differences with Multiple Time Periods,” Journal of Econometrics, 225, 200–230. DOI: 10.1016/j.jeconom.2020.12.001.
  • Cameron, A. C., and Miller, D. L. (2015), “A Practitioner’s Guide to Cluster-Robust Inference,” Journal of Human Resources, 50, 317–372. DOI: 10.3368/jhr.50.2.317.
  • Canay, I. A., and Shaikh, A. M. (2017), Practical and Theoretical Advances in Inference for Partially Identified Models, Vol. 2 of Econometric Society Monographs, pp. 271–306, Cambridge: Cambridge University Press.
  • de Chaisemartin, C., and D’Haultfœuille, X. (2020), “Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects,” American Economic Review, 110, 2964–96. DOI: 10.1257/aer.20181169.
  • ——— (2022), “Two-Way Fixed Effects and Differences-in-Differences with Heterogeneous Treatment Effects: A Survey,” The Econometrics Journal, 26, C1–C30.
  • Dette, H., Kokot, K., and Aue, A. (2020), “Functional Data Analysis in the Banach Space of Continuous Functions,” Annals of Statistics, 48, 1168–1192.
  • Dette, H., Kokot, K., and Volgushev, S. (2020), “Testing Relevant Hypotheses in Functional Time Series via Self-Normalization,” Journal of the Royal Statistical Society, Series B, 82, 629–660. DOI: 10.1111/rssb.12370.
  • Dette, H., Möllenhoff, K., Volgushev, S., and Bretz, F. (2018), “Equivalence of Regression Curves,” Journal of the American Statistical Association, 113, 711–729. DOI: 10.1080/01621459.2017.1281813.
  • Dette, H., and Wied, D. (2014), “Detecting Relevant Changes in Time Series Models,” Journal of the Royal Statistical Society, Series B, 78, 371–394. DOI: 10.1111/rssb.12121.
  • Dette, H., and Wu, W. (2019), “Detecting Relevant Changes in the Mean of Nonstationary Processes–A Mass Excess Approach,” Annals of Statistics, 47, 3578–3608.
  • Di Tella, R., and Schargrodsky, E. (2004), “Do Police Reduce Crime? Estimates Using the Allocation of Police Forces after a Terrorist Attack,” American Economic Review, 94, 115–133. DOI: 10.1257/000282804322970733.
  • Donohue, J. J., Ho, D., and Leahy, P. (2013), “Do Police Reduce Crime? A Reexamination of a Natural Experiment,” in Empirical Legal Analysis: Assessing the Performance of Legal Institutions, ed. Y.-c. Chang, pp. 125–143, Abingdon: Routledge.
  • Fang, Z., and Santos, A. (2019), “Inference on Directionally Differentiable Functions,” The Review of Economic Studies, 86, 377–412. DOI: 10.1093/restud/rdy049.
  • Goodman-Bacon, A. (2021), “Difference-in-Differences with Variation in Treatment Timing,” Journal of Econometrics, 225, 254–277. DOI: 10.1016/j.jeconom.2021.03.014.
  • Hartman, E., and Hidalgo, F. D. (2018), “An Equivalence Approach to Balance and Placebo Tests,” American Journal of Political Science, 62, 1000–1013. DOI: 10.1111/ajps.12387.
  • Hong, H., and Li, J. (2020), “The Numerical Bootstrap,” The Annals of Statistics, 48, 397–412. DOI: 10.1214/19-AOS1812.
  • Kahn-Lang, A., and Lang, K. (2020), “The Promise and Pitfalls of Differences-in-Differences: Reflections on 16 and Pregnant and other Applications,” Journal of Business & Economic Statistics, 38, 613–620. DOI: 10.1080/07350015.2018.1546591.
  • Kuelbs, J. (1973), “The Invariance Principle for Banach Space Valued Random Variables,” Journal of Multivariate Analysis, 3, 161–172. DOI: 10.1016/0047-259X(73)90020-1.
  • Manski, C. F., and Pepper, J. V. (2018), “How Do Right-to-Carry Laws Affect Crime Rates? Coping with Ambiguity Using Bounded-Variation Assumptions,” Review of Economics and Statistics, 100, 232–244. DOI: 10.1162/REST_a_00689.
  • Masten, M. A., and Poirier, A. (2020), “Inference on Breakdown Frontiers,” Quantitative Economics, 11, 41–111. DOI: 10.3982/QE1288.
  • Merlevède, F., Peligrad, M., and Utev, S. (2006), “Recent Advances in Invariance Principles for Stationary Sequences,” Probability Surveys, 3, 1–36. DOI: 10.1214/154957806100000202.
  • Rambachan, A., and Roth, J. (2023), “A More Credible Approach to Parallel Trends,” The Review of Economic Studies, 90, 2555–2591. DOI: 10.1093/restud/rdad018.
  • Romano, J. P. (2005), “Optimal Testing of Equivalence Hypotheses,” The Annals of Statistics, 33, 1036–1047. DOI: 10.1214/009053605000000048.
  • Romano, J. P., Shaikh, A. M., and Wolf, M. (2014), “A Practical Two-Step Method for Testing Moment Inequalities,” Econometrica, 82, 1979–2002.
  • Roth, J. (2022), “Pretest with Caution: Event-Study Estimates After Testing for Parallel Trends,” American Economic Review: Insights, 4, 305–322. DOI: 10.1257/aeri.20210236.
  • Roth, J., Sant’Anna, P. H., Bilinski, A., and Poe, J. (2023), “What’s Trending in Difference-in-Differences? A Synthesis of the Recent Econometrics Literature,” Journal of Econometrics, 235, 2218–2244. DOI: 10.1016/j.jeconom.2023.03.008.
  • Samur, J. D. (1987), “On the Invariance Principle for Stationary ϕ -Mixing Triangular Arrays with Infinitely Divisible Limits,” Probability Theory and Related Fields, 75, 245–259.
  • Sant’Anna, P. H., and Zhao, J. (2020), “Doubly Robust Difference-in-Differences Estimators,” Journal of Econometrics, 219, 101–122. DOI: 10.1016/j.jeconom.2020.06.003.
  • Schuirmann, D. J. (1987), “A Comparison of the Two One-Sided Tests Procedure and the Power Approach for Assessing the Equivalence of Average Bioavailability,” Journal of Pharmacokinetics and Biopharmaceutics, 15, 657–680. DOI: 10.1007/BF01068419.
  • Sun, L., and Abraham, S. (2021), “Estimating Dynamic Treatment Effects in Event Studies with Heterogeneous Treatment Effects,” Journal of Econometrics, 225, 175–199. DOI: 10.1016/j.jeconom.2020.09.006.
  • van der Vaart, A., and Wellner, J. A. (1996), Weak Convergence and Empirical Processes: With Applications to Statistics, New York: Springer-Verlag.
  • Wellek, S. (2010), Testing Statistical Hypotheses of Equivalence and Noninferiority (2nd ed.), Boca Raton, FL: CRC Press.
  • Wooldridge, J. M. (2021), “Two-Way Fixed Effects, the Two-Way Mundlak Regression, and Difference-in-Differences Estimators,” Available at SSRN 3906345.

Appendix A:

Assumption 4.1 in the Canonical DiD Model of Section 2

For λ[ε,1], compute the double-demeaned variables based on a random subsample Sn,λ of nλ individuals, each observed in periods 1,,T+1. For instance, we define the double-demeaned data as Y¨i,t(λ)=Yi,tY¯i,·Y¯·,t(λ)+Y¯··(λ), where Y¯·,t(λ)=1nλjSn,λYj,t  and  Y¯··(λ)=1nλ(T+1)jSn,λs=1T+1Yj,s, and other variables such as W¨i,t(λ) and u¨i,t(λ) are defined analogously. Collecting the individual time series, (A.1) Y¨i(λ)=W¨i(λ)β+u¨i(λ),(A.1) where Y¨i(λ)RT+1,W¨i(λ)RT×(T+1) and u¨i(λ)RT+1. Then, let β̂(λ) denote the OLS estimator for β in model (A.1) from the subsample of nλ individuals, that is β̂(λ)=(Γ̂(λ))11nλiSn,λW¨i(λ)Y¨i(λ)=β+(Γ̂(λ))11nλiSn,λW¨i(λ)u¨i(λ), where the matrix Γ(λ) is defined by Γ̂(λ)=1nλiSn,λW¨i(λ)W¨i(λ). Now, assume that Γ=plimn1ni=1nW¨iW¨i is nonsingular, where W¨i=W¨i(1) and other variables are defined analogously. Under random sampling across individuals (see, e.g., Callaway and Sant’Anna Citation2021) and assuming that sufficient moments exist, it then follows that supλ[ε,1]||Γ̂(λ)Γ||=oP(1) as n,

and n(β̂(λ)β)=Γ1nnλiSn,λW¨i(λ)u¨i(λ)+oP(1) uniformly with respect to λ[ε,1]. Consequently, we obtain from the Cramer-Wold device and Theorem 2.12.1 in van der Vaart and Wellner (Citation1996) that {n(β̂(λ)β)}λ[ε,1]{V1/2λB(λ)}λ[ε,1]

where B is a T-dimensional vector of independent Brownian motions, V=Γ1WΓ1 with W=plimn1ni=1nW¨iu¨iu¨iW¨i, and the symbol means weak convergence in the space (l[ε,1])T of all T-dimensional bounded functions on the interval [ε,1].

Appendix B:

Mathematical Proofs

B.1. Properties of the Test (4.3)

For sufficiently large sample sizes the quantile fα:=QNF(δ,Σ̂11/n)(α) satisfies (B.1) α=P(|N(δ,Σ̂11/n)|QNF(δ,Σ̂11/n)(α))=Φ(fαδΣ11)Φ(fαδΣ11)+O(1n)(B.1)

where Φ is the cdf of the standard normal distribution. Consequently, we obtain for the probability of rejection (B.2) Pβ1(|β̂1|fα)Φ(fαβ1Σ11)Φ(fαβ1Σ11).(B.2)

It is well known that the right-hand side of (B.2) (with the quantile fα defined by (B.1)) is the power function of the uniformly most powerful unbiased test (see Example 1.1 in Romano Citation2005).

B.2. Proof of Theorem 4.1

The proof follows essentially by the same arguments as given in Dette et al. (Citation2018), and, for the sake of brevity, we only explain why this is the case (also see the discussion below in Section B.3, where a sequential version of the result is derived). By Assumption 4.1, n(β̂β) has an asymptotic T-dimensional centered normal distribution. We denote the corresponding asymptotic covariance matrix by Σ=(σij)i,j=1,T. Now we interpret all vectors as stochastic processes on the set X={1,,T} and rewrite the weak convergence of the vector β̂=(β̂1,,β̂T) as (B.3) {n(β̂xβx)}xX{G(x)}xX,(B.3) where {G(x)}xX is a centered Gaussian process on X={1,,T} with covariance structure cov(G(x),G(y))=σxy (x,yX). Note that (B.3) is the analog of eq. (A.7) in Dette et al. (Citation2018), and it follows by exactly the same arguments as stated in this article that (B.4) n(||β̂||||β||)max{maxxE+G(x),maxxEG(x)},(B.4) provided that ||β̂||>0, where the sets E+ and E are defined by E+={l=1,,T:βl=||β||},E={l=1,,T:βl=||β||}, respectively. Note that EE+=E, where E is defined in (4.9), and that (B.3) is the analog of Theorem 3 in Dette et al. (Citation2018). Moreover, if β̂*=(β̂1*,,β̂T*) denotes the estimate from the bootstrap sample, we obtain an analog of the weak convergence in (B.3), that is (B.5) {n(β̂x*β̂̂x)}xX{G(x)}xX(B.5) conditional on the sample (W¨1,Y¨1),,(W¨n,Y¨n). Note that this statement corresponds to the statement (A.25) in Dette et al. (Citation2018). Now the statements (A.7) and (A.25) and their Theorem 3 are the main ingredients for the proof of Theorem 5 in Dette et al. (Citation2018). In the present context these statements can be replaced by (B.3), (B.5), and (B.4), respectively, and a careful inspection of the arguments given in Dette et al. (Citation2018) shows that Theorem 4.1 holds (the arguments even simplify substantially as in our case the index set X of the processes is finite).

B.3. Proof of Theorem 4.2

Let Assumption 4.1 hold. In the case β=0 the result in Theorem 4.2 now follows directly from the continuous mapping theorem. On the other hand, if β=0, it follows that Hn(λ)=n(||β̂(λ)||2||β||2)=n{||β̂(λ)β||2+2(β̂(λ)β)β=2n(β̂(λ)β)β+oP(1) uniformly with respect to λ[ε,1], and a further application of the continuous mapping theorem yields {Hn(λ)}λ[ε,1]{2βDB(λ)λ}λ[ε,1] in l([ε,1]). It is easy to see that for β0 the process on the right-hand side equals in distribution {Δ(β)B1(λ)λ}λ[ε,1]

where B1 is a one-dimensional Brownian motion and (B.6) Δ(β)=4βDDβ(B.6) is a positive constant. Recalling the definition of the statistic M̂n in (4.12) and a further application of the continuous mapping theorem shows that M̂n=β̂RMS2(1)βRMS2V̂n=||β̂(1)||2||β||2(ε1(||β̂(λ)||2||β̂(1)||2)2ν(dλ))1/2=Hn(1)(ε1(Hn(λ)Hn(1))2ν(dλ))1/2dW=B1(1)(ε1(B1(λ)/λB1(1))2ν(dλ))1/2, which proves the assertion.

B.4. Proof of Theorem 4.3

Observing the definition of M̂T in (4.12) we obtain Pβ(β̂RMS2<ζ2+QW(α)V̂n)=Pβ(M̂n<ζ2βRMS2V̂n+QW(α)).

It follows from the proof of Theorem 4.2 that V̂n=OP(1/n). Consequently, if βRMS2>0, the assertion follows by a simple calculation considering the three cases separately. On the other hand, if βRMS=0, the proof of Theorem 4.2 also shows that ||β̂(1)||2=OP(1n) and the assertion follows from the weak convergence (4.15) in Theorem 4.2.

Appendix C:

Simulation Results

Table C.1 Rejection frequencies for βt=1,t=1,,T with equivalence threshold 1 at nominal level of significance α=5%.

Table C.2 Rejection frequencies for β1=1 and βl=0,l=2,,T with equivalence threshold 1 at nominal level of significance α=5%.

Table C.3 Rejection frequencies for n = 1000 with equivalence threshold 1 at nominal level of significance α=5%.

Table C.4 Rejection frequencies for n = 1000 with equivalence threshold 1 at nominal level of significance α=5%.

Table C.5 Treatment effect estimation (πATT=0) and smallest equivalence thresholds under the PTA at nominal level of significance α=5%.

Table C.6 Treatment effect estimation (πATT=0) and smallest equivalence thresholds under a violation of the PTA due to a temporary group-specific shock with mean 0.25 at nominal level of significance α=5%.

Table C.7 Treatment effect estimation (πATT=0) and smallest equivalence thresholds under a violation of the PTA due to a temporary group-specific shock with mean 0.5 at nominal level of significance α=5%.

Table C.8 Treatment effect estimation (πATT=0) and smallest equivalence thresholds under a violation of the PTA due to a time trend with slope 0.025 at nominal level of significance α=5%.