271
Views
9
CrossRef citations to date
0
Altmetric
Articles

Evidential variety as a source of credibility for causal inference: beyond sharp designs and structural models

Pages 233-253 | Published online: 22 Sep 2011
 

Abstract

There is an ongoing debate in economics between the design-based approach and the structural approach. The main locus of contention regards how best to pursue the quest for credible causal inference. Each approach emphasizes one element – sharp study designs versus structural models – but these elements have well-known limitations. This paper investigates where a researcher might look for credibility when, for the causal question under study, these limitations are binding. It argues that seeking variety of evidence – understood specifically as using multiple means of determination to robustly estimate the same causal effect – constitutes such an alternative and that applied economists actually take advantage of it. Evidential variety is especially relevant for a class of macro-level causal questions for which the design-based and the structural approaches appear to have limited reach. The use of evidential variety is illustrated by drawing on the literature on the institutional determinants of the aggregate unemployment rate.

Acknowledgements

I thank Kevin Hoover, Julian Reiss and participants at the INEM Conference in Alabama (November 2010) and the CiTS Meeting in Brussels (January 2011) for their generous comments. This research was supported by the Social Sciences and Humanities Research Council.

Notes

 1. See the Symposium ‘Con out of Economics’ in the Journal of Economic Perspectives (Vol. 24, No. 2) and the ‘Forum on the Estimation of Treatment Effects’ in the Journal of Economic Literature (Vol. 48, No. 2).

 2. There are certainly other elements in the toolbox which will not be discussed here. To start with, we must recognize that approaches to causal inference are not exhausted by the design-based and the structural approaches even though the recent exchanges in JEL and JEP might give this impression. For a more inclusive typology, see Hoover (Citation2008).

 3. ‘[I]n sensitivity analysis a single fixed body of data D is employed and then varying assumptions are considered which are inconsistent with each other to see what follows about some result of interest under each of the assumptions’ (Woodward Citation2006, pp. 234-235). In contrast, measurement robustness comes from varying the measurement procedures – for instance, using Brownian motion, alpha radiation and helium production in the case of the measurement of Avogadro's number discussed later. Each measurement procedure draws on its own observations and produces its own body of data. Furthermore, the assumptions used to derive results from these measurement procedures are typically consistent with each other, at least we hope so. In Woodward's typology, measurement robustness is also distinguished from causal robustness and derivational robustness. The latter has been used recently to analyse theoretical models in economics (Kuorikoski, Lehtinen and Marchionni Citation2010; it is related to a wider literature including Levins Citation1966; Orzack and Sober Citation1993; Weisberg Citation2006).

 4. The ‘long-run’ qualifier is added to distinguish between fluctuations of aggregate unemployment with the business cycle and the general level of aggregate unemployment through the cycle. It is an established fact that unemployment increases in economic downturns but some countries have systematically lower unemployment rates than others whenever one makes the comparison. Another label for long-run unemployment is structural unemployment which is contrasted to cyclical unemployment.

 5. A subset of the employed can also be labelled as underemployed – workers having an involuntary part-time job or being overeducated for their current job. Similarly, a subset of individuals officially out of the labour force are closer to potential workers (e.g. they are willing to work but not currently searching) than the rest of the inactive. The muddy boundaries of the concept of unemployment are a reminder that it is an analytical construction which is – like other core concepts in economics (e.g. inflation; Reiss Citation2007, chaps 2–4) – associated to a set of measurement criteria. These criteria inevitably affect our perception of a social situation. While I stick in the body of the text to the main division between employed and unemployed, one might think that a more comprehensive typology including the underemployed and the discouraged job-seekers is preferable (for well-informed policy decisions for instance). I do not think that my methodological point hinges on this choice.

 6. Advocates of sharp designs endorse the potential outcome framework (Holland Citation1986; also called the Rubin Causal Model) and Heckman opts for a semantics of manipulation of external inputs to a causal structure (Heckman and Vytlacil Citation2007, Section 4). The contemporary discussion over alternative causal semantics in economics is a minefield; sorting out the real locus of disagreement would require a whole paper – and the inclusion of other proposals like Hoover (Citation2001; Citation2011) and Pearl (Citation2009).

 7. The reader should note that the set of inferences one can draw from average claims is quite smaller than that for homogeneous ones – e.g. you are not entitled to assert that an intervention in a specific country will have this effect. This point is important for not overstating the degree of consensus among specialists. Even if many qualitative, average causal effects are consensual, there might well be a lot of disagreement over policy claims for a given country. Note also that if we take the averaging seriously, the truth of the causal claim does not even ensure that we will most of the time accurately capture the direction of the effect of intervening in a country picked at random. Knowing the median causal effect will do that but not the average effect if we cannot rule out that the distribution is skewed.

 8. See note 1.

 9. Other labels are available but the distinction between design-based and structural approaches has the advantages of (i) capturing the main divide and (ii) not insinuating that one of the two is better than the other. Other distinctions which do not have these advantages include Heckman's (Citation2005) statistical versus scientific, Heckman's (Citation2008) statistical versus econometric, Imbens' (Citation2010) causal versus structural.

10. To be precise, proponents of the design-based approach make a disproportionate use of the term credibility. It is far from a new trend to criticize the structural approach for its allegedly ‘incredible identification’ restrictions (Sims Citation1980). Advocates of the design-based approach draw again on this criticism by announcing a ‘credibility revolution’ (Angrist and Pischke Citation2010). Defenders of the structural approach will not capitulate easily and their reply, even if expressed in different ways, can be reconstructed in terms of credibility: while the design-based approach does credibly identify specific causal effects, credible policy analysis will typically require an explicit reliance on economic theory (see, for example, Heckman Citation2010; Keane Citation2010; Nevo and Whinston Citation2010).

11. See Morgan and Winship (Citation2007, chap. 3) for an introduction to this tool applied to the social sciences; the two standard references on causal graphs are Spirtes, Glymour and Scheines (Citation2000) and Pearl (Citation2009).

12. These techniques are not solely associated with the design-based approach. Instrumental variable, for instance, is also extensively used in the estimation of structural models.

13. LATE is ‘local’ in the sense that it identifies the average treatment effect for a specific subpopulation, i.e. the units induced by the instrument to change their treatment status. One can equate LATE to the population average treatment effect only under the rather restrictive assumption that the subpopulation is a representative sample of the general population. Note that, on top of the usual exclusion restrictions for instrumental variables, LATE requires the monotonicity assumption (Imbens and Angrist Citation1994, p. 469): the instrument must not have a positive effect on the probability of being treated for a portion of the population and a negative effect on another portion. It must either affect all units positively (i.e. no defier) or affect them negatively (no complier).

14. Heckman (Citation2010, p. 358fn) writes: ‘For brevity, in this paper my emphasis is on microeconometric approaches. There are parallel developments and dichotomies in the macro time series and policy evaluation literatures. See Heckman (Citation2000) for a discussion of that literature.’ At play here is the assumption that the lesson learnt from microeconometrics can easily be transferred to macro. The approaches discussed in Heckman (Citation2000) are vector autoregression, structural estimation using DSGE models, calibration, sensitivity analysis and natural experiments.

15. Another example in this class is the famous question of the institutional causes of long-run growth (for a methodological discussion of the economic literature on this question, see Kincaid Citation2009). Other examples abound.

16. Macro-level RCTs should not be conflated with ‘Thatcher's experiment’ and the like. The latter do not involve the design of treatment versus control groups and the causal claim that they can support is, strictly speaking, country specific. They might still be one source of information among many when one relies on evidential variety.

17. Stock (Citation2010, pp. 89–91) makes a distinction between three types of macro questions. The two types which matter for us here are shocks (e.g. technology shock) and institutional change. Stock claims that the design-based studies hold more promise for the first sort than the second. The causal question under study here is of the second sort. Note that Stock focuses implicitly on country-specific causal questions.

18. The intuitive appeal of evidential variety is such that it has received a wide range of labels in the methodological and philosophical literature. The term ‘evidential variety’ is found in the Bayesian literature in the philosophy of science, where the ‘variety-of-evidence thesis’ is debated (e.g. Earman Citation1992, pp. 77–79; Wayne Citation1995; Bovens and Hartmann Citation2003, chap. 4; Novack Citation2007). The term ‘robustness’ stems from Wimsatt ([Citation1981] 2007) and has been used by other scholars like Culp (Citation1994; Citation1995) and Stegenga (Citation2009). While Wimsatt used the term broadly, Culp and Stegenga understand it in the more restrictive sense that I use. To avoid confusion, I thus use ‘measurement robustness’ (Woodward Citation2006) to refer to this restrictive meaning (see note 3 for the other concepts of robustness in Woodward Citation2006). Another term close to ‘evidential variety’ is ‘independent determinations’ (e.g. Weber Citation2005, pp. 281–287). The problem with this term is that it gives too much importance to the problematic concept of ‘independence’ (see note 21 and the accompanying text). Yet another term is ‘consilience of evidence’, which is used exactly in line with my analysis by Oreskes (Citation2007, pp. 89–91) in her discussion of the results of climate science. However, Oreskes' only reference to the contemporary literature using the term is Wilson's (Citation1998) book Consilience: The Unity of Knowledge. Wilson uses ‘consilience of evidence’ interchangeably with ‘unification’ and he is specifically arguing that the humanities should be integrated with the sciences. This project is quite far from mine. Finally, let me note that Cartwright's (Citation2007, p. 25) discussion of methods ‘that merely vouch for the conclusion’ and especially her presentation of ‘mixed indirect support’ (Cartwright Citation2007, pp. 36–37) have much in common with my own project. However, I note that what Cartwright sees as methods that ‘clinch’ their results will usually only vouch them because of uncertainty regarding the underlying assumptions (Hoover Citation2009, p. 494); in this case, evidential variety will still be worth seeking.

19. In the writings of Campbell and collaborators (e.g. Campbell and Fiske Citation1959; Webb, Campbell, Schwartz and Sechrest Citation1966), the term triangulation was used in a manner compatible with my own analysis. The term then broadened under the influence of Denzin (Citation1978, chap. 10). Denzin maintained that a study uses triangulation if it varies data, investigators, theories or methods – only method triangulation was in Campbell's writings. Note that the term triangulation is in widespread use in some fields – e.g. in nursing science (Thurmond Citation2001; Tobin and Begley Citation2004) – and Downward and Mearman (Citation2007) attempt to bring it in economics. I, however, refrain from using the term for two reasons. First, as said in the main text, some authors use it to mean approaches quite unlike what I have in mind. Second, the metaphor of ‘triangulation’ evokes an image which does not fit my use of evidential variety. Triangulation refers to the determination of the location of one point by using information on the location of two other points and the angles of the triangle formed by these three points. The problem with the analogy is that it suggests that using at least two means of determination is necessary to estimate the property of interest. But it is not. What is understood here by evidential variety is that each means of determination gives by itself an estimate of the property of interest. The combination is worthwhile only because one has doubts about the reliability of any estimate.

20. Note that my discussion of first-order versus second-order evidence departs somewhat from the discussion of Staley (Citation2004, p. 469): ‘If some fact E constitutes first-order evidence with respect to a hypothesis H, then it provides some reason to believe (or indicates) that H is the case.’ According to me, this definition fails to capture the essential distinction because second-order evidence also ‘provides some reason to believe that H is the case’, given the proviso that the first-order evidence to which it relates is known.

21. Error independence might, in fact, not be a necessary condition for measurement robustness to lead to higher credibility of inference. In a Bayesian framework, Bovens and Hartmann (Citation2003, chap. 4) show that evidential sources giving concordant reports about a hypothesis H increase more and more the posterior belief in H as the number of sources increases (converging to subjective certainty) even when the reliability of each source is totally dependent on the reliability of the other sources. The main modelling assumption driving this result is that unreliable sources output ‘H is true’ randomly (with probability strictly between 0 and 1) and independently of the realized reports from the other sources. Given this assumption, an infinite sequence of concordant reports can be produced only if the sources are reliable. A long sequence of concordant reports thus leads the Bayesian updater to judge the answer as coming from reliable sources. I doubt that the assumption of independent random reports is adequate for most scientific cases of inference. It is certainly not adequate for my regression example.

22. The ones familiar with the work and terminology of Pearl (Citation2009) will recognize that the conditioning strategy is the application of the back-door criterion while the mechanistic strategy uses the front-door criterion.

23. According to the models, there would also be two more intricate effects of employment protection on the wages, one pushing it up while the other down. Higher wages in turn translate, in the model, into higher unemployment.

24. In economics, these effects are often labelled general equilibrium effects. Outside economics, they are typically referred to as failure of the Stable Unit Treatment Value Assumption (SUTVA; see Morgan and Winship Citation2007, pp. 37–40; Imbens and Wooldridge Citation2009, pp. 13–14). Note also that, to get from the micro-level causal effect to the average macro-level effect, the extrapolation can be broken into two: first the micro claim must hold at the country level, second the claim must hold for the population of countries. The failure of SUTVA applies to the first step; the second step is problematic because of causal heterogeneity across countries. The discussion in the main text focuses on the first problem which is enough to establish the main point: it is likely that the inferential base provided by the mechanistic strategy alone is not credible enough.

25. Strictness of employment protection is believed to have significant effects on other relevant variables. For instance, an increase in strictness is thought to increase average unemployment duration and to decrease flows in and out of unemployment. These two results are tightly linked with the main result of zero net effect on aggregate unemployment: if a policy decreases flows in and out in the same proportion at any unemployment rate, the unemployment rate will stay put but the average duration of a spell will increase.

26. Note that the individuation of means of determination is arbitrary to some degree. For instance, the mechanistic strategy could have been divided into model evidence and micro-data evidence. Individuation must certainly be guided by the error-independence criterion – in this respect, the existence of a P x is a potential source of error for both the model and micro-data evidential elements – but this guidance still leaves us leeway in our individuation choices.

27. We are left with epistemic contexts for which we have poor theories, fuzzy designs and, on top, discordant evidence. There might not be another strategy to employ in such contexts beyond the simple recommendation to search harder for good theories, sharp designs and reasons to rule out some evidential elements.

28. Imbens (Citation2010, pp. 418–419), a proponent of the design-based approach, has a nice discussion at the end of his recent methodological article asking what one should do when in possession of evidential elements from a RCT and from an observational study using a structural model. His proposal is directly interpretable in terms of evidential variety. He even goes beyond measurement robustness to assess what one should do in the case of discordant evidential elements.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.