1,515
Views
1
CrossRef citations to date
0
Altmetric
Research Article

Which standards from which disciplines? A test of systematic review for designing interdisciplinary evaluations

, &
Pages 82-100 | Received 10 Apr 2015, Accepted 28 Feb 2016, Published online: 21 Mar 2016

ABSTRACT

Evidence-based development suggests empirical choice of evaluation methods. Systematic review (SR) is increasingly used in development but, to our knowledge, has not informed methods selection. This article tests SR for methods selection for evaluation in health and conflict studies. The review comprised a reproducible literature search, inclusion protocols, quality assessment, data extraction and qualitative aggregation. The study finds that adopting even some aspects of SR for methods selection to be useful and an improvement. The usefulness of SR is constrained by the paucity of empirically grounded methodological recommendations, inconsistent citation and reporting practices and difficulties surrounding multidisciplinary quality assessments.

1. Introduction

In early 2012 an international non-governmental organisation (NGO) asked our Research Methodology group for methodological advice. They were the sole provider of primary health care in an area of endemic conflict and continued funding of their activities depended on demonstrating a beneficial relationship between service delivery and peace and conflict dynamics in the region. They wanted advice on how to examine the interaction, if any, between how they organised the delivery of primary health care services and peace and conflict dynamics in their area of operations. In our initial consultation with the NGO and a subject matter expert we were told that primary health in areas of endemic conflict is easily politicised. Critically, we were informed that opinions differ on whether health care investments should be prioritised in areas recovering from or currently experiencing violent conflict, and further that the beliefs of experts designing research on the topic strongly influence the results. While we did not have evidence to support this allegation nor time to test its validity we decided that it would be prudent to avoid providing advice based only on expert judgement. As an alternative we attempted to use the principles and, where possible, the practices of systematic review (SR) to give an evidence-based response. This article reports on our pilot use of SR to provide methodological advice.

In addition to solving our practical dilemma, we believe this exploration is of relevance to wider concerns for development evaluation. Funders’ interest in verifying ‘what works’ has encouraged the partial emergence of an evidence-based paradigm in development, evidenced for example, in the interest in the evaluation of development projects and useful methods to do so (Berg Harpviken et al. Citation2003; Gauster and Isakson Citation2007; McMullin Citation2011; Percival and Sondorp Citation2010; Roberts et al. Citation2010). One trend in the methodological debate highlights the establishment of the evidence-based paradigm in the medical and health sciences, based on a thorough testing of interventions through randomised controlled trials (RCTs) and subsequent meta-analysis of RCTs through SR (Dixon-Woods et al. Citation2006a; Evans and Benefield Citation2001; Magarey Citation2001; Major and Savin-Baden. Citation2012). However, there are significant differences between the health science and development interventions which limit the extent to which the lessons learned in the former can be applied to the latter. For instance, development evaluators most often study social phenomena in their natural environment, limiting the extent to which researchers can control for external contextual factors (Roberts et al. Citation2010). Further, the instability of the non-laboratory conditions in which interventions are observed may complicate the generation of adequate data (Barakat, Chard, and Jones Citation2005; Barakat et al. Citation2002; Fenn Citation2012). The use of ‘placebo’ development interventions as control groups raises ethical and practical issues (Gutlove and Thompson Citation2006), and the structural tendency of evaluations to focus on an intervention on the terms specified by its funder encourages insensitivity to aspects of the context that were not anticipated to be relevant at the outset of a project and to miss relevant dynamics whose timeframe exceeds that of the intervention(s) studied (Barakat, Chard, and Jones Citation2005; Levermore Citation2011; McMullin Citation2011).

Addressing such methodological issues is not straightforward, however, due in part to the fact that development evaluation is necessarily interdisciplinary and there is no interdisciplinary consensus on how such methodological issues should be addressed. Evaluators come from different academic disciplines, each of which has its own, at times contested, set of theories, methods, measures and standards (Barakat et al. Citation2002; Berg Harpviken et al. Citation2003; Buhmann et al. Citation2010; Levermore Citation2011; McMullin Citation2011; Roberts et al. Citation2010), thus complicating agreement on a single set of standards (Buhmann et al. Citation2010; Weaver and Roberts Citation2010). As such, evaluation, rather than providing clarity over contested development strategies, is itself a contested topic (Barakat, Chard, and Jones Citation2005; Brown Citation2009; Levermore Citation2011).

It was faced with this dilemma that we decided to explore the relevance of the SR methodology to our task. The proposed impact evaluation study would fall between the health sciences and the multidisciplinary field of conflict studies. Although a tempting solution would have been to draw upon the best practices in one of the relevant disciplines, and advocate that as the standard for research in this emerging field, we foresaw two principal problems with this strategy. First, prioritising the methods of one discipline would accentuate the biases of that discipline and limit the insights from other approaches. Second, we thought that the very selection of one discipline by expert consultants (i.e. ourselves) would constitute a regressive shift away from evidence in favour of expert opinion which would risk amplifying the politicisation of an already sensitive topic.Footnote1 If a solution was to be found to this dilemma, we thought it might lie in identifying methods accepted in all relevant disciplines. We therefore decided to use a systematic empirical approach rather than expert selection to identify those methods and turned to SR.

The results of that review are not reported in this article. Instead we report on, and discuss the benefits and challenges of, our application of the principles of SR. The remainder of the article is structured as follows. In the next section we discuss the SR as a methodology, outline our rationale for testing it in this case, and then formulate the research questions that guided this test. We then present the method we used in Section 3. Finally, we report the successes and failures of our application before concluding with a comparison of the method we used with that reported in the literature we reviewed.

2. Systematic review and its possible relevance to identifying methods

Study of the interaction between health care and conflict sits at the intersection of the health sciences and the multidisciplinary field of conflict studies. It lacks an adequate evidence base as there are few empirical studies and those few that have been undertaken employ a range of conceptual frameworks and methodological approaches which makes meaningful comparison of results difficult (Bornemisza et al. Citation2010; Buhmann Citation2005; Gutlove and Thompson Citation2006; MacQueen and Santa-Barbara Citation2000). The heterogeneity of these studies made it impossible for us to credibly identify one method as a model to be adopted as a standard. For reasons already discussed, we rejected the alternative option of choosing a standard method from either of the two contributing fields. From this, we concluded that a viable option might be to try and empirically identify methods accepted in all disciplines that are invoked by researchers who study the interaction of health care and peace and conflict dynamics, and further to do this in such a way that minimised the effect of our own biases as researchers.

SR originated in the health sciences as a means to compile and review all existing research of RCTs of a given intervention (Dixon-Woods et al. Citation2006a; Evans and Benefield Citation2001; Magarey Citation2001; Major and Savin-Baden. Citation2012). In the social sciences, SRs are now also used to make sense of and manage the ‘information explosion’, separate wheat from the chaff, identify gaps in an evidence base, confirm, refute, develop or modify bodies of theory, and increase the standard of research in the field (Bondas and Hall Citation2007; Campbell et al. Citation2003; Langer and Stewart Citation2014; Major and Savin-Baden. Citation2012; Noblit and Hare Citation1988; Wallace et al. Citation2004; White and Waddington Citation2012). SRs are now used in fields such as education and training (Evans and Benefield Citation2001; Price Citation2005; Secomb Citation2008), social policy (Wallace et al. Citation2004) and experiential research in health care (Arman and Rehnsfeldt Citation2003; Campbell et al. Citation2003; Dixon-Woods et al. Citation2007, Citation2006b; Hughes, José Closs, and Clark Citation2009). The expanding range of domains in which SR is applied and its suitability for working from a broad literature commended it to our use. However, while SRs are increasingly used in the field of development to aggregate knowledge of interventions (Guerrero et al. Citation2013; Langer and Stewart Citation2014; Leroy, Gadsden, and Guijarro Citation2012; Mallett et al. Citation2012), to our knowledge, the relevance of SR to the generation of methodological prescriptions for interdisciplinary research is a novel and as yet untested area of application and we did expect challenges. Even in domains where SR is common, it is still usually used to review evidence generated from studies with commensurable theory and methods. By contrast, our study would involve reviewing knowledge claims rather than data or evidence. Further, such claims would be made on both evidentiary and non-evidentiary bases, and would respond to article-specific knowledge gaps which we suspected (correctly) would be non-equivalent. As such, SR required adaptation and explicit assessment for this new application.

In order to adapt a method to a novel context yet still defensibly claim to work with the same methodology, it is necessary to identify core elements of a SR. SRs are traditionally composed of the following steps: transparent and reproducible search strategy; selection of studies to be included in the review; data extraction; secondary analysis of extracted data (Magarey Citation2001). Each of these steps is pursued through a composite method that embodies the principles of transparency, reliability and comprehensiveness. As SR has been adapted to cover topics in which the goals of research, types of evidence, field conditions and epistemological foundations of the health and medical sciences no longer hold, each of these steps has required adaptation, although in such a way as to remain committed to the underlying principles of transparency, reliability and comprehensiveness. In adapting the methodology to our proposed application, we adhered to these principles. Moreover, these principles also provide us with a framework within which to assess the suitability of the methodology for this application, as we do in this article, specifically through examining which principles can or cannot be successfully transferred, and whether using these principles constitute an improvement over standard practices in the area of application.

In responding to the NGO who approached us for assistance, we undertook a SR to answer the following question: ‘What methodological guidelines for research into the interaction between the delivery of primary health care and peace and conflict dynamics in areas of endemic conflict can be extrapolated from methodological prescriptive social science literature?’ The results of that applied study were directly reported to the NGO and they are not mentioned in this essay other than to provide a case which helps to ground discussion. This essay uses that case to assess the suitability of SR to the identification of methodological prescriptions from a heterogeneous literature for interdisciplinary assessment of development interventions. The research questions on which this essay is based are:

Given the task of generating prescriptions for an interdisciplinary study of the interaction between the organisation of primary health care provision and peace and conflict dynamics in an area of endemic conflict from a heterogeneous literature:

  1. What aspects of SR transfer well?

  2. What aspects of SR do not transfer well?

  3. Are the aspects of SR tested an improvement on practices found in subject literature?

In the above questions we take ‘transfer’ to refer to taking the principles and, where appropriate, practices of SR as they have been developed for evidence reviews in the health sciences and applying them to review prescriptions in a cross-disciplinary literature. We operationalise ‘transfer well’ as ‘use of the aspect of SR under immediate consideration was not visibly frustrated by characteristics of the literature’. Finally ‘improvement’ was operationalised as a comparison of standards of transparency, reliability and comprehensiveness as practiced in our review and in the articles under review.

3. Methodology

The review we undertook followed the typical structure of a SR as we have found them in the social sciences. Specifically, we:

  1. Conducted a recorded search for articles related to the topic

  2. Narrowed our population of articles down to those that were written in English

  3. Screened the titles and abstracts of articles according to standardised protocols for relevance

  4. Screened the full text of retrieved articles for relevance, according to standardised protocols

  5. Assessed the quality of the articles found relevant

  6. Identified and extracted analytically relevant data within these articles

  7. Performed an aggregation of analytically relevant data

SR requires clear specification of each of these steps prior to starting a study. We did not pre-specify all steps as this was a test application of SR to a new task. As such, for each stage we identified challenges, picked a path and then documented our progress. Each of these stages is discussed in summary form below. A full description of the methodology employed and details of subject articles was documented in a technical report which is included as supplementary material with this article and is available from the library of the corresponding author. Filtering of articles through each stage is represented in .

Figure 1. Selection of reviewed articles

Figure 1. Selection of reviewed articles

3.1 Search

SR requires identification and review of material in obscure sources and, as such, it is enormously labour intensive. In our project, as seems common in SRs in the social sciences, we accepted publication in an indexed refereed journal as an initial screen for quality and limited our search to iterative creation of a complex search term that we executed in indexes of the Scopus and Web of Science (WOS) databases on 24 January 2012.

This search term was recorded ( and ). After duplicates were removed, this search yielded 312 articles, which we used as our population in the analysis that followed.

Table 1. Details of search executed in WOS on 24 January 2012

Table 2. Details of search term executed in Scopus on 24 January 2012

3.2 Screen for language and relevance

Due to resource constraints we limited our review to cover English language articles, although acknowledging the limitations of this language bias. Screening out non-English articles excluded 32 articles, representing 10 per cent.

Once we had downloaded the titles and abstracts and eliminated duplicates, we created and applied a standardised protocol () to perform an initial screen for relevance based on article titles and abstracts. This stage of screening resulted with 168 preliminarily relevant, English language articles. Articles were identified as relevant if and only if the abstract states that the article advocates or problematises some particular methodological standards relevant to any epistemological orientation or any stage of research. To illustrate, the article by Kelle, Combining qualitative and quantitative methods in research practice: Purposes and advantages (Citation2006), was rejected based on abstract screening, with the coder giving the following justification: ‘Discusses application of mixed methods. Focusses on purposes of combining methods, not on methodological standards’. Similarly with an article by Choudhury and Zaman, Self-referencing as socio-scientific methodology in contrasting paradigms (Citation2009), even though it contained methodology in the title, abstract screening revealed that ‘methodology’ here referred to the practice of self-citation, and not to any type of research methodology.

Figure 2. Relevance-screening protocol for article title and abstract

Figure 2. Relevance-screening protocol for article title and abstract

This set of 168 articles was then carried forward to the second stage of screening, which was based on an appraisal of the full text and aimed to select only articles that made prescriptions that would be relevant for research investigating the relation between delivery of health care and peace and conflict dynamics in a conflict zone, in accordance with the practical objectives of the review. We created a standardised protocol () to screen the full text of articles for relevance. We used this to screen 162 of the 168 preliminary relevant articles. Six articles could not be screened as we could not access the full text of the article, neither through our library, nor through direct correspondence with the authors. This stage of screening resulted in 64 relevant articles.

Table 3. Relevance-screening protocols for full article

3.3 Quality assessment

Quality assessment is usually done through the critical review of the methods used in subject articles, which are appraised against accepted standards for that form of research. Our retrieved articles were a mix of empirical studies and non-empirical arguments that were informed by a diversity of theoretical perspectives. As such, a single external quality standard could never be fair. Therefore, we used a cross-disciplinary standard of ‘internal coherence’, which we operationalised through five criteria (). In trying to apply this instrument, we found low levels of inter-rater reliability among the team, and so we abandoned efforts to assess quality.Footnote2 This issue will be discussed further in Sections 4.2 and 4.3.

Table 4. Protocols for assessing article quality (‘internal coherence’)

Having failed to apply quality standards to the review, the 64 relevant articles were carried forward to the analysis stage. Our first attempt at synthesis produced a subjectively useful but scientifically unsatisfactory set of prescriptions. A second synthesis strategy was designed and applied to a proof-of-concept subset of 18 articles. These attempts are discussed below.

3.4 First analysis/synthesis

Our first analysis/synthesis strategy was designed based on a direct application of the principles of SR to an aggregative synthesis framework. The distinction between ‘interpretive’ and ‘integrative’ syntheses was first made by Noblit and Hare (Citation1988); since then, the term ‘aggregative’ has replaced ‘integrative’. This heuristic distinction refers to two different approaches to review. An ‘aggregative’ synthesis involves the summarising and pooling findings in a largely deductive way. By contrast, interpretive syntheses are usually concerned ‘with the development of concepts’ as outputs, they follow an inductive approach and are done within an interpretivist epistemology where reliability is often given less priority in the process of theory building (Dixon-Woods et al. Citation2006b, 2). The purpose of our review was to provide practical advice for the real-world conduct of research through a method that minimised our disciplinary biases in surveying an interdisciplinary literature. For these reasons we designed an aggregative synthesis strategy with the goal of retaining all underlying data and to process these data in a transparent and reproducible manner. The first step comprised a content analysis, guided by the question ‘what guidelines for research are prescribed in the knowledge claims of this article, and for what stage, methodology, method, or field of research does it apply?’. This content analysis was applied across the 64 articles and generated a set of roughly 420 recommendations.Footnote3 Inspection of the recommendations revealed numerous duplications that claimed application across 92 thematic categories (based on categorisation according to stage, methodology, method and field). When we attempted to aggregate recommendations for synthesis we found it impossible to retain all relevant underlying data and were then unable to systematically structure the decontextualised extracts that remained.Footnote4 We concluded that the strategy fell far short of SR standards as it was based on assumptions of inference and conflict that were not reliable, the system of categorisation was ambiguous, and because the knowledge-generating basis of prescriptions were discarded. We used this set to subjectively select recommendations which we delivered to our funders,Footnote5 and then redid our study in hopes of identifying an improved synthesis strategy.

3.5 Second synthesis

Our second attempt at aggregation began with a return to the qualitative SR literature in order to identify an appropriate alternative. There has been significant work on qualitative synthesis methodologies (for some examples see Bondas and Hall Citation2007; Campbell et al. Citation2003; Dixon-Woods et al. Citation2007; Dixon-Woods et al. Citation2006b; Higginbottom et al. Citation2012; Major and Savin-Baden Citation2012). We required an analysis method that was reproducible, one that supported conclusions about the real-world practices described in the texts reviewed, and one that brought forward contextual information as that would be relevant to determining when a recommended method would no longer be appropriate. Further, the source literature displayed characteristics that required attention, namely the articles were of a non-standardised format, and produced recommendations for multiple normative goals (i.e. varied conceptions of quality research), where SRs traditionally extract data from a uniform literature all of which relates to a single common desired outcome. Among those qualitative syntheses methods we examined, many are based on interpretivist analysis. We concluded that such methods would be inappropriate for our review for two reasons. First, interpretivist analysis usually involves subjective judgement of the analyst, rather than systematic execution of a reproducible analytic method. And secondly interpretive syntheses frequently limit their claims to the texts reviewed. That is, rather than try and reach from these texts to make inferences about the real world those texts purport to describe, the claims made in the review are limited to representations or constructions internal to the texts themselves. We found examples of non-replicability and radical constructivism in critical interpretive synthesis, grounded theory, meta-ethnography, narrative summary, qualitative research synthesis, and thematic analysis. On the other hand, one of the hallmarks of qualitative research is ‘thick description’ (Geertz Citation1973), representations of the research objects that retains discussion of context adequate to support readers’ own interpretation. By contrast, we found that qualitative synthesis methods such as Bayesian meta-analysis, case survey, content analysis and qualitative comparative analysis all tend to discard context.

While each of these methods was inappropriate in its entirety, we were able to identify elements in several that were relevant for our review. We drew on these relevant elements to synthesise a method that, we hoped, was reliable, would support inferences from text to real-world practice, which retained context, and which could be applied to a heterogeneous and multinormative literature. The method we developed for aggregation drew on cross-case techniques (Miles and Huberman Citation1994), narrative summary (Hubbard, Kidd, and Donaghy Citation2008; Secomb Citation2008) and meta-study (Paterson et al. Citation2001). Narrative summary was chosen because the source articles were heterogeneous and lacked a common structure. Our hope was that narrative summaries of articles would reproduce source material in a standard format that retained the connection between knowledge claims and their supporting justifications and scope of applicability. Cross-case techniques were selected as a means to integrate claims from across sources in such a way that the process of analysis could be fully documented. Finally, we chose the meta-study as a framework because we were dealing with a set of literature that differed not only in knowledge claims but also in theoretical starting points. That is, articles began with a conception of what a normative goal of quality research should be (many of which differed widely), and on this basis made recommendations for how research should meet those goals. Our synthesis would therefore need to review theory as well as results.

We tried this aggregation procedure on a subset of 18 articles. These articles were selected purposively by the research team as their content was most immediately relevant to the NGO, their number was adequate to test our synthesis method and we did not have the resources required to redo the entire analysis.

The process by which we identified prescriptions in articles is summarised below, while readers interested in a fuller description are referred to the technical report:

  1. Operationalise the research questionFootnote6 in terms of sub-questions

  2. Code all articles top-downFootnote7 using those sub-questions

  3. Code all articles bottom-up for missed analytically relevant themesFootnote8 and for variables that may be useful for categorising prescriptions

  4. Assess bottom up codes and recode all articles top-down based on that assessment

  5. For each article, write a narrative summary accounting for all coded text, with accompanying log of analytically relevant decisions made in preparing each narrative summary

  6. Code narrative summaries for analytically relevant themes

  7. Aggregate coded text into thematic clusters by codes.

  8. Within each cluster collapse identical statements

  9. Within each cluster explain incompatible statements by referencing original articles

  10. Convert remaining statements into appropriately qualified methodological prescriptions referencing original articles, log notes and explanations of divergence

Contrary to our first aggregation attempt, here we identified all candidate categorising variables through bottom-up coding. All candidate categorising variables were then applied through top-down recoding of all articles. These candidates were assessed with respect to their analytic relevance, prevalence and the capacity to cleanly segment the reviewed literature. At the outset we were concerned that theoretical or normative differences between articles might point to mutually incompatible but individually appropriate prescriptions. The variable proposed at the outset to stratify the literature was the stated purpose of research, or, worded another way, the normative conception of what constitutes quality research in a particular article. However, during the initial top-down coding procedure we noticed that authors at times declared multiple purposes for a given prescription, with the consequence that the variable, purpose, would not adequately segment the literature. Instead, one candidate variable identified through bottom-up coding, reach proved to be the most analytically useful and was considered the most replicable in terms of application to the articles.

The variable ‘reach’ that we adopted was informed by on-going discussion of the epistemological status of claims generated through research that queries humans about the real world. This line of discussion considers, to cite a few highlights, the extent to which human experience of the world is mediated by imperfect interpretation, how accounts of experience reported to researchers are imperfect, how researchers‘ perception of those accounts are imperfect, and how researchers’ own resulting practices of analysis and representation are, similarly, imperfect. In this essay the construct ‘reach’ was used to describe a variable we defined as ‘the proximity of the representation produced by a given set of methodological prescriptions to the real world.’ These might range from the naïvely positivist (e.g. valid understanding of an objective world) to the extremely constructivist (e.g. researchers’ subjective impressions of data). As our purpose was to provide material advice for real-world problems, we worked with the following values for ‘reach’: real world, respondents’ interpretations of the real world, respondents’ contextually shaped constructions of the real world and respondents’ contextually shaped representations of contextually shaped constructions of the real world.

Articles were then re-coded top-down according to reach by querying the conclusions of articles as follows ‘about what is this article making claims?’ The remaining prescribed steps were then completed and a final product was created, although it was constrained by both the partiality of, and mutual inconsistences between, source texts, something not anticipated by the aggregation procedures used. The final product is illustrated in summary form in .

Table 5. Methodological prescriptions resulting from review

4. Strengths and weaknesses of application

This section reports on how successful we were with each element of our review methodology. We first outline the aspects of SR that we found useful, then outline those we found problematic followed by a discussion of the potential value of these aspects.

4.1 Aspects that transferred well

We were able to execute a search and to screen returned articles for relevance to an interdisciplinary study on the organisation of health care in a comprehensive and transparent manner. While this was to be expected, as abstract and text appraisals are standard practice for determining relevance in many SRs (Campbell et al. Citation2003; Dixon-Woods et al. Citation2007; Higginbottom et al. Citation2012; Hubbard, Kidd, and Donaghy Citation2008), such a transparent search for and selection of articles have been noted as wanting in reviews of argument-based literature (McCullough, Coverdale, and Chervenak Citation2007). We were able to transparently identify, classify, analyse and process prescriptions from a literature that spanned medical and social sciences in a rigorous, comprehensive and transparent manner.

4.2 Aspects that did not transfer well

Our review contained an English language bias and a publication bias. 32 articles (10%) from the initial 312 returned by our search were excluded due to being written in a language other than English. As language screening was the first refinement step, no further investigation was made as to the relevance of these 32 articles. The language bias might therefore be considered quantitatively small, but may reinforce divergences in standards between linguistically segregated academic communities. Our language screen was due to resource constraints, and while not uncommon in SRs (Magarey Citation2001), we consider this an area for improvement. Important to note however is that these 32 non-English language articles span 13 languages, so while this improvement is methodologically straightforward, it will be resource intensive.

We confined our search to databases of peer-reviewed academic articles. Grey literature was therefore not included in our review. While this may exclude some important methodological innovations made, for example, in the practitioner literature (Fenn Citation2012; Mallett et al. Citation2012), we don’t yet see a way to overcome this publication bias as it serves two important functions: quantity and quality management. While statistical SRs are strengthened with larger numbers of studies, with qualitative SRs, due to the resources required for analysing large bodies of qualitative data, it is not uncommon for a review to cover a small but focussed sample.Footnote9 An earlier version of our search on WOS returned 13,036 articles, an unworkably large number. To refine, we focussed the search at a higher level of abstraction. It is likely that extending the search to grey literature would return a similarly unworkable number of articles.

We were not able to appraise the quality of reviewed articles. Although there exist quality assessment criteria for qualitative research articles, for quantitative research and for argument-based articles, we chose not to use them as this would imply differentiated treatment of primary articles, compromising our principal of comprehensiveness.Footnote10 We used instead ‘indexed in Scopus or WOS’, as a preliminary control, which externalises the problem of quality appraisal from us as reviewers to the known imperfections of inclusion in indexes of peer-reviewed sources. Further, we attempted but were unable to reliably screen according to our protocol of internal coherence.

In no case did the prescriptions made in a single article discuss everything methodologically required for a given research effort. The prescriptions in a given article always had external dependencies. These dependencies would, ideally, be comprehensively identifiable through citations. We had no evidence, however, that the authors whose articles we reviewed were systematic about their citations. Authors’ lack of transparency in citation made it impossible for us to falsify the hypothesis that they just cherry picked citations that made their prescriptions look good. We were therefore unable to properly delimit individually adequate and mutually discrete methodological prescriptions.

The articles we studied relied variously on evidence and argument to justify their prescriptions. In aggregating prescriptions arising from these very different foundations we assumed equivalence and so we were unable to compare the significance of prescriptive knowledge claims. In our view this results from both the absence of quality controls on our subject literature, and from underdeveloped methods for integrating knowledge claims from diverse sources of knowledge. The limitations with applying quality controls are discussed elsewhere in this article, but a promising strategy may be in the use of quality criteria to accord significance rather than exclusion, which might at least allow us to distinguish prescriptions based on strong foundations from those based on weak or questionable foundations.Footnote11 As for the heterogeneous literature, a priority area for development of synthesis methods is in the integration of evidence-based and argument-based knowledge, and within evidence-based knowledge, that based on quantitative, qualitative and mixed methods research. While some work in theorising the value of different types of knowledge to a review framework appears promising (Heyvaert, Maes, and Onghena Citation2011; McCullough, Coverdale, and Chervenak Citation2007; Sandelowski et al. Citation2012), our experience suggests that the challenges are formidable. Although our search specifically targeted articles with a conscious epistemological awareness, the heterogeneity of returned articles does not give us confidence that it will be easy to find an overarching framework within which each article can be neatly placed. In the absence of a forthcoming framework, we would argue in line with the principles of SR and evaluation that deference be given to prescriptions supported by evidence. Hence, our unweighted aggregation of epistemologically distinct rationales was unfortunate but improving upon it will remain a challenge until such time as methodological prescriptions are only publishable when supported by evidence.

4.3 Pilot study and subject literature in comparison

In order to contextualise this discussion of strengths and weaknesses, the contributions of a systematic approach to the methodological literature must be compared to current practices of generating prescriptions. Our pool of 64 articles included 14 review, 37 argument-based and 13 empirical articles. The values of each aspect of the SR method are now assessed in relation to the review and argument-based articles we encountered.

The successful applications of SR methodology – recordable search and inclusion protocols, and a transparent identification of prescriptions – are clear improvements on what we encountered in the source articles. Few of the source articles discussed how they found or selected their references, nor do they specify whether they did anything other than ignore publications that were divergent from their conclusions. Similarly, we don’t know if articles simply ignored inconvenient details in otherwise useful sources. Our transparent inclusion and extraction protocols mean that we cannot ignore deviant prescriptions, even if we were not always able to do something elegant with them. This echoes the discussion of McCullough et al who in their SR of clinical ethics literature observe that ‘none of the publications [under review] provided a formal search strategy or literature review based on such a strategy, making it difficult for the reader to reach a judgement about whether the literature cited in the publications omits other publications that might be relevant […]’ (Citation2007, 72).

For four of those aspects of SR that did not transfer well to our review, our use was no worse than what was encountered in the source articles. Our biggest weakness, going by the standards of SR, was our failure to screen for quality. However, none of the review and argument-based articles we dealt with offered any discussion of the quality of the articles they cite. There is debate over the value of external quality criteria in assessing qualitative studies. While SRs traditionally excluded outright poor quality studies, the view has been expressed that poor quality studies might still provide valuable insight in qualitative reviews, leading some to apply quality screening to reviews but to nonetheless include those articles judged to be of poor quality – for a discussion on the value of quality criteria, see Campbell et al. (Citation2003), Dixon-Woods et al. (Citation2006a) and Edwards, Russell, and Stott (Citation1998) and for reviews which include negatively assessed studies, see Dixon-Woods et al. (Citation2006b), Lemmer, Grellier, and Steven (Citation1999) and McPherson and Armstrong (Citation2012). We approached this review with the realist intention of generating prescriptions for research that would ‘hold water’ in the real world. Indiscriminate inclusion can only produce a review that answers the relativist question ‘what do reviewed authors consider to be good research’. Producing methodological prescriptions that cross the phenomenological gap requires a method to assess the strength of authors’ arguments, a method whose validity is anchored outside of the individual approaches of the authors under review.

We were unable to adequately identify the limits of the applicability or the external dependencies of prescriptions. This problem stemmed largely from the reporting styles of the source literature. Authors tended to emphasise the value of their contributions and were relatively silent when it came to stating when their recommendations no longer hold. As such, in the case of our review and of the subject articles, readers must subjectively decide, in the absence of adequate information, if the methodological prescriptions being made are relevant to their situation.

Third, while we were unable to find or design a research synthesis method to integrate a heterogeneous literature, nowhere did we encounter any discussion of the implications of integrating knowledge arising from evidence and from argument. As reviewed authors did draw conclusions based on diverse sources, as they were silent on the question of synthesis across these sources, and as we know there are significant challenges in producing a valid synthesis, we can’t have much confidence that the authors we reviewed made scientifically defensible use of their references.

For the purposes of reliability and as a preliminary quality control, we constrained our review to peer review publications. While we strongly support the goals of reliability and quality control in reviews, the means we chose had the unfortunate consequence of excluding unpublished or practitioner reports, irrespective of their potential relevance. However, our choice was typical: most articles under review cited only published works, thus, like us, reinforcing the publication bias.Footnote12

Finally, two aspects of our study were inferior to what we encountered in the articles under review. Our language bias was a marginal disimprovement. We lost just over 10 per cent of our relevant source articles through excluding non-English language articles. Although we have no idea how this compares proportionally with the subject articles since most authors are not transparent about how they identify relevant literature, the simple presence of multilingual bibliographies is superior at face value.Footnote13 However, the language bias is more significant in how it reproduces linguistic segregation in academia more generally. A systematic treatment of literature should try to act as a leading model in bridging methodological movements that happen in discreet linguistic academic communities.

Second, in using a systematic approach, we set ourselves high standards with respect to assigning significance to different prescriptions. Ultimately we failed to live up to these standards. What we are left with is an exhaustive set of prescriptions extracted from all relevant subject literature in our review, but without solid recommendations as to which prescriptions carry the most scientific authority. As with identifying boundaries of relevance, this is left to the subjective judgement of our readers. Our subject articles established the strength of their claims either through particular empirical studies, or through methods of argumentation. Standards exist for specific traditions of research, and for argumentation, but structured methods for aggregating claim strengths across these studies have yet to be developed. As such, an unintended consequence is that we may be polluting the specific standards within fields by indiscriminately importing poor quality knowledge claims from other fields.

5. Conclusion

The study of health care in conflict zones, and of development more generally, is necessarily interdisciplinary. As a result, evaluating ‘what works’ is complicated by the absence of an interdisciplinary consensus on quality evaluation research designs. This study reports on a pilot use of the SR methodology for reviewing methodologically prescriptive literature, which we suspected could generate interdisciplinary standards for research independent of reviewers’ disciplinary biases. We were partially successful at achieving this aim. We found that many of the standard procedures within SR are relevant to the generation of methodological prescriptions from a heterogeneous literature. In our case, the additional effort required for a transparent and reproducible review was merited. Based on this experience, where at all possible, we recommend the use of a systematic approach for the design of non-trivial evaluations. At minimum, using systematic methods to design research will increase transparency and support better informed discussion with regards to the quality of interdisciplinary evaluation. In particular our success with using a recordable literature search strategy, clear inclusion protocols and a systematic and comprehensive aggregative framework, sets a standard for methodologists far above that observed in the articles we reviewed. We recommend the use of these three components as a minimum when designing non-trivial evaluations. We expect such an approach to yield designs that are transparent about their sensitivity to the methodological requirements arising from each of the contributing disciplines, to methodological innovations and debates within each of these disciplines and to the inconvenient attributes of the methodologies considered.

We did encounter four frustrations that we expect will trouble those who follow our example: the articles identified by our search were sufficiently heterogeneous to disrupt comparison, many of the articles we reviewed were non-empirical and relied on argument rather than evidence, we could not find cross-disciplinary standards by which we could assess the quality of articles and authors were not transparent in their own citation of supporting work.

Two of the frustrations we encountered are being worked on. Scholars are extending SR to heterogeneous literature. In this regard Dixon-Woods, Cavers, et al.’s (2006) critical interpretive synthesis and Heyvaert, Maes, and Onghena (Citation2011) proposed classification of 18 mixed methods research synthesis frameworks are exemplary. This effort invites further development, although its proximity to fundamental debates and assumptions in epistemological categorisation cautions us against expectations of any quick-fix solution. Second, in our review we were confronted by prescriptions that were justified by reference to argument. While there is some work on improving the review of argument-based literature (Mahieu and Gastmans Citation2012; McCullough, Coverdale, and Chervenak Citation2007; Sofaer and Strech Citation2012, Citation2011), our hope is that in the future it will become impossible to publish methodological prescriptions without an evidentiary basis.

This research was motivated partly in response to a difficulty in designing interdisciplinary research due to the plurality of quality standards across relevant disciplines. Instead of overcoming this issue we were instead confronted with the same problem at a higher level of abstraction: it was impossible to screen the methodologically prescriptive articles we reviewed on quality grounds because we could not identify acceptable cross-disciplinary standards by which to assess articles. While the evidence regarding peer review suggests that it is a poor proxy for quality appraisals, and while an academic publication bias may exclude some relevant innovations arising from ‘grey’ evaluation studies, this seems to be the only reasonable minimum stop-gap standard. We hope a systematic approach to reviews will create pressure for the identification of cross-disciplinary quality criteria. In this respect, standards of reporting might be a useful starting point and variations of this approach are already being experimented with (Atkins et al. Citation2012; Carroll, Booth, and Lloyd-Jones Citation2012; Da Silva Citation2014; Dixon-Woods et al. Citation2006b; Edwards, Russell, and Stott Citation1998; Hughes, José Closs, and Clark Citation2009).

However, while justifiable from a perspective of cross-disciplinary inclusion, immediate prospects for this standard being adopted would appear limited, as the final problem we encountered was that the authors we read generally did not discuss how they picked their source articles. Standards of reporting can only be pushed to a certain extent by reviewers alone. We therefore appeal to methodologists to share responsibility and to produce full and transparent reports of the methods they use in selecting supporting articles when publishing. Although a limited proxy for quality, such transparency would help reviewers identify both the dependencies of prescriptions within their supporting literature and the bounds of their relevance and, as such, would seem reasonable for editors to require of their authors.

As a consequence of these four frustrations, heterogeneity, non-empirical argument, the absence of cross-disciplinary standards, and the absence of transparency in citations, we were ultimately unable either to weight our prescriptions or to deal with conflicting standards. We welcome efforts by methodologists to overcome these four obstacles, and we encourage further use of systematic approaches to review in order to create pressure for such methodological developments. We believe improvements in SRs along these four lines will do much to raise standards for evaluations, both in health-peace studies and in international development more generally, which will hopefully contribute to the prioritisation of better development projects and strategies in the future.

Supplemental material

1160419_supp_standards_from_disciplines_Technical_Report.docx

Download MS Word (449.6 KB)

Acknowledgements

An earlier version of this paper was presented at the EADI annual conference in Bonn in June 2014. The authors wish to thank the Dutch NGO, CORDAID who funded the research on which this paper is based. They would also like to thank those authors who obliged with timely responses when contacted for information relevant to this review.

Supplemental data

Supplemental data for this article can be accessed here.

Disclosure statement

No potential conflict of interest was reported by the authors.

Notes

1 For a discussion on research in politicised environments, see Barakat, Chard, and Jones (Citation2005), Brown (Citation2009) and McMullin (Citation2011).

2. Further details of failed operationalisation of the quality instrument can be found on pages 9–11 of the technical report accompanying the online version of this article as supplemental material.

3. A more detailed description of the design and application of the content analysis can be found on page 12 of the technical report. It is reported on lightly here as this aspect is not of great relevance to the analysis of our review method.

4. Details of attempts to systematically order these prescriptions through the ‘Items strategy’ can be found on pages 13 and 14 of the technical report.

5. Details on how a subset of recommendations was subjectively selected can be found on page 14 of the technical report.

6. The ‘research question’ here refers to the research question that the SR sought to address, as introduced in Section 2: ‘What methodological guidelines for research into the interaction between the delivery of primary health care and peace and conflict dynamics in areas of endemic conflict can be extrapolated from methodological prescriptive social science literature?’ This is not to be confused with the research question of this article which is concerned with investigating the usefulness of the SR methodology.

7. We use the terms ‘top-down coding’ and ‘bottom-up coding’ to refer to coding through an a priori framework and through an emergent framework, respectively. While top-down coding is preferable from the perspective of reliability, it was suspected (correctly) that initial candidates for explanatory variables might not suffice. Hence an allowance was made for iterative identification of themes, with such identification documented.

8. Analytically relevant themes’ refers to potential independent variables that could explain why a given prescription is made by one article and not another. Themes can be conceptual, as in ‘conception of research quality’, or methodological, as in ‘for qualitative research’ and have the function of denoting a (preferably clearly bounded) scope of applicability of a prescription.

9. As an example, a meta-ethnography by Campbell et al. (Citation2003) on the experiences of diabetics included only 10 studies in their final review.

10. For an example of assessment for qualitative research, see CASP (Citation2010a). For quantitative research, see CASP (Citation2010b). For assessment criteria for argument-based literature, see McCullough, Coverdale, and Chervenak (Citation2007).

11. For an example of a strategy of using quality to accord significance, see NHS CRD (Citation2001; cited in Wallace et al. Citation2004).

12. Notable exceptions include Clegg (Citation2005) who cites communications from research centres, quality standards bodies and a medical institute. Klein (Citation2008) cites a report by the OECD and a paper from a University’s internal working paper series. Robertshaw (Citation2007) cites a report by a public research institute, a website of a computer-assisted data analysis software provider and a news publication.

13. The majority of articles under review cited almost exclusively from English-language sources. Two multilingual exceptions are illustrative: Stige, Malterud, and Midtgarden (Citation2009) cite one publication in Danish and one in Norwegian, along with a German-language original publication by Habermas, a Norwegian translation of Bordieu and a Danish translation of Paulo Freire. Stoczkowski (Citation2008) cites numerous works in French and one self-citation in Polish. Such citation practices would suggest that although our review was constrained to English-language sources, this is only marginally worse than practices in general.

References