3,768
Views
8
CrossRef citations to date
0
Altmetric
Articles

Using large text news archives for the analysis of climate change discourse: some methodological observations

ORCID Icon
Pages 395-406 | Received 12 Aug 2020, Accepted 23 Jan 2021, Published online: 18 Mar 2021

Abstract

This paper explores the contribution of software-based tools that are increasingly used for the semi-automated analysis of large volumes of text, especially Topic Modelling and Corpus Linguistics. These tools highlight the potential of getting interesting and new insights quickly, but at a cost. Linguistic aspects need to be considered carefully if computer-assisted technologies are to provide valid and reliable results. Main features of these tools will be presented, and some general problems and limitations will be discussed. The relation between technical tools and theoretical frameworks is discussed. The main empirical reference is the case of climate change.

1. Introduction

The aim of this paper is to explore the contribution of software-based tools that are increasingly used for the semi-automated analysis of large volumes of text.Footnote1 I will provide an overview of two such technologies, Topic Modelling (TM) and Corpus Linguistics (CL). My main empirical reference is the case of climate change. While the literature review is not based on a systematic review it tries to present some of the most relevant contributions. I also include a more eclectic personal aspect to it which makes this a bit of an auto-ethnographic case study of a sociologist who became interested in linguistics and computer assisted tools. I will show that there is much to be gained by using such tools, but that they need to be aligned with theoretical perspectives and research questions in order to advance research in this field.

In this paper I will address two issues: one is methodological and deals with aspects of corpus construction, data analysis and significance of findings. The other focuses on two variables which could be considered crucial for research in this field, agency and frames. I will first of all summarize the central elements of TM and CL and then move on to sketch some common problems in large text data analysis, before moving on to the variables of interest.

2. Topic modelling

2.1. Scope and tools

Topic modelling is a fast growing approach not only in computer science, but across the social sciences. Scholars use software packages to analyse large text databases for this task. Blei gives the following definition:

‘[M]achine learning researchers have developed probabilistic topic modeling, a suite of algorithms that aim to discover and annotate large archives of documents with thematic information. Topic modeling algorithms are statistical methods that analyze the words of the original texts to discover the themes that run through them, how those themes are connected to each other, and how they change over time.’ (Blei Citation2012, 77–78).

Maier et al. (Citation2018, 93) summarize a major technique used in topic modeling, latent Dirichlet allocation (LDA). This ‘is a computational content-analysis technique that can be used to investigate the “hidden” thematic structure of a given collection of texts. The data-driven and computational nature of LDA makes it attractive for communication research because it allows for quickly and efficiently deriving the thematic structure of large amounts of text documents. It combines an inductive approach with quantitative measurements, making it particularly suitable for exploratory and descriptive analyses.’

The tool is thus best described as ‘semi-automated, unsupervised method for the analysis of textual content’, in which ‘the model attempts to “mimic” the writing process of a given corpus of documents. … “Topics” are statistical entities, sets of frequency distributions of words, based on the linguistic assumption of co-occurrence, that is, that words that are being used more frequently in the same documents also associate thematically. In topic models, every word in the corpus has a probability of appearing in each of the topics, and every document is composed as a mixture of all topics […]. Topic modeling is a “bag-of-words” approach, which means that narrative, location in the text, and syntax are not taken into consideration.’ (Walter and Ophir Citation2019, 254)

Several studies have applied LDA for the analysis of news reporting on climate change. Bohr (Citation2020) created and analyzed a corpus of over 78,000 articles covering two decades from 52 US newspapers in order to investigate outlet bias with regard to geography, partisan orientation, or scale of circulation. Keller et al. (Citation2020) used LDA on over 18,000 climate change articles published between 1997 and 2016 in two Indian newspapers. They categorized the news items into 28 different topics related to overarching themes such as impacts, science, politics, and society. They found that topics related to ‘Climate Change and Society’ and ‘Climate Politics’ appear more frequently than ‘Climate Change Impacts’ or ‘Climate Science’. This raises the question what the differences between these headings are, and what they actually mean. The authors are aware of some limitations of LDA: The assignment of one prevalent topic to an article has to be considered with caution, as an article will rarely consist of only one topic. They also note that ‘some topics are less robust, occur only in either one of the newspapers and are harder to validate by external reviewers which challenges their significance for this study’ (Keller et al. Citation2020, 231). There are other studies available, such as Boussalis and Coan (Citation2016) who present an analysis of US climate sceptical think tanks. Boussalis, Coan, and Poberezhskaya (Citation2016) analze climate change coverage by the Russian press. Vu, Liu, and Tran (Citation2019) provide a frame analysis of climate change reporting in 45 countries, based on LDA.

2.2. Problems and limitations

Topic modeling has received even more critical scrutiny by Brookes and McEnery (Citation2019) who emphasize three main weaknesses: a lack of linguistic sensitivity, failure to define topics, and lack of replicability. ‘In terms of linguistic sensitivity, the first feature worth considering is that topic modelling uses a very naïve model of a text. A text is viewed as being composed of a collection of words, a simple unordered set’ (Brookes and McEnery Citation2019, 5). In a similar vein, Nerlich, Forsyth, and Clarke (Citation2012, 48) observe the ‘obvious fact that human language is inescapably a sequential phenomenon. Phonemes, words, phrases and other linguistic phenomena are generated and interpreted in a temporal sequence. A ‘‘blind Venetian,’’ for example, is not the same as a ‘‘Venetian blind.’’’ A further problematic step is that LDA researchers often remove grammatical and function words, and use word lemmata, disregarding the different morphological expressions of a word. In doing so, important information can be lost. McEnery et al. (2015) show how the words muslim and muslims ‘consistently index different discourses about Muslims. Collapsing the frequency counts of both words together clearly risks at the very least blurring an important distinction and, at worst, losing or mischaracterising it.’ In addition, some researchers remove words that are very frequent and very infrequent (see Maier et al. Citation2018).

Brookes and McEnery note the lack of any theoretical account of what constitutes a ‘topic’ within topic modelling: ‘concepts like ‘theme’ and even ‘coherence’ are vague and remain ill-defined within this body of research. What constitutes a theme and what it is claimed makes that theme coherent are unclear’ (Brookes and McEnery Citation2019, 6). And: ‘in many existing topic model studies, topics are inferred by researchers without actually inspecting the texts assigned to that topic.’ Murakami et al. (Citation2017) also think that the term ‘topic’ is a misnomer, pointing out that ‘[t]hese groups of co-occurring words characterize “topics”, and researchers may choose to refer to them using topic-like titles, but these are only convenient abstractions from lists of words’ (Murakami et al., Citation2017, 244; see also Walter and Ophir Citation2019; Nicholls and Culpepper Citation2020).

The final criticism is that ‘the replication – or more specifically, the repetition – of a topic model study is not necessarily possible. Consequently, a user may modify and re-run the topic modelling many times, evaluating outputs until finding an analysis that they deem to be credible and usable. This introduces the possibility of a high degree of subjectivity into the analysis.’ (Brookes and McEnery Citation2019, 8). The problem of relicability has been addressed by others as well (see Roberts, Stewart, and Tingley Citation2016; Wilkerson and Casas Citation2017).

Their overall verdict is harsh: ‘a major concern with topic modelling methods is the present lack of an adequate theoretical underpinning of what a topic actually is. This absence has given rise to ill-defined (and likely inconsistent) procedures of topic discovery that lack linguistic sensitivity and are, it seems, liable to make the false assumption that unordered and de-contextualised words can be mapped neatly onto propositional topics. Future research employing topic modelling methods should therefore endeavour to engage more deeply with linguistic theory when inferring the presence of topics in their textual data’ (Brookes and McEnery Citation2019, 18).

DiMaggio, Nag, and Blei (Citation2013) have applied this method claiming that it can help with frame analysis, and that it is well suited to analyzing polysemy and heteroglossia (polysemy refers to variations in meaning of a single term, heteroglossia refers to ambiguity at the level of the text). As the authors put it, a ‘virtue of topic modeling is its deep affinity to the central insight in the sociology of culture that texts do not necessarily reflect a singular perspective but are often characterized by heteroglossia, the copresence of competing ‘‘voices’’—perspectives or styles of expression—within a single text’ (DiMaggio, Nag, and Blei Citation2013, 582). In his seminal paper one of the co-authors of this paper had written that the fundamental ‘intuition behind LDA is that documents exhibit multiple topics’ (Blei (Citation2012, 78). This indicates that some interesting work is going on in some areas of topic modeling.The reason may lie in the fact that the group of co-authors brings together a variety of expertise: ‘Like any clustering technique, the method should be employed as a heuristic tool in combination with additional information by a research team that includes subject-area experts’ (DiMaggio, Nag, and Blei Citation2013, 582). One might say that adding a linguist to the team might benefit the development of their methodology.

However, the techniques used for analysis need justification, as every methodology. The authors state that ‘the sociology of culture has long been theory-rich and methods poor. Sociologists who study culture have generated numerous theoretical insights and developed concepts that promise a deep understanding of cultural change. Yet they have often lacked the means to make such concepts operational’—they suggest that LDA could provide the tools for the job (DiMaggio, Nag, and Blei Citation2013, 571). However, it is a large step from identifying a lack of operationalization of theoretical concepts that could be used for empirical research to the assertion that one specific software application would provide the answer.

What is more, the empirical case used in their paper, the news coverage of public funding of the arts in the US, is underwhelming as regards the results. The main finding is that the news coverage focused on controversy, ‘producing a cloud of negative representations’, especially after the election of George H.W. Bush, which unleashed a ‘culture war’. The question is how unique this insight is, and if it goes beyond what an informed news reader would already know.

The authors recognize this problem and state in concluding that ‘the model is just the beginning. For cultural analysis, the purpose of modeling is to apprehend the structure of the data and render it tractable by producing meaningful topics … that can be used to answer more focused questions… Topic modeling will not be a panacea for sociologists of culture. But it is a powerful tool for helping us understand and explore large archives of texts’ (DiMaggio, Nag, and Blei Citation2013, 602–3). What they say about cultural sociology applies to other social science fields which are theory rich but methods poor. Topic modelling is unlikely to provide a ready-made tool to fill this gap. Operationalization of key theoretical concepts remains a major task.

3. Corpus linguistics

3.1. scope and tools

There is no agreed definition of what CL is or does (see Gries Citation2009). I use the definition of McEnery and Andrew (Citation2012, 1–2) who state:

‘We could reasonably define corpus linguistics as dealing with some set of machine-readable texts which is deemed an appropriate basis on which to study a specific set of research questions. The set of texts or corpus dealt with is usually of a size which defies analysis by hand and eye alone within any reasonable timeframe. It is the large scale of the data used that explains the use of machine-readable text… corpora are invariably exploited using tools which allow users to search through them rapidly and reliably. Some of these tools, namely concordancers, allow users to look at words in context. Most such tools also allow the production of frequency data of some description.’

Word frequencies, word collocations and clusters, and key words are prominent metrics in CL research. Word frequencies indicate the importance of specific words within a given text. Collocations indicate how a specific word is accompanied by other words–its collocates– within close proximity in a text. Word clusters are (short) strings of words occurring in a text, also called n-grams. Key word analysis establishes the relative salience of specific words from one selected text compared to a reference corpus.

There is a limited number of studies that have used CL methods for the analysis of climate change discourse. Brigitte Nerlich and a team of co-authors have investigated English speaking news coverage, looking at linguistic properties, metaphors, and readers’ comments. (Collins and Nerlich Citation2015; Nerlich and Koteyko Citation2009; Koteyko, Jaspal, and Nerlich Citation2013; Jaspal, Nerlich, and Koteyko Citation2013). Dayrell (Citation2019) built a corpus of Brazilian news reports on climate change consisting of 19,686 newspaper texts (11.4 million words) published by 12 Brazilian broadsheet papers. She used keyword and collocation analysis for the investigation. Salient keywords are greenhouse gas emissions, UN conferences, fossil fuels, renewable energy or deforestation.

3.2. Problems and limitations

Published CL work uses corpora built from downloads of news archives, typically from newspapers or aggregator sites like Nexis. Availability of full-text material allows the analysis of large amounts of material, without the need for sampling. The downside is that contextual information is lost, or not immediately available, such as pictures, text boxes, or graphical presentation. Another problem is the ‘noise’ contained in the downloads, and the question how much noise should be eliminated in the final (‘cleaned’) corpus. The answer will depend on how one answers two other questions: Should the corpus only contain relevant stories? And: What is relevant?

Most research is not concerned with cleaning and vetting the downloaded items, apart from eliminating duplicates. Nerlich, Forsyth, and Clarke (Citation2012, 47) state: ‘We decided not to clean our corpus by filtering out texts containing […] apparently irrelevant phrases […], to avoid biasing our findings by what we expected to find. In this respect our philosophy is, as far as possible, to ‘‘let the data speak for themselves’’ even if it means accepting a certain level of noise in the corpus.’ I will return to this issue in the next section.

Another problem is the identification of meaning and voices within a corpus. Research on CC discourse frequently presents a ‘pro’ or ‘anti’ stance in news. However, it is not easy to extract this kind of information from an article. Often several messages are conveyed, and it can be difficult to ascertain what the journalist’s voice and position is.

As Dahl and Kjersti (Citation2014) argue, in any given text there is linguistic polyphony. ‘The issue is approached from different perspectives by social actors with different backgrounds, world views, interests, values, and beliefs (Hulme 2009), a situation which implies position taking. Thus, both voices and positions become important objects of study in order to understand the complexity of the debate’ (Dahl and Kjersti 2014, 402).

They explain the idea of linguistic polyphony ‘that in one single utterance there may be several voices or points of view present, in addition to the one of the speaker/writer’ (Dahl and Kjersti 2014, 402).

There are other linguistic properties which pose a difficulty for textual analysis. One of them is the use of irony, satire or humour. Taken together, this means that words cannot be taken at face value and their meaning can only be grasped in context (Partington Citation2007).

Nevertheless, researchers have tried to overcome such doubts and develop methods for linguistic analysis of news texts, partly with the help of software systems. Semantic tagging is a method which categorizes words into a classificatory scheme. It has been used by researchers, following seminal work at Lancaster University (http://ucrel.lancs.ac.uk/usas/).

Collins and Nerlich (Citation2015) analysed reader comments on climate change in the Guardian, using semantic tagger software WMatrix. This software tool compares words with key categories derived from British National Corpus. The authors praise the advantage of this process, being ‘systematic and automatic, organizing thousands of words of data into semantic categories in a matter of seconds’ (Collins and Nerlich Citation2015, 195). However, this approach potentially underestimates the challenge of ambiguity of meaning.

4. Common issues and problems

In this section I want to draw on my own experience with one of the methods (CL), identify two methodological issues, and suggest some steps for further research. Some parts of this section are more relevant to CL, but there are links to TM as well. The next section (5) will then look at the problem of agency and visibility, and the section thereafter (6) at the issue of identifying frames.

4.1. Corpus construction

In previous work we wanted to establish sound methodological rules for downloading and preparing corpora for analysis (Grundmann and Scott Citation2014; Grundmann, Kreischer, and Scott Citation2016). Our aim was to construct a reliable dataset of news items that excludes bias as far as possible, including as many items as possible on a given issue, and only on the issue. Our dataset is based on news archives of written text (sometimes only on one type of newspaper, such as prestige press, as is the case with our work on austerity), and thus excludes other sources, which might merit analysis such as broadcasts, visuals, or social media comments. In this sense our data is biased towards one type of source. Nevertheless, print and prestige press stories are important because political elites tend to read them, and journalists report about political elites and provide reports and quotes from institutional actors.

However, the construction of the dataset follows a procedure which does not rely on subjective choice. It also tries to include as many sources as possible, and not restrict the volume through random or purposive sampling, through lemmatization, or through exclusion of frequent or infrequent words (see also Denny and Spirling Citation2018).

After gaining experience with the method and after the publication of our first paper (Grundmann and Scott Citation2014) we posed ourselves the question: How should a textual corpus be constructed? How much noise should be tolerated? Our approach was guided by the principle to include only texts which are really news stories about the subject. We developed a parser which performs this task automatically. We applied this method to the construction of a corpus on the topic if austerity in the British press and are repeating this procedure for CC. This is a time-consuming process, but we think it is necessary. How can we be sure we are analysing CC discourse if a large proportion of a downloaded data consists of irrelevant news items? Our estimate is that the proportion of real climate stories is much less than 60% of a standard Nexis download based on a keyword search including ‘climate change’ or ‘global warming’.Footnote2

The above cited work by Bohr (Citation2020) and Keller et al. (Citation2020) is cognizant of this problem. Bohr describes how he first excluded duplicates from an initial download based on a Boolean search for ‘climate change’ or ‘global warming’. After manual inspection it became clear that ‘many articles did not focus on climate change and simply mention one of the keyword phrases a single time in passing. An additional filtering procedure required an article to mention the terms ‘climate change,’ ‘global warming,’ or ‘greenhouse gas’ at least twice. This reduced the corpus to less than 50% of the original download.

Keller et al. (Citation2020, 231) state about their work that articles only mentioning climate change in passing could have been included in their sample. As a remedy they suggest ‘using sharper identification criteria, for example more specific search terms’ for future studies.

4.2. Key word analysis

The use of an appropriate reference corpus is important for the calculation of key words. Many researchers use a standard reference corpus, such as the British National Corpus. In the case of climate change, climate relevant words will be salient, such as climate, climate change, global warming, IPCC, temperatures, conferences, Kyoto, and so on. Such terms are to be expected and the analysis merely confirms what every news reader will already know.

In order to identify more specific key words we decided to use the whole climate corpus (our downloaded and cleaned dataset) as the reference corpus. The resulting keyword analysis provides results that go beyond the obvious as they show salient words from within a subset of the database. Using this procedure, obvious words and word combinations like ‘climate change’ or ‘global warming’ are no longer salient, but words such as ‘developing’, ‘trading’, ‘subsidies’, or ‘transport’.

5. Actors and claims makers

I got involved with CL scholarship after publishing an article in Environmental Politics (Grundmann Citation2007). In this piece I was interested in the visibility of specific actors in the CC discourse in Germany and USA, namely advocates, sceptics and the IPCC. This research interest was informed by a pre-theoretical, implicit assumption that the public visibility of sceptical voices about climate change would hamper the development of climate policies. Such a belief was, and still is, widespread among social science researchers, climate scientists, and climate activists. The comparison between the US, and Germany seemed to show that the different levels of public visible scepticism correlated with the ambition of climate policies in these two countries. An influential article on US news coverage of climate change had shown that the visibility of sceptics was artificially inflated through journalistic practices of providing false balance in news reporting (Boykoff and Boykoff Citation2004). As it turns out, the higher visibility of sceptics in the US reflects its political constitution, where robust and open (some might say: adversarial) debates about the scientific evidence for policy is common practice. This means that the causal relationship between visibility and efficacy needs to be examined more carefully than I had assumed. The second problem with this approach is empirical: even in the US it turned out that the postulated journalistic norm of ‘false balance’ was not in evidence after 2005 (Boykoff Citation2007).

This raises the question of agency and visibility in a given corpus. In order to identify relevant and potentially dominant actors we initially used lists of specific claims makers which we drew up in advance (Grundmann and Scott Citation2014). This introduces a potentially large dose of arbitrariness which ideally should be avoided. In order to do so, we decided to use Named Entity Recognition (NER) software to extract names of claims makers from our dataset. We applied this in research about the discourse on Austerity in the British Press (Grundmann, Kreischer, and Scott 2016) and are using it in an ongoing project on CC.

It is worth mentioning that this new procedure is inductive and does not rely on inferring topics or meaning to items found in our corpus. Relevant social actors (individuals and organizations) are identifiable directly.

In our research on the discourse on austerity in the British press we also analyzed verb collocations of central actors (Grundmann, Kreischer, and Scott 2016). We found, for example, that the Chancellor of the Exchequer, George Osborne played a dominant role. He had by far the most mentions in the news corpus: he was quoted ten times more frequently than opposition speakers such as Ed Miliband or Gordon Brown. Not only the word frequency testifies to this role, it is also indicated by the word collocates ‘announced’, ‘unveiled’, ‘delivers’, ‘prepares’, ‘introduces’, and ‘urged’. In contrast, collocates for Gordon Brown were ‘used’, ‘introduced’, ‘was’, and ‘tried’. This shows a difference in agency and ambition, as seen through the press coverage of their statements in public.

Our work on the austerity discourse also shows the need to be cautious with regard to the visibility and agency of claims makers. As the example above suggests, visibility does not equate agency, or more precisely: political influence. The most visible claims makers in our dataset on austerity were the Chancellor of the Exchequer and the Labour party. While the former had agency in defining situations, coining metaphors, phrases, and narratives, the latter did not. As our data shows, it was absorbed by its internal leadership struggles and did not come up with a narrative to challenge the Chancellor.

The high visibility of actors does not mean they are effective. They may be visible in public discourse but such visibility need not translate into political influence, or power. A contextual analysis of their public appearance is necessary. This can be done, to some extent, through collocation analysis. There might also be institutional or administrative power holders behind the scenes who do not manifest themselves in public discourse. In addition, these methods do not allow an analysis of layers of meaning and positioning, as shown by Fløttum and Dahl. Visibility of climate sceptics, to use an overused example, does not mean that sceptical voices are celebrated. They could be mentioned because they are strongly criticised, thus giving them salience. The upshot of this is that we need more empirically informed analysis in order to address theoretical issues about agency, visibility, and power (see the three dimensions of power identified by Lukes Citation2004). Theories of power and of the policy process could benefit greatly from such work, especially if it is comparative in nature.

Our research shows different salient words across countries (Grundmann and Scott Citation2014). We also observe different claims makers, even different types of claims makers dominating public discourse in the countries we analysed. In Germany and the UK environmental NGOs are dominant, in the US it is politicians and celebrities (Al Gore, Arnold Schwarzenegger). In France international organizations, such as the IPCC, get a lot of mentions. These findings can be interpreted on the basis of existing theoretical frameworks, such as civic epistemologies. The term indicates the ‘institutionalized practices’ through which actors, or claims makers ‘test and deploy knowledge claims used as a basis for making collective choices’ (Jasanoff Citation2005, 255). Those institutionalized practices differ from country to country.

Visible claims makers are often limited to their domestic audience, with few exceptions. In our four-country comparison the USA is the only country whose claims makers are mentioned in the other three countries. The reverse is not true. This shows the central role the US plays in global climate science and politics. These results could be brought into productive dialogue with the concept of ‘domestication’ of climate politics (Olausson Citation2014).

During some time periods we observe that claims makers and/or topics are converging; we call this the synchronization of transnational discourse (Grundmann, Smith, and Wright Citation2000). We found for example, that in 2005, 2008, and 2009 there was a convergence of key words across the four countries in our sample. The same is not true for other periods. Here, the discourse is shaped by other events, and visible through different key words in different countries.

This indicates that climate change is embedded in national discourses in specific ways. In our data climate change is ‘riding’ on topics unique to each country. A reasonable hypothesis is that the discourse is shaped differently in different countries in ‘normal times’, and only comes together in exceptional periods. Hence, we should see discursive divergence over several years. Our previous analysis points in this direction, although the time period is relatively short (2005–2010). These observations show how only at critical junctures the discourses across nations become synchronized. In most years the climate discourses are nationally specific. This observation should be tested in future work, over longer periods time, using shorter intervals (monthly data).

6. Frames

Frame analysis is a central element of much work in the field of Discourse analysis. However, frames are typically constructed manually, and no software exists which could extract frames from texts (Nicholls and Culpepper 2020). Although DiMaggio et al. have tried to derive frames from topics, commentators are sceptical about their claims (see Walter and Ophir Citation2019; Maier et al. Citation2018 for a critical evaluation).

The challenge is illustrated by a publication that performs manual frame identification. In their analysis of the Fifth Assessment Report published by the IPCC in 2014, O’Neill et al. (Citation2015) identify, based on previous studies, ten frames: Settled Science (SS), Political or Ideological Struggle (PIS); Role of Science (ROS); Uncertain Science (US); Disaster (D); Security (S); Morality and Ethics (ME); Opportunity (O); Economic (E); and Health (H). These frames were operationalized so that coders could identify them, using specific criteria. However, as often is the case with coding schemes, these are not mutually exclusive. For example, the coding item ‘Unprecedented rate of change compared to palaeo records’ appears under SS, and ‘Unprecedented rise in global average surface temperature’ appears in D. The demarcation between the two could be non-trivial. Likewise, ‘questioning the motives or funding of opponents’ (coded as PIS) could be similar to ‘Urges trust in climate scientists and dismisses sceptic voices’ (coded as SS).

In theory, coding procedures could be refined to avoid ambiguity, transferred into a software algorithm, and via machine learning processes be perfected to a stage where results are (re-) producible in unsupervised ways. As mentioned above, some researchers using LDA maintain that these technical developments may become reality soon, but do not exist at present.

We suggest a different approach. Following the literature on the construction of social problems (Spector and Kitsuse Citation2001; Trumbo Citation1996) we are interested in those claims makers who are prominent in public discourse. Ideally, we would like to know who the claims makers are, and what claims they make. While appropriate software such as NER allows us to identify persons and organizations, the identification of claims from the texts is not possible with software, at least not in a direct way. We can get close by inspecting word collocations but claims can only be inferred from this information. We are facing the problem of inferring meaning from word patterns, establishing rules and algorithms where possible, by searching the database.

Matthes and Kohring (Citation2008) present an approach which manually identifies frame elements, which are parts of a bigger frame. These are based on Entman’s definition of frames as a compound of ‘problem definition, causal interpretation, moral evaluation, and/or treatment recommendation for the item described.’ (Entman Citation1993, 52). They then apply cluster analysis to the frame elements and construct frames in this way. They convincingly argue that identifying frame elements leads to more reliable results compared to a holistic frame analysis. They state: ‘The crucial difference to the common assessment of frames is that frames are empirically determined and not subjectively defined.’ (Matthes and Kohring Citation2008, 265). They identify topics and actors, and operationalize the four frame elements through a coding procedure.

However, their procedure is labour intensive and limited to a sample that can be researched in a given time by human researchers. It is time consuming and costly to identify topics and actors, and to operationalize the four elements of Entman’s definition. They also use the article as unit of analysis which can be problematic as many news stories have a range of actors and/or topics. The critical comments above (see Dahl and Fløttum 2014) apply. Some researchers using LDA recognize the problem, too (DiMaggio, Nag, and Blei Citation2013).

Consider the following statement: ‘Climate change is human-made and getting worse, we need urgent action.’ This statement, or variations of it, can be found in many news stories, sometimes attributed to one actor, sometimes reflecting the gist of the article. The statement combines causal attribution and moral evaluation. Statements like these pose coding problems as the message to be extracted falls under two frame elements. What is more, the time dimension is missing from the frame elements. After all, many actors would agree that climate change is a real, and serious issue, but could disagree about how urgent action is, or what costs would be justified.

Another typical statement, ‘Fossil fuels are to blame for the climate crisis, we need to ramp up our efforts and expand renewable energy’ combines problem definition, causal interpretation, moral evaluation and treatment recommendation. In this version the distinction between frame elements becomes blurred, leading to ambiguity in coding.

The upshot of this discussion is that Entman’s conceptual distinctions of frames has been useful for informing empirically informed frame analysis, but also has clear limitations. It cannot be regarded as a universal tool as it neglects the time dimension and is under-complex with regard to the treatment options. In the case of climate change there is political and scientific argument about the merit of specific solutions, with no consensus emerging. This is not surprising if one considers climate change as a ‘wicked problem’ which cannot be solved, but only managed better or worse (Newman and Head Citation2017; Hoppe, Wesselink, and Cairns Citation2013; Rayner Citation2006; Grundmann Citation2016).

This would call for a more refined definition of frame elements, paying special attention to the treatment aspect. Here several policy options have emerged; I have identified a dozen: 1) rolling out nuclear power plants across the globe; 2) switching all energy supply to solar, wind, or biofuels; 3) taxing carbon (or energy) with a) low or b) high rates; 4) implementing emission-trading systems; 5) developing carbon-capture and storage technologies; 6) developing new zero-carbon energy systems; 7) taking adaptation more seriously; 8) developing geo-engineering projects; 9) adopting vegetarian or vegan diets and lifestyles; 10) restricting population growth; 11) abolishing capitalism; 12) abolishing democracy (Grundmann Citation2018).

Coding news items into this, or any other conceptual scheme, is a deductive approach which crucially rests on coding procedures that require robust agreement between coders about the meaning of an item. CL is not able to fulfil this task. But it can offer a different methodological option, based on its unique features: being inductive, either corpus-driven or corpus based (Tognini-Bonelli Citation2001).

This is facilitated by the availability of a list of claims makers, obtained through NER parsing. This then can be complemented by collocation analysis, showing collocations for the highly visible claims makers. While this approach does not allow to identify frames in unsupervised settings, it shows options to identify frame elements in a more detailed way.

7. Conclusion

If we look back at the points made so far, it appears that there are some common problems, some problems unique to each approach, and some tentative suggestions about a way forward.

There are several advantages to use computer assisted methods. They allow to map a discourse over time, providing a ‘helicopter’ perspective. Major features of texts can be detected through software algorithms, not relying (too much) on subjective decisions of researchers. The analysis aims to discover hidden structures, either through topic analysis or patterns of words which are not visible to the reader. As John Sinclair put it memorably, ‘Language users cannot accurately report language usage, even their own’ (Sinclair 1994, cited in Willis Citation2017). Results can be represented in a summary way, in timelines, and comparisons between countries or types of media outlets are within easy reach of researchers.

A disadvantage using computational methods is the fact that one does not necessarily get beyond the surface, or that trivial findings are obtained. If one wants to avoid these pitfalls, several methodological decisions have to be made. These have to do with the construction of the corpus: which texts should be included, from which sources, over what time period? Statistical measures need to be identified and justified. The reduction of an enormous amount of information to a manageable size, i.e. a size that can be presented in a meaningful way and interpreted is a challenge that requires judgement. And the interpretation of results is still necessary; no text exists without context. In many cases a comprehensive database will provide contextual information as single articles can be inspected for closer analysis. This is possible if a full text database is available throughout the research process. A combination of quantitative and qualitative work seems therefore appropriate.

The lure of computer assisted analysis of vast amounts of texts has led some researchers to explore software applications which are often black boxes for the user. The appeal to objectivity via automation is limited by theoretical assumptions such as the ‘bag-of- words-approach’ (which is also present in CL keyword methodology, see van Meter Citation2018), or the exclusion of word variations and function words, or very frequent and infrequent words as ‘noise’. The use of linguistic software without linguistic expertise is problematic, yet it seems appealing as it allows to produce quick results.

Maybe the numerical packages, from Excel, SPSS to R can teach a lesson. They provide different levels of data analysis and visualization, and users are still able to make the best, or worst of it. A quick and dirty statistical analysis will not withstand the scrutiny of a large and competent expert community. Suggesting causation by calculating correlations is a commonly exposed flaw. Language based software is no different. It provides tools for analysis and presentation which vary in quality. To check the quality of this research we need a critical mass of competent practitioners . We need in-depth, interdisciplinary efforts between linguists, programmers, and social scientists in order to enhance the research potential of these new and exciting technical opportunities.

Practitioners need to ask themselves what kind of questions they want to examine. As the example of topic modelling indicates, there is a danger that the availability of technical applications leads to research questions that are somehow new, but produces underwhelming results. Discourse researchers have been keen to investigate the narratives and meanings of texts, how issues are framed by actors, how types of communication are used in efforts of persuasion, and what these results actually mean for social reality.

Computer assisted technologies for textual analysis should fulfil two tasks: they should help us do a better job with research questions we have always been interested in. And they should provide us with new possibilities, and unexpected insights. Topic modelling, as the name suggests, is interested in the structural properties of texts, not specifically in the efforts of persuasion by specific social actors. CL can offer a powerful tool to analyse large amounts of text. It is very good at providing quick overviews and pointers which one may not have considered. However, it needs to be complemented by careful study—datasets need to be cleaned, lists of named entities need to be vetted and amalgamated, methods for the choice of keywords need to be established, and meaning needs to be constructed on the basis of close inspection. All methods have to be applied and guided by justified decisions for a particular purpose. This means, these decisions have to be replicable, accessible and clearly expressed. Research teams need to be built which have the requisite variety of expertise and are able to address the problems and limitations identified in this paper.

Correction Statement

This article has been republished with minor changes. These changes do not impact the academic content of the article.

Notes

1 I would like to thank Mike Scott, an anonymous reviewer, and the editor for helpful suggestions.

2 Readers who are interested in exploring the tools we have used should look here https://lexically.net/wordsmith/  for access to the program. For the processing:

parsing Nexis & Factiva: https://lexically.net/DownloadParser/index.htm 

finding duplicates: https://lexically.net/downloads/version8/HTML/different_contents_dup_finder.html

checking content: https://lexically.net/downloads/version8/HTML/relevance_check.html

building monthly or yearly sub-corpora: https://lexically.net/downloads/version8/HTML/build_sub_corpora.html

key words and time-lines: https://lexically.net/downloads/version8/HTML/kwdb_database_timelines.html

References