1,118
Views
0
CrossRef citations to date
0
Altmetric
Articles

Discourse analysis after the computational turn: a mixed bag

ORCID Icon
Pages 3-15 | Received 02 Feb 2023, Accepted 10 Mar 2023, Published online: 28 Mar 2023

ABSTRACT

This paper seeks to clarify a methodological agenda for combining discourse analysis with corpus analysis. It details four concerns. Firstly, it argues that corpus-assisted discourse analysis can quite drastically narrow the view on discourse, if used on its own and without accompanying theoretical tools for exploring social practice. Secondly, corpora are of more value in helping researchers identify the symbolic resources that people have available to them than at understanding how they use those resources. Thirdly, they must be approached through a renewed appreciation of communication as a human accomplishment and corpora must, therefore, be reconnected to the producers of that discourse. And fourthly, corpora are of greater value when extended beyond lexical analysis. Underpinning these points is a commitment to discourse analysis as a tool to understand in close detail how people use language to do things in their lives.

Introduction

The papers in this special issue were all presented at the eighth New Zealand Discourse Conference in 2021. They, and the thriving conference they arise from, are evidence of the continued strength of discourse analysis as a method in the social sciences. However, as Gail Fairhurst notes in her paper, the claims of research using discourse analytic methods need to be defended these days and the approach is sometimes marginalised or subsumed into other approaches. That is partly to do with academic trends and the pursuit of particular questions, such as a concern with materiality or political economy in particular fields. One set of concerns is associated with the ’computational turn’ (Berry, Citation2011) in both quantitative and qualitative research traditions. This introductory paper to the special issue draws on some of these debates to try to clarify a methodological agenda at the interface between the study of discourse and large data sets to allow communication researchers to get closer to the practices they study and understand them more deeplyFootnote1.

There have been complaints since its inception about discourse analysis’s emphasis on small numbers of texts. Stubbs (Citation1997), for example, cited concerns that discourse analysts would sometimes cherry-pick a few cases to examine or simply choose texts arbitrarily or without robust enough sampling, so that results were less likely to sustain claims to be representative or reliable. Some forms of critical discourse analysis, where analysts begin with a critical social agenda, have been particularly queried (Widdowson, Citation2000; see also Nartey & Mwinlaaru, Citation2019). In contrast, computer-assisted analytical frameworks, in particular those drawing on corpus linguistics and more recently machine learning, have become popular in the past decade because of their apparently strong claim to rigour on all these fronts. We can now study the discourse of thousands if not millions of texts using these tools. A strand of critical discourse analysis has emerged, sometimes labelled corpus-assisted discourse studies (CADS) (Ancarno, Citation2020), that combines the methods of corpus linguistics and discourse analysis. Nartey and Mwinlaaru (Citation2019) found 121 such studies in major journals between 1995 and 2016.

I have followed this approach myself, but here I want to foreground the goal of getting closer to communicative practice and people’s meaning-making as a way of rebalancing some of these debates. The great strength of computer-assisted analysis is that, as digital humanities scholar Berry (Citation2011) notes, it is subtractive: it turns ‘the continuous flow of our everyday reality’ into data points that can be manipulated using algorithms. ‘These subtractive methods of understanding reality (episteme) produce new knowledges and methods for the control of reality (techne)’, he writes (2). Yet that subtractive approach brings such huge loss of context that used on its own it is often disappointing in its analytical purchase. At the same time, digital tools arise historically out of a set of ideologies that have shaped western thought since the 1970s, including what Barbrook and Cameron (Citation1996) call the ‘Californian ideology’, where the combination of free markets and the liberatory power of technology is disruptive of established power. To respond critically to the computational turn in social science means to be clear about what each approach brings. To that end, I want to make four arguments. Firstly, taking a Latourian approach, I worry about how narrowing the collaboration with corpus analytic tools can often be. Secondly, I argue that corpora are more useful to understand the discursive resources that people use and less about how they use them. Thirdly, I connect these concerns to a central ethical commitment within the discourse analysis tradition to attend closely to people’s talk in their own terms. And, lastly, I trace some of the attempts to extend corpus-assisted discourse analysis beyond words into the broader semiotic space.

Before doing so, however, I need to acknowledge Prof Colleen Mills, who co-organised the eighth New Zealand Discourse Conference and initiated this special issue before her death in mid-2022. We have a lot to thank Colleen for and she will be badly missed in many parts of the scholarly community.

On the tools of discourse analysis

As Bruno Latour reminds us, when we study the world using a set of tools or techniques, the relationship with those tools is much more than a use relation. In his slightly provocative statement, he noted that Louis Pasteur did not so much discover microbes as collaborated with them (quoted in Knight & Chrisafis, Citation2022). That is, Pasteur’s achievement was not just a matter of reason, astute observation or methodological advance but contingent on so much that surrounded him, ‘interactively constructed’ in a wider social practice (Knoblauch, Citation2021). Without falling into arguments about the technological determination of society or about the agency of microbes, we can apply this kind of reflexive or empirical study of scholarship to better understand what is at stake when discourse analysis draws on computational tools.

Jones (Citation2011) notes that sociolinguistics arose as tape recorders enabled new ways of recording and listening. Before these tools became available, transcription of speech was possible but inevitably missed so much that was mentally processed away by the analyst that it lacked definition as a form of data. Analysis of discourse as including ums and ahs, over-talking, hedging, utterances of many kinds rather than complete sentences, the order of talk and similar features began to happen in earnest as it became audible on tape. More importantly, Jones points out, this new tool gave discourse analysis an aura of science and an authority arising from the disciplinary knowledge of how to transcribe and from the solid fact of the transcript. He writes that, with the new authority granted to discourse analysis by the invention of the portable tape recorder, there also came new responsibilities. For one thing, analysts found themselves embedded in a complex new set of ethical and legal relationships with the subjects of their analysis (p. 16).

In a similar way to the tape recorder, corpus analysis gives sometimes radically new insights on language use. That also raises similar questions. Beginning in lexicography in the 1970s, tools were developed that traced usage patterns in large sets of words. These included concordancers that allowed lists of the words around a word to be produced. A key finding of these tools was that word use is often more a matter of the repurposing of set phrases or certain grammatical patterns – patterns of language use that were not apparent using other approaches. As elsewhere in the digitisation of formerly qualitative research traditions discussed by Berry (Citation2011), there is a power to decontextualising data and other ‘subtractive’ moves in analysis. Some scholars have observed in themselves a new openness to language data that can lessen the impact of the researcher’s theories, taxonomies and expectations on understanding the phenomena of language use. Data is defamiliarised in useful ways that can lead to better understanding of discourse and language communities. This distinctiveness can be overstated – all analysis proceeds through analytical schemata or taxonomies that bring certain aspects into focus – but a corpus certainly allows this to be done in different ways. One key advance is that, as Koller and Mautner (Citation2004, p. 225) put it, ‘concordancing effectively helps to break down the quantitative/qualitative distinction, providing as it does the basis for quantitative analysis without “deverbalizing” the data’. Exploring certain kinds of patterns is also much easier and faster. It took Raymond Williams 20 years to collect and study his list of 60 keywords in the early years of British cultural studies (Williams, Citation1977). That kind of discovery of new connections Williams pursued between domains of knowledge is now so much easier.

As boyd and Crawford (Citation2012) note about big data research in general, this kind of work ‘stakes out new terrains of objects, methods of knowing, and definitions of social life’. A key set of questions then arises about what kind of scholarship is privileged when we collaborate with corpora. Most communication scholars would set aside the claims that the data ‘speaks for itself’ in ‘corpus-driven’ approaches (as argued, for example, by Biber, Citation2009). McEnery, Xiao, and Tono (Citation2006) argue that there is no corpus-driven research, as the researcher always brings assumptions about how language works in how they gather and analyse their material, even if they arrive without hypotheses or theories to test. More strongly, Drucker (Citation2011, p. ¶7) warns against giving up the legacy of humanities scholarship for a ‘reductive empiricism’ (see also King, Citation2015). But other issues also arise. It is becoming clear that we need to be mindful of at least three phenomena.

Firstly, corpus-based tools encourage discourse analysis of certain kinds of texts over others. Texts that are already digitised and available such as social media posts or comments and political discourse such as speeches are, Nartey and Mwinlaaru (Citation2019) find, much more likely to be analysed in CADS projects. In one project in this wave of work, Lee and Sumiya (Citation2010) gathered 21 million geo-tagged tweets in Japan over six weeks, the kind of research that would have been unimaginable before but is now common. However, other kinds of language use, particularly spoken language or less publicly available texts, have been studied much less using this approach.

Secondly, like the empirical authority the tape recorder brought to the study of talk, large data sets and their tools have brought authority to particular moves in the study of language use. These include the study of words in the context of surrounding words, keywords, visualisations such as word clouds, semantic sets and n-grams (recurring sets of contiguous words). These are all useful moves in studying discourse, but they lead scholars using the approach to lean heavily on the study of words. The heavy focus on keyword analysis in corpus-assisted discourse analysis, for example, puts a huge emphasis on the significance of particular words in signalling the communicative practice rather than on other analytical levels (pragmatics, narrative, interaction, genre and so on). Assumptions also frequently creep in from quantitative tools and the way datasets are gathered, some of which are not immediately apparent. Driscoll and Walker (Citation2014), for example, have shown that the Twitter data interface or API privileged more popular tweets, neglecting activity by people who have not reached some threshold of visibility. A Twitter dataset gathered through the company’s service will therefore elide away some data, without that being apparent. The corpus-assisted study of discourse, while promising to release us from small samples, can also become itself narrowed.

Thirdly, creating a database does not just record meaningful material in a form that can be later queried, it creates a different kind of knowledge to that in which the people who generated that data were involved. On one level, that is the whole strength of databases – they are searchable, manipulable and comparable in ways that everyday talk is not. But on another level, as Manovich (Citation1999) pointed out, the database form is a logic of our age that risks colonising the imagination, particularly when these forms are commercially motivated. A database, he notes, tends towards seeing knowledge in terms of what Deleuze and Guattari (Citation1988) call rhizomatic logics: any point can be connected with any point, there is no logical centre or hierarchy, rupture in one connection allows new lines to emerge elsewhere. Rather than a tree of knowledge or a narrative, we have a rhizome, just like a clump of irises or ginger. Corpus analysis, then, can be characterised as wanting to find discrete, reconfigurable units of meaning – and it indeed finds them.

It is clear that the position of the author needs to be constantly written back into a corpus-assisted study, given these tendencies to narrow and focus the gaze. Scott (Citation2010, p. 50) cautions that there is always a large human dimension needed in using a corpus – the analyst’s ‘discernment and discretion’ of what is going on in the words. Thus, many critical researchers instead recommend doing corpus analysis and other forms of discourse analysis in parallel, either one after the other or jumping between them. Thus, in one move, we remove the context to see certain aspects; then in another, we reintroduce it. Koller and Mautner (Citation2004) move from reading a set of texts to studying concordances, collocations, frequencies and statistical patterns and back. They value concordances in particular because they preserve a sense of the text as unfolding meaning. O’Halloran (Citation2014) goes further in his use of corpora in a move that is not just analytical but also makes an ethical claim. He argues that the distancing that a corpus provides allows the analyst to take on a ‘nomadic ethical subjectivity’, to interrupt the analytical self and align her or himself with the voices in the text, and thereby re-enter the argument non-arbitrarily, ‘or, at least, with the chances of this re-entry being arbitrary significantly reduced’ (p. 810). In these kinds of formulations, corpora are not being used as proxies for people but as part of a mixed toolkit to try to get closer to them. McDermott (Citation2013), in his study of blogging in authoritarian Singapore, goes as far as to say that we risk co-option by corporations and national states if we allow focus to remain too much on tools whose provenance we have limited knowledge about. He sets up a dichotomy between the ‘social ignorance’ of automated data collection and analysis and the ‘space for agency’ opened up through studying people’s social and textual practices in a wider range of ways.

Corpora and social resources

A significant challenge here is how to contextualise data gathered together into a corpus. Using multiple approaches, such as Koller and Mautner’s (Citation2004) cycling between close reading and use of a concordancing tool, is one way of seeking to do that, but the main answer from corpus analysis has been to compare the text with another dataset of representative text – what Dourish (Citation2004) calls providing a ‘representative context’ for the words. For any research focused on social action this provides a thin kind of context – as Seaver (Citation2015, p. 1105) says, it fails to deal with the ‘interactional context’, or the ‘localised achievement [of interactants to a social activity], irreducible to a collection of sensor data’. (He argues this is why data-driven recommendation systems often fail to give us the book or music we want, since they regard a person as a collection of tastes rather than someone using a book or piece of music as part of their daily life.) For me, a better answer is to read corpora as providing the analyst with insight into language resources that people have available to them, but be much more cautious about using the data to try to read off how they use those resources or what those resources mean. It helps us, as it were, to understand their library, but not their reading; their vocabulary, but not so much their speech.

On the one hand, this means that corpus tools are good in helping with the analysis of the use of specific words or phrases, particularly through comparing that usage with other texts. Tognini Bonelli (Citation2004) formulated a rule that a corpus should always be compared with another general purpose corpus using probabilistic tests (such as log likelihood), and never treated as representative of anything in itself. Most concordancing tools used by discourse analysts come with functionality to compare the chosen texts with these huge corpora of everyday usage, such as the 100 million-word British National Corpus or its sub-corpora of spoken, written and sub-sub-corpora of different kinds of spoken and written English. Reference corpora have great strengths as an alternative to making educated judgements about the value of a word in a text, using the critic’s own language competence. If a text tends to use positive or negative language, or certain semantic content, words from those domains will become evident through the comparison with a large reference corpus. More sophisticated analysis is also possible. For example, critical scholars may notice in a sample of news texts (e.g. van Dijk, Citation1991) a rhetoric of waves of migrants and read this as having negative connotations. O’Halloran and Coffin (Citation2004) used corpora to test out their intuition on a similar set of phrases about migrants flocking. They traced what Louw (Citation1993) calls the ‘semantic prosody’ of instances of flock/flocked in their sample and compared that usage to a sample of more ordinary English usage, to see if uses of the word form flock were generally negative and whether they were particularly associated with sheep. This testing out allowed them to counter criticism of critical textual analysis that it finds what it expects to find – the presence of certain structures of power – or that its analysis depends too much on analyst intuition. They found that flock occurs primarily metaphorically – actually not so often with sheep – making critical analysis of its use in news discourse more complicated, but lending some support to the concern that in some contexts this is a pejorative term. Their findings in particular led them to confirm van Dijk’s (Citation1991) point that textual features do not have ideological meanings in themselves, but are used, in certain discursive moves, in ways that can activate ideologies.

Here, the corpus is telling us something quite specific – and therefore useful – but also limited. It gives evidence of what resources people are drawing on and what some of the meaning potential of that language is. In this respect, a key finding of corpus linguistics that is useful for discourse scholars is that, in English as well as in other languages, certain words tend to occur in quite specific and relatively fixed grammatical structures. Some otherwise obsolete language remains in use in fixed terms (such as hue in hue and cry); and some words have quite different meanings according to the other words they occur with or even the grammatical structure they occur in. For example, Stubbs (Citation2010) shows from a study of corpora that the phrase I don’t feel like it conventionally signals disagreement, even though the words themselves are not otherwise restricted to mean that when added together. He argues that these patterns in language use, which are important aspects of cultural knowledge, were largely invisible before corpus-analytical tools. Language use, then, can be described partly as a matter of making use of these well-defined structures (see, for example, Hunston and Francis (Citation1999) who show that the categories of grammar and words in fact begin to blur into each other). Studying corpora allows us to find, often quickly and with ease, a limited set of resources for communication available in a particular language event (see Matheson, Citation2018 for an application to disaster communication on Twitter).

Baker (Citation2006) argues that corpora therefore provide insight into social norms. That is, we can study the regularities in the use of phrases such as these to explore such things as the evaluative meanings, stereotypes and normativities shared by a group. Scholars have used these insights to trace the way language carries with it what Stubbs (Citation2010, p. 29) calls ‘little schemas of cultural knowledge’. Some key phrases express, for example, strong evaluative meanings that invite other participants in the discourse to share a way of perceiving things. There are ‘preferred and conventional ways of expressing evaluative pragmatic meanings’ (ibid.). For him, these patterns are the building blocks of social institutions, which in turn regulate the communicative acts that can be made within them, delineating what should happen when certain people say certain things (p. 34).

I worry that we lose some of the great insights of the phenomenological tradition and risk reducing the world as experienced to sign systems, an argument has that a contemporary parallel in the public debate around chatbots. Bots ‘learn’ to talk by learning from huge corpora of text and as a result are capable of mimicking interaction, by activating cues that people recognise as part of human interaction. But – and here is the limitation in Stubbs’ reading of textual patterns as social practice – there always remains a gap between a clever chatbot’s capacity to reproduce interactive talk and actual interaction – and indeed the better it gets the more it is a deception (Rapp, Curti, & Boldi, Citation2021, p. 52). In the same way, I would argue that the traces of social institutions and social norms that Stubbs and Baker find must be treated as only that: they are the signs that discourse practice leaves behind, not the practice itself. They are, moreover, signs that have been decontextualised and must therefore be reconnected to other forms of analysis and theories of social life if we are to understand their meaning. Some scholars using corpus analysis do this well. For example, Richardson and Kennedy (Citation2012) use Laclau’s theorisation of particular keywords as ‘empty signifiers’, that are constantly being made to mean without that meaning ever being quite adequate to the ideological work it needs to do, to study the way words such as ‘democracy’ or ‘drugs’ or ‘gang’ are deployed. The analysis depends as much on criminological analysis about the political contest over the links between crime and gangs to connect the textual data back to people’s realities. For me, better corpus-assisted research does not privilege either the text or the social or power structures, but analyses the orientation towards broader patterns of discourse in the linguistic resources of a language practice. As a result, it is able to study meaning and people’s lifeworlds as always in flux because meaning arises in the use of socially shared knowledge resources, and not just in the resources themselves.

People’s talk in their terms

The problem here is partly a naivety, part of what Van Dijck (Citation2014) terms the ‘datafication paradigm’ shared by both tech industry and academic researchers, in which ‘Data and metadata culled from Google, Facebook, and Twitter are generally considered imprints or symptoms of people’s actual behaviour or moods, while the platforms themselves are presented merely as neutral facilitators’ (p. 199). Thus when Google designed a tool to monitor people’s search data on words such as flu, fever, cough, it found it was two weeks ahead of the US National Centers for Disease Control (which used physician reports and virus testing) in spotting the rise of particular flu outbreaks in the western world. But it also found its tool risked overpredicting outbreaks and in 2013 did just that; the Google Flu Trends tool has since been retired. According to Butler (Citation2013), it was spotting an increase in the discussion of flu, at a time when there was high news coverage. To a discourse scholar, the problem is clear. Google Flu Trends was tracing cultural knowledge about the flu and people’s personal and collective fears, not the disease itself. The proxy was useful up to a point, but discursive phenomena are complex. What Google’s team forgot here was that searches reflect public discussion, the prevailing culture around risk and other hard-to-quantify factors – in other words, the social mediation of disease.

How do we ensure that we always return to the human dimension, to people’s social realities, to overcome, in McDermott’s (Citation2013) term, the ‘social ignorance’ that a reliance on these kinds of large-scale pattern-finding tools can drift into? Classic formulations of discourse analysis direct us to keep one foot firmly in social reality when it places a foot in such analysis. Blommaert (Citation2005, p. 2) defined discourse analysis in terms of an orientation to ‘language in action’. Research, in this view, aims to take the analyst closer to language as it is used by people to do things in their lives, and seeks to understand what work language does in social life. In Widdowson’s (Citation2000, p. 4–5) formulation, the object of study is the ‘reality of language as people actually experience it: as communication, as the expression of identity, as the means for the exercise of social control’. These are ethical as well as epistemological statements – that, at its best, discourse analysis can be true to those realities, that is, to how people construct and contest shared worlds. That commitment provides a useful ethical centre point when reflecting on the value of tools which gather and analyse very large datasets of communication, but I wonder if something more significant is needed. A key problem is that aggregate language data is at such a distance from the instance of use and from the user of the language that we lose touch with people’s lifeworlds. Analysis of Twitter talk, for example, tends to aggregate the tweets of people who are not co-present to each other, or at best are in some unstable ‘sense of shared conversational context’ (boyd, Golder, & Lotan, Citation2010).

That raises issues of privacy, consent and confidentiality. As the famous 2006–2008 Harvard ‘Tastes, Ties and Time’ study showed, the ease with which researchers can gather together large databases of personal material – in their case the profiles of all freshers at Harvard – is matched by the difficulty of later squaring that with people’s expectations of anonymity and of control over their communication (Zimmer, Citation2010). These remain difficult problems: it is impractical to ask social media users for their permission for the kinds of data sets needed in corpus analysis. It is also sometimes impossible when using real-time data and in many other contexts. As Zimmer points out, however, our first responsibility is to know the implications for longstanding research ethics conventions and to work them through. Clearer guidance now exists on some of these issues, such as the Association of Internet Researchers ethical guidelines (Franzke, Bechmann, Zimmer, Ess, & Association of Internet Researchers, Citation2020).

But it also raises a further responsibility to take a position on the politics of data, or what Bates (Citation2018, p. 413) defines as consideration of ‘how the circulation of different types of data contribute to the constitution of unfolding social relations’. Without that we risk contributing to surveillance and disempowerment of people in contemporary digital societies; we also risk contributing to the appropriation of people’s knowledge and traces of their social life. And we set up further disempowering ‘data relations’ (Kennedy, Citation2016, cited in Bates, Citation2018) between people when data from or about them is shared further with other scholars or made publicly available in databases.

So what would it look like for corpus-based discourse analysts to take a position in debates around digital sovereignty? Digital sovereignty can be defined in a range of ways, but Floridi (Citation2020, p. 372) focuses on sovereignty as ‘a form of legitimate, controlling power’. Digital sovereignty, then, concerns the demand for the legitimisation of the power to gather and analyse others’ data, through making those wielding that power accountable to those others, by returning data to them, by respecting the claims of indigenous groups to control over cultural treasures such as language and practices around it (Smith, Citation2016), and similar moves. The sovereignty at stake is not a rivalrous resource, where holding data is to deprive others of it, but ‘is more like a relation (control), in which one may engage more or less intensely and successfully’ (Floridi, Citation2020, p. 376). As scholars and civil society figures argue, in Van Dijck’s (Citation2021) words, for a ‘new imaginary’ of platforms and data use, the principles established to govern the gathering, storing and using of private information (e.g. Privacy Commissioner, Citation2020) are being extended to think about cultural resources such as discourse.

I have searched for examples of corpus-informed discourse scholars engaging in these debates or devising ways to study discourse consistent with a stated position on data politics, and found little. I am not aware of ‘citizen science’ discourse tools that would allow members of communities to gather and analyse their own talk in collaboration with researchers, for example. Indigenous language research provides the few examples I have come across, such as the Welsh Twitter Corpus, which is motivated by goals of improving predictive text in Welsh, finding new words in Welsh and producing data for coding clubs for Welsh-speaking children (Jones, Robertson, & Taborda, Citation2015). There is considerable work to be done in the field, but it seems particularly acutely needed in critical discourse analysis, which begins from a concern about discourse as power.

Pushing beyond the study of word sets

Another way to connect large sets of discourse data back to their contexts of use is to link data in more sophisticated ways to other data so more of the social context is available when analysis is done. One way is tagging corpora. This is already often done for grammatical words and for organising text into semantic categories, using word lists and probabilistic techniques automatically parse the text. It is much harder to do that above the sentence, however, and not frequently done. In analysis of news texts, for example, a corpus will often consist of undifferentiated text, rather than being separated out into headlines, ‘intros’, quotation paragraphs, background and other elements, although the producers and users of the news would not consume it in that undifferentiated way. In this way, Baker (Citation2005), studied how the words gay and homosexual frequently occur in a specific kind of language use, British tabloid news, in collocations suggesting shame or secrecy. To my mind, the word gay does not have a stable prosody or emotional tone to it in tabloid discourse, because its use is intertextual, invoking a range of networks of social discourse in different ways in a text. By attending more closely to how a headline or ‘intro’ or a quotation paragraph uses the word gay we would get closer to the work the news text does with those intertextual meanings; and to do that we would need to treat those elements differently. In the headline or ‘intro’, gay is likely to be used in a way that gives the word some explanatory force for the actions of the person or group being identified by their sexuality. Later on, particularly in quotations, such words are less key to overall meaning and may be accompanied by competing accounts of a person. Thus, a next step in analysis would be to explore how the elements in a text and references to external texts such as quotations are connected together. We could follow how sets of words are reconfigured as they are move from quotation to headline, or how they move between politics, news, entertainment, social media and other forms of text.

Some studies are combining visual and textual analysis, in what Bednarek and Caple (Citation2017) term corpus-assisted multimodal discourse analysis (CAMDA). Most studies of this kind are small scale, such as Zhang and Cheung’s (Citation2022) study of the words and images on 300 Time magazine covers, and manually tagged, so that they end up looking similar in many respects to traditional content analyses. But Bednarek (Citation2015) points to the development of automatic corpus-based visual analysis tools which could allow much larger and more complexly marked up sets of multimedia texts. She acknowledges they may be some way away.

Another way to include some of the context of use of text is to study the flow of words within a corpus. Zappavigna (Citation2011) uses a Twitter StreamGraph to suggest some of the ebb and flow of the association of the word happy and the word Obama in tweets half an hour after his 2008 US presidential win was announced. The patterning in the use of the two words and the network of other language around happy show the public discourse that was shared in the US on Twitter, allowing us to see the active intermingling of language items as people responded to the event and to each other. While concerns discussed above about the constructedness of the intermingling remain – these users were only very vaguely co-present to each other – the treatment of the text as dynamic is valuable.

Final thoughts

Large data sets and computer-assisted analysis of them provide us new ways of digging into data and should be valued for that. My concern in the four points I have made above is that discourse analysis needs to do that in ways that help it get closer to how people make sense and how they interact. I have argued that corpus-assisted discourse analysis can quite drastically narrow the view on discourse, if used on its own and without accompanying theoretical tools for exploring social practice. It is a far from neutral tool, despite some claims made for it. Corpora are better at helping researchers identify the symbolic resources that people have available to them than at understanding how they use those resources. These tools must be approached through a renewed appreciation of communication as a human accomplishment and corpora must, therefore, be reconnected to the producers of that discourse. To do otherwise is to forget the extent to which the many are being disempowered by the few through power over data. And the use of corpora must push beyond the lexical analysis tools that the original dictionary researchers developed concordancers for.

This is all an invitation to continue to adapt the tools of discourse analysis and in particular to develop bags of tools in which computational techniques are supportive rather than directive of other approaches. For me, exploring these questions is a reminder of the imperative to take texts back to people. I am exploring how to provide journalists with the tools to understand the labelling patterns in their news practice. But in the current issue of this journal we have more than hypothetical examples. Angela Moewaka-Barnes’ excellent paper takes television texts representing Māori back to Māori audiences for interpretation. This is also, lastly, a reminder of the enormous strengths of the qualitative project that stays close to the human production of meaning. François Cooren, Boris Brummans and Lise Higham show how studying micro-interactions, in which people take on others’ voices, takes us to the heart of the way people manage interaction in talk and helps us evaluate and also foster respectful communication and good listening. Lucy Elkins, Maria Stubbe and Susan Pullon raise deep questions around the ethics of discursive manipulation in health communication through critical discourse analysis of three short texts, pointing to the clash between the imperatives of public health and rights of individuals to make independent informed decisions. Kate Power is able to link analysis of the representation of violence against women in news texts in Pacific news media back to journalistic practice because she stays close to discrete news texts in her analysis, connects those to the framing of that violence and evaluates the language used against the reality of rape. The papers here together illustrate the strength of discourse analysis that attends to the detail of interaction and representation to understand and critique communication practice.

Disclosure statement

No potential conflict of interest was reported by the author.

Notes

1. An earlier version of this paper was presented at the 5th New Zealand Discourse Conference, Auckland University of Technology, 7–9 December, 2015.

References

  • Ancarno, C. (2020). Corpus-assisted discourse studies. In A. de Fina & A. Georgakopoulou (Eds.), The Cambridge handbook of discourse studies (pp. 165–185). Cambridge University Press. doi:10.1017/9781108348195.009
  • Baker, P. (2005). Public discourses of gay men. London: Routledge.
  • Baker, P. (2006). Using corpora in discourse analysis. London: Continuum.
  • Barbrook, R., & Cameron, A. (1996). The Californian ideology. Science as Culture, 6(1), 44–72. doi:10.1080/09505439609526455
  • Bates, J. (2018). The politics of data friction. Journal of Documentation, 74(2), 412–429. doi:10.1108/JD-05-2017-0080
  • Bednarek, M. (2015). Corpus-assisted multimodal discourse analysis of television and filmnarratives. In P. Baker & T. McEnery (Eds.), Corpora and discourse studies (pp. 63–87). London: Palgrave Macmillan.
  • Bednarek, M., & Caple, H. (2017). The discourse of news values: How news organizations create newsworthiness. Oxford: Oxford University Press.
  • Berry, D. (2011). The computational turn: Thinking about the digital humanities. Culture Machine, 12, 1–22.
  • Biber, D. (2009). A corpus-driven approach to formulaic language in English: Multi-word patterns in speech and writing. International Journal of Corpus Linguistics, 14(3), 275–311. doi:10.1075/ijcl.14.3.08bib
  • Blommaert, J. (2005). Discourse: A critical introduction. Cambridge: Cambridge University Press.
  • boyd, D., & Crawford, K. (2012). Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon. Information, Communication & Society, 15(5), 662–679. doi:10.1080/1369118X.2012.678878
  • boyd, D., Golder, S., & Lotan, G. (2010, January). Tweet, tweet, retweet: Conversational aspects of retweeting on twitter. In 2010 43rd Hawaii International Conference on System Sciences, Honolulu (pp. 1–10). IEEE.
  • Butler, D. (2013, February). When Google got flu wrong: US outbreak foxes a leading web-based method for tracking seasonal flu. Nature, 494(7436), 155–156. http://www.nature.com/news/when-google-got-flu-wrong-1.12413
  • Deleuze, G., & Guattari, F. (1988). A thousand plateaus: Capitalism and schizophrenia. London: Bloomsbury Publishing.
  • Dourish, P. (2004). What we talk about when we talk about context. Personal and Ubiquitous Computing, 8(1), 19–30. doi:10.1007/s00779-003-0253-8
  • Driscoll, K., & Walker, S. (2014). Working within a black box: Transparency in the collection and production of big Twitter data. International Journal of Communication, 8, 1745–1764.
  • Drucker, J. (2011). Humanities approaches to graphical display. Digital Humanities Quarterly, 5(1). http://www.digitalhumanities.org/dhq/vol/5/1/index.html
  • Floridi, L. (2020). The fight for digital sovereignty: What it is, and why it matters, especially for the EU. Philosophy and Technology, 33, 369–378. doi:10.1007/s13347-020-00423-6franzke
  • Franzke, A. S., Bechmann, A., Zimmer, M., Ess, C., & Association of Internet Researchers. (2020). Internet research: Ethical guidelines 3.0. https://aoir.org/reports/ethics3.pdf
  • Hunston, S., & Francis, G. (1999). Pattern grammar: A corpus-driven approach to the lexical grammar of English. Amsterdam: John Benjamins Publishing.
  • Jones, D. B., Robertson, P., & Taborda, A. (2015). Corpus of Welsh language tweets. Welsh National Language Technologies Portal. Retrieved from http://techiaith.org/corpora/twitter/?lang=en
  • Jones, R. H. (2011). Data collection and transcription in discourse analysis. In K. Hyland & B. Paltridge (Eds.), The Bloomsbury companion to discourse analysis (pp. 9–21). London: Bloomsbury.
  • Kennedy, H. (2016). Post, mine, repeat: Social media data mining becomes ordinary. London: Palgrave Macmillan UK.
  • King, B. W. (2015). Investigating digital sex talk practices: A reflection on corpus-assisted discourse analysis. In R. H. Jones, A. Chik, & C. A. Hafner (Eds.), Discourse and digital practices: Doing discourse analysis in the digital age (pp. 130–143). London: Routledge.
  • Knight, L., & Chrisafis, A. (2022, October 9). Bruno Latour, French philosopher and anthropologist, dies aged 75. The Guardian. Retrieved from https://www.theguardian.com/world/2022/oct/09/bruno-latour-french-philosopher-anthropologist-dies
  • Knoblauch, H. (2021). Reflexive methodology and the empirical theory of science. Historical Social Research/Historische Sozialforschung, 46(2), 59–79. doi:10.12759/hsr.46.2021.2.59-79
  • Koller, V., & Mautner, G. (2004). Computer applications in critical discourse analysis. In C. Coffin, A. Hewings, & K. O’Halloran (Eds.), Applying English grammar: Corpus and functional approaches (pp. 216–228). London: Arnold.
  • Lee, R., & Sumiya, K. (2010). Measuring geographical regularities of crowd behaviors for Twitter-based geo-social event detection. In Proceedings of the 2nd ACM SIGSPATIAL International Workshop on Location Based Social Networks (pp. 1–10). New York: ACM.
  • Louw, B. (1993). Irony in the text or insincerity in the writer? The diagnostic potential of semantic prosodies. In M. Baker, G. Francis, & E. T. Bonelli (Eds.), Text and technology: In Honour of John Sinclair (pp. 157–176). Amsterdam: John Benjamins.
  • Manovich, L. (1999). Database as symbolic form. Convergence, 5(2), 80–99. doi:10.1177/135485659900500206
  • Matheson, D. (2018). The performance of publicness in social media: Tracing patterns in tweets after a disaster. Media, Culture & Society, 40(4), 584–599. doi:10.1177/0163443717741356
  • McDermott, S. (2013). Countering the social ignorance of ‘social’ network analysis of the media ecology mining for network data and content analysis: a case study of the Singapore blogosphere [ Unpublished PhD thesis]. University of Leeds.
  • McEnery, T., Xiao, R., & Tono, Y. (2006). Corpus-based language studies: An advanced resource book. London: Taylor and Francis.
  • Nartey, M., & Mwinlaaru, I. N. (2019). Towards a decade of synergising corpus linguistics and critical discourse analysis: A meta-analysis. Corpora, 14(2), 203–235. doi:10.3366/cor.2019.0169
  • O’Halloran, K. (2014). Counter-discourse corpora, ethical subjectivity and critique of argument: An alternative critical discourse analysis pedagogy. Journal of Language and Politics, 13(4), 781–813. doi:10.1075/jlp.13.1.09oha
  • O’Halloran, K., & Coffin, C. (2004). Checking overinterpretation and underinterpretation: Help from corpora in critical linguistics. In C. Coffin, A. Hewings, & K. O’Halloran (Eds.), Applying English grammar: Corpus and functional approaches (pp. 275–297). London: Arnold.
  • Privacy Commissioner. (2020). Privacy act 2020 and the privacy principles. Retrieved from https://www.privacy.org.nz/privacy-act-2020/privacy-principles/
  • Rapp, A., Curti, L., & Boldi, A. (2021). The human side of human-chatbot interaction: A systematic literature review of ten years of research on text-based chatbots. International Journal of Human-Computer Studies, 151, 102630. doi:10.1016/j.ijhcs.2021.102630
  • Richardson, C., & Kennedy, L. (2012). ‘Gang’ as empty signifier in contemporary Canadian newspapers. Canadian Journal of Criminology and Criminal Justice, 54(4), 443–479. doi:10.3138/cjccj.2011.E.32
  • Scott, M. (2010). Problems in investigating keyness, or clearing the undergrowth and marking out trails. In M. Bondi & M. Scott (Eds.), Keyness in texts (pp. 43–58). Amsterdam: John Benjamins.
  • Seaver, N. (2015). The nice thing about context is that everyone has it. Media, Culture & Society, 37(7), 1101–1109. doi:10.1177/0163443715594102
  • Smith, D. E. (2016). Governing data and data for governance: The everyday practice of Indigenous sovereignty. In T. Kukutai & J. Taylor (Eds.), Indigenous data sovereignty: Toward an agenda (pp. 117–138). ANU Press. doi:10.22459/CAEPR38.11.2016
  • Stubbs, M. (1997). Whorf’s children: Critical comments on critical discourse analysis. In A. Ryan & A. Ray (Eds.), Evolving models of language (pp. 110–116). Clevedon: Multilingual Matters.
  • Stubbs, M. (2010). Three concepts of keywords. In M. Bondi & M. Scott (Eds.), Keyness in texts (pp. 21–42). London: John Benjamins.
  • Tognini Bonelli, E. (2004). Working with corpora: Issues and insights. In C. Coffin, A. Hewings, & K. O’Halloran (Eds.), Applying English grammar: Corpus and functional approaches (pp. 11–24). Arnold.
  • Van Dijck, J. (2014). Datafication, dataism and dataveillance: Big Data between scientific paradigm and ideology. Surveillance & Society, 12(2), 197–208. doi:10.24908/ss.v12i2.4776
  • Van Dijck, J. (2021). Seeing the forest for the trees: Visualizing platformization and its governance. New Media & Society, 23(9), 2801–2819. doi:10.1177/1461444820940293
  • van Dijk, T. (1991). Racism and the press. London: Routledge.
  • Widdowson, H. (2000). On the limitations of linguistics applied. Applied Linguistics, 21(1), 3–25. doi:10.1093/applin/21.1.3
  • Williams, R. (1977). Keywords: A vocabulary of culture and society. Glasgow: William Collins.
  • Zappavigna, M. (2011). Ambient affiliation: A linguistic perspective on Twitter. New Media & Society, 13(5), 788–806. doi:10.1177/1461444810385097
  • Zhang, W., & Cheung, Y. L. (2022). The co-construction of news values on news magazine covers: A corpus-assisted multimodal discourse analysis (CAMDA). Journalism Studies, 1–25. 10.1080/1461670X.2022.2123845
  • Zimmer, M. (2010). “But the data is already public”: On the ethics of research in Facebook. Ethics of Information Technology, 12(4), 313–325. doi:10.1007/s10676-010-9227-5