1,675
Views
1
CrossRef citations to date
0
Altmetric
Research Article

Critical computation: mixed-methods approaches to big language data analysis

ORCID Icon & ORCID Icon
Pages 62-78 | Received 01 Apr 2022, Accepted 28 Aug 2022, Published online: 03 Apr 2023

ABSTRACT

In this theoretical piece, we discuss the limitations of using purely computational techniques to study big language data produced by people online. Instead, we advocate for mixed-method approaches that are able to more critically evaluate and consider the individual and social impact of this data. We propose one approach that combines qualitative, traditional quantitative, and computational methods for the study of language and text. Such approaches leverage the speed and expediency of computational tools while also highlighting the value of qualitative methods in critically assessing the outcome of computational results. In addition to this, we highlight two considerations for communication scholars utilizing big data: (1) the need to consider more language variations and (2) the importance of self-reflexivity when conducting big language data research. We conclude with additional recommendations for researchers seeking to adopt this framework in the context of their own research.

In a 2014 Special Section Introduction of the International Journal of Communication titled, “Big Data, Big Questions,” Kate Crawford, Kate Miltner, and Mary L. Gray pondered, from an ethical and epistemological perspective, why big data research seemed to suddenly have garnered broad attention in the communication field. These authors cautioned against technologically deterministic answers to the question of “why now?”, calling them unsatisfying, and advocated for “new critical approaches to big data”.Footnote1 Five years later, Emese Domahidi and colleagues identified a “computational turn” in the social sciences—and, specifically, the communication discipline—noting that the larger computational social sciences had reached the stage of a mature discipline.Footnote2 Yet, despite advances in explainable, or interpretable, artificial intelligence, big data research largely continues to rely on so-called “black box” techniques and, thus, remains widely power agnostic.Footnote3 As Klaus Krippendorff has noted, computers (and, by extension, computational methods) “process character strings, not meanings,” so relying solely on these outputs—rather than the knowledge of subject matter experts—can “run the risk of trivializing the meanings of texts.”Footnote4

Power and marginalization have always been, and continue to be, intrinsic to communication. While there is a long history of critical/cultural scholarly attention to this fact, more communication scholars are heeding normative commitments to equity, democracy, and justice, and approaching a wide variety of topics with an increasingly critical eye. We argue that the “computational turn” in communication research will require the adoption of computational techniques (which have been traditionally conceptualized as quantitative approaches); however, this work must be supplemented with qualitative methods. This is because, from a critical perspective, the advantages of quantitative methods are also their primary weakness.Footnote5

To be clear, no one approach (quantitative or qualitative) is inherently superior to the other, including and especially when analyzing big data. Instead, each has its strengths and its weaknesses pertaining to the ontology of the case, our epistemological assumptions, and the specific material logistics of the study we are undertaking.Footnote6 Quantitative methods, including computational approaches to big data, reduce complex and contextually rich phenomena into easily understood descriptive and inferential statistics. While this is an advantage in terms of repeatability, generalizability, and the identification of underlying causal and/or predictive patterns,Footnote7 social reality—including issues of power and marginalization—cannot be reduced to just a few variables that ignore the complexities of people or society. This is not to suggest that adding the variables “power” and “marginalization”—the add “x” and stir approach—is either possible or the solution to this problem. It is simply to say that positivist, quantitative approaches like those often practiced in big language data fail to account for our complex social world. On the other hand, while they are not generally repeatable or generalizable and are less adept at identifying large-scale patterns, qualitative and humanistic approaches allow researchers to closely study people's lived experiences and broader social realities to uncover the rich detail and fabric of their lives. The combination of both is necessary to consider that technology and especially computers and software (including algorithms) are only as good as the people who make and use them.Footnote8

In this essay, we think through what big language data in communication should look like and how they can best inform theory, and recommend practical big data analytic techniques and methods that can best address the urgent need for a more widespread critical computational turn. In what follows, we (a) contextualize the history and meaning of big language data, (b) theorize an iterative mixed-methods approach to big language data analysis in communication, (c) provide a mixed-methods toolkit for language communication analysis of big data, and (d) discuss the future of big language data analysis.

Transparency statement

This work is built on the combined experience of two communication scholars who began their research trajectories using qualitative approaches but transitioned to computational and qualitative mixed methods to better accomplish their goals. As a result of this evolution, both researchers also regularly advocate for the importance of a mixed-methods approach to big language data analysis. This essay is the culmination of this endorsement.

One author's research is rooted in feminist media studies, and her methodological background drew primarily from qualitative approaches, including interviews, critical discourse analysis, and ethnographic observation. During her doctoral studies, she transitioned to an interdisciplinary, methodologically agnostic approach that combines both computational and qualitative techniques to examine and critique the large volumes of content produced by the digital cultures she studies. As a white, cishet woman living in a diverse metropolitan area in the South and studying alternative beliefs systems and supremacism, this scholar is particularly inspired to promote equitable and ethical analysis of big language data.

The other author began as a critical discourse analysis scholar but turned to quantitative and computational approaches to meet the needs of analyzing social media and news data during contentious political moments. Her work is grounded in postcolonial scholarship and democratic theory and is inspired by her background as a first-generation researcher and Asian American woman. As a scholar who specializes in language and linguistic analyses using multiple disciplinary approaches, she is methodologically motivated to build frameworks for studying big language data efficiently and ethically.

We hope these brief statements highlight how realistic it is to undertake mixed-method work, even if you have been firmly grounded in one methodological approach for a long time. Whether you are a qualitative researcher nervous about using quantitative or computational approaches to big data, a quantitative computational researcher concerned about losing the “objectivity” in your work, or simply someone seeking ways to apply a more ethical or critical perspective in your ongoing mixed-method projects, we hope you will find inspiration in the pages that follow.

Big language data

After more than two decades of big data research, there remains a lack of academic consensus around a formal definition of the term “big data.”Footnote9 This definitional lacuna is likely due, in part, to the interdisciplinary nature of the field,Footnote10 as well as lingering confusion across fields around differences between big data, computational social science, data science, and other such terms.Footnote11 Nevertheless, Ossi Ylijoki and Jari Porras tie early attempts to define big data to Doug Laney's (2001) Meta Group, which conceptualized “big data” as being large in size (volume), rapidly produced (velocity), and varied in terms of modality and information (variety); this perspective persists in current research,Footnote12 though it has been expanded upon with additional dimensions (e.g., variability, volatility, validity).Footnote13

Since then, numerous researchers have provided meta-analyses of various big data definitions.Footnote14 Reviewing definitions of “big data” across 62 academic papers, for example, Ylijoki and Porras find 17 definitions that are “logically inconsistent” with one another.Footnote15 A corpus analysis by Andrea De Mauro, Marco Greco, and Michele Grimaldi found four themes common to big data topics: the proliferation of new digitized information; technology; typical methods used in big data analysis; and impact.Footnote16 More recently, Yaseen and Obaid classified big data definitions into similar groups, arguing various scholars have used the phrase to describe “a social phenomenon, analytical technique, a process or a data set.”Footnote17

While useful, this definitional work has typically focused on the collection and processing of the data,Footnote18 including data acquisition, analysis, curation, storage, and usage,Footnote19 with a specific focus on the size, complexity, and technological requirements of big data.Footnote20 However, big data also shape the research that communication scholars conduct, highlighting a need to self-reflexively consider how big data fit in the field's broader research agenda. We assert that big data are useful to communication scholars, in particular, in their ability to study language at scale and, potentially, identify patterns that have impacts on communicative phenomena across a variety of groups and platforms.

In this paper, we focus specifically on big language data, which we define as big data about written or spoken language. While recognizing the importance of many different communication modalities (including visual communication), analyses of language communication continue to dominate our field. The availability of a massive amount of language data creates opportunities for researchers to study and understand language communication at a more granular level and through a mediated process.

We intentionally use the term language and not text, discourse, or rhetoric, to speak broadly about various forms of written and spoken communication. While fully acknowledging the importance of these subfields, a consideration of just one would exclude communication scholars using big data to study, for example, interpersonal and organizational communication. Additionally, it runs the risk of further siloing so-called area studies away from more mainstream disciplinary conversations. Furthermore, a focus on text would not be considerate of spoken language, despite the growing body of social science research using audio data.Footnote21

While the analysis of language at the size of big data is relatively new, communication and media scholars have long employed both qualitative and quantitative approaches to the study of text originating in legacy media, across social media platforms, in political speeches and written documents (including political adverts), and much more. Quantitative approaches to language analysis have helped develop theories like agenda-setting and framing theory, while qualitative approaches afford the ability to consider how power is embedded in language communication and to take a deeper dive into the complexities of social, economic, and political life.

When operationalizing big language data, it is essential that scholars consider the relationship between the text and the meaning of the text, particularly if a word, phrase, or sentence can evoke different meanings and interpretations for communication receivers.Footnote22 These connotative meanings vary across place, time, and context and, thus, require not only subject-matter expertise but also the human ability to discern styles of language (e.g., between satire, parody, sarcasm, etc.). It is therefore essential for researchers to consider carefully whether approaches to analyzing big language data may be semantically biased as a result of keyword selection. For example, computational approaches using traditional dictionary methods (i.e., a set of keywords) require the use of existing dictionaries, which may not fit the case or context under investigation, or the creation of an a priori dictionary, both of which suffer from validity concerns.Footnote23 Critical and cultural approaches in communication and other disciplines, in particular, have highlighted the importance of considering multiple interpretations of a text based on situated knowledges or positionalities, which is not possible using keyword-selection methods.

Big language data analysis: the value of an iterative, mixed-methods approach

Given the most recent critical turn in the communication discipline, combined with the stated strengths and weaknesses of both quantitative and qualitative methods, we suggest big data communication scholars should rely on iterative, mixed-method approaches. In what follows, we do not imply that calls for quantitative and qualitative scholars to work together or use mixed methods are new. They date back at least to Lazarsfeld and Mills’ attempted collaboration in the early years of the 20th century.Footnote24 However, scholars have noted the rising popularity of mixed-methods approaches,Footnote25 despite an unfortunate “narrative of difference” that “often overshadows important existing similarities” between quantitative, computational, and qualitative approaches.Footnote26 As a result, we take seriously Ophir, Walter, and Marchant's warning to better attend to the strengths of a blended methodological approach than to focus on each perspective's perceived weaknesses.Footnote27 To this end, we outline the theoretical foundations of this recommendation, drawing specifically on the relationship between quantitative, computational, and qualitative approaches to the study of big language data analysis.

First, computational and qualitative methodological approaches are symbiotic from both an ontological and epistemological perspective. Ontologically, computational methods show us what is within a given corpus via tools such as web scraping, topic modeling, and network analysis, while qualitative methods such as textual analysis and ethnography provide a social constructivist perspective on existence, reality, and the social world.Footnote28 Epistemologically, computational methods provide a more “objective” and “empirical” approach to text analysis, while qualitative methods offer a more subjective, interpretive analysis into the ideologies and culture underlying these discourses.Footnote29 Taken together, mixed quantitative, computational, and qualitative methods provide a richer and more meaningful picture of people, content, and sociocultural meaning than only one of these methods in isolation.

Second, computational methods can be used in tandem with qualitative approaches because, “fundamentally, qualitative researchers seek to preserve and analyze the situated form, content, and experience of social action, rather than subject it to mathematical or other formal transformations. Actual talk, gesture, and so on are the raw materials of analysis.”Footnote30 In this interpretive tradition, researchers are less concerned with counting stuff and more concerned with meaning. Computational methods, then, provide tools for collecting, analyzing, interpreting, and visualizing these raw materials that can help scholars make analytical sense of large, complex, and often fast-moving data sets.

In the field of communication, these mixed-method approaches can aid researchers working with online platforms who want to examine, for example, the communicative practices deployed by social media (e.g., Twitter, Facebook, and YouTube), web forum (e.g., Reddit), or image board (e.g., 4Chan) users in large corpora containing tens of thousands or even millions of posts. They could also be used to inductively identify common frames used by one or multiple media outlets across decades of reportage, or to analyze the scripts of long-running television programs for common themes and representative erasures, as well as in many other use cases. While there are many ways to “mix your methods,” and we lay out one such process below, a communication researcher has many opportunities to combine qualitative and quantitative approaches. For example, a researcher with a dataset of posts from a subreddit can conduct inductive, exploratory analysis using a computational approach such as topic modeling and supplement it with an in-depth textual analysis of the most statistically representative items in a dataset as identified by the topic model.

A case in point is the communication-adjacent field of digital humanities, where mixed-methods approaches are widely deployed and an array of relevant research points to the capacity of computational methods to address qualitative concerns understood methodologically and tactically.Footnote31 The field of communication studies often straddles the line between humanities and social sciences; however, when practiced from a critical computational perspective that maintains normative commitments, communication scholarship can engage in transformative critique using emerging methods, tools, and practices, Footnote32 including mixed-methods approaches. Digital humanities scholars using computational methods have long attended to issues concerning critical and feminist scholars and drawn on these theories for inductive analysis and theory building.Footnote33 This iterative, mixed-method approach has only sporadically been applied in communication, though we suggest it has several meaningful advantages for critical big language data analysis in the communication and media studies fields.

In their important work on feminist science and technology studies, particularly on feminist data visualization, Catherine D’Ignazio and Lauren Klein outline six principles drawn from feminist theory, which provide an example of how critical, iterative, mixed-methods approaches can be applied to big language data analysis.Footnote34 First, these authors recommend rethinking binaries in our “data collection and classification” and accounting “for a range of multiple and fluid categories.”Footnote35 Second, they encourage us to embrace pluralism by considering how our own positionality affects our research and to get comfortable with “multiple truths” rather than chasing objectivity.Footnote36 Third, they highlight the importance of examining power and aspiring to empower by foregrounding the coconstruction of knowledge between researchers and subjects and “ensuring that the outcomes of our design research connect back to the communities that first made them possible.”Footnote37 Fourth, they propose we consider the context of the knowledge produced insofar as it is situated in “particular social, cultural, and material” circumstances rather than atemporal and ahistorical.Footnote38 Fifth, they endorse viewing knowledge as embodied and affective, and point out that this perspective is equally legitimate to other ways of knowing—even in seemingly “efficiency-oriented and task-driven” research designs.Footnote39 Jessica Enoch and Jean Bessette echo this call, arguing that new opportunities for meaningful engagement with digitally archived materials create space for strategic thinking around material and embodied responses to them.Footnote40 Finally, D’Ignazio and Klein ask that we make our labor visible by “working backwards [to data provenance] to surface the [human] actors … that have labored to generate a particular dataset.”Footnote41

While these six principles arise out of feminist science and technology studies, they translate into the fields of communication and media studies in several important ways. Here, we provide some concrete examples. First, rethinking binaries has long been a paramount concern in areas such as feminist media studies and computer-mediated communication, where gender has been conceptulized as performative rather than biological.Footnote42 Second, embracing pluralism is a common consideration in the field of political communication because it recognizes a diversity of opinions in the political process.Footnote43 Third, examining power and aspiring to empower are foundational principles of both the Activism and Social Justice Division of the National Communication Association and the Activism, Communication and Social Justice (ACSJ) Interest Group of the International Communication Association.Footnote44 Fourth, communication scholars have written on the importance of considering the context for decades because it is a fundamental part of decoding messages.Footnote45 Fifth and finally, viewing knowledge as embodied and affective is an important component of understanding, negotiating, and disseminating knowledge production in scholarly output within the discipline itself and within the broader social world.Footnote46 We hope these examples offer some contextualization around how interdisciplinary scholarship connects to the communication discipline (itself an historically interdisciplinary field).

In summary, the culture of sharing and the open-source nature of big language data and computational methods make them ideal for feminist and other critical projects in the communication discipline and beyond it, and this is particularly true for groups that have traditionally been difficult to study.Footnote47 In other words, the very nature of big data, computational analysis, and new media have made possible not only new, “epistemic communities, knowledge networks, or communities of practice” but also the study of the same.Footnote48 These newer ecosystems are particularly well suited to analysis by communication and media studies scholars, who have long attended to knowledge production, information flows, and group formation. Although some qualitative scholars see big language data, digital humanities, and computational methods as lacking in theoretical grounding and too managerial and technology-centric, others suggest these methods are actually improved by the qualitative and critical scholars who engage with them.Footnote49 It is in this spirit that we propose a mixed-methods process, incorporating elements of D’Ignazio and Klein's principles, to big language data analysis for scholars of communication and media.

A proposed process to big language data analysis

Communication scholars are in a unique position to leverage both traditional and computational methods. As longtime analyzers of text and language, our field is notable for its use of methods such as discourse analysis and content analysisFootnote50 to study a wide range of human communication, including interpersonal interactions,Footnote51 speeches,Footnote52 news,Footnote53 creative works,Footnote54 and advertisements.Footnote55 At the same time, these traditional approaches can be difficult to employ, given the overwhelming amount of language data that societies now produce; practically, it is not possible to manually label that quantity of data.

Thankfully, though, communication scholars have historically recognized the need to expand the methods they use to study language communication.Footnote56 In the digital era, this includes the use of programming and computer tools; as true for quantitative and computational researchers using R and Python as it is of qualitative scholars, many of whom rely on programs such as MaxQDA and NivivoFootnote57 to transcribe and manually code different discursive features.Footnote58 The availability of these tools has also made it easier to integrate quantitative and qualitative language analysis approaches. For example, qualitatively derived codes stored in the qualitative text analysis tool “QDA Miner” can be migrated to its “sister program,” the quantitative program Wordstat.Footnote59 The popularity of these tools has also generated an interest in free and open-source software, making computer-assisted language analysis more accessible.Footnote60

One advantage of these computational approaches is the ability to “scale up” traditional content analyses by using human-labeled data to train classification algorithmsFootnote61 that “learn” to detect the precedence of a language feature. A common application of this is sentiment analysis: the use of algorithmic classifiers to identify messages as having a positive or negative valence.Footnote62 Unsupervised approaches like topic modeling also make it easier to explore the data by, amongst other things, highlighting the most statistically representative information in a given dataset for analysis.Footnote63

It is worth noting that computational approaches are inherently reductive: in treating text and language as sources of data to be mined for specific language features, computational approaches such as supervised machine learning are not meant to understand language—they are meant to identify specific and repeated “signals” embedded in language communication. For this reason, the reduction of language is an essential preprocessing step in these analyses.Footnote64 This is problematic for complex language phenomena such as the detection of sarcasm and irony,Footnote65 but it can also help identify meaningful and explicit language features. To paraphrase: all language models are bad, but some are useful.

One way in which scholars can increase the “usefulness” of these approaches is to involve human researchers throughout the analytical process. In computer science, these are called “human in the loop” (HITL) approaches; we argue that humans are essential for validating any computational results.Footnote66 Correspondingly, computational communication research studying big language data is cybortic—consisting of both computational and human labor. In keeping with D’Ignazio and Klein, we also recommend making the human labor visible and transparent throughout the process.Footnote67

To maximize the value of both traditional and computational language analysis approaches, we propose a mixed-methods analysis pipeline to analyze big language data. This approach is both iterative and systematic, with inductive and deductive portions that inform one another. We describe this mixed-method language analysis approach in five steps.

The first step focuses on exploring the text data. This step is often descriptive and inductive, and is an essential process for the scholar to understand simply what is in the language data. Traditionally, communication scholars have carried out this step using qualitative approaches such as grounded theory, qualitative coding, and discourse analysis.Footnote68 Recent scholarship, however, has also highlighted the value of using exploratory computational approaches such as unsupervised machine learning and network analysis to focus on the most statistically representative text in a given corpus.Footnote69 This step also necessarily involves, as D’Ignazio and Klein suggest, the inclusion of human subject-matter experts to situate our work with the broader context in which it is conducted.Footnote70

Once we have explored the text data, the second step is for researchers to identify meaningful patterns of interest. This approach should be driven by a research question and is the first step to translating a conceptual definition into its operationalization. During this stage, researchers would benefit greatly from memo writing, a process that is often associated with qualitative grounded theory coding.Footnote71 However, memo writing also benefits scholars employing quantitative and computational approaches, as memos are a record of the methodological approaches taken and the researchers’ impressions at various stages of the analysis. Writing memos can also enhance our ability to rethink binaries that may be baked into data collection and classification of more quantitative methods, as well as embrace pluralism as we grapple with our own positionality and the possible appearance of seemingly contradictory data points or “multiple truths.”Footnote72

The third step is then to construct a codebook for manual human coding. This step most closely aligns with traditional content analysis: identifying meaningful labels to code, getting intercoder reliability among coders, and then manually coding the content.Footnote73 However, a key difference between this step and traditional content analysis is how these manual labels are utilized: In traditional content analysis, these labels are the data, whereas in our mixed-methods approach, these labels are a sample that a computer will then learn from.

The fourth step is to use the human labels produced in the content analysis to build a supervised machine learning text classifier. This computational modeling approach is useful for expanding upon the content analysis done in the previous step and is often necessary to analyze big language data “at scale,” as it is unlikely that the content analysis can be conducted on the entirety of a big language dataset. Increasingly, supervised machine learning has become a popular approach to studying social media language because of the quantity of messages produced.Footnote74 However, researchers have also applied supervised machine learning to news data.Footnote75

While this computational modeling is often the focal point of an academic paper, this is not the final step. The fifth essential step is to assess the quality of the classifier. This is ideally done using a combination of quantitative and qualitative approaches. For example, Claire Lauer, Eva Brumberger, and Aaron Beveridge compared the labeling of text data by human researchers versus machine learning classifiers, finding that machine learning classifications were “less useful and less reliable” at identifying complex forms of language.Footnote76

Qualitative approaches are useful across these steps, but may be especially helpful for computational communication researchers, as they validate their results and recontextualize their findings to the communication system. As we note, computational approaches are inherently reductive: excellent at identifying specific language features, but less useful for understanding how that feature is situated within the wider communicative situation or phenomenon. A return to qualitative analyses at the end can enhance a researcher's understanding of those language features.

One way to do so is to apply Fairclough's discourse approach.Footnote77 This can be done by treating the classifier as the micro, text-level analysis, but contextualizing these findings to meso-level considerations and societal (macrolevel) factors. While these elements are relevant throughout the research process, this final step also has the benefit of providing a space for researchers to more fully examine power, identify possible ways to empower the community under study, and consider whether and how they may address embodied and affective knowledge unearthed throughout.Footnote78

The future of big language data analysis: two considerations

For communication researchers, the availability of big language data, and the multitude of approaches that can be used, opens up new avenues of research and theory building. Leveraging the field's traditional qualitative and quantitative work on content while adopting newer computational approaches (as we propose in our mixed-method approach above) will help researchers not only to analyze the sheer quantity of big language data but also to contextualize their findings, leading to richer theory building across many subdisciplines of the communication field.

However, as communication scholars continue to use these more advanced techniques, it is necessary to consider the normative commitments and best practices of the field. Below, we highlight two considerations that communication and media scholars should consider when conducting big language data analysis: (1) language variation and (2) self-reflexivity.

Language variation

Importantly, there is substantial variability in how people use language: from the literal language itself to the situational contexts of their communication and the medium they are relying on to produce that communication. However, communication scholars have focused narrowly on a subset of this. For example, analyses of social media have relied heavily on Twitter data,Footnote79 making it difficult to generalize to other social media platforms. Additionally, communication theories are often Western-centric,Footnote80 likely because these languages are over-represented in the scholarship. This problem is not limited to our field: There are many computational methods for conducting big language data analysis in English or Spanish, but substantially fewer tools to study “low-resource languages” such as Urdu or Tagalog.Footnote81

Finally, despite the richness of language as reflecting a person's psychological, social, and cultural context, studies of big language data have focused overwhelmingly on sentiment. Yet, time and again, sentiment analysis fails to recognize the nuance and context of language in big data sets. In our own research, for example, we have noted the term “trump” categorized as positively valenced when context demonstrates it is, in fact, the name of a former U.S. president. Sentiment analysis also fails to recognize complex language structures such as irony and satire. More problematically, we have seen language used by supremacists and extremists about marginalized people and communities categorized as positive sentiment when, in context, it should be classified as negative (for example, the term “antiracist” registers as having a positive sentiment, but many supremacist communities use this term to describe white nationalists as the only nonracists). These limitations of sentiment analysis further highlight the necessity of integrating much needed contextualizing human labor into computational analysis of big language data.

These three exemplars exemplify a larger problem with big language data analysis: Despite the availability of more and more varied language data, our field remains focused on only a narrow subset of this. It therefore behooves communication scholars to consider and prioritize research on understudied languages, platforms, and modalities.

Self-reflexivity

We concluded the introduction to this essay with a transparency statement that outlined our research backgrounds and agendas, considering our own positionality for working with big data mixed methods. This statement is one example of the application of self-reflexivity for future work on big language data analysis. However, there are other important considerations with which to reckon, including open science efforts and the ethics of big data analysis.

The Center for Open Science defines open science as a situation “in which the process, content, and outcomes of research are openly accessible by default.”Footnote82 Building on this, the League of European Research Universities uses eight open science pillars, including (amongst others) research integrity and transparency, as well as the use of findable, accessible, interoperable and reusable (“FAIR”) data.Footnote83 Open science practices benefit the big language data community by encouraging resource sharing, reflexivity, and transparency. However, these efforts must also be balanced with the ethics of sharing, in particular, big language data from social media.

Relatedly, conducting big language data research requires ethical decision-making at each stage of research design, and the process of ethical decision-making, as we have alluded to above, is iterative. At a minimum, scholars should draft a thoughtful ethics statement in the planning stages of every mixed-methods project drawing on big language data.Footnote84 While a full review of ethical considerations is beyond the scope of this essay, we do want to provide a few resources when thinking through the ethics of your big language data study. The scholars who created these resources have convincingly laid out the necessity of such endeavors.

The Association of Internet Researchers (AoIR) Ethics Working Committee has published three reports covering ethical guidelines for internet research, addressing ethical norms and approaches, and listing questions scholars should think about when conducting online research. The main takeaway is that this research is complex and dynamic, and often involves ethical gray areas related to what constitutes human subjects, private versus public spaces, and data versus persons.Footnote85 For this reason, AoIR recommends an inductive, ongoing, and context-specific approach to ethics throughout the research process.

Conclusion

The availability of big language data provides new opportunities for social scientists to study how people communicate. However, scholars must think carefully about their approach to studying big language data. While computational methods hold great promise, especially in their ability to analyze language data at scale, these highly reductive processes also limit what can be said about human communication. Furthermore, as we have argued, these methods rely on an ontological approach that prizes objectivity, despite the varied and highly subjective interpretations of natural language and discourse.

With this in mind, we proposed a mixed-method approach to studying big language data and highlighted two key considerations for the future of this work. It is our hope that these insights will encourage communication scholars using big language data to account for critical perspectives in their research. Most notably, we advocate for the use of multiple methods to study big language data and note how the benefits of qualitative methods can overcome limitations of the computational methods, and vice versa. These mixed-method approaches also complement the highly interdisciplinary nature of this area of communication scholarship.

We conclude with a few suggestions for how scholars can consider interdisciplinary perspectives in their own scholarship. First, and perhaps most importantly, we encourage scholars to conduct and support mixed-method research using big language data. Exemplars of the symbiotic nature of this approach include research using network approaches with ethnographies;Footnote86 network approaches with qualitative textual analysis;Footnote87 and discourse analysis with traditional content analysis.Footnote88 Of note, this work tends to be highly collaborative, with scholars from different subfields—and often, different disciplines—working together and learning from one another.

For scholars who have not done this before, such an approach may seem daunting and/or time-consuming. However, we argue that qualitative scholars with a familiarity of computational approaches are better able to critique this scholarship and more critically consider the role of technology in society. Additionally, quantitative and computational scholars who understand qualitative methods are able to identify the limits of computational tools as “objective” and gain a deeper understanding of the societal and cultural implications of their research.

Our second recommendation considers that not all studies of big language data are mixed-method and that there may be researchers seeking to incorporate these principles in their ongoing work. For this, we recommend the practice of self-reflexivity, such as the transparency statement we provide above. These statements highlight ethical considerations of using big language dataFootnote89 and can acknowledge how the researcher's perspective may impact their interpretation of the data.Footnote90 In our proposed process for big language data analysis, we also highlighted other ways in which computational researchers can include critical, reflexive components.

While many fields study big language data, communication is situated in a unique position as a highly interdisciplinary social science field, making it an ideal research community to pioneer the integration of computational approaches with more traditionally qualitative and critical considerations. By highlighting the ways in which qualitative and quantitative methods can be combined to effectively study big language data, it is our hope that other communication scholars will be inspired to incorporate multiple methods into their scholarship when applicable.

Acknowledgement

This work was inspired by a recent conference submission done in collaboration with Northeastern University Communication Media and Marginalization Lab network scientist Ryan Gallagher.

Notes

1 Kate Crawford, Mary L. Gray, and Kate Miltner, “Big Data| Critiquing Big Data: Politics, Ethics, Epistemology| Special Section Introduction,” International Journal of Communication 8 (2014): 10.

2 Emese Domahidi, JungHwan Yang, Julia Niemann-Lenz, and Leonard Reinecke, “Computational Communication Science| Outlining the Way Ahead in Computational Communication Science: An Introduction to the IJoC Special Section on ‘Computational Methods for Communication Science: Toward a Strategic Roadmap’,” International Journal of Communication 13 (2019): 9.

3 Buomsoo Kim, Jinsoo Park, and Jihae Suh, “Transparency and Accountability in AI Decision Support: Explaining and Visualizing Convolutional Neural Networks for Text Information,” Decision Support Systems 134 (2020): 113302; Hiroshi Kuwajima, Masayuki Tanaka, and Masatoshi Okutomi, “Improving Transparency of Deep Neural Inference Process,” Progress in Artificial Intelligence 8, no. 2 (2019): 273–85.

4 Klaus Krippendorff, Content Analysis: An Introduction to its Methodology, 4th ed. (SAGE, 2019): 280.

5 While some scholars from the positivist tradition believe quantitative methods to be objective and empirical, we approach the “problem of objectivity” from the understanding that nothing is ever truly objective. See, for example, Fuchs, Christian, “From digital positivism and administrative big data analytics towards critical digital and social media research,” European Journal of Communication 32, no. 1 (2017): 37–49. We further posit that critical and positivist approaches can mutually inform one another, rather than be held in contrast (see Ramasubramanian, Srividya, and Omotayo O. Banjo, “Critical Media Effects Framework: Bridging Critical Cultural Communication and Media Effects Through Power, Intersectionality, Context, and Agency,” Journal of Communication 70, no. 3 (2020): 379–400).

6 Emma Uprichard, “Sampling: Bridging Probability and Non-Probability Designs,” International Journal of Social Research Methodology 16, no. 1 (2013): 1–11. https://doi.org/10.1080/13645579.2011.633391

7 Royce Singleton, and Bruce C. Straits, Approaches to Social Research, 6th ed. (Oxford, U.K.: Oxford University Press, 2017) https://global.oup.com/ushe/product/approaches-to-social-research-9780190614249?cc=us&lang=en&

8 For book-length treatments of technological and algorithmic bias, see, for example: Safiya Umoja Noble, “Algorithms of Oppression,” In Algorithms of Oppression (New York University Press, 2018); Cathy O’neil, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy (Broadway books, 2016); Caroline Criado Perez, Invisible Women: Data Bias in a World Designed for Men (Abrams, 2019).

9 Humam Khalid Yaseen, and Ahmed Mahdi Obaid, “Big Data: Definition, Architecture & Applications,” JOIV: International Journal on Informatics Visualization 4, no. 1 (2020): 45–51.

10 Jiming Hu, and Yin Zhang, “Discovering the Interdisciplinary Nature of Big Data Research Through Social Network Analysis and Visualization,” Scientometrics 112, no. 1 (2017): 91–109. https://doi.org/10.1007/s11192-017-2383-1; Daphne R. Raban, and Avishag Gordon, “The Evolution of Data Science and Big Data Research: A Bibliometric Analysis,” Scientometrics 122, no. 3 (2020): 1563–81. https://doi.org/10.1007/s11192-020-03371-2

11 Christian Fuchs, “From Digital Positivism and Administrative Big Data Analytics Towards Critical Digital and Social Media Research!,” European Journal of Communication 32, no. 1 (2017): 37–49.

12 Ossi Ylijoki, and Jari Porras, “Perspectives to Definition of Big Data: A Mapping Study and Discussion,” Journal of Innovation Management 4, no. 1 (2016): 69–91. https://doi.org/10.24840/2183-0606_004.001_0006

13 See, for example, Monerah Al-Mekhlal, and Amir Ali Khwaja, “A Synthesis of Big Data Definition and Characteristics,” In 2019 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC), pp. 314–22. (IEEE, 2019). https://doi.org/10.1109/CSE/EUC.2019.00067; José María Cavanillas, Edward Curry, and Wolfgang Wahlster, New Horizons for a Data-Driven Economy: A Roadmap For Usage and Exploitation of Big Data in Europe (Springer Nature, 2016).

14 See, for example, Al-Mekhal and Khwaja Synthesis; Wo L. Chang, and Nancy Grady, “NIST Big Data Interoperability Framework: Volume 1, Big Data Definitions,” (2015); Andrea De Mauro, Marco Greco, and Michele Grimaldi, “A Formal Definition of Big Data based on its Essential Features,” Library Review (2016). https://doi.org/10.1108/LR-06-2015-0061; Jonathan Stuart Ward, and Adam Barker, “Undefined by Data: A Survey of Big Data Definitions,” arXiv preprint arXiv:1309.5821 (2013). http://arxiv.org/abs/1309.5821; Yaseen and Obaid, Big Data.

15 Ylijoki and Porras, Perspectives, 79.

16 De Mauro, Greco, and Grimaldi, Formal Definition.

17 Yaseen and Obaid, Big Data, 46.

18 Maddalena Favaretto, Eva De Clercq, Christophe Olivier Schneble, and Bernice Simone Elger, “What is Your Definition of Big Data? Researchers’ Understanding of the Phenomenon of the Decade,” PloS One 15, no. 2 (2020): e0228987. https://doi.org/10.1371/journal.pone.0228987

19 Cavanillas, Curry, and Wahlster, New Horizons.

20 Ward and Barker, Undefined.

21 Dagmar M. Schuller, and Björn W. Schuller, “A Review on Five Recent and Near-Future Developments in Computational Processing of Emotion in the Human Voice,” Emotion Review13, no. 1 (2021): 44–50. https://doi.org/10.1177/1754073919898526

22 James Paul Gee, An Introduction to Discourse Analysis: Theory and Method (Routledge, 2004). https://www.routledge.com/An-Introduction-to-Discourse-Analysis-Theory-and-Method/Gee/p/book/9780415725569

23 To paraphrase the validity concerns, existing dictionaries cannot be applied successfully to all cases, while building a dictionary necessarily relies on researcher assumptions without implementing the approaches for which we advocate in this article.

24 Michael X. Delli Carpini, “Breaking Boundaries: Can We Bridge the Quantitative Versus Qualitative Divide Through the Study of Entertainment and Politics?,” International Journal of Communication 7 (2013): 21.

25 Carpini and Delli, Breaking Boundaries; Catherine d’Ignazio, and Lauren F. Klein, “Feminist Data Visualization,” Workshop on Visualization for the Digital Humanities (VIS4DH) (Baltimore. IEEE., 2016). https://www.semanticscholar.org/paper/Feminist-Data-Visualization-D%27Ignazio-Klein/2e3e2eb1bdc1cab5b0fab515266bb8849d416f33; Alexis Lothian, and Amanda Phillips, “Can Digital Humanities Mean Transformative Critique?,” Journal of E-Media Studies 3, no. 1 (2013): 1–25. https://doi.org/10.1349/PS1.1938-6060.A.425; Yotam Ophir, Dror Walter, and Eleanor R. Marchant, “A Collaborative Way of Knowing: Bridging Computational Communication Research and Grounded Theory Ethnography,” Journal of Communication 70, no. 3 (2020): 447–72. https://doi.org/10.1093/joc/jqaa013

26 Ophir, Walter, and Marchant, Collaborative Knowing, emphasis in original, 448

27 Ibid.

28 Web scraping is the use of automated tools to collect content from a website or forum. Topic modeling, sometimes called the “bag of words” approach, is a type of unsupervised machine learning that uncovers the thematic structure of texts in a dataset. Network analysis explores the structural relationships of knowledge through “shared meaning and symbols” (Marya L. Doerfel, and George A. Barnett, “A Semantic Network Analysis of the International Communication Association,” Human Communication Research 25, no. 4 (1999): 589–603; 589); by applying statistical probabilities and extracting the relationships between objects in a text (Wouter van Atteveldt, “Semantic Network Analysis,” Techniques for Extracting, Representing, and Querying Media Content (2008)).

29 Here, we do not imply that any research is ever truly “objective,” since we always bring our ways of being in the world to bear on research design decisions. Nor do we intend to imply that qualitative scholarship is not empirical.

30 Thomas R. Lindlof, and Bryan C. Taylor, Qualitative Communication Research Methods (Sage Publications, 2002), 18.

31 Matthew Kirschenbaum, “What is ‘Digital Humanities,’ and Why Are They Saying Such Terrible Things About It?,” Differences 25, no. 1 (2014): 46–63.

32 Lothian and Phliips, Transformative Critique; Roopika Risam, “Beyond the Margins: Intersectionality and the Digital Humanities,”Digital Humanities Quarterly 9, no. 2 (2015).

33 d’Ignazio and Klein, Feminist Data

34 Ibid.

35 Ibid., 2.

36 Ibid., 2.

37 Ibid., 3.

38 Ibid., 3.

39 Ibid., 3.

40 Jessica Enoch, and Jean Bessette, “Meaningful Engagements: Feminist Historiography and the Digital Humanities,” College Composition and Communication (2013): 634–60.

41 d’Ignazio and Klein, Feminist Data, 3.

42 Michelle Rodino, “Breaking Out of Binaries: Reconceptualizing Gender and Its Relationship to Language in Computer-Mediated Communication,” Journal of Computer-Mediated Communication 3, no. 3 (1997): JCMC333.

43 For an example written by scholars from adjacent fields (e.g., political science and political sociology) to political communication that inform the same, see Daniel J. Levine, and David M. McCourt, “Why Does Pluralism Matter When We Study Politics? A View From Contemporary International Relations,” Perspectives on Politics 16, no. 1 (2018): 92–109.

44 The mission statement of the ACSJ, in fact, embraces several of these principles, stating, “The Activism, Communication, and Social Justice (ACSJ) Interest Group promotes research and teaching in the intersections of three key aspects of contemporary life as captured in its name. It strives for diversity in the representation of its membership and embraces pluralism and boldness in theory and methodology. It pushes the boundaries between theory and practice and between scholarship and activism by encouraging and facilitating dialogues and engagements” “Interest Groups: Activism, Communication and Social Justice,” International Communication Association. Accessed August 13, 2022. https://www.icahdq.org/group/activism).

45 See, for example, this (1988) article on “The Importance of Context in Applied Communication Research” Loyd S. Pettegrew, “The Importance of Context in Applied Communication Research,” Southern Speech Communication Journal 53, no. 4 (1988): 331–38.

46 This article has a breakdown of research approaches to knowledge production: Marton Demeter, and Manuel Goyanes, “A World-Systemic Analysis of Knowledge Production in International Communication and Media Studies: The Epistemic Hierarchy of Research Approaches,” The Journal of International Communication 27, no. 1 (2021): 38–58.

47 Philip N. Howard, “Network Ethnography and the Hypermedia Organization: New Media, New Organizations, New Methods,” New Media & Society 4, no. 4 (2002): 550–74. https://doi.org/10.1177/146144402321466813

48 Ibid., 550.

49 Kirschenbaum, Digital Humanities.

50 Klaus Krippendorff, “The Changing Landscape of Content Analysis: Reflections on Social Construction of Reality and Beyond,” Communication & Society 47 (2019): 1.

51 Joseph N. Cappella, “Vectors into the Future of Mass and Interpersonal Communication Research: Big Data, Social Media, and Computational Social Science,” Human Communication Research 43, no. 4 (2017): 545–58. https://doi.org/10.1111/hcre.12114

52 Marco Guerini, Carlo Strapparava, and Oliviero Stock, “Corps: A Corpus of Tagged Political Speeches For Persuasive Communication Processing,” Journal of Information Technology & Politics 5, no. 1 (2008): 19–32. https://doi.org/10.1080/19331680802149616

53 Fatemeh Torabi Asr, Mohammad Mazraeh, Alexandre Lopes, Vasundhara Gautam, Junette Gonzales, Prashanth Rao, and Maite Taboada, “The Gender Gap Tracker: Using Natural Language Processing to Measure Gender Bias in Media,” PloS One 16, no. 1 (2021): e0245533. https://doi.org/10.1371/journal.pone.0245533

54 Jennifer A. Manganello, Vani R. Henderson, Amy Jordan, Nicole Trentacoste, Suzanne Martin, Michael Hennessy, and Martin Fishbein, “Adolescent Judgment of Sexual Content on Television: Implications for Future Content Analysis Research,” Journal of Sex Research 47, no. 4 (2010): 364–73. https://doi.org/10.1080/00224490903015868

55 Edward C. Malthouse, and Hairong Li, “Opportunities for and Pitfalls of Using Big Data in Advertising Research,” Journal of Advertising 46, no. 2 (2017): 227–35. https://doi.org/10.1080/00913367.2017.1299653

56 Krippendorff, Changing Landscape.

57 R and Python are free, open-source programming languages commonly used in computational methods like machine learning. MaxQDA and Nvivo are paid data-analysis software packages commonly used by qualitative and mixed-method researchers.

58 Mirian Oliveira, Claudia Bitencourt, Eduardo Teixeira, and Ana Clarissa Santos, “Thematic Content Analysis: Is there a Difference Between the Support Provided by the MAXQDA® and NVivo® Software Packages,” Revista de Administração Da UFSM 9, no. 1 (2016): 72–82. https://doi.org/10.5902/1983465911213

59 QDA Miner is a paid tool used for qualitative data analysis. Wordstat, developed by the same company, is used for content analysis and text mining. It has both R and Python integrations.

60 See, for example, Lindlof & Taylor, Qualitative Communication Methods.

61 Damian Trilling, and Jeroen GF Jonkman, “Scaling up Content Analysis,” Communication Methods and Measures 12, no. 2–3 (2018): 158–74. https://doi.org/10.1080/19312458.2018.1447655

62 Andrea Ceron, Luigi Curini, and Stefano M. Iacus, “Using Sentiment Analysis to Monitor Electoral Campaigns: Method Matters—Evidence from the United States and Italy,” Social Science Computer Review 33, no. 1 (2015): 3–20. https://doi.org/10.1177/0894439314521983

63 Daniel Maier, Annie Waldherr, Peter Miltner, Gregor Wiedemann, Andreas Niekler, Alexa Keinert, Barbara Pfetsch et al, “Applying LDA Topic Modeling in Communication Research: Toward A Valid and Reliable Methodology,” Communication Methods and Measures 12, no. 2–3 (2018): 93–118. https://doi.org/10.1080/19312458.2018.1430754; Maria Y. Rodriguez, and Heather Storer, “A Computational Social Science Perspective on Qualitative Data Exploration: Using Topic Models for the Descriptive Analysis of Social Media Data,” Journal of Technology in Human Services 38, no. 1 (2020): 54–86. https://doi.org/10.1080/15228835.2019.1616350

64 Justin Grimmer, and Brandon M. Stewart, “Text As Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts,” Political Analysis 21, no. 3 (2013): 267–97. https://doi.org/10.1093/pan/mps028

65 Christian Baden, Christian Pipal, Martijn Schoonvelde, and Mariken AC G. van der Velden, “Three Gaps in Computational Text Analysis Methods for Social Sciences: A Research Agenda,” Communication Methods and Measures 16, no. 1 (2022): 1–18. https://doi.org/10.1080/19312458.2021.2015574

66 HITL approaches are generally models that require some type of human interaction. Here, we do not mean to imply researchers should coerce the data to fit a priori assumption but are, instead, using HITL as a metaphor for the cybortic approach we propose.

67 d’Ignazio and Klein, Feminist Data

68 David R. Thomas, “A General Inductive Approach for Qualitative Data Analysis,” The American Journal of Evaluation 27, no. 2 (2003).

69 Dror Walter, and Yotam Ophir, “News Frame Analysis: An Inductive Mixed-Method Computational Approach,” Communication Methods and Measures 13, no. 4 (2019): 248–66. https://doi.org/10.1080/19312458.2019.1639145

70 d’Ignazio and Klein, Feminist Data

71 Melanie Birks, Ysanne Chapman, and Karen Francis, “Memoing in Qualitative Research: Probing Data and Processes,” Journal of Research in Nursing 13, no. 1 (2008): 68–75. https://doi.org/10.1177/1744987107081254

72 d’Ignazio and Klein, Feminist Data, 2.

73 Kimberly A. Neuendorf, The Content Analysis Guidebook (Sage, 2017). https://doi.org/10.4135/9781071802878

74 Ward van Zoonen, and G. L. A. Toni, “Social Media Research: The Application of Supervised Machine Learning in Organizational Communication Research,” Computers in Human Behavior 63 (2016): 132–41. https://doi.org/10.1016/j.chb.2016.05.028

75 Björn Burscher, Daan Odijk, Rens Vliegenthart, Maarten De Rijke, and Claes H. De Vreese, “Teaching the Computer to Code Frames in News: Comparing Two Supervised Machine Learning Approaches to Frame Analysis,” Communication Methods and Measures 8, no. 3 (2014): 190–206. https://doi.org/10.1080/19312458.2014.937527

76 Claire Lauer, Eva Brumberger, and Aaron Beveridge, “Hand Collecting and Coding Versus Data-Driven Methods in Technical and Professional Communication Research,” IEEE Transactions on Professional Communication 61, no. 4 (2018): 389–408. https://doi.org/10.1109/TPC.2018.2870632

77 Norman L. Fairclough, “Critical and Descriptive Goals in Discourse Analysis,” Journal of Pragmatics 9, no. 6 (1985): 739–63. https://doi.org/10.1016/0378-2166(85)90002-5

78 d’Ignazio and Klein, Feminist Data.

79 Ariadna Matamoros-Fernández, and Johan Farkas, “Racism, Hate Speech, and Social Media: A Systematic Review and Critique,” Television & New Media 22, no. 2 (2021): 205–24. https://doi.org/10.1177/1527476420982230

80 Martin Emmer, and Marlene Kunst, “‘Digital Citizenship’ Revisited: The Impact of ICTs on Citizens’ Political Communication Beyond the Western State,” International Journal of Communication 12 (2018): 21.

81 Alexandre Magueresse, Vincent Carles, and Evan Heetderks, “Low-Resource Languages: A Review of Past Work and Future Challenges,” arXiv preprint arXiv:2006.07264 (2020). http://arxiv.org/abs/2006.07264

82 “Our Mission,” Center for Open Science, accessed August 13, 2022, https://www.cos.io/about/mission

83 “Open science and its role in universities: A roadmap for cultural change,” League of European Research Universities, 2018, accessed August 13, 2022, https://www.leru.org/publications/open-science-and-its-role-in-universities-a-roadmap-for-cultural-change

84 Ethics statements answer difficult and often complex questions about collection, storage, and dissemination of data used in big language data research. The AoIR set of ethical guidelines we mention details such questions, drawing on work by Annette Markham, Aline Shakti Franzke, and others.

85 Annette Markham, and Elizabeth Buchanan, “Ethical Decision-Making and Internet Research: Recommendations from the AoIR Ethics Working Committee (Version 2.0),” accessed August 13, 2022, https://aoir.org/reports/ethics2.pdf

86 Walter Ophir, and Marchant, Collaborative Knowing.

87 Deen Freelon, and David Karpf, “Of Big Birds and Bayonets: Hybrid Twitter Interactivity in the 2012 Presidential Debates,” Information, Communication & Society 18, no. 4 (2015): 390–406. https://doi.org/10.1080/1369118X.2014.952659

88 Yiping Xia, Josephine Lukito, Yini Zhang, Chris Wells, Sang Jung Kim, and Chau Tong, “Disinformation, Performed: Self-Presentation of a Russian IRA Account On Twitter,” Information, Communication & Society 22, no. 11 (2019): 1646–64. https://doi.org/10.1080/1369118X.2019.1621921

89 See, e.g., Stevie Chancellor, Eric PS Baumer, and Munmun De Choudhury, “Who Is the ‘Human’ in Human-Centered Machine Learning: The Case of Predicting Mental Health from Social Media,” Proceedings of the ACM on Human-Computer Interaction 3, no. CSCW (2019): 1–32. https://doi.org/10.1145/3359249

90 For more on the role of the researcher, see Elizabeth Halpern, and Ligia Costa Leite, “The Role of the Researcher when using the Socio-Anthropological Method to Understand the Phenomenon of Alcoholism,” Open Journal of Social Sciences 3, no. 05 (2015): 76. https://doi.org/10.4236/jss.2015.35011