ABSTRACT
Supervised machine learning (SML) provides us with tools to efficiently scrutinize large corpora of communication texts. Yet, setting up such a tool involves plenty of decisions starting with the data needed for training, the selection of an algorithm, and the details of model training. We aim at establishing a firm link between communication research tasks and the corresponding state-of-the-art in natural language processing research by systematically comparing the performance of different automatic text analysis approaches. We do this for a challenging task – stance detection of opinions on policy measures to tackle the COVID-19 pandemic in Germany voiced on Twitter. Our results add evidence that pre-trained language models such as BERT outperform feature-based and other neural network approaches. Yet, the gains one can achieve differ greatly depending on the specific merits of pre-training (i.e., use of different language models). Adding to the robustness of our conclusions, we run a generalizability check with a different use case in terms of language and topic. Additionally, we illustrate how the amount and quality of training data affect model performance pointing to potential compensation effects. Based on our results, we derive important practical recommendations for setting up such SML tools to study communication texts.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Data availability statement
The data supporting this research is not publicly available due to ethical restrictions: The terms of use of Twitter do not allow for publicly sharing original Tweets nor did Tweet authors decidedly agree to provide their Tweets for research and to publicly share them outside of Twitter. In the corresponding paper by Beck et al. (Citation2021), we only release the set of identifiers (Tweet IDs) for the texts used in this research project. Thereby, we adhere to the Twitter Developer policy and give users full control of their privacy and data as they can delete or privatize Tweets so that they cannot be collected. The code for the model set up in the paper at hand is publicly shared on the Github repository: https://github.com/UKPLab/cmm2022-stance-covid19
Notes
1 We use the words to annotate and to code as synonyms since both refer to the procedure of assigning certain labels to text.
2 Other examples for features are meta data such as text length or features based on the sentence structure.
3 This context window can, for example, capture the four or five words surrounding the one of interest. The size of this window is set during training the word embeddings.
5 To take into account both the left and the right context of a word, it is trained from left-to-right and right-to-left.
6 To overcome technical challenges associated with bi-directional training using the transformer architecture, Devlin et al. (Citation2019) instead propose to mask some percentage of input words randomly and then predict those tokens.
7 The Huggingface Library provides plenty of pre-trained language models trained for very different areas of application, https://huggingface.co/.
8 We provide the full list of filter terms in Appendix A in .
Additional information
Funding
Notes on contributors
Christina Viehmann
Dr. Christina Viehmann ([email protected]) is a postdoctoral researcher at the Department of Communication at the University of Mainz, Germany.
Tilman Beck
Tilman Beck ([email protected]) is a doctoral candidate at the Ubiquitous Knowledge Processing (UKP) Lab as part of the Computer Science Department at the Technical University of Darmstadt.
Marcus Maurer
Prof. Dr. Marcus Maurer ([email protected]) is a full professor at the Department of Communication at the University of Mainz.
Oliver Quiring
Prof. Dr. Oliver Quiring ([email protected]) is a full professor at the Department of Communication at the University of Mainz.
Iryna Gurevych
Prof. Dr. Iryna Gurevych ([email protected]) is a full professor at the Ubiquitous Knowledge Pro-cessing (UKP) Lab as part of the Computer Science Department at the Technical University of Darmstadt.