Search in:

Digital Journalism Volume 11, 2023 - Issue 2: ANALYTICAL ADVANCES THROUGH OPEN SCIENCE: Employing a Reference Dataset to Foster Best-Practice Data Validation, Analysis, and Reporting

Submit an article Journal homepage

Open access

1,955

Views

CrossRef citations to date

Altmetric

Articles

Noise Pollution: A Multi-Step Approach to Assessing the Consequences of (Not) Validating Search Terms on Automated Content Analyses

Daniela Mahla Department of Communication and Media Research, University of Zurich, Zurich, SwitzerlandCorrespondence[email protected]

https://orcid.org/0000-0002-5330-6885 View further author information

Gerret von Nordheimb Amsterdam School of Communication Research, University of Amsterdam, Amsterdam, The Netherlands

https://orcid.org/0000-0001-7553-3838 View further author information

Lars Guentherc Institute for Journalism and Communication Studies, University of Hamburg, Hamburg, Germany;d Centre for Research on Evaluation, Science and Technology, Stellenbosch University, Stellenbosch, South Africa

https://orcid.org/0000-0001-7760-0416 View further author information

Pages 298-320 | Published online: 23 Sep 2022

Cite this article
https://doi.org/10.1080/21670811.2022.2114920
CrossMark

Full Article
Figures & data
References
Supplemental
Citations
Metrics
Licensing
Reprints & Permissions
View PDF PDF View EPUB EPUB

References

Baden, Christian, Christian Pipal, Martijn Schoonvelde, and Mariken A. C. G. van der Velden. 2022. “Three Gaps in Computational Text Analysis Methods for Social Sciences: A Research Agenda.” Communication Methods and Measures 16 (1): 1–18. https://doi.org/10.1080/19312458.2021.2015574.
Web of Science ®Google Scholar
Barberá, Pablo, Amber E. Boydstun, Suzanna Linn, Ryan McMahon, and Jonathan Nagler. 2021. “Automated Text Classification of News Articles: A Practical Guide.” Political Analysis 29 (1): 19–42. https://doi.org/10.1017/pan.2020.8.
Web of Science ®Google Scholar
Benoit, K. 2020. “Text as Data: An Overview.” In The SAGE Handbook of Research Methods in Political Science and International Relations, edited by Luigi Curini, Robert J. Franzese, and James F. Adams, 461–497. Los Angeles, London, New Delhi, Singapore, Washington DC, Melbourne: SAGE reference.
Google Scholar
Benoit, K., Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. 2018. “Quanteda: An R Package for the Quantitative Analysis of Textual Data.” Journal of Open Source Software 3 (30): 774. https://doi.org/10.21105/joss.00774.
Google Scholar
Blatchford, Annie. 2020. “Searching for Online News Content: The Challenges and Decisions.” Communication Research and Practice 6 (2): 143–156. https://doi.org/10.1080/22041451.2019.1676864.
Web of Science ®Google Scholar
Blei, D. M., A. Y. Ng, and M. I. Jordan. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3: 993–1022.
Web of Science ®Google Scholar
Bohr, Jeremiah. 2020. “Reporting on Climate Change: A Computational Analysis of U.S. Newspapers and Sources of Bias, 1997–2017.” Global Environmental Change 61: 102038. https://doi.org/10.1016/j.gloenvcha.2020.102038.
Web of Science ®Google Scholar
Boumans, Jelle W., and Damian Trilling. 2016. “Taking Stock of the Toolkit: An Overview of Relevant Automated Content Analysis Approaches and Techniques for Digital Journalism Scholars.” Digital Journalism 4 (1): 8–23. https://doi.org/10.1080/21670811.2015.1096598.
Web of Science ®Google Scholar
Boyd, Danah, and Kate Crawford. 2012. “Critical Questions for Big Data: Provocations for a Cultural, Technological, and Scholarly Phenomenon.” Information, Communication & Society 15 (5): 662–679. https://doi.org/10.1080/1369118X.2012.678878.
Web of Science ®Google Scholar
Brüggemann, M., and R. Sadikni. 2020. “Online Media Monitor on Climate Change (OMM): Analysis of Global Tweets and Online Media Coverage.” https://icdc.cen.uni-hamburg.de/en/omm-mediaanalysis.html.
Google Scholar
Campbell, R. Sherlock, and James W. Pennebaker. 2003. “The Secret Life of Pronouns: Flexibility in Writing Style and Physical Health.” Psychological Science 14 (1): 60–65. https://doi.org/10.1111/1467-9280.01419.
PubMed Web of Science ®Google Scholar
Damerau, Fred J. 1993. “Generating and Evaluating Domain-Oriented Multi-Word Terms from Texts.” Information Processing & Management 29 (4): 433–447. https://doi.org/10.1016/0306-4573(93)90039-G.
Web of Science ®Google Scholar
de Vries, Erik, de Martijn Schoonvelde, and Gijs Schumacher. 2018. “No Longer Lost in Translation: Evidence That Google Translate Works for Comparative Bag-of-Words Text Applications.” Political Analysis 26 (4): 417–430. https://doi.org/10.1017/pan.2018.26.
Web of Science ®Google Scholar
Denny, Matthew J., and Arthur Spirling. 2018. “Text Preprocessing for Unsupervised Learning: Why It Matters, When It Misleads, and What to Do About It.” Political Analysis 26 (2): 168–189. https://doi.org/10.1017/pan.2017.44.
Web of Science ®Google Scholar
DiMaggio, Paul. 2015. “Adapting Computational Text Analysis to Social Science (And Vice Versa).” Big Data & Society 2 (2): 205395171560290. https://doi.org/10.1177/2053951715602908.
Google Scholar
Findley, Michael G., Kyosuke Kikuta, and Michael Denly. 2021. “External Validity.” Annual Review of Political Science 24 (1): 365–393. https://doi.org/10.1146/annurev-polisci-041719-102556.
Google Scholar
Gupta, Shivani, and Atul Gupta. 2019. “Dealing with Noise Problem in Machine Learning Data-Sets: A Systematic Review.” Procedia Computer Science 161: 466–474. https://doi.org/10.1016/j.procs.2019.11.146.
Google Scholar
Hase, Valerie, Daniela Mahl, and Mike S. Schäfer. 2022. “Der „Computational Turn“: Ein „Interdisziplinärer Turn“? Ein Systematischer Überblick Zur Nutzung Der Automatisierten Inhaltsanalyse in Der Journalismusforschung.” Medien & Kommunikationswissenschaft 70 (1–2): 60–78. https://doi.org/10.5771/1615-634X-2022-1-2-60.
Google Scholar
Hase, Valerie, Daniela Mahl, Mike S. Schäfer, and Tobias R. Keller. 2021. “Climate Change in News Media Across the Globe: An Automated Analysis of Issue Attention and Themes in Climate Change Coverage in 10 Countries (2006–2018).” Global Environmental Change 70: 102353. https://doi.org/10.1016/j.gloenvcha.2021.102353.
Web of Science ®Google Scholar
Jacobi, Carina, Katharina Kleinen-von Königslöw, and Nel Ruigrok. 2016. “Political News in Online and Print Newspapers.” Digital Journalism 4 (6): 723–742. https://doi.org/10.1080/21670811.2015.1087810.
Web of Science ®Google Scholar
Jang, S. Mo, and P. Sol Hart. 2015. “Polarized Frames on “Climate Change” and “Global Warming” Across Countries and States: Evidence from Twitter Big Data.” Global Environmental Change 32: 11–17. https://doi.org/10.1016/j.gloenvcha.2015.02.010.
Web of Science ®Google Scholar
Jünger, Jakob, and Chantal Gärtner. 2021. “Distilling Issue Cycles from Large Databases: A Time-Series Analysis of Terrorism and Media in Africa.” Social Science Computer Review 39 (6): 1272–1291. https://doi.org/10.1177/0894439320979675.
Web of Science ®Google Scholar
Jünger, Jakob, S. Geise, and M. Hänelt. 2022. “Unboxing Computational Social Media Research: From a Datahermeneutical Perspective: How Do Scholars Address the Tension Between Automation and Interpretation?” International Journal of Communication 16: 1482–1505.
Web of Science ®Google Scholar
Karlsson, Michael, and Helle Sjøvaag. 2016. “Introduction: Research Methods in an Age of Digital Journalism.” Digital Journalism 4 (1): 1–7. https://doi.org/10.1080/21670811.2015.1096595.
Web of Science ®Google Scholar
King, Gary, Patrick Lam, and Margaret E. Roberts. 2017. “Computer-Assisted Keyword and Document Set Discovery from Unstructured Text.” American Journal of Political Science 61 (4): 971–988. https://doi.org/10.1111/ajps.12291.
Web of Science ®Google Scholar
Kirilenko, Andrei P., and Svetlana O. Stepchenkova. 2014. “Public Microblogging on Climate Change: One Year of Twitter Worldwide.” Global Environmental Change 26: 171–182. https://doi.org/10.1016/j.gloenvcha.2014.02.008.
Web of Science ®Google Scholar
Koppers, Lars, Jonas Rieger, Karin Boczek, and Gerret von Nordheim. 2020. Tosca: Tools for Statistical Content Analysis. https://CRAN.R-project.org/package=tosca.
Google Scholar
Lacy, Stephen, Brendan R. Watson, Daniel Riffe, and Jennette Lovejoy. 2015. “Issues and Best Practices in Content Analysis.” Journalism & Mass Communication Quarterly 92 (4): 791–811. https://doi.org/10.1177/1077699015607338.
Web of Science ®Google Scholar
Laugwitz, Laura. 2021. “Qualitätskriterien Für Die Automatische Inhaltsanalyse. Zur Integration Von Verfahren Des Maschinellen Lernens in Die Kommunikationswissenschaft.” https://osf.io/preprints/socarxiv/gt28f/.
Google Scholar
Lovejoy, Jennette, Brendan R. Watson, Stephen Lacy, and Daniel Riffe. 2014. “Assessing the Reporting of Reliability in Published Content Analyses: 1985–2010.” Communication Methods and Measures 8 (3): 207–221. https://doi.org/10.1080/19312458.2014.937528.
Google Scholar
Lucas, Christopher, Richard A. Nielsen, Margaret E. Roberts, Brandon M. Stewart, Alex Storer, and Dustin Tingley. 2015. “Computer-Assisted Text Analysis for Comparative Politics.” Political Analysis 23 (2): 254–277. https://doi.org/10.1093/pan/mpu019.
Web of Science ®Google Scholar
Maier, Daniel, A. Niekler, Gregor Wiedemann, and Daniela Stoltenberg. 2020. “How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models.” Computational Communication Research 2 (2): 139–152. https://doi.org/10.5117/CCR2020.2.001.MAIE.
Google Scholar
Maier, Daniel, A. Waldherr, P. Miltner, G. Wiedemann, A. Niekler, A. Keinert, B. Pfetsch, et al. 2018. “Applying LDA Topic Modeling in Communication Research: Toward a Valid and Reliable Methodology.” Communication Methods and Measures 12 (2–3): 93–118. https://doi.org/10.1080/19312458.2018.1430754.
Web of Science ®Google Scholar
Manning, Christopher D., and Hinrich Schütze. 2002. Foundations of Statistical Natural Language Processing. 5th ed. Cambridge, MA. MIT Press.
Google Scholar
Nerlich, Brigitte, and Nelya Koteyko. 2009. “Compounds, Creativity and Complexity in Climate Change Communication: The Case of ‘Carbon Indulgences.” Global Environmental Change 19 (3): 345–353. https://doi.org/10.1016/j.gloenvcha.2009.03.001.
Web of Science ®Google Scholar
Niekler, A., and P. Jähnichen. 2012. “Matching Results of Latent Dirichlet Allocation for Text.” http://asv.informatik.uni-leipzig.de/publication/file/210/nieklerjaehnichenICCM2012.pdf.
Google Scholar
Nikita, M. 2020. Tuning of the Latent Dirichlet Allocation Models Parameters: Comprehensive R Archive Network (CRAN). https://cran.r-project.org/web/packages/ldatuning/index.html.
Google Scholar
Pipal, Christian, Hyunjin Song, and Hajo G. Boomgaarden. 2022. “If You Have Choices, Why Not Choose (And Share) All of Them? A Multiverse Approach to Understanding News Engagement on Social Media.” Digital Journalism: 1–21. https://doi.org/10.1080/21670811.2022.2036623.
Google Scholar
Reber, U. 2019. “Overcoming Language Barriers: Assessing the Potential of Machine Translation and Topic Modeling for the Comparative Analysis of Multilingual Text Corpora.” Communication Methods and Measures 13 (2): 102–125. https://doi.org/10.1080/19312458.2018.1555798.
Web of Science ®Google Scholar
Rieger, Jonas, Jörg Rahnenführer, and Carsten Jentsch. 2020. “Improving Latent Dirichlet Allocation: On Reliability of the Novel Method LDAPrototype.” In Natural Language Processing and Information Systems, Vol. 12089, edited by Elisabeth Métais, Farid Meziane, Helmut Horacek, and Philipp Cimiano, 118–125. Cham: Springer International Publishing. http://link.springer.com/. 10.1007/978-3-030-51310-8_11. Accessed March 26, 2021.
Google Scholar
Rieger, Jonas. 2020. “LdaPrototype: A Method in R to Get a Prototype of Multiple Latent Dirichlet Allocations.” Journal of Open Source Software 5 (51): 2181. https://doi.org/10.21105/joss.02181.
Google Scholar
Rinke, Eike Mark, Timo Dobbrick, Charlotte Löb, Cäcilia Zirn, and Hartmut Wessler. 2022. “Expert-Informed Topic Models for Document Set Discovery.” Communication Methods and Measures 16 (1): 39. https://doi.org/10.1080/19312458.2021.1920008.
Web of Science ®Google Scholar
Roberts, Margaret E., Brandon M. Stewart, and Dustin Tingley. 2016. “Navigating the Local Modes of Big Data: The Case of Topic Models.” In Computational Social Science: Discovery and Prediction, edited by R. M. Alvarez, 51–97. Analytical methods for social research. Cambridge: Cambridge University Press.
Google Scholar
Roberts, Margaret E., Brandon M. Stewart, and Dustin Tingley. 2019. “Stm: An R Package for Structural Topic Models.” Journal of Statistical Software 91 (2): 1–40. https://doi.org/10.18637/jss.v091.i02.
Web of Science ®Google Scholar
Scharkow, Michael. 2012. “Automatisierte Inhaltsanalyse Und Maschinelles Lernen.” Dissertation at Berlin University of the Arts. epubli. https://opus4.kobv.de/opus4-udk/frontdoor/deliver/index/docId/28/file/dissertation_scharkow_final_udk.pdf.
Google Scholar
Schofield, Alexandra, and David Mimno. 2016. “Comparing Apples to Apple: The Effects of Stemmers on Topic Models.” Transactions of the Association for Computational Linguistics 4: 287–300. https://doi.org/10.1162/tacl_a_00099.
Google Scholar
Sobel, M., and Daniel Riffe. 2015. “U.S. Linkages in New York Times Coverage of Nigeria, Ethiopia and Botswana (2004-13): Economic and Strategic Bases for News.” International Communication Research Journal 50 (1): 3–22.
Google Scholar
Song, Hyunjin, Petro Tolochko, Jakob-Moritz Eberl, Olga Eisele, Esther Greussing, Tobias Heidenreich, Fabienne Lind, Sebastian Galyga, and Hajo G. Boomgaarden. 2020. “In Validations We Trust? The Impact of Imperfect Human Annotations as a Gold Standard on the Quality of Validation of Automated Content Analysis.” Political Communication 37 (4): 550–572. https://doi.org/10.1080/10584609.2020.1723752.
Web of Science ®Google Scholar
Stray, Jonathan. 2019. “Making Artificial Intelligence Work for Investigative Journalism.” Digital Journalism 7 (8): 1076–1097. https://doi.org/10.1080/21670811.2019.1630289.
Web of Science ®Google Scholar
Stryker, Jo Ellen, Ricardo J. Wray, Robert C. Hornik, and Itzik Yanovitzky. 2006. “Validation of Database Search Terms for Content Analysis: The Case of Cancer News Coverage.” Journalism & Mass Communication Quarterly 83 (2): 413–430. https://doi.org/10.1177/107769900608300212.
Web of Science ®Google Scholar
Vogelgesang, Jens, and Michael Scharkow. 2012. “Reliabilitätstests in Inhaltsanalysen.” Publizistik 57 (3): 333–345. https://doi.org/10.1007/s11616-012-0154-9.
Google Scholar
Walter, Stefanie. 2019. “Better Off Without You? How the British Media Portrayed EU Citizens in Brexit News.” The International Journal of Press/Politics 24 (2): 210–232. https://doi.org/10.1177/1940161218821509.
Web of Science ®Google Scholar
Wilkerson, John, and Andreu Casas. 2017. “Large-Scale Computerized Text Analysis in Political Science: Opportunities and Challenges.” Annual Review of Political Science 20 (1): 529–544. https://doi.org/10.1146/annurev-polisci-052615-025542.
Web of Science ®Google Scholar
Zhu, Xingquan, and Xindong Wu. 2004. “Class Noise Vs. Attribute Noise: A Quantitative Study.” Artificial Intelligence Review 22 (3): 177–210. https://doi.org/10.1007/s10462-004-0751-8.
Web of Science ®Google Scholar

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Noise Pollution: A Multi-Step Approach to Assessing the Consequences of (Not) Validating Search Terms on Automated Content Analyses

References

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

Noise Pollution: A Multi-Step Approach to Assessing the Consequences of (Not) Validating Search Terms on Automated Content Analyses

References

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date