Publication Cover
Accountability in Research
Ethics, Integrity and Policy
Volume 28, 2021 - Issue 6
306
Views
11
CrossRef citations to date
0
Altmetric
Articles

Text recycling in STEM: A text-analytic study of recently published research articles

ORCID Icon &
Pages 349-371 | Published online: 24 Nov 2020
 

ABSTRACT

Text recycling, sometimes called “self-plagiarism,” is the reuse of material from one’s own existing documents in a newly created work. Over the past decade, text recycling has become an increasingly debated practice in research ethics, especially in science and technology fields. Little is known, however, about researchers’ actual text recycling practices. We report here on a computational analysis of text recycling in published research articles in STEM disciplines. Using a tool we created in R, we analyze a corpus of 400 published articles from 80 federally funded research projects across eight disciplinary clusters. According to our analysis, STEM research groups frequently recycle some material from their previously published articles. On average, papers in our corpus contained about three recycled sentences per article, though a minority of research teams (around 15%) recycled substantially more content. These findings were generally consistent across STEM disciplines. We also find evidence that researchers superficially alter recycled prose much more often than recycling it verbatim. Based on our findings, which suggest that recycling some amount of material is normative in STEM research writing, researchers and editors would benefit from more appropriate and explicit guidance about what constitutes legitimate practice and how authors should report the presence of recycled material.

Acknowledgments

We thank our colleague Chris Anson for his valuable conversations about this work as it developed and undergraduate students Dennis Nguyen, Juliana Hoover, and Evelyn Scarrow for data mining and cleaning. A special thank you to Brooke Harmon, who’s exceptional thoughtfulness and diligence in collecting, cleaning, and double-checking these data was essential to this study. We also thank audience members at these meetings for their feedback: the 8th International Conference on Writing Analytics, September 2019, Winterthur, Switzerland; the 7th International Conference on Writing Analytics, 2019, St. Petersburg, FL, and the Annual International Conference of The Association for Practical and Professional Ethics, 2020. This research is supported by NSF grant SES-1737093.

Data availability statement

Data are available through Duke University on an individual basis. Please contact Cary Moskovitz at [email protected] for details.

Disclosure statement

No potential conflict of interest was reported by the authors.

Correction Statement

This article has been republished with minor changes. These changes do not impact the academic content of the article.

Notes

1. Discerning readers might wonder why such a specialized tool was necessary when proprietary software (such as Turnitin) is available that can perform text matching. While we discuss these reasons in greater detail in the sections that follow, the most important reason for developing our own algorithm is that it allows us to examine various forms of text recycling (rather than the mere presence or absence of recycled content). While Turnitin’s core algorithm is likely based on similar methods to our own (string pattern matching and scoring), the source code is unavailable for public use. Our algorithm allows us to explore authorial practice in a more robust and fine-grained mode.

2. Funding for science and engineering research at the NSF is done through seven directorates: Biological Sciences, Computer and Information Science and Engineering, Engineering, Geosciences, Mathematical and Physical Sciences, Social, Behavioral and Economic Sciences, and Education and Human Resources. To keep the scope of our study manageable, we eliminated some directorates from consideration: We eliminated Education and Human Resources because we were focused on scientific writing. We also excluded fields that often produce papers with little prose. While text recycling practices in these fields is certainly of interest, the tools we needed to develop to investigate recycling of prose would likely not be appropriate for analyzing papers consisting largely of equations or code. We thus excluded the NSF directorate “Computer and Information Science and Engineering.” We also excluded Geosciences because of its interdisciplinarity.

3. Common English “stopwords” were excluded from this part of the analysis using the Snowball stopword dictionary. See https://cran.r-project.org/web/packages/stopwords/stopwords.pdf for more information.

4. Sentence parsing was performed using the sentence TokenParse command in the lexRankr package for R.

5. We use 80% rather than 100% for two reasons: first, there are sometimes trivial editorial or layout edits that result in very minor alterations; second, a sentence may be recycled verbatim from the source but then have additional material added to it.

6. Pemberton et al. (Citation2019) also report that these editors tended to have little understanding of copyright law. A recent legal analysis of text recycling in STEM research conducted by members of our research group (not yet published) suggests that typical recycling practices in STEM research articles are, in fact, legal – at least under U.S. law.

7. We thank an anonymous reviewer for this interesting idea.

Additional information

Funding

This work was supported by the National Science Foundation [SES-1737093].

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 61.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 461.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.