8,486
Views
145
CrossRef citations to date
0
Altmetric
Teacher's Corner

Text Analysis in R

, & ORCID Icon
 

ABSTRACT

Computational text analysis has become an exciting research field with many applications in communication research. It can be a difficult method to apply, however, because it requires knowledge of various techniques, and the software required to perform most of these techniques is not readily available in common statistical software packages. In this teacher’s corner, we address these barriers by providing an overview of general steps and operations in a computational text analysis project, and demonstrate how each step can be performed using the R statistical software. As a popular open-source platform, R has an extensive user community that develops and maintains a wide range of text analysis packages. We show that these packages make it easy to perform advanced text analytics.

Declaration of interest

The authors report no conflicts of interest. The authors alone are responsible for the content and writing of the article.

Notes

1 The term “data science” is a popular buzzword related to “data-driven research” and “big data” (Provost & Fawcett, Citation2013).

2 Other programming environments have similar archives, such as pip for python. However, CRAN excels in how it is strictly maintained, with elaborate checks that packages need to pass before they will be accepted.

3 The London School of Economics and Political Science recently hosted a workshop (http://textworkshop17.ropensci.org/), forming the beginnings of an rOpenSci special interest group for text analysis.

4 For example, the tif (Text Interchange Formats) package (rOpenSci Text Workshop, Citation2017) describes and validates standards for common text data formats.

6 For a list that includes more packages, and that is also maintained over time, a good source is the CRAN Task View for Natural Language Processing (Wild, Citation2017). CRAN Task Views are expert curated and maintained lists of R packages on the Comprehensive R Archive Network, and are available for various major methodological topics.

8 Notably, there are techniques for automatically expanding a dictionary based on the semantic space of a text corpus (see, e.g., Watanabe, Citation2017). This can be said to add an inductive layer to the approach, because the coding rules (i.e., the dictionary) are to some extent learned from the data.

9 The term n-grams can be used more broadly to refer to sequences, and is also often used for sequences of individual characters. In this teacher’s corner we strictly use n-grams to refer to sequences of words.

10 To view how to cite a package, the citation function can be used—e.g., citation(“quanteda”) for citing quanteda, or citation() for citing the R project. This either provides the citation details provided by the package developer or auto-generated details.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.