Search in:

Journal of Research on Educational Effectiveness Volume 12, 2019 - Issue 4: Education Research in a New Data Environment

Submit an article Journal homepage

1,239

Views

CrossRef citations to date

Altmetric

Methodological Studies

Text as Data Methods for Education Research

Lily FeslerGraduate School of Education, Stanford University, Stanford, California, USACorrespondence[email protected]
View further author information

Thomas DeeGraduate School of Education, Stanford University, Stanford, California, USAView further author information

Rachel BakerSchool of Education, University of California, Irvine, Irvine, California, USAView further author information

Brent EvansPeabody College, Vanderbilt University, Nashville, Tennessee, USA.View further author information

Pages 707-727 | Received 01 Aug 2018, Accepted 02 Jun 2019, Published online: 06 Dec 2019

Cite this article
https://doi.org/10.1080/19345747.2019.1634168
CrossMark

Sample our Education journals, sign in here to start your access, latest two full volumes FREE to you for 14 days

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions
Read this article /doi/full/10.1080/19345747.2019.1634168?needAccess=true

Abstract

Recent advances in computational linguistics and the social sciences have created new opportunities for the education research community to analyze relevant large-scale text data. However, the take-up of these advances in education research is still nascent. In this article, we review the recent automated text methods relevant to educational processes and determinants. We discuss both lexical-based and supervised methods, which expand the scale of text that researchers can analyze, as well as unsupervised methods, which allow researchers to discover new themes in their data. To illustrate these methods, we analyze the text interactions from a field experiment in the discussion forums of online classes. Our application shows that respondents provide less assistance and discuss slightly different topics with the randomized female posters, but respond with similar levels of positive and negative sentiment. These results demonstrate that combining qualitative coding with machine learning techniques can provide for a rich understanding of text-based interactions.

Keywords:

text-as-data
equity
gender
field experiment
online education

Acknowledgments

We thank June John for her assistance in data collection.

Notes

1 In keeping with the literature, we use the term “document” to refer to one observation of text (e.g., one essay, one text message, or one discussion board post).

2 Hand-coding is a traditional method frequently employed in qualitative research and content analysis in which a person reads and manually codes words or phrases with specific themes.

3 The harmonic mean is the reciprocal of the arithmetic mean of reciprocals and is more conservative than using the arithmetic mean (i.e., it produces a lower F1-measure).

4 The R package “readme2” implements this; see https://github.com/iqss-research/readme-software.

5 With computer-assisted clusterings, the researcher can explore many possible computer-generated clusterings (Grimmer & King, Citation2011).

6 We should note that topic models are multimodal, meaning that topics can be sensitive to the starting values used. One way to account for this is using a spectral initialization, which is deterministic and globally consistent (Roberts, Stewart, & Tingley, Citation2016).

7 The FREX measure was developed by Roberts, Stewart, and Airoldi (Citation2016), and is the weighted harmonic mean of the word’s rank in terms of exclusivity and frequency.

8 Some methods also account for word order in a circumscribed manner through using “bigrams” or “trigrams” as opposed to “unigrams.” For example, the sentence “This class is hard” could produce unigrams (“this,” “class,” “is,” “hard”), bigrams (“this class,” “class is,” “is hard”), or trigrams (“this class is,” “class is hard”).

9 Popular packages in R for analyzing text data will do these for you. The tidytext and text mining (“tm”) packages in R are particularly popular. Some analysis packages will also preprocess the data before conducting the analysis (e.g., the STM package in R; Roberts, Stewart, & Tingley, Citation2018).

10 These steps depend on the analysis being conducted. Researchers interested in linguistic style, for instance, may be primarily interested in function words instead of content words.

11 We received institutional review board approval for this study, and we worked in close consultation with them to determine the number of comments placed in each course in order to minimize the costs placed on field participants.

12 We used names that were recently used in studies that have experimentally manipulated perceptions of race and gender (e.g., Bertrand & Mullainathan, Citation2004; Milkman, Akinola, & Chugh, Citation2015; Oreopoulos, Citation2011). We chose a set of four first names and four last names for each gender-race combination (128 names in total).

13 For comparison purposes, we convert the sentiment scores into positive/negative/neutral classifications. For LIWC, posts that have a higher percentage of positive terms than negative terms are classified as positive, posts that have a higher percentage of negative terms than positive terms are classified as negative, and posts that have an equal percentage of positive and negative terms are classified as neutral. For SEANCE, messages with a positive score above 0 and a negative score less than or equal to 0 are classified as positive, messages with a negative score above 0 and a positive score less than or equal to 0 are classified as negative, and all other posts are classified as neutral.

14 See pages 4–5 for an introduction to the confusion matrix.

15 The performance for the negative codes is worse, likely due to how rare the negative codes were (3% of posts were negative).

16 Meyer and Mittag (Citation2017) also show how to estimate the degree of bias due to measurement error in binary dependent variables without having the true variable (i.e., the variable without measurement error).

17 Recall that there are 241 positive posts and only 26 negative posts in our corpus (out of 798 posts).

18 This scale is designed for student-to-student confirmation, and we apply it to both students and instructors. The instructor confirmation scale includes response to questions (which is very similar to the combination of assistance and acknowledgment and includes answering students’ questions fully and indicating that they appreciate student questions), demonstrating interest in the student (which is very similar to individual attention), teaching style (which does not apply in our online discussion forum context), and disconfirmation (which is included in the student-to-student confirmation scale; Ellis, Citation2000).

19 We subset our sample to Black and White posters because a substantial number of MOOC participants were from the United States and may not be able to discern gender in Asian names.

20 See page 6 for an introduction to k-fold cross-validation.

21 Disconfirmation may also be a more complex task to predict. Disconfirmation sometimes includes clear terms of disagreement, like “no” and “not really,” but often is more complex. For instance, one disconfirming post simply counters “highly subjective” when a fictitious poster complained about the lectures, and another states that it is the “calm before the storm” when a fictitious poster stated that they were feeling confident about the course. None of these terms (subjective, calm, storm) shows up in any of the other disconfirming posts.

22 As noted earlier, this is primarily an issue with binary variables, which do not exhibit classical measurement error.

23 See pages 7–8 for an introduction to topic models.

Grimmer, J., & King, G. (2011). General purpose computer-assisted clustering and conceptualization. Proceedings of the National Academy of Sciences, 108(7), 2643–2650. doi:10.1073/pnas.1018067108

PubMed Web of Science ®Google Scholar

Roberts, M. E., Stewart, B. M., & Tingley, D. (2016). Navigating the local modes of big data: The case of topic models. In R. M. Alvarez (Ed.), Computational social science: Discovery and prediction (pp. 51–97). New York, NY: Cambridge University Press.

Google Scholar

Roberts, M. E., Stewart, B. M., & Airoldi, E. M. (2016). A model of text for experimentation in the social sciences. Journal of the American Statistical Association, 111(515), 988–1003. doi:10.1080/01621459.2016.1141684

Web of Science ®Google Scholar

Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S. K., … Rand, D. G. (2014). Structural topic models for open-ended survey responses. American Journal of Political Science, 58(4), 1064–1082.

Google Scholar

Bertrand, M., & Mullainathan, S. (2004). Are Emily and Greg more employable Than Lakisha and Jamal? A field experiment on labor market discrimination. American Economic Review, 94(4), 991–1013. doi:10.1257/0002828042002561

Web of Science ®Google Scholar

Milkman, K. L., Akinola, M., & Chugh, D. (2015). What happens before? A field experiment exploring how pay and representation differentially shape bias on the pathway into organizations. Journal of Applied Psychology, 100(6), 1678–1712. doi:10.1037/apl0000022

PubMed Web of Science ®Google Scholar

Oreopoulos, P. (2011). Why do skilled immigrants struggle in the labor market? A field experiment with thirteen thousand resumes. American Economic Journal: Economic Policy, 3(4), 148–171. doi:10.1257/pol.3.4.148

Web of Science ®Google Scholar

Meyer, B. D., & Mittag, N. (2017). Misclassification in binary choice models. Journal of Econometrics, 200(2), 295–311. doi:10.1016/j.jeconom.2017.06.012

Web of Science ®Google Scholar

Ellis, K. (2000). Perceived teacher confirmation. Human Communication Research, 26(2), 264–291. doi:10.1111/j.1468-2958.2000.tb00758.x

Web of Science ®Google Scholar

Additional information

Funding

This research was supported from a grant from the Institute of Education Sciences [Award No. R305B140009].

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Text as Data Methods for Education Research

Information for

Open access

Opportunities

Help and information

Text as Data Methods for Education Research

Abstract

Acknowledgments

Notes

Additional information

Funding

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature