ABSTRACT
The goal of this research is to make progress towards using supervised machine learning for automated content analysis dealing with complex interpretations of text. For Step 1, two humans coded a sub-sample of online forum posts for relational uncertainty. For Step 2, we evaluated reliability, in which we trained three different classifiers to learn from those subjective human interpretations. Reliability was established when two different metrics of inter-coder reliability could not distinguish whether a human or a machine coded the text on a separate hold-out set. Finally, in Step 3 we assessed validity. To accomplish this, we administered a survey in which participants described their own relational uncertainty/certainty via text and completed a questionnaire. After classifying the text, the machine’s classifications of the participants’ text positively correlated with the subjects’ own self-reported relational uncertainty and relational satisfaction. We discuss our results in line with areas of computational communication science, content analysis, and interpersonal communication.
Disclosure statement
No potential conflict of interest was reported by the authors.
Notes
1. Given the popularity of trace-data, especially in communication research (Choi, Citation2018), it is important to determine if the website prohibits the use of crawling agents to collect data. Both terms of service for both websites were carefully reviewed. Neither website made explicit statements regarding robot.txt or web-scraping policies. As such, we conclude that collecting data from these two sites did not violate any of their terms of service policies.
2. IDF for any term (t) is defined by , where N is the number of documents and DF is the number of documents that contain the term (t). The transformation process is called TF-IDF weighting:
, where it assigns a higher weight to a term (t) in a document (d) when it occurs often, but only in a small number of documents. On the other hand, lower weights are assigned to terms that occur often, but in a high number of documents.
3. Precision is defined by . Recall is defined by
. The F-Measure is defined by 2*
.