661
Views
2
CrossRef citations to date
0
Altmetric
Articles

Disentangling User Samples: A Supervised Machine Learning Approach to Proxy-population Mismatch in Twitter Research

ORCID Icon, &
 

ABSTRACT

This study addresses the issue of sampling biases in social media data-driven communication research. The authors demonstrate how supervised machine learning could reduce Twitter sampling bias induced from “proxy-population mismatch”. Particularly, this study used the Random Forest (RF) classifier to disentangle tweet samples representative of general publics’ activities from non-general—or institutional—activities. By applying RF classifier models to Twitter data sets relevant to four news events and a randomly pooled dataset, the study finds systematic differences between general user samples and institutional user samples in their messaging patterns. This article calls for disentangling Twitter user samples when ordinary user behaviors are the focus of research. It also builds on the development of machine learning modeling in the context of communication research.

Notes

1. Recall, Precision, and Accuracy are the standard metrics to validate ML results. There are four possible classification outcomes: (1) True Positive (TP); (2) False Negative (FN); (3) False Positive (FP); and (4) True Negative (TN). Recall is the rate of correctly labeling general publics out of all the instances supposed to belong to general publics, computed as TP/(TP + FN). Precision is the rate of including correctly labeled general public users from all instances labeled as general publics, computed as TP/(TP + FP). Accuracy is the rate of correct labels of both, the general public and institutional users from all labels, computed as (TP+TN)/(TP + FN + FP + TN). F1 score is the harmonic mean of precision and recall, computed as 2 * (Precision * Recall)/(Precision + Recall).

2. More formally, the GI for a classifier with J labels can be computed via the following:

where probability a randomly chosen object is classified with label i. For the source codes and label data, please see the project repository at https://github.com/jpriniski/TwitterClassification

Additional information

Funding

This work was supported by the 2017 Emerging Scholars Award from Association for Education in Journalism and Mass Communication.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.