CrossRef citations to date


This article refers to:
OkCupid Data for Introductory Statistics and Data Science Courses

Article title: OkCupid Data for Introductory Statistics and Data Science Courses

Authors: Albert Y. Kim & Adriana Escobedo-Land

Journal: Journal of Statistics and Data Science Education

Bibliometrics: Albert Y. Kim & Adriana Escobedo-Land (2015) OkCupid Data for Introductory Statistics and Data Science Courses, Journal of Statistics Education, 23:2

DOI: 10.1080/10691898.2015.11889737

By request of the authors, the original publication and dataset have been corrected to further remove potential identifiers of the OkCupid users whose data was used as examples while retaining its potential for educational use. The following changes have been made:

  • In the Abstract, Introduction, Data, and Conclusion sections, the exact date of the data was removed and replaced with “from a period in the 2010s”.

  • In the Data section, the following additions and edits are included:

    • This line was added: “Note that random noise was added to the age variable for de-identification purposes”.

    • The data file is renamed to “profiles_revised.csv”.

    • This line was added: “However, the essay data has been randomized by rows to decouple them from the profiles data. In other words, the user represented in the first row of profiles_revised does not necessarily correspond to the user that wrote the responses in the first row of essays_revised_and_shuffled. We load this randomized essays data as follows:

essays_revised_and_shuffled <-


header = TRUE,

stringsAsFactors = FALSE)”.

  • In Section 3 Example Analyses, question 3 was reworded to “How accurately can we predict a user’s sex using their listed height?”

  • In Section 3.1.1, the profiles.subset file is renamed profiles_revised.subset.

  • Section 3.3 Text Analysis is now redacted along with the figures. This subsection formerly contained a text analysis comparing frequencies of word use between the “male” and “female” groups of the sex variable. Now that the essays data has been decoupled from the profiles data however, this analysis is now moot and thus this section is redacted.

  • Section and figure numbers after the original Section 3.3 are updated since some figures were removed in the redacted section.

  • In Section 3.3.1 Exercise, subsection of Predictors of Sex, the numbers in the sample function have changed due to differences in R’s random number generator.

  • In Section 3.3.1 Exercise, subsection of Predictors of Sex, the numbers in the table have changed slightly due to the updates in the packages.

  • In Section 3.3.2 Pedagogical Discussion, the misclassification error rate of 16.98% and the height of 67.11 inches were updated due to updated packages.

  • In Section 4 Conclusions, the profiles.csv.zip file is renamed to profiles_revised.csv.zip.

  • The Acknowledgements section was rewritten to reflect revision.

The article has been republished with these corrections.