Abstract
Unsupervised document classification for imbalanced data sets poses a major challenge. To obtain accurate classification results, training data sets are often created manually by humans which requires expert knowledge, time and money. Depending on the imbalance of the data set, this approach also either requires human labelling of all of the data or it fails to adequately recognize underrepresented categories. We propose an integration of web scraping, one-class Support Vector Machines (SVM) and Latent Dirichlet Allocation (LDA) topic modelling as a multi-step classification rule that circumvents manual labelling. Unsupervised one-class document classification with the integration of out-of-domain training data is achieved and >80% of the target data is correctly classified. The proposed method thus even outperforms common machine learning classifiers and is validated on multiple data sets.
Acknowledgments
We are grateful to Cornelius Weisser for the data labelling and to Jeanne Micallef and Maximilian Kornhass for helping with the dictionary used in the keyword search. For both tasks their medical expert knowledge was invaluable. We also thank two anonymous reviewers for their many insightful comments and suggestions on the original version of the paper that improved the resulting manuscript a lot.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Notes
1 For medical data, privacy policies make it very difficult to obtain in-domain training data or data that is comparable to in-domain data, such that scientific papers covering the broad topic of dentistry seem like a good representation for this topic. Regarding the Reuters data set, however, one could argue to consider newspaper articles on the subject of cotton. However, the difficulty in finding text labels that are accurate enough for classification, and the paywalls that apply for the most popular newspaper websites, constitute problems for newspaper article web scraping which are not easy to overcome. For that reason, taking scientific papers for web scraping prove as the best practicable solution.
2 Remember from Section 1.1: there were 27 occurrences with the labelling dentistry found in the medical transcriptions data set.