546
Views
10
CrossRef citations to date
0
Altmetric
Application Note

Unsupervised document classification integrating web scraping, one-class SVM and LDA topic modelling

, , &
Pages 574-591 | Received 29 Sep 2020, Accepted 31 Mar 2021, Published online: 27 Apr 2021
 

Abstract

Unsupervised document classification for imbalanced data sets poses a major challenge. To obtain accurate classification results, training data sets are often created manually by humans which requires expert knowledge, time and money. Depending on the imbalance of the data set, this approach also either requires human labelling of all of the data or it fails to adequately recognize underrepresented categories. We propose an integration of web scraping, one-class Support Vector Machines (SVM) and Latent Dirichlet Allocation (LDA) topic modelling as a multi-step classification rule that circumvents manual labelling. Unsupervised one-class document classification with the integration of out-of-domain training data is achieved and >80% of the target data is correctly classified. The proposed method thus even outperforms common machine learning classifiers and is validated on multiple data sets.

Acknowledgments

We are grateful to Cornelius Weisser for the data labelling and to Jeanne Micallef and Maximilian Kornhass for helping with the dictionary used in the keyword search. For both tasks their medical expert knowledge was invaluable. We also thank two anonymous reviewers for their many insightful comments and suggestions on the original version of the paper that improved the resulting manuscript a lot.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes

1 For medical data, privacy policies make it very difficult to obtain in-domain training data or data that is comparable to in-domain data, such that scientific papers covering the broad topic of dentistry seem like a good representation for this topic. Regarding the Reuters data set, however, one could argue to consider newspaper articles on the subject of cotton. However, the difficulty in finding text labels that are accurate enough for classification, and the paywalls that apply for the most popular newspaper websites, constitute problems for newspaper article web scraping which are not easy to overcome. For that reason, taking scientific papers for web scraping prove as the best practicable solution.

2 Remember from Section 1.1: there were 27 occurrences with the labelling dentistry found in the medical transcriptions data set.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.