Unsupervised document classification integrating web scraping, one-class SVM and LDA topic modelling

Anton Thielmanna Center for Statistics, Georg-August-Universität Göttingen, Göttingen, GermanyCorrespondence[email protected]

Christoph Weissera Center for Statistics, Georg-August-Universität Göttingen, Göttingen, Germany;b Campus-Institut Data Science (CIDAS), Göttingen, Germany

Astrid Krenza Center for Statistics, Georg-August-Universität Göttingen, Göttingen, Germany;c Digital Futures at Work Research Centre, University of Sussex, Brighton, UK

Benjamin Säfkena Center for Statistics, Georg-August-Universität Göttingen, Göttingen, Germany;b Campus-Institut Data Science (CIDAS), Göttingen, Germany

Abstract

Unsupervised document classification for imbalanced data sets poses a major challenge. To obtain accurate classification results, training data sets are often created manually by humans which requires expert knowledge, time and money. Depending on the imbalance of the data set, this approach also either requires human labelling of all of the data or it fails to adequately recognize underrepresented categories. We propose an integration of web scraping, one-class Support Vector Machines (SVM) and Latent Dirichlet Allocation (LDA) topic modelling as a multi-step classification rule that circumvents manual labelling. Unsupervised one-class document classification with the integration of out-of-domain training data is achieved and >80% of the target data is correctly classified. The proposed method thus even outperforms common machine learning classifiers and is validated on multiple data sets.

Keywords:

Acknowledgments

We are grateful to Cornelius Weisser for the data labelling and to Jeanne Micallef and Maximilian Kornhass for helping with the dictionary used in the keyword search. For both tasks their medical expert knowledge was invaluable. We also thank two anonymous reviewers for their many insightful comments and suggestions on the original version of the paper that improved the resulting manuscript a lot.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes

1 For medical data, privacy policies make it very difficult to obtain in-domain training data or data that is comparable to in-domain data, such that scientific papers covering the broad topic of dentistry seem like a good representation for this topic. Regarding the Reuters data set, however, one could argue to consider newspaper articles on the subject of cotton. However, the difficulty in finding text labels that are accurate enough for classification, and the paywalls that apply for the most popular newspaper websites, constitute problems for newspaper article web scraping which are not easy to overcome. For that reason, taking scientific papers for web scraping prove as the best practicable solution.

2 Remember from Section 1.1: there were 27 occurrences with the labelling dentistry found in the medical transcriptions data set.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Unsupervised document classification integrating web scraping, one-class SVM and LDA topic modelling

Information for

Open access

Opportunities

Help and information

Unsupervised document classification integrating web scraping, one-class SVM and LDA topic modelling

Abstract

Acknowledgments

Disclosure statement

Notes

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature