28
Views
1
CrossRef citations to date
0
Altmetric
Computers and Computing

A 2-Tier Bengali Dataset for Evaluation of Hard and Soft Classification Approaches

ORCID Icon, ORCID Icon, ORCID Icon & ORCID Icon
 

Abstract

Document classification is an open problem in library, information, and computer sciences towards assigning documents to one or more classes. The interest of linguistic researchers in this domain has increased day by day due to interesting applications like language identification, readability assessment, sentiment analysis, spam filtering, etc. However, researchers focussing on natural language processing of resource-scaring languages have faced many hurdles due to the absence of benchmark datasets. Bengali is among the most-spoken resource-scaring or low-resource language. Although Bengali NLP researchers have endeavoured towards creating their own datasets, they are only useful for performance evaluation of their proposed document classification techniques only. Therefore, there is a gap in the literature on the availability of benchmark datasets. To overcome this barrier, this paper presents a benchmark dataset for Bengali document classification, which is publicly accessible and freely available. This dataset consists of a two-tier architecture, the first-tier for hard classification and the second-tier for soft classification techniques. Hard classification techniques follow supervised learning based models for the classification of documents, while on the other hand, soft classification techniques follow unsupervised learning based models for the clustering of documents. The proposed dataset consists of thirteen unique characteristics. This paper also introduces four new feature sets to evaluate the performance of the proposed dataset, namely: location revealing factor, part of speech tagging factor, relative frequency, and prominence factor.

Disclosure Statement

No potential conflict of interest was reported by the author(s).

Availability of Data and Materials

The dataset is freely accessible here: https://www.kaggle.com/datasets/debapratimdasdawn/bengali-dataset-for-hard-and-soft-classification with proper DOI: https://www.kaggle.com/dsv/4761177.

Additional information

Notes on contributors

Debapratim Das Dawn

Debapratim Das Dawn received the BTech degree in computer science and engineering from the West Bengal University of Technology, India in 2010, and the MTech degree in computer science and applications from the University of Calcutta, India in 2014. He is currently pursuing PhD at the Department of Computer Science and Engineering, University of Calcutta, India. Since 2021, he is a lecturer at the Department of Computer Science & Technology, Bundwan Polytechnic, India. His research interests include computational linguistics, natural language processing, machine learning, computer vision, image processing, etc. He serves as a reviewer for reputed international journals. Corresponding author. Email: [email protected]

Abhinandan Khan

Abhinandan Khan received the BTech degree in electronics and communication engineering from the West Bengal University of Technology, India in 2011, the ME degree in electronics and telecommunication engineering from Jadavpur University, India in 2013, and the PhD degree from University of Calcutta, India in 2020. He received the University Gold Medal for securing the highest marks among all post-graduate engineering courses at Jadavpur University. He is also a recipient of the junior and senior research Fellowships (NET) from the Council of Scientific & Industrial Research, Government of India. His research interests include computational biology and bioinformatics, computational intelligence, natural language processing, etc. Khan has published over 25 research articles. Email: [email protected]

Soharab Hossain Shaikh

Soharab Hossain Shaikh is an associate professor at the Department of Computer Science and Engineering, Brij Mohan Lal Munjal University Gurgaon, Haryana, India. He has over one and half decade of working experience in academics in teaching. Along with teaching at the University level, he is also involved in educating professionals from industry and academia on artificial intelligence, machine learning, deep learning for computer vision and natural language processing. He has designed and developed professional degree programmes for the industry professionals on AI/Deep Learning for a leading edtech company in India. Earlier he received PhD degree from University of Calcutta. His research interests include solving problems involving computer vision and natural language processing. He published many research papers and two books with Springer-Verlag on various aspects related to image processing, published US-patent in his domain of expertise, contributed chapters on behavioural biometrics published by CRC Press. He has served as the editor for Springer journal on Transactions on Computational Science, acted as the reviewers for many international journals/conferences and delivered lectures in many conferences and invited talks. He supervised two doctoral students towards PhD degrees and undergoing research activities with the existing scholars. Email: [email protected]

Rajat Kumar Pal

Rajat Kumar Pal received the BE degree in electrical engineering from the Bengal Engineering College, Shibpur under the University of Calcutta, India, and the MTech degree in computer science and engineering from the University of Calcutta, India in 1985 and 1988, respectively. He received the PhD degree from the Indian Institute of Technology, Kharagpur, India in 1996. Since 1994, he has been as a faculty with the Department of Computer Science and Engineering, University of Calcutta. He has also held the position of the head of the Department twice from 2005 to 2007 and from 2016 to 2018. He went on lien to become professor at the Department of Information Technology, Assam University, India from 2010 to 2012, where he also held the position of Dean of the Triguna Sen School of Technology. Presently, he is working as a professor with the Department of Computer Science and Engineering, University of Calcutta, India. His major research interests include VLSI design, graph theory and its applications, perfect graphs, logic synthesis, design and analysis of algorithms, computational geometry, parallel computation, and algorithms, etc. He has published over 250 research articles and authored and co-authored two books. He also holds several international patents. Email: [email protected]

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.