28
Views
1
CrossRef citations to date
0
Altmetric
Computers and Computing

A 2-Tier Bengali Dataset for Evaluation of Hard and Soft Classification Approaches

ORCID Icon, ORCID Icon, ORCID Icon & ORCID Icon

References

  • V. Korde, and C. N. Mahender, “Text classification and classifiers: A survey,” Int. J. Artif. Intell. Appl., Vol. 3, no. 2, pp. 85, 2012.
  • G. Fenk-Oczlon, “Word frequency and word order in freezes,” 1989.
  • P. Castells, M. Fernandez, and D. Vallet, “An adaptation of the vector-space model for ontology-based information retrieval,” IEEE, Trans. Knowl. Data. Eng., Vol. 19, no. 2, pp. 261–72, 2006. doi:10.1109/TKDE.2007.22
  • Y. Jia, M. Salzmann, and T. Darrell, “Learning cross-modality similarity for multinomial data,” in 2011 International Conference on Computer Vision, IEEE, 2011, pp. 2407–14.
  • L. M. Manevitz, and M. Yousef, “One-class SVMs for document classification,” J. Mach. Learn. Res., Vol. 2, no. Dec, pp. 139–54, 2001.
  • S.-C. Lin, C.-L. Tsai, L.-F. Chien, K.-J. Chen, and L.-S. Lee, “Chinese language model adaptation based on document classification and multiple domain-specific language models,” in Fifth European Conference on Speech Communication and Technology, 1997.
  • Y. Li, and J. Shawe-Taylor, “Using KCCA for Japanese–English cross-language information retrieval and document classification,” J. Intell. Inf. Syst., Vol. 27, no. 2, pp. 117–33, 2006. doi:10.1007/s10844-006-1627-y
  • S. Al-Harbi, A. Almuhareb, A. Al-Thubaity, M. S. Khorsheed, and A. Al-Rajeh, “Automatic Arabic text classification,” 2008, pp. 77–83.
  • A. R. Ali, and M. Ijaz, “Urdu text classification,” in Pot 7th International Conference on Frontiers of Information Technology, 2009, pp 1–7.
  • H. Ragas, and C. H. A. Koster, “Four text classification algorithms compared on a dutch corpus,” in Pot 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998, pp. 369–70.
  • M. Taher Pilevar, H. Feili, and M. Soltani, “Classification of persian textual documents using learning vector quantization,” in 2009 International Conference on Natural Language Processing and Knowledge Engineering, IEEE, 2009, pp. 1–6.
  • K. Kafle, D. Sharma, A. Subedi, and A. K. Timalsina, “Improving nepali document classification by neural network,” in Proceedings of IOE Graduate Conference, 2016, pp. 317–22.
  • V. Gupta, and V. Gupta, “Algorithm for punjabi text classification,” Int. J. Comput. Appl., Vol. 37, no. 11, pp. 30–5, 2012.
  • P. Bolaj, and S. Govilkar, “Text classification for marathi documents using supervised learning methods,” Int. J. Comput. Appl., Vol. 155, no. 8, pp. 6–10, 2016.
  • S. Puri, and S. Prakash Singh, “Hindi text document classification system using svm and fuzzy: A survey,” Int. J. Rough Sets Data Anal. (IJRSDA), Vol. 5, no. 4, pp. 1–31, 2018. doi:10.4018/IJRSDA
  • A. K. Durga, and A. Govardhan, “Ontology based text categorization-telugu document,” Int. J. Sci. Eng. Res., Vol. 2, no. 9, pp. 1–4, 2011.
  • J. Sarmah, N. Saharia, and S. K. Sarma, “A novel approach for document classification using assamese wordnet,” in 6th International Global Wordnet Conference, 2012, pp. 324–9.
  • K. Rajan, V. Ramalingam, M. Ganesan, S. Palanivel, and B. Palaniappan, “Automatic classification of tamil documents using vector space model and artificial neural network,” Expert. Syst. Appl., Vol. 36, no. 8, pp. 10914–18, 2009. doi:10.1016/j.eswa.2009.02.010
  • H. P. Luhn, “A statistical approach to mechanized encoding and searching of literary information,” IBM J. Res. Dev., Vol. 1, no. 4, pp. 309–17, 1957. doi:10.1147/rd.14.0309
  • A. B. Parves, A. Al Imran, and M. R. Rahman, ““Incorporating supervised learning algorithms with nlp techniques to classify bengali language forms,” in Proceedings of the International Conference on Computing Advancements, 2020, pp. 1–7.
  • H. Borko, and M. Bernick, “Automatic document classification,” J. ACM (JACM), Vol. 10, no. 2, pp. 151–62, 1963. doi:10.1145/321160.321165
  • D. D. Dawn, S. H. Shaikh, and R. K. Pal, “A comprehensive review of bengali word sense disambiguation,” Artif. Intell. Rev., Vol. 53, no. 6, pp. 4183–213, 2020. doi:10.1007/s10462-019-09790-9
  • A. Dhar, H. Mukherjee, N. S. Dash, and K. Roy, “Automatic categorization of web text documents using fuzzy inference rule,” Sādhanā, Vol. 45, no. 1, pp. 1–22, 2020.
  • N. S. Dash, “Compound nouns and adjectives in bangla: Some empirical observations,” 2011.
  • M. Mansur, “Analysis of n-gram based text categorization for Bangla in a newspaper corpus,” PhD thesis, BRAC University, 2006.
  • A. K. Mandal, and R. Sen, “Supervised learning methods for bangla web document categorization,” arXiv preprint arXiv:1410.2045, 2014.
  • M. S. Islam, F. E. M. Jubayer, and S. I. Ahmed, “A support vector machine mixed with tf-idf algorithm to categorize bengali document,” in 2017 International Conference Electrical, Computer and Communication Engineering (ECCE), IEEE, 2017, pp. 191–6.
  • M. Rajib Hossain, and M. M. Hoque, “Automatic bengali document categorization based on deep convolution nets,” in Emerging Research in Computing, Information, Communication and Applications, Springer, 2019, pp. 513–25.
  • A. Dhar, H. Mukherjee, N. S. Dash, and K. Roy, “Cess-a system to categorize bangla web text documents,” ACM Trans. Asian Low Res. Language Inform. Process. (TALLIP), Vol. 19, no. 5, pp. 1–18, 2020. doi:10.1145/3398070
  • N. Romim, M. Ahmed, H. Talukder, and M. S. Islam, “Hate speech detection in the bengali language: A dataset and its baseline evaluation,” in Proceedings of International Joint Conference on Advances in Computational Intelligence, Springer, 2021, pp. 457–68.
  • Available: https://en.wikipedia.org/wiki/Languages_used_on_the_Internet, Last accessed on 2022-12-23.
  • D. G. Altman, and J. Martin Bland, “Statistics notes: Detecting skewness from summary information,” Bmj, Vol. 313, no. 7066, pp. 1200, 1996. doi:10.1136/bmj.313.7066.1200
  • G. Upton, and I. Cook, Understanding statistics. Oxford: Oxford University Press, 1996.
  • J. A. Hartigan, and M. A. Wong, “Algorithm as 136: A k-means clustering algorithm,” J. R. Stat. Soc. Ser. C (Appl. Stat.), Vol. 28, no. 1, pp. 100–8, 1979.
  • D. Arthur, and S. Vassilvitskii, “k-means++: The advantages of careful seeding,” Technical report, Stanford, 2006.
  • A. Aizawa, “An information-theoretic perspective of tf–idf measures,” Inf. Process. Manag., Vol. 39, no. 1, pp. 45–65, 2003. doi:10.1016/S0306-4573(02)00021-3
  • R. R. Chowdhury, M. T. Nayeem, T. T. Mim, M. Chowdhury, S. Rahman, and T. Jannat, “Unsupervised abstractive summarization of bengali text documents,” arXiv preprint arXiv:2102.04490, 2021.
  • S. Ismail, and M. S. Rahman, “Bangla word clustering based on n-gram language model,” in Pot International Conference on Electrical Engineering and Information & Communication Technology, IEEE, 2014, pp. 1–5.
  • M. A. Helal, and M. Mouhoub, “Topic modelling in bangla language: An lda approach to optimize topics and news classification,” Comput. Inform. Sci., Vol. 11, no. 4, pp. 77–83, 2018. doi:10.5539/cis.v11n4p77
  • A. Dhar, N. S. Dash, and K. Roy, “A fuzzy logic-based bangla text classification for web text documents,” J. Adv. Linguist. Stud., Vol. 7, no. 1–2, pp. 159–187, 2018.
  • M. Ahmed, P. Chakraborty, and T. Choudhury, “Bangla document categorization using deep rnn model with attention mechanism,” in Cyber Intelligence and Information Retrieval, Springer, 2022, pp. 137–47.
  • M. R. Hossain, M. M. Hoque, N. Siddique, and I. H. Sarker, “Bengali text document categorization based on very deep convolution neural network,” Expert. Syst. Appl., Vol. 184, p. 115394, 2021. doi:10.1016/j.eswa.2021.115394
  • D. Chicco, and G. Jurman, “The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation,” BMC. Genomics., Vol. 21, no. 1, pp. 6, 2020. doi:10.1186/s12864-019-6413-7
  • F. Verhein, and S. Chawla, “Using significant, positively associated and relatively class correlated rules for associative classification of imbalanced datasets,” in Seventh IEEE International Conference on Data Mining (ICDM 2007), IEEE, 2007, pp. 679–84.
  • P. Bermejo, J. A. Gámez, and J. M. Puerta, “Improving the performance of naive bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets,” Expert. Syst. Appl., Vol. 38, no. 3, pp. 2072–80, 2011. doi:10.1016/j.eswa.2010.07.146

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.