ABSTRACT
In this paper we propose a set of novel regression models for readability scoring in Bengali language, which can also be used for Hindi, making use of several lexical, surface-level, syntactic and semantic features. We perform 5-fold and leave-one-out cross-validation on a human-annotated gold standard dataset of 30 passages, written by 4 eminent Bengali litterateurs. On this dataset, our best model achieves a mean squared error (MSE) of 57%, which is better than state-of-the-art results (73% MSE). We further perform feature analysis to identify potentially useful features in learning a regression model for Bengali readability. Ablation studies indicate the importance of compound characters (Juktakkhors) in readability assessment.
Disclosure statement
No potential conflict of interest was reported by the authors.
Notes
1. ‘Adult literature’ has been used to describe the literature for mature readers, throughout the article.