161
Views
0
CrossRef citations to date
0
Altmetric
Research Article

End-to-end handwritten Ge’ez multiple numerals recognition using deep learning

ORCID Icon & ORCID Icon
Pages 122-134 | Received 13 Jul 2023, Accepted 26 Mar 2024, Published online: 23 Apr 2024

ABSTRACT

Ge'ez has been used in Ethiopian churches for centuries to read and interpret the Bible. As a part of the liturgy and religious ceremonies, it is also utilized in prayer and chanting. Since the language is ancient, plenty of handwritten physical documents are generated. Recognizing handwritten Ge'ez numerals poses significant challenges due to variations in handwriting styles, incomplete or overlapping strokes, noise, and distortion. This study introduces an end-to-end approach for handwritten Ge'ez multiple numeral recognition employing deep neural networks. The proposed method streamlines recognition by eliminating manual feature extraction stages. To enable end-to-end training without explicit alignment, the model uses attention mechanisms and a connectionist temporal classification-based loss function. The proposed model is evaluated on 120,000 handwritten Ge'ez multiple numeral images, which is a syntactically generated dataset. The developed recognition model achieved a character error rate (CER) of 2.81% and a word error rate (WER) of 18.13%. These results highlight the effectiveness of the approach in accurately and reliably recognizing handwritten Ge'ez multiple numerals.

Introduction

The Ge'ez language has a significant role in Ethiopian culture, functioning as an ancient Semitic language primarily used within Ethiopian churches [Citation1]. It holds a sacred status and has been historically employed for writing Ethiopian historical accounts and religious texts, resulting in a substantial collection of Ge'ez documents found in Ethiopia. For centuries, Ge'ez has been used in Ethiopian churches for reading and interpreting the Bible. It is an integral part of liturgical practices, religious ceremonies, prayers, and chants. Due to its ancient nature, a large number of handwritten physical documents are available in the Ge'ez language. Ge'ez utilizes the Ethiopic script, which is its native alphabet [Citation2]. The script follows a left-to-right reading and writing system similar to the Latin script.

Derived from the ancient South Arabian alphabet [Citation1], the Ethiopic script has been used to write Ge'ez for over two thousand years and remains one of the few writing systems still in use today. Notably, Ge'ez incorporates a unique numbering system known as the Ge'ez numbering system, specifically designed for representing numerical data. Comprising a total of 20 distinct symbols as shown in Table , the Ge'ez numbering system employs complex shapes for each symbol, accompanied by two horizontal lines positioned above and below it. It is also important to note that the Ge'ez numeral system does not include a symbol that represents zero.

Table 1. Ge’ez numbering and English equivalent numeric symbols.

The Ge'ez numbering system consists of two main categories: digits and numbers. The digit’s category includes symbols representing the numbers from one to nine: “,” “,” “,” “,” “,” “,” “,” “,” and “,” where each symbol corresponds to its respective numerical value. The numbers category encompasses symbols representing multiples of ten and larger numbers. These symbols include (representing 10), (20), (30), (40), (50), (60), (70), (80), (90), (100), and (10,000). In this category, each symbol represents a specific value, and the English equivalent number appears inside the brackets. The Ge'ez numbering system can represent a wide range of numerical values by combining the 20 primary symbols. For example, the symbol combined with other 9-digit symbols creates numbers such as (representing 21), (22), (23), and so on, up to (29).

Handwritten numeral or text recognition is a complex process that involves automatically identifying various handwritten numerals or text information [Citation3]. While computer algorithms are employed to analyze and interpret numeral features, recognizing handwritten digits presents unique challenges in optical character recognition (OCR) and pattern recognition.

One of the main difficulties lies in the significant variations in handwritten digits exhibited in shape, size, and style [Citation4]. These variations make it particularly challenging to develop an algorithm that can accurately recognize all possible variations. Furthermore, distinguishing between different digits can be problematic due to overlapping or incomplete strokes.

Handwritten digits are also susceptible to distortions and noise caused by factors like poor handwriting, smudges, or low-quality image documents [Citation5–7]. So, the ability to generalize to new data makes handwritten numeral recognition more demanding than recognizing printed numerals. A variety of techniques have been explored to overcome these challenges, including pre-processing, feature extraction, and deep-learning approaches [Citation8]. While considerable research has been conducted on pattern recognition for widely used scripts like Latin, Arabic, Urdu, Persian, and Mandarin, Ge'ez numeral recognition, which pertains to an ancient script, has not received the same level of attention in OCR research [Citation9]. Previous works on Ge'ez script recognition have primarily been based on printed texts [Citation10–13], and the recognition of handwritten Ge'ez scripts has remained largely unexplored due to the scarcity of datasets [Citation3]. Although deep-learning algorithms typically require a large amount of labelled data for training [Citation6], publicly available Ge'ez digit datasets are scarce, with the dataset prepared by Nur et al. being an exception [Citation3]. Consequently, recognizing Ge'ez numerals poses a unique challenge in OCR.

Deep learning (DL) approaches have revolutionized pattern recognition, surpassing traditional machine-learning methods in terms of effectiveness [Citation14]. In particular, convolutional neural networks (CNNs) have played a crucial role in image processing and pattern recognition. Building upon previous studies on Ge'ez digit recognition and handwritten text recognition [Citation3,Citation15], this study focused on the underexplored area of handwritten Ge'ez numeral recognition. To the best of our knowledge, this study will be the first of its kind in handwritten Ge’ez multiple-digit recognition. Our study makes the following significant contributions:

  • Preparation of the dataset: To enhance the training process, we prepare synthetically generated image datasets, augmenting the available data to facilitate robust learning and recognition.

  • Development of comprehensive end-to-end (E2E) learning model: The proposed model integrates automatic feature extraction, attention mechanism, bidirectional LSTM (BLSTM) based sequencing, and CTC loss functions. This comprehensive framework enables the model to automatically extract relevant features from Ge'ez numeral images, selectively focus on important regions of the input sequence, and gain a better understanding of temporal information. Additionally, it enables E2E training without the need for explicit alignment between images and labels.

  • Experimental Evaluation: Leveraging the prepared dataset, we conduct different experiments, starting with hyperparameter selection, to evaluate our model's performance. The preliminary results demonstrate promising outcomes in Ge'ez handwritten numeral recognition.

Through these contributions, our study aims to advance Ge'ez handwritten numeral recognition by leveraging a comprehensive DL framework that combines CNNs, BLSTM, attention mechanisms, and CTC. This approach facilitates effective training and recognition, paving the way for improved performance in Ge'ez numeral recognition tasks.

Literature reviews

Numeral recognition has been extensively studied since the 1980s, employing various approaches ranging from traditional machine-learning techniques to deep-learning approaches [Citation16]. Numerous published studies have explored different strategies for digit recognition, focusing on pattern recognition and feature extraction techniques. In this section, we provide an overview of the research conducted in the field of digit recognition. Some studies use traditional machine-learning approaches, while recent studies apply deep-learning and ensemble approaches.

Abdulrazzaq and Saeed compared three machine-learning recognition algorithms: Naive Bayes (NB), multilayer perceptron (MLP), and the K* algorithm [Citation17]. They used a handwritten dataset from NIST, consisting of 48,060 samples with 128 × 128-pixel images. Pre-processing was performed, resulting in 16 × 16 feature images. The authors employed the “correlated feature selection” method and obtained 37 features for training. Their main objective was to identify the best classifier among the three algorithms, achieving high accuracy with minimal features. The K* algorithm performed better than NB and MLP, with a recognition rate of 82.36%. However, image resizing and random sample selection negatively affected recognition accuracy, leading to more misrecognized digits. The NIST dataset presented challenges due to numerous confusing images, making it difficult to find the optimal classifier through random sampling. In a study by Boukharouba and Bennia [Citation18] a novel feature extraction technique was proposed for recognizing Persian handwritten digits. They utilized a Freeman chain code that incorporated vertical and horizontal information. The advantage of their feature extraction approach is that it eliminates the need for normalization. They employed a support vector machine (SVM) as the classification algorithm and evaluated their model using 80,000 Persian numerals, achieving an accuracy of 98.55%.

Liu et al. [Citation19] compared the recognition time and rate of neural networks with backpropagation (BP), support vector machines (SVM), k-nearest neighbour (KNN), and CNN. They used the MNIST handwritten dataset, which consists of 70,000 28 × 28 images for training and testing. Randomly selecting 5000 samples for training and 1000 samples for testing, the authors followed specific procedures for training the classifier, including input, pre-processing, feature selection, classifier design, and classification decision steps. For the testing phase, they repeated the input, pre-processing, feature selection, and classification decision steps to obtain recognition results. The advantages and disadvantages of the four algorithms were analyzed and reported. The simulation test results revealed that the CNN algorithm achieved the highest recognition accuracy. Abu Ghosh and Maghari [Citation20] compared three popular neural network approaches: CNN, DBN, and DNN. Their system consisted of six steps: pre-processing, segmentation, feature extraction, classification, training, and recognition. They used the MNIST handwritten dataset and a self-created random dataset to verify their results. The random dataset contained 85 different handwritten digits from various sources. The authors evaluated the three algorithms based on accuracy and performance, considering execution time as an additional criterion. DNN outperformed the other algorithms, achieving an accuracy of 98.08%.

Ahlawat et al. [Citation21] proposed an improved digit recognition system using the MNIST dataset and a CNN algorithm. They developed two CNN architectures: CNN_3L with three layers and CNN_4L with four layers. Their main objective was to explore various parameters of the CNN architecture using the MNIST handwritten dataset. They employed optimization algorithms, including stochastic gradient descent with momentum, Adam, Adagrad, and Adadelta, to enhance the performance and speed of the neural networks. Among these optimization parameters, the Adam optimizer achieved a detection rate of 99.89% for the CNN model with three layers, surpassing previously reported results in this field.

The authors [Citation22] proposed recognizing Arabic handwritten characters using ensemble convolutional neural networks. They utilized adaptive gradient descent in the initial epoch for convergence and regular stochastic descent in the final epoch. The study employed K-fold and Monte Carlo cross-validation methods. With the MADbase dataset, they achieved a test accuracy of 99.47% and a classification accuracy of 99.74%. To enhance Arabic handwritten numeral recognition, Ahamed et al. [Citation8] proposed a convolutional neural network architecture. Their CNN architecture comprised two convolutional networks with 32 filters and 5 × 5 kernels, two max-pooling networks with a 2 × 2 kernel, two dropout layers, and two fully connected layers. They conducted two activities in their study: improving a previous research model by Ashiquzzaman et al. [Citation23] and introducing a newly proposed method. Using augmentation techniques, they expanded the initial dataset from 3000 to 72,000 images. The improved model achieved 98.95% accuracy, while the newly proposed model achieved 99.96% accuracy.

Ahlawat and Choudhary [Citation24] proposed an automatic recognition model for handwritten digits utilizing CNN for feature extraction and SVM for output prediction. By combining CNN and SVM, they achieved a classification accuracy of 99.28% on the MNIST dataset of handwritten digits. Ali et al. [Citation25] developed an improved method for recognizing handwritten digits using a CNN with a DeepLearning4J framework. The CNN model, consisting of three convolutional layers, one max-pooling layer, and two fully connected layers, is trained and tested on the MNIST handwritten dataset. The model achieved an overall accuracy of 99.21% when tested on a dataset of 5,130 images, with only 42 misidentifications. The computation time for training and testing was also improved. In their work [Citation26], Yu et al. presented an improved LeNet5 digit recognition algorithm. They replaced the last two layers of LeNet5 with an SVM, utilizing SVM as the classifier and LeNet5 as the feature extractor. Yoa et al. [Citation27] proposed a CapsNet-based recognition and separation method for fully overlapping digits. They introduced three main modifications to the original capsule network [Citation28]: the use of a small convolution kernel, the expansion of capsule dimensions to express extracted features, and the application of dual dynamic routing. Their proposed model, called “FOD_DCNet,” achieved an accuracy of 93.53%, surpassing the performance of CapsNet by 5.43%.

End-to-end learning

E2E learning is a DL or machine-learning approach that aims to simplify the process of solving a task by combining all the stages into a single model [Citation29]. Traditionally, machine-learning pipelines involve multiple steps such as pre-processing, feature extraction, and classification. E2E learning eliminates the need for handcrafted features or intermediate representations and allows the model to learn directly from raw input data to produce the desired output.

This approach has gained popularity with the advancement of deep-learning and neural networks [Citation30]. DL models, such as CNN and recurrent neural networks (RNN), have demonstrated exceptional performance in various domains like computer vision, natural language processing, speech recognition, and pattern recognition.

The main advantage of E2E learning is its simplicity. By removing the need for complex feature engineering, the overall system design becomes more straightforward. Additionally, end-to-end learning models are flexible and can adapt to different domains and tasks. They automatically learn relevant features and representations from the raw data, capturing intricate patterns and dependencies [Citation4].

However, there are some limitations to consider. E2E learning models typically require large amounts of labelled training data to generalize well, which can be time-consuming and costly to acquire. The complex representations learned by these models may also lack interpretability, making it challenging to understand the reasons behind their decisions. Furthermore, the integrated nature of E2E models can make it difficult to reuse or modify specific parts for different tasks [Citation31].

So, we can generalize E2E learning as a powerful approach that allows a model to learn directly from raw input data to produce the desired output. It simplifies the machine-learning process, and achieves state-of-the-art results in various domains, but also requires substantial amounts of labelled data and may lack interpretability and modularity.

Studies done on handwritten Ge’ez numerals

In the realm of Ge'ez numeral recognition, limited prior studies have been conducted. Agegnehu et al. [Citation32] applied deep learning to recognize Amharic punctuation marks and offline handwritten digits using a CNN architecture. Their dataset consisted of 5800 images collected from 100 different handwriting samples, achieving a testing accuracy of 70.04%. Furthermore, studies have also explored Ge'ez digits in contexts beyond printed and handwritten forms. Tamiru et al. [Citation33] proposed a study on recognizing Amharic alphabetic sign language, benefiting individuals with hearing loss. Their model consisted of pre-processing, regimentation, feature extraction, and recognition stages, utilizing digital image processing and machine-learning algorithms such as SVM (98.06%) and ANN (80.86%) for classification. Abeje et al. [Citation34] focused on Ethiopian sign language recognition using a deep convolutional neural network. Their system encompassed pre-processing, feature extraction, and classification stages, employing digital image processing techniques. The dataset used in their study was collected from hearing-impaired students.

While the majority of researchers have predominantly focused on the recognition of English numerals, the absence of publicly available Ge'ez digit datasets has deterred extensive research in this area. However, Nur et al. [Citation3] advanced the field of Ge'ez digit recognition by developing a recognition model with improved accuracy and an experimental dataset comprising 51,952-digit images written by 524 individuals. Their proposed CNN architecture achieved a recognition accuracy of 96.21%. The authors recommended exploring multi-digit recognition using diverse deep-learning approaches for future research. Therefore, our motivation started from their recommendations.

Methodology

In this section, we present the details of our E2E handwritten Ge’ez multiple-number recognition system. The purpose of this study is to develop a recognition model that can accurately recognize handwritten multiple digits in Ge’ez numerals. The architecture of the model and the dataset used to train and evaluate the model are explained.

The dataset preparation

To address the issue of limited datasets in deep-learning studies, we recognize the significance of having a sufficiently large and diverse dataset for model training and testing. In this research, we made concerted efforts to overcome this challenge by developing an extensive dataset specifically tailored to our DL problem.

The dataset used in this study is derived from the initial handwritten dataset introduced by Nur et al. [Citation3]. The original dataset comprises 51,952 sample preprocessed greyscale images of single digits with a dimension of 32 by 32 pixels (see Figure ). To extend the scope of the dataset and sequentially accommodate multiple digits, we synthetically generated a new dataset having a multi-digit image from the original dataset.

Figure 1. Sample data from the initial handwritten Ge’ez digit dataset.

Figure 1. Sample data from the initial handwritten Ge’ez digit dataset.

As the dataset is synthetic, we utilize an automated multi-digit image creator algorithm to create sequential data. In achieving this, we have developed a program employing the OpenCV image creator tool, specifically using the concatenation method. The selection of digits is performed randomly. The algorithm is depicted in Table .

Table 2. Synthetic sequential multi-digit creator algorithm.

The newly generated dataset consists of 120,000 image samples with a dimension of 32 by 128 pixels (see Figure ). Each image in this dataset contains a variable number of sequential digits. This expansion allows us to explore and evaluate the performance of the OCR model in scenarios involving multiple numerals within a single image.

Figure 2. Sample images from synthetically generated 32 by 128 multiple numerals dataset.

Figure 2. Sample images from synthetically generated 32 by 128 multiple numerals dataset.

By leveraging the initial dataset provided by Nur et al. [Citation3], we were able to build upon their work and extend it to address the specific requirements of our study. The synthetic generation of the dataset provides a more diverse and challenging set of images, enabling a comprehensive evaluation of the OCR model’s capabilities in handling sequential multiple digits.

Generating the sequential synthetic data follows the following algorithmic steps and the flowcharts of the steps are depicted in Figure .

  1. Original datasets from [Citation3]: Random selection of images from the original dataset which has a single digit in each image.

  2. Data augmentation: Add noise, erosion, and dilatation.

  3. Pre-processing: Resize, transpose, and normalize the image

  4. Finally, generated 32 by 128-pixel image which consists of multiple handwritten Ge’ez numerals.

Figure 3. Experimental synthetic dataset preparation steps.

Figure 3. Experimental synthetic dataset preparation steps.

The proposed network architecture

The proposed recognition model consists of two main steps: feature extraction and the recognition process. These steps work together to achieve improved recognition accuracy.

Feature extraction process

In the feature extraction part, we employ seven CNN layer blocks, which automatically extract relevant features from the input image, and a SoftMax-based attention mechanism. Each CNN layer block consists of convolutional layers, max-pooling layers, and batch normalization. These components play specific roles in the feature extraction process.

Convolutional layers perform convolution operations on the input image. The convolution operation between an input image (I) and a filter/kernel (K) can be computed in Equation (1). (1) (I,K)[i,j]=(I[x,y]K[ix,jy])(1) In Equation (1), (I,K)[i,j] represents the value at position [i,j] in the resulting feature map. The double summation (∑∑) computes the element-wise multiplication and summation over the spatial dimensions. [x,y] and [i−x,−y] represent the corresponding values in the input image and filter/kernel, respectively.

The convolutional layers are configured with parameters such as feature maps, kernel size, stride, and padding, which determine how the convolutions are applied to the input image. By applying these layers, the model can capture important patterns and structures within the image.

The max-pooling operation reduces the dimensionality of the feature maps by selecting the maximum value within each pooling region. Given a feature map (F), the max-pooling operation can be expressed in Equation (2). (2) MaxP(F)[i,j]=max(F[x,y])(2) Here, MaxP(F)[i,j] represents the value at the position [i,j] in the down-sampled feature map. The function max(F[x,y]) returns the maximum value within the pooling region defined by coordinates [x,y]. This pooling operation helps to retain the most salient features while discarding unnecessary information, leading to more efficient processing and improved generalization.

Batch normalization helps normalize the feature maps by subtracting the mean and dividing by the standard deviation. Given a feature map the batch normalization operation can be represented in Equation (3). (3) BN(F)[i,j]=F[i,j]mean(F)variance(F)+epsilon(3) where,BN(F)[i,j] represents the normalized value at position (i, j) in the feature map. mean(F) represents the mean of the feature map, variance(F) denotes the variance, and is a small constant to avoid division by zero. The subtraction and division operations ensure that the mean is zero and the standard deviation is one. This normalization step ensures that the recognition process is not affected by outliers or irregularities in the data. By maintaining consistent scale and distribution of the features, the model becomes more robust and reliable.

Through the combination of these CNN layer blocks, the proposed model can effectively extract meaningful features from the input image, which serves as a crucial foundation for accurate recognition. The feature extraction steps and the attribute value of each are outlined in Table . The attention mechanism selectively focuses on relevant parts of the feature maps. Given a set of feature maps (F) and their corresponding attention weights (W), the attention mechanism can be computed using Equation (4). (4) A(F,W)=(F[t,i]W[t,i])(4) where, A(F,W) represents the attended feature vector. The summation (∑) computes the element-wise multiplication and summation over the temporal (t) and spatial (i) dimensions of the feature maps. The attention weights (W) assign importance or relevance to each element in the feature maps, allowing the model to concentrate on the most informative regions. By assigning different weights to different regions, the model can allocate its attention to the most informative areas, improving recognition accuracy. This mechanism helps the model handle variations in scale, orientation, and appearance within the input image.

Table 3. The proposed model convolutional layer blocks along with the parameter values.

Recognition process

In the recognition process, two main techniques are utilized in addition to the SoftMax dense layer to further enhance the model’s performance. These techniques include bidirectional RNNs and CTC.

Bidirectional RNNs are pivotal for capturing dependencies in both the forward and backward directions within a given input sequence X. This bidirectional processing occurs in two passes, and the outputs of the forward and backward RNNs at each time step are concatenated to form the final hidden state representation, as expressed by the mathematical representation in Equation (5): (5) h[t]=Con(hf[t],hb[t])(5) where, h[t] represents the hidden state at the time step t. hf and hb represents the outputs of the forward and backward RNNs, respectively.

The Concatenate operation fuses these two representations, enabling the model to capture contextual information from both past and future time steps. This bidirectional approach ensures a comprehensive understanding of the entire input sequence, enhancing the model’s ability to discern sequential patterns and structures.

Moreover, recurrent neural networks (RNNs) have become standard in processing sequential data. In our case, RNNs maintain previous inputs within the internal network state, allowing the model to leverage past context for improved recognition accuracy.

Long Short-Term Memory (LSTM) networks, a specific type of RNN, are employed to address the challenge of learning long-term dependencies. Hochreiter and Schmidhuber [Citation35] proposed the LSTM cell to overcome the vanishing gradient problem associated with traditional RNNs. Comprising three sigmoid activation function gates – the Input Gate, Forget Gate, and Output Gate – LSTM networks facilitate effective learning of intricate temporal dependencies. The Equations (6) to (11) outline the mathematical expressions representing the LSTM gates: (6) ft=σ(Wf[ht1,xt]+bf)(6) (7) it=σ(Wi[ht1,xt]+bi)(7) (8) C~t=tanh(WC[ht1,xt]+bC)(8) (9) Ct=ftCt1+itC~t(9) (10) ot=σ(Wo[ht1,xt]+bo)(10) (11) ht=ottanh(Ct)(11)

Figure  provides a visual representation of the LSTM gates’ architecture at a specific timestamp, enhancing the conceptual understanding of the LSTM’s inner workings.

CTC loss is utilized as a loss function during training to align the predicted sequence (PS) with the ground truth (GT). Given a predicted sequence (Ypred) and the GT sequence (Ytrue), the CTC loss can be calculated here in Equation (12). Here, p(Ytrue|Ypred) denotes the probability of obtaining the GT sequence given the PS. (12) CTCLoss(Ypred,Ytrue)=log(p(Ytrue|Ypred)),(12) The CTC loss is typically computed using dynamic programming algorithms, such as the forward–backward algorithm, to sum over all possible alignments. The negative logarithm is applied to penalize the deviation between the predicted and GT sequences. CTC enables the model to handle input sequences of variable lengths and perform E2E training. It automatically aligns the PS with the GT, accounting for possible repetitions, deletions, and insertions. This approach eliminates the need for explicit alignment information and improves the model’s ability to recognize sequences accurately.

Figure 4. The LSTM gates architectures at timestamp ∼ t [Citation12].

Figure 4. The LSTM gates architectures at timestamp ∼ t [Citation12].

By incorporating these techniques into the feature extraction process above, the proposed model achieves superior performance in recognizing complex patterns and sequences, leading to enhanced recognition results overall. In light of the aforementioned concepts, we propose an E2E learning recognition model, as illustrated in Figure , specifically designed for recognizing handwritten multiple digits in Ge’ez numerals. The model architecture includes convolutional layer blocks with specific parameter values as mentioned in Table  to extract features from the input. Additionally, as depicted in Table , two layers of bidirectional LSTM cells, each with 128 hidden units, are utilized to capture long-term dependencies in sequential digit datasets. Dropout with a rate of 25% is applied to regularize the model and prevent overfitting.

Figure 5. The proposed E2E handwritten Ge’ez multiple numerals recognition network architecture.

Figure 5. The proposed E2E handwritten Ge’ez multiple numerals recognition network architecture.

Table 4. The proposed model recurrent neural networks along with the parameter values.

Therefore, this model directly learns the mapping from input to output by optimizing the parameters of the entire model. During training, the model implicitly learns to extract relevant features or representations from the input, eliminating the need for manual feature engineering or intermediate stages.

Our proposed model’s performance is evaluated using two metrics: CER and WER. CER is calculated by Equation (7). (7) CER=((I+D+S)GT)100,(7) Here, I, D, and S represent the number of character insertions, D deletions, and substitutions respectively. GT is the total number of characters in the GT text.

Similarly, WER is calculated by considering word-level errors. The formula WER can be expressed in Equation (8). (8) WER=((Iw+Dw+Sw)Nw)100,(8) where, Iw, Dw, Sw represents the number of words insertions, deletions, and substitutions respectively, and Nw is the total number of words in the ground truth.

Both CER and WER provide valuable insights into the error rates of the recognition system. By evaluating these metrics, we can measure the effectiveness of our proposed model in accurately transcribing characters and words, and identify areas where improvements may be required.

In addition to our primary evaluation metrics, CER and WER, we employed secondary evaluation metrics to provide a comprehensive analysis of the model’s performance. These metrics, namely precision, recall, and F1-score, focus on the classification aspect of the model’s predictions, specifically in terms of correctly predicted characters and their positions within the sequence.

To better interpret these metrics, let’s define the parameters used in generating the classification report. TP (true positive) signifies the number of positive instances correctly predicted, whereas FP (false positive) denotes the number of positive instances predicted incorrectly. On the other hand, TN (true negative) indicates the number of negative instances correctly predicted, and FN (false negative) represents the number of negative instances predicted incorrectly. Equations (9), (10), and (11) are used to compute the precision (P), recall (R), and F1 score respectively. (9) P=TP(TP+FP),(9) (10) R=TP(TP+FN),(10) (11) Fscore=2PRR+P,(11)

Precision is also known as the positive predictive value (PPV) and it reveals the proportion of correctly identified positive instances out of all instances predicted as positive. It provides insights into the accuracy of the positive predictions made by the model. The recall represents the proportion of correctly identified positive instances out of all actual positive instances. It measures the model’s ability to capture positive instances correctly within the dataset. F1-score combines precision and recall into a single metric. It offers a balanced measure of the model’s overall performance by considering both precision and recall. The F1 score allows us to assess the model’s ability to achieve accurate positive predictions while capturing a sufficient number of positive instances.

By analysing these secondary evaluation metrics, including precision, recall, and F1-score, we gain a deeper understanding of the model’s classification performance. These metrics provide valuable insights into the model’s ability to accurately predict characters within the sequence and make informed assessments of its performance.

Results and discussion

The proposed model was implemented in Python using the Kera’s framework with TensorFlow as the backend. To expedite training, the use of powerful GPUs was made accessible through Google Collaboratory, eliminating the need for expensive hardware. Collaboratory, a cloud-based service, enables remote code execution from anywhere with an internet connection. Its user-friendly interface and support for popular data science libraries make it an excellent choice for deep-learning projects.

A total of 120,000 images were used for the experiment. The dataset is first divided into three categories: training, validation, and testing. The training consists of 96,000 (80%) images, the validation comprises 12,000 (10%) images, and the testing comprises 12,000 (10%) images.

During the experimentation phase, we carefully considered and fine-tuned various network parameters to ensure optimal performance. The results reported in this study were obtained using an Adam optimizer in conjunction with a CNN that progressively increased the feature map size from 32 to 512. Additionally, a BLSTM network with two hidden layers, each consisting of 128 units, was utilized. The initial learning rate chosen for training the model was set to 0.001.

Through systematic parameter selection and experimentation, this configuration yielded favourable outcomes, which are discussed in this study. The combination of the Adam optimizer, CNN with increasing feature maps and BLSTM with two hidden layers proved effective in capturing relevant patterns within the data. Moreover, the learning rate of 0.001 facilitated a suitable balance between convergence speed and optimization accuracy.

Results

Discussions

In this experiment, the evaluation of the proposed recognition model was based on two key metrics (CER and WER) as it was described in section three of this study. The results demonstrate the model’s impressive performance, with an average CER of 2.81% and a WER of 18.13% at empirically selected epoch size 30 and batch size 128. Moreover, the model showcased its capability by accurately predicting 97.18% of the digits in their entirety, while achieving coverage of 81.87% for the full-digit sequences. These outcomes are highly promising, especially considering that this study represents the first attempt at multiple Ge'ez digit recognition.

Errors are a common occurrence in deep-learning models, and our recognition model is no exception. In Figure , we showcase some examples of errors in the results. In the first image sequence, the digit “” is mistakenly classified as “,” while in the second sequence, the image of “” is incorrectly identified as “.” The digit enclosed within the red box represents the digit that has been misrecognized. At the label parts blue represents the ground truth or correct label for the corresponding character image and the red colour highlights the images that have been misrecognized based on the predicted label. These errors can be attributed to the close resemblance between the shapes of these digits, particularly their curved nature. The model may struggle to distinguish between them due to their visual similarities.

Figure 6. Sample misrecognized Ge’ez digits (a) The error that happened between the beginning and the end of the sequence. (b) The error that happens at the end of the sequence.

Figure 6. Sample misrecognized Ge’ez digits (a) The error that happened between the beginning and the end of the sequence. (b) The error that happens at the end of the sequence.

The training and validation loss of the proposed multiple handwritten Ge’ez numerals end-to-end learning model is also depicted in Figure . From this, the training and the validation graph converge on the proposed model. This convergence suggests that the model’s performance on unseen data (validation set) is consistent with its performance on the training data. It is a positive sign that the model is learning effectively and is likely to perform well on new, unseen data.

Figure 7. Training and validation loss versus epoch of the proposed multiple handwritten Ge’ez numerals model.

Figure 7. Training and validation loss versus epoch of the proposed multiple handwritten Ge’ez numerals model.

The performance evaluation of the recognition model in recognizing Ge'ez numerals reveals high accuracy and effectiveness. Precision, recall, and F1 scores were used to assess the model’s performance for each Ge'ez alphabet (see Table ) The precision values ranged from 0.94 to 0.99, indicating the model’s ability to accurately classify and identify the Ge'ez characters. Most of the digits achieved precision scores above 0.95, highlighting the model’s accuracy in recognizing the Ge'ez numerals. Similarly, the recall values were consistently high, ranging from 0.95 to 0.99. This indicates the model’s ability to effectively retrieve a significant portion of the true positive instances for each alphabet, reflecting its proficiency in capturing relevant patterns and features of the Ge'ez characters. The F1 scores, which consider both precision and recall, ranged from 0.95 to 0.99 for the different Ge'ez alphabets. These high F1 scores demonstrate the model’s strong capability to achieve a balance between accurate classification and comprehensive retrieval of the Ge'ez numerals. The support column represents the number of instances for each Ge'ez alphabet in the evaluation dataset, which was well-balanced, with a similar number of instances for most characters, ranging from 5075 to 5390. Overall, the results indicate that the recognition model performed exceptionally well in recognizing the Ge'ez numerals, exhibiting high precision, recall, and F1 scores. This showcases the effectiveness of the model and its potential for various applications involving Ge'ez numeral recognition.

Table 5. Precision, recall, f1-score, and support results of the proposed recognition model.

In terms of individual digit performance, as directed in Figure , the Ge'ez digit achieved the highest precision, recall, and F1-score, demonstrating exceptional accuracy and completeness in recognition. The digits and also exhibited strong performance, closely following the top-performing digit with high precision, recall, and F1 scores of 0.98. Overall, the Ge'ez digits , , and performed exceptionally well, exhibiting the highest precision, recall, and F1 scores. These digits consistently achieved high accuracy and retrieval rates, making them the most reliably recognized digits by the model.

Figure 8. Precision, recall, f1-score, and support results of the proposed recognition model.

Figure 8. Precision, recall, f1-score, and support results of the proposed recognition model.

Limitations. During the experiment, the proposed multiple handwritten Ge’ez numeral recognition method is not tested with real-life datasets. It is important to note that the datasets used in this experiment consisted solely of synthetic handwritten samples. As a result, the model's performance may differ when faced with real-life handwritten data, which often exhibits greater variability and complexity. Additionally, the recognition system faced challenges due to the visual similarities of the numerals. These similarities can lead to confusion and recognition errors, especially when the model is trained on a limited number of samples for certain Ge’ez numerals. The scarcity of training data for specific characters can hinder the system’s ability to accurately recognize them. To address these limitations and enhance the recognition performance of the system, it is crucial to expand the training dataset by including a more diverse and representative collection of real-life handwritten samples. Increasing the number of training samples for each numeral would enable the model to learn and generalize better, improving its ability to handle the intricacies and variations present in real-life data.

Conclusion

In conclusion, this study introduces end-to-end learning model specifically designed for recognizing handwritten Ge'ez multiple numerals. By leveraging deep-learning techniques, our model automatically extracts relevant features from input images and effectively captures the sequential nature of the numerals. The integration of attention mechanisms and a CTC-based loss function allows the model to focus on crucial regions of the input, enabling end-to-end training without the need for explicit alignment between images and labels. The experimental results demonstrate a CER of 2.81% and a WER of 18.13% for the handwritten Ge'ez numeral recognition task, validating the effectiveness of our proposed approach.

Our model not only contributes to the underexplored area of OCR research for the Ge'ez script but also opens up new possibilities for Ge'ez numeral recognition. Future work can be improved by expanding the dataset to include other characters in the Ge’ez language, exploring alternate deep-learning architectures, and investigating the applicability of the approach to other languages and scripts. Continuous research and development efforts hold the potential to further improve the recognition of handwritten Ge'ez multiple numerals, benefiting various applications in pattern recognition and document processing.

To enhance the performance and reliability of our proposed method in real-life scenarios, future iterations of the recognition system should incorporate a larger and more diverse dataset. This step will address the challenges posed by real-life handwritten data and help mitigate recognition errors, ultimately improving the practical applicability of the recognition system. By continually refining and advancing our approach, we can achieve more accurate and robust recognition of handwritten Ge'ez multiple numerals, paving the way for broader adoption and utilization in various fields.

Acknowledgment

The authors would like to express our sincere gratitude to the faculty members, staff, and students of Delhi Technological University for their invaluable guidance and support throughout the completion of this manuscript. Their continuous assistance and expertise have played a crucial role in the success of this research project. The authors are also grateful for their unwavering commitment to fostering academic excellence and their contributions to our growth as researchers.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Notes on contributors

Ruchika Malhotra

Ruchika Malhotra received a master’s and Ph.D. insoftware engineering from the University School of Information Technology, Guru Gobind Singh Indraprastha University, Delhi, India. She is a Professor at the Department of Software Engineering, Delhi Technological University, Delhi. She has published more than 200 research papers in international journals and conferences. Her research interests include software testing, improving software quality, statistical and adaptive prediction models, software metrics, neural net modeling, and the definition and validation of software metrics.

Maru Tesfaye Addis

Maru Tesfaye Addis received a bachelor’s degree in computer science and information technology from Arba Minch University, Ethiopia, in 2010, and a master’s degree in computer science from Bahir Dar University, Ethiopia, in 2017. He is currently pursuing a Ph.D. degree in computer science and engineering at Delhi Technological University, India. He is also a Lecturer with the Department of Computer Science, at Debre Tabor University. His research interests include pattern recognition, deep learning, artificial intelligence, and image processing. He is also dedicated to advancing knowledge and contributing to the development of innovative solutions in these areas.

References

  • Meyer R. The ethiopic script: linguistic features and socio-cultural connotations. Multiling Ethiopia: Ling Challen Capacity Build Efforts. 2016;8(1):137–172.
  • Tadesse DA, Liu CM, Ta VD. Unconstrained bilingual scene text reading using octave as a feature extractor. Appl Sci. 2020;10(4474):1–14.
  • Nur MA, Abebe M, Rajendran RS. Handwritten Geez digit recognition using deep learning [open access]. Appl Comput Intell Soft Comput. 2022;2022:1–12.
  • LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition. Proc IEE: IEEE. 1998;86:2278–2324. doi:10.1109/5.726791
  • Fateh A, Fateh M, Abolghasemi V. Multilingual handwritten numeral recognition using a robust deep network joint with transfer learning. Inform Sci. 2021;581:479–494. doi:10.1016/j.ins.2021.09.051
  • Goel P, Garantra A. Handwritten Gujarati numerals classification based on deep convolution neural networks using transfer learning scenarios. IEEE Access. 2023;11:20202–20215. doi:10.1109/ACCESS.2023.3249787
  • Qiao J, Wang G, Wenjing Li W, et al. An adaptive deep Q-learning strategy for handwritten digit recognition. Neural Netw. 2018;107:61–71. doi:10.1016/j.neunet.2018.02.010
  • Ahamed P, Kundu S, Khan T, et al. Handwritten Arabic numerals recognition using convolutional neural network. J Amb Intel Hum Comp. 2020;11:5445–5457. doi:10.1007/s12652-020-01901-7
  • Obsie EY, Qu H, Huang Q, editors. Amharic character recognition based on features extracted by CNN and auto-encoder models. The 13th International Conference on Computer Modeling and Simulation; Melbourne VIC, Australia: ACM; 2021 June 25–27.
  • Meshesha M, Jawahar CV. Recognition of printed Amharic documents. Eighth International Conference on Document Analysis and Recognition (ICDAR'05); 31 Aug.-1 Sept.; Seoul, Korea (South): IEEE; 2005. p. 784–788.
  • Belay BH, Habtegebrial TA, Stricker D. Amharic character image recognition. 18th IEEE International Conference on Communication Technology; 08-11 October Chongqing, China: IEEE; 2018. p. 1179–1182.
  • Malhotra R, Addis MT. Ethiopic base characters image recognition using LSTM. 2021 2nd International Conference on Computational Methods in Science & Technology (ICCMST); Mohali, India: IEEE; 2021. p. 94–98.
  • Addis D, Liu CD, Ta VD. Printed Ethiopic script recognition by using LSTM networks. 2018 International Conference on System Science and Engineering (ICSSE); 28-30 June; New Taipei, Taiwan: IEEE; 2018. p. 1–6.
  • Aly S, Mohamed A. Unknown-length handwritten numeral string recognition using cascade of PCA-SVMNet classifiers. IEEE Access. 2019;7:52024–52034. doi:10.1109/ACCESS.2019.2911851
  • Belay B, Habtegebrial T, Meshesha M, et al. Amharic OCR: an end-to-end learning [open access]. Appl Sci-Basel. 2020 Feb;10(1117):1–13.
  • Azawi N. Handwritten digits recognition using transfer learning. Comput Electr Eng. 2023;106:1–10. doi:10.1016/j.compeleceng.2023.108604
  • Abdulrazzaq MB, Saeed JN. A comparison of three classification algorithms for handwritten digit recognition. 2019 International Conference on Advanced Science and Engineering (ICOASE); 02-04 April; Zakho - Duhok, Iraq: IEEE; 2019. p. 58–63.
  • Boukharouba A, Bennia A. Novel feature extraction technique for the recognition of handwritten digits. Appl Comput Inform. 2017;13(1):19–26. doi:10.1016/j.aci.2015.05.001
  • Liu W, Wei J, Meng Q. Comparisons on KNN, SVM, BP and the CNN for handwritten digit recognition. 2020 IEEE International Conference on Advances in Electrical Engineering and Computer Applications (AEECA); Nanjing, China: IEEE; 2020. p. 587–590.
  • Abu Ghosh MM, Maghari AY. A comparative study on handwriting digit recognition using neural networks recognition using neural networks. 2017 International Conference on Promising Electronic Technologies; Palestine: IEEE; 2017. p. 77–81.
  • Ahlawat S, Choudhary A, Nayyar A, et al. Improved handwritten digit recognition using convolutional neural networks (CNN). Sensors (Basel). 2020 Jun 12;20(12):1–8. doi:10.3390/s20123344
  • de Sousa IP. Convolutional ensembles for Arabic handwritten character and digit recognition. Peerj Comput Sci. 2018;167:1–13.
  • Ashiquzzaman A, Tushar AK, Rahman A, et al. An efficient recognition method for handwritten Arabic numerals using CNN with data augmentation and dropout. Data Management, Analytics and Innovation. Singapore: Springer Singapore; 2019. p. 299–309.
  • Ahlawat S, Choudhary A. Hybrid CNN-SVM classifier for handwritten digit recognition. International Conference on Computational Intelligence and Data Science (ICCIDS 2019); New Delhi, India: Elsevier B.V.; 2019. p. 2554–2560.
  • Ali S, Shaukat Z, Azeem M, et al. An efficient and improved scheme for handwritten digit recognition based on convolutional neural network. Sn Appl Sci. 2019 Sep;1:1–9.
  • Yu N, Jiao P, Zheng Y. Handwritten digits recognition base on improved LeNet5. The 27th Chinese Control and Decision Conference (2015 CCDC); Qingdao, China: IEEE; 2015. p. 4871–4875.
  • Yao H, Tan T, Xu C, et al. Deep capsule network for recognition and separation of fully overlapping handwritten digits. Comput Electr Eng. 2021;91(2021):1–12.
  • Sabour S, Frosst N, Hinton GE. Dynamic routing between capsules. NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems: ACM Digital Library; 2017. p. 3859–3869.
  • LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–444. doi:10.1038/nature14539
  • Goodfellow I, Bengio Y, Courville A. Deep learning cambrege. Masssachusetts London: England MIT press; 2016.
  • Sermanet P, Eigen D, Zhang X, et al. Overfeat: integrated recognition, localization and detection using convolutional networks. 2nd International Conference on Learning Representations, ICLR 2014; 14-16 April Banff, Canada: NYU; 2014. p. 1–16.
  • Agegnehu M, Tigistu G, Samuel M. Offline handwritten amharic digit and punctuation mark script recognition using deep learning. 2nd Deep Learning Indaba-X Ethiopia Conference 2021; January 27–29; Adama, Ethiopia: Adama Science and Technology University; 2022. p. 53–61.
  • Tamiru NK, Tekeba M, Salau AO. Recognition of amharic sign language with amharic alphabet signs using ANN and SVM. Vis Comput. 2022;38:1703–1718. doi:10.1007/s00371-021-02099-1
  • Abeje BT, Salau AO, Mengistu AD, et al. Ethiopian sign language recognition using deep convolutional neural network. Multimed Tools Appl. 2021;81:29027–29043.
  • Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997 Nov 15;9(8):1735–1780. doi:10.1162/neco.1997.9.8.1735