Full article: Recognition of expiry data on food packages based on improved DBNet

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

To prevent products with missing character information from reaching the market, manufacturers need an automatic character recognition method. One of the key problems of this recognition method is to recognise text under complex package patterns. In addition, some products use dot matrix characters to reduce printing costs, which makes text extraction more difficult. We propose a character detection algorithm using DBNet as the base network, combined with the Convolutional Block Attention Module (CBAM) to improve its feature extraction of characters in complex contexts. After the character area has been located by the detection algorithm, it is intercepted and fed into a fully convolutional character recognition network to achieve print character recognition. We use ResNet as the backbone network and CTC loss for training. In addition, the CBAM module was added to the backbone network to enhance its recognition of dot matrix characters. The algorithm was finally deployed on the jetson nano. The experimental results show that the character detection accuracy reaches 97.9%, an improvement of 1.9% compared to the original network. As for the character recognition algorithm, the inference speed is doubled when deployed to the nano platform compared to the CRNN network, with an accuracy of 97.8%.

KEYWORDS:

1 Introduction

During the production of food and beverages, information such as expiry data are often printed on the surface of the packaging to monitor the quality of the product and to provide consumers with important information about the quality of the product. In the food packaging industry, some types of ink-jet codes use dot matrix character printing as it is more cost-effective than inkjet printing and can convey information efficiently. Dot matrix characters consist of dots printed in a specific order. Although this is more cost-effective, the text quality will be inferior to other methods and the character characteristics will not be as pronounced as with other characters. Therefore, dot matrix characters are more difficult to recognise. At the same time, with the rapid growth of the country’s food industry, companies have designed many beautiful food packages to attract consumers to buy their products, and the packaging patterns are becoming more and more diverse, which will make character recognition more difficult. In the process of spraying character information due to abnormalities in the spraying equipment or environmental interference problems, which can lead to missed spraying, wrong spraying and other situations. Failure to detect a product with a defective print in time and bring it to market can have an impact on a company’s reputation and severely reduce the consumer’s desire to buy the product. The number of products verified in these industries daily can reach a very large figure. There is therefore a need to establish a robust and accurate optical character recognition method that can improve the accuracy of recognising characters against complex background patterns.

The rise of deep learning techniques in recent years has made it a leading research method in several fields, especially in the field of vision. Deep learning is very powerful and can automatically filter the most effective features by learning from the data to better perform various tasks. Many methods developed based on convolutional neural networks enable end-to-end coding character recognition and better detection of coding in complex backgrounds. The OCR we use is divided into two phases: the text detection phase and the text recognition phase. The text detection network uses a modified DBNet (Liao et al., Citation2020) network for detecting where the characters on the package are located and boxing them out with rectangular boxes. The character recognition network uses resnet18 (He et al., Citation2016) as the backbone network and a network with full convolution trained by CTC loss to recognise characters in rectangular boxes. In addition, for the problem of skewed characters, the affine transformation process is performed before inputting into the character recognition network. Experiments show that the method used in this paper has high accuracy and good robustness. We propose to solve the problem of optical character recognition in the industrial detection of good’s expiry dates. Combined with embedded devices, the OCR system is portable, can be quickly applied in production, and can reduce costs.

The main contributions of our research are:

Proposed the improved backbone network of detection and recognition models, which address the problem of complicated food package images, thus effectively improving the performance of the model.
Adopt affine transformation to address the skewed text lines that appear during printing characters.
Deploying the improved model to embedded devices reduces hardware costs and reduces the size to improve portability.

The rest of this paper is organised as follows: Section III describes the detection and recognition models used in this paper, Section IV gives the experimental results and analysis of the detection and recognition models, Section V describes the deployment of the model and its performance on embedded devices, and Section VI concludes the paper.

2 Relate work

In the field of optical character recognition, there are many different solutions for extracting characters from images. The traditional character recognition method (Inunganbi et al., Citation2020; Wang et al., Citation2020) mainly consists of the following steps: (1) pre-processing the image; (2) locating the character area and extracting the ROI; (3) segmenting and extracting individual characters in the ROI area; (4) recognising individual character pieces; (5) finally comparing with the correct character and judging whether it is qualified or not. Although traditional character recognition is characterised by high accuracy and speed in some simple or task-specific scenarios, it requires the design of high-precision templates, and the usage scenarios have greater limitations and less robustness. With the development of deep learning techniques, some of the shortcomings in traditional methods have been effectively solved and are widely used in major classical vision tasks (Chen, Citation2019): image classification, target detection, text recognition, etc.

The first CNN-based text recognition method was proposed by Jaderberg et al. (Citation2014; Jaderberg et al., Citation2016), which classifies words into predefined sets of dictionary words. Before the recognition phase, a combination of Edge Boxes (Zitnick & Dollár, Citation2014) and an aggregated channel feature (ACF) detector framework (Dollár et al., Citation2014) is used to detect candidate locations for the words. Following Jaderberg’s work, many new approaches have recently emerged. For text detection, there are EAST (Zhou et al., Citation2017), TextSnake (Long et al., Citation2018), PSENet (Li et al., Citation2020), etc., and text recognition models such as CRNN (Shi et al., Citation2017), ASTER (Shi et al., Citation2018), NRTR (Sheng et al., Citation2019), etc. However, these methods are rarely used in practical applications. However, these methods are rarely used in practical applications.

With the increase in retail merchandise, an efficient and accurate OCR system has become an urgent need. Deep learning methods are very effective in these tasks (Gong et al., Citation2018; Maitrichit & Hnoohom, Citation2020; Muresan et al., Citation2019; Wei et al., Citation2018). These methods use a combination of deep learning and traditional machine vision. As a result, they all have limitations in terms of usage scenarios. A better solution would be to adopt a deep learning approach altogether. In (Ai & Tao, Citation2020), a deep learning approach was used to detect and recognise characters on the steel surface. The backbone was changed to a lightweight MobileNet network based on the target detection model SSD to improve character detection speed and a CRNN model was used for character recognition. In the literature (Li et al., Citation2022) the code-spurting character detection algorithm uses YOLOv5 as the base network, combined with the ECA attention mechanism to improve its detection accuracy, and the character recognition model uses the CRNN network to finally deploy the algorithm to the NVIDIA TX2 embedded device. Similar methods have been used extensively in industrial production (Florea, Citation2020; Gong et al., Citation2021; Shanthini et al., Citation2021). These methods use common object detection algorithms for text detection. But some texts vary greatly in length. Common object detection methods are less effective in detecting text with large differences in length.

As the demand for OCR has increased many open-source OCR methods have emerged, such as Tesseract OCR, Keras OCR, Easy OCR, etc. A Tesseract OCR-based application is presented in (Hosozawa et al., Citation2018) and its performance is compared with two other open-source OCR engines: NHocr and OCRopus (Breuel, Citation2008). Tesseract has the most applicable character types and the best recognition accuracy, but it must be pre-processed to remove the influence of the background to have good recognition results in complex backgrounds. The open-source OCR method is used in (Kamisetty et al., Citation2022) to recognise the invoice content, where the paper first performs image preprocessing, after which three different OCRs are tested: the Keras OCR, Easy OCR and Tesseract OCR, where the Tesseract OCR gives the best recognition accuracy. But these open-source OCR engines are mainly used for character recognition of simple backgrounds like electronic documents. Complex pre-processing operations are required for images with background patterns, and the recognition results are not satisfactory.

To train such models, it is vital to obtain high-quality training data. There are currently very few retail merchandise datasets available. The only dataset that supports the training of OCR models on retail products is Unitail-OCR (Chen et al., Citation2022). To train a highly robust model, the amount of data in Unitail-OCR is still too small, which increases the risk of overfitting. The work in (Gupta et al., Citation2016; Liu et al., Citation2018) provides excellent solutions for the preparation of manually generated training datasets. We use a manually generated dataset for pre-training, followed by tuning using Unitail-OCR to make the model learn real-world features. Unitail-OCR is primarily for retail product category identification, not for expiry dates. Therefore, we collected images of expiry dates for 162 food items, a total of 2,161 images. They were made into a dataset. This allows the model to be better applied to expiry date recognition.

3 Text extraction models

Based on the analysis and study of existing algorithms, two steps are used to perform OCR: detection and recognition. In the first step, we detect regions of the image that may contain text. In a second step, text recognition is performed, where, for each detected region, the words in that region are recognised and transcribed using a CNN. The detailed process is shown in Figure .

Figure 1. General architecture of character recognition.

3.1 Text detection model

The current deep learning methods in the field of scene text detection mainly include: proposal-based text detection and segmentation-based text detection. The DBNet used in this paper is a segmentation-based text detection method, and the segmentation-based method can describe the scene text of each shape more accurately. The basic principle framework of DBNet is shown in Figure .

Figure 2. Architecture of Text detection model.

The DBNet network represents the input image as a high-dimensional convolutional feature map through the feature pyramid backbone network(Usually based on the ResNet architecture (He et al., Citation2016). In this paper, CBAM is added to each residual module for enhancing the network’s extraction of character features in complex contexts), and then, uses the feature map to predict the probability map and threshold map, and then calculates the binarization map from the probability map and threshold map to finally obtain the text position. Among them, the post-processing of binarization is crucial for segmentation-based detection, which converts the probability map generated by the segmentation method into a binarized map of the text region and thus obtains the location where the text is located. A module called Differentiable Binarization (DB) is used in the DBNet network, which performs the binarization process in the segmented network. The process is adaptive binarization for each pixel. The binarization threshold is obtained by network learning. The binarization step is thoroughly added to the network for training. In this way, the final output image will be very robust to the threshold, which simplifies post-processing and improves the effect of text detection.

To enhance the representation of foreground text and suppress background noise, a convolutional attention module (CBAM) (Woo et al., Citation2018) is added to the network. The attention mechanism is a resource allocation mechanism with the idea of finding intrinsic correlations in the raw data, highlighting strong correlations and ignoring some weakly correlated feature information. In this paper, we want the network to focus only on the character features, so the attention module is added to the network for improving the accuracy of the character detection algorithm.

The basic framework of CBAM is shown in Figure . CBAM consists of two parts: channel attention and spatial attention. The channel attention uses the average pooling and maximum pooling operations to aggregate the spatial information of the feature maps to generate the average pooling feature $F_{a v g}^{c}$ and the maximum pooling feature $F_{max}^{c}$ . These two features are then fed into a shared network consisting of MLPs with a single hidden layer. Finally the two parts are summed to generate the channel attention Mc. The channel attention is computed as: (1) $\begin{aligned} M_{c} (F) & = σ (M L P (a v g P o o l (F)) + M L P (max P o o l (F))) \\ = σ (W_{1} (W_{0} (F_{a v g}^{c})) + W_{1} (W_{0} (F_{max}^{c}))) \end{aligned}$ (1) After the spatial attention is computed in the feature map by channel attention, the average pooling and maximum pooling operations are applied along the channel axis, by which the channel information of the feature map can be aggregated to generate two 2D feature maps and stitch them together, and then a standard convolutional layer is used to generate the spatial attention Ms. The spatial attention is computed as: (2) $\begin{aligned} M_{s} (F) & = σ (f^{7 \times 7} ([a v g P o o l (F); max P o o l (F)])) \\ = σ (f^{7 \times 7} (F_{a v g}^{s}; F_{max}^{s})) \end{aligned}$ (2) The CBAM is added to each residual fast in the backbone as shown in Figure to form a base module. Because information features are extracted by mixing channel information and spatial information in convolutional operations. Therefore, the CBAM module applies the channel and spatial attention modules in turn to enhance the meaningful feature information along the two main dimensions of channel and spatial axis to improve the feature extraction capability of convolutional networks. And CBAM is a lightweight module, so the overhead of parameter calculation has little impact on the speed of the network.

Figure 3. CBAM module.

Figure 4. Basicblock.

gives the parameters of each layer of the text detection network. The backbone of the network is a feature pyramid structure, based on resnet18, which fuses feature layers of different scales through FPN. Pyramid features are upsampled to the same scale and cascaded to produce features F, which are then used to predict probability maps (P) and threshold maps (T) from the feature maps. After that, the approximate binary mapping (B) is calculated by P and F. The calculation process is carried out by means of differentiable binarization. The formula is as follows: (3) $B_{i, j} = \frac{1}{1 + e^{- k (P_{i, j} - T_{i, j})}}$ (3) where B is the approximate binary mapping; T is the adaptive threshold map learned from the network; and k denotes the amplification factor. k is set to 50 empirically. This approximate binarization function is differentiable and thus can be optimised with the segmentation network during training.

Table 1. Parameters of each layer of text detection model.

Download CSV Display Table

3.2 Text recognition model

The character recognition model is a fully convolutional model. The architecture of the model is shown in Figure . The convolutional body is based on the ResNet-18 architecture, and the CBAM module is added in the first residual black. Where the input image is passed through the convolutional body and then the most likely character at each image text position is predicted by the last convolution. In this paper, we do not use recurrent neural networks (such as LSTM or GRU) because recurrent networks are poorly parallel computed. Due to its chained structure, the sequence modelling process is closely related to the length of the input sequence, making the recursive network computation more time consuming (Gao et al., Citation2017). The CNN can acquire the features of each character in the image and simulate the interaction between characters very well (Borisyuk et al., Citation2018), and it is more computationally efficient.

Figure 5. Architecture of Text Recognition model.

This model is trained using the CTC loss, which calculates the conditional probability of a label given the prediction by marginalising over the set of all possible alignment paths, and that can be efficiently computed using dynamic programming. As shown in Figure , every column of the feature map corresponds to the probability distribution of all characters in the alphabet at that position of the image. These predictions may contain duplicate characters or blank characters (-) to finally get the ground truth label. For example, in Figure we show that for an input image with the characters 20211203, the model may generate the sequence of characters “2-00-22-1-122-00-3-”, which includes blank characters and repeated characters. In inference, the characters with the highest probability of prediction in each column of the feature map are extracted and blank characters and duplicate characters are removed to obtain the final prediction results.

Equipment problems such as vibration of the printhead and unstable movement of the conveyor belt can cause some of the characters to be detected to appear skewed. To address this problem, this paper performs an affine transformation on the localised character region before entering the recognition network, and corrects the skewed characters before feeding them into the character recognition network.

The affine transformation is a linear transformation from two-dimensional coordinates to two-dimensional coordinates, maintaining the “straightness” and “parallelness” of the graph. The affine transformation allows you to translate and rotate the image. The transformation rule calculates a transformation matrix from the four predicted coordinate points on the original image and the corresponding preset position coordinates, and then transforms the original image into a new image by the transformation matrix. The affine transformation can be expressed by Equation (4). (4) $[\begin{matrix} x^{'} \\ y^{'} \\ 1 \end{matrix}] = [\begin{matrix} a_{1} & a_{2} & t_{x} \\ a_{3} & a_{4} & t_{y} \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} x \\ y \\ 1 \end{matrix}]$ (4) where (t_x, t_y) represents the amount of translation, while the parameter reflects a_i the changes in image rotation, scaling, etc.

In this paper, we judge the coordinate information of the text box after detecting the character area, and when the tilt angle of the text box exceeds 3°, the character is considered to be tilted excessively, and the affine transformation is used to map the tilted character area back to the correct position. Figure shows the result of an affine transformation.

Figure 6. Affine transformation to corrected characters.

In most text recognition methods, the positioned text area is cropped out and resized to 32 × 100 pixels without preserving the aspect ratio. This results in the same characters having different features depending on the length of the text in the image, which requires a larger model capacity for accurate representation. Therefore, at training time, we adjust the image to 32 × 128 pixels. The height of the image is reduced to 32 pixels wide for proportional scaling. When the width is greater than 128, the aspect ratio is not retained, and the image is directly scaled to 32 × 128 pixels. Otherwise, the edge value of the image is used to fill to 128 pixels width on the right side. In this way, on the one hand, the original features of the characters can be retained, and on the other hand, can effectively use batches of images for training. In the experiments, as can be seen from Table in Section 4, this method has superior accuracy than resizing the image without preserving the aspect ratio.

4 Experiments

4.1 Experimental evaluation indexes

Precision and Recall are used as evaluation metrics in the text detection model. Where Precision and Recall are calculated as follows: (5) $\begin{aligned} P & = \frac{T P}{T P + F P} \end{aligned}$ (5) (6) $\begin{aligned} R & = \frac{T P}{T P + F N} \end{aligned}$ (6)

Where TP, TN, FP and FN are respectively: True Positive, True Negative, False Positive, False Positive.

For the recognition model, accuracy and Normal Edit Distance were used as performance measures. Where accuracy is calculated as follows: (7) $A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}$ (7) For accuracy calculation, a prediction is recorded as correct only when the predicted word matches the real word exactly. The accuracy metric provides information on the percentage of words correctly identified, but it does not provide any information on the degree of error in the incorrect predictions. In an input image, incorrectly recognising one character or incorrectly recognising five characters have the same impact on accuracy, zero. To obtain more information about the degree of misidentification, the normal edit distance is used, which calculates the minimum number of steps required between the conversion of a predicted string into a real string by editing operations (substitution, insertion, deletion) and can be considered as the degree of difference between the two strings. This paper uses normal edit distance to evaluate the similarity between the predicted and the ground truth.

4.2 Model training and datasets

The character detection network and the character recognition network are trained and tested respectively. shows the parameters set for the model training. The character detection network optimisation algorithm uses Adam optimiser. The batch size is 16, the epoch is 120, and the initialisation learning rate is set to 1e-3. Use Warm up for learning rate changes. The character recognition network optimisation algorithm uses Adam optimiser. The batch size is 32, the epoch is 100, and the initialisation learning rate is set to 1e-4. The learning rate decreases by 20% every two epochs. The training process of the character recognition model is shown in Figure .

Figure 7. Curve of parameters during training.

Table 2. Model training parameters settings.

Download CSV Display Table

Having high quality training and testing datasets is important for building robust supervised machine learning models. We first pre-trained using the artificial dataset and then tuned using the unitail-OCR dataset to make the model learn real-world features. Unitail-OCR is an OCR dataset for retail product identification which contains 1454 product categories. This dataset was benchmarked on a variety of state-of-the-art. However, Unitail-OCR does not match the data distribution of the image of the date printed on the package. We therefore manually annotated a dataset with thousands of images of packages with print dates and used it to fine-tune the model, which improved the results considerably.

4.3 Results and analysis

4.3.1 Text detection model

Table shows the experimental results of the text detection model on the open source dataset Unitail-OCR and our manually collected dataset. The→ denotes finetuning, i.e. A→ B means train on A and then finetune on B. We can see that the method used in this paper improves the feature extraction ability of the original model’s backbone, and improves the precision and recall.

Table 3. Comparison of text detection models on different datasets.

Download CSV Display Table

The experimental results of different detection algorithms are compared in Table . The test set used was a manual collection of food packaging images. The CPTN (Tian et al., Citation2016) cuts the text into smaller proposals of fixed width along the text direction, and merges the small text boxes belonging to the same text part to get a complete text box. However, CPTN does not consider the text angle and is only applicable to horizontal text, which is not effective in detecting tilted text. Both EAST (Zhou et al., Citation2017) and TextSnake (Long et al., Citation2018) consider the text angle. Among them, EAST uses two-stage text detection methods: full convolutional network (FCN) and non-maximal suppression (NMS). The shape of the detection can be either a normal quadrilateral or a rotating rectangle. TextSnake obtained 7 feature maps after the basic feature extraction network, of which 4 channels predict TR (Text Region) and TCL (Text Center Line), and the other 3 channels predict r (radius), cosθ and sinθ. The region where the text is located is calculated from these 7 feature maps. However, these methods have high model complexity, slow detection speed and are not suitable for deployment on low computing power devices. The DBNet model used in this paper is of low complexity, and high accuracy can be obtained by using a simpler feature extraction network. And the CBAM module is added to the feature extraction network to improve the network’s effectiveness in localising characters in complex backgrounds. In Table , we can see that the final method achieves the best performance compared to the other models.

Table 4. Comparison of different detection algorithms.

Download CSV Display Table

4.3.2 Text recognition model

In production, the common method used for text recognition is CRNN. However, the recurrent neural network (RNN) structure in CRNN networks has a large time overhead in embedded devices. According to the study in (Borisyuk et al., Citation2018 Gao et al., Citation2017;) for horizontal text can be done well without using RNN structure and convolution can be done well for character recognition. Therefore, this paper uses a full convolutional recognition network and further adds CBAM to the network to improve the recognition effect. From Table , we can see that the improved algorithm has higher recognition accuracy and a lower probability of making mistakes.

Table 5. Comparison of the effect of the improved recognition algorithm.

Download CSV Display Table

The input image of the character recognition algorithm is usually scaled directly to the size of 32 × 128, which will lead to changes in the features of some characters, as shown in Figure . In particular, direct resizing for dot matrix characters will cause the originally circular dots to be stretched into ellipses. The result is that some of the same characters appear with different features affecting the recognition effect of the network. In this paper, we try to maintain the aspect ratio of the original image while changing the image to 32 × 128 pixels size to ensure the character features remain unchanged, which can effectively improve the accuracy of the algorithm recognition. As shown in Table .

Figure 8. Resize.

Table 6. Comparison of resize results.

Download CSV Display Table

5 Model deployment to embedded system

To save hardware costs and facilitate quick embedding in factory production lines for use. Deploy the improved algorithms to the NVIDIA Jetson nano embedded platform and accelerate inference with Tensorrt8.0. The configuration of Jetson nano is shown in Table .

Table 7. Jetson nano configuration.

Download CSV Display Table

Table shows the parameters and computation of several models and the inference time of the models on the embedded platform. The parameters and calculation amount of one of the common convolutional layers are calculated as follows.

Table 8. Comparison of model complexity and inference time.

Download CSV Display Table

Parameters: (8) $C_{o u t} \times C_{i n} \times K \times K$ (8) Amount of calculation: (9) $H \times W \times C_{o u t} \times (2 \times C_{i n} \times K \times K - 1)$ (9) Where H, W is the width and height of the input feature map, C is the number of channels of the input or output of the feature map, K is the size of the convolution kernel, and the above equations are all in the case of ignoring bias.

The parameter is mainly used to describe the size degree of the model, which is similar to the space complexity in algorithms and is related to the hardware frame buffer size. The amount of calculation is similar to the time complexity in an algorithm and is related to the floating point operations per second (FLOPS) of the hardware. The inference time of the model is mainly related to the amount of calculation. From rows 2 and 4 of Table , we can see that the two models have similar parameters, but the difference in the amount of calculation and therefore the time is also very different.

In addition, the CBAM module is lightweight as can be seen in the text detection section of the table. It has little impact on the parameters and amount of calculation of the model, and can effectively improve the detection accuracy of the model. From the text recognition part of the table, it can be found that although the model we used is larger than the CRNN parameters, the calculation amount is smaller than the CRNN. And because the CRNN network uses a recurrent neural network (RNN) structure, the GPU has poor acceleration for the RNN. The model we used is full convolution and is better when accelerated using the GPU, so it is twice as fast as CRNN.

We used a Hikvision industrial camera to capture images at a size of 640 × 480 as input, and we used Tensorrt in the jetson nano platform to call the GPU to accelerate inference. As shown in and , the improved algorithm can recognise fonts commonly used for printing valid dates, including dot matrix characters. In addition, we deploy the model in an embedded device, which offers huge advantages in terms of power consumption and cost compared to using an expensive industrial control machine.

Figure 9. Dot matrix character recognition effect.

Figure 10. Other print character recognition effect.

6 Conclusion

We have combined DBNet with the CBAM module based on the natural scene character recognition approach in deep learning, which effectively improves the accuracy of the algorithm in localising character regions in complex backgrounds. The intercepted character region images after localisation are processed to the input size required for the character recognition model while preserving the character aspect ratio, and the accuracy is improved by 3.5% compared to the unprocessed method. The character recognition method we use does not use the commonly used RNN structure and uses a full convolutional recognition algorithm, which improves the subsequent deployment and recognition speed on embedded devices. And the CBAM module is also added to the recognition algorithm to improve the feature extraction of dot matrix characters by the network, which improves the accuracy by 1.2% up to 97.9%. Finally, the improved algorithm is deployed on jetson nano to implement embedded edge detection. It makes the OCR system more portable and convenient for quick application in actual industrial production lines and reduces the labour cost for character verification. The next step will be to continue to optimise the algorithm to improve the recognition accuracy of characters, and to try to train more character datasets to meet the application of more character products with different types of fonts to make the algorithm more universal.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by Special Fund for Education and Research of Fujian Provincial Department of Finance, China [grant number GY-Z21001].

References

Ai, M., & Tao, Q. (2020). Algorithm for character detection and recognition on steel surface based on MobileNet model. Modern Computer, 3, 73–78. https://doi.org/10.3969/j.issn.1007-1423.2020.03.014
Google Scholar
Borisyuk, F., Gordo, A., & Sivakumar, V. (2018, August 19–23). Rosetta: Large scale system for text detection and recognition in images. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. https://doi.org/10.1145/3219819.3219861
Google Scholar
Breuel, T. M. (2008). The OCRopus open source OCR system. Document recognition and retrieval XV. SPIE. https://doi.org/10.1117/12.783598
Google Scholar
Chen, F., Zhang, H., Li, Z., Dou, J., Mo, S., Chen, H., Zhang, Y., Ahmed, U., Zhu, C., & Savvides, M. (2022, October 23–27). Unitail: Detecting, Reading, And matching in retail scene. Computer vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, Part VII, Springer. https://doi.org/10.1007/978-3-031-20071-7_41
Google Scholar
Chen, R.-C. (2019). Automatic license plate recognition via sliding-window darknet-YOLO deep learning. Image and Vision Computing, 87, 47–56. https://doi.org/10.1016/j.imavis.2019.04.007
Web of Science ®Google Scholar
Dollár, P., Appel, R., Belongie, S., & P. J. I. t. o. p. a. (2014). Fast feature pyramids for object detection. Perona and M Intelligence, 36(8), 1532–1545. https://doi.org/10.1109/TPAMI.2014.2300479
Web of Science ®Google Scholar
Florea, V., & T. J. I. J. o. U.-S. I. Rebedea. (2020). Expiry date recognition using deep neural networks. International Journal of User-System Interaction, 13(1), 1–17. https://doi.org/10.37789/ijusi.2020.13.1.1
Google Scholar
Gao, Y., Chen, Y., Wang, J., & Lu, H. J. A. P. A. (2017). Reading scene text with attention convolutional sequence modeling.
Google Scholar
Gong, L., Thota, M., Yu, M., Duan, W., Swainson, M., Ye, X., & Kollias, S. J. S. (2021). A novel unified deep neural networks methodology for use by date recognition in retail food package image. Image and V. Processing, 15, 449–457. https://doi.org/10.1007/s11760-020-01764-7
Web of Science ®Google Scholar
Gong, L., Yu, M., Duan, W., Ye, X., Gudmundsson, K., & Swainson, M. (2018, May 25–27). A novel camera based approach for automatic expiry date detection and recognition on food packages. Artificial Intelligence Applications and Innovations: 14th IFIP WG 12.5 International Conference, AIAI 2018, Rhodes, Greece. https://doi.org/10.1007/978-3-319-92007-8_12
Google Scholar
Gupta, A., Vedaldi, A., & Zisserman, A. (2016, June 26–30). Synthetic data for text localisation in natural images. Proceedings of the IEEE Conference on Computer Vision and Pattern recognition. https://doi.org/10.1109/cvpr.2016.254
Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. (2016, June 26–30). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. https://doi.org/10.1109/cvpr.2016.90
Google Scholar
Hosozawa, K., Wijaya, R. H., Linh, T. D., Seya, H., Arai, M., Maekawa, T., & K. J. I. J. o. C. T. (2018). Recognition of expiration dates written on food packages with open source OCR. Mizutani and Engineering, 10(5), 170–174. https://doi.org/10.7763/ijcte.2018.v10.1220
Google Scholar
Inunganbi, S., Choudhary, P., Singh, K. M. J. M. T., & Applications. (2020). Local texture descriptors and projection histogram based handwritten Meitei Mayek character recognition. Multimedia Tools and Applications, 79(3), 2813–2836. https://doi.org/10.1007/s11042-019-08482-4
Web of Science ®Google Scholar
Jaderberg, M., Simonyan, K., Vedaldi, A., & A. J. a. p. a. Zisserman. (2014). Synthetic data and artificial neural networks for natural scene text recognition.
Google Scholar
Jaderberg, M., Simonyan, K., Vedaldi, A., and A. J. I. j. o. c. v. Zisserman. (2016). Reading text in the wild with convolutional neural networks. International Journal of Computer Vision, 116(1), 1–20. https://doi.org/10.1007/s11263-015-0823-z
Web of Science ®Google Scholar
Kamisetty, V. N. S. R., Chidvilas, B. S., Revathy, S., Jeyanthi, P., Anu, V. M., & Gladence, L. M. (2022, March 29–31). Digitization of data from invoice using OCR. 2022 6th International Conference on Computing Methodologies and Communication (ICCMC), IEEE.https://doi.org/10.1109/iccmc53470.2022.9754117
Google Scholar
Li, F., Hu, W., Liu, B., & Liu, Y. (2022). Printing character detection algorithm based on NVIDIA TX2. Computer Engineering and Application, 58(13), 210–216. https://doi.org/10.3778/j.issn.1002-8331.2107-0317
Google Scholar
Li, Y., Wu, Z., Zhao, S., Wu, X., Kuang, Y., Yan, Y., Ge, S., Wang, K., Fan, W., & Chen, X. (2020, April 4). PSENet: Psoriasis severity evaluation network. Proceedings of the AAAI Conference on Artificial Intelligence. https://doi.org/10.1609/aaai.v34i01.5424
Google Scholar
Liao, M., Wan, Z., Yao, C., Chen, K., & Bai, X. (2020, February 7–12). Real-time scene text detection with differentiable binarization. Proceedings of the AAAI Conference on Artificial Intelligence. https://doi.org/10.1609/aaai.v34i07.6812
Google Scholar
Liu, Z., Li, Y., Ren, F., Goh, W. L., & Yu, H. (2018, February 2–7). Squeezedtext: A real-time scene text recognition by binary convolutional encoder-decoder network. Proceedings of the AAAI Conference on Artificial Intelligence. https://doi.org/10.1609/aaai.v32i1.12252
Google Scholar
Long, S., Ruan, J., Zhang, W., He, X., Wu, W., & Yao, C. (2018, September 8–14). Textsnake: A flexible representation for detecting text of arbitrary shapes. Proceedings of the European Conference on Computer Vision (ECCV). https://doi.org/10.1007/978-3-030-01216-8_2
Google Scholar
Maitrichit, N., & Hnoohom, N. (2020, November 18–20). Intelligent medicine identification system using a combination of image recognition and optical character recognition. 2020 15th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), IEEE. https://doi.org/10.1109/isai-nlp51646.2020.9376816
Google Scholar
Muresan, M. P., Szabo, P. A., & Nedevschi, S. (2019, September 5–7). Dot matrix OCR for bottle validity inspection. 2019 IEEE 15th International Conference on Intelligent Computer Communication and Processing (ICCP), IEEE. https://doi.org/10.1109/iccp48234.2019.8959762
Google Scholar
Shanthini, K. M., Chitra, P., Abirami, S., Aninthitha, G., & Abarna, P. (2021, July 06–08). Recommendation of product value by extracting expiry date using deep neural network. 2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT), IEEE. https://doi.org/10.1109/icccnt51525.2021.9579675
Google Scholar
Sheng, F., Chen, Z., & Xu, B. (2019, September 20–25). NRTR: A no-recurrence sequence-to-sequence model for scene text recognition. 2019 International Conference on Document Analysis and Recognition (ICDAR), IEEE. https://doi.org/10.1109/icdar.2019.00130
Google Scholar
Shi, B., Bai, X., & Yao, C. (2017). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11), 2298–2304. https://doi.org/10.1109/TPAMI.2016.2646371
PubMed Web of Science ®Google Scholar
Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., & X. J. I. t. o. p. a. (2018). Aster: An attentional scene text recognizer with flexible rectification. Bai and M Intelligence, 41(9), 2035–2048. https://doi.org/10.1109/tpami.2018.2848939
Google Scholar
Tian, Z., Huang, W., He, T., He, P., & Qiao, Y. (2016, October 10–16). Detecting text in natural image with connectionist text proposal network. European Conference on Computer Vision, Springer. https://doi.org/10.1007/978-3-319-46484-8_4.
Google Scholar
Wang, B., Song, S. X., & Wang, Y. Y. (2020). Design and implementation of embedded printing character inspection system based On Qt and Arm NN. Computing Technology and Automation, 39(1), 54–60. https://doi.org/10.16339/j.cnki.jsjsyzdh.202001011
Google Scholar
Wei, T. C., Sheikh, U., & Ab Rahman, A. A.-H. (2018, March 9–10). Improved optical character recognition with deep neural network. 2018 IEEE 14th international colloquium on signal processing & Its applications (CSPA), IEEE. https://doi.org/10.1109/cspa.2018.8368720
Google Scholar
Woo, S., Park, J., Lee, J.-Y., & Kweon, I. S. (2018, September 8–14). Cbam: Convolutional block attention module. Proceedings of the European conference on computer vision (ECCV). https://doi.org/10.1117/12.2636811
Google Scholar
Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., & Liang, J. (2017, July 21–26). East: An efficient and accurate scene text detector. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. https://doi.org/10.1109/cvpr.2017.283
Google Scholar
Zitnick, C. L., & Dollár, P. (2014, September 5–12). Edge boxes: Locating object proposals from edges. European Conference on Computer Vision, Springer. https://doi.org/10.1007/978-3-319-10602-1_26
Google Scholar

Recognition of expiry data on food packages based on improved DBNet

Abstract

1 Introduction

2 Relate work