Full article: Computer vision classification detection of chicken parts based on optimized Swin-Transformer

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

In order to achieve real-time classification and detection of various chicken parts, this study introduces an optimized Swin-Transformer method for the classification and detection of multiple chicken parts. It initially leverages the Transformer’s self-attention structure to capture more comprehensive high-level visual semantic information from chicken part images. The image enhancement technique was applied to the image in the preprocessing stage to enhance the feature information of the image, and the migration learning method was used to train and optimize the Swin-Transformer model on the enhanced chicken parts dataset for classification and detection of chicken parts. Furthermore, this model was compared to four commonly used models in object target detection tasks: YOLOV3-Darknet53, YOLOV3-MobileNetv3, SSD-MobileNetv3, and SSD-VGG16. The results indicated that the Swin-Transformer model outperforms these models with a higher mAP value by 1.62%, 2.13%, 5.26%, and 4.48%, accompanied by a reduction in detection time by 16.18 ms, 5.08 ms, 9.38 ms, and 23.48 ms, respectively. The method of this study fulfills the production line requirements while exhibiting superior performance and greater robustness compared to existing conventional methods.

GRAPHICAL ABSTRACT

KEYWORDS:

1. Introduction

China is the world’s largest poultry breeding country. For a long time, China’s poultry stock and poultry product production have been in the world’s leading position (C. Tang et al., Citation2016). However, when processing whole chickens, the processing plant follows a procedure where the chicken parts are first cut and then manually sorted. This manual sorting approach is inefficient, error-prone, and does not provide a guarantee for product safety (Jiang, Citation2022; J. Li et al., Citation2021). As living standards continue to improve and the demand for food consumption increases, there is a growing demand for chicken products that cater to a variety of segmented taste preferences. At the same time, for processing companies to categorize the different parts of the chicken, it helps to purchase raw materials and process production according to demand. It not only reduces the waste of resources but also reduces the cost of production.

Therefore, it is necessary to study the automatic classification and detection of chicken parts. Limited research has been conducted on poultry detection, particularly in the area of chicken meat detection and classification. At present, there are fewer domestic applications that use deep learning to classify different parts of chickens, and most of them use traditional machine learning methods to accomplish the acquisition of chicken parts. The method employed in this study also enables the identification of chicken parts that fail to meet quality standards, such as those affected by disease or contamination, thereby ensuring product safety. Computer vision technology has the advantages of high efficiency, low cost, good adaptability, and stability, making it widely used in livestock breeding, meat grading, and other aspects (Li & Yu, Citation2020; Sun et al., Citation2021).

In recent years, most of the widely used methods in chicken carcass research have been based on image processing and conventional machine vision. The images of chicken carcasses are captured using industrial cameras, and image processing technology is subsequently employed to extract feature quantities that are closely correlated with the quality of the chicken carcasses (Wu, Citation2016). This process involves establishing a mathematical model that relates the image feature quantities to the quality assessment of the carcasses. While the aforementioned methods showed high accuracy, they also have some limitations (Li, Citation2017). For instance, manual feature extraction is a laborious and time-consuming task, and the recognition of manually extracted features was often inadequate and lacked the natural expression of the target features. Additionally, detection precision is often insufficient for scenarios with overlapping and mutual occlusion of chicken parts in complex environments. The advancements in deep learning technology have led to the emergence of numerous target detection algorithms that have shown excellent performance in solving the aforementioned problems.

Deep learning-based computer vision object detection method that has strong independent extraction, learning, and inference capabilities for deep and shallow features of sample images. It can better solve the above problems. Zhao et al. (Citation2021) used a generative adversarial network to generate sheep skeleton images and conducted an ICNet-based image semantic segmentation study on the key parts of sheep in different scenes, and their average segmentation accuracy reached more than 90%. M. Tian et al. (Citation2019) employed a deep learning model based on CNN for pig target counting. The experiment demonstrated that even in cases of occlusion, the Mean Absolute Error (MAE) of a single image could attain 1.67. Jin et al. (Citation2023) introduced a strawberry target detection approach utilizing YOLOV4 and incorporated an enhanced K-means clustering algorithm for prior frame size calculation. On average, this method achieved a 97.05% accuracy rate in testing, with an average detection time of 74 ms per single image. Pan et al. (Citation2016) used a BP (back propagation) neural network and an SVM (support vector machine) neural network to construct various types of pork freshness grade prediction models, and the results showed that the average prediction accuracy of SVM and BP was 91.11% and 84.44%, respectively, and the SVM and BP prediction models with the combination of feature parameters had the highest accuracy of 88.89% and 95.56%. Zhao and Yang (Citation2023) proposed an aero-engine fault diagnosis method based on the fusion convolutional transformer, using the self-attention mechanism, introducing the maximum pooling layer into the Transformer model, reducing the memory consumption of the model as well as the number of parameters, and improving the accuracy by 28.117% and 6.5% compared to Convolutional Neural Networks and Back Propagation Neural Networks, respectively. Neural Networks and Back Propagation Neural Networks, the accuracy was improved by 28.117% and 6.552%, respectively. Y. Li et al. (Citation2023) amalgamated CNN and Transformer to create a novel model. They meticulously refined the feature map, incorporated multi-scale input for comprehensive global feature modeling, employed a transformer to extract deep global features, and progressively upsampled to generate a detailed segmentation map. Subsequently, they conducted experiments on the substation dataset, achieving a Dice coefficient of 89.31% and a recall rate of 90.52%.

CNN, as a foundational network, has significantly contributed to computer vision tasks and has played a pivotal role in the thriving development of computer vision due to its formidable feature extraction capabilities. However, CNN primarily concentrates on local features within a limited range, sometimes overlooking global features (Yu et al., Citation2023; Zhang et al., Citation2023a). The advent of Swin-Transformer effectively addresses these CNN limitations by adeptly capturing global contextual information (M. Liu et al., Citation2023). In 2017, Vaswani et al. introduced the Transformer network, which demonstrated remarkable performance in the realm of natural language processing. Subsequent research efforts have led to several enhancements to the Transformer architecture, giving rise to networks like DETR(Detection Transformer), ViT(Vision Transformer), and SETR-MLP (Segmentation Transformer). These networks have found applications in diverse areas, including object detection, semantic segmentation, image classification, and image generation (Liu & Huang, Citation2022; Zhang et al., Citation2023b). In this study, the state-of-the-art Swin-Transformer network performed the following three aspects: (1) Data preprocessing and data augmentation were initially carried out on the inputs of the training and test sets of chicken part images. The mosaic data enhancement technique was used during the model training process by adding Mosaic data code to the backbone network to enhance the robustness of the mode; (2) To enhance the extraction of global information from chicken part images and acquire high-level visual semantic information, the model adopted the Swin-Transformer model as its backbone network. The network employs a hierarchical construction method akin to convolutional neural networks for downsampling the images with various downsampling rates.The network was evaluated by the validation set at each epoch during the pre-training of the Swin-Transformer model. The model parameters were also adjusted to optimize the model and the model was saved when it appeared to have the highest classification accuracy on the validation set. Concurrently, the parameters of the pre-trained Swin-Transformer model were transferred to the chicken part detection and recognition model through the transfer learning technique; (3) Introduce other existing mainstream detection algorithms and optimized Swin-Transformer model for comparison. To evaluate the advantages, disadvantages, and real-time performance of the model. The classification and detection method proposed in this paper can offer fresh techniques and insights for the automated and intelligent processing of various chicken meat products. Additionally, it can serve as a technological reference for the automatic classification and sorting of other livestock and poultry meats.

2. Materials and methods

2.1. Test materials and image acquisition

We used three types of chicken parts bought from supermarkets, namely, chicken wing center, chicken thigh, and chicken wing. Sample Image Acquisition Chicken parts sorting site on a conveyor belt During the chicken parts image acquisition process, the chicken parts were randomly dispersed on a 0.8-meter-wide conveyor belt. The industrial camera (MV-CA050-20 GM) was positioned 0.6 meters above the conveyor belt. The camera possesses a high dynamic range suitable for natural lighting conditions. It was installed at a height of 1.4 meters above the ground to capture samples without a specific background and light source. The schematic diagram of the collection setup is depicted in . Schematic diagram of each part of the chicken is shown in .

Figure 1. Image acquisition system (1.Camera 2.Chicken part 3.Conveyor belt).

Figure 2. (a) Chicken leg, (b) chicken wings, (c) chicken breast.

2.2. Sample pre-processing and data enrichment

The accuracy of deep learning-based computer vision models is closely linked to the size of the dataset. Larger datasets enhance the deep network’s capability to extract and learn object features. However, in this study, the number of chicken parts was much lower than the number of samples needed for our deep learning. Hence, there is a need to augment the chicken part data, and data augmentation techniques play a crucial role in generating additional training data from existing image samples. For this purpose, in this paper, the three most common single-sample data enhancement codes were downloaded from the Internet, including clockwise rotation of 45°, translation of 100 pixels along the X-axis and 0.4 exposure adjustment, to algorithmically perform data enhancement of the acquired sample images of chicken parts (W. Tang et al., Citation2023). A comparison of the enhanced image with the original image is shown in .

Figure 3. Example diagram comparing the enhanced image with the original image.

After applying data enhancement techniques to the chicken parts images, a total of 2000 images were acquired. To streamline the model training process, the image dimensions were adjusted to 512 × 512 pixels to construct the chicken parts dataset. Subsequently, this dataset was divided into training, testing, and validation sets in a ratio of 6:2:2. In this paper, the image samples were annotated using the Labeling image annotation tool, following the VOC(Visual Object Classes) data format as a reference. For labeling, the labels for the three types of chicken parts, specifically chicken breasts, chicken legs, and chicken wings, were assigned as follows. The distribution of the number of parts within each segment of the chicken part image dataset was computed by the computer based on manually preset ratios, as illustrated in .

Table 1. Categories and quantity of chicken parts data sets.

Download CSV Display Table

2.3. Principles of the Swin-Transformer algorithm

2.3.1. Swin-Transformer network structure

The network is conceived based on the principles of shifted window operations, attention mechanisms, and layering. Its core components include the Multilayer Perceptron (MLP), Window Multihead Self-Attention Mechanism (W-MSA), Sliding Window Multihead Attention Mechanism (SW-MSA), and Layer Normalisation (LN). This architecture offers distinct advantages such as robust feature extraction, high prediction accuracy, rapid inference, and minimal computational demands. The structure of the Swin-Transformer network is depicted in (Chun et al., Citation2022).

Figure 4. Structure of the Swin-Transformer network(Note: H is the pixel of the input image; W is the width of the input image; C is the dimension of the feature map; LN is the layer specification; W-MSA is the window multihead self-attention structure; SW-MSA is the Shift window multihead self-attention structure; MLP is the multilayer perceptron).

Initially, an RGB three-channel chicken part image measuring 512 × 512 pixels is fed into the patch partition. Subsequently, the patch partition is divided based on a 4 × 4 pixel size benchmark for each neighboring area and dispersed in the channel direction, resulting in a chicken part image with dimensions of 128 × 128 × 48. This processed image is then input into Stage 1. The initial stage comprises the linear embedding and Swin-Transformer blocks. The linear embedding block projects the original features of each image block into C = 128 dimensions, resulting in a feature map with dimensions of 128 × 128 × 128. Subsequently, this feature map is passed to the Swin-Transformer block. The Swin-Transformer block incorporates residual concatenation and conducts W-MSA attentional computations and SW-MSA operations to enhance feature extraction efficiency while reducing network computation. Furthermore, the MLP consists of a two-layer perceptron with Gaussian Error Linear Unit (GELU) nonlinear activation functions, while Layer Normalizations are employed for the preceding arrangement of the W-MSA and SW-MSA. Following feature extraction and computation in Stage 1, the feature map is directed to Stage 2 through Stage 4, encompassing patch merging and varying numbers of rocking transformer blocks. Patch merging is applied to decrease the resolution of the feature map through two downsampling steps, merging in the depth direction to amplify the number of channels, resulting in a hierarchical design. As a result, the resolution of the feature map decreases sequentially during the transmission from stage 2 to stage 3. From 128 × 128 to 64 × 64 and 32 × 32, the number of channels is adjusted from 128 to 256 and 512, respectively. Finally, Stage 4 outputs the feature map with a size of 32 × 32 × 512 and passes the results of each chicken part category in the chicken part image through the LN layer, the global pooling layer, and the fully connected layer (Qamhan et al., Citation2023; Samal et al., Citation2023).

2.3.2. Swin-Transformer attention mechanisms

While the incorporation of the MSA mechanism within the network architecture can enhance the extraction of object features and boost detection accuracy, the process of retrieving each pixel from the feature map using MSA significantly escalates the network’s computational complexity, which is not conducive to network convergence. To address this issue, the Swin-Transformer network employs two types of MSA: W-MSA and SW-MSA. W-MSA segments the feature map into fixed-size windows of M × M (where M = 4) and calculates the self-attention within each window. Nevertheless, there is a limitation in information exchange between non-overlapping windows, potentially leading to errors when extracting features of objects distributed across different windows. As a solution, SW-MSA employs shifted window partitions to broaden the receptive field and address the issue of information gathering within non-overlapping windows. This approach expands the four windows of W-MSA into nine windows. The Swin-Transformer, based on W-MSA, employs a window of size M × M as the calculation unit for processing the image area (He et al., Citation2023; Hui et al., Citation2023). This approach significantly reduces the computational complexity of the network when compared to MSA, which utilizes a block as the calculation unit. Assuming that the image size is h × w and C is the feature map dimension, the computational complexity of MSA and W-MSA is given by EquationEquations (1)(1) $Ω (MSA) = 4 h ω C^{2} + 2 {(hω)}^{2} C$ (1) and (Equation2(2) $Ω (W - MSA) = 4 h ω C^{2} + 2 M^{2} hωC$ (2) ) respectively.

(1)

Ω (MSA) = 4 h ω C^{2} + 2 {(hω)}^{2} C

(1)

(2)

Ω (W - MSA) = 4 h ω C^{2} + 2 M^{2} hωC

(2)

From EquationEquations (1)(1) $Ω (MSA) = 4 h ω C^{2} + 2 {(hω)}^{2} C$ (1) and (Equation2(2) $Ω (W - MSA) = 4 h ω C^{2} + 2 M^{2} hωC$ (2) ), it is evident that the computational complexity of W-MSA has a linear correlation with the image size, whereas the computational complexity of MSA exhibits a quadratic relationship. When dealing with images of the same size, the number of windows is considerably fewer than the number of blocks, leading to a substantial reduction in computational complexity.

As illustrated in above , the conventional window splitting size is 2 × 2, with each window comprising 4 × 4patches. However, this approach results in disjointed windows that lack connectivity. Therefore, the shifted window design on the right side is employed to rectify this issue. This shifted window design allows for the integration of originally non-overlapping windows through a combination of a W-MSA Swin Transformer block and a SW-MSA Swin Transformer block. The forward processes of W-MSA and SW-MSA are shown below (Yuru et al., Citation2022).

Figure 5. Hierarchical feature maps constructed by Swin Transformer and SW-MSA in the architecture.

(3)

ZL \land = W - MSA (LN (Z^{L - 1})) + Z^{L - 1}

(3)

(4)

Z^{L} = MLP (LN (\overset{\land}{Z^{L}})) + \overset{\land}{Z^{L}}

(4)

(5)

\overset{\land}{Z^{L}} = SW - MSA (LN (Z^{L})) + Z^{L}

(5)

(6)

Z^{L + 1} = MLP (LN (\overset{\land}{Z^{L}})) + \overset{\land}{Z^{L + 1}}

(6)

2.4. Evaluation indicators

MAP is usually applied in target detection tasks and is also used as an evaluation index of model detection accuracy in this study. MAP is related to AP(Average Precision)and precision P and recall R, the precision and recall can be expressed by EquationEquations (7)(7) $P = \frac{TP}{TP + FP}$ (7) and (Equation8(8) $R = \frac{TP}{TP + FN}$ (8) ):

(7)

P = \frac{TP}{TP + FP}

(7)

(8)

R = \frac{TP}{TP + FN}

(8)

where TP (True Positive) indicates the number of samples correctly judged as positive, FP(False Positive) indicates the number of samples incorrectly judged as positive, and FN(False Negative) indicates the number of samples incorrectly judged as negative (Du et al., Citation2023). AP represents the detection precision of a category, obtained by integrating the precision-recall curve. The mAP is the mean value of the detection precision AP of all categories, it can be expressed by the EquationEquation (9)(9) $MAP = \frac{1}{M} \sum_{K = 1}^{M} AP (k)$ (9) :

(9)

MAP = \frac{1}{M} \sum_{K = 1}^{M} AP (k)

(9)

In the equation, M is the number of all classes, and AP (k) represents the detection precision of the class K of objects. In addition to the detection precision, the detection speed of the model is equally important; therefore, the average time taken by the model to process a single image is chosen as the second index to judge the goodness of the model in the present study.

2.5. Data enhancement - Mosaic

In this paper, the objective is to align the augmented training data as closely as possible with the actual data distribution to enhance detection accuracy. During the model training process, the Mosaic data augmentation method was incorporated. Mosaic data augmentation combines four images by applying random scaling, random cropping, and random arrangement to create a composite image. This enables the network to improve the extraction of features from small targets, thereby enhancing the network’s robustness. It also reduces GPU memory usage by directly processing data from four images. The primary operation in Mosaic involves the following steps: first, each map is randomly cropped to obtain A. The goal of cropping is to select a portion of the original map rather than the entire map for subsequent splicing. Next, A is resized to the output map size to produce B. This step aims to standardize the coordinate system size, making it more convenient for the subsequent splicing, with the coordinate frame being scaled accordingly. Following this, a specified region C is randomly cropped from B. Finally, C is pasted into the corresponding position of the output map. At this stage, only coordinate translation is required to adjust the coordinate frame. An example illustrating the process of composing an image into a mosaic is depicted in below (Wang et al., Citation2023).

Figure 6. Example map of chicken part image filling (where ( $x_{c}$ , $y_{c}$ ) are randomly selected coordinates of the splicing center).

2.6. Migration learning and test environment

Another crucial technique for optimizing the Swin Transformer, as proposed in this study, involves employing transfer learning to train the Swin Transformer model using the augmented chicken part dataset. Transfer learning enables the model to achieve higher initial performance prior to commencing training. It also accelerates performance improvements during the training process, ultimately resulting in a model with enhanced generalization and robustness. In this study, experiments were conducted using the TensorFlow deep learning framework on a Windows 11 Professional operating system. The Swin Transformer model was developed and trained using the Python language and the parallel computing framework CUDA 11.8. The hardware used was a Dell G15 image processing workstation equipped with an 8-core processor, 64GB of RAM, and an RTX3060 graphics card. As shown in below. Also YOLOV3-Darknet53, YOLOV3-MobileNetv3, SSD-MobileNetv3, and SSD-VGG16 models were built on this framework.

Table 2. Swin Transformer model building hardware and software configuration table.

Download CSV Display Table

3. Experiment and results

3.1. Experiment and results

This paper leverages Swin-Transformer for chicken part detection, comprising three primary steps: 1) construction and segmentation of the chicken parts dataset; 2) establishment of the Swin-Transformer model and application of transfer learning for model training and optimization; 3) introduction of four additional models–YOLOV3-Darknet53, YOLOV3-MobileNetv3, SSD-MobileNetv3, and SSD-VGG16—for comparative experiments alongside the Swin-Transformer model. The flowchart of chicken part detection is shown in .

Figure 7. Flowchart of Swin-Transformer chicken part detection.

The incorporation of transfer learning in the model training process not only expedites the model’s convergence rate but also enhances its training effectiveness. As such, all models developed in this paper were initialized with pre-training weights from the ImageNet dataset. During the training process, the model uses an SGD optimizer. The initial learning rate is a relatively important hyperparameter that affects the speed and effectiveness of model convergence and is generally set between 0.01 and 0.001. Setting the learning rate decay coefficient too small will cause the gradient to fall too slowly, while too large will make it difficult for the model to converge. The purpose of the weight decay coefficient is to reduce the problem of model overfitting to a certain extent, generally set between 0.0001 and 0.001. In this paper, according to the performance of the dataset in pre-training, the initial learning rate is set to 0.01, the learning rate decay coefficient is set to 0.1, the weight decay coefficient is set to 0.0005, the momentum factor is set to 0.937, the batch size is set to 8, the learning confidence is set to 0.5, and the upper limit of the number of iterations is 20,000 times. Furthermore, an automatic strategy for saving the optimal model during training was implemented. As depicted in the , the loss function value experienced a rapid decline in the initial stages of model training, after which the rate of decrease gradually diminished after 80 iterations, eventually converging around the 100-iteration mark. The graph of mAP changes during training is shown in .

Figure 8. Variation of loss function curve during training of Swin-Transformer.

Figure 9. Plot of mAP change during Swin-Transformer training process (0.5 is the confidence threshold, greater than 0.5 is considered a correct prediction.).

3.2. Generalization testing

The images of chicken parts were captured under controlled light source illumination, with an effort to simulate various lighting conditions that may be encountered on the production floor. This simulation was achieved by continuously adjusting the light source along the conveyor belt. Despite the high detection accuracy demonstrated by the proposed model in this study, its performance under diverse lighting conditions remains uncertain. To address this, we randomly selected 100 chicken part images, manipulated their brightness to simulate different lighting scenarios, and subsequently analyzed the results. This process enabled the creation of a comprehensive generalization ability test set comprising 200 images. In the context of the generalization ability test dataset, the following section presents some of the test results evaluating the chicken parts model’s ability to generalize under varying conditions.

As can be seen from the detection results of the chicken part images after brightening and darkening in , the proposed model can accurately detect the chicken part images with a high confidence level. Meanwhile, there was no obvious error between the real contour of the chicken parts and the labeled bounding box. Combined with the radar plots of recall and precision values for each chicken part detection in , the recall and precision values for the three chicken parts are distributed between 0.9 and 1.0, and the precision values are all greater than 0.95. The results show that the model can be used for the detection and classification of chicken parts in different brightness scenarios, with a very good generalization ability. However, the posture of the chicken parts may change with the shooting angle, which may make the method proposed in this paper more difficult for the extraction and recognition of the target features, especially in industrial scenarios, where the posture of the chicken parts on the conveyor belt may vary greatly, which is a challenge to the robustness of the algorithm.

Figure 10. Original figure, image after increasing brightness, image after reducing brightness.

Figure 11. Radar chart of accuracy and recall of three chicken parts.

3.3. Anti-occlusion testing

The disjointed chicken parts were haphazardly strewn across the conveyor belt, resulting in occasional stacking, occlusion, and the absence of target features. The loss of features in chicken parts, as well as the aforementioned situations, poses challenges for accurate classification and detection. Nevertheless, the unobstructed portions within the chicken part images still retained distinct features that remained independent of one another. If the model can adeptly recognize and differentiate these features, it can still successfully achieve accurate classification and detection of chicken parts. Acknowledging the issue of occlusion in chicken parts, a test dataset comprising 100 stacked and occluded chicken part images was curated to assess the model’s resilience against occlusion.

In most of the chicken wings in the chicken parts were covered by the chicken wings, and only a small portion of the target features could be seen, which led to the chicken wings being easily and wrongly detected as chicken wings in the chicken parts, and chicken wings in the chicken parts that were not fully exposed were also detected incompletely or inaccurately. However, in the above case, the chicken wing still retained more obvious features, so the model proposed in this paper was still able to detect the category and location of the corresponding chicken parts in the image. In addition, the window multi-head self-attention mechanism and the shift-window multi-head attention mechanism in the Swin-Transformer network structure improved the ability of the model to extract and learn the semantic information of the image context, so that the model was more likely to distinguish the edge contours under the condition that the chicken parts were adherent to each other and had similar features, thus reducing the probability of false or incomplete detection. Ultimately, the results from the anti-occlusion dataset are illustrated in . The model exhibits continued accuracy in detecting images of chicken parts that were occluded from one another. Specifically, the detection accuracy for chicken legs was recorded at 0.93, while the accuracy for chicken wings reached 0.92. Additionally, the detection accuracy for chicken wings was marked at 0.78. These outcomes underscore the model’s robust anti-occlusion capability, affirming its proficiency in accurately identifying the category and spatial location of each chicken part.

Figure 12. (a) Stacked original image, (b) Dataset test results graph.

3.4. Comparison of Swin-Transformer performance with other models

Based on the training output of the Swin-Transformer optimal model, tests are carried out on the chicken part dataset to obtain the detection accuracy mAP and the average detection time of a single image, which are used to judge the detection performance of the model. The test results and some examples are shown in and below.

Figure 13. Swin-Transformer detection of the actual effect diagram.

Table 3. Swin-Transformer model detection time and mAP.

Download CSV Display Table

As shown in , it is evident that for the validation set in this study, the Swin-Transformer model exhibits no instances of leakage or erroneous detections. Simultaneously, the bounding boxes for the mid-wing, wing, and drumstick can be thoroughly and accurately annotated with their respective locations. Each chicken part category is detected with high confidence, and experimental results indicate that the Swin-Transformer model achieves an mAP of 97.21% with a single image detection time of 19.02 ms. In this paper, the Swin-Transformer model was trained using a high-performance GPU. The model will have a faster speed compared to the CPU when processing and calculating the data, and the time will also be reduced. Swin-Transformer model’s own windows multi-head self-attention structure so that the information is computed inside each window without transferring, which will greatly reduce the amount of computation, especially in the shallow network (low downsampling multiplicity), so this paper’s Swin-Transformer model had better performance in terms of computation speed and test time. The self-attention mechanism in Swin-Transformer enables parallel computation at each step, offering a speed advantage over CNNs, which require serial convolution operations at each step. This means that CNNs are less computationally efficient when processing longer sequences. The self-attention mechanism within the Transformer allows the model to gather information from any position in the sequence, enhancing its capability to handle long-distance dependencies. In contrast, CNNs address such long-distance dependencies by increasing the number of convolutional layers (Dou et al., Citation2023).

Currently, there are various target detection methods based on deep learning, however, when facing different target detection tasks, the detection effect of each method varies significantly, in order to verify the advantages and disadvantages of the algorithm in this paper in dealing with the chicken part detection problem, it is necessary to compare the performance with several other algorithms.The standard deviation during the experiments in this study is that the model has an accuracy of more than 80% for each target feature recognition and an average accuracy of more than 85% for the three target feature recognition per sample. In this paper, SSD-VGG16, SSD-MobileNetv3, YOLOV3-Darknet53, and YOLOV3-MobileNetv3 were introduced for comparison experiments, and the contrast was increased by replacing the feature extraction network of SSD and YOLOV3 with MobileNetv3. For the dataset of this paper, the trend of the four models’ loss function values with the number of iterations is also very close to each other, and the loss function graphs are shown in . Finally, the mAP and average detection time of the four models are shown in below, and the actual detection effect graphs are shown in .

Figure 14. YOLOV3 loss function curve change.

Figure 15. Variation of SSD loss function curve.

Figure 16. SSD-VGG16 actual effect test picture.

Figure 17. SSD-MobileNetv3 actual effect test chart.

Figure 18. YOLOV3-Darknet53 actual effect test chart.

Figure 19. YOLOV3-MobileNetv3 actual effect test chart.

Table 4. Swin-Transformer model comparison of mAP and average detection time with other models.

Download CSV Display Table

The actual detection results demonstrated that all four models, namely SSD-VGG16, SSD-MobileNetv3, YOLOV3-MobileNetV3, and YOLOV3-Darknet-53, accurately located and categorized chicken parts without any misses or misdetections. Meanwhile, the model exhibited strong generalization ability for chicken parts with multi-scale features, although there were variations in detection precision. Each model was capable of accurately recognizing and detecting not only separated chicken parts but also chicken parts that were obscured, overlapped, or stacked together. Based on , it was evident that SSD-VGG16 exhibited higher detection precision than SSD-MobileNetv3. This was attributed to the replacement of the backbone feature extraction network of the SSD model with MobileNetv3, which significantly reduced the model’s computational volume, resulting in increased speed (Ren et al., Citation2019). On the other hand, MobileNetv3 performed relatively poorly in the chicken part detection task compared to VGG16. This could be attributed to VGG16 having a larger number of parameters than MobileNetv3, which allows it to extract deeper and more abstract features of chicken parts and thus improve detection precision (Liu & Li, Citation2022). VGG16 uses the same size convolutional kernel (3 × 3) and maximum pooling (2 × 2) throughout the network. The use of multiple small convolutional layers (3 × 3) is more effective than a single large convolutional layer (5 × 5) in progressively deepening the network structure to improve performance. The convolutional and pooling layers of the SSD-VGG16 model use the same convolutional kernel and pooling kernel parameters, which makes each tensor maintain the same width and height as the previous tensor. The model consists of a number of convolutional and pooling layers stacked to form a deeper network structure, and at the same time, the model adopts a joint multiplication to detect the target to avoid the loss of information in the intermediate process. However, due to the large number of small-scale windows used in the detection process, the detection of some objects cannot be handled effectively. The SSD-MobileNetv3 model significantly reduces the model size and computation while maintaining high performance, and it occupies less memory for embedding into devices. However, the depth-separable convolution used in the model leads to some limitations in the model’s information transfer capability. The performance gap between YOLOV3-MobileNetV3 and YOLOV3-Darknet-53 is similar to the previously discussed case. Comparison graphs of experimental effects are shown in . Darknet-53 introduced residual modules, each consisting of two convolutional layers and shortcut connections, to address the problem of gradient vanishing in deep neural networks. On the other hand, YOLOV3 enhanced small target detection by utilizing multiplicative downsampling feature fusion (Ma et al., Citation2016). These improvements resulted in higher detection precision for the YOLOV3-Darknet-53 model compared to the YOLOV3-MobileNetV3. The YOLOV3-Darknet53 model basically solves the problem of small target detection, and by introducing a multi-scale detection mechanism, it can better deal with targets at different scales and provide more comprehensive detection results, which achieves a better balance between speed and accuracy. The network structure of this model is more complicated, and when deployed on resource-constrained devices, it does not achieve the desired results, and the anchor frame and the actual detection frame are far from each other. The YOLOV3-MobileNetv3 model has a speed advantage over the YOLOV3-Darknet53 model and encounters the same problems as the SSD-MobileNetv3 model in accomplishing real-world detection tasks. The SSD model utilized a feature pyramid structure to enhance its multi-scale target detection capability, but it was found to have lower detection precision compared to YOLOV3 on the dataset used in this paper. Although all four models were able to achieve accurate detection of chicken part categories and locations, differences in the feature extraction network structures led to variations in detection performance.

3.5. Impact of different feature extraction networks on the effectiveness of model classification detection

The classification and detection performance of models are often affected by the choice of feature extraction networks. Feature extraction networks can be categorized into lightweight and heavyweight networks. While heavyweight networks perform well in handling complex classification problems, lightweight networks with a smaller number of parameters are more suitable for mobile devices (Ya et al., Citation2023). MobileNetV3 is a lightweight deep learning neural network that utilizes deep separable convolution instead of traditional convolution, leading to a substantial reduction in network parameters and training computation. Based on , we can observe that both YOLOV3-MobileNetV3 and SSD-MobileNetv3 have reduced their detection times by 11.1 ms and 14.1 ms, respectively, compared to their original networks. This suggests that the lightweight MobileNetv3 neural network can effectively enhance the average detection precision of the model and achieve good results in terms of detection precision (Chen et al., Citation2023).

The detection precision and average detection time graphs for each model were obtained by comparing the target detection effects of five deep learning-based models for the classification of the chicken parts dataset. As depicted in , the Swin-Transformer model exhibited improved mAP of 2.13%, 5.26%, 4.48%, and 1.62% over YOLOV3-MobileNetV3, SSD-MobileNetv3, SSD-VGG16, and YOLOV3-Darknet-53, respectively. Additionally, it achieved an average detection time reduction of 5.08 ms, 9.38 ms, 23.48 ms, and 16.18 ms, respectively, compared to these models. Therefore, the combined detection performance of the Swin-Transformer model has a significant advantage over the other four models.

Figure 20. Comparison of mAP and average detection time for the five models.

Figure 21. Comparison of the accuracy and recall of the five models.

4. Discussion

In comparison to related studies on chicken carcasses, the model presented in this paper encompasses the greatest variety of chicken part categories and boasts the highest classification and detection accuracy. Moreover, the proposed method for detection and classification offers several notable advantages. For instance, it doesn’t depend on the manual selection of chicken carcass features for extracting chicken parts, even when they are obscured or adherent to one another. Nevertheless, its primary drawback lies in its dependence on the structural characteristics of chicken carcasses and processing technology, necessitating manual data expansion. In addition, poultry processors use different processing techniques and different types of chicken parts, which have not been validated using the methodology of this paper for chicken parts other than the three main chicken parts in this study. The method for multi-part classification and detection of chicken parts effectively addresses the challenge of classifying and detecting chicken components in various scenarios on the conveyor belt. The approach outlined in this paper currently validates static images and does not capture spatial distance information of chicken parts. In the future, the image acquisition equipment could be replaced with an RGB-D camera to acquire the depth characteristics of chicken parts. This advancement can be applied to real-time video processing on production lines to enhance the depth-distance guidance for automatic sorting equipment. Automated chicken part sorting equipment will probably be integrated with other automation equipment to build a complete automated production line, realizing full automation from chicken slaughter to packaging and sorting, and also exploring potential applications in other livestock meat processing. shows the different methods used in chicken carcass related studies and their accuracy rates.

Table 5. List of research parameters related to chicken carcasses.

Download CSV Display Table

Wu et al. used a combination of SVM (Support Vector Machine) and the Bp neural network algorithm for the fast detection of chicken wings. The combination of the two is able to deal with high-dimensional data and can achieve any complex nonlinear mapping function, SVM can minimize the generalization error while ensuring that the training error is minimized, and the Bp neural network algorithm has very good robustness. However, SVM (Support Vector Machine) and Bp neural network algorithms lose the information of target features by transforming them into numerical computations. Qi et al. used random forests to measure chicken carcass length and contour length. Random forest can effectively run on large datasets, introducing randomness is not easy to overfit during the training process. However, when encountering a large number of decision trees in the random forest, the space and time required for training will be large, which will lead to a slowdown in the model speed. Li et al. used edge detection and threshold processing methods to measure the length of chicken carcass length and contour length. Edge detection has a smoothing effect on the noise, which can be very good to eliminate the influence of the noise positioning effect is better, but the high and low thresholds need to be set by themselves, and the adaptive ability is poor. Faster RCNN achieves object detection performance with higher accuracy in a two-order network with RPN, compared to a first-order network. A two-order network is more accurate, especially for multi-scale and small objects. While the feature map of Faster RCNN is a single layer with small resolution, the original pooling two times rounding brings a loss of accuracy.

5. Conclusion

The deep learning-based target detection model, Swin-Transformer, excels in achieving precise detection of chicken part categories and their locations, even when parts are partially obscured. On the validation set for chicken parts, the Swin-Transformer model attains a detection accuracy of 97.21% and an average detection speed of 19.02 ms. This represents a substantial improvement over the YOLOV3 and SSD models, as it enhances detection accuracy by 1.62% and 4.48%, while reducing the processing time by 16.18 ms and 23.48 ms.

By using a lightweight neural network as the feature extraction component, the model’s detection speed can be significantly increased, although at the cost of a minor reduction in chicken part detection accuracy. The Swin-Transformer-based method proposed in this paper delivers the most comprehensive detection performance, enabling quick and accurate identification of chicken part categories and their locations on the production line. It excels in terms of high detection accuracy, real-time processing, and a noteworthy multi-scale detection capability. The detection method in this paper and the method adopted by Wu et al, and Li et al. first use image processing technology to greyscale the target, denoising, and other processing, and then according to the target’s shape characteristics, establish the relevant detection model and weight prediction model compared with the time-saving, omitting a lot of intermediate cumbersome operation steps, and at the same time in the detection of the speed and accuracy of the advantage. The proposed method has the potential to enhance production efficiency and processing speed in the poultry processing industry. By accurately identifying and classifying chicken parts, the sorting and processing process can be automated, reducing the need for manual intervention. This research serves as a valuable reference for the study of automatic sorting technology and equipment for livestock meat, as well as laying the groundwork for the development of intelligent meat processing techniques. This study applies computer vision to the field of poultry processing is not only a good expansion and extension of computer vision, but also a practical test of the superiority of computational vision. The poultry processing field puts forward new challenges and requirements for computer vision technology and deep learning-based detection systems, which will push computer vision technology and deep learning-based detection systems to derive more efficient methods in target detection, image segmentation and other aspects. The needs of the poultry processing field will also promote the application of computational vision and deep learning technology in target detection image recognition and other aspects of the further expansion of the scene. In the future, computer vision technology can be used to achieve real-time monitoring and optimization of the poultry processing process. This includes tasks such as identification, detection, and localization in the processing process to improve processing efficiency and product quality through real-time feedback and adjustment. The research method employed in this paper is based on the recognition and detection of invariant biometric features in the samples. Consequently, the method exhibits cross-domain applicability in extracting and identifying the fundamental features of objects across various domains. In order to further verify the adaptability and performance of the model for different environments and datasets, this paper verifies the applicability and stability of the method by adjusting the lighting illumination situation as well as simulating the situation of chicken parts stacked together in a real processing workshop. Ultimately, the experimental results demonstrate that the model proposed in this paper still has good adaptability in different scenarios.

Acknowledgments

Authors extend their gratitude to Shida Zhao and Shucai Wang for her technical assistance to this study.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The data used to support the findings of this study are available from the corresponding author upon request.

Additional information

Funding

This work was supported by Chinese National Natural Science Foundation of China [51905387], Scientific Research Project from Department of Education of Hubei Province [D20211601] and Research Project of Wuhan Polytechnic University [2021Y26].

References

Chen, H., Shao, H., & Zhang, J. (2023). Research on improving YOLOv3‘s ship detection algorithm. Modern Electronics Technology, 46(2), 101–14.
Google Scholar
Chun, B., Nan, H., Yi, Z., Zhang, S., Xu, S., & Yu, H. (2022). Development of deep learning methodology for maize seed variety recognition based on improved Swin Transformer. Agronomy, 12(8), 1843. https://doi.org/10.3390/agronomy12081843
Google Scholar
Dou, Z., Hu, G., & Liang, Y. (2023). Lightweight target detection algorithm based on improved Yolov4-tiny. Computer Science, 50(S1), 484–490.
Google Scholar
Du, Z., Ding, X., Xu, Y., & Li, Y. (2023). UniDL4BioPep: A universal deep learning architecture for binary classification in peptide bioactivity. Briefings in Bioinformatics, 24(3), bbad135. https://doi.org/10.1093/bib/bbad135
PubMed Web of Science ®Google Scholar
He, P., Ma, J., & Li, C. (2023). Research on malaria cell image recognition based on swin transformer. Chinese Journal of Medical Physics, 40(8), 996–1001.
Google Scholar
Hui, L., Jia, L., Lian, C., & Wu, M. (2023). Strans-YOLOX: Fusing Swin Transformer and YOLOX for automatic pavement crack detection. Applied Sciences, 13(3), 1999. https://doi.org/10.3390/app13031999
Google Scholar
Jiang, F. (2022). Review of broiler market situation in China in 2021 and outlook for 2022. China Poultry Guide, 39(4), 35–38.
Google Scholar
Jin, L., Liu, K., Zou, J., Lu, R., & Cui, M. (2023). Research on strawberry target detection technology based on YOLOv4. Modern Agricultural Technology, 12(1), 119–122.
Google Scholar
Li, H. (2017). Online grading system for chicken carcass quality based on machine vision. Nanjing Agricultural University.
Google Scholar
Li, J., Qu, W., Garden, F., Zhao, M., Liu, F., & Zhang, K. (2021). Effect of slaughtering process and equipment on automatic chicken carcass segmentation. Meat Industry, 10(4), 36–41.
Google Scholar
Liu, J., & Huang, J. (2022). Centre point target detection algorithm based on improved Swin Transformer. Computer Science, 1–14.
Google Scholar
Liu, T., & Li, D. (2022). Detection method for sweet cherry fruits based on YOLOv4 in the natural environment. Asian Agricultural Research: English Edition, 14(1), 66–76.
Google Scholar
Liu, M., Liu, L., Shi, T., Hao, H., & Qiang, X. (2023). An optimised Swin Transformer tomato leaf disease identification method. Journal of China Agricultural University, 28(4), 80–90.
Google Scholar
Li, N., & Yu, B. (2020). Computer vision-based identification of rice weedy plants. Agricultural Mechanization Research, 42(12), 228–231.
Google Scholar
Li, Y., Zhu, C., & Zhang, J. (2023). Segmentation of power equipment in complex scene of substation based on improved transformer. Journal of Taiyuan University of Technology, 1–9.
Google Scholar
Ma, S., Zhang, Y., Wang, L., Zhang, X., Jin, W., Wang, L., & Chang, W. (2016). Preliminary study on the identification of tuna stock characteristics based on YOLOv3 model. Fisheries Modernization, 48(5), 79–84.
Google Scholar
Pan, J., Qian, J., & Liu, S. (2016). Optimal selection of colour features for computer vision for pork freshness detection. Food and Fermentation Industry, 42(6), 153–158.
Google Scholar
Qamhan, M., Alotaibi, Y., & Selouani, S. (2023). Source microphone identification using Swin Transformer. Applied Sciences, 13(12), 7112. https://doi.org/10.3390/app13127112
Google Scholar
Qi, C., Xu, J., & Liu, C. (2019). Automatic chicken carcass quality grading method based on machine vision and machine learning technology. Journal of Nanjing Agricultural University, 42(3), 551–558.
Google Scholar
Ren, Y., Yang, J., Liu, F., & Zhang, Q. (2019). Research on target detection method based on SSD and MobileNet network. Computer Science and Exploration, 13(11), 1881–1893.
Google Scholar
Samal, S., Zhang, Y., Gadekallu, T., & Balabantaray, B. K. (2023). ASYv3 attention-enabled pooling embedded Swin Transformer-based YOLOv3 for obscenity detection. Expert Systems, 40(8). https://doi.org/10.1111/exsy.13337
Web of Science ®Google Scholar
Sun, Y., Li, D., Lin, X., & Chen, Y. (2021). A review on the application of computer vision technology in poultry breeding and rooster selection. Journal of Agricultural Machinery, 52(S1), 219–228+283.
Google Scholar
Tang, C., Bu, Z., & Fu, S. (2016). The current situation and development prospect of poultry farming in China. Agricultural Development and Equipment, (7), 40.
Google Scholar
Tang, W., Feng, H., & Xu, H. (2023). A double attention feature fusion algorithm for small target detection. Journal of Wuhan University of Technology, 1–8.
Google Scholar
Tian, M., Guo, H., Chen, H., Wang, Q., Long, C., & Ma, Y. (2019). Automated pig counting using deep learning. Computers and Electronics in Agriculture, 163, 104840. https://doi.org/10.1016/j.compag.2019.05.049
Web of Science ®Google Scholar
Wang, L., Wang, B., & Li, D. (2023). Research on target detection and classification of flat mushroom in mushroom room based on improved YOLOv5. Journal of Agricultural Engineering, 1–9.
Google Scholar
Wu, H. (2016). Research on quality inspection of chicken wings based on machine vision. Shandong Agricultural University.
Google Scholar
Wu, J., Wang, H., & Xu, X. (2022). Rapid detection of broken wings in chicken carcasses based on machine vision. Journal of Agricultural Engineering, 38(22), 253–261.
Google Scholar
Xiao, Z. (2022). Research on chicken part detection and recognition technology based on deep learning. Wuhan Polytechnic University.
Google Scholar
Ya, L., Xiao, X., Wen, X., Jia, H., & Hui, H. (2023). MobileNetV3-CenterNet: A target recognition method for avoiding missed detection effectively based on a lightweight network. Journal of Beijing Institute of Technology, 32(1), 82–94.
Google Scholar
Yu, X., Ke, Z., Wen, M., Chen, B., Chen, H.-M., & Zhang, J.-Y. (2023). Transmission line insulator defect detection based on Swin Transformer and context. Machine Intelligence Research, 20(5), 729–740. https://doi.org/10.1007/s11633-022-1355-y
Google Scholar
Yuru, C., Jing, F., Juan, L., Pang, B., Cao, D., & Li, C. (2022). Detection and classification of lung cancer cells using Swin Transformer. Journal of Cancer Therapy, 13(7), 464–475. https://doi.org/10.4236/jct.2022.137041
Google Scholar
Zhang, Y., Xie, Y., & Xu, X. (2023a). Cucumber leaf area measurement model based on improved mask R-NN. Journal of Agricultural Engineering, 1–8.
Google Scholar
Zhang, L., Jia, D., & Pan, T. (2023b). Distributed ultra-wideband radar human motion recognition based on CNN-Swin Transformer. Telecommunications Technology, 1–10.
Google Scholar
Zhao, D., Wang, C., & Bai, Y. (2021). Real-time semantic segmentation of sheep skeleton images based on generative adversarial network and ICNet. Journal of Agricultural Machinery, 52(2), 329–339+380.
Google Scholar
Zhao, H., & Yang, J. (2023). Aero-engine fault diagnosis based on fusion convolution transformer. Journal of Beijing University of Aeronautics and Astronautics, 1–14.
Google Scholar

Computer vision classification detection of chicken parts based on optimized Swin-Transformer

ABSTRACT

GRAPHICAL ABSTRACT

1. Introduction

2. Materials and methods

2.1. Test materials and image acquisition

2.2. Sample pre-processing and data enrichment

Table 1. Categories and quantity of chicken parts data sets.