1,097
Views
0
CrossRef citations to date
0
Altmetric
Research Article

NAS-YOLOX: a SAR ship detection using neural architecture search and multi-scale attention

, , &
Pages 1-32 | Received 30 Apr 2023, Accepted 06 Sep 2023, Published online: 04 Oct 2023

Abstract

Due to the advantages of all-weather capability and high resolution, synthetic aperture radar (SAR) image ship detection has been widely applied in the military, civilian, and other domains. However, SAR-based ship detection suffers from limitations such as strong scattering of targets, multiple scales, and background interference, leading to low detection accuracy. To address these limitations, this paper presents a novel SAR ship detection method, NAS-YOLOX, which leverages the efficient feature fusion of the neural architecture search feature pyramid network (NAS-FPN) and the effective feature extraction of the multi-scale attention mechanism. Specifically, NAS-FPN replaces the PAFPN in the baseline YOLOX, greatly enhances the fusion performance of the model’s multi-scale feature information, and a dilated convolution feature enhancement module (DFEM) is designed and integrated into the backbone network to improve the network’s receptive field and target information extraction capabilities. Furthermore, a multi-scale channel-spatial attention (MCSA) mechanism is conceptualised to enhance focus on target regions, improve small-scale target detection, and adapt to multi-scale targets. Additionally, extensive experiments conducted on benchmark datasets, HRSID and SSDD, demonstrate that NAS-YOLOX achieves comparable or superior performance compared to other state-of-the-art ship detection models and reaches best accuracies of 91.1% and 97.2% on AP0.5, respectively.

1. Introduction

Synthetic Aperture Radar (SAR) serves as a proactive microwave sensor, enabling Earth observations regardless of lighting or weather constraints. Relative to optical remote sensing, SAR offers paramount significance in various applications (Q. Tian et al., Citation2020). Its deployment spans across domains like military early warning, maritime planning, and sea traffic management (X. Zhou & Li, Citation2023). The efficacy of SAR in these spheres largely hinges on its target detection and recognition capabilities, with the principal objective being the pinpointing of areas and targets of interest via adept algorithms, followed by an accurate categorisation of these entities (Xia et al., Citation2022).

In earlier years, ship detection largely depended on manually designed features, such as the constant false alarm rate (CFAR) (L. Zhang et al., Citation2022) and saliency (Ma et al., Citation2022). The CFAR algorithm, by evaluating background clutter’s statistical properties, modifies the detection threshold and ascertains targets (D. Li, Han, Zheng, et al., Citation2022). Conversely, the saliency approach involves feature extraction, pixel saliency value calculation, and thresholding to discern salient from non-salient areas (D. Li, Han, Weng, et al., Citation2022). However, these methods, due to their dependence on clutter statistics, often struggle with complex coastlines and sea clutter, resulting in compromised accuracy and adaptability (D. Han et al., Citation2022).

In recent times, the advent of deep learning (DL) and convolutional neural networks (CNNs) has heralded a new era in SAR ship detection, providing solutions to previous limitations (H. Li, Han, and Tang, Citation2022). Compared to traditional techniques, CNN-based approaches boast superior benefits such as automated feature extraction, impressive efficiency, simplicity, and heightened accuracy. These networks, by processing vast datasets, have the capacity to discern and learn sophisticated features, thus enhancing target recognition capabilities. Consequently, the SAR ship detection field has seen a marked inclination towards adopting and researching CNN-based methodologies (D. Hou et al., Citation2021).

For example, Ke et al. (Citation2021) utilised the Faster R-CNN framework (Girshick, Citation2015) to introduce a SAR ship detection technique with a variable convolution kernel, allowing it to mimic the geometric alterations of ships with changing shapes. Drawing inspiration from RetinaNet (Lin, Goyal, et al., Citation2017), Miao et al. (Citation2022) developed a streamlined SAR image ship detection system employing the ghost module alongside spatial and channel attention mechanisms. Focusing on the FCOS structure (Z. Tian et al., Citation2019), M. Zhu et al. (Citation2022) sought to enhance the feature depiction of small and dimly visible ships through a revamped feature extraction process. Building upon YOLOX (Ge et al., Citation2021), S. Li, Fu, et al. (Citation2022) introduced a SAR ship feature enhancement technique grounded in high-frequency sub-band channel fusion, emphasising the use of contour details.

The Feature Pyramid Network (FPN) has undeniably carved its niche in the domain of multi-output SAR ship detection ever since its debut (Lin, Dollár, et al., Citation2017). A pivotal factor behind this reception is the diverse size range and rich semantic content exhibited by SAR ships. FPN’s prowess lies in its ability to proficiently merge features drawn from the foundational network, capitalising on its remarkable multi-scale feature representation capacity. This capability has prompted a plethora of subsequent research to infuse FPN into detection frameworks to discern targets spanning multiple scales in SAR imagery. A few notable instances include: T. Zhang et al. (Citation2021) unveiled the Quad Feature Pyramid Network (Quad-FPN), which employs a quartet of feature pyramids geared towards precise SAR ship target identification. K. Zhou et al. (Citation2022) presented the Feature Pyramid Network augmented with fusion coefficients (FC-FPN), a strategy that zeroes in on adaptive multi-scale amalgamation of feature maps to refine SAR ship detection accuracy. Zhao et al. (Citation2020) sculpted an attention-centric receptive pyramid framework, meticulously tailored to discern ships of variable dimensions against intricate backdrops. Such explorations underscore FPN’s adaptability and effectiveness in confronting the multifaceted challenges inherent in SAR ship detection, facilitating proficient detection across diverse scales and scenarios.

Despite the notable advancements in multi-scale SAR object detection achieved by the above-mentioned CNN-based algorithms, certain challenges persist.

It is important to highlight the limitations of existing FPNs. Firstly, they suffer from the loss of important details. FPN utilises upsampling and downsampling operations to combine features from different levels, but this process can result in inaccurate feature fusion, particularly at deeper layers. Consequently, the fused features may contain noise or lose crucial details, ultimately reducing the model’s accuracy. Secondly, FPN structures exhibit heavy computational workloads and poor robustness. These structures are typically manually designed with fixed upsampling and downsampling operations. Such design constraints can hinder the full integration of extracted features in the FPN, limiting its ability to adapt to different tasks or datasets. As a result, the model’s adaptability and generalisation capacity are diminished. Furthermore, CNN-based algorithms also face limitations in detecting small target ships in SAR images. In practical scenarios, when the ship size is small, it is represented as a mere bright spot in the SAR image. Due to the scarcity of feature information available for small ships, distinguishing them from the background clutter during detection becomes challenging. Consequently, this leads to low detection rates and a higher likelihood of false alarms (H. Li et al., Citation2020).

To address the aforementioned challenges, this study introduces a cutting-edge SAR ship detection framework named NAS-YOLOX. Drawing inspiration from YOLOX, it seamlessly integrates the capabilities of Neural Architecture Search Feature Pyramid (NAS-FPN) (Ghiasi et al., Citation2019). Unlike the traditional FPN, often criticised for overlooking vital details and its reduced resilience, NAS-FPN harnesses the full potential of features from the backbone network. It autonomously determines the best fusion technique, thereby enhancing the precision of the detection process. Departing from conventional experience-led methods, NAS-FPN’s data-centric approach minimises manual tweaks, thus bolstering the overall efficiency and robustness. Furthermore, we present a multi-scale channel-spatial attention mechanism (MCSA) specifically to hone in on small object detection. By fusing channel and spatial attention strategies, the MCSA pinpoints and accentuates feature-rich spatial regions in SAR imagery, ensuring the accurate identification of minuscule objects. The network’s prowess in discerning small entities is further augmented by the introduction of a dilated convolutional feature enhancement module (DFEM) in the backbone. By broadening the receptive field and refining its semantic information retrieval capability concerning ship entities, the system is primed for recognising small targets. In essence, NAS-YOLOX rectifies the pitfalls of prevailing models by synergizing NAS-FPN, MCSA, and DFEM, promising heightened precision and speed, particularly for multi-scale SAR ship detection.

The main contributions of this paper can be summarised as follows.

  1. We introduce a ship detection framework tailored for SAR imagery. Leveraging neural architecture search, this framework autonomously identifies the optimal feature pyramid for SAR ship detection, proficiently detecting ships across multiple scales.

  2. We design and integrate a Dilated Convolution Feature Enhancement Module (DFEM) within the core network. This module augments the model’s receptive field and accentuates its capability to distill semantic nuances of ship entities. Consequently, the framework exhibits enhanced prowess in ship feature extraction and achieves superior detection accuracy.

  3. We introduce the Multi-Scale Channel-Spatial Attention (MCSA) module, designed to extract channel and spatial attentions across varying scales. This boosts the model’s ship detection proficiency by emphasising on global contextual nuances.

  4. A comprehensive suite of experiments is conducted, underscoring the superior performance of our proposed approach in juxtaposition with established methods in the domain.

The structure of this paper unfolds as follows: Section II delves into a concise review of relevant studies in SAR object detection. Our research methodology is elucidated in Section III. Section IV showcases the efficacy of our proposed technique and contrasts it with leading object detection methodologies. Section V offers a discussion on the merits and constraints of our model. We wrap up with Section VI, where we encapsulate the paper’s essence and contemplate prospective research avenues.

2. Related works

In this segment, we touch upon both conventional techniques and SAR ship detection methods rooted in deep learning. The contributions from these studies have played a pivotal role in shaping our methodology.

2.1. Traditional SAR ship detection algorithm

Traditional methods of SAR target detection for ships often employed detectors using constant false alarm rate (CFAR) and saliency as primary tools. Such methods serve as foundational steps for isolating targets within SAR imagery, laying groundwork for subsequent target identification.

To enrich target data and counteract the interference from coherent speckle noise, T. Li et al. (Citation2018) developed a two-tiered CFAR detection approach specialised for super-pixel object identification, demonstrating superior detection outcomes for ships in uncomplicated scenarios. For larger-scale targets, Zhai et al. (Citation2016) introduced a ship detection algorithm integrating saliency and contextual data processing, adeptly highlighting large ships and background targets with significant features. Farah et al. (Citation2022) showcased a swift CFAR detection technique via the GΓD model in SAR imagery, targeting multiple ships. Leng et al. (Citation2015) combined the intensity and spatial spread of pixels to conceive a bilateral CFAR algorithm, leveraging kernel density estimation (KDE) for spatial feature extraction. By unifying spatial and intensity distributions, it achieved heightened accuracy in target detection. L. Han et al. (Citation2022) offered a ship detection strategy founded on multiscale superpixel saliency analysis, accentuating the ship pixels’ saliency through local contrast and ship geometry.

Another classical SAR image detection technique focused on the characterisation of ships via basic scattering mechanisms. These mechanisms yield insights into the physical, electrical, and geometric traits of scatterers within each SAR pixel. A noteworthy example is the innovative dual-scattering model algorithm by Karachristos et al. (Karachristos & Anastassopoulos, Citation2023), which employs primary and secondary scattering mechanisms for each polarised SAR pixel. Its primary objective is to distinguish ship targets from the background clutter, rendering a binary distinction. While the concept is pioneering and offers high detection precision, it struggles with efficiency in scenarios featuring multi-scale and multi-scene ship detection.

Traditional techniques, despite their merits, are fraught with challenges such as complexity, limited versatility, labour-intensive designs, and bounded suitability across diverse scenarios (C. Chen et al., Citation2023). Moreover, their analytical framework, constrained by specific ship imagery, often falls short in accurately discerning features of ships across varied sizes and settings. This limitation undermines their efficacy in multi-scale and multi-scene detection tasks. As a consequence, deep learning-based SAR ship detection is emerging as a dominant approach, steadily eclipsing these conventional methods (C. Chen et al., Citation2022).

2.2. SAR ship detection algorithms based on deep learning

Deep learning (DL) and convolutional neural networks (CNN) based target detection models can be broadly divided into one-stage and two-stage algorithms, each presenting a distinct balance between speed and precision (H. Li, Han, & Tang, Citation2022). While one-stage algorithms deliver swift results, they may falter when detecting minuscule or closely clustered objects. Conversely, two-stage algorithms excel in accuracy but lag in processing speed. However, technological evolution has bridged the accuracy gap, allowing one-stage algorithms to match or even outperform their two-stage counterparts without compromising speed. Presently, the majority of SAR ship detection methods that leverage deep learning lean on one-stage algorithms (D. Li, Han, Weng, et al., Citation2022).

Central to one-stage algorithms is the feature fusion module, predominantly represented by the Feature Pyramid Network (FPN). Due to the notable differences in ship dimensions within SAR images, FPN adeptly consolidates features sourced from the primary network, resulting in a more harmonised amalgamation. Given that FPN’s efficacy greatly influences detection outcomes, its structural research has garnered substantial interest in the academic community (J. Li et al., Citation2023).

2.2.1. Two-stage SAR ship detection algorithms

The two-stage detection strategy involves the incorporation of a densely connected segmentation subnet following the primary feature network. This method recalibrates the former classification and regression tasks into three separate functions: classification, regression, and segmentation. When tailored to SAR target detection, such a stratified approach notably bolsters the precision of ship identification. Pioneering two-stage algorithms pivotal to SAR ship detection are Fast R-CNN (Girshick, Citation2015), Faster R-CNN, Cascade R-CNN (Cai & Vasconcelos, Citation2018), and Libra R-CNN (Pang et al., Citation2019). A multitude of scholars have extrapolated from these seminal algorithms, crafting ship detection techniques specifically fine-tuned for the nuances of SAR imagery.

For example, a method introduced by M. Jiang et al. (Citation2023) employs Faster R-CNN for joint ship contour extraction. It delves deep into SAR images to procure intricate ship details, pinpointing ship locations in intricate scenarios and deriving contours from relevant slices. Nevertheless, the method grapples with intricate model structures and lags in processing speed. Y. Li et al. (Citation2020) presented a nimble backbone network fortified with feature relay amplification and skip connections across multiple scales. While their approach zeroes in on capturing object features across different scales in SAR imagery, it falters when pinpointing minute or closely grouped objects. Wang et al. (Citation2019) unveiled a ship detection technique that melds both feature-centric and pixel-centric strategies to enhance both the precision and sturdiness of detection, with a specific emphasis on inshore ships. Yet, the computational demands of this model are significant, and its efficacy wanes when tasked with spotting ships in proximity to the coast. Expanding on the foundations of Libra R-CNN, Guo et al. (Citation2020) unveiled a rotational variant of Libra R-CNN. By harnessing balanced learning and emphasising rotational region detection, they sought to amplify both precision and resilience. Nonetheless, this model tends to register an elevated number of false positives, and smaller ships often elude its detection. Collectively, these research endeavors underscore the continuous push to refine the precision and effectiveness of two-stage algorithms in SAR ship detection, confronting diverse challenges and constraints.

2.2.2. One-stage SAR ship detection algorithms

One-stage detection techniques predominantly identify designated targets by leveraging dense anchor points. They use multi-scale features to anticipate the presence of an object. A common strategy is to implement dense sampling uniformly across various image locations, considering different scales and aspect ratios. Following this, these techniques utilise CNNs to glean relevant features before diving straight into classification and regression operations. Notable models in this domain comprise the YOLO series (P. Jiang et al., Citation2022), FCOS, SSD (W. Liu et al., Citation2016), RetinaNet, and CenterNet (Duan et al., Citation2019). The YOLO series significantly enhances detection velocities. Elaborating on the YOLO framework, the research by (T. Zhang & Zhang, Citation2019) introduces an innovative technique for rapid ship detection in SAR images, utilising a grid-based CNN approach (G-CNN). Their methodology amplifies the detection rate by segmenting the input image and incorporating depth-wise separable convolution. Nonetheless, there’s a trade-off concerning detection precision. Similarly, S. Chen et al. (Citation2021) crafted a streamlined ship detector based on YOLOv3, termed Tiny YOLO-Lite. Through the use of network pruning and knowledge distillation, they reduced the model’s size, albeit at the expense of its versatility. Extending the capabilities of FCOS, a study by Yang and team (Yang et al., Citation2022) introduced an enhanced fully convolutional single-stage detector. This model integrates a multi-tiered feature attention system to sift relevant features and assimilate global contextual data, thus elevating detection fidelity. However, its intricate design decelerates detection speeds. Meanwhile, Sun and associates (Sun et al., Citation2021) leveraged category position modules to refine the position regression and classification attributes in the FCOS framework. Their enhancements facilitated the creation of guiding vectors, optimising target placement in intricate environments. Evaluating from the dual lens of targets and pixelation, Hu and his team (Q. Hu et al., Citation2022) tabled a novel SAR image ship detection blueprint rooted in a feature interaction network. This model seamlessly merges object-focused detection with pixel-focused detection, optimising both through a feature guidance module. This approach magnifies the identification efficiency and locational precision of ships in SAR imagery.

2.2.3. SAR ship detection algorithms based on improved FPN

FPN merges feature layers from different spatial resolutions and semantic meanings using top-down and horizontal pathways. Thus, FPN collects comprehensive semantic details from a singular-scale image across all levels. Leveraging FPN’s robust multi-scale feature capabilities, later studies have incorporated it into detection pyramids to recognise varying scale targets in SAR images. As an instance, Joseph et al. (Redmon & Farhadi, Citation2018) introduced PAFPN, an FPN variant that integrates top-down pathways and cross-scale feature fusion in target detection. Tan et al. (Citation2020) presented BiFPN, which emphasises improved bi-directional feature fusion. Xu et al. (Citation2022) introduced an AW-FPN. This design incorporates more positional information into multi-scale features via cross-scale linkages, enhancing both ship identification and pinpointing. T. Zhang et al. (Citation2020) rolled out a B-FPN to amplify detection precision. Differing from the traditional FPN, B-FPN synergizes balanced semantic attributes at equivalent depth tiers, amplifying the feature pyramid’s multi-tiered facets. P. Chen et al. (Citation2023) advanced an enhanced FPN mechanism founded on deformable convolution, facilitating convolutional processes to acclimate to specific sampling points, thereby bolstering ship target feature extraction and elevating detection rates, particularly in intricate backdrops.

In our study, we use YOLOX as a foundation, refining it for enhanced SAR ship detection. We’ve optimised YOLOX’s FPN aspect for optimal feature information use, boosting detection precision. We’ve also fortified the backbone with extra feature extraction units, addressing YOLOX’s SAR image analysis constraints. Furthermore, we’ve devised a multi-scale attention schema to better pinpoint smaller targets. For reader clarity, we’ve tabulated the pros and cons of the aforementioned methods in Table .

Table 1. Advantages and disadvantages of representative models.

3. Methodology

In this section, we commence by presenting the baseline YOLOX framework. Subsequently, we delve into the NAS-FPN, DFEM, and MCSA modules. We conclude with an introduction to NAS-YOLOX, a comprehensive integration of the aforementioned three modules.

3.1. YOLOX baseline

The structure of YOLOX comprises five main components: the input, Backbone, Neck, YOLOX Head, and the output. A comprehensive illustration of the entire network architecture is depicted in Figure . Central to YOLOX is the CBS module, which is formed by a convolution layer, complemented by a batch normalisation (BN) layer (Ioffe & Szegedy, Citation2015) and a subsequent SiLU activation layer.

Figure 1. Structure of YOLOX-m network.

Figure 1. Structure of YOLOX-m network.

The backbone of YOLOX is CSPDarkNet53 that can be divided into two components, Focus and dark2-5. The Focus module performs slicing operations on images to achieve downsampling operations without information loss. The dark2-4 components have the same basic structure except for the number of, the cross-stage partial (CSP) module. There are two types of bottleneck modules in CSPDarknet, the bottleneck in the main stem, consisting of two CBS modules and one residual block, and the bottleneck in PAFPN, consisting of only two CBS modules. As a result, there are also two types of CSP modules in the backbone, CSP1_x3 and CSP1_x9, which consist of three and nine bottleneck modules respectively. While the CSP module in PAFPN is only composed of three bottleneck modules (i.e. CSP2_x3).

PAFPN is employed to merge features from various scaled feature maps. In YOLOX, the output features from dark3, dark4, and dark5 in the Backbone serve as the PAFPN input. Conventionally, high-level feature maps rich in semantic information are used to detect large targets, whereas small targets are identified using low-level feature maps that retain object specifics. The features, once merged, are channelled to the YOLOX Head’s decoupled detection head via PAFPN’s multi-scale feature fusion. Unlike other YOLO versions where the final regression frames and confidence levels are acquired through a 1 × 1 convolution, YOLOX accomplishes this via a decoupled technique. YOLOX’s other significant advancement is its adoption of dynamic sample matching, moving away from the widely-used anchor-box-based approach.

3.2. NAS-FPN

In the foundational setup, YOLOX integrates the PAFPN strategy within its neck layer, bridging the gap between the deep and shallow layers for efficient feature combination. However, this manually crafted feature fusion mechanism proves to be suboptimal, lacking in thorough feature integration. To address this, we introduce the neural architecture search-feature pyramid network (NAS-FPN). The network structure obtained through automatic network search in NAS-FPN demonstrates superior performance and efficiency, particularly in the SAR ship detection. It exhibits the capability to adaptively learn multi-scale feature representations, enabling the effective capture of ship features across different scales. By leveraging the rich details and contextual information present in the extracted backbone features, NAS-FPN maximises their utilisation. Additionally, NAS-FPN’s network architecture is obtained through automated search on the training set, making it data-driven rather than relying solely on experiential knowledge. This data-driven approach reduces the need for manual intervention and enhances the network’s robustness. In contrast, PAFPN is manually designed with a fixed network structure, limiting its adaptability to specific detection tasks, and resulting in underutilisation of the extracted feature information, ultimately leading to lower detection efficiency. Therefore, we innovatively integrated NAS-FPN into YOLOX, significantly boosting the network’s detection performance.

NAS-FPN primarily consists of two essential elements: a controller and an evaluator. Using a recurrent neural network (RNN) (Zaremba et al., Citation2014) as its manifestation, the controller designs new network configurations based on the evaluator’s insights, the structure is shown in . Utilising reinforcement learning, the controller pinpoints the best model blueprint within every designated search space. The accuracy of each mini-model within this realm acts as a reward signal, guiding the controller to adjust its parameters accordingly. Through cycles of evaluation and feedback, the controller refines its capacity to conceive better structures. Further details about NAS-FPN’s search domain, its controllers, evaluators, and their respective inputs and outputs will be explored in the following sections.

Figure 2. Neural network search structure.

Figure 2. Neural network search structure.

3.2.1. Search space

The search space delineates the variety of neural networks accessible for exploration by the NAS algorithm. It further prescribes the methodology to depict the structure of the neural network.

NAS-FPN uses an RNN controller to sample network architectures. Central to this sampling are the feature maps slated for fusion and their subsequent versions post-fusion. Consequently, its search space remains concise. This space includes the sizes of both input feature maps, the resultant maps, and the fusion methodology applied to the two input maps, all of which are elaborated upon in Table .

Table 2. NAS-FPN search space.

3.2.2. Controllers and evaluators

Within NAS-FPN, both the controller and the evaluator stand as pivotal elements, with each overseeing the search of the neural network structure and its subsequent performance assessment. The controller employs recurrent neural networks to generate these structures and collaborates with the evaluator to cherry-pick the most efficient structure. The sampling routine of the controller can be dissected into four key phases:

  • Step 1: Choose an input feature map, termed Pi, from the list of candidates.

  • Step 2: Choose another input feature map, labelled Pj, from the same list.

  • Step 3: Decide upon the resolution for the resulting feature map.

  • Step 4: Determine the technique to amalgamate Pi and Pj.

By following the four steps, a set of parameters can be sampled, and a new feature map can be created by fusing Pi and Pj. NAS-FPN can adopt two fusion strategies, Sum, and Global Attention Pooling (GAP), Its sampling process is illustrated in Figure . On the left is the list of candidate feature maps, which re-adds the newly generated feature maps to the candidate list by scaling and fusion operations of the sampled output feature maps. One of the rationales behind GAP is that higher-level features tend to contain more semantic information, which can be used as weights for computing feature maps with shallow-level information. Firstly, global pooling and 1 × 1 convolution of high-level features are carried out to obtain attention weights of low-level features, which are then weighted by the low-level features. Finally, the two feature maps are summed to obtain the ultimate output.

Figure 3. The sampling process of NAS-FPN.

Figure 3. The sampling process of NAS-FPN.

RNN can retain its state, enabling the controller to consider previous choices when generating new structures. Moreover, due to the existence of the evaluator, the controller can evaluate the performance of different structures based on actual detection accuracy (AP) to realise an adaptive neural network structure search. The work-flow of the controller is presented in Figure .

Figure 4. NAS-FPN’s controller structure.

Figure 4. NAS-FPN’s controller structure.

The evaluator’s process can be outlined in the subsequent steps:

  1. Reception of Input: The controller-produced network structure is fed into the evaluator, which then restructures it into the genuine network framework.

  2. Training Phase: Using the training dataset, the evaluator initiates the network’s training and determines its average precision (AP).

  3. Feedback: After computations, the evaluator dispatches the deduced AP values back to the controller.

Given the evaluator’s ability to directly assess the network’s efficiency, it enables the controller to liaise with it. This interaction facilitates the selection of the most efficient network configuration, allowing for an adaptive search of the neural network structure.

As depicted in Figure , green circles mark the inputs of the pyramid network, while red circles highlight the outputs. Figure (a) illustrates the PAFPN fusion technique, which comprises only two fusion pathways, from the bottom to the top and vice versa, leading to a limited inclusion of multi-scale information. Conversely, Figure (b-f) demonstrate the feature fusion pyramid designs unearthed by the RNN controller as the evaluator enhances the precision of detection. It’s noteworthy that in its initial phases, the RNN controller rapidly discerns vital cross-scale linkages, especially those between high-resolution and output feature layers, pivotal for creating high-definition features to pinpoint small entities. As the precision in detection escalates, the controller begins to showcase architectures connected from the top-down and the bottom-up. In contrast to the standard PAFPN, numerous cross-scale linkages are present, and such reiteration of features proves to be pivotal. Additionally, the controller adapts to form linkages on freshly crafted layers to reutilise previously processed feature manifestations, as opposed to arbitrarily picking two input layers from the candidate assortment, culminating in a pronounced enhancement in the accuracy of ship detection.

Figure 5. The structure of the pyramid network.

Figure 5. The structure of the pyramid network.

In summary, NAS-FPN performs adaptive neural network structure search by enabling interaction between the controller and the evaluator to optimise target detection ability. The approach has the advantage of optimising the neural network structure based on actual AP, rather than just determining the network structure through rough heuristic search. It facilitates efficient and accurate network structure search in different tasks, reducing manual involvement while ensuring network performance.

3.2.3. NAS-YOLOX inputs and outputs

NAS-FPN processes multi-scale feature layers, accepting them as inputs and producing feature layers that maintain the same scale. Five scales are utilised, specifically {p3, p4, p5, p6, p7}, to serve as the pyramid’s input. Within this set, p3, p4, and p5 are the resultant outputs from the dark3, dark4, and dark5 modules found within CSPDarknet. Meanwhile, p6 and p7 are derived by pooling p5 directly, with step sizes of 2 and 4 respectively. Subsequently, the feature content from these 5 scales is channelled into the pyramid network. As the sampling unfolds, this pyramid network persistently crafts new feature maps, reincorporating them into the candidate pool, with the end goal of outputting the refined multi-scale features, which are {C3, C4, C5, C6, C7}.

Given that the pyramid network’s input and output share the same scale, the structure of the FPN is capable of stacking repeatedly. The output from a prior pyramid network acts as the input for the subsequent one, aiming to achieve greater accuracy. Following several layers of stacking, the ultimate output features emerge as C3’, C4’, C5’, C6’, and C7’. However, considering that the YOLOX detection head requires only three input feature layers, the C6’ and C7’ features, which possess limited information, are omitted. Conversely, the more information-rich features, C3’, C4’, and C5’, are directed into the detection head, due to their enhanced details about the ship.

3.3. Dilated convolution feature enhancement module

While the CSP module possesses robust feature extraction capabilities, there’s a need for enhanced management of spatial contextual information, especially given the diverse scale sizes of ships in SAR images. The variations in ship dimensions and their placements within marine environments markedly impact the model’s detection precision. Moreover, the fidelity and depth of features garnered by the backbone play a pivotal role in subsequent operations.

To address these challenges, we introduce the DFEM (Dilated Feature Extraction Module) to effectively extract ship features across multiple scales. The design of this module draws inspiration from the studies of (L.-C. Chen et al., Citation2019). With DFEM in place, the model is better equipped to manage ships of various sizes, resulting in improved detection accuracy.

Dilated convolution stands out as an effective method for tasks that demand the handling of multi-scaled objects or the extraction of diverse and rich contextual data. In contrast to standard convolution operations, dilated convolution enhances the model’s perceptual field, facilitating the extraction of a richer context. Consequently, we introduce an innovative blend of the dilated convolution module with selected activation functions and residual structures. This combination expands the receptive field of the spatial image, allowing for efficient encoding of ship context information across different scales. The efficacy of the DFEM is further underscored in the subsequent ablation study section.

The DFEM is architecturally composed of four dilated convolution blocks that run parallel, enhanced by a residual connection. This design is vividly depicted in Figure . Each of these dilated convolution blocks is configured with distinct dilation rates to cater to ships of varying scales. In terms of its placement within the network, DFEM is integrated post the Focus, and after the dark2-4 stages of the Backbone. Let’s delve into the specifics of the DFEM configuration: (1) DFEM(X)=SiLU(B(CONV(Concat(B(Dilate(X)))))+X)(1) (2) SiLU(X)=X×σ(X)(2) (3) σ(X)=11+eX(3) (4) (X)=max(X,0)={0,X<0X,X>0}(4)

Figure 6. Structure of the DFEM network.

Figure 6. Structure of the DFEM network.

Within the DFEM framework, the current resolution is set at H × W and C indicates the number of feature channels, the operation Dilate(·) stands for a 3 × 3 dilated convolution. X’ denotes the outcome from four distinct dilation rates of the dilated convolutions.

The DFEM’s functional flow initiates as features navigate through the quartet of dilated convolutions, each distinguished by distinct dilation rates. Post each convolutional stride, a batch normalisation layer, paired with a ReLU activation, steps in to expedite convergence. Outputs from these four distinct pathways are subsequently channelled together, and a 1 × 1 convolution is employed to streamline channel dimensions. To accentuate the extraction of nuanced spatial characteristics, a residual link is integrated, thereby guaranteeing a comprehensive receptive field. This process culminates with the refined features being relayed to the next module, courtesy of the SiLU activation function. In essence, the DFEM module amplifies these features, extending the image’s receptive field and enabling the encapsulation of multi-scale data. Experimental assessments revealed that the dilation rates of 1, 3, 5, and 7 proved most efficacious for the convolutional layers within this framework.

3.4. Multi-scale channel-spatial attention

Attention mechanisms primarily function to distill essential information from specific regions of interest, while diminishing or disregarding irrelevant aspects. Within SAR imagery, ships typically occupy a minuscule fraction of the pixel space due to their sparse distribution. Consequently, only certain pixel areas hold significance. To address this challenge, the introduction of an attention module is proposed to guide the model towards prioritising pivotal ship regions.

Several prevalent attention modules exist, such as CBAM (Woo et al., Citation2018), SE (J. Hu et al., Citation2018), CA (Hou et al., Citation2021), and MS-CAM (Dai et al., Citation2021). Of these, MS-CAM stands out as a recent innovation in channel domain attention, executing multi-scale extractions of both global and local features from input feature maps. However, it falls short in leveraging spatial domain feature information.

To bridge this gap, we propose a Multi-scale Channel-Spatial Attention (MCSA) mechanism. This design fuses a spatial attention mechanism with the foundational MS-CAM structure. While channel attention zeros in on vital aspects of ship detection, emphasising its minutiae, spatial attention concentrates on the ship’s placement within the entire image. This entails the selective amalgamation of spatial features through weight-assigned spatial feature aggregation, aiming to bolster the attention module’s detection precision. The incorporation of spatial attention into MCSA heralds a significant progression, enhancing the detection efficacy by selectively accumulating spatial features and accentuating ship specifics. This enhancement supplements the inherent channel attention trait of MS-CAM, further elevating ship detection precision.

The MCSA is distinctly partitioned into three segments: the global attention module, the local attention module, and the spatial attention module, as depicted in Figure . Both the global attention module and the local attention module employ point-by-point convolution, acting as local channel context aggregators. This mechanism only engages the point-by-point channels at each individual spatial position for interactions. Here’s a description of the local attention module: (5) L(X)=BN(PWConv2(σ(BN(PWConv1(X)))))(5) here, H and W signify the height and width of the feature map, respectively, C stands for the feature map’s channel count. The terms PWConv1 and PWConv2 refer to pointwise convolutions. Specifically, PWConv1 has a convolution kernel size of Cr×C×1×1, while PWConv2 possesses a kernel size of C×Cr×1×1. In this context, r represents the channel reduction ratio, and σ symbolises the Sigmoid activation function. Notably, L(X) retains the same shape as the input features, enabling the preservation and emphasis of intricate details found in low-level features.

Figure 7. Network structure of MCSA.

Figure 7. Network structure of MCSA.

The global attention module differs from local attention in that a GAP operation is first performed on the input X. The GAP procedure is non-parametric, effectively mitigating overfitting at that specific layer. By amalgamating global spatial details, it enhances the model’s resilience to spatial translations in the input image. This process can be articulated as: (6) G(X)=GAP(L(X))(6) The spatial attention module initiates by compressing the input feature map to a size of 1×H × W using a 1 × 1 convolution. Subsequently, weight coefficients are derived using the Sigmoid activation function. Ultimately, these coefficients are multiplied with the input feature map. This process can be formally represented as: (7) S(X)=X×σ(B(Conv(X)))(7) To encapsulate, the features augmented by MCSA can be articulated in the subsequent manner: (8) OUT(X)=X×σ(L(X)+G(X))+S(X)(8)

3.5. NAS-YOLOX

Oceanic environments with elements like port infrastructures, sea noise, varying sea states, inconsistencies in the ship’s multiple scales, and vagueness in the small ship’s features can cause missed or inaccurate ship detections. The conventionally designed feature fusion methods often face issues with insufficient feature use. This paper introduces an advanced algorithm called NAS-YOLOX, an enhancement over the YOLOX-m. The NAS-YOLOX configuration is depicted in Figure . It primarily integrates an upgraded CSPDarknet backbone, NAS-FPN, a distinct detection head, and the MCSA.

Figure 8. Structure of NAS-FPN.

Figure 8. Structure of NAS-FPN.

The NAS-YOLOX offers advancements over the standard YOLOX design. The backbone network, CSPDarknet, has been augmented with a DFEM post the output phases of Focus, dark2, dark3, and dark4 to bolster feature extraction. Then, the primary PAFPN network is substituted with NAS-FPN, optimising the data retrieved from the backbone. An MCSA module is then integrated into the Neck section, amplifying the network’s concentration on ship-related details.

3.6. Loss function

Similar to other versions of the YOLO model, the loss function of NAS-YOLOX consists of three components, classification loss (Lcls), regression loss (Lreg), and confidence loss (Lobj). Among them, Lcls and Lobj are calculated using the Binary Cross-Entropy (BCE) loss function, while Lreg is calculated using the CIOU (Zheng et al., Citation2020) function. The loss function is calculated as follows: (9) LOSS=Lcls+λLreg+LobjNpos(9) λwhere represents the equilibrium coefficient of Lreg, which is set to 0.5, and Npos represents the number of Anchor Points that is classified as positive samples. (10) Lcls=iposjcls(Oijln(C^ij)+(1Oij)ln(1C^ij))Npos(10) (11) C^ij=Sigmoid(Cij)(11) where Oij∈ [0,1] signifies the presence of a target of type j within the predicted boundary box i, Cij is the forecasted value, and C^ij is the target probability derived from Cij via the Sigmoid function. (12) Lobj(O,C)=i(Oiln(C^i)+(1Oi)ln(1C^i)N(12) (13) C^i=Sigmoid(Ci)(13) where Oi∈ [0,1] denotes the predicted target bounding box and the real target bounding box of IOU, C is the predicted value, Ci is the prediction confidence obtained through the function of Sigmoid, and N is the number of positive and negative samples.

The CIOU metric takes into account three distinct aspects: the overlap area, the distance between centre points, and the aspect ratio differences. Its formula can be expressed as: (14) Lreg=1IOU+ρ2(b,bgt)C2+αv(14) where ρ denotes the Euclidean distance between the predicted bounding box and the actual target’s centre point. C stands for the shortest diagonal distance of a rectangle capable of encompassing both the forecasted and genuine boxes. Here, b signifies the centre of the predicted box, while bgt pertains to the centre of the true box. IOU, or Intersection Over Union, describes the overlap rate between the predicted bounding box and the genuine one and can be determined using the subsequent formula: (15) IOU=|AB||AB|(15) v is the discrepancy in the aspect ratio between the predicted bounding box and the actual box, and it can be computed using the subsequent equation: (16) v=4Π2(arctanwgthgtarctanwh)2(16) where wgt and hgt symbolise the width and height of the actual or ground truth bounding box, and w and h stand for the width and height of the forecasted bounding box, respectively.

α is a weighting factor with the following equation: (17) α=v(1IOU)+v(17)

4. Experimental analysis

In the ensuing section, we embark by presenting the two datasets employed in this study. Subsequently, we detail the experimental parameter configurations and the metrics used for evaluation. We subsequently conduct an extensive series of ablation and comparison tests, presenting the results in a visual format. Through these assessments, our objective is to validate the effectiveness of the proposed model.

4.1. Dataset introduction

This study assesses the NAS-YOLOX network utilising two renowned datasets: the HRSID dataset (Wei et al., Citation2020) and the SSDD dataset (T. Zhang et al., Citation2021). Introduced in 2020, the HRSID dataset is a comprehensive SAR ship collection, encompassing images from varied scenarios, radar systems, and polarisation techniques. It encompasses 5,604 individual ship pictures, tallying up to 16,951 ship instances. Typically, an image showcases about three ships. In its categorisation, small, medium, and large vessels make up 54.5%, 43.5%, and 2% of the collection, respectively. For training and validation purposes, we partitioned the dataset into 3,642 training images and 1,962 testing images. The SSDD dataset, in contrast, captures a spectrum of maritime conditions and shoreline complexities. The tiniest vessel in the collection spans 4 pixels in breadth and 7 pixels in height, summing up to just 28 pixels. Conversely, the most expansive ship takes up an impressive 62,878 pixels. Numerous SAR ship detection specialists have adopted both these datasets extensively. For a comprehensive breakdown of the datasets’ specifics, refer to Table . Additionally, Figure offers a glimpse into sample visuals from both datasets.

Figure 9. Example of HRSID dataset and SSDD dataset.

Figure 9. Example of HRSID dataset and SSDD dataset.

Table 3. Parameters of the HRSID dataset.

4.2. Experimental setup

The experimental setup operated on a system with 12 vCPU Inter (R) Xeon (R) Platinum 8255C [email protected], an RTX3090 graphics card boasting 24G of video memory. This was run on a Windows 11 operating system. For our deep learning tasks, we employed the PyTorch 1.9.0 framework, with the codebase scripted in Python version 3.8. Computation was enhanced using CUDA 11.3 and cuDNN 11.1. We opted for MMDetection for both training and evaluation to maintain consistency in hyperparameters with the benchmarked models. The model was trained for 200 epochs using the AdamW optimiser. The learning parameters were set as follows: an initial learning rate of 0.0002, a momentum decay of 0.0001, and an IOU threshold of 0.65.

4.3. Model evaluation metrics

Two average precision series metrics, precision (P) and recall (R) are used to compare the effects of the models.

P represents the precision of the predictions, indicating how many of the predicted positive samples are actually positive. (18) P=TPTP+FP(18) where True Positive (TP) refers to the count of accurately predicted positive samples, and False Positive (FP) denotes the count of mistakenly identified negative samples.

R is quantified as the ratio of accurately identified positive samples to the total samples predicted, that is, (19) R=TPTP+FN(19) where False Negative (FN) is the number of positive samples with prediction errors.

Using the target’s P value on the x-axis and R on the y-axis, we can plot the P-R curve. By computing the area enclosed between this curve and the coordinate axes, we derive the Average Precision (AP) for ship-like targets, which is represented as: (20) AP=01P(R)dR(20)

4.4. Ablation experiments

To systematically ascertain the contribution of each module to the overall performance, we conducted a series of ablation studies on the HRSID dataset. Each test was designed to isolate the effects of the DFEM, NAS-FPN, and MCSA modules. By evaluating the model’s accuracy with and without these modules, we aimed to understand the role of each component.

4.4.1. Impact of DFEM on experimental performance metrics

During our experimentation, we integrated the DFEM into the backbone of the network while ensuring that the Neck and Head components remained consistent with the baseline. We then examined the impact of the dilation rate of each dilated convolution module within the DFEM on the model’s precision. The results are presented in Table . A noticeable improvement in accuracy was observed in the model supplemented with DFEM compared to the baseline. Specifically, with dilation rates of 1, 3, 5, and 7 for DFEM, we achieved an optimal performance, with an increase in AP0.5 by 1.8%. This demonstrates that DFEM effectively contributes to the network’s enhancement.

Table 4. Experimental results of DFEM.

4.4.2. Impact of NAS-FPN on experimental performance metrics

During our experimentation, we substituted the Baseline’s PAFPN module with NAS-FPN, ensuring that the Backbone and Head sections remained in line with the Baseline. To delve deeper into the influence of the NAS-FPN stack number on the experimental outcomes, we performed tests with varying stack configurations. The findings are illustrated in Table . Clearly, with NAS-FPN stacked seven times, the model showcases an optimal feature fusion capability, enhancing AP0.5 by 4.9%.

Table 5. Experimental results of NAS-FPN.

To accentuate the prowess of NAS-FPN’s integration, we plotted Precision-Recall (P-R) curves for IOU values of 0.5 and 0.75, as showcased in Figure . Typically, a PR curve that gravitates towards the top-right quadrant or possesses a more substantial area beneath it is indicative of a model’s superior efficacy. The visual representation reveals that the YOLOX equipped with NAS-FPN markedly eclipses the standard YOLOX in terms of performance. This further solidifies the benefits of embedding NAS-FPN into the model.

Figure 10. Displays the PR curve.

Figure 10. Displays the PR curve.

4.4.3. Impact of MCSA on experimental performance metrics

During the experiment, we incorporated the MCSA module into the Neck component of the Baseline model while keeping the other parts unchanged. We then compared these results with experiments utilising various other mainstream attention mechanisms. The outcomes of these experiments are presented in Table . It is evident that the inclusion of the MCSA module leads to a 3% increase in overall model accuracy. Moreover, APS, APM, and APL all experience nearly a 10% improvement. These results clearly demonstrate that the MCSA module positively contributes to enhancing the performance of the model.

Table 6. Experimental results of MCSA.

4.4.4. Impact of comprehensive enhancements on experimental performance metrics

To evaluate the impact of module combinations on the model’s accuracy, we conducted three sets of experiments, as presented in Table . Firstly, we examined the effect of replacing the Neck of the Baseline with NAS-FPN while keeping the other parts consistent. This modification resulted in a 4.9% increase in AP0.5. Subsequently, we embedded the DFEM module into the Backbone based on the first set of experiments, leading to a 5.3% improvement in AP0.5 compared to the Baseline and a 0.4% improvement compared to the first set of experiments. Lastly, in the third set of experiments, we integrated the MCSA module into the Neck based on the second set of experiments, resulting in a 6.3% increase in AP0.5 compared to the Baseline and a 1% increase compared to the second set of experiments. Through these tests, it’s unmistakably clear that each module plays a pivotal role in bolstering the model’s efficacy and precision.

Table 7. Comprehensive improvement.

4.5. Comparative experiments

To ascertain the robustness and capability of our proposed algorithm, we undertook rigorous comparative tests. First, we compare our model with prevailing state-of-the-art methods across the HRSID and SSDD datasets. Then, for a more nuanced understanding of our algorithm’s performance in complex settings, we focused on the HRSID dataset, analysing both offshore and inshore scenes. The outcomes of these assessments strongly reaffirm the prowess and reliability of our model.

4.5.1. Experimental comparison with current methods on HRSID

To demonstrate the efficacy of our proposed method, we benchmarked it against 10 leading algorithms on the HRSID dataset. These include YOLOv3 (Redmon & Farhadi, Citation2018), FCOS, TOOD (Feng et al., Citation2021), Deformable DETR (X. Zhu et al., Citation2020), RetinaNet, YOLOF (Q. Chen et al., Citation2021), Cascade R-CNN, Libra R-CNN, Free-anchor (X. Zhang et al., Citation2022), ATSS (S. Zhang et al., Citation2020), and FINet. The test results obtained through these comparative methods are presented in Table . Specifically, these methods are the most advanced ones in natural scene object detection. It’s noteworthy that a majority of these techniques lean on traditional FPN strategies, tapping into multi-scale feature data to tackle variations in scale. Deformable DETR, however, adopts a Transformer structure and introduces a multi-scale deformable attention module for fusing multi-scale feature maps, deviating from the traditional FPN architecture. Among them, FINet is an excellent detection model specially proposed for the field of SAR ship detection. For a unified comparison, we use the results of their object-level detection network for comparative experiments in this paper.

The APS, APM, and APL metrics represent the detection performance for small, medium, and large objects, reflecting the multi-scale object detection capability. As shown in Table , the proposed model in this paper achieves significant improvements on the HRSID dataset. The accuracy of AP0.5 reached 91.1%, APS reached 65.2%, APM reached 68.6%, and APL reached 34.1%. Compared to the Baseline, these results represent respective increases of 6.3%, 11.9%, 15.3%, and 30.8%. Clearly, with these enhancements, the model excels in merging features from diverse scales, showcasing heightened proficiency in detecting ships of different dimensions. In particular, the method proposed in this paper has achieved 65.2% in the detection of small ships, which has significant advantages compared with most other advanced detection methods, and is only nearly 4% lower than FINet, indicating that the feature interaction module in FINet can effectively learn characteristics of the small ships. However, on the AP0.5 metric, our model outperforms FINet by 0.6%. It is worth mentioning that all detection models performed poorly in detecting large ships. The observed discrepancy might arise from the dataset’s ship data imbalance, where larger ships are substantially outnumbered by their smaller and medium-sized counterparts. This disproportion complicates the model’s task of discerning large ship features. Notwithstanding these challenges, our proposed technique marked a substantial 30.8% uptick in large ship detection accuracy, thereby bolstering its reliability for detecting larger ships.

Table 8. Comparison of quantitative evaluation indexes on HRSID.

4.5.2. Experimental comparison with current methods on SSDD

To provide a deeper verification of our method’s potency, we juxtaposed its efficacy against other advanced detection frameworks using the SSDD dataset. The comparative outcomes are detailed in Table . As gleaned from the table, NAS-YOLOX registers an AP0.5 of 97.2% when tested on the SSDD dataset, marking a 1.6% enhancement over the Baseline. It exhibits varying degrees of improvement across all six metrics, with most metrics outperforming other methods. From the table, it’s apparent that NAS-YOLOX’s metrics are nearly optimal. However, when compared to FINet, there are noticeable differences, with the most significant being its AP0.5 at 99.8%. In addition to the difference in model performance, it may be due to the fact that the SSDD dataset has too little training data and is not very representative. The numerical outcomes clearly highlight the outstanding performance of our proposed approach, affirming its efficacy.

Table 9. Comparison of quantitative evaluation indexes on SSDD.

4.5.3. Detection of ships in inshore and offshore environments

To assess the algorithm’s efficacy in intricate settings, we performed tests using the offshore and inshore scenarios from the HRSID dataset. The outcomes for both situations are detailed in Table . Importantly, given the distinct scattering of SAR image targets, unclear edge contours, and background interferences in inshore contexts, most advanced detectors, including our proposed model, tend to exhibit reduced accuracy in offshore settings than in inshore situations. As evident from Table , our approach surpasses other existing methods.

Table 10. ship identification in HRSID’s offshore and inshore contexts.

4.6. Analysis and interpretation of visual results

To provide a visual representation of our approach’s efficacy, we undertook a qualitative evaluation. Within this assessment, the detection outcomes on both the SSDD and HRSID datasets were illustrated. The visual comparisons between our method and other leading techniques can be seen in Figure and Figure . From these visualisations, it’s evident that our strategy stands out in performance compared to its counterparts.

Figure 11. SAR ship detection outcomes on the HRSID dataset: Missed detections are marked in blue, while incorrect detections are highlighted in orange.

Figure 11. SAR ship detection outcomes on the HRSID dataset: Missed detections are marked in blue, while incorrect detections are highlighted in orange.

Figure 12. Visualisation of SAR ship detection on the SSDD dataset: Blue boxes represent missed detections, while orange boxes denote false positives.

Figure 12. Visualisation of SAR ship detection on the SSDD dataset: Blue boxes represent missed detections, while orange boxes denote false positives.

We conducted an inspection on a random selection of offshore and inshore ships. The visual representations suggest that leveraging multi-scale fusion feature maps significantly enhances SAR image results across diverse scenes. The proposed method efficiently extracts semantically-rich features, even in the intricate backgrounds typical of nearshore areas. This technique also adeptly reduces interference, pinpointing areas challenging for the human eye to differentiate between noise and genuine ship features.

Figures and underscore a common challenge in inshore areas: given the similarity between near-shore structures and ship targets, several detectors, even the advanced ones, often encounter false detections. Especially in areas characterised by substantial background interference and a high concentration of small vessels, these detectors often manifest either false positives or overlook actual ships. Yet, while other methods exhibit a pronounced rate of misses and errors, our improved NAS-YOLOX model only occasionally registers false positives. This underscores the pivotal role of our enhancement module in accurately identifying clusters of smaller ships. These visualisations reaffirm that, despite the high likeness of SAR ship targets to background disturbances such as coastal infrastructures, NAS-YOLOX remains consistent in its precision, aligning closely with ground truth benchmarks. Consequently, NAS-YOLOX showcases superior SAR ship detection capabilities across a range of conditions.

5. Discussion

The four pictures in Figure illustrate some errors observed in the visualisation results. Among them, pictures (a-b) depict non-dense near-shore ships, where the presence of shore buildings has a significant impact on ship detection. Pictures (c-d) represent dense near-shore ships, where numerous small ships gather closely together, effectively demonstrating the model’s detection performance. In picture (a), it is evident that a ship has been mistakenly detected, while in picture (b), there are clearly ships that have been overlooked. Picture (c) shows a missed ship and two false-positive detections, whereas in picture (d), a substantial number of ships have been incorrectly detected. To address this, we can increase the confidence threshold during non-maximum suppression to reduce the probability of detecting a large number of ships clustered together, or employ data augmentation techniques during model training to enhance its robustness in detecting dense ship formations.

Figure 13. Visualisation results of misdetected samples.

Figure 13. Visualisation results of misdetected samples.

Furthermore, it is crucial to consider the complexity analysis of the model. In comparison to YOLOX, our model exhibits an increase of 19.1M parameters, which consequently leads to an extended training time. Despite the slightly higher complexity, our improvements have yielded significant enhancements in detection accuracy. Specifically, our model achieves a remarkable accuracy improvement of 6.3% and 1.6% on the HRSID and SSDD datasets, respectively. Regarding inference time, YOLOX outperforms our model. However, when compared to FCOS, despite a parameter increase of 12.38M, our approach demonstrates a substantial accuracy improvement of 5% and 2.9% on each dataset, respectively. This underscores the efficacy of the approach presented in this study.

However, in our study, due to budget and time constraints, we used a limited dataset for experiments. The limited size of the dataset could potentially impact the model’s ability to generalise and its overall accuracy. In our dataset, large ships are underrepresented in the dataset compared to small ships. This imbalance problem may cause the model to be not sensitive enough to detect large ships. Because the characteristics of large ships are different from those of small ships, the model may be more inclined to detect the more common small ships, while the detection rate of large ships is lower. Addressing these limitations, upcoming studies should utilise more extensive datasets and incorporate a greater number of large ship samples. This would bolster the model’s proficiency in detecting bigger vessels. Additionally, delving into diverse feature extraction techniques and classification strategies specifically tailored for large ships could further enhance their detection precision.

6. Conclusion and future outlook

This paper introduces NAS-YOLOX, a solution tailored for SAR imaging ship detection, addressing challenges like strong scattering, sparseness, multi-scaling, unclear contours, and intricate interference inherent to SAR targets. Key components of NAS-YOLOX include NAS-FPN, DFEM, and MCSA. The DFEM module extracts multi-scale features from images, enhancing feature information capture. The MCSA module prioritises global contextual details to refine detection precision. Notably, NAS-FPN leverages neural architecture search technology, sidestepping limitations of manually designed feature fusion pyramids and optimising feature information utilisation. To attest to the efficacy of these components, we performed ablation studies. Comparative results highlight NAS-YOLOX’s superior accuracy in SAR image ship detection, outperforming ten other mainstream detection techniques. Using the HRSID and SSDD datasets for our tests, our method achieved impressive results: 91.1% AP0.5 and 97.2% AP0.5, respectively.

In conclusion, the method presented in this paper offers promising applications not only in shipping and port management but also in maritime rescue and disaster mitigation. In the realm of port management, real-time knowledge of ship location, count, and condition can significantly bolster the efficacy and safety of maritime operations. For maritime rescue and disaster response, our method facilitates swift identification and tracking of vessels in distress, delivering real-time insights crucial for effective rescue operations. As we move forward, our research emphasis will pivot towards enhancing detection of larger vessels. We plan to integrate robust data augmentation techniques to optimise the model’s performance in this regard. Moreover, alongside pruning methods, our future endeavours will explore diversified model compression techniques and prioritise network lightweighting for easy deployment.

Additional information

Funding

This research was funded by the agreement for the 2022 Graduate Top Innovative Talents Training Program at Shanghai Maritime University [grant number: 2022YBR005] and the Top-notch Innovative Talent Training Program for Graduate students of Shanghai Maritime University [grant number: 2021YBR008].

References

  • Cai, Z., & Vasconcelos, N. (2018). Cascade r-cnn: Delving into high quality object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Chen, C., Han, D., & Chang, C.-C. (2022). CAAN: Context-aware attention network for visual question answering. Pattern Recognition, 132, 108980. https://doi.org/10.1016/j.patcog.2022.108980
  • Chen, C., Han, D., & Shen, X. (2023). CLVIN: Complete language-vision interaction network for visual question answering. Knowledge-Based Systems, 275, 110706. https://doi.org/10.1016/j.knosys.2023.110706
  • Chen, L.-C., Papandreou, G., Schroff, F., & Adam, H. (2019). Rethinking atrous convolution for semantic image segmentation. arXiv 2017. arXiv preprint arXiv:1706.05587, 2.
  • Chen, P., Zhou, H., Li, Y., Liu, P., & Liu, B. (2023). A novel deep learning network with deformable convolution and attention mechanisms for complex scenes ship detection in SAR images. Remote Sensing, 15(10), 2589. https://doi.org/10.3390/rs15102589
  • Chen, Q., Wang, Y., Yang, T., Zhang, X., Cheng, J., & Sun, J. (2021). You only look one-level feature. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Chen, S., Zhan, R., Wang, W., & Zhang, J. (2021). Learning slimming SAR ship object detector through network pruning and knowledge distillation. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14, 1267–1282. https://doi.org/10.1109/JSTARS.2020.3041783
  • Dai, Y., Gieseke, F., Oehmcke, S., Wu, Y., & Barnard, K. (2021). Attentional feature fusion. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.
  • Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., & Tian, Q. (2019). Centernet: Keypoint triplets for object detection. Proceedings of the IEEE/CVF International Conference on Computer Vision.
  • Farah, F., Laroussi, T., & Madjidi, H. (2022). A fast ship detection algorithm based on automatic censoring for multiple target situations in SAR images. 2022 7th International Conference on Image and Signal Processing and their Applications (ISPA).
  • Feng, C., Zhong, Y., Gao, Y., Scott, M. R., & Huang, W. (2021). Tood: Task-aligned one-stage object detection. 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
  • Ge, Z., Liu, S., Wang, F., Li, Z., & Sun, J. (2021). Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430.
  • Ghiasi, G., Lin, T.-Y., & Le, Q. V. (2019). Nas-fpn: Learning scalable feature pyramid architecture for object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Girshick, R. (2015). Fast r-cnn. Proceedings of the IEEE International Conference on Computer Vision.
  • Guo, H., Yang, X., Wang, N., Song, B., & Gao, X. (2020). A rotational libra R-CNN method for ship detection. IEEE Transactions on Geoscience and Remote Sensing, 58(8), 5772–5781. https://doi.org/10.1109/TGRS.2020.2969979
  • Han, D., Pan, N., & Li, K.-C. (2022). A traceable and revocable ciphertext-policy attribute-based encryption scheme based on privacy protection. IEEE Transactions on Dependable and Secure Computing, 19(1), 316–327. https://doi.org/10.1109/TDSC.2020.2977646
  • Han, L., Liu, D., & Guan, D. (2022). Ship detection in SAR images by saliency analysis of multiscale superpixels. Remote Sensing Letters, 13(7), 708–715. https://doi.org/10.1080/2150704X.2022.2068988
  • Hou, Q., Zhou, D., & Feng, J. (2021). Coordinate attention for efficient mobile network design. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Hu, Q., Hu, S., Liu, S., Xu, S., & Zhang, Y.-D. (2022). Finet: A feature interaction network for SAR ship object-level and pixel-level detection. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–15.
  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. International Conference on Machine Learning.
  • Jiang, M., Gu, L., Li, X., Gao, F., & Jiang, T. (2023). Ship contour extraction from SAR images based on faster R-CNN and chan–vese model. IEEE Transactions on Geoscience and Remote Sensing, 61, 1–14.
  • Jiang, P., Ergu, D., Liu, F., Cai, Y., & Ma, B. (2022). A review of yolo algorithm developments. Procedia Computer Science, 199, 1066–1073. https://doi.org/10.1016/j.procs.2022.01.135
  • Karachristos, K., & Anastassopoulos, V. (2023). Automatic ship detection using PolSAR imagery and the double scatterer model. Geomatics, 3(1), 174–187. https://doi.org/10.3390/geomatics3010009
  • Ke, X., Zhang, X., Zhang, T., Shi, J., & Wei, S. (2021). SAR ship detection based on an improved faster R-CNN using deformable convolution. 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS.
  • Leng, X., Ji, K., Yang, K., & Zou, H. (2015). A bilateral CFAR algorithm for ship detection in SAR images. IEEE Geoscience and Remote Sensing Letters, 12(7), 1536–1540. https://doi.org/10.1109/LGRS.2015.2412174
  • Li, D., Han, D., Weng, T.-H., Zheng, Z., Li, H., Liu, H., Castiglione, A., & Li, K.-C. (2022). Blockchain for federated learning toward secure distributed machine learning systems: A systemic survey. Soft Computing, 26(9), 4423–4440. https://doi.org/10.1007/s00500-021-06496-5
  • Li, D., Han, D., Zheng, Z., Weng, T.-H., Li, H., Liu, H., Castiglione, A., & Li, K.-C. (2022). MOOCschain: A blockchain-based secure storage and sharing scheme for MOOCs learning. Computer Standards & Interfaces, 81, 103597. https://doi.org/10.1016/j.csi.2021.103597
  • Li, H., Han, D., & Tang, M. (2022). A privacy-preserving storage scheme for logistics data with assistance of blockchain. IEEE Internet of Things Journal, 9(6), 4704–4720. https://doi.org/10.1109/JIOT.2021.3107846
  • Li, J., Han, D., Wu, Z., Wang, J., Li, K.-C., & Castiglione, A. (2023). A novel system for medical equipment supply chain traceability based on alliance chain and attribute and role access control. Future Generation Computer Systems, 142, 195–211. https://doi.org/10.1016/j.future.2022.12.037
  • Li, S., Fu, X., & Dong, J. (2022). Improved ship detection algorithm based on YOLOX for SAR outline enhancement image. Remote Sensing, 14(16), 4070. https://doi.org/10.3390/rs14164070
  • Li, T., Liu, Z., Xie, R., & Ran, L. (2018). An improved superpixel-level CFAR detection method for ship targets in high-resolution SAR images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 11(1), 184–194. https://doi.org/10.1109/JSTARS.2017.2764506
  • Li, Y., Zhang, S., & Wang, W.-Q. (2020). A lightweight faster R-CNN for ship detection in SAR images. IEEE Geoscience and Remote Sensing Letters, 19, 1–5.
  • Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. Proceedings of the IEEE International Conference on Computer Vision.
  • Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016). Ssd: Single shot multibox detector. Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14.
  • Ma, F., Sun, X., Zhang, F., Zhou, Y., & Li, H.-C. (2022). What catch your attention in SAR images: Saliency detection based on soft-superpixel lacunarity cue. IEEE Transactions on Geoscience and Remote Sensing, 61, 1–17.
  • Miao, T., Zeng, H., Yang, W., Chu, B., Zou, F., Ren, W., & Chen, J. (2022). An improved lightweight RetinaNet for ship detection in SAR images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 15, 4667–4679. https://doi.org/10.1109/JSTARS.2022.3180159
  • Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W., & Lin, D. (2019). Libra r-cnn: Towards balanced learning for object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767.
  • Sun, Z., Dai, M., Leng, X., Lei, Y., Xiong, B., Ji, K., & Kuang, G. (2021). An anchor-free detection method for ship targets in high-resolution SAR images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14, 7799–7816. https://doi.org/10.1109/JSTARS.2021.3099483
  • Tan, M., Pang, R., & Le, Q. V. (2020). Efficientdet: Scalable and efficient object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Tian, Q., Han, D., Li, K.-C., Liu, X., Duan, L., & Castiglione, A. (2020). An intrusion detection approach based on improved deep belief network. Applied Intelligence, 50(10), 3162–3178. https://doi.org/10.1007/s10489-020-01694-4
  • Tian, Z., Shen, C., Chen, H., & He, T. (2019). Fcos: Fully convolutional one-stage object detection. Proceedings of the IEEE/CVF International Conference on Computer Vision.
  • Wang, R., Xu, F., Pei, J., Wang, C., Huang, Y., Yang, J., & Wu, J. (2019). An improved faster R-CNN based on MSER decision criterion for SAR image ship detection in harbor. IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium.
  • Wei, S., Zeng, X., Qu, Q., Wang, M., Su, H., & Shi, J. (2020). HRSID: A high-resolution SAR images dataset for ship detection and instance segmentation. Ieee Access, 8, 120234–120254. https://doi.org/10.1109/ACCESS.2020.3005861
  • Woo, S., Park, J., Lee, J.-Y., & Kweon, I. S. (2018). Cbam: Convolutional block attention module. Proceedings of the European Conference on Computer Vision (ECCV).
  • Xia, R., Chen, J., Huang, Z., Wan, H., Wu, B., Sun, L., Yao, B., Xiang, H., & Xing, M. (2022). CRTranssar: A visual transformer based on contextual joint representation learning for SAR ship detection. Remote Sensing, 14(6), 1488. https://doi.org/10.3390/rs14061488
  • Xu, Z., Gao, R., Huang, K., & Xu, Q. (2022). Triangle distance IoU loss, attention-weighted feature pyramid network, and rotated-SARShip dataset for arbitrary-oriented SAR ship detection. Remote Sensing, 14(18), 4676. https://doi.org/10.3390/rs14184676
  • Yang, S., An, W., Li, S., Wei, G., & Zou, B. (2022). An improved FCOS method for ship detection in SAR images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 15, 8910–8927. https://doi.org/10.1109/JSTARS.2022.3213583
  • Zaremba, W., Sutskever, I., & Vinyals, O. (2014). Recurrent neural network regularization. arXiv preprint arXiv:1409.2329.
  • Zhai, L., Li, Y., & Su, Y. (2016). Inshore ship detection via saliency and context information in high-resolution SAR images. IEEE Geoscience and Remote Sensing Letters, 13(12), 1870–1874. https://doi.org/10.1109/LGRS.2016.2616187
  • Zhang, L., Zhang, Z., Lu, S., Xiang, D., & Su, Y. (2022). Fast superpixel-based non-window CFAR ship detector for SAR imagery. Remote Sensing, 14(9), 2092. https://doi.org/10.3390/rs14092092
  • Zhang, S., Chi, C., Yao, Y., Lei, Z., & Li, S. Z. (2020). Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Zhang, T., & Zhang, X. (2019). High-speed ship detection in SAR images based on a grid convolutional neural network. Remote Sensing, 11(10), 1206. https://doi.org/10.3390/rs11101206
  • Zhang, T., Zhang, X., & Ke, X. (2021). Quad-FPN: A novel quad feature pyramid network for SAR ship detection. Remote Sensing, 13(14), 2771. https://doi.org/10.3390/rs13142771
  • Zhang, T., Zhang, X., Li, J., Xu, X., Wang, B., Zhan, X., Xu, Y., Ke, X., Zeng, T., & Su, H. (2021). SAR ship detection dataset (SSDD): Official release and comprehensive data analysis. Remote Sensing, 13(18), 3690. https://doi.org/10.3390/rs13183690
  • Zhang, T., Zhang, X., Shi, J., Wei, S., Wang, J., & Li, J. (2020). Balanced feature pyramid network for ship detection in synthetic aperture radar images. 2020 IEEE Radar Conference (RadarConf20).
  • Zhang, X., Wan, F., Liu, C., Ji, X., & Ye, Q. (2022). Learning to match anchors for visual object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6), 3096–3109. https://doi.org/10.1109/TPAMI.2021.3050494
  • Zhao, Y., Zhao, L., Xiong, B., & Kuang, G. (2020). Attention receptive pyramid network for ship detection in SAR images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 13, 2738–2756. https://doi.org/10.1109/JSTARS.2020.2997081
  • Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., & Ren, D. (2020). Distance-IoU loss: Faster and better learning for bounding box regression. Proceedings of the AAAI Conference on Artificial Intelligence.
  • Zhou, K., Zhang, M., Wang, H., & Tan, J. (2022). Ship detection in SAR images based on multi-scale feature extraction and adaptive feature fusion. Remote Sensing, 14(3), 755. https://doi.org/10.3390/rs14030755
  • Zhou, X., & Li, T. (2023). Ship detection in PolSAR images based on a modified polarimetric notch filter. Electronics, 12(12), 2683. https://doi.org/10.3390/electronics12122683
  • Zhu, M., Hu, G., Zhou, H., Wang, S., Feng, Z., & Yue, S. (2022). A ship detection method via redesigned FCOS in large-scale SAR images. Remote Sensing, 14(5), 1153. https://doi.org/10.3390/rs14051153
  • Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2020). Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159.