Full article: Image super-resolution reconstruction based on multi-scale dual-attention

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Image super-resolution reconstruction is one of the methods to improve resolution by learning the inherent features and attributes of images. However, the existing super-resolution models have some problems, such as missing details, distorted natural texture, blurred details and too smooth after image reconstruction. To solve the above problems, this paper proposes a Multi-scale Dual-Attention based Residual Dense Generative Adversarial Network (MARDGAN), which uses multi-branch paths to extract image features and obtain multi-scale feature information. This paper also designs the channel and spatial attention block (CSAB), which is combined with the enhanced residual dense block (ERDB) to extract multi-level depth feature information and enhance feature reuse. In addition, the multi-scale feature information extracted under the three-branch path is fused with global features, and sub-pixel convolution is used to restore the high-resolution image. The experimental results show that the objective evaluation index of MARDGAN on multiple benchmark datasets is higher than other methods, and the subjective visual effect is better. This model can effectively use the original image information to restore the super-resolution image with clearer details and stronger authenticity.

Keywords:

1. Introduction

In the field of image processing, super-resolution (SR) reconstruction is a technology that uses image processing algorithm (Haris et al., Citation2020) to convert low resolution images into high resolution images. In recent years, single image super resolution (SISR) has been widely used in practical applications such as improving the clarity of pictures in multimedia (Alshammri et al., Citation2022), improving the accuracy of medical images in diagnosis (Li et al., Citation2023, Citation2021), and improving the resolution of satellite remote sensing images (Huang & Jing, Citation2020). Video super-resolution (VSR) requires the alignment and fusion of inter-frame complementary information into keyframes to improve keyframe reconstruction, and many network models (Chan & Yu, Citation2021; Choi et al., Citation2021; Jo & Oh, Citation2018) have been available to solve problems such as video blurring in VSR. However, the VSR technique is more complex than SISR and there are still difficulties in implementing and extending existing methods. Therefore, the focus of this paper is SISR.

The majority of the currently available image super-resolution reconstruction algorithms are based on interpolation (Zhang & Wu, Citation2006), reconstruction (Zhang et al., Citation2012) and deep learning (Meng et al., Citation2020; Wang et al., Citation2022; Yang et al., Citation2021; Zhou et al., Citation2020). Although the interpolation-based method is simple, the reconstruction effect is not very good. Reconstruction-based methods are slow and require a lot of prior knowledge. With the rapid development of artificial intelligence, image super-resolution reconstruction and depth learning have become the main research fields in recent years. The mapping relationship between low-resolution (LR) and high-resolution (HR) images is used to train the super-resolution reconstruction model, and then the texture details and other features of the image are recovered. Deep learning combined with big data (Gai et al., Citation2016; Qiu et al., Citation2021) and cloud computing (Li et al., Citation2016; Qiu et al., Citation2020) can further optimise related work of image processing.

Super-resolution convolutional neural network (SRCNN) (Khan et al., Citation2020) is the first time to apply deep learning to super resolution reconstruction and obtains better reconstruction results than the effect maps obtained by interpolation and reconstruction methods. However, SRCNN is not only slow in training but also requires image pre-processing operations. Shi et al. (Citation2016) proposed a pixel rearrangement-based ESPCN network model that uses efficient sub-pixel convolution layers to achieve different magnification effects and obtains higher reconstruction results while reducing the network parameters. Kim et al. (Wang et al., Citation2021) proposed a VDSR network based on residual learning, and adopted residual connection between layers to make information flow better, thus avoiding the problem of disappearing gradient. However, in the process of super resolution reconstruction, the hierarchical features of low resolution image are often not fully utilised, which leads to the phenomenon of fuzzy image after reconstruction.

To solve the above problems, Zhang et al. (Citation2018) proposed a residual dense network (RDN). This network establishes a continuous memory mechanism by learning the low-frequency information in LR images. By retaining the feature information of the previous network layer, the feature information of LR image is fully learned. However, the rich features also contain a significant amount of irrelevant data, which can impact image reconstruction quality. Based on RDN, Zhang et al. (Citation2018) introduced the channel attention mechanism into super resolution and built the RCAN network model. The adaptive learning of the features between individual channels enables the network to disregard irrelevant data and concentrate more on the data's characteristics. Although these models have achieved good results in the objective evaluation criteria of image quality, they may appear artefacts, smoothness, texture details and other phenomena that cannot be recovered. Moreover, they tend to construct deeper and more complex network structures, which makes training more difficult. Meanwhile, deep network will also bring gradient instability and network degradation and other problems. We can build more efficient SR networks by reducing network depth and computation, and designing more efficient modules.

In this paper, we propose a multi-scale dual-attention-based residual dense generative adversarial network (MARDGAN) in combination with deep learning. In addition, CSAB and ERDB are combined to form the deep residual dense attention module (DRDAM) which serves as the generated network's fundamental building block. First, DRDAM of different scales are constructed in different paths for multi-scale feature extraction and feature dissemination. Second, the bottleneck layer fuses the feature information extracted by DRDAM. Global residual learning (GRL) is a method for combining the shallow features and the fused features to produce a more effective feature representation. Finally, the sub-pixel convolution is used to reconstruct the high-resolution image. The discriminant network judges the truth and falsity of the generated images, and then makes the generative network and the discriminant network play games with each other to promote the generative network to generate more real and clearer images. At the same time, spectral normalisation (SN) is added to prevent gradient explosion of the network.

To summarise, the following are our primary contributions:

Channel and spatial attention block (CSAB) calculate their respective weights in a local way and establish the interdependence of channels and spatial features to recover the detailed features of the image.
Enhanced residual dense block (ERDB) fully extracts all layered features of the original LR image, and then enhances feature propagation to restore high-quality images. The spectral normalisation (SN) is added to enhance the stability of the network.
We propose a MARDGAN for accurate SISR reconstruction to extract and share multiple feature information in a multi-scale parallel manner. Ensure maximum information flow between modules to restore image texture features, and restore high resolution images better than some existing network models.

The structure of this paper is as follows. Section 1 introduces the research background and significance of image super resolution reconstruction. Section 2 introduces the related work of super resolution reconstruction. Section 3 introduces the network structure and loss function in detail. Section 4 introduces the experimental process and comparison in detail. Section 5 is the conclusion of this paper.

2. Related work

Convolutional neural network can extract richer and abstract image semantic features by increasing the depth and width of the network layer, but it will also bring problems such as high computational cost and difficult model training. Therefore, how to efficiently extract target features (Hu et al., Citation2019) and use effective features for super resolution reconstruction (Wang et al., Citation2020) has become a hot research topic, and an important concept in feature extraction is receptive field. Low layer network has small receptive field and strong ability to represent local information, but it will lead to the loss of global information. With the deepening of the network, the receptive field of the high-level network will become larger and larger. In this case, the semantic information representation ability of the image is strong, but the resolution of the feature map will decrease. In addition, there are objects of different sizes in the image (Chen et al., Citation2021), and different objects have different semantic features (Zhang et al., Citation2022). Therefore, it is very important to extract and fuse semantic information at all levels efficiently.

Qin et al. (Citation2020) proposed a novel multi-scale feature fusion residual network (MSFFRN), which fully utilises the image features of different levels of SISR. MSFFRN has multiple interleaved paths, and convolution kernels of different sizes are designed to adaptively extract and fuse image features of different scales. This helps to fully mine the local features of the image, but increases the computational complexity. Qiu et al. (Citation2021) proposed a multiple improved residual network (MIRN), which combines the output features of two adjacent residual blocks with the overall input features, and makes full use of the adjacent residual information of the convolution layer features in the residual block. The problem of lack of correlation between features is solved by connecting the residual blocks of multi-level jumps. These features are connected by leaps and bounds, and feature information can be shared and reused with each other.

However, the above mentioned method will result in smoothness and blurring of texture details in the reconstructed SR images. Compared with other networks, clearer (.Daihong et al., Citation2022) and more realistic samples can be obtained through the use of generative adversarial networks (GAN). Ledig et al. (Citation2017) proposed SRGAN, which is the first model to combine the generation of adversarial networks with super-resolution reconstruction. The GAN network-based super-resolution reconstruction model can learn how images degrade and then make the generated images look more real. However, GAN is unable to make full use of images' shallow features and occasionally experiences training instability, which can significantly reduce the expressiveness of the generated network and thus reduce its performance.

3. Method

Super-resolution image reconstruction using deep neural networks can achieve good results, but there are still some problems such as excessive image smoothing, gradient disappearance and gradient explosion. Therefore, MARDGAN is proposed to solve the super-resolution problem in the real world, with the main purpose of improving the overall perceptual quality of SR. Using the idea of inception net (Szegedy et al., Citation2016), convolution of different scales is used to carry out convolution of input images. Compared with the previously used image super-resolution with a single convolution kernel, the proposed algorithm aims to extract more details from low-resolution images, which is helpful to restore the high-quality image.

3.1. Network structure

At present, most super-resolution reconstruction networks still use single-scale convolution kernel to extract the underlying feature information of images, which ignores more details on low-resolution images, while multi-scale feature extraction can effectively extract feature information at different levels (Huang et al., Citation2022; Meng et al., Citation2022). Figure shows the proposed generative network. To create distinct feature maps for shallow feature extraction, low-resolution images were convolution processed using convolution cores of $3 \times 3, 5 \times 5$ , and $7 \times 7$ sizes on three separate pathways. The feature reuse is then enhanced by using 2 CSABs and 16 ERDBs in DRDAM to extract the image's deeper texture structure information. In particular, CSAB improves high-frequency feature information. Through ERDB, feature information exchange of different depths in the network is realised to enhance feature reuse.

Figure 1. Generative network structure.

Common feature fusion operations include feature mean fusion, feature weighted fusion and feature concatenation. In this paper, in the global feature fusion step, the splicing operation is first performed to fuse the features from different paths into one output feature. The bottleneck layer is then utilised to reduce the dimension of the combined features, hence lowering the network parameters. Before performing image reconstruction, GRL integrates the input picture's features from shallow feature extraction with the global features and uses the long jump connection method, which is particularly helpful for network training. To increase training efficiency and add more details, the feature map produced by $3 \times 3$ convolution in this study is only added to the feature map after global feature fusion.

Image reconstruction requires first enlarging the input image to the desired size, that is, up-sampling the image, and then reconstructing the image. In the past, DenseNet used deconvolution as the most popular up-sampling method (Zhang et al., Citation2019), but deconvolution was often mixed with too many artificial factors. In this work, we use sub-pixel convolution (Zhihong et al., Citation2019) for up-sampling, first convolving the original feature map to expand the number of channels, and then organising the convolved feature map into a specific format to obtain a large map, thus realising the image magnification process (He et al., Citation2019).

In discriminator networks, SN is used to provide stable training. Since the concept of classification is introduced, the discriminant network is to judge whether the received image is a real image or an image generated by a generator. Figure shows the discriminant network's structure. Eight convolution kernels with the size of $3 \times 3$ are used to extract the features in the discriminant network, and the number of channels doubles as the depth of the convolution layer increases. To avoid neuronal death, the LeakyReLU activation function was used and SN was added in layers 2–8 to stabilise fluctuations in neural network parameters and avoid gradient explosions. Finally, through the Sigmoid function, a one-dimensional tensor is obtained to reflect the real degree of the image.

Figure 2. Discriminative network structure.

3.2. Channel and spatial attention block (CSAB)

The CSAB designed in this paper is different from the traditional CBAM (Woo et al., Citation2018). As shown in Figure , it uses dual branch paths to enter the channel and spatial attention module at the same time to obtain attention weight coefficients and extract feature information at different levels. When various tasks are typed, resources and features are allotted to each convolutional channel simultaneously. Overall, the implementation is straightforward but efficient. The feature extraction map is then created by merging the two modules' feature maps. Figure (a) depicts the attention paid to the channel. The channel feature map was generated by spatial pyramid pooling (SPP), and the channel attention weight coefficient was generated by Sigmoid function in the fully connected layer and activation function layer. Create a new scaling feature by multiplying this weight by the input feature. Spatial attention is shown in Figure (b). Through average pooling and maximum pooling, two different features on the channel axis are obtained, and then they are concatenated. Other operations are the same as those of channel attention.

Figure 3. Structure of the channel and spatial attention block.

Figure 4. Channel attention and spatial attention: (a) channel attention and (b) spatial attention.

SPP is commonly used in image restoration and image segmentation, and is essentially multiple averaging pooling layers. In super-resolution reconstruction, the perceptual field required for deep networks is very large, which makes training slow. In contrast, by adjusting the size and step of the sliding window with SPP, only one feature calculation is required, making it possible to convert a feature map of any size into a fixed-size feature vector. As shown in Figure , convolution operation is performed on images of arbitrary size to obtain corresponding feature maps. This feature map is broken up by SPP into three distinct scales: $4 \times 4, 2 \times 2$ and $1 \times 1$ . The number $16 \times 64$ -d in the figure indicates that the obtained feature map is divided into 16 blocks, and each block consists of 64 channels. Then the partitioned feature maps are fused to determine the average value of each block, which is converted into a $21 \times 64$ matrix, which can be expanded into a one-dimensional matrix when fed into the full connection.

Figure 5. Structure of SPP.

3.3. Enhanced residual dense block (ERDB)

In this work, we designed the ERDB and introduced the concept of DCCN (Huang et al., Citation2017). In ERDB, a continuous memory mechanism is created by allowing information to be exchanged between various network layers in a deep structure through the use of residual and dense connections. This paper's dense block of residuals has four layers with a total of ten connections. By connecting all feature maps with the same size, each layer receives input from all previous layers and then outputs its feature mapping to all subsequent layers, reducing the gradient disappearance issue as well as boosting feature propagation and reuse. The feature states of each layer are continuously transmitted throughout the network thanks to this continuous storage function. In Figure , the ERDB consists of the convolutional layer, the spectral normalisation layer and the activation layer, and is iterated several times using the PReLU activation function and $3 \times 3$ convolution kernel. In the training process, data are often normalised to speed up the convergence of the model, and even the normalisation is added in the hidden layer. The model's convergence is accelerated through the use of PReLU activation functions and spectral normalisation in this paper. The stability of GAN is enhanced by multiplying the main path by the constant between 0 and 1, and then adding residual connection.

Figure 6. Enhanced residual dense block.

Numerous experiments (Huang et al., Citation2017) have demonstrated that removing the BN layer can improve training performance and reduce the computational complexity in a variety of super-resolution reconstruction tasks. The GAN will appear in the training that the discrimination network enters the ideal state early, and can always distinguish the true and false images. As a result, it is unable to provide effective gradient information to the generating network, which leads to gradient explosion and non-convergence problems. To solve these problems, SN is used in the parameter matrix of the discriminant network to satisfy Lipschitz constraints. The gradient of the function is limited by the Lipschitz condition to a certain degree of radical change. Consequently, during the neural network optimisation process, the function will be smoother, parameter changes will be more stable, and gradient explosion will be less likely.

3.4. Loss function

The loss function is an optimisation target for the reconstruction accuracy of each pixel as well as for the composition. Therefore, the generative network's total loss function is defined as follows in this paper: (1) $L_{G} = ω L_{p e r c e p} + λ L_{g e n} + μ L_{p i x e l} .$ (1) The first term in Equation (Equation1(1) $L_{G} = ω L_{p e r c e p} + λ L_{g e n} + μ L_{p i x e l} .$ (1) ) represents the perceptual loss, which constrains the original and generated images. The VGG-19 network is used to calculate the degree of difference between the HR and SR feature maps, preventing the generated image from being significantly different from the actual image. The definition of the perceptual loss is (2) $L_{p e r c e p} = \frac{1}{W_{i, j} H_{i, j} C_{i, j}} \sum_{x = 1}^{W_{i, j}} \sum_{y = 1}^{H_{i, j}} \sum_{z = 1}^{C_{i, j}} (φ_{i, j} {(I^{H R})}_{x, y, z} - φ_{i, j} (G (I^{L R}))_{x, y, z})^{2},$ (2) where $W_{i, j}$ , $H_{i, j}$ and $C_{i, j}$ are the VGG network's width, height and number of channels. $φ_{i, j}$ is the VGG network's feature map prior to the i pooling and subsequent to the j convolution. $G (I^{L R})$ is the HR image reconstructed by the generator from the low-resolution image.

The second term in Equation (Equation1(1) $L_{G} = ω L_{p e r c e p} + λ L_{g e n} + μ L_{p i x e l} .$ (1) ) is the generation loss, the discriminator considers the generated image to be accurate, so the generated image's distribution tends to be more similar to the real image. The definition of the generation loss is (3) $L_{g e n} = E_{z \sim p_{z} (z)} {\log [1 - D (G (z))]},$ (3) where $E_{z \sim p_{z} (z)}$ is the expectation of data in relation to random noise, and $G (z)$ is the generated sample.

The third term in Equation (Equation1(1) $L_{G} = ω L_{p e r c e p} + λ L_{g e n} + μ L_{p i x e l} .$ (1) ) represents the pixel loss, which is the sum of the absolute values of the pixel differences between the predicted image and the real image, and also prevents the distortion of the reconstructed image due to excessive smoothing. The definition of the pixel loss is (4) $L_{p i x e l} = E_{x, y} (∥ y - G (x) ∥_{1}),$ (4) where E is the mathematical expectation, and $G (x)$ is the high-resolution image that the generator produces for the low-resolution image.

4. Experiment and analysis

4.1. Experimental setup

The public data set COCO2014 (Song et al., Citation2021), which is a sizable and rich object recognition, segmentation and caption data set, primarily collected from complicated daily scenarios, serves as the source for the experimental training set. The test set uses Set5 (Wang et al., Citation2018), Set14 (Xu et al., Citation2020) and BSD100 (Liu & Chen, Citation2021). These three datasets are currently popular single image super-resolution datasets, from natural images to specific objects, including 5, 14 and 100 images with different scenes, respectively. Peak signal-to-noise ratio (PSNR) (Wang et al., Citation2018), structural similarity (SSIM) (Liu et al., Citation2021) and learned perceptual image patch similarity (LPIPS) (Zhang et al., Citation2018) are used as the evaluation indexes on the Y channel.

4.2. Experimental process

Before starting the training, first cut the pictures in the data set randomly to $96 \times 96$ , and then use bicubic interpolation to perform the down-sampling operation. The scaling factor is r = 4 , which is used to generate the LR image required for training. Finally, the LR image and the original HR image are sent into the model, and the training data are enhanced by random horizontal flip. In the generative network, 2 CSABs and 16 ERDBs are used, and the number of channels is uniformly set to 64. In the discriminative network, the number of channels is set to 64, 128, 256, and 512 from shallow to deep, respectively. For the setting of hyperparameters in training, the range of hyperparameters is firstly set, and then randomly sampled from the set range of hyperparameters, and the effect of different orders of hyperparameters on the error is detected by setting different orders of comparison experiments, and finally fine-tuning is performed to further determine the size of hyperparameters. For better generative adversarial training, we first pre-trained a multiscale attention residual dense network (MARDN) model as the generative network, and then trained this generative network together with the discriminative network. The pre-training network and generative adversarial network are trained with Adam optimiser, the parameters are set as $β_{1} = 0.9$ , $β_{1} = 0.999$ , learning rate of $1 \times 10^{- 4}$ . The batch size and epoch for the pre-training network are 32 and 110, respectively. The batch size and epoch size of the generative adversarial networks are 32 and 50, respectively. The weight of the loss function is set to $ω = 1$ , $λ = 1 \times 10^{- 3}$ , $μ = 1$ .

4.3. Loss analysis

To ensure that the direction of the subsequent training stage is correct, the loss function is designed to calculate the difference between the actual value and the backpropagation result of each iteration of the neural network. Figure shows the change of the loss function of MARDGAN under the COCO2014 dataset. The figure shows that the training process of MARDGAN is very unstable, and the oscillation amplitude of the generation loss and discrimination loss curve is particularly large, which shows that the struggle between the discriminator and the generator is very fierce, thus making the image generated by the generator more realistic. In contrast, the perceptual loss gradually converges, indicating that the texture details of the image are well recovered. Generally speaking, compared with traditional networks, generative adversarial network training is more difficult, but it can get better results.

Figure 7. Change curve of MARDGAN loss function: (a) generator loss, (b) discriminator loss and (c) perceived loss.

4.4. Quantitative evaluations

We compare the proposed model with state-of-the-art SISR method, such as Bicubic, RCAN (Zhang et al., Citation2018), RFANet (Liu et al., Citation2020), DRSR (Yang et al., Citation2021), ESRGAN (Wang et al., Citation2018), WDSRGAN (Sun et al., Citation2020) and FBSRGAN (Wang et al., Citation2022) to confirm the proposed network's reconstruction capability. From Table , the average PSNR and SSIM of the MARDGAN proposed in this paper are 2.139 dB and 0.049 higher than ESRGAN, 0.396 dB and 0.013 higher than WDSRGAN, and 0.319 dB and 0.012 higher than FBSRGAN in the three test sets. The average PSNR and SSIM of the proposed MARDN model are 1.375 dB and 0.048 higher than RCAN, 0.826 dB and 0.032 higher than RFANet, and 0.69 dB and 0.026 higher than DRSR in the three test sets. This is because the network model in this paper obtains the deep feature information of the image through three different paths and different receptive fields. The CSAB is used to focus on the more important features in the image, so as to ignore some useless information, and spectral normalisation is used to prevent the occurrence of artefacts in the generated image. However, the traditional channel and spatial attention mechanism is difficult to better learn the high-frequency information of HR images, and thus cannot efficiently restore better quality SR images. The traditional residual network and dense network may contain a lot of redundant information when splicing all the generated feature maps, resulting in too large model complexity. This paper combines residual network and dense network more closely, and proposes a new ERDB structure. It not only fully extracts the local features of the image but also provides continuous storage function for the network. The features of each layer are propagated backward, and the more effective features of previous and current features are learned adaptively through local feature fusion. The network model can be deepened while ensuring that the previous gradient is preserved. Therefore, the objective evaluation index obtained by the model is higher than that of the state-of-the-art SISR method.

Table 1. PSNR(dB)/SSIM evaluation results of different methods on different datasets.

Download CSV Display Table

This paper's MARDGAN model's PSNR and SSIM evaluation indexes are lower than those of the pre-trained MARDN model because the generated adversarial network algorithm failed to improve PSNR and SSIM. However, until the Nash equilibrium state is reached, the generated network and the discriminant network will continue to fight and play until the generated image and the original image are more similar. The generated image's visual smoothness and unsatisfactory image quality will both be impacted by the excessive increase in PSNR and SSIM values. GAN can produce images of high quality that are more in line with human perception. Table shows that the LPIPS of MARDGAN proposed in this paper is the lowest in the three test sets, indicating that the generated image is closer to the real image in terms of vision.

Table 2. LPIPS evaluation results of different methods on different datasets.

Download CSV Display Table

Figure shows the performance and complexity of each model. As shown in Figure (a), FBSRGAN has more parameters than other models despite being a considerable advance over earlier techniques. DRSR can achieve better performance with a few parameters, but it is still difficult to train because of its complex network architecture. In contrast, our MARDN and MARDGAN can better extract image features by combining CSAB and ERDB. In the case of less parameters, the performance is better than some existing methods, and the reconstructed image quality is better. For a more intuitive comparison with other methods, we provide a trade-off between runtimes and performance in Figure (b). It is clear that for a small amount of runtimes, our approach can produce better PSNR values.

Figure 8. Performance and model complexity comparison: (a) PSNR vs parameters and (b) PSNR vs runtimes.

4.5. Qualitative evaluations

To more easily compare the effect, we take one image from Set5, Set10 and BSD100 datasets, and use different SR algorithm models to reconstruct them with super-resolution. Figure shows the reconstruction results of each SR algorithm after we take the image named “Head” from the Set5 dataset and magnify it by four times. The boy's eyes and nose are partially magnified in the test image. The image that was reconstructed using bicubic interpolation is very hazy, as depicted in the figure. The image reconstructed by RCAN and RFANet is very smooth, with blurred texture, and DRSR can be improved slightly. The reconstructed images of ESRGAN and WDSRGAN still have defects in the nose part, and FBSRGAN does not recover the eyelash satisfactorily. The reconstruction of MARDN images in this paper has achieved good results in terms of objective evaluation indicators, but still imperfect in texture detail recovery. However, the image reconstructed in this paper by MARDGAN is more accurate for image texture reconstruction, which can well reconstruct the texture details of the spots on the bridge of the boy's nose, and has a good reconstruction effect in both overall and detail. In Figure , the MARDGAN in this paper can reconstruct the texture details of the baboon whiskers well without any artefacts. In Figure , we can see the clearer and finer details of the glass part of the window, and the reconstruction quality is better than the other models.

Figure 9. “Head” image reconstruction in Set5 dataset by different SR algorithms: (a) origin PSNR(dB)/SSIM, (b) bicubic 27.581(dB)/0.679, (c) RCAN 30.872(dB)/0.759, (d) RFANet 31.836(dB)/0.772, (e) DRSR 32.073(dB)/0.780, (f) ESRGAN 29.521(dB)/0.740, (g) WDSRGAN 31.527(dB)/0.766, (h) FBSRGAN 31.957(dB)/0.776, (i) ours(MARDN) 33.010(dB)/0.818 and (j) ours(MARDGAN) 32.246(dB)/0.788.

Figure 10. “Baboon” image reconstruction in Set14 dataset by different SR algorithms: (a) origin PSNR(dB)/SSIM, (b) bicubic 20.178(dB)/0.419, (c) RCAN 22.033(dB)/0.524, (d) RFANet 22.483(dB)/0.542, (e) DRSR 22.612(dB)/0.548, (f) ESRGAN 21.937(dB)/0.508, (g) WDSRGAN 22.237(dB)/0.531, (h) FBSRGAN 22.506(dB)/0.545, (i) ours(MARDN) 23.020(dB)/0.573 and (j) ours(MARDGAN) 22.707(dB)/0.557.

Figure 11. Reconstruction of “86,000” images in BSD100 dataset by different SR algorithms: (a) origin PSNR(dB)/SSIM, (b) bicubic 23.107(dB)/0.686, (c) RCAN 25.486(dB)/0.765, (d) RFANet 25.908(dB)/0.797, (e) DRSR 26.175(dB)/0.811, (f) ESRGAN 24.746(dB)/0.744, (g) WDSRGAN 25.762(dB)/0.783, (h) FBSRGAN 26.047(dB)/0.802, (i) ours(MARDN) 26.717(dB)/0.832 and (j) ours(MARDGAN) 26.253(dB)/0.815.

4.6. Ablation experiment

We modify the base models one at a time and compare the results to see how each part of the proposed MARDGAN affect the renderings. Figure depicts the overall visual comparison effect. The original image is in the first column, and each other column after it represents a new model. See below for a detailed discussion.

Figure 12. Overall visual effect comparison: (a) origin, (b) BN+RB, (c) SN+ERDB, (d) SN+ERDB +CSAB and (e) SN+ERDB +CSAB+GAN.

The column c image in Figure is not as smooth as the column b image, and there is no artefact phenomenon generation. This is because ERDB can perform feature propagation more effectively than residual block (RB) in SRResNet (Blau et al., Citation2018) and utilise all of the original image's hierarchical features. The artefact phenomenon can be eliminated, the restored image's quality can be improved, and the model's stability can be improved by replacing the BN layer with the SN layer. In Figure , column d has a higher quality of image reconstruction and better recovery of details than column c. This is because CSAB has a strong representation capability to capture semantic information and is able to make better use of high-frequency features of low-resolution images to eliminate features that are not necessary. The images in column e in Figure are clearer and richer in texture than those in column d. This is because GAN has been conducting generative adversarial training. It can give the image a more lifelike appearance, assisting in the recovery of more intricate texture features and producing images of higher quality.

The rows in Table represent the columns in Figure . The usage of each module is analysed by objective evaluation metrics. Table shows that the average PSNR and SSIM of column c improved by 1.267 dB and 0.034, respectively, and LPIPS decreased by 0.017 compared to column b. The average PSNR and SSIM of column d improved by 1.060 dB and 0.044, respectively, and LPIPS decreased by 0.015 compared to column c. The difference is that the average PSNR and SSIM in column e are lower by 0.441 dB and 0.024, respectively, and LPIPS is lower by 0.006 than that in column d. In summary, although SN, ERDB, and CSAB can produce improved PSNR and SSIM values, the quality of the generated image still has defects. However, the combined use of GAN can not only obtain the generated image with higher quality, better texture effect and more in line with human visual senses but also obtain better LPIPS index.

Table 3. Results of evaluation index used by different modules on different datasets.

Download CSV Display Table

5. Conclusion

In this paper, a Multi-scale Dual-Attention based Residual Dense Generative Adversarial Network (MARDGAN) is proposed to address the issue of unrecoverable image details and insufficient use of feature information in low-resolution images. In particular, the enhanced residual dense block (ERDB) is made to transmit the state of each layer continuously so that the main network can concentrate on learning information at high frequencies. In addition, the model training's stability will be significantly enhanced by the proposed spectral normalisation. The channel and spatial attention block (CSAB), which adaptively assigns weights to feature maps by taking into account the interdependence between channel features and spatial features, was also developed to enhance the utilisation of inherent image features. According to the experiments' findings, this approach can extract more in-depth feature information from low-resolution images to produce high-resolution images with improved texture clarity and quality. In terms of PSNR, SSIM, and LPIPS, this method has outperformed the previous methods and has superior visual quality. Through specific training, our method has great application potential in areas requiring high reliability of SR images, such as enhancing mural clarity, satellite remote sensing imaging and medical imaging. However, the number of parameters in this method is still large, which will greatly increase the training time, and may even appear over-fitting. In future research, we will continue to optimise the current network architecture and prefer to use a lighter network to reduce training time and improve the quality of super-resolution reconstructed images.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

The project was supported in part by the Natural Science Basis Research Plan in Shaanxi Province of China under Grant [2023-JC-YB-517] and the High-level Talent Introduction Project of Shaanxi Technical College of Finance and Economics [2022KY01].

References

Alshammri, G. H., Samha, A. K., El-Shafai, W., Elsheikh, E. A., Hamid, E. A., Abdo, M. I., Amoon, M., & El-Samie, F. E. A. (2022). Three-dimensional video super-resolution reconstruction scheme based on histogram matching and recursive Bayesian algorithms. IEEE Access, 10, 41935–41951. https://doi.org/10.1109/ACCESS.2022.3153409
Web of Science ®Google Scholar
Blau, Y., Mechrez, R., Timofte, R., Michaeli, T., & Zelnik-Manor, L. (2018). The 2018 pirm challenge on perceptual image super-resolution. Proceedings of the European conference on computer vision (ECCV) workshops (pp. 1–22).
Google Scholar
Chan, K, Wang, X., Yu, K., Dong, C., & Loy, C.C. (2021). BasicVSR: The search for essential components in video super-resolution and beyond. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4947–4956).
Google Scholar
Chen, Y., Liu, L., Phonevilay, V., Gu, K., Xia, R., Xie, J., Zhang, Q., & Yang, K. (2021). Image super-resolution reconstruction based on feature map attention mechanism. Applied Intelligence, 51(7), 4367–4380. https://doi.org/10.1007/s10489-020-02116-1
Web of Science ®Google Scholar
Choi, Y. J., Lee, Y. W., & Kim, B. G. (2021). Wavelet attention embedding networks for video super-resolution. 2020 25th International conference on pattern recognition (ICPR) (pp. 7314–7320).
Google Scholar
Daihong, J., Sai, Z., Lei, D., & Yueming, D. (2022). Multi-scale generative adversarial network for image super-resolution. Soft Computing, 26(8), 3631–3641. https://doi.org/10.1007/s00500-022-06822-5
Web of Science ®Google Scholar
Gai, K., Qiu, M., & Elnagdy, S. A. (2016). A novel secure big data cyber incident analytics framework for cloud-based cybersecurity insurance. IEEE BigDataSecurity (pp. 171–176).
Google Scholar
Haris, M., Shakhnarovich, G., & Ukita, N. (2020). Deep back-projection networks for single image super-resolution. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(12), 4323–4337. https://doi.org/10.1109/TPAMI.2020.3002836
Google Scholar
He, Z., Cao, Y., Du, L., Xu, B., Yang, J., Cao, Y., Tang, S., & Zhuang, Y. (2019). Mrfn: multi-receptive-field network for fast and accurate single image super-resolution. IEEE Transactions on Multimedia, 22(4), 1042–1054. https://doi.org/10.1109/TMM.6046
Web of Science ®Google Scholar
Hu, Y., Li, J., Huang, Y., & Gao, X. (2019). Channel-wise and spatial feature modulation network for single image super-resolution. IEEE Transactions on Circuits and Systems for Video Technology, 30(11), 3911–3927. https://doi.org/10.1109/TCSVT.76
Web of Science ®Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K.Q. (2017). Densely connected convolutional networks. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4700–4708).
Google Scholar
Huang, W., Liao, X., Zhu, L., Wei, M., & Wang, Q. (2022). Single-Image super-resolution neural network via hybrid multi-scale features. Mathematics, 10(4), 4–21. https://doi.org/10.3390/math10040653
Web of Science ®Google Scholar
Huang, Z.-X., & Jing, C.-W. (2020). Super-resolution reconstruction method of remote sensing image based on multi-feature fusion. IEEE Access, 8, 18764–18771. https://doi.org/10.1109/Access.6287639
Google Scholar
Jo, Y., Oh, S.W., Kang, J., & Kim, S.J. (2018). Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. 2018 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 3224–3232).
Google Scholar
Khan, A., Sohail, A., Zahoora, U., & Qureshi, A. S. (2020). A survey of the recent architectures of deep convolutional neural networks. Artificial Intelligence Review, 53(8), 5455–5516. https://doi.org/10.1007/s10462-020-09825-6
Web of Science ®Google Scholar
Ledig, C., Theis, L., Huszár, F.,Caballero, J., Cunningham, A., Acosta, A., Aitken, A., Tejani, A., Totz, J., Wang, Z., & Shi, W. (2017). Photo-realistic single image super-resolution using a generative adversarial network. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4681–4690).
Google Scholar
Li, H.A., Zhang, M., Chen, D., Zhang, J., Yang, M., & Li, Z. (2023). Image color rendering based on hinge-cross-entropy GAN in internet of medical things. CMES-Computer Modeling in Engineering & Sciences, 135(1), 779–794. https://doi.org/10.32604/cmes.2022.022369
Web of Science ®Google Scholar
Li, Y., Gai, K., Ming, Z., Zhao, H., & Qiu, M. (2016). Intercrossed access controls for secure financial services on multimedia big data in cloud systems. Transactions on Multimedia Computing, Communications, and Applications (TOMM), 12(4s), 1–18. https://doi.org/10.1145/2978575
Google Scholar
Li, Y., Sixou, B., & Peyrin, F. (2021). A review of the deep learning methods for medical images super resolution problems. IRBM, 42(2), 120–133. https://doi.org/10.1016/j.irbm.2020.08.004
Web of Science ®Google Scholar
Liu, B., & Chen, J. (2021). A super resolution algorithm based on attention mechanism and srgan network. IEEE Access, 9, 139138–139145. https://doi.org/10.1109/ACCESS.2021.3100069
Google Scholar
Liu, J., Zhang, W., Tang, Y., Tang, J., & Wu, G. (2020). Residual feature aggregation network for image super-resolution. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2359–2368).
Google Scholar
Liu, Qing-Ming, Jia, Rui-Sheng, Liu, Yan-Bo, Sun, Hai-Bin, Yu, Jian-Zhi, & Sun, Hong-Mei (2021). Infrared image super-resolution reconstruction by using generative adversarial network with an attention mechanism. Applied Intelligence, 51(4), 2018–2030. https://doi.org/10.1007/s10489-020-01987-8
Web of Science ®Google Scholar
Meng, Q., Wang, W., Zhou, T., Shen, J., Van Gool, L., & Dai, D. (2020). Weakly supervised 3d object detection from lidar point cloud. European Conference on computer vision (pp. 515–531). Springer.
Google Scholar
Meng, Z., Zhang, J., Li, X., & Zhang, L. (2022). Lightweight image super-resolution based on local interaction of multi-scale features and global fusion. Mathematics, 10(7), 1–17. https://doi.org/10.3390/math10071096
Web of Science ®Google Scholar
Qin, J., Huang, Y., & Wen, W. (2020). Multi-scale feature fusion residual network for single image super-resolution. Neurocomputing, 379, 334–342. https://doi.org/10.1016/j.neucom.2019.10.076
Web of Science ®Google Scholar
Qiu, D., Zheng, L., Zhu, J., & Huang, D. (2021). Multiple improved residual networks for medical image super-resolution. Future Generation Computer Systems, 116, 200–208. https://doi.org/10.1016/j.future.2020.11.001
Web of Science ®Google Scholar
Qiu, H., Zeng, Y., Guo, S., Zhang, T., Qiu, M., & Thuraisingham, B. (2021). Deepsweep: An evaluation framework for mitigating dnn backdoor attacks using data augmentation. Proceedings of the 2021 ACM Asia conference on computer and communications security (pp. 363–377).
Google Scholar
Qiu, H., Zheng, Q., Msahli, M., Memmi, G., Qiu, M., & Lu, J. (2020). Topological graph convolutional network-based urban traffic flow and density prediction. IEEE Transactions on Intelligent Transportation Systems, 22(7), 4560–4569. https://doi.org/10.1109/TITS.2020.3032882
Web of Science ®Google Scholar
Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A.P., Bishop, R., Rueckert, D., & Wang, Z. (2016). Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1874–1883).
Google Scholar
Song, H., Xu, W., Liu, D., Liu, B., Liu, Q., & Metaxas, D. N. (2021). Multi-stage feature fusion network for video super-resolution. IEEE Transactions on Image Processing, 30, 2923–2934. https://doi.org/10.1109/TIP.2021.3056868
PubMed Web of Science ®Google Scholar
Sun, X., Zhao, Z., Zhang, S., Liu, J., Yang, X., & Zhou, C. (2020). Image super-resolution reconstruction using generative adversarial networks based on wide-channel activation. IEEE Access, 8, 33838–33854. https://doi.org/10.1109/Access.6287639
Web of Science ®Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2818–2826).
Google Scholar
Wang, H., Wei, M., Cheng, R., Yu, Y., & Zhang, X. (2022). Residual deep attention mechanism and adaptive reconstruction network for single image super-resolution. Applied Intelligence, 52(5), 5197–5211. https://doi.org/10.1007/s10489-021-02568-z
Web of Science ®Google Scholar
Wang, X., Wu, Y., Ming, Y., Lv, H. (2020). Remote sensing imagery super resolution based on adaptive multi-scale feature fusion network. Sensors, 20(4), 1–15. https://doi.org/10.1109/JSEN.7361
Web of Science ®Google Scholar
Wang, X., Yu, K., Dong, C., & Loy, C.C. (2018). Recovering realistic texture in image super-resolution by deep spatial feature transform. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 606–615).
Google Scholar
Wang, X., Yu, K., Wu, S., Gu, J., Liu, Y., Dong, C., Qiao, Y., & Change Loy, C. (2018). Esrgan: Enhanced super-resolution generative adversarial networks. Proceedings of the European conference on computer vision (ECCV) workshops (pp. 1–16).
Google Scholar
Wang, Y., Li, X., Nan, F., Liu, F., Li, H., Wang, H., & Qian, Y. (2022). Image super-resolution reconstruction based on generative adversarial network model with feedback and attention mechanisms. Multimedia Tools and Applications, 81(5), 6633–6652. https://doi.org/10.1007/s11042-021-11679-1
Web of Science ®Google Scholar
Wang, Y., Perazzi, F., McWilliams, B., Sorkine-Hornung, A., Sorkine-Hornung, O., & Schroers, C. (2018). A fully progressive approach to single-image super-resolution. Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 864–873).
Google Scholar
Wang, Z., Lu, Y., Li, W., Wang, S., Wang, X., & Chen, X. (2021). Single image super-resolution with attention-based densely connected module. Neurocomputing, 453, 876–884. https://doi.org/10.1016/j.neucom.2020.08.070
Web of Science ®Google Scholar
Woo, S., Park, J., Lee, J. Y., & Kweon, I.S. (2018). Cbam: Convolutional block attention module. Proceedings of the European conference on computer vision (ECCV) (pp. 3–19).
Google Scholar
Xu, X., Ye, Y., & Li, X. (2020). Joint demosaicing and super-resolution (JDSR): network design and perceptual optimization. IEEE Transactions on Computational Imaging, 6, 968–980. https://doi.org/10.1109/TCI.6745852
Web of Science ®Google Scholar
Yang, X., Xie, T., Liu, L., & Zhou, D. (2021). Image super-resolution reconstruction based on improved Dirac residual network. Multidimensional Systems and Signal Processing, 32(4), 1065–1082. https://doi.org/10.1007/s11045-021-00773-0
Web of Science ®Google Scholar
Zhang, J., Feng, W., Yuan, T., Wang, J., & Sangaiah, A. K. (2022). SCSTCF: spatial-channel selection and temporal regularized correlation filters for visual tracking. Applied Soft Computing, 118, 108485. https://doi.org/10.1016/j.asoc.2022.108485
Web of Science ®Google Scholar
Zhang, K., Gao, X., Tao, D., & Li, X. (2012). Single image super-resolution with non-local means and steering kernel regression. IEEE Transactions on Image Processing, 21(11), 4544–4556. https://doi.org/10.1109/TIP.2012.2208977
PubMed Web of Science ®Google Scholar
Zhang, K., Zuo, W., & Zhang, L. (2019). Deep plug-and-play super-resolution for arbitrary blur kernels. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1671–1681).
Google Scholar
Zhang, L., & Wu, X. (2006). An edge-guided image interpolation algorithm via directional filtering and data fusion. IEEE Transactions on Image Processing, 15(8), 2226–2238. https://doi.org/10.1109/TIP.2006.877407
PubMed Web of Science ®Google Scholar
Zhang, R., Isola, P., Efros, A. A., Shechtman, E., & Wang, O. (2018). The unreasonable effectiveness of deep features as a perceptual metric. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 586–595).
Google Scholar
Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., & Fu, Y. (2018). Image super-resolution using very deep residual channel attention networks. Proceedings of the European conference on computer vision (ECCV) (pp. 286–301).
Google Scholar
Zhang, Y., Tian, Y., Kong, Y., Zhong, B., & Fu, Y. (2018). Residual dense network for image super-resolution. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2472–2481).
Google Scholar
Zhihong, X., Caiyan, H., & Kunpeng, Y. (2019). Super-resolution reconstruction of accelerated image based on deep residual network. Acta Optica Sinica, 39(2), 1–10. https://doi.org/10.3788/AOS
Google Scholar
Zhou, T., Wang, W., Qi, S., Ling, H., & Shen, J. (2020). Cascaded human-object interaction recognition. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4263–4272).
Google Scholar

Image super-resolution reconstruction based on multi-scale dual-attention

Abstract

1. Introduction

2. Related work