Full article: Dual conditional GAN based on external attention for semantic image synthesis

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Although the existing semantic image synthesis methods based on generative adversarial networks (GANs) have achieved great success, the quality of the generated images still cannot achieve satisfactory results. This is mainly caused by two reasons. One reason is that the information in the semantic layout is sparse. Another reason is that a single constraint cannot effectively control the position relationship between objects in the generated image. To address the above problems, we propose a dual-conditional GAN with based on an external attention for semantic image synthesis (DCSIS). In DCSIS, the adaptive normalization method uses the one-hot encoded semantic layout to generate the first latent space and the external attention uses the RGB encoded semantic layout to generate the second latent space. Two latent spaces control the shape of objects and the positional relationship between objects in the generated image. The graph attention (GAT) is added to the generator to strengthen the relationship between different categories in the generated image. A graph convolutional segmentation network (GSeg) is designed to learn information for each category. Experiments on several challenging datasets demonstrate the advantages of our method over existing approaches, regarding both visual quality and the representative evaluating criteria.

KEYWORDS:

1. Introduction

Conditional image synthesis mainly uses text, Gaussian noise or semantic layout to generate constrained images. Typically, conditional Generative Adversarial Networks (GANs) (Mirza & Osindero, Citation2014) are common approaches for conditional image synthesis. In conditional image synthesis, semantic image synthesis is to generate photorealistic images through semantic layouts. Since the information contained in the semantic layout is relatively sparse, semantic image synthesis is a huge challenge to image synthesis methods.

Semantic image synthesis is widely used, for example past work includes specified content creation (Mirza & Osindero, Citation2014; Ntavelis et al., Citation2020) and drawing editing (Park et al., Citation2019; Tang, Xu, et al., Citation2020; Zhu et al., Citation2020) and other related work. In addition, the industrial applications of this work are also very wide, such as virtual reality and AIGC related applications.

Currently, the semantic image synthesis methods based on GANs generally use noise as the input, and the semantic layout is used to control the image synthesis process through the adaptive normalisation methods. SPADE (Park et al., Citation2019) is the representative semantic image synthesis method. SPADE effectively solves the problem of blurred boundaries of each category in the generated image. CC-FPSE (Liu et al., Citation2019) and SCGAN (Y. Wang et al., Citation2021) are improvements based on SPADE (Park et al., Citation2019), and these methods have achieved good results. However, since these methods only use a single constraint to control the synthesis process, the quality of the generated images still cannot meet the needs of users.

In addition, the discriminator also has an impact on the quality of the generated images. In GANs, the discriminator mainly consists of a convolutional network. Generally, PatchGAN (Isola et al., Citation2017) is a commonly used discriminator. Recently, some new discriminators have also been proposed. OASIS (Schonfeld et al., Citation2021) proposed a novel discriminator based on the segmentation network. The discriminator based on the segmentation network can effectively prompt the generator to generate the object shapes that conform to the semantic layout. But to a certain extent, it ignores the positional relationship information between different categories.

In order to solve the above problems, we propose Dual-conditional GAN with based on an external attention for semantic image synthesis (DCSIS). In DCSIS, the adaptive normalisation module uses the semantic layouts of one-hot encoding to generate the first constraint. The external attention uses the semantic layouts to generate the second constraint. Two different modules perform two constraint controls on the input in sequence, forming a dual conditional attention (DCA). Compared with only using a single constraint, DCA can better utilise the category information and boundary information in the semantic layouts to synthesise better detailed information. Attention mechanism has been widely used in image synthesis and effectively improve the quality of synthesised images (Tang et al., Citation2019; Q. Wang et al., Citation2020). A novel graph attention (GAT) is introduced into the generator, which aims to strengthen the relative positional relationship between objects of different categories.

DCSIS has two discriminators, one is the traditional discriminator SESAME (Ntavelis et al., Citation2020) and the other is the proposed segmentation network based on graph convolutional network. The proposed segmentation network based on graph convolutional network can not only align semantic information, but also better establish relationships between objects of different categories. The overview of the proposed DCSIS model is shown in Figure . We conduct experiments on three challenging datasets.

Figure 1. Architecture of DCSIS.

In general, the main contributions of this paper are as follows:

We proposed two constraint methods to control the synthesis process. The semantic layout of RGB format and the semantic layouts of one-hot encoding are used to generate two constraints and form dual conditional control.
We designed a segmentation network discriminator based on graph convolutional network, which can better align semantic information.
We designed a novel graph attention to enhance the relational information between objects of different categories.

2. Related work

Generative adversarial networks have achieved remarkable success on unconditional image synthesis tasks (Brock et al., Citation2019; Tero Karras et al., Citation2019; T. Karras et al., Citation2020). Since the result of unconditional image synthesis is uncontrollable, conditional image synthesis using external control information to control the result of image synthesis is proposed. Semantic layout is a commonly used control information in conditional image synthesis, which is mainly used as the input of the generator.

Pix2pix (Isola et al., Citation2017) and pix2pixHD (T.-C. Wang et al., Citation2018) are classical conditional image synthesis methods that take semantic layout as the input of the generator. EdgeGAN (Tang, Qi, et al., Citation2020) used edge details to optimise detailed structural information for image synthesis.

Due to the sparsity of the information contained in the semantic layout, directly using the semantic layout as the input increases the pressure of network learning and it is difficult to effectively improve the quality of the generated images. Therefore, the current mainstream conditional image synthesis methods generally use noise as the input of the generator, and the semantic layout as the constraint to control the image synthesis process.

At present, using the adaptive normalisation method that takes the semantic layout as the input to constrain the image synthesis process has gradually become the mainstream constraint method. AdaLIN (Kim et al., Citation2019), SPADE (Park et al., Citation2019), SEAN (Zhu et al., Citation2020), the class-adaptive normalisation method (D. Chen et al., Citation2020) and SAFM (Lv et al., Citation2022) are some well-known adaptive normalisation methods. These adaptive normalisation methods take the semantic layout as the input, and utilise the semantic layout to constrain the features of the noise during the normalisation process. The adaptive normalisation methods generate corresponding parameters through semantic layout to control the normalisation results. Since the semantic layout is only used to generate normalised parameters, it effectively avoids the defect that the semantic layout contains sparse information.

Except for the generator, the discriminator also has an impact on the quality of the generated images. Some new discriminators are proposed.

CC-FPSE (Liu et al., Citation2019) proposed a pyramid discriminator, which jointly feeds the generated image and semantic labels into the discriminator, and then discriminates true and false on multiple resolutions. OASIS (Schonfeld et al., Citation2021) proposed a segmentation network discriminator supervised with semantic labels.

LGGAN (Tang, Xu, et al., Citation2020) proposed to use a local class-specific and feature module to learn the appearance distribution of different objects globally and the generation of different object categories. SC-GAN (Y. Wang et al., Citation2021) learns to generate normalised parametric models by convolving semantic vectors.

In addition to these GAN-based methods, there are also some special non-GAN methods, such as CRN (Q. Chen & Koltun, Citation2017) which utilises refined cascaded networks for semantic image synthesis. There is also the recent application of the more popular diffusion model SDM (W. Wang et al., Citation2022) to this work, which combines the SPADE normalisation module with a diffusion model backbone to control image generation.

For conditional image synthesis, fully exploiting the information of the semantic layout is crucial to the quality of the generated image. However, most approaches only use the semantic layout for a single constraint control. As a comparison, DCSIS uses the semantic layout of the RGB format and the semantic layout of the one-hot encoding format to form two different constraint controls. This approach further improves the utilisation of the semantic layout information.

3. Method

3.1. Dual condition attention

Image synthesis methods that only use adaptive normalisation to constrain the generated images have been unable to meet the needs of existing tasks. In this paper, the proposed dual condition attention module (DCA) is used to constrain the generated images. DCA contains two constraints: the semantic layouts of RGB encoding and the semantic layouts of one-hot encoding. Structurally, DCA contains SPADE and the proposed attention network. DCA and the architecture of the generator network is shown in Figure .

Figure 2. Architecture of the generator network.

Inspired by the pre-trained visual language models CLIP (Radford et al., Citation2021) and GLIDE (Nichol et al., Citation2021), the paper proposes the RGB semantic encoder to extract the information of the semantic layouts of RGB encoding.

The RGB semantic encoder consists of a CNN encoder and a transformer module, as shown in Figure . The semantic latent space generated by the RGB semantic encoder is fed into the proposed attention network in DCA. Different from the ordinary attention modules, the proposed attention network is a conditional attention module named Seg_Attention. The module is formulated as follows: (1) $Seg\_Attention (Q, K, V, K_{c}, V_{c}) = softmax (\frac{Q concat {(K, K_{c})}^{T}}{\sqrt{d_{k}}}) concat (V, V_{c}),$ (1) where Q, K and V come from feature maps, and $K_{c}$ and $V_{c}$ come from backbone network.

Figure 3. Architecture of the RGB semantic encoder.

Since SPADE only uses the parameters generated by the semantic layouts to control the input information, the information loss of the object category in the semantic layout is more, resulting in a lack of further guidance for the details of the generated images. Hence, the RGB semantic latent space is used to enhance the semantic constrained information in DCA. In this way, the image synthesis process is controlled by two different forms of constraints, which improves the utilisation of semantic layout information.

3.2. Graph attention

To strengthen the connections between categories in the generated images, two attention mechanisms are employed at the end of the generator. The spatial attention (Tang, Bai, et al., Citation2020) and the proposed graph attention are used in the generator. The purpose of the proposed graph attention is to learn the correlation between regions in the image. Learning the correlation between regions can effectively improve the quality of each region in the synthesised image.

The proposed graph attention is shown in Figure . In Figure , FC means fully connected layer. As shown in Figure , the proposed graph attention is composed of the patch embed module, three full connection layers and the softmax layer. The patch embed module is used to divide the image into N blocks as the input. The first full connection layer is used to extract the features of the input. The other two full connection layers and the softmax layer are used to implement attention parameter matrix. The proposed graph attention mechanism adopts a residual way to obtain the final feature. The proposed graph attention can be expressed as: (2) $\begin{aligned} β & = S o f t m a x ({F C}_{1} (FC (α)) \oplus {F C}_{2} {(FC (α))}^{T}) \otimes FC (α), \end{aligned}$ (2) (3) $\begin{aligned} F_{o u t} & = (β \otimes F^{l}) \oplus F^{l - 1}, \end{aligned}$ (3) where $α$ represents the features processed by embed module, $\oplus$ represents element-wise addition, T represents matrix transposition, and $\otimes$ represents element-wise multiplication. $F^{l - 1}$ represents the feature outputted by the (l − 1)th Seg_Attention block and $F^{l}$ represents the feature outputted by the lth Seg_Attention block.

Figure 4. Graph attention module.

3.3. Graph convolution segmentation network

For semantic image synthesis tasks, the role of the discriminator is to distinguish between the synthesised images and the ground-truth images. Generally, the classification-based discriminators are commonly used in image synthesis. However, classification-based discriminators ignore the relationship between each object in the image. Insufficient learning of the relationship between objects can easily lead to blurred object boundaries in the synthesised images.

Therefore, this paper proposes a segmentation network based on graph convolution as a new discriminator. The new discriminator is called GSeg. The role of GSeg is to align category information in the synthesised images and the semantic layouts. Simultaneously, SESAME is also used as a discriminator. Our method includes two different discriminators: GSeg and SESAME. The architecture of GSeg is shown in Figure . GSeg uses an encoder-decoder structure.

Figure 5. Graph convolution segmentation network.

As shown in Figure , the encoder consists of a graph convolutional module in a vision GNN (Han et al., Citation2022). The decoder consists of the convolutional modules. Compared with ordinary convolution modules, graph convolution modules can better learn the relationship between objects.

3.4. Optimisation objective

In our method, the adversarial loss $L_{a d v}$ , the feature matching loss $L_{f e a t}$ , the perceptual loss $L_{p e r c}$ and the semantic alignment loss $L_{s e g}$ are used to control the quality of the synthesised images.

Adversarial loss: In GANs, adversarial loss is very effective for image fidelity, and has achieved good results in many images synthesis works. The adversarial loss can be defined as: (4) $\begin{aligned} L_{a d v}^{D} & = - E_{(I_{R}, S_{h o t})} [min (0, - 1 + D (I_{R}, S_{h o t}))] \\ - E_{(z, S_{h o t}, S_{R G B})} [min (0, - 1 - D (G (z, S_{h o t}, E (S_{R G B})), S_{h o t}))], \end{aligned}$ (4) (5) $\begin{aligned} L_{a d v}^{G} & = - E_{(z, S_{h o t}, S_{R G B})} D (G (z, S_{h o t}, E (S_{R G B})), S_{h o t}), \end{aligned}$ (5) where $I_{R}$ denotes the real image, z denotes the noise, $S_{h o t}$ denotes the one-hot encoded semantic layout, and $S_{R G B}$ denotes the RGB semantic label, G represents the generator, D represent the discriminator, and E represents the RGB semantic encoder.

Feature matching loss: According to (T.-C. Wang et al., Citation2018), in the discriminator, we output multiple sets of different feature maps. During the training process, the $L_{1}$ loss is used to constrain the feature maps of different scale spaces. Its calculation process is shown in formula (6): (6) $L_{f e a t} = \sum_{i = 1}^{n} \frac{1}{N_{i}} ∥ D_{i} (I_{R}, S_{h o t}) - D_{i} (G (z, S_{h o t}, E (S_{R G B})), S_{h o t}) ∥_{1},$ (6) where $N_{i}$ denotes the number of features in $D_{i} (I_{R}, S_{h o t})$ .

Perceptual loss: In this paper, a pre-trained VGG (Qi et al., Citation2018) is used to extract the features of the real images and the synthesised images. The perceptual loss in the multi-scale feature space is shown in formula (7): (7) $L_{p e r c} = \sum_{k = 1}^{K} ∥ Φ_{k} (I_{F}) - Φ_{k} (I_{R}) ∥_{1},$ (7) where $I_{F}$ represents the generated image, $Φ$ represents the VGG model, and $Φ_{k}$ represents the feature map of the k^th layer in the VGG model.

Semantic alignment loss: To constrain the semantic alignment between the synthesised images and the corresponding semantic layouts, our method employs the semantic alignment loss for control. The semantic alignment loss can be expressed as: (8) $\begin{aligned} L_{s e g} & = - \sum_{i = 1}^{C} w_{i} \sum_{j = 1}^{H} \sum_{k = 1}^{W} S_{i, j, k} [\log S e g {(I_{R})}_{i, j, k} + \log S e g {(I_{F})}_{i, j, k}], \end{aligned}$ (8) (9) $\begin{aligned} w_{i} & = \frac{H \times W}{\sum_{j = 1}^{H} \sum_{k = 1}^{W} S_{i, j, k}}, \end{aligned}$ (9) where the ground-truth label image S has three dimensions, where the last two denote spatial locations, namely $(j, k) \in H \times W$ , is the balance weight of each class in the one-hot semantic graph, and Seg represents the graph convolutional segmentation network.

The weighted summary of these loss functions is shown in formulas (10): (10) $L = λ_{a d v} (L_{a d v}^{G} + L_{a d v}^{D}) + λ_{f e a t} L_{f e a t} + λ_{p e r c} L_{p e r c} + λ_{s e g} L_{s e g},$ (10) where $λ_{a d v}$ , $λ_{f e a t}$ , $λ_{p e r c}$ and $λ_{s e g}$ are the corresponding weight parameters.

4. Experiment

Experiments are conducted to evaluate the performance of the proposed approach for image synthesis on various benchmarking datasets. We compare qualitative and quantitative results with some competing methods. These competing methods includes CRN (Q. Chen & Koltun, Citation2017), Pix2PixHD (T.-C. Wang et al., Citation2018), SPADE (Park et al., Citation2019), CC-FPSE (Liu et al., Citation2019), SCGAN (Y. Wang et al., Citation2021), SDM (W. Wang et al., Citation2022) and OASIS (Schonfeld et al., Citation2021). In addition, this article will use multiple sets of ablation experiments to verify the benefits of each module.

4.1. Dataset and experiment details

Dataset: This paper conducts experiments on three challenging datasets, namely Cityscapes (Cordts et al., Citation2016), ADE20K (Tero Karras et al., Citation2019) and CelebAMask-HQ (Brock et al., Citation2019). The Cityscapes dataset is an image set including a variety of urban street scenes, including 35 semantic classes, of which 3000 images are used for training and 500 images are used for verification, and the image resolution is set to 256 × 128. The CelebAMask-HQ dataset is a high-resolution face dataset with fine-grained mask annotations, containing 19 semantic classes, and the image resolution is set to 128 × 128. The ADE20K dataset is a huge dataset with dense annotations, containing 150 semantic classes, with 20,210 images for training and 2000 images for validation, and the image resolution is set to 128 × 128.

Experimental details: Our method uses the ADAM optimiser, the learning rate of the generator is set to $1 \times 10^{- 4}$ , the learning rate of the discriminator is set to $4 \times 10^{- 4}$ , and the RGB semantic encoder is trained together with the generator. $λ_{a d v}$ , $λ_{f e a t}$ , $λ_{p e r c}$ and $λ_{s e g}$ are set to 1, 10, 10 and 1 respectively. In the first half of the optimisation process, only the real images are fed into GSeg, and when the epoch reaches half of the maximum value, the real images and the synthesised images are fed into GSeg. We perform 150 epochs of training on the Cityscapes and ADE20K datasets, 150 epochs of training on the CelebAMask-HQ dataset. All experiments were performed on a Nvidia 2080ti GPU.

Evaluation metrics: This paper uses two metrics to evaluate network performance: Fr´echet Inception Distance (FID) (Heusel et al., Citation2017) and mean Intersection-over-Union (mIoU). FID is used to measure the distance between the distribution of synthesised results and the distribution of the real images. mIoU is used to evaluate the semantic segmentation accuracy of the synthesised images. The higher the semantic segmentation scores (mIoU) are and the lower FID is, the better the method should be. Following previous methods (Liu et al., Citation2019; Park et al., Citation2019), we use semantic segmentation models DRN-D-105 (Yu et al., Citation2017), UpperUnet101 (Xiao et al., Citation2018) and Unet (Lee et al., Citation2020; Ronneberger et al., Citation2015) for semantic segmentation evaluation of Cityscapes, ADE20K and CelebAMask-HQ respectively.

4.2. Comparison with previous methods

In this section, we compare our method with several state-of-the-art semantic image synthesis methods on three datasets.

Quantitative results: The quantitative comparison results of the synthesis models on the Cityscapes, ADE20K and CelebAMask-HQ datasets are shown in Table . From Table , mIoU of DCSIS are superior to all baseline methods on all datasets. For mIoU, DCSIS achieves 68.8, 47.4 and 77.8 on the three datasets of Cityscapes, ADE20K and CelebAMask-HQ. DCSIS gives the relative improvements of 1.7, 2.1 and 2.5 compared to SIMS on the Cityscapes dataset, OASIS on the ADE20K dataset and SIMS on the CelebAMask-HQ dataset, respectively. Compared with the discriminator based on the segmentation network used by OASIS, GSeg has better segmentation performance. This also makes mIoU of DCSIS significantly higher than that of OASIS. For FID, DCSIS achieves 48.4, 31.8 and 17.1 on the three datasets of Cityscapes, ADE20K and CelebAMask-HQ. For the CelebAMask-HQ dataset, DCSIS outperform all baseline methods on FID. FID of DCSIS is slightly lower than that of OASIS for the other two datasets, but our method has better segmentation performance.

Table 1. Quantitative comparison with competing methods on different datasets.

Download CSV Display Table

The parameter size of DCSIS is 108M. Although DCSIS has 8M more parameters than SPADE, DCSIS gives the relative improvements of (10.2, 6.9), (0.8, 10.2), (0.4, 0.1) and (6.0, 5.7) compared to SPADE on three datasets for FID and mIoU. CRN has 21M fewer parameters than DCSIS. However, FID and mIoU of CRN on each dataset are much lower than DCSIS. DCSIS has 14M more parameters than OASIS. But DCSIS gives the relative improvements of 2.3, 2.1 and 2.7 compared to OASIS on three datasets for mIoU. Overall, DCSIS achieves good image synthesis results with a moderate parameter size.

Quantitative results: The Qualitative comparison of different methods on the three datasets of CelebAMask-HQ, Cityscapes and ADE20K are given in Figures , respectively. For the three datasets, the images synthesised by DCSIS not only have much better visual quality, but also are closer to the ground truth images in the overall colour distribution. Compared with all baseline methods, our method produces realistic images while respecting the spatial semantic layout, and can generate diverse scenes with high image fidelity. The reason is that DCSIS uses double constraints for finer control over the synthesised images and GSeg also effectively improves the clarity of object boundaries. Experiments also show that it is difficult to effectively control the details of the object only by a single adaptive normalisation method.

Figure 6. Visual comparison of CelebAMask-HQ dataset.

Figure 7. Visual comparison of Cityscapes dataset.

Figure 8. Visual comparison of ADE20K dataset.

4.3. Ablation experiment

In this section, a set of experiments are to investigate the effect of each component on the performance of DCSIS. We conduct ablation experiments on the Cityscapes dataset and the CelebAMask-HQ dataset. In ablation experiments, the SPADE module is used as the base module of the baseline to verify the effect of each component. These comparison methods include: (1) SPADE + SESAME discriminator; (2) SPADE + SESAME + Unet segmentation network; (3) SPADE + SESAME + Gseg; (4) SPADE + SESAME + DCA + Gseg; (5) SPADE + SESAME + DCA + Gseg + GAT (DCSIS). The quantitative and qualitative results are presented in Table and Figure , respectively.

Figure 9. Visual comparison of different variants.

Table 2. Quantitative comparison of ablation experiments on Cityscapes and CelebAMask-HQ datasets.

Download CSV Display Table

As shown in Table , it can be seen that DCA, Gseg and GAT have a significant impact on the performance of DCSIS. Among the all architectures mentioned above, DCSIS obtains the best results. It can be seen that FID and mIoU can be improved when the segmentation network as the discriminator. Compare with Unet, GSeg brings the relative improvements of (0.3, 1.3) for the Cityscapes dataset and (1.3, 1.9) for the CelebAMask-HQ dataset on FID and mIoU. This shows that the segmentation performance of GSeg is better than that of Unet.

When DCA is introduced, DCA brings the relative improvements of (2.0, 2.3) for the Cityscapes dataset and (0.4, 2.1) for the CelebAMask-HQ dataset on FID and mIoU. Compared with only using a single constraint, using two different forms of constraints to control the synthesis process is beneficial to improve the quality of the synthesised images.

GAT brings further improvements in FID and mIoU. The relative improvement of FID and mIoU are (0.3, 0.8) for the Cityscapes dataset and (0.4, 0.4) for the CelebAMask-HQ dataset, respectively.

Overall, DCA has a greater impact on the quality of the synthesised images than GSeg and GAT. These components effectively improve the performance of DCSIS.

From Figure , it can be seen that the visual quality of the images synthesised by DCSIS is better than other methods. For the CelebAMask-HQ dataset, human skin tones in the synthesised images obtained by DCSIS appear more realistic than other methods. For the Cityscapes dataset, DCA makes the boundary information between categories clearer. Overall, our method makes texture details in the images appear very natural and more realistic. For all datasets, the results suggest that DCA, GSeg and GAT can effectively improve the visual quality of the synthesised images. It proves that all components are useful for the final results in DCSIS and can further improve the performance in DCSIS.

5. Conclusions

This paper presents a novel image synthesis approach, namely DCSIS, in which DCA, GSeg and GAT are used to enhance the information of the semantic layouts and improve the results of image synthesis. In DCSIS, DCA uses the double constraints to control image synthesis. Except for the adaptive normalisation as the first constraint, we also propose a semantic encoder as the second constraint. The proposed semantic encoder uses the semantic layouts of RGB encoding as the input. DCA can effectively improve the quality of the synthesised images. The generator learns the relationship between different categories using GAT to further improve the quality of the synthesised images. The proposed GSeg is used as the discriminator of DCSIS. Gseg is used to align semantic information and establish relationships between objects.

Experiments are conducted on three benchmark datasets to evaluate the performance of our presented approach. The experimental results indicate that DCSIS can generate the higher quality photorealistic images and obtain the better quantitative results. Comparisons with some state-of-the-art baseline methods, it demonstrates that the new method is more effective and efficient in terms of the qualitative and quantitative results in most cases.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

The work described in this article is supported by the Hubei Province University Student Innovation and Entrepreneurship Training, Hubei University of Technology Graduate Research Innovation Project [grant number 4306.22019]. The work described in this paper was support by the National Natural Science Foundation of China [grant number 61300127].

References

Brock, A., Donahue, J., & Simonyan, K. (2019). Large scale GaN training for high fidelity natural image synthesis. In 7th International Conference on Learning Representations, ICLR 2019. 7th International Conference on Learning Representations, ICLR 2019, May 6, 2019 to May 9, 2019, New Orleans, LA, USA.
Google Scholar
Chen, D., Hua, G., Liao, J., Yuan, L., Chai, M., He, M., Yu, N., Chu, Q., & Tan, Z. (2020). Efficient semantic image synthesis via class-adaptive normalization.
Google Scholar
Chen, Q., & Koltun, V. (2017). Photographic image synthesis with cascaded refinement networks. In Proceedings of the IEEE International Conference on Computer Vision. 16th IEEE International Conference on Computer Vision, ICCV 2017, October 22, 2017 to October 29, 2017, Venice, Italy.
Google Scholar
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The Cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, June 26, 2016 to July 1, 2016, Las Vegas, NV, USA.
Google Scholar
Han, K., Wang, Y., Guo, J., Tang, Y., & Wu, E. (2022). Vision GNN: An image is worth graph of nodes. In Advances in neural information processing systems. 36th Conference on Neural Information Processing Systems, NeurIPS 2022, November 28, 2022 to December 9, 2022, New Orleans, LA, USA.
Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in neural information processing systems. 31st Annual Conference on Neural Information Processing Systems, NIPS 2017, December 4, 2017 to December 9, 2017, Long Beach, CA, USA.
Google Scholar
Isola, P., Zhu, J.-Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In Proceedings – 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, July 21, 2017 to July 26, 2017, Honolulu, HI, USA.
Google Scholar
Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019, June 16, 2019 to June 20, 2019, Long Beach, CA, USA.
Google Scholar
Karras, T., Laine, S., Aittala, M., Hellsten, J., & Aila, T. (2020). Analyzing and improving the image quality of StyleGAN. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Google Scholar
Kim, J., Kim, M., Kang, H., & Lee, K. (2019). U-GAT-IT: Unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation.
Google Scholar
Lee, C.-H., Liu, Z., Wu, L., & Luo, P. (2020). MaskGAN: Towards diverse and interactive facial image manipulation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, June 14, 2020 to June 19, 2020, Virtual, Online, USA.
Google Scholar
Liu, X., Shao, J., Yin, G., Wang, X., & Li, H. (2019). Learning to predict layout-to-image conditional convolutions for semantic image synthesis. In Advances in neural information processing systems. 33rd Annual Conference on Neural Information Processing Systems, NeurIPS 2019, December 8, 2019 to December 14, 2019, Vancouver, BC, Canada.
Google Scholar
Lv, Z., Li, X., Niu, Z., Cao, B., & Zuo, W. (2022). Semantic-shape adaptive feature modulation for semantic image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Google Scholar
Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. abs/1411.1784, undefined.
Google Scholar
Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., & Chen, M. (2021). GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv.
Google Scholar
Ntavelis, E., Romero, A., Kastanis, I., Van Gool, L., & Timofte, R. (2020). SESAME: Semantic editing of scenes by adding, manipulating or erasing objects. In Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics). 16th European Conference on Computer Vision, ECCV 2020, August 23, 2020 to August 28, 2020, Glasgow, UK.
Google Scholar
Park, T., Liu, M.-Y., Wang, T.-C., & Zhu, J.-Y. (2019, June 16–20). Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA.
Google Scholar
Qi, X., Chen, Q., Jia, J., & Koltun, V. (2018). Semi-parametric image synthesis. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018, June 18, 2018 to June 22, 2018, Salt Lake City, UT, USA.
Google Scholar
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. arXiv.
Google Scholar
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics). 18th International Conference on Medical Image Computing and Computer-Assisted Intervention, MICCAI 2015, October 5, 2015 to October 9, 2015, Munich, Germany.
Google Scholar
Schonfeld, E., Sushko, V., Zhang, D., Gall, J., Schiele, B., & Khoreva, A. (2021). You only need adversarial supervision for semantic image synthesis. In ICLR 2021 – 9th International Conference on Learning Representations. 9th International Conference on Learning Representations, ICLR 2021, May 3, 2021 to May 7, 2021, Virtual, Online.
Google Scholar
Tang, H., Bai, S., & Sebe, N. (2020). Dual attention gans for semantic image synthesis. In Proceedings of the 28th ACM International Conference on Multimedia.
Google Scholar
Tang, H., Qi, X., Sun, G., Xu, D., Sebe, N., Timofte, R., & Van Gool, L. (2020). Edge guided gans with contrastive learning for semantic image synthesis. arXiv. https://doi.org/10.48550/arXiv.2003.13898
Google Scholar
Tang, H., Xu, D., Sebe, N., Wang, Y., Corso, J. J., & Yan, Y. (2019). Multi-channel attention selection gan with cascaded semantic guidance for cross-view image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Google Scholar
Tang, H., Xu, D., Yan, Y., Torr, P. H. S., & Sebe, N. (2020). Local class-specific and global image-level generative adversarial networks for semantic-guided scene generation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, June 14, 2020 to June 19, 2020, Virtual, Online, USA.
Google Scholar
Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., & Hu, Q. (2020). ECA-net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Google Scholar
Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Tao, A., Kautz, J., & Catanzaro, B. (2018). High-resolution image synthesis and semantic manipulation with conditional GANs. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018, June 18, 2018 to June 22, 2018, Salt Lake City, UT, USA.
Google Scholar
Wang, W., Bao, J., Zhou, W., Chen, D., Chen, D., Yuan, L., & Li, H. (2022). Semantic image synthesis via diffusion models. arXiv. https://doi.org/10.48550/arXiv.2207.00050
Google Scholar
Wang, Y., Qi, L., Chen, Y.-C., Zhang, X., & Jia, J. (2021). Image synthesis via semantic composition. In Proceedings of the IEEE International Conference on Computer Vision. 18th IEEE/CVF International Conference on Computer Vision, ICCV 2021, October 11, 2021 to October 17, 2021, Virtual, Online, Canada.
Google Scholar
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018). Unified perceptual parsing for scene understanding. In Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics). 15th European Conference on Computer Vision, ECCV 2018, September 8, 2018 to September 14, 2018, Munich, Germany.
Google Scholar
Yu, F., Koltun, V., & Funkhouser, T. (2017). Dilated residual networks. In Proceedings – 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, July 21, 2017 to July 26, 2017, Honolulu, HI, USA.
Google Scholar
Zhu, P., Abdal, R., Qin, Y., & Wonka, P. (2020). SEAN: Image synthesis with semantic region-adaptive normalization. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, June 14, 2020 to June 19, 2020, Virtual, Online, USA.
Google Scholar

Dual conditional GAN based on external attention for semantic image synthesis

Abstract

1. Introduction

2. Related work