Context-aware and co-attention network based image captioning model

Figure 2. An overview of our context-aware co-attention network-based image captioning model which consists of region-aware attention, relation-aware attention, $G R U$ - $G E L U$ decoder pair, and transformer-based co-attention network modules.

Figure 3. Transformer model-based co-attention network with multi-head attention for relationship features and object features to generate context-aware global features.

Table 2. Hyperparameter values.

Display Table

Table 3. Ablation tests conducted on MSCOCO Karparthy test split.

Table 4. Results of our context-aware co-attention based image captioning model and compared models on MSCOCO Karpathy test split with cross-entropy loss.

Table 5. Results of our context-aware co-attention based image captioning model and compared models on MSCOCO Karpathy test split with CIDEr score optimization.

Table 6. Results of our context-aware co-attention based image captioning model and compared models on the Flickr30k dataset.

Table 7. Comparison with the existing models in terms of the number of parameters in million (M), training time on GPUs, Flops, layers, width and MLP on MSCOCO.

Figure 4. Qualitative results of our model and other models on MSCOCO. Ours indicates the captions generated by our model. EnsCaption [Citation28] and M2-Transformer [Citation23] are the strong comparative models. For each image, we have shown one interaction and two entity words. Highest attention weights are shown in red colour.

Figure 5. Qualitative results of our model and other models on Flickr30k. Ours indicates the captions generated by our model. EnsCaption [Citation28] and Mul_Att [Citation11] are the strong comparative models. For each image, we have shown one interaction and two entity words. Highest attention weights are shown in red colour.

Figure 6. Visualization results of attended regions/objects of our context-aware co-attention module during the decoding phase. Higher attention weights are shown in the form of brighter regions. (Best viewed in colour and $200 %$ ).

Figure 7. Same caption is generated by the proposed model for the left and middle images, as these two images are visually similar. In the rightmost image, our model fails to identify “onion” and “orange” and thus generates the wrong caption.

Yang M, Liu J, Shen Y, et al. An ensemble of generation-and retrieval-based image captioning with dual generator generative adversarial network. IEEE Trans Image Process. 2020;29:9627–9640.

Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018. p. 6077–6086.

Liu J, et al. Interactive dual generative adversarial networks for image captioning. In: Proc. AAAI; 2020. p. 11588–11595.

Kim DJ, Oh TH, Choi J, et al. Dense relational image captioning via multi-task triple-stream networks. IEEE Trans Pattern Anal Mach Intell. 2021;44(11):7348–7362.

Hu N, Ming Y, Fan C, et al. TSFNet: triple-steam image captioning. IEEE Transactions on Multimedia. 2022:1–14.

Wang Y, Xu N, Liu AA, et al. High-order interaction learning for image captioning. IEEE Trans Circuits Syst Video Technol. 2021;32(7):4417–4430.

Liu AA, Zhai Y, Xu N, et al. Region-aware image captioning via interaction learning. IEEE Trans Circuits Syst Video Technol. 2021;32(6):3685–3696.

Rennie SJ, Marcheret E, Mroueh Y, et al. Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017. p. 7008–7024.

Huang L, Wang W, Chen J, et al. Attention on attention for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2019. p. 4634–4643.

Yao T, Pan Y, Li Y, et al. Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision (ECCV); 2018. p. 684–699.

Yang X, Tang K, Zhang H, et al. (2019). Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. p. 10685–10694.

Pan Y, Yao T, Li Y, et al. X-linear attention networks for image captioning. In: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.; 2020. p. 10971–10980.

Zhang Z, Wu Q, Wang Y, et al. Exploring pairwise relationships adaptively from linguistic context in image captioning. IEEE Trans Multimed. 2021;24:3101–3113.

Wang J, Wang W, Wang L, et al. Learning visual relationship and context-aware attention for image captioning. Pattern Recognit. 2020;98:107075.

PubMed Web of Science ®Google Scholar

Sharma H, Srivastava S. Multilevel attention and relation network-based image captioning model. Multimed Tools Appl. 2022:1–23.

Sharma H, Srivastava S. Graph neural network-based visual relationship and multilevel attention for image captioning. J Electron Imaging. 2022;31(5):053022.

Liu X, Li H, Shao J, et al. Show, tell and discriminate: image captioning by self-retrieval with partially labeled data In: Proc. Eur. Conf. Comput. Vis. (ECCV); 2018. p. 338–354.

Cornia M, Stefanini M, Baraldi L, et al. Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020. p. 10578–10587.

Xu K, et al. Show, attend and tell: neural image caption generation with visual attention. Comput Sci. Feb. 2015;2015:2048–2057.

You Q, Jin H, Wang Z, et al. Image captioning with semantic attention. In: Proc. CVPR; 2016 Jun. p. 4651–4659.

Chen L, et al. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In: Proc. CVPR; 2017 Jul. p. 6298–6306.

Lu J, Xiong C, Parikh D, et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proc. CVPR; 2017 Jul. p. 3242–3250.

Ye S, Han J, Liu N. Attentive linear transformation for image captioning. IEEE Trans Image Process. 2018;27(11):5514–5524.

Sharma H, Jalal AS. Incorporating external knowledge for image captioning using CNN and LSTM. Mod Phys Lett B. 2020;34(28):2050315.