218
Views
0
CrossRef citations to date
0
Altmetric
Articles

Context-aware and co-attention network based image captioning model

ORCID Icon &
Pages 244-256 | Received 25 Mar 2022, Accepted 08 Feb 2023, Published online: 22 Feb 2023

References

  • Wang T, Zhang R, Lu Z, et al. End-to-end dense video captioning with parallel decoding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2021. p. 6847–6857.
  • Deng C, Chen S, Chen D, et al. Sketch, ground, and refine: top-down dense video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021. p. 234–243.
  • Sharma H, Srivastava S. Visual question answering model based on the fusion of multimodal features by a two-way co-attention mechanism. Imaging Sci J. 2022: 1–13.
  • Sharma H, Jalal AS. Visual question answering model based on graph neural network and contextual attention. Image Vis Comput. 2021;110:104165.
  • Sharma H, Jalal AS. Image captioning improved visual question answering. Multimed Tools Appl. 2022;81(24):34775–34796.
  • Zhao R, Shi Z, Zou Z. High-resolution remote sensing image captioning based on structured attention. IEEE Trans Geosci Remote Sens. 2021;60:1–14.
  • Ye X, Wang S, Gu Y, et al. A joint-training Two-stage method For remote sensing image captioning. IEEE Trans Geosci Remote Sens. 2022;60:1–16.
  • Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–1780.
  • Cho K, van Merriënboer B, Gulcehre B, et al. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar; 2014 Oct; Association for Computational Linguistics. p. 1724–1734.
  • Hendrycks D, Gimpel K. (2016). Gaussian error linear units, in arXiv:1606.08415.
  • Sharma H, Srivastava S. Multilevel attention and relation network-based image captioning model. Multimed Tools Appl. 2022:1–23.
  • Sharma H, Jalal AS. Incorporating external knowledge for image captioning using CNN and LSTM. Mod Phys Lett B. 2020;34(28):2050315.
  • Sharma H, Srivastava S. Graph neural network-based visual relationship and multilevel attention for image captioning. J Electron Imaging. 2022;31(5):053022.
  • Rennie SJ, Marcheret E, Mroueh Y, et al. Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017. p. 7008–7024.
  • Yang X, Tang K, Zhang H, et al. (2019). Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. p. 10685–10694.
  • Hoxha G, Melgani F. A novel SVM-based decoder for remote sensing image captioning. IEEE Trans Geosci Remote Sens. 2021;60:1–14.
  • Al-Malla MA, Jafar A, Ghneim N. Image captioning model using attention and object features to mimic human image understanding. J Big Data. 2022;9(1):1–16.
  • Yao T, Pan Y, Li Y, et al. Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision (ECCV); 2018. p. 684–699.
  • Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018. p. 6077–6086.
  • Wang C, Gu X. Learning joint relationship attention network for image captioning. Expert Syst Appl. 2023;211:118474.
  • Xiao F, Xue W, Shen Y, et al. A new attention-based LSTM for image captioning. Neural Process Lett. 2022;54(4):3157–3171.
  • Wang C, Gu X. Dynamic-balanced double-attention fusion for image captioning. Eng Appl Artif Intell. 2022;114:105194.
  • Cornia M, Stefanini M, Baraldi L, et al. Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020. p. 10578–10587.
  • Huang L, Wang W, Chen J, et al. Attention on attention for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2019. p. 4634–4643.
  • Zhou D, Yang J. Relation network and causal reasoning for image captioning. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management; 2021 Oct. p. 2718–2727.
  • Wang J, Wang W, Wang L, et al. Learning visual relationship and context-aware attention for image captioning. Pattern Recognit. 2020;98:107075.
  • Guo D, Wang Y, Song P, et al. (2020). Recurrent relational memory network for unsupervised image captioning. arXiv preprint arXiv:2006.13611.
  • Yang M, Liu J, Shen Y, et al. An ensemble of generation-and retrieval-based image captioning with dual generator generative adversarial network. IEEE Trans Image Process. 2020;29:9627–9640.
  • Liu J, et al. Interactive dual generative adversarial networks for image captioning. In: Proc. AAAI; 2020. p. 11588–11595.
  • Kim DJ, Oh TH, Choi J, et al. Dense relational image captioning via multi-task triple-stream networks. IEEE Trans Pattern Anal Mach Intell. 2021;44(11):7348–7362.
  • Hu N, Ming Y, Fan C, et al. TSFNet: triple-steam image captioning. IEEE Transactions on Multimedia. 2022:1–14.
  • Wang Y, Xu N, Liu AA, et al. High-order interaction learning for image captioning. IEEE Trans Circuits Syst Video Technol. 2021;32(7):4417–4430.
  • Liu AA, Zhai Y, Xu N, et al. Region-aware image captioning via interaction learning. IEEE Trans Circuits Syst Video Technol. 2021;32(6):3685–3696.
  • Ren S, He K, Girshick R, et al. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell. 2017;39(6):1137–1149.
  • He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proc. CVPR; 2016 Jun. p. 770–778.
  • Veliçković P, et al. Graph attention networks. In: Proc. Int. Conf. Learn. Representations; 2018. p. 120–132.
  • Kim J-H, et al. Hadamard product for low-rank bilinear pooling. In: Proc. 5th Int. Conf. Learn. Representations; 2016. p. 66–78.
  • Kanai S, Fujiwara Y, Yamanaka Y, et al. Sigsoftmax: reanalysis of the softmax bottleneck. In: Proc. Adv. Neural Inf. Process. Syst., Ser. Adv. Neural Inf. Process. Syst.; 2018. p. 284–294.
  • Lin TY, et al. Microsoft COCO: common objects in context. Lect Notes Comput Sci. 2014;8693:740–755.
  • Young P, et al. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguistics. 2014;2:67–78.
  • Karpathy A, Fei-Fei L. Deep visual-semantic alignments for generating image descriptions. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR); 2015 Jun. p. 3128–3137.
  • Papineni K, et al. Bleu: a method for automatic evaluation of machine translation. In: Proc. 40th Annu. Meet. Assoc. for Comput. Linguistics; 2002. p. 311–318.
  • Banerjee S, Lavie A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proc. ACL Workshop on Intrinsic and Extrinsic Eval. Meas. for Mach. Transl. and/or Summarization; 2005. p. 65–72.
  • Lin CY. Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out. Association for Computational Linguistics; Barcelona, Spain; 2004. p. 74–81.
  • Anderson P, Fernando B, Johnson M, Gould S. Spice: semantic propositional image caption evaluation. In: European conference on computer vision. Cham: Springer; 2016. p. 382–398.
  • Vedantam R, Zitnick CL, Parikh D. CIDEr: consensus-based image description evaluation. In: Proc. IEEE Conf. Comput. Vis. and Pattern Recognit.; 2015. p. 4566–4575.
  • Kingma DP, Ba J. Adam: a method for stochastic optimization. arXiv:1412.6980 (2014).
  • Pan Y, Yao T, Li Y, et al. X-linear attention networks for image captioning. In: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.; 2020. p. 10971–10980.
  • Zhang Z, Wu Q, Wang Y, et al. Exploring pairwise relationships adaptively from linguistic context in image captioning. IEEE Trans Multimed. 2021;24:3101–3113.
  • Liu X, Li H, Shao J, et al. Show, tell and discriminate: image captioning by self-retrieval with partially labeled data In: Proc. Eur. Conf. Comput. Vis. (ECCV); 2018. p. 338–354.
  • Xu K, et al. Show, attend and tell: neural image caption generation with visual attention. Comput Sci. Feb. 2015;2015:2048–2057.
  • You Q, Jin H, Wang Z, et al. Image captioning with semantic attention. In: Proc. CVPR; 2016 Jun. p. 4651–4659.
  • Chen L, et al. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In: Proc. CVPR; 2017 Jul. p. 6298–6306.
  • Lu J, Xiong C, Parikh D, et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proc. CVPR; 2017 Jul. p. 3242–3250.
  • Ye S, Han J, Liu N. Attentive linear transformation for image captioning. IEEE Trans Image Process. 2018;27(11):5514–5524.
  • Barraco M, Cornia M, Cascianelli S, et al. The unreasonable effectiveness of CLIP features for image captioning: an experimental analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022. p. 4662–4670.
  • Li Y, Pan Y, Yao T, et al. Comprehending and ordering semantics for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022. p. 17990–17999.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.