89
Views
1
CrossRef citations to date
0
Altmetric
Research Articles

Visual question answering model based on the fusion of multimodal features by a two-way co-attention mechanism

ORCID Icon &
Pages 177-189 | Received 25 Mar 2022, Accepted 26 Nov 2022, Published online: 06 Dec 2022

References

  • Chen C, Anjum S, Gurari D. Grounding answers for visual questions asked by visually impaired people. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; New Orleans, LA, USA; 2022. p. 19098–19107.
  • Xi Y, Zhang Y, Ding S, et al. Visual question answering model based on visual relationship detection. Signal Process Image Commun. 2020;80:115648.
  • Xu L, Huang H, Liu J. Sutd-trafficqa: a question answering benchmark and an efficient network for video reasoning over traffic events. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; Nashville, TN, USA; 2021. p. 9878–9888.
  • Naseem U, Khushi M, Kim J. Vision-language transformer for interpretable pathology visual question answering. IEEE J Biomed Health Inform. 2022. DOI:10.1109/JBHI.2022.3163751
  • Sharma H, Srivastava S. Multilevel attention and relation network based image captioning model. Multimed Tools Appl. 2022: 1–23. DOI:10.1007/s11042-022-13793-0
  • Sharma H, Srivastava S. Graph neural network-based visual relationship and multilevel attention for image captioning. J Electron Imag. 2022;31(5):053022.
  • Sharma H, Jalal AS. Incorporating external knowledge for image captioning using CNN and LSTM. Mod Phys Lett B. 2020;34(28):2050315.
  • Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition; Salt Lake City, UT, USA; 2018. p. 6077–6086.
  • Yang Z, He X, Gao J, et al. Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition; Las Vegas, NV, USA; 2016. p. 21–29.
  • Nguyen DK, Okatani T. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition; Salt Lake City, UT, USA; 2018. p. 6087–6096.
  • Xu H, Saenko K. Ask, attend and answer: exploring question-guided spatial attention for visual question answering. In: European conference on computer vision. Cham: Springer; 2016, October. p. 451–466.
  • Singh A, Natarajan V, Shah M, et al. Towards vqa models that can read. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; Long Beach, CA, USA; 2019. p. 8317–8326.
  • Biten AF, Tito R, Mafla A, et al. Scene text visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision; Long Beach, CA, USA; 2019. p. 4291–4301.
  • Hu R, Singh A, Darrell T, et al. Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; Seattle, WA, USA; 2020. p. 9992–10002.
  • Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need; 2017. arXiv preprint arXiv:1706.03762.
  • Lu J, Yang J, Batra D, et al. Hierarchical question-image co-attention for visual question answering. Adv Neural Inf Process Syst. 2016;29:289–297.
  • Yu Z, Yu J, Cui Y, et al. Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; Long Beach, CA, USA; 2019. p. 6281–6290.
  • Bojanowski P, Grave E, Joulin A, et al. Enriching word vectors with subword information. Trans Assoc Comput Linguist. 2017;5:135–146.
  • Almazán J, Gordo A, Fornés A, et al. Word spotting and recognition with embedded attributes. IEEE Trans Pattern Anal Mach Intell. 2014;36(12):2552–2566.
  • Zhou B, Tian Y, Sukhbaatar S, et al. Simple baseline for visual question answering; 2015. arXiv preprint arXiv:1512.02167.
  • Malinowski M, Rohrbach M, Fritz M. Ask your neurons: a neural-based approach to answering questions about images. In: Proceedings of the IEEE international conference on computer vision; Santiago, Chile; 2015. p. 1–9.
  • Ben-Younes H, Cadene R, Cord M, et al. Mutan: multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision; Venice, Italy; 2017. p. 2612–2620.
  • Sharma H, Jalal AS. Image captioning improved visual question answering. Multimed Tools Appl. 2022;81(24):34775–34796.
  • Sharma H, Jalal AS. An improved attention and hybrid optimization technique for visual question answering. Neural Process Lett. 2022;54(1):709–730.
  • Kim JH, Jun J, Zhang BT. Bilinear attention networks. Adv Neural Inf Process Syst. 2018;31:1571–1581.
  • Wu Y, Ma Y, Wan S. Multi-scale relation reasoning for multi-modal visual question answering. Signal Process Image Commun. 2021;96:116319.
  • Li L, Gan Z, Cheng Y, et al. Relation-aware graph attention network for visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision; Long Beach, CA, USA; 2019. p. 10313–10322.
  • Sharma H, Jalal AS. Visual question answering model based on graph neural network and contextual attention. Image Vis Comput. 2021;110:104165.
  • Fukui A, Park DH, Yang D, et al. Multimodal compact bilinear pooling for visual question answering and visual grounding; 2016. arXiv preprint arXiv:1606.01847.
  • Gao P, Jiang Z, You H, et al. Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; Long Beach, CA, USA; 2019. p. 6639–6648.
  • Su W, Zhu X, Cao Y, et al. Vl-bert: pre-training of generic visual-linguistic representations; 2019. arXiv preprint arXiv:1908.08530.
  • Zhou L, Palangi H, Zhang L, et al. Unified vision-language pre-training for image captioning and vqa. In: Proceedings of the AAAI conference on artificial intelligence; New York, NY, USA; 2020, April. Vol. 34, no. 7, p. 13041–13049.
  • Wu Y, Zhang W, Wan S. CE-text: A context-aware and embedded text detector in natural scene images. Pattern Recognit Lett. 2022;159:77–83.
  • Wu Y, Liu W, Wan S. Multiple attention encoded cascade R-CNN for scene text detection. J Vis Commun Image Represent. 2021;80:103261.
  • Gao C, Zhu Q, Wang P, et al. Structured multimodal attentions for textvqa. IEEE Transactions on Pattern Analysis and Machine Intelligence; 2021.
  • Han W, Huang H, Han T. Finding the evidence: localization-aware answer prediction for text visual question answering; 2020. arXiv preprint arXiv:2010.02582.
  • Yang Z, Lu Y, Wang J, et al. Tap: text-aware pre-training for text-vqa and text-caption. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; Nashville, TN, USA; 2021. p. 8751–8761.
  • Zeng G, Zhang Y, Zhou Y, et al. Beyond OCR+ VQA: involving OCR into the flow for robust and accurate TextVQA. In: Proceedings of the 29th ACM international conference on multimedia; China; 2021, October. p. 376–385.
  • Sharma H, Jalal AS. A framework for visual question answering with the integration of scene-text using PHOCs and fisher vectors. Expert Syst Appl. 2022;190:116159.
  • Biten AF, Litman R, Xie Y, et al. Latr: layout-aware transformer for scene-text vqa. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; New Orleans, LA, USA; 2022. p. 16548–16558.
  • Guo W, Zhang Y, Yang J, et al. Re-attention for visual question answering. IEEE Trans Image Process. 2021;30:6730–6743.
  • Yu D, Fu J, Tian X, et al. Multi-source multi-level attention networks for visual question answering. ACM transactions on multimedia computing, communcation application (TOMM). 2019;15(2s):1–20.
  • Wu J, Du J, Wang F, et al. A multimodal attention fusion network with a dynamic vocabulary for TextVQA. Pattern Recognit. 2022;122:108214.
  • Zhang W, Yu J, Zhao W, et al. DMRFNet: deep multimodal reasoning and fusion for visual question answering and explanation generation. Inf Fusion. 2021;72:70–79.
  • Liu F, Xu G, Wu Q, et al. Cascade reasoning network for text-based visual question answering. In: Proceedings of the 28th ACM international conference on multimedia; Seattle WA USA; 2020, October. p. 4060–4069.
  • Devlin J, Chang MW, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding; 2018. arXiv preprint arXiv:1810.04805.
  • Ren S, He K, Girshick R, et al. Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst. 2015;28:91–99.
  • Singh A, Natarajan V, Jiang Y, et al. Pythia-a platform for vision & language research. SysML Workshop, NeurIPS (Vol. 2018); 2018.
  • Pennington J, Socher R, Manning CD. Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP); Doha, Qatar; 2014, October. p. 1532–1543.
  • Krasin I, Duerig T, Alldrin N, et al. Openimages: A public dataset for large-scale multi-label and multi-class image classification. 2017;2(3),18. https://github.com/openimages.
  • Borisyuk F, Gordo A, Sivakumar V. Rosetta: large scale system for text detection and recognition in images. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining; London United Kingdom; 2018, July. p. 71–79.
  • Imambi S, Prakash KB, Kanagachidambaresan GR. Pytorch. In: Programming with TensorFlow. Cham: Springer; 2021. p. 87–104.
  • Kingma DP, Ba J. Adam (2014), A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR) (Vol. 1412); 2015. arXiv preprint arXiv.
  • Lin Y, Zhao H, Li Y, et al. DCD ZJU, TextVQA Challenge 2019 winner. https://visualqa.org/workshop.html.
  • Jin ZX, Wu H, Yang C, et al. Ruart: a novel text-centered solution for text-based visual question answering. IEEE Trans Multimed. 2021: 1–1. DOI:10.1109/TMM.2021.3120194
  • Sharma H, Jalal AS. Improving visual question answering by combining scene-text information. Multimed Tools Appl. 2022;81(9):12177–12208.
  • Kant Y, Batra D, Anderson P, et al. Spatially aware multimodal transformers for textvqa. In: European conference on computer vision. Cham: Springer; 2020, August. p. 715–732.
  • Lu X, Fan Z, Wang Y, et al. Localize, group, and select: boosting text-vqa by scene text modeling. In: Proceedings of the IEEE/CVF international conference on computer vision; Nashville, TN, USA; 2021. p. 2631–2639.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.