Search in:

Advanced search

The Imaging Science Journal Volume 69, 2021 - Issue 1-4

Submit an article Journal homepage

Views

CrossRef citations to date

Altmetric

Research Articles

Visual question answering model based on the fusion of multimodal features by a two-way co-attention mechanism

Himanshu SharmaDepartment of Computer Engineering and Applications, GLA University Mathura, Mathura, IndiaCorrespondence[email protected]

https://orcid.org/0000-0002-3745-7616 View further author information

Swati SrivastavaDepartment of Computer Engineering and Applications, GLA University Mathura, Mathura, IndiaView further author information

Pages 177-189 | Received 25 Mar 2022, Accepted 26 Nov 2022, Published online: 06 Dec 2022

Cite this article
https://doi.org/10.1080/13682199.2022.2153489
CrossMark

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions

References

Chen C, Anjum S, Gurari D. Grounding answers for visual questions asked by visually impaired people. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; New Orleans, LA, USA; 2022. p. 19098–19107.
Google Scholar
Xi Y, Zhang Y, Ding S, et al. Visual question answering model based on visual relationship detection. Signal Process Image Commun. 2020;80:115648.
Web of Science ®Google Scholar
Xu L, Huang H, Liu J. Sutd-trafficqa: a question answering benchmark and an efficient network for video reasoning over traffic events. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; Nashville, TN, USA; 2021. p. 9878–9888.
Google Scholar
Naseem U, Khushi M, Kim J. Vision-language transformer for interpretable pathology visual question answering. IEEE J Biomed Health Inform. 2022. DOI:10.1109/JBHI.2022.3163751
PubMedGoogle Scholar
Sharma H, Srivastava S. Multilevel attention and relation network based image captioning model. Multimed Tools Appl. 2022: 1–23. DOI:10.1007/s11042-022-13793-0
Web of Science ®Google Scholar
Sharma H, Srivastava S. Graph neural network-based visual relationship and multilevel attention for image captioning. J Electron Imag. 2022;31(5):053022.
Web of Science ®Google Scholar
Sharma H, Jalal AS. Incorporating external knowledge for image captioning using CNN and LSTM. Mod Phys Lett B. 2020;34(28):2050315.
Web of Science ®Google Scholar
Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition; Salt Lake City, UT, USA; 2018. p. 6077–6086.
Google Scholar
Yang Z, He X, Gao J, et al. Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition; Las Vegas, NV, USA; 2016. p. 21–29.
Google Scholar
Nguyen DK, Okatani T. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition; Salt Lake City, UT, USA; 2018. p. 6087–6096.
Google Scholar
Xu H, Saenko K. Ask, attend and answer: exploring question-guided spatial attention for visual question answering. In: European conference on computer vision. Cham: Springer; 2016, October. p. 451–466.
Google Scholar
Singh A, Natarajan V, Shah M, et al. Towards vqa models that can read. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; Long Beach, CA, USA; 2019. p. 8317–8326.
Google Scholar
Biten AF, Tito R, Mafla A, et al. Scene text visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision; Long Beach, CA, USA; 2019. p. 4291–4301.
Google Scholar
Hu R, Singh A, Darrell T, et al. Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; Seattle, WA, USA; 2020. p. 9992–10002.
Google Scholar
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need; 2017. arXiv preprint arXiv:1706.03762.
Google Scholar
Lu J, Yang J, Batra D, et al. Hierarchical question-image co-attention for visual question answering. Adv Neural Inf Process Syst. 2016;29:289–297.
Google Scholar
Yu Z, Yu J, Cui Y, et al. Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; Long Beach, CA, USA; 2019. p. 6281–6290.
Google Scholar
Bojanowski P, Grave E, Joulin A, et al. Enriching word vectors with subword information. Trans Assoc Comput Linguist. 2017;5:135–146.
Google Scholar
Almazán J, Gordo A, Fornés A, et al. Word spotting and recognition with embedded attributes. IEEE Trans Pattern Anal Mach Intell. 2014;36(12):2552–2566.
PubMed Web of Science ®Google Scholar
Zhou B, Tian Y, Sukhbaatar S, et al. Simple baseline for visual question answering; 2015. arXiv preprint arXiv:1512.02167.
Google Scholar
Malinowski M, Rohrbach M, Fritz M. Ask your neurons: a neural-based approach to answering questions about images. In: Proceedings of the IEEE international conference on computer vision; Santiago, Chile; 2015. p. 1–9.
Google Scholar
Ben-Younes H, Cadene R, Cord M, et al. Mutan: multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision; Venice, Italy; 2017. p. 2612–2620.
Google Scholar
Sharma H, Jalal AS. Image captioning improved visual question answering. Multimed Tools Appl. 2022;81(24):34775–34796.
Web of Science ®Google Scholar
Sharma H, Jalal AS. An improved attention and hybrid optimization technique for visual question answering. Neural Process Lett. 2022;54(1):709–730.
Web of Science ®Google Scholar
Kim JH, Jun J, Zhang BT. Bilinear attention networks. Adv Neural Inf Process Syst. 2018;31:1571–1581.
Google Scholar
Wu Y, Ma Y, Wan S. Multi-scale relation reasoning for multi-modal visual question answering. Signal Process Image Commun. 2021;96:116319.
Web of Science ®Google Scholar
Li L, Gan Z, Cheng Y, et al. Relation-aware graph attention network for visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision; Long Beach, CA, USA; 2019. p. 10313–10322.
Google Scholar
Sharma H, Jalal AS. Visual question answering model based on graph neural network and contextual attention. Image Vis Comput. 2021;110:104165.
Web of Science ®Google Scholar
Fukui A, Park DH, Yang D, et al. Multimodal compact bilinear pooling for visual question answering and visual grounding; 2016. arXiv preprint arXiv:1606.01847.
Google Scholar
Gao P, Jiang Z, You H, et al. Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; Long Beach, CA, USA; 2019. p. 6639–6648.
Google Scholar
Su W, Zhu X, Cao Y, et al. Vl-bert: pre-training of generic visual-linguistic representations; 2019. arXiv preprint arXiv:1908.08530.
Google Scholar
Zhou L, Palangi H, Zhang L, et al. Unified vision-language pre-training for image captioning and vqa. In: Proceedings of the AAAI conference on artificial intelligence; New York, NY, USA; 2020, April. Vol. 34, no. 7, p. 13041–13049.
Google Scholar
Wu Y, Zhang W, Wan S. CE-text: A context-aware and embedded text detector in natural scene images. Pattern Recognit Lett. 2022;159:77–83.
Web of Science ®Google Scholar
Wu Y, Liu W, Wan S. Multiple attention encoded cascade R-CNN for scene text detection. J Vis Commun Image Represent. 2021;80:103261.
Web of Science ®Google Scholar
Gao C, Zhu Q, Wang P, et al. Structured multimodal attentions for textvqa. IEEE Transactions on Pattern Analysis and Machine Intelligence; 2021.
Google Scholar
Han W, Huang H, Han T. Finding the evidence: localization-aware answer prediction for text visual question answering; 2020. arXiv preprint arXiv:2010.02582.
Google Scholar
Yang Z, Lu Y, Wang J, et al. Tap: text-aware pre-training for text-vqa and text-caption. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; Nashville, TN, USA; 2021. p. 8751–8761.
Google Scholar
Zeng G, Zhang Y, Zhou Y, et al. Beyond OCR+ VQA: involving OCR into the flow for robust and accurate TextVQA. In: Proceedings of the 29th ACM international conference on multimedia; China; 2021, October. p. 376–385.
Google Scholar
Sharma H, Jalal AS. A framework for visual question answering with the integration of scene-text using PHOCs and fisher vectors. Expert Syst Appl. 2022;190:116159.
Web of Science ®Google Scholar
Biten AF, Litman R, Xie Y, et al. Latr: layout-aware transformer for scene-text vqa. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; New Orleans, LA, USA; 2022. p. 16548–16558.
Google Scholar
Guo W, Zhang Y, Yang J, et al. Re-attention for visual question answering. IEEE Trans Image Process. 2021;30:6730–6743.
PubMed Web of Science ®Google Scholar
Yu D, Fu J, Tian X, et al. Multi-source multi-level attention networks for visual question answering. ACM transactions on multimedia computing, communcation application (TOMM). 2019;15(2s):1–20.
Google Scholar
Wu J, Du J, Wang F, et al. A multimodal attention fusion network with a dynamic vocabulary for TextVQA. Pattern Recognit. 2022;122:108214.
Web of Science ®Google Scholar
Zhang W, Yu J, Zhao W, et al. DMRFNet: deep multimodal reasoning and fusion for visual question answering and explanation generation. Inf Fusion. 2021;72:70–79.
Web of Science ®Google Scholar
Liu F, Xu G, Wu Q, et al. Cascade reasoning network for text-based visual question answering. In: Proceedings of the 28th ACM international conference on multimedia; Seattle WA USA; 2020, October. p. 4060–4069.
Google Scholar
Devlin J, Chang MW, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding; 2018. arXiv preprint arXiv:1810.04805.
Google Scholar
Ren S, He K, Girshick R, et al. Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst. 2015;28:91–99.
Google Scholar
Singh A, Natarajan V, Jiang Y, et al. Pythia-a platform for vision & language research. SysML Workshop, NeurIPS (Vol. 2018); 2018.
Google Scholar
Pennington J, Socher R, Manning CD. Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP); Doha, Qatar; 2014, October. p. 1532–1543.
Google Scholar
Krasin I, Duerig T, Alldrin N, et al. Openimages: A public dataset for large-scale multi-label and multi-class image classification. 2017;2(3),18. https://github.com/openimages.
Google Scholar
Borisyuk F, Gordo A, Sivakumar V. Rosetta: large scale system for text detection and recognition in images. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining; London United Kingdom; 2018, July. p. 71–79.
Google Scholar
Imambi S, Prakash KB, Kanagachidambaresan GR. Pytorch. In: Programming with TensorFlow. Cham: Springer; 2021. p. 87–104.
Google Scholar
Kingma DP, Ba J. Adam (2014), A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR) (Vol. 1412); 2015. arXiv preprint arXiv.
Google Scholar
Lin Y, Zhao H, Li Y, et al. DCD ZJU, TextVQA Challenge 2019 winner. https://visualqa.org/workshop.html.
Google Scholar
Jin ZX, Wu H, Yang C, et al. Ruart: a novel text-centered solution for text-based visual question answering. IEEE Trans Multimed. 2021: 1–1. DOI:10.1109/TMM.2021.3120194
Google Scholar
Sharma H, Jalal AS. Improving visual question answering by combining scene-text information. Multimed Tools Appl. 2022;81(9):12177–12208.
Web of Science ®Google Scholar
Kant Y, Batra D, Anderson P, et al. Spatially aware multimodal transformers for textvqa. In: European conference on computer vision. Cham: Springer; 2020, August. p. 715–732.
Google Scholar
Lu X, Fan Z, Wang Y, et al. Localize, group, and select: boosting text-vqa by scene text modeling. In: Proceedings of the IEEE/CVF international conference on computer vision; Nashville, TN, USA; 2021. p. 2631–2639.
Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Visual question answering model based on the fusion of multimodal features by a two-way co-attention mechanism

References

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

Visual question answering model based on the fusion of multimodal features by a two-way co-attention mechanism

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date