89
Views
1
CrossRef citations to date
0
Altmetric
Research Articles

Visual question answering model based on the fusion of multimodal features by a two-way co-attention mechanism

ORCID Icon &
Pages 177-189 | Received 25 Mar 2022, Accepted 26 Nov 2022, Published online: 06 Dec 2022
 

ABSTRACT

Scene Text Visual Question Answering (VQA) needs to understand both the visual contents and the texts in an image to predict an answer for the image-related question. Existing Scene Text VQA models predict an answer by choosing a word from a fixed vocabulary or the extracted text tokens. In this paper, we have strengthened the representation power of the text tokens by using FastText embedding, appearance, bounding box and PHOC features for text tokens. Our model employs two-way co-attention by using self-attention and guided attention mechanisms to obtain the discriminative image features. We compute the text token position and combine this information with the predicted answer embedding for final answer generation. We have achieved an accuracy of 51.27% and 52.09% on the TextVQA validation set and test set. For the ST-VQA dataset, our model predicted an ANLS score of 0.698 on validation set and 0.686 on test set.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Notes on contributors

Himanshu Sharma

Himanshu Sharma is an Associate Professor in the Department of Computer Engineering and Applications, GLA University, Mathura, India. He has done his Ph.D from GLA University Mathura and M.Tech. from NSIT, New Delhi, India. His area of research is Computer Vision, Image Processing, and Natural Language Processing. He has published many research papers in reputed journals and conferences.

Swati Srivastava

Swati Srivastava is currently working as an Assistant Professor in the Department of Computer Engineering and Applications, GLA University, Mathura, India. She has completed her Ph.D. in Computational Intelligence from HBTU Kanpur, India, and M.Tech. from NIT Allahabad, India. Her area of research includes high-dimensional neurocomputing, computational intelligence, machine learning, and computer vision focused on biometrics. She has published many research papers in reputed journals and conferences.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 61.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 305.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.