218
Views
0
CrossRef citations to date
0
Altmetric
Articles

Context-aware and co-attention network based image captioning model

ORCID Icon &
Pages 244-256 | Received 25 Mar 2022, Accepted 08 Feb 2023, Published online: 22 Feb 2023

Figures & data

Figure 1. Intuition of our context-aware co-attention-based image captioning model. It consists of relation-aware attention, region-aware attention, transformer-based co-attention module, and GRU-GELU-based language decoders. The relation-aware attention module is used to generate interactive words based on the relationships between objects. The relation-aware attention module is used to focus on related image regions to generate entity words. Transformer based co-attention module is used to capture the intra-modal and inter-modal interaction between image regions and objects.

Figure 1. Intuition of our context-aware co-attention-based image captioning model. It consists of relation-aware attention, region-aware attention, transformer-based co-attention module, and GRU-GELU-based language decoders. The relation-aware attention module is used to generate interactive words based on the relationships between objects. The relation-aware attention module is used to focus on related image regions to generate entity words. Transformer based co-attention module is used to capture the intra-modal and inter-modal interaction between image regions and objects.

Table 1. Limitations of previous works.

Figure 2. An overview of our context-aware co-attention network-based image captioning model which consists of region-aware attention, relation-aware attention, GRU-GELU decoder pair, and transformer-based co-attention network modules.

Figure 2. An overview of our context-aware co-attention network-based image captioning model which consists of region-aware attention, relation-aware attention, GRU-GELU decoder pair, and transformer-based co-attention network modules.

Figure 3. Transformer model-based co-attention network with multi-head attention for relationship features and object features to generate context-aware global features.

Figure 3. Transformer model-based co-attention network with multi-head attention for relationship features and object features to generate context-aware global features.

Table 2. Hyperparameter values.

Table 3. Ablation tests conducted on MSCOCO Karparthy test split.

Table 4. Results of our context-aware co-attention based image captioning model and compared models on MSCOCO Karpathy test split with cross-entropy loss.

Table 5. Results of our context-aware co-attention based image captioning model and compared models on MSCOCO Karpathy test split with CIDEr score optimization.

Table 6. Results of our context-aware co-attention based image captioning model and compared models on the Flickr30k dataset.

Table 7. Comparison with the existing models in terms of the number of parameters in million (M), training time on GPUs, Flops, layers, width and MLP on MSCOCO.

Figure 4. Qualitative results of our model and other models on MSCOCO. Ours indicates the captions generated by our model. EnsCaption [Citation28] and M2-Transformer [Citation23] are the strong comparative models. For each image, we have shown one interaction and two entity words. Highest attention weights are shown in red colour.

Figure 4. Qualitative results of our model and other models on MSCOCO. Ours indicates the captions generated by our model. EnsCaption [Citation28] and M2-Transformer [Citation23] are the strong comparative models. For each image, we have shown one interaction and two entity words. Highest attention weights are shown in red colour.

Figure 5. Qualitative results of our model and other models on Flickr30k. Ours indicates the captions generated by our model. EnsCaption [Citation28] and Mul_Att [Citation11] are the strong comparative models. For each image, we have shown one interaction and two entity words. Highest attention weights are shown in red colour.

Figure 5. Qualitative results of our model and other models on Flickr30k. Ours indicates the captions generated by our model. EnsCaption [Citation28] and Mul_Att [Citation11] are the strong comparative models. For each image, we have shown one interaction and two entity words. Highest attention weights are shown in red colour.

Figure 6. Visualization results of attended regions/objects of our context-aware co-attention module during the decoding phase. Higher attention weights are shown in the form of brighter regions. (Best viewed in colour and 200%).

Figure 6. Visualization results of attended regions/objects of our context-aware co-attention module during the decoding phase. Higher attention weights are shown in the form of brighter regions. (Best viewed in colour and 200%).

Figure 7. Same caption is generated by the proposed model for the left and middle images, as these two images are visually similar. In the rightmost image, our model fails to identify “onion” and “orange” and thus generates the wrong caption.

Figure 7. Same caption is generated by the proposed model for the left and middle images, as these two images are visually similar. In the rightmost image, our model fails to identify “onion” and “orange” and thus generates the wrong caption.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.