Exploring latent weight factors and global information for food-oriented cross-modal retrieval

Wenyu Zhaoa School of Computer Science and Technology, Hunan University of Science and Technology, Xiangtan, People’s Republic of ChinaView further author information

Dong Zhoub School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou, People’s Republic of ChinaCorrespondence[email protected]
View further author information

Buqing Caoa School of Computer Science and Technology, Hunan University of Science and Technology, Xiangtan, People’s Republic of ChinaCorrespondence[email protected]
View further author information

Wei Lianga School of Computer Science and Technology, Hunan University of Science and Technology, Xiangtan, People’s Republic of ChinaView further author information

Nitin Sukhijac Department of Computer Science, Slippery Rock University of Pennsylvania, Slippery Rock, PA, USAView further author information

Abstract

Food-oriented cross-modal retrieval aims to retrieve relevant recipes given food images or vice versa. The modality semantic gap between recipes and food images (text and image modalities) is the main challenge. Though several studies are introduced to bridge this gap, they still suffer from two major limitations: 1) The simple embedding concatenation only can capture the simple interactions rather than complex interactions between different recipe components. 2) The image feature extraction based on convolutional neural networks only considers the local features and ignores the global features of an image, as well as the interactions between different extracted features. This paper proposes a novel method based on Latent Component Weight Factors and Global Information (LCWF-GI) to learn the robust recipe and image representations for food-oriented cross-modal retrieval. This proposed method integrates the textual embeddings of different recipe components into a compact embedding to represent the recipes with the latent component-specific weight factors. A transformer encoder is utilised to capture the intra-modality interactions and the importance of different extracted image features for enhanced image representations. Finally, the bi-directional triplet loss is further used to perform retrieval learning. Experimental results on the Recipe 1M dataset show that our LCWF-GI method achieves competent improvements.

KEYWORDS:

Disclosure statement

No potential conflict of interest was reported by the author(s).

Correction Statement

This article has been corrected with minor changes. These changes do not impact the academic content of the article.

Additional information

Funding

This work was supported in part by the Hunan Provincial Natural Science Foundation of China [grant no 2022JJ30020], the Guangdong Basic and Applied Basic Research Foundation of China [grant no 2023A1515012718], the Philosophy and Social Sciences 14th Five-Year Plan Project of Guangdong Province [grant no GD23CTS03], the Scientific Research Fund of Hunan Provincial Education Department [grant no 21A0319], and the Hunan Provincial Innovation Foundation for Postgraduate [grant no CX20210986].

Exploring latent weight factors and global information for food-oriented cross-modal retrieval

Information for

Open access

Opportunities

Help and information

Exploring latent weight factors and global information for food-oriented cross-modal retrieval

Abstract

Disclosure statement

Correction Statement

Additional information

Funding

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature