ABSTRACT
With the increase in multimodal data, such as remote sensing images, text, and video, cross-modal retrieval is widely applied in many fields. Because different modality data are heterogeneous, the main aim of cross-modal retrieval is to bridge their ‘semantic gap’. However, there are few cross-modal retrieval studies focused on remote sensing images and text, and fine-grained information from the remote sensing images and text are not fully considered by existing methods. To improve the retrieval accuracy of remote sensing images and text, we propose a self-attention unsupervised deep common feature space cross-modal retrieval method. First, a self-attention mechanism is applied to capture the fine-grained relationships between word fragments of text and remote sensing images. Then, a deep learning method and the triplet loss of remote sensing image and text pairs are used to extract the features of remote sensing images and text, and generate a semantically consistent common feature space, in which the feature representations of remote sensing images and text exhibit a uniform distribution. The experimental results of three benchmark cross-modal datasets, including RSICD, UCM_captions, and Sydney_captions, show that the proposed method performs better than the state-of-the-art methods. For example, when compared with the best model (VSE++) tested on RSICD, the proposed model demonstrated 2.1% and 5.6% higher precision in text to image and image to text retrieval tasks, respectively, and 3.8% higher mean average precision. However, similar to the other models tested, we found that the proposed model had a weaker performance when retrieving text from an image query than retrieving images from a text query.
Disclosure statement
No potential conflict of interest was reported by the authors.
Data Availability statement
Data are available at https://github.com/201528014227051/RSICD_optimal.