Cross-modal retrieval of remote sensing images and text based on self-attention unsupervised deep common feature space

Qilin DingCollege of Information and Communication, National University of Defense Technology, Wuhan, China

Haisu ZhangCollege of Information and Communication, National University of Defense Technology, Wuhan, ChinaCorrespondence[email protected]

Xu WangCollege of Information and Communication, National University of Defense Technology, Wuhan, China

Weipeng LiCollege of Information and Communication, National University of Defense Technology, Wuhan, China

ABSTRACT

With the increase in multimodal data, such as remote sensing images, text, and video, cross-modal retrieval is widely applied in many fields. Because different modality data are heterogeneous, the main aim of cross-modal retrieval is to bridge their ‘semantic gap’. However, there are few cross-modal retrieval studies focused on remote sensing images and text, and fine-grained information from the remote sensing images and text are not fully considered by existing methods. To improve the retrieval accuracy of remote sensing images and text, we propose a self-attention unsupervised deep common feature space cross-modal retrieval method. First, a self-attention mechanism is applied to capture the fine-grained relationships between word fragments of text and remote sensing images. Then, a deep learning method and the triplet loss of remote sensing image and text pairs are used to extract the features of remote sensing images and text, and generate a semantically consistent common feature space, in which the feature representations of remote sensing images and text exhibit a uniform distribution. The experimental results of three benchmark cross-modal datasets, including RSICD, UCM_captions, and Sydney_captions, show that the proposed method performs better than the state-of-the-art methods. For example, when compared with the best model (VSE++) tested on RSICD, the proposed model demonstrated 2.1% and 5.6% higher precision in text to image and image to text retrieval tasks, respectively, and 3.8% higher mean average precision. However, similar to the other models tested, we found that the proposed model had a weaker performance when retrieving text from an image query than retrieving images from a text query.

KEYWORDS:

Disclosure statement

No potential conflict of interest was reported by the authors.

Data Availability statement

Data are available at https://github.com/201528014227051/RSICD_optimal.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Cross-modal retrieval of remote sensing images and text based on self-attention unsupervised deep common feature space

Information for

Open access

Opportunities

Help and information

Cross-modal retrieval of remote sensing images and text based on self-attention unsupervised deep common feature space

ABSTRACT

Disclosure statement

Data Availability statement

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature