245
Views
4
CrossRef citations to date
0
Altmetric
Research Article

Multi-granularity vision transformer via semantic token for hyperspectral image classification

, , , , &
 

ABSTRACT

The superior local context modelling capability of convolutional neural networks (CNNs) in representing features allows greatly enhanced performance in hyperspectral image (HSI) classification tasks by CNN-based methods. However, most of these methods suffer from a restricted receptive field and poor performance in the continuous data domain. To address these issues, we propose a multi-granularity vision transformer via semantic token (MSTViT) for HSI classification, which differs from the existing transformer view by modelling the HSI classification tasks as word embedding problems. Specifically, the MSTViT model extracts multi-level semantic features by a ladder feature extractor and applies a multi-granularity patch embedding module to embed these features simultaneously as different-scale tokens. Moreover, different-granularity tokens are fed to the vision transformer to capture the long-distance dependencies among the different tokens. A depth-wise separable convolution multi-layer perceptron is used to assist the attention mechanism for further excavation of the deep information of HSI. Finally, the performance of HSI classification is improved by fusing the coarse- and fine-granularity representations to generate stronger features. Experimental results on four standard datasets verify the marked improvement of the MSTViT over state-of-the-art CNN and transformer structures. The code of this work is available at https://github.com/zhaolin6/MSTViT for the sake of reproducibility.

Acknowledgment

We would like to take this opportunity to thank the editor and the anonymous reviewers for their outstanding comments and suggestions, which greatly helped us to improve the technical quality and presentation of the article. We would also like to thank Dr. John Olaghere of Hunan Institute of Science and Technology and Prof. Xin-Hua Hu of East Carolina University for their help in reviewing this article.

Disclosure statement

No potential conflict of interest was reported by the authors.

Data availability statement

Data available at https://github.com/zhaolin6/MSTViT.

Additional information

Funding

This work was supported in part by the Natural Science Foundation of Hunan Province of China under Grant 2020JJ4343; in part by the Scientific Research Project of the Hunan Provincial Education Department under Grant 19A201, Grant 19A200, Grant 20A214, and Grant 20A223, in part by the Graduate Research and Innovation Project of Hunan Province under CX20211186; and in part by the Graduate Research and Innovation Project of Hunan Institute of Science and Technology under YCX2021A09.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.