Search in:

GIScience & Remote Sensing Volume 60, 2023 - Issue 1

Submit an article Journal homepage

Open access

1,715

Views

CrossRef citations to date

Altmetric

Research Article

High-resolution satellite video single object tracking based on thicksiam framework

Xiaodong ZhangState Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, ChinaView further author information

Kun ZhuState Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, ChinaCorrespondence[email protected]

https://orcid.org/0000-0001-6279-9215 View further author information

Guanzhou ChenState Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, ChinaCorrespondence[email protected]
View further author information

Puyun LiaoState Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, ChinaView further author information

Xiaoliang TanState Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, ChinaView further author information

Tong WangState Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, ChinaView further author information

Xianwei LiState Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, ChinaView further author information

show all

Article: 2163063 | Received 04 Jul 2022, Accepted 21 Dec 2022, Published online: 03 Jan 2023

Cite this article
https://doi.org/10.1080/15481603.2022.2163063
CrossMark

Full Article
Figures & data
References
Citations
Metrics
Licensing
Reprints & Permissions
View PDF PDF View EPUB EPUB

Figures & data

Figure 1. In high-resolution satellite videos, ground targets show attributes such SS, PO, PoV, PTBD, SD, and PGFI. These attributes are the challenges in current satellite video SOT task and also make natural scene-based trackers inapplicable to satellite videos.

Figure 2. The overall tracking workflow of ThickSiam. It formally includes TRBS-Net for extracting robust semantic features to obtain the initial tracking results and a RKF module for simultaneously correcting the trajectory and size of the targets. The results of TRBS-Net and RKF modules are combined by an N-frame-convergence mechanism to achieve final tracking results.

Figure 3. The structure comparisons of original residual block and the proposed TRB. Based on the original residual block, the modifications of TRM include doubling the number of channels in bottleneck and cropping out the limbic elements attached to the feature map.

Figure 4. The structure comparisons of original down-sampling residual block and the proposed TMRB. Based on the original down-sampling residual block, the modifications of the TMRB include doubling the number of channels in bottleneck, cropping out the outermost features, modifying the stride in the above two convolutional modules to 1, and adding a maxpooling layer to achieve down-sampling of the feature map.

Table 1. The detailed information of feature map in each stage of TRBS-Net. $s$ is the abbreviation of $s t r i d e$ . $c o n v 1$ and $b n 1$ , respectively, represent the convolutional layer and batch normalization in $s t a g e 1$ .

Display Table

Figure 5. The constructed exemplar-search training pairs. The annotation selected from DIOR (Li et al. Citation2020) object detection dataset is expanded outward by 1/2 of the sum of width and height, and scaled according to the sizes of the exemplar image and search image of the TRBS-Net.

Figure 6. The constructed testing dataset used in the experiments. There are twelve objects in eight satellite scenarios, and the targets consist of airplanes, ships, trains, and vehicle.

Table 2. The detailed descriptions of the constructed testing dataset. $p x$ is the abbreviation of $p i x e l$ . Attributes refer to the difficulties in tracking this object.

Display Table

Table 3. The experimental results of the ThickSiam framework with different training mechanisms. The baseline method was stacked by original residual block and down-sampling residual block according to the structure of the TRBS-Net.

Download CSV Display Table

Table 4. Comparative experiment results of different $N$ values in the $N$ -frame convergence mechanism. $N$ selected the number divisible by 10 from 10 to 100.

Display Table

Table 5. Comparisons with the state-of-the-art trackers on our constructed testing dataset.

Download CSV Display Table

Figure 7. The visualized tracking results of ThickSiam, SiamFC++ (AlexNet) and SiamFC (AlexNet) trackers with corresponding GT. The yellow $#$ at the top right of the image represented the video frame number. The yellow, red, blue, and green bounding boxes represented the results of GT, ThickSiam (ours, TRBS-Net+RKF), SiamFC++ (AlexNet), and SiamFC (AlexNet), respectively.

Figure 8. The tracking results of the ThickSiam tracker in six typical scenarios containing all attributes including SS, PO, PoV, PTBD, SD, and PGFI.

Figure 9. The field of view of “Jilin-1” satellite video. The specified area in the red box on the left was enlarged and displayed in the upper right corner. Cars on the bridge were further magnified for visual display. These targets usually had only a few or dozens of pixels, and the ultra-small size made their apparent features weak, and the boundary with the background was not clearly visible.

Li, K., G. Wan, G. Cheng, L. Meng, and J. Han. 2020. “Object Detection in Optical Remote Sensing Images: A Survey and a New Benchmark.” Isprs Journal of Photogrammetry and Remote Sensing 159: 296–307. doi:10.1016/j.isprsjprs.2019.11.023.

Web of Science ®Google Scholar

Bolme, D. S., J. R. Beveridge, B. A. Draper, and Y. M. Lui. 2010. Visual Object Tracking Using Adaptive Correlation Filters. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2544–2550. doi:10.1109/CVPR.2010.5539960.

Google Scholar

Henriques, J. F., R. Caseiro, P. Martins, and J. Batista. 2012. Exploiting the Circulant Structure of Tracking-By-Detection with Kernels. Proceedings of the european conference on computer vision, Florence, Italy.

Google Scholar

Henriques, J. F., R. Caseiro, P. Martins, and J. Batista. 2014. “High-Speed Tracking with Kernelized Correlation Filters.” IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (3): 583–596. doi:10.1109/TPAMI.2014.2345390.

Web of Science ®Google Scholar

Danelljan, M., F. S. Khan, M. Felsberg, and J. Van De Weijer. 2014b. Adaptive Color Attributes for Real-Time Visual Tracking. 2014 IEEE conference on computer vision and pattern recognition, pp. 1090–1097. doi:10.1109/CVPR.2014.143.

Google Scholar

Danelljan, M., G. Häger, F. S. Khan, and M. Felsberg. 2014a. Accurate Scale Estimation for Robust Visual Tracking. British machine vision conference, Nottingham, September 1-5, 2014. Bmva Press.

Google Scholar

Bertinetto, L., J. Valmadre, S. Golodetz, O. Miksik, and P. H. S. Torr. 2016a. Staple: Complementary Learners for Real-Time Tracking. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, Nevada, pp. 1401–19. doi:10.1109/CVPR.2016.156.

Google Scholar

Bertinetto, L., J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr. 2016b. Fully-Convolutional Siamese Networks for Object Tracking. In: G. Hua and H. Jégou (edited by), Computer vision - eccv 2016 workshops - Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part II, pp. 850–865. doi:10.1007/978-3-319-48881-3\_56.

Google Scholar

Wang, Q., J. Gao, J. Xing, M. Zhang, and W. Hu. 2017. Dcfnet: Discriminant Correlation Filters Network for Visual Tracking. https://github.com/foolwood/DCFNet_pytorch.

Google Scholar

Danelljan, M., G. Bhat, F. S. Khan, and M. Felsberg. 2017. Eco: Efficient Convolution Operators for Tracking. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6931–6939. doi:10.1109/CVPR.2017.733.

Google Scholar

Li, F., C. Tian, W. Zuo, L. Zhang, and M. H. Yang. 2018. Learning Spatial-Temporal Regularized Correlation Filters for Visual Tracking. 2018 IEEE/CVF conference on computer vision and pattern recognition, pp. 4904–4913. doi:10.1109/CVPR.2018.00515.

Google Scholar

Danelljan, M., G. Bhat, F. S. Khan, and M. Felsberg. 2019. Atom: Accurate Tracking by Overlap Maximization. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4655–4664. doi:10.1109/CVPR.2019. 00479.

Google Scholar

Bhat, G., M. Danelljan, L. Van Gool, and R. Timofte. 2019. Learning Discriminative Model Prediction for Tracking. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6181–6190. doi:10.1109/ICCV.2019.00628.

Google Scholar

Zhang, Z., and H. Peng. 2019. Deeper and Wider Siamese Networks for Real-Time Visual Tracking. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4586–4595. doi:10.1109/CVPR.2019.00472.

Google Scholar

Li, B., W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan. 2019. Siamrpn++: Evolution of Siamese Visual Tracking with Very Deep Networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4277–4286. doi:10.1109/CVPR.2019.00441.

Google Scholar

Xu, Y., Z. Wang, Z. Li, Y. Yuan, and G. Yu. 2020. Siamfc++: Towards Robust and Accurate Visual Tracking with Target Estimation Guidelines. Proceedings of the AAAI conference on artificial intelligence, Hilton New York Midtown, New York, New York, USA, pp. 12549–12556.

Google Scholar

Zhu, K., X. Zhang, G. Chen, X. Tan, P. Liao, H. Wu, X. Cui, Y. Zuo, and Z. Lv. 2021. “Single Object Tracking in Satellite Videos: Deep Siamese Network Incorporating an Interframe Difference Centroid Inertia Motion Model.” Remote Sensing 13 (7): 1298. doi:10.3390/rs13071298.

Web of Science ®Google Scholar

Data availability statement

The testing data in this study are available at https://github.com/CVEO/ThickSiam.

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

High-resolution satellite video single object tracking based on thicksiam framework

Table 1. The detailed information of feature map in each stage of TRBS-Net. $s$ is the abbreviation of $s t r i d e$ . $c o n v 1$ and $b n 1$ , respectively, represent the convolutional layer and batch normalization in $s t a g e 1$ .

Table 2. The detailed descriptions of the constructed testing dataset. $p x$ is the abbreviation of $p i x e l$ . Attributes refer to the difficulties in tracking this object.

Table 3. The experimental results of the ThickSiam framework with different training mechanisms. The baseline method was stacked by original residual block and down-sampling residual block according to the structure of the TRBS-Net.

Table 4. Comparative experiment results of different $N$ values in the $N$ -frame convergence mechanism. $N$ selected the number divisible by 10 from 10 to 100.

Table 5. Comparisons with the state-of-the-art trackers on our constructed testing dataset.

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

High-resolution satellite video single object tracking based on thicksiam framework

Figures & data

Table 1. The detailed information of feature map in each stage of TRBS-Net. s is the abbreviation of stride. conv1 and bn1, respectively, represent the convolutional layer and batch normalization in stage1.

Table 2. The detailed descriptions of the constructed testing dataset. px is the abbreviation of pixel. Attributes refer to the difficulties in tracking this object.

Table 3. The experimental results of the ThickSiam framework with different training mechanisms. The baseline method was stacked by original residual block and down-sampling residual block according to the structure of the TRBS-Net.

Table 4. Comparative experiment results of different N values in the N-frame convergence mechanism. N selected the number divisible by 10 from 10 to 100.

Table 5. Comparisons with the state-of-the-art trackers on our constructed testing dataset.

Data availability statement

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Table 1. The detailed information of feature map in each stage of TRBS-Net. $s$ is the abbreviation of $s t r i d e$ . $c o n v 1$ and $b n 1$ , respectively, represent the convolutional layer and batch normalization in $s t a g e 1$ .

Table 2. The detailed descriptions of the constructed testing dataset. $p x$ is the abbreviation of $p i x e l$ . Attributes refer to the difficulties in tracking this object.

Table 4. Comparative experiment results of different $N$ values in the $N$ -frame convergence mechanism. $N$ selected the number divisible by 10 from 10 to 100.