ABSTRACT
This study presents an asynchronous parallel strategy coordinating central processing unit (CPU) and graphic processing unit (GPU) to accelerate neighborhood operation (NO). Specifically, we propose a data partitioning method called multi-anchor task queuing and a task scheduling method called bi-direction task scheduling, which can support CPU and GPU to find the responsible data blocks rapidly and concurrently handle their tasks via a bi-direction merge. Moreover, we optimize the organization of threads distributed among the CPU and GPU. Experimental results show that when a 1.7 GB raster dataset is processed, the speedup ratio achieved by the proposed parallel algorithm reaches 29.63, which is 19% and 18% higher than those of the GPU and standard asynchronous parallel algorithm, respectively. Additionally, the load balance index is below 0.085, which is significantly better than the value achieved by a conventional algorithm. Thus, the strategy achieves a higher speedup ratio and more adaptable load balance, thereby accelerating the NO more efficiently. Further, the impacts of the data volume, computational intensity, organization mode of the GPU threads, and granularity of the GPU stream on the parallel efficiency are evaluated and discussed. We also test the efficiency of four other common NOs with our strategy.
Acknowledgments
The authors sincerely thank the anonymous reviewers and editors for their valuable feedback and constructive comments, which greatly contribute to improving this paper.
Disclosure statement
No potential conflict of interest was reported by the author(s).
CRediT authorship contribution statement
Zhixin Yu: Conceptualization, Methodology, Software, Visualization, Writing – original draft.
Chen Zhou: Conceptualization, Data Curation, Supervision, Validation, Writing – review & editing.
Manchun Li: Supervision, Writing – review & editing.
Data availability statement
The computer code and sample dataset that support the findings of this study are available at https://www.doi.org/10.17605/OSF.IO/AG3QC. The code was developed using C++. A CPU with multiple cores and a CUDA-enabled GPU are necessary. It is recommended to run the code on OpenMP 2.0, CUDA 11.2 and GDAL 3.2.0 or later.