ABSTRACT
Apache Sedona (formerly GeoSpark) is a new in-memory cluster computing system for processing large-scale spatial data, which extends the core of Apache Spark to support spatial datatypes, partitioning techniques, spatial indexes, and spatial operations (e.g. spatial range, nearest neighbor, and spatial join queries). Distance-based Join Queries (DJQs), like nearest neighbor join (kNNJQ) or closest pairs queries (kCPQ), are not supported by it. Therefore, in this paper, we investigate how to design and implement efficient DJQ distributed algorithms in Apache Sedona, using the most appropriate spatial partitioning and other optimization techniques. The results of an extensive set of experiments with real-world datasets are presented, demonstrating that the proposed kNNJQ and kCPQ distributed algorithms are efficient, scalable, and robust in Apache Sedona. Finally, Sedona is also compared to other similar cluster computing systems, showing the best performance for kCPQ and competitive results for kNNJQ.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Notes
1 Available at https://spark.apache.org/
2 Available at https://sedona.apache.org/download/
3 Available at https://github.com/purduedb/LocationSpark
4 Available at http://www.cs.utah.edu/~dongx/simba/
5 Available at https://github.com/acgtic211/LocationSpark/tree/DJQ
8 Available at https://github.com/acgtic211/incubator-sedona/tree/KNNJ
9 Available at https://github.com/acgtic211/incubator-sedona/tree/KCP
10 Available at http://spatialhadoop.cs.umn.edu/datasets.html
11 Available at https://github.com/apache/incubator-sedona
12 Available at https://github.com/purduedb/LocationSpark
13 Available at https://github.com/locationtech/jts