Efficient Group K Nearest-Neighbor Spatial Query Processing in Apache Spark

Aiming at the problem of spatial query processing in distributed computing systems, the design and implementation of new distributed spatial query algorithms is a current challenge. Apache Spark is a memory-based framework suitable for real-time and batch processing. Spark-based systems allow users...

Full description

Bibliographic Details
Main Authors:	Moutafis, Panagiotis, Mavrommatis, George, Vassilakopoulos, Michael, Corral Liria, Antonio Leopoldo
Format:	info:eu-repo/semantics/article
Language:	English
Published:	MDPI 2021
Subjects:	big spatial data spatial query processing group nearest-neighbor query Apache Spark spatial query evaluation
Online Access:	http://hdl.handle.net/10835/13072

_version_	1789406485387673600
author	Moutafis, Panagiotis Mavrommatis, George Vassilakopoulos, Michael Corral Liria, Antonio Leopoldo
author_facet	Moutafis, Panagiotis Mavrommatis, George Vassilakopoulos, Michael Corral Liria, Antonio Leopoldo
author_sort	Moutafis, Panagiotis
collection	DSpace
description	Aiming at the problem of spatial query processing in distributed computing systems, the design and implementation of new distributed spatial query algorithms is a current challenge. Apache Spark is a memory-based framework suitable for real-time and batch processing. Spark-based systems allow users to work on distributed in-memory data, without worrying about the data distribution mechanism and fault-tolerance. Given two datasets of points (called Query and Training), the group K nearest-neighbor (GKNN) query retrieves (K) points of the Training with the smallest sum of distances to every point of the Query. This spatial query has been actively studied in centralized environments and several performance improving techniques and pruning heuristics have been also proposed, while, a distributed algorithm in Apache Hadoop was recently proposed by our team. Since, in general, Apache Hadoop exhibits lower performance than Spark, in this paper, we present the first distributed GKNN query algorithm in Apache Spark and compare it against the one in Apache Hadoop. This algorithm incorporates programming features and facilities that are specific to Apache Spark. Moreover, techniques that improve performance and are applicable in Apache Spark are also incorporated. The results of an extensive set of experiments with real-world spatial datasets are presented, demonstrating that our Apache Spark GKNN solution, with its improvements, is efficient and a clear winner in comparison to processing this query in Apache Hadoop.
format	info:eu-repo/semantics/article
id	oai:repositorio.ual.es:10835-13072
institution	Universidad de Cuenca
language	English
publishDate	2021
publisher	MDPI
record_format	dspace
spelling	oai:repositorio.ual.es:10835-130722023-04-12T19:25:48Z Efficient Group K Nearest-Neighbor Spatial Query Processing in Apache Spark Moutafis, Panagiotis Mavrommatis, George Vassilakopoulos, Michael Corral Liria, Antonio Leopoldo big spatial data spatial query processing group nearest-neighbor query Apache Spark spatial query evaluation Aiming at the problem of spatial query processing in distributed computing systems, the design and implementation of new distributed spatial query algorithms is a current challenge. Apache Spark is a memory-based framework suitable for real-time and batch processing. Spark-based systems allow users to work on distributed in-memory data, without worrying about the data distribution mechanism and fault-tolerance. Given two datasets of points (called Query and Training), the group K nearest-neighbor (GKNN) query retrieves (K) points of the Training with the smallest sum of distances to every point of the Query. This spatial query has been actively studied in centralized environments and several performance improving techniques and pruning heuristics have been also proposed, while, a distributed algorithm in Apache Hadoop was recently proposed by our team. Since, in general, Apache Hadoop exhibits lower performance than Spark, in this paper, we present the first distributed GKNN query algorithm in Apache Spark and compare it against the one in Apache Hadoop. This algorithm incorporates programming features and facilities that are specific to Apache Spark. Moreover, techniques that improve performance and are applicable in Apache Spark are also incorporated. The results of an extensive set of experiments with real-world spatial datasets are presented, demonstrating that our Apache Spark GKNN solution, with its improvements, is efficient and a clear winner in comparison to processing this query in Apache Hadoop. 2021-11-25T13:41:39Z 2021-11-25T13:41:39Z 2021-11-11 info:eu-repo/semantics/article 2220-9964 http://hdl.handle.net/10835/13072 10.3390/ijgi10110763 en https://www.mdpi.com/2220-9964/10/11/763 Attribution-NonCommercial-NoDerivatives 4.0 Internacional http://creativecommons.org/licenses/by-nc-nd/4.0/ info:eu-repo/semantics/openAccess MDPI
spellingShingle	big spatial data spatial query processing group nearest-neighbor query Apache Spark spatial query evaluation Moutafis, Panagiotis Mavrommatis, George Vassilakopoulos, Michael Corral Liria, Antonio Leopoldo Efficient Group K Nearest-Neighbor Spatial Query Processing in Apache Spark
title	Efficient Group K Nearest-Neighbor Spatial Query Processing in Apache Spark
title_full	Efficient Group K Nearest-Neighbor Spatial Query Processing in Apache Spark
title_fullStr	Efficient Group K Nearest-Neighbor Spatial Query Processing in Apache Spark
title_full_unstemmed	Efficient Group K Nearest-Neighbor Spatial Query Processing in Apache Spark
title_short	Efficient Group K Nearest-Neighbor Spatial Query Processing in Apache Spark
title_sort	efficient group k nearest-neighbor spatial query processing in apache spark
topic	big spatial data spatial query processing group nearest-neighbor query Apache Spark spatial query evaluation
url	http://hdl.handle.net/10835/13072
work_keys_str_mv	AT moutafispanagiotis efficientgroupknearestneighborspatialqueryprocessinginapachespark AT mavrommatisgeorge efficientgroupknearestneighborspatialqueryprocessinginapachespark AT vassilakopoulosmichael efficientgroupknearestneighborspatialqueryprocessinginapachespark AT corralliriaantonioleopoldo efficientgroupknearestneighborspatialqueryprocessinginapachespark

Efficient Group K Nearest-Neighbor Spatial Query Processing in Apache Spark

Similar Items