552
Views
7
CrossRef citations to date
0
Altmetric
Articles

Lowering the Barriers for Accessing Distributed Geospatial Big Data to Advance Spatial Data Science: The PolarHub Solution

Pages 773-793 | Received 01 Jul 2016, Accepted 01 Jun 2017, Published online: 14 Nov 2017
 

Abstract

Data is the crux of science. The widespread availability of big data today is of particular importance for fostering new forms of geospatial innovation. This article reports a state-of-the-art solution that addresses a key cyberinfrastructure research problem—providing ready access to big, distributed geospatial data resources on the Web. I first formulate this data access problem and introduce its indispensable elements, including identifying the cyberlocation, space and time coverage, theme, and quality of the data set. I then propose strategies to tackle each data access issue and make the data more discoverable and usable for geospatial data users and decision makers. Among these strategies is large-scale Web crawling as a key technique to support automatic collection of online geospatial data that are highly distributed, intrinsically heterogeneous, and known to be dynamic. To better understand the content and scientific meanings of the data, methods including space–time filtering, ontology-based thematic classification, and service quality evaluation are incorporated. To serve a broad scientific user community, these techniques are integrated into an operational data crawling system, PolarHub, which is also an important cyberinfrastructure building block to support effective data discovery. A series of experiments was conducted to demonstrate the outstanding performance of the PolarHub system. This work seems to contribute significantly in building the theoretical and methodological foundation for data-driven geography and the emerging spatial data science.

数据是科学的关键。在今日, 大数据的广泛可及性, 对于促进崭新的地理空间创新形式而言特别重要。本文报导一个应对关键信息基础建设研究问题的最新解决方法——在互联网上提供大型且分散的地理空间数据资源的管道。我首先阐述此一数据取得管道的问题, 并引介其不可或缺的元素, 包含指认信息位置、时空聚合、主题, 以及数据集的质量。我接着提出应对各个数据管道问题、并且让地理空间数据使用者与决策者更容易发现与使用数据的策略。这些策略以大规模网络抓取作为支持自动搜集高度分散、本质上异质且动态的网上地理空间数据之关键技术。为了更佳理解数据的内容与科学意义, 纳入包含时空筛选、以本体为基础的主题分类, 以及服务品质评估等方法。为了服务广泛的科技使用者社群, 这些技术被整合进操作式的数据抓取系统 “极地枢纽” (PolarHub), 该系统同时是支持有效的数据挖掘的信息基础建设的重要基石。本研究进行一系列的实验, 证实 PolarHub 系统的杰出表现。该工作似乎对数据驱动的地理和浮现中的空间数据科学建立理论与方法论基础, 做出显着的贡献。

Los datos son el elemento esencial de la ciencia. La disponibilidad generalizada de big data en la actualidad tiene particular importancia para el fomento de nuevas formas de innovación geoespacial. En este artículo se reporta una solución de vanguardia que aboca un problema de investigación clave de ciberinfraestructura––proveyendo acceso expedito a vastos recursos de datos geoespaciales distribuidos en la Web. Primero que todo formulo este problema de acceso a los datos y presento sus elementos indispensables, incluso identificando la ciberlocalización, la cobertura de espacio y tiempo, el tema y la calidad del conjunto de datos. Luego, propongo las estrategias para encarar el asunto individualizado del acceso a lo datos y de hacerlos más fáciles de recuperar, y más utilizables para los usuarios de información geoespacial y para los tomadores de decisiones. Entre estas estrategias se encuentra el rastreo de la Web a gran escala como técnica clave para apoyar la recolección automática de datos geoespaciales en red que se hallan muy distribuidos, son intrínsecamente heterogéneos y que se sabe son dinámicos. Para entender mejor el contenido y significados científicos de los datos, se incorporaron métodos que incluyen el filtrado espacio–temporal, la clasificación temática basada en la ontología y el servicio de evaluación de la calidad. Para servir a una amplia comunidad de usuarios científicos, estas técnicas se integraron en un sistema operacional de rastreo de datos, el PolarHub, que también es un paquete importante de construcción de ciberinfraestructura para ayudar al efectivo hallazgo de datos. Se llevó a cabo una serie de experimentos para demostrar el sobresaliente desempeño del sistema PolarHub. Este trabajo puede contribuir significativamente a edificar los fundamentos teóricos y metodológicos de la geografía orientada por datos y a la emergente ciencia de los datos espaciales.

Acknowledgments

Assistance received from Sizhe Wang on data processing is greatly appreciated. The author would also like to thank the editor and anonymous reviewers for their valuable and constructive comments.

Funding

This article draws on work supported in part by the following awards: PLR-1349259; BCS-1455349; and PLR-1504432 from the National Science Foundation and another award from the Open Geospatial Consortium.

Additional information

Notes on contributors

Wenwen Li

WENWEN LI is an Associate Professor in the School of Geographical Sciences and Urban Planning, Arizona State University, Tempe, AZ 85287–5302. E-mail: [email protected]. Her research interests include methodology development in cyberinfrastructure, spatial data science, geospatial semantics, deep learning and their applications in polar environmental change, terrain analysis, sustainability, and urban heat island research.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 312.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.