ABSTRACT
Data-driven GIScience shows a growing interest in making spatial information from large text data. In this paper, we quantify and thus evaluate the relation between text frequency and properties of the outer-text, geographic setting by comparing text frequencies of mountain names to the respective geomorphometric characteristics. We focus on some 2000 unique mountain names that appear some 50,000 times in a large compilation of texts on Swiss alpine history. The results on the full data set suggest only a weak relation: only 5–10% of the variation in the text frequency being explained by the respective geomorphometric characteristics. However, an analysis of multiple scales allows us to identify a Simpson’s Paradox. What appears to be ‘noise’ in the analysis of all mountains in the whole of Switzerland shows significant local signals. Small spatial extents, found all over Switzerland, can show considerably strong correlations between text frequency and spatial prominence, with up to 90% of the total variation explained. We argue that our findings have practical implications for data-driven GIScience. Retrieving meaningful spatial information from text might only be possible if the spatial scale of analysis reflects the spatial scale described in the input text documents.
Acknowledgments
We would like to express our special thanks to Ross S. Purves for his helpful comments to an early version of the manuscript.
Disclosure statement
No potential conflict of interest was reported by the authors.
Notes
1. The details of the named entity recognition in the current version of the corpus are described in the corpus documentation (in German), available at http://www.textberg.ch/ReleaseNotes/README_Release_151_v01.htm.
2. We relate to these two measures as frequency (i.e. mountain toponym frequency) and prominence (i.e. spatial mountain prominence) in the rest of this discussion.