1,731
Views
35
CrossRef citations to date
0
Altmetric
Articles

Volunteered geographic information production as a spatial process

, &
Pages 1191-1212 | Received 27 May 2010, Accepted 23 Sep 2011, Published online: 31 Jan 2012
 

Abstract

Wikipedia is a free encyclopedia that anyone can edit and a popular example of user-generated content that includes volunteered geographic information (VGI). In this article, we present three main contributions: (1) a spatial data model and collection methods to study VGI in systems that may not explicitly support geographic data; (2) quantitative methods for measuring distance between online authors and articles; and (3) empirically calibrated results from a gravity model of the role of distance in VGI production. To model spatial processes of VGI contributors, we use an invariant exponential gravity model based on article and author proximity. We define a proximity metric called a ‘signature distance’ as a weighted average distance between an article and each of its authors, and we estimate the location of 2.8 million anonymous authors through IP geolocation. Our study collects empirical data directly from 21 language-specific Wikipedia databases, spanning 7 years of contributions (2001–2008) to nearly 1 million geotagged articles. We find empirical evidence that the spatial processes of anonymous contributors fit an exponential distance decay model. Our results are consistent with the prior results on information diffusion as a spatial process, but run counter to theories that a globalized Internet neutralizes distance as a determinant of social behaviors.

Acknowledgments

This research was supported in part by the National Science Foundation (Awards #BCS-0849625 ‘Collaborative Research: A GIScience Approach for Assessing the Quality, Potential Applications, and Impact of Volunteered Geographic Information’ and #IIS-0431166 ‘Collaborative Research: Integrating Digital Libraries and Earth Science Data Systems’) and the US Army Research Office (Award #W911NF0910302). We thank Wikimedia Deutschland, e.V. in Berlin, Germany, for providing the helpful Toolserver service (http://toolserver.org). They provided database access, web hosting, and computational resources for our study. Thanks to Tim Alder and Stefan Kühn for comments on geotagging methods in Wikipedia, and for sharing their data-mining software and results. Finally, we also thank Sarah Elwood, Danica Schaffer-Smith, Daniel Sui, and especially the anonymous reviewers for their comments.

Flickr, Google Earth, Google Maps, and YouTube are trademarks ™, and GeoIP, GeoLite, MaxMind, and Wikipedia are registered trademarks ®.

Notes

1. Priedhorsky et al. (Citation2010) used the term geographic volunteer work rather than VGI ‘to emphasize the active role of end users.’

2. O'Reilly (Citation2005) attributed the term ‘long tail’ to Chris Anderson who was describing the ‘collective power of the small sites that make up the bulk of the web's content.’ Barabási and Albert (Citation1999) described the underlying phenomena behind this web topology.

3. Source: Retrieved 23 February 2010, from http://en.wikipedia.org/wiki/Wikipedia:GEO.

4. A workflow of their data mining processes is available in German (Source: http://de.wikipedia.org/wiki/Datei:Wikipedia_Geodata_Workflow.svg [Accessed 25 February 2010]) We used their 22 June 2008 results data, which are available from any Toolserver account in the u_kolossos_geo_p database.

5. These are as follows (with their ISO 639-1 codes used by Wikipedia): Catalan (ca), Chinese (zh), Czech (cs), Danish (da), Dutch (nl), English (en), Esperanto (eo), Finnish (fi), French (fr), German (de), Icelandic (is), Italian (it), Japanese (ja), Norwegian (no), Polish (pl), Portuguese (pt), Russian (ru), Slovak (sk), Spanish (es), Swedish (sv), and Turkish (tr).

6. Reportedly, Wikipedia logs IP addresses for all contributions – from anonymous and registered contributors alike – but they restrict access to those data to authorized administrators.

7. GeoLite City database is a freely available version of their commercial product GeoIP City database. MaxMind described their methods as follows: ‘We employ user-entered location data from sites that ask web visitors to provide their geographic location. We then run millions of these datasets through a series of algorithms that identify, extract, and extrapolate location points for IP addresses’ (http://www.maxmind.com/app/ip-locate [Accessed 17 February 2010]). “GeoIP and GeoLite draw from different seed data sources to generate the IP location data. GeoLite draws primarily from publicly available data and is less accurate, especially at the city level. GeoIP draws primarily from internally collected sources and is more accurate” (Source: http://forum.maxmind.com [Accessed 21 November 2007]).

8. MaxMind provides an ‘Accuracy Radius Database’ that includes an estimated average error – which they define as the ‘average distance between the actual location of the end user using the IP address and the location returned by the GeoLite City database’ – for given IP address blocks (Source: http://www.maxmind.com/app/geolite_city_accuracy [Accessed 17 February 2010]). We suspect that their evaluation methods are both nonscientific and limited in scope. For example, we cannot account for large variations between industrialized countries with significant Internet deployments.

9. We define work in simple terms as an edit count for our study, but we recognize that the literature has many different definitions for work, including edit counts (Kittur et al. Citation2007), edit deltas (Zeng et al. Citation2006), edit similarity (i.e., information distance) (Voss Citation2005), edit longevity (i.e., age or survival or persistence) (Adler and de Alfaro Citation2007, Wöhner and Peters Citation2009), and edit visibility (Priedhorsky et al. Citation2007).

10. Wilson (Citation2010, p. 367) serendipitously noted that this exponential nature ‘even now appears [as an article] in Wikipedia’.

11. These services include GEOnet Names Server (GNS) and Geographic Names Information System (GNIS) (Source: http://en.wikipedia.org/wiki/User:The_Anomebot2 [Accessed 24 February 2010]) Using gazetteers as data sources is common for these automated processes, but there are other data sources in use. Another Wikibot (Rambot), for example, use its own database of 3,141 countries and 33,832 cities to create geographic articles (Source: http://en.wikipedia.org/wiki/User_talk:Rambot [Accessed 24 February 2010]).

12. The sytax for geographic markup is varied, spanning rich markup with the Geography Markup Language (GML) to simple HTML-based markup with Dublin Core metadata (Kunze Citation1999), GEO metadata (Daviel and Kaegi Citation2007), and geo microformat (Çelik Citation2005).

13. This paradox, in conjunction with privacy and other concerns, has led to what is known as ‘sock puppetry’ where a single author will use multiple accounts to protect their privacy or otherwise obfuscate their actions.

14. The article they reference was deleted on 8 January 2008, but the article ‘Wikipedians with articles’ does list full names (Source: http://en.wikipedia.org/wiki/Wikipedia:Wikipedians_with_articles [Accessed 23 September 2010]).

15. Kisilevich et al. (Citation2010) reviewed the spatial data-mining methods for tracking trajectories of individuals or groups with such trace data.

16. Source: http://en.wikipedia.org/wiki/NPOV [Accessed 24 September 2010].

17. From these articles, they create a geographic container for the area bound by these articles. They estimate author locations by clustering contribution histories based on a convex hull of the article locations.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.