ABSTRACT
While exciting, Big Data (particularly geotagged social media data) has proven difficult for many urbanists and social science researchers to use. As a partial solution, we propose a strategy that enables the fast extracting of only relevant data from large sets of geosocial data. While contrary to many Big Data approaches—in which analysis is done on the entire dataset—much productive social science work can use smaller datasets—around the same size as census or survey data—within standard methodological frameworks. The approach we outline in this paper—including the example of a fully operating system—offers a solution for urban researchers interested in these types of data but reluctant to personally build data science skills.
Notes on Contributors
Ate Poorthuis is an assistant professor in the humanities, arts and social sciences at Singapore University of Technology and Design.
Matthew Zook is a professor of information and economic geography at the at University of Kentucky, Lexington.
Notes
1 Particularly relevant for urbanists and geographers is that, for the subset of geotagged tweets, the difference between the sample and the firehose is negligible.
2 The DOLLY project received an academic white listing in May 2009 for a different project (Dugundji, Poorthuis, and van Meeteren, Citation2011; van Meeteren, Poorthuis, and Dugundji, Citation2009). This original white listing allowed DOLLY to access the elevated garden hose (10%) streaming access without going to a third-party commercial vendor in 2011.
3 The fluctuations seen in and are not a result in the DOLLY methodology or system and are tied to changes in (1) actual Twitter usage and/or (2) changes in Twitter’s public API. However we have been unable to clarify with Twitter the exact cause of these changes.
4 Although the University’s data center was outfitted with its own power generator and UPS, the system was affected by power outages. Likewise, some configuration errors and software bugs resulted in some gaps (generally of a few minutes or hours) while updates were preformed. While we discussed filling these gaps the combination of a short-time horizon for action (before Tweets were no longer available), the weight of other demands on our time and lack of human resources made filling these gaps a lower priority than other tasks crucial to keeping the system going. While DOLLY was up approximately 99.99 percent of the time these gaps bring home the difficulty of maintaining 100 percent uptime in long-term data collection.
5 The open source RabbitMQ that utilized the open AMQP standard is used here (Vinoski, Citation2006)
6 Using Natural Earth Data (“Natural Earth,” Citationn.d.) and the PostGIS spatial database ( Ramsey, Citation2005 )