ABSTRACT
The content of a government’s website is an important source of information about policy priorities, procedures, and services. Existing research on government websites has relied on manual methods of website content collection and processing, which imposes cost limitations on the scale of website data collection. In this research note, we propose that the automated collection of website content from large samples of government websites can offer relief from the costs of manual collection, and enable contributions through large-scale comparative analyses. We also provide software to ease the use of this data collection method. In an illustrative application, we collect textual content from the websites of over two hundred municipal governments in the United States, and study how website content is associated with mayoral partisanship. Using statistical topic modeling, we find that the partisanship of the mayor predicts differences in the contents of city websites that align with differences in the platforms of Democrats and Republicans. The application illustrates the utility of website content data extracted via our methodological pipeline.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Notes
1. A related literature concerns the websites of politicians and their parties. By and large, researchers in this field also rely on hand-coding (Druckman, Hennessy, Kifer, & Parkin, Citation2010; Druckman, Kifer, & Parkin, Citation2009; Esterling, Lazer, & Neblo, Citation2011; Norris, Citation2003), albeit with some exceptions, who do use targeted scrapers (Cryer, Citation2019; Therriault, Citation2010).
2. Urban (Citation2002) relies on a webcrawler to measure how many pages each city website is comprised of, which is also the first step of wget. In this way, his research is a precursor to our own, albeit without the actual analysis of each page.
3. In the online appendix we show that without using boilerpipe, some of the most partisan ‘topics’ are simply website boilerplate text.
4. readtext determines a document’s type solely through its ending – so the conversion described above is necessary.
5. We retain laws, nationalities or religious or political groups, and works of art.
6. Lemmatization is similar to stemming, but works differently by taking grammar and surrounding words into account to identify the dictionary form of a word.
Additional information
Funding
Notes on contributors
Markus Neumann
Markus Neumann is a Postdoctoral Research Fellow at the Wesleyan Media Project. His research agenda revolves around the application of machine learning methods to social science data, particularly text, audio and images.
Fridolin Linder
Fridolin Linder received his PhD in Political Science from Pennsylvania State University. He was a postdoctoral researcher at the Social Media and Political Participation Lab at NYU when working on this study. He now works as a Data Scientist for Yunex Traffic in Munich.
Bruce Desmarais
Bruce Desmarais is the DeGrandis-McCourtney Early Career Professor in Political Science, Associate Director of the Center for Social Data Analytics, and an Affiliate of the Institute for Cyber Science at Pennsylvania State University. His research is focused on methodological development and applications that further our understanding of the complex interdependence that underlies politics, policymaking, and public administration.