388
Views
1
CrossRef citations to date
0
Altmetric
Research Article

Government websites as data: a methodological pipeline with application to the websites of municipalities in the United States

 

ABSTRACT

The content of a government’s website is an important source of information about policy priorities, procedures, and services. Existing research on government websites has relied on manual methods of website content collection and processing, which imposes cost limitations on the scale of website data collection. In this research note, we propose that the automated collection of website content from large samples of government websites can offer relief from the costs of manual collection, and enable contributions through large-scale comparative analyses. We also provide software to ease the use of this data collection method. In an illustrative application, we collect textual content from the websites of over two hundred municipal governments in the United States, and study how website content is associated with mayoral partisanship. Using statistical topic modeling, we find that the partisanship of the mayor predicts differences in the contents of city websites that align with differences in the platforms of Democrats and Republicans. The application illustrates the utility of website content data extracted via our methodological pipeline.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes

1. A related literature concerns the websites of politicians and their parties. By and large, researchers in this field also rely on hand-coding (Druckman, Hennessy, Kifer, & Parkin, Citation2010; Druckman, Kifer, & Parkin, Citation2009; Esterling, Lazer, & Neblo, Citation2011; Norris, Citation2003), albeit with some exceptions, who do use targeted scrapers (Cryer, Citation2019; Therriault, Citation2010).

2. Urban (Citation2002) relies on a webcrawler to measure how many pages each city website is comprised of, which is also the first step of wget. In this way, his research is a precursor to our own, albeit without the actual analysis of each page.

3. In the online appendix we show that without using boilerpipe, some of the most partisan ‘topics’ are simply website boilerplate text.

4. readtext determines a document’s type solely through its ending – so the conversion described above is necessary.

5. We retain laws, nationalities or religious or political groups, and works of art.

6. Lemmatization is similar to stemming, but works differently by taking grammar and surrounding words into account to identify the dictionary form of a word.

Additional information

Funding

This work was supported by the National Science Foundation [1839282, 1637089, 2028675]NSF;

Notes on contributors

Markus Neumann

Markus Neumann is a Postdoctoral Research Fellow at the Wesleyan Media Project. His research agenda revolves around the application of machine learning methods to social science data, particularly text, audio and images.

Fridolin Linder

Fridolin Linder received his PhD in Political Science from Pennsylvania State University. He was a postdoctoral researcher at the Social Media and Political Participation Lab at NYU when working on this study. He now works as a Data Scientist for Yunex Traffic in Munich.

Bruce Desmarais

Bruce Desmarais is the DeGrandis-McCourtney Early Career Professor in Political Science, Associate Director of the Center for Social Data Analytics, and an Affiliate of the Institute for Cyber Science at Pennsylvania State University. His research is focused on methodological development and applications that further our understanding of the complex interdependence that underlies politics, policymaking, and public administration.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.