1,071
Views
5
CrossRef citations to date
0
Altmetric
Methodological Studies

Gather-Narrow-Extract: A Framework for Studying Local Policy Variation Using Web-Scraping and Natural Language Processing

ORCID Icon
Pages 685-706 | Received 01 Aug 2018, Accepted 02 Aug 2019, Published online: 06 Dec 2019
 

Abstract

Education researchers have traditionally faced severe data limitations in studying local policy variation; administrative data sets capture only a fraction of districts’ policy decisions, and it can be expensive to collect more nuanced implementation data from teachers and leaders. Natural language processing and web-scraping techniques can help address these challenges by assisting researchers in locating and processing policy documents located online. School district policies and practices are commonly documented in student and staff manuals, school improvement plans, and meeting minutes that are posted for the public. This article introduces an end-to-end framework for collecting these sorts of policy documents and extracting structured policy data: The researcher gathers all potentially relevant documents from district websites, narrows the text corpus to spans of interest using a text classifier, and then extracts specific policy data using additional natural language processing techniques. Through this framework, a researcher can describe variation in policy implementation at the local level, aggregated across state- or nationwide populations even as policies evolve over time.

Acknowledgments

The opinions expressed are those of the authors and do not represent views of the institute or the U.S. Department of Education.

Notes

1 Readers may turn to Jurafsky and Martin (Citation2018) for more comprehensive coverage of NLP techniques and to Grimmer and Stewart (Citation2013) and Gentzkow et al. (Citation2017) for a more comprehensive review of text-as-data methods in political science and economics.

2 For more information on regular expressions, I recommend Mastering Regular Expressions (Friedl, Citation2002).

3 For example, Python’s Natural Language Toolkit maintains a list of 179 stop words (Bird et al., Citation2018).

4 A number of software libraries are available for simplifying the process of building a web crawler; modules for making HTTP requests and parsing HTML are particularly prevalent. To submit HTTP requests, I used the Requests (Reitz, Citation2018) library; to parse HTML, I used BeautifulSoup (Richardson, Citation2017).

5 Of the 3,995 documents scraped from district websites, Apache Tika was able to extract text from 3,818.

6 I used the open-source Python library spaCy (Honnibal, Citation2017), which includes pretrained word embeddings and pretrained convolutional filters.

7 I used the following regular expression in my Python code: \d{2,3}.\d{2,}. Researchers should note that although the syntax of regular expressions is constant, their specification can depend on programming language and/or software implementation.

Additional information

Funding

The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant #R305B140026 to the Rectors and Visitors of the University of Virginia.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.