Abstract
Education researchers have traditionally faced severe data limitations in studying local policy variation; administrative data sets capture only a fraction of districts’ policy decisions, and it can be expensive to collect more nuanced implementation data from teachers and leaders. Natural language processing and web-scraping techniques can help address these challenges by assisting researchers in locating and processing policy documents located online. School district policies and practices are commonly documented in student and staff manuals, school improvement plans, and meeting minutes that are posted for the public. This article introduces an end-to-end framework for collecting these sorts of policy documents and extracting structured policy data: The researcher gathers all potentially relevant documents from district websites, narrows the text corpus to spans of interest using a text classifier, and then extracts specific policy data using additional natural language processing techniques. Through this framework, a researcher can describe variation in policy implementation at the local level, aggregated across state- or nationwide populations even as policies evolve over time.
Acknowledgments
The opinions expressed are those of the authors and do not represent views of the institute or the U.S. Department of Education.
Notes
1 Readers may turn to Jurafsky and Martin (Citation2018) for more comprehensive coverage of NLP techniques and to Grimmer and Stewart (Citation2013) and Gentzkow et al. (Citation2017) for a more comprehensive review of text-as-data methods in political science and economics.
2 For more information on regular expressions, I recommend Mastering Regular Expressions (Friedl, Citation2002).
3 For example, Python’s Natural Language Toolkit maintains a list of 179 stop words (Bird et al., Citation2018).
4 A number of software libraries are available for simplifying the process of building a web crawler; modules for making HTTP requests and parsing HTML are particularly prevalent. To submit HTTP requests, I used the Requests (Reitz, Citation2018) library; to parse HTML, I used BeautifulSoup (Richardson, Citation2017).
5 Of the 3,995 documents scraped from district websites, Apache Tika was able to extract text from 3,818.
6 I used the open-source Python library spaCy (Honnibal, Citation2017), which includes pretrained word embeddings and pretrained convolutional filters.
7 I used the following regular expression in my Python code: \d{2,3}.\d{2,}. Researchers should note that although the syntax of regular expressions is constant, their specification can depend on programming language and/or software implementation.