Search in:

Journal of Research on Educational Effectiveness Volume 12, 2019 - Issue 4: Education Research in a New Data Environment

Submit an article Journal homepage

1,071

Views

CrossRef citations to date

Altmetric

Methodological Studies

Gather-Narrow-Extract: A Framework for Studying Local Policy Variation Using Web-Scraping and Natural Language Processing

Kylie L. AnglinUniversity of Virginia, Charlottesville, Virginia, USACorrespondence[email protected]

http://orcid.org/0000-0001-7661-3370 View further author information

Pages 685-706 | Received 01 Aug 2018, Accepted 02 Aug 2019, Published online: 06 Dec 2019

Cite this article
https://doi.org/10.1080/19345747.2019.1654576
CrossMark

Sample our Education journals, sign in here to start your access, latest two full volumes FREE to you for 14 days

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions
Read this article /doi/full/10.1080/19345747.2019.1654576?needAccess=true

Abstract

Education researchers have traditionally faced severe data limitations in studying local policy variation; administrative data sets capture only a fraction of districts’ policy decisions, and it can be expensive to collect more nuanced implementation data from teachers and leaders. Natural language processing and web-scraping techniques can help address these challenges by assisting researchers in locating and processing policy documents located online. School district policies and practices are commonly documented in student and staff manuals, school improvement plans, and meeting minutes that are posted for the public. This article introduces an end-to-end framework for collecting these sorts of policy documents and extracting structured policy data: The researcher gathers all potentially relevant documents from district websites, narrows the text corpus to spans of interest using a text classifier, and then extracts specific policy data using additional natural language processing techniques. Through this framework, a researcher can describe variation in policy implementation at the local level, aggregated across state- or nationwide populations even as policies evolve over time.

Keywords:

NLP
web-scraping
machine learning
implementation
Districts of Innovation

Acknowledgments

The opinions expressed are those of the authors and do not represent views of the institute or the U.S. Department of Education.

Notes

1 Readers may turn to Jurafsky and Martin (Citation2018) for more comprehensive coverage of NLP techniques and to Grimmer and Stewart (Citation2013) and Gentzkow et al. (Citation2017) for a more comprehensive review of text-as-data methods in political science and economics.

2 For more information on regular expressions, I recommend Mastering Regular Expressions (Friedl, Citation2002).

3 For example, Python’s Natural Language Toolkit maintains a list of 179 stop words (Bird et al., Citation2018).

4 A number of software libraries are available for simplifying the process of building a web crawler; modules for making HTTP requests and parsing HTML are particularly prevalent. To submit HTTP requests, I used the Requests (Reitz, Citation2018) library; to parse HTML, I used BeautifulSoup (Richardson, Citation2017).

5 Of the 3,995 documents scraped from district websites, Apache Tika was able to extract text from 3,818.

6 I used the open-source Python library spaCy (Honnibal, Citation2017), which includes pretrained word embeddings and pretrained convolutional filters.

7 I used the following regular expression in my Python code: \d{2,3}.\d{2,}. Researchers should note that although the syntax of regular expressions is constant, their specification can depend on programming language and/or software implementation.

Jurafsky, D., & Martin, J. H. (2018). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition (3rd ed.). Upper Saddle River, NJ: Prentice Hall. Retrieved from https://web.stanford.edu/∼jurafsky/slp3/

Google Scholar

Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, 21(3), 267–231. doi:10.1093/pan/mps028

Web of Science ®Google Scholar

Gentzkow, M., Kelly, B. T., & Taddy, M. (2017). Text as data (NBER Working Paper Series No. 23276). Retrieved from https://www.nber.org/papers/w23276

Google Scholar

Friedl, J. E. F. (2002). Mastering regular expressions (2nd ed.). Sebastopol, CA: O’Reilly & Associates.

Google Scholar

Bird, S., Loper, E., & Klein, E. (2018). Natural Language Toolkit — NLTK 3.3 documentation. Retrieved July 27, 2018, from https://www.nltk.org/index.html

Google Scholar

Reitz, K. (2018). Requests 2.19.1 documentation. Retrieved July 19, 2018, from http://docs.python-requests.org/en/master/

Google Scholar

Richardson, L. (2017). Beautiful soup documentation — Beautiful soup 4.4.0 documentation. Retrieved July 19, 2018, from https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Google Scholar

Honnibal, M. (2017). spaCy · Industrial-strength natural language processing in Python. Retrieved July 19, 2018, from https://spacy.io/

Google Scholar

Additional information

Funding

The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant #R305B140026 to the Rectors and Visitors of the University of Virginia.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Gather-Narrow-Extract: A Framework for Studying Local Policy Variation Using Web-Scraping and Natural Language Processing

Information for

Open access

Opportunities

Help and information

Gather-Narrow-Extract: A Framework for Studying Local Policy Variation Using Web-Scraping and Natural Language Processing

Abstract

Acknowledgments

Notes

Additional information

Funding

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature