287
Views
0
CrossRef citations to date
0
Altmetric
Intervention, Evaluation, and Policy Studies

The Effects of Higher-Stakes Teacher Evaluation on Office Disciplinary Referrals

ORCID Icon, &
Pages 475-509 | Received 15 Jul 2020, Accepted 15 Oct 2021, Published online: 26 Jan 2022
 

Abstract

The effects of imposing accountability pressures on public school teachers are empirically indeterminate. In this paper, we study the effects of accountability in the context of teacher responses to student behavioral infractions in the aftermath of teacher evaluation reforms. We leverage cross-state variation in the timing of state policy implementation to estimate whether teachers change the rate at which they remove students from their classrooms. We find that higher-stakes teacher evaluation had no causal effect on the rates of disciplinary referrals, and we find no evidence of heterogeneous effects for grades subject to greater accountability pressures or in schools facing differing levels of disciplinary infractions. Our results are precisely estimated and robust to a battery of assumption and specification checks.

Acknowledgment

We are grateful to the Education and Community Supports (ECS) research unit at the University of Oregon for access to the confidential School-Wide Information System data. We thank Kent McIntosh and Angus Kittelman for answering various data-related questions and providing substantive feedback. We thank Kaitlin Anderson, Chris Curran, Glen Waddell, Anwesha Guha, several anonymous referees, participants at the Association of Public Policy and Management Fall Conference, the University of Oregon Applied Micro-Econometrics seminar, and the Education Policy Collaborative Annual Meeting for their feedback. All errors are our own.

Data availability statement

The data can be obtained by filing a request directly with ECS: https://ecs.uoregon.edu/research-projects/. Replication materials are available at: 10.17605/OSF.IO/9X8PU.

Open Scholarship

This article has earned the Center for Open Science badges for Open Materials and Preregistered through Open Practices Disclosure. The materials are openly accessible at OSF.IO/9X8PU and https://sreereg.icpsr.umich.edu/sreereg/search/search and enter: Liebowitz for search term OR direct download via: https://sreereg.icpsr.umich.edu/sreereg/subEntry/2506/pdf?section=all&action=download https://sreereg.icpsr.umich.edu/sreereg/subEntry/2507/pdf?section=all&action=download. To obtain the author's disclosure form, please contact the Editor.

Notes

1 Lazear’s (Citation2001) seminal work on the production of education lays out a theoretical relationship between instructional effectiveness and classroom behavior, but this phenomenon is dramatically understudied empirically.

2 The typical mechanism by which teachers respond to student behavior that they have determined cannot be addressed in the classroom is to send students to a school administrator (e.g., principal, assistant principal, dean of students) in the school’s office. Other approaches include waiting to speak to an administrator in the hallway. For the purposes of this paper, we describe all such events as Office Disciplinary Referrals (ODRs).

3 See, among others, Brehm et al. (Citation2017), Chakrabarti (Citation2014), Chiang (Citation2009), Deming et al. (Citation2016), Eren (Citation2019), Macartney (Citation2016), Ozek (Citation2012), and Reback et al. (Citation2014). Deming and Figlio (Citation2016) synthesize the literature on educational accountability.

4 See, among others, Cullen et al. (Citation2019), Kraft et al. (Citation2020), Macartney et al. (Citation2019), Pope (Citation2019), Rothstein (Citation2015), Steinberg and Sartain (Citation2015), Stecher et al. (Citation2018), Strunk et al. (Citation2017) and Taylor and Tyler (Citation2012). Liebowitz (Citation2020) summarizes this nuanced literature.

5 Alaska, Maine, Mississippi, New Jersey, North Dakota and Pennsylvania passed new teacher evaluation laws in 2012; Kentucky, South Carolina and Texas did so in 2013.

6 We categorize subjective and objective behaviors data following Greflund et al. (Citation2014): “Subjective behaviors were defined as behaviors that require not simply observing a discrete, objective event (e.g., a student smoking), but a significant value judgment regarding whether the intensity or quality of the behavior warrants an ODR (e.g., a student using inappropriate language). (…) The following behaviors were categorized as subjective: abusive language/ inappropriate language/profanity, defiance/disrespect/insubordination/ non-compliance, harassment/bullying, disruption, dress code violation, and inappropriate display of affection. The following behaviors were categorized as less subjective: physical aggression/fighting, tardy, skipping, truancy, property damage/vandalism, forgery/theft, inappropriate location/out of bounds, use/possession of tobacco, alcohol, drugs, combustibles, weapons, bomb threat/false alarm, and arson. Three problem behaviors did not meet the inter-rater reliability criterion and were also not classified as subjective: lying/cheating, technology violation, and gang affiliation display” (pp. 220–221).

7 To be classified as successfully implementing PBIS, schools had to meet one of the following thresholds: School-wide Evaluation Tool (SET): greater than or equal to 80 percent of expectations taught and overall implementation; Tiered Fidelity Inventory (TFI): Tier 1 ratio greater than or equal to 70 percent; Benchmark of Quality (BOQ: Total Ratio greater than or equal to 70 percent; Self-Assessment Survey (SAS): Implementation Average greater than or equal to 80; and Team Implementation Checklist (TIC): Implementation Average greater than or equal to 80 percent. We refer readers to McIntosh et al. (Citation2013) and Mercer, McIntosh and Hoselton (Citation2017) for details on the validation of these instruments. We do not use the continuous implementation scores as they represent substantially different scales and are not linked across instruments (Greflund et al., Citation2014; Mercer et al., Citation2017).

8 We present additional information on the CRDC sample of schools in Supplementary Appendix Table B1. In the sample of 343,015 school-year observations in the CRDC, the average school suspends 6 percent of its students per year (SD = 0.09). In auxiliary regressions, we find that districtwide patterns in ODRs are positively correlated to suspensions in the CRDC data, but imperfectly so. This aligns with our proposed use of the suspension data as a placebo test that measures a related, but different construct than classroom referrals.

9 While this number is a rough approximation given the lack of precise estimates, we believe it is a conservative one. Students are typically required to complete a reflection form and conference with a school administrator before returning to class. Students unprepared to return to class remain with the administrator for longer periods. If one administrator were responsible for all referrals in a 500 student school (a reality in many contexts), this would mean that 22.5 percent of her 8-hour work days over 180 school days would be devoted to these referrals.

10 In fact, we regress the seven school demographic characteristics on our evaluation indicator and reject the null in only one instance. Evaluation implementation predicts a small decrease in the FRPL composition of a school (Beta: 1.25 p.p., SE: 0.52). Given the multiple hypotheses we test and the small magnitude of the coefficient, we take these results as consistent with our claim that school demographic characteristics are exogenous to policy implementation, though we present estimates without these adjustments in all cases to address this concern.

11 In our event-study results, we estimate coefficients for all available data (including binned categories for years 6+ pre-, 2 years post, and 3+ years post evaluation) but only interpret years 5 through 1 to ensure that we only compare units that are observable for all treatment timing years. Including estimates outside the 5 to +1 bandwidth mixes relative-time effects with compositional shifts to schools that we are and are not able to observe for these years. In our main difference-in-differences estimates, however, we pool pre- and post-treatment periods to take advantage of the full range of data which allows us to include schools that implement new evaluation policies for up to seven years.

12 While this approach as well as the event study models are similar in spirit to the Comparative Interrupted Time Series (CITS) design, there are several important distinctions between the CITS and the difference-in-differences with multiple time points approaches. The CITS approach models different pre-trends for treatment and counterfactual groups. The DD approach assumes (and then tests for) parallel trends between treatment and counterfactual groups. This stricter assumption about pre-trends means that the DD approach need not rely on linear (or higher-order) extrapolations from pre-trends to estimate intercept- and slope-shifts. Given that our setting meets the stricter parallel trends assumptions of the differences-in-differences approach, we fit all models using this approach.

13 We may be concerned that the estimates from Equations (2) and (3) will be biased as a result of unobserved state-level factors that, contemporaneous with the introduction of high-stakes teacher evaluation, also affect ODRs. Triple difference (DDD) estimates that leverage alternative, potentially unaffected, outcomes help us address these sources of bias. We model these as follows:ODRgjstβ1EVALxAFFECTst+β2EVALst+β3AFFECTst+(AFFECTst·Γj)ϕ+(AFFECTst·Πt)δ+(Xjt)θ+Δg+Γj+Πt+υgjst. AFFECTst

is an indicator variable that takes the value of one if the observation is one in which we would anticipate the introduction of high-stakes evaluation policies will affect the rate of ODRs or affect the rate more intensively. We contrast locations in which ODRs occur, specifically comparing classroom-originating ODRs, which we anticipate would be influenced by changes in the teacher evaluation policies, and non-classroom-originating ODRs, which we anticipate would not be affected by the policy changes. Alternatively, we contrast the type of infraction (subjective or objective) resulting in the ODR. β1 represents the effect of the introduction of the high-stakes evaluation policy on anticipated affected outcomes, compared to unaffected outcomes in states that had not yet or never adopted the evaluation policy. We adjust for unexplained within-school and within-year heterogeneity in affected outcomes by interacting our AFFECT indicator with year – and school-indicators. As we show below, we find null effects for all of our double difference models, and so we do not feature our triple difference framework prominently. We do present these results in Tables A13 and A14 and, as expected, they also return precise zeros.

14 We do not present K-2 and 12 as a counterfactual of no accountability increases, but we would anticipate that any effects in these grades would be less intense than tested grades. We recognize that not all teachers in grades 3–11 teach a tested subject, but in the presence of an effect of greater test-score-based accountability pressures in these grades, we would nevertheless expect to see an average treatment effect in these grade bands. While all schools in states in our sample require high-stakes assessments in grades 3-8, high-school assessment requirements vary. All states require students to test at some point in grades 9-11. Some states require testing only in one of these grades, other states require testing across multiple years, still others allow student discretion on the grade in which students take tests. Our estimates are even closer to zero when we restrict our definition of higher-accountability grades to 3-8 (class: 0.026 (0.067); subjective-class: 0.022 (0.067)).

15 Kraft et al. (Citation2020) seek to rule out threats to their identification strategy from contemporaneous teacher and accountability policy reforms, such as the implementation of Common Core Standards or licensure tests. These are less relevant to our identification strategy as we are ultimately interested in whether and how increased accountability shifts teachers’ classroom practices. Our results are robust to the inclusion of policy indicators for the reform of tenure laws and weakening of collective bargaining (see Supplementary Appendix Tables A15 and A16). To the extent that our estimates of teacher evaluation reforms are influenced by other accountability-related policy reforms, this would imply that our results are evidence of overall accountability pressures on teacher practice, rather than specific to teacher evaluation.

16 This represents a departure from our pre-registered preferred estimates which included demographic covariates in our model specification to improve precision and adjust for any remaining observable bias in our models. The new methodological insights post-date our pre-registration. As we present the pre-registered results (Models II and V) alongside these preferred results and the coefficients differ by only 0.005 and 0.001 of a referral per-500 students per day, we believe adopting these as our preferred results is fully consistent with our pre-registration plan.

17 We scale the precision of these null effects to the standard deviation of our outcomes across the full analytic sample. When we scale our outcome to the within-school standard deviation of our outcomes, our 95 percent confidence intervals are 0.12 to +0.02 SD and 0.11 to +0.04 SD units for the main effects of evaluation on classroom and subjective referrals, respectively.

18 When we scale our outcome to the within-school standard deviation, the 95 percent confidence interval on the moderating effects of PBIS are 0.09 to +0.13 and -0.08 to +0.11 SD units for classroom and subjective referrals.

19 We similarly find no effects on quadratic terms for pre-policy referral rates (class: 0.000 (0.004); subjective: 0.000 (0.006)) and in models where we average referral rates from the two years prior to policy implementation and then leave these two years out (class: 0.011 (0.044); subjective: 0.019 (0.038)).

20 We implement de Chaisemartin and D’Haultfoeuille’s (Citation2018) Wald-TC estimator using the May 2019 version of the did_multipleGT Stata package. Current versions of did_multipleGT implement the DIDM estimator (de Chaisemartin & D’Haultfoeuille, Citation2020); these return essentially identical results. Replications of the Wald-TC results will return slightly different values due to the use of bootstrapping for obtaining standard errors (we use 50 replications). To reduce computing resource demands, we estimate these results using our school-year sample, though in practice this does not meaningfully affect our standard errors as we cluster them at the state level.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 302.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.