776
Views
5
CrossRef citations to date
0
Altmetric
Intervention, Evaluation, and Policy Studies

Teacher Performance Ratings and Professional Improvement

, , &
Pages 90-115 | Received 06 Oct 2017, Accepted 04 Jun 2018, Published online: 29 Nov 2018
 

Abstract

Like other public workers, teachers typically receive high and compressed ratings that do little to differentiate them based on performance. Motivated by empirical evidence of substantial variation in effectiveness among teachers, there has been a recent push to develop more informative evaluation systems with greater ratings dispersion. We study one of the first of these new systems, implemented in Tennessee, in order to understand how teachers respond to the provision of new, more differentiated performance ratings. We focus on whether summative ratings influence teachers’ self-reported, self-directed professional improvement activities as measured by four items on a statewide teacher survey. Using a regression discontinuity design we find no evidence that teachers alter their time investments in professional improvement, or adjust their professional improvement activities based on evaluation feedback, in response to their ratings.

Acknowledgments

Koedel is in the Department of Economics and Truman School of Public Affairs, and Li and Tan are in the Department of Economics, at the University of Missouri. Springer is in the Peabody College of Education and Human Development at Vanderbilt University. This study was supported by the Tennessee Education Research Alliance at Vanderbilt University’s Peabody College, a unique research-practice partnership between Vanderbilt University and the State of Tennessee. We appreciate helpful comments and suggestions from Dale Ballou, Jason Grissom, Colleen Heflin, Peter Mueser, Michael Podgursky and Nate Schwartz. We would also like to acknowledge the many individuals at the Tennessee Alliance and Tennessee Department of Education for providing data and expert insight to conduct our analysis, in particular, Susan Burns, Erin O’Hara, and Matthew Pepper. The usual disclaimers apply.

Notes

1 A follow-up study by Kraft and Gilmour (Citation2017) shows some movement toward more differentiated teacher evaluations nationally, but generally confirms the basic conditions found by Weisberg et al. (Citation2009). Grissom and Loeb (Citation2017) document that principals are more likely to assign lower ratings to teachers in low-stakes settings than in high-stakes settings.

2 Rating compression in the public sector is particularly acute, but the phenomenon is not unique to public workers (e.g., see Murphy & Cleveland, Citation1991; Prendergast, Citation1999).

4 Kraft and Gilmour (Citation2017) show that the new system in Tennessee is among the most aggressive in the country in terms of identifying teachers as below proficient.

5 In addition to the “system maturity” issue mentioned in the text, another limitation of the analysis of ratings in the first year is that the first follow-up survey in Tennessee asked fewer questions about professional improvement, so we cannot perform as thorough an analysis. These issues make our null findings from the first year of the evaluation less compelling and prompted us to examine the second year of data for this work. That said, the similarity of the results in the first and second years further corroborates our null findings.

6 Approximately 45 percent of teachers in Tennessee have an individual growth measure (see ).

7 Teachers could receive the rating in two ways. First, they could log into the management information system that maintains the teacher evaluation data to see their discrete performance rating from the prior school year. Second, prior to any observations being conducted for the current school year, teachers had a one-on-one meeting with a certified observer to review the prior year’s evaluation.

8 The largest reduction in the sample due to the bandwidth restriction is the loss of level-5 teachers with scores far above the 4/5 threshold, which is of limited consequence given the nature of identification in our RD specifications (see below). We additionally exclude teachers evaluated by the COACH model because it produces a lumpy distribution of underlying scores that is not compatible with the RD models and teachers without basic covariates.

9 Although most of the differences between samples in Table 1 are statistically significant, this is the result of the very large sample sizes. For example, even differences between the samples in columns 1 and 4 that are clearly not different substantively, like across teacher education levels, are different statistically.

10 On average across the four questions, 9.3 percent of survey respondents did not answer by choice, and 12.7 percent were directed to skip the professional improvement questions (along with other questions) due to a position change. Individuals who changed positions were directed to a different set of questions based on their new positions. Response patterns are similar if we include individuals who do not answer the question as “nonpositive” in the denominator of each ratio—see Supplemental Appendix B.

11 For example, three of the survey questions we study give four response options: (a) strongly disagree, (b) disagree, (c) agree, and (d) strongly agree (see Supplemental Appendix A). If we assign the option “agree” a value of 3 and the option “strongly disagree” a value of 1, it implies the value of “agree” is equal to three times the value of “strongly disagree.”

12 With one common factor, the average commonality among questions is 0.32, which is low for a typical factor analysis. That said, in results omitted for brevity we verified the robustness of our findings to a model that uses an “index of professional improvement intensity and responsiveness” as the outcome of interest. The index is a weighted average of answers to the four questions where the weights are determined by factor analysis, as in Koedel et al. (Citation2017).

13 The running variable is not perfectly continuous due to some discreteness in teachers’ scores on the subcomponents. The end result is that the values of the running variable cluster around 0.5-unit intervals throughout the range of possible scores. Although the discreteness in the running variable is not egregious in our application by any means, in results omitted for brevity we investigate its implications by estimating variants of our models where we used two-dimensional clustering at the school level and the 0.5-unit-interval level, as suggested by Lee and Card (Citation2008). The standard errors from the alternatively clustered models are very similar to what we report below, and do not alter inference from our analysis in any way.

14 We use the Akaike information criterion (AIC) test to determine the polynomial order for our primary specification. Adding higher order polynomial terms (up to quartic) of the running variable to the models does not influence our results qualitatively.

15 We drop records with a missing response to each survey question when that question is used as the dependent variable. In results omitted for brevity, we confirm that our findings are not qualitatively affected if we instead directly model missing responses (i.e., if we treat them as separate outcomes in multinomial models). Our findings are also very similar if we include school or district fixed effects.

16 Although researchers can overcome the direct threat by controlling for violating covariates in a regression, if discontinuities in observables emerge then it raises the concern that there are other, unobserved discontinuities as well.

17 Koedel et al. (Citation2017) found some evidence of score rounding in the Tennessee system in the first year, but there is no evidence of the type of bunching one would expect from score rounding in our data from the second year.

18 This approach follows that of McCrary and Royer (Citation2011), who encounter a related problem in their investigation of the effects of female education on fertility and infant health. We exclude teachers outside of the bandwidth, assessed under COACH, and with missing basic covariates from these regressions.

19 The question-by-question sample sizes in Table 6 are slightly smaller than the sample sizes reported in columns 4 and 5 of Table 1 because all teachers who answered at least one survey question did not answer every question.

20 We reach a qualitatively similar conclusion with the numerically coded models, where our point estimates and standard errors rule out effect sizes larger than 0.25 for question 1, and 0.05–0.07 for questions 2–4. The sample average value of the numerically coded answers to question 1 is 3.6, and for questions 2–4 (with fewer choices, per Supplemental Appendix A) the sample averages range from 2.2–2.8.

21 In results omitted for brevity, we further verify that our findings are robust to reweighting the observations so that teachers with scores closer to the discontinuity thresholds receive higher weights than those farther away within the main bandwidth.

22 This is a simplification in the following sense: seeing the rating is necessary for treatment if the treatment is viewed purely as informational to the individual teacher, but we cannot rule out that discrete ratings affect behaviors even if teachers never see the ratings. For example, school principals may respond to the ratings, which they also observe, and this could affect how they interact with teachers, and in turn affect teachers’ behaviors related to professional improvement.

23 Results for TEAM teachers only are similar and omitted for brevity.

24 A smaller effect would be possible because the aggregated growth measures for these teachers are contemporaneous and thus may include the teacher’s own effect along with the effects of other teachers in the aggregation.

25 Still, it would be of interest to study whether timing delays are a mechanism that contributes to our findings. We lack variation within our sample to investigate this hypothesis empirically but such variation may exist in other locales or over time as the systems in Tennessee and elsewhere mature. As noted above, timing issues may also play a role in our findings regarding teacher attrition and how they differ from Dee and Wyckoff (Citation2015).

26 Moreover, it suggests that the gains in teacher improvement observed by Taylor and Tyler (Citation2012) in Cincinnati are unlikely to be the result of the revelation of summative performance information.

27 In results omitted for brevity, we have verified that the qualitative feedback that teachers receive from their observers, as measured by word counts in reinforcement (positive) and refinement (negative) areas, is not discontinuous at the rating thresholds used in our regression-discontinuity models. This is as expected under the RD assumptions. The analysis of qualitative feedback data was performed for TEAM teachers only because we do not have these data for teachers evaluated under the other evaluation models.

28 See Supplemental Appendix E for a brief, descriptive analysis of teachers’ evaluations on the classroom observation component of the rating specifically.

29 Unless it is offset by some sort of positive reporting bias among higher rated teachers at the thresholds. To be clear, this would not be positive improvement activity among higher rated teachers, in which case it would be part of a real effect, but positive bias in reporting relative to activity—although we have no means to formally rule out this possibility, it seems unlikely.

30 There is also the technical concern that measurement error in the dependent variable will bias the estimates from our nonlinear models. However, the substantive importance of this concern is ruled out by the similarity of results from the linear RD models shown in Supplemental Appendix C, in which dependent-variable measurement error does not cause bias.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 302.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.