5,941
Views
41
CrossRef citations to date
0
Altmetric
Value Added Metrics in Education

Does the Model Matter? Exploring the Relationship Between Different Student Achievement-Based Teacher Assessments

Pages 28-39 | Received 01 Aug 2012, Published online: 04 Nov 2014
 

Abstract

Policymakers have demonstrated an interest in using measures of student achievement to inform high-stakes teacher personnel decisions, but the idea of using student outcomes as a teacher performance measure is complex to implement for a variety of reasons, not least of which is the fact that there is no universally agreed upon statistical methodology for translating student achievement into a measure of teacher performance. In this article, we use statewide data from North Carolina to evaluate different methodologies for translating student test achievement into measures of teacher performance at the elementary level. In particular, we focus on the extent to which there are differences in teacher effect estimates generated from different modeling approaches, and to what extent classroom level characteristics predict these differences. We find models that only include lagged achievement scores and student background characteristics are highly correlated with specifications that also include classroom characteristics, while value-added models (VAMs) estimated with school fixed effects have a lower correlation. Teacher effectiveness estimates based on median student growth percentiles are highly correlated with estimates from VAMs that include student background characteristics, despite the fact that the two methods for estimating teacher effectiveness are, at least conceptually, quite different. But even when the correlations between performance estimates generated by different models are quite high for the workforce as a whole, there are still sizable differences in teacher rankings generated by different models that are associated with the composition of students in a teacher's classroom.

Notes

The 2010 application for the TIF states “an applicant must demonstrate, in its application, that it will develop and implement a [performance-based compensation system] that rewards, at differentiated levels, teachers and principals who demonstrate their effectiveness by improving student achievement…” Furthermore, the application states that preference is given to applicants planning to use value-added measures of student growth “as a significant factor in calculating differentiated levels of compensation…” Similarly, the RttT selection criteria outlined in the 2010 application include the design and implementation of evaluation systems that “differentiate effectiveness using multiple rating categories that take into account data on student growth…as a significant factor…”

For reviews, see Goldhaber (Citation2010) and Toch and Rothman (Citation2008).

See, for instance, Aaronson, Barrow, and Sander (Citation2007), Goldhaber, Brewer, and Anderson (Citation1999), and Rivkin, Hanushek, and Kain (Citation2005).

Several large firms—e.g., SAS, Value Added Research Center at University of Wisconsin, Mathematica, and Battelle for Kids—offer competing, though not necessarily fundamentally different, services for this translation process.

We use the terms teacher job performance and teacher effectiveness interchangeably.

Several states, including Tennessee, Ohio, Pennsylvania, and North Carolina, have contracted with SAS to use the Education Value-Added Assessment System (EVAAS), a particular type of value-added model. EVAAS uses several different types of models, depending on the objective (estimating school, district, or teacher effectiveness) and available data. None of the models include student background characteristics as control variables, but some models do include prior scores as covariates. The EVAAS layered teacher multivariate response model is a longitudinal, linear mixed model, where multiple years and subjects of student test scores are estimated simultaneously. In the univariate response model, scores are estimated separately by year, grade, and subject, with prior scores included as covariates in the model. For more detail on EVAAS, see Wright et al. (Citation2010).

However, in practice this is exactly how SGPs have been used. For example, Denver Public Schools uses teacher-level median student growth percentiles to award teachers bonuses (Goldhaber and Walch Citation2012). Barlevy and Neal (Citation2012) recommended a pay-for-performance scheme that uses a similar system.

However, other common estimation approaches include two stage random effects or hierarchical linear modeling. For more information, see Kane and Staiger (Citation2008).

For more on the theoretical assumptions underlying typical VAM models see Harris, Sass, and Semykina (Citation2010), Rothstein (Citation2010), and Todd and Wolpin (Citation2003).

We are unaware of any research that assesses the validity of SGP-based measures of individual teacher effects.

For an example of VAM comparisons at the school level, see Atteberry (Citation2011).

Additional variables at the student-level included race, gender, grade (4th or 5th), prior year test score, and alternate subject prior year test score; teacher-level variables included years of experience, education, and credential status. Specifics about the methodology used to estimate the value-added reported by the Los Angeles Times are described in Buddin (Citation2010).

The intertemporal stability is similar to other high-skill occupations thought to be comparable to teaching (Goldhaber and Hansen 2013).

The Simple Panel Growth Model is described as a simple longitudinal mixed effects model, where the school effects are defined as the deviation of a school's trajectory from the average trajectory. The covariate-adjusted VAMs with random school effects predict student achievement using multiple lagged math and reading scores and student background variables as covariates, along with school random effects.

This dataset has been used in many published studies (Clotfelter, Ladd, and Vigdor Citation2006; Goldhaber Citation2007; Goldhaber and Anthony Citation2007; Jackson and Bruegmann Citation2009).

B-spline cubic basis functions, described in Wei and He (Citation2006), are used in the parameterization of the conditional percentile function to improve goodness-of-fit, and the calculations are performed using the SGP package in R (R Development Core Team Citation2008).

Knots are the internal breakpoints that define a spline function. Splines must have at least two knots (one at each endpoint of the spline). The greater the number of knots the more “flexible” the function to model the data. In this application, knots are placed at the quantiles stated above, meaning that an equal number of sample observations lie in each interval. Boundaries are the points at which to anchor the B-spline basis (by default this is the range of data). For more information on Knots and Boundaries, see Racine (Citation2011). LOSS: lowest obtainable scale score; HOSS: highest obtainable scale score.

In particular, the conditional distribution of a student's SGP is limited to a sample of students having at least an equal number of prior score histories and values.

We also generate teacher effectiveness estimates based on the mean of the student growth percentiles for each teacher but find the estimates to be highly correlated with the MGPs (r = 0.96 in math; r = 0.93 in reading). We choose to only report the results for MGPs.

Lockwood and McCaffrey (Citation2012) argued that measurement error in prior-year test scores results in biased teacher effectiveness estimates and recommend using a latent regression that includes multiple years of prior scores to mitigate the bias.

Teacher fixed effects for the Student Background and Classroom Characteristics VAMs are generated with the user-written Stata program fese (Nichols Citation2008); the School FE VAM estimates, which requires two levels of fixed effects, are generated with the user-written Stata program felsdvreg (Cornelissen Citation2008).

Note that some student background variables are not available in all years. VAMs include separate indicators for learning disability in math, reading, and writing.

Because we estimate VAMs with teacher fixed effects, the effect of the classroom characteristics are identified off of variation within teachers across years. Classroom-level variables are continuous, and the effects are assumed to be linear. If teachers have similar classroom characteristics across time periods, for example, classrooms with 70% FRL and 72% FRL in consecutive years, the effect of percent FRL will be based on small within-teacher differences and may be attenuated (Ehlert et al. Citation2012). Ashenfelter and Krueger (Citation1994) discussed a similar problem with using pairs of twins to investigate the economic returns to schooling. They found small within-pair differences in schooling, leading to a high degree of measurement error.

See Goldhaber, Walch, and Gabele (Citation2012) for comparisons of these omitted specifications (available online at http://www.cedr.us/papers/working/CEDR%20WP%202012-6_Does%20the%20Model%20Matter.pdf). We also estimate two-stage random effects VAMs similar to the school-level VAMs described in Ehlert et al. (Citation2012). We find these estimates to be highly correlated (r = 0.99) to estimates from the teacher single-stage Student Background VAM and classroom characteristics VAM specifications described above.

The correlations between the adjusted and unadjusted effects are greater than 0.97 in math and greater than 0.94 in reading for each the VAMs.

Adjusted effect sizes are calculated using the following equation: where represents the variance of the estimated teacher effects, kj represents the number of student observations contributing to each estimate, and SE () represents the standard errors of the estimated teacher effects.

Research that estimates within-school teacher effects tends to find smaller effects sizes, in the neighborhood of 0.10 (Hanushek and Rivkin Citation2010).

For instance, the student fixed effects specifications (not reported here), which are favored by some academics in estimating teacher effectiveness, were very imprecisely estimated

Recall that includes descriptions of the various model specifications and methods of estimating teacher effectiveness. We do not focus on the extent to which effectiveness in one subject corresponds to effectiveness in another subject. For more information on cross-subject correlations, see Goldhaber, Cowan, and Walch (Citation2012).

Spearman rank correlations are similar to the correlation coefficients reported in . Additionally, the correlations between the EB-adjusted VAM estimates are very similar to the reported correlations using the unadjusted estimates; in nearly all cases, the differences in correlation coefficients are 0.01 or less.

For example, in math at the elementary level Burke and Sass (Citation2008) found that a one-point increase in mean peer achievement results in a statistically significant gain score increase of 0.04 points—equivalent to an increase of 0.0015 standard deviations of achievement gains.

One of the classroom level variables we include is class size and one might hypothesize that the high correlation for models with and without classroom level controls has to do with the fact that having a more challenging class (e.g., high poverty students) tends to be offset by having a lower class size. To test this, we re-estimate the Classroom Characteristics VAM excluding class size, but find that this has little effect on the correlations. This finding is consistent with Harris and Sass (Citation2006).

These correlations are higher than those reported in a similar comparison using school effectiveness measures (Ehlert et al. Citation2012). One explanation for this is that random factors (e.g., students having the flu on testing day) that influence student achievement, and hence teacher effectiveness measures, are relatively more important when the level of aggregation is the classroom rather than the school. This is indeed what one would expect if the random shocks primarily occur at the classroom level but are not strongly correlated across classrooms in a school.

The means of the absolute value of the difference between Student Background and Classroom Characteristics VAM estimates are consistently close to 0.02 in math and 0.013 in reading at the 10th, 25th, 50th, 75th, and 90th percentiles of the effectiveness distribution.

We focus exclusively on prior achievement, free/reduced price lunch, and minority students because there is relatively little variation in other student characteristics at the classroom level in the North Carolina sample. Since we estimate two-year teacher effects, we calculate classroom-level characteristics based on two years of data. To allow for nonlinear relationships, we also include squared and cubed terms for each classroom characteristic.

The analytic sample includes 7,672 “advantaged” classrooms, 3,820 “average” classrooms, and 8,002 “disadvantaged” classrooms according to these definitions. Since the student characteristics are aggregated over two years these are not true classrooms, but instead are averages across two classrooms.

When we exclude class size from the Classroom Characteristics VAM, we find a more (but not completely) equitable distribution of teacher effectiveness across stylized classrooms.

Ehlert et al. (Citation2012) discussed a similar case when school-level control variables are included in VAMs with school fixed effects. The school-level control variables are identified using variation within schools over time and often the changes in student composition within schools over time are small, making it unlikely to capture systematic differences in school context. Therefore, the ratio of variation due to measurement error to total identifying variation will be greater when school fixed effects are included in the model, attenuating the coefficients on the school-level covariates. This same issue arises with the estimation of teacher fixed effects.

While not reported here, we also estimated teacher effects with VAMs that included student fixed effects and find the year-to-year correlations to be much lower than other VAMs and MGPs due to the imprecision of the estimates. This is consistent with McCaffrey et al. (Citation2009).

Ehlert et al. (Citation2012) made the case for using a proportional, two-stage value-added approach that assumes the correlation between growth measures of effectiveness and student background covariates is zero, forcing comparisons to be between similar teachers and schools.