ABSTRACT
To meet recent accountability mandates, school districts are implementing assessment frameworks to document teachers’ effectiveness. Observational assessments play a key role in this process, albeit without compelling evidence of their psychometric rigor. Using a sample of kindergarten teachers, we employed Generalizability theory to investigate (across teachers, raters, and lessons) the stability of scores obtained with two different observation measures: The CLASS K-3 and the FFT. We conducted a series of Decision studies to document (for both measures’ constituent domains) the number of lessons per teacher and raters per lesson that would justify the use of observation scores for high stakes decisions. Acceptable, stable scores for individual-level decisions about teachers may generally require more raters and lessons than is typically used in practice (1–2 raters and fewer than 3 lessons). The considerable variability of observation-based scores raises concerns about either measure’s appropriateness for making individual or group decisions about teachers’ effectiveness.