Measuring Classroom Assessment Practice Using Instructional Artifacts: A Validation Study of the QAS Notebook: Educational Assessment: Vol 17 , No 2-3

Abstract

We report the results of a pilot validation study of the Quality Assessment in Science Notebook, a portfolio-like instrument for measuring teacher assessment practices in middle school science classrooms. A statewide sample of 42 teachers collected 2 notebooks during the school year, corresponding to science topics taught in the fall and spring. Each notebook was scored on 9 dimensions of assessment practice by 3 trained raters. Our analysis investigated the reliability and validity of notebook ratings, with particular emphasis on identifying key sources of error in the ratings. The results suggest that variation in teacher practice across notebooks (i.e., over time) was more important than idiosyncratic rater inconsistencies as a source of error in the scores. The validity results point to a dominant factor underlying the ratings and some predictive power of notebook ratings on student achievement. We discuss implications of the results for measuring assessment practice through artifacts, drawing conceptual and methodological lessons about our model of assessment practice, the consistency of raters, and the estimation of variance over time with classroom-based measures of instruction.

Notes

¹The 1996 Standards are in the process of being replaced with new standards based on the Framework for K–12 Science Education (NRC, 2011). It is important to note that the two frameworks share many key features as it relates to instruction and particularly to assessment practice (NRC, 2001).

²The group included two university-based science education experts and two experienced science teachers. To operationalize the model of classroom assessment, the research team and expert advisors engaged in an iterative process, drafting, reviewing, discussing, and refining multiple drafts of the dimensions and the accompanying rubrics over a period of 5 months. Then the two teachers piloted a draft version of the notebook instrument and were asked to provide in-depth feedback on it and on the draft dimensions at that point (in writing and during a debriefing interview

³Because teachers were dispersed across the state we could not meet in person to offer training for completing the notebook. In addition to the detailed instructions offered in the notebook, we developed a training video for teachers, which was available for viewing at the teacher's convenience over the Internet, with a detailed account of notebook contents and the steps needed to complete the notebook.

¹Participant teachers received $400 for collecting two notebooks. Students who returned signed consent forms were entered into a raffle for an iPod nano. Raters received an honorarium of $1,000 for attending training sessions and rating 28 notebooks over 1 week's time.

⁵The eight topics in the eighth-grade California Science Standards are motion, forces, structure of matter, solar system, chemical reactions, organic chemistry, periodic table, and density and buoyancy.

⁶Preliminary analyses looking at sets of notebooks covering the same science topics did not show a consistent pattern that could suggest that the scores are consistently higher or lower for some of the topics.

⁷A true fully crossed rating design with all raters (n _r = 11) scoring all notebooks (n _n = 84) was unfeasible due to time and resource constraints (each rater would have had to rate notebooks for three weeks). We used the method proposed by CitationChiu and Wolfe (2002) to subdivide the design into three independent fully crossed segments (i.e., three groups of raters scored both notebooks for a different sample of 14 teachers) combining the results to obtain a single parameter estimate for the whole sample. Two remaining raters were assigned to teachers across blocks using a Modified Balanced Incomplete Blocks to compare the results with a sparse data matrix to those of Chiu and Wolfe's subdivision method. The results of this comparison are not presented here.

⁸The formula for relative reliability (ρ) resembles (1) and (2) but omits the terms for raters and notebooks (σ²r, σ²n) from the denominator, as absolute error (e.g., variation in rater stringency) would not affect rank orders in a crossed design.

⁹Descriptive analyses of the scores revealed systematic scoring patterns for some raters. One rater in particular (H) showed erratic rating behavior, consistently scoring much higher or lower than others for a particular notebook; this rater was removed from the data set for all subsequent analyses.

^aVariance components are averages computed over three fully crossed segments (CitationChiu & Wolfe, 2002).

^bVariance components accounting for less than 5% of the variance not shown for ease of interpretation.

^aVariance components shown are averages over three fully crossed segments (CitationChiu & Wolfe, 2002).

^bVariance components accounting for less than 5% of the variance not shown for ease of interpretation.

*Correlation is statistically significant at p < .10.

^a n = 42.

^b n = 310.

^aVariance components shown are averages computed over eight topic segments (CitationChiu & Wolfe, 2002). n = 40. Average of separate designs estimated for each topic.

Log in via your institution

Access through your institution

Log in to Taylor & Francis Online

Shibboleth

Log in to Taylor & Francis Online

Restore content access

Restore content access for purchases made as guest

Purchase options * Save for later

PDF download + Online access

48 hours access to article PDF & online version
Article PDF can be downloaded
Article PDF can be printed

USD 53.00 Add to cart

Issue Purchase

30 days online access to complete issue
Article PDFs can be downloaded
Article PDFs can be printed

USD 290.00 Add to cart

* Local tax will be added as applicable

Measuring Classroom Assessment Practice Using Instructional Artifacts: A Validation Study of the QAS Notebook

Log in via your institution

Log in to Taylor & Francis Online

Restore content access

Related Research

Information for

Open access

Opportunities

Help and information

Measuring Classroom Assessment Practice Using Instructional Artifacts: A Validation Study of the QAS Notebook

Abstract

Notes

Log in via your institution

Log in to Taylor & Francis Online

Log in to Taylor & Francis Online

Restore content access

Related Research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature