563
Views
15
CrossRef citations to date
0
Altmetric
CONTENT ARTICLES IN ECONOMICS

The Consequences of Using one Assessment System to Pursue two Objectives

Pages 339-352 | Published online: 27 Sep 2013
 

Abstract

Education officials often use one assessment system both to create measures of student achievement and to create performance metrics for educators. However, modern standardized testing systems are not designed to produce performance metrics for teachers or principals. They are designed to produce reliable measures of individual student achievement in a low-stakes testing environment. The design features that promote reliable measurement provide opportunities for teachers to profitably coach students on test-taking skills, and educators typically exploit these opportunities whenever modern assessments are used in high-stakes settings as vehicles for gathering information about their performance. Because these coaching responses often contaminate measures of both student achievement and educator performance, it is likely possible to acquire more accurate measures of both student achievement and education performance by developing separate assessment systems that are designed specifically for each measurement task.

JEL codes:

Acknowledgments

The author thanks the Searle Freedom Trust for research support and also thanks Lindy and Michael Keiser for research support through a gift to the University of Chicago's Committee on Education. He further thanks Michael Greenstone, Diane Whitmore Schanzenbach, and Robert S. Gibbons for useful comments, Robert D. Gibbons and David Thissen for their insights on psychometrics and assessment development, and Ian Fillmore, Sarah A. G. Komisarow, and Richard Olson for excellent research assistance.

Notes

1. The SMARTER Balanced Assessment Consortium (SBAC) and the Partnership for Assessment of Readiness for College and Careers (PARCC) are the two groups developing assessment systems for the Common Core State Standards using funds awarded as part of the Obama Administration's Race to the Top initiative.

2. The pattern described here is not definitive proof that student test score gains on high-stakes assessments do not reflect real gains in subject mastery because the parallel low-stakes assessments are never identical in terms of subject content. See Koretz (Citation2002) for more on this point. However, many of the studies in this literature present results that are difficult to explain in a credible way without some story that explains how high-stakes assessments scores can rise quickly without commensurate improvements in true levels of math or reading achievement.

3. Glewwe, Ilias, and Kremer (2010), Jacob (Citation2005), Klein et al. (Citation2000), Koretz and Barron (Citation1998), Koretz (Citation2002), and Vigdor (Citation2009) all present results that show divergence between student assessment results on two assessments of the same subject matter in settings where one assessment became a high-stakes assessment for educators and the other assessment continued to involve relatively low stakes. Neal (Citation2012) provides a detailed discussion of this literature.

4. See Campbell (Citation1979), Kerr (Citation1995), and Rothstein (Citation2009) for many examples.

5. The effort distortions induced by assessment-based accountability are not one-time costs. Any system that induces teachers to adopt teaching methods that raise test scores but degrade the true quality of instruction imposes an ongoing cost on students, and students bear these costs throughout all grades and classes where their teachers are subject to assessment-based accountability.

6. See Hambleton, Swaminathan, and Rogers (1991, 135).

7. See Kolen and Brennan (2010, 19).

8. For more on IRT models, see Hambleton, Swaminathan, and Rogers (1991).

9. See Bay-Borelli et al. (2010, 25).

10. Kolen and Brennan (Citation2010) asserted that proper ex post equating of the results from different exam forms is not possible without an ex ante commitment to systematic procedures that govern item and form development, and they gave several examples of cases where equating procedures did not work well ex post because different assessment forms in a series were not developed and administered in a consistent manner (see Chapter 8).

12. The new exam was not given in the first quarter of 2004, and pass rates historically vary by quarter, with pass rates for first quarters being below the corresponding year-wide averages. The pass rate in the final three quarters of 2004 was almost identical to the 2005 annual pass rate and may have been slightly below if the exam was given in all four quarters of 2004.

13. The pass rates for the other three components of the exam follow a similar pattern, but the patterns on other exams are more difficult to interpret because both the format and the item content of the other three exams changed substantially to reflect new international standards for accounting. The 2011 drop in annual pass rates is less than one percent for BEC and roughly five percent for AUD and FAR. For all three exams, the decline in pass rates is more pronounced when one compares pass rates from the first two quarters of 2011 to the pass rates from the first two quarters of 2010. The changes in content specifications for all exams were announced more than a year before the 2011 exams were administered.

14. See Lazear and Rosen (Citation1981) as well as chapters 10 and 11 in Lazear and Gibbs (Citation2008).

15. The performance metric we propose is called the Percentile Performance Index (PPI). It is similar in construction to Student Growth Percentile measures (SGP) that are already being used in some states as accountability measures (see Betebenner Citation2009). Free PPI software is available at http://sites.google.com/site/dereknealresearch/home/pay-for-percentile-software.

16. Standard results on optimal incentive contracts show that if educators are risk-neutral, a reduction in reliability does not hamper efficient incentive provision. On the other hand, if educators are risk-averse, they will demand to be compensated for assuming the extra risk created by any drop in reliability. However, as the number of students that any educator or group of educators teaches grows large, this effect may well become a second-order concern.

17. See Prendergast (Citation1999) and Neal (Citation2012).

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.