1,075
Views
9
CrossRef citations to date
0
Altmetric
Research Articles

Evidence-Centered Assessment Design and the Advanced Placement Program®: A Psychometrician's Perspective

Pages 392-400 | Published online: 11 Oct 2010

Abstract

This paper provides an overview of evidence-centered assessment design (ECD) and some general information about of the Advanced Placement (AP®) Program. Then the papers in this special issue are discussed, as they relate to the use of ECD in the revision of various AP tests. This paper concludes with some observations about the need to validate various claims that are often made about ECD.

The principal focus of the articles in this special issue is the ongoing application of evidence-centered assessment design (ECD) in the revision of various tests that are part of the Advanced Placement (AP®) Program. I begin my discussion by providing an overview of ECD as it is currently conceptualized. Then I consider the current status of the AP Program as a backdrop for a subsequent consideration of the contributions made by the articles in this issue. I conclude with some observations about ECD, in general, and in the context of the AP Program, in particular.

In short, this article is not simply a review of the articles in this special issue. My discussion of the revised AP exams is embedded within a consideration of ECD, in general, and my perspectives on ECD are considered in the broader context of validation.

EVIDENCE-CENTERED ASSESSMENT DESIGN

One of the most accessible descriptions of ECD is CitationMislevy and Haertel (2006). In that paper the authors argue that ECD

views an assessment as an evidentiary argument: an argument from what we observe students say, do, or make in a few particular circumstances, to inferences about what they know, can do, or have accomplished more generally. (p. 7)

ECD is a “principled” and structured approach to assessment development that proponents believe should enhance valid inferences about test scores, as well as the creation of forms that are more comparable than they otherwise would be. A central focus of ECD is a transparent evidentiary argument that warrants the inferences made from student test performance. The evidentiary argument requires that each assessment goal be expressed as a claim that is supported by observable evidence.

According to CitationMislevy and Haertel (2006), ECD is organized in terms of five layers or components: (a) Domain Analysis, which involves gathering substantive information about the domain to be assessed; (b) Domain Modeling, which involves expressing the assessment argument in narrative form based on the Domain Analysis; (c) Conceptual Assessment Framework, which expresses the assessment argument in structures and specifications for tasks and tests, evaluation procedures, and measurement models; (d) Assessment Implementation, which implements the assessment including presentation-ready tasks and calibrated measurement models; and (e) Assessment Delivery, which coordinates the interaction of students and tasks, task- and test-level scoring, and reporting (CitationMislevy & Haertel, 2006, p. 8).

The Conceptual Assessment Framework consists of: (i) a Student Model that expresses what the assessment designer is trying to measure in terms of variables that reflect aspects of students' performance; (ii) a Task Model that describes the environment in which students say, do, or make something to provide evidence; and (iii) an Evidence Model that bridges the Student Model and Task Model and has two components—an Evaluation Model that provides algorithms or rubrics for scoring, and a Measurement Model that involves piecing together various item response theory (IRT) or other models (CitationMislevy & Haertel, 2006, pp. 10, 13, and 14).

The above paragraphs might be viewed as describing a “full” implementation of ECD as it is currently conceived. Specific applications of ECD, however, may not involve a full implementation.

In my opinion, the Evidence Model and its two components, the Evaluation Model and the Measurement Model, are central to a full implementation of ECD, but incorporating a defensible Evaluation Model and Measurement Model is challenging. Historically, defensible algorithms and rules for scoring complex assessments have been the nemesis of many innovative assessment methods, and complex measurement models (see, for example, CitationMislevy, 2006, pp. 285–296) are challenging to understand, explain, and implement. These matters are not arguments against ECD, but they are reasons for expecting conceptual and resource challenges, and the need for rather extensive pilot studies that supply the necessary data.

The terminology and concepts used in ECD are not entirely novel, but they are not mainstream either. Also, CitationHuff, Steinberg, and Matts (2010/this issue) note that there is a tendency for ECD terminology to obscure communications. So, those who seek to understand and/or apply ECD need to learn a new assessment language, and the process of doing so is likely to take considerable practice and iteration.

As noted by CitationMislevy and Haertel (2006) and by the various authors of the AP articles under discussion here, ECD is closely aligned philosophically with the framework for validation proposed by CitationKane (2006). CitationKane (2006), however, does not grapple with nuts-and-bolts terminology and processes nearly as much as ECD does. At the risk of oversimplification, Kane's interpretative argument is more-or-less aligned with claims, and his validity argument is more-or-less aligned with evidence.

AP PROGRAM AND ITS CONTEXT

In my judgment, to evaluate ECD in the context of the AP® Program, it is necessary first to have some understanding of the breadth, scope, and certain salient characteristics of the program, as discussed next. Note that this discussion should not be interpreted as a pro- or con- evaluation of the AP Program or its current characteristics.

The AP Program, which is owned by the College Board, is one of the largest examination programs in the world. According to the College Board website (www.collegeboard.com) in 2008 the AP Program included 37 examinations in 22 subject areas administered to over 1.5 million students. All told, over 2.7 million AP exams were administered in 2008, and approximately 20% of all 11th and 12th graders in the United States took at least one AP exam. Currently, all examinations are administered during May, with about 50% of the examinations taken by 12th graders. The primary testing window is early in May. However, some students wish to take multiple tests that are scheduled for the same time period in the first testing window. To accommodate such student needs there is a second testing window later in May.

Not only is the AP Program large, it is also qualitatively different from most other testing programs in that each examination has associated instructional support material. AP courses and examinations are delivered to almost 16,000 schools by over 120,000 teachers who use a broad array of curricula and textbooks. By just about any benchmark, the AP Program is a widely known national (and even international) component of the U.S. educational system.

An important characteristic of the AP Program is that almost all higher education institutions grant college-level credit or placement for students who achieve a college-specified acceptable score, which is typically 3, 4, or 5 (on a scale of 1, 2, 3, 4, 5) depending on the college/university and examination. Over all examinations administered in 2008, roughly 57% of AP scores were 3 or higher.

Each examination is administered in a time period that is usually three hours. Examinations generally consist of both a multiple-choice (MC) section and a free-response (FR) section. Currently, MC sections are scored using a correction-for-guessing formula. The number of MC items and FR items/tasks varies by examination, and the weights of the MC and FR sections vary as well. Roughly speaking, the FR section for an exam contributes 40–60% of the score points on the composite scale for the exam.

Inter-year forms of most exams are linked through a process that involves equating the MC sections using a common-item design and scaling composite scores to the equated MC scores. This is much less than a full description of the linking/equating/scaling process that is currently employed, but it is adequate for the purposes of this discussion.

Clearly, the AP Program represents an academic bridge between high school and college. Furthermore, almost certainly, the AP Examinations expand educational opportunities of students while they are in high school and when they get to college. It is clear, however, that the characteristics and breadth of the AP Program present substantial challenges in many measurement and psychometric areas including, but not limited to, test development.

THE INTERSECTION OF AP® AND ECD

It is my contention that the effort being undertaken by the College Board to revise certain AP exams using ECD principles is laudable in many ways, but it is both more and less than a “full” implementation of ECD. It is more than a full implementation in that the AP Program involves not only tests but also extensive sets of material and training that support instruction. ECD principles are clearly supportive of such material and training, but ECD per se provides relatively little guidance about how such material and training should be created or provided.

The AP Program is also less than a full implementation of ECD in part because of numerous a priori constraints placed on the AP test development, scoring, and reporting processes. (These constraints probably impact the Evidence Model more than most of the other components of ECD.) Some of these constraints are budgetary, but many of them are not—they are features of the current AP Program that the developers, users, and/or the public are extraordinarily unwilling to change. This inertia to change is not unique to the AP Program. Revision of any large-scale testing program is always a series of decisions about what to hold on to, and what to change, with the majority of stakeholders often favoring the “hold on to” side of the arguments, at least initially.

Among the constraints are the following:

  • each examination is administered in a time period that is usually three hours;

  • the only reported scores are a five-point scale (1, 2, 3, 4, and 5), with almost exclusive attention given to scores of 3, 4, and 5;

  • examinations generally consist of both a multiple-choice (MC) section and a free-response (FR) section;

  • FR items are generally scored by only one rater;

  • currently, MC items are scored using a correction-for-guessing formula; and

  • by fiat, generally the FR section of an exam must contribute 40–60% of the score points on the composite scale for the exam.

The combination of the first and last constraints, in particular, place rather severe restrictions on test development whether it is approached via ECD or more traditional procedures (see CitationSchmeiser & Welch, 2006). The entire set of constraints also have serious implications for efforts to achieve score comparability across forms.

THE ARTICLES IN THIS SPECIAL ISSUE

The AP tests under revision using ECD are in various stages of development, and the first forms of the revised tests will not be administered for some time. In this sense, the articles in this special issue describe a very ambitious work in progress. In 2009, the 13 primary tests under revision included four science tests, three history tests, and six world language tests; calculus and English language arts are in earlier stages of revision.

The CitationHuff et al. (2010/this issue) article provides an excellent overview of why and how the College Board is using ECD for these revised AP tests. This article emphasizes that the AP Program is facing two serious challenges: (a) the need to create courses and exams that reflect contemporary advances in understanding how students learn, as well as advances in assessment theory and practice; and (b) ensuring comparability of scores within and across years. In a sense, the CitationHuff et al. (2010/this issue) article is an overview of the entire AP revision project, while the remaining three articles in this issue describe specific ECD activities or components (what are called “layers” by CitationMislevy & Haertel, 2006) in considerable detail. Note that all articles emphasize the need to view ECD as an iterative process.

The CitationEwing, Packman, Hamen, and Clark (2010/this issue) article focuses on Domain Analysis and Domain Modeling, as well as the products of these activities, which are called “artifacts.” For Domain Analysis the artifacts are content and skills (as well as a prioritization of them). For the Domain Model the artifacts are claims, evidence, and achievement level descriptors (ALDs). This article does a good job of describing the practical challenges of using ECD. Two issues loom particularly large: the steep learning curve for subject matter experts grappling with the novel features of ECD, and defining an appropriate level of specificity (or grain size) for claims and evidence. The authors argue, however, that these challenges are largely offset by the end result of stronger links among curriculum, instruction, and assessment, which is essential for the AP Program.

The CitationPlake, Huff, and Reshetar (2010/this issue) article focuses exclusively on the ALDs and how they might be used in standard setting. In developing ALDs for the revised AP exams, a central focus was the identification of exemplar claims (always with supporting evidence) that characterize student performance for score levels 3, 4, and 5, which are the levels typically used to grant college credit. Most descriptions of ECD do not explicitly emphasize ALDs, but they are crucial for the use of ECD in the AP Program, since ALDs are the artifacts that most directly connect student performance to the score scale. As such, the ALDs bear a substantial part of the burden for the Evidence Model.

The CitationHendrickson, Huff, and Luecht (2010/this issue) article considers how claims, evidence, and ALDs are used as input in the construction of the assessment framework. This article does a particularly good job of describing and articulating relationships among claims, evidence, tasks, templates, and items. For the AP Program well-framed tasks and templates seem crucial if comparability of scores within and across years is to be achieved, which is one of the two major challenges identified in the CitationHuff et al. (2010/this issue) article.

I believe these four articles do a very commendable job of describing the very complex and ambitious effort being undertaken by the College Board to use ECD for revised AP tests. Because this is a work in progress, however, the articles are notably lacking in data to support the presumed efficacy of ECD. Also, discussions are occasionally quite complicated since the tests under revision do not all fit into the same ECD “mold.” In particular, the artifacts occasionally differ across disciplines.

ECD AND VALIDATION

As noted previously, the ECD paradigm has remarkable similarities with the validation framework laid out by CitationKane (2006). Both rely on claim-evidence arguments; both share some common philosophical (e.g., CitationToulmin, 1958) and historical (CitationCronbach, 1971; CitationMessick, 1989) precedents; and both are highly consistent with the generally accepted notion that validity involves principled arguments about the inferences concerning test scores. There are differences, however, between ECD and validation.

For example, ECD is less extensive than validation as that term is used by Kane and others. For example, ECD is intimately concerned with the interpretation of scores, but it does not explicitly grapple with many of the complicated consequential issues that often characterize the use of test scores. Also, at least for observable attributes, CitationKane (2006) views validation as involving not only scoring and scaling issues, but also generalization and extrapolation. ECD, by contrast, is relatively silent about generalization and extrapolation. For example, there can be no a priori guarantee that an AP test created via ECD will necessarily perform adequately for granting college credit. There may be good reason to be hopeful, but validation requires evidence.

It is sometimes stated or at least implied that when a test is developed using ECD, validity is essentially built into the test a priori. There is an element of truth in such an assertion, but I would argue that any broad validity claim is largely unsubstantiated. It cannot be emphasized enough that validity is about inferences concerning test score interpretations and uses; validity is not about tests or test items per se, no matter how carefully constructed they may be. The focus on scores is crucial. That is why content validity has been so much maligned in recent years. Many users of test scores never see the tests or the items in the test. Furthermore, even if they do, such users rarely know how test scores are obtained from item-level responses. That is why scaling and equating issues are so important in validation arguments.

SCALING, EQUATING, AND COMPARABILITY

Scaling is one of the most crucial and challenging tasks in psychometrics. Contrary to uninformed intuition, it is virtually never obvious how item scores should be accumulated to obtain the test scores that are provided to users—that is the process of scaling. Some ways of performing scaling can enhance test score interpretation; other ways can blur it, or even undermine it. In the context of the AP Program this may be a matter of some concern. Recall that one of the program constraints is that the reporting scale must be 1, 2, 3, 4, and 5. No matter how sophisticated the scaling process may be, this five-point scale may not be rich enough to capture the essence of the evidentiary arguments embodied in the claims, evidence, and tasks. That is one very compelling reason for the use of ALDs and standard setting as described in the CitationPlake et al. (2010/this issue) article. Still, in the end, there is reason to speculate about the adequacy of the rather coarse five-point scale for capturing the complexities of an ECD evidentiary argument.

Although one of the ultimate goals of test construction is to enable the creation of forms that are similar in all relevant respects, that goal has never been fully achieved by any real testing program, to the best of my knowledge. Those who claim otherwise make extraordinarily strong assumptions that are both unwarranted and untested. Proponents of ECD reasonably argue that ECD may do a better job than traditional approaches in creating comparable forms, but I see no reason to believe that ECD will eliminate the need for equating (statistical procedures that transform scores on a new form to the scale of an old form) in order to correct for the fact that forms are not completely comparable.

As noted by CitationHuff et al. (2010/this issue) equating poses a particularly difficult challenge for the AP Program. In fact, the revised tests in the AP Program are using ECD in large part because the College Board believes that ECD will facilitate the development of forms that are more comparable than they otherwise would be. If that is true (and I think it will be true), then equating should work better in helping to ensure score comparability. However, there are still some serious challenges. Foremost among them is the fact that, for practical and political reasons, it is almost certain that for most (if not all) tests common-item sets linking new and old forms will not include FR items. If MC and FR items truly test different content and/or skills (as most AP observers believe), then the lack of inclusion of FR items in common-item sets could undermine equating, and negatively impact score comparability. ECD may mitigate the problem, but it is not likely to eliminate it.

HOW WELL DOES ECD WORK?

I view the evidentiary argument embodied by claims and supporting observable evidence as the central feature of ECD. Since ECD makes rather strong claims, I suggest that the same standard (i.e., an evidentiary argument) should be applied to ECD itself.

For example, among the claims that need to be examined relative to observable evidence are:

  • ECD leads to forms that are more comparable than forms created using traditional procedures;

  • test scores are more interpretable when ECD is used than when traditional procedures are used for test construction;

  • ECD is a cost effective alternative to traditional approaches;

  • ECD can provide the detail and explicitness that are often the Achilles' heel of conventional test specifications (CitationHendrickson et al., 2010/this issue);

  • the complexities of ECD can be managed given the typical personnel and resource constraints in large-scale testing programs; and

  • ECD works well for claims-evidence-tasks based on reading passages and/or similar types of stimuli.

One approach to addressing such claims and collecting supporting evidence would be a psychometric “bake-off” in which ECD and one or more traditional approaches are used to develop at least two forms of the same test. Claims about these “products” could be evaluated and compared (evidence) with respect to any number of criteria (e.g., comparability of forms). This is essentially the “duplicate-construction experiment” proposed by CitationCronbach (1971, p. 455) in a different context nearly 40 years ago.

To the best of my knowledge no such experiment has ever been undertaken for any testing program. Doing so certainly would be costly and time-consuming, and it is probably not practical for a testing program as complex as the AP. Still, such an experiment could directly address important questions that other approaches can consider only approximately. Realistically, other less ambitious approaches would probably need to be employed. In any case, somehow or other the strong claims that are made about ECD need to be validated in the context of specific testing programs, such as the AP.

Although the articles in this special issue suggest that the College Board's claims about ECD may be supportable, much of the relevant evidence and data will not be available for several years. Even at this time, however, the efforts that have been undertaken to employ ECD for the revised AP tests are extensive and impressive and, in my opinion, likely to contribute substantially to the valid inferences about scores for these tests.

REFERENCES

  • Cronbach , L. J. 1971 . “ Test validation ” . In Educational measurement , 2nd , Edited by: Thorndike , R. L. 443 – 507 . Washington, DC : American Council on Education .
  • Ewing , M. , Packman , S. , Hamen , C. and Clark , A. 2010/this issue . Representing targets of measurement within evidence-centered design . Applied Measurement in Education. ,
  • Hendrickson , A. , Huff , K. and Luecht , R. 2010/this issue . Claims, evidence, and achievement-level descriptors as a foundation for item design and test specifications . Applied Measurement in Education. ,
  • Huff , K. , Steinberg , L. and Matts , T. 2010/this issue . The promises and challenges of implementing evidence-centered design in large-scale assessment . Applied Measurement in Education. ,
  • Kane , M. T. 2006 . “ Validation ” . In Educational measurement , 4th , Edited by: Brennan , R. L. 17 – 64 . Westport, CT : American Council on Education/Praeger .
  • Messick , S. 1989 . “ Validity ” . In Educational measurement , 3rd , Edited by: Linn , R. L. 13 – 103 . New York : American Council on Education and MacMillan .
  • Mislevy , R. J. 2006 . “ Cognitive psychology and educational assessment ” . In Educational measurement , 4th , Edited by: Brennan , R. L. 257 – 306 . Westport, CT : American Council on Education/Praeger .
  • Mislevy , R. J. and Haertel , G. 2006 . Implications for evidence-centered design for educational assessment . Educational Measurement: Issues and Practice , 25 : 6 – 20 .
  • Plake , B. S. , Huff , K. and Reshetar , R. 2010/this issue . Evidence-centered assessment design as a foundation for achievement-level descriptor development and for standard setting . Applied Measurement in Education. ,
  • Schmeiser , C. B. and Welch , C. J. 2006 . “ Test development ” . In Educational measurement , 4th , Edited by: Brennan , R. L. 307 – 354 . Westport, CT : American Council on Education/Praeger .
  • Toulmin , S. 1958 . The uses of argument , Cambridge : Cambridge University Press .

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.