4,261
Views
5
CrossRef citations to date
0
Altmetric
Research Article

Toward a Theory of Socioculturally Responsive Assessment

ABSTRACT

In the United States, opposition to traditional standardized tests is widespread, particularly obvious in the admissions context but also evident in elementary and secondary education. This opposition is fueled in significant part by the perception that tests perpetuate social injustice through their content, design, and use. To survive, as well as contribute positively, the measurement field must rethink assessment, including how to make it more socioculturally responsive. This paper offers a rationale for that rethinking and then employs provisional design principles drawn from various literatures to formulate a working definition and the beginnings of a theory. In the closing section, a path toward implementation is suggested.

Introduction

This paper poses an initial theory of socioculturally responsive assessment. The first section provides a detailed rationale as to why some of the basic premises of educational assessment must be rethought. The second part describes provisional principles for that rethinking, along with examples to illustrate selected aspects. Next, those principles are woven into a working definition and an initial theory. In the closing section, a path toward implementation is suggested.

A rationale for socioculturally responsive assessment

In this paper, “culture” refers to the ways of life passed down across generations. Those ways include social behavior and norms; laws and institutions; customs, habits, traditions, and rituals; language; knowledge and ways of knowing; beliefs and values; arts; and cuisine. In purpose and content, standardized tests can be viewed as cultural artifacts (Greenfield, Citation1997; Solano-Flores, Citation2019). That is, they are products of our nation’s dominant common culture, shaped by it, and used for its benefit. For example, state accountability tests help stakeholders understand how effectively schools impart the English language arts (ELA), math, and science competencies that our common culture values. Tests for admission to selective postsecondary programs help allocate further learning opportunity in keeping with a culturally determined meritocratic ideal of hard work, ability, and achievement (Brint, Citation2022; Lemann, Citation1999). Finally, occupational and professional licensure examinations assess candidates on culturally valued competencies deemed key to protecting the public, as for instance the focus in medical licensure tests on evaluating knowledge of Western as opposed to Eastern medicine. Exactly how these different competencies are measured by examinations is shaped by the backgrounds, perspectives, and sociocultural identities of those responsible for test design, item creation, performance scoring, analysis, and reporting.

Because test performance on all the above examination types differs considerably among racial/ethnic and socioeconomic groups, and because those differences have significant material consequences, the sociocultural positioning of tests matters greatly. Historically, that positioning has been problematic because of the eugenicist beliefs of many of our field’s founders (Brigham, Citation1923, 71, 205–210; Goddard, Citation1920, pp. vii, 6–7, 18–20; 127–128; Terman, Citation1916, pp. 3–4, 91–92). More recently, that positioning has been increasingly framed as part of the larger set of structural inequities affecting our society. For some demographic groups, particularly people of color, those inequities have been present for generations, impacting finances, employment, health, housing, community safety, and education quality, among other things (JEC, Citationn.d.; Braun & Kirsch, Citation2016; Goosby & Heidbrink, Citation2013; Hussar et al., Citation2020 Figures 1, p. 50; Warnken & Lauritsen, Citation2019). In combination, the pervasiveness and longstanding nature of these factors have limited children’s opportunity to learn culturally valued academic competencies in school, at home, and in the community (Bennett, Citation2021, April 10). Because test performance is so heavily influenced by these inequitable learning opportunities, state accountability, admissions, and licensure tests are perceived by significant segments of our society as racist—i.e., as helping to institutionalize a common culture, depicting too many of those who don’t belong to that culture as less capable, and abetting the continued inequitable distribution of educational and economic opportunity.

There is considerable evidence that this perspective is taking hold more widely. For example, in his 2019 New York Times best seller, Ibram Kendi (Citation2019, p.101) wrote, “We degrade Black minds every time we speak of an academic achievement gap based on [standardized tests]. The acceptance of an ‘academic-achievement gap’ is just the latest method of reinforcing the oldest racist idea: Black intellectual inferiority.” In line with that thinking, the University of California Board of Regents voted in May 2020 to suspend admissions testing for in-state students and in November 2021 decided to permanently eliminate it (Nietzel, Citation2021). For the 2021–22 admissions cycle, close to 1800 undergraduate institutions (>75% of the total) were reported to have adopted test optional/blind policies (FairTest, Citation2021). Although these policies were initially invoked in large part because of COVID-19, among selective institutions, the belief that tests impede diversity is one of the primary motivations for the policies’ continuation (Editor, Citation2020).

A change in perspective is also apparent within the measurement community. That community has a long history of addressing fairness issues dating at least to the seminal work of Cleary (Citation1966), primarily centered on methodology to identify unfairness and on issues of inappropriate test use. The more recent discourse, however, has begun to question long-accepted ideas and practices. Lyons, Hinds, and Poggio (Citation2021, p. 41), for example, write, “The well-documented, disparate adverse impact on … marginalized students as a result of … college entrance exams … is sufficient evidence for suspending this practice.” Huff (American Consortium for Equity in Education, Citation2021) states, “I think we’re at an inflection point … , because we’re only at the very beginning of undoing … our theory, method and practices to make our assessment[s] culturally relevant and anti-racist.” Finally, Randall (Citation2021b, p. 33), referring to K-12 accountability tests generally, maintains, “ … at a minimum [testing companies should] acknowledge explicitly what is being measured … and stop referring to it … as college readiness. Instead, refer to it as the knowledge, values, and ways of understanding of the white dominant class.”

Rather than a transitory phenomenon, the loss of faith evident in the above quotes and in institutional elimination of testing requirements is likely to grow simply because of national demographics and the increased cultural diversity these population changes imply (Bennett, Citation2022). In 1980, the US population was 80% non-Hispanic White (Hobbs & Stoops, Citation2002). As of 2021, that figure was estimated to be 59% (US Census Bureau, Citation2021). Our public-school population currently has students of color in the majority. According to (NCES, Citation2020a), Tables 203.70 and 204.10), as of 2018, that population was 47% non-Hispanic White (and 52% eligible for free and reduced-price lunch). By 2060, the Census Bureau projects that the under-18 share of the US population will be 36% non-Hispanic-White (Vespa, Medina, & Armstrong, Citation2020). Although 2060 might seem distant, in some states that future has arrived. As of 2021–22, California’s public-school population was 56% Hispanic or Latino and 21% White non-Hispanic (California Department of Education CDE, Citation2022).

These demographics suggest that cultural sensibilities among at least some societal segments are shifting and will continue to evolve, though with significant resistance from opposing social and political factions. Even so, evidence of evolution can be seen in such occurrences as the assignment of Juneteenth as a federal holiday (Karni & Broadwater, Citation2021); the designation of Indigenous Peoples’ Day as a replacement for Columbus Day in 13 states, the District of Columbia, and 130 cities (Andrew & Willingham, Citation2020); the establishment of multiple official languages in Arkansas, Hawaii, and South Dakota; the remaking of the New York City Police Department as a majority-minority organization (NYC, Citation2021); the emergence of diversity/equity/inclusion as a significant business in its own right (Newkirk, Citation2019); the issuance by the New York State Education Department of a framework for culturally responsive teaching (NYSED, Citationn.d.); the mandated inclusion of Native American history and culture in school curricula in Connecticut, North Dakota, and Oregon (Haigh, Citation2021); similar laws in New Jersey and Illinois for Asian American and Pacific Islander history (Bellamy-Walker, Citation2022); and the requirement of an ethnic studies course for a California public high-school diploma (Mays, Citation2021).

In sum, we are witnessing dramatic changes in US society that are affecting the measurement field. Those changes include a growing recognition that standardized tests reflect the dominant common culture in which they are embedded. By being so embedded, tests may too often limit opportunities for individuals from racial/ethnic, socio-economic, immigrant, native language, exceptionality, gender identity, sexual orientation, or other groups whose ways of knowing may not always completely align with those of our tests. If the educational measurement field is to not only survive, but make positive contributions, it must adapt its theories and practices to the more multicultural, pluralistic society the United States is rapidly becoming.

What might that rethinking mean for how we define and implement such foundational concepts as comparability, standardization, validity, and fairness? And what impact might that rethinking have on test performance? We begin to address these questions by drawing on key ideas from several literatures, which are briefly described in the next section.

Foundational ideas

Work directly and indirectly relevant to a more pluralistic rethinking of educational assessment goes back at least several decades. One indirect line traces to the literature on a category of related ideas and approaches to improving the teaching of traditionally underserved students, particularly those of color. Members of that category include culturally responsive education (Cazden & Leggett, Citation1976), culturally relevant pedagogy (Ladson-Billings, Citation1995, Citation2021), culturally responsive pedagogy (Lee, Citation1998), culturally responsive teaching (Gay, Citation2002, Citation2018), culturally sustaining pedagogy (Paris & Alim, Citation2014; Paris, Citation2012), and funds of knowledge (Gonzalez, Moll, & Amanti, Citation2005; Moll, Citation2019; Moll, Amanti, Neff, & Gonzalez, Citation1992). Differences among the members include concern with theory vs. practice (e.g., pedagogy in contrast to teaching), target population, and precepts (e.g., developing critical consciousness). An idea common to most members is to view and utilize the knowledge, practices, and identities that students bring from home and community as assets upon which to build academic competencies.

A second indirect line comes from the learning sciences (NASEM, Citation2018; NRC, Citation2000), especially the segment dealing with sociocultural learning theory (Vygotsky, Citation1978). This segment locates learning within communal activity systems (Greeno & Engeström, Citation2014). Such systems are characterized by regular and recurring patterns of linguistic, cultural, and substantive activity, or practices, in which group members engage (Mislevy, Citation2018). These socially mediated practices in turn foster the development of knowledge, sense-making, and modes of behaving that are particular to the assemblage, be it a classroom, school, family, disciplinary community, or cultural group. Among other things, these developments influence how it is that members understand and interact with their world (Gee, Citation2008), highlighting that ways of knowing and interactional patterns may vary by group.

A final, more direct line focuses on improving the assessment of traditionally underserved students. Like that on teaching, this line represents a grouping of related ideas and approaches. Approaches include equitable assessment (Gordon, Citation1995), culturally responsive assessment (Hood, Citation1998; Lee, Citation1998; Qualls, Citation1998; Solano-Flores & Nelson-Barber, Citation1999; Solano-Flores, Citation2019), antiracist assessment (Inoue, Citation2015; Randall, Citation2021a; Randall, Slomp, Poe, & Oliveri, Citation2022), culturally sustaining assessment (Randall, Slomp, Poe, & Oliveri, Citation2022), justice-oriented assessment (Randall, Slomp, Poe, & Oliveri, Citation2022), assessment based on sociocultural/situative theory (Moss, Citation2008; Penuel & Shepard, Citation2016), socioculturally responsive assessment (Bennett, Citation2021, April 10), and universal design for assessment (Ketterlin-Geller, Citation2005). Among the differences are target population (e.g., focused on a narrowly defined demographic group vs. more inclusive), primary underlying literatures (e.g., bilingual education, sociocultural learning theory), key precepts (e.g., use of multiple modalities, inclusion of antiracist test content), ultimate goals (e.g., more accurate measurement, development of critical consciousness), rhetorical stance (e.g., appealing to specific vs. broad constituencies), preferred test format (e.g., performance assessment), and hierarchical relationship (e.g., one member subsuming others). Variation is also apparent within each grouping, as evidenced by the different orientations of the cited authors. Even so, a key idea common to most approaches is to design assessment for the social, cultural, and other relevant characteristics of diverse students and the contexts in which they live.

Socioculturally responsive assessment

Collectively, the summarized literatures suggest that both teaching and assessment should be structured to account for the variation in ways of knowing and interactional patterns that diverse students bring to school. Although versions of this idea first appeared decades ago, within the mainstream measurement community, the possibility is just emerging of applying it to standardized assessment generally—i.e., to tests for state accountability, NAEP, teacher licensure, and postsecondary admissions and the broad populations they serve (e.g., Bennett, Citation2021, April 10; Randall, Citation2021a; Sireci, Citation2020). Evans (Citation2021) offered a pertinent hierarchy of application: sensitive (awareness of cultural differences without assigning value), relevant (linking student background to learning), responsive (adapting to background), and sustaining (honoring and extending background). Evans suggested that, by their nature, standardized tests cannot go beyond being sensitive, whereas classroom assessment can achieve the sustaining level.

The present paper elaborates on “socioculturally responsive assessment” (Bennett, Citation2021, April 10), one member of the last line of research mentioned above. The intention is for application to standardized testing, with the belief that such measures can indeed be made sustaining, though not as tests are traditionally conceived. As with assessment based on sociocultural/situative theory, the “socio” prefix is used to emphasize that learning and assessment are embedded in social activity systems which advantage certain ways of knowing and interactional patterns. That prefix also is employed to highlight the need to consider a multiplicity of social factors – race/ethnicity, socioeconomic level, exceptionality, first language, immigrant status, gender identity, and sexual orientation, as well as their intersectionalities (Crenshaw, Citation1989)–in defining who it is that our assessment practices must be better designed to serve. This inclusive focus is notably distinct from the more targeted foci of many other members of the category.

In the next section, several provisional principles are described. Those principles convey key ideas drawn from the cited literatures. For each principle, a rationale is given for its application to standardized assessment based on logic, prior theory, or empirical research, along with an example. Those principles lead, in turn, to what are perhaps the most important contributions of the paper – a working definition built from principles, an initial theory of empirically testable propositions based on those principles, and the beginnings of a path forward. The principled definition and theory in particular distinguish the described approach from most other related proposals.

Provisional principles

Five provisional principles might be used to guide the design of socioculturally responsive assessments and, in that way, make the overarching concept more concrete. Although the following discussion centers primarily on K-12 assessment, the principles are intended as universals – i.e., applicable regardless of assessment purpose, intended inference, and target population. Purposes, inferences, and population will, however, dictate how the principles are instantiated for any given use case.

Principle 1:

Present problem situations that connect to, and value, examinee experience, culture, and identity

There are several rationales for this principle. One is that students are more likely to be able to show what they know and can do in familiar as opposed to foreign contexts because the former may help to activate prior knowledge (NRC, Citation2007, pp. 19, 119, 142). A theoretical basis for this principle can be found in the literatures on teaching, assessment, and the learning sciences cited above. The intent is to connect the academic content and processes valued by the school to the prior knowledge diverse learners bring from home and community (Gay, Citation2002, Gay, Citation2018; Hood, Citation1998; Moll, Amanti, Neff, & Gonzalez, Citation1992; Nelson-Barber & Trumbull, Citation2007; Solano-Flores & Nelson-Barber, Citation2001). That connection recognizes both the importance of academic competency and the underutilized assets that students possess. A second rationale is to communicate respect for different cultural identities in ways that help sustain those identities (Paris & Alim, Citation2014; Paris, Citation2012), as well as give other students an understanding of them. A third rationale is to begin to broaden the content and processes valued by the school beyond the Westernized subset currently privileged (Paris & Alim, Citation2014; Trumbull & Nelson-Barber, Citation2019). Both this last rationale and the prior sustaining one are consistent with state frameworks and curriculum requirements for culturally responsive teaching, Native American history and culture, Asian American and Pacific Islander history, and ethnic studies noted above.

One approach to implementing this principle is to craft item contexts that depict members of a specific demographic group or, more preferably, appeal to the experiences of that group, an approach that some assessment companies are reportedly pursuing. An alternative path that appears to be more rarely taken is given in . This prompt, drawn from a teacher credentialing test, was adapted to make it suitable for a secondary-school ELA writing assessment. The example is purposefully structured to resonate with the lived experiences of multiple groups simultaneously, including intersectionalities, rather than targeting any single group. The example also invites examinees to discuss cultural assets, which could include their own. To make the task suitable for all students, some prior ethnic-studies exposure might be presumed.

Table 1. A secondary-level test item presenting a problem situation that connects to, and values, examinee experience, culture, and identity.

Principle 2:

Allow for multiple forms of representation and expression in problem stimuli and in responses

The rationale for this principle is that some forms may be more common to the ways of knowing and interactional patterns found in particular communities (Ascher, Citation1991; Cazden & Leggett, Citation1976; Greenfield, Citation1997; Solano-Flores & Nelson-Barber, Citation2001; Trumbull & Nelson-Barber, Citation2019). Forcing an assessment interaction into a particular form may therefore create construct-irrelevant variance for individuals not conversant with that form. This principle has long been practiced for the assessment of students with exceptionalities, as exemplified by the availability of not only Braille, but also large-type and audio versions for examinees with visual limitations. The principle has, however, not been extended more broadly. A generalization to all examinees would argue for greater availability in problem presentation of alternate representations – verbal, graphical, and symbolic. (See Mayer, Citation2009, however for guidance on simultaneous use.) Greater allowance might also be made for such representations in test responses, as well as for accepting responses in different modes. An example might be allowing students to dictate their essay instead of writing it. A second illustration would be to permit submission of a slide presentation, hierarchical list, concept map, or text response when the focus of assessment is content knowledge.

Principle 3:

Promote instruction for deeper learning through assessment design

This principle is rooted in several premises. The first premise is that the development of conceptual understanding is critical to domain proficiency and greater attention to that understanding is warranted in schooling (NRC, Citation2000). From an equity perspective, students attending under-funded schools are less likely to get this attention as their teachers frequently have less access to appropriate curricular resources (e.g., science labs) and are on average less experienced and qualified than those instructing their majority-group peers (Cardichon et al., Citation2020; Changing the Equation, Citationn.d.; Mehrotra, Morgan, & Socol, Citation2021; Rahman, Fox, Ikoma, & Gray, Citation2017, pp. 15–16, B-39). A third premise is that assessment design can influence teaching and learning behavior (Frederiksen, Citation1984; Resnick & Resnick, Citation1990; Shepard, Citation2021). In the negative direction, the sole use of multiple-choice questions can lead to instruction oriented toward discrete bits of knowledge and memorized procedures. In contrast, research suggests that the inclusion of well-designed performance tasks encourages teaching for conceptual understanding, knowledge application in realistic settings, and the integration of skills (Stecher, Citation2010, pp. 24–25).

From an assessment design perspective, promoting instruction for deeper learning suggests several desiderata. As just implied, one would be the inclusion of performance tasks involving problems with reasonable fidelity (Darling-Hammond & Adamson, Citation2014; Taylor, Citation2022, chapter 3), an approach that some assessment programs currently follow. A second desideratum might be to provide consultative resources like those used in real-world problem solving. Precedent can be found in state accountability assessments that call upon students to write based on given source material, as well as in occupational and professional licensure examinations. In licensure, the National Council of Architectural Registration Boards’ Architect Registration Examination (ARE) includes case studies requiring the examinee to synthesize multiple pieces of information and make evaluative judgments (National Council of Architectural Registration Boards NCARB, Citation2020, p. 147). Each case study contains a scenario and associated resources like a site plan, design specification, zoning ordinance, geotechnical report, and international building-code excerpts. Appropriate resources are also made available in the Uniform Certified Public Accountant Examination’s task-based simulations (American Institute of Certified Public Accountants AICPA, Citation2020). Third is to structure tasks so that they model for students the strategies and habits-of-mind proficient performers are likely to employ. As an example, proficient performers routinely evaluate their work, along with that of others, against domain-relevant quality standards, a practice that could be modeled by including such standards as a task resource (e.g., guidelines for writing a quality argumentative essay).

Although operational assessments generally do not structure tasks to model proficiency in this way, an example can be found in the ELA scenario-based assessment prototypes produced by the Cognitively Based Assessment of, for, and as Learning (CBAL) initiative (Bennett, Deane, & van Rijn, Citation2016). Each assessment centers on a scenario, or topical context, intended to reflect an issue of contemporary importance to middle-school students. The assessments were built to instantiate a theory of key practices (Deane et al., Citation2015), where each such practice is an integrated bundle of reading, writing, and thinking skills required to participate in an activity system through specific, meaningful modes of interaction with other members of a literate community. “Discuss and debate ideas,” or argumentation, constitutes one of those practices. That practice is decomposed into phases – understand the issue, explore the subject, consider positions, create and evaluate arguments, and organize and present arguments – with each phase having questions to direct activity. For example, understanding the issue entails identifying whose opinions matter, what people interested in the issue care about, whom the writer needs to convince, and how the writer will try to convince them. The phases, and questions, thus represent a canonical argumentation process, one that a proficient performer might automatically employ. The assessment models this process, taking the student through lead-in tasks that call for engaging with provided resources in ways that follow the phases and their associated questions. Culminating the assessment is an argumentative essay in which the student must offer and substantiate their position, with reasons and evidence derived from the sources. The assessment thus functions as an argumentation-process model for students to internalize and for less-experienced teachers to incorporate into their instructional practice.

A last deeper-learning design requirement is to allow student agency. Agency should go beyond selection of response mode (Principle 2) to encompass some form of problem choice. Agency has been shown to increase motivation to learn (NASEM, Citation2018; Shepard, Citation2021), as well as said to help students discover how to choose problems wisely. Because agency is a type of personalization, it is discussed in conjunction with the next principle.

Principle 4:

Adapt the assessment to student characteristics

Educational testing is built on the premise of treating every examinee the same. That idea is the foundation of comparability and of fairness to individuals. Administering the same content in the same task formats under the same conditions leads to scores that can be meaningfully compared. Departures on any of those facets are asserted to reduce comparability (Bennett, Citation2020).

However, principled departures from sameness have been part of standardized testing for a long time. Personalization to competency level dates to the Stanford-Binet Intelligence Scales, in which the examiner adjusted difficulty depending upon the responses of the examinee (Reckase, Citation1989). More sophisticated versions of that idea appeared in the 1980s, when the first computerized adaptive tests were introduced (Ward, Citation1988). In such tests – ranging from current state accountability measures like Smarter Balanced to postsecondary admissions exams like the GRE General Test – examinees not only receive different items but may be administered tests of considerably different difficulty.

Accounting for special needs dates at least to the 1937 creation of the Braille edition of the SAT (Fuess, Citation1950, p. 140). Although the scores resulting from accommodations to standard procedure and modifications to test format were for many years flagged as nonstandard, they are no longer so identified, effectively treated by testing programs and users as comparable to scores obtained under standard conditions. Moreover, in testing programs like Smarter Balanced, accommodations designated as “universal tools” are available to all examinees at their choosing. Such tools include an English glossary (i.e., pop-up definitions for selected construct-irrelevant words), highlighter, strikethrough, zoom, and notepad.

Personalization to the examinee’s first language, while far from universal, is also not novel. Encouraged by federal law, thirty-one states plus the District of Columbia offer native-language assessments in mathematics, 21 in science, six in social studies, and four in reading/language arts (Sugarman & Villegas, Citation2020). Almost all these jurisdictions have versions in Spanish, with other languages including Arabic, Chinese, Haitian Creole, Korean, Russian, Somali, and Vietnamese. Three states (MI, NY, and WA) offer tests in multiple native languages. In almost all cases, these assessments are either direct translations of the English versions or transadaptations intended to account for some types of cultural difference. States vary in whether students must use only the native language version or may use both the English and native-language ones simultaneously.

Considerably more novel is designing an assessment specifically for the members of a language and cultural group. In this regard, the approach taken by the Hawaii State Department of Education is unique. The Kaiapuni Assessment of Education Outcomes (KĀ‘EO) is grounded in the language, culture, and worldview of the students enrolled in Hawaiian language immersion programs (HSDE, Citationn.d.; Shultz & Englert, Citation2021). KĀ‘EO is administered in grades 3–8 in Hawaiian language arts, math, and science. The assessment results are used for community, state, and federal accountability purposes.

Yet another departure from sameness involves adapting to examinee interests and prior knowledge, which dates to the offering of problem choice in the original College Board essay tests of the early 1900s (Wainer & Thissen, Citation1994). This type of adaptation is rarely employed in operational testing programs today. An exception can be found in several of the Board’s Advanced Placement Program examinations, including the long essay tasks on the US History and European History tests (Citation2020a, Citation2020c). Affording choice is based on the fact that, due to prior knowledge and possibly other factors, task difficulty varies across examinees, a person-by-task interaction (Linn & Burton, Citation1994; Shavelson, Baxter, & Gao, Citation1993). Such affordance assumes, of course, that most examinees will make beneficial choices.

Considerable empirical research was conducted during the 1990s on the effects of allowing examinee choice. Two lines of research were undertaken, both now dated. The results of the first line, concerning constructed-response tasks like essays, were summarized by Powers and Bennett (Citation1999, p. 265) as follows: “Currently, it is possible, therefore (depending on the research one cites), to either strongly favor examinee choice or vigorously oppose it.” That summarization reflected the finding that test takers did not appear to make good choices uniformly, that the prevalence of poor choices may be associated with the nature of the test and the characteristics of examinees, and that the average effect may be negative for some tests and positive for others. The second line of work focused on self-adaptive multiple-choice testing in which examinees choose among difficulty levels but not specific items. A meta-analysis of those investigations was more positive, finding that choice led to marginally higher ability estimates and a small reduction in posttest anxiety (Pitkin & Vispoel, Citation2001).

The above discussion should make clear that there are several paths to adaptation currently deployed operationally. Those paths adapt to different degrees and in disparate ways that may be more or less socioculturally attuned. One path is machine driven, as in computerized adaptive testing, which offers a high degree of individualization but only on a single cognitive attribute. Incorporating sociocultural characteristics into the adaptation algorithm would require test designers to know which characteristics to personalize upon (e.g., cultural identity, interests), how to measure each of them effectively, how they worked conjointly, and what that conjoint result implied for adapting test content and format – none of which are yet well understood. A second path is to create bespoke assessments for specific populations, where those populations have been defined by such factors as exceptionality, language, or language and culture. Indeed, major testing programs today provide many such adapted versions. This path, however, becomes less feasible as the number of groups increases and less sensible as intersectionalities take on prominence (Evans, Citation2021). A third path is examinee driven, as in choice (of problems or tools) within an assessment, allowing cognitive and noncognitive factors to come into play, including sociocultural ones. In that path, responsibility for making appropriate choices shifts largely from the testing organization to the examinee, thereby implicitly adding that skill to the construct definition.

A significant generalization of the examinee-driven path is also possible, though with its own peculiar challenges. In this generalization, the same standardized assessment can arguably personalize to cultural identity, interests, and prior knowledge; use those attributes as assets; help sustain diverse backgrounds; and, in the process, aid instruction. Only a few operational examples currently exist. One such example can be found in the Advanced Placement (AP) Art and Design Program. The Program consists of three courses (Drawing, 2D Art and Design, 3D Art and Design), each with a portfolio examination (College Board, Citation2021a, pp. 36–37). Although this discussion focuses on the 3D Art and Design Portfolio Exam, the three assessments share essential features and have similar implications for personalization.

In the 3D Art and Design Portfolio Exam, students can work with any materials, processes, and ideas they choose. Their submissions may include figurative or nonfigurative sculpture, architectural models, metal work, ceramics, glasswork, installation, performance, assemblage, 3-D fabric/fiber arts, still images from videos or film, or composite images. For the Selected Works section of the exam, digital images of each of five artifacts must be submitted that demonstrate 3D skills and synthesis of materials, processes, and ideas. Those submissions must be accompanied by a written description of the evident ideas, employed materials, and processes used. For the Sustained Investigation section, 15 digital images must be submitted that evidence the same competencies as for Selected Works. In addition, those works must show sustained investigation through practice, experimentation, and revision of materials, processes, and ideas. The accompanying written submission must identify the inquiry guiding the investigation and give evidence of how that inquiry directed the investigation in terms of practice, experimentation, and revision.

Several aspects of the exam are especially pertinent. First, student choices are intended to be negotiated with their teachers. Teachers are expected to help students make informed decisions by ensuring they understand the portfolio requirements and the assessment rubrics. Second, with respect to cultural identity, the exam allows one to choose among the artistic traditions prevalent in one’s community or that otherwise pique one’s interest.

Examples of artifacts submitted, and of written responses accompanying the sustained investigations, are shown in and , respectively. These examples were selected to suggest the wide range of complex individualities that can be found among the submissions. Each investigation (not shown) illustrates a high score on a particular rubric dimension (inquiry; practice, experimentation, and revision; materials, processes, and ideas; or 3D/drawing art and design).

Figure 1. 2021 AP® 3-D Art and Design sustained investigation of impact of stress/anxiety (Image 6).

Note. Material(s): Nickel sheet metal. Process(es): Wearable cage neckpiece created by piercing nickel sheet and forming using an anvil and mallet. © 2020-2021 by Jessica McAnn. Used by permission. Sustained investigation can be viewed at https://apcentral.collegeboard.org/pdf/ap21-3d-art-and-design-sustained-investigation-row-c-score3.pdf
Figure 1. 2021 AP® 3-D Art and Design sustained investigation of impact of stress/anxiety (Image 6).

Table 2. Written response from 2021 AP® 3-D Art and Design sustained investigation of impact of stress/anxiety.

Figure 2. 2021 AP® 3-D Art and Design sustained investigation of religion and gender identity (Image 1).

Note. Material(s): Performance - wood, found cloth, mascara. Process(es): Constructed wood and fabric garment, recorded act of putting on and taking off. © 2020-2021 by Is Perlman. Used by permission. Sustained investigation can be viewed at https://apcentral.collegeboard.org/pdf/ap21-3d-art-and-design-sustained-investigation-row-a-score3.pdf
Figure 2. 2021 AP® 3-D Art and Design sustained investigation of religion and gender identity (Image 1).

Table 3. Written response from 2021 AP® 3-D Art and Design sustained investigation of religion and gender identity.

Figure 3. AP® 3-D Art and Design sustained investigation of black hair/beauty (Image 4).

Note. “Flexin My Complexion, 2021.” Medium: Photography, Fiber Arts, Metal. Location: Greenville, SC. Model: Karis Finklin. © 2020-2021 by Kristen-Alyse Finklin. Used by permission. Sustained investigation can be viewed at https://apcentral.collegeboard.org/pdf/ap21-3d-art-and-design-sustained-investigation-row-b-score3.pdf
Figure 3. AP® 3-D Art and Design sustained investigation of black hair/beauty (Image 4).

Table 4. Written response from 2021 AP® 3-D Art and Design sustained investigation of black hair/beauty.

As the examples make clear, cultural identity is at the center of these submissions, exceptionality in one case, gender orientation and religion in another, and race/ethnicity and gender in the third. That is, these students chose to integrate their unique intersectional identities into their demonstrations of competency. It should be noted that many students, perhaps most, choose not to pursue work so entwined with identity. The critical point, however, is that this option is available for those who wish to take it.

Although the AP Art and Design model finely illustrates sociocultural responsiveness, such an assessment raises obvious technical questions, including ones of rater agreement, score reliability, standardization, score comparability, and to the central issue of this discussion, fairness. Stecher (Citation2010) reviewed the literature on the technical quality and impact of performance and portfolio assessment in K-12 assessment contexts generally. He concluded that scoring consistency among raters was achievable given raters who sufficiently understood the domain and the rating criteria, task design based on clear definition of what constitutes proficient performance, scoring guides that minimize inference, effective training, and careful monitoring. With respect to score reliability, he noted that this characteristic depends greatly on the number of tasks employed, which might be only performance or some combination of performance with selected response. In the case of the 3D Art and Design Portfolio Exam, the raters are domain experts (AP or college teachers), the rubric is general but made more concrete through extensive training with work samples that mark score categories, and each submission gets multiple independent ratings. Although 20 images are submitted by each student, the images and written documentation are judged as each of two collections, five images plus the written submission for the first section and 15 images with the associated written submission for the second, which is sensible given the related nature of the materials within a section.

Perhaps more fundamentally, this exam challenges traditional notions of standardization and score comparability by directly rejecting sameness. As should be evident, the wearable choker (), wood and fabric garment (), and suit jacket () vary dramatically from one another, as do their larger sustained investigations. That variation emanates from differences among the identities, interests, and prior knowledge brought by the students.

The challenges posed by the Art and Design Portfolio Exams are in fact being taken up by the measurement community. Randall (Citation2021a), for example, asserts that test content is not neutral. That point is powerfully illustrated in the flexibility of the Art and Design Portfolio Exams, which allow students to mold test content positionally. Similarly, Mislevy (Citation2018, chapter 9) sees a key goal of personalization as making tasks less comparable across individuals to produce evidence that is more comparable. He lays out the theoretical foundation and some of the technical methodology for “conditional fairness,” a principled basis for complex adaptation of the type found in Art and Design. The goal is to facilitate inferences that are conditional on examinees taking assessment variations specifically engineered to lower construct-irrelevant requirements for them (Mislevy et al., Citation2013). Along like lines, Sireci (Citation2020) proposes the concept of UNDERSTANDardization to encompass “ … (a) what each student brings to the testing situation in addition to the proficiency measured, (b) how these personal characteristics may interact with testing conditions, and (c) how the testing conditions can be flexible enough to accommodate and account for these potential interactions” (p. 101). Finally, Moses (Citation2022) describes an expanded linking framework for just such situations where tests are developed to measure constructs differently for different examinees.

These various proposals, then, suggest that standardization, as traditionally conceived, might be less necessary than once believed. Such is especially the case when the consequences attached to scores are relatively limited lessening the need for high levels of comparability, a situation somewhat truer today than, for example, when state accountability assessment was widely used for educator compensation, promotion, and termination decisions (Bennett, Citation2016). Strict standardization might likewise be less necessarily when restrictions on interpretation can be imposed to discourage unwarranted comparability inferences (Moses, Citation2022).

Aside from the technical questions are more practical ones. To be sure, the Art and Design Portfolio Exams constitute a small program having examinee volumes of about 50,000 students in 2020 (College Board, Citation2020d). Further, the exams are focused on a single unusual domain at the high school level. An example of generalization to other domains, including STEM, is AP Capstone. In its two-year course sequence, students investigate topics in multiple disciplines, culminating in an academic paper and presentation that includes an oral defense of their research. Notably, student projects allow for personalization with respect to identity, background, and interest much like that of the Art and Design Portfolio Exams. In addition, by the nature of research, projects afford opportunities for engagement with external resources, including experts (College Board, Citation2020b, p. 43). In 2019, Capstone had a total volume similar to that of Art and Design, with some 31% of participants coming from underrepresented groups. However, only about a third of the program’s total volume participated in the second-year research course (College Board, Citation2021b, Citation2021c), where the greater opportunity for sociocultural responsiveness would seem to occur.

A further generalization of the AP performance assessment model can be found in the New York Performance Standards Consortium, operating for over 20 years (Fine & Pryiomka, Citation2020). Thirty-eight schools in New York City, Rochester, and Ithaca compose the consortium. In the consortium’s high schools, under a waiver from the New York State Education Department, performance assessments are substituted for the state Regents Examinations in all subjects except English. Required for graduation, the performance assessments include written and orally presented projects in history, science, mathematics, and ELA. With teacher facilitation, students explore questions they generate and have their results evaluated by external scholars and other professionals. Under a pilot arrangement, the City University of New York and the consortium’s New York City high schools have collaborated in using the results for college admission.

Common to virtually all programs of the type described is that teachers play a significant role. Not only are the assessments intended to be learning activities, but teachers may help create tasks and score responses (though not necessarily those of their own students). It is well-established, however, that teachers in schools serving predominantly minoritized students are typically less qualified than those teaching in schools having mostly White students (Cardichon et al., Citation2020; Mehrotra, Morgan, & Socol, Citation2021; Rahman, Fox, Ikoma, & Gray, Citation2017, pp. 15–16, B-39). Additionally, as of 2018, the sociocultural background of the nation’s public-school teaching force differed markedly from that of students, with teachers being overwhelmingly non-Hispanic White and most students being from other racial/ethnic groups (NCES, Citation2020a, Table 203.70; NCES, Citation2020b). Those factors raise questions about whether such models can be scaled in ways that allow the assessments to be socioculturally responsive and technically adequate enough for their intended purposes. That said, significant training is generally a key element and participation is argued to be an important path to improving teaching practice (Darling-Hammond & Falk, Citation2013). Such training could well be fashioned to include sociocultural responsiveness, enhancing further the value of teacher participation.

Principle 5:

Represent assessment results as an interaction among what the examinee brings to the assessment, the types of tasks engaged, and the conditions and context of that engagement

Test scores have long been interpreted as indicators of personal characteristics. That interpretation dates to the beginnings of educational measurement, when intelligence test scores were viewed as indices of genetic endowment unequally distributed across groups (e.g., Brigham, Citation1923, pp. 64, 71, 205–210). After genetic associations were abandoned, the field continued to conceptualize test scores as indexing hypothesized personal characteristics like intelligence, aptitude, ability, knowledge, skill, and personality. These so-called trait interpretations derived from patterns of test responses that were relatively stable over tasks, time, and situations (Kane & Bridgeman, Citation2017). Construct interpretations subsequently propounded by Cronbach and Meehl (Citation1955), Loevinger (Citation1957), and Messick (Citation1975) added a theoretical account of the trait’s relationships to other variables but perpetuated the view of an underlying characteristic resident in the person. This conception is problematic to the degree that it personalizes achievement, ability, and noncognitive differences, which are then too often propagated into pernicious depictions of groups. (See Zwick, Citation2022, for this argument applied to the context of university admissions.)

A more helpful conception is suggested by sociocognitive interpretations (Mislevy, Citation2018). These interpretations depict an assessment result – summative or formative – as reflecting the interaction among person, their history of opportunity to learn, the tasks presented, and the contexts in which they are posed. Due to its situated nature, test performance can be taken as indicating a likelihood to behave in certain ways conditional on these factors, rather than only on personal ones. Sociocultural interpretations go further, not only recognizing performance as situated, but emphasizing the identities, knowledge, ways of knowing, interactional patterns, and practices valued in students’ families and communities that can affect performance (Penuel & Shepard, Citation2016). Both interpretations, but especially sociocultural ones, highlight that a score results from the intersection of these multiple intrinsic and extrinsic influences, rather than from a generalized competency that an individual applies across all tasks and situations.

Operational testing programs rarely, if ever, report results in ways that reflect sociocognitive or sociocultural interpretations. A hypothetical sociocognitive illustration from Bennett (Citation2020, p. 232) concerning the 2011 NAEP writing assessment is: Female students scored higher than male students at the 8th-grade level when composing online essays on demand to persuade, explain, or convey experience. Notably, the example conditionalizes its claim to a single task type – the essay – written to only three of a larger universe of composition purposes. Further, it gives critical context, including that the essays were written on demand (i.e, without preparation), unlike much of classroom composition, and that they were composed on computer. Whether 8th-grade girls of that period wrote more effective text messages, e-mails, letters, stories, or poems, for any of a variety of purposes, cannot be inferred from the results. More pointedly, the unsupportable inference that females generally write better than males (e.g., APA, Citation2018; Jacobs, Citation2018), is avoided. The example would be strengthened by a brief description of differences in relevant prior knowledge, interests, and identity between the groups.

A working definition and initial theory

From the above principles, a working definition of socioculturally responsive assessment might be formulated as follows: Socioculturally responsive assessment (1) includes problems that connect to the cultural identity, background, and lived experiences of all individuals, especially from traditionally underserved groups; (2) allows forms of expression and representation in problem presentation and solution that help individuals show what they know and can do; (3) promotes deeper learning by design; (4) adapts to personal characteristics including cultural identity; and (5) characterizes performance as an interaction among extrinsic and intrinsic factors. Socioculturally responsive assessment is assessment that people can see and affirm themselves in and from which they can learn.

An initial theory might use the components of this definition to form propositions stating causal relationships that should be empirically testable. Those propositions are described below and summarized in . In the figure, the principles are on the left, outcomes on the right, and mediators in the center.

Figure 4. An initial theory of socioculturally responsive assessment.

Figure 4. An initial theory of socioculturally responsive assessment.

Proposition 1

Problems that connect to the cultural identity, background, and lived experiences of all learners – but especially diverse ones – are posited to cause increased learner identification with the assessment, thereby promoting engagement and motivation to perform. Such problems should help to activate prior knowledge that builds on the assets these learners bring to school, causing students to perform better than they would on problems that don’t make such connections. Better performance should contribute to confidence and sense of efficacy that, in a virtuous circle, facilitate learning and test performance, returning to confidence and efficacy. Finally, these problems should lead to perceptions among stakeholders that assessment is fairer.

Proposition 2

Thoughtfully incorporating multiple forms of representation, and permitting alternate modes of expression, should cause students to show what they know and can do more than would be apparent under the typically limited means of expression and representation provided on standardized tests, likewise increasing student performance and the perceptions of all stakeholders that testing is fair. Problems that aid students in making deep-structure connections among representational forms and expressive modes should enhance the chances for subsequent transfer of learning, as well as improved test performance.

Proposition 3

Promoting deeper learning through assessment design should cause teachers unfamiliar with approaches to such instruction to begin to incorporate these approaches in their practice. In conjunction with teachers giving greater attention to deeper learning, modeling such learning in the assessment should cause students to increase meta-cognitive self-regulatory behavior, including monitoring their performance against quality standards and internalizing the processes employed by proficient domain performers. These changes in student and teacher behavior should lead to greater learning.

Proposition 4

Adapting to personal characteristics should cause stakeholders to feel the assessment is fairer because it aligns better with student interests, cultural identity, background, and prior knowledge than does a traditional test. Adaptation should also cause higher levels of motivation and engagement with the test, thereby increasing examination performance. Given appropriate guidance, allowing choice should enhance competency in taking effective agency which should, in turn, positively affect learning. To the extent that agency encourages examinees to explore cultural identity and share their explorations, those identities should be reinforced and sustained. The greater the degree of adaptation to the particular aforementioned personal characteristics, the larger should be the salutary effects, especially for students from traditionally underserved groups.

Proposition 5

Characterizing results as an interaction among what the examinee brings to the assessment, the types of tasks engaged, and the conditions and context of that engagement should cause examinees, teachers, the public, and policy makers to interpret, communicate about, and act on assessment results more carefully than currently the case. More careful interpretation means recognizing that, absent other evidence, results are bound to task types, conditions, and contexts like those employed in the assessment – selections that developers should have made and justified on a defensible basis. Understanding results as an interaction should cause students to know better how task features, conditions, and contexts affect their performance. Similarly, that knowledge should lead teachers and students to experiment with modifications of these factors that facilitate learning and improve test performance.

The theory depicted in encapsulates a significant research program, with each link and box implying one or more studies. More generally, the above propositions suggest the operation of both differential boost and interaction hypotheses (Sireci, Scarpati, & Li, Citation2005). The former hypothesis suggests that students from diverse backgrounds thought to be disadvantaged by traditional test designs will, for example, see a score increase from socioculturally responsive assessments relative to traditional measures. The latter hypothesis indicates that socioculturally responsive assessment should produce higher scores for diverse students relative to traditional measures, coupled with a smaller increase for other students.

A path forward

The above propositions are framed as statements in which an element of assessment design causes an effect among teachers, students, or some other population. In isolation, of course, changes to assessment design may have little or no impact if other aspects of the ecosystem in which assessment is embedded act in countervailing ways. For that reason, the theorized effects are more likely realized to the extent that the described principles are broadly applied. Broad application means beyond state accountability and admissions tests to all components of a balanced assessment system and, as importantly, to curriculum and instruction. In this way, principles like these might serve as an underlying basis for creating more coherent systems of assessment, curriculum, and instruction, thereby engendering a multiplicity of avenues for achieving the intended effects of sociocultural responsiveness.

Because even changing an assessment system is a huge undertaking, it might be wise to work on several fronts simultaneously. One front could target incremental implementation in accountability testing, national assessment, and admissions testing. A second front could concentrate on implementation in classroom assessment. Pursuit on both implementation fronts is critical so that classroom and external assessments are designed to mutually reinforce one another. Yet a third front might involve a research program to support the implementation fronts, as well as evaluate the earlier-stated propositions.

Pursuit on these several fronts has the virtue of using the experiences of each to inform the others. The classroom-assessment context may afford room for trying more complete implementations than traditional testing programs can allow. Through qualitative and quantitative research, workable approaches might be found that can then be modified for incorporation into external assessment. AP Art and Design, as well as AP Capstone, live in both worlds already, operating as classroom assessments and external tests. Studying them more closely may identify elements transferable to other programs.

Simultaneously, more traditional testing programs might implement the most achievable increments first. Assessment organization staff likely believe they are already at the first level in the Evans (Citation2021) framework (sensitivity); that is, significant effort put toward creating and reviewing content showing awareness of cultural differences without assigning value. The larger challenge is in incorporating relevant content within existing tests in ways that are politically achievable and logistically feasible. Meeting that challenge is complicated by several factors. For one, political sensitivities are such that content relevant to one group can readily trigger a negative public reaction from another constituency. Second, creating enough relevant content to achieve balance across the many interest groups served may be logistically unworkable. Last, relevant content demands the active involvement of members from the target populations; otherwise, organizations run the risk of being driven by their current staff members’ perceptions which may not necessarily align with examinee experience. Target population involvement means hiring diverse staff, regularly engaging community members as advisors, and interacting deeply with examinees to understand what is relevant to them. Assessment organizations have not historically excelled on any of these counts.

One place to find an implementation toehold is recognizing that every test item does not need to be immediately socioculturally relevant. But some, beyond a token amount, certainly do. Beginning with a relatively small number may make for a more considered and controlled change process, dictated in part by results from the research program. But because empirical research takes years to produce results – which may even then not be consistent – the field should not wait to begin implementation. In the interim, principles, theory, and logic may be serviceable substitutes for deontic action (Gordon Commission, Citation2013, p. 158), presuming reasonable consensus around one or more of those guideposts.

A second toehold is to borrow ideas from that which is already well established. In the high-school context, AP Art and Design, and AP Capstone, offer operational models for how to incorporate choice, very broadly conceived. As noted, such a broad conception may not be generally feasible but more limited modifications might be possible. For example, ELA writing assessment may offer opportunities, as may the assessment of STEM competencies. At high school (and possibly middle school), prompts might be offered in which students are encouraged to choose, within the same topic, paths that build on their background, prior knowledge, interests, and identity (see ). That idea can be supplemented, to the extent feasible, with choice among prompts engineered to give similar opportunities.

Also well-established are population-specific assessments. As noted, these assessments currently serve selected exceptionality, language, and cultural and language groups. Although proliferating test versions might not generally be logistically or economically feasible, there is precedent among the international assessments for further expansion, where justified. PISA 2018, for example, was offered in over 90 language versions (Dept, Citation2018).

A third toehold might be through new technology. In assessments like AP Art and Design, as well as in some other programs, technology makes it more feasible to capture and submit student work processes and products. Technology additionally makes the remote training of raters and online rating possible. Further afield is socioculturally responsive machine personalization (e.g., using known or inferred examinee characteristics to automatically adjust problem content, context, format, and other aspects of presentation and response modality). As with exploring complex implementations more generally, this development might best be pursued in the learning context, porting what is found effective to testing programs over time.

How will we know what implementations of the described principles are meaningful improvements over current practice? Working from the initial theory, the most meaningful implementations for a given purpose will be those designs that do some – ideally all – of the following. For examinees, especially those from marginalized groups: enable identification with the assessment, engender engagement and motivation to perform, enhance learning, build confidence and sense of efficacy, reinforce and sustain identity, show differential boost or interaction, and produce perceptions of fairness. Among teachers: encourage deeper learning practices; build understanding of how task features, conditions, and context affect performance; and promote experimentation with those factors to improve learning. For stakeholders generally: facilitate careful characterizations of performance and encourage positive views of tests.

To summarize, the growing opposition to traditional standardized tests is fueled in good part by the perception that tests represent a worldview no longer suited to the pluralistic society we are rapidly becoming. That misfit may eventually make traditional tests no longer sustainable, a decline already well underway in the admissions context. Measures that are designed to be (i.e.,“born,”) socioculturally responsive seem a reasonable direction for exploration (Bennett, Citation2021, April 10). Getting to widespread operational use will require considerable time, thought, iteration, and political skill. This paper assembled provisional design principles from various literatures, formulated a definition from those principles, provided illustrative examples, and used the principles to suggest an initial theory with intended effects expected from implementation in a mutually reinforcing ecosystem. Multiple approaches consistent with one or more design principles were described. Those approaches included increasing the cultural relevance of content in current measures, providing population-specific assessment where sensible, exploring machine adaptation to relevant student characteristics, and encouraging learner agency to the maximum extent feasible. None of these approaches is currently logistically straightforward, applicable to every use case, or likely to achieve all the outcomes postulated by the theory. Rather, the long-term development of this assessment conception might best evolve from playing out these approaches interactively in the classroom context and in external assessments, alongside a sustained program of research. Through that implementation and research, we might well discover which specific approaches, and combinations of approaches, work best for which use cases, contexts, and desired outcomes.

Disclosure statement

No potential conflict of interest was reported by the author.

Correction Statement

This article has been corrected with minor changes. These changes do not impact the academic content of the article.

References