2,036
Views
2
CrossRef citations to date
0
Altmetric
Commentary

Language Testing for Migrants: Co-Constructing Validation

ORCID Icon &

Introduction

The present issue provides a much-needed space to key issues not visible in our discourse in language testing and research. The various articles delve into research, policy, test development, and validity considerations for migrants who are increasingly mandated to take language and literacy tests. The papers point to issues of “test misuse,” bias, negative impact, and altogether different test taker populations, which tend to have low literacy in their first languages. While many of the concerns raised in this special issue relate specifically to the testing of language learners with low print literacy, there are lessons here for test development and validation theory across the board. In our commentary, we will focus primarily on issues of validation. This seems to be a critical theme in all the papers included in the present issue. We and the authors in this special issue argue that the language testing community needs to revisit validity theory considering the intricate connections between language testing and migration policies. Validation, as clearly shown in this issues, needs to be co-constructed by key stakeholder groups at the design, development, administration, research, and use levels.

Taking a position on validity and validation

The position Barry has taken on validation over the years is to be found in O’Sullivan (Citation2016, Citation2019). Micheline’s position has been articulated in publications such as Chalhoub-Deville (Chalhoub-Deville, Citation2009a, Citation2016, Citation2020). We most recently discussed our shared views in Chalhoub-Deville and O’Sullivan (Citation2020). Our conversations leading up to the joint 2020 book publication revealed our commitment to several key elements in a conceptualization of validity and validation. These elements include a focus on consequences, not only as important for research investigations but also as part of validation. Secondly, we believe that we need to integrate stakeholder groups into validation work and in turn to reallocate the total control that test developers (and policy makers) have to a shared/negotiated responsibility. Finally, we argue that it is not sufficient to engage in confirmatory approaches to validation but need to also commit to critical validity research. These elements are elaborated below with references to papers in the present issue.

Consequences

The traditional definition sees validity as “the degree to which a test measures what it claims, or purports, to be measuring” (Brown, 1Citation996, p. 231). This test-centred definition continues to gain some degree of support from theorists such as Borsboom and Markus (Citation2013) and Cizek (Citation2020) who essentially argue that a test can be valid in and of itself, irrespective of use. This position sees use as lying outside of the developers’ remit, which is to identify the target ability/construct and accurately measure it. Our understanding of validity has changed over the years to a position where there is a broad consensus that validity evidence to support the test itself is not sufficient. Validity evidence to support test score use for a particular purpose is also critical. A focus on documenting the quality of scores to provide adequate and appropriate inferences about test takers’ abilities for a given purpose or use is the prevailing shorthand consensus of validity. There is also a broad consensus, especially in language testing that issues of consequences are germane to test use considerations.

Any test will have consequences for a range of stakeholder groups. Some of these will be predictable (e.g., university admission for those who display a specific level of performance) while others will be less so (e.g., failure due to misinterpretation and/or misunderstanding of multimodal input in test items, as seen in the Altherr Flores paper, this issue). In this issue, the authors of all the articles make a strong case for the need to pay close attention to issues of migration test impact, intended and unintended, on the lives and life chances of migrants as well as on society at large. In the case of language tests for migrants, establishing evidence of the validity of the use of particular test scores is critical. Failure to recognise this could (and should) lead to the legal liability of institutions who insist on using inappropriate non-validated tests.

The articles in this issue, for example, Altherr Flores, Hooft et al., and Rüsseler et al., also provide compelling research that shows how our traditional engagement with language tests has focused on populations that differ markedly from those typically encountered in migration and integration testing contexts. Their studies make clear how relatively little we know about the language constructs encountered in these contexts. They also show how such lack of knowledge results in policies and tests that fail to provide necessary support or accommodations that impinge on fairness and results in human rights’ violations.

Audience

Our engagement with the argument-based approaches have opened our eyes to one previously ignored though important element of validation, i.e., the audience. Kane (Citation1992, Citation2006, Citation2013) proposed that validation reports should adopt a legal-style structure in which argumentation appeared to take prominence. This focus on an argument-based approach compels us to think about the audience(s) for the argument. Traditionally and almost exclusively, validity research is carried out by one test developer, provider, and researcher group and is intended for researchers, providers, and developers in the extended stakeholder group. We are trying to convince each other of the merits of our arguments. This position has been articulated in publications such as O’Sullivan and Weir (Citation2011), Chalhoub-Deville and O’Sullivan (Citation2020).

Authors such as Carlsen and Rocca (this issue) contend that the conceptualisation of validation as being primarily, or even uniquely, a measurement issue is critically flawed. The statistics we use to validate test score interpretation and usage needs to be meaningful and/or convincing to other stakeholder groups. The articles in the present issue provide compelling evidence that one such critical stakeholder group that we need to engage in our validity conversation is the policy makers. For example, Deygers et al., in the present issue, make a strong case for engaging policy makers regarding the impact of language testing policy on vulnerable and/or marginalized populations. In short, our research evidence is of little use unless we can convert it to language that policy makers can understand so we can engage them in communication that can help yield more appropriate policies and more meaningful services to our test takers.

Critical validation

Kane (Citation2006) portrays validity research as a clearly structured argument, supported by theory and analyses of test data from various perspectives and is intended to defend the test score properties. Kane states that the confirmatory approach taken in almost all validation research where the research attempts to confirm the intended arguments of the developer is too limited. Kane, following Cronbach (Citation1988) and Shepard (Citation1993), argues that an additional more critical approach should be undertaken. While the confirmatory approach continues to be widespread, we scarcely see critical validation undertaken. Additionally, very few studies from external independent researchers are commissioned by developers to undertake critical validation research. .

Many developers see critical validation as an internal affair that forms a crucial part of the test development phase and for many practical purposes this is quite reasonable. However, instead of supporting only confirmatory validation studies, as is the case with most developers, research grants could include truly independent critical research into elements of the testing system itself. It is understandable that there are significant risks in doing so given the increased commercialisation of education and assessment. Test developers and other decision makers are loath to support research which may not add value to the brand. Additionally, there will always be a genuine concern that external researchers may not be cognisant of contextual constraints in the development process and may reach conclusions that are not based on a adequate understanding of the development processes, inevitable limitation of resources, or context. Nonetheless, the call that emerges from the studies and arguments presented in this special issue compel a critical validation approach to migration-oriented assessments to better characterize the scope of testing, the constructs measured, and the wide-reaching impact of these policy-related testing systems on individual test takers as well as society at large.

Socio-cognitive validity as integrated arguments

Over a number of years, we exchanged ideas and learnt from each other in order to work out a validation model that embraces a U.K. as well as a U.S. tradition. Our shared views resulted in the Chalhoub-Deville and O’Sullivan (Citation2020) validity as integrated arguments. Our approach to validity research places the test firmly in its context-of-use and looks to four interactive evidence sources: 1) the measurement model (comprises scoring, data analyses, norming, and equating), 2) the development model (includes an underlying language development or progression model and a test development model), 3) the theory of action model (defines exactly what we hope to achieve and how we hope to achieve it), and 4) the communication model (articulates who we communicate with and how we communicate with them. In the context of this special issue, the two key stakeholder groups are the vulnerable, low script literate migrants described in the papers you have read and the government actors who influence, develop, and administer migrant language policy.).

Another key element of this approach is that it defines the context-of-use in terms of the key stakeholders who make up that context and integrates these stakeholders into the entire process. The context-of-use will be comprised of a description of the educational, social, and political context in which the test is to be developed and administered. This context is defined by those groups who are in any way affected by the test. These people should be identified at the initiation of any consultation process, either related to policy decisions or test development decisions. This entails engaging in defining the interactive nature of language use and the approximate domain of test use and the equally complex nature of policy making and implementation. These issues are illustrated in Carlsen and Rocca and by Lo Bianco (in Deygers, this issue). It is critical that the test developer and/or commissioning body is aware of these complexities if any test initiative is to succeed.

Instead of ignoring or passing the buck on consequences, test developers should engage with key stakeholder groups from the beginning of the development process. Meaningful interaction with a range of stakeholder groups helps the developer to define the language and test use domains. In other words, it provides the rationale for the Theory of Action that will drive the whole process. It also begins a dialogue between the developer and stakeholders, and across different stakeholder groups (curated by the developer). This dialogue then forms the basis of a Communication engagement model or plan, though which the developer updates stakeholders and receives feedback from them. The resulting communication is, in effect, validation-in-action. While this may be a contentious view for those who see validation as being based around an informative communication model (i.e. one-way information transfer), the reality is that an interactive communication engagement model, in which validity is co-constructed, is a pre-requisite if the use of a test score is to be recognised as valid by the stakeholders involved.

An important contribution to the idea of engaging with stakeholders from the earliest possible stage of development is that it allows for the developer to consider the value implications of the introduction of the test (see McNamara, Citation2013). Without an open and transparent interaction with key stakeholders, it is not possible for this to happen, and if it cannot happen, then it becomes difficult, if not impossible, to communicate with these groups in a truly meaningful way. We are again reminded that validity is co-constructed.

The core of the socio-cognitive model focuses on the development of the test instrument itself. Critical to this will be a clear understanding of the target candidates. We define these from a number of perspectives: individual – e.g., physical, psychological and experiential characteristics (touched on in various papers in this issue); linguistic – e.g., target language development, L1 print proficiency (as highlighted by Rüsseler et al., this issue); cognitive abilities – e.g., functional connectivity, processing surplus/deficit (also highlighted by Rüsseler et al.). In addition to understanding the candidature, we should also ensure that we are working to a clear definition of language progression, so that we are very explicit about what exactly the test is testing. This will be augmented by a test development model (such as proposed by Mislevy et al. (Citation2003) with their Evidence-Centered Design (ECD) approach, or Weir (Citation2005) with his Socio-Cognitive frameworks.

With this representation of validation in mind, we first turn our attention to Hooft et al (this issue). This article allows us to explore issues of validity expressed by Chalhoub-Deville and O’Sullivan (Citation2020). Following this, we reflect on notable issues raised by the various papers in the special issue.

Reflecting on Hooft et al

Since one of the papers in this issue (Hooft et al.) describes a low-stakes test developed for low script literate migrants, we take the opportunity to reflect on features presented in the paper and how those correspond to the socio-cognitive validity as integrated arguments as discussed in Chalhoub-Deville and O’Sullivan (Citation2020). While we do this, we are mindful that validation is a complex and broad program of research and any one publication or report typically speaks to part of the integrated evidence or arguments that undergird a testing system. Our comments should be taken as a call for additional research, whether that research already exists, but not reported here, or needs to be undertaken.

The measurement argument

Hooft et al. argue in their paper that “[v]alid, reliable instruments to identify low-literate migrants and policy-oriented research into the assessment of low-literates are scarce.” The paper fills such a gap in the published literature. The authors present the measurement properties of the instrument they have developed. This research represents, not only much needed, but also quality work in terms of the measurement argument. As already stated, however, the validation program is sorely incomplete if it is solely based on measurement argument related research alone.

The test development argument

the paper operationalises language development through the items included in the two-phase instrument, though we are not really made aware of the actual focus of the items in phase two. In any future, or fuller, validity argument, the authors should make explicit the connection between the test content and the target language domain – defined here as the language which migrants are likely to be expected to deal with in their daily lives. As proposed by Weir (Citation2005), it is important to have more specific rationale offered for the various decisions made during the development process – this is touched on, though not in detail, by Hooft et al. By carefully documenting these decisions throughout the process, we can provide evidence that the test we develop is likely to work on a technical level, while also targeting the appropriate population and language

The theory of action argument

the test rationale needs to be explicitly linked to the test use context – this was hinted at in the introduction to this paper, but needs to be a more explicitly written theory of change which rationalises a theory of action in which the proposed process is detailed. So, for example, it would be useful to know where the driver for this test originated. Did it come from government authorities, from organisations dealing with migrants (as in supporting them or managing them), or from the experiences or theories of the authors? It would also be useful to know where the proposed solution originated – in this case we assume it came from the authors, but in many cases the solution can be mandated by government; see Chalhoub-Deville and O’Sullivan (Citation2020) on this point.

The communication engagement argument

as, stated, the authors “are collaborating closely with the asylum centres to guide them in implementing the tool and undertaking actions based on its results” (p. 15). While this is very positive to hear, it would be interesting to know how the authors plan to communicate about the instrument to all key stakeholders, including candidates, policymakers and influencers. We tend to agree with the points raised by Tani and Nadarajan (in Deygars et al., this issue) that language tests for migration purposes can be defended. However, this can only be the case where tests are seen by the individual stakeholders concerned as offering an accurate fair and useful indicator of appropriate language proficiency whose use is transparent and justified. Realistically, it is unlikely that all stakeholders will be satisfied, given the often-competing agendas and concerns.

Developing and validating tests of language for migrants: additional comments

From the arguments and evidence presented in this special issue, it is clear that language test developers and validators need to change the way they approach these tests. This change needs to be on a number of levels:

Complex language models

There needs to be a recognition that all aspects of the test development process are critical to any validation claim. These need to be made as clear and transparent as possible. Test development approaches such as set out in Weir’s (Citation2005) frameworks or in Mislevy’s Evidence Centered Design (Mislevy, Citation2007; Mislevy et al., Citation2003; Mislevy & Haertel, Citation2007) offer guidance on how this can be achieved in practice. Language models such as those suggested for reading by Khalifa and Weir (Citation2009) and for listening by Field (Citation2019) have been adopted for use in a range of tests globally – including the British Council’s Aptis test. However, as the authors point out (see Altherr Flores, Hooft et al., and Rüsseler et al. in this issue) the construct in language testing for migrants necessitates very different considerations given test takers’ often low print literacy and limited formal schooling. More research is needed to better document the constructs for testing purposes. This is a significant issue for tests for migrants, where the language use domain is so complex and varied. Without a clearly described language model, test interpretation and use are ambiguous and indefensible. Language tests for migrants, perhaps even more than for other types of testing/purposes, will need to be highly context- and small-group specific as well as locally appropriate. In the context of testing for English language learners in the schools in the U.S., Chalhoub-Deville (Citation2009b) argues that the educational system and the testing sector needs to consider how to specify the types of instruction and accommodations learners require on tests according to their background variables, e.g., first and/or second language literacy, second language proficiency, age, etc. The same argument applies here. The various papers in this issue call attention to the individual needs of this migrant population. Altherr Flores (this issue) describes the adults in her article as having to learn “to read and write for the first time in their lives, and for whom second language acquisition coincides with literacy development.” It is imperative that test providers accommodate these test taker’s literacy, second language proficiency, and other relevant background variables.

Scaffolded test development and delivery

The evidence presented by Rüsseler et al. in this issue suggest that a single multi-level proficiency test can never hope to work across a population consisting of low to high literate candidates. The two-stage test beginning with a pre-test scan element as proposed by Hooft et al. appears to be an ideal approach. However, it is important to remember that low literacy is not the same as low proficiency, hence the suggestion by Rüsseler et al. that a low literate-focused test should be far more scaffolded than existing low-proficiency tests. It is also clear that the way in which these tests are delivered will need to change so that candidates can be supported in interpreting what test developers are expecting of them, for example, where multimodal input is presented as shown by Altherr Flores (this issue). Also, Deygers et al. (this issue) make a strong case that “Developing and administering language tests for a population that includes people with low print literacy constitutes an ethical and a political problem for the people and organizations involved in it.” Our responsibility as test developers and/or providers is to engage responsibly and responsively. The call in this issue is to overhaul our thinking and our operations to accommodate the specific needs of this vulnerable test taker population. Issues of fairness are central to test design, development and administration.

Transparent validity arguments

The current orthodoxy is to represent validation as an argument, though we are not always made aware of who the argument is meant to convince let alone what it is trying to convince them to believe. The critical weakness of the whole argument-based approach is seen by Carlsen and Rocca (this issue) as being that “the test developer is very much in control of the whole process from test development to test use, interpretations and consequences.” In other words, the test developer has overall control over the arguments presented, including any claim rebuttals. We feel there is a role for such approaches, but they really need to be informed by a more candidate- and other stakeholder-focused orientation. It is also useful to consider the trend in impact evaluation practice that values independent evaluative research over internal research projects (see O’Sullivan et al. Citation2020).The recent expansion of the socio-cognitive model proposed by Chalhoub-Deville and O’Sullivan (Citation2020) offers a good first step in this new thinking, but it just a first step. We need to work with colleagues in other disciplines to understand the needs and expectations of test takers and relevant stakeholder groups. In the case of testing low literate candidates, we should work closely with colleagues in migration and other professionals such as those in discrimination studies as well as with experts in graphic design and online customer experience where tests are delivered digitally. In the digital age, language testing is about far more than language and measurement. If a candidate’s experience of taking the test is not intuitive then we are allowing for construct irrelevant variance to potentially muddy the waters when it comes to interpreting and using test scores. A good example of this is to be seen in the paper by Altherr Flores, where candidates interpreted the multimodal input in unexpected and unpredictable ways.

Policy and assessment literacy

Lo Bianco suggests in the Deygers et al. paper (this issue) that we need to engage with the concept of policy literacy. We concur that test developers need to become educated about and sophisticated in how they engage with policy. Engagement in socio-educational policies is part of validation efforts, as Chalhoub-Deville (Citation2016) argues in the context of accountability testing. We also contend that the field needs to continue to promote assessment literacy for all relevant groups involved in the testing processes and outcomes. This requires that we rethink communication in the field. We recommend that we learn from and work with experts in the varied fields of communication. The goal is to learn how to effectively communicate our (validation) messages, not only to academics but also to a host of stakeholder groups, including candidates, parents, teachers, policymakers etc. We also must work to find ways to make evidence available and accessible to a broad range of stakeholders. An argument can be made that it will not be possible to convince test developers to make the entire validation process transparent. While some proprietary intellectual property (IP) will never be made public, there are precedents in other areas that we can consider. Two papers from the technology world are of particular interest here. One is by Gebru et al. (Citation2019) where the argument is made that technology developers should create a detailed Datasheet for the datasets used in the development of the artificial intelligence (AI) models used across different fields (from education to security to health). The Datasheet presents to the developer a series of questions focusing on seven distinct areas of development. Here, it is assumed that proprietary information will only be contained in internal versions of the Datasheet. Another is by Mitchell et al. (Citation2019, January 29–312019, January 29–31) who propose that developers should devise what they call a Model Card to demonstrate how AI models have been validated for a specific use. The card is designed to be short and readily accessible, with only some limited technical information. It is not unreasonable to suggest that language test developers create similar publicly available and accessible documentation for their tests. These would sit well in tandem with more technically complete documents, see for example, the Aptis General Technical Manual (O’Sullivan et al. Citation2020).

Concluding reflections

In this paper, we have considered some of the lessons learnt from the articles included in this special issue. The title of the paper reflects our belief, and that of our much-missed colleague Cyril Wier, that the whole process of test development should be documented as evidence of validation. This is because each decision made along the way reflects the developer’s theoretical and pragmatic perspective. Not all stakeholders will be interested to see this level of technical detail and/or can understand such technical information. It is, therefore, imperative that we present information in ways that are accessible to stakeholders, and we only know how to do this through communicating with these stakeholders.

Not surprisingly, we agree with Carlsen and Rocca’s (this issue) proposal that the socio-cognitive model of test development and validation as most recently proposed by Chalhoub-Deville and O’Sullivan (Citation2020) offers a viable approach to engage in validation research in the context discussed in this issue. The primary advantage of using the Chalhoub-Deville and O’Sullivan (Citation2020) validity as integrated arguments model is the four-pronged approach that moves away from a heavy reliance on measurement and/or test development arguments. A differentiating feature of our approach is its call for inclusivity–taking into account the values, needs, expectations and concerns–of key stakeholders. By engaging in truly interactional communication with stakeholders throughout the conceptualisation, development and administration process, using a theory of action approach, we empower them to ask the sort of critical questions that are meaningful to them. Different stakeholder groups are likely to ask very different types of questions, some highly technical, some more focused on social or educational values or conditions. Validation is all about answering these questions.

Articles in this special issue elaborate changes to our customary thinking about various matters such as:

  • our understanding of the context-of-use, highlighted by Carlsen and Rocca’s the test misuse argument;

  • our engagement with the underlying construct. This is reflected in our growing understanding of the way in which low script-literate learners interact with test input, as highlighted by Hooft et al., Altherr Flores and Rüsseler et al.;

  • our awareness of the implications of the finding that low script-literate learners display a physically different brain structure which means that their cognitive abilities are different to other test takers in how they interact with the test (Rüsseler et al). This suggests that existing tests are highly unlikely to be fair to this population;

  • our interactions with policy and policy makers is vital given the close connections between policy and testing for migrants, as discussed in the Deygers et al. paper.

These changes suggest that Messick (Citation1989) and Shepard (Citation1993) were correct in suggesting that validation is an ongoing process. As long as there are questions to be answered about the interpretation and use of our tests there will be a need for us to revise our established systems and for further validation – confirmatory and critical. We do not see it stopping there. The issues raised in this special issue, suggest that conceptualising validation as a series of integrated arguments (Chalhoub-Deville & O’Sullivan, Citation2020) offers a meaningful way to fully reflect the importance of understanding both context and consequence to all stakeholders. It is only by involving as broad a range of stakeholders as feasible in the process that developers, users and other stakeholders can co-construct validation in a way meaningful to all.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References