4,076
Views
1
CrossRef citations to date
0
Altmetric
Original Articles

Validation of sub-constructs in reading comprehension tests using teachers’ classification of cognitive targets

ABSTRACT

Reading comprehension is often treated as a multidimensional construct. In many reading tests, items are distributed over reading process categories to represent the subskills expected to constitute comprehension. This study explores (a) the extent to which specified subskills of reading comprehension tests are conceptually conceivable to teachers, who score and use national reading test results and (b) the extent to which teachers agree on how to locate and define item difficulty in terms of expected text comprehension. Eleven teachers of Swedish were asked to classify items from a national reading test in Sweden by process categories similar to the categories used in the PIRLS reading test. They were also asked to describe the type of comprehension necessary for solving the items. Findings of the study suggest that the reliability of item classification is limited and that teachers’ perception of item difficulty is diverse. Although the data set in the study is limited, the findings indicate, in line with recent validity theory, that the division of reading comprehension into subskills by cognitive process level will require further validity evidence and should be treated with caution. Implications for the interpretation of test scores and for test development are discussed.

Introduction

According to Messick’s (Citation1989) classic definition, validity refers to “the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions on the basis of test scores or other modes of assessment” (p. 13, italics in original). Argument for test validity, then, following Kane (Citation2013), must be based on an evaluation of the plausibility of the proposed interpretations and uses of test scores. In national test systems, like the ones found in the Nordic countries, classroom teachers all over the country, instead of small panels of experts, take responsibility both for the scoring of tests and for the interpretation and use of test results in the classroom (EACEA, Citation2009). Therefore, in these systems, teacher interpretations of test content and score meaning provide vital evidence for test validity.

This article concerns two issues of importance to the development and use of reading comprehension tests: the validation of sub-constructs of the test and the consistency in perceiving the difficulty of individual test items. The article reports an empirical study of (a) the extent to which specified sub-constructs are conceptually conceivable to teachers, who score and use national reading test results, and (b) the extent to which teachers agree on how to locate and define item difficulty in terms of expected text comprehension. In this way the article aims to contribute relevant knowledge about validity in the assessment of reading comprehension and to discuss ways of ensuring construct and content representativeness in tests.

Reading comprehension is commonly conceived of as a multidimensional form of processing, as opposed to being a unitary construct (Campbell, Citation2005; OECD, Citation2009; van den Broek, Citation2012). This suggests that the reading situation, text, topic, and reading purposes will affect the nature of reading in such a way that when measuring reading proficiency, the selection of reading material, the assumed reading purposes, and also item construction will have a significant impact on students’ test results. This has been verified in many empirical studies (Best, Ozuru, Floyd., & McNamara, Citation2006; Francis et al., Citation2006; Keenan, Betjemann, & Olson, Citation2008; van den Bergh, Citation1990). Moreover, for the last couple of decades it has been fashionable in the construction of standardized tests of reading to categorize comprehension components by taxonomical models of cognitive complexity, often going from locating explicitly mentioned pieces of information, to inferencing, interpreting, and critically evaluating text (Pearson & Hamm, Citation2005; Song, Citation2008). These are the categories, or reading processes, defined in, for instance, the PISA reading test (OECD, Citation2009). Similar categories are used in the PIRLS test (IEA, Citation2009) as well as in many national reading tests (Tengberg, Citation2017).

However, while reading processes may be identified conceptually (i.e., perceived as psychological multidimensionality, Henning, Citation1992), evidence of psychometric divisibility of reading comprehension has been scarce. Although many researchers have suggested classifications of different dimensions (Davis, Citation1968; Khalifa & Weir, Citation2009; Langer, Citation1995; McNamara & Magliano, Citation2009), there is still little consensus in the field as to how reading comprehension processes might validly be classified, or even if separable subskills of comprehension really exist at all (Alderson, Citation2000; Rupp, Citation2012). Several empirical studies, using both analyses of intercorrelations between proposed subskills and factor analyses, have demonstrated the difficulty of separating reading comprehension into reliable factors or sub-components (Meijer & van Gelderen, Citation2002; Rost, Citation1993; Schedl, Gordon, Carey, & Tang, Citation1996; Song, Citation2008; Spearritt, Citation1972; van Steensel, Oostdam, & van Gelderen, Citation2012).

Another way of validating a multidimensional view of reading comprehension involves studying the cognitive processes activated by test takers while solving tasks of different categories. In an experimental study, Rouet, Vidal-Abarca, Erboul, and Millogo (Citation2001) found that undergraduate students presented with a scientific text activated a “review-and-integrate search pattern” when answering high-level questions, and a “locate-and-memorize pattern” when answering low-level questions.Footnote1 Field (Citation2013), among others, has argued that validation of the cognitive processes underlying language test performance, concerning both listening and reading comprehension, is an area where much more research is needed. Cognitive validation means ensuring, first of all, that the processes activated during test-taking resemble the cognitive processing that test developers actually aim to assess, and second, that the processes elicited by test items are representative for the processes activated and needed during non-test reading (Khalifa & Weir, Citation2009; Pearson & Hamm, Citation2005).

In cases where teachers of reading are expected to make formative use of reading test results, as in the Nordic countries, cognitive validation of test results also entails that teachers interpret cognitive complexity and difficulty of test items relatively consistently. This is especially important when tests are meant to measure comprehension according to curriculum goals, which is the case for national reading tests in the Nordic countries (Tengberg, Citation2017). In other words, teachers should be able to reach a shared understanding of the cognitive demands offered by test items. To validate the test by ensuring reliable classification of items will then work as a prerequisite for common standards in teachers’ feedback on student performance. If strengths and weaknesses of student performance are not identified correctly, then the information provided by test scores will have less valid implications for future reading instruction. The test design may thus fail to support student progress, something that inevitably will have the most damaging effect on the weakest performing readers (i.e., those who are in the direst need of adequate instructional support).

A framework of reading processes that has gained a great deal of attention in the recent decade is the threefold division into “aspects of reading” used in the PISA reading test. Originally, five aspects of reading were defined, but because reporting on five separate subscales would require a much more extensive collection of items, the framework was organized into three broader categories: access and retrieve; integrate and interpret; and reflect and evaluate (OECD, Citation2009). In the original PISA framework (OECD, Citation1999), the distinguishing characteristics of the different aspects were defined. First, a distinction is made between items requiring the reader to use information primarily from within the text and items that require the reader to draw upon outside knowledge. A second distinction concerns whether the reader must focus on the text as a whole or focus on specific parts of the text. A third distinguishing characteristic concerns whether the reader is asked to relate different pieces or parts of information to each other or to focus on parts of information in isolation. Finally, a fourth distinction is made between items where the reader should focus on the content of the text and items that deal with text structure (OECD, Citation1999, p. 29).

OECD admits that the definitions and the distinguishing markers necessarily oversimplify the complexity of each aspect. In addition, the specificity of tasks and the difficulty, in terms of required text comprehension, inevitably relates to the way information is structured in the text (Cerdan, Vidal-Abarca, Martinez, Gilabert, & Gil, Citation2009; Lumley, Routitsky, Mendelovits, & Ramalingam, Citation2012; Mosenthal, Citation1996; Rouet et al., Citation2001). Semantic relationship between propositions, the embeddedness of propositions, and the range and type of match between question formulation and text available to the reader, are all qualities that affect the way the cognitive load of items is perceived (Mosenthal, Citation1996). In addition, the level and the type of difficulty of items will be affected by readers’ familiarity with the topic, the genre and the format of the text (Alderson, Citation2000), and the depth of understanding or the extent of textual support requested by the scoring rubric (Solheim & Skaftun, Citation2009).

These complexities are important to analyze. Because, as Song (Citation2008) points out,

Although the concept of describable subskills is highly controversial, in practice, it is commonplace for language teachers and language test developers to distinguish different comprehension subskills or levels of understanding of a given text as a basis for planning a syllabus, describing students’ language competence, and developing test items in a mother tongue as well as in a foreign language (Song, Citation2008, p. 436).

To summarize, cognitive validation of reading comprehension testing requires not only that suggested cognitive dimensions are psychometrically distinct from each other but also that they are conceivable as psychological dimensions (Henning, Citation1992; Pearson & Hamm, Citation2005). Reliable reporting by subscales necessitates psychometric evidence that the subskills exist, but for such results to bear meaning in the classroom, experienced reading teachers should be able to identify items in terms of the suggested reading process categories. If teachers’ classification of reading test items, which are aimed to measure reading comprehension according to curriculum goals, is not reliable, it will pose a threat to test validity in two different ways. First, it may be an indication that the items in question do not measure the intended skills according to the given framework. Second, it may be an indication that there is a lack of alignment between how subskills are defined in the test and how they are used and taught in the classroom. In both cases the interpretation of test scores will be undermined by a lack of validity evidence.

In alignment studies, where ratings of cognitive demand are used to evaluate items against standards, agreement between subject matter experts is essential. Herman, Webb, and Zuniga (Citation2007), for instance, investigated the reliability of such alignment indices and noted considerable variability among high school Mathematics teachers in judging item dimensionality and depth of knowledge.Footnote2 In a similar vein, Lombardi, Seburn, Conley, and Snow (Citation2010) found higher reliability for cognitive demand ratings of Mathematics items than for English/Language Arts items, and higher reliability for ratings of cognitive demand than for rigor (level of difficulty) of items.

For the cognitive processes of reading comprehension items, previous empirical studies have shown that classifying items according to a reading framework is difficult even for trained reading researchers. In 1992 the L1 reading section of the American National Assessment of Educational Progress (NAEP) included a framework of four different reading processes, or stances (Pearson & Hamm, Citation2005).Footnote3 After administration, a study was conducted to analyze the alignment between items and framework. According to Glaser and Linn (Citation1994), the results showed that the fit was moderate (only 50% agreement) and that the categories were judged not to be discrete from one another (see also DeStefano, Pearson, & Afflerbach, Citation1997). In an earlier study Alderson and Lukmani (Citation1989) presented nine university teachers with an L2 reading test and let them rate whether the items were targeting “lower,” “middle,” or “higher” order abilities. The analysis showed that for a majority of the items (27 of 41) there was very little agreement on classification between judges. Similarly, when judges were asked to describe what they thought each item tested, their account often varied considerably. Looking further at the items where the judges actually did reach some agreement, the researchers found little relationship between these items either in terms of difficulty or in terms of process level.

However, the suggestion that judges would be unable to reach agreement on subskills tested by particular items has been challenged in later studies (Bachman, Davidson, & Milanovic, Citation1996; Brutten, Perkins, & Upshur, Citation1991; Lumley, Citation1993). For instance, Lumley (Citation1993) let a group of five raters judge the difficulty of subskills and match subskills to individual items in an EAP test for students of non-English-speaking background. The results showed high level of agreement between raters and thus supported the validity and, in Lumley’s words, the usefulness of the construct (p. 230). It is worth noting that these studies all concern subskills in L2 reading tests and might not be generalizable to the perception of cognitive dimensions in L1 reading tests. In addition, the raters used in these studies were few and trained experts of the constructs, usually being test constructors themselves. To the extent that reading test construction and results are expected to influence both the content of classroom teaching and teachers’ semester grading, the cognitive validation should entail that tested subskills, or cognitive dimensions, are conceivable not only to testing experts but also to classroom teachers.

Context of the study

The national reading test in Swedish and Swedish as a second languageFootnote4 in ninth grade (students aged 15–16) aims at measuring reading comprehension according to curriculum goals (Swedish National Agency for Education, Citation2011). It is based on a reading framework including four reading process levels (see ). The process categories were introduced by inspiration primarily from the PIRLS test and serve to organize the item sample and to ensure a balanced coverage of the intended construct (Swedish National Agency for Education, Citation2014).Footnote5 However, while both the PIRLS and PISA frameworks provide extended descriptions of the process categories involved along with examples of items that might be used to assess them, respectively, (IEA, Citation2009; p. 23ff; OECD, Citation2009; p. 45ff), the descriptors of the Swedish framework are much briefer (see ).

Table 1. Reading Process Categories With Descriptors.

In the Swedish test, student results are not reported by subscales, but to receive a passing grade (an E on the Swedish grading scale), students need to score points in at least three of the four process categories. In this way the process categories have a high-stakes function in the test, which is why cognitive validity of the framework is essential.

In Sweden, ninth grade is the final year of compulsory school. The national reading test is usually administered in March and serves the purpose of assessing student achievement according to curriculum goals and monitoring school performance. It is not an examination that decides the students’ final grades, but it is regarded as an important piece of evidence in assessing students’ progress according to the curriculum goals. A test result indicating that a student underachieves in some respect may, therefore, be put to formative use by the teacher in the period from March to June to help the student improve and thereby qualify for a higher final grade in Swedish. The way knowledge and skills are defined by national tests might also impact teachers’ general understanding of the skills they teach and certainly as they prepare students for taking the national tests (Koretz, Citation2008; Shohamy, Donitsa-Schmidt, & Ferman, Citation1996; Stobart, Citation2008). The cognitive validity of the test is, therefore, of imminent importance both to the students’ learning progress and to their final grades.

Research questions

Reliable classification of what different items actually test is crucial to the cognitive validity of any test design assumed to measure ability along some scheme of different subskills. To contribute relevant knowledge about the validity of cognitive process schemes in reading comprehension tests, this article addresses two distinct but interrelated research questions. First, to what extent are teachers able to classify reading test items reliably by reading process levels? Second, to what extent do teachers agree on how to define the difficulty of individual items in terms of expected text comprehension?

Method

Participants

The participants of the study were 12 qualified secondary school (Grades 6‒9) teachers of Swedish (all female), from 7 different schools in both rural and urban areas. Teaching experience ranged between 7 and 21 years and the teachers had all been administering and scoring national test in Swedish annually during their whole working career. In Sweden, students receive semester grades in each subject, meaning that the participants were also well experienced at assessing student achievements according to curricular goals. All of them volunteered by responding to an open invitation. The data collection was conducted during a one-day in-service workshop on assessment. One participant, however, repeatedly assigned items into several categories, against the explicit instruction given, which is why the final data set contains 11 raters.Footnote6

Procedure

The teachers were presented with 4 texts and 19 test items sampled from test material in use from 2013 to 2014. A week in advance, they were emailed pdf versions of the texts and asked to read them carefully a couple of times. During the workshop they were provided with the test constructors’ definitions of the four processes (see ) and asked to classify items accordingly. Each item was, thus, classified by all teachers. Having administered the national tests several times, the teachers were familiar with the test design and most of them had read the texts and seen the items previously. It was expected, however, that they would not remember the way in which particular items were classified by cognitive processes.

In addition to classification, the teachers were also asked to write an open-ended description for each item of what they believed the student needed to understand specifically in the text to be able to answer the question. These descriptions were produced together with the classifications. In the data collection protocol each item was thus followed by an instruction (“This is what the student needs to understand in the text to be able to answer the question”) and space for an open-ended response.

Data Analysis

The data were analyzed in several steps. First, the classifications were plotted in a table (Rater x Item) for an initial review of agreement for each item. Data were then restructured (Item x Process) and Fleiss’s kappa for agreement between raters was computed by using the software package R (all calculations were double-checked in Excel). Fleiss’s kappa is the recommended measure for assessing overall reliability between multiple raters on categorical (nominal) data (Gwet, Citation2008; Landis & Koch, Citation1977). While Cohen’s kappa measures the agreement between two raters, Fleiss’s kappa can be extended to measure agreement between multiple raters. As with Cohen’s kappa, Fleiss’s kappa controls for the agreement expected by chance alone and uses no weighting. To interpret values of kappa, Landis and Koch proposed that 0–0.20 indicates slight agreement, 0.21–0.40 fair agreement, 0.41–0.60 moderate agreement, 0.61–0.80 substantial agreement, and 0.81–1.00 indicates almost perfect or perfect agreement. These guidelines, however, have been criticized and Krippendorff (Citation1980), for instance, has suggested a more conservative guide for interpretation of kappa values, in which observed agreements below 0.67 should be disregarded, values between 0.67 and 0.80 should be seen as grounds for tentative conclusions, and that definite conclusions be made only for values above 0.80. As noted by Hallgren (Citation2012), however, all interpretation of reliability estimates will be dependent on both the method of study and the research question. Furthermore, it is essential to bear in mind that some scales and distributions are easier to judge consistently than others. Thus, most reliability estimates need to be analyzed for the particular performances being assessed and the particular scale used.

When analyzing how the teachers locate and define difficulty in items, all written accounts were charted and coded according to the type of textual references that they made and according to the cognitive processes that they thought was required. The coding scheme was thus developed during the analysis and both categories and coding were recurrently re-assessed to eventually fit all teacher accounts into the final version of the coding scheme. Results of the coding have not been subjected to kappa analysis. This is partly because the reports included some missing data, to which Fleiss’s kappa is highly sensitive. The main reason, however, is that the purpose of coding is not to display statistics of item difficulty definitions but rather to structure the teacher accounts into meaningful groups to reveal topical patterns in how reading test items are conceived by those who score the tests. It needs to be stressed that this part of the analysis is not primarily focused on what the teachers believe the challenge of an item is but how they describe it or, more precisely, how they locate and define the type of comprehension necessary for successfully solving the item.

Results

The logic of the study is that if teachers, whose task it is to score the tests and use the results, can classify test items consistently, it will contribute to the validation of the reading framework by showing that the reading processes are conceptually conceivable to professionals in the field. If the teachers are not able to do this, then it may be seen as a threat to test validity and may also indicate that the transferability into formative use of the test results is endangered.

Classification of items by reading process categories

The teachers’ classifications of the 19 items are displayed in , together with an account of the official classification by the test constructor. As seen in the table, there are items (e.g., i14) for which all raters agree. They have all put i14 in category 4, in alignment with the test constructor. Similarly, 10 of the 11 raters have put i10 in the official category. Both of these items deal with conceptual knowledge. In i14 students are asked to assign genre to the input text. But if we look at i11, the test constructor classifies it as belonging to category 3, which none of the teachers do. For i18 all teachers agree on choosing category 2 while the test constructor defines it as a 3. There are also items where the variation is total in the sense that all four categories are suggested (see i6 and i17).

Table 2. Raters’ Classification of Items Into Reading Process Categories.

To evaluate the overall agreement statistically within the group of raters, Fleiss’s kappa was calculated and is reported in . The results indicate what Landis and Koch (Citation1977) would define as fair level of agreement between raters (κ = .35 (95% CI, .31 to .38), p < .0001). In practice this means that there is a considerable variation in how teachers classify the items according to reading process category. also shows that there is significantly higher agreement when classifying items as category 4 than when using the other three categories.

Table 3. Fleiss’s Kappa for Interrater Agreement Between the 11 Raters.

The Fleiss’s kappa analysis also yields information about the proportion of pairs of raters that agree in their classification on each item (pi). For five of the 19 items (i9; i10; i13; i14; i18) there was fairly high interrater agreement (pi > .65), while for four items (i5; i6; i12; i17) interrater agreement was fairly low (pi < .35). Studying these items separately, no association was found between the level of agreement and reading process category. However, four of five items with high interrater agreement were multiple choice (MC), while three of four items with low interrater agreement were constructed response (CR). The average pi was also larger for MC items (.58) than for CR items (.42). According to an independent t-test the difference is not significant for the given sample (t (17) = −1.60, p = .13), but it may certainly be argued that the number of items is too small and that a larger sample might have detected a significant difference.

The principal finding in this section is that even teachers with extensive experience of scoring national tests (although not being trained experts) find it difficult to reach agreement about classification of reading test items into cognitive process categories. One way of exploring further why this is so difficult is to analyze how teachers locate and describe the particular challenge or difficulty in items, by using their own words rather than fixed labels.

Defining what students need to understand to solve individual items

Thus, in addition to classification, the teachers were also asked to describe in their own words what the students need to understand to solve the given items. These reports were then analyzed and grouped into different categories, depending on the type of references they made in explaining item difficulty (). It is worth repeating that the coding is not primarily focused on what they believe the challenge is but rather on how they locate and define difficulty of items. The coding and the table should thus be understood as displaying the points of reference teachers make when they describe what it is more precisely that a test taker must comprehend to solve the particular task.

Table 4. Type of References When Locating and Defining Item Difficulty.

A first impression of the accounts is that the majority of them are quite short (20 words or less). Still, they appear to represent an honest attempt to characterize the primary challenge of each item as precisely as possible. Naturally, there can be no single correct definition of what is necessary for a student to understand in a given text (or in a given test item). Rather, the categories suggested in point to the multidimensionality of both reading comprehension in itself and the challenges imposed by reading test items. Both categories 4 and 6 include accounts that do not point directly to some aspect of comprehending the input text but rather to necessary knowledge of vocabulary or terminology used in the item. In the distribution of teachers’ definitions over the whole set of 19 items is displayed. Items are also identified by official reading process category.

Table 5. Distribution of Raters’ Definitions of Item Difficulty.

Patterns in the table seem to support the assumption that for a couple of items, it is much easier to arrive at consensus about a description of the required type of comprehension than for other items. For instance, item no 14 reads:

i14. To which genre does [the title of the text] belong? Choose the correct alternative.

  • ∘ Letter to the editor

  • ∘ Column

  • ∘ Blog post

  • ∘ Report

To this item all raters provided explanations that referred to necessary knowledge about form and structure of text (category 6 in ), as in the following examples:

R1: They need to recognize the text type, both in terms of layout and in terms of the way it is written.

R7: The student needs to understand what it is that defines the text type.

R10: To have knowledge about the form of different texts.

In contrast with many other items in the sample, i14 requires a limited amount of interpretive processing and little, if any, searching through the input text for information. Instead, the test taker should pay attention to global surface features of the text. Provided that s/he possesses adequate knowledge of genre, s/he should be able to identify the correct answer. R1 stresses the recognition of text type, while R10 highlights the need to “have knowledge.” Central to both of these accounts is familiarity with genre traits. All of the 11 raters agree that this is central for solving the item. In we also find that many of the items for which teachers agree to a relatively large extent about item characteristics are items targeting reading process category 4. I15 may be an exception to this pattern, because the teachers’ descriptions of item difficulty are more varied. However, i15 is also a dubious example of category 4.

i15. Is [name of the author]’s attitude towards flying primarily positive or negative? Choose the correct alternative.

  • ∘ Primarily positive

  • ∘ Primarily negative

Justify your opinion with examples from the text.

_______________________________________________

_______________________________________________

_______________________________________________

Compared to other process 4 items, i15 does not focus on the evaluation of any textual features by using previous knowledge but rather on inferring an authorial perspective by integrating and interpreting different parts of the text. In the classification task none of the raters placed i15 in category 4, but rather in category 2 or 3 (), where the teachers’ item descriptions generally vary more substantially. I5 (category 2) may serve as an example:

i5: Explain why [name of main character] feels that the lesson is “pleasantly monotonous”.

_______________________________________________

_______________________________________________

_______________________________________________

When describing what the students need to understand to solve this task, the following accounts were given:

R2: That the boring lesson makes those who receive roses less “bubbling” so that [name of main character] will not become so exposed for the rapture in which he cannot participate.

R3: Be able to interpret that “pleasantly” means something positive and understand what he feels in the situation.

R4: “[name of teacher in the narrative] demonstrates equations that calm down even the most bubbling Valentine’s rapture.”

R6: That [name of main character] likes it when the lessons are monotonous.

R7: The student must understand the “figurative” description of the situation as well as direct descriptions.

The cited accounts represent various ways to conceive or explain textual challenge or difficulty in relation to the given item, and they all fall into different categories according to . Note that the issue here is not the extent to which these accounts provide more or less accurate understandings of the interaction between the input text and the item. What is important is the extent to which such differences in perceiving the comprehension and reading skills needed for a correct item response also reflect differences of a more profound content-related nature. It may, for instance, indicate variation in how teachers conceive task and text complexity in general, which may, in turn, affect both classroom instruction and assessment of reading comprehension.

From one perspective, variation in teachers’ description of item difficulty can, and perhaps should, be seen as a display of the multidimensionality inherent in many reading test items on an inferential level. As demonstrated in the example above, an item in which students are required to explain a character’s emotions will involve not only higher-order processing, such as integration of segments from different parts of the text, but also lower-order processing, such as interpretation of word meanings (in both the text and the item) and of sentences. From another perspective, validity of an achievement measure requires a certain degree of consensus among subject matter experts about what different items actually measure (Haladyna & Rodriguez, Citation2013). The fact that both categorization of items into reading process categories and open-ended descriptions of item difficulty vary considerably between teachers might indicate that the cognitive validity of the test is threatened. Implications of this will be discussed in the final section.

Discussion

The aim of the present study is to contribute relevant knowledge about validity in reading comprehension assessment by investigating (a) the extent to which specified reading process categories are conceptually conceivable to teachers and (b) the extent to which teachers agree on how to locate and define item difficulty in terms of expected text comprehension. Because test validity relies on adequate interpretation and use of test results and because teachers in Sweden both score the national tests and use test results formatively in the classroom after administration, their perception of test content is critical. Findings of the study indicate that there is a clear limit to the reliability with which reading test items are classified according to reading process category and the way in which item difficulty is located and defined by teachers.

However, while the reliability of classification was generally low, data suggested that some items were more consistently discerned than others. Agreement was, for instance, significantly higher for items dealing with examination of language and textual elements (category 4), both in terms of defining process level and in terms of perceiving item difficulty. Whether this can be explained by more easily discernible item characteristics or by a more intelligible instructional definition of the intended reading process cannot be determined from the existing data. However, it is worth noting that the instructional definition of category 4 is the only one containing an example of items included in the category (“Examples of this aspect of comprehension are to perceive that a given text is an article…,” ). Definitions of the other three categories remain at a more abstract level. A concrete but simple implication of this might be that a more systematic description of the cognitive processes, including typical items and comprehension targets, could perhaps help to support a more reliable discernment of the sub-constructs.

The study also indicated higher interrater agreement for MC items than for CR items, although the difference was nonsignificant. It is possible that the MC format facilitates discernment of cognitive target because alternative answers (including the correct answer) are provided, thereby making the suggested route of textual understanding, or processing, more concrete. However, both of these suggestions (that agreement is higher when classifying examining content and language-items and when classifying MC items) need to be verified by future studies using a larger sample size.

On a more general note, it seems clear to assume that classification of items by reading process levels is a difficult task for teachers who are not trained as expert raters. This finding is consistent with some of the previous studies in the field (Alderson & Lukmani, Citation1989; DeStefano et al., Citation1997). Yet the complexity of the reading process, at the intersection of reader, text, and reading situation (Alderson, Citation2000; Snow, Citation2002), inevitably leaves substantial room for interpreting lexical, semantic, structural, and thematic constraints for comprehension differently. Therefore, the type of variation found in the present study is perhaps not very surprising. Following Alderson (Citation2000),

[a]nswering a test question is likely to involve a variety of interrelated skills, rather than one skill only or even mainly. Even if there are separate skills in the reading process which one could identify by a rational process of analysis of one’s own reading behavior, it appears to be extremely difficult if not impossible to isolate them for the sake of testing or research (p. 49).

Thus, if test grades are to depend on subdivision of the construct, as is already the case in the Swedish national reading tests, clear evidence must be provided for both psychometric and psychological multidimensionality (cf. Henning, Citation1992). Empirical research has indicated that sub-constructs of reading are difficult to prove psychometrically (Meijer & van Gelderen, Citation2002; Spearritt, Citation1972; van Steensel et al., Citation2012). And while some studies have investigated the extent to which suggested sub-constructs can be made useful and conceivable to those who are responsible for development and scoring (Bachman et al., Citation1996; Lumley, Citation1993), few studies have examined the cognitive validity of such sub-constructs in testing systems where classroom teachers are responsible for the whole chain of scoring, interpreting, and using test results. Findings from the present study do not suggest that the concept of reading processes cannot be used for distinguishing sub-constructs of reading ability. Yet it demonstrates the necessity of collecting further empirical evidence for whether the postulated sub-constructs really exist as discernible psychological dimensions. Without such evidence the national test scores may be of limited use for reliable identification of students’ strengths and weaknesses.

The reported variance between teachers’ accounts of item difficulty and requested form of comprehension may certainly, following Alderson (Citation2000), Cerdan et al. (Citation2009), and Mosenthal (Citation1996), refer to a multidimensionality of the comprehension and strategic thinking inherent in any successful solving of reading test items on inferential or reflection level. It has been pointed out that item difficulty as an empirical issue cannot necessarily be inferred from process level, response format, or even from the input text (Lumley et al., Citation2012; OECD, Citation2009; Rupp, Ferne, & Choi, Citation2006). Rather, item difficulty arises from a specific form of interaction between characteristics of certain segments in the text and the particular item. Yet by investigating how teachers locate and define the comprehension necessary for solving a set of given items, we also learn about patterns in the teachers’ perspectives. These perspectives are informative when it comes to understanding their perception of reading process levels, which in turn affects perhaps not their scoring but certainly their interpretation and use of student test scores.

According to Kane (Citation2013), test validity should be concerned with the evidence that supports suggested interpretations and uses of test scores. In the present context, this implicates that the meaning of test results ought to be shared by those who score, interpret, and use them (i.e., qualified and experienced secondary teachers of Swedish). These teachers may be regarded as subject-matter experts (SMEs), in Haladyna’s and Rodriguez’ (Citation2013) term, and should be able to distinguish the content and align the construct of the test to the goals in the curriculum. When SMEs disagree with the official categorization of specific items, the validity of those categorizations needs to be questioned. If the disagreement among SMEs is substantial, as in the present case, then the entire process of scale development and categorizing items needs to be evaluated thoroughly. An initial step for future test development might thus be to benchmark the process of scale development against comparable reading tests. If the tests are to include a subdivision of the construct into separate processes, or subscales, then strong emprical evidence for their existence must be collected.

The size of the data set used in the study obviously entails limitations to the type of conclusions that may validly be drawn from the results. As for rater reliability, the findings cannot be generalized to the population of teachers of Swedish. Yet if seen as an examination of the intelligibility and transparency of the test design itself, it should be expected that reliable classification does not appear on an aggregated level only but that it is present in any random sample of qualified and experienced teachers of Swedish. Furthermore, if treated as a case study, the findings reported may also provide some clues for future studies. For instance, if classification of subskills tested by reading test items is difficult, then it should be interesting to examine whether this is a general trait related to the broad definitions of the reading process categories themselves, or if it is related to other characteristics of the particular items examined here.

Conclusions

The findings reported in the study point to limitations of the cognitive validity primarily in the Swedish national reading test. Together with indications from some of the previous classification studies as well as from psychometric analyses of the dimensionality of reading, the findings also point to a potential weakness of the separation into cognitive processes itself. As long as sub-constructs are identified for the purpose of organizing the reading domain and ensuring a broad representation of the construct, then the consequences of limited classification reliability are perhaps minor. But when scores are reported on subscales or, as in the present case, when credits are required on items of several different subscales for test takers to earn a particular grade, then the consequences may be significant and stronger evidence for validity is also required. In line with Kane’s (Citation2013) argument that what is to be validated is not tests or test scores but the proposed interpretations of test scores, it may be argued that regardless of whether the test has other qualities, such as a good representation of the construct at large or good predictive validity of future reading achievements, its capability of reliably identifying strengths and weaknesses on sub-construct level should be questioned.

As far as the content of the national test is expected to reflect curriculum goals of reading, it may also be inferred that scores on subscales open for a more detailed formative use of the test results. However, because teachers largely disagree on what separate items actually test as well as on what the primary difficulties of particular items are, such use of test scores, either in the classroom or elsewhere, does not appear to be justified. Reading process categories of the sort examined here (which are also found in the PIRLS and PISA tests) need therefore to be treated with caution. Until the divisibility of reading comprehension into process categories can be supported by substantial validity evidence, it should at least not form the basis of high-stakes dimensions of tests.

Acknowledgments

This research was partially supported by funding from Gunvor och Josef Anérs stiftelse.

Notes

1 High-level questions, according to Rouet et al., refer to tasks that require “focus on a broader set of concepts or on concepts higher in a hierarchy” (i.e., inferential reading) (Rouet et al., p. 167). Low-level questions, in contrast, “focus on a single concept and require only superficial processing of the materials” (Ibid.).

2 Ratings of item dimensionality in the Herman, Webb, & Zuniga (Citation2007) study concerned whether items were unidimensional (targeting one topic) or multidimensional (targeting two topics). Depth-of-knowledge ratings concerned the degree of cognitive demand in a way similar to the ratings investigated in the present study.

3 The categories were Form an initial understanding; Develop an interpretation; Personally reflect and respond to the reading; and Demonstrate a critical stance (Pearson & Hamm, Citation2005).

4 The same test is used for students in both Swedish and Swedish as a second language. Thus, the test is used for measuring both L1 and L2 reading comprehension.

5 A slight moderation has been made in the third category, which in the Swedish framework includes reflection (see ). In PIRLS the reflection aspect belongs to the fourth category.

6 In the study the teacher participants serve as “raters” and will interchangeably be referred to as teachers and raters.

References

  • Alderson, J. C. (2000). Assessing reading. Cambridge: Cambridge University Press.
  • Alderson, J. C., & Lukmani, Y. (1989). Cognition and reading: Cognitive levels as embodied in test questions. Reading in a Foreign Language, 5(2), 253–270.
  • Bachman, L. F., Davidson, F., & Milanovic, M. (1996). The use of test method characteristics in the content analysis and design of EFL proficiency tests. Language Testing, 13(2), 125‒150. doi:10.1177/026553229601300201
  • Best, R., Ozuru, Y., Floyd., R., & McNamara, D. S. (2006). Children’s text comprehension. Effects of genre, knowledge, and text cohesion. In S. A. Barab, K. E. Hay, & D. T. Hickey (Eds.), Proceedings of the seventh international conference of the learning sciences (pp. 37–42). Mahwah, NJ: Erlbaum.
  • Brutten, S. R., Perkins, K., & Upshur, J. A. (1991). Measuring growth in ESL reading. Paper presented at the 13th Annual Language Testing Research Colloquium. Princeton, NJ.
  • Campbell, J. R. (2005). Single instrument, multiple measures: Considering the use of multiple item formats to assess reading comprehension. In S. G. Paris & S. A. Stahl (Eds.), Children’s reading comprehension and assessment (pp. 347–368). Mahwah, NJ: Lawrence Erlbaum Ass.
  • Cerdan, R., Vidal-Abarca, E., Martinez, T., Gilabert, R., & Gil, L. (2009). Impact of question-answering tasks on search processes and reading comprehension. Learning and Instruction, 19(1), 13–27. doi:10.1016/j.learninstruc.2007.12.003
  • Davis, F. B. (1968). Research in comprehension in reading. Reading Research Quarterly, 3, 499–545. doi:10.2307/747153
  • DeStefano, L., Pearson, P., & Afflerbach, P. (1997). Content validation of the 1994 NAEP in Reading: Assessing the relationship between the 1994 assessment and the reading framework. In R. Linn, R. Glaser, & G. Bohrnstedt (Eds.), Assessment in transition. 1994 trial state assessment report on reading: Background studies (pp. 1–50). Stanford, CA: The National Academy of Education.
  • EACEA; Eurydice. (2009). National testing of pupils in Europe: Objectives, organisation and use of results. Brussels, Belgium: Eurydice.
  • Field, J. (2013). Cognitive validity. In A. Geranpayeh & L. Taylor (Eds), Examining listening: Research and practice in assessing second language (pp. 77‒151). Cambridge: Cambridge University Press.
  • Francis, D. J., Snow, C. E., August, Carlson, C. D., Miller, J., & Iglesias, A. (2006). Measures of reading comprehension: A latent variable analysis of the diagnostic assessment of reading comprehension. Scientific Studies of Reading, 10(3), 301–322. doi:10.1207/s1532799xssr1003_6
  • Glaser, R., & Linn, R. (1994). Assessing the content of the trial state assessment of the National Assessment of Educational Progress in reading. Educational Assessment, 2(3), 273–274. doi:10.1207/s15326977ea0203_4
  • Gwet, K. L. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61, 29–48. doi:10.1348/000711006X126600
  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. New York, NY: Routledge.
  • Hallgren, K. A. (2012). Computing inter-rater reliability for observational data: An overview and tutorial. Tutorials in quantitative methods for Psychology, 8(1), 23–34. doi:10.20982/tqmp.08.1.p023
  • Henning, G. (1992). Dimensionality and construct validity of language tests. Language Testing, 9(1), 1‒11. doi:10.1177/026553229200900102
  • Herman, J. L., Webb, N. M., & Zuniga, S. A. (2007). Measurement issues in the alignment of standards and assessments: A case study. Applied Measurement in Education, 20(1), 101–126.
  • IEA. (2009). PIRLS 2011 Assessment framework. Chestnut Hill, MA: Boston College.
  • Kane, M. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50, 1–73. doi:10.1111/jedm.2013.50.issue-1
  • Keenan, J. M., Betjemann, R. S., & Olson, R. K. (2008). Reading comprehension tests vary in the skills they assess: Differential dependence on decoding and oral comprehension. Scientific Studies of Reading, 12(3), 281–300. doi:10.1080/10888430802132279
  • Khalifa, H., & Weir, C. (2009). Examining reading: Research and practice in assessing second language learning. New York, NY: Cambridge University Press.
  • Koretz, D. (2008). Measuring up: What educational testing really tells us. Cambridge, MA: Harvard University Press.
  • Krippendorff, K. (1980). Content analysis: An introduction to its methodology. Beverly Hills, CA: Sage Publications.
  • Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159–174. doi:10.2307/2529310
  • Langer, J. (1995). Envisioning literature: Literary understanding and literature instruction. New York, London: Teachers College Press.
  • Lombardi, A., Seburn, M., Conley, D., & Snow, E. (2010). A generalizability investigation of cognitive demand and rigor ratings of items and standards in an alignment study. Paper presented at the annual conference of the American Educational Research Association, Denver, CO, April 2010. Available at: https://files.eric.ed.gov/fulltext/ED509419.pdf
  • Lumley, T. (1993). The notion of subskills in reading comprehension tests: An EAP example. Language Testing, 10(3), 211‒234. doi:10.1177/026553229301000302
  • Lumley, T., Routitsky, A., Mendelovits, J., & Ramalingam, D. (2012). A framework for predicting item difficulty in reading tests. Paper presented at the American Educational Research Association Meeting, Vancouver, BC.
  • McNamara, D. S., & Magliano, J. (2009). Toward a comprehensive model of comprehension. Psychology of Learning and Motivation, 51, 297–384.
  • Meijer, J., & van Gelderen, A. (2002). Lezen voor het leven: Een empirische vergelijking van een nationale en een internationale leesvaardigheidspeiling. Amsterdam, Netherlands: SCO-Kohnstamminstituut.
  • Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York, NY: American Council on Education.
  • Mosenthal, P. B. (1996). Understanding the strategies of document literacy and their conditions of use. Journal of Educational Psychology, 88(2), 314–332. doi:10.1037/0022-0663.88.2.314
  • OECD. (1999). PISA 2000. Measuring student knowledge and skills. A new framework for assessment. Retrieved from www.oecd.org/
  • OECD (2009). PISA 2009. Assessment framework: Key competencies in reading, mathematics and science. Retrieved from www.oecd.org/.
  • Pearson, P. D., & Hamm, D. N. (2005). The assessment of reading comprehension: A review of practices: Past, present, and future. In S. G. Paris & S. A. Stahl (Eds.), Children’s reading comprehension and assessment (pp. 13–70). Mahwah, NJ: Law. Erlbaum Ass.
  • Rost, D. H. (1993). Assessing different components of reading comprehension: Fact or fiction? Language Testing, 10(1), 79–92. doi:10.1177/026553229301000105
  • Rouet, J.-F., Vidal-Abarca, E., Erboul, A. B., & Millogo, V. (2001). Effects of information search tasks on the comprehension of instructional text. Discourse Processes, 31(2), 163–186. doi:10.1207/S15326950DP3102_03
  • Rupp, A., Ferne, T., & Choi, H. (2006). How assessing reading comprehension with multiple-choice questions shapes the construct: A cognitive processing perspective. Language Testing, 23(4), 441–474. doi:10.1191/0265532206lt337oa
  • Rupp, A. A. (2012). Psychological vs. psychometric dimensionality in reading assessment. In J. Sabatini, E. R. Albro, & T. O’Reilly (Eds.), Measuring up: Advances in how we assess reading ability (pp. 135–152). New York, NY: Rowan & Littlefield Education.
  • Schedl, M., Gordon, A., Carey, P. A., & Tang, K. L. (1996). An analysis of the dimensionality of TOEFL reading comprehension items. Princeton, NJ: Educational Testing Service.
  • Shohamy, E., Donitsa-Schmidt, S., & Ferman, I. (1996). Test impact revisited: Washback effect over time. Language Testing, 13(3), 298–317. doi:10.1177/026553229601300305
  • Snow, C. E. (Ed.). (2002). Reading for understanding: Toward an R&D program in reading comprehension. Santa Monica, CA: Rand.
  • Solheim, O. J., & Skaftun, A. (2009). The problem of semantic openness and constructed response. Assessment in Education: Principles, Policy & Practice, 16(2), 149–164. doi:10.1080/09695940903075909
  • Song, M.-Y. (2008). Do divisible subskills exist in second language (L2) comprehension? A structural equation modeling approach. Language Testing, 25(4), 435–464. doi:10.1177/0265532208094272
  • Spearritt, D. (1972). Identification of sub-skills of reading comprehension by maximum likelihood factor analysis. Reading Research Quarterly, 8, 92–111. doi:10.2307/746983
  • Stobart, G. (2008). Testing times: The uses and abises of assessment. London, UK: Routledge.
  • Swedish National Agency for Education. (2011). Curriculum for the compulsory school, preschool class and the leisure-time centre 2011. Stockholm, Swedish National Agency for Education.
  • Swedish National Agency for Education. (2014). Teacher guideline. National test in Swedish/Swedish as a second language 2014. Part A. To read and to comprehend. Stockholm: Swedish National Agency for Education.
  • Tengberg, M. (2017). National reading tests in Denmark, Norway, and Sweden: A comparison of construct definitions, cognitive targets, and response formats. Language Testing, 34(1), 83–100. doi:10.1177/0265532215609392
  • van den Bergh, H. (1990). On the construct validity of multiple-choice items for reading comprehension. Applied Psychological Measurement, 14(1), 1–12. doi:10.1177/014662169001400101
  • van den Broek, P. (2012). Individual and developmental differences in reading comprehension: Assessing cognitive processes and outcomes. In J. P. Sabatini, E. R. Albro, & T. O’Reilly (Eds.), Measuring up: Advances in how to assess reading ability (pp. 39–58). New York, NY: Rowman & Littlefield Education.
  • van Steensel, R., Oostdam, R., & van Gelderen, A. (2012). Assessing reading comprehension in adolescent low achievers: Subskills identification and task specificity. Language Testing, 30(1), 3–21. doi:10.1177/0265532212440950