2,980
Views
2
CrossRef citations to date
0
Altmetric
Research Article

Same grade for different reasons, different grades for the same reason?

ORCID Icon

Abstract

It is widely acknowledged in research that common criteria and aligned standards do not result in consistent assessment of such a complex performance as the final undergraduate thesis. Assessment is determined by examiners’ understanding of rubrics and their views on thesis quality. There is still a gap in the research literature about how analytic and holistic judgments are made and how they are integrated in decisions about the final grade. This interview study aims to identify the sources of inconsistency in analytic and holistic assessment. Ten examiners assessed three final undergraduate theses from a primary school teacher education programme at a Swedish university. The analytic assessment focused on two parts: theoretical framework and academic language style, which revealed inconsistencies in assessment. The analysis of the holistic assessment showed how examiners weighted different dimensions of the thesis and how they made judgements for the final grade awarded. The findings reveal several sources of inconsistency, such as examiners’ own constructs, their interpretations and expectations about students’ ability to manage academic work. The study calls for further discussion about whether it is possible to make criterion-based assessment reliable.

Introduction

In 1999, the Bologna Declaration recommended that all university programmes at the bachelor’s level should involve a final thesis. In line with this declaration, the main purpose of the bachelor’s final thesis in teacher education in Sweden was defined as to develop student teachers’ research competence (Mattsson Citation2008). In writing the final thesis, students are supposed to learn to conduct a research study following the same structure as a PhD thesis. Accordingly, the assessment criteria resemble the criteria used to assess academic work. There are different views on what academic quality involves and this calls for more transparency in the assessment process and clearer justifications for the decisions on the final grade (Grainger, Purnell, and Zipf Citation2008; Williams and Kemp Citation2019). Hence, there is a growing interest in research in the field of assessment of final theses (Bloxham et al. Citation2016; Haagsman et al. Citation2021; Koris and Pello Citation2022).

Research indicates that common criteria and aligned standards do not always lead to consistent assessment among examiners (Sadler Citation2009; Golding, Sharmini, and Lazarovitch Citation2014; Bloxham et al. Citation2016). The final assessment is therefore determined by the examiner’s interpretation of the standard descriptors and their own understanding of the quality of the thesis. The use of criteria is often referred to as analytic assessment, using judgements on a limited number of pre-set criteria, while holistic assessment involves the judgement of the overall quality of the work (Sadler Citation2009). Sadler notes that analytic and holistic assessment are not mutually exclusive categories. However, holistic assessment often masks the variations in how the underlying descriptors for a given criterion are interpreted and applied by examiners (Sadler Citation2009; Harsch and Martin Citation2013; Bloxham et al. Citation2016; Chetcuti, Cacciottolo, and Vella Citation2022). The resulting grade may, therefore, not reflect the intended judgement. There is still a gap in the research literature about the practice of how examiners weigh the rubric criteria in analytic and holistic judgements in assessment of final theses (Sadler Citation2009; Haagsman et al. Citation2021). To address this gap, this study investigates consistency and variation in teacher educators’ analytic and holistic assessment of undergraduate final theses written as part of the primary teacher education programme at a university in Sweden. The aim is to illustrate the process by showing how examiners make use of a rubric in analytic and holistic judgement and in making decisions for the final grade.

Rubrics in higher education

Research literature accentuates the benefits of using rubrics in higher education (Reddy and Andrade Citation2010; Grainger and Weir Citation2020; Weir Citation2020; Koris and Pello Citation2022), suggesting that rubrics: (1) support and enhance student learning by clarifying the targets for student work and (2) contribute to transparency and fairness in assessment practices. Rubrics are typically presented as a grid and contain three elements: criteria, commonly involving the same number of gradations of quality (Humphry and Heldsinger Citation2014), standards and standards descriptors (Reddy and Andrade Citation2010; Grainger and Weir Citation2020). The standards descriptors are often short and open to interpretation resulting in inconsistency in assessment. Thus, the language in standards descriptors is crucial and considered as one of the most challenging aspects of their design.

Variability in assessment—sources of inconsistency

Interpretable standards descriptors that provide insufficient guidance for examiners are proven to be one source of inconsistency for reliable assessment (Bloxham et al. Citation2016; Bearman and Ajjawi Citation2018; Grainger and Weir Citation2020; Weir Citation2020). Another source of inconsistency is how assessors weigh the criteria in the rubrics, as some criteria might be considered more important than others (Sadler Citation2009; Bloxham, Boyd, and Orr Citation2011). Humphry and Heldsinger (Citation2014, 254) assert that if separate criteria reflect ‘a general rater impression of performance’, it may induce a halo effect; that is, judgements that are strongly influenced by an overall impression or general feeling. They argue that criteria should neither be largely independent of each other nor overlap with each other, in order to allow the raters to capture distinctive and complementary aspects of students’ performances.

Haagsman et al. (Citation2021) found that examiners prioritize criteria differently in assessment of undergraduate biology theses. Criteria such as ‘scientific quality’ and ‘structure’ were better grade predictors, while the criterion ‘writing skills’ was a worse predictor of the final grade. One possible reason for this could be that ‘scientific quality’ and ‘structure’ are considered to be more important than ‘writing skills’, and supervisors are therefore advised to give extra guidance and put more emphasis on the criterion ‘scientific quality’. In a study of assessment of doctoral work, Chetcuti, Cacciottolo, and Vella (Citation2022) show that examiners combine the explicit criteria outlined in the university regulations with implicit criteria emerging from examiners’ tacit knowledge and personal expectations. Thus, assessment is influenced by implicit factors that examiners might not be aware of.

Similarly, Golding, Sharmini, and Lazarovitch (Citation2014) show that the first impression of the work, good or bad, is decisive for the assessment. These findings concur with those of Stolpe et al. (Citation2021), who illustrate how examiners combine institutional and personal criteria in assessment of undergraduate theses, and how they prioritize between different assessment profiles, such as ‘logic text structure’, ‘research process’ and ‘results’. Stolpe et al. show that a holistic approach involves a lack of coherence about whether the process, the product or the logic of the text should be the main focus in assessment.

Conceptual framing—analytic and holistic assessment

Sadler (Citation2009) notes that grading is never entirely analytic or holistic. Analytic assessment identifies the specific elements that contribute to the final grade. In holistic assessment, quality is recognized from the experience of the examiners and their choices about whether it is rational to build up a judgement from discrete decisions regarding the criteria. Research indicates that examiners use a holistic approach rather than leaning on formal criteria in the process of marking (Bloxham, Boyd, and Orr Citation2011). They decide which criteria are relevant in every specific case in the process of hermeneutic interpretation, where parts are combined to form a whole (Sellbjer Citation2017). However, assessment is never a mechanical aggregation of parts to form a whole (Sadler Citation1989, Citation2009). Qualitative judgements cannot be reduced to a formula which can be applied in assessment, since they consist of multiple combinations made by a person who becomes the source of, as well as the instrument for, the appraisal. A work that is judged as outstanding on each criterion may be judged overall to be mediocre, while a brilliant work may not get the highest rating on each criterion (Sadler Citation2009).

The current study is informed by Sadler’s (Citation1989) characterization of criteria as sharp and fuzzy, where sharp criteria relate to an identifiable transition from one state to another, such as from incorrect to correct, and fuzzy criteria are characterized by a continuous gradation from one state to another. A sharp criterion is rational, based on specific descriptors, such as the length of the text (the amount of words), grammar, referencing and structure, and focuses on the measurable and unambiguous aspects of the final thesis, requiring a set of clarified prescriptions. A fuzzy criterion is an abstract mental construct with ‘no absolute and unambiguous meaning independent of its context’ (Sadler Citation1989, 124). These concepts guided the data collection and analysis in this study.

The aim and research questions

The aim of the study is to identify sources of inconsistency in assessment of final theses in teacher education, by exploring how examiners use a rubric in analytic and holistic assessment.

The research questions are

  1. How do the examiners’ views on rubrics reflect sources of inconsistency in analytic assessment?

  2. What do examiners interpret as essential in integrating analytic and holistic judgements to award an overall grade?

Data sources and procedures

Ten examiners with varying experience of assessment were recruited from a primary teacher education programme at a university in Sweden to assess three undergraduate final theses (A, B, and C). The theses had all passed the examination but nevertheless had some obvious weaknesses in certain respects. All three theses were qualitative studies using semi-structured interviews as a method. Thesis A was the weakest one with regard to the theoretical framework, since there was no explicit description of theory in the thesis, nor of how it was used in the study. Furthermore, there were clear shortcomings in the structure of the thesis. In Theses B and C, the students aimed for a theoretical framework but in both cases some theoretical deficiencies were identified. Thesis C involved some language shortcomings that could be difficult to assess. Both Theses B and C were originally awarded Pass with Distinction.

Rubric and grades in the current study

The rubric used in this study provides standard descriptors for the following criteria: aim and problem statement, method and theory, findings and analysis, sources and references, discussion and conclusion, structure, academic language style and wider research context (). The rubric is used for evaluation, as well as to provide feedback to students before the final judgement when a grade (Fail, Pass, Pass with Distinction) is assigned by examiners. The rubric does not provide any standard descriptors for Fail.

Table 1. Rubric template used for assessing undergraduate final thesis in teacher education (translated to English from Swedish by the author).

Course setting

The course for writing the final thesis involves 10 wk of full-time study. The time is mainly spent on individual writing and supervision. The supervisor has a certain amount of time for supervision, and it is up to the student to decide to what extent to turn to the supervisor for support within the hours given for supervision. There is an introductory lecture at the beginning of the course which informs the students about the course guidelines and the rubric. At the end of the course, students participate in an examination seminar where they use the rubric to oppose each other’s theses. Students have a right to decide themselves whether to hand in the thesis for examination. The supervisors can only discourage them from doing so; supervisors have no right to prevent the students from submitting. After the seminar, the students are given five days to improve their theses in accordance with the examiner’s comments, based on the rubric. The examiner decides upon the grade when the final, revised version of the thesis is handed in.

Data collection and analysis

The data were collected through semi-structured interviews. In advance of the interviews, each of the examiners was provided with three theses and the rubric used in the programme. The examiners were not given any information about the theses except that they had been graded. The interviews followed a thematically structured guide consisting of two parts. The first part involved questions about the examiners’ backgrounds, their assessment experience and their procedure for assessing. The second part focused on the assessment of each of the three theses. The examiners were asked for an analytic assessment of method and theory, findings and analysis, academic language style and structure, to make an overall holistic judgement, and to provide an overall grade as if they were considering the thesis as a final product.

The interviews were carried out at the examiners’ workplace. Each interview lasted for 60–70 min and was audio recorded. The transcribed interviews were coded according to the themes that structured the interview guide: (1) background information and general assessment procedure, (2) analytic and holistic assessment, (3) final grade awarded. The first step of analysis involved creating profiles for each examiner (E). The profiles were based on summaries of the first theme in the interview guide, involving disciplinary backgrounds, and teaching, supervising and assessment experience, including how examiners learned to assess undergraduate theses. The examiners had a disciplinary background in science (E1, E3, E5, and E9), social science (E2, E6, and E10), mathematics (E7 and E8) and language (E4). All examiners except E7 and E10 had a teaching degree and had worked in schools before their PhD studies. The number of theses previously assessed by each examiner varied from about 12 (E3 and E8) to 25-50 (E2, E4, and E5) to hundreds (E1, E6, E7, E9, and E10). All examiners learned assessment through their own experiences of being PhD students or by participating in assessment practices occasionally offered by the department.

The second step of the analysis involved thematic coding in line with Braun and Clarke (Citation2006). The following themes were defined: (1) examiners’ views on the rubric and their interpretation of standard descriptors in the assessment of the theses (analytic assessment), (2) an overall assessment of the weaknesses and strengths in each thesis (holistic assessment). Summaries of the assessment of each criterion for every thesis were outlined in separate tables involving weaknesses and strengths identified in the assessment of each thesis as analytical categories. An axial coding was provided for each category (Corbin and Strauss Citation2008) to identify consistency and inconsistency in assessment by relating concepts to each other. Inconsistency was particularly salient in assessment of theoretical framework and language.

The analysis showed the overall assessment of theoretical framework (involving method, analysis and results) as weak in all three theses. Of 30 judgements, 19 addressed specific deficiencies in theory, but there was considerable variation in what these deficiencies were. The language was assessed as good or acceptable in 20 judgements, and in 10 the language was considered unacceptable in relation to academic standards. The assessment of theoretical framework and of language were chosen to illustrate the sources of inconsistency in assessment, leading to the formulation of the first research question.

The third step of analysis involved summaries of the aspects that examiners interpreted as essential when combining analytic and holistic judgements to award an overall grade. During this analysis, the second research question was formulated.

Ethical considerations

The study has followed the ethical principles of the Swedish Research Council (Citation2017). The empirical material has been collected and reported anonymously. To protect the identity of the examiners’ backgrounds, variables such as gender and age were not revealed. In the presentation of the findings, the examiners were referred to by numbers (E1-10) and where a pronoun was necessary, the gender-neutral they was used.

Findings

To identify the sources of inconsistency in assessment, examiners’ views on rubrics are presented in this section, followed by an analysis of how analytic and holistic judgements were made, and how they were combined into an overall grade. All quotations have been translated from Swedish by the author.

Examiners’ views on rubrics

A general view on rubrics is that they are useful both in supervision and in assessment. Examiners claimed that rubrics are necessary and helpful when the assessment must be justified to the students, even though they primarily trusted their own judgement as a source for assessment decisions rather than using rubrics, as E4’s comment illustrates:

The rubric is helpful, but I think I have a routine, even though I have not supervised so many theses yet. Anyway, I think I know how to do it, so I do not usually focus too narrowly on rubrics.

Discrepancies in examiners’ views on rubrics in general were as follows:
  1. The standard descriptors are open to interpretation but adequate (E1, 7, 9).

  2. The expectations on students are reflected in their low performance (E1).

  3. The standard descriptors are vague and impossible to use (E6). Discussions among examiners would be more valuable than using rubrics for achieving equal views on thesis quality, which would benefit the students.

  4. The standard descriptors are interpretable but nevertheless helpful in assessment and supervision (E2, 3, 4, 5, 8, 10).

These discrepancies did not differ in relation to examiners’ experience or disciplinary background.

The assessment of theoretical framework

A general point of view among the examiners was that the standard descriptor for theoretical framework () is vague and that there was no shared understanding of its meaning. The understanding of theory was considered to be one of the knowledge fields that was challenging to learn, which affected the requirements and expectations for what students were to achieve. In general, the examiners noted that the assessment of theory is particularly critical since teacher education does not prepare students for theory-based research, including an understanding of theory and how to use it in a thesis. This indicates that the assessment of the theoretical framework in the thesis was problematic for analytic assessment.

Two contrasting views on standard descriptors in relation to students’ achievements emerged: a) the level of the standard is achievable for students and teacher education should provide the necessary knowledge about how theory can be applied in undergraduate theses (E1, 3, 4, 5); b) the level of the standard is too high and therefore difficult or perhaps not possible for the students to achieve (E2, 6, 7, 9). The standard descriptor needs to be revised since students rarely manage to achieve it:

It is extremely rare that the students understand what a theory is… Basically, 90% of all papers could be rejected due to this standard descriptor. (E2)

Even though all examiners acknowledged that teacher education did not sufficiently prepare students for research, they implied that it should. Despite this, the view of the standard descriptor as difficult, even impossible, to achieve influenced their assessment, affecting their own requirements and lowering their expectations regarding student performance: ‘The students do not get sufficient knowledge on theory in teacher education. How can we make sure they fulfil these requirements?’ (E6). However, E1 had a different view, claiming that the overall expectations on students must remain high and the standards cannot be lowered, which led to E1’s assessment of all three theses as ‘Fail’.

The theoretical weakness in Thesis A was pointed out as critical for the final grade by seven of the ten examiners (E1, 3, 4, 6, 7, 8, 9), while three (E2, 5, 10) provided a more positive judgement. E1 stated that Thesis A had not reached the level required to submit for examination since there was no theoretical framework in it, which had an impact on the weak analysis. However, E10 could discern some theoretical concepts: ‘There is no theory, but it is not always necessary. There are some theoretical concepts’ (E10), while E5 was more generous in their judgement, pointing out that there is ‘an attempt to use theory here’ (E5), though not elaborating further on this. Similarly, in Thesis C, the assessment of theory was judged as weak by all examiners, even though three of them (E5, 6, 10) asserted that it corresponded to the level that could be expected. In Thesis B, the discrepancies in the assessment of theoretical framework were shown in two contradictory judgements. Seven examiners stated that the study was ‘theoretically strong’ or a ‘well theoretically anchored study’ (E1, 3, 4, 5, 7, 8, 10), while three examiners stated that the theory was ‘tacked on’ (E2, 6, 9) or was some kind of ‘decoration:’

I can see that this student does not understand what a theory is. She seems to think that theory is a framework to decorate the thesis with some kind of diamonds. (E2)

In contrast to Haagsman et al. (Citation2021), who showed consistency in assessment of the criterion ‘scientific quality’, these findings reveal rather contradictory views on assessment of theory. The examiner’s own constructs, expectations of students and personal interpretations of quality are shown to be sources of inconsistency. These findings replicate Chetcuti, Cacciottolo, and Vella (Citation2022), who found that assessment is influenced by implicit factors such as tacit knowledge and personal expectations. The examiners’ varying views on rubrics and their interpretation of the standard descriptors suggest a range of implicit factors allowing continuous gradation and characterizing the criterion on theory as fuzzy (Sadler Citation1989). This indicates that examiners become the source of, as well as instrument for, the judgement, bringing in personal expectations and interpretations of the standard descriptors.

The assessment of academic language style

The main source of inconsistencies in assessment of language emerged in the examiners’ views on whether some students were actually capable of reaching the requirements for Pass (). In Theses A and B, the language was in general not pointed out as a decisive factor in passing the thesis, even though some minor linguistic errors were noted in Thesis B by E2, 4, 5, 6, 7. The discrepancies in the assessment of the language were most striking in Thesis C, which contains shortcomings that are not common to a native speaker. E10 assessed the language as ‘good’ and E6 as ‘very good’, pointing out Thesis C as the best of the three theses. Similarly, E3 and E10 stated that the level for Pass was reached without paying any further attention to the language shortcomings, while E1, 2, 4, 5, 7 and 9 noticed them. However, it was only E4, with a disciplinary background in language, who regarded the language as decisive for the final grade awarded:

There are too many shortcomings, and this thesis can’t pass. The more I look at it, the more I can see that it can’t pass because of the language. (E4)

E4 justified the decision not to pass the thesis from a disciplinary point of view, considering language as fundamental for the teaching profession:

One cannot let it pass because these students are going to teach in schools. If you cannot make yourself understood, then you cannot teach. (E4)

E4’s comment indicates a view of the language criterion as sharp, that is, either correct or incorrect (Sadler Citation1989). On the other hand, E9 appears to view the criterion as fuzzy since they did not consider the shortcomings as critical to the final grade, because the student was obviously not a native speaker:

One can see that the student has language difficulties, the sentence structure is wrong, but it is hard to do anything about it. I would not fail the thesis because of the language. (E9)

Similarly, E2 and E5 stated that they would not fail the thesis because of these shortcomings, even though they claimed that there are passages that were difficult to follow because of deficiencies in language. E9 and E5 required some revisions in Thesis C. However, in the final grading, E9 awarded the thesis a Pass, while E5 failed the thesis because of the typographical errors, not because of the shortcomings in language. Even though the standard descriptor on language was not achieved, most examiners did not consider the shortcomings as enough of a reason to fail Thesis C.

In conclusion, similarly to the assessment of theoretical framework, the analysis shows contradictory assessment of language. Sources of inconsistency emerge from three opposing views: (1) the language cannot be further improved, and assessment therefore focuses on other qualities, (2) the language is of particular importance for the teaching profession and should be improved, (3) the shortcomings in language overshadow other qualities in the thesis. One possible explanation for these discrepancies could be found in the first impression, which has been shown to have a strong influence on assessment (Golding, Sharmini, and Lazarovitch Citation2014). E6 claimed that they were impressed by the typographical correctness, thorough summary and introduction, which made the reading of the initial three pages in Thesis C enjoyable. Despite the observed analytical shortcomings, they assessed the results as ‘very exciting’ and Thesis C as the best of the three theses. This might be the reason that E6 did not pay attention to the shortcomings in language and awarded the thesis the highest grade, Pass with Distinction. Some aspects of the thesis have thus been assigned more importance inducing a halo effect (Humphry and Heldsinger Citation2014). In contrast, E7 failed Thesis C, assessing it as the worst of all three theses. They stopped reading after the initial pages because of the ‘really terrible language’ and ‘references that are not anchored in research’. E7 stated that ‘there is no point reading the whole thesis if it has failed in the initial reading’.

According to Golding, Sharmini, and Lazarovitch (Citation2014), an initial negative impression entails a more critical reading focusing on problems, which was the case with E7, while a positive impression to start with will make the reading enjoyable, as it was for E6. Another possible reason for the contradictory assessment of language is that grammatical errors of non-native writers are easy to observe and there is a risk that shortcomings in the language obscure the content and influence the holistic judgement negatively (Sadler Citation1989; O’Hagan and Wigglesworth Citation2015). However, Sadler notes that there is no independent method of confirming whether a judgement that has been made is correct, since it depends on which means are used to arrive at a conclusion. If an assessment focuses on certain textual properties, this would lead to one conclusion, while focusing on qualitative aspects, with or without rubrics, would lead to another conclusion. Neither of the judgements is right or wrong, since they are based on essentially different approaches.

Integrating analytic and holistic judgement to award an overall grade

The analysis revealed that the overall ranking was not consistent. All theses were failed by one or several examiners in the study, although none was failed in the original grading (). None of the theses was awarded the same grade by all the examiners in the study and their grading also deviated from the original grade assigned to the papers (O’Hagan and Wigglesworth Citation2015; Bloxham et al. Citation2016).

Table 2. The summary on the grade awarded.

The deviation in grading in relation to the original grade awarded was most striking in the assessment of Thesis A and Thesis C. Both were awarded a lower grade than the original one by most examiners. Assessment of Thesis B was most consistent with the original grade awarded. Sources of inconsistency in grading were recognized in the differences in views on whether it was possible for students to achieve the standards.

In analytic assessment, some essential commonalities emerged in relation to the specific grades. For instance, when the final grade awarded was Fail or Pass with Distinction, the examiners provided a range of discrete decisions about assessment of a specific criterion in the thesis, which they did not do to the same extent when the judgement concerned Pass. Particularly when it came to Fail, the shortcomings were described more specifically and extensively, highlighting several deficiencies, even though similar analytic judgements could result in both Pass and Fail. Even when theoretical weaknesses in Thesis A were observed, it could still be given a final judgement of Pass: ‘There is no depth in either the results or in the discussion. The theoretical framework is weak, as well as the method’ (E7). Examiners (E1, 2, 3, 4, 6, 8, 10) who failed Thesis A pointed out theoretical, methodological and analytical weaknesses: ‘There is no theory and no description of how analysis was provided’ (E10).

In the final grading, the examiners focused on aspects that were possible to revise, such as reconstructing the thesis (E6, 8), making the results more coherent (E8), or illustrating how the analysis was done (E10). The aspects that are more difficult to revise, such as theoretical and methodological weaknesses, were not pointed out as critical in failing the thesis. For instance, E3 stated: ‘I don’t see any reason to fail this thesis because of the theory, but I wish that the students gained more knowledge about theory and that they could develop analytical skills’. Thus, in grading, the examiners took into consideration the difficulty for students in achieving an understanding of such critical criteria as theoretical framework: ‘they cannot learn theory in one week’ (E2), and therefore did not require the revision of theory in the thesis. Even though the examiners rate the work as unsatisfactory in many important aspects, particularly theory, they do not fail the thesis, which indicates that examiners lean on their own benchmarks when they make individual judgements, and weigh the separate criteria in analytic assessment differently (Williams and Kemp Citation2019; Stolpe et al. Citation2021).

Holistic assessment gives the examiners freedom to decide whether to build up the judgement from discrete decisions or not (Sadler Citation2009; Haagsman et al. Citation2021), and weigh the importance of different dimensions of the thesis in marking (Williams and Kemp Citation2019). The first impression has a strong impact on holistic assessment (Golding, Sharmini, and Lazarovitch Citation2014), and in this study, it turned out to be decisive, particularly in grading with Fail or Pass with Distinction. E7 stopped reading Thesis C after a couple of pages assessing it as ‘the worst thesis’ having no doubts about failing it. In grading with Pass with Distinction, examiners referred immediately to the high quality of the thesis in their judgements: ‘This was the best of the theses. Very good summary, excellent table of content, interesting introduction’ (E6, Thesis C); ‘It starts with a very, very well-formed and precise aim’ (E7, Thesis B).

However, even when they graded it with the highest grade, the examiners identified shortcomings in each thesis: ‘The weakness is the lack of analysis and consistency in the results. Actually, there should be a really interesting analysis for Pass with Distinction, so for now there is a minus in the grade’ (E6, Thesis C); ‘Indeed, the language could have been better’ (E7, Thesis B). This indicates that the examiners relied on their first holistic impression, rating the work as outstanding, in obtaining an outcome with which they felt comfortable, as suggested by Baume, Yorke, and Coffey (Citation2004). Even when the assessment based on discrete decisions did not result in the highest grade, the holistic judgement based on the first impression seemed to have a strong impact that overshadowed the weaknesses, which were not taken into consideration in the grading.

One particular criterion could be decisive for the holistic judgement (Humphry and Heldsinger Citation2014). However, the data in this study did not reveal any indication of the direction this halo effect might take in determining the final grade. The global impression could result in the highest as well as the lowest grade. This replicates the findings from several studies indicating that the variation in interpretation of criteria is often masked in holistic assessment (Sadler Citation2009; Harsch and Martin Citation2013; Bloxham et al. Citation2016; Chetcuti, Cacciottolo, and Vella Citation2022).

Discussion

Despite the increased use of rubrics in higher education, Grainger and Weir (Citation2020, 3) assert that the ‘actual process of measuring student achievement remains a mystery’. This study contributes to the research field by providing valuable insights into how particular assessment decisions are made. Even though the study does not show how examiners weight the criteria when they integrate analytic and holistic judgements on the final grade, it reveals several sources of inconsistency, such as examiners’ personal interpretations of quality and their expectations about students’ ability to manage academic work. Examiners followed their own benchmarks in assessment: a single criterion could be decisive for the holistic assessment and the final grade awarded in some judgements but less critical in others. Thus, the same grade could be awarded for different reasons, and the same judgement could result in different grades. These findings are supported by a large body of previous research (Baume, Yorke, and Coffey Citation2004; Grainger, Purnell, and Zipf Citation2008; Sadler Citation2009; O’Hagan and Wigglesworth Citation2015; Bloxham, Boyd, and Orr Citation2011, Bloxham et al. Citation2016; Haagsman et al. Citation2021; Stolpe et al. Citation2021).

The study reveals some underpinning assumptions about examiners’ view on criteria. Both ‘fuzzy’ and ‘sharp’ criteria (Sadler Citation1989) could be decisive for the final grade, regardless of whether the grade was high or low. These findings indicate that even if the rubric contains criteria, such as theory and academic language (), these criteria might not be considered as critical for the final assessment by the examiners. Thus, there is a risk that examiners consider some features of the work to be dominant in assessment and decisive for the final grade (Sadler Citation2005; Humphry & Heldsinger Citation2014; Haagsman et al. Citation2021), and that the examiners base the assessment on their own personal views (Williams and Kemp Citation2019). Sadler (Citation2009) states that a competent judge can make an appraisal and decide which criterion is relevant. Accordingly, the concise nature of standard descriptors entails that the consistency in assessment relies on the professionality of the examiners. Subjective professional judgements based on different expectations thus become more decisive in determining the final grade than the examiners’ use of the rubric (Grainger, Purnell, and Zipf Citation2008; Harsch and Martin Citation2013; Chetcuti, Cacciottolo, and Vella Citation2022).

In higher education, rubrics have become a common tool for assessment as well as for student feedback (Grainger and Weir Citation2020). Research literature suggests that criteria presented in rubrics and clearly articulated standards clarify the targets for student work, support students in achieving the required learning outcomes and improve transparency of assessment (Reddy and Andrade Citation2010; Weir Citation2020; Koris and Pello Citation2022;). However, Bearman and Ajjawi (Citation2018) question whether transparency is achievable. They note, with reference to Sadler (Citation2009) and Bloxham et al. (Citation2016) among others, that standard descriptors neither explicitly capture the complex nature of the work nor the holistic tacit knowledge involved in assessment. They argue that academic standards are social constructs rather than objective standards which can be accurately described. We therefore need to be realistic about how easy it is to capture the complexity of performance in short standard descriptions.

How, then, can the use of rubrics in assessment satisfy students’ desire for transparency and fairness in grading? First, students need to understand the rubrics. However, the question is whether they possess sufficient knowledge to comprehend the standard descriptors. The variety in the quality of the three theses assessed in this study indicates that students as well as supervisors had different understandings of theory, and how it should be assessed. Inconsistency in assessment could possibly be reduced if everyone involved in the assessment could accept and use the same instrument, and share a common understanding of academic quality (Baume, Yorke, and Coffey Citation2004; Bloxham et al. Citation2016; Koris and Pello Citation2022). Even though educators are rarely given sufficient time to work with rubric development, a discussion of the variety of judgements made by examiners, disciplinary norms and assessment standards is needed to contribute to a more accurate way of determining the final grade. However, we should keep in mind that assessment should not be reduced to a checklist for scoring specific elements. It could limit independence and creativity in students’ work and align student performance to the same standards (Bloxham et al. Citation2016).

Limitations

This study is limited to the assessment of the final version of the thesis, as the examiners were not involved in earlier processes of assessment, such as the oral examination. The anonymity of students and supervisors might have resulted in stricter judgements. Making the analytic assessment explicit does not necessarily reflect the natural process of assessment, since examiners rarely work systematically through each criterion (Sadler Citation2009). The examiners were asked to focus on specific criteria before making an overall judgement, which could entail that they looked specifically for weaknesses. Furthermore, when examiners are asked to give reasons for a judgement, the final grade tends to be lower (Baume, Yorke, and Coffey Citation2004).

Implications and future research

The main implication of this study is that a ‘true’ appraisal or a common ‘correct’ grade (Sadler Citation1989) is an ideal that is very difficult to achieve. Even though rubrics developed within a team could result in a more consistent assessment, providing that there is a shared understanding of thesis quality within the team, there is still a risk that the use of rubrics might give the impression that assessment is more consistent than it is. Considering that rubrics have become a common tool for assessment in higher education, we need more research on how they operate within educational settings and how the use of rubrics can contribute to a more consistent assessment.

Acknowledgments

The author would like to thank all the examiners who participated in this study, and her colleague Ola Strandler for data collection. She also thanks Lisa Asp Onsjö, Peter Nyström and Ali Yildirim for valuable feedback and suggestions, and Catherine MacHale Gunnarsson for offering suggestions regarding the language.

Disclosure statement

No potential conflict of interest was reported by the author.

References

  • Baume, D., M. Yorke, and M. Coffey. 2004. “What is Happening When We Assess, and How Can We Use Our Understanding of This to Improve Assessment?” Assessment & Evaluation in Higher Education 29 (4): 451–477. doi:10.1080/02602930310001689037.
  • Bearman, M., and R. Ajjawi. 2018. “From “Seeing through” to “Seeing with”: Assessment Criteria and the Myths of Transparency.” Frontiers in Education 3 (96): 1–8. doi:10.3389/feduc.2018.00096.
  • Bloxham, S., P. Boyd, and S. Orr. 2011. “Mark My Words: The Role of Assessment Criteria in UK Higher Education Grading Practices.” Studies in Higher Education 36 (6): 655–670. doi:10.1080/03075071003777716.
  • Bloxham, S., B. den-Outer, J. Hudson, and M. Price. 2016. “Let’s Stop the Pretence of Consistent Marking: Exploring the Multiple Limitations of Assessment Criteria.” Assessment & Evaluation in Higher Education 41 (3): 466–481. doi:10.1080/02602938.2015.1024607.
  • Braun, V., and V. Clarke. 2006. “Using Thematic Analysis in Psychology.” Qualitative Research in Psychology 3 (2): 77–101. doi:10.1191/1478088706qp063oa.
  • Chetcuti, D., J. Cacciottolo, and N. Vella. 2022. “What Do Examiners Look for in a PhD Thesis? Explicit and Implicit Criteria Used by Examiners across Disciplines.” Assessment & Evaluation in Higher Education 47 (8): 1358–1373. doi:10.1080/02602938.2022.2048293.
  • Corbin, J. M Strauss, and A. L. 2008. Basics of Qualitative Research: Techniques and Procedures for Developing Grounded Theory. Thousand Oaks, CA: Sage.
  • Golding, C., S. Sharmini, and A. Lazarovitch. 2014. “What Examiners Do: What Thesis Students Should Know.” Assessment & Evaluation in Higher Education 39 (5): 563–576. doi:10.1080/02602938.2013.859230.
  • Grainger, P., K. Purnell, and R. Zipf. 2008. “Judging Quality through Substantive Conversations between Markers.” Assessment & Evaluation in Higher Education 33 (2): 133–142. doi:10.1080/02602930601125681.
  • Grainger, P., and K. Weir. 2020. “The Significance of Rubrics.” In Assessing Learning in Higher Education Contexts Using Rubrics, edited by Grainger, P., & Weir, K., 1–8. Newcastle upon Tyne, UK: Cambridge Scholars Publishing.
  • Haagsman, M., B. Snoek, A. Peeters, K. Scager, F. Prins, and M. van Zanten. 2021. “Examiners’ Use of Rubric Criteria for Grading Bachelor Theses.” Assessment & Evaluation in Higher Education 46 (8): 1269–1284. doi:10.1080/02602938.2020.1864287.
  • Harsch, C., and G. Martin. 2013. “Comparing Holistic and Analytic Scoring Methods: Issues of Validity and Reliability.” Assessment in Education: Principles, Policy & Practice 20 (3): 281–307. doi:10.1080/0969594X.2012.742422.
  • Humphry, S. M., and S. A. Heldsinger. 2014. “Common Structural Design Features of Rubrics May Represent a Threat to Validity.” Educational Researcher 43 (5): 253–263. doi:10.3102/0013189X14542154.
  • Koris, R., and R. Pello. 2022. “We Cannot Agree to Disagree: Ensuring Consistency, Transparency and Fairness across Bachelor Thesis Writing, Supervision and Evaluation.” Assessment & Evaluation in Higher Education, (Ahead-of-Print): 1–12. doi:10.1080/02602938.2022.2125931.
  • Mattsson, M. 2008. “Degree Projects and Praxis Development.” In Examining Praxis. Assessment and Knowledge Construction in Teacher Education, edited by Mattsson, M., Johansson, I., & Sandström, B., 55–76. Rotterdam: Sense Publishers..
  • O’Hagan, S. R., and G. Wigglesworth. 2015. “Who’s Marking My Essay? The Assessment of Non-Native-Speaker and Native-Speaker Undergraduate Essays in an Australian Higher Education Context.” Studies in Higher Education 40 (9): 1729–1747. doi:10.1080/03075079.2014.896890.
  • Reddy, Y. M., and H. Andrade. 2010. “A Review of Rubric Use in Higher Education.” Assessment & Evaluation in Higher Education 35 (4): 435–448. doi:10.1080/02602930902862859.
  • Sadler, D. R. 1989. “Formative Assessment in the Design of Instructional Systems.” Instructional Science 18 (2): 119–144. doi:10.1007/BF00117714.
  • Sadler, D. R. 2005. “Interpretations of Criteria-Based Assessment and Grading in Higher Education.” Assessment & Evaluation in Higher Education 30 (2): 175–194.
  • Sadler, D. R. 2009. “Indeterminacy in the Use of Preset Criteria for Assessment and Grading.” Assessment & Evaluation in Higher Education 34 (2): 159–179. doi:10.1080/02602930801956059.
  • Sellbjer, S. 2017. “Meaning in Constant Flow: University Teachers’ Understanding of Examination Tasks.” Assessment & Evaluation in Higher Education 42 (2): 182–194. doi:10.1080/02602938.2015.1096900.
  • Stolpe, K., L. Björklund, M. Lundström, and M. Åström. 2021. “Different Profiles for the Assessment of Student Theses in Teacher Education.” Higher Education 82 (5): 959–976. doi:10.1007/s10734-021-00692-w.
  • Swedish Research Council. 2017. Good Research Practice. Stockholm: Swedish Research Council.
  • Weir, K. 2020. “Understanding Rubrics.” In Assessing Learning in Higher Education Contexts Using Rubrics, edited by Grainger, P., & Weir, K., 9–24. Newcastle upon Tyne, UK: Cambridge Scholars Publishing.
  • Williams, L., and S. Kemp. 2019. “Independent Markers of Master’s Theses Show Low Levels of Agreement.” Assessment & Evaluation in Higher Education 44 (5): 764–771. doi:10.1080/02602938.2018.1535052.