Publication Cover
Educational Psychology
An International Journal of Experimental Educational Psychology
Volume 43, 2023 - Issue 2-3
8,848
Views
2
CrossRef citations to date
0
Altmetric
Research Articles

Halo effects in grading: an experimental approach

ORCID Icon, &
Pages 246-262 | Received 24 Aug 2021, Accepted 20 Mar 2023, Published online: 31 Mar 2023

Abstract

Halo effects in teacher judgments, can occur when the assessment of one aspect of a person’s achievement is generalised to another aspect of achievement for that same person. We conducted an experimental study in which participants (N = 107) had to grade vignettes of the same student in two different subjects. The first vignette described either a weak, average, or strong performance, whereas the second vignette always described an average performance. Thereby, rich descriptions of achievement-relevant information were used, and the impact of influencing factors was reduced. The treatment condition emerged as a significant predictor of the grade given for the second vignette even after controlling for gender, status of the participants, and the gender of the student. The results suggest that participants’ ratings can be manipulated to show halo bias. The results support the view that teacher education should raise awareness of possible sources of the halo effect.

The professional judgments made by teachers play an important role in the academic and personal development of their students. Not only are these judgments relevant in the grading process but they also translate into classroom instruction, communication, and task development (Alvidrez & Weinstein, Citation1999). Teachers need to retrieve reliable information on their students and apply this information in order to provide them with appropriate tasks (McElvany et al., Citation2009) and respond to their students’ work with helpful feedback (Hattie & Timperley, Citation2007). As teachers often spend hours on end with their students, and thus have very extensive information on how each student is doing in class, it could be expected that their judgments would provide differential diagnostic information on their students’ achievement (Spinath, Citation2005).

Teachers’ judgments are often the main source of information on the progress and achievement of their students. Thus, the need for valid assessment of students’ academic achievement is of high significance. They have been studied extensively, with results showing a moderate to acceptable accuracy of teacher judgments for student achievement, their intelligence, or motivation (Hoge & Coladarci, Citation1989; Kaiser et al., Citation2013; Machts et al., Citation2016; Südkamp et al., Citation2012; Urhahne & Wijnia, Citation2021). Research also shows however that the judgments were not perfect reflections of the students’ abilities but may be influenced by a plethora of biases, such as stereotypes (e.g. Retelsdorf et al., Citation2015), expectations (Jussim & Harber, Citation2005; Muntoni & Retelsdorf, Citation2018), or halo effects (Dennis, Citation2007).

In the present study, we focussed on a halo effect in teachers’ judgement, which can occur when the assessment of one aspect of someone’s achievement is transferred to another aspect of achievement in that same person. To that end, we conducted an experimental study in which teachers and student teachers had to grade vignettes of the same student in two different subjects. The first vignette described either a weak, average, or strong performance, whereas the second vignette in each case described an average performance. We expected judgments of the second vignette to be biased in the direction of the first. The vignettes are described in more detail in the Materials section. With this study we aimed to shed light on the relevance of the halo effect in the school context. Moreover, in applying an experimental approach, our study goes beyond some of the existing research on halo effects in the school context, which so far has mainly been based on correlational studies or used context irrelevant information as an intervention.

The halo effect in education

The term halo effect was coined by Thorndike (Citation1920) in his work on employee performance appraisals and on ratings of army officers; since then it has been extensively researched (Forgas, Citation2011). It is claimed to be a pervasive cognitive bias in the assessment of achievement or performance (Cooper, Citation1981), and its properties have been discussed in depth (e.g. Balzer & Sulsky, Citation1992; Fisicaro & Lance, Citation1990). Dennis (Citation2007) defined the halo effect (or halo error) as a ‘bias [that] operates to inflate the covariance of different assessments by the same rater relative to those that would have occurred if the bias were absent’ (p. 1169). The achievement domain in which the effect is evoked can be similar to or essentially independent of the domain in which the initial assessment is made. Thus, halo errors are of potential significance in any setting in which ratings are used to provide feedback on performance (Dennis, Citation2007). This highlights their potential relevance in the school context in particular.

Five sources of illusionary halo effects

Cooper (Citation1981) derived five related sources of illusionary halo effects that in practice occur together, namely undersampling, engulfing, insufficient concreteness, insufficient rater motivation and knowledge, and cognitive distortions. Undersampling refers to the rater’s overestimation of the relation between two categories that results from insufficient information, forcing the rater to rely on a global impression. Haloed ratings from engulfing stem from the belief that either a general impression or, alternatively, a salient feature or features, is related to the categories being rated. In other words, these ratings are influenced by implicit covariance theories that link general impressions with the rated categories; such ratings are essential for the present investigation. Insufficient concreteness occurs when the nature of the rated categories is too abstract for the rater. Therefore, the rater is forced to merge their observations on the basis of the overlapping aspects of the presented categories. Insufficient rater motivation and knowledge can evoke halo effects if the raters expend insufficient effort on the rating process, or are insufficiently sensitive to the fact that their ratings might be influenced by a halo effect. Finally, cognitive distortions can be halo sources when the rater loses or adds information to the rating. Details can be lost through memory effects, whereas on the other hand, raters’ beliefs on cross-category correlations can be added, resulting in an overestimation of the true relations between categories. In addition to these five illusionary halo sources, Cooper (Citation1981) defined the true halo source as correlated true scores originating in the actual relation between two categories.

Previous research in education

In the educational context, halo effects can occur when the assessment of an aspect of someone’s achievement is transferred to another aspect of achievement in that same person. Using a sample of 663 pupils from 38 classes from third grade and their respective teachers Dompnier et al. (Citation2006) showed that halo effects can influence the judgments of teachers. They used standardised test scores and grades to investigate cross-subject halo effects. Similarly, Bressoux et al. (Citation2013) used standardised tests to investigate halo effects in a sample of 180 students in 20 first-grade classes and their 20 teachers. They were able to demonstrate that teachers show a general impression halo, in line with the engulfing effect as defined by Cooper (Citation1981). In both cases the resultant findings were heavily reliant on the methodological soundness of the measure assessing achievement in the first place. Two further studies investigated the halo effect in schools introducing achievement domain-irrelevant additional information to investigate the effect on gradings by teachers. In the study by Foster and Ysseldyke (Citation1976), the authors used labels such as ‘learning disabled’ and showed that these labels evoked halo effects. The study by Abikoff et al. (Citation1993) used an experimental approach to the halo effect using a sample of N = 139 elementary school teachers. They used videos to present fourth-grade students engaging in behaviours characteristic of attention-deficit hyperactivity disorder, oppositional defiant disorder or unobtrusive behaviour. Results indicated that the teachers showed a unidirectional halo effect in their ratings when oppositional behaviour occurred. The study by Tompkins and Boor (Citation1980) used the attractiveness of seventh-grade boys and the popularity of their names to investigate ratings of academic and social attributes. The results showed that the more attractive students were rated higher in social attributes. However, no effects were found in regard to academic attributes.

A recent study by Sanrey et al. (Citation2021) investigated halo effects in teacher judgments and the role of the certainty of the judgement in particular applying a novel methodological approach using 25 teachers and their 199 students in the first study and 20 teachers and their 180 students in the second study. The authors used standardised tests to compare the actual achievement of the students to the item-by-item judgement of success for each student by the teachers. In a second study the authors included a measure of certainty after each item that aimed to assess how certain the teachers were about their response. The results showed the presence of a halo effect in teachers’ judgments as the teachers’ judgments were more homogeneous than the students’ actual achievement levels. In addition, high judgement certainty resulted in a stronger halo effect.

Halo effects have also been investigated in university settings. In a study by Dennis (Citation2007) for example, the results showed halo effects from one piece of a student’s work to another. Notably, the supervisors knew the students personally and had supervised them over a substantial period of time. Talamas et al. (Citation2016) investigated the limiting effects of the attractiveness halo on perceptions of actual academic performance. They used ratings of perceived intelligence, perceived academic achievement and perceived conscientiousness as well as attractiveness of pictures of 100 university students. The authors found perceived conscientiousness to be a better predictor of actual academic performance when compared to perceived intelligence and perceived academic performance. Accuracy was improved when the influence of attractiveness on judgments was controlled for. These findings emphasise the potential effect of attractiveness on the accuracy of achievement judgments. In another study, by Malouff et al. (Citation2013), the authors investigated psychology faculty members and teaching assistants who were randomly assigned to grade a student giving either a poor or a good oral presentation. Afterwards, the graders assessed an unrelated piece of written work by the same student. The results showed the hypothesised halo effect. All in all, previous research was able to identify halo effects in the educational context. More naturalistic settings yielded smaller but still statistically significant halo effects. However, it must be noted that the amount of explained variance was heavily dependent on the research design and methodological approaches used in the respective study.

Methodological approaches to the halo effect

In most research, the halo effect was measured by comparing true and inflated intercorrelations between ratings on a number of categories, or by drawing inferences from the differences in true and observed intrarater variance across categories. Both approaches rely on a sufficiently reliable assessment of the true correlation or variance, and thereby are exposed to a myriad of methodological issues (Dennis et al., Citation1996). Thorndike (Citation1920) explained that theoretically, the exact size of the halo effect cannot be assessed, due to an inability to pinpoint the true correlation or variance, given the lack of objective criteria against which to measure them. In other words, one of the core difficulties in studying halo effects is the separation of halo bias from true correlations between the qualities being rated. In a similar vein, studies showed that much of the evidence on the halo effect depends on methods that fail to take appropriate account of the impact of measurement error (Lance et al., Citation1990; Murphy et al., Citation1993).

An experimental approach circumvents some of these methodological issues because the halo effect is controlled in the respective setting. However, the experimental studies presented earlier often used achievement-irrelevant additional information to evoke halo effects such as rated attractiveness or personal affect. The approach we took was to give relevant information on student achievement to individuals that have experienced grading themselves as school and university students and in addition, are already grading regularly or are educated to become teachers and grade themselves.

With regard to the halo sources named earlier we assumed differential effects derived from our research design. It is important to note that in practice, halo sources occur together, as Cooper (Citation1981) has stated. However, elaborating on the likelihood of the co-occurrence of the different halo sources can shed light on the experimental approach we took in the present investigation, as well as on the halo effect itself.

First, we would argue that engulfing is the most probable halo source to influence the grading of the second vignette. As the school students’ grades tend to be correlated, we would assume that the participants’ ratings are influenced by implicit covariance theories that link general impressions with the rated categories. Second, the influence of cognitive distortions seems probable, albeit to a lesser degree. The participants may include information from the rating of the first vignette in the grading of the second vignette, due to the raters’ beliefs about cross-category correlations resulting in overestimation of the true relations between categories, similarly to engulfing. However, we assumed that the memory effects related to the same halo source were comparatively negligible, due to the fact that the rating is given immediately after reading the vignette.

Third, undersampling and insufficient concreteness were still less probable halo sources. The information on the academic achievement and behaviour was rich and represented relevant illustrations of what is usually sufficient to give a grade in the school context. Thus, the present investigation differs substantially from studies that give very little information, such as the study by Murrone and Gynther (Citation1989), who investigated the effect of a single-sentence statement about the intelligence of a person on a plethora of noncognitive variables. Hence, while undersampling and insufficient concreteness have been shown to be of importance generally, the influence seemed comparatively less probable with regard to the approach we took. Finally, it is difficult to predict the influence of insufficient rater motivation and knowledge about the participants in our study. On the one hand, one could argue that grading is an integral part of the teaching profession and thus, the participants would have invested sufficient effort in their decision. On the other hand, we have no information on participants’ knowledge in regard to the potential importance of halo effects on the grading of school students.

In addition to the categorisation of halo sources by Cooper (Citation1981), earlier research showed that interpersonal affect can evoke a halo effect (e.g. Conway, Citation1998; Varma et al., Citation1996). The study by Dennis (Citation2007), who investigated the halo effect using supervisors with vast knowledge about the students they rated, underscores this assumption. However, with the approach we took, the effect of interpersonal affect was expected to be comparatively small, due to the fact that the raters did not know the person they were rating, apart from the descriptions they received in the two vignettes.

The present study

In the present investigation, participants rated two vignettes describing the academic behaviour and achievement of a student in two separate subjects. In line with prior research, we expected to find a halo effect in the grading of the second vignette. Participants who received the strong performance vignette were expected to rate the second average performance version higher than would the participants who received the average performance vignette or the weak performance vignette for the first subject and vice versa.

As noted earlier, the participants of the study were students becoming teachers and teachers who were already working in school. As earlier research has shown, the assessors’ judgments were affected by the expertise in the domain they were asked to evaluate. Thus, we controlled for the status of the participants, namely novices (meaning student teachers with less experience) and experts–in our case, practicing teachers.

To sum up, to investigate the halo error we applied an experimental approach that used relevant information (grading on the basis of realistic vignettes). In addition, the experimental approach we took may be considered rather conservative–or as a rather pure reflection of implicit covariance theories–due to the fact that those halo sources depending on more category-irrelevant biases (such as undersampling, insufficient concreteness, insufficient rater motivation and interpersonal affect) were less likely to be found. Our study thus went beyond previous research in regard to three important concerns. First, we circumvented some of the methodological issues of previous research as outlined above. Second, we used relevant information as intervention material, rather than simply relying on the influence that additional information such as the attractiveness of the ratee, or single item interventions such as a rating of intelligence, can have. Finally, the experimental approach enabled us to reduce the influence of additional influencing factors such as stereotypes or affect, which are difficult to control for in non-experimental settings.

Methods

Sample

The original sample comprised N = 109 participants (64.2% female; Mage = 29.72, SDage = 10.24). The data of two participants were removed due to missing values on all relevant variables (grading of the vignettes). With an α = .05, power of 1-β = 0.95, and a squared multiple correlation of the predictors of ρ2 = .21 (f2 = .27) the projected sample size needed (G*Power 3.1; Faul et al., Citation2009) was approximately N = 82 for the final regression model. Thus, our final sample size of N = 107 was more than adequate for the main objective of this study. Participants were recruited online and by telephone. No special reason for why the participants were asked to contribute was given except that the data were needed for research purposes at the university. Most of the participants were university students becoming teachers (nstudents = 79). The remaining 28 participants were teachers. The participants used an online version of the survey. It took around 20 minutes to complete the survey. Participants were asked to work alone, and no help was given.

Study design

Participants were randomised into three experimental groups. Each participant received one of three different texts. The text was either a strong (nstrong = 37), average (naverage = 35), or weak (nweak = 35) framed assessment of student achievement (Subject A) and was followed by another text, which was an assessment of average performance by the same student in a different subject (Subject B). The participants were asked to imagine teaching the student described in the vignette and rate the students’ achievements for that specific subject on a 15-point scale, as is common in the German grading system in upper secondary school (0 = ‘failed’, 15 = ‘very good’). The participants were given as few additional information as possible e.g. the age of the student was omitted and if the student was real or not. All information was given online in written form.

Materials

Participants read two vignettes describing the behaviour and achievement of the same student in two subjects, which were not named explicitly (Subject A and Subject B). We developed eight vignettes in total. For Subject A, three versions of a text were developed (strong, average, and weak) that referred either to a student with a common German- sounding male name (Paul) or a common German-sounding female name (Paula). The remaining two vignettes (Subject B) were average performance assessments of the same student (Paul or Paula) as in the first text (Subject A). In German, a gender-neutral sounding name or the use of no name (e.g. ‘the student’) would not suffice to obscure the gender of the student. The gender markers used in the text would indicate the gender in German; thus, we decided to use two versions of the text, and randomly assigned the participants to the Paul or Paula condition and control for the effect for gender thereafter.

The vignettes were developed to give either a strong, average, or weak description of the student’s behaviour and achievement in class. We based the texts on sample report card comments (Lang, Citation2007; Ochi, Citation2006) and on the secondary school education plan for the federal state of Hamburg (Freie und Hansestadt Hamburg, Citation2018). The comments on the students’ achievement included verbal assessments of (1) their learning progress and (2) learning potential as well as (3) comments on written assessments by the student. We used the common performance and proficiency standards defined in the education plan for secondary schools (Freie und Hansestadt Hamburg, Citation2018) as a framework to develop a scaffold for the texts we developed. The comments did not include assessments that indicated the subject in which the student was assessed. The comments were thus generic competencies relevant to school success (Farrington et al., Citation2012), and allowed all participants to grade the student without content knowledge for a specific subject.

Each text included assessments of academic achievement (e.g. improvement in performance), academic behaviour (e.g. pays attention to the teacher), learning behaviour (e.g. participation in classroom discussions), social behaviour (e.g. learning in cooperative settings), and non-cognitive factors from the domain of motivational psychology or personality psychology (e.g. perseverance, motivation). The four texts were structured the same, to enhance the comparability between manipulations. The differences between the texts arose in single words to describe the level of competence or the way a student interacted with the learning content, the teacher, or his or her classmates.

Vignettes for Subject a

The vignettes for Subject A (manipulation vignette) depicted three different evaluations of a student. We aimed to give a comprehensive and consistent picture of the student in the three vignettes for Subject A (manipulation vignette).

In the strong performance description, we used expressions such as ‘very good’ with regard to the academic achievement of the student. Thus, we aimed to give a clear indication of how well the student does in school by explicitly using the terminology known by the participants to describe grades. Furthermore, the students’ improvement in performance was stated to be strong and consistent. The student was considered to show no problem whatsoever in the reproduction or transfer of knowledge. Thus, we indicated that the student was able to achieve the highest competence level as described in the competence model for secondary school (Freie und Hansestadt Hamburg, Citation2018). In addition, the student was stated to show an immense ability to concentrate, willingness to learn and good working speed. Further, they were not easily distracted by what happens in the classroom, and they consistently added to the classroom discussion with insightful contributions. The student was considered open and friendly as well as self-reliant. Finally, the student helped the other students in class and showed no other problem in social interaction. All in all, we aimed to present the perfect student.

In the weak performance vignette, we described the same aspects of scholastic achievement but used descriptors that indicated the opposite of the strong performance vignette. Again, we used terminology the participants are used to with regard to grading in school. In the weak performance vignette we used terms that describe less than satisfactory or unsatisfactory performance (Freie und Hansestadt Hamburg, Citation2018). In addition, the student was represented as not participating in classroom discussion even if asked to by the teacher, and as showing weak to unacceptable social behaviour. All in all, the student we described in the weak performance vignette was the complete opposite of the student in the strong performance vignette.

For the average condition, we developed a third, average performance vignette. In contrast to the strong and weak vignettes, we refrained from using words that could explicitly or implicitly lead the participants to assume that the student did very well or very poorly in a specific aspect of school. All in all, we aimed to develop the description of a student whose achievement was thoroughly average. We did so by describing the comprehension of the student as average. As the basic requirements of the school curriculum were met, the student was considered to satisfy the average competence levels (Freie und Hansestadt Hamburg, Citation2018). The student did not reach the high level of transfer of knowledge reached by the student in the strong performance vignette. In comparison with the description in the weak performance vignette, some support by the teacher was stated to be sufficient for the student to cope with the demands of the classroom. Their verbal contributions were appropriate but did not add to the classroom discussion in a substantial way. All in all, we aimed to present a student who was average in all of his or her abilities.

Vignette for Subject B

All participants received the same vignette for Subject B. The vignette for Subject B was developed on the basis of the average performance vignette for Subject A. The important difference between the two is that some aspects of the description for Subject B did imply that the student did either a little better or a bit worse than average in some aspects. In other words, the description was to a lesser degree homogeneous than the average vignette for Subject A. Thus, to come to a conclusion on how well the student was doing in school was somewhat more difficult. We hypothesised that this would lead the participants to rely more heavily on the description in Subject A, even though they were asked to evaluate the achievement of the student in Subject B exclusively on the basis of the description they were given for that subject.

The student was described as slightly better than average with regard to the consistency of participation in classroom discussions, as having a more self-reliant work ethic and more consistent ability to reproduce knowledge. However, in some aspects the student was presented as somewhat below average in the Subject B vignette: First, their contributions in classroom discussions led to no new or substantiated claims. Second, their ability to concentrate and the perseverance of the student were described as below average. Finally, the student’s ability to transfer knowledge did not suffice to cope with demanding tasks given by the teacher. It has to be noted that none of these descriptions would, on the basis of the competence model for secondary school (Freie und Hansestadt Hamburg, Citation2018) imply a change in competence level compared to the average performance vignette in Subject A.

Control variables

Participants were asked to fill out a questionnaire before reading the first vignette. We assessed the gender and whether the participants were students becoming teachers or were already working as teachers. We dummy coded the latter variable as: 0 = novice, 1 = expert. In addition, the target student’s gender was introduced as a dummy coded variable:, male = 0 female =1.

Statistical analyses

We analysed the data in two steps. First, a manipulation check (check for implementation fidelity) was performed to verify that the vignettes for Subject A were evaluated as intended by the participants. To do so we used a one-way between-subjects ANOVA to test for the differences in the grading by the participants in the strong, average, and weak conditions. Second, we used a multiple regression analysis to investigate whether a halo effect occurred. We applied effects coding (or planned contrasts) to test for the effect of the treatment (strong and weak condition) defined as the deviation between the treatment mean and the grand mean (Alkharusi, Citation2012; Cohen & Cohen, Citation1983). In this method, the dummy variables take the values 1, 0, and −1. To the first contrast variable A1 (strong) the following values were assigned: strong = 1, weak = 0, average = −1. To the second contrast variable A2 (weak) the following values were assigned: strong = 0, weak = 1, average = −1. The grade given for the second vignette was the dependent variable. In a subsequent analysis, we controlled for the participants’ as well as the students’ gender and whether the participants were still students or already working as teachers (0 = novice, 1 = teacher). We used SPSS Version 25 and Mplus Version 8.5 (Muthén & Muthén, Citation2017) for the analyses.

Results

Manipulation check and descriptive statistics

As a manipulation check, we first tested for differences in the participant gradings in the strong, average and weak conditions. There was a significant effect of the condition on the grades the participants gave for Subject A: F(2,104) = 841.71, p < .01, η2 = .94. Post hoc comparisons using the Tukey honestly significant difference test (Levene’s test for unequal variances: F = 0.26, p = .77) indicated that the mean score of the weak performance condition differed significantly from the mean score of the average performance (ΔM = −5.49, SD = .28, p < .01, d = 4.00) and strong performance conditions (ΔM = −11.38, SD = .28, p < .01, d = 9.86), and that the mean score of the average performance condition differed significantly from the mean score of the strong performance condition (ΔM = −5.89, SD = .28, p < .01, d = 5.88). Taken together, these results suggest that manipulation through the vignette for Subject A was successful (see and for descriptive statistics).

Table 1. Results subject A and subject B by manipulation and student gender.

Table 2. Results subject A and subject B by expert status of the participants.

Treatment effect

Next, multiple regression analyses were conducted to see whether the treatment condition had an effect on the grading of the second (average) performance vignette of Subject B. The standardised results (effect sizes) from the stepwise regression analysis can be found in . The results showed statistically significant effects for the two contrast variables A1 (strong) and A2 (weak). More specifically, A1 suggests that participants in the strong condition gave higher grades in comparison to the other participants. The opposite was true for A2 (weak). The participants in the weak condition gave lower grades in Subject B compared to the other participants. The treatment showed statistically significant effects even after introducing the gender (female = 0, male = 1) and status (novice = 0, expert = 1) of the participants, and the gender of the student being graded (female = 0, male = 1). Notably, the expert status showed statistically significant effects on the grades. Experts in comparison to novices gave lower grades for the vignettes in Subject B. Is has to be noted however, that the expert sample was much smaller than the novice sample. The overall amount of explained variance was mediocre as expected in this field of research. All in all, the results show that the manipulation had a significant effect on the grading of the second average performance vignette of the student in Subject B. Neither the participants’ nor the students’ gender had a statistically significant impact on the grade given in Subject B.

Table 3. Standardised and unstandardised slope estimates and R2 of the multiple regression analyses predicting the grade in subject B.

Discussion

The ability to make valid assessments of students’ academic achievement is of high significance, as teachers’ judgments are often the main source of information on the progress and achievement of students (e.g. Machts et al., Citation2016). In addition, these assessments translate into classroom instruction, communication, and task development (Alvidrez & Weinstein, Citation1999). The halo effect is one influence that may distort the assessment of the teacher. However, in previous investigations into halo effects researchers applied methodologies that suffered from issues such as the impossibility of estimating the true relations between the rated categories, or the uncorrected measurement error. With the present investigation we aimed to investigate the halo effect by applying an experimental approach utilising relevant information: Participants rated an average performance vignette of student achievement and classroom behaviour after receiving a strong, average, or weak performance vignette on the student in a different subject.

The manipulation check revealed the expected rating behaviours for the first vignette. The weak performance vignette was rated significantly worse than the average or strong performance vignette. What is more, none of the weak condition raters gave the same grade as any participant in the strong condition. Thus, the texts induced the rating behaviour that was intended, and enabled investigating the main research question of the present investigation.

A multiple regression analysis showed that the treatment had a statistically significant effect on the grading of the second average performance vignette describing academic achievement and behaviour in Subject B. Thus, the results support the assumption that the grading of the first vignette would have a significant effect on the rating of the second vignette, even though the effect was small as expected. These results suggest that participants’ ratings can be manipulated in such a way as to show a halo bias. We controlled for the status the participants had. However, the number of experts in the sample is comparatively smaller than the novices. Neither the gender of the rater and the ratee emerged as significant predictors in our analyses.

It has to be noted that the degree of explained variance in the final model might be considered rather small (R2 = .16). However, one could argue that the research design enabled the participants to have rich information on the students’ achievement in the two subjects at hand, reducing the effect that undersampling may have had. In addition, the participants rated each vignette immediately after reading them. Memory distortions, as defined in the categorisation by Cooper (Citation1981) are thus unlikely halo sources to influence the rating of the second vignette. Further, studies have shown the effect of personal affect on rating decisions. With the approach we took, the participants had to rely on the descriptions of achievement and classroom behaviour. This information could be assumed to lead to less personal affect than if, for example, the participants were asked to rate students they knew from inside and outside their own classrooms. The vignettes, developed using the secondary school education plan for the federal state of Hamburg (Freie und Hansestadt Hamburg, Citation2018) were designed to be relevant information to assess student achievement and classroom behaviour. Thus, the effect of insufficient concreteness should not have influenced the raters to a substantial degree. Finally, we would argue that the participants in our study did not lack the necessary motivation to give an informed rating after reading the vignettes. We can only assume that the participants, who were either teachers or were studying to become teachers, did not experience participation in the study as demotivating. Further research however is needed to confirm this assumption, as well as to examine the influence that participant knowledge about halo errors might have had on the results. To do so, controlling for the test-taking motivation and knowledge of the halo effect could shed light on these possible influencing factors.

All in all, we would argue that the effect the treatment condition had on the rating of the second vignette–especially due to the fact that the task we presented the participants with was relatively easy–could be interpreted as a rather pure reflection of the effect that engulfing or implicit covariance theories can have on grading behaviours, and thus represents an interesting avenue for future research. However, it seems worthwhile to investigate a more nuanced approach where the achievement of the student was not as streamlined as in our study. To add a homogeneous vs. non-homogeneous condition could add to the more narrowly defined conditions that were used in the present investigation as it would enable more deliberation on the side of the participants. In addition, it would be interesting to investigate if teachers respond similarly even when presented with more than one student. This would enable researchers to investigate differential effects good and weak students may have on the teachers. What is more, it would be interesting to know more about the teachers overall (e.g. knowledge or biases or received special training) or specifically with regard to their behaviour rating the vignettes. For example, in the study by Sanrey et al. (Citation2021) the authors accounted for the certainty the teachers had while rating.

In addition, our research could be of importance from a practical perspective, as it could warrant the inclusion of our findings in the curriculum of student teachers, as in the approach proposed by Behrmann (Citation2019). The author was able to show that student teachers experienced attitude changes towards systematic thinking after being confronted with their own biased perceptions. In a similar vein, one of the halo sources defined by Cooper (Citation1981) is lack of knowledge about the halo effect itself. Thereby including the findings regarding judgement accuracy in the curriculum should be worthwhile. With our research we show again that halo error exists among both teachers and students becoming teachers. Our research thus has practical implications as it could lead to a better understanding of the halo effect in teachers themselves and in turn reduce its potentially distorting effect. Thus, we would argue that the malleability of the halo effect is possible and therefore interventions to educate on the halo effect desirable.

Limitations

Finally, some limitations of the present study need to be addressed. First, the sample consisted of student teachers and teachers already working in school. Unfortunately, we were not able to recruit more teachers, in order to include number of years worked as a teacher in our analysis. The differentiation between novices and experts thus is not as intricate as we would have wished for. This shortcoming needs to be considered when interpreting our findings. Using larger sample sizes could enable future research to control for the effect of years of work experience.

Second, the halo literature suggests that implicit covariance theories have a negative connotation. We paid heed to this suggestion by inferring that the effect of the first vignette on the second is a somewhat biased assessment. One could argue that the behaviour we were able to observe is not necessarily an error in assessment, but a behaviour that is demanded of teachers working in sometimes strained environments. In fact, given that students’ grades across subjects are highly correlated, expectation for similar behaviours would be sensible. However, the instruction was not to give an overall impression, but to grade the two vignettes on their own even though they described the same student.

All in all, the experimental approach of the present investigation can be considered successful in having evoked a halo effect. Our findings complement research from more naturalistic settings that also suggest that halo effects are prevalent in educational contexts (e.g. Dennis, Citation2007). The influence of the different halo sources on the findings could not be tested with the approach used in this study. Even though Cooper (Citation1981) stated that in practice these sources occur together, we would like to encourage further research to investigate the differential effects they can have. Testing for knowledge about the influence a halo effect can have on one’s own ratings would be a fruitful approach to disentangling the combined effect of the halo sources. Finally, as is typical for experimental research approaches, the study design may be prone to external validity concerns, as we cannot be certain that our results be replicated under more natural settings. What is more, teachers in more natural settings derive their judgments on the basis of a plethora of information. The implications the findings of this study have, have to be interpreted with this shortcoming in mind.

Acknowledgement

The authors thank Stephen McLaren for his editorial support during preparation of this article.

Disclosure statement

No potential competing interest was reported by the authors.

Data availability statement

The data used in this study as well as the materials (vignettes in German and an ad-hoc translated version) are open and available online https://osf.io/4dwkc/?view_only=d61c56cf48cf4fcd839130d20dd16712

Additional information

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

References

  • Abikoff, H., Courtney, M., Pelham, W. E., & Koplewicz, H. S. (1993). Teachers’ ratings of disruptive behaviors: The influence of halo effects. Journal of Abnormal Child Psychology, 21(5), 519–533. https://doi.org/10.1007/BF00916317
  • Alkharusi, H. (2012). Categorical variables in regression analysis: A comparison of dummy and effect coding. International Journal of Education, 4(2), 202–210. https://doi.org/10.5296/ije.v4i2.1962
  • Alvidrez, J., & Weinstein, R. S. (1999). Early teacher perceptions and later student academic achievement. Journal of Educational Psychology, 91(4), 731–746. https://doi.org/10.1037/0022-0663.91.4.731
  • Balzer, W. K., & Sulsky, L. M. (1992). Halo and performance appraisal research: A critical examination. Journal of Applied Psychology, 77(6), 975–985. https://doi.org/10.1037/0021-9010.77.6.975
  • Behrmann, L. (2019). The halo effect as a teaching tool for fostering research-based learning. European Journal of Educational Research, 8(2), 433–441. https://doi.org/10.12973/eu-jer.8.2.433
  • Bressoux, P., Lima, L., & Pansu, P. (2013). Halo effect in teachers’ judgment about students’ achievement [Paper presentation]. 15th Biennial EARLI Conference for Research on Learning and Instruction, Munich, Germany.
  • Cohen, J., & Cohen, P. (1983). Applied multiple regression/correlation for the behavioral sciences. (2nd ed.). Lawrence Erlbaum Associates.
  • Conway, J. M. (1998). Understanding method variance in multitraitmultirater performance appraisal matrices: Examples using general impressions and interpersonal affect as measured method factors. Human Performance, 11(1), 29–55. https://doi.org/10.1207/s15327043hup1101_2
  • Cooper, W. H. (1981). Ubiquitous halo. Psychological Bulletin, 90(2), 218–244. https://doi.org/10.1037/0033-2909.90.2.218
  • Dennis, I. (2007). Halo effects in grading student projects. The Journal of Applied Psychology, 92(4), 1169–1176. https://doi.org/10.1037/0021-9010.92.4.1169
  • Dennis, I., Newstead, S. E., & Wright, D. E. (1996). A new approach to exploring biases in educational assessment. British Journal of Psychology, 87(4), 515–534. https://doi.org/10.1111/j.2044-8295.1996.tb02606.x
  • Dompnier, B., Pansu, P., & Bressoux, P. (2006). An integrative model of scholastic judgments: Pupils’ characteristics, class context, halo effect and internal attributions. European Journal of Psychology of Education, 21(2), 119–133. https://doi.org/10.1007/BF03173572
  • Farrington, C. A., Roderick, M., Allensworth, E., Nagaoka, J., Keyes, T. S., Johnson, D. W., & Beechum, N. O. (2012). Teaching adolescents to become learners. The role of noncognitive factors in shaping school performance: A critical literature review. University of Chicago, Consortium on Chicago School Research.
  • Faul, F., Erdfelder, E., Buchner, A., & Lang, A.-G. (2009). Statistical power analyses using G*Power 3.1: Tests for correlation and regression analyses. Behavior Research Methods, 41(4), 1149–1160. https://doi.org/10.3758/BRM.41.4.1149
  • Fisicaro, S. A., & Lance, C. E. (1990). Implications of three causal models for the measurement of halo error. Applied Psychological Measurement, 14(4), 419–429. https://doi.org/10.1177/014662169001400407
  • Forgas, J. P. (2011). She just doesn’t look like a philosopher…? Affective influences on the halo effect in impression formation. European Journal of Social Psychology, 41(7), 812–817. https://doi.org/10.1002/ejsp.842
  • Foster, G., & Ysseldyke, J. (1976). Expectancy and halo effects as a result of artificially induced teacher bias. Contemporary Educational Psychology, 1(1), 37–45. https://doi.org/10.1016/0361-476X(76)90005-9
  • Freie und Hansestadt Hamburg. (2018). Bildungsplan Gymnasium Sekundarstufe I: Allgemeiner Teil [Education plan for secondary school academic track: General section]. https://www.hamburg.de/contentblob/11249352/d0540346f8847d8a5601fc2903fef3a9/data/gym-a-teil.pdf
  • Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77(1), 81–112. https://doi.org/10.3102/003465430298487
  • Hoge, R. D., & Coladarci, T. (1989). Teacher-based judgments of academic achievement: A review of literature. Review of Educational Research, 59(3), 297–313. https://doi.org/10.2307/1170184
  • Jussim, L., & Harber, K. D. (2005). Teacher expectations and self-fulfilling prophecies: Knowns and unknowns, resolved and unresolved controversies. Personality and Social Psychology Review : An Official Journal of the Society for Personality and Social Psychology, Inc, 9(2), 131–155. https://doi.org/10.1207/s15327957pspr0902_3
  • Kaiser, J., Retelsdorf, J., Südkamp, A., & Möller, J. (2013). Achievement and engagement: How student characteristics influence teacher judgments. Learning and Instruction, 28, 73–84. https://doi.org/10.1016/j.learninstruc.2013.06.001
  • Lance, C. E., Fisicaro, S. A., & LaPointe, J. A. (1990). An examination of negative halo error in ratings. Educational and Psychological Measurement, 50(3), 545–554. https://doi.org/10.1177/0013164490503008
  • Lang, M. (2007). Textbausteine für Berichtszeugnisse in der Grundschule (7. Aufl.) [Sample report card comments for primary school (7th Edition)]. Stolz Verlag.
  • Machts, N., Kaiser, J., Schmidt, F. T., & Möller, J. (2016). Accuracy of teachers’ judgments of students’ cognitive abilities: A meta-analysis. Educational Research Review, 19, 85–103. https://doi.org/10.1016/j.edurev.2016.06.003
  • Malouff, J. M., Emmerton, A. J., & Schutte, N. S. (2013). The risk of a halo bias as a reason to keep students anonymous during grading. Teaching of Psychology, 40(3), 233–237. https://doi.org/10.1177/0098628313487425
  • McElvany, N., Schroeder, S., Hachfeld, A., Baumert, J., Richter, T., Schnotz, W., Horz, H., & Ullrich, M. (2009). Teachers’ diagnostic skills to assess student abilities and task difficulty of learning materials incorporating instructional pictures. German Journal of Educational Psychology, 23, 223–235. https://doi.org/10.1024/1010-0652.23.34.223
  • Muntoni, F., & Retelsdorf, J. (2018). Gender-specific teacher expectations in reading—The role of teachers’ gender stereotypes. Contemporary Educational Psychology, 54, 212–220. https://doi.org/10.1016/j.cedpsych.2018.06.012
  • Muthén, L. K., & Muthén, B. (2017). Mplus user’s guide: Statistical analysis with latent variables, user’s guide. Muthén & Muthén.
  • Murrone, J., & Gynther, M. D. (1989). Implicit theories or halo effect? Conceptions about children’s intelligence. Psychological Reports, 65(3_suppl2), 1187–1193. https://doi.org/10.2466/pr0.1989.65.3f.1187
  • Murphy, K. R., Jako, R. A., & Anhalt, R. L. (1993). Nature and consequences of halo error. Journal of Applied Psychology, 78(2), 218–225. https://doi.org/10.1037/0021-9010.78.2.218
  • Ochi, S. (2006). Formulierungshilfen für Schulberichte und Zeugnisse (4., aktualisierte und erw. Aufl.) [Examples and guidelines for school reports and report card comments]. Medienwerkstatt Mühlacker.
  • Retelsdorf, J., Schwartz, K., & Asbrock, F. (2015). “Michael can’t read!” – Teachers’ gender stereotypes and boys’ reading self-concept. Journal of Educational Psychology, 107(1), 186–194. https://doi.org/10.1037/a0037107
  • Sanrey, C., Bressoux, P., Lima, L., & Pansu, P. (2021). A new method for studying the halo effect in teachers’ judgement and its antecedents: Bringing out the role of certainty. The British Journal of Educational Psychology, 91(2), 658–675. https://doi.org/10.1111/bjep.12385
  • Spinath, B. (2005). Accuracy of teachers’ judgment of pupils’ characteristics and the construct of diagnostic competence. German Journal of Educational Psychology, 19, 85–95. https://doi.org/10.1024/1010-0652.19.12.85
  • Südkamp, A., Kaiser, J., & Möller, J. (2012). Accuracy of teachers’ judgments of students’ academic achievement: A meta-analysis. Journal of Educational Psychology, 104(3), 743–762. https://doi.org/10.1037/a0027627
  • Talamas, S. N., Mavor, K. I., & Perrett, D. I. (2016). Blinded by beauty: Attractiveness bias and accurate perceptions of academic performance. PloS One, 11(2), e0148284. https://doi.org/10.1371/journal.pone.0148284
  • Thorndike, E. L. (1920). A constant error in psychological ratings. Journal of Applied Psychology, 4(1), 25–29. https://doi.org/10.1037/h0071663
  • Tompkins, R. C., & Boor, M. (1980). Effects of students’ physical attractiveness and name popularity on student teachers’ perceptions of social and academic attributes. The Journal of Psychology, 106(1), 37–42. https://doi.org/10.1080/00223980.1980.9915168
  • Urhahne, D., & Wijnia, L. (2021). A review on the accuracy of teacher judgments. Educational Research Review, 32, 100374. https://doi.org/10.1016/j.edurev.2020.100374
  • Varma, A., DeNisi, A. S., & Peters, L. H. (1996). Interpersonal affect and performance appraisal: A field study. Personnel Psychology, 49(2), 341–360. https://doi.org/10.1111/j.1744-6570.1996.tb01803.x