6,380
Views
98
CrossRef citations to date
0
Altmetric
Web paper

An analysis of peer, self, and tutor assessment in problem-based learning tutorials

, , &
Pages e122-e132 | Published online: 03 Jul 2009

Abstract

Objective: The purpose of this study was to explore self-, peer-, and tutor assessment of performance in tutorials among first year medical students in a problem-based learning curriculum.

Methods: One hundred and twenty-five students enrolled in the first year of the Bachelor of Medicine and Bachelor of Surgery Program at the University of Queensland were recruited to participate in a study of metacognition and peer- and self-assessment. Both quantitative and qualitative data were collected from the assessment of PBL performance within the tutorial setting, which included elements such as responsibility and respect, communication, and critical analysis through presentation of a case summary. Self-, peer-, and tutor assessment took place concurrently.

Results: Scores obtained from tutor assessment correlated poorly with self-assessment ratings (r = 0.31–0.41), with students consistently under-marking their own performance to a substantial degree. Students with greater self-efficacy, scored their PBL performance more highly. Peer-assessment was a slightly more accurate measure, with peer-averaged scores correlating moderately with tutor ratings initially (r = 0.40) and improving over time (r = 0.60). Students consistently over-marked their peers, particularly those with sceptical attitudes to the peer-assessment process. Peer over-marking led to less divergence from the tutor scoring than under-marking of one's own work.

Conclusion: According to the results of this study, first-year medical students in a problem-based learning curriculum were better able to accurately judge the performance of their peers compared to their own performance. This study has shown that self-assessment of process is not an accurate measure, in line with the majority of research in this domain. Nevertheless, it has an important role to play in supporting the development of skills in reflection and self-awareness.

Practice points

  • Self-assessment results in substantial under-marking compared to tutor assessment.

  • Scores obtained from peer-assessment are significantly more generous than those scores arising from tutor assessment.

  • Self-assessment is a less accurate means of assessing student performance than peer-assessment.

Introduction

Education of medical students should prepare them to deal with problems in the future, equipping them with skills necessary to become active, self-directed learners, rather than passive recipients of information (Dolmans & Schmidt Citation1996). Acknowledgment of this need was responsible, in part, for the development of problem-based learning (PBL) (Barrows & Tamblyn Citation1980).

Within the discipline of medical education, PBL is a curriculum innovation that involves students in learning activities using loosely structured medical problems to drive learning (Norman & Schmidt Citation1992). The pedagogical appeal of PBL is its perceived capacity to encourage, through these learning processes, enhanced clinical reasoning skills, and the development of both an adaptable knowledge base and skills in self-directed learning necessary to become lifelong learners (Kelson & Distlehorst Citation2000). Four crucial conditions for a deep approach to learning are encompassed within the PBL approach: a well-structured knowledge base, active learning, collaborative learner interaction, and a context designed to promote internal motivation through the provision of pragmatic goals (Margetson Citation1994). Assessment of student progress in such a student-centred curriculum, however, has remained challenging (Eva Citation2001).

Assessment protocols within PBL curricula have sometimes sought to include self-, peer-, and tutor evaluation to assess a range of skills, such as self-directed learning, group cooperation, and communication (Swanson et al. Citation1997). Tutors and peers have a unique opportunity to judge each others’ work in PBL tutorials, and students should develop the ability to reflect on their own strengths and weaknesses as these are central elements of self-directed learning (Eva et al. Citation2004).

Several published quantitative studies of peer-assessment within PBL curricula reveal correlations between staff/tutor and peer ratings ranging from very low (Sluijmans et al. Citation2001; Reiter et al. Citation2002) to moderate (Sullivan et al. Citation1999; Segers & Dochy Citation2001). Of limited research undertaken with medical students in PBL, moderate correlation between peer and tutor ratings was demonstrated by Sullivan (Citation1999) while low correlation was reported by Reiter et al. (Citation2002). Findings arising from studies of medical students in non-PBL curricula show generally moderate correlations (Burnett & Cavaye Citation1980; Van Rosendaal & Jennett Citation1992; Rudy et al. Citation2001; Minion et al. Citation2002). Several factors have the potential to impact negatively on the accuracy of peer evaluations, including friendship marking, and decibel marking which favours dominant group members (Pond & ul-Haq Citation1997). These may result in peer over-marking often observed in quantitative studies of peer-assessment (for instance, Rudy et al. Citation2001). It is also possible that, in high stakes settings such as medical schools, inflated estimates of peer performance would be the norm (Norcini Citation2003).

Another format for evaluating student performance in PBL tutorials is self- assessment. Self- and peer-assessment are often combined or considered together. Peer-assessment, for instance, builds on evaluation skills that may be transferred to self-assessment tasks and enables learners to compare their self evaluations with the assessments of others.

Despite meta-analyses of self-assessment in higher education deeming students ‘well able to self-assess accurately’ (Sluijmans et al. Citation1999, p. 300), within medical PBL programs reported correlations between self and tutor evaluations are uniformly low (Rezler Citation1989; Gordon Citation1991; Das et al. Citation1998; Sullivan et al. Citation1999; Reiter et al. Citation2002). Nor has a significant relationship been found between self-assessment scores and examination results (Tousignant & Des Marchais Citation2002; Eva et al. Citation2004). This is also true of non-PBL medical curricula, where poor association has been shown between scores obtained from self-assessment and tutor assessment (or examination results) (Arnold et al. Citation1985; Woolliscroft et al. Citation1993; Rudy et al. Citation2001; Fitzgerald et al. Citation2003).

When self-assessment scores are compared with peer-assessment scores, low correlations have been shown in all but one study of PBL curricula or of medical courses (Sullivan et al. Citation1999; Reiter et al. Citation2002; Miller Citation2003). The exception, a study by Burnett and Cavaye (Citation1980), reported a very high correlation between self- and peer-assessment scores among fifth year medical students in a traditional curriculum. However, Eva et al. (Citation2004) reported disappointingly low correlation between students’ self-assessment and performance on a test of medical knowledge, with no evidence of improvement after one year of medical education. The tendency exists for students to overestimate their competence (Mattheos et al. Citation2004), especially lower-performing students (Woolliscroft et al. Citation1993; Lejk & Wyvill Citation2001; Edwards et al. Citation2003). Young or highly capable students are more likely to undermark their work (Stefani Citation1992; Rudy et al. Citation2001; Edwards et al. Citation2003; Fitzgerald et al. Citation2003). The influence of gender on both self-assessment and peer-assessment accuracy appears to be minimal (Falchikov & Magin Citation1997; Mattheos et al. Citation2004), although the experience of self and peer-assessment may be more stressful for females (Pope Citation2005).

Several approaches have been suggested to improve the accuracy of scores generated from peer- or self-assessment. One well-supported idea is the use of co-assessment which involves students with staff in the assessment process (Sluijmans et al. Citation1999). While research has not supported the reliability of tutor assessment in PBL tutorials for measures of student knowledge (Neville Citation1998; Cunnington Citation2001; Whitfield & Xie Citation2002), the potential exists for tutor assessment to be combined with or compared to peer- or self-assessment to improve the accuracy and comprehensiveness of the evaluations generated (Dochy et al. Citation1999; Eva Citation2001). Tutors are in a reasonable position to judge group processes (Dodds et al. Citation2001).

The aim of this study was to explore peer- and self-assessment within PBL tutorials in a medical course using qualitative and quantitative approaches. Qualitative data were collected to gather students’ perceptions of these alternate forms of assessment (see Papinczak et al. Citation2007). Quantitative data were analysed to assess the ‘accuracy’ of students as assessors, with tutor scores as comparison. The impact of specific demographic factors and students’ self-efficacy was analysed to gain greater understanding of influences on scoring. It was anticipated that confident (efficacious) students would award themselves higher marks, althought this may be mediated by fears of self-aggrandisement in a public arena (see Chaves et al. Citation2006).

Self-efficacy is defined as students’ perceptions of their ability to successfully carry out a task (Bandura Citation1986). When facing a difficult learning task, a student with high self-efficacy beliefs is more likely to participate more actively, work harder, remain more problem-focussed, and persist for a longer time than a student with low self-efficacy, who is more likely to view the situation as insurmountable, get frustrated and give up (Pajares Citation1996; Nichols & Steffi Citation1999). Students with high levels of self-efficacy are more willing to take on challenging tasks (Zimmerman Citation2000), whereas students with low self-efficacy may fail to achieve even when goals are within easy reach (Bandura Citation1993). The effect of self-efficacy on scores obtained through self-assessment has not previously been evaluated within the PBL tutorial setting.

The PBL environment, with its emphasis on self-directed and collaborative learning, provides a unique context in which to explore alternative forms of assessment. As they work together in PBL tutorials, students may develop interdependent relationships facilitating learning and motivation (Willis et al. Citation2002). This study sought to incorporate qualitative and quantitative dimensions in order to gain a fuller understanding of peer- and self-assessment within collaborative small group environments.

Description of the study

Quantitative and qualitative data were gathered as part of a larger study of metacognitive processes undertaken with first-year students enrolled in the Bachelor of Medicine and Bachelor of Surgery (MBBS) Program at The University of Queensland, Australia. Only the results of the self- and peer-assessment segment of the study are reported here, including quantitative findings and qualitative results which may be explanatory or insightful.

Ethical approval was obtained from the University of Queensland's Behavioural & Social Sciences Ethical Review Committee.

Setting

The MBBS Program introduced a four-year, graduate-entry PBL curriculum in 1997. First-year students, in small groups of nine or ten, undertake five hours of PBL tutorial time each week for 33 weeks of the year. Working in collaboration with group members, students analyse a problem of practice, formulate hypotheses, and undertake self-directed learning to try to understand and explain all aspects of the patient's ‘problem’.

Subjects

The study was conducted with 125 first-year medical students and 20 tutors over a period of six months during 2004. Every student in thirteen tutorial groups took part in a program of educational activities within their PBL tutorials, including peer- and self-assessment. Subsequent statistical analysis showed that the self-selected study subjects (40.2% of the student population) were representative of the entire cohort on measures of age, gender, and primary undergraduate degree.

Instruments

Qualitative and quantitative data were generated using two instruments: the peer assessment instrument (as shown in ) and the test of self-efficacy, which all participants completed at the commencement of the study.

Figure 1. The final 17-item draft of the Peer assessment instrument.

Figure 1. The final 17-item draft of the Peer assessment instrument.

The peer assessment instrument

In order to enhance student ownership of assessment criteria (as recommended by Boud (Citation1995) and Orsmond et al. (Citation2000)), members of several PBL tutorial groups in the previous cohort were invited to participate in the development of an instrument for peer- and self-assessment of students performance in PBL tutorials. Students were first presented with a list of criteria derived from relevant literature (including Das et al. (Citation1998) and Willis et al. (Citation2002)) from which a set of items were selected for inclusion in the first draft of the instrument. The negotiated instrument with 19 items, labelled the peer assessment instrument, was trialled with another student group and rated as easy to use and understand by all participants. Student dissatisfaction with two items resulted in their removal from the final version of the instrument. The resulting scale measures several features of successful adult education, such as participation, punctuality, respect for others, effective communication, and critical analysis (as shown in ). However, the inclusion of items specifically targeting self-directed learning and self-awareness, core features of PBL, allows it to be differentiated from others which may be appropriate for open-ended, but less student-centred, approaches—such as case-based instruction (Hay & Katsikitis Citation2001).

The phrasing of items on the peer assessment instrument was varied slightly to make it more relevant to self-evaluation where applicable, for instance, ‘I’ instead of ‘the student’. Qualitative data were collected from responses to an open question (inviting comments) on the final page of the questionnaire. These were analysed and coded to themes to provide insight into student perceptions.

In order to gain a measure of face validity, three experienced PBL facilitators were asked to indicate whether each of the 17 items on the instrument was relevant to PBL performance and able to be adequately assessed using the item in question. Unanimous face validity was obtained for all items in the four sub-scores: responsibility and respect, information processing, communication, and critical analysis. Some dissent about the validity of the self-awareness sub-score was evident. Construct validity describes the degree to which the items used in the instrument define the constructs (Pressley & McCormick Citation1995). The five constructs or domains of performance were reported extensively in the medical and nursing education literature. Each of the three PBL tutors and ten PBL students were asked to categorise the 17 items into the five specified domains. In all cases, the items were distributed in accordance with the domains as defined on the instrument. Values for Cronbach's alpha ranged from 0.76 to 0.84, indicating good internal consistency among the five sub-scores. Acceptable reliability was found, with Pearson correlation coefficients for peer-averaged and tutor assessment ranging from 0.40 to 0.60. Notably, self-awareness items were problematic with a significant number of students consistently entering ‘not applicable’ for those two items.

Unfortunately, time constraints prevented further renegotiation of the peer assessment instrument with the subsequent cohort prior to the commencement of the study.

Test of self-efficacy

The instrument to measure students’ self-efficacy was composed specifically for this project as existing instruments were not designed for use in problem-based learning courses. The test of self-efficacy comprises eleven closed questions relating to regulation of, and confidence in, learning, with scores rated on a Likert scale of one-to-five. The first six items (Part A) deal with students’ perceived capability to use various self-regulation strategies, such as organizing their studies, and concentrating and participating in small-group tutorials. These were loosely based on Bandura's (1989) multidimensional scales of perceived self-efficacy reported in Zimmerman et al. (Citation1992). This original scale was designed to measure high school students’ perceived capability to use various self-regulating strategies, such as concentrating on school subjects, organising schoolwork, and participating in school discussions. Of the eleven items on the original scale, the most applicable six were chosen and rewritten to more appropriately reflect the learning and studying activities carried out by students in this medical course in order to create a brief instrument measuring self-efficacy to regulate learning. The six items deal with the self-regulation strategies: completing allocated learning objectives for the group; studying when there are distractions; planning and organising study; course motivation; and concentration and active participation in tutorials.

A further five items measuring self-efficacy for academic achievement were included in the final set of questions (Part B) of the instrument. These were framed using items taken from the ten-item measure of self-efficacy first reported by Schwarzer and Jerusalem (Citation1995) with five questions selected and modified to better measure the specific responsibilities of examination performance, tutorial participation, self-awareness, clinical reasoning, and academic achievement under consideration in this study.

Statistical testing to determine internal reliability yielded Cronbach's alpha values of 0.68 for the first six items measuring self-efficacy for self-regulation, and 0.73 for the remaining five items dealing with self-efficacy for academic achievement. Reliability was not improved by the omission of a single item from either self-efficacy measure.

Peer-, self- and tutor assessment

The study was undertaken with the 2004 cohort of medical students, using the assessment instrument previously developed with student input. Student feedback was collected during, and at the conclusion of, the study. This feedback was applied to modify aspects of the larger study framework, only some of which included the peer and self-assessment process itself. All participants completed the test of self-efficacy before commencing the program of peer- and self-assessment.

For a period of twenty-four weeks, all members of thirteen PBL tutorial groups took part in an activity designed to enhance learning. Two key components were introduced into PBL tutorials, both of which were readily integrated into the existing tutorial format with minimal additional time required from tutors or students. These components were:

  1. Reflection on learning: Each week one student from each tutorial group was asked to compose a summary of the week's medical problem incorporating the clinical reasoning and collaborative learning occurring in their PBL tutorial group. The student was encouraged to present the summary to the group as a concept map or in mechanistic case-diagramming format (Guerrero Citation2001) to give a visual representation of both the content and the clinical reasoning entailed in solving the problem (a ‘knowledge object’ (Entwistle & Marton Citation1994)).

  2. Peer- and self-assessment: The student presentation, in association with his/her fulfilment of PBL roles and responsibilities for that week, was assessed using the peer assessment instrument. Peer-, self-, and tutor assessment took place concurrently. Scores from these worksheets were compared to explore statistical relationships. Qualitative data were analysed.

Constant monitoring of student perceptions of, and attitudes to, these educational activities helped to monitor the effects of the dual activities to maximize student learning. This is in keeping with an action research process which uses overlapping cycles of planning, acting and observing, and reflecting (Kemmis & Wilkinson Citation1998) to maintain a responsive and flexible study design.

In the first week of the main study (week four of the academic year), students in the thirteen tutorial groups were distributed two documents: a copy of the peer assessment instrument to enable students to become familiar with the evaluation criteria, and an exemplar outlining ‘good’ and ‘poor’ outcomes for the criteria. Tutors assigned to each group then led their group in a practice session, with a tutor-led presentation of a summary of the previous week's medical case, in order to establish familiarity with the instrument and process. The summary itself was written by the researcher and presented, with explanation, to each tutor in the week prior to the trial. Tutors received written information, a short information session, and frequent communication and feedback to help maintain fidelity of treatment.

In the ensuing 23 weeks, tutorial groups implemented the summarization and assessment activities at the start of each week as part of the ‘wrap-up’ of the previous week's PBL case. Each student was expected to be the focus of two marking episodes, each preceded by a summary presentation (as shown in ). Tutors were encouraged to give concise feedback (based on written peer comments on the reverse of the assessment sheet) to students as soon as possible after the completion of the peer and self-assessment procedure. Student feedback about the exercise was regularly invited as part of the action research process.

Figure 2. Diagram showing the sequence of marking episodes and assessment events for each student in each of 13 tutorial groups.

Figure 2. Diagram showing the sequence of marking episodes and assessment events for each student in each of 13 tutorial groups.

During the course of implementation, two tutorial groups withdrew from the study (16% of participants). Their justification for withdrawal was based on perceptions of their experiences, including scepticism about the value of peer-assessment and concerns about friendship marking. Statistical analysis showed students withdrawing did not differ significantly from those remaining in the study in terms of age, gender, primary degree, or self-efficacy.

Data analysis

Analysis of the data was implemented using statistical software SPSS Version 13.0. Scores on all five sub-scores of the peer assessment instrument were summed to give an overall score, with a maximum score of 85. For each marking episode, data for each student consisted of a self-assessment score, a tutor assessment score, and up to nine peer-assessment scores. Scores obtained from the test of self-efficacy were summed to create two sub-scores: self-efficacy for self-regulation (with a maximum score of 30) and self-efficacy for academic achievement (with a maximum score of 25). These two sub-scores were analysed as separate entities.

Descriptive statistics for tutor, self- and peer-assessment are presented in for scores for each of the two marking episodes. Data were missing for individuals failing to submit completed assessment instruments. Frequency histograms revealed non-normal distributions of peer scores resulting from peer-assessment, with some groups awarding full marks for a large proportion of assessments. As qualitative data made it apparent that some students deliberately scored 100% for peer performances, irrespective of quality, it was resolved to apply an algorithm to reduce the prevalence of deliberately skewed scores or scores resulting from friendship marking and students’ cavalier attitudes. In instances where the tutor score for a given group was 72 out of 85 (representing a result of 85%) or less, all peer scores of 100% were omitted from the statistical analysis for that tutorial group. In this way, the most highly skewed results were excluded from the data set (representing 4.6% data loss) yet the data remained a reflection of the peer-assessment process which operates in a climate of student generosity towards others (see, for instance, Rudy et al. Citation2001).

Table 1.  Descriptive statistics–self-, peer-, and tutor assessment - for each of the two marking episodes

Averaged peer-assessment scores were calculated by computing the mean for each students’ completed peer-assessment instruments (once skewed results were removed). The reliability among peer-averaged scores for all intervention tutorial groups was well within acceptable limits with Cronbach's alpha scores ranging from 0.66 to 0.77. The results of Kolmogorov–Smirnov testing confirmed that scores for self-, peer-averaged, and tutor-assessment generated from two marking episodes followed a normal distribution. The parametric tests chosen are fairly robust and should remain relatively unaffected by the observed clustering of assessment marks to the upper end of the range. Despite reservations about its use in this capacity (see Ward et al. Citation2002) tutor scoring was utilised as the most appropriate benchmark for comparative purposes in assessing the reliability of peer- and self-assessment.

In order to provide evidence for claims of ‘accuracy’, Bland Altman plots (see Bland and Altman Citation1986) were used to graphically represent levels of agreement between two sets of scores. Average of scores was plotted against difference between paired scores for (1) self- versus tutor scores at time 2 and (2) peer-averaged versus tutor scores at time 2. Three lines representing the mean difference and upper and lower limits of agreement were drawn. The limits of agreement were set as mean difference ± 5% of maximum score. The limits suggested by Bland and Altman (Citation1986), which are mean difference ± 2 standard deviations, were considered too wide to give meaningful results in this study and were not used. These plots are included as and .

Figure 3. Bland-Altman plot of self-assessment versus tutor scores at time 2.

Figure 3. Bland-Altman plot of self-assessment versus tutor scores at time 2.

Figure 4. Bland-Altman plot of peer-averaged versus tutor scores at time 2.

Figure 4. Bland-Altman plot of peer-averaged versus tutor scores at time 2.

Results

Self-assessment

Demographic variables

Multilevel regression analysis was used to explore the relationship between demographic variables and self-assessment scores. Of four factors incorporated into the model (age, gender, primary undergraduate degree, and repeat student status), only primary degree was statistically significant in explaining the variance in self-assessment scores. Students with an arts, commerce, music, education or law degree on admission to the MBBS Program were significantly more likely to have higher initial self-assessment scores than others, while those with pure sciences or therapies degrees marked themselves significantly lower (t = 2.89; p = 0.05). This distinction was less noticeable in the second marking episode.

Self-efficacy

Initial self-efficacy for self-regulation was moderate to high with a mean of 23.85 (out of 30) and a standard deviation of 3.18, while initial self-efficacy for academic achievement also showed relatively elevated levels (mean = 19.51 out of a possible 25, standard deviation = 2.51). In order to explore the relationship between self-assessment scores and self-efficacy, a multiple linear regression analysis was undertaken. Only initial self-efficacy for self-regulation was statistically significant in explaining the variance in self-assessment scores (t = −3.85, p = 0.001).

Comparison of means

Direct comparison of the self-assessment mean with the tutor score revealed consistent under-marking of students’ own work, as shown in . Paired t-tests were undertaken to determine whether statistically significant directional differences existed for each marking episode. In each marking episode, the students scored themselves significantly lower than their tutor (t = −5.27 to −8.10; p < 0.001).

Analysis of qualitative data indicated students were concerned about lack of objectivity. One student commented on their struggle to remain impartial: ‘I find it difficult to undertake self assessment—mainly because I feel that my perception of my performance may be inaccurate due to bias or distorted perceptions’.

Inter-rater agreement

To explore levels of agreement between scores for self-assessment and other tutorial-based scores, two analyses were undertaken. These were: (a) correlation to test for the strength and direction of linear relationships between scores; and (b) Bland Altman plots to graphically represent scores obtained from self- and tutor assessment.

  • (a) Pearson correlation coefficients were computed for self and peer-averaged, and self and tutor assessment scores derived from both marking episodes (see ). Despite reaching statistical significance, the correlation between self and peer-averaged scores was low-to-moderate (r = 0.30–0.32). A slightly stronger correlation was observed for self and tutor scores (r = 0.31–0.41). Tremendous variability existed between tutorial groups. When groups were analysed for score correlation separately, across both marking episodes, six tutorial groups showed very high levels of marking agreement with self and tutor score correlation coefficients ranging from 0.74 to 0.92. Other groups showed low correlations.

  • (b) Bland Altman plots charted the difference between tutor and self-assessment scores against the average of these scores. Plots showed poor accuracy of self- versus tutor assessment at both times 1 and 2, with a considerable proportion of plotted scores well outside the levels of agreement (mean ± 5% of the maximum score). shows a Bland–Altman plot for self- versus tutor assessment at time 2. The mean differences of–7.59 (time 2) highlights both the considerable under-marking of self compared to tutor scores and lack of accuracy. The standard deviation was quite large (11.70 at time 2), indicating a wide spread of scores about the mean.

Table 2.  Correlation between pairs of scores obtained from self-, peer-, and tutor assessment. Peer averaged scores have been used. Cronbach's alpha for all peer averaged scores across 13 tutorial groups in the first marking episode = 0.77, while Cronbach's alpha for all peer averaged scores across 13 tutorial groups in the second marking episode = 0.66

Peer-assessment

Demographic variables

Analysis of variance demonstrated no significant differences between the marks awarded to peers based on the presenting student's gender, age, or primary degree. There was evidence, however, of a trend towards higher scores being awarded to older male students in the groups. This failed to reach statistical significance (F = 3.12; df = 12; p = 0.095).

Comparison of means

Direct comparison of the peer-averaged mean with the tutor score revealed consistent over-marking by peers (see ). Paired t-tests were undertaken to determine whether statistically significant directional differences existed for each marking episode. In each marking episode, the mean of the peer scores for each student presentation was significantly higher than the score awarded by their tutor (t = 3.71 to 4.14; p < 0.001).

Inter-rater agreement

To explore levels of agreement between scores for peer-averaged assessment and other tutorial-based scores, two analyses were undertaken. These were: (a) correlation to test for the strength and direction of linear relationships between scores; and (b) Bland Altman plots to graphically represent scores obtained from peer-averaged and tutor assessment.

  • (a) presents Pearson correlation coefficients for tutor and peer-averaged scores generated from both marking episodes. At best moderate correlations were demonstrated initially for tutor and peer-averaged scores (r = 0.40), with some improvement over time (r = 0.60). This data supports the acceptable reliability of the assessment instrument subject to the limitations of the use of tutor assessment as the benchmark.

When tutorial groups were analysed for score correlation separately, in all seven of the thirteens groups were capable of very high levels of marking agreement with correlation coefficients ranging from 0.76 to 0.96. Qualitative data indicated that the majority of these groups were very supportive of, and committed to, the peer-assessment process. Comments such as: ‘… good to learn how to do this appropriately, as I think we will need to be able to assess our peers’ performance, as well as our own, throughout our careers’ were given by some enthusiastic respondents.

Scores obtained from other tutorial groups were in substantially less agreement. Most of these group members expressed negative views about peer-assessment related specifically to potential for bias. The effect of omission of highly skewed results (as discussed earlier) on correlation was briefly explored. Data editing was found to improve the peer-tutor correlation from 0.32 to 0.40 in the first marking episode.

  • (b) Bland Altman plots charted the difference between tutor and peer-averaged scores against the average of these scores. A moderate level of agreement between peer-averaged and tutor assessment at both time 1 and 2 were shown, with a considerable proportion of plotted scores within the levels of agreement (mean ± 5% of the maximum score). The mean differences of 3.75 (time 2) highlight both the over-marking of peers compared to tutor scores and improved accuracy of peer-averaged scores compared to scores derived from self-assessment (see ).

Variability between tutorial groups

Differences between the averaged peer-assessment scores of all 13 groups were explored using multivariate analysis of variance. Statistically significant variability for the first marking episode only (F = 2.09; df = 12; p = 0.028) was evident. The effect size, as measured by partial eta squared, was 0.26. illustrates this variability, through box plots, for the 13 tutorial groups on the 0–85 scale of the peer assessment instrument. Analysis of variance demonstrated five tutorial groups had reliably recorded significantly lower peer-average scores, while three groups had consistently scored group members more favourably. Comparison between the three sources of assessment scores (self-, peer-, and tutor) revealed a small number of groups consistently reporting high student-generated scores in the absence of high tutor scores. Based on tutor assessment as the benchmark, members of these tutorial groups must be awarding peers overly generous marks. Qualitative data suggest this may relate to friendship marking.

Figure 5. Box plots of averaged peer-assessment scores for thirteen tutorial groups with results obtained from the first marking episode.

Figure 5. Box plots of averaged peer-assessment scores for thirteen tutorial groups with results obtained from the first marking episode.

A strong reaction to peer-assessment was the widespread perception that this process could be corrupted by bias due to friendship marking, fear of ‘tit-for-tat’ scoring, or lack of honesty. The following comments sum up the attitude among many study participants: ‘(It is) hard to criticise friends’ and ‘Relationships between students can colour opinions’. Some students expressed casual and/or sceptical attitudes, scoring 100% for each student in their group regardless of the quality of the work to be judged. Frequent comments such as: ‘Not taken too seriously’, and ‘Not too much thought goes into the marking’ reflect a cynical attitude to the peer-assessment process.

Discussion and conclusion

Self-assessment appears to be a less accurate means of evaluating student performance in PBL tutorials than peer-assessment. The heterogeneity of the group was confirmed by the presence of many outliers in the data, confirming the low accuracy of self-assessment among this sample of students. Subject to the variability observed between the scores obtained from 13 tutorial groups, correlations between self-assessment scores and those generated from tutor and peer-assessment were disappointingly low. Some groups showed evidence of greater accuracy, and some students within groups were also more accurate than others in their self-assessment. These findings are in accord with the general consensus in medical education that self-assessment of tutorial processes in PBL is an inexact measure (Rezler Citation1989; Gordon Citation1991; Das et al. Citation1998; Sullivan et al. Citation1999; Reiter et al. Citation2002).

Students clearly under-marked themselves, particularly those with pure science or therapies degrees. Other authors have shown that young or highly capable students are more likely to undermark their work (Stefani Citation1992; Rudy et al. Citation2001; Edwards et al. Citation2003; Fitzgerald et al. Citation2003). Analysis of qualitative data suggests that students struggled to find a balance between confidence in their performance, self-awareness, and humility. Objectivity was also a major cause of concern.

Self-efficacy was correlated with self-assessment. Students awarding themselves higher marks were more likely to have stronger self-efficacy for self-regulation. Self-assessment scores would be influenced by many factors, but confidence in one's ability to do well would be expected to provide incentive to award oneself higher marks on PBL performance. Positive collaborative learning behaviour has been shown to be related to learning self-efficacy (McLoughlin & Luca Citation2004).

Peer-assessment offers a greater likelihood of providing accurate alternate forms of assessment within the PBL tutorial environment. Correlation between tutor and peer-averaged scores were barely moderate at first, then improved with continued practice in peer-assessment. Some groups achieved very high correlation (up to 0.96) between tutor scores and peer-averaged scores. The use of peer-averaged rather than individual scores may help to account for apparently improved accuracy compared to self-assessment. Nevertheless, some students, and some groups of students, were able to judge the performance of their peers in PBL tutorials with precision.

The removal of highly skewed results prior to the generation of peer-averaged scores improved the correlation between peer-averaged and tutor scores. Qualitative data indicated that some students were treating the peer-assessment process with casual and/or sceptical attitudes. By removing scores known to be deliberately distorted, it was anticipated that the peer-averaged scores would more appropriately reflect the genuine abilities of responsible students to assess their peers. Data analysis indicated that this was the case, with a small improvement in the correlation of tutor and peer-averaged scores in the first marking episode once highly skewed scores were omitted. The implementation of peer-assessment in any setting is likely to lead to initial scepticism and doubt about its value and validity. However, through repeated exposure to, and practice in, peer-assessment, such perceptions should be moderated (Sluijmans et al. Citation1999; Ballantyne et al. Citation2002), and highly skewed results would be expected to decrease in frequency. The improved correlation between peer-averaged and tutor assessment at time 2 (compared to time 1) lends support to this supposition.

Results from other studies of peer-assessment of processes in PBL (or small group) tutorials show variable correlations between staff and peer scores. Correlations range from very low (Reiter et al. Citation2002) to high (Magin Citation2001).

In keeping with the findings of Rudy et al. (Citation2001), students were over generous in their marking of peers. The consistent under-marking of self combined with over-marking by peers helps to account for the low correlation found for self and peer-assessment. Qualitative results showed the potential for inflated estimates of peer performance resulting from friendship marking. Friendship marking has been reported by other researchers as biasing peer-assessment responses (Pond & ul-Haq Citation1997).

This study has the capacity to make a contribution to knowledge in the area of peer- and self-assessment in PBL tutorials. The study incorporated strengths in four main areas. These were: (1) focus on the learning process in PBL tutorials; (2) duration of the program; (3) congruence with PBL philosophy; and (4) triangulation in data collection.

  1. The focus on learning process is an important strength of the study design as it enabled insights into learning processes undertaken by students. Bereiter and Scardamalia (Citation2000) call for greater research into PBL processes using reflective action research.

  2. The moderate duration of the program (exceeding six months) leads to greater confidence in study findings. Loss of participants over time (16% of participants) was not excessive given the time pressures experienced by students.

  3. Congruence with PBL philosophy allowed the study to complement the existing structure of PBL tutorials. By supporting the practice of self-assessment within PBL, the study upheld a SDL emphasis. Collaboration was enhanced through the use of collaborative assessment formats. The reflective component of the intervention built upon the review phase of the PBL learning cycle.

  4. Methodological triangulation was achieved by combining quantitative and qualitative approaches to study design. While qualitative inquiry helped to confirm theory emerging from student perspectives, quantitative inquiry enabled a set of statistical relationships to be uncovered.

Results should be interpreted within the context of potential limitations, including non-probability sampling, a relatively small sample size, subjective scoring of test items, and the different ways students and tutors interpret and apply assessment criteria. With regard to assessment of PBL processes such as communication and respect, no real benchmark exists (Ward et al. Citation2002). This casts some doubt on the validity of expert assessment in this domain.

Peer- and self-assessment within the tutorial setting has an important role to play through its reinforcement of the educational goals and instructional principles of problem-based learning (Nendaz & Tekian Citation1999; Segers & Dochy Citation2001). There is evidence that tutorial-based assessment may also reduce the overwhelming reliance on formal grading of students which encourages competition rather than collaboration (Eva Citation2001).

First-year medical students in this study demonstrated poor ability to self-assess their performance in PBL, which included the creation and presentation of a case summary. Normally a private process, self-assessments conducted publicly require students to balance unrealistic goals and perceptions, assessment anxiety, and ‘social norms about self-aggrandizement’ (Chaves et al. Citation2006, p. 30). This makes it unlikely that self-assessment accuracy in medical education is achievable (Eva & Regehr Citation2005).

Nevertheless, practice in self-assessment should be integrated into existing programs of medical education. As Eva and Regehr (Citation2005) emphasise, self-assessment is a means of identifying one's strengths and weaknesses to guide goal setting and enhance self-efficacy. This study has shown that self-assessment is not an accurate measure, in line with the majority of research in this domain. Nevertheless, it has an important role to play in supporting the development of skills in reflection and self-awareness. Self-assessment needs to be viewed from a ‘self-improvement perspective’ (Eva & Regehr, Citation2005, p. S52). Further qualitative research needs to be conducted to better understand students’ apparent inability to self-assess accurately within collaborative small group learning environments.

Peer-assessment provides a valuable opportunity for tutorial-based assessment. The act of evaluating the performance of professional peers has long been central to the referral process in medicine (Norcini Citation2003). Skills gained through peer-assessment activities may transfer to self-assessment tasks and enable learners to compare their self-assessment with the assessments of others (Searby & Ewers Citation1997; Dochy et al. Citation1999). Feedback from peers has the potential to assist learners to develop more accurate impressions of themselves and their abilities (Eva & Regehr Citation2005).

A fundamental part of the PBL process is the capacity of students to embrace their responsibilities as active members of a group of learners. These may include collaborative and self-assessment practices which have the potential to enhance reflection and self-awareness.

Additional information

Notes on contributors

Tracey Papinczak

TRACEY PAPINCZAK is completing her PhD in medical education within the School of Medicine, The University of Queensland.

Louise Young

DR LOUISE YOUNG is a senior lecturer in the School of Medicine at The University of Queensland and is currently Deputy Director of the University's Centre for Medical Education.

Michele Groves

ASSOCIATE PROFESSOR MICHELE GROVES is Deputy Head of School and Director of Medical Studies in the School of Medicine, Griffith University, Queensland.

Michele Haynes

DR MICHELE HAYNES works at The University of Queensland's Social Research Centre as Statistical Advisor and lectures in the School of Social Science.

References

  • Arnold L, Willoughby TL, Calkins EV. Self-evaluation in undergraduate medical education: a longitudinal perspective. J Med Edu 1985; 60: 21–28
  • Ballantyne R, Hughes K, Mylonas A. Developing procedures for implementing peer assessment in large classes using an action research process. Asses Eval Higher Edu 2002; 27: 427–441
  • Bandura A. Social Foundations of Thought and Action. Prentice Hall, Englewood Cliffs, NJ 1986
  • Bandura A. Perceived self-efficacy in cognitive development and functioning. Edu Psychologist 1993; 28: 117–148
  • Barrows HS, Tamblyn RM. Problem-Based Learning: an Approach to Medical Education. Springer, New York 1980
  • Bereiter C, Scardamalia M. Process and product in problem-based learning research. Problem-Based Learning: a Research Perspective on Learning Interactions, DH Evenson, CE Hmelo. Lawrence Erlbaum Associates, Mahwah, NJ 2000
  • Bland MJ, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986; i: 307–311
  • Boud D. Enhancing Learning Through Self Assessment. Kogan Page, London 1995
  • Burnett W, Cavaye G. Peer assessment by fifth year students of surgery. Assess Higher Edu 1980; 5: 273–278
  • Chaves JF, Baker CM, Chaves JA, Fisher ML. Self, peer and tutor assessments of MSN competencies using the PBL-Evaluator. J Nurs Edu 2006; 45: 25–31
  • Cunnington J. Evolution of student evaluation in the McMaster MD programme. Pedagogue 2001; 10: 1–9
  • Das M, Mpofu D, Dunn E, Lanphear JH. Self and tutor evaluations in problem-based learning tutorials: is there a relationship?. Med Edu 1998; 32: 411–418
  • Dochy F, Segers M, Sluijmans D. The use of self-, peer-, and co-assessment in higher education. Studies in Higher Edu 1999; 24: 331–350
  • Dodds AE, Orsmond RH, Elliott SL. Assessment in problem-based learning: The role of the tutor. Annal Acad Med Singapore 2001; 30: 366–370
  • Dolmans DH, Schmidt HG. The advantages of problem-based curricula. Postgraduate Med J 1996; 72: 535–538
  • Edwards RK, Kellner KR, Sistrom CL, Magyari EJ. Medical student self-assessment of performance on an obstetrics and gynaecology clerkship. Am J Obstetrics and Gynaecol 2003; 188: 1078–1082
  • Entwistle NJ, Marton F. Knowledge objects: Understandings constituted through intensive academic study. Br J Edu Psychol 1994; 64: 161–178
  • Eva KW. Assessing tutorial-based assessment. Adv Health Sci Edu 2001; 6: 243–257
  • Eva KW, Regehr G. Self-assessment in the health professions: A reformulation and research agenda. Acad Med 2005; 80: S46–S54
  • Eva KW, Cunnington JPW, Reiter HI, Keane DR, Norman GR. How can I know what I don’t know? Poor self assessment in a well-defined domain. Adv Health Sci Edu 2004; 9: 211–224
  • Falchikov N, Magin D. Detecting gender bias in peer marking of students’ group process work. Asses Eval Higher Edu 1997; 22: 385–396
  • Fitzgerald JT, White CB, Gruppen LD. A longtitudinal study of self-assessment accuracy. Med Edu 2003; 37: 645–649
  • Gordon MJ. A review of the validity and accuracy of self-assessments in health professions training. Acad Med 1991; 66: 762–769
  • Guerrero APS. Mechanistic case diagramming: A tool for problem-based learning. Acad Med 2001; 76: 385–389
  • Hay PJ, Katsikitis M. The ‘expert’ in problem-based and case-based learning: Necessary or not?. Med Edu 2001; 35: 22–28
  • Kelson ACM, Distlehorst LH. Groups in problem-based learning (PBL): Essential elements in theory, & practice. Problem-Based Learning: a Research Perspective on Learning Interactions, DH Evenson, CE Hmelo. Lawrence Erlbaum Associates, Mahwah, NJ 2000
  • Kemmis S, Wilkinson M. Participatory action research and the study of practice. Action Research in Practice: partnerships for Social Justice in Education, B Atweh, S Kemmis, P Weeks. Routledge, London 1998; 21–36
  • Lejk M, Wyvill M. The effect of the inclusion of self-assessment with peer-assessment of contributions to a group project: a quantitative study of secret and agreed assessments. Assess Eval Higher Edu 2001; 26: 551–561
  • Magin DJ. A novel technique for comparing the reliability of multiple peer assessments with that of a single teacher assessment of group process work. Asses Eval Higher Edu 2001; 26: 139–152
  • Margetson D. Current educational reform and the significance of problem-based learning. Stud Higher Edu 1994; 19: 5–19
  • Mattheos N, Nattestad A, Falk-Nilsson E, Attstrom R. The interactive examination: assessing students’ self-assessment ability. Med Edu 2004; 38: 378–389
  • McLoughlin C, Luca J. An investigation of the motivational aspects of peer and self assessment tasks to enhance teamwork outcomes. Paper presented at the Proceedings of the 21st ASCILITE Conference. Perth 2004; 5–8, December
  • Miller PJ. The effect of scoring criteria specificity on peer and self assessment. Asses Eval Higher Edu 2003; 28: 383–394
  • Minion DJ, Donnelly MB, Quick RC, Pulito A, Schwartz R. Are multiple objective measures of student performance necessary?. Am J Surg 2002; 183: 663–665
  • Nendaz MR, Tekian A. Assessment in problem-based learning medical schools: a literature review. Teach Learn Med 1999; 11: 232–243
  • Neville AJ. The tutor in small-group problem-based learning: teacher? Facilitator? Evaluator?. Pedagogue 1998; 8: 1–9
  • Nichols JD, Steffi BE. An evaluation of success in an alternative learning programme: motivational impact versus completion rate. Edu Rev 1999; 51: 207–219
  • Norcini JJ. The metric of medical education. Peer assessment of competence. Med Edu 2003; 37: 539–543
  • Norman G, Schmidt HG. The psychological basis of problem-based learning: a review of the evidence. Acad Med 1992; 67: 557–565
  • Orsmond P, Merry S, Reiling K. The use of student derived marking criteria in peer- and self-assessment. Asses Eval Higher Edu 2000; 25: 23–38
  • Pajares F. Role of self-efficacy beliefs in the mathematical problem-solving of gifted students. Contemporary Edu Psychol 1996; 21: 325–344
  • Papinczak T, Young L, Groves M. Peer-assessment in problem-based learning: A qualitative study. Adv Health Scie Edu 2007; 12: 169–186
  • Pond K, ul-Haq R. Learning to assess students using peer review. Stud Edu Eval 1997; 23: 331–348
  • Pope NK. The impact of stress in self- and peer-assessment. Asses Eval Higher Edu 2005; 30: 51–63
  • Pressley M, McCormick CB. Advanced Educational Psychology for Educators, Researchers and Policymakers. Harper Collins College Publishers, New York 1995
  • Reiter HI, Eva KW, Hatala RM, Norman GR. Self and peer assessment in tutorials: Application of a relative-ranking model. Acad Med 2002; 77: 1134–1139
  • Rezler AG. Self-assessment in problem-based groups. Med Teach 1989; 11: 151–156
  • Rudy DW, Fejfar MC, Griffith CH, Wilson JF. Self and peer assessment in a first-year communication and interviewing course. Eval Health Profess 2001; 24: 436–445
  • Schwarzer R, Jerusalem M. Generalized Self-efficacy Scale. Nfer-Nelson, Windsor, UK 1995
  • Searby M, Ewers T. An evaluation of the use of peer assessment in higher education: A case study in the school of music. Asses Eval Higher Edu 1997; 22: 371–383
  • Segers M, Dochy F. New assessment forms in problem-based learning: The value-added of the students’ perspective. Stud Higher Edu 2001; 26: 327–343
  • Sluijmans D, Dochy F, Moerkerke G. Creating a learning environment by using self-, peer-, and co-assessment. Learn Environ Res 1999; 1: 293–319
  • Sluijmans DMA, Moerkerke G, van Merrienboer JJG, Dochy FJRC. Peer assessment in problem-based learning. Stud Edu Eval 2001; 27: 153–173
  • Stefani LAJ. Comparison of collaborative self, peer and tutor assessment in a biochemistry practical. Biochem Edu 1992; 20: 148–151
  • Sullivan ME, Hitchcock MA, Dunnington GL. Peer and self assessment during problem-based tutorials. Am J Surg 1999; 177: 266–269
  • Swanson DB, Case SM, van der Vleuten CPM. Strategies for student assessment. The Challenge of Problem-Based Learning, D Boud, G Feletti. Kogan Page, London 1997; 269–282
  • Tousignant M, DesMarchais JE. Accuracy of student self-assessment ability compared to their own performance in a problem-based learning medical program: a correlation study. Adv Health Sci Edu 2002; 7: 19–27
  • Van Rosendaal GMA, Jennett PA. Resistance to peer evaluation in an internal medicine residency. Acad Med 1992; 67: 63
  • Ward M, Gruppen L, Regehr G. Measuring self-assessment: current state of the art. Adv Health Sci Edu 2002; 7: 63–80
  • Whitfield CF, Xie SX. Correlation of problem-based learning facilitators’ scores with student performance on written exams. Adv Health Sci Edu Theory and Pract 2002; 7: 41–51
  • Willis SC, Jones A, Bundy C, Burdett K, Whitehouse CR, O’Neill PA. Small-group work and assessment in a PBL curriculum: a qualitative and quantitative evaluation of student perceptions of the process of working in small groups and its assessment. Med Teacher 2002; 24: 495–501
  • Woolliscroft JO, Tenhaken J, Smith J, Calhoun JG. Medical students’ clinical self-assessments: comparisons with external measures of performance and the students’ self-assessments of overall performance and effort. Acad Med 1993; 68: 285–294
  • Zimmerman BJ. Self-efficacy: an essential motive to learn. Contemp Edu Psychol 2000; 25: 82–91
  • Zimmerman BJ, Bandura A, Martinez-Pons M. Self-motivation for academic attainment: the role of self-efficacy beliefs and personal goal setting. Am Edu Res J 1992; 29: 663–676

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.