1,513
Views
13
CrossRef citations to date
0
Altmetric
Web papers

‘Team observation’: a six-year study of the development and use of multi-source feedback (360-degree assessment) in obstetrics and gynaecology training in the UK

Pages e177-e184 | Published online: 03 Jul 2009

Abstract

Multi-source feedback, or 360-degree assessment, is an important part of the assessment of people in the workplace, in both health and industry. Almost all published work concentrates on content validity and generalizability. However, an assessment system needs construct validity, and has to have practicability and acceptability, without sacrificing fitness for purpose, content validity or inter-rater reliability. This was a six-year study of the first UK-wide hospital-based multi-source feedback system, in the specialty of obstetrics and gynaecology. This paper describes the development of the assessment tool, its use and the analyses of the results in several areas. These are picking up poor performance, congratulating good behaviour, construct validity, the number of domains to be measured, and the minimum number of raters. The study demonstrated that the Team Observation system in reality only measured a very limited number of attributes, and that the main construct under scrutiny is interpersonal behaviour. The system can identify those who may have a problem, using less than 10 raters, and yet the process can be a positive experience for the large majority of people who have been assessed.

Introduction

Why multi-source feedback? Justifying the task

Interpersonal skills and behaviours have been consistently recognized as lying at the heart of satisfying clinical encounters (Simpson et al., Citation1991; General Medical Council, Citation1994, Citation2001; Lu et al., Citation1994; Newble et al., Citation1999; Norcini, Citation1999; Southgate & Pringle, Citation1999). Lack of such skills and inappropriate behaviours are a potent cause of system failure (National Institute for Clinical Excellence, Citation2001).

In 1994, no system for assessing such behaviour existed for hospital doctors in the UK. We decided to tackle this issue for obstetrics and gynaecology trainees by introducing an opportunity for staff and peers to record formally a subjective assessment of the trainee's behaviour in relation to communication and relationships. The intention had been that 10 assessments would be performed per trainee on a six-monthly basis, and that a summary of the results would be used to inform an appraisal meeting between the trainee and the educational supervisor.

Evidence that underpinned the development of the first multi-source feedback drafts

Summary of the preliminary (1994) research evidence

The evidence in 1994 showed that assessment of interpersonal or humanistic behaviour was important, feasible and practical (Linn, Citation1975; Dawson-Saunders & Paiva, Citation1986; Maxim & Dielman, Citation1987). Peers or other team members did this reliably (Risucci et al., Citation1989; Ramsey et al., Citation1993; Ramsey et al., Citation1996). Also, nurses performed assessment reliably, but had a slightly different perspective (Butterfield & Pearson, Citation1990; Butterfield & Mazzaferri, Citation1991). Evidence was powerful that a single factor, theme or domain had a major influence on the assessment across all non-cognitive items, even though the items are ostensibly mutually exclusive (Linn, Citation1975; Dielman et al., Citation1980; Dawson-Saunders & Paiva, Citation1986; Maxim & Dielman, Citation1987; Risucci et al., Citation1989; Davidge & Hull, Citation1990; Ramsey et al., Citation1993; Wood & Campbell, Citation2004; Whitehouse et al., Citation2005). This suggested the need for a simple tool. Very little evidence existed as to the optimum descriptors or scale. We anticipated the need for about 11 raters per trainee, on the basis of (especially Ramsey's) previous work (Butterfield & Mazzaferri, Citation1991; Ramsey et al., Citation1993; Wenrich et al., Citation1993; Woolliscroft et al., Citation1994; Ramsey et al., Citation1996).

Methods

Development of the Team Observation system in an obstetrics and gynaecology context: from local to national

The initial target was to produce and pilot a local system for multi-source feedback for the obstetrics and gynaecology trainees in Coventry, within the West Midlands. However, the Royal College of Obstetrics and Gynaecology (RCOG) soon afterwards set out to design the assessment documentation to facilitate the implementation of Calman's ‘Structured Training’. The College set up a system of communication with every obstetrics and gynaecology college tutor in the UK. This was used to help develop the content of a whole new assessment-based curriculum, of which multi-source feedback, or Team Observation as it was called, was to be an important part.

Designing the Multi-source Feedback tool: number of ‘domains’

Using a classical Delphi technique, all college tutors nationally were asked, in consultation with their colleagues, for a detailed view of the skills and behaviours (both clinical and non-clinical) required of obstetrics and gynaecology trainees at every stage of training. The non-clinical features were collected into a suggested set of six ‘domains’, which were re-circulated nationally. From this, we distilled a final list of four desirable sets of behaviour. These were once again circulated for final comment, and final adjustments were made to the wording. The final four domains were:

  • relationship with colleagues;

  • relationship to patients;

  • information gathering/notekeeping;

  • time management/diligence.

As the evidence seemed to point to the existence of a single, behavioural domain amenable to testing by multi-source feedback, any instrument might reasonably simply ask ‘How does the trainee perform in relation to interpersonal skills?’. However, we felt that raters would value the opportunity to express themselves more completely than a single choice of score in one single behavioural domain would allow.

The resulting instrument was named ‘Team Observation’, and was then used in every obstetrics and gynaecology training post within the UK for six years. During this time we collected 1719 assessments on 113 doctors passing through one major University Hospital.

Designing the Multi-source Feedback tool: choice of scale

It is vital to select a scale that fits with the purposes to which an assessment will be put. The Royal College of Obstetricians and Gynaecologists’ mission was essentially threefold: first, to put interpersonal behaviours firmly on the curriculum; second to detect the occasional doctor who may have a serious problem with interpersonal behaviours; third we wanted to congratulate the majority of doctors who had hitherto not been recognized officially for good performance in interpersonal behaviours.

A simple pass–fail dichotomy would clearly serve the purpose, but we felt that this would discourage choice of the fail option. Indeed, we felt that the purpose of the assessment was to alert us to the possibility of underperformance, rather than necessarily to make an absolute judgement in this regard. We felt, therefore, that there should be more than one ‘unsatisfactory’ option.

A single pass category would be seen as losing an opportunity to congratulate. However, more than one category for such a subjective assessment as observation of behaviour over time would seem to increase the likelihood of ‘doves and hawks’. This would particularly be the case if single-word descriptors were used, which may mean different things to different people (e.g. ‘good’ and ‘satisfactory’). The risk of centralization would apply to an odd number of options. We therefore decided that there was a need for some sort of ‘outstanding’ category, to be described in such a way as to be used sparingly.

The final four-point scale was therefore:

  1. Needs serious attention.

  2. Some deficiency. Progress needed. This includes borderline candidates.

  3. Fine. No problem.

  4. Outstanding. Well done.

We added cameo descriptions of the unsatisfactory and satisfactory trainee (see Appendix). Very importantly, we felt that raters needed a free text opportunity to comment on any specific examples of good or bad behaviour.

Designing the Multi-source Feedback tool: choice, number and training of raters

The Royal College of Obstetricians and Gynaecologists decided that the team leader in each area in which a trainee regularly worked should be invited to be a rater. We decided against allowing trainees to choose raters, because, when the whole idea of multi-source feedback was so new, this may have reduced the face validity. We specified that the form should only be filled after discussion between the team leader and relevant members of the team. This, we hoped, would help minimize realization of the key fear of trainees: that an isolated anecdote would harm them. For these reasons also (as well as from a pragmatic perspective) we decided not have a programme of compulsory rater training.

Results

Analysis of six years of ‘Team Observation’ in obstetrics and gynaecology

Results 1: Did Team Observation assessment pick up poor performance? We collected data on 1719 forms, which represented 219 assessments on 113 doctors. As each form collects four ratings, the data pool is of 6876 ratings (). Some trainees failed the assessment: 45 (0.7%) and 360 (5.2%) ratings were in the ‘needs serious attention’ and ‘some deficiency’ categories respectively ().

Table 1.  Number of doctors assessed

Table 2.  Numbers and percentages of responses in each of the four categories in all 1719 forms

Whether all under-performers have been detected may not be the main issue. That some have been detected indicates that there is an assessment ‘driver’ in the system to demand minimum performance in terms of interpersonal behaviours. Of course, it would be of concern if the trainees whose behaviour was being designated as unsatisfactory were not actually the poor ones. However, where there is consistency of agreement between more than one assessor then this is at least worthy of further enquiry.

In this study, the identification of seemingly poor performers did not lead to career disadvantage but to skilled feedback, followed by intensified scrutiny of both good and bad behaviour. We felt that this would produce benefits both to the system and to the individual, without prejudice to the future. We therefore believed that action based on only the combined subjective observation of multi-source feedback was justified.

Interestingly, negative comments occurred more frequently than for the 6% who had negative marks. gives the frequency of positive, negative and neutral comments. It may be seen that 11% of trainees received negative comments, and that a further 13% received mixed comments.

Table 3.  Analysis of the free text comments

We have demonstrated that Team Observation detects some trainees regarded as poor performers, and we make the assumption that this sends an important message to trainees about standards of interpersonal behaviours.

Results 2: Did Team Observation congratulate good behaviour? The study provides evidence to support this hypothesis. showed that more than 94% of responses were rated ‘Fine, no problem’, or better. It was intended that the wording of the top category, ‘Outstanding. Well done’ would be sufficiently superlative as to encourage only sparing use. This category, however, was used for 38% of assessments.

By contrast, 5.2% of ratings were borderline, and 0.7% were rated ‘Needs serious attention’. Thus a trainee was more than 15 times more likely to be told that he/she was perceived to be good than to be told that he/she might not be. Trainees had more than a one in three chance of being classified as outstanding, but less than 1% chance of being described as needing serious attention—a fifty-fold difference. Furthermore, in the free text comments, a trainee was four times as likely to receive purely positive comments compared with purely negative, and less than a quarter of trainees received any negative comments at all.

We therefore feel that we have demonstrated that Team Observation is, on balance, a positive activity.

Results 3: Is something consistent and appropriate being measured? We looked for construct validity by first ensuring that an important construct was being measured consistently. To do this, we performed a principal component factor analysis (PCA) on 839 first-time assessments on 113 doctors ().

Table 4.  Principal components of the four domains of assessment

Principal component factor analysis of 839 first assessments gave support for the idea that something consistent was being measured: a single main component or construct accounted for 76% of the variance. We then examined the correlation between first assessments and second assessments for each factor on each of the 67 doctors having two sets of assessment (usually separated by 6–7 months). The correlation coefficients were very high (0.77 for the correlation of the principle component with individual scores). Thus assessors seemed to be measuring the same thing the second time that they measured on the first occasion. This was highly statistically significant ().

Table 5.  Spearman's correlation coefficients and p-values between the first and the second mean scores

Furthermore, evidence for a single, consistent construct was the strong correlation between the scores for each of the four domains, across all 219 assessments (). Thus either all the scores were mainly measuring the same thing, or the same underlying attribute resulted in equivalent performance across the domains. The apparent dominance of this one domain—relationship with colleagues—might be because the other domains are closely linked to this in any individual or because what the raters in this study were apparently best equipped to comment on was an individual's relationship with colleagues (and least equipped to assess relationship with patients).

Table 6.  Spearman's correlation coefficients between mean scores of the different domains in 1719 assessments

The character and importance of this attribute cannot be derived from the data on its consistency, but may be inferred from the correlation between the content of the items being assessed, the interpersonal character of many of their free text comments, and the subsequent evaluation of the system. Ideally, an analysis of the free text comments should be compared with some other measurement of the attribute under scrutiny, but no such other measurements were available at the time.

However, we feel we have provided good evidence that something consistent was being measured: that our multi-source feedback system measured a consistent aspect of interpersonal behaviour and the content of the domains being assessed reflected a nationally agreed description of desired behaviours.

Results 4: How many ‘domains’ or ‘items’ should a multi-source feedback instrument have? If mainly a single attribute is being measured, did we provide enough options for raters to distinguish between their perceptions of good and bad performers? We provided four domains, and a four-point scale. The evidence that this was ample was borne out by the ability of each domain to pick up performers regarded as poor. The first domain—relationship with colleagues (the best discerner) would have picked up half of the trainees thought to be unsatisfactory. Taking the best three of the four domains would have improved this figure to 94% (). The fourth domain only added a 6% pick-up rate of performers regarded as unsatisfactory.

Table 7.  Power of each domain to add discernment to the most discerning domain, in picking up the 124 with unsatisfactory scores

We therefore feel that it is reasonable to conclude that as Team Observation mainly assesses a single ‘interpersonal’ construct (Result 3), an instrument with more than four domains and four scale points is unlikely to improve power to discern between trainees.

Results 5: What is the minimum number of raters needed in Team Observation? We looked at the agreement between assessors, using an intraclass correlation coefficient. This is the correlation between the scores from any two randomly chosen assessments of the same individual—it equals unity only if all the scores for each individual doctor are identical, and it equals zero if scores for individual doctors are no more similar than scores of different doctors.

Intraclass correlation coefficients (ICC) were calculated for the 839 first assessments for each of the 113 doctors ().

Table 8.  Demonstration of correlation between raters: intraclass correlations

Although highly significant, the intraclass correlation coefficients are not high (i.e. a weak level of agreement). The ICC for the total score for all domains was 0.34, which means that around two-thirds of the difference in scores on average would seem to be due to variation in raters, rather than a real difference in the trainees. For higher numbers of raters, gives some ICC values. (ICC was derived from a one-way analysis of variance, ‘ANOVA’, which found that the total variance was 4.56. This comprised two components: 1.54—the variance between trainees—and 3.02 the variance within trainees).

Table 9.  Demonstration of the increasing generalizability of intraclass correlation coefficients, by increasing number of raters

It may be seen that the decision as to the exact minimum number of raters is an arbitrary one—the more raters, the better the coefficient, and thus the more generalizable the results. Having eight raters had a coefficient of 0.80 in this study, i.e. a four in five chance that any difference between two trainees is a real one, rather than due to a difference in raters. This number of eight raters might be further improved upon, by, for instance, better training and selection of raters.

When these factors have been optimized, there will still be a need for a minimum number of raters—for instance raters might differ in extent to which they had experienced a good or bad behaviour in a trainee; or might differ in their perception of the importance of such behaviour.

We conclude that, with the Team Observation system, eight raters would be enough to have a reasonable chance of a representative score.

Results 6: What Team Observation could do? Because the purpose of the assessment had been to identify those who may have had a problem in relation to interpersonal behaviours, the statistical analysis was necessarily related to clarifying the reliability of this pass/fail distinction.

It therefore gave no information whatever as to whether any distinction between shades of excellence was reliable. This might be measured by use of the ‘outstanding’ versus the ‘satisfactory’ points on the scales. However, this is a different construct from that we which set out to test. (For instance the construct of ‘failure’ relates to an implicit criterion of acceptable behaviour, whereas comparison of shades of excellence is essentially normative.)

We might have been better placed to consider the issue of correct identification of ‘outstanding-ness’, if we had had the ability to track trainees over time, and if we had the monitoring system to demonstrate degree of progress in the behaviours under scrutiny. However, trainees mainly moved on within a year, and as only one measure of behaviour existed (i.e. Team Observation) any use of this to ratify its own validity would have been tautological.

Conclusion

In this study we feel that we have produced evidence that the Team Observation system described is feasible and workable. We have confirmed that a single construct seems to influence the assessment across all non-cognitive items, even though the items are ostensibly mutually exclusive. We have demonstrated that this construct is consistently interpreted as being related to interpersonal behaviours. Four domains were adequate to fulfil the purpose of the identification of some trainees regarded as unsatisfactory in relation to this. In this system (with selection of raters) perhaps eight assessments were needed to achieve a reproducible result.

In such a system the large majority of trainees will be congratulated. However, the study did not find evidence that the system was suitable for the purpose of distinguishing between shades of excellence, nor was it of help in demonstrating progress in individuals.

This system has led on to the development of ‘TAB’ (Team Assessment of Behaviour) by our team, which is now in widespread use in the UK, in all specialties—not just in obstetrics and gynaecology (Whitehouse et al., Citation2002; Whitehouse et al., Citation2005). Further studies continue, to include various aspects of rating scales, differing responses of different raters and longitudinal aspects of measurement of behaviours of doctors in difficulty.

However, multi-source feedback (360-degree assessment) does need further development and further research to answer such questions as what training should be given to raters, whether better item construction can define other behaviours to be tested, and, importantly, the predictive validity of such assessments.

Additional information

Notes on contributors

Laurence Wood

LAURENCE WOOD is an associate postgraduate dean for education in the West Midlands Deanery and a consultant in obstetrics and gynaecology at University Hospital Coventry and Warwickshire in Coventry.

David Wall

DAVID WALL is a deputy regional postgraduate dean in the West Midlands Deanery and professor of medical education at Staffordshire University.

Alison Bullock

ALISON BULLOCK is Reader in Medical and Dental Education in the School of Education at the University of Birmingham and a senior member of the Centre for Research in Medical and Dental Education in the School of Education.

Andrew Hassell

ANDY HASSELL is an associate postgraduate dean for education in the West Midlands Deanery and a consultant in rheumatology at the University Hospital of North Staffordshire in Stoke on Trent.

Andrew Whitehouse

ANDREW WHITEHOUSE is a director of hospital and specialist education in the West Midlands Deanery, and consultant physician at George Eliot Hospital in Nuneaton, Warwickshire.

Ian Campbell

IAN CAMPBELL is a statistician. He qualified as a doctor but now works as a medical statistician on a number of different projects.

References

  • Butterfield OS, Pearson JA. Nurses in resident evaluation: a qualitative study of the participants’ perspectives. Evaluation and the Health Professions 1990; 13: 453–473
  • Butterfield PS, Mazzaferri EL. A new rating form for use by nurses in assessing residents’ humanistic behaviour. Journal of General Internal Medicine 1991; 6: 155–161
  • Davidge AM, Hull AL. A system for the evaluation of medical students’ clinical competence. Journal of Medical Education 1990; 55: 65–67
  • Dawson-Saunders B, Paiva REA. The validity of clerkship performance evaluations. Medical Education 1986; 20: 240–245
  • Dielman TE, Hull AL, Davis WK. Psychometric properties of clinical performance ratings. Evaluation and the Health Professions 1980; 3: 103–117
  • General Medical Council. The New Doctor—Guidance on PRHO Training. General Medical Council, London 1994
  • General Medical Council. Good Medical Practice. General Medical Council, London 2001
  • Linn BS. Performance self assessment. British Journal of Medical Education 1975; 9: 98–101
  • Maxim BR, Dielman TE. Dimensionality, internal consistency and interrrater reliability of clinical performance ratings. Medical Education 1987; 21: 130–137
  • National Institute for Clinical Excellence. NICE Response to the Report of the Bristol Royal Infirmary Inquiry. National Institute for Clinical Excellence, London 2001
  • Newble D, Paget N, McLaren B. Revalidation in Australia and New Zealand: approach of Royal Australasian College of Physicians. British Medical Journal (Intl edn) 1999; 319(7218)1185–1188
  • Norcini JJ. Recertification in the United States. British Medical Journal 1999; 319: 1183–1185
  • Ramsey PG, Carline JD, Blank LL, Wenrich MD. Feasibility of hospital-based use of peer ratings to evaluate the performances of practicing physicians. Academic Medicine 1996; 71(4)364–370
  • Ramsey PG, Wenrich MD, Carline JD, Inui TS, Larson B, LoGerfo JP. Use of peer ratings to evaluate physician performance. Journal of the American Medical Association 1993; 269: 1655–1660
  • Risucci DA, Tortolania AJ, Ward RJ. Ratings of surgical residents by self supervisors and peers. Surgery, Gynaecology and Obstetrics 1989; 169: 519–526
  • Simpson M, Buckman R, Stewart M. Doctor–patient communication: the Toronto consensus statement. British Medical Journal 1991; 303: 1385–1387
  • Southgate L, Pringle M. Revalidation in the United Kingdom: general principles based on experience in general practice. British Medical Journal 1999; 319: 1180–1183
  • Wenrich MD, Carline JD, Giles LM, Ramsey PG. Ratings of the performances of practising internists by hospital based registered nurses. Academic Medicine 1993; 68: 680–687
  • Whitehouse A, Walzman M, Wall D. Pilot study of 360 degree assessment of personal skills to inform record of in-training assessments for senior house officers. Hospital Medicine 2002; 63: 172–175
  • Whitehouse A, Hassell A, Wood L, Wall D, Walzman M, Campbell I. Development and reliability testing of TAB—a form for 360° assessment of senior house officers’ professional behaviour, as specified by the General Medical Council. Medical Teacher 2005; 27: 252–258
  • Wood L, Campbell I. 360 degree assessment: encouraging results of a 6 year study. Liverpool, UK 2004, paper presented at the Annual Meeting of the Association for the Study of Medical Education
  • Woolliscroft JO, Howell JD, Patel BP, Swanson DB. Resident–patient interactions: the humanistic qualities of internal medicine residents assessed by patients, attending physicians, program supervisors, and nurses. Academic Medicine 1994; 69(3)216–224
  • Lu YH, Meng XY, Lui X. Professional behaviour of medical school graduates: an analysis. Medical Education 1994; 28: 296–300

Appendix 1

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.