893
Views
3
CrossRef citations to date
0
Altmetric
Research Article

A competency based selection procedure for Dutch postgraduate GP training: A pilot study on validity and reliability

, , , , &
Pages 307-313 | Received 04 Jun 2013, Accepted 29 Dec 2013, Published online: 19 Mar 2014

Abstract

Background: Historically, semi-structured interviews (SSI) have been the core of the Dutch selection for postgraduate general practice (GP) training. This paper describes a pilot study on a newly designed competency-based selection procedure that assesses whether candidates have the competencies that are required to complete GP training.

Objectives: The objective was to explore reliability and validity aspects of the instruments developed.

Methods: The new selection procedure comprising the National GP Knowledge Test (LHK), a situational judgement tests (SJT), a patterned behaviour descriptive interview (PBDI) and a simulated encounter (SIM) was piloted alongside the current procedure. Forty-seven candidates volunteered in both procedures. Admission decision was based on the results of the current procedure.

Results: Study participants did hardly differ from the other candidates. The mean scores of the candidates on the LHK and SJT were 21.9 % (SD 8.7) and 83.8% (SD 3.1), respectively. The mean self-reported competency scores (PBDI) were higher than the observed competencies (SIM): 3.7(SD 0.5) and 2.9(SD 0.6), respectively. Content-related competencies showed low correlations with one another when measured with different instruments, whereas more diverse competencies measured by a single instrument showed strong to moderate correlations. Moreover, a moderate correlation between LHK and SJT was found. The internal consistencies (intraclass correlation, ICC) of LHK and SJT were poor while the ICC of PBDI and SIM showed acceptable levels of reliability.

Conclusion: Findings on content validity and reliability of these new instruments are promising to realize a competency based procedure. Further development of the instruments and research on predictive validity should be pursued.

KEY MESSAGES:

  • High-stakes decisions, as in selection procedures, should be based on assessments from multiple sources to ensure reliability.

  • Candidates with questionable performance on one or more instruments should be discussed in a deliberation session in which all assessments are considered. Thresholds of the instruments for deliberation should be determined in advance.

INTRODUCTION

As in most European countries, semi-structured interviews are the core of the selection procedure for Dutch postgraduate general practice (GP) training (Citation1). However, this instrument weakly predicts future performance (Citation2). In addition to this shortcoming, admission decisions depend on the selection processes of the respective eight Dutch training departments, although these should be equally carried out according to national regulations (Citation1). This indicates that a candidate who is admitted to one department may be rejected by another, which generates doubts regarding fairness of the procedure.

In 2005, a competency-based curriculum was introduced, the development of a competency-based selection procedure, like those in the UK and Denmark, had to be developed, and aimed to assess whether candidates have competencies needed to complete training successfully (Citation3,Citation4). Such a procedure is based on the principle that high-stake decisions should be made after consulting assessments from multiple sources and a variety of instruments (Citation4,Citation5). Patterson et al., showed auspicious results in reliability and validity of a multiple procedure for GP training in the UK (Citation3,Citation6–8).

For the Netherlands, a comparable procedure has been developed. GP experts selected the essential competencies in a Delphi procedure from four role domains; ‘medical expertise,’ ‘communication,’ ‘collaboration’ and ‘professionalism’ (Citation9). A critical appraisal of the literature resulted in the selection of four instruments: a knowledge test, a situational judgment test, a patterned behaviour descriptive interview and a simulated encounter (Citation3,Citation4,Citation6–8,Citation10–18). They intended to assess candidates’ ability to cope with complex tasks and to manage (medical) problems in general practice on different levels of Miller's pyramid, namely: ‘knows,’ ‘knows how’ and ‘shows how’ () (Citation19).

Table 1. Domains, Millers’ pyramid levels and instruments of the current and the new competency based selection procedure.

To explore the content validity and reliability of the instruments, 47 candidates for GP training volunteered to complete the new competency-based procedure alongside the current procedure to answer the following questions:

  • How are assessments of content-related and divergent competencies associated between and within the instruments?

  • What is the internal consistency of the instruments?

  • How do candidates evaluate the relevance and fairness of the instruments?

METHODS

Study design

All candidates in the selection procedure of April 2011, for the postgraduate GP training in Nijmegen and Utrecht, received written information regarding the goal and process of the study and were invited to complete the new competency-based selection procedure. Candidates who were willing to participate signed an informed consent form. These volunteers received written feedback on their performance in the new instruments for their private use and received a gift voucher of €50. The current procedure determined the admission decision. Both procedures were carried out independently from each other. The study was executed according to the code of conduct for the use of personal data in scientific research (Citation20).

Data collection

Individual characteristics. Data of all candidates on gender, age (in years), past performance—clinical experience as a medical doctor (< one year; ≥ one year), area of medical school (north-western Europe; elsewhere), the number of times applied (the first time; > the first time) and the admission decision into GP training were extracted from administrative databases and clerically anonymized before data processing (Citation20).

Current procedure

Semi-structured interview (SSI). To assess personal qualities of the candidates, interviews were conducted by a selection committee consisting of a staff member, trainer and trainee after a tailored instruction (Citation1,Citation21). Each member of the selection committee independently assessed the candidates’ motivation, orientation on the job, learning needs and personal attributes. In Nijmegen, a four-point scale was used—insufficient (0); uncertain (1); sufficient (2); good (3); in Utrecht, a three-point scale below (1); upon (2); above standard (3). The Nijmegen scores of ‘0’ and ‘1’ were recoded to correspond to the Utrecht score of ‘1.’ The final score for each quality was the mean score of the three assessors. The head of the department decided on admission, taking scores and underpinning of members of the selection committee into account.

New procedure

Knowledge test for general practice (LHK, Dutch abbreviation). To assess medical knowledge, the national GP knowledge test was used (Citation17,Citation22). Twice a year, GP experts develop a completely new LHK, consisting of case-vignettes addressing the entire GP domain, with multiple-choice questions. The sum of the correct answers minus the sum of the incorrect answers— corrected for guessing—was converted into a percentage score (Citation17,Citation22).

Situational judgement test (SJT). To measure candidates’ cognitions about how to effectively resolve practical dilemmas in a GP-setting, an online SJT was developed by highly experienced GPs to assure face validity (Citation6,Citation12). This test consists of 20 situations with four behavioural options each (Box 1). The candidates indicated the extent to which they perceived the 80 presented options to be effective (1 = completely ineffective, 5 = completely effective). Subsequently, the extent to which the scores corresponded to those of 15 experienced GPs was determined by calculating the absolute difference between the candidates’ scores and the mean score of the experienced GPs. The absolute difference was subtracted from the maximum score and converted into a percentage score.

Box 1. Example of a situation with four behavioural options in the situational judgement test

A patient encounters his GP with complaints of fatigue. Lately, the man has worked hard under a lot of pressure; he is worried that something physically is wrong. After a thorough examination, the GP tells him that worry is not needed. His complaints are most probably a consequence of his hard working.

Patient: ‘Still, I do not trust it. Could you arrange more medical examinations in the hospital?’

Reactions of the GP:

  • OK, but if I explain to you that these complaints are normal in your situation, why do you want more examinations? What do you think will be the advantage for you?

  • OK, if you are still so worried and I cannot take away your concerns, then perhaps it is better to refer you to internal medicine.

  • You are still not convinced? OK, let us see why you are so uncertain. Maybe I can reassure you.

  • You are still worried. I do not understand why. I think you are somewhat exaggerating.

Patterned behaviour descriptive interview (PBDI). In the PBDI, candidates were asked to report on clinical experiences in which they showed: ‘empathy,’ ‘collaboration,’ ‘coping with pressure,’ ‘respect,’ ‘self-care’ and ‘reflection.’ Two staff members explored candidates’ behaviour during those situations using the STAR (S = situation, T = task, A = action, R = result) technique (Citation13). Each competency was scored by consensus on a five-point scale (1 = absent, 2 = doubtful, 3 = average, 4 = sufficient and 5 = good) with written substantiation. All assessors attended a one-day training session to familiarize them with the STAR technique and the competency assessment.

Simulated encounter (SIM). The SIM aimed to measure three competencies: ‘medical expertise’ ‘doctor–patient communication’ and ‘professionalism (Citation3).’ Experienced GPs constructed two cases: ‘a patient with heart complaints’ and a patient with dyspnoea’. Candidates worked through one, randomly chosen, case. Assessors and actors attended a one-day training as well. One assessor assessed competencies on a five-point scale (see PBDI) with written substantiation.

Deliberation. The assessments of all instruments were aggregated and validated in a deliberation session to evaluate assessments of candidates under thresholds. Threshold for the LHK and SJT was the mean score of the group of candidates minus one SD. The threshold for the PBDI and SIM assessments was a score of ‘1’ (absent) on at least one competency or a score of ‘2’ (doubtful) on at least two competencies.

Evaluation. The candidates indicated on a pre-coded response sheet the degree (ranging from 1 ‘strongly disagree’ to 5 ‘strongly agree’) to which they considered the content of the instruments relevant to general practice, the degree to which the instrument allowed them to demonstrate their competence and whether or not they considered the instrument fair for all candidates (yes/no).

Data analysis

First, individual characteristics of the study population and the remaining candidates were explored (Citation23). Assessments of competencies, qualities and instruments were presented as descriptive statistics (mean; SD). Few data was missing (2.8% of all data); 75% of the missing's belonged to the SSI data. In those cases, mean values of the group of candidates were imputed (Citation24).

Next, we investigated to which degree content-related competencies or qualities measured by different instruments and divergent competencies or qualities measured by the same instruments were associated. The associations between the (overall) instrument scores were estimated. All associations were expressed as Pearson's correlation coefficient; all rating scales were approximately continuous (Citation1). The internal consistency of the instruments was estimated using intraclass correlation (ICC) (Citation25).

Finally, we compared the mean scores (SD) of the candidates’ evaluation of the new instruments with the SSI evaluation scores (Citation23).

The analyses were performed using SPSS version 20.

RESULTS

Study-population

In Nijmegen and Utrecht 60 and 63 candidates, respectively, participated in the current procedure. Forty-eight candidates were willing to participate in both procedures; one candidate withdrew due to personal circumstances. The study-population (n = 47) did not differ from the other candidates (n = 76) regarding to gender, age, number of times applied, past performance and percentage of admission, except region of medical school (). Ten of the 47 candidates were rejected (21%).

Table 2. Individual characteristics; difference, (δ, 95% confidence interval, CI) between study population and remaining candidates.

Candidates’ assessments

In the SSI, motivation, orientation on the job, learning needs and personal attributes were assessed about as high, above the ‘2’ scale point on a three-point scale (). Four, five, three, and five candidates, respectively, were assessed as ‘below standard’ for these qualities.

Table 3. Mean and overall scores (SD) of the instruments in the current (SSI) and the new procedure (LHK, SJT, PBDI, SIM) (n = 47).

The mean score of the LHK was 21.9% (SD 8.7). Ten candidates scored below the threshold. The mean difference score of the SJT was 83.8% (SD 3.1); the variance was nearly three times less than the variance of the mean LHK score. Six candidates scored below the threshold.

The mean scores of self-reported competencies assessed by the PBDI were higher than the observed competencies assessed by the SIM on a five-point scale–PBDI 3.7 (SD 0.5); SIM 2.9 (SD 0.6). The assessors used all scale points; thus, it was possible to assess differences between candidates. Six candidates were selected for deliberation based on the PBDI, and 12 on the SIM.

Overall, 26 candidates (55.3%) were eligible for deliberation; 19 candidates scored below the threshold on one instrument, 6 candidates on two instruments and one candidate on three instruments.

Associations

Gender was weakly associated with motivation, learning needs and personal attributes in the current procedure (r: 0.34, 0.41, 0.32, respectively); women received moderately better scores on these qualities. Age showed a weak negative association with learning needs (r: −0.30). The remaining individual characteristics were not associated with personal qualities in the current procedure (not in the table).

In general, content-related competencies measured by different instruments were weakly associated with one another, whereas more divergent competencies measured by a single instrument were moderately to strongly associated (Supplementary material Table 1 available online only, at http//www.informahealthcare.com/doi/abs/10.3109/13814788.2014.885013). For example, there was a weak correlation (r: 0.28) between empathy in the PBDI and doctor–patient communication in the SIM. However, there was a strong correlation between professionalism and doctor–patient communication in the SIM (r: 0.76). Therefore, the overall PBDI and SIM scores were considered relevant indicators of the self-reported or general clinical competencies.

Individual characteristics were not correlated with knowledge (LHK), with cognitions on effectively coping (SJT), with self-reported (PBDI) or shown competencies (SIM) (Supplementary Table 2 available online only, at http//www.informahealthcare.com/doi/abs/10.3109/13814788.2014.885013). Gender was correlated with the overall score of the SSI; female candidates scored higher. There was a moderate correlation between the overall score for personal qualities measured by the SSI and general clinical competencies measured with SIM (r: 0.34), but not with the remaining instruments. A moderate correlation (r: 0.34) was found between LHK (‘knows’ level) and SJT (‘knows-how’ level). No correlations were found between the remaining instruments.

Reliability

The internal consistency of the SSI based on the four personal quality scores was moderate (ICC: 0.65). The internal consistencies of the LHK and SJT were poor (ICC: 0.59 and 0.55, respectively). The internal consistencies of the PBDI and SIM showed acceptable levels of reliability (ICC: 0.79 and 0.73, respectively).

Candidates’ evaluation

Candidates perceived the content of the LHK as the most relevant to GP training. Compared to the SSI, the SJT, PBDI and SIM assessments were perceived as providing the same or better opportunities for demonstrating GP competence and as being fairer to candidates ().

Table 4. Candidates’ evaluation scores of the instruments of the current and new procedure on 5 point scale, ranging from 1 (strongly disagree) to 5 (strongly agree) n = 46.

DISCUSSION

Main findings

Scores of competencies measured with instruments on different levels of Miller's pyramid were hardly associated with each other. Two instruments (PBDI and SIM) showed acceptable reliability; the ICC of the LHK and SJT was poor. In general, the candidates considered the instruments used in the new procedure to be relevant and more fair than the SSI in the current procedure; a similar result was obtained in the UK (Citation8).

Strengths and limitations

Forty-seven candidates completed both procedures, allowing us to compare the results of the assessments. Face validity was guaranteed, because the selection of the competencies, the development of the four instruments and the assessments were performed by experienced GPs. However, the study has some limitations. First, the participants were volunteers; therefore, they are likely to have endorsed the new procedure more positively and participants did not represent candidates who had finished medical school outside of north-western Europe. In addition, the correlation between qualities and competencies found in the SSI, PBDI and SIM appeared to contribute to a relatively high ICC but could be attributable to a halo effect. Investigating this possibility was beyond the scope of this study.

Interpretation

The new procedure was based on competencies, selected by experts in a Delphi procedure, and theoretical considerations regarding assessment programmes (Citation3–5,Citation26). Therefore, we assume this new procedure to be more relevant and suitable. The LHK and SJT are easily to be implemented. The PBDI and SIM require well-trained assessors and actors.

It is important to assess medical knowledge before the training while knowledge is the basis of Miller's pyramid, and a low level of knowledge is a predictor of poor performance (2,11,18.) The choice of using the LHK, a validated and rather reliable test for trainees, was rather pragmatic, as in the Netherlands different progress tests during medical school are administered (Citation17,Citation27). In general, reliability of the LHK varies between 0.60 and 0.76 (Citation27). The reliability of the LHK used in this study was lower, which may be a result of the fact that we did not perform an item analysis resulting in deleting unsuitable items or the fact that the population was different from the trainees. The Grade Point Average of the master, including cognitive and non-cognitive assessments, may become more relevant in the near future (Citation2,Citation4).

Compared to the SJT used in the selection of GP training in the UK, we found disappointing psychometric results: small variance and poor reliability (Citation6,Citation12). Development with more situations may improve the instrument. Further testing is needed to determine the value of the SJT in the Dutch context.

The PBDI and SIM have relatively high ICCs, suggesting that both assessments are suitable for high-stakes decisions. Given the perspective that reporting on competencies is easier than ‘showing how,’ generally higher scores of the candidates on the PBDI were expected.

The assessments of GP knowledge, cognitions on coping with dilemmas, self-reported and observed competencies were found to be weakly correlated with one another, which could be attributed to the different levels that were assessed. Approximately half of the candidates were eligible for deliberation, suggesting that all ‘borderline’ candidates could be discussed to reach a fair and balanced admission decision.

Implications for the future

All of the instruments tested can be implemented and should be evaluated in the context of the entire procedure. The costs of using the LHK and SJT are relatively low. However, the SJT should undergo further development to improve reliability. The PBDI and SIM require extensive resources and logistics to implement. The known weaknesses of rater-based assessments, including the halo effect, may be reduced by expanding the number of assessors. Integrating more mini-interviews into the procedure could prevent a bias caused by context specificity (Citation5,Citation16,Citation26,Citation28). The narrative feedback provided on the LHK, PBDI and SIM assessments was unexpectedly rich and could be used at the beginning of GP training to establish an individual development plan for each trainee.

Conclusion

The four complementary instruments of the new procedure were found promising and applicable in a competency based procedure. Further development of the instruments and research on their predictive validity should be pursued to establish a relevant and publicly defensible selection for GP training.

Supplemental material

http://www.informahealthcare.com/doi/abs/10.3109/13814788.2014.885013

Download PDF (29.4 KB)

ACKNOWLEDGMENTS

The authors would like to thank Mariëlle Damoiseaux and Edger Gubbels for their assistance in developing and executing the procedure; Barend Koch for his elaborate work on the SJT and Huisartsopleiding Nederland for making the LHK available.

FUNDING

SBOH (Stichting Beroeps Opleiding Huisartsen, Dutch abbreviation).

Declaration of interest: The authors report no conflicts of interest. The authors alone are responsible for the content and writing of the paper.

REFERENCES

  • Vermeulen MI, Kuyvenhoven MM, Zuithoff NPA, Tromp F, Graaf van der Y, Pieters HM. Selection for Dutch postgraduate GP training; time for improvement. Eur J Gen Pract. 2012;18:201–5.
  • Siu E, Reiter HI. Overview: What's worked and what hasn't as a guide towards predictive admissions tool development. Adv in Health Sci Educ. 2009;14:758–75.
  • Patterson F, Ferguson E, Norfolk T, Lane P. A new selection system to recruit general practice registrars: Preliminary findings from a validation study. Br Med J. 2005;330:711–4.
  • Prideaux D, Roberts C, Eva K, Centeno A, McCrorie P, McManus C, et al. Assessment for selection for the health care professions and specialty training: Consensus statement and recommendations from the Ottawa 2010 Conference. Med Teach. 2011;33:215–23.
  • van der Vleuten CPM, Schuwirth LW. Assessing professional competence: From methods to programmes. Med Educ. 2005; 39:309–17.
  • Patterson F, Ashworth V, Zibarras L, Coan P, Kerrin M, O’Neill P. Evaluations of situational judgement tests to assess non-academic attributes in selection. Med Educ. 2012;46:850–68.
  • Irish B, Patterson F. Selecting general practice specialty trainees: Where next? Br J Gen Pract. 2010;60:849–52.
  • Patterson F, Zibarras L, Carr V, Irish B, Gregory S. Evaluating candidate reactions to selection practices using organisational justice theory. Med Educ. 2011;45:289–97.
  • Tromp F, Vernooij-Dassen M, Grol R, Kramer A, Bottema B. Assessment of CanMEDS roles in postgraduate training: The validation of the Compass. Patient Educ Couns. 2012;89:199–204.
  • Eva KW, Reiter HI, Trinh K, Wasi P, Rosenfeld J, Norman GR. Predictive validity of the multiple mini-interview for selecting medical trainees. Med Educ. 2009;43:767–75.
  • Schmidt FL, Hunter JE. The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychol Bull. 1998; 124:262–74.
  • Lievens F, Patterson F. The validity and incremental validity of knowledge tests, low-fidelity simulations, and high-fidelity simulations for predicting job performance in advanced-level high-stakes selection. J Appl Psychol. 2011;96:927–40.
  • Huffcutt AI, Conway JM, Roth PL, Stone NJ. Identification and meta-analytic assessment of psychological constructs measured in employment interviews. J Appl Psychol. 2001;86:897–913.
  • Altmaier EM, Smith WL, O’Halloran CM, Franken EA, Jr. The predictive utility of behavior-based interviewing compared with traditional interviewing in the selection of radiology residents. Invest Radiol. 1992;27:385–9.
  • Brothers TE, Wetherholt S. Importance of the faculty interview during the resident application process. J Surg Educ. 2007;64: 378–85.
  • Eva KW, Rosenfeld J, Reiter HI, Norman GR. An admissions OSCE: The multiple mini-interview. Med Educ. 2004;38:314–26.
  • Ram P, van der Vleuten C, Rethans JJ, Schouten B, Hobma S, Grol R. Assessment in general practice: The predictive value of written-knowledge tests and a multiple-station examination for actual medical performance in daily practice. Med Educ. 1999;33:197–203.
  • Vermeulen MI, Kuyvenhoven MM, Zuithoff NPA, van der Graaf Y, Pieters HM. Attrition and poor performance in general practice training; age, competence and knowledge play a role. Ned Tijdschr Geneeskd. 2011;155:A2780.
  • Miller GE. The assessment of clinical skills/competence/performance. Acad Med. 1990;65:S63–7.
  • Association of Universities the Netherlands. Gedragscode voor gebruik van persoonsgegevens in wetenschappelijk onderzoek (in Dutch) 2005.
  • Vermeulen MI, Kuyvenhoven MM, Zuithoff NPA, Graaf van der Y, Damoiseaux RAMJ. Dutch postgraduate GP selection procedure; reliability of interview assessments. BMC Fam Pract 2013;14:43.
  • Van Leeuwen YD, Pollemans MC, Mol SSL, Eekhof JAH, Grol R, Drop MJ. Dutch knowledge test for general practice: Issues of validity. Eur J Gen Pract. 1995;1:113–7.
  • Gardner MJ, Altman DG. Statistics with confidence. Confidence intervals and statistical guidelines. London: Br Med J. 1989.
  • van der Heijden GJ, Donders AR, Stijnen T, Moons KG. Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: A clinical example. J of Clin Epidemiol. 2006;59:1102–9.
  • McGraw KO, Wong SP. Forming inferences about some intraclass correlations coefficients. Psychol methods 1996;1:30–46.
  • Schuwirth LWT, Vleuten van der CPM. How to design a useful test: The principles of assessment. In: Swanwick T, editor. Understanding medical education: Evidence, theory and practice. First ed. London: Wiley-Blackwell; 2010. pp. 195–207.
  • Kramer AWM, Düsman H, Tan LHC, Jansen KJM, Grol RPTM, van der Vleuten CPM. Effect of extension of postgraduate training in general practice on the acquisition of knowledge of trainees. Fam Pract. 2003;20:207–12.
  • Gingerich A, Regehr G, Eva KW, Rater-based assessments as social judgements: Rethinking the etiology of raters errors. Acad Med. 2011;86:S1–7.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.