5,206
Views
18
CrossRef citations to date
0
Altmetric
Articles

A novel workplace-based assessment for competency-based decisions and learner feedback

, ORCID Icon, , , , , , , ORCID Icon, ORCID Icon & show all

Abstract

Background: Increased recognition of the importance of competency-based education and assessment has led to the need for practical and reliable methods to assess relevant skills in the workplace.

Methods: A novel milestone-based workplace assessment system was implemented in 15 pediatrics residency programs. The system provided: (1) web-based multisource feedback (MSF) and structured clinical observation (SCO) instruments that could be completed on any computer or mobile device; and (2) monthly feedback reports that included competency-level scores and recommendations for improvement.

Results: For the final instruments, an average of five MSF and 3.7 SCO assessment instruments were completed for each of 292 interns; instruments required an average of 4–8 min to complete. Generalizability coefficients >0.80 were attainable with six MSF observations. Users indicated that the new system added value to their existing assessment program; the need to complete the local assessments in addition to the new assessments was identified as a burden of the overall process.

Conclusions: Outcomes – including high participation rates and high reliability compared to what has traditionally been found with workplace-based assessment – provide evidence for the validity of scores resulting from this novel competency-based assessment system. The development of this assessment model is generalizable to other specialties.

Introduction

The intersection between medical education and safe patient care is optimized when trainees are given appropriate levels of educational guidance and supervision to prepare them for the responsibility of caring for patients. Progressively increasing patient-care responsibilities alongside decreasing levels of supervision are necessary throughout the educational process to prepare the trainee for unsupervised practice (Hicks Citation2011), and assignment of and progression through training experiences should be informed by reliable evidence of readiness to practice within a specified level of supervision (Kogan et al. Citation2014; Carraccio et al. Citation2016). Assessment methods that can provide such evidence are critical but to this point have been lacking.

For nearly two decades, organizations that accredit and certify physicians have been promoting the use and benefits of competency-based assessment frameworks. Residency programs are required to provide semiannual reports to the Accreditation Council for Graduate Medical Education (ACGME) for all residents; these reports detail the level of achievement for a specified subset of competencies, called the “reporting milestones” (Nasca et al. Citation2012). Residency programs also use competency-based performance data to make decisions about progression and promotion, all of which culminates in attestation of eligibility to sit for certifying examinations (Swing Citation2007; Hicks et al. Citation2010; Nasca et al. Citation2012; Swing et al. Citation2013). Although some competencies, such as medical knowledge or decision-making, are best assessed using traditional methods (e.g. written or computer-based multiple-choice tests), the most appropriate method to assess other competencies (e.g. interpersonal and communication skills, professionalism, patient care, practice-based learning and improvement, systems-based practice) is through direct observation in the workplace. Despite the potential advantages of collecting data in the workplace, the results of such assessments generally have been disappointing. There is a clear need to develop improved workplace-based assessment methods (Swing et al. Citation2009; Driessen and Scheele Citation2013; Holmboe et al. Citation2016) and to provide evidence for the utility of these methods in terms of reliability, validity, cost, acceptability, and educational impact (van der Vleuten Citation1996).

An effective workplace-based assessment system must not only accurately and efficiently collect data on learner performance; it also must produce actionable reports. The importance of intentional, carefully constructed feedback in medical education has been well documented (Lockyer Citation2003; Sargeant et al. Citation2005, Citation2006; Archer Citation2010). Feedback should provide learners with specific information about areas of strength and areas in need of improvement. Additionally, the feedback should stimulate a rich dialog between the learner and mentor, peer, clinical supervisor, and/or academic advisor. Despite the importance of providing this type of information, formalized processes that support the collection of specific feedback and its synthesis in a concise report are not the norm.

In 2014, the American Board of Pediatrics (ABP), the National Board of Medical Examiners (NBME), and the Association of Pediatric Program Directors (APPD) entered into a collaboration to develop a competency-based workplace-based assessment system utilizing the Pediatrics Milestones: the Pediatrics Milestones Assessment Collaborative (PMAC). The intention was to assist programs in their efforts to report learner achievement both for ACGME-required Pediatrics Milestones and for other Pediatrics Milestones critical for informing advancement decisions within a training program (key advancement decisions are listed in ; Hicks et al. Citation2016). The utility of the progression decision framework is that it allows for shaping the decisions that program leaders need to make around evidence of readiness to advance to the next level of responsibility (with an associated decreased level of supervision). The ACGME reportable Pediatrics Milestones provide evidence of a learner’s level of performance with respect to individual competencies. Program leaders, who need to make these larger decisions about learner readiness to take on higher levels of responsibility with decreasing levels of supervision, integrate the assessment evidence from a number of different Pediatrics Milestones levels. The results of a pilot study that investigated the feasibility of developing and administering assessment content for the first PMAC progression decision, “Readiness to serve as an intern in the inpatient setting”, were reported previously (Hicks et al. Citation2016; Turner et al. Citation2017). The present paper reports on the subsequent development of a workplace-based assessment system designed to assess learners on their performance with respect to the second progression decision: “Readiness to care for patients in the inpatient setting without a supervisor nearby”. Development of the assessment content and processes for this decision took place over two distinct phases. The purpose of Phase 1 (P1) was to develop content (items and instruments) that would (1) be feasible and acceptable and (2) inform the progression decision. The Phase 1 development process involved: (1) testing initial assessment instruments at a small group of sites; and then (2) analyzing the resulting data to inform revisions to items and instruments. During Phase 2 (P2), data were collected using the revised instruments; following data collection, analyses were completed to examine validity and overall utility of the assessments for the intended purpose (i.e. informing decisions about readiness to care for patients without a supervisor nearby). This work specifically addresses the need for assessment methods that provide reliable evidence of learner readiness to practice within a specified level of supervision. As such, we believe that the process for developing the instruments and the validity argument to support them are applicable to a wide range of specialties across the learning continuum.

Table 1. Decisions selected by the pediatric community to inform program leaders and learners about the learner’s readiness to advance within the pediatric graduate medical education training program.

Phase 1

Methods

Content development

The purpose of Phase 1 was to develop content (items and instruments), to determine whether programs could collect a sufficient number of instruments (response process), and to evaluate whether the items performed well (content validity). Eleven pediatric clinician educators were recruited to serve as content expert panel members to guide the development and refinement of the P1 items and assessment instruments. Over the course of six months, content experts convened through multiple webinars in order to select, review, and revise content. The steps in sequence were as follows (note that many of these steps involved multiple iterations): (1) initial selection of competencies that provided the best evidence to inform the decision; (2) identification of observable behaviors that provide support for the decision; (3) deciding on the interprofessional roles that could best observe those behaviors; and (4) determining what, if any, specific inpatient activity would prompt the demonstration of those behaviors. Full-group consensus was not required for any of the above steps; when consensus was not achieved, analyses such as tabulation of importance ratings (e.g. for initial competency selection) and frequency of response (to select the roles and activities that should be included) provided objective decision-making metrics. The content experts also were asked about the best methods for capturing the desired data, and they identified multi-source feedback and structured clinical observation as the most useful methodologies. ( lists the competencies selected by the expert panel as critical for this advancement decision.) (ACGME and ABP Citation2012; Benson et al. Citation2014). Prior to finalizing the P1 instruments, cognitive interviews were conducted with residents, faculty, and nurses. During these interviews, participants reviewed instrument content relevant to their role in order to provide feedback about whether items were understood and answered in a manner that was consistent with the intentions of the item developers. The results were used to refine the instruments prior to programming them in the delivery system.

Table 2. Readiness to care for patients in the inpatient setting without a supervisor nearby.

Instruments

As mentioned previously, two types of instruments were developed in Phase 1: multisource feedback (MSF), with 17 scored items on the resident version of the instrument (other roles had fewer items), and structured clinical observation (SCO) with 10 scored items. Items were written to capture behaviors relevant to a specific competency within a specific context; those that were relevant to rounds were assigned to the SCO instrument and those for which it was important to collect data on the behavior more generally over time were assigned to the MSF. MSF instruments are intended to collect feedback about a learner after an inter-professional team member works with/observes the learner for a minimum of two days during a specific rotation; this minimum was selected because a single day of observation was not viewed as sufficent to gather information about the learner’s performance across occasions. The MSF instrument is completed by four different observer groups – supervising residents, attending faculty, nurses, and other inter-professional clinical team members such as social workers and respiratory therapists – and each group completes a version of the instrument that is specific to that clinical role. SCO instruments are intended to collect behavioral feedback from supervising residents or faculty based on a single observation of the learner’s performance on rounds.

All instruments contained scored competency-specific items as well as unscored feedback items requesting specific recommendations for improvement when an item score was below a specific threshold, a global item that addressed the key decision “Readiness to care for patients in the inpatient setting without a supervisor nearby”, and numerous opportunities for entering free-text comments. Additional items allowed observers to report specific details about the observational period in which s/he observed the learner. Specifically, on all MSF instruments observers were asked to report the number of days they were able to observe the learner during the rotation. In addition, because performance in the resident training environment can be affected by numerous factors that are outside of the learner’s control, it was important to capture an indication of the conditions under which the learner’s assessment took place. To do so, residents and faculty were given specific information about factors that can affect workload (e.g. patient volume, number of discharges and/or adminssions, patient acuity or complexity, administrative tasks) and then were asked to characterize (1) the typical workload for the rotation and (2) the level of the learner’s workload relative to the typical workload.

Scoring and platform

A complex scoring equation was developed that allowed for aggregating within-competency item-level scores across instruments; this equation produces a weighted score that is based on a number of factors including the duration of observation reported by observers, the included assessment instruments (MSF or SCO), and the number of items completed for a given competency and by a given observer role. (Authors can provide specific details about the scoring equations for those who may be interested in collaborating in this research work.)

Instruments were implemented using the Qualtrics™ platform, which provided point-of-care mobile access to the assessment instruments, allowed instrument selection to be based on the observer’s clinical role, and permitted item-level scores to trigger the presentation of follow-up items to allow the observer to further specify the nature of performance and make recommendations for improvement.

Reports

Monthly feedback reports were designed with input from the content expert panel. These reports presented competency-level scores and feedback within three main content areas: (1) Clinical Care: Cognitive, Diagnostic, and Management Aspects; (2) Communication: Teamwork and Interpersonal Skills; and (3) Professionalism: Personal and Professional Development. Each section presented the associated competencies and the respective scores, responses to free text recommendation items, and other observer comments.

Implementation

MSF and SCO instruments were used in randomly selected inpatient rotations at eight geographically-dispersed residency programs during a four-month data collection period; individual programs determined which inpatient rotations would be used to collect PMAC assessment data. Typically, eligible observers for an inpatient rotation include at least four different faculty, two different (supervisory) residents, and varying numbers of nurses and other health professionals. All observers who observed the learner for at least two days were eligible to complete and submit an assessment. Each program (site) received orientation that included role-specific demonstration of instruments and details about expectations for the number and type of instruments to be completed by role. Programs provided additional orientation to key stakeholder groups to assure that all observer and learner groups had sufficient orientation to the study. Observer names and email addresses were captured for analytic purposes; information about individual observers was not revealed to learners either during the observation period or in the feedback reports (see below). IRB consent was obtained for each site.

Each P1 data collection period was defined by the program based on rotation dates. Following each data collection period, programs received feedback reports for learners who had a minimum of four MSF observations. A minimum of two completed SCO instruments was required for SCO data to be included in the report. The initial rationale for specifying a minimum goal of four MSFs to generate a report was based on the desire to encourage participation and to have as much evidence as possible to generate competency-level scores. An interpretation guide that explained the report content in detail was distributed along with the reports.

Analysis

After initial data collection, we analyzed items to investigate: (1) observers’ use of item scales (e.g. did they use the full scale?); (2) relationships between items within each competency; and (3) the relationship between competency-level items and global items. Multivariate generalizability analyses also were completed to investigate the extent to which the competency-level scores provided stable measures of learner performance and to provide a measure of the reliability of a composite score produced by averaging across competencies. Following initial review of the output by project staff, the content expert panel evaluated poorly performing items and made decisions about item revision, replacement, and deletion.

Results

During the Phase 1 data collection (February–June 2015), 165 learners received at least one completed instrument. In total, 873 MSFs and 500 SCOs were returned, averaging 5.3 MSFs and three SCOs per learner per rotation. Completion times ranged from 3 to 10 min for MSF and 4 to 6 min for SCO instruments.

Phase 1 data analysis identified eight scoreable MSF items and seven scoreable SCO items that did not function as intended. More detailed review of these items by staff and the content expert panel identified several areas in which they were problematic. For example, the correlations between items that were scored dichotomously (e.g. a learner either did or did not display a particular behavior) and the other items within the same competency were consistently low for both the MSF and the SCO instruments. Because this finding was consistent across all items of this type for both instruments, the response scale was viewed as problematic and was modified to broaden the response options. The analysis also called attention to items that functioned differently across programs. One such item evaluated the extent to which the resident included inter-professional team members in specific clinical activities, the expectation being that a higher-performing resident would do this more consistently than a lower-performing resident. After additional investigation, however, it was discovered that some programs do not expect residents to solicit this type of input from inter-professional team members (and that the item therefore should not be used as it was originally written, as item development and analysis focused on developing assessment instruments that could be used in a variety of programs with different structures, cultures, and systems). Finally, the review allowed for the identification of items for which the clarity of the wording of the item stem or options was problematic; these items were revised to clarify the language. Following the detailed item review, two MSF items and five SCO items were deleted. The remaining six MSF items were revised, and one SCO item was revised and one was changed from a scored to an unscored item.

Results of the generalizability analysis indicated a generalizability coefficient for a composite MSF score of 0.74 for an assessment based on five observations made by supervising residents; this value increased to 0.77 with six observations.

Phase 2

Methods

Phase 1 was intended to provide insight into the feasibility of developing and administering a multi-site workplace-based assessment system. Phase 2 provided additional evidence about the logistical feasibility and practicality of the system along with more direct evidence about the utility of the resulting scores and comments to support decisions about a learner’s “Readiness to care for patients in the inpatient setting without a supervisor nearby”. To broadly evaluate the assessment system, we followed the framework of van der Vleuten’s model of utility (van der Vleuten Citation1996). This model considers numerous factors including educational impact, reliability, validity, acceptability, and cost. To provide an additional focus on aspects related to reliability and validity we used Messick’s framework which includes consideration of: construct validity and content representation; response process; relationship to other variables; internal structure; and consequences (Messick Citation1995).

Implementation

The revised versions of the instruments were tested in an eight-month data collection at 11 geographically-dispersed residency programs (four of which also had participated in P1). The study was IRB approved at each site.

Data collection periods again were defined by program rotation start and end dates. After each rotation, assessment results were distributed to interns and program directors in an end-of-rotation report for each learner. As in Phase 1, a minimum of four MSF instruments were required to generate a report and at least two completed SCO instruments were required for the SCO data to be included. In addition to the monthly reports, summary reports that aggregated data for a learner across all rotations were given to program directors and Clinical Competency Committees (CCCs). For these summary reports, a learner’s rotation-level data were included if at least three MSF instruments were completed for that rotation; the reporting threshold was decreased from four to three for the summary reports to increase the number of learners for whom reports were generated.

The 1–5 score scale used for the PMAC items was designed such that the different score points provided clear differentiation of learners at various levels within a range of observed performance that content experts felt was typically seen in the authentic clinical settings. In order to facilitate reporting of milestone levels to the ACGME based on PMAC scores, each score point on the 1–5 PMAC scale was translated to an ACGME Pediatrics Milestone reporting level for each competency. This score translation was done by the same content expert panel that did the development work; they reviewed each item on the PMAC instruments to identify the relevant content, reviewed the Pediatrics Milestones levels for each competency, and for each point on the score scale made a recommendation for the correspondence between the PMAC score and the Pediatrics Milestone level. These data were aggregated across panelists to arrive at the range of Pediatrics Milestone levels that would be associated with each PMAC score.

The summary reports were provided to all participating programs for their use in informing both the ACGME-required Pediatrics Milestones reporting process and internal program progression decisions. For a subset of programs, the reports were used to further investigate the extent to which the data were useful for the purpose of informing milestone reporting decisions. For each PGY-1 learner, programs provided: (1) tentative milestone scores assigned by the CCC prior to receiving PMAC summary reports (if any); and (2) final milestone scores submitted to the ACGME. Resulting data were analyzed to evaluate the extent to which access to PMAC data was associated with differences between tentative and final milestone scores. In addition, programs responded to a survey about planned uses of PMAC data for purposes other than milestone reporting. Focus groups and surveys were used to investigate the extent to which participants viewed the PMAC assessment as valuable, useful, and overall worthwhile.

Analysis

Descriptive analyses were completed to provide insight into response process. Correlations were estimated to investigate the internal structure of the instruments. As with P1, multivariate generalizability analysis was used to investigate the extent to which the composite and competency-level scores provided stable measures of learner performance. Additional analyses focused on the CCC study data and addressed (1) the extent to which the PMAC data influenced milestone reporting decisions; and (2) the perceived utility of the PMAC assessment system.

Results

Construct validity and content representation

The Phase 1 process of instrument development and the post-phase 1 instrument refinement, previously described, provided evidence to support arguments for construct validity.

Response process

During the Phase 2 data collection (October 2015–May 2016), 289 interns were assessed with the revised instruments. In total, 2181 instruments (1355 MSFs and 826 SCOs) were completed (an average of 4.7 MSF and three SCO per learner). To accurately calculate instrument counts for individual observers, nurse instruments that were submitted using a generic “nurse” email account (rather than an individual email address) were excluded; this reduced the MSF count to 1241 and resulted in 426 unique observers. Instrument counts by observer role are as follows: 207 faculty (1212 instruments), 179 residents (626 instruments), 32 nurses (168 instruments), and eight other health professionals (61 instruments). The majority of observers (150 faculty, 159 resident, 23 nurse, and five other) completed between 1 and 6 instruments; the maximum number of instruments completed by a single observer was 44 (faculty). A non-trivial number of observers (122 faculty, 103 residents, and one nurse) completed more than one instrument for the same learner. Instruments required between 4 and 8 min to complete.

Internal structure

MSF inter-item correlations ranged from 0.34 to 0.81 and SCO correlations ranged from 0.49 to 0.82; all within-competency items displayed significant (p < 0.01) positive correlations. Additionally, all items were significantly correlated with the global item score from the same instrument: these correlations ranged from 0.22 to 0.67 for the MSF and from 0.50 to 0.56 for the SCO.

Generalizability analyses of the MSF instrument included 69 learners who had observations from at least two supervisory residents. (Because of the design of the analysis, examinees could only be included if they had more than one instrument completed.) The generalizability of the composite score was 0.79 for an assessment based on five observations and was 0.82 with six observations. This result suggests that the instrument revisions substantially improved the generalizability of the score. Similar levels of generalizability were observed when considering MSF instruments completed by faculty observers (0.84 for five observations; 0.86 for six observations, N = 137) and by nurses (0.80 for five observations and 0.83 for six observations, N = 41). The composite generalizability for the SCO was 0.48 with four observations (by residents or faculty) per learner.

Relationship to other variables and consequences

Seven programs provided the ACGME milestones for their PGY1 learners; six of these programs also provided tentative milestones assigned prior to CCC meetings (and prior to receiving PMAC summary reports). Sites reported that PMAC summary reports provided CCC members with more confidence in their tentative milestone ratings. Final milestone scores averaged 0.13 lower (on a five-point scale) than preliminary scores for learners who did not have a summary report but only 0.06 lower for learners who received the summary reports (interaction p < 0.003). We expected that providing CCCs with additional unique evidence about learner performance would lead them to incorporate this evidence in their decisions, thus producing a greater absolute difference between tentative and final milestones when the additional evidence was available than when it was not. We did not have any expectations about the direction of this effect – whether CCCs would uprate vs. downrate learners with PMAC reports more strongly than those without, on average. One possibility is that CCC discussion in general tends to slightly lower scores (consistent with other group merit review processes that overweight negative vs. positive evidence (Thorngate Citation2009)) even though a group CCC may give learners “the benefit of the doubt”. Availability of better evidence reduces doubt, and thereby reduces this tendency to err upward.

Of 13 Program Directors or Associate Program Directors who responded to the survey about planned uses of PMAC data, all reported that they planned to use the PMAC reports to assist with other institutional reporting requirements, nomination of residents for awards, and letters of recommendation for residents. In addition, 12 of the 13 reported that they planned to use the reports for providing progress reports to the American Board of Pediatrics (ABP) on marginal or unsatisfactory residents, verifying competence to sit the ABP certifying examination, and other administrative programmatic needs.

Acceptability and educational impact

Observers were able to use the assessment system and did so with reasonable frequency. We also employed focus groups and surveys to investigate the extent to which the system was perceived as valuable, useful, and overall worthwhile. Because the response rate for the survey was low, we consider the survey data anecdotal, and discuss it together with the focus group responses. The majority of respondents in the focus groups and surveys reported that the assessment process was useful and provided valuable information about learners’ areas of strength and areas for improvement. (Addditional comments from these sources are reported below in discussing potential limitations of the instruments.) It is clear that further evaluation of educational impact is an important long-term goal of this work.

Discussion

The healthcare profession has a duty to provide the public with physicians who are well trained and have demonstrated the necessary competence to provide safe and effective patient care. During the educational process, physician competence is continuing to develop. The accurate assessment of a resident’s care quality and alignment with a level of supervision that promotes both safe care and professional development is critical. These decisions should be informed by reliable evidence that is gathered in the authentic patient-care setting.

The present approach to developing a competency-based workplace assessment system that guides progression decisions about trainees is intended to balance the components of van der Vleuten’s equation for assessment utility: reliability, validity, acceptability, educational impact, and potentially cost (van der Vleuten Citation1996). This system was designed to provide data to inform decisions about a learner’s “Readiness to care for patients in the inpatient setting without a supervisor nearby”, and it was developed using a rigorous process to support the validity of decisions that would be made based on reported scores from direct observations. The data derived from these instruments demonstrated acceptable levels of reliability with four to six instruments (depending on observer role). Generalizability coefficients at or above 0.80 typically are considered acceptable for observational or performance-based assessments. In fact, these values exceed some of those reported in the literature for high-stakes performance assessments used in medical education (Clauser Citation2009). These results compare favorably with the results of Donnon et al.’s systematic review of MSF assessments, which found that eight observers on average usually were necessary to achieve sufficient reliability (Donnon et al. Citation2014). The PMAC SCO (of rounds) instrument did not achieve this level of reliability, which is not surprising as SCO ratings are made after observation of a single clinical encounter and therefore are likely to be highly context specific; SCOs would be expected to provide the best sense of the performance of a resident on rounds only after many different rotations under many different conditions.

With respect to limitations of the system, program directors felt that faculty buy-in and effort were challenges; this was particularly true in programs without dedicated observers or administrative support for assessment and for programs in which PMAC tools were being tested alongside the program’s traditional tools rather than in place of them. These concerns are common to many workplace-based assessment efforts. In fact, even when programs have incentives or consequences associated with assessment completion, completion rates often are far lower than for the PMAC instruments (Warm et al. Citation2016). During Phase 2, over 400 unique observers completed nearly 2200 instruments for almost 300 learners, the majority of observers completed more than two instruments, and more than half – including nearly 60% of participating faculty – completed multiple observations of the same learner.

Some sites participated in both Phase 1 and Phase 2. As such, it is likely that some of the same observers participated in both phases of the assessment. Because no learner–participants were interns in both data collection phases, however, it was not possible to duplicate observer–learner pairs across phases. In addition, due to the logistics of observer work assignments, repeated observations of the same learner by the same observer within data collection phases were limited due both to the limited subset of rotations chosen for PMAC study at each institution and to the sporadic nature of observer assignments on those rotations.

The PMAC instruments demonstrated educational impact beyond the purely psychometric value of improved assessment, both through statements of participants and observed consequences on milestones assigned by CCCs. Perhaps most importantly, interns reported that the quality, quantity, frequency, specificity, and timeliness of feedback improved with PMAC. Underscoring recent discussion in the literature of the value of qualitative data in assessment, narrative comments seemed to have the greatest value, particularly when organized around a clear reporting framework. Program directors felt the PMAC tools had greater specificity around areas of focus pertinent to learner development and most CCC members felt that PMAC reports were valuable for providing useful feedback to interns who were struggling. CCC members also indicated that PMAC reports were helpful in reinforcing milestone assignments; this was confirmed by the pattern of tentative and final milestones assignments amongst learners with and without PMAC summary reports. Despite these encouraging findings, it will be important for future work to more systematically evaluate the full educational impact of these assessments.

This assessment system offers considerable value to pediatrics residency programs who seek to draw crucial inferences about their residents’ preparedness for inpatient services where a supervisor is not present. More importantly, this approach to developing a competency-based assessment system aligns with contemporary thinking about competency-based and workplace-based assessment, providing a model for other specialties to identify decisions, items, and instruments that address the progression of critical skills needed to advance their learners along the continuum towards unsupervised practice.

Ethical approval

This study was reviewed and approved by each participating residency program’s Institutional Review Board and by the Institutional Review Board of the University of Illinois at Chicago.

Glossary

Competency-based decisions: Decisions to advance learners, justify learner or practitioner readiness or privilege to care for patients (i.e. take on a level of responsibility with a particular level of supervision or independence), or support claims about an individual practitioner/learner (who may need additional training, for example) based on a competency or performance framework.

Acknowledgements

The following members of the PMAC Module 1 Study Group also meet the criteria for authorship of this paper and should be so indexed: Beatrice Boateng, Ann Burke, Su-Ting T. Li, Julia Shelburne, Teri L. Turner. Additional members of the group who should be indexed as collaborators on this work include Dorene Balmer, Vasu Bhavaraju, Kim Boland, Alan Chin, Sophia Goslings, Hilary Haftel, Nicola Orlov, Amanda Osta, Sara Multerer, Jeanine Ronan, Sahar Rooholamini, Rebecca Tenney-Soeiro, Rebecca Wallihan, and Anna Weiss.

Disclosure statement

The authors report no conflicts of interest. The authors alone are responsible for the content and writing of this article.

Additional information

Funding

This study was funded by the National Board of Medical Examiners and the American Board of Pediatrics. Authors Margolis and Clauser are employees of the National Board of Medical Examiners. Author Carraccio is an employee of the American Board of Pediatrics. Authors Hicks and Schwartz were supported in part by contracts between the funders and their respective institutions.

Notes on contributors

Patricia J. Hicks

Patricia J. Hicks, MD, MHPE, is Professor of Clinical Pediatrics in the Perelman School of Medicine at the University of Pennsylvania and Director, Pediatrics Milestones Assessment Collaborative, a joint project of the American Board of Pediatrics, the Association of Pediatric Program Directors and the National Board of Medical Examiners©.

Melissa J. Margolis

Melissa J. Margolis, PhD, is Senior Measurement Scientist at the National Board of Medical Examiners (NBME). Her work over the past 20 years has focused on workplace-based assessment, assessment and instrument design, validity, automated scoring of complex performance tasks, and standard setting.

Carol L. Carraccio

Carol L. Carraccio, MD, MA, is Vice President of Competency-based Assessment at the American Board of Pediatrics where she leads many national research projects in assessment.

Brian E. Clauser

Brian E. Clauser, EdD, is Vice-President for the Center for Advanced Assessment at the National Board of Medical Examiners. His research interests include automated scoring of simulations, validity theory, standard setting, applications of generalizability theory, and workplace-based assessment.

Kathleen Donnelly

Kathleen Donnelly, MD, is a critical care attending and Director of the Inova Children’s Hospital Pediatric Residency Program.

H. Barrett Fromme

H. Barrett Fromme, MD, MHPE, is an Associate Professor of Pediatrics at the University of Chicago Pritzker School of Medicine where she is the Associate Pediatric Residency Program Director and the Director of Faculty Development in Medical Education.

Kimberly A. Gifford

Kimberly A. Gifford, MD, is assistant professor of pediatrics and primary care faculty at Geisel School of Medicine at Dartmouth and Director of Competency-based Education at the Children’s Hospital at Dartmouth-Hitchcock.

Sue E. Poynter

Sue E. Poynter, MD, MEd, is Associate Professor of Pediatrics and Director of the Pediatric Residency Program at Cincinnati Children’s/University of Cincinnati where she is also a critical care attending and Medical Director, Division of Respiratory Care.

Daniel J. Schumacher

Daniel J. Schumacher, MD, MEd, is Assistant Professor of Pediatrics and Pediatric Emergency Medicine physician at Cincinnati Children’s Hospital/University of Cincinnati and PhD Candidate at Maastricht University School of Health Professions Education in Maastricht, The Netherlands. His research efforts focus on entrustment and the association between learner and patient outcomes.

Alan Schwartz

Alan Schwartz, PhD, is the Michael Reese Endowed Professor of Medical Education, Associate Head and Director of Research in the Department of Medical Education at the University of Illinois at Chicago. He is also Director of the Association of Pediatric Program Directors Longitudinal Educational Assessment Research Network.

References

  • ACGME, ABP. 2012. ACGME-selected Pediatrics Milestones for reporting. [accessed] 2018 Sep 4. https://www.acgme.org/acgmeweb/Portals/0/PDFs/Milestones/PediatricsMilestones.pdf
  • Archer JC. 2010. State of the science in health professional education: effective feedback. Med Educ. 44:101–108.
  • Benson BJ, Burke A, Carraccio C, Englander R, Guralnick S, Hicks PJ, Ludwig S, Schumacher DJ. 2014. The Pediatrics Milestones project. [accessed 2017 May 1]. https://www.abp.org/sites/abp/files/pdf/milestones.pdf
  • Carraccio C, Englander R, Holmboe ES, Kogan JR. 2016. Driving care quality: aligning trainee assessment and supervision through practical application of entrustable professional activities, competencies, and milestones. Acad Med. 91:199–203.
  • Clauser B, Balog K, Harik P, Mee J, Kahraman N. 2009. A multivariate generalizability analysis of history-taking and physical-examination scores from the USMLE Step 2 Clinical Skills Examination. Acad Med. 84:S86–S89.
  • Donnon T, Al Ansari A, Al Alawi S, Violato C. 2014. The reliability, validity, and feasibility of multisource feedback physician assessment: a systematic review. Acad Med. 89:511–516.
  • Driessen E, Scheele F. 2013. What is wrong with assessment in postgraduate training? Lessons from clinical practice and educational research. Med Teach. 35:569–574.
  • Hicks PJ. 2011. The role of supervision in creating responsible and competent physicians. Acad Pediatr. 11:9–10.
  • Hicks PJ, Margolis M, Poynter SE, Chaffinch C, Tenney-Soeiro R, Turner TL, Waggoner-Fountain L, Lockridge R, Clyman SG, Schwartz A. 2016. The pediatrics milestones assessment pilot: development of workplace-based assessment content, instruments, and processes. Acad Med. 91:701–709.
  • Hicks PJ, Schumacher DJ, Benson BJ, Burke AE, Englander R, Guralnick S, Ludwig S, Carraccio C. 2010. The pediatrics milestones: conceptual framework, guiding principles, and approach to development. JGME. 2:410–418.
  • Holmboe ES, Edgar L, Hamstra S. 2016. The milestones guidebook. [accessed 2016 Aug 21]. http://www.acgme.org/Portals/0/MilestonesGuidebook.pdf
  • Kogan JR, Conforti LN, Iobst WF, Holmboe ES. 2014. Reconceptualizing variable rater assessments as both an educational and clinical care problem. Acad Med. 89:721–727.
  • Lockyer J. 2003. Multisource feedback in the assessment of physician competencies. J Contin Educ Health Prof. 23:4–12.
  • Messick S. 1995. The interplay of evidence and consequences in the validation of performance assessments. Educ Res. 23:13–23.
  • Nasca TJ, Philibert I, Brigham T, Flynn T. 2012. The next GME accreditation system – rationale and benefits. N Engl J Med. 366:1051–1056.
  • Sargeant J, Mann K, Ferrier S. 2005. Exploring family physicians’ reactions to multisource feedback: perceptions of credibility and usefulness. Med Educ. 39:497–504.
  • Sargeant J, Mann K, Sinclair D, Van der Vleuten C, Metsemakers J. 2006. Understanding the influence of emotions and reflection upon multi-source feedback acceptance and use. Adv Health Sci Educ. 13:275–288.
  • Swing SR. 2007. The ACGME outcome project: retrospective and prospective. Med Teach. 29:648–654.
  • Swing SR, Beeson MS, Carraccio C, Coburn M, Iobst W, Selden NR, Stern PJ, Vydareny K. 2013. Educational milestone development in the first 7 specialties to enter the next accreditation system. JGME. 5:98–106.
  • Swing SR, Clyman SG, Holmboe ES, Williams RG. 2009. Advancing resident assessment in graduate medical education. JGME. 1:278–286.
  • Thorngate W, Dawes RM, Foddy M. 2009. Judging merit. 1st ed. New York (NY): Psychology Press. Chapter 3, Lessons from clinical research; p. 27–31.
  • Turner TL, Bhavaraju VL, Luciw-Dubas UA, Hicks PJ, Multerer S, Osta A, McDonnell J, Poynter SE, Schumacher DJ, Tenney-Soeiro R, et al. 2017. Validity evidence from ratings of pediatric interns and subinterns on a subset of pediatric milestones. Acad Med. 92:809–819.
  • van der Vleuten CPM. 1996. The assessment of professional competence: developments, research and practical implications. Adv Health Sci Educ. 1:41–67.
  • Warm EJ, Held JD, Hellmann M, Kelleher M, Kinnear B, Lee C, O’Toole JK, Mathis B, Mueller C, Sall D, Tolentino J, Schauer DP. 2016. Entrusting observable practice activities and milestones over the 36 months of an internal medicine residency. Acad Med. 91:1398–1405.