2,159
Views
11
CrossRef citations to date
0
Altmetric
Research Methodology

A Proposal for Evaluating the Validity of Holistic-Based Admission Processes

Pages 103-107 | Published online: 18 Jan 2013

Abstract

Background. Admission decisions require that information about an applicant be combined using either holistic (human judges) or statistical (actuarial) methods. For optimizing a defined measureable outcome, there is a consistent body of research evidence demonstrating that statistical methods yield superior decisions compared to those generated by judges. It is possible, however, that the benefits of holistic decisions are reflected in unmeasured outcomes. If such benefits exist, they would necessarily appear as systematic variance in raters’ scores that deviate from statistically-based decisions. Purpose. To estimate this variance, we propose a design examining the interrater reliability of difference scores (i.e., the difference between observed committee rankings and rankings based on statistical approaches). Methods. Example calculations and G study models are presented to demonstrate how rater agreement on difference scores can be analyzed under various circumstances. High interrater reliability of difference scores would support but not prove the assertion that the holistic process adds useful information beyond that achieved by much less costly statistical approaches. Conversely, if the interrater reliability of difference scores is near zero, this would clearly demonstrate that committee judgments add random error to the decision process. Results. Evidence to conduct such studies already exists within most highly selective medical schools and graduate programs and the proposed validity research could be conducted on existing data. Conclusions. Such research evidence is critical for establishing the validity of widely used holistic admission approaches.

INTRODUCTION

Over the last 50 years, many studies have compared decisions using statistical (actuarial) approaches with holistic decisions made using human judges.Citation 1 Citation 4 Holistic decisions are based on a rater or judge's overall subjective evaluation of multiple sources of information, whereas decisions using statistical approaches mathematically weight each source of information to derive a composite score for ranking applicants. Although the research evidence consistently favors the statistical approach, many continue to prefer intuitive holistic-based methods that employ human judges.Citation 5 As an example of the widespread preference for the holistic approach, the Association of American Medical Colleges (AAMC) has recently begun to promote “holistic” approaches for making admission decisions, and the U.S. Supreme CourtCitation 6 , Citation 7 suggests that “holistic” admission is more likely to be fair and legally defensible compared to approaches that rely on mathematical formulas.Citation 8 , Citation 9 Given that holistic approaches are widely advocated, the validity of the holistic review deserves further research to determine whether there might be some previously undiscovered psychometric merit to holistic-based selection.

Although research studies in medical education have extensively examined the reliability and validity of undergraduate grade point average (GPA), the Medical College Admission Test (MCAT), and the preadmission interview, there has been little reporting on the reliability and validity of the methods used to make the final decision to admit or reject an applicant.Citation 10 Citation 13 For example, research has not closely examined which techniques work best for combining information about applicants to highly competitive professional training programs such as medicine. Although some measures used in the admissions process have accumulated solid research evidence affirming an acceptable level of reliability and validity, these desirable psychometric attributes can be easily compromised or lost if the measures are inefficiently combined during the decision process.Citation 14

Admission decisions always require the use of either a statistical (actuarial) formula or holistic judgments to summarize and combine information about an applicant. Ultimately, the reliability and validity of the final admission decision is the outcome of paramount importance for determining the success of the selection process. When using statistical methods to combine information, it is important to understand how various techniques for weighting information impacts the reliability and validity of the final composite score.Citation 15 , Citation 16 On the other hand, if the decision process uses human judges in a holistic review, it is important to consider how rater agreement impacts the reliability and validity of the final decision.Citation 3 , Citation 17 , Citation 18

This study uses medical school admissions as an example of a highly competitive selection procedure. A recent AAMC/MCAT-sponsored survey of admission offices shows that the most common practice at U.S. medical colleges is to use an admissions committee to holistically combine applicant information.Citation 19 Although applicant pools are narrowed by approximately half using various statistical approaches that utilize both quantitative and qualitative applicant information (e.g., academic aptitude, academic achievement, state residency status, and other demographic variables) to determine who will be granted a preadmission interview, the vast majority of U.S. medical schools employ an admissions committee to holistically review and evaluate all interviewed applicants.Citation 19

Despite the fact that the holistic review is promoted by both the AAMC and the U.S. Supreme Court, there exists little psychometric evidence regarding an admission committee's contribution to the admissions process.Citation 8 , Citation 14 , Citation 20 , Citation 21 A new research methodology is offered here for evaluating the contribution of admission committee decisions.

BACKGROUND

Service on an admission committee requires a substantial time commitment from its members. At medical colleges in the United States, it is estimated that approximately 74,000 medical school applicants are offered a preadmission interview and campus visit each year.Citation 19 It is these interviewed applicants that admission committees review to determine who will and who will not be offered entry to the medical school. Although published research does not provide an estimate of the average amount of time committee members devote to reviewing an applicant's file, based upon the quantity of information contained within these files, it seems reasonable to suggest that a committee member would need a minimum of 20 min to carefully read and consider a single file. As medical school committees review an average of approximately 525 applicants per year, usually with multiple reviewers per applicant, the total time spent reviewing these files totals to more than 1,000 hr at many U.S. medical colleges. Given that admission committees are typically composed of the medical college's most highly paid faculty and staff, labor costs for these reviews can be quite substantial.

Although applicant review is an expensive process, the importance of understanding the committee's performance is highlighted not just by its costs but also by the consequences of its decisions. For example, medical education's low attrition rate (< 3%) implies an admission committee's decisions are practically the equivalent of deciding who will become tomorrow's physicians.Citation 20 , Citation 21 Although it is clear based on both cost and consequence that an evaluation of committee performance is important, it is less clear on what dimensions an evaluation should be based.

One possible approach to evaluating committee performance might be to compare committee reviewers’ decisions with statistically optimal decisions, where “optimal” is defined as the decision rule that maximizes class performance on one or more of the important outcomes of medical education (e.g., medical school grades and/or United States Medical Licensing Examination [USMLE] scores). Previous research examining the medical school admissions process has shown that it is possible to generate statistically derived equations to achieve a maximum predicted class performance as defined by a measurable medical school outcome.Citation 14 , Citation 22 Once one has obtained statistically optimal applicant rankings, it is a relatively easy and objective process to compare statistical rankings with committee-based rankings.

As previously mentioned, for cases involving the optimization of a measurable outcome, there is a large and consistent body of research demonstrating that statistically derived formulas yield superior decisions compared to those generated by human judges.Citation 3 , Citation 4 , Citation 9 Because of these well-established research findings, applying this same research approach to evaluate the decisions of an admissions committee is unlikely to reveal new validity evidence related to holistic review. Given this, another approach to validation is needed.

It seems one could reasonably assert that the primary benefit of using an admissions committee is not in predicting who will score highest on the USMLE, but rather from the committee's ability to subjectively consider and evaluate nonquantifiable aspects of the applicants. Hence, a reasonable validity argument for using an admissions committee is that human judges are better able to use qualitative information to uniquely weigh each applicant's attributes. In fact, this argument for holistic review may be defensible, as there are almost certainly important applicant attributes and medical education outcomes that are not measured, and past comparative research has used only measurable applicant attributes and measurable outcomes. A validity approach that simply evaluates quantified and coded applicant data and measured outcomes may exclude key dependent and independent variables that are required to conduct a complete validity study of committee decisions.

To more comprehensively evaluate the validity of an admission committee's rankings or decisions, it is necessary to acknowledge that the committee's contribution might not be reflected in measures that are currently collected by the medical college. For instance, characteristics related to social commitment and/or the ethical aspects of a medical class would be quite difficult to measure and are seldom available in a reliable fashion. It could be argued that an admission committee's contribution may well reside in its ability to evaluate an applicant's life circumstances from the biographical statement or letters of reference and holistically use that information to achieve an applicant-specific consideration of the MCAT, GPA, and/or interview score to produce a more analytically insightful assessment of an applicant's character. If this view of the committee's contribution is correct, it will be reflected in the agreement between committee members’ judgments that deviate from statistically based judgments that could alternately have been used to make the final admission decision.

THE VALIDITY ARGUMENT

This study design examines whether differences between statistically based rankings and committee-based rankings might reasonably be explained by meaningful characteristics possessed by the applicant. Specifically, if the interrater reliability of difference scores (DiffSco) [DiffSco = (statistical rankings (committee member ranking)] is high, this would support, but not prove, the assertion that committee reviewers add useful information beyond that achieved by simple statistical approaches that could alternately have been used to make the final admission decision. On the other hand, if interrater reliability of difference scores is near zero, this result would clearly prove that committee judgments add random error to the decision process.

The design proposed here uses the interrater reliability of modified judges’ rankings of medical school applicants that reflect the difference between observed committee rankings and rankings based on simple statistical approaches. For the purposes of providing a simple example, two statistically based approaches applied within the context of a medical school are described. Each statistical formula uses just three quantifiable independent variables: MCAT (MCAT), Undergrad Science GPA (USGPA), and Interview Scores (IS). The first statistical formula maximizes predicted USMLE scores (dependent variable), and the second statistical formula describes the average effective weight committee members have previously placed on the three independent variables (MCAT, USGPA, IS). In actual application, all variables in the applicant file that can be quantified or categorized could be used.

If rater error accounts for the variance not explained by a statistically based rankings, the conclusion is that after removing the descriptive or predictive variance, a particular admission committee's decision is entirely dependent on which sample of committee members is assigned to rate a particular file. Or stated another way, low or zero interrater reliability of observed difference scores would indicate that each committee member has an idiosyncratic interpretative system and that committees add random rater error to the decision process.

Modern validity theory suggests that the interpretation of any score or judgment not only must seek evidence to support a particular interpretation but also must examine evidence that could potentially refute that interpretation.Citation 23 , Citation 24 If the value of committee member judgments is viewed as related to their ability to holistically evaluate applicants and uniquely weigh all information within the admissions file to achieve a decision that is superior to statistically based decisions, this contribution can be estimated as the degree to which raters agree on rankings or decisions that deviate from statistical rankings or decisions. If rater agreement on these difference scores is zero, then committee judgments must be regarded as adding random rater error to the admissions decision. On the other hand, appreciable committee agreement in this context would support, but not prove, the assertion that committees add important decision information compared to decisions using only statistical methods.

VALIDITY STUDY METHODS

The specific methods needed to conduct a validity study using difference scores will vary depending on the committee procedures used at each institution. To provide a concrete example of the methods, the analytic steps needed at a hypothetical medical college are recounted here. To implement the validity study, two formulistic admission equations can be computed. For example, the first (Equation 1 – Predictive Equation) is based on optimizing the mean standardized USMLE score with regression weighting of the mean MCAT, USGPA, and the IS. The second equation (Equation 2 – Descriptive Equation), based on the same three independent variables described in Equation 1, is an equation that describes past committee decisions at the institution. It represents the collective average effective weight judges have historically applied to MCAT, USGPA, and IS to generate an average committee ranking (the dependent variable). In Equation 1 and Equation 2, X is simply a scaling constant and the Betas (β) are regression-based weights.

Results from Equation 1 and Equation 2 can be used to generate difference scores used in the validity research design. graphically displays how applicant rankings by a committee of raters can be subtracted from the statistical ranking to derive difference scores for each applicant. To demonstrate how difference scores can be derived, let's assume three committee members each rate an applicant's file. The three difference scores can be calculated by simply subtracting each committee member's ranking from the ranking indicated by a predictive or descriptive equations [DiffSco = (statistical rankings) (committee member ranking)].

FIG. 1 Description of the sequential data steps used to derive difference scores for each applicant. Note. MCAT = Medical College Admission Test; GPA = grade point average.

FIG. 1 Description of the sequential data steps used to derive difference scores for each applicant. Note. MCAT = Medical College Admission Test; GPA = grade point average.

Each rater of an applicant can award a ranking consistent with, above, or below the statistical ranking. The difference scores are simply the number of positions above or below the statistical ranking. A positive number indicates the number of positions that the statistically based method ranked the applicant above the committee member's ranking, a zero difference score would indicate a ranking consistent with the statistical ranking, and a negative difference score would indicate the degree to which the statistical ranking was lower than the committee member ranking. A generalizability study can be conducted on difference scores like those shown in the right-hand column of .

When a different subset of committee members rate each applicant, the difference scores can then be entered into a simple random rater (r)-nested-within-applicant (a) [r : a] generalizability study to determine the level of agreement between raters. Or, if each applicant is rated by all the committee members, a rater-crossed-with-applicant [r × a] random model can be employed.

graphically displays the data analytic steps for the validity study. As shown, a simple random G study model can also be applied to the raw ranking scores to evaluate rater agreement using raw data. The G study of the raw data (Step 1 – ) will convey the absolute level of committee agreement. A comparison with the G study in Step 4 in will provide an estimate of the “value added” by the committee process.

FIG. 2 Steps in the analysis. Note. MCAT = Medical College Admission Test; USGPA = undergrad science grade point average; IS = interview scores.

FIG. 2 Steps in the analysis. Note. MCAT = Medical College Admission Test; USGPA = undergrad science grade point average; IS = interview scores.

It is should be noted that the G study model can be modified depending on a particular institution's procedures. For example, some institutions intentionally introduce variability into admission committee decisions by using stratified sampling from different committee rater populations or categories (e.g., faculty, community members, and students). In this case, a fixed facet would need to be added to the G study design. When raters are sampled from categories [c], the G study model to estimate rater agreement of difference scores is a mixed model design of rater-nested-within-category-and-applicant-and-category-crossed-with-applicant r:(c × a). Or a rater-nested-within-category-crossed-with-applicant (r:c × a) if all raters review all files.

SIGNIFICANCE OF RESULTS

As discussed, it is important for institutions to evaluate the validity and reliability of admission committee decisions. The AAMC specifically recommends “identifying what in the holistic review admissions process is working and what is not, as well as the location of the impediments.”Citation 8 The validity design reported here will provide very useful evidence about whether holistic decisions enhance the decision process. Although it is possible that some admissions personnel may resist conducting validity studies to avoid the possibility of negative findings, electing not to conduct validity research is clearly damaging to the trust we place in medical education and medicine. Evaluations like the one described in this article can be conducted on data that already exist at most medical schools, and such studies are crucial to establishing the validity of the procedures we use to grant admission to medical school. Despite whether or not medical school admission policymakers value the results of the proposed research, the practical importance cannot be dismissed. Medical school admission policy makes very high-stakes decisions, and to intentionally use inefficient methods to make those decisions is unethical and unscientific.

REFERENCES

  • Meehl , P E . 1954 . Clinical vs. statistical prediction: A theoretical analysis and a review of the evidence , Minneapolis , MN : University of Minnesota Press .
  • Schofield , W and Garrard , J . 1975 . Longitudinal study of medical students selected for admission to medical school by actuarial and committee methods . British Journal of Medical Education , 9 : 86 – 90 .
  • Dawes , R M , Faust , D and Meehl , P E . 1989 . Clinical versus actuarial judgment . Science , 243 : 1668 – 74 .
  • Grove , W M and Meehl , P E . 1996 . Comparative efficiency of informal (subjective, impressionistic) and formal (mechanical, algorithmic) prediction procedures: The clinical-statistical controversy . Psychology, Public Policy, and Law , 2 : 293 – 323 .
  • Kleinmuntz , B . 1990 . Why we still use our heads instead of formulas: Toward an integrative approach . Psychological Bulletin , 107 : 178 – 200 .
  • Gratz v. , Bollinger . 2003 . 540 U.S. 307
  • Grutter v. , Bollinger . 2003 . 539 U.S. 306
  • Association of American Medical Colleges . 2010 . Roadmap to diversity: Integrating holistic review practices into medical school admission processes . http://www.cossa.org/diversity/reports/Integrating_Holistic_Review_Practices.pdf. Accessed July 26, 2011
  • McGaghie , W C and Kreiter , C D . 2005 . Holistic versus actuarial student selection . Teaching and Learning in Medicine , 17 : 89 – 91 .
  • Kulatunga-Moruzi , C and Norman , G R . 2002 . Validity of admission measures in predicting performance outcomes: The contribution of cognitive and non-cognitive dimensions . Teaching and Learning in Medicine , 14 : 34 – 42 .
  • Kreiter , C D and Kreiter , Y . 2007 . A validity generalization perspective on the ability of undergraduate GPA and the Medical College Admission Test to predict important outcomes . Teaching and Learning in Medicine , 19 : 95 – 100 .
  • Kreiter , C D , Yin , P , Solow , CM. and Brennan , R L . 2004 . Investigating the reliability of the medical school admission interview . Advances in Health Science Education , 9 : 147 – 59 .
  • Callahan , C A , Hojat , M , Veloski , J , Erdmann , J B and Gonnella , J S . 2010 . The predictive validity of three versions of the MCAT in relation to performance in medical school, residency, and licensing examinations: A longitudinal study of 36 classes of Jefferson Medical College . Academic Medicine , 85 : 980 – 7 .
  • Kreiter , C D . 2006 . A commentary on the use of cut-scores to increase the emphasis on non-cognitive variables in medical school admissions . Advances in Health Science Education , 12 : 315 – 9 .
  • Wainer , H and Thissen , D . 2001 . “ True score theory: The traditional method ” . In Test scoring , Edited by: Thissen , D and Wainer , H . 34 – 52 . Mahwah , NJ : Erlbaum .
  • Kane , M and Case , S M . 2004 . The reliability and validity of weighted composite scores . Applied Measurement in Education , 17 : 221 – 40 .
  • Stemler , S E . 2004 . A comparison of consensus, consistency, and measurement approaches to estimating inter-rater reliability . Practical Assessment Research and Evaluation , 9 : 1 – 29 .
  • Grove , W M . 2005 . Clinical versus statistical prediction: The contribution of Paul E. Meehl . Journal of Clinical Psychology , 61 : 1233 – 43 .
  • Association of American Medical Colleges . 2008 . Admissions Policies and Practices Presentation at the first meeting of the 5th Comprehensive Review of the MCAT Washington , DC
  • McGaghie , W C . 2002 . “ Student selection ” . In International handbook of research in medical education , Edited by: Norman , G , van der Vleuten , C and Newble , D . 303 – 35 . Dordrecht , , the Netherlands : Kluwer .
  • Barzansky , B , Jonas , H S and Etzel , S I . 1999 . Educational programs in US medical schools, 1989–1999 . Journal of the American Medical Association , 282 : 840 – 6 .
  • Kreiter , C D . 2002 . The use of constrained optimization to facilitate admissions decisions . Academic Medicine , 77 : 148 – 51 .
  • Kane , M T . 2006 . “ Validation ” . In Educational measurement , (4th ed.) , Edited by: Brennan , R L . 17 – 64 . New York , NY : American Council on Education and Greenwood .
  • Messick , S . 1995 . Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning . American Psychologist , 50 : 741 – 9 .

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.