2,270
Views
2
CrossRef citations to date
0
Altmetric
Articles

Using personal statements in college admissions: An investigation of gender bias and the effects of increased structure

ORCID Icon & ORCID Icon
Pages 5-20 | Received 13 Feb 2021, Accepted 19 Nov 2021, Published online: 27 Jan 2022

Abstract

Personal statements are among the most commonly used instruments in college admissions procedures. Yet, little research on their reliability, validity, and fairness exists. The first aim of this paper was to investigate hypotheses about adverse impact and underprediction for female applicants, which could result from lower tendencies to use agentic language compared to male applicants. Second, we examined if rating personal statements in a more structured manner would increase reliability and validity. Using personal statements (250 words) from a large cohort of applicants to an undergraduate psychology program at a Dutch University, we found no evidence for adverse impact for female applicants or more agentic language use by male applicants, and no relationship between agentic language use and personal statement ratings. In contrast, we found that personal statements of female applicants were rated slightly more positively than those of males. Exploratory analyses suggest that female applicants’ better writing skills might explain this difference. A more structured approach to rating personal statements yielded higher, but still only ‘moderate’ inter-rater reliability, and virtually identical, negligible predictive validity for first year GPA and dropout.

Personal statements are among the most commonly used sources to gather information about applicants in college admissions procedures for undergraduate and graduate programs (Clinedinst, Citation2019; Klieger et al., Citation2017; Woo et al., Citation2020). They are often used to collect information about motivation to study in a particular field, goals and interests, strengths and weaknesses, and writing skills (Kuncel et al., Citation2020; Kyllonen et al., Citation2005). A small meta-analysis (k = 8–10) by Murphy et al. (Citation2009) showed that ratings of personal statements were poor predictors of academic achievement, with r = .13 for GPA and r = .09 for faculty performance ratings, and had no incremental validity (ΔR2 = .002) over admission test scores and prior grades. Plausible explanations are that personal statements are typically highly unstructured in nature (Kuncel et al., Citation2020; Woo et al., Citation2020), both in terms of instructions for applicants on what to include in their statement, and in how they are judged or rated, if formally rated at all. This lack of structure likely results in low reliability, construct validity, and hence, predictive validity.

Potential gender bias in personal statement ratings

Aside from low validity, there are concerns of possible biases resulting from using personal statements in admissions. One major concern is the possibility for gender bias (Woo et al., Citation2020). Some studies showed that male applicants tend to use more agentic and self-promotional language than female applicants (Babal et al., Citation2019; Osman et al., Citation2015; Ostapenko et al., Citation2018), which, while not previously tested, is expected to result in higher ratings on personal statements for male applicants. Moreover, assuming that the tendency to use agentic language is unrelated to academic achievement, ratings on personal statements could subsequently result in overpredicting male performance and underpredicting female performance.

The aim of this study was twofold. First, we aimed to investigate gender differences and predictive gender bias in ratings of personal statements. Since males have been found to use more self-promotional and agentic language, we expected that:

H1: Male applicants obtain higher personal statement ratings than female applicants.

H2: Male applicants use more agentic language in their personal statements than female applicants.

Since female students tend to perform better academically (Voyer & Voyer, Citation2014) and we expect higher personal statement ratings for male applicants, we also expected:

H3: Ratings on personal statements result in overprediction of male academic achievement and underprediction of female academic achievement.

Since we hypothesize that using agentic language explains the differences in personal statement ratings, we expected:

H4: Controlling for the use of agentic language reduces gender-based differential prediction.

Structure in personal statement ratings

Second, we investigated if evaluating personal statements in a more structured way improved reliability and predictive validity of personal statement ratings, as previously suggested (Kuncel et al., Citation2020; Murphy et al., Citation2009; Woo et al., Citation2020). To that end, we compared general, impressionistic ratings and ratings based on anchored rating scales on several dimensions, made based on the same personal statements. We have the following hypotheses:

H5: Single, impressionistic ratings of personal statements are unreliable and poor predictors of academic achievement.

H6: Using more structured ratings of personal statements will lead to improvements in reliability and predictive validity.

In previous studies, predictive validity was mainly examined for GPA. Since some hypothesized that personal statements may be more useful in predicting ‘fit’ related outcomes such as retention (Chiu, Citation2019; Murphy et al., Citation2009), we investigated predictive validity for first year GPA and dropout. Furthermore, we will explore if more structured ratings show different results regarding gender differences than general, impressionistic ratings.

Using personal statements in college admissions procedures

Finally, we present the effect of including personal statements in admissions procedures by demonstrating the validity of an admission test + personal statement composite under different hypothetical weighting schemes. Presenting composite validity under different possible weighting schemes adds to the traditional regression approach in terms of showing the practical implications of using personal statements in conjunction with other instruments, since optimal regression weights are seldom available or used in admissions procedures. We expect that using a suboptimally weighted composite of personal statement ratings and admission test scores will result in lower predictive validity compared to just using the admission test alone, especially when general, impressionistic ratings are used.

Method

Sample

Archival personal statements of 806 applicants to an undergraduate Psychology program at a Dutch university were retrieved from the university records. Of all applicants, 627 were accepted and started the program, for whom information on gender, admission test scores, first year GPA, and dropout were retrieved as well. The mean age was M = 19.62 (SD = 1.69), 66% was female, 45% had the Dutch nationality, 46% had a German nationality, 7% had another EU-nationality, and 2% had a non-EU nationality. The program’s courses were offered in Dutch and English, 42% took their courses in Dutch. A priori power analysis

(1 – β = .80, α = .05, one-tailed), showed that this sample size is sufficient to detect a predictive validity of r = .10 and small differences in personal statement ratings (d = .30).

Materials and procedure

All applicants had to submit a personal statement of approximately 250 words, indicating their motivation for the program. The mean number of words was M = 232 (SD = 34). Of all applicants, 38% wrote their statement in Dutch and 62% wrote their statement in English. Each personal statement was rated in two ways; by providing a single, impressionistic rating (second author and research assistant) and by using pre-defined dimensions with anchored rating scales (first author and research assistant). So, each statement was rated by two raters based on each approach. The scales were piloted to check calibration and clarity among the raters beforehand. The statements were presented to the raters in random order, and personally identifiable information such as names and addresses were omitted to ensure anonymity of the applicants. No other information, such as prior educational achievement, grades, test scores or applicant gender, were available when rating the statements. Rating the statements spanned about five weeks and was done in multiple sittings to limit fatigue effects.

Measures

To obtain a single, impressionistic rating, each statement was rated based on the following question: How suitable is this applicant for studying Psychology at this university?, using a five-point scale (not suitable at allvery suitable). The mean rating across two raters was used in all analyses.

For the structured ratings, we consulted staff and the literature (Chiu, Citation2019; GlenMaye & Oakes, Citation2002; Max et al., Citation2010) to determine what kind of information is typically extracted from personal statements. The resulting five dimensions (motivation for the discipline, expressing a future career perspective, motivation to study and learn, possessing relevant competencies and skills, and writing skills) with anchored rating scales are shown in the Appendix. Again, mean ratings across the two raters were used in the analyses.

The use of agentic language was operationalized as the proportion of the agentic words in each statement, using the agentic language dictionary developed by Pietraszkiewicz et al. (Citation2019) for the LIWC linguistic analysis program. They report satisfactory convergent- and divergent validity and strong relationships with subjective ratings of agentic language use. Furthermore, we used a translate and back-translate procedure to develop a similar dictionary in Dutch (available via the first author). The analyses resulted in comparable, but slightly higher proportions of agentic language in statements written in Dutch than in statements written in English (see , Hedges’ g = .29, 95% CI [.14, .43]). All personal statements were checked and corrected for spelling errors and typo’s before linguistic analysis.

Table 1. Descriptive statistics.

First year GPA, dropout, gender, and admission test scores were obtained from the university administration. First year GPA was the mean grade obtained in the first year (1–10, with 10 being the highest grade). Dropout was defined as unenrolling during the first year or not re-enrolling for the second year. The admission test score was the raw score on a 40-item psychology test that showed high predictive validity in prior studies (Niessen et al., Citation2018), comparable to other commonly used admission tests such as the SAT and ACT (Kuncel & Hezlett, Citation2010; Westrick et al., Citation2015).

Analytic approach

A proposal for this study was submitted to the editors before the data were collected and analyzed. The project proposal, R code used for analyses, and the Dutch LIWC dictionary file are available on OSF (https://osf.io/b5e2p/). All directional hypotheses were tested using one-tailed tests. For exploratory analyses or when the data indicated effects in the opposite direction than expected, no p-values were computed. Instead, we report descriptive statistics and effect sizes with confidence intervals. Hypotheses that followed from other hypothesized effects were not tested when the former hypothesized effects were not detected.Footnote1

Inter-rater reliability was estimated using intra-class correlations (ICCs, type 2), for single ratings and the average of two raters. Since we cannot verify whether the small differences in agentic language between statements written in Dutch or English reflect true differences, or are caused by factors such as differences in linguistic customs or using different dictionaries, all analyses including agentic language use were conducted separately for personal statements written in English and Dutch. Admission test + personal statement composite scores under different weighting schemes were computed by creating weighted sum scores using the standardized predictor scores.

The personal statements were submitted as part of the admissions procedure, but were not previously used in the admissions process. Therefore, range restriction in personal statement ratings was not expected. For verification, the ratio of standard deviation of enrolled applicants to the standard deviation in the entire applicant pool (ux) was computed, resulting in ux = 0.97 for general, impressionistic ratings and ux > 0.99 for structured ratings. Hence, range restriction was negligible, so no corrections were applied.

Results

Descriptive statistics of all study variables are presented in .

Gender differences

To test the hypothesis that male applicants obtained higher personal statement ratings than female applicants when general, impressionistic ratings are made (H1), a one-tailed t-test was planned. However, contrary to this expectation, the descriptive statistics () show that female applicants obtained slightly higher personal statement ratings than male applicants (Hedges’ g = .24, 95% CI [.08, .41]). Since this difference is in the opposite direction than hypothesized, no statistical test was conducted.

To investigate whether male applicants used more agentic language than female applicants (H2), the proportion of agentic words in personal statements written by male and female applicants was compared. Descriptive statistics again show that the difference was in the opposite direction than expected; female applicants used more agentic language than male applicants in statements written in Dutch (Hedges’ g = .14, 95% CI [–.12, .40]) and English (Hedges’ g = .08, 95% CI [–.13, .30]), albeit slightly. Again, no statistical tests were conducted. Furthermore, we did not find an association between agentic language use and ratings of personal statements written in Dutch (r = .04, 95% CI [–.08, .15]) or English (r = −.01, 95% CI [–.10, .08].

The hypotheses that personal statements would result in overprediction of male performance and underprediction of female performance (H3), and that controlling for the use of agentic language would reduce differential prediction (H4), followed from the expectation that male applicants would receive higher personal statement ratings and would use more agentic language. Since we found no evidence for either of those expectations, potential differential prediction could not be explained by the hypothesized mechanism. Therefore, the analyses that were planned to investigate H3 and H4 were not conducted.

Rating structure

To investigate if rating personal statements in a more structured fashion increased reliability and validity compared to using general, impressionistic ratings, inter-rater reliability and predictive validity were computed for the two rating procedures. In agreement with H5, general, impressionistic ratings resulted in low inter-rater reliability (single-rater ICC = .32, 95% CI [.27, .37], average ratings ICC = .49, 95% CI [.42, .54]) and negligible to near-zero predictive validity for first year GPA (r = .09, 95% CI [.01, .16], p = .01, one-sided) and dropout (r = −.01, 95% CI [–.09, .07], p = .40, one-sided).

Ratings of personal statements made using a more structured approach resulted in somewhat higher reliability (H6, single-rater ICC = .50, 95% CI [.35, .61], average ratings ICC = .67, 95% CI [.52, .76]), although reliability was only ‘moderate’ according to most guidelines (LeBreton & Senter, Citation2008; Nunnally & Bernstein, Citation1994). Moreover, the increase in reliability was not accompanied by increased predictive validity for first year GPA (H6, r = .09, 95% CI [.02, .17], p = .01, one-sided) or dropout (r < .01, 95% CI [–.08, .08], p = .49, one-sided), as the correlations were virtually identical.

While gender differences in general ratings were not detected in the expected direction, we still explored gender differences in structured ratings. The results were very similar to those for general ratings, with statements of female applicants rated slightly higher than those of male applicants (Hedges’ g = .23, 95% CI [.07, .40]).

Using personal statements in admissions

To demonstrate the effects of adding personal statements to an admission procedure that contains a commonly used valid predictor (i.e., an admission test), shows the validity of an admission test + personal statement composite under different hypothetical weighting schemes that could be used in practice. Since predictive validity and correlations between admission test scores and personal statement ratings were very similar, regardless of rating structure (general ratings: r = .17, 95% CI [.11, .24], structured ratings: r = .15, 95% CI [.08, .22]), the general ratings were used to generate the composite scores.

Table 2. Admission test – personal statement composite validity under different weighting schemes.

An optimal, regression-based composite of the admission test scores and personal statement ratings would result in R2 = .20 for first year GPA, with relative weights analysis (Tonidandel & LeBreton, Citation2015) resulting in a weight of 98% for the admission test score and 2% for the personal statement rating. For dropout, an optimal composite would result in pseudo R2 = .08, with relative weights of 99% for the admission test score and 1% for the personal statement. Additionally, weighting the statements 0% yielded identical results. These results and the hypothetical weighting schemes presented in show that using the personal statement ratings in admissions procedures would not have improved predictive validity for GPA or dropout at best, and would have been detrimental to validity when receiving a weight above 10%.

Exploratory analyses

To shed some more light on the unexpected findings regarding gender differences, we exploratory investigated if writing skills (as perceived by the raters) could be a possible explanation for the higher ratings for female applicants. Women tend to score higher on writing skills than men (Kaufman et al., Citation2009; Petersen, Citation2018; Reilly et al., Citation2019), also when writing English as a foreign language (Keller et al., Citation2020). Furthermore, writing skills could have unintentionally influenced the general and structured non-writing-skills-oriented ratings. Including writing skills as one aspect to be rated in the structured rating form (single-rater ICC = .39, 95% CI [.30, .47], average ratings ICC = .56, 95% CI [.46, .64] allowed an exploration of this possibility. The average writing skills rating across two raters was used in the analyses presented below.

Writing skills ratings were moderately related to general, impressionistic ratings (r = .36, 95% CI [.30, .41]) and structured ratings (with the writing skills item excluded, r = .26, 95% CI [.20, .32]). Furthermore, female applicants were rated slightly higher on writing skills than male applicants (Hedges’ g = .34, 95% CI [.17, .50].Footnote2 In addition, exploratory hierarchical regression analyses () show that the (already small) relationship between gender and personal statement ratings was substantially reduced when writing skills ratings were included in the models. The regression coefficients for gender reduced by at least half and the partial correlations, showing the relationship between gender and personal statement ratings with writing skills ratings partialled out, were substantially smaller than their zero-order correlations.

Table 3. Hierarchical regression analyses of personal statement ratings by gender and writing skills ratings.

Discussion

We found no evidence of adverse impact against female applicants in personal statement ratings, nor that male applicants use more agentic language in their personal statements. The latter is at odds with earlier studies that found that men used more agentic language in personal statements for medical residency programs (Babal et al., Citation2019; Osman et al., Citation2015). Furthermore, we found no evidence for the suggested relationship (Woo et al., Citation2020) between agentic language use and personal statement ratings. Contrary to our expectations, we found that personal statements of female applicants were rated slightly more positively than those of male applicants. Exploratory analyses suggest that these small differences might be explained by better writing skills of female applicants (Kaufman et al., Citation2009; Reilly et al., Citation2019), and the relationship between (perceived) writing skills and personal statement ratings. However, since these analyses were conducted to explore possible reasons for unexpected results, these results should be interpreted very tentatively.

While no evidence for substantial gender bias was detected, the results concerning reliability and predictive validity once again paint a bleak picture for using personal statements in college admissions, at least when rated in the ways we investigated. In line with previous studies (Murphy et al., Citation2009), general, impressionistic ratings resulted in low inter-rater reliability and predictive validity for first year GPA. Additionally, the near-zero correlations with dropout provide a lack of evidence for the suggestion that personal statements may be more useful for predicting ‘fit’-related outcomes than performance-related outcomes (Chiu, Citation2019; Murphy et al., Citation2009). Furthermore, rating personal statements in a more structured manner improved inter-rater reliability to some extent, but reliability was still only ‘moderate’ and we found no evidence for improvements in predictive validity. So, increasing rating structure, at least in the way adopted in this study, does not seem to solve the issues concerning the psychometric quality and utility of personal statement ratings in college admissions.

Finally, our findings align with earlier conclusions that using personal statement ratings derived using an unstructured or a more structured approach for the purpose of predicting academic performance (Murphy et al., Citation2009), either alone or in combination with a valid admission test, is ill advised. In terms of predictive validity, and hence, making admission decisions, spending time reading and rating personal statements in this way seems a waste of time at best. Moreover, if given substantial weight, which seems to be quite common in practice (Klieger et al., Citation2017), including personal statements in admissions procedures has a substantial detrimental effect on validity.

Limitations, strengths and future research

We investigated some previously posited hypotheses that had not been tested before. Furthermore, the data allowed estimations of validity and reliability of personal statements that were demonstrably unaffected by range restriction, which is rare when using data obtained in applied settings.

The results presented in this manuscript were based on data obtained from a single undergraduate program at one university, representing one specific personal statement format (a brief format of 250 words), and one level of selectivity (low, in this case), which limits the generalizability of our findings. Therefore, replications with larger and more diverse samples are strongly encouraged. A question raised by an anonymous reviewer was whether agentic language use would be considered indicative of good ‘fit’ to a psychology program. While prior research in an organizational setting found that agentic descriptions were related to perceived competence and hireability (Rudman, Citation1998), we did not find relationships between agentic language use and personal statement ratings. It is possible that agentic language is perceived more favorably in some disciplines than in others. However, we think assuming that using language that demonstrates confidence, competence and ambition would be perceived positively in general, is plausible.

In this study, the raters were unaware of the gender of the applicants while rating the personal statements. Perhaps results would have been different if that had not been the case (Heilman, Citation2001; Rudman, Citation1998), and gender-blind rating may not be representative of real-world admissions procedures; names would usually provide quite accurate cues on applicant gender. Furthermore, our study was based on ratings made by a few raters without experience in rating personal statements. This limitation prevents investigating the effects of rater characteristics such as rater gender and experience. For example, it is possible that male and female raters would respond differently to agentic language use, or to statements written by male and female applicants in general. It is also possible that training raters, for example using frame-of-reference training (Roch et al., Citation2012), would result in higher inter-rater reliability and predictive validity, or would affect the influence of structure on reliability and validity. Future research should shed more light on these possibilities.

The degree of structure in the structured rating process may not have been sufficient to yield reliability and validity gains. The large body of research on the effect of structure on interview reliability and validity (Huffcutt et al., Citation2014; Levashina et al., Citation2014) shows that higher degrees of structure, both regarding what information is asked of the applicants and the response rating process, are more beneficial. The highest level of structure is reached when all applicants are asked to answer the exact same questions and each individual answer is rated using formal, anchored rating scales. We were confined to adjusting the structure of the rating process alone, and in the absence of detailed instructions to applicants on what to write about in their statements, the degree to which the rating process could be structured was limited as well (Huffcutt et al., Citation2013). Even in rating the quite broad dimensions we defined, we already noticed that some applicants did not write about some of them at all, complicating the rating process. Possibly, increasing the amount of structure analogous to the highest level of interview structure would result in more positive results. However, when adopting such an approach, such personal statements would perhaps best be defined as open-ended biodata measures, or written structured interviews, which is considered a different type of instrument (Murphy et al., Citation2009).

Additional challenges in ensuring that personal statements would yield valid and fair assessments are the large amount of tips and tricks on writing personal statements available online and the apparent abundance of plagiarism (Shuker, Citation2014). A notable example is over 200 applicants to colleges in the U.K. in 2007 using the exact same opening sentence in their personal statement (“Ever since I accidentally burnt holes in my pyjamas after experimenting with a chemistry set on my 8th birthday, I have always had a passion for science”, Shuker, Citation2014). Additionally, the difficulties in verifying and controlling the amount and type of support applicants receive in writing their statements is a potential source of bias. Wright and Bradley (Citation2010) found that applicants to UK medical schools from state schools received lower scores on their personal statements than applicants from grammar and independent schools (often populated by students from higher SES backgrounds), but perform equally well in medical school. Applicants from state schools also received less support in writing their personal statements (Wright, Citation2015).

Furthermore, if (perceived) writing skills affect personal statement ratings, that could result in adverse impact for applicants with a migration or low SES background. While assessing writing skills is often mentioned as one of the attributes assessed in personal statements (Chiu, Citation2019; Kyllonen et al., Citation2005), they do not seem to be valid indicators of writing skills (Kuncel et al., Citation2020; Powers & Fowles, Citation1997). If assessing writing skills is considered relevant, other, more valid tools should be used.

Conclusion

In short, while our results do not provide support for earlier concerns of adverse impact and bias against female applicants when personal statements are used, the results in terms of validity and reliability were in line with earlier meta-analytic findings (Murphy et al., Citation2009). Given these findings, we echo earlier advice not to use personal statements in admissions procedures (Kuncel et al., Citation2020), at least when they are intended to contribute to predicting academic performance, and until evidence is presented that they can yield reliable and valid ratings. Whether the latter is possible remains an open question.

Acknowledgment

We thank Lotte Mensink for her help in organizing and collecting the data used for this study.

Disclosure statement

We have no conflicts of interest to disclose.

Notes

1 This was the case for differential prediction; the expected gender ­differences in personal statement ratings that could subsequently result in differential prediction of academic achievement were not detected. Therefore, differential prediction analyses were not conducted, and the intended analytical approach for those analyses is not described. However, the hypotheses were retained in the paper for transparency.

2 Results were very similar for statements written in Dutch (Hedges’ g = .31, 95% CI [.05, .57]) and English (Hedges’ g = .35, 95% CI [.13, .56].

References