0
Views
0
CrossRef citations to date
0
Altmetric
Research Article

General mental ability testing and adverse impact in the United Kingdom: a meta-analysis with more than two million observations

ORCID Icon, ORCID Icon & ORCID Icon
Received 31 Oct 2023, Accepted 02 Jul 2024, Published online: 13 Jul 2024

ABSTRACT

We review ethnic group differences on high-stakes General Mental Ability (GMA) tests based on 21st century UK data. Thereafter, we meta-analyse scores on 23 occupational, public sector, educational, military, or general-public selection tests, with a sample size exceeding two million. Relative to White GMA, the grand meta-analytic effect sizes (Cohen’s d) by major ethnic groups were: Mixed d = .14 (k = 24, N = 67,114), Blacks d = .65 (k = 32, N = 112,975), Asians d = .33 (k = 32, N = 311,695), and Other d = .49 (k = 24, N = 42,846). Further, although Chinese residents outscored White British residents (d = −.15, k = 20, N = 18,897), all other Asian ethnic groups scored slightly to substantially lower. For example, South Asians as a whole averaged d = .37; k = 13, N = 67,566. By subgroups, these averages were: Indians (d = .17, k = 10, N = 28,236), Pakistanis (d = .49, k = 9, N = 19,371), and Bangladeshi (d = .55, k = 7, N = 19,772). Implications for practice and theory are discussed.

Introduction

General mental ability testing and adverse impact

In the UK, tests measuring general mental ability (GMA) are often used in high-stakes selection contexts. For example, a survey by the Chartered Institute of Personnel and Development (CIPD, Citation2017) found that fully 41% of organizations in the UK rely on GMA tests for employee selection purposes. Thus, GMA tests routinely help determine whether job applicants will get hired/promoted (and whether potential students will be admitted into degree programs). This is not surprising, as GMA tests possess high validity when predicting work and/or job training performance, together with educational achievement (for a meta-analytic review, see Schmidt & Hunter, Citation1998).

A double-edged sword exists for UK practitioners wishing to use GMA tests in selection contexts. First, as described above, GMA testing produces impressive validity when predicting employee or student success. These tests, however, also produce medium to large mean differences across various racial and ethnic group test-takers. In the industrial/organizational literature, this result is known as the “diversity—validity dilemma” (Ployhart & Holtz, Citation2008). Specifically, despite GMA’s relatively high validity, using these tests creates medium to large average differences across groups, such that lower-scoring group members will be underrepresented in employment and/or educational selection contexts.

As one example, consider the situation in the USA (Baron et al., Citation2003). There, the Black-White standardized mean difference (d) on cognitive tests is approximately one standard deviation, while the difference for the Hispanic-White comparison is also substantial (Jensen, Citation1998; Roth et al., Citation2017). These differences can create substantial underrepresentation for minority groups in selection contexts.

Worse, the existence of these selection disparities by group membership may also be illegal. The legal theory here is termed “adverse (aka disparate) impact.” Note that this case law was first created in the USA, via the US Supreme Court’s landmark decision in Griggs v Duke Power (Citation1971). In this case, Griggs was a Black male who shovelled coal as an employee of Duke Power. He desired and applied for a promotion but was rejected based on company policy stating that his score on a GMA test was not high enough to be eligible for promotion. Griggs sued and ultimately won.

Thus, the US Supreme Court effectively “amended” Title VII of the Civil Rights Act of 1964 by creating a new form of discrimination known as adverse impact. This type of potentially illegal discrimination exists when a neutral selection practice (e.g., GMA testing; physical abilities testing) nonetheless “harms” a protected class like race or sex (in the UK, see: Essop and others v Home Office, Citation2017). The word “harms” here indicates that members of the affected class were selected at significantly lower rates relative to some reference group. In general, adverse impact exists when a test or other selection method differentially excludes the selection of one protected-class group (e.g., minorities) over another (e.g., majorities).

In the UK, adverse impact is typically referred to as “particular disadvantage” (see Essop and others v Home Office, Citation2017). Except with regard to how “harm” is proven, these across-country legal theories are essentially the same. Thus, here we effectively treat “adverse impact” and “particular disadvantage” as synonyms.

The Equality Act (Citation2010) is the primary law addressing indirect discrimination and/or adverse impact in the UK. The Act identifies various “protected characteristics,” including age, disability, gender reassignment, race, among others, and the law prohibits both direct and indirect discrimination based on these characteristics. According to the Act, indirect discrimination exists when applying a “provision, criterion or practice” that puts members of protected groups at “a particular disadvantage” [i.e., one that causes adverse impact] if the provision, criterion, or practice cannot be shown to be a “proportionate means of achieving a legitimate aim” [i.e., one that is not shown to be “job related”].

By explicitly prohibiting indirect discrimination, the Equality Act underscores a commitment to preventing practices that, though seemingly neutral, result in unequal outcomes across protected characteristics. Similar to the principles of “job relatedness,” and “business necessity” in the USA, the UK’s Equality Act incorporates the principle of “objective justification,” which gives employers an opportunity to demonstrate the legitimacy and proportionality of any employment practice that resulted in adverse impact. In this context, proportionality is assessed with the “no more than necessary” test (Essop and others v Home Office, Citation2017), wherein adverse impact would be illegal if plaintiffs can show that an alternative selection practice exists, which is just as valid as the challenged practice, but causes less adverse impact.

To be clear, adverse impact (or “particular disadvantage”) is not synonymous with group difference. Adverse impact exists when members of a protected class are selected at lower rates relative to some reference group (whatever the cause), and when the difference in selection ratios is deemed to be of practical significance by the legal system. When the resulting adverse impact is neither “proportionate” nor “legitimate,” the selection practice is illegal.

Given the UK’s current legal environment, together with its recent and rapid diversification, a better understanding of GMA differences across ethnicities (together with any resulting adverse impact) seems critical. To wit, the UK census (2011) reported that the percent of the current UK population that is of non-European/White ethnicity equals 14% (of which 38% are South Asian; 16% are Chinese/other Asian; 24% are African/Black, and 23% are mixed/other; ONS, Citation2011). This percentage is expected to increase to 26% by 2051 (Rees et al., Citation2017). Thus, here we provide a comprehensive and meta-analytic review of the magnitude of group differences that GMA testing creates across various ethnicities in the UK. First, however, we provide a brief review regarding GMA, and the adverse impact resulting from using these tests in selection contexts.

GMA and group differences in measured GMA

Cognitive ability is structured in a hierarchy comprising three stratums (Carroll, Citation1993; Jensen, Citation1998; McGrew, Citation2009). Here, GMA appears alone at the highest level in the hierarchy. Next in the model, broad mental abilities (e.g., visual, spatial, verbal) appear at Stratum II; whereas narrow mental abilities comprise Stratum III. All stratums display moderate to large intercorrelations with the other stratums. A massive literature shows that GMA tests possess substantial predictive validity for school and work performance, including success in training programs. While debate currently exists about whether GMA is the strongest predictor in this literature (Oh et al., Citation2023; Ones & Viswesvaran, Citation2023; Sackett et al., Citation2023), it is undeniable that GMA provides substantial utility in high-stakes selection contexts.

Much research on group differences has focused on ethnic group differences specific to the USA (e.g., Roth et al., Citation2001, Citation2017). The oversampling of the USA is likely due to a strong research focus on discrimination in this country, together with relatively high rates of ethnic diversity. Research here has shown moderate to large differences in measures of GMA between major ethnic groups. Research has also shown that cognitive differences tend to be largest on measures of GMA (i.e., Stratum I) than on measures of broad or narrow mental abilities (i.e., Stratums II and III; see, e.g., Jensen, Citation1998; Roth et al., Citation2001).

Especially compared with the USA, the literature on cognitive score differences in the UK is both sparse and dated. To wit, the UK test score differences reported in the industrial/organizational literature mostly include samples from at least twenty years ago (e.g., Baron et al., Citation2003; Evers et al., Citation2005). It’s unwise to assume that ethnic differences will remain constant over time, as new groups culturally assimilate to the UK, and as ethnic compositions change due to recent migrations. Moreover, these older UK-based studies often used overly broad ethnic categories such as “White versus Non-White.” Categories like these are less than ideal for estimating true differences, especially given that scores sometimes exhibit large heterogeneity across the types of GMA tests and subtests the researcher has examined (Jensen, Citation1998).

While there are many cultural similarities between the USA and the UK, each country’s race/ethnic group classifications are not automatically comparable, and similarly defined groups cannot be assumed to perform the same in both countries (Stillwell, Citation2022). For example, in some instances, no comparable socially defined groups exist across these two countries. In the USA, the Office and Management and Budget requires the collection of data on six minimum categories (Hispanic, American Indian, Asian, Black, Pacific Islander, and White), while an 18-group system exists in the UK, with five major groupings (White, Mixed or Multiple, Asian, Black, and Other ethnic group). Conversely, the USA has no broad ethnic group categories matching with the UK’s “Mixed or Multiple” and “Other ethnic group(s),” nor are data in the USA generally collected on more narrowly represented ethnicities.

Even when USA/UK labels are identical (e.g., “White,” “Asian,” “Black”), the classification rules differ (Stillwell, Citation2022). In the USA, Middle Eastern and North African individuals of Arab ancestry are classified as “White”, while, in the UK, Arabs are classified under “Any other Ethnic group”. Likewise, in terms of geographic origin, the ethnic composition of broad ethnic/racial groups differs across countries. For example, the Asian group in the UK includes a much larger share of individuals of South Asian descent relative to the Asian group in the USA (in which there are relatively more East Asians). This point may be especially relevant, given findings that East Asians have elevated mathematical versus verbal skills (Suzuki et al., Citation2002).

Moreover, identical labels for ethnic groups within the USA and the UK are still not completely comparable for other reasons, as group members between the countries have different backgrounds and histories. For example, the vast majority of Black British are the descendants of educationally selective African and West Indian migrants, who came to the UK within the last three generations (Adesote & Osunkoya, Citation2018; Model, Citation2008). Conversely, the vast majority of Black Americans are descendants of “involuntary migrants” (Gibson & Ogbu, Citation1991) as a result of the Transatlantic slave trade. As another example, Asian-Indian migrants to the USA tend to be hyper-selected in terms of education (Chakravorty et al., Citation2016), whereas this happens less so in the UK, as many Indians migrated after World War II to fill semi-skilled labour shortages (Sharma, Citation2015). These differences are salient to our comparisons here, as more educationally selective migrants and their children are expected to have higher GMA scores than less selective migrants (e.g., Cattaneo & Wolter, Citation2012). Finally, ethnic groups across countries can differ in terms of the proportion of first-, second-, and third-generation migrants, and thus the time it takes for cultural assimilation into one or the other country.

Just as group categories cannot be generalized across countries, neither can theories for why these differences exist. In the USA, GMA differences (in particular between Whites, Blacks, and Hispanics) are typically attributed to discrimination and social exclusion. To wit, Cottrell et al. (Citation2015) theorized that race in the USA signifies socio-political conflicts and interests and explained GMA differences in terms of centuries of housing, educational, and occupational segregation. It is not clear, however, whether Cottrell et al.’s model for the USA might also apply to residents in the UK (whose immigrants mostly arrived last century).

Tentatively modeling group differences in GMA in the UK

While the cause of GMA differences is not legally relevant to adverse impact claims, a theoretical model of their origin may be useful for understanding them. In a recent study, Pesta et al. (Citation2023) examined group-level cognitive ability differences for several nationally representative samples of UK adults. They reported a correlation of r = .93 (N = 16) between group-level cognitive ability differences in the UK and international achievement test scores from each ethnic group’s region of origin. When considering just UK-born and/or English-fluent individuals, the correlation was still substantial, but it reduced to .77 (N = 16). Pesta et al. (Citation2023) surmised that first- and second-generation migrants from countries with less well-developed educational systems tend to perform poorly, relative to those from other countries with more-developed educational systems (all else being equal).

We tentatively adopt Pesta et al.’s (Citation2023) suggestion as a starting point for understanding group differences in the UK. Additionally, we assume that differential migrant selection will also affect scores. That is (all else being equal), groups that are more educationally-selected should perform relatively better on GMA tests than ones that are less so. Additionally, we theorize that time and the number of generations spent in the UK will also impact GMA scores. Over time, groups will generally culturally assimilate, and so we expect there will be reduced cultural and language bias on GMA tests. Consistent with this expectation, Pesta et al. (Citation2023) reported that GMA gaps for British Asians and Blacks are substantially smaller when limiting consideration to UK-born individuals. Insofar as assimilation is stalled, Cottrell et al.’s (Citation2015) model of social segregation (described above) may be especially appealing. To wit, Pesta et al. (Citation2023) suggested that adverse impact from GMA tests could inadvertently perpetuate differences across generations by restricting minority access to employment, educational, and other opportunities.

In sum, given how widely GMA tests are used in selection, the gap in the current literature as it relates to the UK, and the ethnic differences reviewed above, more knowledge is needed regarding differences in GMA. Here, we review mean GMA test scores by ethnicity for UK test takers. Following research in the USA (Roth et al., Citation2001, Citation2017), we focus on differences in measured GMA, since measured differences, not the cause of them (e.g., psychometric bias), are directly relevant to adverse impact claims. Finally, our review covers the five broad and 18 narrow ethnic groups identified by the 2011 UK Census, since the data are regularly collected for these groups.

Method

Study identification, screening, and selection

We first created a database of all UK samples which contained scores on educational or occupational selection tests. Consideration was limited to any sample from the 21st century which also included data on multiple ethnic groups (see the Supplementary Materials file for a summary). Next, we conduct a meta-analysis of these samples, and we report the PRISMA statement requirements that were relevant for our review and meta-analysis.

Information sources

In searching for data, we reviewed narrative discussions on GMA differences and selection testing in the UK (e.g., Baron et al., Citation2003; Dewberry, Citation2011; Evers et al., Citation2005; Woolf et al., Citation2011) and also searched for the references therein. We then conducted searches in Eric, PsycINFO, Business Source Premier, Proquest, and Google Scholar for dissertations and English articles published between 2000 and 2020 concerning ethnic differences on selection tests. We scanned all abstracts and texts (when appropriate). The searches we employed were:

  1. Eric: TX (ethnic or minority or Black or Asian) AND TX (test OR exam OR score OR assessment OR cognitive OR aptitude OR achievement) AND TX (Great Britain OR United Kingdom OR Northern Ireland OR England OR Scotland OR Wales)

  2. PsychInfo: AB (ethnic or minority or Black or Asian) AND AB (test OR exam OR score OR assessment OR cognitive OR aptitude OR achievement) AND PL (Great Britain OR United Kingdom OR Northern Ireland OR England OR Scotland OR Wales)

  3. Business Source Premier: AB (ethnic or minority or Black or Asian) AND TX (Adverse Impact OR test OR exam Or score OR assessment OR cognitive ability OR mental ability OR IQ OR aptitude OR achievement OR selection) AND GE (Great Britain OR United Kingdom OR Northern Ireland OR England OR Scotland OR Wales)

  4. Proquest ab(ethnic OR minority OR Black OR Asian) AND ab(test OR exam OR score OR assessment OR cognitive OR aptitude OR achievement) AND (Great Britain OR United Kingdom OR Northern Ireland OR England OR Scotland OR Wales)

  5. Google Scholar: (“mental ability” OR “cognitive ability” OR “adverse impact”) AND (“selection tests” OR “high stakes tests”) AND (Asian OR Black) AND (Great Britain OR United Kingdom OR Northern Ireland OR England OR Scotland OR Wales)

The searches were slightly different across search engines (e.g., Google Scholar vs. Business Source Premier) and tailored to maximize relevant results. We next contacted publishing companies, organizations, and consortiums (e.g., Rail Safety and Standards Board, GL Assessment, Talent Lens, Team Focus, and LNAT) for technical reports and data. To identify publishing companies, we reviewed the British Psychological Society’s Test Publishers List (The British Psychological Society & Testing, Citationn.d).

In addition, we consulted with UK citizens about common governmental/public tests and submitted a substantial number of Freedom of Information (FOI) act requests for data held by relevant ministries and governmental offices (e.g., the Ministry of Defense). We requested data for recent years because there were limits on the amount of data/time the governmental offices allotted to spend on requests; moreover, many of the government offices only held data for the most recent years. A complete list of the thirty publishing companies, governmental departments, etc., which we contacted is provided in SM File 2. Finally, we also conducted searches for technical reports and pilot studies that were published on GOV.UK, an informational website that hosts government data and reports, created by Government Digital Service. The search process, however, was stopped in late 2020.

Selection of samples

In SM File 1, we provide a detailed narrative review of samples identified through our searches. We adopted the following criteria for inclusion in the meta-analysis:

  1. The test is used for selection (i.e., a test taken as part of the entry process to an educational or occupational organization). We excluded, for example, university-course grades and class-test scores.

  2. Scores were reported for specific ethnic groups (e.g., Chinese or Caribbean, but not non-White) or major ethnic groups aggregated consistent with government classifications (Black or Asian) but not generic categories of Black and minority ethnic (BME) or non-White.

  3. Information was available for the computation of effect sizes (e.g., means and standard deviations) and to allow for meta-analysis (e.g., sample sizes provided). We did not include samples with effect sizes computed from single threshold pass rates since assumptions of equal variances and normality were likely violated due to the heterogeneity of some of the ethnic groups (e.g., Asians; Ho & Reardon, Citation2012).

  4. Samples featured UK residents. We note that citizens and residents were typically not disaggregated.

  5. Samples were of adults (i.e., the average age of the samples was over 16), so as to reduce age-related heterogeneity.

  6. Tests were administered between 2000 and 2020.

  7. The reported information was gathered at the individual rather than the group level.

  8. The tests were administered to normal (nonclinical) populations.

  9. Samples did not contain redundant material. For example, a number of samples reported in Woolf et al. (Citation2011) were retrospective studies that reported The University Clinical Aptitude Test (UCAT) test scores. However, we included data from UCAT technical reports, which already contained these scores by virtue of reporting all scores for all UCAT tests taken.

The following eight categories of data are discussed in SM File 1: Aptitude Tests for Occupational Selection, Public Sector Tests, Military Tests, Driving Tests, College Entrance Tests & Qualifications, Law School Tests, Medical School Tests, and Business School tests. Two of the authors independently coded the samples discussed in the review for meeting the criteria above. The inter-rater reliability, measured as percent agreement, was 90 (i.e., 54 out of 60). Discrepancies were discussed until consensus was reached. A list of excluded samples resulting from the narrative review, along with detailed reasons for their exclusion, appears in SM File 2.

Ethnic classifications

In England and Wales, governmental agencies use an 18-group system of ethnic classifications. These 18 narrow classifications are grouped into 5 broad ones:

  1. White (English, Welsh, Scottish, Northern Irish or British; Irish; Any other White; Gypsy and Irish Traveller),

  2. Mixed or Multiple (White and Black Caribbean; White and Black African; White and Asian; Any other Mixed or Multiple),

  3. Asian or Asian British (Indian; Pakistani; Bangladeshi; Chinese; Any other Asian, which includes other South and Eastern Asians),

  4. Black (African; Caribbean or Black British; Any other Black),

  5. Other ethnic group (Arab; Any other ethnic group).

We note that private organizations use similar classification systems. As in the USA, ethnicity in the UK is self-identified. Moreover, according to the Office of National Statistics (ONS, Citation2003), ethnicity is “subjectively meaningful” and not based on “objective, quantifiable information”.

It should be noted that the meaning of the classifications has changed somewhat over time. This can be seen, in particular with the “Any other Asian” and “Any other ethnic groups” in the UK census. In 2001, the Chinese group was subgrouped under “Other.” Moreover, The Office for National Statistics treated Any Other Asian as a residual South Asian category (e.g., Sri Lankan) and classified other geographically Asian groups (e.g., Japanese) as “Any Other Ethnic Group.” However, in 2011, the Chinese were subgrouped under “Asian”. Since 2011, non-Chinese East Asians tend to identify as “Any other Asian” (Aspinall, Citation2003, Citation2013).

Since classifications are “subjective”, they do not necessarily map perfectly to geographic origin. For example, Iranians, a geographically defined Asian group of West Eurasian (i.e., Caucasian) ancestry, might identify either as “Any Other White,” or “Any other Asian.” This situation is similar to that in the USA, where there is often uncertainty as to how Middle Eastern and North African ancestry individuals identify (Cassino, Citation2023).

Since data in the UK are collected using this system, we rely on it here with a few alterations. And since the vast majority of our datasets came from after 2011, we use the 2011 framework. For the purpose of this meta-analysis, we used the following groupings:

  1. White (English, Welsh, Scottish, Northern Irish or British; Irish; Any other White),

  2. Mixed or Multiple (White and Black Caribbean; White and Black African; White and Asian; Any other Mixed or Multiple),

  3. Asian (non-Chinese Asian, which includes all groups but Chinese; Chinese and Any other Asian, which includes: Chinese & Any other Asian; South Asian, which includes: Indian, Pakistani, & Bangladeshi),

  4. Black (African; Caribbean or Black British; Any other Black),

  5. Other ethnic group (Arab; Gypsy, Roma and Irish Traveller; Any other ethnic group).

This system differs slightly from the government classification in that we also divided the broad “Asian” category into three subcategories: non-Chinese Asian, Chinese and other Asian, and South Asian. The non-Chinese Asian (i.e., Indian; Pakistani; Bangladeshi; Any other Asian) category overlaps with both the “Chinese and Any other Asian” and “South Asian” categories, and it includes all Asians except Chinese. This was done for purely clerical reasons, given how the data were presented in sources so that we could match classifications across sources.

For example, some technical reports, such as the UCAT, provide data for Chinese separate from the remaining Asian groups (thus, “Chinese” & “non-Chinese Asian”). Other technical reports provided data for South Asians separate from the remaining Asian groups (thus, “South Asian” and “Chinese and Other Asian”). While including these three subcategories leads to a category that overlaps with other categories, it allows us to add extra data points to the meta-analysis. The only other difference was that we placed the Gypsy, Roma and Irish Traveller group under the Other category instead of the White category. This was because, at the time of collecting data, Irish Travellers and Gypsy/Roma were not distinguished and because the Romani people are a South Asian-origin ethnic group that migrated to Europe around the 14th century and so seem to fit better in the Other group.

When only narrow ethnic group data were available (e.g., Indian & Pakistani), we created scores for broad groups (e.g., South Asians) by N-weighted averaging the available narrow group scores. We also provide data for an Unclassified group in the SM File 2 but this group is not analysed in the meta-analysis.

Data items

For each study included in the meta-analysis, we recorded the: 1) source or author, 2) test battery or sample name (e.g., UCAT), 3) sample type, 4) level of mental ability, 5) size of the total sample, 6) mean score of the total sample (on a scale with standard deviations set to 15), 7) ethnic category (e.g., White), 8) ethnic subgroup sample size, and 9) ethnic subgroup mean score (on a scale with the standard deviation set to 15).

Many tests featured multiple years of data (e.g., twelve years of UCAT data), giving us the option of treating these as independent samples in the meta-analysis, or of combining them prior to performing meta-analytical analyses to form a single, large sample. In the Schmidt and Hunter tradition of meta-analysis (Schmidt & Hunter, Citation2015), one is discouraged from splitting samples into subsamples, for instance, by sex or by year, as this increases the chance of finding differences strongly caused by sampling error. In line with this tradition, we decided to combine the data from various years. For instance, we considered the twelve UCAT datasets as strongly interchangeable, so we combined them. Thus, we elected to combine all data for the same test and the same dataset first, and then meta-analysed results across all combined and single datasets.

When different sample sizes were reported for overlapping subtest scores, we computed and used the harmonic N of the sample sizes. In computing means, we first converted test scores for each ethnic group into standardized differences with the White group set to 0. To do this, we used the standard deviation pooled across all available ethnic groups in the given sample.

For sample type, we created three groups: Industrial/Governmental (which included occupational, public sector, and public driving samples), Military (which included Army, Navy, and Airforce samples), and Educational (which included College entrance, Law school, Medical school/Biomedical program, and Business school samples).

For test level, we coded the following constructs: g composite, general summary, verbal, and quantitative. Following Roth et al. (Citation2001) and Sackett and Shen (Citation2010), we computed g composite scores when scores were reported from two or more subtests but no composite score was reported. For example, for the Civil Service exam, verbal and quantitative scores were reported, but not overall scores. For the general summary scores, we used 1) g composite scores when available, or 2) scores from a single general test when not. So, our general summary score could be based on a composite or a single test. Thus, all samples (except nr. 9) have a general summary score, while not all samples have a g composite, verbal, and quantitative score.

The data for the meta-analysis were drawn from a number of sources, including technical manuals, published papers, and Freedom of Information Act (FOIA) requests. SM File 2 provides a list of the data used in the meta-analysis along with details on where these data can be found.

Meta-analysis

To carry out the meta-analysis, we relied on the Hunter and Schmidt Psychometric Meta-analytical package, with the random-effects model, developed by Schmidt and Le (Citation2014).

Effect sizes (d)

We set the White d of every dataset at 0 so that other groups were compared to that value. So, positive d-values reflect scores lower than the mean for Whites; whereas negative d-values reflect scores higher than the mean for Whites. We already noted that there were almost no nationally representative datasets in our meta-analysis, so almost all the ds are relative to the mean for Whites for that specific sample. However, it is highly unlikely that all the means of non-nationally representative samples for Whites will equal the population mean. So, this procedure does not allow us to make a meta-analytical statement about the population value for mean GMA, but it does allow conclusions about how strongly ethnic group means deviate from the White mean.

Analyses with the Schmidt and Hunter package

Supplementary File 2 is an Excel file showing all the samples’ basic information. For each ethnic category, ethnic subcategory, and ethnic group, we computed the total number of data points (K) and the sample-size weighted effects size. The percentage of variance explained between the data points by sampling error constituted our measure of consistency.

Moderators

We tested three moderator variables, including: 1) ethnicity (ethnic category, ethnic subcategory, and ethnic group), 2) sample type (industrial/governmental, military, and educational), and 3) level of mental ability (general summary score versus verbal and quantitative scores). We included ethnicity because a vast literature shows substantial variability in mean GMA scores across various ethnic categories, ethnic subcategories, and ethnic groups. We included sample type because in their meta-analysis of Black/White differences, Roth et al. (Citation2001, ) report outcomes by sample type, showing d = 1.12 for educational samples, d = 1.10 for military samples, and d = .99 for industrial samples. So, there is a moderator effect for sample type, but most of the effects are clearly quite small (for instance, d = .13 between educational samples and industrial samples). Sample type might also act as a moderator in our analysis, and so we empirically test for this possibility via moderator analysis.

Table 1. Meta-analytic outcomes for all data points plus results for the moderator analysis on ethnicity: number of data points, mean ds, and percentage variance explained by Ethnic category, Ethnic subcategory, and Ethnic group for general summary scores.

We included the level of mental ability as a possible moderator because it has been found that Asian Americans show moderate differences in verbal versus mathematical test performance (Roth et al., Citation2017). Thus, following other reviews (Roth et al., Citation2001, Citation2017; Sackett & Shen, Citation2010), we examined the effect of level of mental ability, in this case, the general (g) scores versus the verbal and/or quantitative scores.

Additional analyses

While our samples come from a diverse range of occupations, industries, and educational backgrounds, it is not clear to what extent our meta-analytic results represent the cognitive profile of UK adult populations. Therefore, we compared our meta-analytic d values with d values computed from the cognitive ability scores reported by Pesta et al. (Citation2023), which were computed based on several nationally representative samples obtained between 2000 and 2016.

Additionally, to examine Pesta et al. (Citation2023) hypothesis that differences among ethnic groups might be attributed to assimilation factors—including proficiency in English and duration of residence in the UK—we correlated our meta-analytic d values with the percentage of each ethnic group that speaks English as a main language or speaks English very well to well, and with the percentage of the ethnic group born in the UK. These demographic data were sourced from the UK government’s Ethnicity Facts and Figures website (UK Government, Citation2024a, Citation2024b).

Pesta et al. (Citation2023) report strong correlations between group-level cognitive ability scores and scores predicted based on international achievement test scores from the countries or regions where each ethnic group originates. We repeated their analysis using average d values from our own meta-analysis and the predicted scores they reported. Additionally, we developed alternative predicted scores by matching the 288 specific ethnic categories from the 2021 UK Census (Office for National Statistics, Citation2022) to countries in the World Bank dataset (for example, Polish corresponds to Poland). For each of our narrow ethnic groups, we calculated predicted scores by multiplying the 2020 international test scores from the World Bank with the percentage of people from each country within that ethnic group. For instance, “Irish” was assigned the World Bank score for Ireland, and “Other White” received a score based on the weighted average of percentages of “Other White” individuals from countries like Poland, Romania, and Turkey, each weighted by their respective World Bank scores. Mixed ethnic groups were assigned the average World Bank score of the parental populations based on the assumption of a simple vertical cultural transmission model.

Results

Descriptive statistics

Supplementary File 2 displays basic information for all the samples. The data include: the study number, the authors of the data point, the GMA test used, the sample type, the level of mental ability, the group (the ethnic category, the ethnic subcategory, the ethnic group), the sample size, the mean for the GMA test, the standard deviation (which equals 15 in all cases), and the effect size (d). A brief inspection of the file shows that there is a large variability in the d scores for the many data points.

Meta-analytic results

shows the meta-analytical outcomes for all the data points together with the results for the moderator analysis on ethnicity by GMA summary scores. Column two shows the number of data points, column three shows the effect sizes, with a positive d indicating a mean score lower than that of the White group, and a negative d indicating a mean score higher than that of the White group. Column 3 shows the percentage of variance in the data points explained by sampling error.

In Schmidt and Hunter’s approach to meta-analysis, two requirements have to be met before concluding that a moderator effect exists. First, there should be substantial differences in the d-values between the groups comprising the moderator analysis. Second, the percentage of variance explained should increase substantially after splitting up the sample of all the data points into each subgroup.

begins with an analysis of all K = 328 data points for GMA. Note that this analysis includes all data points from both the group categories and their subcategories. The analysis of all data in the meta-analysis shows that a large amount of variance is explained by differences in sample size between the studies, but clearly not one hundred percent of the variance is explained here. In other words, it is now appropriate to look for other causes of the differences between studies, including potential moderators.

We next address whether ethnicity shows a mixed moderator effect by ethnic categories. To evaluate these effects, we relied on Cohen (Citation1988) who recommended that d = .2 is a small effect, d = .5 is modest or medium effect, and d = .8 is a large effect. The five ethnic categories analysed here suggest that they are not comparable regarding mean GMA scores. Specifically, all minority groups scored below the mean of the White group, with the values of ds varying between small to large. The amount of variance explained in all the data points is 70% for all the datasets, and the amount of variance explained within the five ethnic categories was clearly minuscule in all of the cases. So, we have only limited evidence that a moderator effect exists with this analysis.

We illustrate our approach by looking at the data for the three ethnic subcategories within the White ethnic category. The category of Whites, by definition, had a d = 0.00 and 100% of the variance between the K = 32 data points explained by sampling error, which is not surprising as all the 32 ds have a value of 0. The White British ethnic group had d = −.01, the White Irish ethnic group had d = .09, and the White Other ethnic subcategory had d = .02. Cohen (Citation1988) states that d = .2 is small, so the largest effect here (d = .09) seems trivial. As the variance explained is minuscule to small, we conclude there is no moderator effect at the level of White subcategories. For all other comparisons (with one exception), the differences between the ds for the categories and the subcategories or ethnic groups were also minuscule to small. The exception regards the d = −.48 value observed for the Chinese.

The percent of variance explained by sampling error is minuscule for the categories, and in a modest number of cases, it is somewhat larger for the subcategories and ethnic groups (but only the Arab group produces a large value here). At best, we have modest support for the existence of a small moderator effect in this analysis.

Again, an exception is the data point for the Chinese, who displayed a modest-sized effect but only a small increase in the percentage of variance explained. Thus we conclude there is no clear moderator effect here. A second exception is the data point for Arab samples, showing a small effect, but with all variance explained by sampling error. We conclude there is a small but reliable moderator effect in this analysis.

shows whether the level of mental ability acts as a moderator. The first column shows the ethnic categories, the second column shows the sample sizes, the third column shows the number of data points, the fourthcolumn shows the mean ds, and the fifthcolumn shows the percent variance explained by ethnic category for either general summary scores, verbal scores, and quantitative scores. The sixth column shows the lower and upper values of the 80% Credibility interval. The seventh column shows the precentage of variance explained. The eight column shows the deltas, which show how much the ds increase or decrease for, respectively, a verbal score and a quantitative score in comparison to the ds of the general scores. A positive value of delta means the value of d is increasing, and a negative value of delta means the value of d is decreasing.

Table 2. Meta-analytic moderator analysis on level of mental ability: number of data points, mean ds, percentage variance explained in ds by Ethnic category, for General scores, verbal scores, and quantitative scores and mean Δs, which show how much the mean ds increase or decrease for, respectively, a verbal score and a quantitative score in comparison to the mean ds of the general scores.

In line with Schmidt and Hunter’s approach to moderators, we checked whether the two ds in this comparison were meaningfully different, and whether the percentage variance explained increased, in this case, for most of the categories. The data in show that almost all the values of Δ are minuscule, or very small (there is only one small effect for the group Other). The amount of variance explained for the categories is trivial for GMA, and does not increase noticeably for the verbal and mathematical scores. In sum, we conclude that no substantial moderator effects exist on “the type of GMA test.”

shows whether the sample type acts as a moderator of GMA scores. The first column shows the ethnic categories, the second column shows the number of data points, the third column shows the number of data points, the fourth column shows the mean ds, the fifth column shows the standard deviation of the observed d, the sixth column shows the lower and upper values of the 80% Credibility intervalk and the last column shows the percentage variance explained by ethnic category for general scores, verbal scores, and quantitative scores. With two exceptions, there are only minuscule to small differences in d-values between the groups in the moderator analysis. The only two meaningfully different effect sizes are found for military samples in the categories Blacks and Other, respectively. The variance explained does not increase at all. So, despite the sizable effects for two of the military samples, we conclude there are no moderator effects in this analysis.

Table 3. Meta-analytical moderator analysis on sample type: number of data points, mean ds, and percentage variance explained by ethnic category for all types, Industrial/Governmental, military, and educational for general scores.

Additional analyses

Pesta et al. (Citation2023) reported cognitive ability scores for UK adults by ethnic group. For the 18 narrow ethnic groups, d values reported in our and the corresponding d values computed from the data for all adults reported by Pesta et al. (Citation2023; Table S4) correlated at r = .84. This can be interpreted as a high degree of consistency. Despite this high correlation, the ethnic group differences found presently were smaller than those reported by Pesta et al. (Citation2023). For Mixed, Asians, Blacks, and Others, the d values in this meta-analysis were smaller by ds = .12, .34, .14, and .16, respectively. More detailed results are provided in SM File 2.

Contrary to the suggestion by Pesta et al. (Citation2023), there were negligible and non-significant correlations between ethnic d values and the percentage of the ethnic group proficient in English (r = −.05; N = 18), and with the percentage of the ethnic group born in the UK (r = .00; N = 18). This absence of association is illustrated by the observation that the group with the highest mean d value, Chinese, also had one of the lowest percentages of English proficiency (85%) and one of the lowest percentages of UK-born individuals (24%). So, these outcomes do not show proof for the hypothesis that differences among ethnic groups are attributable to assimilation factors.

However, ethnic d values were correlated strongly and statistically significantly with the predicted region of origin test scores reported by Pesta et al. (r = −.73; N = 16) and the predicted region of origin test scores computed by the present authors (r = −.71; N = 18). The correlations remained high when dropping ethnic groups for whom a substantial percentage of the group had unclear region of origin (i.e., Other Black, Gypsy, Roma and Irish Travellers, and Other Mixed) (r = −.79 to −.80; N = 15) or when additionally dropping all other mixed groups (i.e., Caribbean-White, African-White, and Asian-White) (r = −.82 to −.83; N = 12) or when additionally dropping fairly heterogeneous groups (i.e., Other White, Other Asian, any Other) (r = −.85 to .-86; N = 9). Excluding British Whites, on the grounds that they are not a migrant group, did not substantially change the results (r = −.67, N = 17, to r = .83, N = 8). These results tentatively suggest that ethnic groups from countries and regions with more developed educational infrastructures (e.g., Irish, Other White, and Chinese) tend to outperform those from regions with less developed educational systems (e.g., Pakistanis, Caribbeans, Africans, and Arabs).

Discussion

Main findings

We reviewed group differences in GMA selection test scores by ethnicity for UK test takers. Our aim was to assess the magnitude of differences on these tests as this is relevant to concern about potential adverse impact. Compared with Whites, the four major non-White ethnic categories scored slightly to substantially lower, with ds = .14, .65, .33, .49 for the Mixed, Black, Asian, and Other groups, respectively. A more fine-grained analysis showed that the Chinese slightly outscored Whites at d = −.15, while the non-Chinese Asian groups scored slightly to substantially lower with ds of .37 (South Asian) and .27 (Other Asian). Moreover, among South Asians, scores varied by nationality, with ds of .17 (Indian), .49 (Pakistani), and .55 (Bangladeshi). Finally, we found that the three Black groups (Caribbean, African, and Other) scored similarly (ds = .66 to .68), as did the three Other groups (Arabs; Gypsy, Roma and Irish Travellers; Any Other; ds = .51 to .61). Given Cohen’s (Citation1988) effect size guidelines reviewed above, we interpret the results as broadly showing small to moderately large GMA differences between White and non-White ethnic groups.

A modest amount of heterogeneity in test scores existed in our analyses, depending on two factors: the level of ability (general test score versus verbal and quantitative test scores), and the sample type (industrial/governmental, military, educational). That said, the general trends indicated small to very small effects. Unfortunately, for these two analyses, the sample sizes did not allow us to decompose the results by more narrow ethnic categories. As a result of using broad categories, our moderator analysis may have missed some of this heterogeneity.

We also found that the meta-analytic d values for the 18 narrow ethnic groups correlated very strongly with Pesta et al. (Citation2023) GMA test scores (also derived from representative samples of 21st-century UK adults). Therefore, GMA ethnic differences based on selection samples generally correspond with population-level ethnic differences, albeit the effect sizes we found for the non-White groups were generally smaller than those reported by Pesta et al. (Citation2023).

Comparisons with USA data

An extensive literature on group differences in selection tests exists in the USA. For example, Roth et al. (Citation2017) meta-analysed The College Board’s SAT data from 1996–2014. They found a Black-White composite score gap of d = 1.1, together with an Asian-White gap of d = −.16. The authors also noted that the Asian-White gap increased to d = −.28 for the most recent years in their analyses. In comparison, we find a UK Black-White GMA gap of d = .65 and an Asian-White gap of d = .33.

As a more recent example, Murray (Citation2021) reviewed U.S. applicant test score data encompassing undergraduate (ACT & SAT), law (LSAT), medical (MCAT), and graduate business (GMAT) school entrance exams. After converting the mean SAT score differences into composite scores, the Black-White postgraduate differences ranged from d = .79 (GMAT) to 1.08 (LSAT). The Black-White college entrance gap was d =.90 (ACT/SAT). Meanwhile, the Asian-White postgraduate differences ranged from d =.05 (GMAT) to −0.10 (MCAT), with the college entrance gaps being d = −.54 (ACT/SAT). Thus, even Murray’s (Citation2021) more recent data (relative to Roth et al., Citation2017) are consistent with our interpretation. In sum, Asians and Blacks perform substantially differently in the USA versus the UK.

The relative advantage of Asians in the USA over those in the UK might be due to differences in ethnic composition and selectivity, as Asian migrants to the USA are often more selective and ancestrally North East Asian. Conversely, the relative advantage of Blacks in the UK over those in the USA may be due to selectivity or to differing histories of race relations, with Blacks in the USA largely being descendants of a historically marginalized group. Future theorizing on adverse impact should consider a cross-national comparative perspective.

Potential reasons for group differences in the UK

Four potential reasons for subgroup differences in the UK were discussed in the Introduction, and now we relate the findings to these explanations and see if they were supported by the data. First, Cottrell et al. (Citation2015) tried to explain GMA differences in the USA in terms of centuries of housing, educational, and occupational segregation. This explanation is less plausible for the UK, where immigrants generally arrived within the last century and did not experience such long-term segregation.

Second, we hypothesized that ethnic groups might differ in terms of degree of cultural assimilation as indexed by English fluency and birth in the UK. However, the data showed a virtual absence of a relationship between the size of group differences in GMA on the one hand and both the percentage of the ethnic group proficient in English and the percentage of the ethnic group born in the UK on the other hand, offering no support for this hypothesis. On the other hand, as noted previously, we found that the differences between non-White and White groups were smaller than reported by Pesta et al. (Citation2023). A possible reason for the effect size differences is that the samples here tend to be more culturally assimilated, since the data were more recent and the participants are younger (e.g., the median publication year for the six studies reported by Pesta et al. (Citation2023) was approximately 2010, while the median publication year for samples in the present study was approximately 2017). Consistent with this hypothesis, Pesta et al.’s Table A7 shows that both the Asian-White and Black-White differences are smaller for younger cohorts, especially for individuals born in the UK. On balance, these findings offer some support for the position that more recent cohorts are more culturally assimilated, leading to smaller group differences. However, we lack the data to test this hypothesis rigorously, so the smaller differences reported here could also be due to unexamined moderators.

Third, we hypothesized that differential migrant selection would affect scores. While we could not directly test this hypothesis, Luthra and Platt (Citation2023) reported only a weak relationship between migrant selection and cognitive differences among UK immigrant groups, suggesting that differences in migrant selection may not be a strong factor in accounting for the ethnic differences in the UK.

Fourth, we hypothesized that region of origin factors could explain GMA differences because first- and second-generation migrants from countries with less well-developed educational systems tend to perform poorly relative to those from other countries with more-developed educational systems. These findings suggest that, in the UK, ethnic groups from countries and regions with a more highly developed educational infrastructure (e.g., Irish, Other White, and Chinese) tend to outperform those from regions with less highly developed educational systems (e.g., Indians, Pakistanis, Bangladeshi, Caribbeans, Africans, and Arabs).

In sum, as to the potential causes of group differences, Cottrell et al. (Citation2015) segregation hypothesis is less plausible, there is mixed support for the assimilation hypothesis, and differential migrant selection does not show a substantial effect. On the other hand, there is evidence that region of origin strongly predicts ethnic differences.

Theoretical implications of findings to ethnic adverse impact research

As underscored by the comparison between socially-defined Asians and Blacks in the USA vs. in the UK, ethnic differences on GMA tests cannot be presumed to generalize across countries. This conclusion has obvious implications for adverse impact research. Most notably, ethnic/racial group differences need to be investigated on a country-by-country basis.

Since first-generation immigrants often perform poorly on cognitive measures partially due to linguistic and other forms of psychometric bias (e.g., Batalova & Fix, Citation2015), it is not surprising that there might be corresponding ethnic differences in measured GMA scores. However, ethnic differences often remain after adjusting for the effects of language bias, and members from majority and various second-generation migrant groups, who are more fluent in the language of the test, also often show substantially different scores on GMA tests (e.g., Te Nijenhuis et al., Citation2016).

To account for such differences, we hypothesize that, all else being equal, first- and second-generation migrants from nations with better-performing educational systems will tend to outperform those from nations with less well-performing systems. This hypothesis aligns with the results of a substantial body of educational research examining migrants’ academic achievements across different countries, connecting their performance to sociocultural influences from the countries of origin in addition to influences in the countries of destination (De Philippis & Rossi, Citation2021; Dronkers et al., Citation2014; Figlio et al., Citation2019; Hanushek et al., Citation2022). This hypothesis does not discount other factors, such as country-specific influences (e.g., Cottrell et al., Citation2015) or the role of migrant selectivity (e.g., Model, Citation2008).

While ethnic group differences on GMA tests in the USA are often attributed to discrimination (e.g., Cottrell et al., Citation2015), we contend that this framework may not be generalizable to other countries with distinctly different historical contexts. Future research—particularly concerning recent migrants—would benefit by incorporating both country-of-destination and country-of-origin factors, as is done in educational research (e.g., Dicks et al., Citation2019).

Practical implications regarding adverse impact

A problem with assessing whether GMA tests create adverse impact in the UK is that no simple rule-of-thumb exists as a preliminary test of whether a selection practice has created legally significant disadvantages. This judgement is instead based on contextual factors.

Nonetheless, while collecting data for this study, we were notified by the Ministry of Education (in wake of Essop and others v Home Office, Citation2017) that they were discontinuing the use of the Professional Skills Test due to adverse impact concerns. Note that the Skills Test produced ds of .19, .50, .78, and .53 for Mixed, Asian, Black, and Other ethnic groups, respectively. These ds represent small- to large-sized differences.

Additionally, in the case of Essop and others v Home Office (Citation2017), the UK Supreme Court ruled in favour of Essop. The court found that a skills test created legally significant adverse impact, as Black and Minority Ethnic (BME) candidates had a pass rate that was 43% of the rate for White candidates. This discrepancy aligns with the expected outcome from a standardized group difference of d = 0.5 when selecting the top 10% of candidates, as noted in Sackett and Shen (Citation2010; Table 17.1).

We might reasonably infer from both the Ministry of Education’s and the Supreme Court’s decision that medium-size mean differences may lead to adverse impact in the UK. Our meta-analytic results imply that South Asians from Pakistan and Bangladesh, along with the Black and Other groups, score in this range (with more pronounced differences existing on Military tests). Thus, GMA testing in the UK may create legally significant adverse impact such that members of affected groups will be less successful in getting hired or promoted, or in being accepted into degree programs.

According to Essop and others v Home Office (Citation2017), “a wise employer will monitor how his policies and practices impact upon various groups … and will try and see what can be modified to remove that impact while achieving the desired result”. In the case of selection practices, this may involve supplementing or replacing GMA tests with alternative predictors of work and/or job training performance, such as personality tests (Ployhart & Holtz, Citation2008). Strategies for reducing ethnic differences have been extensively discussed in the literature (De Soete et al., Citation2012; Ployhart & Holtz, Citation2008; Roth et al., Citation2017).

The diversity/validity dilemma

An important tradeoff exists for employers or educators opting to use GMA tests when selecting applicants (De Soete et al., Citation2012; Ployhart & Holtz, Citation2008). Specifically, and until recently, GMA tests were considered second to none in terms of predictive validity (but see Sackett et al., Citation2022). However, the cost for this validity is moderate-to-large adverse impact against various ethnic and racial groups. The tradeoff, though, is not unprecedented as a similar situation exists regarding sex as a protected class, and when selecting people based on physical strength for various jobs (e.g., firefighter) where strength is an essential job function.

In either event, employers will likely be able to defend against any resulting adverse impact because GMA tests are valid and unbiased predictors of job and/or educational success, and so, too, are physical strength requirements for jobs that obviously demand them. Recall also that the mere existence of harm is not necessarily illegal in the UK. Here, adverse impact is legally permissible when the selection practice is “proportionate.” To be proportionate, the practice must cause no more adverse impact than is necessary, which translates to: No other equally valid alternative method exists that results in less adverse impact.

But is it worth the hassle and the resulting lack of diversity that results when tests like these are used in employment decisions? Unfortunately, we have no one-size-fits-all answer to this question. Regardless, it is still critical for decision-makers to understand the legal context when opting for this or that selection practice.

The situation regarding GMA testing has become more complicated, given Sackett et al.’s (Citation2022) recent analyses showing that the predictive validity of these tests may have been overstated in prior studies. We speculate that the debate regarding Sackett et al.’s arguments will not be resolved quickly (see, e.g., Ones & Viswesvaran, Citation2023; Oh et al., Citation2023 for recent replies to Sackett et al., Citation2022). With the literature here in flux, we are reluctant to offer specific recommendations on whether using GMA tests is still ideal. Instead, we suspect that GMA testing will remain ubiquitous until and unless the science suggests that other methods exist for selecting people, and that these methods are about as valid as GMA testing, but do not harm protected classes. Schmidt and Hunter (Citation1998) warn against discontinuing the use of GMA tests, as they have several important advantages: good predictive validity for both job performance and training performance, relatively cheap use, and, importantly, a clear understanding of how they predict job performance.

An elegant way to increase both diversity in the workforce and average job performance is to rely on an extensive selection procedure, so one using a careful combination of various instruments and procedures. As ds are considerably smaller on, for instance, Situational Judgement Tests and personality questionnaires than on GMA tests, adding them to the selection procedure will reduce the values of d considerably; the value of d for the extensive selection procedure will come closer to the value of d for work performance. Choosing various selection measures to add to GMA tests takes several things into account. The first is the predictive validity of the second measure, with a preference for high predictive validity. A second consideration is the correlation between the GMA test and the additional measure, with a preference for low correlations; interview scores and GMA scores correlate substantially, whereas Big Five scores and GMA scores generally show small correlations. So, the best catch would be measures with a minimal correlation with GMA scores and high predictive validity. Obviously, using an extensive selection procedure comes with increased costs.

Limitations of the study

Our data collection process was limited by the fact that British government institutions allow only a small number of FOI requests, so that not all data could be accessed. Also, the government stores its data for only a limited number of years, and even some quite recent datasets are no longer accessible. Still, with a total sample size of over two million, we are confident that our estimates are reliable.

To wit, the percentages of variance explained by sampling error (dependent upon sample size) in our analyses were extremely small in most cases; whereas sample size often explains a substantial amount of variance in other studies which feature smaller samples. Moreover, because our computed ds were based on comparing two groups, we adopted Schmidt and Hunter’s (Citation2015) recommendation to use the Ns of both groups for the computation of the total N for d. This method gives less weight to the larger group in the analysis. As the Ns for the White group are large to very large, even Schmidt and Hunter’s formula still results in very large total Ns in the frequent case the N for the second group is quite modest. This could be a reason that in so many analyses an extremely small percentage of variance is explained by sampling error. Using a strict harmonic N with much less weight for the largest sample instead might have led to a sizable amount of variance being explained.

Another limitation of this study is that we did not know the g loadings for all the different test batteries we analysed, and so we could not include this variable as a moderator in our meta-analysis. For some context, it is well known that cognitive test batteries differ in how strongly they measure GMA. Reasons for this include test differences in item difficulties, the number of subtests, and the strength of their g loadings (e.g., Farmer et al., Citation2020). While some of the test batteries we meta-analysed are likely strongly g-loaded (e.g., Military tests), we are less sure about the g loadings for the other batteries (e.g., Civil Service tests) in our analyses.

The issue is relevant for two reasons. First, in both the USA and the Netherlands, the magnitude of ethnic group differences varies with a test/subtest’s g loading (McDaniel & Kepes, Citation2014; Te Nijenhuis et al., Citation2016). We were not able to test this hypothesis with the UK data, but we deem g loading to be a plausible unexplored moderator for explaining group differences in this country. Second, we suspect our GMA measures likely contained significant heterogeneity in terms of their g loadings (although we also assume that our measures of verbal and quantitative ability were more homogenous). Thus, our moderator test for “level of ability” was likely suboptimal, as we lacked information on each test’s g loading.

Suggestions for future research

First, since our analyses demonstrate that GMA testing in the UK disadvantages various ethnic groups, more research is needed to better understand why these differences exist (e.g., language bias). It would therefore be helpful if future research explored the psychometric nature of group differences in the UK. One question is whether tests are unbiased in the sense that measurement invariance holds between ethnic groups, as often found in the USA (e.g., Wilson et al., Citation2023). We are not aware of any research that has addressed this issue in the UK, despite its importance for understanding the source of differences (Maassen et al., Citation2023).

Second, research should investigate patterns of differences within ethnic groups (e.g., by age or migrant generation). Regarding patterns within ethnic groups, our data did not allow us to examine whether the test-taker’s age, and/or generational differences across test-takers, partially explained the GMA gaps we observed. Likewise, we were not able to examine if interactions existed between test type (e.g., maths vs. verbal) and group differences for the more narrowly represented ethnic groups in the UK. In this regard, perhaps our broad coding of ethnicity attenuated the above-mentioned effects. For example, it may be that Asian groups have different performance profiles on verbal versus quantitative tests. To wit, Pesta et al. (Citation2023) reported large discrepancies in GMA test scores for the Chinese across these test types. Obviously, differences in English language proficiency may be at play here, but these issues can only be resolved by more data.

Additionally, especially for the Black-White comparisons here, the resulting d values were noticeably lower than those typically found in the USA. Although we hypothesized that the differences might be due to the differential immigration patterns of Blacks into these two countries, our hypothesis remains an untested empirical question that future research could address. One might, for example, compare Black-White differences in the UK to differences between Whites and first- and second-generation Blacks in the USA.

Conclusion

The present meta-analysis featured a combined sample of approximately two million UK test takers. We believe the large sample size allowed us to draw relatively strong conclusions, as detailed above. At any rate, it is clear that selection based on GMA testing in the UK may create moderate to strong adverse impact against certain ethnic groups. How to explain, and then ameliorate these differences, remain critical but open empirical questions.

Supplemental material

Supplemental file 2 v17a.xls

Download MS Excel (233 KB)

Supplemental file 1 v17a.docx

Download MS Word (185.4 KB)

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

All of the data used for the computation of results are available at, or through, the sites/linked resources described in Supplementary File 2.

Supplementary data

Supplemental data for this article can be accessed online at https://doi.org/10.1080/1359432X.2024.2377780

References

  • Adesote, S. A., & Osunkoya, O. A. (2018). The brain drain, skilled labour migration and its impact on Africa’s development, 1990s–2000s. Africology: The Journal of Pan African Studies, 12(1), 395–420.
  • Aspinall, P. J. (2003). Who is Asian? A category that remains contested in population and health research. Journal of Public Health, 25(2), 91–97. https://doi.org/10.1093/pubmed/fdg021
  • Aspinall, P. J. (2013). Do the “Asian” categories in the British censuses adequately capture the Indian sub-continent diaspora population? South Asian Diaspora, 5(2), 179–195. https://doi.org/10.1080/19438192.2013.740226
  • Baron, H., Martin, T., Proud, A., Weston, K., & Elshaw, C. (2003). Ethnic group differences and measuring cognitive ability. In C. Cooper & I. Robertson (Eds.), International Review of Industrial and Organizational Psychology (Vol. 18, pp. 191–238). John Wiley and Sons Ltd. https://doi.org/10.1002/0470013346.ch6
  • Batalova, J., & Fix, M. (2015). Through an immigrant lens: PIAAC assessment of the competencies of adults in the United States. Migration Policy Institute.
  • The British Psychological Society. (n.d). Psychological Testing. List of Test Publishers. https://portal.bps.org.uk/PTC/List-of-Test-Publishers
  • Carroll, J. B. (1993). Human cognitive abilities: A survey of factor-analytic studies. Cambridge University Press.
  • Cassino, D. (2023). Symposium on adding a Middle Eastern or North African category to the US census. Survey Practice, 16(1), 1–3. https://doi.org/10.29115/SP-2023-0021
  • Cattaneo, M. A., & Wolter, S. C. (2012). Migration policy can boost PISA results – findings from a natural experiment. Swiss coordination centre for research in education staff paper 7.
  • Chakravorty, S., Kapur, D., & Singh, N. (2016). The other one percent: Indians in America. Oxford University Press.
  • CIPD. (2017). Resourcing and talent planning survey 2017. http://www.cipd.com
  • Cohen, J. (1988). Statistical power analysis for the behavior sciences (2nd ed.). Lawrence Erlbaum Associates.
  • Cottrell, J. M., Newman, D. A., & Roisman, G. I. (2015). Explaining the black–white gap in cognitive test scores: Toward a theory of adverse impact. Journal of Applied Psychology, 100(6), 1713. https://doi.org/10.1037/apl0000020
  • De Philippis, M., & Rossi, F. (2021). Parents, schools and human capital differences across countries. Journal of the European Economic Association, 19(2), 1364–1406. https://doi.org/10.1093/jeea/jvaa036
  • De Soete, B., Lievens, F., & Druart, C. (2012). An update on the diversity-validity dilemma in personnel selection: A review. Psihologijske teme, 21(3), 399–424.
  • Dewberry, C. (2011). Aptitude testing and the legal profession. Legal Services Board.
  • Dicks, A., Dronkers, J., & Levels, M. (2019). Cross-nationally comparative research on racial and ethnic skill disparities: Questions, findings, and pitfalls. In P. A. J. Stevens & A. G. Dworkin (Eds.), The palgrave handbook of race and ethnic inequalities in education (pp. 1183–1215). Palgrave Macmillan US.
  • Dronkers, J., Levels, M., & de Heus, M. (2014). Migrant pupils’ scientific performance: The influence of educational system features of origin and destination countries. Large-Scale Assessments in Education, 2(1), 1–28. https://doi.org/10.1186/2196-0739-2-3
  • Equality Act. (2010). https://www.legislation.gov.uk/ukpga/2010/
  • Essop and others v home office. (2017). UKSC 27
  • Evers, A., Te Nijenhuis, J., & van der Flier, H. (2005). Ethnic bias and fairness in personnel selection: Evidence and consequences. In A. Evers, N. Anderson, & O. Voskuijl (Eds.), The Blackwell handbook of personnel selection (pp. 306–328). Blackwell.
  • Farmer, R. L., Floyd, R. G., Reynolds, M. R., & Berlin, K. S. (2020). How can general intelligence composites most accurately index psychometric g and what might be good enough? Contemporary School Psychology, 24(1), 52–67. https://doi.org/10.1007/s40688-019-00244-1
  • Figlio, D., Giuliano, P., Özek, U., & Sapienza, P. (2019). Long-term orientation and educational performance. American Economic Journal, Economic Policy, 11(4), 272–309. https://doi.org/10.1257/pol.20180374
  • Gibson, M. A., & Ogbu, J. U. (Eds.). (1991). Minority status and schooling: A comparative study of immigrants and involuntary minorities. Garland.
  • Giggs v. Duke Power Co. (1971). 401 U.S. 424.
  • Hanushek, E. A., Kinne, L., Lergetporer, P., & Woessmann, L. (2022). Patience, risk-taking, and human capital investment across countries. The Economic Journal, 132(646), 2290–2307. https://doi.org/10.1093/ej/ueab105
  • Ho, A. D., & Reardon, S. F. (2012). Estimating achievement gaps from test scores reported in ordinal “proficiency” categories. Journal of Educational and Behavioral Statistics, 37(4), 489–517. https://doi.org/10.3102/1076998611411918
  • Jensen, A. R. (1998). The g factor: The science of mental ability. Praeger.
  • Luthra, R. R., & Platt, L. (2023). Do immigrants benefit from selection? Migrant educational selectivity and its association with social networks, skills and health. Social Science Research, 113, 102887. https://doi.org/10.1016/j.ssresearch.2023.102887
  • Maassen, E., D’Urso, E. D., Van Assen, M. A. L. M., Nuijten, M. B., De Roover, K., & Wicherts, J. M. (2023). The dire disregard of measurement invariance testing in psychological science. Psychological Methods. Online ahead of print. https://doi.org/10.1037/met0000624
  • McDaniel, M. A., & Kepes, S. (2014). An evaluation of Spearman’s hypothesis by manipulating g saturation. International Journal of Selection and Assessment, 22(4), 333–342. https://doi.org/10.1111/ijsa.12081
  • McGrew, K. S. (2009). CHC theory and the human cognitive abilities project: Standing on the shoulders of the giants of psychometric intelligence research. Intelligence, 37(1), 1–10. https://doi.org/10.1016/j.intell.2008.08.004
  • Model, S. (2008). West Indian immigrants: A black success story? Russell Sage Foundation.
  • Murray, C. (2021). Facing reality: Two truths about race in America. Encounter Books.
  • Office for National Statistics. (2003). Ethnic group statistics: A guide for the collection and classification of ethnicity data. The Stationery Office.
  • Office for National Statistics. (2011). Census 2011. Office of National Statistics.
  • Office for National Statistics. (2022). Ethnic group (detailed). Accessed at: https://www.ons.gov.uk/datasets/TS022/editions/2021/versions/1
  • Oh, I.-S., Le, H., & Roth, P. L. (2023). Revisiting Sackett et al.’s (2022) rationale behind their recommendation against correcting for range restriction in concurrent validation studies. Journal of Applied Psychology, 108(8), 1300–1310. https://doi.org/10.1037/apl0001078
  • Ones, D. S., & Viswesvaran, C. (2023). A response to speculations about concurrent validities in selection: Implications for cognitive ability. Industrial and Organizational Psychology, 16(3), 358–365. https://doi.org/10.1017/iop.2023.43
  • Pesta, B. J., Te Nijenhuis, J., Fuerst, J. G., & Shibaev, V. (2023). Links between ethnicity, socioeconomic status, and measured cognition in diverse samples of UK adults. Comparative Sociology, 22(6), 785–823. https://doi.org/10.1163/15691330-bja10094
  • Ployhart, R. E., & Holtz, B. C. (2008). The diversity–validity dilemma: Strategies for reducing racioethnic and sex subgroup differences and adverse impact in selection. Personnel Psychology, 61(1), 153–172. https://doi.org/10.1111/j.1744-6570.2008.00109.x
  • Rees, P. H., Wohland, P., Norman, P., Lomax, N., & Clark, S. D. (2017). Population projections by ethnicity: Challenges and solutions for the United Kingdom. In D. A. Swanson (Ed.), The frontiers of applied demography (pp. 383–408). Springer International Publishing.
  • Roth, P. L., Bevier, C. A., Bobko, P., Switzer, F. S. I., & Tyler, P. (2001). Ethnic group differences in cognitive ability in employment and educational settings: A meta-analysis. Personnel Psychology, 54(2), 297–330. https://doi.org/10.1111/j.1744-6570.2001.tb00094.x
  • Roth, P. L., Van Iddekinge, C. H., DeOrtentiis, P. S., Hackney, K. J., Zhang, L., & Buster, M. A. (2017). Hispanic and Asian performance on selection procedures: A narrative and meta-analytic review of 12 common predictors. Journal of Applied Psychology, 102(8), 1178–1202. https://doi.org/10.1037/apl0000195
  • Sackett, P. R., & Shen, W. (2010). Subgroup differences on cognitive tests in contexts other than personnel selection. In J. Outtz (Ed.), Adverse impact: Implications for organizational staffing and high stakes selection (pp. pp. 323–348). Routledge.
  • Sackett, P. R., Zhang, C., Berry, C. M., & Lievens, F. (2022). Revisiting meta-analytic estimates of validity in personnel selection: Addressing systematic overcorrection for restriction of range. Journal of Applied Psychology, 107(11), 2040. https://doi.org/10.1037/apl0000994
  • Sackett, P. R., Zhang, C., Berry, C. M., & Lievens, F. (2023). Revisiting the design of selection systems in light of new findings regarding the validity of widely used predictors. Industrial and Organizational Psychology, 16(3), 1–18. https://doi.org/10.1017/iop.2023.24
  • Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124(2), 262–274. https://doi.org/10.1037/0033-2909.124.2.262
  • Schmidt, F. L., & Hunter, J. E. (2015). Methods of meta-analysis: Correcting error and bias in research findings (3rd ed.). Sage.
  • Schmidt, F. L., & Le, H. (2014). Software for the Hunter-Schmidt meta-analysis methods, version 2.0. University of Iowa, Department of Management & Organizations.
  • Sharma, S. (2015). South Asian diaspora in Europe and the United States. In B. Kachru, Y. Kachru, & S. Sridhar (Eds.), Language in South Asia (pp. 515–533). Cambridge University Press.
  • Stillwell, D. (2022, January 25). Comparing ethnicity data for different countries. Data in government. Retrieved: https://dataingovernment.blog.gov.uk/2022/01/25/comparing-ethnicity-data-for-different-countries/
  • Suzuki, L., Mogami, T., & Kim, E. (2002). Interpreting cultural variations in cognitive profiles. In K. Kurasaki, S. Okazaki, & S. Sue (Eds.), Asian American mental health: Assessment theories and methods (pp. 159–171). Kluwer Academic/Plenum.
  • Te Nijenhuis, J., Willigers, D., Dragt, J., & van der Flier, H. (2016). The effects of language bias and cultural bias estimated using the method of correlated vectors on a large database of IQ comparisons between native Dutch and ethnic minority immigrants from non-Western countries. Intelligence, 54, 117–135. https://doi.org/10.1016/j.intell.2015.12.003
  • UK Government. (2024a). English language skills. Ethnicity facts and figus. Retrieved: https://www.ethnicity-facts-figus.service.gov.uk/uk-population-by-ethnicity/demographics/english-language-skills/latest/
  • UK Government. (2024b). People born outside the UK. Ethnicity facts and figus. Retrieved: https://www.ethnicity-facts-figus.service.gov.uk/uk-population-by-ethnicity/demographics/people-born-outside-the-uk/latest/#ethnic-groups-by-region-of-birth
  • Wilson, C. J., Bowden, S. C., Byrne, L. K., Joshua, N. R., Marx, W., & Weiss, L. G. (2023). The cross-cultural generalizability of cognitive ability measures: A systematic literature review. Intelligence, 98, 101751. https://doi.org/10.1016/j.intell.2023.101751
  • Woolf, K., Potts, H. W., & McManus, I. C. (2011). Ethnicity and academic performance in UK trained doctors and medical students: Systematic review and meta-analysis. BMJ, 342, d901. https://doi.org/10.1136/bmj.d901