1,607
Views
0
CrossRef citations to date
0
Altmetric
Research Article

An examination of model fit and measurement invariance of general mental ability and personality measures used in the multilingual context of the Swiss Armed Forces: A Bayesian structural equation modeling approach

, , &
Pages 96-113 | Received 23 Mar 2021, Accepted 19 Jul 2021, Published online: 28 Oct 2021

ABSTRACT

Measurement invariance of psychological test batteries is an essential quality criterion when the test batteries are administered in different cultural and language contexts. The purpose of this study was to examine to what extent measurement model fit and measurement invariance across the two largest language groups in Switzerland (i.e., German and French speakers) can be assumed for selected general mental ability and personality tests used in the Swiss Armed Forces’ cadre selection process. For the model fit and invariance testing, we used Bayesian structural equation modeling (BSEM). Because the sizes of the language group samples were unbalanced, we reran the invariance testing with the subsampling procedure as a robustness check. The results showed that at least partial approximate scalar invariance can be assumed for the constructs. However, comparisons in the full sample and subsamples also showed that certain test items function differently across the language groups. The results are discussed regarding the three following issues: First, we critically discuss the applied criterion and alternative effect size measures for assessing the practical importance of non-invariances. Second, we highlight potential remedies and further testing options, that can be applied, once certain items have been detected to function differently. Third, we discuss alternative modeling and invariance testing approaches to BSEM and outline future research avenues.

What is the public significance of this article?—If psychological test batteries are meant to be valid in a cross-cultural context, two central prerequisites need to be given for the utilized measures. First, the scale items have to adequately reflect the underlying intended constructs. Second, the employed measures need to have the same psychometric properties in each cultural group. By examining these two important issues in the case of the multilingual psychological cadre assessment of the Swiss Armed Forces, this study helped to ensure the quality of this assessment and to identify possibilities for improvement. But in a broader perspective, however, the study may also provide guidance to researchers who have to deal with challenges similar to those that we faced (i.e., examining misfitting models for invariance in the context of unbalanced samples) and thus may help them to obtain robust findings despite these challenges.

Introduction

In the conscription system of the Swiss Armed Forces, cadre selection is based on various assessments. In an initial selection step, the training camp commander and his staff conduct interviews, military exercises, and knowledge tests with the recruits during the 18 to 21 weeks of basic military training. Recruits that are considered as most promising for a cadre position are sent to a cadre school after basic military training. The final assessments take place at the cadre school. There, it is decided which candidates are fit for lower cadre positions (i.e., squad leader positions) and thus will complete cadre school only and which candidates are fit for higher noncommissioned officer (NCO) or officer positions and thus are permitted entry into the higher NCO or officer school (Swiss Armed Forces, Citation2012).

In this final selection of NCOs and officers, psychological cadre assessment at the recruitment centers has a key role (Swiss Armed Forces, Citation2012). This includes the assessment of 19 leadership characteristics/dimensions that are grouped into the five broad sections: “leadership motivation,” “general mental ability,” “self-competence,” “social-competence,” and the category “extra dimensions” (see ). Assessment of 13 of these 19 dimensions is based on the candidate’s self-ratings (i.e., motivation and personality), two dimensions on objective test results (i.e., general mental ability), and four dimensions on the candidate’s performance on a concept development exercise,Footnote1 which is rated by two psychologists (Swiss Armed Forces, Citation2012). Scores on the 19 dimensions are then converted to stanine values, which is done to allow easier comparison of a candidate’s assessment result with the population norms of cadre applicants (Swiss Armed Forces, Citation2012). For the final cadre selection decision by the training camp commander and his staff, the stanine values are then summarized on a one-page output sheet (Swiss Armed Forces, Citation2012).

Table 1. Scales and their features and Cronbach’s alphas of the previously (i.e., 2012) and the presently studied (i.e., 2019) assessment cycles.

As this psychological assessment takes place in Switzerland’s multilingual context, the employed measures not only need to be valid but also need to be equally valid across the language groups (e.g., German and French speakers). Regarding construct validity, for instance, not only do the items of the employed measures have to adequately reflect the underlying intended construct (i.e., the measurement model should fit the data); in addition, the employed measures need to have the same psychometric properties in each language group (i.e., the measurement model parameters should be invariant across groups).

Despite the relevance of this issue, no up-to-date and rigorous examination of measurement invariance of the psychological assessment is available, however. Also, previous studies on the psychometric properties of the measures (Goldammer, Citation2019; Maier, Citation2008) relied primarily on conventional model fit and invariance testing procedures (i.e., [single and multiple group] confirmatory factor analysis [CFA] with maximum likelihood [ML]), which offer limited options for dealing appropriately with models that fail the exact fit and/or exact invariance test (i.e., χ2 and/or Δχ2, respectively) (Muthén & Asparouhov, Citation2012, Citation2013).

The aim of this article is therefore to (re)examine selected general mental ability (GMA) and personality tests used in the psychological cadre assessment (rows in gray, ) regarding model fit and measurement invariance across two major language groups in Switzerland (i.e., German and French speakers) by using Bayesian structural equation modeling (BSEM; Muthén & Asparouhov, Citation2012) – a modeling and invariance testing approach that takes potential model misfit and non-invariance more appropriately into account (e.g., Muthén & Asparouhov, Citation2012, Citation2013). In this study we focus on examining the most illustrative constructs of the test battery (i.e., general mental ability [concentration-stress test], conscientiousness, extraversion, agreeableness [cooperation], integrity, neuroticism), which are also commonly used by many other Armed Forces to assess the cognitive ability and personality of their personnel (e.g., Darr, Citation2011). By (re)examining the model fit and the measurement invariance of these constructs, this study therefore helps to maintain the high quality of the psychological cadre assessment of the Swiss Armed Forces and to improve it where necessary. In a broader perspective, the study may also provide guidance to researchers who have to deal with challenges similar to those that we faced (i.e., examining misfitting models for invariance in the context of unbalanced samples) and thus help them to obtain robust findings despite the challenges.

In the rest of this introduction, we first provide a brief overview of the theoretical background of the GMA and personality tests examined and describe the adaptations and new developments undertaken. We then describe the analytical framework applied (i.e., BSEM) in a little bit more detail. More specifically, we highlight the advantages of BSEM compared to traditional approaches (e.g., CFA with ML), when it comes to evaluation of model fit and measurement invariance.

Test development and content

All GMA and personality tests used in psychological cadre assessment are based on an in-depth theoretical background and well-established scales. However, to make the measures suitable for the military context and the target population (i.e., young adults aged 20), items of existing tests had to be adapted and some completely new items had to be developed (Boss & Brenner, Citation2006; Boss & Fischer, Citation2006; Egger & Boss, Citation2006c).

General mental ability

The largest GMA measure included in the psychological cadre assessment is the concentration-stress test, which is based on the Berlin Model of Intelligence Structure (BIS; Jäger, Citation1982, Citation1984) – a model that has considerable influence on intelligence measurement in German-speaking countries (e.g., Amthauer, Brocke, Liepmann, & Beauducel, Citation2001; Jäger, Süss, & Beauducel, Citation1997). The BIS distinguishes between two modalities (i.e., operations and contents). The four operations (i.e., processing speed, memory, creativity, processing capacity) and the three contents (i.e., verbal, numerical, figural) form a matrix of 12 unique tasks, which are hypothesized to indicate a general mental ability construct (Jäger, Citation1982, Citation1984). The four test modules that make up the concentration-stress test (i.e., numerological processing speed [math test]Footnote2; verbal reasoning [analogy test]; verbal and numerological memory [coordinate test]; figural reasoning [cutting pattern test]), reflect five to six cells of the BIS (Boss & Fischer, Citation2006). The concentration-stress test has a specific sequential order. At the start, the test taker must complete the first parts of the math, analogy, and memory tests. Next, the challenging tasks of the cutting pattern test have to be solved, which for most participants will induce the intended frustration, as only few participants will be able to rearrange the cutting patterns correctly. Afterward, the second parts of the math, analogy, and memory tests must be completed, to determine deviations from the results of the corresponding first parts and thus to infer the candidate’s consistency of performance or the lack of frustration tolerance (Boss & Fischer, Citation2006).

Personality

Another major part of the test battery assesses the candidate’s personality, which is largely based on Goldberg’s (Citation1981) Big Five framework. For the development of the tailored Big Five scales used in the battery (e.g., conscientiousness, extraversion, agreeableness, and neuroticism), an item pool was first generated that included adapted items from existing Big Five or related inventories (e.g., NEO-FFI (Costa & McCrae, Citation1992); 16 PF (Cattell, Eber, & Tatsuoka, Citation1970)) and newly developed items (Boss & Brenner, Citation2006). Based on the results of three pretests with conscripts, this pool was then cleared of items that consistently showed an unclear factor loading pattern (i.e., low major loadings and/or high cross-loadings) and a low item-total correlation, which left 12 items per Big Five domain (Boss & Brenner, Citation2006). Based on the results of a subsequent study (Maier, Citation2008), the item set was further reduced. This resulted in the present item set of nine to 10 items per Big Five trait.

A similar procedure was applied when developing the integrity scale (Egger & Boss, Citation2006c). Based on three attitude-based facets of integrity (e.g., attitude toward rules and regulations) and five trait-based facets of integrity (e.g., self-control), which were considered as especially relevant for the present cadre selection context (Egger & Boss, Citation2006c), a large item pool was generated that contained adapted items from existing integrity inventories (e.g., Mussel, Citation2003) and newly developed items (Egger & Boss, Citation2006c). Expert ratings were then used to clear the pool of items with ambiguous wording, and based on a subsequently conducted study with conscripts, it was additionally cleared of items with an unclear factor loading pattern (i.e., low major loadings and/or high cross-loadings) and a low item-total correlation. With this reduction, eight items for each of the eight integrity facets remained in the final scale (Egger & Boss, Citation2006c).

Model-fit and invariance testing: From CFA with ML to BSEM

Once a newly developed test battery has reached a certain maturity, researchers normally move on from explorative stage and try to confirm the hypothesized measurement model (i.e., the factor structure) in new samples, as had been done in the development process of the Swiss Armed Forces’ test battery. In a conventional confirmatory measurement model set-up, selected parameters are freely estimated (e.g., major factor loadings), and other parameters are fixed to specific values (e.g., cross-loadings and residual correlations are fixed to zero) (Brown, Citation2015, pp. 35–87). This set-up of freely estimated and fixed parameters expresses the researchers’ hypotheses on how the latent (unobserved) factors are related to observed measures (i.e., test/questionnaire items) (Brown, Citation2015, p. 35–87). If the hypothesized structure should be also tested for equivalence across different groups (as needed in our case), the parameter set-up can be extended by including additional equality constraints for selected parameters (e.g., equality restrictions for factor loadings across groups) (Brown, Citation2015, pp. 206–286). In any case, the desired outcome of such a confirmatory test (i.e., CFA) is, of course, that the measurement model fits the data, or more technically, that the model-implied variance-covariance matrix matches the sample variance-covariance matrix (Brown, Citation2015, pp. 35–87).

Most often, however, the proposed measurement model(s) of the constructs used in the Swiss Armed Forces test battery turned out to fit the data unsatisfactorily (Goldammer, Citation2019; Maier, Citation2008), which means they failed the exact fit and/or exact invariance test (i.e., χ2 and/or Δχ2, respectively). Unfortunately, conventional CFA with ML mainly offers two limited (if not inappropriate) options to deal with such misfitting models (Muthén & Asparouhov, Citation2012, Citation2013). First, researchers can use approximate fit indices, such as CFI or RMSEA, to judge the fit of a model (e.g., Hu & Bentler, Citation1999), and the change of these indices, when examining measurement invariance (e.g., Cheung & Rensvold, Citation2002). Second, they can undertake a series of model modifications (e.g., adding residual correlations of freely estimating otherwise constrained parameters) based on the modification indices, to reach a well-fitting model. Both approaches seem unsatisfactory. The first approach, because it does not offer a remedy for the model misfit and/or non-invariance but rather an excuse to interpret the biased model estimates anyway, and the second approach, because it bears the risk of ending up with a model that does not generalize beyond the present sample (MacCallum, Roznowski, & Necowitz, Citation1992) or ending up with a model that does not entail the correct composition of invariant and non-invariant parameters across groups (Asparouhov & Muthén, Citation2014).

In contrast, BSEM is a newer modeling and invariance testing approach that can take the potential model misfit and non-invariance into account more appropriately (e.g., Muthén & Asparouhov, Citation2012, Citation2013). Let us take, for instance, the single-group case, in which the evaluation of measurement model fit is usually of primary interest. Compared to a conventional CFA with ML, in which strict assumptions for certain parameters are imposed (i.e., zero cross-loadings and residual correlations), BSEM allows the incorporation of prior beliefs for certain parameters in a more nuanced way. This could entail specifying a prior distribution with a mean of zero and a small variance for cross-loadings and/or residual correlations (see Asparouhov, Muthén, & Morin, Citation2015; Muthén & Asparouhov, Citation2012). The larger the prior variances on such parameters need to be to obtain a model that fits the data, the more flexible the model becomes and thus the weaker the theory becomes (Asparouhov et al., Citation2015). Thus, using BSEM offers two central benefits: First, it allows the researcher to keep on working with a measurement model that fits the data and nevertheless represents substantive theory. Second, it provides the researcher with a quantification of the strength of the underlying theory, in terms of the prior variance that is specified for the parameters (see Asparouhov et al., Citation2015).

In addition, BSEM also offers versatile modeling and analysis options for the multiple-group case, in which the evaluation of measurement invariance is usually of primary interest. Compared to a conventional multiple-group CFA with ML, in which exact equality constraints across groups are imposed for certain parameters (e.g., loadings and intercepts), BSEM allows the incorporation of the prior beliefs of the parameter invariance in a more nuanced way. This could entail, for instance, specifying a prior distribution with a mean of zero and a small variance for the difference between the factor loadings and/or indicator intercepts (Muthén & Asparouhov, Citation2013). The larger the prior variances on such difference parameters need to be to obtain a model that fits the data, the larger the magnitude of non-invariance (Muthén & Asparouhov, Citation2013). In the BSEM context, parameters are considered as significantly different from each other (i.e., non-invariant) if the 95% credibility interval of the difference between the parameters does not include zero. In addition, the BSEM approach of (approximate) invariance testing is also very attractive because it can be applied with models that cannot otherwise be tested for invariance (i.e., measurement models with a small variance prior on cross-loadings and/or residual-correlations).

BSEM therefore seems to be a far better modeling and analysis option than the conventional CFA with ML, especially if the hypothesized models fail the exact fit and/or invariance test. This was also the reason why we choose BSEM for the present investigation.

Methods

Sample

For this study we used the data from the complete assessment cycle in 2019, in which 1,843 cadre candidates completed the psychological assessment. The cadre candidates were on average 21.11 years old (SD = 1.81) and predominately men (1,776, or 96.4%). The sample consisted of 1,284 German-speaking, 474 French-speaking, and 85 Italian-speaking candidates. Thus, with exception of the language group, the sample was relatively homogeneous.

However, we decided to use only the larger two language groups (i.e., German and French speakers) for the present study. The reason for disregarding the Italian-speaking group was that we had planned to estimate models with up to 75 parameters in the single-group analyses and models with up to 152 parameters in the multiple-group analyses, which would have resulted in poor coverage of central estimates. For instance, the power to detect potential residual correlations within the Italian language group and parameter non-invariances between the Italian and the other language groups would have been below an acceptable minimum of .8 (see e.g., Muthén & Asparouhov, Citation2012).

The data set contained no missing values. For one, this is because non-responses in the cognitive tests are treated by default as incorrect responses. For another, this is because candidates have to provide a response for each item in order to proceed to the next survey page within the personality test part (i.e., forced-response format is used).

Study measures

General mental ability

The concentration-stress test (Boss & Fischer, Citation2006) is composed of four subtests that measure the GMA facets numerological processing speed (math test; two parts with 50 items each, 8-minute time limit for each part), verbal reasoning (analogy test; two parts with 20 items each, 6-minute time limit for each part), verbal and numerological memory (coordinate test; two parts with 10 items each, 7-minute time limit for each part), and figural reasoning (cutting pattern test; 20 items, 8-minute time limit). The language group-specific Cronbach’s alpha values for the subtests in the assessment cycle studied here (i.e., 2019 cycle) were the following αGerman = .96, αFrench = .96 for both parts of the math test, αGerman = .79, αFrench = .70 for both parts of the analogy test, αGerman = .78, αFrench = .78 for both parts of the coordinate test, and αGerman = .72, αFrench = .68 for the cutting pattern test. At least at first sight (i.e., without conducting any formal test), the language-group specific alpha values of the presently studied cycle appeared to be fairly similar to those that were reported for the previously studied assessment cycle (i.e., column Cronbach’s α – 2012 cycle, ).

Personality

The candidates rated all items of the Big five scales (Boss & Brenner, Citation2006) and the integrity scale (Egger & Boss, Citation2006c) on a Likert scale ranging from 1 = completely disagree to 6 = completely agree. For each of the five personality scales, the following provides a sample item and the corresponding language-group specific Cronbach’s alpha values for the presently studied assessment cycle (i.e., 2019 cycle): “I like to work very precisely” (conscientiousness; 10 items; αGerman = .86, αFrench = .79), “I can make new friends quickly” (extraversion; 10 items; αGerman = .87, αFrench = .85), “I prefer tasks that I can do together with work colleagues” (agreeableness[cooperation]; 10 items; αGerman = .78, αFrench = .78), “I take honesty more seriously than many others” (integrity; 64 items; αGerman = .93, αFrench = .92), “There are times when I feel absolutely worthless” (neuroticism; 9 items; αGerman = .79, αFrench = .81). With exception of the agreeableness [cooperation] scale, for which the language group-specific alpha values tended to be lower, the language-group specific alpha values of the other four personality scales in the presently studied cycle appeared to be fairly similar (i.e., without conducting any formal test) to those that were reported for the previously studied assessment cycle (i.e., column Cronbach’s α – 2012 assessment cycle, ).

Analytical procedure

Measurement model evaluation

To evaluate the measurement model fit of the constructs, we used BSEM (Muthén & Asparouhov, Citation2012) in Mplus Version 8.4 (Muthén & Muthén, Citation1998–2017). A series of BSEM models were estimated for each construct.Footnote3 For identification purposes, the factor variance was fixed to 1 within these models. Moreover, factor indicators were standardized so that the prior variance had similar interpretation across items. First, a measurement model with non-informative priors for the major loadings and exact zero residual correlations were estimated. Estimating a BSEM model with non-informative priors for the major loadings and exact zero residual correlations is comparable to estimating a conventional CFA with ML (Muthén & Asparouhov, Citation2012). This restricted BSEM model then served as baseline model for more liberal BSEM models in which approximate zero residual correlations with increasingly larger prior variances were specified. In this comparison process, which is also referred to as sensitivity analysis (Asparouhov et al., Citation2015), we examined the fit of models with five different prior variances ranging from 0.0001 to 0.01. With a standardized factor variance and standardized indicators, a prior variance of 0.01, for instance, implies a prior belief that 95% of the distribution of a residual correlation will fall within a range between ± .2 (Muthén & Asparouhov, Citation2012). All these analyses were conducted separately for the German-speaking and the French-speaking group.

The models were estimated with the Markov chain Monte Carlo (MCMC) algorithm using the Gibbs sampler and two chains, each of which was run with 50,000 iterations. The convergence of the estimation process was assessed with the potential scale reduction factor (PSRF). In this case, PSRF values lower than 1.10 were taken as an indication of model convergence (Gelman, Carlin, Stern, & Rubin, Citation2004, p. 287). All models listed in reached convergence according to this criterion.

Table 2. Model fit results of models with different levels of residual correlations.

The fit of the measurement models was assessed with posterior predictive checking and the associated posterior predictive (PP) p-value. Whereas a PP p-value around .50 is indicative of an excellent-fitting model (Muthén & Asparouhov, Citation2012), PP p-values above .10 or .15 have been taken as indicating a reasonable fit (Cain & Zhang, Citation2019). In addition, the deviance information criterion (DIC) was used for model comparisons within each construct. In this case, lower DIC values were indicative of a better fitting model (Asparouhov et al., Citation2015).

The final measurement model for each construct was selected based on whether it offered an adequate balance between model fit (i.e., of two models with different priors, the model that exhibited a better fit to the data was favored) and model parsimony (i.e., if two models fitted equally, the one with smaller prior was favored over the model with larger prior), and on whether the model estimates were comparable and in the expected range in both language groups (i.e., we favored models that had the same configuration of major loadings and residual correlations in the groups over models that had a different configuration of major loadings and residual correlations in the groups).

Measurement invariance evaluation

The final measurement models of the constructs were then further tested for invariance across the language groups.Footnote4 For this evaluation, we again used the BSEM modeling and testing framework (Muthén & Asparouhov, Citation2013).

A series of scalar invariance models (i.e., models in which invariance of loadings and intercepts are imposed) with language group-specific residuals (co)variances were estimated for each construct. These models were identified by setting the factor variance and the factor mean in the French-speaking group to 1 and 0, respectively. We focused on testing for scalar invariance, as this type of invariance is sufficient for most structural comparisons across groups (e.g., latent means)Footnote5 (Steenkamp & Baumgartner, Citation1998). However, if the focus was on comparing observed composite scores, an additional test for (approximate) invariance of residual (co)variances would be necessary (Steenkamp & Baumgartner, Citation1998). First, a measurement model with exact scalar invariance was estimated. This restricted BSEM model then served as a baseline model for more liberal BSEM models in which approximate scalar invariance with increasingly larger prior variances were specified. In this sensitivity analysis (Asparouhov et al., Citation2015), we examined the fit of models with five different prior variances ranging from 0.001 to 0.1 (see Muthén & Asparouhov, Citation2013; Winter & Depaoli, Citation2020). With a standardized factor variance and standardized indicators, a prior variance of 0.1, for instance, implies a prior belief that 95% of the distribution of the difference/non-invariance will fall into a range between ± .62 (Muthén & Asparouhov, Citation2013). Note, however, that we performed the invariance tests based on the unstandardized model estimates, as the use of standardized estimates would have implied the unlikely assumption of equal indicator variances across groups.

In this measurement invariance testing, the same estimation procedures were used, and the same criteria were applied when examining convergence (all models listed in achieved convergence according to the PSRF criterion) and model fit as in the previous evaluation of the measurement models.

Table 3. Model fit results of models with different levels of scalar invariance across language groups.

The final scalar invariance model for each construct was selected based on whether it offered an adequate balance between model fit (i.e., of two models with different priors, the model that exhibited a better fit to the data was favored) and model parsimony (i.e., if two models fitted equally, the one with smaller prior was favored over the model with larger prior). In addition, we also checked for the plausibility of the estimates and favored well-fitting models with plausible estimates over well-fitting models with implausible estimates (e.g., models in which loadings turned out to be negative and/or non-significant, and/or residual covariances were very large and significant).

Subsampling

Since the sample sizes were not equal across the language groups, we additionally ran subsample comparisons based on the final scalar invariance model (Yoon & Lai, Citation2018) as a robustness check. If measurement invariance is tested in the context of unbalanced sample sizes, the power to detect a true non-invariance may be diminished (Kaplan & George, Citation1995; Yoon & Lai, Citation2018), because larger groups have a stronger influence than smaller groups on the joint fit function of the corresponding multiple group analysis (Brown, Citation2015, p. 251; Yoon & Lai, Citation2018, p. 203). In other words, a misfit (i.e., non-invariance) may not be detected, because the model is estimated in favor of the larger group (i.e., it fits for the majority but not for the minority). As a remedy, Yoon and Lai (Citation2018) therefore proposed repeating the invariance test across multiple randomly drawn subsamples in which the comparison groups are of equal size.Footnote6

For the subsample comparisons, we generated 100 data sets of which each contained 474 randomly drawn participants from the German-speaking sample (n =1,284) and the 474 participants from the French-speaking sample. In addition to the non-invariance determined in the full-sample comparisons, we considered a parameter difference as potentially substantial (i.e., worth further investigation) if the difference between the parameters turned out to be significant in a substantial proportion (i.e., we considered 20% as substantial) of the subsample comparisons.

Results

Descriptive statistics

The language-group specific correlation matrices as well as the means and standard deviations of the construct indicators are reported in the supplementary material, Excel file “Descriptive statistics.”

Measurement model fit

provides a summary of model fit for models, in which exact or different levels of approximate zero residual correlations were specified. The results in are reported per construct and separately for each language group.

For all constructs, the strictest BSEM model (i.e., the model with exact zero residual correlations) provided a bad fit to the data, not only according to the PPP but also if alternative fit indices (i.e., comparative fit index [CFI; Bentler, Citation1990] or root mean square error of approximation [RMSEA; Steiger & Lind, Citation1980]) had been used for model fit evaluation. Thus, if we had insisted on using the model specifications that are conventionally applied in a CFA with ML, our examination of measurement invariance of the constructs would have been over before it even started, as the basic prerequisite of a well-fitting measurement model was not met for the examined constructs (see Steenkamp & Baumgartner, Citation1998, p. 80). In contrast, a well-fitting measurement model could be obtained for all constructs and in both language groups if more liberal BSEM models were estimated that allowed for a small prior variance of the residual correlations.

Take the model fit evaluation of the extraversion construct, for instance. Even the specification of a tiny prior variance of (v < .0001) for the residual correlations resulted in an model that showed a noticeably better fit to the data (i.e., PPPGerman = .402, PPPFrench = .265), than the model in which exact zero residual correlations were specified (i.e., PPPGerman = .000, PPPFrench = .000). Thus, if “fit to the data” had been the only criterion by which we judged the suitability of a model, the specification of a tiny prior variance (e.g., v < .0001) would have been sufficient for the extraversion and also all other constructs. In addition to the model fit, however, we also aimed for models that had the same configuration of major loadings and residual correlations in the language groups, in order to have an optimal basis for the following invariance testing. In the case of the extraversion construct, for instance, models with smaller prior variances (i.e., v < .0001, v = .0001, v = .0003) on the residual correlations resulted in different sets of significant residual correlations across the language groups. In contrast, in the model with a prior variance of .001, only two residual correlations turned out to be significant. One of them occurred in both language groups and the other one uniquely in the French-speaking group (for details, see supplementary material, Excel file “Model testing summary”). Therefore, the model with a prior variance of .001 was selected as final BSEM model for the extraversion construct. The same principle was applied when selecting the final BSEM models of the other constructs.

In all the final BSEM models (rows in gray, ) all loadings were significant and sufficiently large (i.e., except for five loadings, all of them were larger than .4) (Brown, Citation2015). In addition, the vast majority of the freely estimated residual correlations in these final models turned out to be close to zero and non-significant (for details, see supplementary material, Excel file “Model testing summary”). In the rare case in which a residual correlation escaped the prior variance and turned out to be significant despite the wiggle room that we allowed (i.e., small prior variance for the parameter), the involved items were flagged to allow closer examination of their content. This examination revealed that significant residual correlations were most likely due to similar item wording.

In general, the results therefore indicated that the misfit of the strict BSEM models was mostly due to many small unmodeled residual correlations, which, however, are of little practical importance and thus do not call into question the general model specification (see Asparouhov et al., Citation2015). We therefore concluded that the final measurement models offered an adequate representation of the (co)variances in the data and consequently could be used to test for measurement invariance across language groups.

Measurement invariance

provides a summary of model fit for models in which exact or different levels of approximate scalar invariance across the language groups were specified. The results in are listed separately for each construct.

For all constructs, liberal BSEM models (i.e., models with approximate scalar invariance) offered a better representation of the (non)invariance in the measurement models than the strictest BSEM model (i.e., model with exact scalar invariance), not only in terms of model fit but also in terms of the plausibility of the model estimates.

Again, take the extraversion construct as an example. When forcing the loadings and intercepts of this construct to be exactly equal across groups, the inequality of these parameters had to seemingly channel through other unconstrained parameters, such as the residual (co)variances, which became inflated and significant. Similar effects (i.e., inflated significant residual correlations) occurred for approximate scalar invariance models in which only a small prior variance for the difference parameters was specified (i.e., v = .001, v = .005). This stands in contrast to a regular pattern of loadings and residual (co)variances (i.e., significant loadings and non-significant residuals covariances to those observed in the single-group analyses) that could be observed for the model in which a prior variance of .1 was specified for the difference parameters (for details, see supplementary material, Excel file “Invariance testing summary”). Therefore, the model with the prior variance of .1 was selected as final approximate scalar invariance model of the extraversion construct. Notably, the pattern that we observed for the extraversion construct (i.e., implausible estimates for models in which no or only small prior variances for the difference parameters were specified) also occurred when we examined the other constructs. For selecting the final approximate scalar invariance models of the other constructs (rows in gray, ), we applied the same principle as in the case of the extraversion construct (i.e., we favored well-fitting models with plausible estimates over well-fitting models with implausible estimates).

These final approximate scalar invariance models were then further examined for significant loading and intercept differences (i.e., non-invariances). First, we examined the parameter differences in the full sample. (column “Full sample comparisons”) shows the loading and intercept estimates of the construct extraversion for both language groups, as well as the differences of the parameters.Footnote7 The results of the full sample comparisons of the other constructs are available as supplementary material (see supplementary material, Tables S1-S5, columns “Full sample comparisons”).

Table 4. Parameter difference tests for the extraversion construct across the full sample and the subsamples.

When examining the loadings of the extraversion construct across the language groups, none of the comparisons revealed a significant difference. Thus, full approximate metric (i.e., loading) invariance may be claimed for the extraversion items. However, four intercepts (i.e., τ2, τ6, τ7, τ8) turned out to be significantly different across the language groups despite the wiggle room that we allowed (i.e., small prior variance for the difference parameter). Thus, full approximate metric, but only partial approximate scalar invariance, may be claimed for the extraversion items. The same type of invariance (i.e., full approximate metric, partial approximate scalar) could be imposed for the other examined constructs (Table S2-S5), with exception of the concentration-stress test, for which full approximate metric and scalar invariance could be established (Table S1).

As a robustness check, we then reexamined the parameter differences with the subsampling procedure. (column “Subsample comparisons”) shows the average parameter difference and the corresponding standard deviation across the subsample comparisons, as well as the percentage of subsample comparisons in which the parameter difference turned out to be significantly larger than zero. The results of the subsample comparisons of the other constructs are available as supplementary material (see supplementary material, Tables S1-S5, columns “Subsample comparisons”). Not surprisingly, parameter differences that had been discovered in the full sample comparison as significant also turned out to be significant in a large proportion of the subsample comparisons. For instance, the parameter difference of τ2 was larger than zero in 100% of the subsample comparisons and that of τ6 in 79% of the subsample comparisons (, column “Subsample comparisons”). More interesting, however, were the parameter differences that turned to be significant in a substantial proportion of the subsample comparisons (i.e., in 20%), even though the differences appeared to be non-significant in the full sample comparison. An exemplary case is the difference of τ8 (row in gray, ). Although this difference was not found to be significant in the full sample comparison, it proved to be significantly larger than zero in 52% of the subsample comparisons. Such masked differences were also found for the items of the other constructs (rows in gray, Tables S3-S5), even though the proportions of significant subsample comparisons were not as large (i.e., 20–25%) as was the case for τ8 of the extraversion construct. Nevertheless, all these masked differences are certainly worth further examination.

Discussion

The aim of this study was to (re)examine selected GMA and personality tests used by the Swiss Armed Forces in psychological cadre assessment regarding model fit and measurement invariance across two major language groups in Switzerland (i.e., German and French speakers) by using BSEM.

The measurement model fit testing showed that the strictest BSEM model (i.e., the model with exact zero residual correlations) provided a bad fit to the data. Accordingly, if we had insisted on using the strict specifications that are applied in a conventional CFA with ML, the examination of measurement invariance of the constructs would have been over before it even began, as the central prerequisite of a well-fitting measurement model (Steenkamp & Baumgartner, Citation1998, p. 80) was not given. In contrast, more liberal BSEM models (i.e., models with approximate zero residual correlations) offered an adequate representation of the (co)variances in the data. By using liberal BSEM models, we could therefore obtain a well-fitting measurement model that could be used for further invariance testing. On the other hand, however, we had also to accept in turn that our measurement models and the underlying theories are at best approximations.

The subsequent measurement invariance testing then showed that at least partial approximate scalar invariance can be claimed for the items of the constructs. Interestingly, a large proportion of items that were noninvariant in the presently studied assessment cycle (i.e., 2019 cycle), were also found to be non-invariant in the previously studied assessment cycle (i.e., 2012 cycle; see Goldammer, Citation2019). For instance, all intercepts of the extraversion construct that turned out to be non-invariant in the present study (i.e., τ2, τ6, τ7, τ8, τ9 in ) were also found to function differently across the German- and French-speaking language groups in the 2012 assessment cycle (see Goldammer, Citation2019). This can be taken as further evidence for the robustness of the detected non-invariances, since they occurred in different samples and independently of the estimation method. However, this consistent finding also means that these non-invariances should be taken seriously.

Reflections on invariance testing and the assessment of the importance of non-invariances

Of course, the number of items detected to function differently across the groups depends strongly on how strict the desired invariance criterion is. Naturally, setting a stricter invariance criterion (e.g., exact invariance) most likely goes along with the finding that more items function differently across the groups. However, applying exact invariance as a criterion also entails the risk of finding many statistically significant yet potentially trivial non-invariances, such that not even a minimal level of invariance is reached that is needed to make meaningful comparisons across groups (e.g., Cieciuch, Davidov, Algesheimer, & Schmidt, Citation2018; Marsh et al., Citation2018). In light of the potential difficulties that may arise when using such a strict criterion of invariance, we therefore used BSEM as analytical framework that allows the application of a more lenient and potentially more realistic criterion of invariance (i.e., approximate invariance). More specifically, BSEM allows the researcher to anticipate a certain degree of trivial non-invariance, or to expect a certain degree of approximate invariance, by specifying a prior distribution with a mean of zero and a small variance for the difference between the parameters (e.g., loadings and/or intercepts). In this framework, parameter differences that are truly trivial therefore have credibility intervals that include zero. In contrast, if a difference between parameters turns out to be significant (i.e., the credibility interval does not include zero) despite the wiggle room that was allowed (i.e., set by the prior variance), the parameter difference may be considered to be non-trivial and thus of practical importance. This was also the criterion that we applied in our examination, when we assessed the practical importance of non-invariances. In other words, we considered any non-invariance that went beyond approximate invariance as worthy of further investigation (i.e., adjustments had to be made, or at least the items had to be examined more closely). We considered this criterion for the practical importance of a non-invariance as best suited for our high-stack questionnaire/test.

However, this is by far not the only possibility that researchers can use to determine the practical importance of a non-invariance. In the context of BSEM, for instance, Shi et al. (Citation2019) proposed using the region of practical equivalence (ROPE) to determine whether a non-invariance is of practical relevance or not. For this type of non-invariance assessment, the researcher first defines a ROPE (preferably on standardized metric), such as ±.1, against which the credibility interval of the parameter difference can be compared (for details, see also Kruschke, Citation2014, p. 336–340). If the 95% credibility interval (preferably the highest density interval) falls completely within the ROPE, the researcher may conclude that the non-invariance is not of practical importance. On the other hand, if the 95% credibility interval lies completely outside the ROPE, the researcher may conclude that the parameter difference is of practical importance. However, the finding is considered as inconclusive if the ROPE and the credibility interval of the difference partially overlap (see Shi et al., Citation2019).Footnote8

These Bayesian-based approaches aside, there are also a lot of effect size measures available outside the Bayesian modeling and testing framework (for an overview, see Gunn, Grimm, & Edwards, Citation2020; Meade, Citation2010). In addition, rather than the item-level perspective that we used, one could use an effect size measure that determines the practical importance of a non-invariance by the extent to which it affects the overall scale score (see e.g., Stark, Chernyshenko, & Drasgow, Citation2004) or a measure that additionally takes into account the impact on the selection decision (see e.g., Stark et al., Citation2004). Accordingly, the most comprehensive picture of the practical importance of a non-invariance is likely to be obtained by considering several complementary indices (e.g., an item-level and scale-level index) (Stark et al., Citation2004, p. 498).

Among the multitude of available effect size measures of non-invariance, however, BSEM-based indices have a central advantage. Because with BSEM it is possible to obtain a theory-guided well-fitting measurement model with unbiased estimates, the effect size measures derived from the estimates are unbiased as well. In contrast, if a conventional CFA with ML is used, the derived effect size measures of non-invariance may be biased to the extent to which the CFA model is misspecified (see Nye & Drasgow, Citation2011, p. 977). In our understanding, this advantage of BSEM-based indices should therefore be clearly taken into account when researchers choose an effect size measure of non-invariance for their study of measurement invariance.

Different item functioning: Remedies and further testing options

Once a non-invariance has been detected, and judged as substantial, the question remains as to how to best deal with it. Clearly, more and more effective options are available if the causes of the non-invariance can be identified. If research is conducted in a multilingual context like ours, a good starting point for identifying the reasons for non-invariances is to check the translations of the items for imprecision. Indeed, an imprecise translation may have been the reason for the intercept difference in τ2 (−.478) of the extraversion construct. Whereas the German item is, “Ich sprühe oft vor Energie” [I am often bursting with energy], the French item is, “je suis souvent plein d’énergie” [I am often full of energy]. Admittedly, this is only a slight nuance. However, the fact that the German item translation is stronger than the French translation may have ultimately led the German-speaking respondents to endorse it less often than the French-speaking respondents. If an imprecise translation has been identified as the most likely reason, as in our case, the adjustment is straightforward and entails finding a more suitable translation than the original one, or, if this is not possible, finding an expression that has the same meaning for both groups. In our case, this meant adjusting the German item wording to “Ich bin oft voller Energie” [I am often full of energy].

In addition to imprecise translations, however, non-invariances can also, and probably far more frequently, occur because the item triggers a different social norm or attitude in the language groups. For instance, behavior (e.g., outgoingness) in a social situation (e.g., at a party) may not only depend on the degree of extraversion but also on whether being reserved or being outgoing is considered as polite or as in accordance with the social norm in this circumstance. Under the premise that a concrete social norm can be identified as a reason for the non-invariance, the adjustment would entail replacing the item with one that does not trigger this norm but only indicates the intended construct. However, correctly attributing a non-invariance to a social norm is a challenging task and prone to speculations.

Eventually, the researcher may end up with a couple of non-invariant items for which no obvious reasons for the different functioning across the (language) groups can be found. However, this does not necessarily mean the end for any further tests and/or comparisons across groups. Even if several items are found to function differently, valid comparisons across groups may still be possible, as long as there are at least some invariant items in the questionnaire/test battery. For instance, Pokropek, Davidov, and Schmidt (Citation2019) showed in their large simulation study that population latent means and path coefficients can be effectively recovered even if only some of the questionnaire items are approximately invariant. In such a case (i.e., partial approximate invariance) it seems most appropriate to use a BSEM-based CFA, as this analysis approach showed reasonable levels of coverage and precision even under unfavorable conditions (i.e., larger number of invariant items and larger prior variance for the difference parameters) and performed far better than other approaches examined (e.g., conventional multiple-group CFA with ML) (Pokropek et al., Citation2019).

Alternatively, if the comparability of individual questionnaire/test scores across groups is of primary interest, test equating may be applied (e.g., Kolen & Brennan, Citation2014). For instance, in our examination (i.e., a common item non-equivalent groups design), the set of invariant items could have served as internal anchor items, on which basis the different test versions (i.e., German and French versions) could have been equated. Mean, linear, or equipercentile equating are a few of the available methods that could have been used to put the scores of the language versions on the same scale (e.g., Bandalos, Citation2018, pp. 547–584; Kolen & Brennan, Citation2014, pp. 103–169). Readers looking for a concrete and illustrative example of how test equating may be applied in the context of linguistically different tests are referred to Elosua and López-Jáuregui (Citation2008).

Limitations and future research directions

Some limitations should be noted. For instance, we used BSEM to better deal with models that fail the exact fit and/or invariance test, under the assumption that the approximate nature of the underlying measurement theory would be the primary reason for the model misfit. However, this perspective neglects the possibility that contextual factors of the survey administration may have also had a considerable influence on the participants’ response behavior. For one, there is the fact that the survey is used for selection purposes, which in turn may motivate certain cadre candidates to present themselves in a more favorable way (i.e., they fake some responses). Although there is an ongoing debate about the extent to which faking has adverse effects at all (for an overview, see Holden, Citation2008; Tracey, Citation2016), several studies have shown its distorting effects on the hypothesized factor structure (e.g., Ion & Iliescu, Citation2017; Lee, Joo, & Fyffe, Citation2019). For another, certain participants, presumably less motivated cadre candidates that had been implicitly coerced to take on a cadre position, may feel reluctant to go through the lengthy questionnaire and thus complete some parts or the whole questionnaire rather carelessly. This represents another potential response behavior with adverse effects on the factor structure (e.g., Goldammer, Annen, Stöckli, & Jonas, Citation2020). Future studies could therefore additionally include response style factors in their BSEM models. This would allow more clear differentiation between model misfit that is due to response styles and model misfit that may uniquely be attributed to the imprecision of the measurement model.

In addition, it can be argued that the criteria that we applied for assessing the practical importance of a non-invariance (i.e., anything that goes beyond approximate invariance) is only a heuristic in which the effective size of the non-invariance is not really taken into account. Obviously, in our approach to assessing the practical importance of a non-invariance, the size of the non-invariance/difference is not explicitly compared against a standardized metric, as is the case, for example, in the ROPE procedure outlined by Shi et al. (Citation2019). However, it nevertheless entails a comparison, although a more implicit one, namely, in terms of prior variance that is specified for the difference between the parameters (e.g., loadings and intercepts). Naturally, in a strict BSEM model, in which only a small prior variance is specified for the difference between the parameters, it is likely that smaller non-invariances may turn out to be significant. But in more liberal BSEM models with more prior variance for the difference between the parameters, only substantial non-invariances may turn out to be significant. Accordingly, substantial differences can also be identified during the course of a sensitivity analysis, because substantial non-invariances turn out to be significant independent of the size of the prior variance that has been specified for the difference between the parameters (see supplementary material, Excel file “Invariance testing summary”). Nevertheless, future studies should examine to what extent the criterion that we used to judge a non-invariance as practically important (i.e., anything that goes beyond approximate invariance) leads to the same conclusions as other (BSEM-based) effect size measures. In this context, it would also be interesting to see which combination of different effect sizes measures (e.g., item- and test-level) provides the most comprehensive picture of the practical importance of different types of non-invariance (e.g., different number of non-invariant items, loading and/or intercept non-invariance, size of non-invariance).

Lastly, it might be argued that we focused (too) strongly on BSEM in our examination and thereby neglected other approaches that might be equally or even more suited. Undoubtedly, regularized structural equation models (RegSEM; Jacobucci, Grimm, & McArdle, Citation2016), BSEM-based alignment (Asparouhov & Muthén, Citation2014), and model implied instrumental variable two-stage least square estimation (MIIV-2SLS; Bollen, Gates, & Fisher, Citation2018) have their merits when dealing with models that fail the exact fit and/or invariance test. However, only few studies up to now have examined how these different approaches perform compared to each other (e.g., see Jacobucci & Grimm, Citation2018; Kim, Cao, Wang, & Nguyen, Citation2017, for a partial comparison between these approaches). Thus, at present, the researcher’s choice between these approaches is guided by relatively little information. Future studies should therefore examine under what types of misspecifications and/or non-invariances BSEM, RegSEM, BSEM-based alignment, and MIIV-2SLS perform best, so that applied researchers can make their choices on a more informed basis.

Conclusions

If GMA and personality constructs in a tailored psychological cadre assessment are meant to be valid in a cross-cultural context, two central prerequisites need to be given for the utilized measures. First, the scale items have to adequately reflect the underlying intended constructs. Second, the employed measures need to have the same psychometric properties in each cultural group. By examining these two important issues in the case of the multilingual psychological cadre assessment of the Swiss Armed Forces, this study helped to ensure the quality of this assessment and to identify possibilities for improvement. In a broader perspective, however, the study may have also provided guidance to researchers who have to deal with challenges similar to those that we faced (i.e., examining misfitting models for invariance in the context of unbalanced samples) and thus may help them to obtain robust findings despite these challenges.

Supplemental material

Supplemental Material

Download MS Excel (148.4 KB)

Supplemental Material

Download MS Excel (236 KB)

Supplemental Material

Download MS Excel (64.4 KB)

Supplemental Material

Download MS Word (86.7 KB)

Data availability statement

The data that support the findings of this study are available from the first author, upon reasonable request.

Disclosure statement

No potential conflict of interest was reported by the authors.

Supplementary material

Supplemental data for this article can be accessed on the publisher’s website.

Additional information

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Notes

1. In the concept development exercise, the candidates have 30 minutes to prepare a concept for a randomly assigned task (e.g., for a company outing, or a club’s anniversary). After the 30 minutes, the candidates hand in their concept on a single A4 sheet and present their solution in front of their peers and two psychologists.

2. The math test presents the candidates with six-digit numbers, and they have to decide whether the numbers can be divided by three and/or four or neither. At the beginning of this test, the candidates are given instructions on how to check whether a number is divisible by three or by four. Thus, this test should primarily measure numerological processing speed.

3. Initially, we wanted to test all constructs in a joint model, which would have entailed 1,698 freely estimated parameters in each language group, if approximate zero cross-loadings and residual correlations were specified. In such large model, however, a single iteration needed a processing time of approximately 30 seconds. We therefore decided to base our analyses on models that were computationally less demanding (i.e., we decided to run our tests for each construct separately), especially since we had planned to run several models with different priors (each with 50,000 iterations) to determine the best fitting model and additionally to run subsample comparisons with 100 repetitions for this final model.

4. The inverse Wishart (IW) settings (i.e., the settings of the degrees of freedom for the residual correlations) determined in the single group analyses with standardized factor variances and indicators were taken as proxy for the sample-specific IW settings when testing the unstandardized model estimates for measurement invariance across the language groups. This approach seemed justified, because the variances of the indicators tended to be close to 1 (see Muthén, Citation2013). However, the four indicators of the concentration-stress test were an exception. Here, a smaller prior variance had to be chosen than for the single-group analyses, since these unstandardized variances were remarkably different from 1 (i.e., on average only 0.0265).

5. For the appropriate comparison of structural parameters (e.g., latent means, or covariances) in the BSEM context, Muthén and Asparouhov (Citation2013) proposed a two-step procedure: First, a model is run to identify non-invariant parameters. Second, a second model is run in which all parameters are constrained to be exactly equal except for those that turned out to be non-invariant (Muthén & Asparouhov, Citation2013). Because the comparison of structural parameters was not the focus of our study, we completed only the first of these two steps.

6. The subsampling approach also has its pitfalls. First, due to the reduced sample size, the overall power decreases and non-invariance may therefore remain undetected. Second, due to the reduced sample size, the influence of the prior on the posterior distribution of the difference parameter increases (see Muthén, Muthén, & Asparouhov, Citation2017, p. 385/415; Zyphur & Oswald, Citation2015, p. 398), which again can reduce the chance of detecting a substantial non-invariance. Thus, for drawing valid conclusions of the subsampling procedure, the size of the subsampled samples must be still large (powerful) enough to allow valid comparisons. Applying subsampling in the case of the German-speaking (n = 1,284) and French-speaking samples (n = 474) (i.e., using 474 randomly drawn German-speaking participants together with the 474 French-speaking participants for the invariance analysis) is justifiable, but it would hardly be justifiable if the Italian-speaking-group (n = 85) had been included as well (i.e., using 85 randomly drawn German-speaking and French-speaking participants together with the 85 Italian-speaking participants for the invariance analysis).

7. We chose to report the extraversion construct, because it best illustrated the point of the subsampling procedure, namely, to detect non-invariances that might be masked otherwise due to the unbalanced sample sizes.

8. In preliminary analyses, we also tried to examine the factorial invariance by using the procedure outlined by Shi et al. (Citation2019). However, the specified models (i.e., configural models with language group-specific approximate zero residual correlations) almost never reached convergence, most likely because the models were too liberal (i.e., not identified).

References

  • Amthauer, R., Brocke, B., Liepmann, D., & Beauducel, A. (2001). Intelligenz-Struktur-Test 2000 R [Test of intelligence structure 2000 R]. Göttingen, Germany: Hogrefe.
  • Asparouhov, T., & Muthén, B. (2014). Multiple-group factor analysis alignment. Structural Equation Modeling: A Multidisciplinary Journal, 21(4), 495–508. doi:10.1080/10705511.2014.919210.
  • Asparouhov, T., Muthén, B. O., & Morin, A. J. S. (2015). Bayesian structural equation modeling with cross-loadings and residual covariances: Comments on Stromeyer et al. Journal of Management, 41(6), 1561–1577. doi:10.1177/0149206315591075.
  • Bandalos, D. L. (2018). Measurement theory and applications for the social sciences. New York, NY: Guilford Press.
  • Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107(2), 238–246. doi:10.1037/0033-2909.107.2.238.
  • Bollen, K. A., Gates, K. M., & Fisher, Z. (2018). Robustness conditions for MIIV-2SLS when the latent variable or measurement model is structurally misspecified. Structural Equation Modeling: A Multidisciplinary Journal, 25(6), 848–859. doi:10.1080/10705511.2018.1456341.
  • Boss, P., & Brenner, C. (2006). Kaderbeurteilung II: Persönlichkeitsfragebogen: Skalen B5 (Big Five) [Cadre assessment II: Personality questionnaire: Scales B5 (Big Five)]. Zurich, Switzerland: Department of Social and Business Psychology, Institute of Psychology, University of Zurich.
  • Boss, P., & Fischer, S. (2006). Kaderbeurteilung II: Konzentrations-Belastungs-Test (KBT) [Cadre assessment II: Concentration-stress test]. Zurich, Switzerland: Department of Social and Business Psychology, Institute of Psychology, University of Zurich.
  • Brown, T. A. (2015). Confirmatory factor analysis for applied research (2nd ed.). New York, NY: Guilford Press.
  • Cain, M. K., & Zhang, Z. (2019). Fit for a Bayesian: An evaluation of PPP and DIC for structural equation modeling. Structural Equation Modeling: A Multidisciplinary Journal, 26(1), 39–50. doi:10.1080/10705511.2018.1490648.
  • Cattell, R. B., Eber, H. W., & Tatsuoka, M. M. (1970). Handbook for the Sixteen Personality Factor Questionnaire (16PF). Champaign, IL: Institute for Personality and Ability Testing.
  • Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes for testing measurement invariance. Structural Equation Modeling: A Multidisciplinary Journal, 9(2), 233–255. doi:10.1207/S15328007SEM0902_5.
  • Cieciuch, J., Davidov, E., Algesheimer, R., & Schmidt, P. (2018). Testing for approximate measurement invariance of human values in the European Social Survey. Sociological Methods & Research, 47(4), 665–686. doi:10.1177/0049124117701478.
  • Costa, P. T., & McCrae, R. R. (1992). Revised NEO Personality Inventory (NEO PI-R) and NEO Five-Factor Inventory (NEO-FFI): Professional manual. Odessa, FL: Psychological Assessment Resources.
  • Darr, W. (2011). Military personality research: A meta-analysis of the Self Description Inventory. Military Psychology, 23(3), 272–296. doi:10.1080/08995605.2011.570583.
  • Egger, I., & Boss, P. (2006a). Kaderbeurteilung II: Persönlichkeitsfragebogen: Skala SV (Soziales Verhalten) [Cadre assessment II: Personality questionnaire: Scale SV (social behavior)]. Zurich, Switzerland: Department of Social and Business Psychology, Institute of Psychology, University of Zurich.
  • Egger, I., & Boss, P. (2006b). Kaderbeurteilung II: Persönlichkeitsfragebogen: Skala KV (Konfliktverhalten) [Cadre assessment II: Personality questionnaire: Scale KV (proactive conflict behavior)]. Zurich, Switzerland: Department of Social and Business Psychology, Institute of Psychology, University of Zurich.
  • Egger, I., & Boss, P. (2006c). Kaderbeurteilung II: Persönlichkeitsfragebogen: Skala IN (Integrität) [Cadre assessment II: Personality questionnaire: Scale IN (integrity)]. Zurich, Switzerland: Department of Social and Business Psychology, Institute of Psychology, University of Zurich.
  • Elosua, P., & López-Jáuregui, A. (2008). Equating between linguistically different tests: Consequences for assessment. The Journal of Experimental Education, 76(4), 387–402. doi:10.3200/JEXE.76.4.387-402.
  • Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Bayesian data analysis (2nd ed.). Boca Raton, FL: Chapman & Hall.
  • Giauque, N., Vaso, H. M., & Boss, P. (2006). Kaderbeurteilung II: Führungsmotivations-Fragebogen [Cadre assessment II: Leadership motivation questionnaire]. Zurich, Switzerland: Department of Social and Business Psychology, Institute of Psychology, University of Zurich.
  • Goldammer, P. (2019). The benefits of careless response screenings and regularized structural equation models in obtaining credible and interpretable study results: An illustration based in the evaluation of cadre selection tools of the Swiss Armed Forces (Doctoral dissertation). Zurich, Switzerland: University of Zurich.
  • Goldammer, P., Annen, H., Stöckli, P. L., & Jonas, K. (2020). Careless responding in questionnaire measures: Detection, impact, and remedies. The Leadership Quarterly, 31(4), 101384. doi:10.1016/j.leaqua.2020.101384.
  • Goldberg, L. R. (1981). Language and individual differences: The search for universals in personality lexicons. Review of Personality and Social Psychology, 2, 141–165.
  • Gunn, H. J., Grimm, K. J., & Edwards, M. C. (2020). Evaluation of six effect size measures of measurement non-invariance for continuous outcomes. Structural Equation Modeling: A Multidisciplinary Journal, 27(4), 503–514. doi:10.1080/10705511.2019.1689507.
  • Gürber, C., & Skupnjak, A. (2006). Kaderbeurteilung II: Persönlichkeitsfragebogen: Skala LM (Leistungsmotivation) [Cadre assessment II: Personality questionnaire: Scale LM (achievement motivation)]. Zurich, Switzerland: Department of Social and Business Psychology, Institute of Psychology, University of Zurich.
  • Holden, R. R. (2008). Underestimating the effects of faking on the validity of self-report personality scales. Personality and Individual Differences, 44(1), 311–321. doi:10.1016/j.paid.2007.08.012.
  • Hornke, L. F., Etzel, S., & Küppers, A. (2000). Konstruktion und Evaluation eines adaptiven Matrizentests [Design and evaluation of an adaptive matrices test]. Diagnostica, 46, 182–188. doi:10.1026//0012-1924.46.4.182
  • Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural equation modeling: A multidisciplinary journal, 6(1), 1–55.
  • Ion, A., & Iliescu, D. (2017). The measurement equivalence of personality measures across high-and low-stake test taking settings. Personality and Individual Differences, 110, 1–6. doi:10.1016/j.paid.2017.01.008.
  • Jacobucci, R., & Grimm, K. J. (2018). Comparison of frequentist and Bayesian regularization in structural equation modeling. Structural Equation Modeling: A Multidisciplinary Journal, 25(4), 639–649. doi:10.1080/10705511.2017.1410822.
  • Jacobucci, R., Grimm, K. J., & McArdle, J. J. (2016). Regularized structural equation modeling. Structural Equation Modeling: A Multidisciplinary Journal, 23(4), 555–566. doi:10.1080/10705511.2016.1154793.
  • Jäger, A. O. (1982). Mehrmodale Klassifikation von Intelligenzl-eistungen: Experimentell kontrollierte Weiterentwicklung eines deskriptiven Intelligenzstrukturmodells [Multimodal classification of intelligence tests: Experimentally controlled development of a descriptive model of intelligence structure]. Diagnostica, 28, 195–225.
  • Jäger, A. O. (1984). Intelligenzstrukturforschung: Konkurrierende Modelle, neue Entwicklungen, Perspektiven [Research on intelligence structure: Competing models, new developments, perspectives]. Psychologische Rundschau, 35, 21–35.
  • Jäger, A. O., Süss, H. M., & Beauducel, A. (1997). Berliner Intelligenzstruktur-Test: BIS-Test, Form 4 [Berlin Intelligence Structure Test, BIS-Test, Form 4]. Göttingen, Germany: Hogrefe.
  • Kaplan, D., & George, R. (1995). A study of the power associated with testing factor mean differences under violations of factorial invariance. Structural Equation Modeling: A Multidisciplinary Journal, 2(2), 101–118. doi:10.1080/10705519509539999.
  • Kim, E. S., Cao, C., Wang, Y., & Nguyen, D. T. (2017). Measurement invariance testing with many groups: A comparison of five approaches. Structural Equation Modeling: A Multidisciplinary Journal, 24(4), 524–544. doi:10.1080/10705511.2017.1304822.
  • Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling, and linking: Methods and practices (3rd ed.). New York, NY: Springer.
  • Kruschke, J. K. (2014). Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan (2nd ed.). Boston, MA: Academic Press.
  • Lécher, C., & Boss, P. (2006). Kaderbeurteilung II: Persönlichkeitsfragebogen: Skala BV (Beeinflussungsverhalten) [Cadre assessment II: Personality questionnaire: Scale BV (influencing behavior)]. Zurich, Switzerland: Department of Social and Business Psychology, Institute of Psychology, University of Zurich.
  • Lee, P., Joo, S. H., & Fyffe, S. (2019). Investigating faking effects on the construct validity through the Monte Carlo simulation study. Personality and Individual Differences, 150, 109491. doi:10.1016/j.paid.2019.07.001.
  • MacCallum, R. C., Roznowski, M., & Necowitz, L. B. (1992). Model modifications in covariance structure analysis: The problem of capitalization on chance. Psychological Bulletin, 111(3), 490–504. doi:10.1037/0033-2909.111.3.490.
  • Maier, E. (2008). Testkürzungen: Persönlichkeitstests Kaderbeurteilung II [Test shortening: Personality tests cadre assessment II]. Zurich, Switzerland: Department of Social and Business Psychology, Institute of Psychology, University of Zurich.
  • Marsh, H. W., Guo, J., Parker, P. D., Nagengast, B., Asparouhov, T., Muthén, B., & Dicke, T. (2018). What to do when scalar invariance fails: The extended alignment method for multi-group factor analysis comparison of latent means across many groups. Psychological Methods, 23(3), 524–545. doi:10.1037/met0000113.
  • Meade, A. W. (2010). A taxonomy of effect size measures for the differential functioning of items and scales. Journal of Applied Psychology, 95(4), 728–743. doi:10.1037/a0018966.
  • Melliger, E., & Boss, P. (2006). Kaderbeurteilung II: Persönlichkeitsfragebogen: Skala SS (Selbstständigkeit) [Cadre assessment II: Personality questionnaire: Scale SS (independence)]. Zurich, Switzerland: Department of Social and Business Psychology, Institute of Psychology, University of Zurich.
  • Mussel, P. (2003). Persönlichkeitsinventar zur Integritätsabschä-tzung (PIA) [Integrity test PIA]. In J. Erpenbeck & L. von Rosenstiel (Eds.), Handbuch Kompetenzmessung [Manual of competence measurement] (pp. 3–18). Stuttgart, Germany: Schäffer-Poeschel.
  • Muthén, B., & Asparouhov, T. (2012). Bayesian structural equation modeling: A more flexible representation of substantive theory. Psychological Methods, 17(3), 313–335. doi:10.1037/a0026802.
  • Muthén, B. O., & Asparouhov, T. (2013). BSEM measurement invariance analysis. (Mplus Web Notes: No. 17). Retrieved from: http://www.statmodel.com/examples/webnotes/webnote17.pdf
  • Muthén, B. O., Muthén, L. K., & Asparouhov, T. (2017). Regression and mediation analysis using Mplus. Los Angeles, CA: Muthén & Muthén.
  • Muthén, L. K. (2013, November 13). Standardization in BSEM? [Discussion board message]. Retrieved from: http://www.statmodel.com/discussion/messages/9/17396.html?1574717796
  • Muthén, L. K., & Muthén, B. O. (1998-2017). Mplus user’s guide (8th ed.). Los Angeles, CA: Muthén & Muthén.
  • Nye, C. D., & Drasgow, F. (2011). Effect size indices for analyses of measurement equivalence: Understanding the practical importance of differences between groups. Journal of Applied Psychology, 96(5), 966–980. doi:10.1037/a0022955.
  • Pokropek, A., Davidov, E., & Schmidt, P. (2019). A monte carlo simulation study to assess the appropriateness of traditional and newer approaches to test for measurement invariance. Structural Equation Modeling: A Multidisciplinary Journal, 26(5), 724–744. doi:10.1080/10705511.2018.1561293.
  • Shi, D., Song, H., DiStefano, C., Maydeu-Olivares, A., McDaniel, H. L., & Jiang, Z. (2019). Evaluating factorial invariance: An interval estimation approach using Bayesian structural equation modeling. Multivariate Behavioral Research, 54(2), 224–245. doi:10.1080/00273171.2018.1514484.
  • Stark, S., Chernyshenko, O. S., & Drasgow, F. (2004). Examining the effects of differential item (functioning and differential) test functioning on selection decisions: When are statistically significant effects practically important? Journal of Applied Psychology, 89(3), 497–508. doi:10.1037/0021-9010.89.3.497.
  • Steenkamp, J. E. M., & Baumgartner, H. (1998). Assessing measurement invariance in cross-national consumer research. Journal of Consumer Research, 25(1), 78–90. doi:10.1086/209528.
  • Steiger, J. H., & Lind, J. C. (1980, May). Statistically-based tests for the number of common factors. Paper presented at the meeting of the Psychometric Society, Iowa City, IA.
  • Stoll, M. (2006). Kaderbeurteilung II: Persönlichkeitsfragebogen: Skala BB (Belastbarkeit) [Cadre assessment II: Personality questionnaire: Scale BB (stress tolerance)]. Zurich, Switzerland: Department of Social and Business Psychology, Institute of Psychology, University of Zurich.
  • Swiss Armed Forces. (2012). Qualifikations- und Mutationswesen in der Armee [Qualifications and redeployments in the Swiss Armed Forces]. Bern, Switzerland: BBL.
  • Tracey, T. J. (2016). A note on socially desirable responding. Journal of Counseling Psychology, 63(2), 224–232. doi:10.1037/cou0000135.
  • Winter, S. D., & Depaoli, S. (2020). An illustration of Bayesian approximate measurement invariance with longitudinal data and a small sample size. International Journal of Behavioral Development, 44(4), 371–382. doi:10.1177/0165025419880610.
  • Yoon, M., & Lai, M. H. (2018). Testing factorial invariance with unbalanced samples. Structural Equation Modeling: A Multidisciplinary Journal, 25(2), 201–213. doi:10.1080/10705511.2017.1387859.
  • Zyphur, M. J., & Oswald, F. L. (2015). Bayesian estimation and inference: A user’s guide. Journal of Management, 41(2), 390–420. doi:10.1177/0149206313501200.