7,105
Views
2
CrossRef citations to date
0
Altmetric
Research Articles

Why Full, Partial, or Approximate Measurement Invariance Are Not a Prerequisite for Meaningful and Valid Group Comparisons

ORCID Icon & ORCID Icon
Pages 859-870 | Received 18 Nov 2022, Accepted 12 Mar 2023, Published online: 03 May 2023

Abstract

It is frequently stated in the literature that measurement invariance is a prerequisite for the comparison of group means or standard deviations of the latent variable in factor models. This article argues that measurement invariance is not necessary for meaningful and valid comparisons across groups. There is unavoidable ambiguity in how researchers can define comparisons if measurement invariance is violated. Moreover, there is no support for preferring the partial invariance approach over competing approaches, such as invariance alignment, robust linking, or Bayesian approximate invariance. Furthermore, we also argue why an intentionally misspecified multiple-group factor model with invariant item parameters can be justified if measurement invariance is violated.

1. Introduction

In the social and behavioral sciences, concepts are frequently measured utilizing questionnaires. Different indicators (i.e., questions or items) are used to infer values of a latent factor variable from a set of item responses. The unidimensional factor model is typically used for relating items to the latent variable of interest (Mulaik, Citation2009). In many applications, it is of interest to compare the distribution of the factor variables across different groups of subjects. Frequently used grouping variables are demographic variables, such as gender, social status, migration background, or different cultures or countries. In the psychometric literature, it is emphasized that the concept of measurement invariance (Flake et al., Citation2017; Meredith, Citation1993; Millsap, Citation2011; Molenaar & Borsboom, Citation2013; Somaraju et al., Citation2022; Vandenberg & Lance, Citation2000; Wicherts, Citation2016) plays a vital role in justifying meaningful and valid comparisons of means and standard deviations of the latent factor across groups (e.g., Boer et al., Citation2018; D’Urso et al., Citation2022; Kankaraš & Moors, Citation2014). Different levels of measurement invariance are discussed in the literature, depending on the restrictions that are imposed on the parameters of the factor model across groups. In the context of group mean comparisons, the focus is typically on scalar (also labeled strong or strict) measurement invariance. Scalar invariance assumes the existence of equal (i.e., invariant) item intercepts and item loadings in the factor model across groups.

The current social science consensus is that establishing measurement invariance is a necessary prerequisite for group comparisons because it ensures that items have the same meaning across groups (Davidov et al., Citation2014; Lacko et al., Citation2022; Meredith & Teresi, Citation2006; Meuleman et al., Citation2022; Putnick & Bornstein, Citation2016; but see for opposing views Funder, Citation2020; Protzko, Citation2022; Welzel & Inglehart, Citation2016). In his popular textbook on structural equation modeling, Kline (Citation2016, p. 398) argues that “strong invariance is the minimal level required for meaningful interpretation of group mean contrasts.” Boer et al. (Citation2018) explicitly write that “only scalar invariance allows validly comparing scale mean scores across cultures.” Furthermore, they also note that “comparing scale mean scores across cultures (using t-tests, multivariate ANOVAs, SEM with mean structures, or multilevel analyses) is appropriate only if scalar invariance is established.” Unequivocally, Davidov and Meuleman (Citation2019) also state that “scalar invariance is a necessary condition to make valid comparisons of latent [group] means.” In line with these statements, Putnick and Bornstein (Citation2016) suggest that “…measurement invariance is fast becoming de rigueur in psychological and developmental research.” Given these quotes from the literature, it seems legitimate to say that there exist strong claims that measurement invariance would be a prerequisite for meaningful and valid comparisons across groups.

In this article, we question the relevance of measurement invariance testing for meaningful and valid group comparisons. Our argument proceeds in two steps. In the first step, we clarify that the assumption of scalar measurement invariance across groups imposes the absence of item-by-group interactions in item parameters. However, we challenge the importance of establishing scalar measurement invariance and provide reasons why, in our view, the presence of item-by-group interactions does not preclude group comparisons. In addition, empirical research has shown that “scalar invariance, however, has been found to rarely fit the data well, especially in the case of many groups (Muthén & Asparouhov, Citation2018, p. 640). In response to this, different approaches for dealing with violations of measurement invariance (e.g., partial invariance, invariance alignment) have been proposed. These approaches allow for group comparisons under the assumption of partial (Byrne et al., Citation1989) or approximate (Lek et al., Citation2019; Muthén & Asparouhov, Citation2018; Seddig & Leitgöb, Citation2018) measurement invariance. However, in the second step, we clarify that group mean comparisons under partial or approximate invariance rely on strong—and often untestable—assumptions about the structure of the item-by-group interactions. Thus, there is ambiguity in defining group comparisons when the assumption of full measurement invariance is not met (i.e., measurement non-invariance) because none of the competing estimation approaches can be generally preferred. Consequently, we consider statistical techniques to investigate measurement invariance only as tools for exploring data that might help understand questionnaire data. However, we are convinced that the violation of measurement invariance typically does not threaten validity.

The article is structured as follows. In Section 2, we review the unidimensional factor analysis model for multiple groups. Section 3 discusses the assumptions of measurement invariance, and it is argued in Section 4 why we think that measurement invariance has only minor relevance for ensuring meaningful and valid group comparisons. Furthermore, we discuss different estimation strategies for dealing with measurement non-invariance in Section 5. We argue that there is ambiguity in defining group comparisons because a researcher must individually make untestable model assumptions. The paper closes with a discussion in Section 6.

2. Unidimensional Factor Analysis

Assume that there are I items X1,,XI that are measured in a finite number of G groups. The unidimensional factor model for the I items in group g is given as (see, e.g., Meredith, Citation1993) (1) Xi=νig+λigF+ei ,  E(F)=αg , SD(F)=ψg,(1)

where F is the latent factor, υig and λig are item intercepts and item loadings for item i in group g, respectively. The item residual variables ei (i=1,,I) are assumed to have zero means and are uncorrelated with each other and uncorrelated with F. Note that the group-specific mean αg and standard deviation ψg cannot be identified unless additional constraints on item intercepts or item loadings are imposed. Under a multivariate normal distribution assumption, the group-specific mean vector μg and the covariance matrix Σg are sufficient statistics of the factor model in group g. In the rest of the article, we deal with population-level data and do not distinguish sample statistics from population statistics because we focus on identification conditions and asymptotic bias discussions for population-level data.

The typical step in the assessment of measurement invariance starts with fitting the unidimensional factor model (1) within each group g (see, e.g., Schroeders & Gnambs, Citation2020; van de Schoot et al., Citation2012). One sets αg = 0 and ψg = 1 as identification constraints. The group-specific item means µi,g = E(Xi | G = g) provide saturated estimates of identified item intercepts νi,g* (i.e., µi,g = νi,g*). The covariances σij,g = Cov(Xi, Xj | G = g) for pairs of items i and j in group g follow the restricted model (2) σij,g=λi,g*λj,g*,(2)

if the unidimensional factor model holds. In EquationEquation (2), λi,g* are the identified group-specific item loadings in each of the groups. The item loadings can be identified from observed covariances (Steyer, Citation1989) (3) λi,g*=σij,gσik,gσjk,g(3) for two items j and k that are distinct from i, assuming that all covariances are positive.Footnote1 Overall, these properties make clear that the identified group-specific intercepts νi,g* and loadings λi,g* in the unidimensional factor model are only functions of observed sufficient statistics µi,g and σij,g.

For ordinal items, factor analysis can be based on item thresholds and polychoric correlations if the probit link function and the graded response model are used (Muthén, Citation1984). In the case of ordinal items, violations of measurement invariance are also labeled as differential item functioning (DIF; Holland & Wainer, Citation1993). In the following, we focus on the case that items are treated as continuous variables. Nevertheless, the main arguments in this paper also apply to ordinal items.

3. Absence of Item-By-Group Interactions Under Full Invariance

If the unidimensional factor model is fitted separately for each group, group-specific differences in factor mean αg and standard deviations ψg (g = 1,…,G) are represented in identified item intercepts νi,g* and loadings λi,g*. Hence, some identification constraints must be imposed to enable the computation (i.e., identification) of αg and ψg. One popular identification constraint is assuming full invariance of item parameters across groups; that is, the existence of common item intercepts and loadings across groups. In the literature, two types of measurement invariance are distinguished (Millsap, Citation2011): Measurement invariance regarding item loadings (metric invariance or weak invariance) and measurement invariance regarding item intercepts and loadings (scalar invariance or strong invariance). Both invariance types will also be referred to as full invariance if the conditions exactly hold.

Metric invariance assumes equal loadings across groups, which enables the identification of group-specific standard deviations ψg of the factor variable. The assumption of the existence of common item loadings λi,0 results in (4) λi,g*=λi,0ψg.(4)

For model identification, we set the standard deviation ψ1 of the first group equal to 1. With positive item loadings,Footnote2 we can use the notation li,g*=log λi,g*, li,0 = log λi,0, and pg = log ψg. We get from EquationEquation (4) (5) li,g*=li,0+pg.(5)

It can be seen that (5) is an additive model for two-way data li,g* (i = 1,…,I; g = 1,…,G), which is equivalent to presupposing the absence of interactions in an analysis of variance (ANOVA) model for logarithmized item loadings li,g* (see Robitzsch & Lüdtke, Citation2020; van der Linden, Citation1994). Note that condition (4) holds if (6) λi,g*λi,h*=λj,g*λj,h*(6) for all items i and j and pairs of groups g and h. Thus, metric invariance across groups is established if the ratio of item loadings is constant across items for all pairs of groups. The literature often argues that metric invariance is required for a meaningful and valid comparisonFootnote3 of group standard deviations.

Scalar invariance presupposes the existence of common item intercepts νi,0 in addition to common item loadings λi,0 across groups. This property allows the identification of group means αg of the factor variable. Assuming that metric invariance holds (i.e., the existence of common item loadings λi,0), the assumption of common item intercepts results in (7) νi,g*=νi,0+λi,0αg(7)

By dividing the two sides of EquationEquation (7) by λi,0 and using the notation for transformed item intercepts ui,g*=νi,g*/λi,0, ui,0=νi,0/λi,0 and ag=αg/λi,0, we obtain the additive model (8) ui,g*=ui,0+ag(8)

For model identification, the mean α1 of the first group is set to 0. Again, EquationEquation (8) corresponds to an ANOVA model for transformed item intercepts ui,g* with only main effects. Hence, scalar invariance excludes the possibility of item-by-group interactions in transformed item intercepts ui,g*. In the literature, it is often argued that scalar invariance is required for a meaningful and valid comparison of group means.

To sum up, we have shown that the assumptions of metric and scalar invariance correspond to ANOVA models without interaction effects of transformed group-specific item loadings and intercepts. We critically reflect on the relevance of this property in the following Section 4.

4. Why Full Invariance Has Limited Relevance

In this section, we present three arguments why the assumption that metric and scalar measurement invariance exactly holds (i.e., full invariance) has only limited relevance for ensuring a meaningful and valid comparison of group means and standard deviations.

First, we question the usefulness of the implied assumption of the absence of interaction effects in ANOVA models for transformed item loadings and item intercepts (see EquationEquations 5 and Equation8). Of course, one can compute average effects in ANOVA models if interaction effects occur. However, additional constraints (i.e., weighting schemes) must be imposed for interaction effects (Maxwell et al., Citation2017). In the causal effects literature, it is pointed out that average treatment effects are well-defined if there are heterogeneous treatment effects (i.e., the treatment effect is moderated by covariates; see, e.g. Morgan & Winship, Citation2015). It is sometimes argued that one compares apples and oranges if measurement invariance is violated (see, e.g., Chen, Citation2008, or Greiff & Scherer, Citation2018, for a discussion). If items function differentially across groups (i.e., there are item-by-group interactions in item parameters), items do not measure “the same concept” (i.e., factor; see also Little, Citation2013). This would threaten the validity and preclude researchers from making meaningful group comparisons. We think such statements can only be regarded as metaphorical claims without statistical or empirical meaning (see Robitzsch & Lüdtke, Citation2022, for a discussion). For example, Meuleman et al. (Citation2022) argue that group comparisons “should reflect true differences rather than measurement differences.” It remains unclear what is meant by “true differences” because these have to be identified from “measurement differences” (i.e., observed data) in any case. Such statements indicate that those researchers simply believe there should be no differential functioning of the items because this would compromise meaningful and valid group comparisons. In the same way, Meredith and Teresi (Citation2006) argue that “the failure of invariance to hold is […] evidence that the manifest variables [i.e., observed variables] fail to measure the same latent attributes [i.e., latent variables] in the same way in different situations.” As it is also evident in this quote, the definition of latent variables becomes circular because they do not even exist in the statistical model and must always be inferred from manifest variables.Footnote4 Instead, latent variables only emerge (i.e., they are represented by model parameters in the multiple-group factor analysis) from restrictions in the mean vector and covariance matrix of observed variables. The typical reasoning seems to imply that “true differences” are attributed to a causal interpretation of the latent factor variable (Borsboom, Citation2008). Notably, the substitution of assessing the adequacy of the measurement process with demonstrating acceptable model fit of psychometric (factor) models can be questioned (Uher, Citation2021; see also Borgstede & Eggert, Citation2023, or Edelsbrunner, Citation2022).

Second, it might be argued that full invariance is desirable because group-specific factor means and standard deviations were invariant if only a subset of items would be used in an analysis. Hence, empirical results of group comparisons do not depend on the particular choice of items in a questionnaire. We think the choice of items will impact the concept to be measured in any case (VanderWeele, Citation2022). However, in many applications (e.g., comparing self-concept scales between gender groups or countries), the scale and, hence, the concrete choice of items is fixed. With a fixed scale, the property of invariance concerning subsets of items is simply irrelevant. It seems plausible that group comparisons could change if a self-concept scale is measured with four items selected from a five-item questionnaire. However, we would argue that the measured concept will change with a subscale of items, and it is unclear why the concept of interest should be restricted to fewer items.

Third, we question whether the call for the absence of item-by-group interactions in the ANOVA model for transformed item loadings and intercepts even poses a threat to validity if the differential functioning of items is construct-relevant (Camilli, Citation1993; Shealy & Stout, Citation1993).Footnote5 By excluding items with item-by-group interactions, validity would decrease because the purposefully constructed measurement instrument would be changed in statistical analysis (De Los Reyes et al., Citation2022; El Masri & Andrich, Citation2020; Robitzsch & Lüdtke, Citation2022). We further elaborate on this issue when discussing the role of partial invariance.

In empirical research, full invariance seems to be as rare as unicorns. Metric invariance seems much more frequently fulfilled than scalar invariance (Rutkowski & Svetina, Citation2014). The following Section 5 discusses different approaches for dealing with violations of measurement invariance. It is argued that there is unavoidable ambiguity in identifying group means and standard deviations if the assumption of full invariance is not met. We assert that it is also unnecessary for meaningful group comparisons to demonstrate that measurement invariance is only violated to some tolerable extent.

5. Ambiguity in Choosing Identification Constraints in Violations of Full Invariance

In practice, full invariance will almost always be violated to some extent (e.g., Ellis, Citation1993; Leitgöb et al., Citation2023). As argued in Section 3, the extent of non-invariance is characterized by interaction effects of the ANOVA models defined in EquationEquations (5) and Equation(8). The interactions can be regarded as DIF effects in item loadings (i.e., l˜i,g) and item intercepts (i.e., u˜i,g) (9) li,g*=li,0+pg+l˜i,g(9) (10) ui,g*=ui,0+ag+u˜i,g(10)

For transformed item loadings li,g* and item intercepts ui,g* in EquationEquations (9) and Equation(10), there are I × G input parameters, but there are I + G main effects (i.e., pg and ag) and I × G interaction effects (i.e., l˜i,g and u˜i,g) in model equations. Hence, the number of the estimated parameters would exceed the number of input parameters. This implies that identification constraints must be imposed in the violation of measurement invariance because there are more unknown parameters than input parameters. The statistical approaches for handling non-invariance are distinguished in different aspects. First, we can distinguish approaches that specify a multiple-group factor model in one step from approaches that first estimate group-wise factor models and link the results onto a common metric in a subsequent linking approach (Kolen & Brennan, Citation2014).Footnote6 It has been demonstrated that two-step methods can be alternatively estimated as one-step methods with appropriate constraints of interaction effects in (9) and (10) (see Little et al., Citation2006 or von Davier & von Davier, Citation2007). Second, approaches can be distinguished whether or not they explicitly model DIF effects. As we will argue below, this distinction is practically irrelevant because it is rather the choice of the estimation function that matters than the mere fact of including DIF effects in the factor model (see Robitzsch, Citation2022). Third, estimation approaches can be distinguished whether their resulting group comparisons are robust regarding the presence of a few interaction effects in transformed item loadings or item intercepts (9) and (10). This distinction is the most important because it classifies the approaches into robust and nonrobust linking of groups (Robitzsch, Citation2021). Importantly, the former allows some items or their item parameters to be excluded from comparisons, while the latter does not exclude or downweigh items. We now classify competing approaches to handling non-invariance regarding this aspect.

5.1. Partial Invariance Approach and Robust Linking

Among the robust linking approaches, the partial invariance approach is likely the most frequently employed solution to measurement non-invariance (Byrne et al., Citation1989; Steenkamp & Baumgartner, Citation1998). In this approach, it is assumed that only a few interaction effects in (9) and (10) differ from zero, while most interaction effects are equal to zero. Modification indices can be used to detect which of the parameters should be estimated as different from zero. The multiple-group factor model is reestimated in a second analysis step by allowing some group-specific item parameters (Byrne et al., Citation1989). It is often argued that partial invariance ensures meaningful and valid group comparisons if full invariance is violated. From the perspective of ANOVA estimation with interaction effects, it is interesting that partial invariance corresponds to an estimation problem with a loss function that minimizes the number of interaction effects (i.e., using an L0 loss function; see Davies, Citation2014; Robitzsch, Citation2020a; Robitzsch & Lüdtke, Citation2020). We have pointed out elsewhere that the partial invariance approach comes with the disadvantage that comparing pairs of groups could effectively depend on different sets of items because items whose group-specific parameters were freed do not contribute to the linking process (Robitzsch & Lüdtke, Citation2022). For example, in a cross-cultural study, the comparison of Poland and Germany could depend on a different set of items than the comparison of Poland and Austria.Footnote7 We consider this property as a threat to validity. Thus, we would argue that there is a danger of comparing apples and oranges when using the partial invariance approach in applications with more than two groups (Robitzsch & Lüdtke, Citation2022). In contrast, the set of items is fixed for all group comparisons if a misspecified multiple-group factor model with invariant item parameters is fitted to all groups.

5.2. Regularization Techniques

As an alternative to the two-step method of partial invariance, the regularization technique with a lasso-type penalty function (see, e.g., Finch, Citation2022, for an overview) can be utilized for fitting the multiple-group factor model (Bauer et al., Citation2020; Geminiani et al., Citation2021; Huang, Citation2018; Liang & Jacobucci, Citation2020; Schauberger & Mair, Citation2020). Like in partial invariance, it is also assumed that DIF effects are sparsely distributed; that is, the majority of group-specific item parameters must be equal to zero to enable an unbiased estimation of group means and standard deviations. One can expect that partial invariance and regularization will provide very similar results in applications (Robitzsch, Citation2022).

5.3. Invariance Alignment

Minor violations of measurement invariance are also referred to as approximate invariance in the literature (Arts et al., Citation2021; Lomazzi, Citation2021; Luong & Flake, Citation2022; Meitinger et al., Citation2020; Muthén & Asparouhov, Citation2018; van de Schoot et al., Citation2013). This label involves approaches that acknowledge deviations from full invariance in multiple-group factor analysis with DIF effects. Invariance alignment is a two-step estimation procedure in which the group-specific identified item parameters are linked in a second step such that the majority of DIF effects are small while there are only a few large DIF effects (Asparouhov & Muthén, Citation2014; Muthén & Asparouhov, Citation2014). In this regard, invariance alignment is ideally suited to situations close to partial invariance. The alignment method may be considered to automate the partial invariance approach (Leitgöb et al., Citation2023). In fact, invariance alignment falls into the class of robust linking techniques popular in item response theory (Robitzsch, Citation2020a). Alternative techniques, such as robust Haberman linking, might be preferred over invariance alignment for statistical reasons (Robitzsch, Citation2020a). The vital aspect of robust linking techniques, such as invariance alignment is that some large DIF effects are treated as outliers and should be eliminated or down-weighted in the estimation of group means and standard deviations (Magis & De Boeck, Citation2011; Robitzsch, Citation2022; Robitzsch & Lüdtke, Citation2020; Wang et al., Citation2022). The similarity of lasso-type regularization and robust linking has also been pointed out by Chen et al. (Citation2021).Footnote8 Because invariance alignment may be viewed as an alternative implementation of the partial invariance approach, our critique of comparing apples with oranges in partial invariance also applies to the alignment, robust linking, and regularization approaches.

5.4. Bayesian Approximate Invariance with Normal Prior Distributions

As an alternative, Bayesian approximate invariance (Muthén & Asparouhov, Citation2018) has been proposed that specifies normal prior distributions with known variances in differences between group-specific item parameters (van de Schoot et al., Citation2013). It has been shown that normal prior distributions of parameter differences correspond to normal prior distributions of DIF effects (Battauz, Citation2020; Robitzsch, Citation2022). Alternatively, the variance of the prior distribution can also be estimated, referred to as a multiple-group factor analysis with random effects (De Boeck, Citation2008; De Jong et al., Citation2007; Fox & Verhagen, Citation2010). Importantly, using normal prior distributions presupposes that DIF effects are normally distributed. In particular, it is assumed that they follow a symmetric distribution. This assumption is in stark contrast to the assumption of partial invariance in which the majority of DIF effects is zero (or very small). With only a few items for a factor (say, smaller than 20), it is very unlikely that statistical modeling will help find the correct distributional assumption of DIF effects. Hence, it is up to the researcher whether he or she believes in a particular strategy for modeling measurement non-invariance (Robitzsch, Citation2022). It should be emphasized that robust linking and, hence, invariance alignment treat DIF effects as outliers and essentially remove items from some group comparisons. This assumption is not imposed in Bayesian approximate invariance. All items enter the determination of group means and group standard deviations but are somehow weighted according to the size of DIF effects.

5.5. Intentionally Misspecified Models

Another option would be to intentionally not model DIF effects (Robitzsch, Citation2022). That is, invariant item parameters are assumed in a statistical model, and there is no attempt to introduce additional item parameters for DIF effects. The estimation function of the multiple-group factor model weighs sampling error and model error. M-estimation theory provides the statistical framework for obtaining valid statistical inference for misspecified models (Berk et al., Citation2014; Boos & Stefanski, Citation2013; Huber, Citation1964; White, Citation1982). It is well-known that maximum likelihood estimation will be the most efficient estimation method in the case of correctly specified models (i.e., under full invariance). However, under violations of invariance, alternative estimation methods, such as diagonally weighted least squares or unweighted least squaresFootnote9 can be used (Robitzsch, Citation2022). Using unweighted least squares, model errors referring to DIF effects would be equally weighted in the computation of group means. This might be seen as a desirable property. If deviations in DIF effects should be down-weighted or essentially eliminated from computing group means and standard deviations, robust estimation functions, such as least absolute deviation estimation can be utilized (Robitzsch, Citation2022; Siemsen & Bollen, Citation2007).

5.6. Summary

Overall, we would like to argue that there is unavoidable ambiguity in handling situations of measurement non-invariance. We find that recommendations that particular approaches enable (approximately) unbiased estimation of group means are ill-advised because individual researchers define how to handle DIF effects by choosing an estimation function (Robitzsch, Citation2022). We do not think there are statistical arguments for preferring a partial invariance or an invariance alignment approach over estimating the multiple-group factor analysis model assuming invariant item parameters and utilizing unweighted least squares estimation. Such recommendations depend on how the truth is defined. It is unlikely that there will be any consistent frontrunner among competitive approaches for dealing with measurement non-invariance (see Muthén & Asparouhov, Citation2018; Pokropek et al., Citation2019, Citation2020). Notably, the choice of estimation methods under measurement non-invariance relies on assumptions about the structure of DIF effects. There can be a few large but outlying DIF effects, or DIF effects can be rather symmetrically distributed. The former case corresponds to a partial invariance approach, while the latter case refers to Bayesian approximate invariance with normal prior distributions. With only a few items, it is unlikely to determine the true data-generating structure. In contrast, an estimation method explicitly defines the researcher’s belief about the structure of DIF effects and is “strategically chosen” (Asparouhov & Muthén, Citation2023). Asparouhov and Muthén (Citation2023) argue that alignment-based methods will typically provide less biased group comparisons compared to Bayesian approaches with normal prior distributions on DIF effects. We find such statements as recommendations for practitioners highly questionable because they come with an impetus of general validity that could never be seriously justified.

6. Discussion

In this article, we have argued that the relevance of measurement invariance for ensuring meaningful and valid group comparisons is likely to be overstated (Welzel et al., Citation2021). It was pointed out why full invariance should not be considered a prerequisite for meaningful and valid group comparisons. Nevertheless, some kind of measurement equivalence of items (such as translational equivalence) must be ensured for validity reasons (van de Vijver, Citation2018). However, measurement invariance testing should not replace the difficult non-statistical task of demonstrating measurement equivalence (but see Fischer et al., Citation2022). Several researchers erroneously equate the concepts of measurement invariance and measurement equivalence (Boer et al., Citation2018; Geiser, Citation2020). A large part of the literature argues that measurement invariance is necessary but not sufficient for ensuring meaningful and valid group comparisons (Fischer et al., Citation2022, Meitinger, Citation2017; van de Vijver, Citation2018). This article argues against this perspective: measurement invariance is not a necessary condition (see our arguments in Section 4).

We want to highlight that measurement non-invariance can only detect item-specific functioning across groups. Bias due to systematic effects, such as response styles cannot typically be detected with measurement invariance assessment. Nevertheless, assessing measurement non-invariance can prove a viable screening tool for detecting potential issues in scale indicators. However, in our view, items should not be automatically removed from group comparisons or should receive group-specific item parameters in the partial invariance approach if measurement non-invariance would not be determined as construct-irrelevant. Researchers should provide substantive (and non-statistical) evidence of why they believe items should be removed from the comparisons. There should not be mechanistic rules for group comparisons solely determined through statistical criteria. In this sense, we think applying psychometric models should always be carried out from the perspective of (external) validity (Kane, Citation2013). We feel it is inappropriate to accuse other researchers of conducting non-rigorous research that does not test for measurement invariance (e.g., Boer et al., Citation2018; D’Urso et al., Citation2022). Importantly, we do not want to claim that a justification for selecting particular items for measuring a construct is not required. However, we firmly believe that measurement invariance assessment is not necessarily a prerequisite for establishing sufficient measurement quality of the operationalization of a construct.

We also argued that there is unavoidable ambiguity in defining identification constraints if the assumption of full invariance is violated (i.e., measurement non-invariance). There are competing plausible methods for handling non-invariance,Footnote10 and it is usually up to the individual researcher to select one among the methods (Robitzsch, Citation2022; Zitzmann & Loreth, Citation2021). In particular, we are not convinced by preferring partial invariance over alternative approaches because this results in comparing apples with oranges for more than two groups. In addition, applying the partial invariance approach implies that some noninvariant items are excluded from group comparisons, which can cause a threat to validity. Moreover, it is unlikely that sparsely distributed DIF effects frequently occur in practice, although this setting is most frequently employed in simulation studies (e.g., Asparouhov & Muthén, Citation2014; Steinmetz, Citation2013). In our experience with cognitive data and with many test items, it is much more likely that DIF effects are symmetrically distributed and closely follow a normal distribution.

Furthermore, we question whether a statistical test of full invariance or model fit effect sizes has relevance in applied research. In contrast, the specified multiple-group factor analysis with invariant item parameters is intentionally misspecified, and the assumption of measurement invariance is only made for identification reasons. We have argued that unweighted least squares estimation might be preferred over maximum likelihood estimation in the multiple-group factor model because the former equally weighs model errors (i.e., DIF effects). Hence, we also think that the emphasis on approximate measurement approaches in violations of full invariance is misguided (e.g., Leitgöb et al., Citation2023) because misspecified multiple-group factor analysis with assumed (but empirically violated) invariance is equally defensible. Notably, the assumed pattern of DIF effects (i.e., sparsely or symmetrically densely distributed) is crucial for the choice of analysis.

6.1. Why Longitudinal and Multilevel Measurement Invariance Are Also Not Required

We think the concept of measurement invariance is also not necessary for ensuring meaningful and valid comparisons of similar types of analysis in factor models. Establishing measurement invariance regarding time points in longitudinal data (Geiser, Citation2020; Little, Citation2013; Wicherts et al., Citation2004; Winter & Depaoli, Citation2020) or the combination of groups and time points in longitudinal data (Avvisati, Citation2020; Koc & Pokropek, Citation2022; Lee & von Davier, Citation2020; Seddig et al., Citation2020) does not seem a prerequisite to us for making comparisons across different time points by applying our arguments in the case of groups to time points. Moreover, we also do not believe that measurement invariance across levels in multilevel levels (Jak, Citation2019; Jak & Jorgensen, Citation2017; Ryu, Citation2014; Ryu & Mehta, Citation2017) is required, nor that the absence of cluster bias (Jak et al., Citation2013) seems to be a prerequisite for analyzing multilevel data in multilevel factor models.

6.2. Latent Class and Mixture Models for Handling Measurement Noninvariance

Alternative approaches stick to the case of considering measurement invariance for a larger number of groups (say, more than ten). Latent class or mixture models can be applied that define clusters (i.e., latent classes) of invariant groups (De Roover, Citation2021; De Roover et al., Citation2022; Eid, Citation2019; see also Zieger et al., Citation2019). Hence, these approaches suggest that comparisons would only be meaningful for subsets of groups that share the same item parameters in the measurement model. Hence, one would typically conclude in applications that comparisons across all groups would not be meaningful and the original research question cannot be answered. We argued in this article why this reasoning is unjustified.

6.3. Why Simulation Studies Have Limited Relevance for Recommendations for Practitioners

A lot of simulation research demonstrates how group comparisons are biased in particular violations of measurement invariance (e.g., Steinmetz, Citation2013). We think that simulation studies have limited relevance in generating recommendations for practitioners in applied research in measurement non-invariance. Simulation studies always define the truth by assuming specific data-generating parameters. The essential point is that the simulators already know the structure of DIF effects in the measurement situation. However, as we argued in the previous section, applied researchers only have empirical data. They cannot typically test competing assumptions on the structure of measurement non-invariance because there is always ambiguity on how to believe about the structure of DIF effects. Applied researchers can only argue why they prefer particular treatments of measurement non-invariance by choosing a particular estimation method. Notably, there are no definite criteria for legitimate statements that using the partial or invariance alignment approach will generally provide less biased group comparisons in empirical research. Simulation studies can help methodological and applied researchers understand different estimation methods in different data-generating models. Nevertheless, the choice of a statistical method in a concrete empirical application cannot be defended based on results from simulation studies.

In fact, most simulation studies utilized the partial invariance model as the data-generating model. It comes as no surprise as a finding of such a simulation study that partial invariance or invariance alignment approach were appropriate estimation methods in this case. However, with a different data-generating model, a quite different conclusion would be drawn.

6.4. Measurement Invariance with Multiple Covariates

The original definition of measurement invariance dates back to Mellenbergh (Citation1989) and Meredith (Citation1993). Measurement invariance for a set of item responses X is fulfilled if the conditional distribution of item responses given the factor F does not depend on any observed or unobserved covariates summarized in a vector V. More formally, it is required for measurement invariance that P(X | F,V = P(X | F) holds for all possible covariates V. In principle, V could define any subpopulation of persons from the original population. This property led Meredith (Citation1993) to define measurement invariance under the concept of selection invariance. Selection invariance implies that a measurement model holds (after utilizing appropriate identification constraints) in any subpopulation of the original population. In particular, this property also rules out the possibility of a latent grouping variable like in factor mixture models (Cohen & Bolt, Citation2005; Cole et al., Citation2019; Lubke & Muthén, Citation2007). Suppose one would believe in the requirement of the more strict measurement invariance definition of Mellenbergh and Meredith. In that case, a multitude of statistical models must be fitted to rule out any kind of non-invariance. In principle, V could be a high-dimensional vector of discrete and continuous variables and statistical techniques based on machine learning or regularization (Bauer et al., Citation2020; Brandmaier & Jacobucci, Citation2023; Jacobucci et al., Citation2019; Tutz & Schauberger, Citation2015) might help to detect non-invariance. No doubt, there are feasible and valid statistical approaches that can handle a high-dimensional covariate vector V for investigating DIF. However, it remains questionable what the purpose of such statistical enterprises is. Why should one believe that a data-driven modified model provides more meaningful results than applying inference based on a priori clearly defined single-factor model?

6.5. Assessing Measurement Invariance for Ordinal Items

In typical measurement invariance analyses, Likert scale items with four-point, five-point, or continuous bounded scales are used. We have argued elsewhere that it is defensible using the factor model in a continuous treatment of the items by analyzing Pearson correlations (or covariances of the raw items) instead of an ordinal treatment based on polychoric correlations (Robitzsch, Citation2020b; see also Kampen & Swyngedouw, Citation2000). To us, the choice of the metric of the latent variable (i.e., treating items as continuous or ordinal) must be based on substantive and not statistical reasons, and it is not legitimate to say that one of the two methods will typically lead to biased results because such reasoning depends on how researchers define the truth. Consequently, researchers should not generally prefer an ordinal treatment of items in the assessment of measurement invariance (see Svetina et al., Citation2020, for a tutorial). Moreover, we believe that it is inappropriate to ask whether violations of measurement invariance are the consequence of using a continuous instead of an ordinal treatment (see, e.g., Chen et al., Citation2020; Meuleman et al., Citation2022; Welzel et al., Citation2021, Citation2022, for such discussions). In our view, the choice of the metric of the latent variable is independent of assessing measurement invariance. Our conclusion that measurement invariance is not necessary for a meaningful and valid comparison across groups also applies to ordinal treatments of measurement invariance based on polychoric correlations or using item response models estimated with maximum likelihood. Of course, the reasoning would also apply to factor models that utilize bounded distributions for items (Noel & Dauvier, Citation2007; Revuelta et al., Citation2022).

6.6. Is Scalar Invariance Sufficient for Comparing Regression Coefficients Across Groups?

Moreover, we would like to point out that establishing full invariance in a unidimensional factor model does not guarantee the invariance of relationships of a factor variable with covariates across groups. In such a case, invariance analysis must be performed for items by testing for potential interaction effects of groups and the covariate of interest. Unfortunately, incorrect statements that metric invariance in a multiple-group model ensures the comparability of covariances of the factor variable and covariates across groups can be frequently found in the literature (e.g., Davidov & Meuleman, Citation2019; He et al., Citation2017; Leitgöb et al., Citation2023). For example, Leitgöb et al. (Citation2023) write: “If metric invariance is given, one may compare unstandardized associations (i.e., covariances or unstandardized regression coefficients between the latent variables of interest for which metric invariance holds). Considering our previous example, one may compare the covariance or the unstandardized regression coefficient between the variable ‘age’ and the latent variable ‘attitudes toward immigration.’” Moreover, He et al. (Citation2017) note that “with metric invariance of the scales established, correlational analysis among metric invariant constructs and achievement can be considered valid.” These statements imply that it would only be necessary to demonstrate metric invariance for all involved latent variables separately to ensure invariance of unstandardized associations regarding selecting subsets of items. However, this is incorrect because one must show that in the mentioned example in Leitgöb et al. (Citation2023) that in a joint model involving “age” and the indicators of “attitudes toward immigration,” residual correlations between age and the indicators must be equal to zero. This assumption remains untested if metric invariance is only independently shown for all latent variables. This frequently found inconsistency in the literature is particularly annoying because “Methodologists […] insist with increasing vigor that detecting ‘non-invariance’ in […] is an infallible sign of […] in-equivalences in how respondents understand the items” (Welzel et al., Citation2021). Hence, the usual steps in measurement invariance literature are anything but rigorous. However, we argued that we generally question the methodological concept of measurement invariance.

6.7. Quantifying Measurement Non-invariance as an Additional Source of Uncertainty

Finally, we believe it might be advantageous to quantify the extent of non-invariance as an additional source of uncertainty in estimated parameters, such as group means and standard deviations. Such an approach has been discussed as linking errors for applications in educational assessment based on item response models (Monseur & Berezner, Citation2007; Robitzsch & Lüdtke, Citation2019).Footnote11 As an alternative approach in the multiple-group factor model, the model error can be parametrized as an additional source of uncertainty in the estimation function (Wu, Citation2010; Wu & Browne, Citation2015). This approach results in an uncertainty quantification of estimated parameters that comprises sampling and model errors. This approach should be more widely implemented in empirical research involving (multiple-group) confirmatory factor analysis and structural equation modeling.

Notes

1 In typical applications, this restriction is not a serious limitation because item loadings can all be made positive by multiplying item scores by −1. An example might be the presence of positively and negatively worded statements. The item scores of all negatively worded items can be multiplied by −1.

2 Ibid.

3 In the following, we use the terms “valid” and “meaningful” interchangeably.

4 Formally, one can define true scores as intraindividual distributions of manifest items in stochastic measurement theory (SMT; Eid, Citation2000; Steyer, Citation1989; Steyer et al., Citation2015). Factor models are defined by imposing some restrictions on the true score distributions in SMT. However, SMT does not resolve identification issues in factor models in general and measurement non-invariance in particular. SMT just includes an additional and redundant layer for defining true scores (and latent variables) as a notational exercise, but arrives at the same identification conditions compared without these definitional exercises. Hence, ambiguity in defining group comparisons in measurement non-invariance situations cannot be resolved by utilizing SMT.

5 An anonymous reviewer argued that assessing and eliminating DIF items would be particularly relevant in high-stakes assessment tests. We tend to disagree with such a statement. As described by Camilli (Citation2006), test fairness should be clearly distinguished from DIF (i.e., measurement non-invariance). Test fairness is related to unsystematic (i.e., item-specific) and systematic disadvantages of particular groups of test takers. It is evident that DIF assessment only focuses on item-specific DIF and does not test for systematic group disadvantages. If DIF exists, high-stakes testing is a good showcase that removing construct-relevant DIF items from the test is questionable. According to Camilli (Citation2006) and our conviction, only construct-irrelevant DIF items should be removed from the test because only those items are related to test unfairness.

6 Note that two-step linking approaches are much more popular in item response modeling than in applications of structural equation modeling.

7 Zieger et al. (Citation2019) argue that in cross-cultural studies researchers should identify groups of countries with invariant item parameters in order to allow comparisons among them. However, in the violation of invariance, not all countries should be compared simultaneously.

8 Indeed, the recently proposed penalized structural equation modeling (Asparouhov & Muthén, Citation2023) can be viewed as a regularized estimation approach.

9 Unweighted least squares estimation may be justified for standardized variables or if all items were measured using the same Likert scale.

10 See, e.g., Pokropek and Pokropek (Citation2022), for more sophisticated approaches based on deep learning techniques.

11 In fact, linking errors can be interpreted as summary statistics of the expected parameter change (Oberski, Citation2014) in group means and standard deviations.

References

  • Arts, I., Fang, Q., van de Schoot, R., & Meitinger, K. (2021). Approximate measurement invariance of willingness to sacrifice for the environment across 30 countries: The importance of prior distributions and their visualization. Frontiers in Psychology, 12, 624032. https://doi.org/10.3389/fpsyg.2021.624032
  • Asparouhov, T., Muthén, B. (2023). Penalized structural equation models. Technical report. Retrieved March 3, 2023, from https://www.statmodel.com/download/PML.pdf
  • Asparouhov, T., & Muthén, B. (2014). Multiple-group factor analysis alignment. Structural Equation Modeling: A Multidisciplinary Journal, 21, 495–508. https://doi.org/10.1080/10705511.2014.919210
  • Avvisati, F. (2020). The measure of socio-economic status in PISA: A review and some suggested improvements. Large-Scale Assessments in Education, 8, 8. https://doi.org/10.1186/s40536-020-00086-x
  • Battauz, M. (2020). Regularized estimation of the four-parameter logistic model. Psych, 2, 269–278. https://doi.org/10.3390/psych2040020
  • Bauer, D. J., Belzak, W. C., & Cole, V. T. (2020). Simplifying the assessment of measurement invariance over multiple background variables: Using regularized moderated nonlinear factor analysis to detect differential item functioning. Structural Equation Modeling: A Multidisciplinary Journal, 27, 43–55. https://doi.org/10.1080/10705511.2019.1642754
  • Berk, R., Brown, L., Buja, A., George, E., Pitkin, E., Zhang, K., & Zhao, L. (2014). Misspecified mean function regression: Making good use of regression models that are wrong. Sociological Methods & Research, 43, 422–451. https://doi.org/10.1177/0049124114526375
  • Boer, D., Hanke, K., & He, J. (2018). On detecting systematic measurement error in cross-cultural research: A review and critical reflection on equivalence and invariance tests. Journal of Cross-Cultural Psychology, 49, 713–734. https://doi.org/10.1177/0022022117749042
  • Boos, D. D., & Stefanski, L. A. (2013). Essential statistical inference. Springer. https://doi.org/10.1007/978-1-4614-4818-1
  • Borgstede, M., & Eggert, F. (2023). Squaring the circle: From latent variables to theory-based measurement. Theory & Psychology, 33, 118–137. https://doi.org/10.1177/09593543221127985
  • Borsboom, D. (2008). Latent variable theory. Measurement: Interdisciplinary Research and Perspectives, 6, 25–53. https://doi.org/10.1080/15366360802035497
  • Brandmaier, A. M., & Jacobucci, R. C. (2023). Machine-learning approaches to structural equation modeling. In R. H. Hoyle (Ed.), Handbook of structural equation modeling. Guilford Press.
  • Byrne, B. M., Shavelson, R. J., & Muthén, B. (1989). Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychological Bulletin, 105, 456–466. https://doi.org/10.1037/0033-2909.105.3.456
  • Camilli, G. (1993). The case against item bias detection techniques based on internal criteria: Do item bias procedures obscure test fairness issues? In P. W. Holland & H. Wainer (Eds.), Differential item functioning: Theory and practice (pp. 397–417). Erlbaum.
  • Camilli, G. (2006). Test fairness. In R. L. Brennan (Ed.), Educational measurement (pp. 221–256). Praeger Publications.
  • Chen, F. F. (2008). What happens if we compare chopsticks with forks? The impact of making inappropriate comparisons in cross-cultural research. Journal of Personality and Social Psychology, 95, 1005–1018. https://doi.org/10.1037/a0013193
  • Chen, P. Y., Wu, W., Garnier-Villarreal, M., Kite, B. A., & Jia, F. (2020). Testing measurement invariance with ordinal missing data: A comparison of estimators and missing data techniques. Multivariate Behavioral Research, 55, 87–101. https://doi.org/10.1080/00273171.2019.1608799
  • Chen, Y., Li, C., & Xu, G. (2021). DIF statistical inference and detection without knowing anchoring items. arXiv. arXiv:2110.11112. https://doi.org/10.48550/arXiv.2110.11112
  • Cohen, A. S., & Bolt, D. M. (2005). A mixture model analysis of differential item functioning. Journal of Educational Measurement, 42, 133–148. https://doi.org/10.1111/j.1745-3984.2005.00007
  • Cole, V. T., Bauer, D. J., & Hussong, A. M. (2019). Assessing the robustness of mixture models to measurement noninvariance. Multivariate Behavioral Research, 54, 882–905. https://doi.org/10.1080/00273171.2019.1596781
  • Davidov, E., & Meuleman, B. (2019). Measurement invariance analysis using multiple group confirmatory factor analysis and alignment optimisation. In F. J. van de Vijver (Ed.), Invariance analyses in large-scale studies (pp. 13–20). OECD Education. Working Papers No. 201. https://doi.org/10.1787/254738dd-en
  • Davidov, E., Meuleman, B., Cieciuch, J., Schmidt, P., & Billiet, J. (2014). Measurement equivalence in cross-national research. Annual Review of Sociology, 40, 55–75. https://doi.org/10.1146/annurev-soc-071913-043137
  • Davies, P. L. (2014). Data analysis and approximate models. CRC Press. https://doi.org/10.1201/b17146
  • De Boeck, P. (2008). Random item IRT models. Psychometrika, 73, 533–559. https://doi.org/10.1007/s11336-008-9092-x
  • De Jong, M. G., Steenkamp, J. B. E., & Fox, J. P. (2007). Relaxing measurement invariance in cross-national consumer research using a hierarchical IRT model. Journal of Consumer Research, 34, 260–278. https://doi.org/10.1086/518532
  • De Los Reyes, A., Tyrell, F., Watts, A. L., & Asmundson, G. (2022). Conceptual, methodological, and measurement factors that disqualify use of measurement invariance techniques to detect informant discrepancies in youth mental health assessments. Frontiers in Psychology, 13, 931296. https://doi.org/10.3389/fpsyg.2022.931296
  • De Roover, K. (2021). Finding clusters of groups with measurement invariance: Unraveling intercept non-invariance with mixture multigroup factor analysis. Structural Equation Modeling: A Multidisciplinary Journal, 28, 663–683. https://doi.org/10.1080/10705511.2020.1866577
  • De Roover, K., Vermunt, J. K., & Ceulemans, E. (2022). Mixture multigroup factor analysis for unraveling factor loading noninvariance across many groups. Psychological Methods, 27, 281–306. https://doi.org/10.1037/met0000355
  • D’Urso, E. D., Maassen, E., van Assen, M. A., Nuijten, M. B., De Roover, K., & Wicherts, J. (2022). The dire disregard of measurement invariance testing in psychological science. PsyArXiv, July 26. https://doi.org/10.31234/osf.io/n3f5u
  • Edelsbrunner, P. A. (2022). A model and its fit lie in the eye of the beholder: Long live the sum score. Frontiers in Psychology, 13, 986767. https://doi.org/10.3389/fpsyg.2022.986767
  • Eid, M. (2000). A multitrait-multimethod model with minimal assumptions. Psychometrika, 65, 241–261. https://doi.org/10.1007/BF02294377
  • Eid, M. (2019). Multigroup and multilevel latent class analysis. In F. J. van de Vijver (Ed.), Invariance analyses in large-scale studies (pp. 70–90). OECD Education. Working Papers No. 201. https://doi.org/10.1787/254738dd-en
  • El Masri, Y. H., & Andrich, D. (2020). The trade-off between model fit, invariance, and validity: The case of PISA science assessments. Applied Measurement in Education, 33, 174–188. https://doi.org/10.1080/08957347.2020.1732384
  • Ellis, J. L. (1993). Subpopulation invariance of patterns in covariance matrices. British Journal of Mathematical and Statistical Psychology, 46, 231–254. https://doi.org/10.1111/j.2044-8317.1993.tb01014.x
  • Finch, H. (2022). Applied regularization methods for the social sciences. CRC Press. https://doi.org/10.1201/9780367809645
  • Fischer, R., Karl, J., & Luczak-Roesch, M. (2022). Why equivalence and invariance are both different and essential for scientific studies of culture: A discussion of mapping processes and theoretical implications. PsyArXiv, September 3. https://doi.org/10.31234/osf.io/fst9k
  • Flake, J. K., Pek, J., & Hehman, E. (2017). Construct validation in social and personality research: Current practice and recommendations. Social Psychological and Personality Science, 8, 370–378. https://doi.org/10.1177/1948550617693063
  • Fox, J.-P., & Verhagen, A. J. (2010). Random item effects modeling for cross-national survey data. In E. Davidov, P. Schmidt, & J. Billiet (Eds.), Cross-cultural analysis: Methods and applications (pp. 461–482). Routledge Academic.
  • Funder, D. (2020). Misgivings: Some thoughts about “measurement invariance”. Retrieved January 31, 2020, from https://bit.ly/3caKdNN
  • Geiser, C. (2020). Longitudinal structural equation modeling with Mplus: A latent state-trait perspective. Guilford Publications.
  • Geminiani, E., Marra, G., & Moustaki, I. (2021). Single-and multiple-group penalized factor analysis: A trust-region algorithm approach with integrated automatic multiple tuning parameter selection. Psychometrika, 86, 65–95. https://doi.org/10.1007/s11336-021-09751-8
  • Greiff, S., & Scherer, R. (2018). Still comparing apples with oranges? Some thoughts on the principles and practices of measurement invariance testing. European Journal of Psychological Assessment, 34, 141–144. https://doi.org/10.1027/1015-5759/a000487
  • He, J., Van de Vijver, F. J., Fetvadjiev, V. H., de Carmen Dominguez Espinosa, A., Adams, B., Alonso–Arbiol, I., Aydinli–Karakulak, A., Buzea, C., Dimitrova, R., Fortin, A., Hapunda, G., Ma, S., Sargautyte, R., Sim, S., Schachner, M. K., Suryani, A., Zeinoun, P., & Zhang, R. (2017). On enhancing the cross–cultural comparability of Likert–scale personality and value measures: A comparison of common procedures. European Journal of Personality, 31, 642–657. https://doi.org/10.1002/per.2132
  • Holland, P. W., & Wainer, H. (Eds.). (1993). Differential item functioning: Theory and practice. Erlbaum. https://doi.org/10.4324/9780203357811
  • Huang, P. H. (2018). A penalized likelihood method for multi‐group structural equation modelling. British Journal of Mathematical and Statistical Psychology, 71, 499–522. https://doi.org/10.1111/bmsp.12130
  • Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35, 73–101. https://doi.org/10.1214/aoms/1177703732
  • Jacobucci, R., Brandmaier, A. M., & Kievit, R. A. (2019). A practical guide to variable selection in structural equation modeling by using regularized multiple-indicators, multiple-causes models. Advances in Methods and Practices in Psychological Science, 2, 55–76. https://doi.org/10.1177/2515245919826527
  • Jak, S. (2019). Cross-level invariance in multilevel factor models. Structural Equation Modeling: A Multidisciplinary Journal, 26, 607–622. https://doi.org/10.1080/10705511.2018.1534205
  • Jak, S., & Jorgensen, T. D. (2017). Relating measurement invariance, cross-level invariance, and multilevel reliability. Frontiers in Psychology, 8, 1640. https://doi.org/10.3389/fpsyg.2017.01640
  • Jak, S., Oort, F. J., & Dolan, C. V. (2013). A test for cluster bias: Detecting violations of measurement invariance across clusters in multilevel data. Structural Equation Modeling: A Multidisciplinary Journal, 20, 265–282. https://doi.org/10.1080/10705511.2013.769392
  • Kampen, J., & Swyngedouw, M. (2000). The ordinal controversy revisited. Quality and Quantity, 34, 87–102. https://doi.org/10.1023/A:1004785723554
  • Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50, 1–73. https://doi.org/10.1111/jedm.12000
  • Kankaraš, M., & Moors, G. (2014). Analysis of cross-cultural comparability of PISA 2009 scores. Journal of Cross-Cultural Psychology, 45, 381–399. https://doi.org/10.1177/0022022113511297
  • Kline, R. B. (2016). Principles and practice of structural equation modeling (4th ed.). The Guilford Press.
  • Koc, P., & Pokropek, A. (2022). Accounting for cross-country-cross-time variations in measurement invariance testing. A case of political participation. Survey Research Methods, 16, 79–96. https://doi.org/10.18148/srm/2022.v16i1.7909
  • Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling, and linking. Springer. https://doi.org/10.1007/978-1-4939-0317-7
  • Lacko, D., Čeněk, J., Točík, J., Avsec, A., Đorđević, V., Genc, A., Haka, F., Šakotić-Kurbalija, J., Mohorić, T., Neziri, I., & Subotić, S. (2022). The necessity of testing measurement invariance in cross-cultural research: Potential bias in cross-cultural comparisons with individualism–collectivism self-report scales. Cross-Cultural Research, 56, 228–267. https://doi.org/10.1177/10693971211068971
  • Lee, S. S., & von Davier, M. (2020). Improving measurement properties of the PISA home possessions scale through partial invariance modeling. Psychological Test and Assessment Modeling, 62, 55–83. https://bit.ly/3FRN6Qf
  • Leitgöb, H., Seddig, D., Asparouhov, T., Behr, D., Davidov, E., De Roover, K., Jak, S., Meitinger, K., Menold, N., Muthén, B., Rudnev, M., Schmidt, P., & van de Schoot, R. (2023). Measurement invariance in the social sciences: Historical development, methodological challenges, state of the art, and future perspectives. Social Science Research, 110, 102805. https://doi.org/10.1016/j.ssresearch.2022.102805
  • Lek, K., Oberski, D., Davidov, E., Cieciuch, J., Seddig, D., & Schmidt, P. (2019). Approximate measurement invariance. In T. P. Johnson, B.-E. Pennell, I. A. L. Stoop, & B. Dorer (Eds.), Advances in comparative survey methods: Multinational, multiregional, and multicultural contexts (3MC) (pp. 911–928). Wiley. https://doi.org/10.1002/9781118884997.ch41
  • Liang, X., & Jacobucci, R. (2020). Regularized structural equation modeling to detect measurement bias: Evaluation of lasso, adaptive lasso, and elastic net. Structural Equation Modeling: A Multidisciplinary Journal, 27, 722–734. https://doi.org/10.1080/10705511.2019.1693273
  • Little, T. D. (2013). Longitudinal structural equation modeling. Guilford Press.
  • Little, T. D., Slegers, D. W., & Card, N. A. (2006). A non-arbitrary method of identifying and scaling latent variables in SEM and MACS models. Structural Equation Modeling: A Multidisciplinary Journal, 13, 59–72. https://doi.org/10.1207/s15328007sem1301_3
  • Lomazzi, V. (2021). Can we compare solidarity across Europe? What, why, when, and how to assess exact and approximate equivalence of first-and second-order factor models. Frontiers in Political Science, 3, 641698. https://doi.org/10.3389/fpos.2021.641698
  • Lubke, G., & Muthén, B. O. (2007). Performance of factor mixture models as a function of model size, covariate effects, and class-specific parameters. Structural Equation Modeling: A Multidisciplinary Journal, 14, 26–47. https://doi.org/10.1080/10705510709336735
  • Luong, R., & Flake, J. K. (2022). Measurement invariance testing using confirmatory factor analysis and alignment optimization: A tutorial for transparent analysis planning and reporting. Psychological Methods. Advance online publication. https://doi.org/10.1037/met0000441
  • Magis, D., & De Boeck, P. (2011). Identification of differential item functioning in multiple-group settings: A multivariate outlier detection approach. Multivariate Behavioral Research, 46, 733–755. https://doi.org/10.1080/00273171.2011.606757
  • Maxwell, S. E., Delaney, H. D., & Kelley, K. (2017). Designing experiments and analyzing data: A model comparison perspective. Routledge. https://doi.org/10.4324/9781315642956
  • Meitinger, K. (2017). Necessary but insufficient: Why measurement invariance tests need online probing as a complementary tool. Public Opinion Quarterly, 81, 447–472. https://doi.org/10.1093/poq/nfx009
  • Meitinger, K., Davidov, E., Schmidt, P., & Braun, M. (2020). Measurement invariance: Testing for it and explaining why it is absent. Survey Research Methods, 14, 345–349. https://doi.org/10.18148/srm/2020.v14i4.7655
  • Mellenbergh, G. J. (1989). Item bias and item response theory. International Journal of Educational Research, 13, 127–143. https://doi.org/10.1016/0883-0355(89)90002-5
  • Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58, 525–543. https://doi.org/10.1007/BF02294825
  • Meredith, W., & Teresi, J. A. (2006). An essay on measurement and factorial invariance. Medical Care, 44, S69–S77. https://www.jstor.org/stable/41219507
  • Meuleman, B., Żółtak, T., Pokropek, A., Davidov, E., Muthén, B., Oberski, D. L., Billiet, J., & Schmidt, P. (2022). Why measurement invariance is important in comparative research. A response to Welzel et al. (2021). Sociological Methods & Research. Advance online publication. https://doi.org/10.1177/00491241221091755
  • Millsap, R. E. (2011). Statistical approaches to measurement invariance. Routledge. https://doi.org/10.4324/9780203821961
  • Molenaar, D., & Borsboom, D. (2013). The formalization of fairness: Issues in testing for measurement invariance using subtest scores. Educational Research and Evaluation, 19, 223–244. https://doi.org/10.1080/13803611.2013.767628
  • Monseur, C., & Berezner, A. (2007). The computation of equating errors in international surveys in education. Journal of Applied Measurement, 8, 323–335. https://bit.ly/2WDPeqD
  • Morgan, S. L., & Winship, C. (2015). Counterfactuals and causal inference. Cambridge University Press. https://doi.org/10.1017/CBO9781107587991
  • Mulaik, S. A. (2009). Foundations of factor analysis. CRC Press. https://doi.org/10.1201/b15851
  • Muthén, B. (1984). A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika, 49, 115–132. https://doi.org/10.1007/BF02294210
  • Muthén, B., & Asparouhov, T. (2014). IRT studies of many groups: The alignment method. Frontiers in Psychology, 5, 978. https://doi.org/10.3389/fpsyg.2014.00978
  • Muthén, B., & Asparouhov, T. (2018). Recent methods for the study of measurement invariance with many groups: Alignment and random effects. Sociological Methods & Research, 47, 637–664. https://doi.org/10.1177/004912411770148
  • Noel, Y., & Dauvier, B. (2007). A beta item response model for continuous bounded responses. Applied Psychological Measurement, 31, 47–73. https://doi.org/10.1177/0146621605287691
  • Oberski, D. L. (2014). Evaluating sensitivity of parameters of interest to measurement invariance in latent variable models. Political Analysis, 22, 45–60. https://doi.org/10.1093/pan/mpt014
  • Pokropek, A., & Pokropek, E. (2022). Deep neural networks for detecting statistical model misspecifications. The case of measurement invariance. Structural Equation Modeling: A Multidisciplinary Journal, 29, 394–411. https://doi.org/10.1080/10705511.2021.2010083
  • Pokropek, A., Davidov, E., & Schmidt, P. (2019). A Monte Carlo simulation study to assess the appropriateness of traditional and newer approaches to test for measurement invariance. Structural Equation Modeling: A Multidisciplinary Journal, 26, 724–744. https://doi.org/10.1080/10705511.2018.1561293
  • Pokropek, A., Lüdtke, O., & Robitzsch, A. (2020). An extension of the invariance alignment method for scale linking. Psychological Test and Assessment Modeling, 62, 305–334. https://bit.ly/2UEp9GH
  • Protzko, J. (2022). Invariance: What does measurement invariance allow us to claim? PsyArXiv, April 18. https://doi.org/10.31234/osf.io/r8yka
  • Putnick, D. L., & Bornstein, M. H. (2016). Measurement invariance conventions and reporting: The state of the art and future directions for psychological research. Developmental Review, 41, 71–90. https://doi.org/10.1016/j.dr.2016.06.004
  • Revuelta, J., Hidalgo, B., & Alcazar-Córcoles, M. Á. (2022). Bayesian estimation and testing of a beta factor model for bounded continuous variables. Multivariate Behavioral Research, 57, 57–78. https://doi.org/10.1080/00273171.2020.1805582
  • Robitzsch, A. (2020a). Lp loss functions in invariance alignment and Haberman linking with few or many groups. Stats, 3, 246–283. https://doi.org/10.3390/stats3030019
  • Robitzsch, A. (2020b). Why ordinal variables can (almost) always be treated as continuous variables: Clarifying assumptions of robust continuous and ordinal factor analysis estimation methods. Frontiers in Education, 5, 589965. https://doi.org/10.3389/feduc.2020.589965
  • Robitzsch, A. (2021). Robust and nonrobust linking of two groups for the Rasch model with balanced and unbalanced random DIF: A comparative simulation study and the simultaneous assessment of standard errors and linking errors with resampling techniques. Symmetry, 13, 2198. https://doi.org/10.3390/sym13112198
  • Robitzsch, A. (2022). Estimation methods of the multiple-group one-dimensional factor model: Implied identification constraints in the violation of measurement invariance. Axioms, 11, 119. https://doi.org/10.3390/axioms11030119
  • Robitzsch, A., & Lüdtke, O. (2019). Linking errors in international large-scale assessments: Calculation of standard errors for trend estimation. Assessment in Education: Principles, Policy & Practice, 26, 444–465. https://doi.org/10.1080/0969594X.2018.1433633
  • Robitzsch, A., & Lüdtke, O. (2020). A review of different scaling approaches under full invariance, partial invariance, and noninvariance for cross-sectional country comparisons in large-scale assessments. Psychological Test and Assessment Modeling, 62, 233–279. https://bit.ly/3ezBB05
  • Robitzsch, A., & Lüdtke, O. (2022). Mean comparisons of many groups in the presence of DIF: An evaluation of linking and concurrent scaling approaches. Journal of Educational and Behavioral Statistics, 47, 36–68. https://doi.org/10.3102/10769986211017479
  • Rutkowski, L., & Svetina, D. (2014). Assessing the hypothesis of measurement invariance in the context of large-scale international surveys. Educational and Psychological Measurement, 74, 31–57. https://doi.org/10.1177/0013164413498257
  • Ryu, E. (2014). Factorial invariance in multilevel confirmatory factor analysis. British Journal of Mathematical and Statistical Psychology, 67, 172–194. https://doi.org/10.1111/bmsp.12014
  • Ryu, E., & Mehta, P. (2017). Multilevel factorial invariance in n-level structural equation modeling (nSEM). Structural Equation Modeling: A Multidisciplinary Journal, 24, 936–959. https://doi.org/10.1080/10705511.2017.1324311
  • Schauberger, G., & Mair, P. (2020). A regularization approach for the detection of differential item functioning in generalized partial credit models. Behavior Research Methods, 52, 279–294. https://doi.org/10.3758/s13428-019-01224-2
  • Schroeders, U., & Gnambs, T. (2020). Degrees of freedom in multigroup confirmatory factor analyses. European Journal of Psychological Assessment, 36, 105–113. https://doi.org/10.1027/1015-5759/a000500
  • Seddig, D., & Leitgöb, H. (2018). Approximate measurement invariance and longitudinal confirmatory factor analysis: Concept and application with panel data. Survey Research Methods, 12, 29–41. https://doi.org/10.18148/srm/2018.v12i1.7210
  • Seddig, D., Maskileyson, D., & Davidov, E. (2020). The comparability of measures in the ageism module of the fourth round of the European Social Survey, 2008–2009. Survey Research Methods, 14, 351–364. https://doi.org/10.18148/srm/2020.v14i4.7369
  • Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58, 159–194. https://doi.org/10.1007/BF02294572
  • Siemsen, E., & Bollen, K. A. (2007). Least absolute deviation estimation in structural equation modeling. Sociological Methods & Research, 36, 227–265. https://doi.org/10.1177/0049124107301946
  • Somaraju, A. V., Nye, C. D., & Olenick, J. (2022). A review of measurement equivalence in organizational research: What’s old, what’s new, what’s next? Organizational Research Methods, 25, 741–785. https://doi.org/10.1177/10944281211056524
  • Steenkamp, J. B. E. M., & Baumgartner, H. (1998). Assessing measurement invariance in cross-national consumer research. Journal of Consumer Research, 25, 78–107. https://doi.org/10.1086/209528
  • Steinmetz, H. (2013). Analyzing observed composite differences across groups: Is partial measurement invariance enough? Methodology, 9, 1–12. https://doi.org/10.1027/1614-2241/a000049
  • Steyer, R. (1989). Models of classical psychometric test theory as stochastic measurement models: representation, uniqueness, meaningfulness, identifiability, and testability. Methodika, 3, 25–60. https://bit.ly/3Js7N3S
  • Steyer, R., Mayer, A., Geiser, C., & Cole, D. A. (2015). A theory of states and traits—Revised. Annual Review of Clinical Psychology, 11, 71–98. https://doi.org/10.1146/annurev-clinpsy-032813-153719
  • Svetina, D., Rutkowski, L., & Rutkowski, D. (2020). Multiple-group invariance with categorical outcomes using updated guidelines: an illustration using Mplus and the lavaan/semtools packages. Structural Equation Modeling: A Multidisciplinary Journal, 27, 111–130. https://doi.org/10.1080/10705511.2019.1602776
  • Tutz, G., & Schauberger, G. (2015). A penalty approach to differential item functioning in Rasch models. Psychometrika, 80, 21–43. https://doi.org/10.1007/s11336-013-9377-6
  • Uher, J. (2021). Psychometrics is not measurement: Unraveling a fundamental misconception in quantitative psychology and the complex network of its underlying fallacies. Journal of Theoretical and Philosophical Psychology, 41, 58–84. https://doi.org/10.1037/teo0000176
  • van de Schoot, R., Kluytmans, A., Tummers, L., Lugtig, P., Hox, J., & Muthén, B. (2013). Facing off with Scylla and Charybdis: A comparison of scalar, partial, and the novel possibility of approximate measurement invariance. Frontiers in Psychology, 4, 770. https://doi.org/10.3389/fpsyg.2013.00770
  • van de Schoot, R., Lugtig, P., & Hox, J. (2012). A checklist for testing measurement invariance. European Journal of Developmental Psychology, 9, 486–492. https://doi.org/10.1080/17405629.2012.686740
  • van de Vijver, F. J. (2018). Towards an integrated framework of bias in noncognitive assessment in international large‐scale studies: Challenges and prospects. Educational Measurement: Issues and Practice, 37, 49–56. https://doi.org/10.1111/emip.12227
  • van der Linden, W. J. (1994). Fundamental measurement and the fundamentals of Rasch measurement. In M. Wilson (Ed.), Objective measurement: theory into practice (Vol. 2, pp. 3–24). Ablex.
  • Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3, 4–70. https://doi.org/10.1177/109442810031002
  • VanderWeele, T. J. (2022). Constructed measures and causal inference: Towards a new model of measurement for psychosocial constructs. Epidemiology, 33, 141–151. https://doi.org/10.1097/EDE.0000000000001434
  • von Davier, M., & von Davier, A. A. (2007). A unified approach to IRT scale linking and scale transformations. Methodology, 3, 115–124. https://doi.org/10.1027/1614-2241.3.3.115
  • Wang, W., Liu, Y., & Liu, H. (2022). Testing differential item functioning without predefined anchor items using robust regression. Journal of Educational and Behavioral Statistics, 47, 666–692. https://doi.org/10.3102/10769986221109208
  • Welzel, C., & Inglehart, R. F. (2016). Misconceptions of measurement equivalence: Time for a paradigm shift. Comparative Political Studies, 49, 1068–1094. https://doi.org/10.1177/0010414016628275
  • Welzel, C., Brunkert, L., Kruse, S., & Inglehart, R. F. (2021). Non-invariance? An overstated problem with misconceived causes. Sociological Methods & Research. Advance online publication. https://doi.org/10.1177/0049124121995521
  • Welzel, C., Kruse, S., & Brunkert, L. (2022). Against the mainstream: On the limitations of non-invariance diagnostics: Response to Fischer et al. and Meuleman et al. Sociological Methods & Research. Advance online publication. https://doi.org/10.1177/00491241221091754
  • White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica, 50, 1–25. https://doi.org/10.2307/1912526
  • Wicherts, J. M. (2016). The importance of measurement invariance in neurocognitive ability testing. The Clinical Neuropsychologist, 30, 1006–1016. https://doi.org/10.1080/13854046.2016.1205136
  • Wicherts, J. M., Dolan, C. V., Hessen, D. J., Oosterveld, P., van Baal, G. C. M., Boomsma, D. I., & Span, M. M. (2004). Are intelligence tests measurement invariant over time? Investigating the nature of the Flynn effect. Intelligence, 32, 509–537. https://doi.org/10.1016/j.intell.2004.07.002
  • Winter, S. D., & Depaoli, S. (2020). An illustration of Bayesian approximate measurement invariance with longitudinal data and a small sample size. International Journal of Behavioral Development, 44, 371–382. https://doi.org/10.1177/0165025419880610
  • Wu, H. (2010). [An empirical Bayesian approach to misspecified covariance structures] [Doctoral dissertation]. Ohio State University. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1282058097
  • Wu, H., & Browne, M. W. (2015). Quantifying adventitious error in a covariance structure as a random effect. Psychometrika, 80, 571–600. https://doi.org/10.1007/s11336-015-9451-3
  • Zieger, L., Sims, S., & Jerrim, J. (2019). Comparing teachers’ job satisfaction across countries: A multiple‐pairwise measurement invariance approach. Educational Measurement: Issues and Practice, 38, 75–85. https://doi.org/10.1111/emip.12254
  • Zitzmann, S., & Loreth, L. (2021). Regarding an “almost anything goes” attitude toward methods in psychology. Frontiers in Psychology, 12, 612570. https://doi.org/10.3389/fpsyg.2021.612570