Abstract
We describe and evaluate a random permutation test of measurement invariance with ordered-categorical data. To calculate a p-value for the observed (∆)χ2, an empirical reference distribution is built by repeatedly shuffling the grouping variable, then saving the χ2 from a configural model, or the ∆χ2 between configural and scalar-invariance models, fitted to each permuted dataset. The current gold standard in this context is a robust mean- and variance-adjusted ∆χ2 test proposed by Satorra (2000), which yields inflated Type I errors, particularly when thresholds are asymmetric, unless samples sizes are quite large (Bandalos, 2014; Sass et al., 2014). In a Monte Carlo simulation, we compare permutation to three implementations of Satorra’s robust χ2 across a variety of conditions evaluating configural and scalar invariance. Results suggest permutation can better control Type I error rates while providing comparable power under conditions that the standard robust test yields inflated errors.
Acknowledgment
We would like to thank Yves Rosseel for his helpful technical discussions while investigating different implementations of the mean- and variance-adjusted test statistic, and Paul Johnson for his computational assistance while comparing software packages. We thank the Center for Research Methods and Data Analysis and the College of Liberal Sciences at the University of Kansas for access to their high performance compute cluster on which our Monte Carlo simulations were conducted.
Notes
1 Throughout the manuscript, we will restrain our discussion to the case of polychoric correlations for models fit only to ordered-categorical items, but this WLS estimator can also be applied to a mixture of discrete and continuous indicators. When continuous indicators are included, their observed (co)variances are included in the estimated polychoric correlation matrix, and polyserial correlations are estimated between the discrete and continuous indicators.
2 Mean- and variance-adjusted statistics can also be calculated for other estimators, such as maximum likelihood.
3 Note that it is not appropriate to calculate the difference between two statistics because they will not be approximately
distributed. Instead, the difference between unadjusted
statistics must be calculated, then adjusted.
4 Details about how to use the DIFFTEST command can be found with Web Note 4 at http://www.statmodel.com/ .
5 Jorgensen et al. (Citation2017a) showed that the test of overall model fit tests an overly restrictive null hypothesis because model configurations could be equivalent across populations even if the hypothesized model is not a perfectly accurate representation of it. This issue is discussed elsewhere in greater detail (Jorgensen, Citation2017; Jorgensen et al., Citation2017), but it is beyond the focus of the current study, which focuses on situations in which the
test fails even in the ideal circumstance that the hypothesized model is a perfect representation of the population(s).
6 Jorgensen et al. (Citation2017a) showed that permuting alternative fit indices also provides valid tests of hypotheses about measurement invariance.
7 Wu and Estabrook (Citation2016) recently showed that it is not possible to test equality of thresholds independently of any other type of measurement parameter. It is only possible to test equality of thresholds on the condition of at least one other type of measurement parameter (for items with four or more categories), at least two other types (for items with three categories), or at least three other types (for binary items). This finding has implications for how measurement invariance should be tested with ordered-categorical indicators, but such a paradigm shift is beyond the scope of the current article.
8 Appendix A also discusses the issue of sparse data, when not all levels of a variable are observed in each group.
9 The application of the permutation method to incomplete data is a topic for future research that is beyond the scope of the current investigation.
10 Jorgensen (Citation2017) discussed modifying configurally invariant models with inadequate fit.
11 If we had fixed the factor means and variances in both groups even in the scalar model, as Sass et al. (Citation2014) did, these differences would have been = 16 and 40, respectively, as Sass et al. (Citation2014) reported. We discuss the implications of this difference in the Discussion section.
12 If software is flexible enough (e.g., general Bayesian modeling software, or more flexible SEM software like OpenMx), it is possible to fit a model to each group that estimates only the thresholds between categories that were observed within each group. Equality constraints could still be imposed on loadings and the thresholds the researcher knows correspond to categories on the same response scale used in each group.