Abstract
This study investigated the effect of complex structure on dimensionality assessment in compensatory multidimensional item response models using DETECT- and NOHARM-based methods. The performance was evaluated via the accuracy of identifying the correct number of dimensions and the ability to accurately recover item groupings using a simple matching similarity (SM) coefficient. The DETECT-based methods yielded higher proportion correct than the NOHARM-based methods in two- and three-dimensional conditions, especially when correlations were ≤.60, data exhibited ≤30% complexity, and sample size was 1,000. As the complexity increased and the sample size decreased, the performance of the methods typically diminished. The NOHARM-based methods were either equally successful or better in recovering item groupings than the DETECT-based methods and were mostly affected by complexity levels. The DETECT-based methods were affected largely by the test length, such that with the increase of the number of items, SM coefficients would decrease substantially.
Notes
NOHARM is technically a model estimation procedure, for which specific statistics to evaluate dimensionality were developed; thus, procedures used in this study could be termed NOHARM-based dimensionality assessment procedures. For brevity, we use NOHARM-based methods to identify the relevant procedures.
For example, in Froelich and Habing (Citation2008), the authors used item parameter distributions from the Armed Services Vocational Aptitude Battery, ACT-Math, and SAT-Verbal tests to draw the parameters for their simulation study (as reported in their on p. 146). Mean and standard deviation for a's were 1.22 (.70), 1.09 (.35), and 1.07 (.40) for the three tests, respectively, which were fairly close to the generating item parameter distributions used in the current study.
Table 1 Item Parameters for 2D Compensatory MIRT Model for 10 Items per Dimension for All Types of Structures
In the current study, item parameters for 3D and longer test length conditions looked very similar to those presented in . The distribution means and standard deviations were almost identical (any difference was <.03) across all generating conditions.
Svetina (Citation2013) proposed a more complex evaluation of the performance of the methods by looking at an item rather than item-pair level of correct classification. The author provided rules for (first) labeling each cluster solution and then examined the percent of items that were correctly classified. Gierl et al. (Citation2006) provided a different way of looking at item classification by resimulating the data and essentially running dual analysis.