Abstract
Recent advances have allowed for modeling mixture components within latent growth modeling using robust, skewed mixture distributions rather than normal distributions. This feature adds flexibility in handling non-normality in longitudinal data, through manifest or latent variables, by directly modeling skewed or heavy-tailed latent classes rather than assuming a mixture of normal distributions. The aim of this study was to assess through simulation the potential under- or over-extraction of latent classes in a growth mixture model when underlying data follow either normal, skewed-normal, or skewed-t distributions. In order to assess this, we implement skewed-t, skewed-normal, and conventional normal (i.e., not skewed) forms of the growth mixture model. The skewed-t and skewed-normal versions of this model have only recently been implemented, and relatively little is known about their performance. Model comparison, fit, and classification of correctly specified and mis-specified models were assessed through various indices. Findings suggest that the accuracy of model comparison and fit measures are dependent on the type of (mis)specification, as well as the amount of class separation between the latent classes. A secondary simulation exposed computation and accuracy difficulties under some skewed modeling contexts. Implications of findings, recommendations for applied researchers, and future directions are discussed; a motivating example is presented using education data.
Notes
1 We have mimicked the model fit and assessment measures that Nylund et al. (Citation2007) report in order to present a thorough treatment of the issue. This element extends beyond the results for the generalized framework presented in Wei et al. (Citation2017), where only the BIC and two classification measures were presented.
2 We modified growth curve values from Kaplan (Citation2002) to reflect linear growth for simplicity.
3 The equation used to compute this distance between two latent classes is
where
represents the inverse of the common covariance matrix, and the
and
terms represent the means for the first and second latent classes, respectively (McLachlan & Peel, Citation2000). In this case, the means would be the intercept and slope growth parameters for each trajectory. For ease of computation, the variance/covariance matrix was the same for both classes in this study. Note that the Mahalanobis distance can also be computed for manifest variables, if desired.
4 We first generated data for each replication in Mplus. Next, we used the Mplus Automation package (Hallquist & Wiley, Citation2013) in R to generate input files for each replication, and then used this package to run each input file. Output files for replications with improper solutions were discarded, and results of the remaining viable replications were then pulled into R for data computations.
5 In addition, for the cells where BLRT was requested, we specified the number of LRT starts to be 100 with 20 final stage optimizations to ensure that we did not converge to a local maximum.
6 Mplus does not remove all non-converged replications so it is always important to check the Tech 9 error and warning report and manually remove any problematic replications before interpretation.
7 For the information criteria, the percentage of correctly identified class structures was calculated as:
where represents the fit statistic for the correctly specified model and
represents the fit statistic for the mis-specified model. This percentage was calculated somewhat differently for the LRT-based tests. Specifically, for the LMR-LRT and BLRT, the percentage of time the correctly specified model was chosen over the mis-specified model was calculated as
when a GCM represented the correctly specified model, and
when a GMM represented the correctly specified model.
8 Note that there were four additional mis-specifications added here beyond the main simulation: skewed-normal mis-specified to normal, skewed-normal to skewed-t, skewed-t to normal, and skewed-t to skewed-normal. Although these forms of specification error may be unlikely to encounter in applied scenarios, we included them here for completeness of the story surrounding the impact of improper distributional form on the accuracy of final results.
9 We would like to note in particular the result in the upper right plot of Figure 5, where a skewed-t distribution was analyzed as normal. Notice that relative percent bias levels appear to increase as sample sizes increase, with some rather high bias levels present for n = 3000. This larger sample size acted akin to a “population size”, and allowed for some very extreme intercept and slope values. When the skewed-t distribution was modeled accurately, these cases were properly assigned to the class they were meant to be in because the skewed-t distribution allows for these extreme cases to exist. However, when the same data were analyzed using a normal distribution, the latent classes were not well constructed. These few (very extreme) cases were treated as a latent class of their own (Class 2) since they were so far out in the skewed tail, and the rest (majority) of the cases were assigned to a separate class (i.e., Class 1, which is pictured in this plot). This plot shows many outlier dots because all cases that were not in this very extreme category for intercept and slope values were assigned to Class 1. In reality, these were cases that were not well classified since they should have been identified as being part of Class 2 (but instead only the extreme outliers from the skewed tail were in Class 2). After further investigation, we are confident that this result stemmed from the randomly generated data, and that it would not plague an applied researcher working with a pre-set scale for items. Instead, we would argue that an applied researcher would expect decreased variability and bias (as pictured in Figure 5) as sample size increases, without many of the outlier dots we experienced due to the nature of the simulation.