1,759
Views
1
CrossRef citations to date
0
Altmetric
Research Article

Model fit criteria curve behaviour in class enumeration – a diagnostic tool for model (mis)specification in longitudinal mixture modelling

ORCID Icon, ORCID Icon, ORCID Icon & ORCID Icon
Pages 1640-1672 | Received 30 Aug 2021, Accepted 04 Nov 2021, Published online: 22 Nov 2021

Abstract

The use of longitudinal finite mixture models (FMMs) to identify latent classes of individuals following similar paths of temporal development is gaining traction in applied research. However, FMM’s users may be unaware of how data features as well as the inappropriate specification of the model’s covariance structure impacts class enumeration. To elucidate this, we investigated model fit-criteria curve behaviour across an array of data conditions and covariance structures. Fit statistic patterns were variable among the fit criteria and across a range of data conditions. This variability was greatly attributable to the level of class separation and the presence/absence of random effects. Our findings support some widely held notions (e.g. BIC outperforms other criteria) while debunking others (adding random effects is not always the solution). Based on the obtained results, we present guidelines on how the behaviour of fit criteria curves can be used as a diagnostic aid during class enumeration.

1. Introduction

Longitudinal finite mixture models (FMM) are model-based clustering approaches designed to uncover latent heterogeneity in longitudinal profiles of the repeated measures type. This heterogeneity is usually represented as developmental trajectories, which comprise both inter-individual (between-subjects) and intra-individual (within-subjects) variability over time. These methods assist in identifying distinct latent classes of subjects within the population that show similar (within-class) temporal development. Assignment of subjects to such classes is typically done according to where their posterior probability (of the parameters) given the data is highest. Popular longitudinal FMMs include growth mixture models (GMM) [Citation1], latent class growth analysis (LCGA) [Citation2], and group-based trajectory models (GBTM) [Citation3].

Longitudinal FMMs are increasingly used in applied sciences, particularly health sciences to understand differences in the development and aetiology of a variety of disorders and diseases, as well as subject responses to treatment. Recent studies include whether group differences in alcohol consumption are related to cardiovascular disease [Citation4], understanding different treatment responses for adults with obsessive-compulsive disorder [Citation5], and establishing the link between cannabis use in adolescents and a variety of health factors [Citation6].

Nonetheless, it is often overlooked in practice that analysis results obtained with FMMs are sensitive to violations of their underlying assumptions, in particular the variance-covariance structure of the outcome variables in each class. This paper will investigate the impact of between-subject covariance misspecification on fit statistic behaviour during class extraction and, ultimately, on the choice of the number of classes. We examine, for instance, whether an inconclusive behaviour (e.g. continual improvement) of the considered model fit statistics (AIC, BIC, ssBIC and scaled entropy) as a function of increasing the number of fitted classes, a recurring phenomenon in practice [Citation7], is evidence of such covariance misspecification. We ascertain that identified fit statistic behaviour under such misspecifications may be used as a diagnostic tool in finding an adequate covariance structure.

We conduct a simulation study in which several data features (design conditions) are manipulated (e.g. number of repeated measures, degree of class separation, trajectory shape, true covariance structure), conforming to a specific GMM, LCGA or GBTM model. We then fit models misspecified in terms of the covariance to the data to investigate (1) whether a plateauing behaviour (or other peculiar behaviour) of the fit statistics under the fitted model is a relic of covariance misspecification, (2) how sensitive in terms of class enumeration are these fit statistics to covariance misspecification under various data features (e.g. class separation, number of time points), and (3) whether identified fit statistic patterns may assist in finding the correct model. Moreover, an empirical example using alcohol consumption data is used to illustrate the fit criteria curves as a diagnostic aid during class enumeration and model specification. Such a diagnostic tool may be useful since covariance misspecification has important consequences for both class extraction and classification performance [Citation8–14].

2. Specification of models

Longitudinal FMMs develop from the premise that within the population, K latent classes (subgroups) exist with subjects within classes following similar paths of development over time (trajectories). The marginal probability distribution P(yi) of a randomly chosen trajectory is then modelled as, (1) P(yi)=k=1KπkPk(yi)(1) where yi=(yi1,,yiT) is a vector of repeated measures for subject i,i=1,,n at time t,t=0,,T1, and Pk(yi) is the conditional distribution of the longitudinal sequence, yi, given that the subject i is in class k,k=1,,K. Further, πk is the class membership probability and conforms to πk0,k=1Kπk=1, with K>1. These models assume K to be known, but this is difficult to deduce directly from the data.

For continuous outcomes data, Pk(yi) is assumed multivariate normal (MVN) within classes, that is, (2) yikMVN(μk,Σk)(2) with yik a T×1 vector of continuous outcomes for subject i, and μk and Σk are the model-implied mean vector and covariance matrix for class k respectively. Pk(yi) is uniquely defined by the trajectory specification per class.

A GMM is the most general of our considered longitudinal FMMs. It includes both fixed effects to quantify class specific average growth curves and random effects to allow for individual differences (inter-individual differences) from the average growth curve within classes. Its class-specific trajectories may be expressed as, (3) yik=Xβk+Zbik+eik(3) where the superscript k specifies the class, X is a T×b design matrix for the fixed effects, βk is a b×1 vector of fixed effects, Z is a T×q design matrix for the random effects, bik is a q×1 vector of random effects, and eik is a T×1 residual vector. It is assumed that bikN(0,Dk) and eikN(0,Rk). Dk is the model-implied q×q random effects covariance matrix (inter-individual variation) and Rk is the T×T residual covariance matrix (intra-individual variation) of the k-th class. Ultimately, a GMM is specified where μk=Xβk and Σk=Rk+ZDkZ in Eq. (2).

LCGA and GBTM are special cases of Eq. (3) in which there are no random effects i.e. Zbik=0, such that Σk=Rk, with Σk diagonal [Citation10]. Diagonal Σk implies independence between the repeated measures within a given individual. LCGA models allow for the residual variance to differ between classes and time points. The GBTM, a popular special case of the LCGA, makes the explicit assumption of the residual variance being equal for all classes and all time points i.e. Rk=R=σ2I, where I is the identity matrix [Citation3,Citation15,Citation16].

2.1. Class enumeration

A key outcome of FMM analysis is to identify the optimal number of classes K which adequately describe the data. Several statistical fit indices can assist in selecting K [Citation16], a process known as class enumeration (synonymous with extraction). However, no fit statistic has yet emerged as the clear best performer [Citation17–22]. Therefore, practitioners are often advised to use a variety of fit statistics as well as a substantive interpretation of their models during class extraction [Citation3,Citation23,Citation24].

In this study, we restrict ourselves to the likelihood-based, information criterion (IC) model-fit indices most often encountered in practice (and widely available by default in most software), that is, the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC) and sample-size adjusted BIC (ssBIC). Scaled entropy (sE), a statistic derived from entropy E(K), is included as a complement to the IC indices which is customarily reported as a measure of classification certainty. Table  presents their equations. The first term of the AIC, BIC and ssBIC reward models for having better log-likelihoods. The second term penalizes models for lack of model parsimony. sE is scaled to be bound between zero and one [Citation25] and is higher when models show clear classification into classes.

Table 1. Summary of fit statistic calculations.

In an ideal situation, one would expect a clear minimum of the IC and an sE close to 1 at the true number of classes. However, in practice, these ICs often do not exhibit clear-cut behaviour (e.g. a minimum value) as a function of increasing K. For instance, a ‘plateauing’ curve is frequently observed [Citation7] in that the IC continues to improve marginally as the fitted number of classes increases. It is to be established whether sE elicits similar behaviour. We hypothesize that such behaviour is evidence of random effect (between-subject) covariance misspecification, which can have serious consequences for class enumeration accuracy, classification performance, and model interpretability [Citation9,Citation10]. We ascertain whether aforesaid identified behaviour may assist in finding the correct covariance specification.

2.2. Class separation

Class separation in longitudinal FMMs typically refers to the degree of overlap between growth trajectories for latent classes [Citation26]. This may be quantified in terms of the amount of overlap between the latent classes’ growth trajectory intercept and slope or the degree of overlap between the observed repeated measures [Citation27].

Low class separation has been shown to play a substantial role in decreasing estimation accuracy in GMMs [Citation22,Citation28]. However, to date, there is no consensus on the best definition of class separation, and indeed which measure of class separation to utilize (See e.g. Nowakowska et al. [Citation29]). As such, it is largely dependent on the researcher to decide upon given the investigation at hand [Citation26]. This study employs the Cohen’s D (CD), which is often used to quantify effect sizes [Citation30], as a class separation measure. We report a time-averaged version, calculated as, (4) CDave=1Tt=0T1|μatμbt|σt(4) where t=0,,T1 is the time point, μat and μbt is the class mean (of the observed outcome variable) at time point t for class a and b respectively, and σt is the square root of the diagonal element of the total covariance matrix (Σ) corresponding to time point t.

Additionally, we report the multivariate Mahalanobis distance (MD) [Citation31], a popular class separation measure in longitudinal FMM studies [Citation32–35]. The pairwise MD in terms of the observed repeated measures is calculated as, (5) MD=(μaμb)TΣ1(μaμb)(5) where μa and μb are the mean vectors of the observed repeated measures for class a and b respectively, and Σ1 corresponds to the inverse of the covariance matrix of y which is assumed equal in both latent classes [Citation35,Citation36]. An MD of one and three usually reflects small and large class separation respectively found in the literature [Citation18,Citation32,Citation37–39].

Lastly, we provide the overlap coefficient (OVL) [Citation29] for class separation. We calculate this as the average over all time points of the overlap of two class distributions at each time point, (6) OVL=1Tt=0T1[min[fat(x;μat,Σat),fbt(x;μbt,Σbt)]dx](6) where fat(x;μat,Σat) and fbt(x;μbt,Σbt) correspond to the class density function (univariate normal) at time point t of class a and b respectively. Figure  shows low and high separation in terms of the OVL (Eq. (6)) for two univariate normal densities for a single time point. The OVL is the common area under the lower of the two density functions. The greater the overlap between the densities, the broader the x-axis range is where the minimum of the two densities is high. For the example of low separation, a CD of 0.8 corresponds to an OVL of 0.69 (see the grey area in the leftmost plot). For high separation, a CD of 2 results in an OVL of 0.32 (see the grey area in the rightmost plot).

Figure 1. An illustration of class separation for two univariate normal densities.

Figure 1. An illustration of class separation for two univariate normal densities.

2.3. Covariance misspecification

Covariance misspecification implies assuming an incorrect structure for the random effects’ covariance matrix Dk and/or for the residual covariance matrix Rk during model estimation. Such misspecification may be broadly classified into three categories [Citation9].

Covariance underspecification can occur when the true model includes class-specific covariance matrices (i.e. Dk and Rk), but the fitted model specification constrains within-class covariance matrices to be equal across classes (i.e. Dk=D or Rk=R) or even equal to zero (i.e. Dk=0). It can also be that the true model includes equal within-class covariance matrices (i.e. Dk=D and Rk=R), a GMM, but is specified such that Dk=0, an LCGA or GBTM. Covariance overspecification can occur when the true model contains equal random effect covariance matrices across classes (i.e. Dk=D) or equal residual covariance matrices across classes (i.e. Rk=R), while the model selected for analysis allows for the estimation of class-specific matrices (i.e. Dk and/or Rk). Additionally, overspecification also arises when the true model has no random effect variability within classes (i.e. Dk=0) but is estimated with such variability. In this context, the true model is an LCGA or GBTM, but the assumed model is a GMM. General covariance misspecification can occur when fitting a mixture model when one is not needed, that is, where the true model consists of a single population (i.e. growth curve model), but the analysis proceeds assuming population heterogeneity (i.e. LCGA, GBTM or GMM) [Citation9].

In our paper, we will examine the effects of under- and overspecification on fit statistic behaviour, more specifically the effect of incorrectly specifying the Dk matrix.

3. Methods

3.1. Design of the simulation study

To imitate model specifications frequently and currently used in practice [Citation4,Citation40–42], we limit ourselves to the case of underspecification where the true model has Dk=D and Rk=R (i.e. a GMM), but is estimated such that Dk=0 (i.e. an LCGA or GBTM). In the case of overspecification, we investigate the impact where the true model has Dk=0 with Rk=R (i.e. an LCGA or GBTM), but is estimated such that Dk=D (i.e. a GMM). Such equal within-class covariance matrices is the default specification of most software [Citation16] which is often inadvertently selected by practitioners.

The (true) models for data simulation are:

  • Model 1 (GBTM): With Dk=0, and Rk=R=σ2I

  • Model 2 (LCGA): With Dk=0, and Rk=R=σεt2I, that is, class-invariant but time-variant residual variance

  • Model 3 (GMM-I): With class-invariant random intercept and random linear slope allowed to covary Dk=D, and Rk=R=σ2I

  • Model 4 (GMM-II): With class-invariant random intercept and random linear slope allowed to covary Dk=D, and Rk=R=σεt2I

We then study the effect of the chosen misspecifications by considering various fitted on true model combinations. These are shown in Table . The misspecification of the R matrix in terms of time-dependency (either time-variant or time-invariant) is beyond the scope of this paper, but we do note that GMM with random slopes generates heterogeneity of variance across time points and thus may resemble a time-variant in R LCGA.

Table 2. True with fitted models considered (Misspecification: a: D underspecified, b: D overspecified, c: D correctly specified).

The design conditions underlying the data generating process are informed by previous simulation studies and applied research [Citation13,Citation43,Citation44], and are summarized in Table  and described below.

Table 3. Primary design conditions investigated.

The choice of a sample size of 1000 reflects the median condition in applied studies [Citation44]. Furthermore, a minimum sample size of 900 is suggested under conditions of multiple classes and low class separation [Citation45]. We also briefly investigate a sample size of 260 for a subset of models, as small sample sizes have a demonstrably negative impact on class enumeration [Citation33], with N=200 being the recommended minimum for complete case data and high class separation [Citation45].

Five repeated measures are chosen to be the lower bound at which to detect non-linear growth trajectories and to ensure model identifiability [Citation13], especially when including full rank covariance matrices and larger K. This is expanded to eight to mirror the higher number of repeated measures seen in applied GMM research [Citation33]. Moreover, it has been shown that increasing the number of time points has a positive effect on classification performance [Citation10]. Equally spaced time values over a fixed time interval from zero to seven are chosen, and so for T=5,t=0,1.75,3.5,5.25,7 and for T=8,t=0,1,2,3,4,5,6,7. The time interval is the same for both T-values to prevent confounding of the effect of number of time points with the effect of a change of total follow-up time.

Most simulation studies in the literature consider two or three true classes. We expand upon this by including four classes. We focus primarily on equal class sizes. However, we also explore unequal class sizes (k=1(35\%)/k=2(15\%)/k=3(15\%)/k=4(35\%)) for a subset of models since a substantial decrease in the class enumeration accuracy of the BIC compared to the AIC and ssBIC has been noted when one class is considerably smaller [Citation22].

We will impose a CDave of approximately 0.5 and 2 to reflect low and high class separation respectively. The data will be constructed in such a way that each class will be at least CDave units away from each other. MD and OVL are also reported.

Fixed effects’ parameters are altered according to the degree of class separation (low or high) corresponding with our chosen Cohen’s D separation metric. We have chosen a second-order polynomial in the fixed effects as it is a flexible function which can capture many patterns across time, including monotonic trends, and u- and n-shaped trends as well as parts thereof. Two conditions of trajectory growth are studied. One in which trajectories comprise the same intercept but different slopes between classes (natural starting (NS) point) and the second includes both different intercepts and different slopes between classes. The second condition’s functional form mimics the ‘cat’s cradle’ (CC) phenomenon often identified in applied health research with a small number of time points [Citation46–48]. Sher et al. [Citation46] present this pattern empirically in terms of alcohol use over time. Subjects’ alcohol consumption in one class starts high and remains high (chronically bad), in a second class it starts low and remains low (unaffected/non-drinkers), in a third class it starts high but reduces over time (recovery), and in the fourth class it starts low but increases over time (delayed onset).

Lastly, we impose an R that is diagonal, equal across classes, and either time-variant (TV) or time-invariant (TI). For the R matrix of the GBTM model, each diagonal element is set to equal the average of the sum of the diagonals of the full Σ matrix of the GMM. For the LCGA specification, the diagonal elements of the R matrix are set equal to the corresponding diagonal elements of the Σ of the GMM. This strategy is effected so that the total average diagonal variation is similar across the design conditions. For conditions with a non-zero D matrix (i.e. where the true model is a GMM), the proportion of total average diagonal variation explained by the random effects was set to a fixed proportion of approximately 0.5. We enforce a weak positive correlation of 0.1 between random intercept and random linear slope, in line with previous studies [Citation8,Citation13,Citation26,Citation27,Citation49].

The data generated are of the following general form, (7) yitk=(b0ik+β0k)+(b1ik+β1k)t+β2kt2+εitk(7) with k=1,..,K, i=1,,n, β0k,β1k and β2k are fixed effects quantifying the population average growth curve for class k, and b0ik and b1ik are random effects that allow for individual differences from the average growth curve of class k. In the case of LCGA and GBTM, random effects are not included. Figures  and  show the different trajectory shapes of selected true GMM-I models for different parameter sets. All considered models’ parameters are found in the supplementary material.

Figure 2. True GMM-I with NS scenario for fixed effects, T=8 and time-invariant R.

Figure 2. True GMM-I with NS scenario for fixed effects, T=8 and time-invariant R.

Figure 3. True GMM-I with CC scenario for fixed effects, T=8 and time-invariant R.

Figure 3. True GMM-I with CC scenario for fixed effects, T=8 and time-invariant R.

3.2. Simulation procedure

Longitudinal repeated measures data conforming to our true models were generated in R v3.6.3. The fitted models were estimated using the R package Mplus Automation [Citation50], which interfaces directly with Mplus [Citation51]. We used Mplus v7.3 for our analysis and ggplot2 in R [Citation52] for the plotting of figures.

Subjects were first assigned to classes according to the chosen class size, e.g. for N=1000,K=4 and equal classes, there were exactly 250 subjects in each class. Then, a vector of random effects for each subject in a class was generated according to bikMVN(0,Dk). A vector of continuous repeated measures (yi) for that subject within a class was then generated according to Eq. (7). This process was repeated 200 times for each of the 32 design conditions in Table  to generate independent datasets, giving a total of 6400 simulated datasets. Each generated dataset was used as input in the subsequent Mplus Automation step where both the true and misspecified Dk models, as given in Table , were fitted over K=1,2,,10 producing 6 400 × 2x10 = 128 000 estimated models. Anticipating that 200 replications may be too few, we ran 1000 replications for select conditions but did not observe marked differences in the results. We, therefore, adhered to 200 replications, which is in line with other published FMM research [Citation35,Citation53,Citation54].

For model estimation, we bore in mind that longitudinal FMMs are notoriously sensitive to starting values for model parameters [Citation54]. Selecting too few starting values may negatively impact the chance of finding the global solution, whilst too many may return improbable combinations, likely leading to nonconverged solutions and zero class sizes. Therefore, in line with research [Citation13,Citation54] and practical [Citation51] recommendations, for a thorough investigation of the likelihood surface, we instructed Mplus to use 100 random sets of starting values for all model parameters of a given model on a given run. The programme was then ordered to run through 20 iterations on each of these sets. Next, the programme was directed to use the 10 sets yielding the highest log-likelihood from the first stage as starting values in the final stage optimization until convergence criteria were met. The model with the highest log-likelihood from this stage was used as the basis for further analysis. Any nonconverged solutions were discarded, with the proportion of non-convergence out of all 200 runs computed per design combination per true model per fitted number of classes never exceeding 5%. For 96% of the cases, nonconvergence was below 1% (See Supplement Section S.2.). If non-convergence exceeded 2%, this was always for true or fitted model GMM, T=8, low separation, and number of fitted classes exceeding 5 (See Supplement Table S.3).

4. Results

4.1. Accuracy of class extraction in relation to design conditions and D misspecification

The impact of design conditions and D misspecification on the probability of class extraction was investigated with two logistic regression analyses; one using as outcome correct versus incorrect extraction and including all cases (logistic model 1 – LM-I), and one using as outcome over- versus underextraction and including only cases of incorrect extraction (logistic model 2 – LM-II). The K chosen from the fitted models corresponded to that K at which the IC value was lowest or the sE highest. The abundance of interactions found in these analyses between true model and every other design factor in Table  justified separate logistic regression analyses on subsets of the data to facilitate interpretation. Subset (a) included true GMM-I and GMM-II, whilst subset (b) considered true GBTM and LCGA. These subsets corresponded to examining under- and over-specification of D respectively (See Supplementary Section S.3.1 for full logistic regression results).

In the subset analyses, two-way interactions were included to test whether the effect of D misspecification on K extraction depended on the level of the design condition (e.g. low versus high class separation) and also on the fit statistic (e.g. AIC versus BIC). Further, interactions between the fit statistic and design conditions were also considered. Likelihood ratio tests were conducted to ascertain whether the included interactions yielded a significantly better fit. Many of these interactions were significant. Of particular note were; the interactions between (1) the fit statistic and class separation across all logistic regression analyses, (2) the fitted D and fit statistic in all logistic models (except LM-II(b)), and (3) the fitted D and class separation level across all logistic models (except for LM-II (a)). These interactions are displayed in Figure  (to be discussed in greater detail) and Appendix Figures A.1-A.3, which show the patterns and sizes of the effects of the design conditions on class enumeration performance for each criterion.

Figure 4. Estimated probabilities of K correct (upper half) or of overextraction given incorrect K (lower half) for true by fitted model under low/high class separation given conditions: Natural starting point, T=8 repeated measures. Left half concerns models GMM-I and GBTM, right half concerns models GMM-II and LCGA.

Figure 4. Estimated probabilities of K correct (upper half) or of overextraction given incorrect K (lower half) for true by fitted model under low/high class separation given conditions: Natural starting point, T=8 repeated measures. Left half concerns models GMM-I and GBTM, right half concerns models GMM-II and LCGA.

Figure  presents the estimated probabilities of K extraction for the fit statistics under natural starting point (NS) and eight time points (T=8) for all logistic models considered. The remaining combinations of NS/CC and T=5/T=8 are presented in the Appendix. These figures were chosen since they respect the prominent interactions of fitted D with class separation and fit statistic.

The findings of LM-I (outcome: correct class extraction) displayed in Figure  (a) are multifaceted. First, it shows that, irrespective of class separation, all IC fit statistics had a low probability of selecting the true K when D was underspecified (i.e. true model is GMM, fitted model is GBTM/LCGA). Further, under high class separation, BIC and ssBIC performed almost perfectly for fitted models with D overspecification (i.e. true model is GBTM/LCGA, fitted model is GMM) or correct specification, whereas the AIC and sE performed substantially worse. By contrast, all fit statistics performed poorly under low class separation irrespective of the D specification, although the AIC generally performed slightly better than the other fit statistics if D was correctly specified or overspecified. The effect of the number of time points, time-variant versus time-invariant R for all fitted models and fit statistics on correct class extraction was inconspicuous compared to the effect of class separation (See the Appendix).

Figure  (b) shows the results of LM-II (over- versus underextraction). Here, regardless of class separation, for underspecified D, the IC fit statistics had a 100% probability of overextraction (given incorrect extraction). For D over- or correct specification, all IC fit statistics showed a high probability to overextract under high separation and underextract under low separation. sE, however, showed converse behaviour, under- and overextracting under high and low separation respectively.

To conclude, all fit statistics performed poorly in terms of correct K extraction under low separation, with BIC and ssBIC performing the worst. Under high separation, BIC performs best, followed by the ssBIC. Furthermore, under high separation, underspecification of D is associated with a high risk of incorrect class extraction for the IC fit statistics, whereas overspecification of D for all fit statistics shows little risk of incorrect K extraction, particularly for the BIC and ssBIC. Moreover, among the cases that were incorrectly extracted, underspecification of D is associated with a high risk of overextraction by the IC statistics regardless of class separation. For over- and correct specification of D, IC fit statistics tended to overextract under high separation and underextract under low separation, among the subset of cases with incorrect extraction.

4.2. Identifiable patterns of fit statistic curves across all conditions

4.2.1. Screeplot behaviour

Each of the 32 simulation design conditions considered in Table  for the true model yielded fit statistic curves and summary bar charts. Given space constraints, only two of these conditions are presented in Figures  and , but the corresponding figures for all design conditions are provided in Supplement S.4. Each figure condenses the output information of 200 simulations (runs) for one design condition. For each fit criterion (rows), three distinct plots are shown (columns): the fit statistic curve given as the average over 200 runs at each number of classes (left), frequency distribution of the number of turning points of the fit statistics’ curve of a single run (middle), and frequency distribution of the final selected K (right). This information is provided in each subplot separately for each fitted model. A turning point for AIC, BIC and ssBIC is defined as a point K where IC(K)<IC(K1) and IC(K)<IC(K+1). For sE this is defined as a point K where sE(K)>sE(K1) and sE(K)>sE(K+1).

Figure 5. Fit statistic behaviour for true 4 class GMM-I with natural starting point, high class separation and T=8. Left column: The average fit statistic value over all runs (ordinate axis) against the number of estimated classes (abscissa); middle column: Frequency of the number of turning points in the individual fit statistic curves (n=200 runs); right column: frequency of specific K being selected.

Note: Turning point for AIC, BIC and ssBIC is defined as a point K where both IC(K)<IC(K1) and IC(K)<IC(K+1). For sE, a turning point is defined as a point K where both sE(K)>sE(K1) and sE(K)>sE(K+1).

Figure 5. Fit statistic behaviour for true 4 class GMM-I with natural starting point, high class separation and T=8. Left column: The average fit statistic value over all runs (ordinate axis) against the number of estimated classes (abscissa); middle column: Frequency of the number of turning points in the individual fit statistic curves (n=200 runs); right column: frequency of specific K being selected.Note: Turning point for AIC, BIC and ssBIC is defined as a point K where both IC(K)<IC(K−1) and IC(K)<IC(K+1). For sE, a turning point is defined as a point K where both sE(K)>sE(K−1) and sE(K)>sE(K+1).

Figure 6. Fit statistic behaviour for true 4 class GBTM with natural starting point, high class separation and T=8. Left column: The average fit statistic value over all runs (ordinate axis) against the number of estimated classes (abscissa); middle column: Frequency of the number of turning points in the individual fit statistic curves (n=200 runs); right column: frequency of specific K being selected.

Note: Turning point for AIC, BIC and ssBIC is defined as a point K where both IC(K)<IC(K1) and IC(K)<IC(K+1). For sE, a turning point is defined as a point K where both sE(K)>sE(K1) and sE(K)>sE(K+1).

Figure 6. Fit statistic behaviour for true 4 class GBTM with natural starting point, high class separation and T=8. Left column: The average fit statistic value over all runs (ordinate axis) against the number of estimated classes (abscissa); middle column: Frequency of the number of turning points in the individual fit statistic curves (n=200 runs); right column: frequency of specific K being selected.Note: Turning point for AIC, BIC and ssBIC is defined as a point K where both IC(K)<IC(K−1) and IC(K)<IC(K+1). For sE, a turning point is defined as a point K where both sE(K)>sE(K−1) and sE(K)>sE(K+1).

Figure  (true model = GMM-I) shows that when a GBTM is fitted (i.e. underspecified D), the IC statistics exhibit clear plateauing behaviour given their continual improvement as K increases (left column). Furthermore, the general absence of turning points (middle column) highlights their proclivity to overextract as the maximum considered K=10 is always selected (right column). This conforms with the findings in Figure , showing the ICs’ high probability of incorrect extraction for underspecified in D models (Figure  (a)), see True: GMM-I Fitted: GBTM, specifically overextraction (Figure  (b)). This pattern is repeated throughout conditions where underspecified models are fitted (see Supplement S.4). For a correctly specified GMM, both the BIC and ssBIC are highly accurate and stable showing a large majority of one turning point at K=4. sE exhibits erratic and inaccurate performance compared to the BIC and ssBIC for the true model. AIC shows a tendency to overextract, even under the correct model. Again, these observations conform to the logistic regression results (Figure ) which highlights the poor accuracy of the AIC and sE relative to the BIC and ssBIC.

In Figure  (true model = GBTM), the BIC and ssBIC of both correct and overspecified models extract the correct K. In contrast, the sE for correct D and the AIC for both correct D and overspecified D shows lower accuracy. The BIC and ssBIC do not show plateauing behaviour as there is a single turning point. This pattern recurs in similar cases (see Supplement S.4) indicating that the risk of an incorrect K under overspecification appears small with high separation.

It is noticeable that the average scree plot of the various IC fit statistics (which approximately matched the individual scree plots, one per simulated dataset) (See Supplement S.4) is smooth (i.e. gradual improvement in IC) for underspecified models, that is, when the true model is a GMM and a GBTM or LCGA is fitted. By contrast, the curves are jagged (i.e. quick uneven improvement in IC to an elbow, with no or hardly any improvement in the IC beyond the true K) for correct or overspecified models, that is, when a GBTM or LCGA is the true model underlying the data and a GBTM, LCGA or GMM is fitted. This noticeable pattern may assist practitioners in refining their model’s covariance structure as the smoothness indicates the necessity for random effects or a respecification of the covariance structure.

4.2.2. Fit statistic behaviour across all simulation conditions

Figure  summarizes the results of only four of all 32 simulation conditions whilst Figures  and  do so for one condition each. Therefore, to visualize patterns within the data for all 32 simulation conditions, heatmaps [Citation55] are presented. These heatmaps summarize the outputs of all the crossed true model by fitted model simulations for time-invariant (Figure ) and time-variant (Appendix Figure A. 4) R conditions respectively. The results of the 200 simulations per design condition (rows), per fitted model and per fit criterion (columns) are summarized by two cells in separate but complementary heatmaps within Figure  (respectively Appendix Figure A. 4):

  • each cell in panel A (for IC fit statistics) or panel C (for scaled entropy) displays the proportion of correct K extracted by the fit statistic out of all runs for each design condition,

  • each cell in panel B (for IC fit statistics) or panel D (for scaled entropy) shows the modal K (i.e. the most frequent K extracted by a fit statistic) for each design condition

Figure 7. Heatmaps of the proportion correct K=4 (panels A and C) and modal K (panels B and D) extracted by different fit statistics for different fitted models under time-invariant R conditions (GBTM, GMM-I). Ordinate axis coded as H/L_NS/CC_8/5 indicating: Class separation: H(igh) or L(ow), Trajectory shape: N(atural) S(tart) or C(at’s) C(radle), and Time points: 8 or 5. All panels: Quadrants clockwise from upper left: (1) Underspecified GBTM, (2) Correctly specified GMM, (3) Overspecified GMM, (4) Correctly specified GBTM.

Figure 7. Heatmaps of the proportion correct K=4 (panels A and C) and modal K (panels B and D) extracted by different fit statistics for different fitted models under time-invariant R conditions (GBTM, GMM-I). Ordinate axis coded as H/L_NS/CC_8/5 indicating: Class separation: H(igh) or L(ow), Trajectory shape: N(atural) S(tart) or C(at’s) C(radle), and Time points: 8 or 5. All panels: Quadrants clockwise from upper left: (1) Underspecified GBTM, (2) Correctly specified GMM, (3) Overspecified GMM, (4) Correctly specified GBTM.

Combined, the two cells for a given condition inform whether the fit statistics performed well in terms of extracted K accuracy (panels A and C) under each design by fitted model condition, while simultaneously hinting at their underlying fit statistic curve behaviour (panels B and D).

Each panel is divided into 4 quadrants. The top left quadrant corresponds to underspecification of D, the bottom right quadrant represents overspecification of D, and the remaining two quadrants correspond to correct specification. Within panels, the x-axis corresponds to the fitted model and its associated fit statistic, e.g. GBTM (ssBIC) shows that a GBTM was fitted with its ssBIC output given. The y-axis shows the true model and its underlying design conditions, with the naming convention of TrueModel_Trajectory shape_Degree of class separation_Number of repeated measures. For example, the performance of the AIC for a GBTM fitted to a true GMM-I with a natural starting point and high class separation for 8 measurements (row 1 in Figure ) corresponds to the coordinate (Panel: A and C, Row: GMM-I_NS_H(igh)_T=8, Column: GBTM(AIC)) in Figure . The results in the heatmaps are arranged such that the upper half of each panel corresponds to true GMM models, and the lower half to either true GBTM (Figure ) or LCGA (Appendix Figure A. 4) models. Within each half, the results are further divided by class separation, then trajectory shape, and finally the number of repeated measures.

To facilitate interpretation, consider light cells in panel A of Figure . These cells have low proportions of correct K extraction. However, these cells on their own do not indicate whether the fit statistics extracted more or fewer classes than the correct K, just that K=4 was selected hardly ever in all the 200 runs. The additional nuance of under- or overextraction is found in the corresponding cell in panel B. Here, given A, if the cell in B is also light, e.g. for (GMM-I_NS_H_T=8, GBTM(AIC)) the modal K=10, this signifies class overextraction. This occurs mainly if the true model is GMM and the fitted model is GBTM, that is, if the covariance matrix D is underspecified. The fact that this overextraction is accompanied by a plateauing behaviour of the fit statistic curve is confirmed by considering both the left and middle columns of Figure  (or associated Supplemental figures) as discussed previously. By contrast, if the cell is light in A, but dark in B (bottom right quadrant), this is an indication of underextraction by the fit statistic i.e. the fit statistic curve reached a minimum point before the correct K. This occurs if D is correctly specified or overspecified, combined with low class separation.

Consider now the darkest cells in panel A which display the highest accuracy of correct K extracted (upper half of top right quadrant and of both bottom quadrants). Their counterparts in panel B confirm that the fit statistics excelled in selecting a modal K=4. This optimal extraction performance again transpires in exemplar Figure  (considering GBTM_NS_H_T=8, GBTM(BIC,ssBIC)): these IC fit statistics curve had a clear (elbow) minimum turning point at the correct K=4, with no improvement in the curve beyond the true K (left) and highest frequency of one turning point (middle).

The heatmaps thus encapsulate unfolding fit statistic patterns over increasing K, while directing attention to specific combinations of design conditions and fitted models (x- and y-axes), in which standout curve behaviours are observed.

Information criteria (IC) for time-invariant R (Figure A and B): What is immediately apparent is the large number of zero proportion correct K in panel A. These are seen when fitted models are underspecified in D (the upper left quadrant of panel A) or for low class separation (rows 5–8 and 13–16 from the top) independent of whether D is over-, under- or correctly specified. For high separation (rows 1–4 and 9-12), we notice with correct specification of D (upper half of upper right quadrant and of lower left quadrant of panel A) and with overspecification of D (upper half of the bottom right quadrant of panel A) high accuracy of the ICs, which in most cases exceeds 98% correct. However, the AIC is considerably less accurate than the BIC and ssBIC.

Linking the above-identified behaviour to panel B, we see that under high class separation and underspecified D this is associated with a modal K=10 in a vast majority of cases – in line with plateauing behaviour (see appendix scree and turning point bar plots) and the associated risk of overextraction. Under low class separation, underspecified D is still associated with overextraction exhibiting a high modal K where most cases exceed K=8. Therefore, when the IC of fitted GBTM selects considerably more classes than a fitted GMM, this may point to underspecification in terms of D. For high separation, the overspecified or correctly specified in D models (associated with the darker regions) show a low risk of overextraction where they almost always have a modal K at the true K. Moreover, under conditions of low separation for correctly specified and overspecified models we notice the tendency of ICs to under-extract classes.

sE for time-invariant R: sE also performs better under high class separation, but its performance is inconsistent. In particular, the sE does not appear to perform better in class extraction if the correct model is fitted than if an under- or overspecified model is fitted.

The findings of time-variant R (shown in Appendix Figure A. 4) are similar to time-invariant R. This is in line with our logistic regression results, which confirms that the effect of the level of R is small relative to those of class separation and of D specification.

To conclude, it appears that when the true model is fitted under high class separation, the best IC fit statistic is most often observed at or close to the true K showing a clear elbow. If an underspecified model, GBTM respectively LCGA, is fitted to a true GMM-I and GMM-II respectively, it is frequently observed that the AIC, BIC and ssBIC fit statistics continue to improve as K increases (plateauing behaviour). No useful fit curve behaviour for sE can be found. Lastly, fit statistic class enumeration behaves poorly under low class separation, but the patterns identified under high separation repeats in the low separation conditions, however with the modal K being lower (see Supplement S.4).

4.3. Unequal class sizes and small sample size

Whether the patterns identified above also hold for select models with unequal class sizes (two classes of 35%, two of 15%) or small sample size (N=260) was briefly investigated. Only conditions of high separation (as low separation has already been shown to be detrimental to class extraction), NS and CC, T=8 and time-invariant R, were considered.

The ICs perform similarly under unequal classes (Appendix Figure A. 5). However, compared to equal classes, the overextraction behaviour of underspecified models in D is more pronounced under unequal classes given cat’s cradle as they show a higher relative frequency of modal K=10. The BIC remains accurate with correctly specified and overspecified models. sE again performs poorly and erratically under correctly fitted models. Finally, the plateauing and elbow behaviour identified previously also holds under unequal class sizes (see Supplement S.4.3).

Under small sample conditions (Appendix Figure A. 6), the previously identified behaviour of the ICs is retained. The BIC performs better than the other ICs, but does suffer a decrease in accuracy under correctly specified GMM-I and over-specified GMM-I under CC compared to the large sample condition. We also take note of the decrease in accuracy of the ssBIC under small samples, which is noticeable given that it is meant to perform better under small samples [Citation56]. The sE appears to perform better in small samples than in large samples, with 8/12 of the crossed models having a modal K=4 frequency of above 50%, but in general, remains an unreliable class enumeration measure. Moreover, plateauing and elbow behaviour of the IC fit statistics associated with the level of D specification is still clearly evident under small sample sizes (Supplement S.4.4).

5. Application

In this section, we will show how an appropriate covariance structure for a longitudinal finite mixture model can be selected using the fit statistic behaviour as an aid. We consider the log-transformed self-reported alcohol consumption (ACit=log(ACit+1)) of n=908 individuals from a former longitudinal FMM study [Citation4]. ACit is the total volume of weekly consumption (in glasses) of subject i measured at four time intervals (t = 1 Youth: 12–18 years, t = 2 Young adult: 19–27 years, t = 3 Adult: 28–44 years, t = 4 Middle age: 45–60 years). The specifications of the models fitted to the data conform to GBTM, LCGA, GMM-I and GMM-II considered in this paper. Model pairings for comparison in line with our study would be GBTM with GMM-I and LCGA with GMM-II. In line with a previous study [Citation16], each class trajectory is modelled as a quadratic function of time, such that, (8) ACitk=(β0k+b0ik)+(b1ik+β1k)timeit+β2ktimeit2+εitk(8) with the specific polynomial and equation parameters conforming to Eq. (7). The objective of such an exercise is to identify classes of individuals following distinct trajectories of alcohol consumption over time.

The IC fit statistic curves of the estimated models are displayed in Figure . We exclude sE as we have established that it exhibits no discernible pattern in identifying covariance misspecification. A plateauing behaviour of the IC statistics for the fitted GBTM and LCGA is evident, as all curves have a minimum value at K=10 (which is the maximum K examined). For the ICs, both GMM models show clear turning points at a K consistently lower than the associated GBTM or LCGA. The ICs for GMM-I suggest a K between 5 and 8 classes, whilst for GMM-II they all point to 6 classes. This preliminary evidence, taken together, hints at the LCGA and GBTM being underspecifications of the covariance structure of the data. They are, therefore, inconclusive. When presented with such fit statistic behaviour, the researcher is advised to explore models that allow for a more complex covariance structure. Accordingly, good candidate models to explore further include the GMM-I and GMM-II with their more general covariance structure. These models can then be further refined in sequential steps as has been suggested by other authors [Citation16,Citation24,Citation57,Citation58]. Such refinements would include (amongst others) inspection of models for non-convergence, non-identifiability, checking the significance of fixed effects parameters, class separation, and ascertaining distinctiveness of trajectories [Citation16]. As an illustration, using the available OVL R code (Supplementary material), we computed the magnitude of the class separation among the K=6 trajectories (shown in the Supplement) for the GMM-II model. This yielded OVLs ranging from 0.197 (between trajectories 2 and 4) to 0.73 (between 3 and 4), indicative of high to moderately low class separation levels (See appendix).

Figure 8. Fit statistic curves of estimated models (optimum fit statistic value at bold shape).

Figure 8. Fit statistic curves of estimated models (optimum fit statistic value at bold shape).

6. Discussion

6.1. Research questions recalled

Given the above results, we can answer our research questions:

  1. Is a plateauing behaviour (or other peculiar behaviour) of the fit statistics under the fitted model a relic of covariance misspecification?

We find that underspecification of the D matrix (random effects structure) across all considered design conditions leads to a continual improvement, and associated plateauing behaviour in the IC fit statistic (AIC, BIC, ssBIC) as fitted K increases. These underspecified models do not adequately capture the underlying variability (contained in D), which increases the likelihood of overextraction. This covariance misspecification is, thus, encapsulated as spurious latent classes [Citation59]. No useful consistent pattern for sE fit curves across fitted models is easily identifiable.

  1. How sensitive in terms of class enumeration are these fit statistics to covariance (D) misspecification under various data features (e.g. class separation, number of time points)?

The ICs, especially the BIC and ssBIC, of overspecified and correctly specified models enumerate accurately under high separation. The sE under similar conditions performs worse than the BIC and ssBIC. However, under low separation, all fit statistics perform poorly with the ICs tending to underextract whilst the sE overextracts. For all levels of D misspecification, the effect of the number of repeated measures, time-variant versus time-invariant R, and NS versus CC on correct K extraction by IC fit statistics is considerably lower than the effect of class separation.

  1. Do identified fit statistic patterns assist in finding the correct model?

We posit that if the ICs of a fitted GBTM or LCGA continually improve as K increases, then this is indicative of covariance underspecification. This position is even more compelling if a GMM fitted to the same data yields a better fit in terms of IC fit statistics at a considerably lower K. In this case, the guidance provided by the ICs (namely the number of classes) for the GBTM or LCGA may be misleading and prone to overextraction. A thorough investigation of the proposed covariance structure and model is then warranted.

If an overspecified GMM is fitted where an LCGA or GBTM would suffice, the value of the IC fit statistics for all three fitted models tend to be lowest and similar at the true K motivating the selection of the more parsimonious (i.e. GBTM or LCGA) model. This can be confirmed using likelihood ratio tests (LRTs) [Citation16] such as the adjusted Lo-Mendell-Rubin LRT (aLMR) [Citation19]. Finally, no identifiable diagnostic pattern for the sE was found.

6.2. Fresh insights and recommendations

Some further insights can be gleaned from our results, which debunk, confirm and/or complement several widely held opinions about FMM class extraction:

  • Firstly, although the ICs of an overspecified GMM are less likely to extract spurious classes than the ICs of an underspecified GBTM or LCGA under high separation, they, all perform poorly under low class separation. Here, the ICs of the overspecified GMM underextracts, while the ICs of the underspecified GBTM and LCGA continue to overextract, which complements established research [Citation22,Citation35,Citation60]. Crucially, with low separation, the addition of random effects is not a panacea and could potentially collapse clinically meaningful classes with distinct patterns of change into a single class.

  • Secondly, under high separation, the AIC shows a greater tendency to incorrectly extract classes (compared to BIC and ssBIC), even under correctly specified models. The BIC was the most accurate class enumeration fit statistic, followed by the ssBIC, but as with AIC, they tend to overextract when D is underspecified. Under low separation, all fit statistics perform poorly.

  • Thirdly, the use of sE for model selection during class enumeration has been cautioned against [Citation61]. Our findings warrant this cautionary tone.

  • Fourthly, the notion that the risk of overextraction is high in particular for larger samples (N>1000) [Citation9,Citation21,Citation48,Citation53] is not fully correct. We have shown that underspecification of D can lead to overextraction even for smaller sample sizes (N=260).

Additionally, it must be emphasized that the fit statistic criteria only serve as a guide in determining the number of classes. The final decision of how many classes to extract is not an automatic process and demands considerable involvement from the researcher at every step of model fitting. This includes the judicious use of statistical analysis and substantive interpretation [Citation24,Citation57]. Considering our research findings, we recommend that:

  • If a plateauing behaviour of ICs for GBTM and LCGA is evident, a visual inspection of the estimated mean trajectories within each class is warranted (particularly if practitioners do not have access to GMM capable software). Higher K solutions showing classes not substantively different from each other (e.g. trajectories that are either parallel, have especially low class separation, or exhibit very small or null class sizes) should be discarded and a more parsimonious model selected. Example code for OVLs, i.e. the class separation index, is provided in the supplementary material. Researchers are advised to compute the OVLs among the trajectories as an adjunct to assess the quality of class extraction.

  • If there are multiple candidates for K based on the BIC, then likelihood ratio tests such as the adjusted Lo-Mendell-Rubin LRT (aLMR), substantive interpretation and visual inspection of trajectories could assist in further refining K [Citation16,Citation24,Citation57,Citation58].

  • In scenarios suggesting underextraction, in particular under low-class separation, researchers are advised to carefully evaluate the distinctiveness of the longitudinal profiles of candidate models, while considering their theoretical relevance. One could check for multimodality or a wide mode of the residual distribution per time point (for GBTM), or of the random effect distribution of the intercept and slope (for GMM), with deviations from normality being indicative of possible underextraction. Failure to address this may lead to wrong inferences such as an incorrect standard error of the class trajectory slope and the slope itself may be biased. We have, however, not explored this possibility in this paper.

6.3. Class enumeration: reification and validation

In empirical sciences, FMMs are widely used for clustering purposes. Lesser known is that FMMs can be used to approximate oddly shaped distributions using a mixture of normal distributions, with specific applications in handling non-normal data including missing values [Citation62,Citation63] and outlier detection [Citation64]. In a clustering context, however, this ability to approximate a non-normal distribution becomes a liability. In 2003, Bauer and Curran [Citation49] drew attention to this by demonstrating that FMMs uncover spurious latent classes in one-class, non-normal data. Since then, other studies have replicated GMMs’ over-extracting tendency [Citation65] within a clustering context, with a solution for that developed and implemented using robust non-normal skewed distributions [Citation66–68]. We did not address models’ and fit criteria’ performances under violations of distributional assumptions, and further simulations should explore whether our findings can be replicated under such circumstances.

Moreover, the vicissitudes of class enumeration make the ‘reification fallacy’ [Citation69] admonition as relevant as ever. The caution that one should refrain from interpreting latent classes as true entities, particularly in exploratory studies, is seldom misplaced. However, this issue pertains more to the external (as against internal) validation of the classes. For instance, two recent developments in FMMs applications substantiate a more theoretically founded interpretation of identified trajectories, specifically through criterion validity (genotyping) [Citation70,Citation71] or replicability of findings (meta-analyses) [Citation72]. In these cases, a more lenient posture towards classes’ reification may be justified.

6.4. Limitations

We acknowledge the limitations of this study, which includes the focus on continuous repeated measures as encapsulated by a multivariate normal link function (as in Eq. (2)). It would be instructive to investigate such emergent fit statistic patterns for different data types (e.g. count and binary data). Parameter recovery of the extracted trajectories when the correct K is selected was not considered as our focus was on establishing fit statistic curve behaviour under different D misspecification. However, preliminary visual inspection of the trajectories when the known simulated K is selected (under high separation, correct and overspecified D) suggests that the recovery of trajectories’ temporal paths and class sizes is sound.

7. Conclusion

This paper has shown via extensive simulation that fit statistic curve behaviour can be a valuable diagnostic tool assisting model selection. Hence, practitioners of longitudinal FMM are advised to plot and inspect the fit statistics’ patterns of change as a function of increasing K during class enumeration. These plots engender a better understanding of data features which underlie problematic behaviour of model fit statistical indices, helping to identify possible covariance misspecification. Notably, a continual improvement of the IC for fitted GBTM and LCGA as the researcher increases the number of classes is a clear indication of the models not adequately capturing the underlying covariance structure (underspecification), which then manifests into spurious latent classes. As a tool, these plots represent an additional step in following a transparent and methodological approach when fitting longitudinal FMM [Citation3,Citation16,Citation24,Citation57].

Finally, the OVL may serve an ancillary role in model fitting by first establishing the level of class separation between extracted trajectories, and thus the quality of class extraction before further refinements in the model fitting process. For cases of low separated classes, researchers will need to go beyond fit criteria to transparently substantiate their choices, such as using complementary criteria (e.g. theoretical justification, residual plots).

Software acknowledgements

Additional packages not previously cited but used to conduct our research includes; brglm2 [Citation73], dplyr [Citation74], ggeffects [Citation75], lme4 [Citation76], stargazer [Citation77], viridis [Citation78].

Supplemental material

JSCS_Supplementary_Material_final.docx

Download MS Word (15.6 MB)

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

Appendix

Figure A1. Estimated probabilities of K correct (upper half) or of overextraction given incorrect K (lower half) for true by fitted model under low/high class separation given conditions: Cat’s cradle, T=5 repeated measures. Left half concerns models GMM-I and GBTM, right half concerns models GMM-II and LCGA.

Figure A1. Estimated probabilities of K correct (upper half) or of overextraction given incorrect K (lower half) for true by fitted model under low/high class separation given conditions: Cat’s cradle, T=5 repeated measures. Left half concerns models GMM-I and GBTM, right half concerns models GMM-II and LCGA.

Figure A2. Estimated probabilities of K correct (upper half) or of overextraction given incorrect K (lower half) for true by fitted model under low/high class separation given conditions: Natural starting point, T=5 repeated measures. Left half concerns models GMM-I and GBTM, right half concerns models GMM-II and LCGA.

Figure A2. Estimated probabilities of K correct (upper half) or of overextraction given incorrect K (lower half) for true by fitted model under low/high class separation given conditions: Natural starting point, T=5 repeated measures. Left half concerns models GMM-I and GBTM, right half concerns models GMM-II and LCGA.

Figure A3. Estimated probabilities of K correct (upper half) or of overextraction given incorrect K (lower half) for true by fitted model under low/high class separation given conditions: Cat’s cradle, T=8 repeated measures. Left half concerns models GMM-I and GBTM, right half concerns models GMM-II and LCGA.

Figure A3. Estimated probabilities of K correct (upper half) or of overextraction given incorrect K (lower half) for true by fitted model under low/high class separation given conditions: Cat’s cradle, T=8 repeated measures. Left half concerns models GMM-I and GBTM, right half concerns models GMM-II and LCGA.

Table A1. OVL for 6-class GMM-II. Each cell corresponds to the OVL between corresponding classes given in row and column headings.

Figure A4. Heatmaps of the proportion correct K=4 (panels A and C) and modal K (panels B and D) extracted by different fit statistics for different fitted models under time-variant R conditions (LCGA, GMM-II). Ordinate axis coded as H/L_NS/CC_8/5 indicating: Class separation: H(igh) or L(ow), Trajectory shape: N(atural) S(tart) or C(at’s) C(radle), and Time points: 8 or 5. All panels: Quadrants clockwise from upper left: (1) Underspecified LCGA (2) Correctly specified GMM (3) Overspecified GMM (4) Correctly specified LCGA.

Figure A4. Heatmaps of the proportion correct K=4 (panels A and C) and modal K (panels B and D) extracted by different fit statistics for different fitted models under time-variant R conditions (LCGA, GMM-II). Ordinate axis coded as H/L_NS/CC_8/5 indicating: Class separation: H(igh) or L(ow), Trajectory shape: N(atural) S(tart) or C(at’s) C(radle), and Time points: 8 or 5. All panels: Quadrants clockwise from upper left: (1) Underspecified LCGA (2) Correctly specified GMM (3) Overspecified GMM (4) Correctly specified LCGA.

Figure A5. Heatmaps of the proportion correct K=4 (panels A and C) and modal K (panels B and D) extracted by different fit statistics for time-invariant R with unequal class sizes (35%/15%/15%/35%) (GBTM, GMM-I). Ordinate axis coded as GMM-I/GBTM_NS/CC_H/L_NS/CC_T=8/5 indicating: True Model: GMM-I or GBTM, Trajectory shape: N(atural) S(tart) or C(at’s) C(radle), Class separation: H(igh) or L(ow), and Time points: 8 or 5. All panels: Quadrants clockwise from upper left: (1) Underspecified GBTM, (2) Correctly specified GMM, (3) Overspecified GMM, 4.) Correctly specified GBTM.

Figure A5. Heatmaps of the proportion correct K=4 (panels A and C) and modal K (panels B and D) extracted by different fit statistics for time-invariant R with unequal class sizes (35%/15%/15%/35%) (GBTM, GMM-I). Ordinate axis coded as GMM-I/GBTM_NS/CC_H/L_NS/CC_T=8/5 indicating: True Model: GMM-I or GBTM, Trajectory shape: N(atural) S(tart) or C(at’s) C(radle), Class separation: H(igh) or L(ow), and Time points: 8 or 5. All panels: Quadrants clockwise from upper left: (1) Underspecified GBTM, (2) Correctly specified GMM, (3) Overspecified GMM, 4.) Correctly specified GBTM.

Figure A6. Heatmaps of the proportion correct K=4 (panels A and C) and modal K (panels B and D) extracted by different fit statistics for time-invariant R for N=260 (GBTM, GMM-I). Ordinate axis coded as GMM-I/GBTM_NS/CC_H/L_NS/CC_T=8/5 indicating: True Model: GMM-I or GBTM, Trajectory shape: N(atural) S(tart) or C(at’s) C(radle), Class separation: H(igh) or L(ow), and Time points: 8 or 5. All panels: Quadrants clockwise from upper left: (1) Underspecified GBTM, (2) Correctly specified GMM, (3) Overspecified GMM, (4) Correctly specified GBTM.

Figure A6. Heatmaps of the proportion correct K=4 (panels A and C) and modal K (panels B and D) extracted by different fit statistics for time-invariant R for N=260 (GBTM, GMM-I). Ordinate axis coded as GMM-I/GBTM_NS/CC_H/L_NS/CC_T=8/5 indicating: True Model: GMM-I or GBTM, Trajectory shape: N(atural) S(tart) or C(at’s) C(radle), Class separation: H(igh) or L(ow), and Time points: 8 or 5. All panels: Quadrants clockwise from upper left: (1) Underspecified GBTM, (2) Correctly specified GMM, (3) Overspecified GMM, (4) Correctly specified GBTM.