0
Views
0
CrossRef citations to date
0
Altmetric
Teacher's Corner

Linear Logistic Scoring Equations for Latent Class and Latent Profile Models: A Simple Method for Classifying New Cases

ORCID Icon &
Received 05 Mar 2024, Accepted 30 Jun 2024, Published online: 23 Jul 2024

Abstract

Researchers are often interested in using latent class or latent profile parameter estimates to obtain posterior class membership probabilities for observations other than those of the original sample. In this paper, we demonstrate that these probabilities typically take on the form of linear logistic equations with coefficients which are functions of the original model parameters. In other words, the posterior class membership probabilities can be computed with a prediction formula similar to that of a multinomial logistic regression model. We derive the scoring equations for nominal, ordinal, count, and continuous indicators, as well as investigate models with missing values on class indicators, local dependencies, covariates, or multiple latent variables. In addition to the mathematical derivations of the scoring equations, we describe how either exact or approximate scoring equations can be obtained by estimating a multinomial regression model using a weighted data set.

In applications of factor analysis, after selecting and estimating the factor model of interest, one will typically obtain (linear) factor-score equations which can be used to estimate the subjects’ factor scores as a function of the original items included in the model (Bartholomew et al., Citation2011). An important feature of factor-score equations is that these can be used not only for the subjects in the estimation sample, but also for new subjects, that is, for out-of-sample prediction.

When performing a latent class (LC) analysis, after selecting the final model, one may assign the individuals in the estimation sample to LCs using their posterior class membership probabilities. However, it is not well known that these posterior probabilities can be expressed exactly by a set of linear logistic equations, with “regression” weights which are functions of the original LC model parameters. More specifically, a closed-form expression for the posteriors exists, as a function of the LC model parameters, if the responses are modeled using distributions from the exponential family and with canonical link functions. Availability of a set of scoring equations makes it straightforward to compute the class membership probabilities for subjects which do not belong to the original sample used to estimate the LC model. In this way, one can realize an important goal of many LC analysis applications, namely obtaining out-of-sample class membership predictions. The main advantage of this approach is it allows predicting class memberships of new subjects without the need to use LC analysis software or to program the formula of the estimated LC model. Note that this is similar to what is done in factor analysis, where factor scores are obtained using a linear factor-score formula, without the need to return to the estimates of the factor covariances, factor loadings, and residual covariances.

As far as we know, LatentGOLD (Vermunt & Magidson, Citation2016; Citation2021) is currently the only software for LC analysis that allows one to obtain these logistic scoring equations, both in tabular form and in the form of SPSS or R syntax. The aim of this paper is to show how these equations are derived. As will be shown, the slopes of the linear logistic scoring equations are obtained easily, but the expression for the intercept terms (constants) may be somewhat more complex. In many situations, the equations for the posteriors will contain only main effects of the response variables. However, as in quadratic discriminant analysis, when a LC model for continuous responses assumes variances to be class specific, quadratic terms also need to be included, and when the LC model contains covariances/associations which are class specific, interactions are also required. The approach can be extended easily to LC models with covariates and multiple latent variables. More complicated are situations where responses contain missing values (in which case the constants need to be adapted to the missing data pattern), where the model contains direct effects of covariates on the responses (in which case the exact logistic form may no longer hold), and where non-canonical link functions are used (in which case there is no longer any direct relation between the LC model parameters and the scoring equation).

Rather than computing the scoring equations from the LC model parameters, one can also obtain these equations by estimating a multinomial logistic regression model using the posteriors as weights, as done in the LatentGOLD Step3-Scoring option (Vermunt & Magidson, Citation2016; Citation2021). This approach has the advantage of increased flexibility in that it is also possible to obtain approximate equations when exact closed-form solutions are not available, or when one prefers a simpler approximate set of scoring equations over more complex exact equations.

Below, we present the scoring equations for models for categorical responses (nominal, ordinal, and counts), models for continuous responses, models with local dependencies, models with covariates, models with missing values on responses, and models with multiple latent variables. We also discuss how the scoring equations can be obtained using a weighted multinomial logistic regression analysis.

1. Latent Class Models for Categorical Responses

Let Yj denote one of J response variables (or indicators), with 1jJ. A particular response for and the number of categories of the jth response variable are referred to as yj and Rj, respectively, with 1yjRj. The probability of having a particular set of responses y is denoted by P(Y=y). The discrete latent variable is denoted by X, a particular latent class by k, and the number of classes by K.

1.1. Nominal Responses

The standard LC model for nominal responses has the following form (Collins & Lanza, Citation2010; Goodman, Citation1974a; Citation1974b; Hagenaars, Citation1990; McCutcheon, Citation1987): P(Y=y)=k=1KP(X=k)j=1JP(Yj=yj|X=k), where P(X=k) is the probability of belonging to class k and P(Yj=yj|X=k) the conditional probability of giving response yj on variable Yj conditional on belonging to class k. These probabilities are often parameterized using logistic equations (Formann, Citation1992; Heinen, Citation1996; Magidson & Vermunt, Citation2004); that is, P(X=k)=exp(γk)D P(Yj=yj|X=k)=exp(αyj+βyjk)Ejk

With D=k=1Kexp(γk) Ejk=yj=1Rjexp(αyj+βyjk)

Here, γk are intercept or constant terms in the regression model for P(X=k), and αyj and βyjk are intercept and slope parameters in the regression model for P(Yj=yj|X=k). As always, identifying constraints need to be imposed on the logistic parameters. Typically, they are either restricted to sum to 0 over classes and response categories (referred to as effect coding), or set to 0 for one class and one response category (called dummy coding). The terms D and Ejk are normalizing constants.

The posterior probability of belonging to class k conditional on response vector y, denoted by P(X=k|Y=y), can be obtained as follows (Dias & Vermunt, Citation2008; Goodman, Citation1974a; Citation1974b): P(X=k|Y=y)=P(X=k)j=1JP(Yj=yj|X=k)k=1KP(X=k)j=1JP(Yj=yj|X=k).

Replacing the model probabilities by their logit equations yields: P(X=k|Y=y)=exp(γk)Dj=1Jexp(αyj+βyjk)Ejkk=1Kexp(γk)Dj=1Jexp(αyj+βyjk)Ejk.

This equation can be simplified by removing D and αyj, which are redundant because they do not depend on k. Moreover, the product over the J responses can be replaced by a sum over the logs of the terms concerned. This yields P(X=k|Y=y)=exp(γkj=1Jlog(Ejk)+j=1Jβyjk)k=1Kexp(γkj=1Jlog(Ejk)+j=1Jβyjk)=exp(γk*+j=1Jβyjk)k=1Kexp(γk*+j=1Jβyjk), where γk*=γkj=1Jlog(Ejk). Though not really necessary, the γk* parameters may be subjected to the same identifying—effect or dummy coding—constraints as the other parameters.

The above derivation shows that the posterior class membership probabilities can be written as a logistic equation with slopes equal to the LC model logistic regression slopes and with constants equal to the logistic class constants minus the sum of the logs of the normalizing constants. The more difficult part in the computation of these scoring equations is the computation of the constants γk*. But, once we have the scoring equations, we can easily compute the class membership probabilities for any response pattern, including response patterns which were not available in the original data used to estimate the LC model of interest.

It should be noted that the γk* terms are identical to the one-variable parameters for the latent classes in the log-linear formulation of the LC model proposed by Haberman (Citation1979). In this formulation, the joint distribution of the X and Y, P(X=k,Y=y), is modelled as follows: P(X=k,Y=y)=exp(γk*+j=1Jαyj+j=1Jβyjk)F.

The posterior class membership probability is obtained as P(X=k|Y=y)=P(X=k,Y=y)/k=1KP(X=k,Y=y). As can be seen, since j=1Jαyj and F cancel, P(X=k|Y=y) has exactly the form we derived above. Thus, when using Haberman’s log-linear formulation, the constants of the scoring equations are also model parameters. However, an important disadvantage of this formulation is that it is computationally less efficient since parameter estimation involves processing the cell entries in the joint cross-tabulation of X and all Yj variables. Therefore, this log-linear approach can be used only when the number of response variables is small.

1.2. Ordinal Responses and Counts

In the LatentGOLD program for LC analysis (Vermunt & Magidson, Citation2016; Citation2021), ordinal response variables can be modeled using an adjacent-category logit model, that is, using a canonical link function (Agresti, Citation2002). More specifically, these are multinomial logit models in which the class-indicator association parameters are restricted as follows: βyjk=βjkyj; that is, to be nominal-by-linear (Goodman, Citation1979; Heinen, Citation1996). This implies that for ordinal variables, P(Yj=yj|X=k)=exp(αyj+βjkyj)Ejk.

The same restrictions imposed on the βyjk parameters also apply to the scoring equations; that is, for ordinal variables, we replace the βjk terms in the scoring equations by βjkyj. This shows that in the ordinal case, the class-membership logits are linear functions of the item responses. It should be noted that while we assumed that the category scores ranged from 1 to Rj, the adjacent-category logit model allows for any type of scoring. In its more general form, βyjk=βjkνyj, where νyj is the score for category yj.

When modeling ordinal variables using other (non-canonical) link functions, such as cumulative logit or cumulative probit link functions, exact expressions for the scoring equations no longer exist. As will be shown below, a possible way out is to estimate the scoring equations treating the response variables as either numeric or nominal predictors of class membership.

Also for a Poisson and binomial count variable, the scoring equations contain the term βjkyj. This can be seen from the fact that the class-specific density of a count variable Yj takes on the following form: P(Yj=yj|X=k)exp(αjyj+βjkyj)Ejk, where it should be noted that αjyj cancels from the scoring equation. The expression for logEjk changes compared to the nominal and ordinal case. For Poisson counts, logEjk=exp(αj+βjk)ej, where ej is the exposure; and for Binomial counts, logEjk=ejlog(1+exp(αj+βjk)), where ej represents the number of trials. This shows that scoring equations should also include terms for the exposure (number of trials) when this number varies across individuals. When the ej are fixed, as for nominal and ordinal variables, the logEjk terms can be included in the constants γk*.

1.3. Local Dependencies

Thus far, we assumed that responses are independent within classes. Now we will look at the scoring equations for LC models with local dependencies (Hagenaars, Citation1988; Magidson & Vermunt, Citation2004; Oberski et al., Citation2013). In the most general case, including a local dependency between (nominal) response variables Yj and Ym implies that P(Yj=yj,Ym=ym|X=k)=exp(αyj+αym+βyjk+βymk+δyjym+λyjymk)Ejmk, where Ejmk=yj=1Rjym=1Rmexp(αyj+αym+βyjk+βymk+δyjym+λyjymk).

This is a model with an association between Yj and Ym, δyjym, and an interaction with the latent classes, λyjymk. In other words, it represents a model in which the strength of the local dependency is allowed to vary across classes.

As can be seen, one difference with the local independence model is that the normalizing constants entering in the γk* coefficients of the scoring equations should be computed per set of locally dependent variables. The δyjym term cancels from the scoring equations because it does not depend on the classes. In contrast, the term λyjymk becomes part of the scoring equations, which in the case of class-specific local dependencies not only contains main effects but also interaction terms. Note that when local dependencies are not class-specific, that is, when λyjymk=0, the only remaining difference between local independence and local dependence models concerns the computation of the constants γk*.

The scoring equations in local-dependence LC models for ordinal variables are very similar to those for nominal variables. When the ordinal variables are modelled using an adjacent-category logit specification, δyjym=δjmyjym and λyjymk=λjmkyjym. The scoring equations will contain the term λjmkyjym when the interaction parameters λjmk are not fixed to 0.

1.4. Missing Values

When some indicators have missing values, the LC model for the observed values Yobs can be defined as follows: P(Yobs=yobs)=k=1KP(X=k)j=1JP(Yj=yj|X=k)rj, where rj=1 if the response variable concerned is observed and 0 when it has a missing value (Vermunt et al., Citation2008; Vermunt & Magidson, Citation2016). Note that this formulation implies that the product is taken over the observed responses only. Therefore, similar to subjects with complete data, the computation of the posteriors for subjects with missing values involves using only their observed responses. This means that the sum j=1Jβyjk should be taken over the observed variables only or, equivalently, that βyjk should be set to 0 for the missing value category. However, the sum j=1Jlog(Ejk) which is subtracted from the constants should also be taken over the observed variables only, implying that each pattern of missing data has its own set of constants γk*. A way to account for this is by using the same γk* for all observations but adding a term log(Ejk) to the scoring equation when variable j has a missing value. In other words, in order to deal with missing data, the scoring equation should be expanded to include the term j=1Jlog(Ejk)(1rj). Note that this approach can be used with any missing data pattern occurring among the new subjects for which one wishes to obtain the posteriors.

A special type of missing data occurs when the LC model is estimated using J variables, but only the first J1 of these are to be used for classification purposes (where J=J1+J2); for example, this situation may occur if one wishes to ignore the last J2 variables when calculating the classification probabilities because this information will not be available when performing out-of-sample predictions. In this case, the posteriors are obtained as follows: P(X=k|Y1=y1)=P(X=k)j=1J1P(Yj=yj|X=k)k=1KP(X=k)j=1J1P(Yj=yj|X=k).

As can be seen, only the slope parameters and the normalizing constants of the first J1 response variables will enter into the scoring equations.

1.5. An Example with Five Dichotomous Indicators

provides an example illustrating the computation of the scoring equations for an application with five dichotomous response variables. It concerns the model with three latent classes estimated for the LatentGOLD “political.sav” demo data set. The upper part of gives the estimates of the model parameters γk, αyj, and βyjk using dummy coding with the parameters for the first class and the first item category fixed to 0. The lower part gives the values of γk* and log(Ejk), where for consistency we also use dummy coding for the log(Ejk) terms. To obtain the log(E1k) values for the first item, we first compute exp(αy1+βy1k), which for all three classes equals 1.0000 for y1=1, and 2.6533, 0.4452, and 1.4312 for y1=2. Next, we sum the obtained values across the two item categories and take the log, yielding 1.29562, 0.3682, and 0.8884 for the three classes. Because of the dummy coding, we subtract the value of the first class, yielding the reported log(E1k) values 0.0000, −0.9275, and −0.4073. The log(Ejk) values of all items are subtracted from the γk values to get the intercepts γk* of the scoring equations, and the slopes βyjk can be used without any modification in the scoring equations. In the case of a missing value, the slope parameters for the item concerned equal log(Ejk). Appendix A shows R code generated for this application by LatentGOLD, which can be used for classifying new observations.

Table 1. Latent class model parameters and scoring equation parameters for the political.sav data example.

2. Other Types of Latent Class and Mixture Models

2.1. Continuous Responses

Now, let us turn to LC or mixture models for continuous response variables (McLachlan & Peel, Citation2000), also referred as latent profile models. In a local independence model with normal within-class distributions with possibly unequal variances, the response distributions have the following form: P(Yj=yj|X=k)exp(12logσjk212μjkμjkσjk2+μjkyjσjk212yjyjσjk2), where μjk and σjk2 denote the mean and variance of Yj in latent class k. It can be seen that in the construction of the logistic scoring equations, the terms 12logσjk2 and 12μjkμjkσjk2, which do not contain the response, become part of the constants. Moreover, the equations will contain the linear and quadratic terms μjkσjk2yj and 12σjk2yjyj.

When variances are assumed to be equal across classes, the first and last term of the above univariate normal distribution become 12logσj2 and 12yjyjσj2, respectively, implying that these cancel from the scoring equations because they do not depend on the class. This yields a set of scoring equations similar to those obtained in linear discriminant analysis (Hastie et al., Citation2008).

In the more general case of multivariate normal responses with unrestricted covariances Σk, the LC model becomes P(Y=y)=k=1KP(X=k)P(Y=y|X=k), with P(Y=y|X=k)exp(12log|Σk|12μkΣk1μk+μkΣk1y12yΣk1y).

As can be seen, the scoring equations now not only contain linear and quadratic terms, but also interaction terms are needed. More specifically, denoting an entry of Σk1 by amjk, the weights for yj, yj2, and yjym  are m=1J μmkajmk, 12ajjk, and ajmk, respectively. The first two terms of the multivariate normal density, which do not depend on the responses, become part of the constants. When variances and covariances are equal across classes, we again have equations with main effects only, as in linear discriminant analysis.

Various kinds of restricted mixtures of multivariate normal distributions have been proposed in which constraints are imposed on the class-specific means and/or covariances. Examples include mixture factor models (McLachlan & Peel, Citation2000; Yung, Citation1997), mixture structural equation models (Dolan & Van der Maas, Citation1997), mixture models with constrained eigenvalue decompositions of Σk (Banfield & Raftery, Citation1993), and mixture growth models (Muthén, Citation2004). For these models, the same scoring equations can be used as when means and covariances are unrestricted.

2.2. An Example with Three Continuous Indicators

reports the model parameters and the scoring equations for a LC model with three continuous indicators (Glucose, Insulin, and SSPG) from the LatentGOLD “diabetes.dat” demo data set. It is a three-class model with a free residual covariance between the first two class indicators and with class-specific residual (co)variances. The scoring equation for this model contains not only linear terms, but also quadratic terms as well as the interaction terms between y1 and y2.

Table 2. Latent class model parameters and scoring equation parameters for the diabetes.dat data example.

2.3. Covariates

When covariates are included in the model, the latent class probabilities are typically modeled as a logistic function of the covariates (Bandeen-Roche et al, Citation1997; Dayton & Macready, Citation1988; Yamaguchi, Citation2000). That is, P(X=k|z)=exp(γ0k+p=1Pγpkzp)k=1Kexp(γ0k+p=1Pγpkzp).

Here, z denotes the vector of covariates, and γ0k and γpk represent the constants and the regression parameters for covariate zp.

Since the denominator does not depend on the class, it cancels from the formula for the posterior class membership probability and thus also from the scoring equations. Assuming the response variables are nominal, the posterior probability of class membership given responses and covariates becomes: P(X=k|z)=exp(γ0k*+j=1Jβyjk+p=1Pγpkzp)k=1Kexp(γ0k*+j=1Jβyjk+p=1Pγpkzp).

That is, the covariate terms can simply be added to the scoring equations.

Covariates may also have direct effects on the indicators. Let us assume we have a single covariate z which has a direct effect on the categorical response variable Yj; that is, P(Yj=yj|X=k,z)=exp(αyj+βyjk+δyjz)Ejk|z, where Ejk|z=yj=1Rjexp(αyj+βyjk+δyjz).

As can be seen, in this model, the normalizing constants depend on the covariate value, meaning that we no longer have a single logEjk which can be subtracted from γk. Because the logEjk|z are neither linear functions of the covariate values, they cannot be added to the linear term for the covariate concerned. In other words, the exact linear logistic representation of the posterior probabilities collapses in this situation, though, as discussed below and in Appendix B, it may still be used as an approximation. An exception is the situation in which the covariate is a nominal or dichotomous variable, in which case exact scoring equations can still be obtained by subtracting logEjk|z from the γpk terms of the covariate concerned.

Note that when the direct effect of a covariate is allowed to be class specific, or equivalently, when an interaction term is included, the indicator-covariate interaction should also be added to the scoring equation. Again, the scoring equations will be exact only when the covariate concerned is nominal or dichotomous. For example, this is the specification used in multiple-group LC models in which response probabilities may be allowed to differ across subgroups for one or more indicators (Clogg & Goodman, Citation1984; Eid et al., Citation2003; Kankaras et al., Citation2010).

2.4. Multiple Latent Variables

Suppose the LC model contains two latent variables X1 and X2 instead of one, so that we have a LC Factor or Discrete Factor model. Such a model, has the following form (Goodman, Citation1974b; Hagenaars, Citation1990; Magidson & Vermunt, Citation2001; Vermunt & Magidson, Citation2005): P(Y=y)=k1=1K1k2=1K2P(X1=k1,X2=k2)j=1JP(Yj=yj|X1=k1,X2=k2).

As in the single latent variable case, the P(X1=k1,X2=k2) and P(Yj=yj|X1=k1,X2=k2) can be modelled using logistic regression models (Magidson & Vermunt, Citation2001). For example, P(X1=k1,X2=k2)=exp(γk11+γk22+γk1k212)D P(Yj=yj|X1=k1,X2=k2)=exp(αyj+βyjk11+βyjk22)Ejk1k2.

Also in this case, the posterior probabilities can be written as functions of the LC model parameters; i.e., P(X1=k1,X2=k2|Y=y)=exp(γk1k2*+j=1Jβyjk11+j=1Jβyjk22)k1=1K1k2=1K2exp(γk1k2*+j=1Jβyjk11+j=1Jβyjk22), where the γk1k2* contain the γ terms and the normalizing constants Ejk1k2.

Note that the above equation for two latent variables can easily be generalized to an arbitrary number of S latent variables. The logistic scoring equation then becomes: P(X1=k1,,XS=kS|Y=y)=exp(γk1kS*+s=1Sj=1Jβyjkss)k1=1K1kS=1KSexp(γk1kS*+s=1Sj=1Jβyjkss).

It should be noted that the marginal posterior probabilities P(Xs=ks|Y=y), which are obtained by collapsing over the other latent variables, cannot be written as logistic functions. However, logistic approximations of the marginal posteriors may be precise enough in most applications. Below, we discuss how such approximations can be obtained.

3. Estimating the Scoring Equations Using Logistic Regression Analysis

Rather than computing the scoring equations from the parameters of the LC model, it is also possible to obtain these equations posthoc using a standard routine for multinomial logistic regression analysis. This involves the following three steps:

  1. After selection of the final LC model, save the posterior class membership probabilities to an output file. This is a feature available in all software packages for LC analysis.

  2. Create an expanded data set with K records per subject, which contains a column with the class number taking on values from 1 to K, a column with the posterior probability for the person and class concerned, and columns for the response variables and covariates used in the LC model, the latter columns containing the same values repeated in each of the K records for each subject.

  3. Estimate a logistic regression model in which the posteriors are used as weights. The class number is the dependent variable, and the responses and covariates are the predictors.

Depending on the situation, in the third step the responses and covariates are modeled as either nominal or numeric predictors, quadratic and/or interaction effects are added, and missing value dummies are included. For count variables, one should include the exposures (or total number of trials) as additional numeric predictors when these differ across individuals. Steps 2 and 3 are automated in the LatentGOLD program (Vermunt & Magidson, Citation2016; Citation2021), and are called Step3-Scoring.

This approach can be used not only for the posthoc computation of the exact scoring equations, but also for obtaining approximate scoring equations. This is useful when an exact form does not exist, such as when direct effects of numeric covariates on indicators were included in the LC model or when non-canonical link functions were used for ordinal variables, as well as the situation where one prefers a set of simplified equations, say without quadratic or interaction terms, that are almost as good as the exact ones. An example of the latter can be seen in , which reports the approximate scoring equations for the diabetes.dat example presented above, but leaving out the quadratic terms of y1 and y2 and their interaction terms. The approximate equations predict the class memberships almost as well as the exact equations; that is, the entropy R2 equals 0.817, while its original value equals 0.833.

Table 3. Approximate scoring equation parameters for the diabetes.dat data example.

When ordinal variables are modeled using non-canonical link functions, such as cumulative logit or probit models, we have two options. Option 1 is to compute the exact scoring equations by treating the response variables as nominal predictors in the posthoc logistic regression analysis; that is, by making use of the fact that the estimated P(Yj=yj|X=k) based on an ordinal model can be reproduced perfectly by an unrestricted multinomial model. Option 2 is to estimate the scoring equations using the response variables as numeric predictors, which in fact implies that the estimated P(Yj=yj|X=k) from the original LC model are approximated by an adjacent-category logit model.

As shown above, in LC models with multiple latent variables, an exact set of logistic scoring equations exists for the joint class membership probabilities, but not for the marginal class membership probabilities. The posthoc estimation method can also be used to obtain approximate scoring equations based on the marginal posteriors. Applying these equations will be simpler than first computing the joint and subsequently collapsing over the other latent variables, especially with models containing more than two latent variables. The quality of the resulting approximation can be assessed by a goodness-of-fit measure.

4. Discussion

As in continuous latent variables models, in LC models it is important to have a simple scoring rule for predicting a person’s value on the latent variable. In this paper, we showed that for LC models this scoring rule has the form of a linear logistic equation, with weights which are simple functions of the original LC model parameters. We derived the exact scoring equations for nominal, ordinal, count, and continuous response variables, for local independence and local dependence models, for models with covariates, for models with multiple latent variables, and for models with missing values on some of the indicators. Moreover, we discussed several situations in which exact scoring equations may not exist, such as LC models with direct effects of covariates on the indicators and LC models in which the conditional response distributions are restricted using regression models based on non-canonical link functions.

We also explained how to compute exact or approximate scoring equations with the saved posterior probabilities from any LC analysis program. This can be achieved with standard routines for logistic regression analysis. In practice, this may be much easier than computing the scoring equations from the LC model parameters, where the constants from the scoring equations may be somewhat more tedious to obtain.

While not discussed explicitly, the computation of the scoring equations proceeds in exactly the same manner in LC models for mixed responses; that is, in LC models for combinations of nominal, ordinal, count, and continuous indicators (Hennig & Liao, Citation2013; Hunt & Jorgensen, Citation1999; Vermunt & Magidson, Citation2002). The only thing that needs to be done in the computation of the scoring equations is to collect the terms for the different indicators, irrespective of their scale types. When using the posthoc method based on a logistic regression analysis, things are even easier. Nominal indicators are used as nominal predictors, and ordinal, count and continuous indicators as numeric predictors. Depending on the situation, quadratic and/or interaction terms may also need to be included.

The scoring equations discussed in this article can be used to obtain point estimates for the posterior probabilities, not only for subjects in the original sample, but also for new subjects. However, an issue not dealt with in this paper is the uncertainty about those estimates. Since the “regression” weights of the scoring equation are sample estimates, when deriving a prediction, it would be better to take into account this sampling variability. Note that the weights are functions of the original model parameters, for which we have the estimated asymptotic variance-covariance matrix. A possible approach to obtain the covariance matrix of the weights involves sampling say 100 parameter sets from their estimated multivariate normal distribution and computing 100 sets of weights. Other options to explore are the delta method and bootstrapping (Dias & Vermunt, Citation2008). Our future research will focus on this important topic.

Disclosure Statement

Jeroen K. Vermunt and Jay Magidson are co-developers of the LatentGOLD software.

References

  • Agresti, A. (2002). Categorical data analysis (2d ed.). Wiley.
  • Bandeen-Roche, K., Miglioretti, D. L., Zeger, S. L., & Rathouz, P. J. (1997). Latent variable regression for multiple discrete outcomes. Journal of the American Statistical Association, 92, 1375–1386. https://doi.org/10.1080/01621459.1997.10473658
  • Banfield, J. D., & Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803–821. https://doi.org/10.2307/2532201
  • Bartholomew, D. J., Knott, M., & Moustaki, I. (2011). Latent variable models and factor analysis. Arnold.
  • Clogg, C. C., & Goodman, L. A. (1984). Latent structure analysis of a set of multidimensional contingency tables. Journal of the American Statistical Association, 79, 762–771. https://doi.org/10.1080/01621459.1984.10477093
  • Collins, L. M., & Lanza, S. T. (2010). Latent class and latent transition analysis: With applications in the social, behavioral, and health sciences. Wiley.
  • Dayton, C. M., & Macready, G. B. (1988). Concomitant-variable latent-class models. Journal of the American Statistical Association, 83, 173–178. https://doi.org/10.1080/01621459.1988.10478584
  • Dias, J. G., & Vermunt, J. K. (2008). A bootstrap-based aggregate classifier for model-based clustering. Computational Statistics, 23, 643–659. https://doi.org/10.1007/s00180-007-0103-7
  • Dolan, C. V., & Van der Maas, H. L. J. (1997). Fitting multivariate normal finite mixtures subject to structural equation modeling. Psychometrika, 63, 227–253. https://doi.org/10.1007/BF02294853
  • Eid, M., Langeheine, R., & Diener, E. (2003). Comparing typological structures across cultures by multigroup latent class analysis. A primer. Journal of Cross-Cultural Psychology, 34, 195–210. https://doi.org/10.1177/0022022102250427
  • Formann, A. K. (1992). Linear logistic latent class analysis for polytomous data. Journal of the American Statistical Association, 87, 476–486. https://doi.org/10.1080/01621459.1992.10475229
  • Goodman, L. A. (1974a). Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika, 61, 215–231. https://doi.org/10.2307/2334349
  • Goodman, L. A. (1974b). The analysis of systems of qualitative variables when some of the variables are unobservable: Part I - A modified latent structure approach. American Journal of Sociology, 79, 1179–1259. https://doi.org/10.1086/225676
  • Goodman, L. A. (1979). Simple models for the analysis of association in cross-classifications saving ordered categories. Journal of the American Statistical Association, 74, 537–552. https://doi.org/10.1080/01621459.1979.10481650
  • Haberman, S. J. (1979). Analysis of qualitative data, Vol 2, New developments. Academic Press.
  • Hagenaars, J. A. (1990). Categorical longitudinal data - loglinear analysis of panel, trend and cohort data. Sage.
  • Hagenaars, J. A. P. (1988). Latent structure models with direct effects between indicators local dependence models. Sociological Methods & Research, 16, 379–405. https://doi.org/10.1177/0049124188016003002
  • Hastie, T., Tibshirani, R., & Friedman, J. (2008). The elements of statistical learning. Springer.
  • Heinen, T. (1996). Latent class and discrete latent trait models: Similarities and differences. Sage.
  • Hennig, C., & Liao, T. F. (2013). How to find an appropriate clustering for mixed type variables with application to socioeconomic stratification (with discussion). Journal of the Royal Statistical Society Series C: Applied Statistics, 62, 309–369. https://doi.org/10.1111/j.1467-9876.2012.01066.x
  • Hunt, L., & Jorgensen, M. (1999). Mixture model clustering using the MULTIMIX program. Australian and New Zeeland Journal of Statistics, 41, 153–172.
  • Kankaras, M., Moors, G., & Vermunt, J. K. (2010). Testing for measurement invariance with latent class analysis. In E. Davidov, P. Schmidt, and J. Billiet (Eds.), Cross-cultural analysis: Methods and applications (pp. 359–384). Routledge.
  • Magidson, J., & Vermunt, J. K. (2001). Latent class factor and cluster models, bi-plots and related graphical displays. Sociological Methodology, 31, 223–264. https://doi.org/10.1111/0081-1750.00096
  • Magidson, J., & Vermunt, J. K. (2004). Latent class models. In D. Kaplan (Ed.), The Sage handbook of quantitative methodology for the social sciences (pp. 175–198). Sage Publications.
  • McCutcheon, A. L. (1987). Latent class analysis. Sage.
  • McLachlan, G. J., & Peel, D. (2000). Finite mixture models. John Wiley & Sons.
  • Muthén, B. (2004). Latent variable analysis: Growth mixture modeling and related techniques for longitudinal data. In D. Kaplan (Ed.), The Sage handbook of quantitative methodology for the social sciences (pp. 345–368). Sage.
  • Oberski, D. L., van Kollenburg, G. H., & Vermunt, J. K. (2013). A Monte Carlo evaluation of three methods to detect local dependence in binary data latent class models. Advances in Data Analysis and Classification, 7, 267–279. https://doi.org/10.1007/s11634-013-0146-2
  • Vermunt, J. K., & Magidson, J. (2002). Latent class cluster analysis. In J. Hagenaars & A. McCutcheon (Eds.), Applied latent class analysis (pp. 89–106). Cambridge University Press.
  • Vermunt, J. K., & Magidson, J. (2005). Factor analysis with categorical indicators: A comparison between traditional and latent class approaches. In A. Van der Ark, M.A. Croon & K. Sijtsma (Eds.), New developments in categorical data analysis for the social and behavioral sciences (pp. 41–62). Erlbaum.
  • Vermunt, J. K., & Magidson, J. (2016). Technical Guide for LatentGOLD 5.1: Basic, Advanced, and Syntax. Statistical Innovations Inc.
  • Vermunt, J. K., & Magidson, J. (2021). Upgrade Manual for LatentGOLD Basic, Advanced/Syntax, and Choice Version 6.0. Statistical Innovations Inc.
  • Vermunt, J. K., Van Ginkel, J. R., Van der Ark, L. A., & Sijtsma, K. (2008). Multiple imputation of categorical data using latent class analysis. Sociological Methodology, 38, 369–397. https://doi.org/10.1111/j.1467-9531.2008.00202.x
  • Yamaguchi, K. (2000). Multinomial logit latent-class regression models: An analysis of the predictors of gender-role attitudes among Japanese women. American Journal of Sociology, 105, 1702–1740. https://doi.org/10.1086/210470
  • Yung, Y. F. (1997). Finite mixtures in confirmatory factor-analysis models. Psychometrika, 62, 297–330. https://doi.org/10.1007/BF02294554

Appendix A:

R Code Generated by LatentGOLD for the First Example Application

With the LatentGOLD output options “ScoringEquations” and “WriteRsyntax= <filename>“, one can request an R syntax file that can be used to classify new observations. The variable names in the “political.sav” data file are sys_resp, ideo_lev, rep_pot, prot_app, and conv_par. The lg_scoring function consists of three parts:

  1. First, it creates the variables to be used as “predictors” in the scoring equations. For ordinal and continuous indicators, these are copies of the variables in the data set, and for categorical variables, these are dummies for the response categories. In addition, dummies are created for missing values.

  2. Then, it computes the class-specific linear terms using the variables created in part 1 and the scoring equations’ parameters.

  3. Subsequently, the linear terms are exponentiated and transformed to posterior probabilities.

The function’s output are the posterior probabilities and the modal class. Below, you find the R code, which ends with example code calling the lg_scoring function to add classification information to a data set.

## Scoring function to be called per record

lg_scoring<-function(dat) {

# Part 1: Create variables to be used as predictors in scoring

# equations

if(is.na(dat$sys_resp)) {

 sys_resp_lg_1<-0;sys_resp_lg_2<-0;sys_resp_lg_m<-1

}

else {

 if(dat$sys_resp==1) {

  sys_resp_lg_1<-1;sys_resp_lg_2<-0;sys_resp_lg_m<-0

 }

 else if(dat$sys_resp==2) {

  sys_resp_lg_1<-0;sys_resp_lg_2<-1;sys_resp_lg_m<-0

 }

 else {

  sys_resp_lg_1<-0;sys_resp_lg_2<-0;sys_resp_lg_m<-1

 }

}

# The same is done for the other 4 indicators 

# Part 2: Compute the class-specific linear terms 

Cluster_lg_1<-(0)+

  (0)*sys_resp_lg_1+(0)*sys_resp_lg_2+

  (0)*ideo_lev_lg_1+(0)*ideo_lev_lg_2+

  (0)*rep_pot_lg_1+(0)*rep_pot_lg_2+

  (0)*prot_app_lg_1+(0)*prot_app_lg_2+

  (0)*conv_par_lg_1+(0)*conv_par_lg_2+

  (0)*sys_resp_lg_m+(0)*ideo_lev_lg_m+

  (0)*rep_pot_lg_m+(0)*prot_app_lg_m+

  (0)*conv_par_lg_m

Cluster_lg_2<-(3.4185551)+

  (0)*sys_resp_lg_1+(-1.7853117)*sys_resp_lg_2+

  (0)*ideo_lev_lg_1+(-3.0502076)*ideo_lev_lg_2+

  (0)*rep_pot_lg_1+(0.56595846)*rep_pot_lg_2+

  (0)*prot_app_lg_1+ (-0.74630356)*prot_app_lg_2+

  (0)*conv_par_lg_1+(-3.039846)*conv_par_lg_2+

  (-0.9274766)*sys_resp_lg_m+(-0.49927555)*ideo_lev_lg_m+

  (0.11060958)*rep_pot_lg_m+ (-0.34180273)*prot_app_lg_m+

  (-1.8328984)*conv_par_lg_m

Cluster_lg_3<-(-3.6424675)+

  (0)*sys_resp_lg_1+ (-0.61732246)*sys_resp_lg_2+

  (0)*ideo_lev_lg_1+ (-0.23276128)*ideo_lev_lg_2+

  (0)*rep_pot_lg_1+ (3.6818864)*rep_pot_lg_2+

  (0)*prot_app_lg_1+ (3.0608866)*prot_app_lg_2+

  (0)*conv_par_lg_1+ (-1.0034281)*conv_par_lg_2+

  (-0.40726727)*sys_resp_lg_m+(-0.089565904)*ideo_lev_lg_m+

  (1.9387485)*rep_pot_lg_m+(2.5015355)*prot_app_lg_m+

  (-0.81826931)*conv_par_lg_m

# Part 3: Compute odds from logits, as well as modal class and

# probabilities from odds

max_lg<-Cluster_lg_1

if(Cluster_lg_2 > max_lg) {

 max_lg<-Cluster_lg_2

}

if(Cluster_lg_3 > max_lg) {

 max_lg<-Cluster_lg_3

}

Cluster_lg_1<-exp(Cluster_lg_1-max_lg)

Cluster_lg_2<-exp(Cluster_lg_2-max_lg)

Cluster_lg_3<-exp(Cluster_lg_3-max_lg)

max_lg<-Cluster_lg_1

Cluster_lg_modal<-1

if(Cluster_lg_2 > max_lg) {

 max_lg<-Cluster_lg_2; Cluster_lg_modal<-2

}

if(Cluster_lg_3 > max_lg) {

 max_lg<-Cluster_lg_3; Cluster_lg_modal<-3

}

sum_lg<-Cluster_lg_1 + Cluster_lg_2 + Cluster_lg_3

Cluster_lg_1<-Cluster_lg_1/sum_lg

Cluster_lg_2<-Cluster_lg_2/sum_lg

Cluster_lg_3<-Cluster_lg_3/sum_lg

return(list(

 ”Cluster_modal”=Cluster_lg_modal,

 ”Cluster_1”=Cluster_lg_1,

 ”Cluster_2”=Cluster_lg_2,

 ”Cluster_3”=Cluster_lg_3

))

}

## Example of call of scoring function in a loop over records

outdata<-inpdata

for(i in 1:nrow(outdata)){

scoring<-lg_scoring(outdata[i,])

outdata[i,”Cluster_modal”]<-scoring$Cluster_modal

outdata[i,”Cluster_1”]<-scoring$Cluster_1

outdata[i,”Cluster_2”]<-scoring$Cluster_2

outdata[i,”Cluster_3”]<-scoring$Cluster_3

}

As an example, let us take a subject with sys_resp = NA, ideo_lev = 1, rep_pot = 2, prot_app = 2, and conv_par = NA. For this data pattern, in part 1, the dummies sys_resp_lg_m, ideo_lev_lg_1, rep_pot_lg_2, prot_app_lg_2, and conv_par_lg_m are set to 1, and the remaining ones to 0. By summing the weights for which the dummies equal 1, part 2 yields the values 0, −1.8329, and −.8183 for the three linear equations. Transforming these to probabilities in part 3, gives posteriors .1095, .1766, and .7139 for the three classes.

Appendix B: Taylor Approximation of the Normalizing Constants with a Covariate Having a Direct Effect on a Categorical Indicator

As shown in the main text, when a numeric covariate z has a direct effect on a categorical indicator, the normalizing constant of the indicator concerned becomes: Ejk|z=yj=1Rjexp(αyj+βyjk+δyjz), and will thus depend on the value of z. As a result, the scoring equations will no longer be linear logistic. However, a possible way out is to approximate this term using a Taylor expansion.

For simplicity, assume covariate z is centered, and thus has a mean of 0. The second-order Taylor approximation of logEjk|z at z=0 equals: logjk|z=logEjk|z=0+logEjk|z=0dzz+d2 logEjk|z=0d2z/2 z2, with logEjk|z=0dz=yj=1RjP(Yj=yj|X=k,z=0)δyj

and d2 logEjk|z=0d2z=yj=1RjP(Yj=yj|X=k,z=0)[δyjlogEjk|z=0dz]δyj.

The term logEjk|z=0 is subtracted from the intercept of class k, the logEjk|z=0dz are subtracted from the linear terms for z, and the d2 logEjk|z=0d2z/2 appear as quadratic terms in the scoring equations.