244
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Estimating Latent-Variable Panel Data Models Using Parameter-Expanded SEM Methods

Abstract

This article presents new estimation algorithms for three types of dynamic panel data models with latent variables: factor models, discrete choice models, and persistent-transitory quantile processes. The new methods combine the parameter expansion (PX) ideas of Liu, Rubin, and Wu with the stochastic expectation-maximization (SEM) algorithm in likelihood and moment-based contexts. The goal is to facilitate convergence in models with a large space of latent variables by improving algorithmic efficiency. This is achieved by specifying expanded models within the M step. Effectively, we are proposing new estimators for the pseudo-data within iterations that take into account the fact that the model of interest is misspecified for draws based on parameter values far from the truth. We establish the asymptotic equivalence of the likelihood-based PX-SEM to an alternative SEM algorithm with a smaller expected fraction of missing information compared to the standard SEM based on the original model, implying a faster global convergence rate. Finally, in simulations we show that the new algorithms significantly improve the convergence speed relative to standard SEM algorithms, sometimes dramatically so, by reducing the total computing time from hours to a few minutes.

1 Introduction

This article presents new estimation algorithms for dynamic panel data models with latent variables. Dynamic panel data models are widely used in applied work today. They tend to exhibit many latent variables over multiple periods (e.g., time-invariant, persistent, and transitory components), which are important to capture unobserved heterogeneity and dynamic responses (Arellano and Bonhomme 2017). However, the presence of latent variables brings challenges to the estimation.

Iterative methods like the stochastic expectation-maximization (SEM) algorithm can be useful tools for estimating models with latent variables (Diebolt and Celeux Citation1993).Footnote1 Specifically, as a simulated version of the Expectation-Maximization (EM) algorithm (Dempster, Laird, and Rubin Citation1977), SEM iterates through an E-step where we draw latent variables from the posterior distribution of the model of interests given observables, and an M-step where we estimate the model as if the draws were observables until the parameters converge to the stationary distribution. It simplifies the estimation as it replaces the complex optimization problem, which involves multiple integrals due to latent variables, with a series of much simpler optimization problems under pseudo-complete data.

However, the slow convergence, an often voiced criticism of EM and its variants, tends to diminish its practical appeal. Indeed, the slow convergence issue in practice is even more pronounced: researchers often need to run the algorithms multiple times with different initial guesses and select the result based on criteria such as the likelihood value, to mitigate the negative effects of a “bad” initial guess and to address the possibility of the algorithm converging to a local maximum. Recent research has explored alternative samplers for latent variables when performing the E-step to improve sampling efficiency and stability.Footnote2 In contrast, this article focuses on the potential improvement in the M-step.

In this article, we develop a new estimation method, the PX-SEM algorithm, by combining the parameter expansion ideas in Liu, Rubin, and Wu (Citation1998) with the SEM algorithm. The goal is to facilitate convergence in models with a large space of latent variables by improving algorithmic efficiency. Even though the general concept of the PX-SEM algorithm applies to various models, we focus on three types of dynamic panel data models containing rich latent variable structures, where slow convergence issues are exacerbated, with the expectation that it can be particularly fruitful: (a) dynamic factor models, (b) random effects discrete choice models with persistent and transitory components, and (c) persistent-transitory dynamic quantile models with individual effects.Footnote3

The PX-SEM algorithm consists of two steps: an E-step, where we draw values of latent variables from the posterior distribution, and a PX-M-step, where we update parameters. Having the same E-step as the SEM algorithm, PX-SEM replaces the SEM M-step estimator with a more robust one, taking into account the possibility that E-step draws could violate model assumptions when parameter guesses are far from the true value. The PX-M step estimator is able to leverage additional information from the model itself, effectively “correcting” the M step updates in progressing to more accurate ones.

To implement the PX-SEM algorithm, one must construct an expanded model, the L model, which needs to satisfy two conditions. First, the L model must nest the original model, the O model. Second, there must exist a reduction function, a mapping from the L model parameters to the O model parameters, keeping the observed-data likelihood unchanged. After constructing a suitable L model, we can iterate between the E step and the PX-M step, which involves (a) estimating the L model and (b) mapping back to the O model parameters through the reduction function.

There are different ways to construct L models. All else being equal, a more flexible L model should improve the convergence rate. However, since our ultimate goal is to reduce the total computing time, we also need to consider the time spent in each iteration for estimating the L model and converting it to the O model. Therefore, taking these two factors into account, this article proposes a method to expand the model linearly. Linear expansion addresses the potential violation of zero-correlation assumptions.

In terms of statistical properties, Liu, Rubin, and Wu (Citation1998) proves the monotone convergence of the parameter-expanded EM algorithm and its superior rate of convergence relative to its parent EM. By combining the results of Nielsen (Citation2000) and Arellano and Bonhomme (Citation2016), this article establishes the asymptotic equivalence of the likelihood-based PX-SEM to an alternative SEM algorithm with a smaller expected fraction of missing information compared to the standard O model based SEM, implying a faster global convergence rate and a smaller variance for the limiting stationary distribution. Finally, in the simulations, we show that PX-SEM can significantly improve algorithmic efficiency compared to the standard SEM algorithm, sometimes dramatically so. For example, in our numerical calculations for discrete choice and quantile models, SEM has still not converged even after running for 50–80 min whereas PX-SEM converges within 2–3 min.

This article belongs to an expanding literature that considers the application of the EM algorithm (Dempster, Laird, and Rubin Citation1977) and its variants in estimating latent variable models (Diebolt and Celeux Citation1993; Liu, Rubin, and Wu Citation1998; Arcidiacono and Jones Citation2003; Pastorello, Patilea, and Renault Citation2003; Arellano and Bonhomme Citation2016; Chen Citation2016; Arellano et al. Citation2023, among others). This article contributes to this literature in two ways. First, by developing a new estimation method, PX-SEM, which combines the parameter expansion idea with the SEM algorithm.Footnote4 The method offers appealing theoretical properties and the potential to enhance algorithmic efficiency, which is particularly valuable for complex models such as nonlinear panel data models, where SEM may encounter slow convergence issues. Second, the article proposes a specific class of linear expansions for implementing PX-SEM and develops new estimation algorithms for three types of latent-variable panel data models with enhanced algorithmic efficiency.

The article proceeds as follows. Section 2 illustrates the difference between the standard stochastic EM algorithm and the parameter-expanded stochastic EM algorithm using a simple toy model. In Section 3, a formal definition of PX-SEM is provided, along with a discussion of its statistical properties and implementation based on linear expansions. Sections 4–6 develop PX-SEM methods for three types of latent-variable panel data models: dynamic factor models, discrete choice models, and persistent-transitory dynamic quantile models, respectively. Finally, Section 7 concludes.

2 Toy Model

Based on a simple toy model, this section compares the parameter-expanded stochastic EM (PX-SEM) algorithm with the standard stochastic EM (SEM) algorithm and provides intuitions behind PX-SEM.

Consider the following model we want to estimate, denoted as the O model: yi=yi*+ϵi,   where [yi*ϵi]iidN(0,[σ2001]).(O Model)

The observed outcomes are y1,,yN, and the latent variables whose distribution is of interest are y1*,,yN*. The only unknown parameter is the standard deviation σ.

SEM

To implement the SEM algorithm, we need to start with an initial guess of the unknown parameter σ̂(0), and then iterate the following two steps for s=0,1,,S until the convergence of σ̂(s) to the stationary distribution:

1. Stochastic E step: Draw yi* from the posterior distribution fO(yi*|yi;σ̂(s))

2. M step: Estimate the O model and update σ̂(s+1), that is σ̂(s+1)=std̂(yi*) where fO(·) is the density function of O model. The final estimator is the average of the last S0 iterations σ̂=1S0SS0+1Sσ̂(s).

The nonstochastic version, the EM algorithm, is effective because it improves the observed-data likelihood in each iteration: (1) ilogfO(yi;σ̂(s+1))ilogfO(yi;σ̂(s))Q(σ̂(s+1)|σ̂(s))Q(σ̂(s)|σ̂(s))0,(1) where Q(σ̂(s+1)|σ̂(s))=ilogfO(yi,yi*;σ̂(s+1))fO(yi*|yi;σ̂(s))dyi*.Footnote5

PX-SEM

Now we introduce the PX-SEM algorithm. Like SEM, PX-SEM comprises an E step for draws of latent variables and an M step for updating parameters. While sharing the same E-step as SEM, PX-SEM’s M step involves (a) estimating an expanded model, the L model, and (b) mapping L model parameters to the O model parameters.

For this toy model, we propose the following L model:

yi=yi*+ϵi,where [yi*ϵi]iidN(0,K[σ2001]K),K=[k01k1]. (L Model)

In addition to σ, the L model also contains an auxiliary parameter k. It is easy to verify that when k = 1, the two models coincide, that is fO(yi*,yi;σ)=fL(yi*,yi;k=1,σ); and when k1, L model expands the O model by allowing for a nonzero correlation between yi* and ϵi, as cov(yi*,ϵi)=k(1k)σ2. Moreover, the L model is unidentified with observables: for any L model with parameter values σL and kL, the O model with parameter value σ=σL has the same observed data likelihood, that is fO(yi;σL)=fL(yi;σL,kL).Footnote6

To implement PX-SEM, we begin with an initial guess of the unknown parameter σ̂(0), and then iterate the following two steps for s=0,1,,S until the convergence of σ̂(s) to the stationary distribution:

  1. Stochastic E step: Draw yi* from the posterior distribution fO(yi*|yi;σ̂(s))

  2. PX-M step: Update parameters by

    1. Estimate the L model: k̂L=var̂(yi*)cov̂(yi*,yi),σ̂L(s+1)=std̂(yi*)|k̂L|

    2. Reduction: mapping from (σ̂L(s+1),k̂L) to σ̂(s+1) maintaining the observed-data likelihood, that is fO(yi;σ̂(s+1))=fL(yi;σ̂L(s+1),k̂L), and thus σ̂(s+1)=σ̂L(s+1)

The final estimator is the average of the last S0 iterations: σ̂=1S0SS0+1Sσ̂(s). As we see, the difference between the two methods lies in the estimators of the M step: the PX-SEM estimator is adjusted by 1|k̂L|.

illustrates how PX-SEM has the potential to enhance algorithmic efficiency through the utilization of the auxiliary parameter k. Specifically, depicts a scatterplot of simulations generated by the data generating process (DGP), the O model with a true value of σ = 2. The X-axis and Y-axis display yi* and ϵi, respectively, but only their sum yi=yi*+ϵi is used for estimation.

Fig. 1 Data and E-step draws under different guess values of σ.

Fig. 1 Data and E-step draws under different guess values of σ.

is the scatterplot of (yi*,yiyi*) where yi* is the E-step draws under the true value, that is yi*fO(yi*|yi;σ=2), and is the scatterplot of (yi*,yiyi*) where yi* is the E-step draws under an incorrect value σ = 1, that is yi*fO(yi*|yi;σ=1). Contrary to the case of , where E-step draws are generated under correct guess, and there is no significant correlation between yi* and yiyi*, presents a significant positive correlation between E-step draws yi* and yiyi*, which is assumed to be zero by the O model. This “false” positive correlation arises because the draws are taken under a wrong condition that the variance of yi* should not deviate significantly from one.

As a consequence, for the E-step draws in , both M-step and PX-M step estimators are consistent: the SEM one is under a correct constraint k = 1. However, in the case of , the M-step of SEM ignores the violation of the zero-correlation assumption at the current draws, while PX-SEM takes into account the “false” correlation by adding the parameter k. By construction, the extra flexibility induced by k leads to a better model fit of the PX-M step for the current draws and thus, a larger pseudo complete data likelihood. As we will show in Section 3, similar to EM, the PX-EM can improve observed-data likelihood in each iteration by transmitting gains from pseudo-complete data likelihood. Therefore, improving model fit as in the PX-M step essentially equates to improving the lower bound of the inequality in (1). Finally, by mapping the L model parameters to the O model parameters keeping the observed-data likelihood unchanged, we preserve the “gains” in likelihood. Intuitively, PX-SEM replaces the SEM M-step with a more robust estimator that leverages additional model information, namely, there is a linear correlation between the E-step draws yi* and yiyi*, even though there should not be. Appendix A provides more explanation with illustrative figures.

Regarding the choice of the L model, there are many different ways of expanding the O model. Flexible expansion should help increase the convergence rate. Appendix A compares different L models and PX-M step estimators using the toy model. However, in practice, when we estimate more complex models, we must consider how easily we can estimate the L model and reduce it to the O model to avoid spending too much time in each iteration and increasing the total computing time. With this consideration in mind, we propose a linear expansion method in Session 3 and discuss its applications in Sessions 4–6.

As a final comment, the PX-SEM algorithm aims to enhance algorithm efficiency, even when E-step draws are appropriately obtained. For instance, as demonstrated in Appendix A, PX-SEM improves the convergence rate when the E-step draws are based on direct sampling. In more complex models where direct sampling is not feasible, and methods such as MCMC are required, the PX-SEM algorithm demonstrates lower sensitivity to poorly generated E-step draws. This can be justified if one of the ways that “bad” draws manifest themselves is through violations of model assumptions. Moreover, an MCMC-based E-step generally requires more time for each iteration, so reducing the iterations needed for convergence can significantly reduce the total computing time. For example, as shown in Sections 5 and 6, while SEM can take more than 3000 sec without showing signs of convergence, PX-SEM converges within a couple of minutes.

3 Parameter Expanded Stochastic EM Algorithm

The section begins by defining the PX-SEM algorithm. Following that, we discuss its statistical properties and explore the reasons behind its potential to enhance algorithmic efficiency. Specifically, we establish the asymptotic equivalence between PX-SEM and an alternative SEM with a reduced expected fraction of missing information. Finally, we propose a general approach, the linear expansion method, for the implementation.

3.1 Definition of PX-SEM Algorithm

Setup

Let {Yi,Xi,Yi*} for i=1,,N be iid random variables following the distribution of the O model, denoted as fO(Yi|Xi;θ)=Yi*fO(Yi,Yi*|Xi;θ)dYi*. Here Wi[Yi Xi] represents the observable vector, Yi* is the latent-variable vector, and θ is the unknown parameter vector to be estimated. The true value, θ¯, satisfies the equation E(ΨO(Yi*,Wi;θ¯))=0, where ΨO(·) represents the score function of the complete O model in the case of the likelihood-based PX-SEM algorithm and moment restrictions in the case of the moment-based one. The law of iterated expectations implies that the true value θ¯ also satisfies the equation: (2) E(ΨO(Yi*,Wi;θ¯)fO(Yi*|Wi;θ¯)dYi*)=0.(2)

Define θ̂ as the solution of the integrated moment restrictions of the original O model, i(ΨO(Yi*,Wi;θ̂)fO(Yi*|Wi;θ̂)dYi*)=0, a sample analogy to (2). In the case of a likelihood-based algorithm, we know that θ̂ is the MLE.

Denote the expanded model, the L model, as fL(Yi|Xi;θ,K)=Yi*fL(Yi,Yi*|Xi;θ,K)dYi*, where K represents the auxiliary parameter vector. The expanded L model needs to satisfy two conditions: (a) The L model nests the O model: There exists K = K0 such that fO(Yi,Yi*|Xi;θ)=fL(Yi,Yi*|Xi;θ,K0), θ, and (b) Existence of reduction function: There exists a mapping from the L model parameters to the O model parameters, the reduction function θ=R(θL,K), such that the observed-data likelihood is preserved: fO(Yi|Xi;R(θL,K))=fL(Yi|Xi;θL,K), θL,K.

Let ΨLθ(·) denote the score function of the L model with respect to θ in the case of the likelihood-based PX-SEM algorithm and the same moment restrictions as ΨO(·) in the case of the moment-based one. Under condition (a), we have ΨLθ(Yi*,Wi;θ,K0)=ΨO(Yi*,Wi;θ), and thus E(ΨLθ(Yi*,Wi;θ¯,K0))=0. Additionally, assuming that K is identified when we observe Yi*, meaning that there exists ΨLK(·) such that E(ΨLK(Yi*,Wi;θ¯,K0))=0, we then have E(ΨL(Yi*,Wi;θ¯,K0))=0, where ΨL(·)=[ΨLθ(·) ΨLK(·)]. By the law of iterated expectations, this implies:Footnote7 (3) E(ΨL(Yi*,Wi;θ¯,K0)fO(Yi*|Wi;R(θ¯,K0))dYi*)=0.(3)

Definition of the PX-SEM algorithm

Before we outline the general steps of PX-SEM, let us take a look at the SEM algorithm for comparison. SEM is an iterative algorithm where, in the E step, we make draws of latent variables Yi* from the posterior distribution fO(Yi*|Wi;θ̂(s)) under the parameter guess θ̂(s), and in the M step, we update it to θ̂(s+1), which satisfies iΨO(Yi*,Wi;θ̂(s+1))=0. This stochastic version differs from the original EM algorithm as it replaces the integral in (2) with latent draws.

In contrast, the PX-SEM algorithm proposes iterations that are better linked to (3): while we still make draws of latent variables Yi* from the posterior distribution fO(Yi*|Wi;θ̂(s)) under the parameter guess θ̂(s), we use the expanded model to update the parameter to θ̂(s+1), satisfying θ̂(s+1)=R(θ̂L,K̂), where iΨL(Yi*,Wi;θ̂L,K̂)=0.

The general steps are as follows: starting with an initial guess θ̂(0), we iterate the following two steps for s=0,1,2,,S until θ̂(s) converges to the stationary distribution:

  1. Stochastic E step: Draw Yi* from the posterior distribution fO(Yi*|Wi;θ̂(s))

  2. PX-M step: Update parameters by

    1. Estimate L model: iΨL(Yi*,Wi;θ̂L,K̂)=0

    2. Reduction: θ̂(s+1)=R(θ̂L,K̂) subject to fO(Yi|Xi;θ̂(s+1))=fL(Yi|Xi;θ̂L,K̂)

Reduction function

In practice, one of the challenges in implementing the PX-SEM algorithm is to find the reduction function associated with the L model. However, if we construct the L model such that the auxiliary parameter K does not affect the observed-data likelihood, that is fO(Yi|Xi;θL)=fL(Yi|Xi;θL,K), then immediately, the reduction function becomes R(θ,K)=θ. As a result, PX-SEM can be simplified as follows:

  1. Stochastic E step: Draw Yi* from the posterior distribution fO(Yi*|Wi;θ̂(s))

  2. PX-M step: Update parameters by solving iΨL(Yi*,Wi;θ̂(s+1),K̂)=0

Comparing the PX-M and M steps, we find that the M-step estimator is a constrained version of the PX-M-step estimator with the constraint K = K0. Intuitively, when the E-step draws Yi* are generated under a guess θ̂(s) close enough to the true value, the M-step estimator is under the correct restriction, leading both the M-step and PX-M-step estimators to be consistent at that iteration, as indicated by (2) and (3).

However, when the guess θ̂(s) deviates significantly from the true value, causing the draws Yi* to violate certain model assumptions, we would expect the PX-SEM estimator to exhibit greater “robustness” due to extra flexibility brought by auxiliary parameter K. As shown in the following section, the likelihood-based PX-M step can achieve a larger pseudo-complete data likelihood improvement, which could further lead to a greater observed-data likelihood improvement compared to the M step.

3.2 Statistical Properties

This section focuses on the statistical properties of likelihood-based algorithms. We will first show that the parameter-expanded EM algorithm exhibits nonnegative improvement in the observed-data log-likelihood at each iteration. Next, for the stochastic version, PX-SEM, we will establish its asymptotic equivalence to an alternative SEM algorithm with a smaller expected fraction of missing information compared to the standard O model based SEM, which implies a faster global convergence rate and a smaller variance for the limiting stationary distribution in a semipositive definite order.

Convergence

Following Liu, Rubin, and Wu (Citation1998), we now prove that PX-EM algorithm increases the observed-data likelihood in each iteration. The change in the observed-data log-likelihood between iterations ilogfO(Yi|Xi;θ̂(s+1))ilogfO(Yi|Xi;θ̂(s)) equals ilogfL(Yi|Xi;θ̂L,K̂) ‐ilogfL(Yi|Xi;θ̂(s),K0)Q(θ̂L,K̂|θ̂(s),K0) ‐ Q(θ̂(s),K0|θ̂(s),K0)0,where Q(θ̂L,K̂|θ̂(s),K0)=ilogfL(Yi,Yi*|Xi;θ̂L,K̂)fL(Yi*|Wi;θ̂(s),K0)dYi*.

The equality holds because of both condition (a): When K = K0, two models coincide, meaning fO(Yi|Xi;θ̂(s))=fL(Yi|Xi;θ̂(s),K0), and condition (b): The reduction function exists, and thus by construction fO(Yi|Xi;θ̂(s+1))=fL(Yi|Xi;θ̂L,K̂). We then apply Gibbs’ inequality. Finally, the definition of θ̂L, which is θ̂L,K̂argmaxθ,KQ(θ,K|θ̂(s),K0), leads to a nonnegative change in observed-data likelihood. Notably, the result also implies that (θ̂,K0) is a fixed point of PX-EM, where θ̂ represents the MLE.Footnote8

As a final remark, the L model nesting the O model implies the following inequality: Q(θ̂L,K̂|θ̂(s),K0)Q(θ̂(s),K0|θ̂(s),K0)Q(θ̂EM(s+1),K0|θ̂(s),K0)Q(θ̂(s),K0|θ̂(s),K0),where θ̂EM(s)=argmaxθilogfO(Yi,Yi*|Xi;θ)fL(Yi*|Wi;θ̂(s),K0)dYi*. Therefore, the parameter expansion technique can be intuitively interpreted as a way to improve the lower bound of the log-likelihood increment compared to the EM algorithm.

Asymptotic properties

We first characterize the dynamics of PX-SEM updates. Define Θ as the joint set of auxiliary and O model parameters, Θ[θ;K]. Accordingly, Θ¯[θ¯;K0] represents the vector of true values, and Θ̂[θ̂;K0] represents the MLE of the O model. Given any estimate Θ̂(s)=[θ̂(s);K0] in the iteration s, PX-SEM generates the next update from a Markov process: iΨL(Yi*,Wi;Θ̂(s+1))=0, where Yi*fL(Yi*|Wi;Θ̂(s)).Footnote9Footnote10 Expanding around Θ̂ and considering θ̂pθ¯, as shown in detail in Appendix B, we have (4) (Θ̂(s+1)Θ̂)=(IA1V)(Θ̂(s)Θ̂)+A1ϵ(s)+op(N(1/2)),(4) where, under the correct specification, A=E(ΨL(Yi*,Wi;Θ¯)ΨL(Yi*,Wi;Θ¯)) is the L model-based complete-data information matrix, V=E(ΨL(Wi;Θ¯)ΨL(Wi;Θ¯)) is the L model-based observed-data information matrix, IA1V is the expected fraction of missing information, and Nϵ(s)dN(0,AV).

The SEM iterations can be characterized in the same way: (5) (θ̂SEM(s+1)θ̂)=(IAθθ1Vθθ)(θ̂SEM(s)θ̂)+Aθθ1ϵθ(s)+op(N(1/2)),(5) where Aθθ represents the O model-based complete-data information matrix, Vθθ represents the O model-based observed-data information matrix, FSEMIAθθ1Vθθ is the expected fraction of missing information, and Nϵθ(s)dN(0,AθθVθθ). Moreover, PX-SEM and SEM dynamics are closely connected: A=[AθθAθKAKθAKK],V=[Vθθ000], where AθKK|Θ¯E(Ψ˜Lθ(Yi*,Wi;Θ)) and AKKK|Θ¯E(Ψ˜LK(Yi*,Wi;Θ)).

We now present the main results of the asymptotic properties, building on Liu, Rubin, and Wu (Citation1998) and Nielsen (Citation2000), with detailed discussions provided in Appendix B.

Theorem 1.

The PX-SEM iteration of θ̂(s) is asymptotically equivalent to SEM iteration with observed-data information matrix Vθθ and complete-data information matrix AθθAθKAKK1AKθ.

Proof.

Let H denote the inverse of matrix A, that is H=[HθθHθKHKθHKK][AθθAθKAKθAKK]1=A1,where, by design, Hθθ1=AθθAθKAKK1AKθ. Then the coefficient matrix IA1V and the asymptotic variance of the innovation term A1ϵ(s) in (4) become: IA1V=[IHθθVθθ0HKθVθθI],A1(AV)A1=[Hθθ(Hθθ1Vθθ)HθθHθKHθθVθθHθKHKθHKθVθθHθθHKKHKθVθθHθK].

It becomes evident that the PX-SEM process of θ̂(s+1) is asymptotically equivalent to an alternative SEM dynamics, described by (6), which shares the same observed-data information matrix Vθθ as the standard SEM in (5), but replaces the original complete-data information matrix Aθθ by Hθθ1=AθθAθKAKK1AKθ. (6) (θ̂(s+1)θ̂)=(IHθθVθθ)(θ̂(s)θ̂)+Hθθϵ˜θ(s)+op(N(1/2)),(6)

where Nϵ˜θ(s)dN(0,Hθθ1Vθθ), and FPXIHθθVθθ is the expected fraction of missing information.Footnote11

Since Hθθ=Aθθ1+HθKHKK1HKθ, under the condition that A is positive definite, HθθAθθ1 in a semipositive definite order, implying the largest eigenvalue of FPX is no greater than the largest eigenvalue of FSEM. Applying Theorem 1, this comparison in the expected fraction of missing information matrix between PX-SEM and SEM immediately implies the dominance of PX-SEM in convergence rate, as stated in Corollary 1.

Corollary 1.

PX-SEM dominates SEM in global rate of convergence.

Proof.

In Appendix B.2. □

Moreover, Nielsen (Citation2000) provides conditions under which the SEM update is ergodic and characterizes the limiting stationary distribution, based on which Corollary 2 describes the limiting stationary distribution of PX-SEM and compares it with SEM.Footnote12

Corollary 2.

The limiting stationary distribution of PX-SEM updates θ̂(s), conditional on Wi, is N(θ̂(s)θ̂)dN(0,Vθθ1(I(I+FPX)1)), and unconditionally, is N(θ̂(s)θ¯)dN(0,Vθθ1(2I(I+FPX)1)), with its variances being less than or equal to those of the standard O model-based SEM, that is, Vθθ1(2I(I+FPX)1)Vθθ1(2I(I+FSEM)1)=Vθθ1(I(I+FPX)1)Vθθ1(I(I+FSEM)1)0 in semipositive definite order.

Proof.

In Appendix B.2. □

Corollary 2 implies that PX-SEM updates exhibit smaller fluctuation along iterations in large samples. Moreover, since the final estimator is the average of the last S0 iterations after convergence, θ̂=1S0SS0+1Sθ̂(s), which converges to the MLE as the number of iterations increases, PX-SEM and SEM estimators share the same asymptotic variance.Footnote13

When the M-step is moment-based, in general, convergence is not guaranteed. Under convergence, the speed does not necessarily dominate SEM. Indeed, Appendix A shows an example where moment-based PX-SEM underperforms SEM for some initial guesses.

However, moment-based PX-SEM may be the preferred choice in practice for at least two crucial reasons. First, in some cases, obtaining GMM estimators is much easier, such as in the quantile model discussed in Section 6. Since our final target is to reduce the computing time, we should consider not only the number of iterations but also the time spent in each iteration. Second, even if obtaining the MLE of the O model is feasible, restricting ourselves to a tractable MLE in the PX-M step can limit the flexibility in building the L model, negatively impacting the convergence rate. Appendix A shows an example of the toy model where the moment-based PX-SEM with a more flexible L model outperforms the likelihood-based PX-SEM, which uses a less flexible L model.

3.3 Implementation based on Linear Expansions

So far, we have shown that a new estimation method, PX-SEM, which combines the parameter expansion technique with the SEM algorithm, has attractive theoretical properties relative to ordinary SEM and the potential to achieve large computational gains.

However, the parameter expansion technique itself does not speak of the selection of the L model. On the one hand, all else being equal, a more flexible L model should improve the convergence rate. On the other hand, we also need to consider the time spent in each iteration to estimate the L model and convert it to the O model since our ultimate goal is to reduce the total computing time. Therefore, another contribution of this article is to propose a specific class of linear expansions, targeting the potential violation of zero-correlation assumptions, which can be generally applied to a wide range of models.

Considering an O model of the form: Yi=G(Yi*;θ), Yi*FO(θ), where both G(·) and FO(·) are known parametric functions up to unknown parameter θ, we propose the following linear expansion to Yi*Footnote14

Yi=G(Yi*;θ),  Yi*=AYi,  YiFO(θ),s.t. G(Yi*;θ)=d G(Yi;θ) (equally distributed), (L Model)where the auxiliary parameter is given by K=vec(A).Footnote15

The expansion is straightforward. We assume that the latent variable Yi follows the same distribution as the O model counterpart. However, the E-step draws Yi*, which directly contribute to the measurement equation and observable Yi, result from an affine transformation applied to Yi. It is easy to check that when A = I, the L model coincides with the O model, whereas when AI, it allows us to introduce linear correlations among elements of Yi*. The constraint ensures that the auxiliary parameters do not affect the observed-data likelihood, simplifying the reduction function to R(θ,K)=θ.Footnote16

To implement the PX-SEM algorithm, in the E-step, we draw Yi* from the O model based posterior distribution as discussed before. In the PX-M step, we leverage moment constraints or the distribution of FO(θ) to pin down A and θ.

This method has the advantage that, despite the model of interest being nonlinear, the expansion is linear in latent variables, which are drawn from the E-step and treated as observables in the PX-M step. Thus, we can identify the auxiliary parameters separately through a relatively simple linear model, regardless of the specific form of G(·).

In the following sections, we discuss three applications: (a) dynamic factor models, (b) discrete choice models, and (c) quantile models, for which we propose PX-SEM algorithms based on linearly expanded models.

4 Dynamic Factor Models

The first type of model we discuss is the dynamic factor model (Geweke Citation1977). The appeal of this class of models is their ability to explain variation across multiple dimensions using fewer latent common factors. Applications span multiple fields, including topics in macroeconomics and finance, among others (Bai and Ng Citation2008; Stock and Watson Citation2006, 2011). While we will focus on a specific single-factor O model, it is worth noting that the same approach for implementing the PX-SEM algorithm can be applied to models with multiple latent factors. The O model to be estimated is as follows:

yit=λiνt+ϵit  and  νt=νt1+ut, (O Model)where ϵitiidN(0,σi2), utiidN(0,1), ν0=0, and ut is independent of ϵit.

The model contains a latent common factor νt that follows a Gaussian random walk. We observe N different measures, yi, where i=1,,N, over a total of T periods, with each measure associated with a distinct factor loading λi. The set of unknown parameters is denoted as θ(λ1,,λN,σ1,,σN).Footnote17

SEM

We first explain the SEM procedure. Let ν[ν1 ν2  νT]. Starting with an initial guess θ̂(0), we iterate through the E-step and the M-step for s=0,1,2,,S until the convergence of θ̂(s) to the stationary distribution:

  1. Stochastic E step: Draw ν from the posterior distribution fO(ν|yi;θ̂(s))

  2. M step: Update θ̂(s+1)=(λ̂1,,λ̂N,σ̂1,,σ̂N) λ̂i=(tνt2)1(tνtyit) and σ̂i=std̂(yitλ̂iνi),i

PX-SEM

To implement PX-SEM, we construct a simple L model as follows:

yit=λiνt+ϵit  and  νt=νt1+ut, (L Model)where ϵitiidN(0,σi2), utiidN(0,k2), ν0=0, and ut is independent of ϵit.

This L model expands the O model by introducing an auxiliary parameter, k, allowing the variance of the persistent shock ut to deviate from 1. Since k can always take the value of 1, making the two models coincide, the L model satisfies condition (a). Moreover, it is easy to verify that reduction function R(λ1,,λN,σ1,,σN,k)=(λ1k,,λNk,σ1,,σN) satisfies condition (b), that is fO(yi;R(θ,k))=fL(yi;θ,k).

With the L model specified and an initial guess θ̂(0), we iterate through the E-step and the PX-M step for s=0,1,,S until the convergence of θ̂(s) to the stationary distribution:

  1. Stochastic E step: Draw ν from the posterior distribution fO(ν|y;θ̂(s))

  2. PX-M step:

    1. L model estimation: λ̂L, σ̂L=(λ̂L1,,λ̂LN,σ̂L1,,σ̂LN) and k̂

      λ̂Li=(tνt2)1(tνtyit) and σ̂Li=std̂(yitλ̂iνi),i; and k̂=std̂(νtνt1)

    2. Reduction: θ̂(s+1)=(λ̂Lk̂,σ̂L)

The PX-M step estimation of the auxiliary parameter k is straightforward due to the separability of the log-likelihood function. Compared to SEM, the PX-SEM update θ̂(s+1) takes into account potential deviations from the assumption k=1 in the O model. When the guess θ̂(s) is sufficiently close to the true value, we expect k̂ to be close to 1, resulting in similar SEM and PX-SEM updates θ̂(s+1). However, when the guess θ̂(s) deviates significantly from the true value, leading to a violation of the assumption k=1 in the E-step draws, PX-SEM adjusts the estimate accordingly. For instance, if k̂ is greater than 1, it suggests scaling down the latent draws ν by a factor of k to ensure var(Δν)=1 and scaling up λi by the same factor k to maintain the same log-likelihood for observed data.

Remark.

It is easier to see the connection to the proposed linear expansion method after reparameterizing the L model. As detailed in Appendix C, we obtain an alternative expanded model with the reduction function R(θ,K)=θ, that is, yit=λiνt+ϵit, [ν1  νT ϵi1  ϵiT]=Ai[ν1*  νT* ϵi1*  ϵiT*], νt*=νt1*+ut, where Ai=[kIT×T0T×Tλi(1k)IT×TIT×T], ϵ*, u, and ν* follow identical distributions to the O model counterparts. Thus, it is evident that the proposed L model belongs to the linear expansions with a specific constraint on matrix Ai: only contemporaneous correlations between (νt,ϵit) and (νt*,ϵit*) are allowed. Despite its advantage of the easy adaptation for various models and a negligible increase in computational burden due to the likelihood separability, in the other two applications, we will explore more flexible L models by relaxing constraints in the matrix A, such as allowing for correlations across periods, to achieve faster convergence.

Simulation Results

presents simulation results for a DGP where λ=(1.22,1.07,1.62) and σ=(0.92,0.78,1.33) with N = 3 and T = 200. The x-axis represents the number of iterations s=1,,1500, and the y-axis represents the M-step update θ̂(s). The blue line depicts the SEM trajectory, whereas the orange line depicts the PX-SEM trajectory. The horizontal green dashed line represents the true value. Starting from a randomly chosen initial guess θ̂(0), both SEM and PX-SEM updates move toward the true value and stabilize after several iterations. We use the average of the last 500 updates as the final estimate.

Fig. 2 SEM and PX-SEM iterations of θ̂(s) from a random initial guess.

NOTE: Iterations of SEM (blue line) and PX-SEM (orange line) based on direct sampling, compared with the true value (green dashed line). SEM estimates (blue diamond) and PX-SEM estimates (orange star) are calculated as the average of the last 500 iterations. Random initial guess generated from a lognormal distribution. N=3,T=200.

Fig. 2 SEM and PX-SEM iterations of θ̂(s) from a random initial guess.NOTE: Iterations of SEM (blue line) and PX-SEM (orange line) based on direct sampling, compared with the true value (green dashed line). SEM estimates (blue diamond) and PX-SEM estimates (orange star) are calculated as the average of the last 500 iterations. Random initial guess generated from a lognormal distribution. N=3,T=200.

As shown in , for all the parameters, PX-SEM converges almost immediately. However, for SEM, although it also converges relatively fast for σs, a notable difference is observed in the case of λs: it does not converge until 500 iterations.

Regarding volatilities of updates across iterations, Appendix D presents figures with longer trajectories, where we can observe that PX-SEM exhibits smaller volatilities. Appendix D also includes results for larger sample sizes and additional figures plotting cumulative computing time, revealing significant gains, especially for larger samples.

5 Discrete Choice Models

The second type of model we discuss is the random effects discrete choice model with persistent and transitory components. Discrete choice models are widely used in empirical research on various topics, including labor supply (Hyslop Citation1999) and consumer demand (Keane et al. 2013), among others. Distinguishing heterogeneity from persistence is of interest for many reasons, but the nonlinearity and the presence of latent variables complicate the estimation process.Footnote18

In this section, we develop PX-SEM algorithms for a group of discrete choice models with rich latent-variable structures, including time-invariant, persistent, and transitory components. Specifically, the O model is as follows: yit=1(zit>0),zit=βxit+μi+νit+ϵit,νit=ρνi,t1+uit,where μi|xiidN(0,σμ2), νi1|xiidN(0,1), uit|xiidN(0,σu2), ϵit|xiidN(0,1); and μi, uit, ϵit are mutually independent.Footnote19

For each individual i=1,,N at period t=1,,T, we observe a vector of independent variable xit of dimension J and a binary (0-1) discrete dependent variable yit, whereas zit, individual effect μi, persistent component νit, and transitory component ϵit are latent. We denote the set of unknown parameters as θ, where θ(β,σμ,ρ,σu).

SEM

Let zi[zi1ziT], νi[νi1νiT]. From an initial guess θ̂(0), we iterate through E-step and M-step for s=0,1,,S until θ̂(s) converges to the stationary distribution:

  1. Stochastic E step: Draw (zi,μi,νi) from the posterior distribution fO(zi,μi,νi|yi,xi;θ̂(s)).

  2. M step: Update θ̂(s+1)=(β̂(s+1),σ̂μ(s+1),ρ̂(s+1),σ̂u(s+1))

    β̂(s+1)=(itxitxit)1(itxit(zitμiνit)),ρ̂(s+1)=(itνi,t1νi,t1)1(itνi,t1νit),σ̂μ(s+1)=std̂(μi) and σ̂u(s+1)=std̂(νitρ̂(s+1)νi,t1).

PX-SEM

One option for building the L model is to expand the O model to include only contemporaneous correlations, similar to the dynamic factor model in Section 4. Its advantage lies in the MLE being readily obtained in the PX-M step due to a separable likelihood. Appendix E provides the detailed steps and results of this approach. However, to achieve faster convergence, we now propose a more flexible L model.

Let us define xi[xi1  xiT], ϵi[ϵi1  ϵiT], νi*[νi1*  νiT*], ϵi*[ϵi1*  ϵiT*], and zi*[zi1*  ziT*]. We construct the following L model: yit=1(zit>0),(L Model)zit=γtxi+μi+νit+ϵit,[μi νi ϵi]=pA[μi* νi* ϵi*]+Bxi,νit*=ρνi,t1*+uit,where μi*|xiidN(0,σμ2), νi1*|xiidN(0,1), uit|xiidN(0,σu2), ϵit*|xiidN(0,1), and μi*, uit, ϵit* are mutually independent; and subject to 1p×(CB+γ)=IT×Tβ, CAΣAC=CΣC, and p>0, where C=[1T×1 IT×T IT×T], Σ=var([μi* νi* ϵi*]), γ[γ1  γT] and A is a lower triangular matrix with positive diagonal entries. Alongside θ from the O model, the L model contains a vector of auxiliary parameters K[vech(A),vec(B),p].

Following the linear expansion method, we introduce latent variables μi*, νi*, and ϵi*, which follow the same distributions as their O model counterparts. However, the E-step draws for [μi νi ϵi], given xi, can result from an affine transformation applied to [μi* νi* ϵi*], allowing for linear correlations among μi, νi, ϵi, and dependence on xi. The scalar p permits scaling zit and thus the deviation of var(ϵit) from the value of 1. Hence, the L model satisfies condition (a): when B=0(2T+1)×(J×T), A=I(2T+1)×(2T+1), p=1, the two models coincide fO(yi,zi,μi,νi|xi;θ)=fL(yi,zi,μi,νi|xi;θ,A=I,B=0,p=1).

The L model has two key constraints: 1p×(CB+γ)=IT×Tβ and CAΣAC=CΣC. Beyond addressing identification, these constraints simplify the reduction function. Specifically, under these constraints, the L model can be written as zi=pβxit+pC[μi* νi* ϵi*], implying no effect of auxiliary parameters p, A, and B on the conditional distribution of yit given xit. Therefore, regarding condition (b), we find a reduction function, R(θ,K)=θ, satisfying fO(yi|xi;R(θ,K))=fL(yi|xi;θ,K).

Finally, with the L model specified and an initial guess θ̂(0), we iterate through the following two steps for s=0,1,,S until θ̂(s) converges to the stationary distribution:

  1. Stochastic E step: Draw (zi,μi,νi) from the posterior distribution fO(zi,μi,νi|yi,xi;θ̂(s)).

  2. PX-M step:

    1. L model estimation:

      θ̂L,K̂=arg minθ,KiΨ(θ,K;yi,zi,xi,μi,νi)

    2. Reduction: θ̂(s+1)=R(θ̂L,K̂)=θ̂L

      where Ψ(·) is a known function whose detailed specification is presented in Appendix F. Below, we list the moments involved in function Ψ(·):

      pβ:E(xit(zitpβxit))=0,1p×(CB+γ)=IT×TβB:E(xi([μi νi ϵi]xiB))=0σμ,p,Σ,A: CAΣAC=CΣC,and moment constraints on Σρ,σu:E(νi,t1*(νit*ρνi,t1*))=0,var(νit*ρνi,t1*)=σu2.

Simulation Results

We conduct simulations to compare SEM and PX-SEM from a DGP with true parameter values: β=[1.0;0.5], σμ=1.25, ρ=0.7, and σu=0.9.

The initial guess is determined as follows: (a) β̂(0) is the Probit regression coefficients of yit on xit, (b) Impose ρ̂(0)=1, and the rest of the parameters are estimates of the linearly approximated model.Footnote20 In the E-step, we employ a random-walk Metropolis-Hastings sampler with an acceptance rate controlled between 20% and 40%.

presents the estimation results of one simulation with N = 5000 and T = 8. Specifically, we plot the M-step updates θ̂(s) for 1000 iterations (S = 1000). The blue line depicts each update of SEM, while the orange one represents the PX-SEM updates. In this example, we might be interested in switching to SEM, which is likelihood-based, treating the PX-SEM estimate as an initial guess. Hence, we also present the results of a combined approach, where we run PX-SEM for 500 iterations and then continue with SEM for another 500 iterations, using the average of the last 250 PX-SEM iterations as the initial guess, as shown by the gray line.Footnote21 The green dashed line indicates the true value. The final estimates are the average of the last 250 iterations (S0=250), represented by the blue diamond for SEM, the orange star for PX-SEM, and the gray circle for PX-SEM + SEM.

Fig. 3 SEM and PX-SEM iterations of θ̂(s) from an informed guess.

NOTE: Iterations of SEM (blue line), PX-SEM (orange line), and PX-SEM + SEM (gray line) based on 100 MH draws, compared with the true value (green dashed line). Estimates of SEM (blue diamond), PX-SEM (orange star), and PX-SEM + SEM (gray circle) are the average of the last 250 iterations. Informed initial guess.

Fig. 3 SEM and PX-SEM iterations of θ̂(s) from an informed guess.NOTE: Iterations of SEM (blue line), PX-SEM (orange line), and PX-SEM + SEM (gray line) based on 100 MH draws, compared with the true value (green dashed line). Estimates of SEM (blue diamond), PX-SEM (orange star), and PX-SEM + SEM (gray circle) are the average of the last 250 iterations. Informed initial guess.

From this comparison, it is clear that starting from the same initial guess, PX-SEM converges almost immediately (within 100 iterations). In contrast, SEM progresses much slower, especially for σ̂μ(s), σ̂u(s), and ρ̂(s), and does not converge within 1000 iterations. In terms of the combined approach, the variation across iterations significantly decreases after transitioning to SEM. However, since the final estimates are the average of the last 250 updates, the difference between PX-SEM and PX-SEM + SEM is small.Footnote22

Appendix L provides additional figures where the x-axis is the cumulative computing time. The gain is significant: SEM takes approximately 3000 sec to run 1000 iterations without converging, while PX-SEM converges almost immediately.Footnote23

Appendix H compares algorithms based on random initial guesses. Researchers often run SEM algorithms from various initial guesses and choose one based on specific criteria (e.g., the likelihood value) to avoid obtaining a local maximum, given that getting a “good” initial guess can be challenging. Appendix H shows that the dominance of PX-SEM in convergence rate remains under random initial guesses. Given that this type of exercise is often performed repeatedly in practice, the time saved could be substantial.Footnote24

Finally, Appendix M provides the overall trajectories of SEM and PX-SEM over iterations. Specifically, we conduct 40 simulations using the same DGP, each estimated under a different set of initial guesses shared by both SEM and PX-SEM. For each parameter, we examine the distribution of updates across 40 trajectories at each specific iteration and how this distribution evolves over the iterations for SEM and PX-SEM, respectively. We reach the same conclusion: PX-SEM significantly improves algorithmic efficiency.

6 Quantile Models

The final type of model for which we consider a PX-SEM approach is the persistent-transitory dynamic quantile models with individual effects, as proposed by Arellano, Blundell, and Bonhomme (Citation2017) (referred to as ABB hereafter). The ABB model does not impose functional-form restrictions on the distributions of individual effects, transitory shocks, or conditional distributions of the persistent component. Indeed, the flexible dynamics of the persistent component allow for attractive features such as nonlinear persistence, meaning that the persistence could vary with the size of shocks and accumulated history, which is shown to be empirically prominent in earning dynamics. The model has also been applied to other topics including firm and health dynamics.

Specifically, we focus on the ABB baseline model with an additive fixed effect, discussed in their Appendix.Footnote25 Denote the τth conditional quantile of νit given νi,t1 as Qν(νi,t1,τ) for each τ(0,1). The O model to be estimated is as follows:

yit=μi+νit+ϵit,νit=Qν(νi,t1,uit),(uit|μi,ui,t1,ui,t2,)iidUniform(0,1), t=2,,T, (O Model)where ϵit has zero mean, iid over time, and independent of νi[νi1 νi2  νiT] and μi. Individual effect μi is assumed to be independent of ϵi[ϵi1 ϵi2  ϵiT] and νi.

To estimate this model, we follow Arellano, Blundell, and Bonhomme (Citation2017) and empirically specify the quantile function of νit given νi,t1, Qν(νi,t1,τ), the quantile function of ϵit, Qϵ(τ), the quantile function of νi1, Qν1(τ), and the quantile function of μi, Qμ(τ), as follows: Qν(νi,t1,τ)=h=0HγhQ(τ)φh(νi,t1),Qϵ(τ)=γϵ(τ),  Qν1(τ)=γν1(τ),  Qμ(τ)=γμ(τ),where φh(·) is Hermite polynomials of order h and γ(·)s are functions to be estimated.

Arellano, Blundell, and Bonhomme (Citation2017) exploit a variation of SEM for estimation, where the M-step involves a series of quantile regressions instead of likelihood optimization for computational convenience. We first explain their procedures. Let θ denote the set of unknown parameters, including γkQ(τ), γϵ(τ), γν1(τ), and γμ(τ).Footnote26 With an initial guess θ̂(0), we iterate through the following two steps until θ̂(s) converges to the stationary distribution:

  1. Stochastic E step: Draw μi and νi from the posterior distribution fO(μi,νi|yi;θ̂(s)).

  2. M step: Update parameters by computing a series of quantile regressions:

    γ̂Q(τ)=arg minγ0Q,,γHQi=1Nt=2Tρτ(νith=0HγhQφh(νi,t1)),γ̂μ(τ)=arg minγμi=1Nρτ(μiγμ),γ̂ϵ(τ)=arg minγϵi=1Nt=1Tρτ(ϵitγϵ),γ̂ν1(τ)=arg minγν1i=1Nρτ(νi1γν1),

    where ρτ(u)=u(τ1(u0)) is the check function.

PX-SEM

We expand the O model linearly targeting the correlations among μi, νi, and ϵi. Define νi*[νi1*  νiT*], ϵi*[ϵi1*  ϵiT*]. We build the following L model:

yit=μi+νit+ϵit,[μi νi ϵi]=A[μi* νi* ϵi*],νit*=Qν(νi,t1*,uit),(uit|μi*,ui,t1,ui,t2,)iidUniform(0,1), t=2,,T, (L Model)subject to CA=C, where C=[1T×1 IT×T IT×T]. Similarly, we assume that ϵit* has zero mean, iid over time, and independent of νi* and μi*, and μi* is independent of νi*. The L model contains a vector of auxiliary parameters Kvec(A).

Consistent with the linear expansion method, E-step draws [μi νi ϵi] are assumed to be outcomes of affine transformations through matrix A of [μi* νi* ϵi*], which follow identical distributions as their O model counterparts. When A=I, two models coincide, satisfying condition (a). Moreover, with the constraint CA=C, the L model becomes yi=C[μi* νi* ϵi*], implying no effect of K on the observed-data likelihood. Thus, regarding condition (b), the reduction function is R(θ,K)=θ.

Finally, with this L model and an initial guess θ̂(0), we iterate through the following two steps for s=0,1,,,S until θ̂(s) converges to the stationary distribution:

  1. Stochastic E step: Draw μi and νi from posterior distribution fO(μi,νi|yi;θ̂(s))

  2. PX-M step:

    1. L model estimation:

      θ̂L,K̂=arg minθ,KiΨ(θ,K;yi,μi,νi)

    2. Reduction: θ̂(s+1)=R(θ̂L,K̂)=θ̂L

      where Ψ(·) is a known function to be discussed in the following paragraphs.

The inclusion of matrix A adds complexity to joint estimation due to infeasible separate quantile regressions as in the SEM M-step and the involvement of many more parameters.Footnote27 Thus, we employ two strategies: adding extra constraints on the entries of matrix A and sequential estimation.

Regarding the extra constraints, we assume that ϵit is orthogonal to μi*, ϵi1*,…, ϵi,t1*, for t=2,,T, and the coefficient of ϵit* with respect to ϵit is homogenous across all periods. The advantage of doing so is that, by exploiting moment conditions including zero correlation among μi*, νi*, and ϵi*, and νit* following the first-order Markov process as well as ϵit* being iid, we can separately estimate matrix A through constrained GMM while restricting the number of unknowns to only two. This further facilitates the sequential estimation strategy. Once obtaining Â, we estimate θ through the same series of quantile regressions as in SEM using [μ̂i* ν̂i* ϵ̂i*]=Â1[μi νi ϵi]. Appendix I provides a detailed discussion of the estimation process.Footnote28

Simulation Results

We simulate from the following DGP (N=5000,T=6): (7) yit=μi+νit+ϵit,νit=ρννi,t1+(σνt0+σνt1νi,t12)vit,(7) where μiiidN(0,σμ2), νi1iidN(0,σν12), vitiidN(0,1), ϵitiidN(0,σϵ2), and μi, νi1, vit, ϵit are mutually independent.

We present results for a persistent-transitory process without time-invariant heterogeneity and heteroscedasticity in the persistent shock by imposing μi=0 and σνt1=0. The other parameter values are ρν=0.8,σνt0=0.15,σνt1=0,σν12=0.15,σϵ2=0.05. Appendix J provides simulation results for a DGP with time-invariant heterogeneity and heteroscedasticity and another DGP, which is based on a flexible quantile model.

With the simulated data, we estimate the quantile model with time-invariant heterogeneity, as specified previously (the O model), assuming no knowledge of the distribution family. We set the initial guess by estimating the canonical random-walk permanent-transitory model, with details explained in Appendix J. Finally, the highest order of Hermite polynomials for the empirical specification of the νit dynamics, H, is set to two.

presents the results. To provide clearer visualization, instead of plotting the updates of raw parameters in the quantile model directly (due to their large quantity), we plot the iterations of estimated parameter values in the parametric model, (7). Specifically, in each iteration, we estimate the parametric model using E-step draws (μ, ν, and ϵ) for SEM and” corrected” draws (μ̂*, ν̂*, and ϵ̂*) for PX-SEM. Importantly, these estimates are only for visualizing convergence and are not directly involved in the algorithm updating procedure. Consistent with previous exercises, PX-SEM exhibits rapid convergence for all parameters, whereas SEM converges much more slowly.

Fig. 4 SEM and PX-SEM iterations, μi=0.

NOTE: Iterations of SEM (blue solid line) and PX-SEM (orange solid line) based on 100 MH draws, compared with the true value (green dashed line). In each iteration, we estimate the parametric model, (7), using E-step draws μ, ν, ϵ for SEM and “corrected” draws μ̂*, ν̂*, ϵ̂* for PX-SEM. These estimates are only used for visualizing the convergence and are not directly involved in any algorithm. SEM estimates (blue diamond) and PX-SEM estimates (orange star) are both calculated as the average of the last 100 iterations. Informed initial guess.

Fig. 4 SEM and PX-SEM iterations, μi=0.NOTE: Iterations of SEM (blue solid line) and PX-SEM (orange solid line) based on 100 MH draws, compared with the true value (green dashed line). In each iteration, we estimate the parametric model, (7), using E-step draws μ, ν, ϵ for SEM and “corrected” draws μ̂*, ν̂*, ϵ̂* for PX-SEM. These estimates are only used for visualizing the convergence and are not directly involved in any algorithm. SEM estimates (blue diamond) and PX-SEM estimates (orange star) are both calculated as the average of the last 100 iterations. Informed initial guess.

Appendix L provides complementary figures to , with cumulative computing time as the x-axis, showing significant time gain: SEM takes over 5500 sec for 500 iterations without clear convergence, whereas PX-SEM converges almost immediately.Footnote29 Appendix M presents the overall trajectories of 40 iterations. Finally, Appendix K shows simulation results for different sample sizes.

After 500 iterations, averaging the last 100 updates as temporary estimates, we simulate from the estimated models and compare the model fit, focusing on the persistence of the νit dynamics, νitνi,t1(νi,t1,τ), one of the characteristics of interest (Arellano, Blundell, and Bonhomme Citation2017). displays heatmaps showing the absolute distance between the estimated persistence and true persistence at different levels of the shock τ and νi,t1 for SEM and PX-SEM, respectively. Overall, PX-SEM shows a better model fit.

Fig. 5 Distance between estimated persistence and true persistence, μi=0. NOTE: The absolute distance between the estimated νt persistence (based on the average of the last 100 iterations) and true persistence for each level of the shock τ and νi,t1.

Fig. 5 Distance between estimated persistence and true persistence, μi=0. NOTE: The absolute distance between the estimated νt persistence (based on the average of the last 100 iterations) and true persistence for each level of the shock τ and νi,t−1.

7 Conclusions

This article introduces new estimation algorithms for dynamic panel data models with latent variables. By combining the parameter expansion ideas with the SEM algorithm, we develop the PX-SEM algorithm, which could facilitate convergence in models with a large space of latent variables by improving algorithmic efficiency.

Sharing the same E-step as SEM, PX-SEM differs in the M-step. Instead of estimating the original model (the O model), the M-step of PX-SEM requires estimating an expanded model (the L model). Effectively, we propose new estimators for the pseudo-data within iterations, accounting for the misspecification of the O model for draws based on parameter values far from the truth. Thus, PX-SEM can leverage additional model information to effectively” correct” the M-step updates in progressing to more accurate ones.

Moreover, the article proposes a method for constructing the L model through linear expansion and presents new PX-SEM-based estimation algorithms for three types of dynamic panel data models: factor models, discrete choice models, and quantile models.

Regarding statistical properties, we establish the asymptotic equivalence of the likelihood-based PX-SEM to an alternative SEM with a smaller expected fraction of missing information compared to the standard O model based SEM, implying a faster global convergence rate and a smaller variance for the limiting stationary distribution. Finally, simulations show that PX-SEM can significantly improve the algorithmic efficiency relative to SEM.

Supplementary Materials

The online supplement consists of the following appendices. Appendix A presents illustrative figures for the intuition behind PX-SEM and comparisons among different L models and M-step estimators using the toy model. Appendix B provides a detailed proof for Section 3. Appendix C explains the equivalence through reparameterization among L models. Appendices E, F, and G discuss alternative L models, detailed L model estimation procedures, and PX-SEM methods applied to two extensions for the discrete choice model in Section 5. Appendix I provides detailed L model estimation procedures for the quantile model in Section 6. Appendices D, H, J–M present more simulation results for the three types of models discussed in Sections 4-6, with more iterations, different sample sizes, cumulative computing time, different initial guesses, and overall trajectories.

Supplemental material

Supplementary_Materials_for_Review (20).zip

Download Zip (8.3 MB)

Acknowledgments

This work is based on Chapter 2 of my PhD thesis at CEMFI, which received the Enrique Fuentes Quintana Funcas Award in Economics, Finance and Business 2021-2022. I am deeply grateful to Manuel Arellano for his invaluable support and advice. I also thank Martin Almuzara, Dante Amengual, Dmitry Arkhangelsky, Orazio Attanasio, Richard Blundell, Stéphane Bonhomme, Micole De Vera, Jose Gutierrez, Pedro Mira, Josep Pijoan-Mas, Enrique Sentana, Liyang Sun, and seminar participants at IE university, CEMFI, International Panel Data Conference, EEA-ESEM, and SAEe meetings for valuable comments and suggestions. Two anonymous referees and an Associate Editor have helped greatly improve the article. All errors are my sole responsibility.

Disclosure Statement

The author reports there are no competing interests to declare.

Additional information

Funding

Grants PID2022-143184NA-I00 funded by MCIU/AEI/10.13039/501100011033 and by FEDER, UE; BES-2017-082506 funded by MCIU/AEI/10.13039/501100011033 and by “ESF Investing in your future”; and MDM-2016-0684 funded by MCIU/AEI/10.13039/501100011033 are greatly acknowledged.

Notes

1 Arellano and Bonhomme (2017) discusses the potentials of SEM in nonlinear panel data analysis.

2 For instance, Arellano et al. (Citation2023) develops a Sequential Monte Carlo sampler for the E step.

3 Wei (Citation2022) applies the algorithms developed in this article to a substantive analysis of the earnings and employment dynamics of older workers, which brings together elements of the three types of panel models considered here.

4 Liu, Rubin, and Wu (Citation1998) is based on the EM algorithm; Liu and Wu (Citation1999) applies the parameter expansion technique to Bayesian inference; Lavielle and Meza (Citation2007) combines the parameter expansion technique with Monte Carlo EM (Wei and Tanner Citation1990).

5 See Wu (Citation1983) for more discussions.

6 Since k does not affect the L model observed data likelihood, fO(yi;σL)=fL(yi;σL,1)=fL(yi;σL,kL)

7 Note that the reduction function satisfies R(θ,K0)=θ.

8 In the moment-based PX-EM case, if the fixed point with K = K0 exists, then it will satisfy i(ΨO(Yi*,Wi;θ̂)fO(Yi*|Wi;θ̂)dYi*)=0.

9 Note that the E-step draws of PX-SEM are based on the O model under the guess θ̂(s). This is equivalent to making draws from the L model under the guess Θ̂(s)=[θ̂(s);K0], due to condition (a).

10 Appendix B shows that for any L model, an alternative L model can be found by reparameterization, which yields identical updates of θ with the reduction function R(θ,K)=θ.

11 That AV being positive definite implies Hθθ1Vθθ being positive definite.

12 The author thanks an anonymous referee for his/her encouragement to develop this result.

13 With a fixed S0, the PX-SEM and SEM estimators will in general give rise to different asymptotic variances. Expressions for these variances are provided in Appendix B.

14 In this expression, Yi* also includes error terms in the measure equation G(·).

15 Extensions include unit-specific matrix A (Section 4) and adding exogenous regressor Xi (Section 5).

16 The constraint might not be necessary in applications where reduction functions are easy to find.

17 The method can be easily adapted to models with (a) unknown persistence in the νt process, (b) multiple latent factors, (c) ϵit following an MA process, etc.

18 Chen (Citation2016) proposes a fixed effects EM estimator for a class of nonlinear panel data models.

19 Appendix G presents two extensions: (a) Allowing for the dependence of μi and νi1 on xi1, and (b) Logit (with strategies for the quantile model in the next section).

20 We approximate the model as follows yit=Φ(βxit+μi+νit)+ηit0.5+0.25(βxit+μi+νit)+ηit.

21 We could also endogenize the switching procedure by using metrics like the distance between K̂ and K0 or the likelihood difference between the L model and the O model to guide our transition to SEM.

22 Whether PX-SEM requires more iterations in this example due to its higher volatility, impacting total computing time, is beyond this article’s scope, especially considering the PX-SEM + SEM option.

23 The results are obtained using a Mac Mini (M1, 2020) with a single processor core. We apply the Metropolis-Hastings algorithm for the E-step, with the first 100 iterations designated as a burn-in phase.

24 Appendix H also presents simulation results with more iterations and different sample sizes.

25 In practice, standard SEM generally performs well in estimating the ABB baseline model without the fixed effect. But it is challenging when a fixed effect is included. We also remove age effects.

26 Unknown parameters also include tail parameters. Functions γ(·) are piecewise-polynomial interpolating splines on a grid [τ1,τ2], [τ2,τ3],…, [τL1,τL]. And the tails on (0,τ1] and [τL,1) are modeled using a parametric model. Please refer to Appendix B in Arellano, Blundell, and Bonhomme (Citation2017) for more details.

27 In the discrete choice model, the matrix A and other auxiliary parameters can be easily estimated in the PX-M step by focusing solely on the first two moments due to the normality assumption.

28 Similar strategies are used to estimate a Logit model in Appendix G.

29 The results are obtained using a Mac Mini (M1, 2020) with a single processor core. We apply the Metropolis-Hastings algorithm for the E-step, with the first 100 iterations designated as a burn-in phase.

References

  • Arcidiacono, P., and Jones, J. B. (2003), “Finite Mixture Distributions, Sequential Likelihood and the EM Algorithm,” Econometrica, 71, 933–946. DOI: 10.1111/1468-0262.00431.
  • Arellano, M., Blundell, R., and Bonhomme, S. (2017), “Earnings and Consumption Dynamics: A Nonlinear Panel Data Framework,” Econometrica, 85, 693–734. DOI: 10.3982/ECTA13795.
  • Arellano, M., Blundell, R., Bonhomme, S., and Light, J. (2023), “Heterogeneity of Consumption Responses to Income Shocks in the Presence of Nonlinear Persistence,” Journal of Econometrics, 240, 105449. DOI: 10.1016/j.jeconom.2023.04.001.
  • Arellano, M., and Bonhomme, S. (2016), “Nonlinear Panel Data Estimation via Quantile Regressions,” The Econometrics Journal, 19, C61–C94. DOI: 10.1111/ectj.12062.
  • ———(2017), “Nonlinear Panel Data Methods for Dynamic Heterogeneous Agent Models,” Annual Review of Economics, 9, 471–496.
  • Bai, J., and Ng, S. (2008), Large Dimensional Factor Analysis. Foundations and Trends[textregistered] in Econometrics (Vol. 3), pp. 89–163, Hanover, MA: Now Publishers. DOI: 10.1561/0800000002.
  • Chen, M. (2016), “Estimation of Nonlinear Panel Models with Multiple Unobserved Effects,” Working Paper, Department of Economics, University of Warwick.
  • Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977), “Maximum Likelihood from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society, Series B, 39, 1–22. DOI: 10.1111/j.2517-6161.1977.tb01600.x.
  • Diebolt, J., and Celeux, G. (1993), “Asymptotic Properties of a Stochastic EM Algorithm for Estimating Mixing Proportions,” Stochastic Models, 9, 599–613. DOI: 10.1080/15326349308807283.
  • Geweke, J. (1977), “The Dynamic Factor Analysis of Economic Time Series,” in Latent Variables in Socio-Economic Models, eds. D. J. Aigner and A. S. Goldberger, Amsterdam: North-Holland.
  • Hyslop, D. R. (1999), “State Dependence, Serial Correlation and Heterogeneity in Intertemporal Labor Force Participation of Married Women,” Econometrica, 67, 1255–1294. DOI: 10.1111/1468-0262.00080.
  • Keane, M. P. (2013), “Panel Data Discrete Choice Models of Consumer Demand,” in The Oxford Handbook of Panel Data, ed. B. H. Baltagi, pp. 548–582, Oxford: Oxford University Press.
  • Lavielle, M., and Meza, C. (2007), “A Parameter Expansion Version of the SAEM Algorithm,” Statistics and Computing, 17, 121–130. DOI: 10.1007/s11222-006-9007-6.
  • Liu, C., Rubin, D. B., and Wu, Y. N. (1998), “Parameter Expansion to Accelerate EM: The px-em Algorithm,” Biometrika, 85, 755–770. DOI: 10.1093/biomet/85.4.755.
  • Liu, J. S., and Wu, Y. N. (1999), “Parameter Expansion for Data Augmentation,” Journal of the American Statistical Association, 94, 1264–1274. DOI: 10.1080/01621459.1999.10473879.
  • Nielsen, S. F. (2000), “The Stochastic EM Algorithm: Estimation and Asymptotic Results,” Bernoulli, 6, 457–489. DOI: 10.2307/3318671.
  • Pastorello, S., Patilea, V., and Renault, E. (2003), “Iterative and Recursive Estimation in Structural Nonadaptive Models,” Journal of Business & Economic Statistics, 21, 449–509. DOI: 10.1198/073500103288619124.
  • Stock, J. H., and Watson, M. W. (2006), “Forecasting with Many Predictors,” Handbook of Economic Forecasting, 1, 515–554.
  • ———(2011), “Dynamic Factor Models,” in Oxford Handbook of Economic Forecasting, eds. Michael P. Clements and David F. Hendry, Oxford: Oxford University Press.
  • Wei, G. C., and Tanner, M. A. (1990), “A Monte Carlo Implementation of the EM Algorithm and the Poor Man’s Data Augmentation Algorithms,” Journal of the American statistical Association, 85, 699–704. DOI: 10.1080/01621459.1990.10474930.
  • Wei, S. (2022), “Income, Employment and Health Risks of Older Workers,” Documentos de Trabajo (CEMFI), (5), 1.
  • Wu, C. J. (1983), “On the Convergence Properties of the EM Algorithm,” The Annals of Statistics, 11, 95–103. DOI: 10.1214/aos/1176346060.