Full article: Estimating Latent-Variable Panel Data Models Using Parameter-Expanded SEM Methods

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

This article presents new estimation algorithms for three types of dynamic panel data models with latent variables: factor models, discrete choice models, and persistent-transitory quantile processes. The new methods combine the parameter expansion (PX) ideas of Liu, Rubin, and Wu with the stochastic expectation-maximization (SEM) algorithm in likelihood and moment-based contexts. The goal is to facilitate convergence in models with a large space of latent variables by improving algorithmic efficiency. This is achieved by specifying expanded models within the M step. Effectively, we are proposing new estimators for the pseudo-data within iterations that take into account the fact that the model of interest is misspecified for draws based on parameter values far from the truth. We establish the asymptotic equivalence of the likelihood-based PX-SEM to an alternative SEM algorithm with a smaller expected fraction of missing information compared to the standard SEM based on the original model, implying a faster global convergence rate. Finally, in simulations we show that the new algorithms significantly improve the convergence speed relative to standard SEM algorithms, sometimes dramatically so, by reducing the total computing time from hours to a few minutes.

KEYWORDS:

1 Introduction

This article presents new estimation algorithms for dynamic panel data models with latent variables. Dynamic panel data models are widely used in applied work today. They tend to exhibit many latent variables over multiple periods (e.g., time-invariant, persistent, and transitory components), which are important to capture unobserved heterogeneity and dynamic responses (Arellano and Bonhomme 2017). However, the presence of latent variables brings challenges to the estimation.

Iterative methods like the stochastic expectation-maximization (SEM) algorithm can be useful tools for estimating models with latent variables (Diebolt and Celeux Citation1993).Footnote¹ Specifically, as a simulated version of the Expectation-Maximization (EM) algorithm (Dempster, Laird, and Rubin Citation1977), SEM iterates through an E-step where we draw latent variables from the posterior distribution of the model of interests given observables, and an M-step where we estimate the model as if the draws were observables until the parameters converge to the stationary distribution. It simplifies the estimation as it replaces the complex optimization problem, which involves multiple integrals due to latent variables, with a series of much simpler optimization problems under pseudo-complete data.

However, the slow convergence, an often voiced criticism of EM and its variants, tends to diminish its practical appeal. Indeed, the slow convergence issue in practice is even more pronounced: researchers often need to run the algorithms multiple times with different initial guesses and select the result based on criteria such as the likelihood value, to mitigate the negative effects of a “bad” initial guess and to address the possibility of the algorithm converging to a local maximum. Recent research has explored alternative samplers for latent variables when performing the E-step to improve sampling efficiency and stability.Footnote² In contrast, this article focuses on the potential improvement in the M-step.

In this article, we develop a new estimation method, the PX-SEM algorithm, by combining the parameter expansion ideas in Liu, Rubin, and Wu (Citation1998) with the SEM algorithm. The goal is to facilitate convergence in models with a large space of latent variables by improving algorithmic efficiency. Even though the general concept of the PX-SEM algorithm applies to various models, we focus on three types of dynamic panel data models containing rich latent variable structures, where slow convergence issues are exacerbated, with the expectation that it can be particularly fruitful: (a) dynamic factor models, (b) random effects discrete choice models with persistent and transitory components, and (c) persistent-transitory dynamic quantile models with individual effects.Footnote³

The PX-SEM algorithm consists of two steps: an E-step, where we draw values of latent variables from the posterior distribution, and a PX-M-step, where we update parameters. Having the same E-step as the SEM algorithm, PX-SEM replaces the SEM M-step estimator with a more robust one, taking into account the possibility that E-step draws could violate model assumptions when parameter guesses are far from the true value. The PX-M step estimator is able to leverage additional information from the model itself, effectively “correcting” the M step updates in progressing to more accurate ones.

To implement the PX-SEM algorithm, one must construct an expanded model, the L model, which needs to satisfy two conditions. First, the L model must nest the original model, the O model. Second, there must exist a reduction function, a mapping from the L model parameters to the O model parameters, keeping the observed-data likelihood unchanged. After constructing a suitable L model, we can iterate between the E step and the PX-M step, which involves (a) estimating the L model and (b) mapping back to the O model parameters through the reduction function.

There are different ways to construct L models. All else being equal, a more flexible L model should improve the convergence rate. However, since our ultimate goal is to reduce the total computing time, we also need to consider the time spent in each iteration for estimating the L model and converting it to the O model. Therefore, taking these two factors into account, this article proposes a method to expand the model linearly. Linear expansion addresses the potential violation of zero-correlation assumptions.

In terms of statistical properties, Liu, Rubin, and Wu (Citation1998) proves the monotone convergence of the parameter-expanded EM algorithm and its superior rate of convergence relative to its parent EM. By combining the results of Nielsen (Citation2000) and Arellano and Bonhomme (Citation2016), this article establishes the asymptotic equivalence of the likelihood-based PX-SEM to an alternative SEM algorithm with a smaller expected fraction of missing information compared to the standard O model based SEM, implying a faster global convergence rate and a smaller variance for the limiting stationary distribution. Finally, in the simulations, we show that PX-SEM can significantly improve algorithmic efficiency compared to the standard SEM algorithm, sometimes dramatically so. For example, in our numerical calculations for discrete choice and quantile models, SEM has still not converged even after running for 50–80 min whereas PX-SEM converges within 2–3 min.

This article belongs to an expanding literature that considers the application of the EM algorithm (Dempster, Laird, and Rubin Citation1977) and its variants in estimating latent variable models (Diebolt and Celeux Citation1993; Liu, Rubin, and Wu Citation1998; Arcidiacono and Jones Citation2003; Pastorello, Patilea, and Renault Citation2003; Arellano and Bonhomme Citation2016; Chen Citation2016; Arellano et al. Citation2023, among others). This article contributes to this literature in two ways. First, by developing a new estimation method, PX-SEM, which combines the parameter expansion idea with the SEM algorithm.Footnote⁴ The method offers appealing theoretical properties and the potential to enhance algorithmic efficiency, which is particularly valuable for complex models such as nonlinear panel data models, where SEM may encounter slow convergence issues. Second, the article proposes a specific class of linear expansions for implementing PX-SEM and develops new estimation algorithms for three types of latent-variable panel data models with enhanced algorithmic efficiency.

The article proceeds as follows. Section 2 illustrates the difference between the standard stochastic EM algorithm and the parameter-expanded stochastic EM algorithm using a simple toy model. In Section 3, a formal definition of PX-SEM is provided, along with a discussion of its statistical properties and implementation based on linear expansions. Sections 4–6 develop PX-SEM methods for three types of latent-variable panel data models: dynamic factor models, discrete choice models, and persistent-transitory dynamic quantile models, respectively. Finally, Section 7 concludes.

2 Toy Model

Based on a simple toy model, this section compares the parameter-expanded stochastic EM (PX-SEM) algorithm with the standard stochastic EM (SEM) algorithm and provides intuitions behind PX-SEM.

Consider the following model we want to estimate, denoted as the O model: $y_{i} = y_{i}^{*} + ϵ_{i}, where [\begin{matrix} y_{i}^{*} \\ ϵ_{i} \end{matrix}] \overset{iid}{\sim} N (0, [\begin{matrix} σ^{2} & 0 \\ 0 & 1 \end{matrix}]) . (O Model)$

The observed outcomes are $y_{1}, \dots, y_{N}$ , and the latent variables whose distribution is of interest are $y_{1}^{*}, \dots, y_{N}^{*}$ . The only unknown parameter is the standard deviation σ.

SEM

To implement the SEM algorithm, we need to start with an initial guess of the unknown parameter ${\hat{σ}}^{(0)}$ , and then iterate the following two steps for $s = 0, 1, \dots, S$ until the convergence of ${\hat{σ}}^{(s)}$ to the stationary distribution:

1. Stochastic E step: Draw $y_{i}^{*}$ from the posterior distribution $f_{O} (y_{i}^{*} | y_{i}; {\hat{σ}}^{(s)})$

2. M step: Estimate the O model and update ${\hat{σ}}^{(s + 1)}$ , that is ${\hat{σ}}^{(s + 1)} = \hat{std} (y_{i}^{*})$ where $f_{O} (\cdot)$ is the density function of O model. The final estimator is the average of the last S⁰ iterations $\hat{σ} = \frac{1}{S^{0}} \sum_{S - S^{0} + 1}^{S} {\hat{σ}}^{(s)}$ .

The nonstochastic version, the EM algorithm, is effective because it improves the observed-data likelihood in each iteration: (1) $\begin{matrix} \sum_{i} log f_{O} (y_{i}; {\hat{σ}}^{(s + 1)}) - \sum_{i} log f_{O} (y_{i}; {\hat{σ}}^{(s)}) \\ \geq Q ({\hat{σ}}^{(s + 1)} | {\hat{σ}}^{(s)}) - Q ({\hat{σ}}^{(s)} | {\hat{σ}}^{(s)}) \geq 0, \end{matrix}$ (1) where $Q ({\hat{σ}}^{(s + 1)} | {\hat{σ}}^{(s)}) = \sum_{i} \int log f_{O} (y_{i}, y_{i}^{*}; {\hat{σ}}^{(s + 1)}) f_{O} (y_{i}^{*} | y_{i}; {\hat{σ}}^{(s)}) d y_{i}^{*}$ .Footnote⁵

PX-SEM

Now we introduce the PX-SEM algorithm. Like SEM, PX-SEM comprises an E step for draws of latent variables and an M step for updating parameters. While sharing the same E-step as SEM, PX-SEM’s M step involves (a) estimating an expanded model, the L model, and (b) mapping L model parameters to the O model parameters.

For this toy model, we propose the following L model:

$\begin{matrix} y_{i} = y_{i}^{*} + ϵ_{i}, \\ where [\begin{matrix} y_{i}^{*} \\ ϵ_{i} \end{matrix}] \overset{iid}{\sim} N (0, K [\begin{matrix} σ^{2} & 0 \\ 0 & 1 \end{matrix}] K'), \\ K = [\begin{matrix} k & 0 \\ 1 - k & 1 \end{matrix}] . \end{matrix}$ (L Model)

In addition to σ, the L model also contains an auxiliary parameter k. It is easy to verify that when k = 1, the two models coincide, that is $f_{O} (y_{i}^{*}, y_{i}; σ) = f_{L} (y_{i}^{*}, y_{i}; k = 1, σ)$ ; and when $k \neq 1$ , L model expands the O model by allowing for a nonzero correlation between $y_{i}^{*}$ and ϵ_i, as $cov (y_{i}^{*}, ϵ_{i}) = k (1 - k) σ^{2}$ . Moreover, the L model is unidentified with observables: for any L model with parameter values σ_L and k_L, the O model with parameter value $σ = σ_{L}$ has the same observed data likelihood, that is $f_{O} (y_{i}; σ_{L}) = f_{L} (y_{i}; σ_{L}, k_{L})$ .Footnote⁶

To implement PX-SEM, we begin with an initial guess of the unknown parameter ${\hat{σ}}^{(0)}$ , and then iterate the following two steps for $s = 0, 1, \dots, S$ until the convergence of ${\hat{σ}}^{(s)}$ to the stationary distribution:

Stochastic E step: Draw $y_{i}^{*}$ from the posterior distribution $f_{O} (y_{i}^{*} | y_{i}; {\hat{σ}}^{(s)})$
PX-M step: Update parameters by
1. Estimate the L model: ${\hat{k}}_{L} = \frac{\hat{var} (y_{i}^{*})}{\hat{cov} (y_{i}^{*}, y_{i})}, {\hat{σ}}_{L}^{(s + 1)} = \frac{\hat{std} (y_{i}^{*})}{| {\hat{k}}_{L} |}$
2. Reduction: mapping from $({\hat{σ}}_{L}^{(s + 1)}, {\hat{k}}_{L})$ to ${\hat{σ}}^{(s + 1)}$ maintaining the observed-data likelihood, that is $f_{O} (y_{i}; {\hat{σ}}^{(s + 1)}) = f_{L} (y_{i}; {\hat{σ}}_{L}^{(s + 1)}, {\hat{k}}_{L})$ , and thus ${\hat{σ}}^{(s + 1)} = {\hat{σ}}_{L}^{(s + 1)}$

The final estimator is the average of the last S⁰ iterations: $\hat{σ} = \frac{1}{S^{0}} \sum_{S - S^{0} + 1}^{S} {\hat{σ}}^{(s)}$ . As we see, the difference between the two methods lies in the estimators of the M step: the PX-SEM estimator is adjusted by $\frac{1}{| {\hat{k}}_{L} |}$ .

illustrates how PX-SEM has the potential to enhance algorithmic efficiency through the utilization of the auxiliary parameter k. Specifically, depicts a scatterplot of simulations generated by the data generating process (DGP), the O model with a true value of σ = 2. The X-axis and Y-axis display $y_{i}^{*}$ and ϵ_i, respectively, but only their sum $y_{i} = y_{i}^{*} + ϵ_{i}$ is used for estimation.

Fig. 1 Data and E-step draws under different guess values of σ.

is the scatterplot of $(y_{i}^{*}, y_{i} - y_{i}^{*})$ where $y_{i}^{*}$ is the E-step draws under the true value, that is $y_{i}^{*} \sim f_{O} (y_{i}^{*} | y_{i}; σ = 2)$ , and is the scatterplot of $(y_{i}^{*}, y_{i} - y_{i}^{*})$ where $y_{i}^{*}$ is the E-step draws under an incorrect value σ = 1, that is $y_{i}^{*} \sim f_{O} (y_{i}^{*} | y_{i}; σ = 1)$ . Contrary to the case of , where E-step draws are generated under correct guess, and there is no significant correlation between $y_{i}^{*}$ and $y_{i} - y_{i}^{*}$ , presents a significant positive correlation between E-step draws $y_{i}^{*}$ and $y_{i} - y_{i}^{*}$ , which is assumed to be zero by the O model. This “false” positive correlation arises because the draws are taken under a wrong condition that the variance of $y_{i}^{*}$ should not deviate significantly from one.

As a consequence, for the E-step draws in , both M-step and PX-M step estimators are consistent: the SEM one is under a correct constraint k = 1. However, in the case of , the M-step of SEM ignores the violation of the zero-correlation assumption at the current draws, while PX-SEM takes into account the “false” correlation by adding the parameter k. By construction, the extra flexibility induced by k leads to a better model fit of the PX-M step for the current draws and thus, a larger pseudo complete data likelihood. As we will show in Section 3, similar to EM, the PX-EM can improve observed-data likelihood in each iteration by transmitting gains from pseudo-complete data likelihood. Therefore, improving model fit as in the PX-M step essentially equates to improving the lower bound of the inequality in (1). Finally, by mapping the L model parameters to the O model parameters keeping the observed-data likelihood unchanged, we preserve the “gains” in likelihood. Intuitively, PX-SEM replaces the SEM M-step with a more robust estimator that leverages additional model information, namely, there is a linear correlation between the E-step draws $y_{i}^{*}$ and $y_{i} - y_{i}^{*}$ , even though there should not be. Appendix A provides more explanation with illustrative figures.

Regarding the choice of the L model, there are many different ways of expanding the O model. Flexible expansion should help increase the convergence rate. Appendix A compares different L models and PX-M step estimators using the toy model. However, in practice, when we estimate more complex models, we must consider how easily we can estimate the L model and reduce it to the O model to avoid spending too much time in each iteration and increasing the total computing time. With this consideration in mind, we propose a linear expansion method in Session 3 and discuss its applications in Sessions 4–6.

As a final comment, the PX-SEM algorithm aims to enhance algorithm efficiency, even when E-step draws are appropriately obtained. For instance, as demonstrated in Appendix A, PX-SEM improves the convergence rate when the E-step draws are based on direct sampling. In more complex models where direct sampling is not feasible, and methods such as MCMC are required, the PX-SEM algorithm demonstrates lower sensitivity to poorly generated E-step draws. This can be justified if one of the ways that “bad” draws manifest themselves is through violations of model assumptions. Moreover, an MCMC-based E-step generally requires more time for each iteration, so reducing the iterations needed for convergence can significantly reduce the total computing time. For example, as shown in Sections 5 and 6, while SEM can take more than 3000 sec without showing signs of convergence, PX-SEM converges within a couple of minutes.

3 Parameter Expanded Stochastic EM Algorithm

The section begins by defining the PX-SEM algorithm. Following that, we discuss its statistical properties and explore the reasons behind its potential to enhance algorithmic efficiency. Specifically, we establish the asymptotic equivalence between PX-SEM and an alternative SEM with a reduced expected fraction of missing information. Finally, we propose a general approach, the linear expansion method, for the implementation.

3.1 Definition of PX-SEM Algorithm

Setup

Let ${Y_{i}, X_{i}, Y_{i}^{*}}$ for $i = 1, \dots, N$ be iid random variables following the distribution of the O model, denoted as $f_{O} (Y_{i} | X_{i}; θ) = \int_{Y_{i}^{*}} f_{O} (Y_{i}, Y_{i}^{*} | X_{i}; θ) d Y_{i}^{*}$ . Here $W_{i} \equiv [Y_{i}^{'} X_{i}^{'}]'$ represents the observable vector, $Y_{i}^{*}$ is the latent-variable vector, and θ is the unknown parameter vector to be estimated. The true value, $\bar{θ}$ , satisfies the equation $E (Ψ_{O} (Y_{i}^{*}, W_{i}; \bar{θ})) = 0$ , where $Ψ_{O} (\cdot)$ represents the score function of the complete O model in the case of the likelihood-based PX-SEM algorithm and moment restrictions in the case of the moment-based one. The law of iterated expectations implies that the true value $\bar{θ}$ also satisfies the equation: (2) $E (\int Ψ_{O} (Y_{i}^{*}, W_{i}; \bar{θ}) f_{O} (Y_{i}^{*} | W_{i}; \bar{θ}) d Y_{i}^{*}) = 0.$ (2)

Define $\hat{θ}$ as the solution of the integrated moment restrictions of the original O model, $\sum_{i} (\int Ψ_{O} (Y_{i}^{*}, W_{i}; \hat{θ}) f_{O} (Y_{i}^{*} | W_{i}; \hat{θ}) d Y_{i}^{*}) = 0$ , a sample analogy to (2). In the case of a likelihood-based algorithm, we know that $\hat{θ}$ is the MLE.

Denote the expanded model, the L model, as $f_{L} (Y_{i} | X_{i}; θ, K) = \int_{Y_{i}^{*}} f_{L} (Y_{i}, Y_{i}^{*} | X_{i}; θ, K) d Y_{i}^{*}$ , where K represents the auxiliary parameter vector. The expanded L model needs to satisfy two conditions: (a) The L model nests the O model: There exists K = K₀ such that $f_{O} (Y_{i}, Y_{i}^{*} | X_{i}; θ) = f_{L} (Y_{i}, Y_{i}^{*} | X_{i}; θ, K_{0}), \forall θ$ , and (b) Existence of reduction function: There exists a mapping from the L model parameters to the O model parameters, the reduction function $θ = R (θ_{L}, K)$ , such that the observed-data likelihood is preserved: $f_{O} (Y_{i} | X_{i}; R (θ_{L}, K)) = f_{L} (Y_{i} | X_{i}; θ_{L}, K), \forall θ_{L}, K$ .

Let $Ψ_{L}^{θ} (\cdot)$ denote the score function of the L model with respect to θ in the case of the likelihood-based PX-SEM algorithm and the same moment restrictions as $Ψ_{O} (\cdot)$ in the case of the moment-based one. Under condition (a), we have $Ψ_{L}^{θ} (Y_{i}^{*}, W_{i}; θ, K_{0}) = Ψ_{O} (Y_{i}^{*}, W_{i}; θ)$ , and thus $E (Ψ_{L}^{θ} (Y_{i}^{*}, W_{i}; \bar{θ}, K_{0})) = 0$ . Additionally, assuming that K is identified when we observe $Y_{i}^{*}$ , meaning that there exists $Ψ_{L}^{K} (\cdot)$ such that $E (Ψ_{L}^{K} (Y_{i}^{*}, W_{i}; \bar{θ}, K_{0})) = 0$ , we then have $E (Ψ_{L} (Y_{i}^{*}, W_{i}; \bar{θ}, K_{0})) = 0$ , where $Ψ_{L} (\cdot) = [Ψ_{L}^{θ} (\cdot)' Ψ_{L}^{K} (\cdot)']'$ . By the law of iterated expectations, this implies:Footnote⁷ (3) $E (\int Ψ_{L} (Y_{i}^{*}, W_{i}; \bar{θ}, K_{0}) f_{O} (Y_{i}^{*} | W_{i}; R (\bar{θ}, K_{0})) d Y_{i}^{*}) = 0.$ (3)

Definition of the PX-SEM algorithm

Before we outline the general steps of PX-SEM, let us take a look at the SEM algorithm for comparison. SEM is an iterative algorithm where, in the E step, we make draws of latent variables $Y_{i}^{*}$ from the posterior distribution $f_{O} (Y_{i}^{*} | W_{i}; {\hat{θ}}^{(s)})$ under the parameter guess ${\hat{θ}}^{(s)}$ , and in the M step, we update it to ${\hat{θ}}^{(s + 1)}$ , which satisfies $\sum_{i} Ψ_{O} (Y_{i}^{*}, W_{i}; {\hat{θ}}^{(s + 1)}) = 0$ . This stochastic version differs from the original EM algorithm as it replaces the integral in (2) with latent draws.

In contrast, the PX-SEM algorithm proposes iterations that are better linked to (3): while we still make draws of latent variables $Y_{i}^{*}$ from the posterior distribution $f_{O} (Y_{i}^{*} | W_{i}; {\hat{θ}}^{(s)})$ under the parameter guess ${\hat{θ}}^{(s)}$ , we use the expanded model to update the parameter to ${\hat{θ}}^{(s + 1)}$ , satisfying ${\hat{θ}}^{(s + 1)} = R ({\hat{θ}}_{L}, \hat{K})$ , where $\sum_{i} Ψ_{L} (Y_{i}^{*}, W_{i}; {\hat{θ}}_{L}, \hat{K}) = 0$ .

The general steps are as follows: starting with an initial guess ${\hat{θ}}^{(0)}$ , we iterate the following two steps for $s = 0, 1, 2, \dots, S$ until ${\hat{θ}}^{(s)}$ converges to the stationary distribution:

Stochastic E step: Draw $Y_{i}^{*}$ from the posterior distribution $f_{O} (Y_{i}^{*} | W_{i}; {\hat{θ}}^{(s)})$
PX-M step: Update parameters by
1. Estimate L model: $\sum_{i} Ψ_{L} (Y_{i}^{*}, W_{i}; {\hat{θ}}_{L}, \hat{K}) = 0$
2. Reduction: ${\hat{θ}}^{(s + 1)} = R ({\hat{θ}}_{L}, \hat{K})$ subject to $f_{O} (Y_{i} | X_{i}; {\hat{θ}}^{(s + 1)}) = f_{L} (Y_{i} | X_{i}; {\hat{θ}}_{L}, \hat{K})$

Reduction function

In practice, one of the challenges in implementing the PX-SEM algorithm is to find the reduction function associated with the L model. However, if we construct the L model such that the auxiliary parameter K does not affect the observed-data likelihood, that is $f_{O} (Y_{i} | X_{i}; θ_{L}) = f_{L} (Y_{i} | X_{i}; θ_{L}, K)$ , then immediately, the reduction function becomes $R (θ, K) = θ$ . As a result, PX-SEM can be simplified as follows:

Stochastic E step: Draw $Y_{i}^{*}$ from the posterior distribution $f_{O} (Y_{i}^{*} | W_{i}; {\hat{θ}}^{(s)})$
PX-M step: Update parameters by solving $\sum_{i} Ψ_{L} (Y_{i}^{*}, W_{i}; {\hat{θ}}^{(s + 1)}, \hat{K}) = 0$

Comparing the PX-M and M steps, we find that the M-step estimator is a constrained version of the PX-M-step estimator with the constraint K = K₀. Intuitively, when the E-step draws $Y_{i}^{*}$ are generated under a guess ${\hat{θ}}^{(s)}$ close enough to the true value, the M-step estimator is under the correct restriction, leading both the M-step and PX-M-step estimators to be consistent at that iteration, as indicated by (2) and (3).

However, when the guess ${\hat{θ}}^{(s)}$ deviates significantly from the true value, causing the draws $Y_{i}^{*}$ to violate certain model assumptions, we would expect the PX-SEM estimator to exhibit greater “robustness” due to extra flexibility brought by auxiliary parameter K. As shown in the following section, the likelihood-based PX-M step can achieve a larger pseudo-complete data likelihood improvement, which could further lead to a greater observed-data likelihood improvement compared to the M step.

3.2 Statistical Properties

This section focuses on the statistical properties of likelihood-based algorithms. We will first show that the parameter-expanded EM algorithm exhibits nonnegative improvement in the observed-data log-likelihood at each iteration. Next, for the stochastic version, PX-SEM, we will establish its asymptotic equivalence to an alternative SEM algorithm with a smaller expected fraction of missing information compared to the standard O model based SEM, which implies a faster global convergence rate and a smaller variance for the limiting stationary distribution in a semipositive definite order.

Convergence

Following Liu, Rubin, and Wu (Citation1998), we now prove that PX-EM algorithm increases the observed-data likelihood in each iteration. The change in the observed-data log-likelihood between iterations $\sum_{i} log f_{O} (Y_{i} | X_{i}; {\hat{θ}}^{(s + 1)}) - \sum_{i} log f_{O} (Y_{i} | X_{i}; {\hat{θ}}^{(s)})$ equals $\begin{matrix} \sum_{i} log f_{L} (Y_{i} | X_{i}; {\hat{θ}}_{L}, \hat{K}) ‐ \sum_{i} log f_{L} (Y_{i} | X_{i}; {\hat{θ}}^{(s)}, K_{0}) \\ \geq Q ({\hat{θ}}_{L}, \hat{K} | {\hat{θ}}^{(s)}, K_{0}) ‐ Q ({\hat{θ}}^{(s)}, K_{0} | {\hat{θ}}^{(s)}, K_{0}) \geq 0, \end{matrix}$ where $Q ({\hat{θ}}_{L}, \hat{K} | {\hat{θ}}^{(s)}, K_{0}) = \sum_{i} \int log f_{L} (Y_{i}, Y_{i}^{*} | X_{i}; {\hat{θ}}_{L}, \hat{K}) f_{L} (Y_{i}^{*} | W_{i}; {\hat{θ}}^{(s)}, K_{0}) d Y_{i}^{*}$ .

The equality holds because of both condition (a): When K = K₀, two models coincide, meaning $f_{O} (Y_{i} | X_{i}; {\hat{θ}}^{(s)}) = f_{L} (Y_{i} | X_{i}; {\hat{θ}}^{(s)}, K_{0})$ , and condition (b): The reduction function exists, and thus by construction $f_{O} (Y_{i} | X_{i}; {\hat{θ}}^{(s + 1)}) = f_{L} (Y_{i} | X_{i}; {\hat{θ}}_{L}, \hat{K})$ . We then apply Gibbs’ inequality. Finally, the definition of ${\hat{θ}}_{L}$ , which is ${\hat{θ}}_{L}, \hat{K} \equiv arg {max}_{θ, K} Q (θ, K | {\hat{θ}}^{(s)}, K_{0})$ , leads to a nonnegative change in observed-data likelihood. Notably, the result also implies that $(\hat{θ}, K_{0})$ is a fixed point of PX-EM, where $\hat{θ}$ represents the MLE.Footnote⁸

As a final remark, the L model nesting the O model implies the following inequality: $\begin{matrix} Q ({\hat{θ}}_{L}, \hat{K} | {\hat{θ}}^{(s)}, K_{0}) - Q ({\hat{θ}}^{(s)}, K_{0} | {\hat{θ}}^{(s)}, K_{0}) \\ \geq Q ({\hat{θ}}_{EM}^{(s + 1)}, K_{0} | {\hat{θ}}^{(s)}, K_{0}) - Q ({\hat{θ}}^{(s)}, K_{0} | {\hat{θ}}^{(s)}, K_{0}), \end{matrix}$ where ${\hat{θ}}_{EM}^{(s)} = arg {max}_{θ} \sum_{i} \int log f_{O} (Y_{i}, Y_{i}^{*} | X_{i}; θ) f_{L} (Y_{i}^{*} | W_{i}; {\hat{θ}}^{(s)}, K_{0}) d Y_{i}^{*}$ . Therefore, the parameter expansion technique can be intuitively interpreted as a way to improve the lower bound of the log-likelihood increment compared to the EM algorithm.

Asymptotic properties

We first characterize the dynamics of PX-SEM updates. Define Θ as the joint set of auxiliary and O model parameters, $Θ \equiv [θ; K]$ . Accordingly, $\bar{Θ} \equiv [\bar{θ}; K_{0}]$ represents the vector of true values, and $\hat{Θ} \equiv [\hat{θ}; K_{0}]$ represents the MLE of the O model. Given any estimate ${\hat{Θ}}^{(s)} = [{\hat{θ}}^{(s)}; K_{0}]$ in the iteration s, PX-SEM generates the next update from a Markov process: $\sum_{i} Ψ_{L} (Y_{i}^{*}, W_{i}; {\hat{Θ}}^{(s + 1)}) = 0$ , where $Y_{i}^{*} \sim f_{L} (Y_{i}^{*} | W_{i}; {\hat{Θ}}^{(s)})$ .Footnote⁹Footnote¹⁰ Expanding around $\hat{Θ}$ and considering $\hat{θ} \to_{}^{p} \bar{θ}$ , as shown in detail in Appendix B, we have (4) $\begin{matrix} ({\hat{Θ}}^{(s + 1)} - \hat{Θ}) = (I - A^{- 1} V) ({\hat{Θ}}^{(s)} - \hat{Θ}) + A^{- 1} ϵ^{(s)} \\ + o_{p} (N^{- (1 / 2)}), \end{matrix}$ (4) where, under the correct specification, $A = E (Ψ_{L} (Y_{i}^{*}, W_{i}; \bar{Θ}) Ψ_{L} (Y_{i}^{*}, W_{i}; \bar{Θ})')$ is the L model-based complete-data information matrix, $V = E (Ψ_{L} (W_{i}; \bar{Θ}) Ψ_{L} (W_{i}; \bar{Θ})')$ is the L model-based observed-data information matrix, $I - A^{- 1} V$ is the expected fraction of missing information, and $\sqrt{N} ϵ^{(s)} \to_{}^{d} N (\overset{⇀}{0}, A - V)$ .

The SEM iterations can be characterized in the same way: (5) $\begin{matrix} ({\hat{θ}}_{SEM}^{(s + 1)} - \hat{θ}) = (I - A_{θ θ}^{- 1} V_{θ θ}) ({\hat{θ}}_{SEM}^{(s)} - \hat{θ}) \\ + A_{θ θ}^{- 1} ϵ_{θ}^{(s)} + o_{p} (N^{- (1 / 2)}), \end{matrix}$ (5) where $A_{θ θ}$ represents the O model-based complete-data information matrix, $V_{θ θ}$ represents the O model-based observed-data information matrix, $F_{SEM} \equiv I - A_{θ θ}^{- 1} V_{θ θ}$ is the expected fraction of missing information, and $\sqrt{N} ϵ_{θ}^{(s)} \to_{}^{d} N (\overset{⇀}{0}, A_{θ θ} - V_{θ θ})$ . Moreover, PX-SEM and SEM dynamics are closely connected: $A = [\begin{matrix} A_{θ θ} & A_{θ K} \\ A_{K θ} & A_{KK} \end{matrix}], V = [\begin{matrix} V_{θ θ} & 0 \\ 0 & 0 \end{matrix}],$ where $A_{θ K} \equiv - \frac{\partial}{\partial K'} |_{\bar{Θ}} E ({\tilde{Ψ}}_{L}^{θ} (Y_{i}^{*}, W_{i}; Θ))$ and $A_{KK} \equiv - \frac{\partial}{\partial K'} |_{\bar{Θ}} E ({\tilde{Ψ}}_{L}^{K} (Y_{i}^{*}, W_{i}; Θ))$ .

We now present the main results of the asymptotic properties, building on Liu, Rubin, and Wu (Citation1998) and Nielsen (Citation2000), with detailed discussions provided in Appendix B.

Theorem 1.

The PX-SEM iteration of ${\hat{θ}}^{(s)}$ is asymptotically equivalent to SEM iteration with observed-data information matrix $V_{θ θ}$ and complete-data information matrix $A_{θ θ} - A_{θ K} A_{KK}^{- 1} A_{K θ}$ .

Proof.

Let H denote the inverse of matrix A, that is $H = [\begin{matrix} H_{θ θ} & H_{θ K} \\ H_{K θ} & H_{KK} \end{matrix}] \equiv {[\begin{matrix} A_{θ θ} & A_{θ K} \\ A_{K θ} & A_{KK} \end{matrix}]}^{- 1} = A^{- 1},$ where, by design, $H_{θ θ}^{- 1} = A_{θ θ} - A_{θ K} A_{KK}^{- 1} A_{K θ}$ . Then the coefficient matrix $I - A^{- 1} V$ and the asymptotic variance of the innovation term $A^{- 1} ϵ^{(s)}$ in (4) become: $\begin{matrix} I - A^{- 1} V = [\begin{matrix} I - H_{θ θ} V_{θ θ} & 0 \\ - H_{K θ} V_{θ θ} & I \end{matrix}], \\ A^{- 1} (A - V) A^{- 1} \\ = [\begin{matrix} H_{θ θ} (H_{θ θ}^{- 1} - V_{θ θ}) H_{θ θ} & H_{θ K} - H_{θ θ} V_{θ θ} H_{θ K} \\ H_{K θ} - H_{K θ} V_{θ θ} H_{θ θ} & H_{KK} - H_{K θ} V_{θ θ} H_{θ K} \end{matrix}] . \end{matrix}$

It becomes evident that the PX-SEM process of ${\hat{θ}}^{(s + 1)}$ is asymptotically equivalent to an alternative SEM dynamics, described by (6), which shares the same observed-data information matrix $V_{θ θ}$ as the standard SEM in (5), but replaces the original complete-data information matrix $A_{θ θ}$ by $H_{θ θ}^{- 1} = A_{θ θ} - A_{θ K} A_{KK}^{- 1} A_{K θ}$ . (6) $\begin{matrix} ({\hat{θ}}^{(s + 1)} - \hat{θ}) = (I - H_{θ θ} V_{θ θ}) ({\hat{θ}}^{(s)} - \hat{θ}) \\ + H_{θ θ} {\tilde{ϵ}}_{θ}^{(s)} + o_{p} (N^{- (1 / 2)}), \end{matrix}$ (6)

where $\sqrt{N} {\tilde{ϵ}}_{θ}^{(s)} \to_{}^{d} N (\overset{⇀}{0}, H_{θ θ}^{- 1} - V_{θ θ})$ , and $F_{PX} \equiv I - H_{θ θ} V_{θ θ}$ is the expected fraction of missing information.Footnote¹¹□

Since $H_{θ θ} = A_{θ θ}^{- 1} + H_{θ K} H_{KK}^{- 1} H_{K θ}$ , under the condition that A is positive definite, $H_{θ θ} \geq A_{θ θ}^{- 1}$ in a semipositive definite order, implying the largest eigenvalue of F_PX is no greater than the largest eigenvalue of F_SEM. Applying Theorem 1, this comparison in the expected fraction of missing information matrix between PX-SEM and SEM immediately implies the dominance of PX-SEM in convergence rate, as stated in Corollary 1.

Corollary 1.

PX-SEM dominates SEM in global rate of convergence.

Proof.

In Appendix B.2. □

Moreover, Nielsen (Citation2000) provides conditions under which the SEM update is ergodic and characterizes the limiting stationary distribution, based on which Corollary 2 describes the limiting stationary distribution of PX-SEM and compares it with SEM.Footnote¹²

Corollary 2.

The limiting stationary distribution of PX-SEM updates ${\hat{θ}}^{(s)}$ , conditional on W_i, is $\sqrt{N} ({\hat{θ}}^{(s)} - \hat{θ}) \to_{}^{d} N (0, V_{θ θ}^{- 1} (I - {(I + F'_{PX})}^{- 1}))$ , and unconditionally, is $\sqrt{N} ({\hat{θ}}^{(s)} - \bar{θ}) \to_{}^{d} N (0, V_{θ θ}^{- 1} (2 I - {(I + F'_{PX})}^{- 1}))$ , with its variances being less than or equal to those of the standard O model-based SEM, that is, $V_{θ θ}^{- 1} (2 I - {(I + F'_{PX})}^{- 1}) - V_{θ θ}^{- 1} (2 I - {(I + F'_{SEM})}^{- 1}) = V_{θ θ}^{- 1} (I - {(I + F'_{PX})}^{- 1}) - V_{θ θ}^{- 1} (I - {(I + F'_{SEM})}^{- 1}) \leq 0$ in semipositive definite order.

Proof.

In Appendix B.2. □

Corollary 2 implies that PX-SEM updates exhibit smaller fluctuation along iterations in large samples. Moreover, since the final estimator is the average of the last S⁰ iterations after convergence, $\hat{θ} = \frac{1}{S^{0}} \sum_{S - S^{0} + 1}^{S} {\hat{θ}}^{(s)}$ , which converges to the MLE as the number of iterations increases, PX-SEM and SEM estimators share the same asymptotic variance.Footnote¹³

When the M-step is moment-based, in general, convergence is not guaranteed. Under convergence, the speed does not necessarily dominate SEM. Indeed, Appendix A shows an example where moment-based PX-SEM underperforms SEM for some initial guesses.

However, moment-based PX-SEM may be the preferred choice in practice for at least two crucial reasons. First, in some cases, obtaining GMM estimators is much easier, such as in the quantile model discussed in Section 6. Since our final target is to reduce the computing time, we should consider not only the number of iterations but also the time spent in each iteration. Second, even if obtaining the MLE of the O model is feasible, restricting ourselves to a tractable MLE in the PX-M step can limit the flexibility in building the L model, negatively impacting the convergence rate. Appendix A shows an example of the toy model where the moment-based PX-SEM with a more flexible L model outperforms the likelihood-based PX-SEM, which uses a less flexible L model.

3.3 Implementation based on Linear Expansions

So far, we have shown that a new estimation method, PX-SEM, which combines the parameter expansion technique with the SEM algorithm, has attractive theoretical properties relative to ordinary SEM and the potential to achieve large computational gains.

However, the parameter expansion technique itself does not speak of the selection of the L model. On the one hand, all else being equal, a more flexible L model should improve the convergence rate. On the other hand, we also need to consider the time spent in each iteration to estimate the L model and convert it to the O model since our ultimate goal is to reduce the total computing time. Therefore, another contribution of this article is to propose a specific class of linear expansions, targeting the potential violation of zero-correlation assumptions, which can be generally applied to a wide range of models.

Considering an O model of the form: $Y_{i} = G (Y_{i}^{*}; θ), Y_{i}^{*} \sim F_{O} (θ)$ , where both $G (\cdot)$ and $F_{O} (\cdot)$ are known parametric functions up to unknown parameter θ, we propose the following linear expansion to $Y_{i}^{*}$ Footnote¹⁴

$\begin{matrix} Y_{i} = G (Y_{i}^{*}; θ), Y_{i}^{*} = A Y_{i}^{†}, Y_{i}^{†} \sim F_{O} (θ), \\ s . t . G (Y_{i}^{*}; θ) \overset{d}{=} G (Y_{i}^{†}; θ) (equally distributed), \end{matrix}$ (L Model)where the auxiliary parameter is given by $K = vec (A)$ .Footnote¹⁵

The expansion is straightforward. We assume that the latent variable $Y_{i}^{†}$ follows the same distribution as the O model counterpart. However, the E-step draws $Y_{i}^{*}$ , which directly contribute to the measurement equation and observable Y_i, result from an affine transformation applied to $Y_{i}^{†}$ . It is easy to check that when A = I, the L model coincides with the O model, whereas when $A \neq I$ , it allows us to introduce linear correlations among elements of $Y_{i}^{*}$ . The constraint ensures that the auxiliary parameters do not affect the observed-data likelihood, simplifying the reduction function to $R (θ, K) = θ$ .Footnote¹⁶

To implement the PX-SEM algorithm, in the E-step, we draw $Y_{i}^{*}$ from the O model based posterior distribution as discussed before. In the PX-M step, we leverage moment constraints or the distribution of $F_{O} (θ)$ to pin down A and θ.

This method has the advantage that, despite the model of interest being nonlinear, the expansion is linear in latent variables, which are drawn from the E-step and treated as observables in the PX-M step. Thus, we can identify the auxiliary parameters separately through a relatively simple linear model, regardless of the specific form of $G (\cdot)$ .

In the following sections, we discuss three applications: (a) dynamic factor models, (b) discrete choice models, and (c) quantile models, for which we propose PX-SEM algorithms based on linearly expanded models.

4 Dynamic Factor Models

The first type of model we discuss is the dynamic factor model (Geweke Citation1977). The appeal of this class of models is their ability to explain variation across multiple dimensions using fewer latent common factors. Applications span multiple fields, including topics in macroeconomics and finance, among others (Bai and Ng Citation2008; Stock and Watson Citation2006, 2011). While we will focus on a specific single-factor O model, it is worth noting that the same approach for implementing the PX-SEM algorithm can be applied to models with multiple latent factors. The O model to be estimated is as follows:

$y_{it} = λ_{i} ν_{t} + ϵ_{it} and ν_{t} = ν_{t - 1} + u_{t},$ (O Model)where $ϵ_{it} \overset{iid}{\sim} N (0, σ_{i}^{2}), u_{t} \overset{iid}{\sim} N (0, 1), ν_{0} = 0$ , and u_t is independent of ϵ_it.

The model contains a latent common factor ν_t that follows a Gaussian random walk. We observe N different measures, y_i, where $i = 1, \dots, N$ , over a total of T periods, with each measure associated with a distinct factor loading λ_i. The set of unknown parameters is denoted as $θ \equiv (λ_{1}, \dots, λ_{N}, σ_{1}, \dots, σ_{N})$ .Footnote¹⁷

SEM

We first explain the SEM procedure. Let $ν \equiv [ν_{1} ν_{2} \dots ν_{T}]'$ . Starting with an initial guess ${\hat{θ}}^{(0)}$ , we iterate through the E-step and the M-step for $s = 0, 1, 2, \dots, S$ until the convergence of ${\hat{θ}}^{(s)}$ to the stationary distribution:

Stochastic E step: Draw ν from the posterior distribution $f_{O} (ν | y_{i}; {\hat{θ}}^{(s)})$
M step: Update ${\hat{θ}}^{(s + 1)} = ({\hat{λ}}_{1}, \dots, {\hat{λ}}_{N}, {\hat{σ}}_{1}, \dots, {\hat{σ}}_{N})$ ${\hat{λ}}_{i} = {(\sum_{t} ν_{t}^{2})}^{- 1} (\sum_{t} ν_{t} y_{it}) and {\hat{σ}}_{i} = \hat{std} (y_{it} - {\hat{λ}}_{i} ν_{i}), \forall i$

PX-SEM

To implement PX-SEM, we construct a simple L model as follows:

$y_{it} = λ_{i} ν_{t} + ϵ_{it} and ν_{t} = ν_{t - 1} + u_{t},$ (L Model)where $ϵ_{it} \overset{iid}{\sim} N (0, σ_{i}^{2}), u_{t} \overset{iid}{\sim} N (0, k^{2}), ν_{0} = 0$ , and u_t is independent of ϵ_it.

This L model expands the O model by introducing an auxiliary parameter, k, allowing the variance of the persistent shock u_t to deviate from 1. Since k can always take the value of 1, making the two models coincide, the L model satisfies condition (a). Moreover, it is easy to verify that reduction function $R (λ_{1}, \dots, λ_{N}, σ_{1}, \dots, σ_{N}, k) = (λ_{1} k, \dots, λ_{N} k, σ_{1}, \dots, σ_{N})$ satisfies condition (b), that is $f_{O} (y_{i}; R (θ, k)) = f_{L} (y_{i}; θ, k)$ .

With the L model specified and an initial guess ${\hat{θ}}^{(0)}$ , we iterate through the E-step and the PX-M step for $s = 0, 1, \dots, S$ until the convergence of ${\hat{θ}}^{(s)}$ to the stationary distribution:

Stochastic E step: Draw ν from the posterior distribution $f_{O} (ν | y; {\hat{θ}}^{(s)})$
PX-M step:
1. L model estimation: ${\hat{λ}}_{L}, {\hat{σ}}_{L} = ({\hat{λ}}_{L 1}, \dots, {\hat{λ}}_{LN}, {\hat{σ}}_{L 1}, \dots, {\hat{σ}}_{LN})$ and $\hat{k}$
  $\begin{matrix} {\hat{λ}}_{Li} = {(\sum_{t} ν_{t}^{2})}^{- 1} (\sum_{t} ν_{t} y_{it}) and \\ {\hat{σ}}_{Li} = \hat{std} (y_{it} - {\hat{λ}}_{i} ν_{i}), \forall i; and \hat{k} = \hat{std} (ν_{t} - ν_{t - 1}) \end{matrix}$
2. Reduction: ${\hat{θ}}^{(s + 1)} = ({\hat{λ}}_{L} \hat{k}, {\hat{σ}}_{L})$

The PX-M step estimation of the auxiliary parameter k is straightforward due to the separability of the log-likelihood function. Compared to SEM, the PX-SEM update ${\hat{θ}}^{(s + 1)}$ takes into account potential deviations from the assumption $k = 1$ in the O model. When the guess ${\hat{θ}}^{(s)}$ is sufficiently close to the true value, we expect $\hat{k}$ to be close to 1, resulting in similar SEM and PX-SEM updates ${\hat{θ}}^{(s + 1)}$ . However, when the guess ${\hat{θ}}^{(s)}$ deviates significantly from the true value, leading to a violation of the assumption $k = 1$ in the E-step draws, PX-SEM adjusts the estimate accordingly. For instance, if $\hat{k}$ is greater than 1, it suggests scaling down the latent draws ν by a factor of k to ensure $var (Δ ν) = 1$ and scaling up λ_i by the same factor k to maintain the same log-likelihood for observed data.

Remark.

It is easier to see the connection to the proposed linear expansion method after reparameterizing the L model. As detailed in Appendix C, we obtain an alternative expanded model with the reduction function $R (θ, K) = θ$ , that is, $y_{it} = λ_{i} ν_{t} + ϵ_{it}, [ν_{1} \dots ν_{T} ϵ_{i 1} \dots ϵ_{iT}]' = A_{i} [ν_{1}^{*} \dots ν_{T}^{*} ϵ_{i 1}^{*} \dots ϵ_{iT}^{*}]', ν_{t}^{*} = ν_{t - 1}^{*} + u_{t}$ , where $A_{i} = [\begin{matrix} k I_{T \times T} & 0_{T \times T} \\ λ_{i} (1 - k) I_{T \times T} & I_{T \times T} \end{matrix}], ϵ^{*}$ , u, and $ν^{*}$ follow identical distributions to the O model counterparts. Thus, it is evident that the proposed L model belongs to the linear expansions with a specific constraint on matrix A_i: only contemporaneous correlations between $(ν_{t}, ϵ_{it})$ and $(ν_{t}^{*}, ϵ_{it}^{*})$ are allowed. Despite its advantage of the easy adaptation for various models and a negligible increase in computational burden due to the likelihood separability, in the other two applications, we will explore more flexible L models by relaxing constraints in the matrix A, such as allowing for correlations across periods, to achieve faster convergence.

Simulation Results

presents simulation results for a DGP where $λ = (1.22, 1.07, 1.62)$ and $σ = (0.92, 0.78, 1.33)$ with N = 3 and T = 200. The x-axis represents the number of iterations $s = 1, \dots, 1500$ , and the y-axis represents the M-step update ${\hat{θ}}^{(s)}$ . The blue line depicts the SEM trajectory, whereas the orange line depicts the PX-SEM trajectory. The horizontal green dashed line represents the true value. Starting from a randomly chosen initial guess ${\hat{θ}}^{(0)}$ , both SEM and PX-SEM updates move toward the true value and stabilize after several iterations. We use the average of the last 500 updates as the final estimate.

Fig. 2 SEM and PX-SEM iterations of ${\hat{θ}}^{(s)}$ from a random initial guess.

NOTE: Iterations of SEM (blue line) and PX-SEM (orange line) based on direct sampling, compared with the true value (green dashed line). SEM estimates (blue diamond) and PX-SEM estimates (orange star) are calculated as the average of the last 500 iterations. Random initial guess generated from a lognormal distribution. $N = 3, T = 200$ .

Fig. 2 SEM and PX-SEM iterations of θ̂(s) from a random initial guess.NOTE: Iterations of SEM (blue line) and PX-SEM (orange line) based on direct sampling, compared with the true value (green dashed line). SEM estimates (blue diamond) and PX-SEM estimates (orange star) are calculated as the average of the last 500 iterations. Random initial guess generated from a lognormal distribution. N=3,T=200.

As shown in , for all the parameters, PX-SEM converges almost immediately. However, for SEM, although it also converges relatively fast for $σ' s$ , a notable difference is observed in the case of $λ' s$ : it does not converge until 500 iterations.

Regarding volatilities of updates across iterations, Appendix D presents figures with longer trajectories, where we can observe that PX-SEM exhibits smaller volatilities. Appendix D also includes results for larger sample sizes and additional figures plotting cumulative computing time, revealing significant gains, especially for larger samples.

5 Discrete Choice Models

The second type of model we discuss is the random effects discrete choice model with persistent and transitory components. Discrete choice models are widely used in empirical research on various topics, including labor supply (Hyslop Citation1999) and consumer demand (Keane et al. 2013), among others. Distinguishing heterogeneity from persistence is of interest for many reasons, but the nonlinearity and the presence of latent variables complicate the estimation process.Footnote¹⁸

In this section, we develop PX-SEM algorithms for a group of discrete choice models with rich latent-variable structures, including time-invariant, persistent, and transitory components. Specifically, the O model is as follows: $\begin{matrix} y_{it} = 1 (z_{it} > 0), \\ z_{it} = β' x_{it} + μ_{i} + ν_{it} + ϵ_{it}, \\ ν_{it} = ρ ν_{i, t - 1} + u_{it}, \end{matrix}$ where $μ_{i} | x \overset{iid}{\sim} N (0, σ_{μ}^{2}), ν_{i 1} | x \overset{iid}{\sim} N (0, 1), u_{it} | x \overset{iid}{\sim} N (0, σ_{u}^{2}), ϵ_{it} | x \overset{iid}{\sim} N (0, 1)$ ; and μ_i, u_it, ϵ_it are mutually independent.Footnote¹⁹

For each individual $i = 1, \dots, N$ at period $t = 1, \dots, T$ , we observe a vector of independent variable x_it of dimension J and a binary (0-1) discrete dependent variable y_it, whereas z_it, individual effect μ_i, persistent component ν_it, and transitory component ϵ_it are latent. We denote the set of unknown parameters as θ, where $θ \equiv (β, σ_{μ}, ρ, σ_{u})$ .

SEM

Let $z_{i} \equiv [z_{i 1} \dots z_{iT}]', ν_{i} \equiv [ν_{i 1} \dots ν_{iT}]'$ . From an initial guess ${\hat{θ}}^{(0)}$ , we iterate through E-step and M-step for $s = 0, 1, \dots, S$ until ${\hat{θ}}^{(s)}$ converges to the stationary distribution:

Stochastic E step: Draw $(z_{i}, μ_{i}, ν_{i})$ from the posterior distribution $f_{O} (z_{i}, μ_{i}, ν_{i} | y_{i}, x_{i}; {\hat{θ}}^{(s)})$ .
M step: Update ${\hat{θ}}^{(s + 1)} = ({\hat{β}}^{(s + 1)}, {\hat{σ}}_{μ}^{(s + 1)}, {\hat{ρ}}^{(s + 1)}, {\hat{σ}}_{u}^{(s + 1)})$
$\begin{matrix} {\hat{β}}^{(s + 1)} = {(\sum_{i} \sum_{t} x_{it} x_{it}^{'})}^{- 1} (\sum_{i} \sum_{t} x_{it} (z_{it} - μ_{i} - ν_{it})), \\ {\hat{ρ}}^{(s + 1)} = {(\sum_{i} \sum_{t} ν_{i, t - 1} ν_{i, t - 1}^{'})}^{- 1} (\sum_{i} \sum_{t} ν_{i, t - 1} ν_{it}), \\ {\hat{σ}}_{μ}^{(s + 1)} = \hat{std} (μ_{i}) and {\hat{σ}}_{u}^{(s + 1)} = \hat{std} (ν_{it} - {\hat{ρ}}^{(s + 1)} ν_{i, t - 1}) . \end{matrix}$

PX-SEM

One option for building the L model is to expand the O model to include only contemporaneous correlations, similar to the dynamic factor model in Section 4. Its advantage lies in the MLE being readily obtained in the PX-M step due to a separable likelihood. Appendix E provides the detailed steps and results of this approach. However, to achieve faster convergence, we now propose a more flexible L model.

Let us define $x_{i} \equiv {[x_{i 1}^{'} \dots x_{iT}^{'}]}^{'}, ϵ_{i} \equiv [ϵ_{i 1} \dots ϵ_{iT}]', ν_{i}^{*} \equiv [ν_{i 1}^{*} \dots ν_{iT}^{*}]', ϵ_{i}^{*} \equiv [ϵ_{i 1}^{*} \dots ϵ_{iT}^{*}]'$ , and $z_{i}^{*} \equiv [z_{i 1}^{*} \dots z_{iT}^{*}]'$ . We construct the following L model: $\begin{matrix} y_{it} = 1 (z_{it} > 0), (L Model) \\ z_{it} = γ_{t}^{'} x_{i} + μ_{i} + ν_{it} + ϵ_{it}, \\ {[μ_{i} ν_{i}^{'} ϵ_{i}^{'}]}^{'} = pA {[μ_{i}^{*} ν_{i}^{*'} ϵ_{i}^{*'}]}^{'} + B x_{i}, \\ ν_{it}^{*} = ρ ν_{i, t - 1}^{*} + u_{it}, \end{matrix}$ where $μ_{i}^{*} | x \overset{iid}{\sim} N (0, σ_{μ}^{2}), ν_{i 1}^{*} | x \overset{iid}{\sim} N (0, 1), u_{it} | x \overset{iid}{\sim} N (0, σ_{u}^{2}), ϵ_{it}^{*} | x \overset{iid}{\sim} N (0, 1)$ , and $μ_{i}^{*}$ , u_it, $ϵ_{it}^{*}$ are mutually independent; and subject to $\frac{1}{p} \times (CB + γ) = I_{T \times T} \otimes β', CAΣA' C' = C Σ C'$ , and $p > 0$ , where $C = [{\overset{⇀}{1}}_{T \times 1} I_{T \times T} I_{T \times T}], Σ = var ([μ_{i}^{*} ν_{i}^{*'} ϵ_{i}^{*'}]'), γ' \equiv [γ_{1} \dots γ_{T}]$ and A is a lower triangular matrix with positive diagonal entries. Alongside θ from the O model, the L model contains a vector of auxiliary parameters $K \equiv [vech (A)', vec (B)', p]'$ .

Following the linear expansion method, we introduce latent variables $μ_{i}^{*}, ν_{i}^{*}$ , and $ϵ_{i}^{*}$ , which follow the same distributions as their O model counterparts. However, the E-step draws for $[μ_{i} ν_{i}^{'} ϵ'_{i}]$ , given x_i, can result from an affine transformation applied to $[μ_{i}^{*} ν_{i}^{*'} ϵ_{i}^{*'}]$ , allowing for linear correlations among μ_i, ν_i, ϵ_i, and dependence on x_i. The scalar p permits scaling z_it and thus the deviation of $var (ϵ_{it})$ from the value of 1. Hence, the L model satisfies condition (a): when $B = 0_{(2 T + 1) \times (J \times T)}, A = I_{(2 T + 1) \times (2 T + 1)}, p = 1$ , the two models coincide $f_{O} (y_{i}, z_{i}, μ_{i}, ν_{i} | x_{i}; θ) = f_{L} (y_{i}, z_{i}, μ_{i}, ν_{i} | x_{i}; θ, A = I, B = \overset{⇀}{0}, p = 1)$ .

The L model has two key constraints: $\frac{1}{p} \times (CB + γ) = I_{T \times T} \otimes β'$ and $CAΣA' C' = C Σ C'$ . Beyond addressing identification, these constraints simplify the reduction function. Specifically, under these constraints, the L model can be written as $z_{i} = p β' x_{it} + pC [μ_{i}^{*} ν_{i}^{*'} ϵ_{i}^{*'}]'$ , implying no effect of auxiliary parameters p, A, and B on the conditional distribution of y_it given x_it. Therefore, regarding condition (b), we find a reduction function, $R (θ, K) = θ$ , satisfying $f_{O} (y_{i} | x_{i}; R (θ, K)) = f_{L} (y_{i} | x_{i}; θ, K)$ .

Finally, with the L model specified and an initial guess ${\hat{θ}}^{(0)}$ , we iterate through the following two steps for $s = 0, 1, \dots, S$ until ${\hat{θ}}^{(s)}$ converges to the stationary distribution:

Stochastic E step: Draw $(z_{i}, μ_{i}, ν_{i})$ from the posterior distribution $f_{O} (z_{i}, μ_{i}, ν_{i} | y_{i}, x_{i}; {\hat{θ}}^{(s)})$ .
PX-M step:
1. L model estimation:
  ${\hat{θ}}_{L}, \hat{K} = arg {min}_{θ, K} \sum_{i} Ψ (θ, K; y_{i}, z_{i}, x_{i}, μ_{i}, ν_{i})$
2. Reduction: ${\hat{θ}}^{(s + 1)} = R ({\hat{θ}}_{L}, \hat{K}) = {\hat{θ}}_{L}$
  where $Ψ (\cdot)$ is a known function whose detailed specification is presented in Appendix F. Below, we list the moments involved in function $Ψ (\cdot)$ :
  $\begin{matrix} p β : E (x_{it} (z_{it} - p β' x_{it})) = 0, \frac{1}{p} \times (CB + γ) = I_{T \times T} \otimes β' \\ B : E (x_{i} ([μ_{i} ν_{i}^{'} ϵ_{i}^{'}] - x'_{i} B')) = 0 \\ σ_{μ}, p, Σ, A : CAΣA' C' = C Σ C', and moment constraints on Σ \\ ρ, σ_{u} : E (ν_{i, t - 1}^{*} (ν_{it}^{*} - ρ ν_{i, t - 1}^{*})) = 0, var (ν_{it}^{*} - ρ ν_{i, t - 1}^{*}) = σ_{u}^{2} . \end{matrix}$

Simulation Results

We conduct simulations to compare SEM and PX-SEM from a DGP with true parameter values: $β = [1.0; 0.5], σ_{μ} = 1.25, ρ = 0.7$ , and $σ_{u} = 0.9$ .

The initial guess is determined as follows: (a) ${\hat{β}}^{(0)}$ is the Probit regression coefficients of y_it on x_it, (b) Impose ${\hat{ρ}}^{(0)} = 1$ , and the rest of the parameters are estimates of the linearly approximated model.Footnote²⁰ In the E-step, we employ a random-walk Metropolis-Hastings sampler with an acceptance rate controlled between 20% and 40%.

presents the estimation results of one simulation with N = 5000 and T = 8. Specifically, we plot the M-step updates ${\hat{θ}}^{(s)}$ for 1000 iterations (S = 1000). The blue line depicts each update of SEM, while the orange one represents the PX-SEM updates. In this example, we might be interested in switching to SEM, which is likelihood-based, treating the PX-SEM estimate as an initial guess. Hence, we also present the results of a combined approach, where we run PX-SEM for 500 iterations and then continue with SEM for another 500 iterations, using the average of the last 250 PX-SEM iterations as the initial guess, as shown by the gray line.Footnote²¹ The green dashed line indicates the true value. The final estimates are the average of the last 250 iterations ( $S^{0} = 250$ ), represented by the blue diamond for SEM, the orange star for PX-SEM, and the gray circle for PX-SEM + SEM.

Fig. 3 SEM and PX-SEM iterations of ${\hat{θ}}^{(s)}$ from an informed guess.

NOTE: Iterations of SEM (blue line), PX-SEM (orange line), and PX-SEM + SEM (gray line) based on 100 MH draws, compared with the true value (green dashed line). Estimates of SEM (blue diamond), PX-SEM (orange star), and PX-SEM + SEM (gray circle) are the average of the last 250 iterations. Informed initial guess.

Fig. 3 SEM and PX-SEM iterations of θ̂(s) from an informed guess.NOTE: Iterations of SEM (blue line), PX-SEM (orange line), and PX-SEM + SEM (gray line) based on 100 MH draws, compared with the true value (green dashed line). Estimates of SEM (blue diamond), PX-SEM (orange star), and PX-SEM + SEM (gray circle) are the average of the last 250 iterations. Informed initial guess.

From this comparison, it is clear that starting from the same initial guess, PX-SEM converges almost immediately (within 100 iterations). In contrast, SEM progresses much slower, especially for ${\hat{σ}}_{μ}^{(s)}, {\hat{σ}}_{u}^{(s)}$ , and ${\hat{ρ}}^{(s)}$ , and does not converge within 1000 iterations. In terms of the combined approach, the variation across iterations significantly decreases after transitioning to SEM. However, since the final estimates are the average of the last 250 updates, the difference between PX-SEM and PX-SEM + SEM is small.Footnote²²

Appendix L provides additional figures where the x-axis is the cumulative computing time. The gain is significant: SEM takes approximately 3000 sec to run 1000 iterations without converging, while PX-SEM converges almost immediately.Footnote²³

Appendix H compares algorithms based on random initial guesses. Researchers often run SEM algorithms from various initial guesses and choose one based on specific criteria (e.g., the likelihood value) to avoid obtaining a local maximum, given that getting a “good” initial guess can be challenging. Appendix H shows that the dominance of PX-SEM in convergence rate remains under random initial guesses. Given that this type of exercise is often performed repeatedly in practice, the time saved could be substantial.Footnote²⁴

Finally, Appendix M provides the overall trajectories of SEM and PX-SEM over iterations. Specifically, we conduct 40 simulations using the same DGP, each estimated under a different set of initial guesses shared by both SEM and PX-SEM. For each parameter, we examine the distribution of updates across 40 trajectories at each specific iteration and how this distribution evolves over the iterations for SEM and PX-SEM, respectively. We reach the same conclusion: PX-SEM significantly improves algorithmic efficiency.

6 Quantile Models

The final type of model for which we consider a PX-SEM approach is the persistent-transitory dynamic quantile models with individual effects, as proposed by Arellano, Blundell, and Bonhomme (Citation2017) (referred to as ABB hereafter). The ABB model does not impose functional-form restrictions on the distributions of individual effects, transitory shocks, or conditional distributions of the persistent component. Indeed, the flexible dynamics of the persistent component allow for attractive features such as nonlinear persistence, meaning that the persistence could vary with the size of shocks and accumulated history, which is shown to be empirically prominent in earning dynamics. The model has also been applied to other topics including firm and health dynamics.

Specifically, we focus on the ABB baseline model with an additive fixed effect, discussed in their Appendix.Footnote²⁵ Denote the τth conditional quantile of ν_it given $ν_{i, t - 1}$ as $Q_{ν} (ν_{i, t - 1}, τ)$ for each $τ \in (0, 1)$ . The O model to be estimated is as follows:

$\begin{matrix} y_{it} = μ_{i} + ν_{it} + ϵ_{it}, \\ ν_{it} = Q_{ν} (ν_{i, t - 1}, u_{it}), \\ (u_{it} | μ_{i}, u_{i, t - 1}, u_{i, t - 2}, \dots) \overset{iid}{\sim} Uniform (0, 1), t = 2, \dots, T, \end{matrix}$ (O Model)where ϵ_it has zero mean, iid over time, and independent of $ν_{i} \equiv [ν_{i 1} ν_{i 2} \dots ν_{iT}]'$ and μ_i. Individual effect μ_i is assumed to be independent of $ϵ_{i} \equiv [ϵ_{i 1} ϵ_{i 2} \dots ϵ_{iT}]'$ and ν_i.

To estimate this model, we follow Arellano, Blundell, and Bonhomme (Citation2017) and empirically specify the quantile function of ν_it given $ν_{i, t - 1}, Q_{ν} (ν_{i, t - 1}, τ)$ , the quantile function of ϵ_it, $Q_{ϵ} (τ)$ , the quantile function of $ν_{i 1}, Q_{ν_{1}} (τ)$ , and the quantile function of μ_i, $Q_{μ} (τ)$ , as follows: $\begin{matrix} Q_{ν} (ν_{i, t - 1}, τ) = \sum_{h = 0}^{H} γ_{h}^{Q} (τ) φ_{h} (ν_{i, t - 1}), \\ Q_{ϵ} (τ) = γ^{ϵ} (τ), Q_{ν_{1}} (τ) = γ^{ν_{1}} (τ), Q_{μ} (τ) = γ^{μ} (τ), \end{matrix}$ where $φ_{h} (\cdot)$ is Hermite polynomials of order h and $γ (\cdot)' s$ are functions to be estimated.

Arellano, Blundell, and Bonhomme (Citation2017) exploit a variation of SEM for estimation, where the M-step involves a series of quantile regressions instead of likelihood optimization for computational convenience. We first explain their procedures. Let θ denote the set of unknown parameters, including $γ_{k}^{Q} (τ), γ^{ϵ} (τ), γ^{ν_{1}} (τ)$ , and $γ^{μ} (τ)$ .Footnote²⁶ With an initial guess ${\hat{θ}}^{(0)}$ , we iterate through the following two steps until ${\hat{θ}}^{(s)}$ converges to the stationary distribution:

Stochastic E step: Draw μ_i and ν_i from the posterior distribution $f_{O} (μ_{i}, ν_{i} | y_{i}; {\hat{θ}}^{(s)})$ .
M step: Update parameters by computing a series of quantile regressions:
$\begin{matrix} {\hat{γ}}^{Q} (τ) = arg {min}_{γ_{0}^{Q}, \dots, γ_{H}^{Q}} \sum_{i = 1}^{N} \sum_{t = 2}^{T} ρ_{τ} (ν_{it} - \sum_{h = 0}^{H} γ_{h}^{Q} φ_{h} (ν_{i, t - 1})), \\ {\hat{γ}}^{μ} (τ) = arg {min}_{γ^{μ}} \sum_{i = 1}^{N} ρ_{τ} (μ_{i} - γ^{μ}), \\ {\hat{γ}}^{ϵ} (τ) = arg {min}_{γ^{ϵ}} \sum_{i = 1}^{N} \sum_{t = 1}^{T} ρ_{τ} (ϵ_{it} - γ^{ϵ}), \\ {\hat{γ}}^{ν_{1}} (τ) = arg {min}_{γ^{ν_{1}}} \sum_{i = 1}^{N} ρ_{τ} (ν_{i 1} - γ^{ν_{1}}), \end{matrix}$
where $ρ_{τ} (u) = u (τ - 1 (u \leq 0))$ is the check function.

PX-SEM

We expand the O model linearly targeting the correlations among μ_i, ν_i, and ϵ_i. Define $ν_{i}^{*} \equiv [ν_{i 1}^{*} \dots ν_{iT}^{*}]', ϵ_{i}^{*} \equiv [ϵ_{i 1}^{*} \dots ϵ_{iT}^{*}]'$ . We build the following L model:

$\begin{matrix} y_{it} = μ_{i} + ν_{it} + ϵ_{it}, \\ {[μ_{i} ν_{i}^{'} ϵ_{i}^{'}]}^{'} = A {[μ_{i}^{*} ν_{i}^{*'} ϵ_{i}^{*'}]}^{'}, \\ ν_{it}^{*} = Q_{ν} (ν_{i, t - 1}^{*}, u_{it}), \\ (u_{it} | μ_{i}^{*}, u_{i, t - 1}, u_{i, t - 2}, \dots) \overset{iid}{\sim} Uniform (0, 1), t = 2, \dots, T, \end{matrix}$ (L Model)subject to $CA = C$ , where $C = [{\overset{⇀}{1}}_{T \times 1} I_{T \times T} I_{T \times T}]$ . Similarly, we assume that $ϵ_{it}^{*}$ has zero mean, iid over time, and independent of $ν_{i}^{*}$ and $μ_{i}^{*}$ , and $μ_{i}^{*}$ is independent of $ν_{i}^{*}$ . The L model contains a vector of auxiliary parameters $K \equiv vec (A)$ .

Consistent with the linear expansion method, E-step draws $[μ_{i} ν'_{i} ϵ'_{i}]'$ are assumed to be outcomes of affine transformations through matrix A of $[μ_{i}^{*} ν_{i}^{*'} ϵ_{i}^{*'}]'$ , which follow identical distributions as their O model counterparts. When $A = I$ , two models coincide, satisfying condition (a). Moreover, with the constraint $CA = C$ , the L model becomes $y_{i} = C {[μ_{i}^{*} ν_{i}^{*'} ϵ_{i}^{*'}]}^{'}$ , implying no effect of K on the observed-data likelihood. Thus, regarding condition (b), the reduction function is $R (θ, K) = θ$ .

Finally, with this L model and an initial guess ${\hat{θ}}^{(0)}$ , we iterate through the following two steps for $s = 0, 1,, \dots, S$ until ${\hat{θ}}^{(s)}$ converges to the stationary distribution:

Stochastic E step: Draw μ_i and ν_i from posterior distribution $f_{O} (μ_{i}, ν_{i} | y_{i}; {\hat{θ}}^{(s)})$
PX-M step:
1. L model estimation:
  ${\hat{θ}}_{L}, \hat{K} = arg {min}_{θ, K} \sum_{i} Ψ (θ, K; y_{i}, μ_{i}, ν_{i})$
2. Reduction: ${\hat{θ}}^{(s + 1)} = R ({\hat{θ}}_{L}, \hat{K}) = {\hat{θ}}_{L}$
  where $Ψ (\cdot)$ is a known function to be discussed in the following paragraphs.

The inclusion of matrix A adds complexity to joint estimation due to infeasible separate quantile regressions as in the SEM M-step and the involvement of many more parameters.Footnote²⁷ Thus, we employ two strategies: adding extra constraints on the entries of matrix A and sequential estimation.

Regarding the extra constraints, we assume that ϵ_it is orthogonal to $μ_{i}^{*}, ϵ_{i 1}^{*}$ ,…, $ϵ_{i, t - 1}^{*}$ , for $t = 2, \dots, T$ , and the coefficient of $ϵ_{it}^{*}$ with respect to ϵ_it is homogenous across all periods. The advantage of doing so is that, by exploiting moment conditions including zero correlation among $μ_{i}^{*}, ν_{i}^{*}$ , and $ϵ_{i}^{*}$ , and $ν_{it}^{*}$ following the first-order Markov process as well as $ϵ_{it}^{*}$ being iid, we can separately estimate matrix A through constrained GMM while restricting the number of unknowns to only two. This further facilitates the sequential estimation strategy. Once obtaining $\hat{A}$ , we estimate θ through the same series of quantile regressions as in SEM using ${[{\hat{μ}}_{i}^{*} {\hat{ν}}_{i}^{*'} {\hat{ϵ}}_{i}^{*'}]}^{'} = {\hat{A}}^{- 1} {[μ_{i} ν_{i}^{'} ϵ_{i}^{'}]}^{'}$ . Appendix I provides a detailed discussion of the estimation process.Footnote²⁸

Simulation Results

We simulate from the following DGP ( $N = 5000, T = 6$ ): (7) $\begin{matrix} y_{it} = μ_{i} + ν_{it} + ϵ_{it}, \\ ν_{it} = ρ_{ν} ν_{i, t - 1} + (σ_{ν t 0} + σ_{ν t 1} ν_{i, t - 1}^{2}) v_{it}, \end{matrix}$ (7) where $μ_{i} \overset{iid}{\sim} N (0, σ_{μ}^{2}), ν_{i 1} \overset{iid}{\sim} N (0, σ_{ν 1}^{2}), v_{it} \overset{iid}{\sim} N (0, 1), ϵ_{it} \overset{iid}{\sim} N (0, σ_{ϵ}^{2})$ , and μ_i, $ν_{i 1}$ , v_it, ϵ_it are mutually independent.

We present results for a persistent-transitory process without time-invariant heterogeneity and heteroscedasticity in the persistent shock by imposing $μ_{i} = 0$ and $σ_{ν t 1} = 0$ . The other parameter values are $ρ_{ν} = 0.8, σ_{ν t 0} = 0.15, σ_{ν t 1} = 0, σ_{ν 1}^{2} = 0.15, σ_{ϵ}^{2} = 0.05$ . Appendix J provides simulation results for a DGP with time-invariant heterogeneity and heteroscedasticity and another DGP, which is based on a flexible quantile model.

With the simulated data, we estimate the quantile model with time-invariant heterogeneity, as specified previously (the O model), assuming no knowledge of the distribution family. We set the initial guess by estimating the canonical random-walk permanent-transitory model, with details explained in Appendix J. Finally, the highest order of Hermite polynomials for the empirical specification of the ν_it dynamics, H, is set to two.

presents the results. To provide clearer visualization, instead of plotting the updates of raw parameters in the quantile model directly (due to their large quantity), we plot the iterations of estimated parameter values in the parametric model, (7). Specifically, in each iteration, we estimate the parametric model using E-step draws (μ, ν, and ϵ) for SEM and” corrected” draws ( ${\hat{μ}}^{*}, {\hat{ν}}^{*}$ , and ${\hat{ϵ}}^{*}$ ) for PX-SEM. Importantly, these estimates are only for visualizing convergence and are not directly involved in the algorithm updating procedure. Consistent with previous exercises, PX-SEM exhibits rapid convergence for all parameters, whereas SEM converges much more slowly.

Fig. 4 SEM and PX-SEM iterations, $μ_{i} = 0$ .

NOTE: Iterations of SEM (blue solid line) and PX-SEM (orange solid line) based on 100 MH draws, compared with the true value (green dashed line). In each iteration, we estimate the parametric model, (7), using E-step draws μ, ν, ϵ for SEM and “corrected” draws ${\hat{μ}}^{*}, {\hat{ν}}^{*}$ , ${\hat{ϵ}}^{*}$ for PX-SEM. These estimates are only used for visualizing the convergence and are not directly involved in any algorithm. SEM estimates (blue diamond) and PX-SEM estimates (orange star) are both calculated as the average of the last 100 iterations. Informed initial guess.

Appendix L provides complementary figures to , with cumulative computing time as the x-axis, showing significant time gain: SEM takes over 5500 sec for 500 iterations without clear convergence, whereas PX-SEM converges almost immediately.Footnote²⁹ Appendix M presents the overall trajectories of 40 iterations. Finally, Appendix K shows simulation results for different sample sizes.

After 500 iterations, averaging the last 100 updates as temporary estimates, we simulate from the estimated models and compare the model fit, focusing on the persistence of the ν_it dynamics, $\frac{\partial ν_{it}}{\partial ν_{i, t - 1}} (ν_{i, t - 1}, τ)$ , one of the characteristics of interest (Arellano, Blundell, and Bonhomme Citation2017). displays heatmaps showing the absolute distance between the estimated persistence and true persistence at different levels of the shock τ and $ν_{i, t - 1}$ for SEM and PX-SEM, respectively. Overall, PX-SEM shows a better model fit.

Fig. 5 Distance between estimated persistence and true persistence, $μ_{i} = 0$ . NOTE: The absolute distance between the estimated ν_t persistence (based on the average of the last 100 iterations) and true persistence for each level of the shock τ and $ν_{i, t - 1}$ .

7 Conclusions

This article introduces new estimation algorithms for dynamic panel data models with latent variables. By combining the parameter expansion ideas with the SEM algorithm, we develop the PX-SEM algorithm, which could facilitate convergence in models with a large space of latent variables by improving algorithmic efficiency.

Sharing the same E-step as SEM, PX-SEM differs in the M-step. Instead of estimating the original model (the O model), the M-step of PX-SEM requires estimating an expanded model (the L model). Effectively, we propose new estimators for the pseudo-data within iterations, accounting for the misspecification of the O model for draws based on parameter values far from the truth. Thus, PX-SEM can leverage additional model information to effectively” correct” the M-step updates in progressing to more accurate ones.

Moreover, the article proposes a method for constructing the L model through linear expansion and presents new PX-SEM-based estimation algorithms for three types of dynamic panel data models: factor models, discrete choice models, and quantile models.

Regarding statistical properties, we establish the asymptotic equivalence of the likelihood-based PX-SEM to an alternative SEM with a smaller expected fraction of missing information compared to the standard O model based SEM, implying a faster global convergence rate and a smaller variance for the limiting stationary distribution. Finally, simulations show that PX-SEM can significantly improve the algorithmic efficiency relative to SEM.

Supplementary Materials

The online supplement consists of the following appendices. Appendix A presents illustrative figures for the intuition behind PX-SEM and comparisons among different L models and M-step estimators using the toy model. Appendix B provides a detailed proof for Section 3. Appendix C explains the equivalence through reparameterization among L models. Appendices E, F, and G discuss alternative L models, detailed L model estimation procedures, and PX-SEM methods applied to two extensions for the discrete choice model in Section 5. Appendix I provides detailed L model estimation procedures for the quantile model in Section 6. Appendices D, H, J–M present more simulation results for the three types of models discussed in Sections 4-6, with more iterations, different sample sizes, cumulative computing time, different initial guesses, and overall trajectories.

Supplemental material

Supplementary_Materials_for_Review (20).zip

Download Zip (8.3 MB)

Acknowledgments

This work is based on Chapter 2 of my PhD thesis at CEMFI, which received the Enrique Fuentes Quintana Funcas Award in Economics, Finance and Business 2021-2022. I am deeply grateful to Manuel Arellano for his invaluable support and advice. I also thank Martin Almuzara, Dante Amengual, Dmitry Arkhangelsky, Orazio Attanasio, Richard Blundell, Stéphane Bonhomme, Micole De Vera, Jose Gutierrez, Pedro Mira, Josep Pijoan-Mas, Enrique Sentana, Liyang Sun, and seminar participants at IE university, CEMFI, International Panel Data Conference, EEA-ESEM, and SAEe meetings for valuable comments and suggestions. Two anonymous referees and an Associate Editor have helped greatly improve the article. All errors are my sole responsibility.

Disclosure Statement

The author reports there are no competing interests to declare.

Additional information

Funding

Grants PID2022-143184NA-I00 funded by MCIU/AEI/10.13039/501100011033 and by FEDER, UE; BES-2017-082506 funded by MCIU/AEI/10.13039/501100011033 and by “ESF Investing in your future”; and MDM-2016-0684 funded by MCIU/AEI/10.13039/501100011033 are greatly acknowledged.

Notes

1 Arellano and Bonhomme (2017) discusses the potentials of SEM in nonlinear panel data analysis.

2 For instance, Arellano et al. (Citation2023) develops a Sequential Monte Carlo sampler for the E step.

3 Wei (Citation2022) applies the algorithms developed in this article to a substantive analysis of the earnings and employment dynamics of older workers, which brings together elements of the three types of panel models considered here.

4 Liu, Rubin, and Wu (Citation1998) is based on the EM algorithm; Liu and Wu (Citation1999) applies the parameter expansion technique to Bayesian inference; Lavielle and Meza (Citation2007) combines the parameter expansion technique with Monte Carlo EM (Wei and Tanner Citation1990).

5 See Wu (Citation1983) for more discussions.

6 Since k does not affect the L model observed data likelihood,

f_{O} (y_{i}; σ_{L}) = f_{L} (y_{i}; σ_{L}, 1) = f_{L} (y_{i}; σ_{L}, k_{L})

7 Note that the reduction function satisfies $R (θ, K_{0}) = θ$ .

8 In the moment-based PX-EM case, if the fixed point with K = K₀ exists, then it will satisfy $\sum_{i} (\int Ψ_{O} (Y_{i}^{*}, W_{i}; \hat{θ}) f_{O} (Y_{i}^{*} | W_{i}; \hat{θ}) d Y_{i}^{*}) = 0$ .

9 Note that the E-step draws of PX-SEM are based on the O model under the guess ${\hat{θ}}^{(s)}$ . This is equivalent to making draws from the L model under the guess ${\hat{Θ}}^{(s)} = [{\hat{θ}}^{(s)}; K_{0}]$ , due to condition (a).

10 Appendix B shows that for any L model, an alternative L model can be found by reparameterization, which yields identical updates of θ with the reduction function $R (θ, K) = θ$ .

11 That A–V being positive definite implies $H_{θ θ}^{- 1} - V_{θ θ}$ being positive definite.

12 The author thanks an anonymous referee for his/her encouragement to develop this result.

13 With a fixed S⁰, the PX-SEM and SEM estimators will in general give rise to different asymptotic variances. Expressions for these variances are provided in Appendix B.

14 In this expression, $Y_{i}^{*}$ also includes error terms in the measure equation $G (\cdot)$ .

15 Extensions include unit-specific matrix A (Section 4) and adding exogenous regressor X_i (Section 5).

16 The constraint might not be necessary in applications where reduction functions are easy to find.

17 The method can be easily adapted to models with (a) unknown persistence in the ν_t process, (b) multiple latent factors, (c) ϵ_it following an MA process, etc.

18 Chen (Citation2016) proposes a fixed effects EM estimator for a class of nonlinear panel data models.

19 Appendix G presents two extensions: (a) Allowing for the dependence of μ_i and $ν_{i 1}$ on $x_{i 1}$ , and (b) Logit (with strategies for the quantile model in the next section).

20 We approximate the model as follows $y_{it} = Φ (β' x_{it} + μ_{i} + ν_{it}) + η_{it} \approx 0.5 + 0.25 (β' x_{it} + μ_{i} + ν_{it}) + η_{it}$ .

21 We could also endogenize the switching procedure by using metrics like the distance between $\hat{K}$ and K₀ or the likelihood difference between the L model and the O model to guide our transition to SEM.

22 Whether PX-SEM requires more iterations in this example due to its higher volatility, impacting total computing time, is beyond this article’s scope, especially considering the PX-SEM + SEM option.

23 The results are obtained using a Mac Mini (M1, 2020) with a single processor core. We apply the Metropolis-Hastings algorithm for the E-step, with the first 100 iterations designated as a burn-in phase.

24 Appendix H also presents simulation results with more iterations and different sample sizes.

25 In practice, standard SEM generally performs well in estimating the ABB baseline model without the fixed effect. But it is challenging when a fixed effect is included. We also remove age effects.

26 Unknown parameters also include tail parameters. Functions $γ (\cdot)$ are piecewise-polynomial interpolating splines on a grid $[τ_{1}, τ_{2}], [τ_{2}, τ_{3}]$ ,…, $[τ_{L - 1}, τ_{L}]$ . And the tails on $(0, τ_{1}]$ and $[τ_{L}, 1)$ are modeled using a parametric model. Please refer to Appendix B in Arellano, Blundell, and Bonhomme (Citation2017) for more details.

27 In the discrete choice model, the matrix A and other auxiliary parameters can be easily estimated in the PX-M step by focusing solely on the first two moments due to the normality assumption.

28 Similar strategies are used to estimate a Logit model in Appendix G.

29 The results are obtained using a Mac Mini (M1, 2020) with a single processor core. We apply the Metropolis-Hastings algorithm for the E-step, with the first 100 iterations designated as a burn-in phase.

References

Arcidiacono, P., and Jones, J. B. (2003), “Finite Mixture Distributions, Sequential Likelihood and the EM Algorithm,” Econometrica, 71, 933–946. DOI: 10.1111/1468-0262.00431.
Web of Science ®Google Scholar
Arellano, M., Blundell, R., and Bonhomme, S. (2017), “Earnings and Consumption Dynamics: A Nonlinear Panel Data Framework,” Econometrica, 85, 693–734. DOI: 10.3982/ECTA13795.
Web of Science ®Google Scholar
Arellano, M., Blundell, R., Bonhomme, S., and Light, J. (2023), “Heterogeneity of Consumption Responses to Income Shocks in the Presence of Nonlinear Persistence,” Journal of Econometrics, 240, 105449. DOI: 10.1016/j.jeconom.2023.04.001.
Web of Science ®Google Scholar
Arellano, M., and Bonhomme, S. (2016), “Nonlinear Panel Data Estimation via Quantile Regressions,” The Econometrics Journal, 19, C61–C94. DOI: 10.1111/ectj.12062.
Web of Science ®Google Scholar
———(2017), “Nonlinear Panel Data Methods for Dynamic Heterogeneous Agent Models,” Annual Review of Economics, 9, 471–496.
Web of Science ®Google Scholar
Bai, J., and Ng, S. (2008), Large Dimensional Factor Analysis. Foundations and Trends[textregistered] in Econometrics (Vol. 3), pp. 89–163, Hanover, MA: Now Publishers. DOI: 10.1561/0800000002.
Google Scholar
Chen, M. (2016), “Estimation of Nonlinear Panel Models with Multiple Unobserved Effects,” Working Paper, Department of Economics, University of Warwick.
Google Scholar
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977), “Maximum Likelihood from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society, Series B, 39, 1–22. DOI: 10.1111/j.2517-6161.1977.tb01600.x.
Google Scholar
Diebolt, J., and Celeux, G. (1993), “Asymptotic Properties of a Stochastic EM Algorithm for Estimating Mixing Proportions,” Stochastic Models, 9, 599–613. DOI: 10.1080/15326349308807283.
Google Scholar
Geweke, J. (1977), “The Dynamic Factor Analysis of Economic Time Series,” in Latent Variables in Socio-Economic Models, eds. D. J. Aigner and A. S. Goldberger, Amsterdam: North-Holland.
Google Scholar
Hyslop, D. R. (1999), “State Dependence, Serial Correlation and Heterogeneity in Intertemporal Labor Force Participation of Married Women,” Econometrica, 67, 1255–1294. DOI: 10.1111/1468-0262.00080.
Web of Science ®Google Scholar
Keane, M. P. (2013), “Panel Data Discrete Choice Models of Consumer Demand,” in The Oxford Handbook of Panel Data, ed. B. H. Baltagi, pp. 548–582, Oxford: Oxford University Press.
Google Scholar
Lavielle, M., and Meza, C. (2007), “A Parameter Expansion Version of the SAEM Algorithm,” Statistics and Computing, 17, 121–130. DOI: 10.1007/s11222-006-9007-6.
Web of Science ®Google Scholar
Liu, C., Rubin, D. B., and Wu, Y. N. (1998), “Parameter Expansion to Accelerate EM: The px-em Algorithm,” Biometrika, 85, 755–770. DOI: 10.1093/biomet/85.4.755.
Web of Science ®Google Scholar
Liu, J. S., and Wu, Y. N. (1999), “Parameter Expansion for Data Augmentation,” Journal of the American Statistical Association, 94, 1264–1274. DOI: 10.1080/01621459.1999.10473879.
Web of Science ®Google Scholar
Nielsen, S. F. (2000), “The Stochastic EM Algorithm: Estimation and Asymptotic Results,” Bernoulli, 6, 457–489. DOI: 10.2307/3318671.
Web of Science ®Google Scholar
Pastorello, S., Patilea, V., and Renault, E. (2003), “Iterative and Recursive Estimation in Structural Nonadaptive Models,” Journal of Business & Economic Statistics, 21, 449–509. DOI: 10.1198/073500103288619124.
Web of Science ®Google Scholar
Stock, J. H., and Watson, M. W. (2006), “Forecasting with Many Predictors,” Handbook of Economic Forecasting, 1, 515–554.
Google Scholar
———(2011), “Dynamic Factor Models,” in Oxford Handbook of Economic Forecasting, eds. Michael P. Clements and David F. Hendry, Oxford: Oxford University Press.
Google Scholar
Wei, G. C., and Tanner, M. A. (1990), “A Monte Carlo Implementation of the EM Algorithm and the Poor Man’s Data Augmentation Algorithms,” Journal of the American statistical Association, 85, 699–704. DOI: 10.1080/01621459.1990.10474930.
Web of Science ®Google Scholar
Wei, S. (2022), “Income, Employment and Health Risks of Older Workers,” Documentos de Trabajo (CEMFI), (5), 1.
Google Scholar
Wu, C. J. (1983), “On the Convergence Properties of the EM Algorithm,” The Annals of Statistics, 11, 95–103. DOI: 10.1214/aos/1176346060.
Web of Science ®Google Scholar

Estimating Latent-Variable Panel Data Models Using Parameter-Expanded SEM Methods

Abstract

1 Introduction

2 Toy Model

SEM

PX-SEM

3 Parameter Expanded Stochastic EM Algorithm

3.1 Definition of PX-SEM Algorithm

Setup

Definition of the PX-SEM algorithm

Reduction function

3.2 Statistical Properties

Convergence

Asymptotic properties

3.3 Implementation based on Linear Expansions

4 Dynamic Factor Models

SEM

PX-SEM

Simulation Results

5 Discrete Choice Models

SEM

PX-SEM

Simulation Results

6 Quantile Models

PX-SEM

Simulation Results

7 Conclusions

Supplementary Materials

Supplementary_Materials_for_Review (20).zip

Acknowledgments

Disclosure Statement

References

Information for

Open access

Opportunities

Help and information

Estimating Latent-Variable Panel Data Models Using Parameter-Expanded SEM Methods

Abstract

1 Introduction

2 Toy Model

SEM

PX-SEM

3 Parameter Expanded Stochastic EM Algorithm

3.1 Definition of PX-SEM Algorithm

Setup

Definition of the PX-SEM algorithm

Reduction function

3.2 Statistical Properties

Convergence

Asymptotic properties

3.3 Implementation based on Linear Expansions

4 Dynamic Factor Models

SEM

PX-SEM

Simulation Results

5 Discrete Choice Models

SEM

PX-SEM

Simulation Results

6 Quantile Models

PX-SEM

Simulation Results

7 Conclusions

Supplementary Materials

Supplementary_Materials_for_Review (20).zip

Acknowledgments

Disclosure Statement

Additional information

Funding

Notes

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date