Recovering Latent Variables by Matching: Journal of the American Statistical Association: Vol 118 , No 541

ABSTRACT

We propose an optimal-transport-based matching method to nonparametrically estimate linear models with independent latent variables. The method consists in generating pseudo-observations from the latent variables, so that the Euclidean distance between the model’s predictions and their matched counterparts in the data is minimized. We show that our nonparametric estimator is consistent, and we document that it performs well in simulated data. We apply this method to study the cyclicality of permanent and transitory income shocks in the Panel Study of Income Dynamics. We find that the dispersion of income shocks is approximately acyclical, whereas the skewness of permanent shocks is procyclical. By comparison, we find that the dispersion and skewness of shocks to hourly wages vary little with the business cycle. Supplementary materials for this article are available online.

KEYWORDS:

Supplementary material

In addition to these extensions, in the online appendix we provide the proofs and present some additional simulations.

Acknowledgments

We thank to Colin Mallows, Tincho Almuzara, Alfred Galichon, Jiaying Gu, Kei Hirano, Pierre Jacob, Roger Koenker, Thibaut Lamadon, Guillaume Pouliot, Azeem Shaikh, Tim Vogelsang, Daniel Wilhelm, and audiences at various places for comments. Tincho Almuzara and Beatriz Zamorra provided excellent research assistance.

Notes

1 Codes to implement the estimator are available on the second author’s webpage.

2 When $A' A$ is nonsingular and A is known, $\hat{X} = {(A' A)}^{- 1} A' Y$ recovers X exactly. We are interested in situations, such as deconvolution and filtering, where exact recovery of the latent variables is not possible.

3 An interesting possibility would be to jointly estimate A and the distribution of X. Although we do not study it formally, we comment on this possibility in Section 4.2.

4 Other applications in economics include the estimation of the heterogeneous effects of an exogenous binary treatment under the assumption that the potential outcome in the absence of treatment is independent of the gains from treatment (Heckman, Smith, and Clements Citation1997), and the estimation of the distribution of time-invariant random coefficients of binary treatments in panel data models (Arellano, and Bonhomme Citation2012).

5 See also Botosaru and Sasaki (Citation2015). Our approach may be used to estimate linear autoregressive specifications of the form $η_{t} = α + ρ η_{t - 1} + v_{t}$ , where we estimate $(α, ρ)$ —that is, the matrix A—in a first step. An important application of error components models is to relax independence in repeated measurements models such as Equation (1). This can be done provided T is large enough. Modeling $ε_{t}$ in Equation (1) as a finite-order moving average or autoregressive process with independent innovations preserves the linear independent factor structure of the model (Arellano, and Bonhomme Citation2012; see also Hu, Moffitt, and Sasaki Citation2019). In addition, in model (1) Schennach (Citation2013b) pointed out that full independence between the factors is not necessary, and that sub-independence suffices to establish identification.

6 Ben Moshe (2017) showed how to allow for arbitrary subsets of dependent factors, and proposes characteristic-function based estimators.

7 The sample sizes being the same for Y and X₂ is not essential and can easily be relaxed. In a setting where the cdf $F_{X_{2}}$ is known, one can draw a sample from it, or alternatively work with an integral counterpart to our estimator.

8 Specifically, one could compute $X_{σ (i, r), 1} + X_{i 2}$ , with $σ (\cdot, 1), \dots, σ (\cdot, R)$ being R independent permutations. In that case, π would be a generalized permutation, mapping ${1, \dots, N}^{R}$ to ${1, \dots, N}$ .

9 It is common in applications to assume that some of the X_k’s have zero mean while leaving the remaining means unrestricted. For example, in the repeated measurements model, assuming that $E (X_{1}) = 0$ suffices for identification. Our algorithm can easily be adapted to such cases.

10 If $A = {a_{t k}}$ is unknown and $\hat{A} = {{\hat{a}}_{t k}}$ is a consistent estimate of it, then we replace a_tk by ${\hat{a}}_{t k}$ in Equation (4). We proceed similarly in the algorithm we propose in the next section. Alternatively, one could jointly minimize the objective function on the right-hand side of Equation (4) with respect to both X and ${a_{t k}}$ . Here, we do not study the formal properties of such a joint estimation method.

11 Notice that, since π is a permutation, $\sum_{i = 1}^{N} \sum_{t = 1}^{T} Y_{π (i), t}^{2} = \sum_{i = 1}^{N} \sum_{t = 1}^{T} Y_{i t}^{2}$ does not depend on π.

12 See, for example, Galichon (Citation2016, chap. 3) on discrete Monge–Kantorovitch problems.

13 An entropic-regularized counterpart to Equation (4) is, for $ϵ_{N} > 0$

$\hat{X} = \underset{X \in X_{N}}{argmin} {min_{P \in P_{N}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} P_{i j} \sum_{t = 1}^{T} {(Y_{j t} - \sum_{k = 1}^{K} a_{t k} X_{σ_{k} (i), k})}^{2} + ϵ_{N} \sum_{i = 1}^{N} \sum_{j = 1}^{N} P_{i j} (ln (P_{i j}) - 1)} .$

14 Strictly speaking, Mallows (Citation2007) redefined ${\hat{X}}_{i 1}^{(s + 1)} \equiv {\hat{X}}_{σ^{(s)} (i), 1}^{(s + 1)}$ for all $i = 1, \dots, N$ at the end of step s, and then applies the random permutation $σ^{(s + 1)}$ to the new ${\hat{X}}^{(s + 1)}$ values. This difference with the algorithm outlined here turns out to be immaterial, since the composition of $σ^{(s + 1)}$ and $σ^{(s)}$ is also a random permutation of ${1, \dots, N}$ .

15 It is not necessary for ${\hat{X}}_{k}$ to be an exact minimizer of Equation (4). As we show in the proof, it suffices that the value of the objective function at $({\hat{X}}_{1}, \dots, {\hat{X}}_{K})$ be in an ϵ_N-neighborhood of the global minimum, for ϵ_N tending to zero as N tends to infinity.

16 Part (i) ensures that $F_{X_{k}}^{- 1}$ belongs to an $‖ \cdot ‖_{1, \infty}$ -ball, which is compact under $‖ \cdot ‖_{\infty}$ (Gallant and Nychka Citation1987). Compactness can be preserved when norms are replaced by weighted norms (e.g., using polynomial or exponential weights); see, for example, (Freyberger and Masten Citation2019, theor. 7), and the analysis in Newey and Powell (Citation2003).

17 The alternative density estimator ${\tilde{f}}_{X_{k}} (x) \equiv 1 / \nabla {\hat{H}}_{k} ({\hat{H}}_{k}^{- 1} (x))$ can be shown to be uniformly consistent for $f_{X_{k}}$ as N tends to infinity under the same conditions.

18 A simple recommendation for practice is based on a truncated normal distribution. Let ${\hat{σ}}_{k}$ denote a consistent estimate of the standard deviation of X_k, for example, obtained by covariance-based minimum distance, and let c > 0 be a tuning parameter. Possible penalization constants are: $2.3 c {\hat{σ}}_{k}$ (upper bound on quantile values), $2.5 c^{- 1} {\hat{σ}}_{k}$ and $37 c {\hat{σ}}_{k}$ (lower and upper bounds for first derivatives), and $3300 c {\hat{σ}}_{k}$ (upper bound on the second derivatives). When c = 1, these constants are binding when X_k follows a normal truncated at the 99th percentiles. As a default choice one may take c = 2.

19 When implementing the Fourier estimator we enforce the non-negativity and integral constraints ex-post. To select the tuning parameter, we minimize the Monte Carlo MISE of the estimator on a grid of values.

20 Indeed, we have $\begin{array}{l} \underset{\equiv Y}{\underset{}{\underset{︸}{( \begin{matrix} Δ Y_{1} \\ Δ Y_{2} \\ Δ Y_{3} \\ ... \\ Δ Y_{T} \end{matrix} )}}} = \underset{\equiv A}{\underset{}{\underset{︸}{( \begin{matrix} 1 & 0 & ... & 0 & 1 & 0 & ... & 0 \\ 0 & 1 & ... & 0 & - 1 & 1 & ... & 0 \\ 0 & 0 & ... & 0 & 0 & - 1 & ... & 0 \\ ... & ... & ... & ... & ... & ... & ... & ... \\ 0 & 0 & ... & 1 & 0 & 0 & ... & - 1 \end{matrix} )}}} \\ \underset{\equiv X}{\underset{}{\underset{︸}{(\begin{matrix} v_{1} - ε_{0} \\ v_{2} \\ ... \\ v_{T} + ε_{T} \\ ε_{1} \\ ε_{2} \\ ... \\ ε_{T - 1} \end{matrix})}}} . \end{array}$

21 For example, Storesletten, Telmer, and Yaron (Citation2004) estimated an AR(1) process for the persistent component, whose baseline value for the autoregressive coefficient is 0.96. While they estimated the model in levels, our motivation for estimating (11) in the first-differences is that differences are robust to heterogeneity between cohorts.

22 We compute the Newey-West formula with one lag. Using two or three lags instead has little impact. In this calculation we do not account for the fact that the quantiles are estimated, our rationale being that the cross-sectional sizes are large relative to the length of the time series.

23 Indeed, in the univariate case $X_{i + 1, k} \geq X_{i k}$ for all i is equivalent to $\sum_{j = 1}^{m} X_{σ_{k} (i_{j}), k} (σ_{k} (i_{j + 1}) - σ_{k} (i_{j})) \leq 0$ for all $m \leq N$ and length-m cycle $i_{1}, \dots, i_{m}, i_{m + 1} = i_{1}$ .

24 Note that (14) is linear in $X_{l, i, k}$ ’s. However, it may be impractical to enforce all restrictions in (14) in the update step. In applications, a possibility is to select S_N restrictions at random, where S_N depends on the sample size.

Additional information

Funding

Arellano acknowledges research funding from the Ministerio de Economía y Competitividad, Grant ECO2016-79848-P. Bonhomme acknowledges support from the NSF, Grant SES-1658920.

Recovering Latent Variables by Matching

Log in via your institution

Log in to Taylor & Francis Online

Restore content access

Related Research

Information for

Open access

Opportunities

Help and information

Recovering Latent Variables by Matching

ABSTRACT

Supplementary material

Acknowledgments

Notes

Additional information

Funding

Log in via your institution

Log in to Taylor & Francis Online

Log in to Taylor & Francis Online

Restore content access

Related Research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature