361
Views
6
CrossRef citations to date
0
Altmetric
Original Articles

Constructing Common Factors from Continuous and Categorical Data

Pages 1141-1171 | Published online: 03 Sep 2014
 

Abstract

The method of principal components is widely used to estimate common factors in large panels of continuous data. This article first reviews alternative methods that obtain the common factors by solving a Procrustes problem. While these matrix decomposition methods do not specify the probabilistic structure of the data and hence do not permit statistical evaluations of the estimates, they can be extended to analyze categorical data. This involves the additional step of quantifying the ordinal and nominal variables. The article then reviews and explores the numerical properties of these methods. An interesting finding is that the factor space can be quite precisely estimated directly from categorical data without quantification. This may require using a larger number of estimated factors to compensate for the information loss in categorical variables. Separate treatment of categorical and continuous variables may not be necessary if structural interpretation of the factors is not required, such as in forecasting exercises.

JEL Classification:

ACKNOWLEDGMENTS

I thank Aman Ullah for teaching me econometrics and especially grateful for his guidance and support over the years. Comments from two anonymous referees are greatly appreciated. I also thank Nickolay Trendafilov for helpful comments and discussions.

Notes

1A 1983 issue of Journal of Econometrics (de Leeuw and Wansbeek editors) was devoted to these methods.

2The focus is rather different from the structural factor analysis considered in Cunha and Heckman (Citation2008) and Almund et al. (Citation2011).

3It is important that the lower bound is attainable and does not depend on x. If the lower bound was 1 instead of 2, no meaning could be attached to x = 3 because the lower bound of 1 is not attainable.

4A matrix is sub-orthonormal if it can be made orthonormal by appending rows or columns.

5The orthogonal Procrustes problem was solved in Schonmenn (Citation1966). See Gower and Dijksterhuis (Citation2004) for a review for subsequent work.

6Suppose that two continuous variables X 1 and X 2 are jointly normal with correlation coefficient ρ. The probability that (X 1 > τ1, X 2 > τ2) is given by

The tetrachoric correlation proposed by Pearson (Citation1900) is the ρ such that p 12(ρ) equals the sample proportion . Polychoric correlations are then generalizations of tetrachoric correlations from two dichotomous indicators to multiple ordered class.

7The initial procedure proposed by Takane et al. (Citation1979) and refined by Nevels (Citation1989) both have shortcomings. FACTALS fixes those bugs. Special thanks to H. Kiers for sharing the MATLAB code.

8The method has been discovered and rediscovered under different names, including as quantification, multiple correspondence analysis, dual or optimal scaling, and homogeneity analysis. See Tenenhaus and Young (Citation1985) for a synthesis of these procedures. However, none of these methods are familiar to economists.

9Olsson et al. (Citation1982) show that ρ Yz is downward biased for ρ YZ if Y and Z are jointly normal. The greatest attenuation occurs when there are few categories and the data are opposite skewed. In the special case when consecutive integers are assigned to categories of Y, it can be shown that and φ(·) is the standard normal density and q is the categorization attenuation factor.

For y = X (latent continuous data), x (categorical data), and G (adjacency matrix of indicators), IC y denotes the number of factors selected by the Bai and Ng (Citation2002) criterion with penalty when the principal components are constructed from data y. AO y denotes factors determined using the criterion of Onatski (Citation2010). The columns denote the average R 2 when each of the factors estimated from y are regressed on the true factors.

10In an earlier version of the article when x and G were not demeaned, PCA estimated one more factor in both x and G.

For y = X, x, G, Z where Z denotes quantified data, denotes the number of factors estimated by the IC criterion of Bai and Ng (Citation2002) with penalty g 2. is the average R 2 when each of the factors estimated from y is regerssed on all the true factors.

11The FACTALS has convergence problems when N is 100 and the dimension of G is large.

12Given weights w 1,…, w T and real numbers x 1,…, x T , the monotone (isotonic) regression problem finds to minimize subject to the monotonicity condition t ⪯ k implies y t  ≤ y k where ⪯ is a partial ordering on the index set [1,…T]. An up-and-down-block algorithm is given in Kruskal (Citation1964). See also de Leeuw (Citation2005).

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 61.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 578.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.