368
Views
6
CrossRef citations to date
0
Altmetric
Original Articles

Constructing Common Factors from Continuous and Categorical Data

 

Abstract

The method of principal components is widely used to estimate common factors in large panels of continuous data. This article first reviews alternative methods that obtain the common factors by solving a Procrustes problem. While these matrix decomposition methods do not specify the probabilistic structure of the data and hence do not permit statistical evaluations of the estimates, they can be extended to analyze categorical data. This involves the additional step of quantifying the ordinal and nominal variables. The article then reviews and explores the numerical properties of these methods. An interesting finding is that the factor space can be quite precisely estimated directly from categorical data without quantification. This may require using a larger number of estimated factors to compensate for the information loss in categorical variables. Separate treatment of categorical and continuous variables may not be necessary if structural interpretation of the factors is not required, such as in forecasting exercises.

JEL Classification:

ACKNOWLEDGMENTS

I thank Aman Ullah for teaching me econometrics and especially grateful for his guidance and support over the years. Comments from two anonymous referees are greatly appreciated. I also thank Nickolay Trendafilov for helpful comments and discussions.

Notes

1A 1983 issue of Journal of Econometrics (de Leeuw and Wansbeek editors) was devoted to these methods.

2The focus is rather different from the structural factor analysis considered in Cunha and Heckman (Citation2008) and Almund et al. (Citation2011).

3It is important that the lower bound is attainable and does not depend on x. If the lower bound was 1 instead of 2, no meaning could be attached to x = 3 because the lower bound of 1 is not attainable.

4A matrix is sub-orthonormal if it can be made orthonormal by appending rows or columns.

5The orthogonal Procrustes problem was solved in Schonmenn (Citation1966). See Gower and Dijksterhuis (Citation2004) for a review for subsequent work.

6Suppose that two continuous variables X 1 and X 2 are jointly normal with correlation coefficient ρ. The probability that (X 1 > τ1, X 2 > τ2) is given by

The tetrachoric correlation proposed by Pearson (Citation1900) is the ρ such that p 12(ρ) equals the sample proportion . Polychoric correlations are then generalizations of tetrachoric correlations from two dichotomous indicators to multiple ordered class.

7The initial procedure proposed by Takane et al. (Citation1979) and refined by Nevels (Citation1989) both have shortcomings. FACTALS fixes those bugs. Special thanks to H. Kiers for sharing the MATLAB code.

8The method has been discovered and rediscovered under different names, including as quantification, multiple correspondence analysis, dual or optimal scaling, and homogeneity analysis. See Tenenhaus and Young (Citation1985) for a synthesis of these procedures. However, none of these methods are familiar to economists.

9Olsson et al. (Citation1982) show that ρ Yz is downward biased for ρ YZ if Y and Z are jointly normal. The greatest attenuation occurs when there are few categories and the data are opposite skewed. In the special case when consecutive integers are assigned to categories of Y, it can be shown that and φ(·) is the standard normal density and q is the categorization attenuation factor.

For y = X (latent continuous data), x (categorical data), and G (adjacency matrix of indicators), IC y denotes the number of factors selected by the Bai and Ng (Citation2002) criterion with penalty when the principal components are constructed from data y. AO y denotes factors determined using the criterion of Onatski (Citation2010). The columns denote the average R 2 when each of the factors estimated from y are regressed on the true factors.

10In an earlier version of the article when x and G were not demeaned, PCA estimated one more factor in both x and G.

For y = X, x, G, Z where Z denotes quantified data, denotes the number of factors estimated by the IC criterion of Bai and Ng (Citation2002) with penalty g 2. is the average R 2 when each of the factors estimated from y is regerssed on all the true factors.

11The FACTALS has convergence problems when N is 100 and the dimension of G is large.

12Given weights w 1,…, w T and real numbers x 1,…, x T , the monotone (isotonic) regression problem finds to minimize subject to the monotonicity condition t ⪯ k implies y t  ≤ y k where ⪯ is a partial ordering on the index set [1,…T]. An up-and-down-block algorithm is given in Kruskal (Citation1964). See also de Leeuw (Citation2005).

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.