Abstract
Data in social and behavioral sciences are routinely collected using questionnaires, and each domain of interest is tapped by multiple indicators. Structural equation modeling (SEM) is one of the most widely used methods to analyze such data. However, conventional methods for SEM face difficulty when the number of variables () is large even when the sample size (
) is also rather large. This article addresses the issue of model inference with the likelihood ratio statistic
. Using the method of empirical modeling, mean-and-variance corrected statistics for SEM with many variables are developed. Results show that the new statistics not only perform much better than
but also are substantial improvements over other corrections to
. When combined with a robust transformation, the new statistics also perform well with non-normally distributed data.
Notes
1 While there are other types of big data, we will refer big data as samples with large and small
in this article for simplicity.
2 Lasso regression is not used in our selection of predictors because variables with smaller regression coefficients have been repeatedly found to be more effective in reducing the sum of squares of residuals than variables with larger coefficients.
3 Note that and
are not literally independent. However, the distribution of
is approximately symmetric when
(see Figure 11.1 of Forbes et al., Citation2011, p. 70), and then the correlation of
and
becomes tiny. With many variables, the degrees of freedom for typical SEM models will be much greater than 20, as for the conditions described in the following section. Thus, instead of using generalized least squares, we will just use WLS to estimate the model in Equation 6, by ignoring the correlation of
and
.
4 The number of best subsets might be arbitrary, but our experience indicates that the additional gain becomes minimal as we select more best-subsets. Also, best-subset regression becomes less effective with too many variables included in the following step when their product terms are included.
5 The table does not contain the information of , which was kindly provided by Dr. Dexin Shi via personal communication.
6 The option “model y = v1-v10/selection = maxR; weight w;” under Proc Reg allows us to select the best predictors from v1 to v10 according to weighted least squares.
7 For the statistic , 92.6% of
and 46.18% of
are greater than 1.96, respectively. By contrast, 0.0% of
and .54% of
are smaller than −1.96, respectively.
8 For a non-negative definite matrix
, its power can be obtained by
, where
with
being the
th eigenvector of
corresponding to the
th eigenvalue
, and
is the diagonal matrix with
th diagonal element being given by
to the power of
.
9 For a sample from a normally distributed population, the squared Mahalanobis distance approximately follow
.