Abstract
Survey data often contain many variables. Structural equation modeling (SEM) is commonly used in analyzing such data. However, conventional SEM methods are not crafted to handle data with a large number of variables (p). A large p can cause Tml, the most widely used likelihood ratio statistic, to depart drastically from the assumed chi-square distribution even with normally distributed data and a relatively large sample size N. A key element affecting this behavior of Tml is its mean bias. The focus of this article is to determine the cause of the bias. To this end, empirical means of Tml via Monte Carlo simulation are used to obtain the empirical bias. The most effective predictors of the mean bias are subsequently identified and their predictive utility examined. The results are further used to predict type I errors of Tml. The article also illustrates how to use the obtained results to determine the required sample size for Tml to behave reasonably well. A real data example is presented to show the effect of the mean bias on model inference as well as how to correct the bias in practice.
Article information
Conflict of Interest Disclosures: Each author signed a form for disclosure of potential conflicts of interest. No authors reported any financial or other conflicts of interest in relation to the work described.
Ethical Principles: The authors affirm having followed professional ethical guidelines in preparing this work. These guidelines include obtaining informed consent from human participants, maintaining ethical treatment and respect for the rights of human or animal participants, and ensuring the privacy of participants and their data, such as ensuring that individual participants cannot be identified in reported results or from publicly available original or archival data.
Funding: This work was supported by Grant SES-1461355 from the National Science Foundation.
Role of the Funders/Sponsors: None of the funders or sponsors of this research had any role in the design and conduct of the study; collection, management, analysis, and interpretation of data; preparation, review, or approval of the manuscript; or decision to submit the manuscript for publication.
Acknowledgments: The authors would like to thank Stephen West, Brenna Gomer and two reviewers for their comments on prior versions of this manuscript. The ideas and opinions expressed herein are those of the authors alone, and endorsement by the authors' institutions or the National Science Foundation is not intended and should not be inferred.
Notes
1 The function is regarded as a transformation of x with power 0 in the development of Box and Cox (1964).
2 The selected number of best subsets might be arbitrary, but our experience indicates that the additional gain becomes minimal as we select more subsets. Also, best-subset regression becomes less effective with too many variables being included in the following step that involves product terms.
3 The option “model y = v1-v10/selection = maxR; weight w;” under Proc Reg allows us to select the best predictors from v1 to v10 according to weighted least squares.
4 The variables in these subsets are reported in .
5 Note that the dots corresponding to and
are close to overlap in , and so are the two corresponding to
and
6 We use to represent the factor loading of the jth variable on the kth factor. For example,
is the loading of the 13th variable (Straight-Curved Capitals) on the 3rd factor (Speed), including
makes variable 13 also load on the 1st factor (Spatial).
7 We used 5 decimals in order to see the change in p-values with different models and different methods of evaluation.