ABSTRACT
Finite mixture distributions arise in sampling a heterogeneous population. Data drawn from such a population will exhibit extra variability relative to any single subpopulation. Statistical models based on finite mixtures can assist in the analysis of categorical and count outcomes when standard generalized linear models (GLMs) cannot adequately express variability observed in the data. We propose an extension of GLMs where the response follows a finite mixture distribution and the regression of interest is linked to the mixture’s mean. This approach may be preferred over a finite mixture of regressions when the population mean is of interest; here, only one regression must be specified and interpreted in the analysis. A technical challenge is that the mixture’s mean is a composite parameter that does not appear explicitly in the density. The proposed model maintains its link to the regression through a certain random effects structure and is completely likelihood-based. We consider typical GLM cases where means are either real-valued, constrained to be positive, or constrained to be on the unit interval. The resulting model is applied to two example datasets through Bayesian analysis. Supporting the extra variation is seen to improve residual plots and produce widened prediction intervals reflecting the uncertainty. Supplementary materials for this article are available online.
Supplementary Materials
Hiroshima Code: | Demonstrates mixture link Binomial analysis on the Hiroshima dataset (R script). | ||||
Arizona Medpar Code: | Demonstrates mixture link Poisson analysis on the azpro dataset (R script). |
Acknowledgments
We thank Professors Thomas Mathew, Yi Huang, and Yaakov Malinovsky at the University of Maryland, Baltimore County (UMBC) for serving on the committee of the dissertation in which this work was initiated. We thank the UMBC High Performance Computing Facility for use of its computational resources, and for financial support of the first author through a multiple year graduate assistantship. Finally, we thank the referee and associate editor for their feedback.
Disclaimer: This article is released to inform interested parties of ongoing research and to encourage discussion of work in progress. Any views expressed are those of the authors and not necessarily those of the U.S. Census Bureau.
Notes
1 The package currently implements mixture link Binomial and Poisson distributions and MCMC samplers. Functions to compute maximum likelihood estimates using numerical optimization are also provided.
2 Analogous statements for some of these remarks can be made about the mixture link Poisson and mixture link Normal distributions, discussed in Sections 4 and 5. We have focused on the binomial case for brevity.
3 This is the number of unique permutations of {v*1, …, vJ*}, keeping one of the elements fixed.
4 This dataset is available from the publisher of Morel and Neerchal (Citation2012) via http://www.sas.com.
5 These long and thin tails also cause some variability in the computation of the prediction limits. Using the same 2500 draws from MCMC, the upper quantiles can shift by several values each time they are computed because of the random noise in the residuals.