Abstract
Several studies have shown that the Poisson-lognormal (PLN) offers a better alternative compared to the Poisson-gamma (PG) when data are skewed while the PG is a more reliable option otherwise. However, it is not explicitly clear when the analyst needs to shift from the PG to the PLN – or vice versa. In addition, so far, the comparison has usually been accomplished using the goodness-of-fit statistics or statistical tests. Such metrics rarely give any intuitions into why a specific distribution or model is preferred over another. This paper addresses these topics by (1) designing characteristics-based heuristics to select a distribution between the PG and PLN, and (2) prioritizing the most important summary statistics to select a distribution between these two options. The results show that the kurtosis and percentage-of-zeros of data are among the most important summary statistics needed to distinguish between these two options.
Acknowledgements
The authors would like to thank the Safe-D UTC center for their support throughout the completion of this research. We also would like to thank Dr. Soma Dhavala for sharing his valuable insights and comments with us. [Disclaimer: The contents of this paper reflect the views of the authors, who are responsible for the facts and the accuracy of the information presented herein. This document is disseminated in the interest of information exchange. The research was funded partially or entirely by a grant from the U.S. Department of Transportation's University Transportation Centers Program. However, the U.S. Government assumes no liability for the contents or use thereof.]
Disclosure statement
No potential conflict of interest was reported by the authors.
Notes
1 The goodness of logic terminology was first used in the work of Miaou and Lord (Citation2003). The term implies that researchers and analysts should not solely select a model over another based on goodness of fit measures, but that they also need to look at the logic behind the selection of the ‘best model.’ More specifically, the model should appropriately characterize the crash generation process via the selected distribution, the functional form linking the number of crashes to the explanatory variables and how it relates to the boundary conditions.
2 We assumed that the mean of crash data varies from 0.1 to 20 in our simulation protocol. It is worth pointing out that there are instances that we may have a larger mean for crash data. However, in those situations, our analysis showed that the difference between using the Poisson-gamma and the Poisson-lognormal would become negligible and both will perform similarly when modelling data.