Abstract
Predicting the software quality prior to system tests and operations has proven to be useful for achieving effective reliability improvements. Poisson (pure) regression modelling is the most commonly used count modelling technique for predicting the expected number of faults in software modules. It is best suited to when the distribution of the fault data (dependent variable) is not biased, that is equidispersed fault data, whose mean equals the variance. However, in software fault data we often observe a large portion of zeros (no faults), especially in high-assurance systems. In such cases a pure Poisson regression model (PRM) may yield inaccurate fault predictions. A zero-inflated Poisson (ZIP) model changes the mean structure of a PRM, resulting in improved predictive quality. To illustrate the same, we examined software data collected from a full-scale industrial software system. Fault prediction models were calibrated using both pure Poisson and ZIP regression techniques. To prevent claims based on a biased data split (for the fit and test data sets), the data set was randomly split 50 times, and models were calibrated using each of these split combinations. A comparative hypothesis test between the pure Poisson and ZIP modelling techniques was performed. The test revealed that the ZIP model fitted better than its counterpart. Our comprehensive empirical comparative study presented in this paper showed that the ZIP model yielded better predictions than the PRM and also demonstrated better robustness in prediction accuracy across the 50 data splits.
Acknowledgments
Taghi M. Khoshgoftaar is a professor of the Department of Computer Science and Engineering, Florida Atlantic University and the Director of the Empirical Software Engineering Laboratory. His research interests are in software engineering, software metrics, software reliability and quality engineering, computational intelligence, computer performance evaluation, data mining, and statistical modeling. He has published more than 200 refereed papers in these areas. He has been a principal investigator and project leader in a number of projects with industry, government, and other research-sponsoring agencies. He is a member of the Association for Computing Machinery, the IEEE Computer Society, and IEEE Reliability Society. He served as the general chair of the 1999 International Symposium on Software Reliability Engineering (ISSRE’99), and the general chair of the 2001 International Conference on Engineering of Computer Based Systems. Also, he has served on technical program committees of various international conferences, symposia, and workshops. He has served as North American editor of the Software Quality Journal, and is on the editorial boards of the journals Empirical Software Engineering, Software Quality, and Fuzzy Systems.
Kehan Gao received the Ph.D. degree in Computer Engineering from Florida Atlantic University, Boca Raton, FL, USA, in 2003. She is currently an Assistant Professor in the Department of Mathematics and Computer Science at Eastern Connecticut State University. Her research interests include software engineering, software metrics, software reliability and quality engineering, computer performance modeling, computational intelligence, and data mining. She is a member of the IEEE Computer Society and the Association for Computing Machinery.
Robert M. Szabo received the Ph.D. degree in computer science from Florida Atlantic University, Boca Raton, FL, USA in 1995. He received the M.S. degree (1981) in computer science and the B.S. degree (1980, Summa Cum Laude) in computer science from Cleveland State University, Cleveland, OH, USA. He is currently a Senior I/T Architect in IBM Software Group, Public Sector Solutions Development, Boca Raton, FL, USA. He is a member of the IEEE; the Empirical Software Engineering Laboratory, Florida Atlantic University, Boca Raton, FL, USA; and an Industrial Affiliate of the Center for Cardiovascular Bioinformatics and Modeling, Johns Hopkins University, Baltimore, MD, USA. His research interests include software quality engineering, software quality modeling, and databases for supporting biological data mining.