12,236
Views
27
CrossRef citations to date
0
Altmetric
Advanced Methods in Health Psychology and Behavioral Medicine

Too many zeros and/or highly skewed? A tutorial on modelling health behaviour as count data with Poisson and negative binomial regression

ORCID Icon
Pages 436-455 | Received 22 Jun 2020, Accepted 16 Apr 2021, Published online: 06 May 2021
 

ABSTRACT

Background

Dependent variables in health psychology are often counts, for example, of a behaviour or number of engagements with an intervention. These counts can be very strongly skewed, and/or contain large numbers of zeros as well as extreme outliers. For example, ‘How many cigarettes do you smoke on an average day?’ The modal answer may be zero but may range from 0 to 40+. The same can be true for minutes of moderate-to-vigorous physical activity. For some people, this may be near zero, but take on extreme values for someone training for a marathon. Typical analytical strategies for this data involve explicit (or implied) transformations (smoker v. non-smoker, log transformations). However, these data types are ‘counts’ (i.e. non-negative whole numbers) or quasi-counts (time is ratio but discrete minutes of activity could be analysed as a count), and can be modelled using count distributions – including the Poisson and negative binomial distribution (and their zero-inflated and hurdle extensions, which alloweven more zeros).

Methods

In this tutorial paper I demonstrate (in R, Jamovi, and SPSS) the easy application of these models to health psychology data, and their advantages over alternative ways of analysing this type of data using two datasets – one highly dispersed dependent variable (number of views on YouTube, and another with a large number of zeros (number of days on which symptoms were reported over a month).

Results

The negative binomial distribution had the best fit for the overdispersed number of views on YouTube. Negative binomial, and zero-inflated negative binomial were both good fits for the symptom data with over-abundant zeros.

Conclusions

In both cases, count distributions provided not just a better fit but would lead to different conclusions compared to the poorly fitting traditional regression/linear models.

This article is part of the following collections:
Advanced Methods in Health Psychology and Behavioral Medicine

Acknowledgements

I would like to thank David Fisher, for introducing me to count models. Many of the ideas presented in this paper are based on his, though any errors are mine. Luke Danagher, David Fletcher and two anonymous reviewers provided helpful feedback.

Disclosure statement

No potential conflict of interest was reported by the author.

Notes

1 Although I am discussing count distributions here, it is also worth noting that very small values just above 0 (e.g., 0.001) become very large and influential values when logged. So logging a continuous variable with very small values will also not perform well.

2 The negative binomial is a distribution developed to model a series of independent and identically distributed trials with a binary outcome, with a probability of success (p), before a set number of failures (r) occurs. Here, I have described an alternative parameterisation which better describes the shape of the distribution, rather than its underlying mechanism. It also is more obviously analogous with the normal and Poisson distribution described as μ and θ.

3 Student’s t-distribution is probably the most well-known fat-tailed distribution. As the sample size/degrees of freedom increases, the distribution converges tothe normal distribution. But with smaller samples, the tails fatten, meaning that the value of t required to reject the null hypothesis increases well above 1.96, making the t-test more conservative as the sample size decreases.

4 There are of course, more distributions (e.g. the beta-binomial), but in practice, Poisson and negative binomial are the most commonly reported and compared (Zeng et al., Citation2014).

5 You might note that I have not discussed the binomial distribution, but am jumping straight to the quasi-binomial. The binomial distribution does not have the shape we are interested in here, but the quasi-binomial can sometimes be useful.

6 This includes NOT selecting the ‘negative binomial’ option to conduct a negative binomial regression.

7 AIC estimates how much ‘information’ a model loses, so lower values are indicate better fit.

8 Named, apparently, because the y-axis is the square-root of the frequency.

9 AIC is available for linear regression, but the values are not comparable.

10 I have no strong justification, theoretical or otherwise for the choice of these variables. The original design choices would have been that people with more negative attitudes towards medicines might be less likely to seek conventional medical treatment.

11 As this particular example was constructed for this tutorial, I don’t have a clear choice here.