The Classical Occupancy Distribution: Computation and Approximation: The American Statistician: Vol 75 , No 4

Sample our Mathematics & Statistics journals, sign in here to start your FREE access for 14 days

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions
Read this article /doi/full/10.1080/00031305.2019.1699445?needAccess=true

Abstract

We examine the discrete distributional form that arises from the “classical occupancy problem,” which looks at the behavior of the number of occupied bins when we allocate a given number of balls uniformly at random to a given number of bins. We review the mass function and moments of the classical occupancy distribution and derive exact and asymptotic results for the mean, variance, skewness and kurtosis. We develop an algorithm to compute a cubic array of log-probabilities from the classical occupancy distribution. This algorithm allows the computation of large blocks of values while avoiding underflow problems in computation. Using this algorithm, we compute the classical occupancy distribution for a large block of values of balls and bins, and we measure the accuracy of its asymptotic approximation using the normal distribution. We analyze the accuracy of the normal approximation with respect to the variance, skewness and kurtosis of the distribution. Based on this analysis, we give some practical guidance on the feasibility of computing large blocks of values from the occupancy distribution, and when approximation is required.

Keywords:

Approximation
Birthday problem
Classical occupancy distribution
Computation
Coupon collector problem
Distribution and moments

Acknowledgments

The author would like to thank two anonymous journal referees and the associate editor for suggestions that improved the article from a previous draft.

Notes

1 First published in English as De Moivre (Citation1718), Problems XXIX–XXX, pp. 73–77. These problems are slightly different to the classical occupancy problem as it is presently conceived. Nevertheless, both Hald, de Moivre, and McClintock (Citation1984, p. 232) and Holst (1986, p. 16) give De Moivre credit for the genesis of the occupancy problem on the basis of these two problems. [Note that there are different editions of De Moivre’s work which label the problems with different numbers. Correspondence of the problems is shown in Table 1 of Hald, deMoivre, and McClintock (1984, p. 234).]

2 Other analysis of the distribution can be found in Arfwedson (Citation1951), Weiss (Citation1958), Harris (Citation1968), Park (Citation1972), Samuel-Cahn (Citation1974), Chao (Citation1984), and Holst (1986).

3 This example is particularly interesting; it involves a strange result that occurred in the Quebec “Super Loto” on July 25, 1982. The lottery allocated 500 prize cars by random sampling with replacement to 2.4 million unclaimed lottery tickets. One participant was lucky enough to win two cars from his single ticket! Hanley notes that this outcome is less improbable than might be imagined.

4 List of distributions at https://en.wikipedia.org/wiki/List_of_probability_distributions. At the time of writing there is no page for the occupancy distribution, but there is a page for the coupon collector’s problem. [Search undertaken on January 13, 2019.]

5 The name “classical occupancy distribution” was used in Johnson and Kotz (Citation1977, p. 110). In an earlier book (Johnson and Kotz Citation1969, p. 251) they noted that it is sometimes called “Arfwedson’s distribution” after Arfwedson (Citation1951). Williamson et al. (Citation2009) proposed to call the distribution the “Stirling2” distribution, due to the presence of the Stirling numbers of the second kind. The present author is of the view that James Stirling has already received sufficient credit for the Stirling numbers, and since he made no specific contribution to the analysis of the occupancy problem, it is preferable to use the more impersonal terminology used by Johnson and Kotz. This name is already established in some of the literature, and has the advantage of linking the distribution directly to the problem from which it is derived. It also avoids adding another case of Stigler’s “law of eponymy”: that no scientific discovery is named after its original discoverer (see Stigler Citation1980).

6 The probability vector $p$ determines the number of bins m through its length, and so there is no need to specify the latter as a parameter when the former is already included. For this reason our later distribution notation will not include explicit dependence on m, but this is implicit, through the fact that there is dependence on $p$ .

7 Notation for this function is awkward, since the forward difference operator operates on the monomial function yⁿ to yield a new function, and the latter is then evaluated at a particular argument value x. We have chosen to use different variable symbols for these two parts to give a clear distinction between the monomial function and the argument used for evaluation of the kth forward difference $Δ^{k}$ of the monomial function. In our analysis there will be occasions where we will alter the monomial function in this expression, and occasions where we evaluate the result at a particular argument value x. The notation we use allows us to clearly distinguish those two things. Our notation contrasts with the notation used in Johnson and Kotz (Citation1977) where they write our $(Δ^{k} y^{n}) (0)$ in a shorthand as $Δ^{k} 0^{n}$ (the kth difference of the monomial evaluated at zero), and use similar shorthand when they make changes in the monomial function. We have chosen to be more explicit to prevent any confusion.

8 Our skewness and kurtosis are $S k e w (K) \equiv E (Z^{3})$ and $K urt (K) \equiv E (Z^{4})$ where $Z \equiv (K - μ_{n, m}) / σ_{n, m}$ (i.e., we refer to the third and fourth standardized moments).

9 This is computed as $logsumexp (l_{1}, l_{2}) = \max (l_{1}, l_{2}) + log (1 + exp (-$ $| l_{1} - l_{2} |))$ where the function $log (1 + exp (- x))$ can be computed with high accuracy using its Maclaurin series. The function logsumexp is directly available in most statistical computing programs.

10 With $N = 25, 000$ this matrix of log-probabilities takes up 2.19 GB as an rmd file in R. Computation on a standard personal computer at the time of writing takes less than an hour.

11 In the special case where $\min (n, m) = 1$ the occupancy distribution is a point-mass distribution at k = 1 and we have $μ_{n, m} = 1$ and $σ_{n, m}^{2} = 0$ . In this special case, we take the density $N (k | μ_{n, m}, σ_{n, m}^{2})$ to be the Dirac function at $k = μ_{n, m} = 1$ which means that the approximation is a point mass at k = 1 (which is the exact distribution). In this case both the occupancy distribution and its approximation are point-mass distributions at the same value, so there is no approximation error. This will be programmed as a special case in our algorithm.

de Moivre, I. (1718), The Doctrine of Chances, London: Printed by W. Pearfor [Publicly Available on Google Books].

Google Scholar

Hald, A., de Moivre, A., and McClintock, B. (1984), “A. de Moivre: ‘De Mensura Sortis’ or ‘On the Measurement of Chance,” International Statistical Review/Revue Internationale de Statistique, 52, 229–262. DOI: https://doi.org/10.2307/1403045.

Web of Science ®Google Scholar

Arfwedson, G. (1951), “A Probability Distribution Connected With Stirling’s Second Class Numbers,” Scandinavian Actuarial Journal, 34, 121–132. DOI: https://doi.org/10.1080/03461238.1951.10432133.

Google Scholar

Weiss, I. (1958), “Limiting Distributions in Some Occupancy Problems,” The Annals of Mathematical Statistics, 29, 878–884. DOI: https://doi.org/10.1214/aoms/1177706544.

Google Scholar

Harris, B. (1968), “Statistical Inference in the Classical Occupancy Problem Unbiased Estimation of the Number of Classes,” Journal of the American Statistical Association, 63, 837–847. DOI: https://doi.org/10.2307/2283876.

Web of Science ®Google Scholar

Park, C. J. (1972), “A Note on the Classical Occupancy Problem,” The Annals of Mathematical Statistics, 43, 1698–1701. DOI: https://doi.org/10.1214/aoms/1177692405.

Google Scholar

Samuel-Cahn, E. (1974), “Asymptotic Distributions for Occupancy and Waiting Time Problems With Positive Probability of Falling Through the Cells,” The Annals of Probability, 2, 515–521. DOI: https://doi.org/10.1214/aop/1176996669.

Web of Science ®Google Scholar

Chao, A. (1984), “Nonparametric Estimation of the Number of Classes in a Population,” Scandinavian Journal of Statistics, 11, 265–270.

Web of Science ®Google Scholar

Johnson, N. L., and Kotz, S. (1977), Urn Models and Their Applications, New York: Wiley, pp. 107–175, 318–370.

Google Scholar

Johnson, N. L., and Kotz, S. (1969), Discrete Distributions, New York: Wiley, pp. 251–253.

Google Scholar

Williamson, P. P., Mays, D. P., Abay Asmerom, G., and Yang, Y. (2009), “Revisiting the Classical Occupancy Problem,” The American Statistician, 63, 356–360. DOI: https://doi.org/10.1198/tast.2009.08104.

Web of Science ®Google Scholar

Stigler, S. M. (1980), “Stigler’s Law of Eponymy,” Transactions of the New York Academy of Sciences, 39, 147–158. DOI: https://doi.org/10.1111/j.2164-0947.1980.tb02775.x.

Google Scholar

Johnson, N. L., and Kotz, S. (1977), Urn Models and Their Applications, New York: Wiley, pp. 107–175, 318–370.

Google Scholar

Log in via your institution

Access through your institution

Log in to Taylor & Francis Online

Shibboleth

Log in to Taylor & Francis Online

Username Password

Forgot password?

Keep me logged in (not suitable for shared devices).

You will otherwise be logged out automatically, after a limited period, and will need to log in again.

Restore content access

Restore content access for purchases made as guest

Purchase options * Save for later Item saved, go to cart

PDF download + Online access

48 hours access to article PDF & online version
Article PDF can be downloaded
Article PDF can be printed

USD 61.00 Add to cart

PDF download + Online access - Online Checkout

Issue Purchase

30 days online access to complete issue
Article PDFs can be downloaded
Article PDFs can be printed

USD 106.00 Add to cart

Issue Purchase - Online Checkout

* Local tax will be added as applicable

Share icon
Back to Top

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

The Classical Occupancy Distribution: Computation and Approximation

Log in via your institution

Log in to Taylor & Francis Online

Restore content access

Related Research

Information for

Open access

Opportunities

Help and information

The Classical Occupancy Distribution: Computation and Approximation

Abstract

Acknowledgments

Notes

Log in via your institution

Log in to Taylor & Francis Online

Log in to Taylor & Francis Online

Restore content access

Related Research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature