762
Views
7
CrossRef citations to date
0
Altmetric
General

The Classical Occupancy Distribution: Computation and Approximation

Pages 364-375 | Received 26 Aug 2019, Accepted 09 Nov 2019, Published online: 06 Jan 2020
 

Abstract

We examine the discrete distributional form that arises from the “classical occupancy problem,” which looks at the behavior of the number of occupied bins when we allocate a given number of balls uniformly at random to a given number of bins. We review the mass function and moments of the classical occupancy distribution and derive exact and asymptotic results for the mean, variance, skewness and kurtosis. We develop an algorithm to compute a cubic array of log-probabilities from the classical occupancy distribution. This algorithm allows the computation of large blocks of values while avoiding underflow problems in computation. Using this algorithm, we compute the classical occupancy distribution for a large block of values of balls and bins, and we measure the accuracy of its asymptotic approximation using the normal distribution. We analyze the accuracy of the normal approximation with respect to the variance, skewness and kurtosis of the distribution. Based on this analysis, we give some practical guidance on the feasibility of computing large blocks of values from the occupancy distribution, and when approximation is required.

Acknowledgments

The author would like to thank two anonymous journal referees and the associate editor for suggestions that improved the article from a previous draft.

Notes

1 First published in English as De Moivre (Citation1718), Problems XXIX–XXX, pp. 73–77. These problems are slightly different to the classical occupancy problem as it is presently conceived. Nevertheless, both Hald, de Moivre, and McClintock (Citation1984, p. 232) and Holst (1986, p. 16) give De Moivre credit for the genesis of the occupancy problem on the basis of these two problems. [Note that there are different editions of De Moivre’s work which label the problems with different numbers. Correspondence of the problems is shown in Table 1 of Hald, deMoivre, and McClintock (1984, p. 234).]

2 Other analysis of the distribution can be found in Arfwedson (Citation1951), Weiss (Citation1958), Harris (Citation1968), Park (Citation1972), Samuel-Cahn (Citation1974), Chao (Citation1984), and Holst (1986).

3 This example is particularly interesting; it involves a strange result that occurred in the Quebec “Super Loto” on July 25, 1982. The lottery allocated 500 prize cars by random sampling with replacement to 2.4 million unclaimed lottery tickets. One participant was lucky enough to win two cars from his single ticket! Hanley notes that this outcome is less improbable than might be imagined.

4 List of distributions at https://en.wikipedia.org/wiki/List_of_probability_distributions. At the time of writing there is no page for the occupancy distribution, but there is a page for the coupon collector’s problem. [Search undertaken on January 13, 2019.]

5 The name “classical occupancy distribution” was used in Johnson and Kotz (Citation1977, p. 110). In an earlier book (Johnson and Kotz Citation1969, p. 251) they noted that it is sometimes called “Arfwedson’s distribution” after Arfwedson (Citation1951). Williamson et al. (Citation2009) proposed to call the distribution the “Stirling2” distribution, due to the presence of the Stirling numbers of the second kind. The present author is of the view that James Stirling has already received sufficient credit for the Stirling numbers, and since he made no specific contribution to the analysis of the occupancy problem, it is preferable to use the more impersonal terminology used by Johnson and Kotz. This name is already established in some of the literature, and has the advantage of linking the distribution directly to the problem from which it is derived. It also avoids adding another case of Stigler’s “law of eponymy”: that no scientific discovery is named after its original discoverer (see Stigler Citation1980).

6 The probability vector p determines the number of bins m through its length, and so there is no need to specify the latter as a parameter when the former is already included. For this reason our later distribution notation will not include explicit dependence on m, but this is implicit, through the fact that there is dependence on p.

7 Notation for this function is awkward, since the forward difference operator operates on the monomial function yn to yield a new function, and the latter is then evaluated at a particular argument value x. We have chosen to use different variable symbols for these two parts to give a clear distinction between the monomial function and the argument used for evaluation of the kth forward difference Δk of the monomial function. In our analysis there will be occasions where we will alter the monomial function in this expression, and occasions where we evaluate the result at a particular argument value x. The notation we use allows us to clearly distinguish those two things. Our notation contrasts with the notation used in Johnson and Kotz (Citation1977) where they write our (Δkyn)(0) in a shorthand as Δk0n (the kth difference of the monomial evaluated at zero), and use similar shorthand when they make changes in the monomial function. We have chosen to be more explicit to prevent any confusion.

8 Our skewness and kurtosis are Skew(K)E(Z3) and Kurt(K)E(Z4) where Z(Kμn,m)/σn,m (i.e., we refer to the third and fourth standardized moments).

9 This is computed as logsumexp(l1,l2)=max(l1,l2)+log(1+exp( |l1l2|)) where the function log(1+exp(x)) can be computed with high accuracy using its Maclaurin series. The function logsumexp is directly available in most statistical computing programs.

10 With N=25,000 this matrix of log-probabilities takes up 2.19 GB as an rmd file in R. Computation on a standard personal computer at the time of writing takes less than an hour.

11 In the special case where min(n,m)=1 the occupancy distribution is a point-mass distribution at k = 1 and we have μn,m=1 and σn,m2=0. In this special case, we take the density N(k|μn,m,σn,m2) to be the Dirac function at k=μn,m=1 which means that the approximation is a point mass at k = 1 (which is the exact distribution). In this case both the occupancy distribution and its approximation are point-mass distributions at the same value, so there is no approximation error. This will be programmed as a special case in our algorithm.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 61.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 106.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.