818
Views
0
CrossRef citations to date
0
Altmetric
Articles

On the MLE of the Waring distribution

, &
Pages 144-158 | Received 17 Sep 2022, Accepted 27 Jan 2023, Published online: 13 Feb 2023

Abstract

The two-parameter Waring is an important heavy-tailed discrete distribution, which extends the famous Yule-Simon distribution and provides more flexibility when modelling the data. The commonly used EFF (Expectation-First Frequency) for parameter estimation can only be applied when the first moment exists, and it only uses the information of the expectation and the first frequency, which is not as efficient as the maximum likelihood estimator (MLE). However, the MLE may not exist for some sample data. We apply the profile method to the log-likelihood function and derive the necessary and sufficient conditions for the existence of the MLE of the Waring parameters. We use extensive simulation studies to compare the MLE and EFF methods, and the goodness-of-fit comparison with the Yule-Simon distribution. We also apply the Waring distribution to fit an insurance data.

1. Introduction

The power-law distributions are a class of heavy-tailed univariate distributions that describe a quantity whose probability decreases as a power of its magnitude, which is widely used in social science, network science and so on. Two commonly used discrete examples are Zipf distribution and Yule-Simon distribution (or Yule distribution). Zipf law is found by the linguist Zipf when studying the words in a linguistic corpus, in which the frequency of a certain word is proportional to rd, where r is the corresponding rank and d is some positive value. The Yule-Simon distribution is a highly skewed discrete probability distribution with very long upper tails, named after Udny Yule and Herbert Simon–winner of the 1978 Nobel Prize in economics, with distribution function P(X=k)=αΓ(k)Γ(α+1)Γ(α+k+1),α>0,k=1,2,3,,where Γ() is the Gamma function, and α is the parameter. Yule (Citation1925) proposed the distribution first, applying it to model the number of species in the biological genera. Simon (Citation1955) rediscovered the ‘Yule’ distribution later, using it to examine city populations, income distributions, and word frequency in publications (Mills, Citation2017). In Price (Citation1965Citation1976), Price, a famous American scientist, found that the number of citations of the literature follows the Yule distribution, when linking the published literature with his cited literature to form a directed network of scientific and technological literature. It is a cumulative advantage distribution based on the mechanism of ‘success breeds success’.

The two-parameter Waring distribution is a generalization of the Yule-Simon distribution, which provides more flexibility than the commonly used one-parameter Zipf distribution, Yule-Simon distribution, negative binomial distribution, etc. The Waring distribution can describe a wide variety of phenomena in actuarial science, network science, library and information science, such as number of shares purchased by each customer, number of traffic accidents, number of nodes in the internet connections, and frequency of authors who publish a certain number of paper (Huete-Morales & Marmolejo-Martín, Citation2020; Panaretos & Xekalaki, Citation1986; Seal, Citation1952; Xekalaki, Citation1983). The distribution function of XW(α,β) is given by (1) P(X=k)=αΓ(β+k1)Γ(β)Γ(α+β)Γ(α+β+k),α>0,β>0,k=1,2,3,,(1) where α,β are the parameters of the Waring distribution. It is easy to prove that the Waring distribution is a heavy-tailed distribution, with a polynomial tail of order α+1. We can also derive that E(X)=1+βα1 if α>1, and var(X)=αβ(α+β1)(α1)2(α2) if α>2. The Yule-Simon distribution is a special case of the Waring distribution with β=1.

The parameter estimation is extremely important to make a statistical inference. Garcia (Citation2011) provided a fixed-point algorithm to estimate the Yule-Simon distribution parameter. For the Waring distribution, a commonly used method is the EFF (Expectation-First Frequency), which is essentially the method of moments. More specifically, the EFF method uses the sample mean X¯ to estimate E(X)=1+βα1 and the empirical first frequency Pˆ(X=1) to estimate P(X=1)=αα+β, leading to αˆ=Pˆ(X=1)(X¯1)Pˆ(X=1)X¯1,βˆ={1Pˆ(X=1)}(X¯1)Pˆ(X=1)X¯1.The EFF method has two drawbacks: first, it restricts that α>1, which can not be used when the first moment does not exist; second, it only uses the information of P(X=1) and E(X), which loses information of the data. Xekalaki (Citation1985) proposed a factorial moment estimation for the bivariate generalized Waring distribution, which also suffers from these drawbacks.

In the current literature, researchers also considered the maximum likelihood estimator (MLE) of the Waring parameters. However, they usually directly applied the optimization algorithm to the log-likelihood function, without verifying the existence of the MLE (Rivas & Campos, Citation2021). As we all know, MLE does not exist in all cases. In fact, for some sample data, the MLE of Waring parameters exists, while for some sample data, it does not exist. For example, in the insurance share data analysed in Section 4, the MLE of the Waring parameters does not exist for the groups with central ages 17.5, 22.5 and 67.5; for each group, the age length equals 5. If we do not know whether MLE exists and we calculate it, then it is questionable to show the credibility of MLE. Based on this consideration, the existence of MLE will be investigated in this paper. More specifically, we apply the profile method to the log-likelihood function, deriving the necessary and sufficient conditions for the existence of the MLE of the Waring parameters. When the largest value in the observed sample is small, we also verify our theory by exactly solving the estimating equation system. Furthermore, we get two byproducts during the proof of the main result. The first one is our Lemma 2.3, which provides an alternative way to prove the existence of MLE for two parameters, while the conventional proof includes a complicated calculation of the Hessian matrix. The second one is our Lemma 2.4, which provides a comparison method for two increasing and concave functions. These results may play a role in other applications.

Through extensive simulation studies, we find that when the sample size is as small as n = 100, both MLE and EFF yield relatively poor estimates. When n200, MLE always results in much smaller biases than EFF; the relative bias of MLE decreases from 6%-7% when n = 200 to around 1% when n = 1000, while that of EFF is still around 10% even when n = 1000 for α1.2. The relative standard errors from MLE are comparable with those from EFF for medium-sized samples (n = 200 and 400), but smaller for n = 1000. Overall, the MLE method results better performance than the EFF method when α/β is not large or the sample size is large enough. The performance of EFF is relatively better when α/β is large, say α/β2. Our explanation is that, since P(X=1)=αα+β=α/βα/β+1, if α/β is large, then P(X=1) is close to 1, and thus EFF includes relatively more information than the case with small α/β. We also compare the Waring distribution and Yule-Simon distribution in terms of goodness-of-fit to the data, and we find that the Waring distribution fits the data similar to the Yule-Simon distribution when β=1, and much better when β departs from 1.

The rest of the paper is organized as follows. Section 2 presents the main result based on the profile method. Section 3 gives some numerical studies to show the advantage of MLE over the EFF method, and that of Waring distribution over the Yule-Simon distribution. The real insurance data analysis is presented in Section 4. All technical details are deferred to the Appendix.

2. Maximum likelihood estimator of the Waring parameters

For the two-parameter Waring distribution, we have P(X=1)=αΓ(β)Γ(β)Γ(α+β)Γ(α+β+1)=αα+β,P(X=k)=αΓ(β+k1)Γ(β)Γ(α+β)Γ(α+β+k)=αβ(β+1)(β+2)(β+k2)(α+β)(α+β+1)(α+β+k1),k=2,3,.Suppose that x1,,xn is a random sample from the Waring distribution W(α,β), and let m=max{x1,,xn} be the largest observe value, nk be the number of observations equal to k, k=1,,m, and k=1mnk=n. Based on the data {x1,,xn}, we can easily derive the likelihood function as Ln(α,β)=(αα+β)n1k=2m{αβ(β+1)(β+k2)(α+β)(α+β+1)(α+β+k1)}nk=(αα+β)n(βα+β+1)s=2mns(β+1α+β+2)s=3mns(β+m2α+β+m1)nm.Then the log-likelihood is (2) n(α,β)=logLn(α,β)=n{logαlog(α+β)}+s=2mns{logβlog(α+β+1)}+s=3mns{log(β+1)log(α+β+2)}++nm{log(β+m2)log(α+β+m1)}.(2) Taking partial derivatives with respect to α and β leads to the following maximum likelihood equations (3) 1nn(α,β)α=1α(1α+β+s=2mpsα+β+1+s=3mpsα+β+2++pmα+β+m1)=0,(3) (4) 1nn(α,β)β=(1α+β+s=2mpsα+β+1+s=3mpsα+β+2++pmα+β+m1)+(s=2mpsβ+s=3mpsβ+1++pmβ+m2)=0,(4) where pk=nk/n with pm>0.

We first consider Equation (Equation3), which can be treated as the conditional maximum likelihood equation of α given a positive β. When m = 1, that is, all the observed values equal to 1, since 1nn(α,β)α=1α1α+β=βα+β>0, thus there is no solution to the likelihood equation. We focus on the situation where m2.

In the following, we first consider the conditional maximum likelihood Equation (Equation3) given any positive β, which can be regarded as a generalization of the Yule-Simon distribution, and we prove that those results for Yule-Simon distribution (β=1) also hold for any β>0. More specifically, given a positive β, we denote the conditional MLE of α as α(β). According to (Equation3), α(β) satisfies (5) α(β)=11α(β)+β+s=2mpsα(β)+β+1+s=3mpsα(β)+β+2++pmα(β)+β+m1.(5) For notational ease, we define (6) η1=t=2ms=tmps=t=2m(t1)pt,η2=t=2mts=tmps=t=2m(t1)(t+2)2pt,η3=t=2mt2s=tmps=t=2m(t1)(2t2+5t+6)2pt,(6) and present the properties of α(β) in the following Proposition 2.1.

Proposition 2.1

Let α(β) be defined as in (Equation5). We have the following properties.

  1. Property 1. If β0, we have α(β)0.

  2. Property 2. If β, we have α(β)=1η1β+η2η1η1(1+η1)+η22η1η3η1+2η2η3(1+η1)31β+O(1β2).

  3. Property 3. α(β) is an increasing and concave function of β.

  4. Property 4. The first derivative α(β) if β0, and α(β)1η1 if β.

  5. Property 5. When β>0, the number of solutions to α(β)=Z(β) is finite, where Z(β) is any polynomial or fractional function of β.

Next we discuss the existence of MLE of (α,β). By (Equation3), we have (7) 1α(β)+β+s=2mpsα(β)+β+1+s=3mpsα(β)+β+2++pmα(β)+β+m1=1α(β).(7) By (Equation4), we have 1α(β)+β+s=2mpsα(β)+β+1+s=3mpsα(β)+β+2++pmα(β)+β+m1=s=2mpsβ+s=3mpsβ+1++pmβ+m2.Let (8) h(β)=1s=2mpsβ+s=3mpsβ+1++pmβ+m2.(8) If the curves y=h(β) and y=α(β) intersect at some β>0, we have solution to the equation system (Equation3)–(Equation4). Later we prove that the intersection is unique and is the MLE of the Waring distribution.

To discuss whether y=h(β) and y=α(β) intersect at some β>0, we first present the properties of h(β) in the following proposition.

Proposition 2.2

Let h(β) be defined as in (Equation8), we have the following properties.

  1. Property 1*. If β0, we have h(β)0.

  2. Property 2*. If β, we have h(β)=1η1β+η22η1η12+η22η1η3η131β+O(1β2).

  3. Property 3*. h(β) is an increasing and concave function of β.

  4. Property 4*. The first derivative h(β)1s=2mps if β0, and h(β)1η1 if β.

  5. Property 5*. When β>0, the number of solutions to h(β)=Z(β) is finite, where Z(β) is any polynomial or fractional function of β.

Based on Properties 1 and 4 of Proposition 2.1 and 1* and 4* of Proposition 2.2, it is easy to derive that α(β)>h(β) when β is small. Therefore, if we can prove that α(β)<h(β) for some large β, due to the continuity of the two functions, there must exist solution to the equation systems (Equation3)–(Equation4). This is the key idea to check the existence of the MLE.

Before presenting the main result, we first give two important lemmas.

Lemma 2.3

For the log-likelihood function n(α,β), assume that for any β, n(α(β),β)=maxαn(α,β), and there exists β1 such that

n(α,β)/β|α=α(β1),β=β1=0, n(α,β)/β|α=α(β)>0 for β<β1 and

n(α,β)/β|α=α(β)<0 for β>β1. Then we have n(α(β1),β1)=maxα,βn(α,β).

Lemma 2.3 provides an alternative to the proof of MLE based on the profile method, which is simpler than the conventional proof that includes complicated calculation of the Hessian matrix.

Lemma 2.4

Assume that t1(x) and t2(x) are increasing and concave functions for x>0, the curves y=t1(x) and y=t2(x) only intersect finite times, and the number of solutions to ti(x)=Z(x) is finite for both i = 1, 2, where Z(x) is any polynomial or fractional function of x. Further assume that

  1. t1(a)=t2(a) for some a;

  2. there exists some δ>0 such that t1(x)>t2(x) for x(a,a+δ);

  3. limxt1(x)x=limxt2(x)x=c>0;

  4. there exists δ4 such that t1(x)>t2(x) for x(δ4,).

Then, we have t1(x)t2(x) for all x(a,).

Lemma 2.4 provides a general method to compare two increasing and concave functions, without requiring the explicit form of the functions, which not only simplifies the comparison of α(β) and h(β), but also has its own value in other applications.

Based on Propositions 2.1–2.2, Lemmas 2.3–2.4, we summarize the existence of MLE in the following Theorem 2.5.

Theorem 2.5

Suppose that {x1,,xn} is a random sample from the Waring distribution W(α,β), and m=max{x1,,xn}. Let pk=nk/n be the proportion of {xi=k} with pm>0. Let αintercept=η2η1η1(1+η1),hintercept=η22η1η12.

  1. If αintercept<hintercept, then the MLE of (α,β) exists.

  2. If αintercept>hintercept, then the MLE of (α,β) does not exist.

  3. If αintercept=hintercept, or equivalently, η12+2η1η2=0, we denote dα=η22η1η3η1+2η2η3(1+η1)3 and dh=η22η1η3η13. The MLE exists if dα<dh and doesn't exist if dα>dh.

To derive the necessary and sufficient conditions of MLE existence, we start form the conditional MLE of α for a given β, because it is easier to discuss the possible solutions by intersection of two curves determined by the estimating equations. Numerically, since we only have two parameters to estimate, thus it is quite efficient to solve that by the ‘optim’ function in R.

Remark 2.1

Unlike the existing literature which directly applied the optimization algorithm to the log-likelihood function, without verifying the existence of the MLE (Huete-Morales & Marmolejo-Martín, Citation2020; Rivas & Campos, Citation2021), we present the necessary and sufficient conditions for the existence of the MLE of the Waring parameters, which is the first attempt. It is easy to see that the sign of αintercepthintercept is equal to the sign of A(p2,,pm)=η12+2η1η2.For m = 2, we have A(p2,,pm)=p22=(n2/n)2>0, and thus the MLE of (α,β) does not exist. For m3, it depends, and we can check the sign of A(p2,,pm)={p2+2p3++(m1)pm}2{p3+3p4++(m1)(m2)2pm}for a general m. For m = 2, 3, we also carefully check the existence of real-valued solution to the equation system (Equation3)–(Equation4), and find that the sign of αintercepthintercept indeed determines the existence of MLE. The readers can refer to the authors for checking details.

One more comment on Theorem 2.5 is as follows. If αintercept<hintercept, or αintercept=hintercept with dα<dh, the MLE of the Waring parameters is a finite vector. Then the Waring distribution fits the data better than the Yule-Simon distribution, if the estimated β departs from 1, and similarly if the estimated β is close to 1. If αintercept>hintercept, or αintercept=hintercept with dα>dh, the likelihood function will be maximized at the boundary region, i.e., infinity. Therefore, if we directly apply the optimization algorithm to the likelihood function, the MLE may be far from the true parameters; for example, in the real data application, we get that MLE αˆ=1,687,133.2,βˆ=675,078.4 for the group with central age 67.5 (age from 65 to 70), where in fact that the MLE does not exist. In such cases, we can use the EFF method if the EFF estimates are in reasonable scales, and the Waring distribution will still fit the data better than the Yule-Simon distribution.

3. Simulation studies

3.1. Comparison of MLE and EFF

In this section, we give some numerical studies to compare the MLE and the EFF method in the Waring parameter estimation.

The Waring distributed observations are generated by the function rWARING in the R package gamlss.dist. We need mention that in the function rWARING, the parameters is {μ,σ}, and the probability mass function is given by P(X=k)=(1+σ)Γ(k+μσ)Γ(μ+σ+1σ)σΓ(k+μ+1σ+2)Γ(μσ),k=0,1,2,,μ>0,σ>0.Comparing the above probability mass function to (Equation1), we can find that we need to add 1 to the generated values from rWARING, and the relationship between the parameters is α=1+1/σ and β=μ/σ. Thus rWARING automatically restricts α>1 and the EFF estimator exists. We consider 20 combinations of (α,β), where α=2,1.5,1.2,1.1,1.05 and β=0.5,1,1.5,2, with sample sizes n = 100, 200, 400 and 1000. We generate 500 replicates for each case.

Probably due to the parameter specification and restricted data-generating process of the function rWARING, we find that αintercept<hintercept is satisfied in all cases, except two replicates in the case α=2,β=0.5 with small sample size n = 100. By Remark 2.1, αintercept<hintercept is equivalent to (9) {p2+2p3++(m1)pm}2<p3+3p4++(m1)(m2)2pm.(9) It is easy to see that {p2+2p3++(m1)pm}2={k=1m(k1)pk}2={En(X)1}2,p3+3p4++(m1)(m2)2pm=k=1m(k1)(k2)2pk=12En(X2)32En(X)+1,where En means the empirical distribution. When 1<α2, E(X) exists while E(X2) diverges. Thus (Equation9) is very likely to hold, and the MLE exists. However, in real applications, it is possible that αintercept>hintercept (Section 4).

As mentioned immediately after Theorem 2.5, we use the ‘optim’ function to solve the MLE after verifying its existence. We tried four methods to initialize the parameters: (i) small values, (α(0),β(0))=(1.1,0.1); (ii) large values, (α(0),β(0))=(2.5,3); (iii) true values of the parameters plus a random perturbation N(0,0.22), but restrict that α(0)1.1 and β(0)0.1; (iv) the EFF method. Extensive numerical studies show that these four initializing methods yield almost the same results, which indicates that the optimization is not sensitive to the initial values. Therefore, we use the EFF estimator for initialization if EFF produces positive estimates, otherwise, we set the initial values as (α(0),β(0))=(1.1,0.1).

Among all the cases, the EFF method results in negative estimates only in one replicate in the case α=2,β=0.5 with small sample size n = 100; in another replicate, the denominator Pˆ(X=1)X¯1 is exactly 0, so the estimator does not exist; these two replicates are deleted for fair comparison. Since the parameters are in different scales, especially the parameter β, the maximal value is four times of the minimal one. Thus for fair comparison, we report the rBias (relative bias, defined as the bias divided by the true value of the parameter) and rStd (relative standard errors, defined as the standard error divided by the true value of the parameter) in Tables  and . We find that, when the sample size is as small as n = 100, both MLE and EFF yield relatively poor estimates, with standard errors being larger than or close to 50% of the true value of the parameter, which indicates that it is challenging to accurately estimate the parameters with small sample sizes. Therefore, we focus on the comparison of MLE and EFF for n200. First, MLE always results in much smaller biases than EFF. Though the rBias of EFF decreases when the sample size increases, it increases when the true α decreases, and it is still around 10% even when n = 1000 for α1.2; the rBias of MLE decreases from 6%–7% to around 1% when n increases from 200 to 1000, regardless of the true α. Second, MLE results in comparable rStd with EFF for medium-sized sample (n = 200 and 400), but smaller rStd for n = 1000. Overall, the MLE method results better performance than the EFF method when α/β is not large or the sample size is large enough. The performance of EFF is relatively better when α/β is large, e.g., α/β2. Our explanation is that, since P(X=1)=αα+β=α/βα/β+1, if α/β is large, then P(X=1) is close to 1. Thus EFF includes relatively more information than the case with small α/β.

Table 1. Relative biases and relative standard errors of estimated parameters, for β=0.5 and 1.

Table 2. Relative biases and relative standard errors of estimated parameters, for β=1.5 and 2.

3.2. Goodness-of-fit comparison with Yule-Simon distribution

In this section, we compare the Waring distribution and the Yule-Simon distribution, in terms of goodness-of-fit to the data.

We fix α=1.5, and generate data from the Waring distribution with β=1,1.5,2; data is generated from the function rWARING as in Section 3.1. When β=1, it is exactly the Yule-Simon distribution, and when β departs from 1, the Yule-Simon assumption is violated. We consider 500 replicates with sample sizes n = 100, 200, 400, 1000. To initialize the optimization for the MLE of the Yule-Simon parameter α, we use the first frequency P(X=1)=αα+1, that is, α~=Pˆ(X=1)1Pˆ(X=1). Figure  presents the box-plots of the likelihood ratio statistics Tn=2{n(αˆ,βˆ)n(αˆ)},where n(αˆ,βˆ) is the log-likelihood function of the Waring fitting evaluated at the MLE (αˆ,βˆ), and n(αˆ) is the log-likelihood function of the Yule-Simon fitting evaluated at the MLE αˆ. If the true β equals 1, the Yule-Simon distribution is correct, so it is easy to prove that Tnχ12; if the true β departs from 1, the Yule-Simon distribution is not correct, so Tn will be large. The box-plots in Figure  confirm that the Waring distribution fits the data similar to the Yule-Simon distribution when β=1, and much better when β departs from 1. We further report the proportion of replicates that the Yule-Simon distribution is rejected at nominal level 0.05, in Table .

Figure 1. Box-plots of Tn corresponding to β=1 (first row), 1.5 (second row), and 2 (third row), respectively, where the dashed line indicates the critical value 3.84, and the number at the right side of the figure is the proportion that Tn>3.84. In the last piece, all Tn's are much larger than 3.84, and thus the dashed line and the rejection proportion are not shown in the figure.

Figure 1. Box-plots of Tn corresponding to β=1 (first row), 1.5 (second row), and 2 (third row), respectively, where the dashed line indicates the critical value 3.84, and the number at the right side of the figure is the proportion that Tn>3.84. In the last piece, all Tn's are much larger than 3.84, and thus the dashed line and the rejection proportion are not shown in the figure.

Table 3. Proportion of replicates that the Yule-Simon distribution is rejected at nominal level 0.05.

4. Real data application

Seal (Citation1947Citation1952) provided data on insurance shares for 12 different age periods. The original data is about male lives assured in a British life office, maintained for administrative purposes. The analysed data is a random subset, and every tenth names in this list were included until the total of 2000 was reached. The lives sampled are scheduled according to the year of birth and the number of policies in force. The group is represented by the central age.

Seal (Citation1952) fitted the data using the discrete Pareto, with probability mass function P(X=k)=kd/ζ(d),k=1,2,3,,d>1,where ζ(d) is a normalization constant, and the parameter d is estimated by the MLE. Here we apply the Waring distribution to fit the data. For the age periods centred at 17.5 and 22.5, the maximal number of shares is 2. The EFF method leads to negative parameter estimates, while the MLE is proved not to exist as in Remark 2.1. We focus on the rest 10 groups, with central ages from 27.5 to 72.5. Among these 10 groups, for the group with central age 67.5, we have n = 45 and n1=33,n2=7,n3=4,n4=1, and it is easy to verify that (Equation9) does not hold. Thus the MLE does not exist. If we directly apply the optimization algorithm, we get αˆ=1,687,133.2,βˆ=675,078.4, which is meaningless. However, if we use the EFF method, we get αˆ=11,βˆ=4, and the resulted fitting is reasonably good. Thus, we need to be careful in using the MLE. Table  summarizes the comparison of the actual distribution with discrete Pareto law fitting, Waring fitting with EFF and MLE, we find that the Waring distribution fits the data slightly better than the discrete Pareto law.

Table 4. Comparison of actual distribution (A) with discrete Pareto law fitting (P), Waring fitting with EFF (E) and MLE (M).

5. Discussion

To fit a given data set by the Waring distribution, we need to verify the existence condition of the MLE of the Waring parameters before we use the MLE. If the existence condition is not satisfied, it means that the likelihood is maximized at the boundary, i.e., infinity. Therefore, if we directly apply the optimization algorithm to the likelihood function, the MLE may be far from the true parameters; see for example, we get MLE αˆ=1,687,133.2,βˆ=675,078.4 for the group with central age 67.5, where in fact that the MLE does not exist. In such cases, we can use the EFF method if the EFF estimates are in reasonable scales. Based on the simulation studies and the real data analysis, we find that, when the sample size is small or the maximum observed value is small, the MLE is less likely to exist, and when the sample size is big and the maximum observed value is large, the MLE is more likely to exist. Nevertheless, we need verify the existence condition for the MLE.

Acknowledgements

The authors would like to thank two anonymous reviewers, an associate editor and the editor for constructive comments and helpful suggestions.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work is partially supported by National Natural Science Foundation of China [Grant Numbers 11671096, 11690013, 11731011, 11871376] and Natural Science Foundation of Shanghai [Grant Number 21ZR1420700].

References

  • Garcia, J. M. (2011). A fixed-point algorithm to estimate the Yule–Simon distribution parameter. Applied Mathematics and Computation, 217(21), 8560–8566. https://doi.org/10.1016/j.amc.2011.03.092
  • Huete-Morales, M. D., & Marmolejo-Martín, J. A. (2020). The Waring distribution as a low-frequency prediction model: A study of organic livestock farms in Andalusia. Mathematics, 8(11), 2025. https://doi.org/10.3390/math8112025
  • Mills, T. (2017). A statistical biography of george udny yule: A loafer of the world. Cambridge Scholars Press.
  • Panaretos, J., & Xekalaki, E. (1986). The stuttering generalized waring distribution. Statistics and Probability Letters, 4(6), 313–318. https://doi.org/10.1016/0167-7152(86)90051-9
  • Price, D. (1965). Network of scientific papers. Science, 149(3683), 510–515. https://doi.org/10.1126/science.149.3683.510
  • Price, D. (1976). A general theory of bibliometric and other cumulative advantage processes. Journal of the American Society for Information Science, 27(5), 292–306. https://doi.org/10.1002/(ISSN)1097-4571
  • Rivas, L., & Campos, F. (2021). Zero inflated Waring distribution. Communications in Statistics – Simulation and Computation, to appear. https://doi.org/10.1080/03610918.2021.1944638
  • Seal, H. L. (1947). A probability distribution of deaths at age x when policies are counted instead of lives. Scandinavian Actuarial Journal, 1947, 118–43. https://doi.org/10.1080/03461238.1947.10419647
  • Seal, H. L. (1952). The maximum likelihood fitting of the discrete Pareto law. Journal of the Institute of Actuaries, 78(1), 115–121. https://doi.org/10.1017/S0020268100052501
  • Simon, H. A. (1955). On a class of skew distribution functions. Biometrika, 42(3–4), 425–440. https://doi.org/10.1093/biomet/42.3-4.425
  • Xekalaki, E. (1983). The univariate generalized Waring distribution in relation to accident theory: Proneness, spells or contagion? Biometrics, 39(4), 887–895. https://doi.org/10.2307/2531324
  • Xekalaki, E. (1985). Factorial moment estimation for the bivariate generalized Waring distribution. Statistical Papers, 26(1), 115–129. https://doi.org/10.1007/BF02932525
  • Yule, G. U. (1925). A mathematical theory of evolution, based on the conclusions of Dr. J. C. Willis, F.R.S. Philosophical Transactions of the Royal Society B, 213, 21–87.

Appendices

The appendix contains some useful lemmas and technical proofs.

Appendix 1. Some useful lemmas

Lemma A.1

Define g(x)=11a1x+b1++1akx+bk,x>0,where a1,,ak are positive and b1,,bk are nonnegative. Then g(x) is an increasing and concave function.

Proof.

It is easy to derive that g(x)=g2(x){a1(a1x+b1)2++ak(akx+bk)2}>0,g(x)=2g3(x)[{a1(a1x+b1)2++ak(akx+bk)2}2(1a1x+b1++1akx+bk){a12(a1x+b1)3++ak2(akx+bk)3}]=1i<jk1(aix+bi)(ajx+bj)(1aix+bi1ajx+bj)2<0.

Lemma A.2

When x, we have xm+a1xm1+a2xm2+b1xm1+b2xm2+b3xm3+=1b1x+a1b1b2b12+a2b12b1b3a1b1b2+b22b131x+O(1x2).

Proof.

Assume that xm+a1xm1+a2xm2+b1xm1+b2xm2+b3xm3+=1b1x+c+d1x+O(1x2).Then xm+a1xm1+a2xm2+=(b1xm1+b2xm2+b3xm3+)(1b1x+c+d1x+)=xm+(b1c+b2b1)xm1+(b1d+b2c+b3b1)xm2+,which indicates that: (i) b1c+b2/b1=a1, and then c=a1b1b2b12; (ii) b1d+b2c+b3b1=a2, and then d=a2b12b1b3a1b1b2+b22b13. The proof is completed.

Appendix 2. Technical Proofs

Appendix 2.1. Proof of Lemmas 2.3–2.4

Proof

Proof of Lemma 2.3

Since for any β, n(α(β),β)=maxαn(α,β). Thus, to prove n(α(β1),β1)=maxα,βn(α,β), we only need prove that β1 maximizes n(α(β),β). Therefore, we only need prove that n(α(β),β)/β|β=β1=0, n(α(β),β)/β>0 for β<β1 and n(α(β),β)/β<0 for β>β1.

Consider the following decomposition, n(α(β+Δβ),β+Δβ)n(α(β),β)Δβ=n(α(β+Δβ),β+Δβ)n(α(β),β+Δβ)Δβ+n(α(β),β+Δβ)n(α(β),β)Δβ{n(α,β)αα(β)β+(α,β)β}|α=α(β),where n(α,β)α|α=α(β)=0, and thus (α,β)β|α=α(β) totally determines the sign of (α(β),β)β. The proof is completed.

Proof

Proof of Lemma 2.4

We use the method of contradiction. If the conclusion is not correct, then there exists x1>a such that t1(x1)=t2(x1), t1(x)>t2(x) for x<x1 and t1(x)<t2(x) for x(x1,x1+δ0) for some δ0>0. By assumption (D), the curves y=t1(x) and y=t2(x) will intersect again after (x1,t1(x1)), i.e., there exists x2>x1 such that t1(x2)=t2(x2), t1(x)<t2(x) for x(x1,x2) and t1(x)>t2(x) for x>x2 (suppose that there exists only one such x2, otherwise, we consider the largest intersection). According to assumption (D), take one point x(δ4,) (which is of course greater than x2), use (x2,t(x2)) as the starting point, and then take a ray interpolating (x,t1(x)). Let x diverge to infinity so that the point (x,t1(x)) moves along the curve y=t1(x). Since t1(x) is increasing and concave, the ray interpolating (x,t1(x)) tilts down around the start point (x2,t(x2)). By assumption (C), when x, the slope of the ray t1(x)t1(x2)xx2c.Thus the limit of the ray is a ray with start point (x2,t1(x2)) and slope c, denoted as L, and the curve y=t1(x) is above L.

Note that the start point of the ray L, (x2,t(x2)), is on the curve y=t2(x). By assumption (D), there exists an x, which satisfies that, the curve y=t2(x) intersects L at (x,t2(x)) and y=t2(x) lies below L for x(x,x+δ5) with some positive δ5. Without loss of generality, we assume that x2 is such point, that is, y=t2(x) lies below L for x(x2,x2+δ5).

Through the intersection (x2,t1(x2)), we make tangent line of the curve y=t2(x). If the tangent line coincides with the ray, then take another point x(x2,x2+δ5), and make another tangent line of the curve y=t2(x) through the point (x,t2(x)). Since y=t2(x) is increasing and concave, if the tangent line (of y=t2(x)) through (x2,t1(x2)) coincides with the ray L, the tangent line through (x,t2(x)) does not coincide with L. Note that the curve y=t1(x) is above L, while y=t2(x) is below the tangent line (a concave curve is always below its tangent line) which is below the ray L (the one which does not coincide with L must be below L according to the above discussion). Therefore, limxt1(x)xlimxt2(x)x, which contradicts with assumption (C).

To summary, no such x1>a exists that t1(x1)=t2(x1), t1(x)>t2(x) for x<x1 and t1(x)<t2(x) for x(x1,x1+δ0) for some δ0>0. We conclude that, t1(x)t2(x) for x(a,). The proof of Lemma 2.4 is completed.

Appendix 2.2. Proof of Propositions 2.1–2.2

Proof

Proof of Propositions 2.1

Proof of Property 1. If β0, we have g1(α,β)11α+s=2mpsα+1+s=3mpsα+2++pmα+m10,when α0. Therefore, when β0, the intersection of y=g1(α,β) and y=α converges to the origin of coordinates.

Proof of Property 2. If β, then for any α>0, we have g1(α,β). Thus, if β, then α(β) because (α(β),β) is the intersection. We have α(β)=11α(β)+β+s=2mpsα(β)+β+1+s=3mpsα(β)+β+2++pmα(β)+β+m1={α(β)+β}m+a1{α(β)+β}m1+a2{α(β)+β}m2+b1{α(β)+β}m1+b2{α(β)+β}m2+b3{α(β)+β}m3+,where a1=m(m1)2,a2=124m(m1)(m2)(3m1),b1=1+η1,b2=m(m1)2(1+η1)+η1η2,b3=m(m1)(m2)(3m1)24+m(m1)(3m27m+14)+2424η1{m(m1)2+2}η2+η3.Based on Lemma A.2, tedious calculation yields α(β)=11+η1{α(β)+β}+cα+dαα(β)+β+,where cα=η2η1(1+η1)2,dα=η22η1η3η1+2η2η3(1+η1)3.Simple algebra yields α(β)=1η1β+cα(1+η1)η1+1+η1η1dαα(β)+β+=1η1β+cα+dαβ+O(1/β2),where cα=cα(1+η1)η1=η2η1η1(1+η1).

Proof of Property 3. Since (A1) 1α(β)=1α(β)+β+s=2mpsα(β)+β+1++pmα(β)+β+m1,(A1) taking derivative with respect to β on both sides of (EquationA1), we have α(β)=α2(β)[α(β)+1{α(β)+β}2+{α(β)+1}s=2mps{α(β)+β+1}2++{α(β)+1}pm{α(β)+β+m1}2].Simple algebra leads to α(β)=u(β)1u(β),where (A2) u(β)=1{α(β)+β}2+s=2mps{α(β)+β+1}2++pm{α(β)+β+m1}2{1α(β)+β+s=2mpsα(β)+β+1++pmα(β)+β+m1}2>0.(A2) Furthermore, since {1α(β)+β+s=2mpsα(β)+β+1++pmα(β)+β+m1}2=i=1m[s=ipsα(β)+β+i1{j=1ms=jmpsα(β)+β+j1}]>i=1m{s=ipsα(β)+β+i11α(β)+β}>i=1ms=ips{α(β)+β+i1}2,which indicates that u(β)<1. Therefore, α(β)>0.

Taking derivative with respect to β twice on both sides of (EquationA1), we have α(β)=2α3(β)([α(β)+1{α(β)+β}2+{α(β)+1}s=2mps{α(β)+β+1}2++{α(β)+1}pm{α(β)+β+m1}2]2{1α(β)+β+s=2mpsα(β)+β+1+s=3mpsα(β)+β+2++pmα(β)+β+m1}×[{α(β)+1}2{α(β)+β}3+{α(β)+1}2s=2mps{α(β)+β+1}3++{α(β)+1}2pm{α(β)+β+m1}3])=2α3(β){α(β)+1}2×[0i<jm1s=i+1mpss=j+1mps{α(β)+β+i}{α(β)+β+j}{1α(β)+β+i1α(β)+β+j}2]<0.

Proof of Property 4. If β0, then α(β)0, and thus (EquationA2) indicates that u(β)1; therefore α(β). If β, then α(β), and thus (EquationA2) indicates that u(β)11+t=2ms=tmps; therefore α(β)1t=2ms=tmps.

Proof of Property 5. According to (Equation3), the conditional maximum likelihood equation of α can be rewritten as αα+β+αs=2mpsα+β+1+αs=3mpsα+β+2++αpmα+β+m1=1.Let f(α)=αα+β+αs=2mpsα+β+1+αs=3mpsα+β+2++αpmα+β+m1,and then f(α) is an increasing function of α. Since f(α(β))=1, then x is less than, equal to or greater than α(β) which is equivalent to that f(x) is less than, equal to or greater than 1. Therefore, α(β)=Z(β) is equivalent to f(Z(β))=1. Since Z(β) is a polynomial or fractional function of β, then f(Z(β))=Z(β)Z(β)+β+Z(β)s=2mpsZ(β)+β+1+Z(β)s=3mpsZ(β)+β+2++Z(β)pmZ(β)+β+m1=1is a high-ordered polynomial equation, which has finite number of solutions.

Proof

Proof of Proposition 2.2

The proofs of Properties 1* and 5* are similar to the proofs of Properties 1 and 5 in Proposition 2.1, respectively, and Property 3* follows from Lemma A.1. In the following, we present the proofs of Properties 2* and 4*.

Proof of Property 2*. By Lemma A.2, it is easy to obtain h(β)=βm1+a1βm2+a2βm3+b1βm2+b2βm3+b3βm4+,where a1=(m1)(m2)2,a2=(m1)(m2)(m3)(3m4)24,b1=η1,b2=(m2)(m1)+42η1η2,b3=(m2)(m1)(3m213m+36)+9624η1m23m+102η2+η3.Therefore, we have h(β)=1η1β+ch+dh1β+O(1/β2),where ch=a1b1b2b12=η22η1η12,dh=a2b12b1b3a1b1b2+b22b13=η22η1η3η13.Proof of Property 4*. It is easy to derive that h(β)=s=2psβ2+s=3ps(β+1)2++pm(β+m2)2(s=2psβ+s=3psβ+1++pmβ+m2)=(t=2ms=tmps)β2(m1)++(s=2mps){(m2)!}2(t=2ms=tmps)2β2(m1)++(s=2mps)2{(m2)!}2,and we have h(β)(s=2mps){(m2)!}2(s=2mps)2{(m2)!}2=1s=2mps,when β0,h(β)t=2ms=tmps(t=2ms=tmps)2=1t=2ms=tmps=1η1,when β.

Appendix 2.3. Proof of Theorem 2.5

By Properties 1, 4 of α(β) and 1*, 4* of h(β), when β0, h(β)0 and α(β)0; however, h(β)1s=2mps while α(β). Thus, there exists δ1>0, such that α(β)>h(β) for β(0,δ1).

By Property 2 of α(β) and 2* of h(β), when β, h(β)=1η1β+η22η1η12+η22η1η3η131β+O(1β2),α(β)=1η1β+η2η1η1(1+η1)+η22η1η3η1+2η2η3(1+η1)31β+O(1β2).We first discuss the situation η22η1η12η2η1η1(1+η1). We have, there exists δ2>0, such that for β(δ2,), (A3) {α(β)<h(β),if η2η1η1(1+η1)<η22η1η12,α(β)>h(β),if η2η1η1(1+η1)>η22η1η12.(A3) In case of η22η1η12=η2η1η1(1+η1), that is, η2=η12+2η1, we need compare dα=η22η1η3η1+2η2η3(1+η1)3 and dh=η22η1η3η13. If dα>dh, α(β)>h(β) and if dα<dh, α(β)<h(β).

Therefore, if (A4) η2η1η1(1+η1)<η22η1η12,orη2η1η1(1+η1)=η22η1η12,dα<dh,(A4) there must exist an intersection for the curves y=h(β) and y=α(β). Part (II) of Theorem 2.5 follows directly from Lemma 2.4. Thus we only need prove part (I). In the following, we assume that (EquationA4) holds so that y=h(β) and y=α(β) intersect at least once at some positive β.

Suppose that y=h(β) and y=α(β) intersect firstly at (β1,α1), where α1=h(β1)=α(β1), and then α(β)>h(β) for β(0,β1). By Property 5 of α(β) in Proposition 2.1 the curves y=h(β) and y=α(β) only intersect finite times. Therefore, there exists δ3>β1 such that y=α(β) and y=h(β) do not intersect for β(β1,δ3). If for β(β1,δ3), the curve y=α(β) is above y=h(β). Then, due to (EquationA4), the curve y=α(β) will finally be below the curve y=h(β). Thus the two curves will intersect again. However, because the number of intersections is finite, it cannot be always the case that the curve y=α(β) lies above y=h(β) after the intersection, i.e., there exists an intersection that y=α(β) lies below y=h(β) after that intersection. Without loss of generality, we assume that (A5) y=α(β) lies below y=h(β) after the first intersection (β1,α1).(A5) Next, we prove that (α(β1),β1) is the maximizer of the log-likelihood function n(α,β). Since α(β) is the conditional maximum likelihood estimator of α, i.e., maxα,β>0n(α,β)=maxβ>0n(α(β),β),We only need prove that β=β1 is a maximizer of n(α(β),β).

Since (α(β1),β1) is a solution to the equation system (Equation3)–(Equation4), then 1nn(α,β)α|α=α(β1),β=β1=1α(β1)(1α(β1)+β1+s=2mpsα(β1)+β1+1+s=3mpsα(β1)+β1+2++pmα(β1)+β1+m1)=0,1nn(α,β)β|α=α(β1),β=β1=1α(β1)+(s=2mpsβ1+s=3mpsβ1+1++pmβ1+m2)=0.To prove that β=β1 maximizes n(α(β),β), by Lemma 2.3, we only need prove that n(α,β)β|α=α(β) is greater than zero for β(0,β1) and smaller than zero if β(β1,).

We first consider β(0,β1). When β(0,β1), we have α(β)>h(β). Therefore, (A6) 1nn(α,β)β|α=α(β)=1α(β)+(s=2mpsβ+s=3mpsβ+1++pmβ+m2)>0.(A6) We next consider β(β1,). By (EquationA5), α(β)<h(β) if β(β1,δ3). Then, by Lemma 2.4, y=α(β) can't be above y=h(β) at any β>β1, i.e., α(β)h(β) for all β>β1. Therefore, (A7) 1nn(α,β)β|α=α(β)=1α(β)+(s=2mpsβ+s=3mpsβ+1++pmβ+m2)<0.(A7) The proof is completed. We see that the overall proof depends on the fact that if α(β)>h(β),then n(α(β),β)β>0,and thus n(α(β),β) increases with β;if α(β)<h(β),then n(α(β),β)β<0,and thus n(α(β),β) decreases with β.