Abstract
In income and wealth data modeling Pareto distribution and its several variants play an important role. Both univariate and multivariate variations of this model have been extensively used as a suitable model for various non-negative socio-economic variables. In this article, we consider the most general Feller-Pareto (FP, in short) distribution, which subsumes all four types of Pareto distributions and show that it can be represented as a mixture of a conditional generalized gamma and an unconditional gamma distribution. Using this strategy, we consider a data augmentation based method (under the envelope of Bayesian paradigm) to estimate the parameters of the FP distribution. This mixture representation allows us to easily derive conditional Jeffery’s type non informative priors. For illustrative purposes, one data set is considered to exhibit the utility of the proposed method.
1 Introduction
Applications of several Pareto models and its various generalizations in modeling socio-economic phenomena are well established in literature. For an excellent survey on several Pareto models along with its stochastic properties see Arnold (Citation2015) and the references cited therein. In the hierarchy of several Pareto models, Pareto (Type IV) is the most general which subsumes three other models. Feller (Citation1971) came up with a different representation for the Pareto (Type IV) model, expressing it as a ratio of two independent gamma variables, a distribution alternatively known as Beta distribution of the second kind. According to Feller, if , are independent random variables, if for we define , then W has a Feller-Pareto (FP, henceforth, in short) distribution, and we write . The corresponding density can be easily obtained as (1.1) (1.1)
There is another construction of the FP distribution as defined in the next. Let and be two independent radon variables having gamma distribution with scale parameter and shape parameters Then, has FP Indeed Arnold (Citation2015) has shown that the FP distribution is a generalization of the Pareto (IV) distribution.
The FP family appears to be an unimodal distribution which subsumes a variety of continuous probability models as special cases. For example, it includes Pareto (type I), Pareto (type II), Pareto (type III), and Pareto (type IV) which are identified as special cases by appropriately selecting model parameters in EquationEq. (1.1)(1.1) (1.1) given below:
Pareto (I) =FP
Pareto (II) =FP
Pareto(III) =FP
Pareto(IV) =FP
Another special case of the FP distribution is the transformed beta family which includes several well-known probability models such as Burr, Generalized Pareto, and Inverse Burr among others. Noticeably, a salient feature of these distributions is that they possess relatively high probability in the upper tail. However, it is also interesting to note that there are some distributions that exhibit distinctly non-Paretian behavior in the upper tail. For instance, Log-logistic, Inverse Pareto, and Inverse Paralogistic -each is a special case of Inverse Burr have relatively “light” tails as independently observed by Brazauskas (Citation2002). Application of such probability models covers a wide spectrum of areas ranging from actuarial science, economics, finance to health science domain and telecommunications, for distributions of variables such as sizes of insurance claims, incomes in a population of people, stock price fluctuations, duration of responses to medical treatment, and length of telephone calls. For pertinent details, see Arnold (Citation1983, Citation2015); Johnson et al. (Citation1994), and Klein and Moeschberger (Citation1997). Moreover, some of these distributions are relevant within much broader classes of probability models. For example, a generalized Pareto distribution arises in semiparametric modeling of upper observations in samples from distributions which are regularly varying or in the domain of attraction of extreme value distributions, see Embrechts et al. (1997). This motivates us to consider the study of this distribution from the estimation perspective.
Next, without loss of generality, we consider . Alternatively, we can estimate from a given data as which is a consistent estimate of Then, we can subtract from w and set equal to zero. Kalbfleisch and Prentice (Citation1980) identified EquationEq. (1.1)(1.1) (1.1) as a generalized F density. However, the seemingly unattractive and complicated (although it does have a closed form density) expression of the density function discouraged researchers to investigate further about this model. In the literature attempts have been made to discuss structural properties of the FP distribution as well as methodologies related to methods of estimation under a frequentist approach. A not-exhaustive list of such references are mentioned as follows. Tahmasebi and Behboodian (Citation2010) have discussed and derived the exact analytical expressions of entropy for the FP family and order statistics of FP subfamilies. Dutang, Goulet, and Langevin (Citation2022) developed an R package actuar for implementing FP distribution in actuarial applications numerically. Odubote and Oluyede (Citation2014) discussed a weighted version of a Feller-Pareto distribution and discussed several useful structural properties. Brazauskas (Citation2002) derived the exact form of Fisher Information matrix for the FP distribution. However, none of the above cited references have discussed the estimation of the model parameters under a frequentist approach. Additionally, classical estimation appears to have serious limitations in efficiently estimating the parameters of the FP model that has a total of 5 parameters. In particular, one is faced with the following problems:
The classical maximum likelihood method of estimation may not perform satisfactorily well, because we have 4 parameters (assuming ) in total, and the associated likelihood is not a well behaved function as it involves gamma functions. Moreover, even if we estimates for the parameters, it is quite difficult to examine mathematically or otherwise, whether or not the estimated values are global or local maximum.
From EquationEq. (1.1)(1.1) (1.1) , with the k-th order moment of FP distribution will be It is quite obvious from this expression that for the k-th order of moment to exist, we must have This is quite a strong assumption, and at the same time we do not know for sure whether for a given data set, this condition will be satisfied. Although, one may use the method of fractional moments (which always exists), but that too, might not yield satisfactory results as several estimation methods under the classical set-up involves selecting a starting value for the parameters under study.
As a remedy, we seek a different approach in this paper. We write the density function in EquationEq. (1.1)(1.1) (1.1) as a mixture of one conditional generalized gamma and an unconditional gamma distribution. Later on, we will show that this mixture representation helps us to derive easily conditional Jeffery’s type non informative prior. This is quite fascinating in the sense that we are not assuming any external prior, but by rewriting the model in terms of two known distributions by invoking the argument of data augmentation. Needless to say, similar technique might well be considered for a multivariate FP distribution. The rest of the paper is organized as follows. In Section 2, the maximum likelihood estimation method is used to estimate the model parameters for the FP distribution given in EquationEq. (1.1)(1.1) (1.1) . In Section 3, we discuss mixture representation of the density given in EquationEq. (1.1)(1.1) (1.1) . In Section 4, we discuss a simulation study. Section 5 deals with the application of the proposed methodology to a data set having varying complexities. Finally, some concluding remarks are presented in Section 6.
2 Estimation by the method of maximum likelihood
For a random sample of size n the associated log-likelihood function drawn from the probability model in EquationEq. (1.1)(1.1) (1.1) will be (2.1) (2.1)
The corresponding maximum likelihood equations will be by differentiating EquationEq. (2.1)(2.1) (2.1) with respect to , and will be (2.2) (2.2) (2.3) (2.3) (2.4) (2.4) (2.5) (2.5)
The maximum likelihood estimates of the parameters are obtained by equating EquationEqs. (2.2)–(2.5) to zero, and is the polygamma function.
For the elements of the Fisher information (FIM, in short), an interested reader is referred to the paper by Brazauskas (Citation2002) with and need to be replaced by and as per the notation utilized in this paper in the probability density function in EquationEq. (1.1)(1.1) (1.1) .
The asymptotic variance-covariance matrix of the MLE can be obtained from the inverse of the observed FIM as
Lehmann and Casella (Citation2006) have discussed the asymptotic normality of the MLE under certain conditions (see, Theorem 3.10, p.449). For the FP distribution given in EquationEq. (1.1)(1.1) (1.1) , it is straightforward to see that
In addition, all the remaining necessary and sufficient conditions of Theorem 3.10 of Lehmann and Casella (Citation2006) are satisfied, and therefore, it can be assumed that
Consequently, a % approximate confidence intervals of the parameters will be (2.6) (2.6)
where is the 100q-th upper percentile of the standard normal distribution.
2.1 Simulation study
In this section, we conduct a Monte Carlo simulation study to evaluate the performance of the likelihood inference for the FP distribution given in EquationEq. (1.1)(1.1) (1.1) . Random samples from the FP distribution are obtained using the actuar package in R. In particular, we consider the sample sizes n = 50, 75, and 100 with the following six sets of choices of the model parameters.
Choice 1: , and
Choice 2: , and
Choice 3: , and
Choice 4: , and
Choice 5: , and
Choice 6: , and
For each setting, sets of random samples are generated. Regarding the MLE estimates, we have mimicked the strategy adopted in Dutang, Goulet, and Langevin (Citation2022). For each simulated random sample, we also compute the 95% approximate confidence intervals for the parameters and based on EquationEq. (2.6)(2.6) (2.6) with the asymptotic variances obtained from inverting the observed Fisher information matrix as well as the approximated variances obtained from a parametric bootstrap method with 400 bootstrap samples, for details on the bootstrapping, see Efron and Tibshirani (Citation1994). The estimated biases and mean squared errors (MSEs) of the MLEs of and are presented in . The estimated coverage probabilities and average widths of the confidence intervals are presented in . Since the observed information need not be positive definite which results in negative asymptotic variances, see, for example, Verbeke and Molenberghs (Citation2007). Additionally, we also represent the percentage of cases in which the asymptotic variances are negative and the confidence intervals cannot be computed in . In those cases that the asymptotic variances are negative, we recommend using a parametric bootstrap method as an alternative method to approximate the variances of the MLEs.
One may observe from , that the estimated MSEs for the four parameters , and decrease as the sample size increases. However, for the estimated biases, there is not a steady decreasing pattern with the increase of sample sizes, and on the contrary, in some cases, it appears that there is a negligible amount (by 0.01–0.05) of increase. Whether this is an anomaly or not will be investigated in a separate article. We also observe that the estimated MSEs of is larger than the MSEs of , and Next, from , one may also observe the following
that the proportions of cases in which negative variance estimates are obtained is negligibly small.
the computed approximate confidence intervals based on bootstrap variances performs satisfactorily well. Note that these approximate confidence intervals can be used as an alternative when the asymptotic variances are negative, for pertinent details, see Ghosh and Ng (Citation2019) and the references cited therein.
3 Mixture representation
In this section, we represent the mixture representation of the FP distribution. Suppose, and and they are independent. Let us define with Then, the distribution of will be
Note that the above density function has been independently derived and explored by Stacy (Citation1962). The joint density of and will be
Next, let us consider the following transformation. with and We are interested in the distribution of W. The Jacobian of the above transformation is Then, the joint distribution of W and U will be
Hence, the marginal distribution of W will be (3.1) (3.1)
Noticeably, it is the ratio of two independent Stacy random variables with the same shape parameter. Jordanova et al. (Citation2023) discussed and studied the distribution of the ratio of two independent Stacy random variables when both the shape parameters for the numerator and denominator random variables is equal to one. However, it must be noted that Jordanova et al. (Citation2023) did not work on a subset of this current work. Precisely, the authors in that paper work on different sets of the values of the powers in the numerator and denominator, and just have an intersection when these powers are equal to 1.
Next, observe that we can rewrite EquationEq. (3.1)(3.1) (3.1) as where and
with
In this case, the associated complete data likelihood takes the following form: (3.2) (3.2)
Next, we try to derive from EquationEq. (3.2)(3.2) (3.2) , the full conditionals of , given as follows:
For ,
.
Note that this function is log concave.
Note that this function is also log concave.
Finally,
Therefore, all full conditionals can be sampled using Acceptance Rejection (AR) sampling using the R-package ars.
On the choice of hyperparameters for the priors:
In this case, the hyperparameter values for the Gamma priors are obtained via the method of matching first two theoretical moments with the sample moments. Needless to say that there are other available strategies of selecting hyperparameters, such as the procedure adopted by Giannone et al. (Citation2015) in relation to vector autoregression models, Singh and Hellander (Citation2018) in the context of hyperparameter optimization, etc. However, the methodology adopted in such references might be more applicable to spatial-temporal models rather than the model that we have here. Additionally, whether such strategies will be beneficial (in the sense of computational efficiency, computation time) is a subject matter of a future study. Furthermore, it is safe to opine that we are not claiming that this set of priors with these specific choice of hyperparameters is optimum for the associated Bayesian analysis, but, in our cases, we have tried several other prior choices and found minimal changes in the final estimates. Also, the acceptance probability across found to be between which indicates that the chain mixing was satisfactory. A full scale study regarding a wide range of prior choices needs to be done which is beyond the scope of this present paper.
3.1 Comment on the convergence of MCMC procedure
For the convergence diagnostics of the adopted MCMC procedure, the method of Gelman and Rubin (Citation1992) is utilized. It involves two stages. The first step, before sampling begins, involves obtaining an over-dispersed estimate of the target distribution(s) and using these to generate the starting points for the desired number of independent chains (in our case,we consider running a MCMC procedure with parallel independent chains of length 2k = 40000 each). The second step involves using the last iterations to re-estimate the target distribution of the scalar quantity of interest (in our case , , and ). The convergence of the MCMC convergence is monitored by the following quantity given below: where B is the variance between the means from the parallel chains, W is the average of the within-chain variances and df is the degrees of freedom of the approximating t density, for details, see pp. 465, Eq. (20) of Gelman and Rubin (Citation1992). Convergence is said to have achieved once the value of is near 1 for all scalar estimands of interest for a sufficiently large k (). It is to be noted here that Gelman and Rubin (Citation1992) had proposed the method for Gibbs sampler but here we have applied their procedure on the output of MCMC procedure which is a particular case of Gibbs sampling. For illustrative purposes, we provide the values of for all of our scalar quantities of interest under the partial dependent conjugate prior set up (provided in in Appendix).
4 Simulation study
For the associated Bayesian analysis, we consider the following two different sets of prior choices, namely the non-informative & improper and conjugate prior set up. At first, we consider the non-informative prior set-up and we label it as Choice 1.
Choice 1 (Non-informative prior set-up):
Therefore, the joint prior in this case will be
The associated posterior summary is represented in .
Choice 2 (Conjugate prior set-up):
Next, under the conjugate prior set-up, we consider the same prior choices for all the parameters, i.e., each follows a Gamma distribution with shape = 0.2, and rate = 0.2 respectively. We consider using WINBUGS (a software for computing Bayesian analysis) with R interface to conduct the simulation study. We do not claim that the prior choice made here is optimum, but among different choices of the hyperparameters made for this study, this choice appears to be producing satisfactory results for the Bayesian analysis based on a random sample of size . The associated posterior summary is provided in .
For both the cases, we consider the following four sets of true parameter choices.
Choice 1: , and
Choice 2: , and
Choice 3: , and
Choice 4: , and
Remark.
From the posterior simulation studies as given in and , one may comment the following:
Under the non-informative prior set-up the associated 95% HPD intervals are slightly wider.
We cannot make a general comment as to whether prior conjugacy in this scenario will perform uniformly better or not. A full scale study involving other possible prior choices and hyperparameter values along with a flat prior of the form will be the subject matter of a separate article.
5 Real data application
In this section, we consider a data on fault-trace lengths from Clark, Cox, and Laslett (Citation1999). As per the authors, mimicking their words, “It has often been observed that fault-trace lengths tend to follow a power-law or Pareto distribution, at least for sufficiently large lengths.” The applicability of the FP model under the classical paradigm (using the MLE method) has already been discussed in Clark, Cox, and Laslett (Citation1999). In this paper, we discuss the Bayesian estimation of the FP model parameters. One prominent reason for selecting the FP distribution is that it is the most general model in the hierarchy of Pareto distribution and many other distributions have been assumed to fit this particular data set which assumes a Pareto like behavior (especially mimicking the tail-behavior). Here, we consider the estimates of the parameters using moment matching (i.e., equating the sample moments with the theoretical moments) strategy and treat them as the initial prior choice for the Bayesian analysis. The summary of the Bayesian output is given in . Regarding the convergence of the adopted MCMC, we provide the values given in the Appendix. For the Bayesian analysis, we consider the following two different sets of priorchoices:
Choice 1 (Non-informative prior set-up):
Choice 2 (Conjugate prior set-up):
From the 95% HPD intervals for the conjugate prior set-up, it appears that the Bayesian estimation under the non-informative (and improper) prior set-up is less efficient as expected. The summary of the output is given in .
Sensitivity analysis with respect to the hyperparameters:
An anonymous referee has expressed concern that the results of this analysis might be influenced by the choice of hyperparameters. As an initial effort to investigate hyperparameters sensitivity, we re-analyze the data set with two other sets of prior choices that are given as follows (for the conjugate prior set-up):
Choice 3:
Choice 4:
The posterior summaries based on the above two different prior choices (with varying choices of the hyperparameters) are given in . It appears that with these two sets of new choices of hyperparameters for the respective priors the final conclusion remains the same, i.e., the performance of the Bayesian inference under the conjugate prior set-up performs slightly better as compared to that of under the non-informative priorset-up.
6 Concluding remarks
In this paper, we have discussed estimation of the model parameters of the Feller-Pareto distribution under both frequentist and Bayesian paradigm. Under the frequentist approach, we consider the MLE method and also discussed the asymptotic normality of the estimates under certain regularity conditions. Noticeably, as mentioned in Section 1, estimation of the parameters under the classical approach is usually hindered by the fact that the likelihood function is not well behaved, and the parameter space have constraints. Some of these problems can be avoided by using computing environment such as R using specific packages that are designed to handle such types of complicated models in terms of their estimation. Nevertheless, efficient estimation of model parameters for the most general members of the Pareto-family, namely the Feller-Pareto distribution under the classical approach is difficult to achieve, as the existence of moments has some conditions that are hard to satisfy from a practical point of view. However, by rewriting the density function in EquationEq. (1.1)(1.1) (1.1) as a mixture of two well-known continuous probability models and by invoking the strategy of data augmentation, we have conducted a Bayesian analysis under both non-informative & improper and conjugate prior set-up. Based on our study, it is difficult to comment as to which of the two procedures (MLE and Bayesian) will be performing uniformly better than the other. Now, in the absence of large samples and in the presence of prior information, one can hope that the inferential aspect under the Bayesian paradigm will be better than any strategy under the classical set-up. The results of a small simulation study is encouraging. We have also discussed the utility of the proposed strategy of estimation of the Feller-Pareto distribution in the context of a real-life data set. A more comprehensive study is required to explore the usefulness of Bayesian estimation from a standard Bayesian analysis point of view.
Several possible works in the future direction can be considered, including but not limited to
choice of other types of improper priors as opposed to the ones assumed in this paper and then conduct a Bayesian analysis.
selecting a dependent & full conditional priors from the exponential family as opposed to subjective conjugate and proper priors assumed in this paper.
estimation of the model parameters by other methods under then classical paradigm, such as the method of maximum product spacing distance, Cramer Von mises, method of ordinary and weighted least square estimators etc., and have a comparison study to see the relative efficiency of each of these estimation strategies in the context of such type of probability models.
Disclosure Statement
No potential conflict of interest was reported by the author(s).
References
- Feller W. 1971. An introduction to probability theory and its applications. Vol. 2. New York: Willey. p. 766.
- Arnold BC. 1983. Pareto distributions. Fairland (MD): International Cooperative Publishing House.
- Johnson NL, Kotz S, Balakrishnan N. 1994. Continuous univariate distributions. Vol. 1, 2nd ed. New York: Wiley.
- Klein JP, Moeschberger S. 1997. Survival analysis: techniques for censored and truncated data. Berlin: Springer.
- Emberchts P, Claudia K, Mikosch T. 1997. Modelling extremal events. Stochastic modelling and applied probability. Vol. 33. Berlin: Springer. p. 283–370.
- Arnold BC. 2015. Pareto distributions. 2nd ed. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Boca Raton, FL: Chapman & Hall/CRC.
- Brazauskas V. 2002. Fisher information matrix for the Feller–Pareto distribution. Stat Probab Lett. 59(2):159–167.
- Clark RM, Cox SJD, Laslett GM. 1999. Generalizations of power-law distributions applicable to sampled fault-trace lengths: model choice, parameter estimation and caveats. Geophys J Int. 136(2):357–372.
- Dutang C, Goulet V, Langevin N. 2022. Feller-Pareto and related distributions: Numerical implementation and actuarial applications. J Stat Softw. 103:1–22.
- Efron B, Tibshirani RJ. 1994. An introduction to the bootstrap. New York: CRC Press.
- Ghosh I, Ng HKT. 2019. A class of skewed distributions with applications in environmental data. Commun Stat Case Stud Data Anal Appl. 5(4):346–365.
- Gelman A, Carlin JB, Stern HS, Rubin DB. 1995. Bayesian data analysis. New York: Chapman and Hall.
- Gelman A, Rubin DB. 1992. Inference from iterative simulation using multiple sequences (with discussion). Stat Sci. 7:457–511.
- Giannone D, Lenza M, Primiceri GE. 2015. Prior selection for vector autoregressions. Rev Econ Stat. 97(2):436–451.
- Kalbfleisch JD, Prentice RL. 1980. The statistical analysis of failure time data. New York: Wiley.
- Lehmann EL, Casella G. 2006. Theory of point estimation. New York: Springer.
- Lunn DJ, Thomas A, Best N, Spiegelhalter D. 2000. WinBUGS-a Bayesian modeling framework: concepts, structure, and extensibility. Stat Comput. 10(4):325–337.
- Odubote O, Oluyede BO. 2014. Theoretical properties of the weighted Feller-Pareto distributions. Asian J Math Appl. 1:1–12.
- Jordanova PK, Savov M, Tchorbadjieff A, Stehlík M. 2023. Mixed Poisson process with Stacy mixing variable. Stochastic Anal Appl. 42(2):289–305.
- Robert CP, Casella G. 1999. Monte Carlo statistical methods. New York: Springer.
- Rubenstein RY. 1981. Simulation and the Monte Carlo method. New York: Wiley.
- Singh P, Hellander A. 2018. Hyperparameter optimization for approximate bayesian computation. In: 2018 Winter Simulation Conference (WSC). IEEE, p. 1718–1729.
- Stacy EW. 1962. A generalization of the Gamma distribution. Ann Math Stat 33(3):1187–1192.
- Tahmasebi S, Behboodian J. 2010. Shannon entropy for the Feller-Pareto (FP) family and order statistics of FP subfamilies. Appl Math Sci 4(10):495–504.
- Verbeke G, Molenberghs G. 2007. What can go wrong with the score test? Amer Stat 61(4):289–290.
Appendix
In this section, we discuss the proofs of the results mentioned in Section 3.
Log concavity of distributions corresponding to mixture representation in Section 3:
First of all, note that a real valued function is said to be log-concave on the interval if the function is a concave function on . Equivalently, one can say,
is monotonically decreasing on or (,) where “stands for the double derivative and stands for the first order derivative respectively.
Result 1.
The kernel of the conditional density of is log concave and integrable.
Proof.
We need to show that the kernel of this distribution is log concave and integrable. Let us define . Then
Therefore, (A.1) (A.1)
Note that, from EquationEq. (A.1)(A.1) (A.1) , for integer values and with , as increasing, the quantity is decreasing. Also, the term inside the second parenthesis is decreasing. Hence is log-concave.
Next, consider
For the sake of notational simplicity, we write
Note that B is positive (based on the argument as mentioned earlier.) Also, .
Therefore,
So, is log-convex. Again, since is log-concave is log-convex. Now, our kernel for the conditional density
This is sum of two log-convex functions, and therefore, it is log-convex.
We must make a note of the following:
For integer values of is a decreasing function.
For fractional values of (i.e., ) is increasing.
For any values of and with any fractional positive values of the ratio is decreasing.
But, for integer values of and for any values of the above ratio is increasing.
However, since the quantity is of the form , it will be always .
Similarly, one can establish the log-concavity (and/or log-convexity) of all the other conditional distributions.
On the integrability of all the conditional density functions for AR sampling:
Here we begin with the following:
Since the conditional density of given is a generalized gamma, indeed it is integrable.
The conditional density of is given by (without the normalizing constant)
where , and .
Next, note that will imply . Also, . So, the function will exhibit the following:
provided .
.
Therefore, is a bounded function on the interval .
Next, let us consider the integrability of the quantity
This integral will be bounded and integrable, provided . This appears to be a impossible condition. Hence the proof.
□