451

Views

CrossRef citations to date

Altmetric

Listen

Articles

β-divergence loss for the kernel density estimation with bias reduced

Hamza Dhakera Mathématiques et statistique, Universite de Moncton, Moncton, CanadaCorrespondence[email protected]

https://orcid.org/0000-0003-0712-9467 View further author information

El Hadji Demeb UFR SAT, Universite Gaston Berger, Saint-Louis, SenegalView further author information

Youssou Cissb UFR SAT, Universite Gaston Berger, Saint-Louis, SenegalView further author information

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

In this paper, we investigate the problem of estimating the probability density function. The kernel density estimation with bias reduced is nowadays a standard technique in explorative data analysis, there is still a big dispute on how to assess the quality of the estimate and which choice of bandwidth is optimal. This framework examines the most important bandwidth selection methods for kernel density estimation in the context of with bias reduction. Normal reference, least squares cross-validation, biased cross-validation and β-divergence loss methods are described and expressions are presented. In order to assess the performance of our various bandwidth selectors, numerical simulations and environmental data are carried out.

Keywords:

1. Introduction

Selecting an appropriate bandwidth for a kernel density estimator is of crucial importance, and the purpose of the estimation may be an influential factor in the selection method. In many situations, it is sufficient to subjectively choose the smoothing parameter by looking at the density estimates produced by a range of bandwidths. A good overview on kernel density estimators is supplied by Silverman (Citation1986), Scott (Citation1992), Mugdadi and Ibrahim (Citation2004). Let $(X_{1}, \dots, X_{n})$ be a sample of size n identically distributed with unknown probability density function (p.d.f) f. The kernel density estimator was introduced by Parzen (Citation1962). Let K be a kernel function on real line, and let h be a positive value called bandwidth. Then kernel density estimator of f is defined as (1) $f_{n, h} (x) = \frac{1}{n h} \sum_{i = 1}^{n} K (\frac{x - X_{i}}{h}) .$ (1)

To make the estimator meaningful, the kernel function is usually required to satisfy conditions $K (x) > 0$ , $\int K (x) d x = 1$ , $\int x K (x) d x = 0$ and $\int x^{2} K (x) d x < \infty$ . Note that the bandwidth $h := h_{n} ↓ 0$ , as $n ↑ \infty$ . The choice of this bandwidth is very important. Several approaches are known for the choice of bandwidth in the kernel smoothing methods, via cross validation or by minimising a measure of error.

Studies are shown that the kernel density estimation of f in (Equation1(1) $f_{n, h} (x) = \frac{1}{n h} \sum_{i = 1}^{n} K (\frac{x - X_{i}}{h}) .$ (1) ) is biased. Recently, Xie and Wu (Citation2014) studied a bias reduced version of $f_{n}$ and proved its performances comparing it to the usual methods. If the density f is twice continuously differentiable, this bias reduced estimator is given as follows (2) $\begin{aligned} {\hat{f}}_{n, h} (x) & = f_{n, h} (x) - \hat{Bias} (f_{n, h} (x)), \\ = f_{n, h} (x) - \frac{h^{2}}{2} f_{n}^{″} (x) \int t^{2} K (t) d t . \end{aligned}$ (2) The bandwidth h is the most dominant parameter in the kernel density estimator. This parameter controls the amount of smoothing and is analogous to the bandwidth in a histogram. Even though the kernel estimator depends on the kernel and the bandwidth in a rather complicated way, a graphical representation clearly illustrates the difference in importance between these two parameters, see Figure 3.3 and 2.6(a) in Wand and Jones (Citation1995). To explore the most relevant bandwidth selection methods in density estimation for complete data see the reviews of Turlach (Citation1993), Cao et al. (Citation1994), Jones et al. (Citation1996) or Mammen et al. (Citation2011) and Mammen et al. (Citation2014), and the recent work on β-divergence for Bandwidth Selection by Dhaker et al. (Citation2018).

It should be noticed that nonparametric estimation procedures have been recently applied in environmental data, e.g., Schmalensee et al. (Citation1998), Taskin and Zaim (Citation2000), Millimet and Stengos (Citation2000), and Millimet et al. (Citation2003). However, the nonparametric modelling used in this paper is for another purpose which is to study the dynamics of the entire distribution of CO₂ emissions per capita.

Our aim in this paper is to propose and compare several bandwidth selection procedures for the kernel density estimator in (Equation2(2) $\begin{aligned} {\hat{f}}_{n, h} (x) & = f_{n, h} (x) - \hat{Bias} (f_{n, h} (x)), \\ = f_{n, h} (x) - \frac{h^{2}}{2} f_{n}^{″} (x) \int t^{2} K (t) d t . \end{aligned}$ (2) ). The procedures we study are bandwidth selector based on the criterion of β-divergence with different β values. A simulation study is then carried out to assess the finite sample behaviour of these bandwidth selectors.

The remainder of the paper is organised as follows. In Section 2, we state our main results which presents the proposal method for bandwidth selector based on β-divergence $D_{β}$ . Section 3 gives the estimation of the optimal bandwidth selection. Section 4 is devoted to our simulation results, Section 5 applies the methods to real datasets and finally, we conclude the paper in Section 6.

2. Bandwidth selection based on β-divergence

The β-divergence (see, e.g., Basu et al., Citation1998; Cichocki et al., Citation2006; Eguchi & Kano, Citation2001) is a general framework of similarity measures induced from various statistical models, such as Poisson, Gamma, Gaussian, Inverse Gaussian and compound Poisson distribution. For the connection between the β-divergence and various statistical distributions, see Jorgensen (Citation1997). Beta divergence was proposed in Basu et al. (Citation1998) and Minami and Eguchi (Citation2002) and is defined as dissimilarity between the density function and its estimator as $\begin{aligned} D_{β} ({\hat{f}}_{n, h}, f) & = \frac{1}{β} \int {({\hat{f}}_{n, h} (x))}^{β} d x - \frac{1}{β - 1} \\ \times \int {({\hat{f}}_{n, h} (x))}^{β - 1} f (x) d x + \frac{1}{β (β - 1)} \\ \times \int {(f (x))}^{β} d x . \end{aligned}$ In the case where $β = 2$ , we have $2 D_{2} ({\hat{f}}_{n, h}, f) = I S E ({\hat{f}}_{n, h}) = \int {({\hat{f}}_{n, h} (x) - f (x))}^{2} d x .$ Before we start our results, we introduce the following assumptions on the probability density function f and on the kernel K:

(F1)	f is compactly supported on I.
(F2)	f is four times continuously differentiable on I.
(F3)	$\int_{I} (f^{(4)} (x))^{2} (f (x))^{β - 2} d x < \infty$ .

Proposition 2.1

Under assumptions $(F 1) -- (F 3),$ the mean of $D_{β} ({\hat{f}}_{n, h}, f)$ is given by (3) $\begin{aligned} E D_{β} ({\hat{f}}_{n, h}, f) & := A E D_{β} ({\hat{f}}_{n, h}, f) + O_{p} (n^{- c}) + O (h^{6}), \\ 0 & < c < \frac{1}{8}, \end{aligned}$ (3) where $A E D_{β} ({\hat{f}}_{n, h}, f)$ is the asymptotic mean of $D_{β} ({\hat{f}}_{n, h}, f)$ expressed as (4) $\begin{aligned} A E D_{β} ({\hat{f}}_{n, h}, f) & = \frac{h^{8}}{2 \times 576} {(\int_{I} t^{4} K (t) d t)}^{2} \\ \times \int {(f (x))}^{β - 2} {(f^{(4)} (x))}^{2} d x \\ + \frac{1}{2 n h} \int_{I} {(K (t))}^{2} d t \int {(f (x))}^{β - 1} d x . \end{aligned}$ (4)

For the proof of the Proposition 2.1, see appendix in Section A. The following theorem allows us to give the analytical value of bandwidth which minimises the asymptotic mean of $D_{β} ({\hat{f}}_{n, h}, f)$ .

Theorem 2.2

Assume that $(F 1) -- (F 3)$ hold, then the bandwidth $h_{E D_{β}}$ that minimises $A E D_{β} ({\hat{f}}_{n, h}, f)$ is (5) $\begin{aligned} h_{β} & = h_{E D_{β}} \\ = {\{72 \frac{\int {(K (t))}^{2} d t \int_{I} {(f (x))}^{β - 1} d x}{{(\int t^{4} K (t) d t)}^{2} \int_{I} {(f (x))}^{β - 2} {(f^{(4)} (x))}^{2} d x}\}}^{1 / 9} \\ \times n^{- 1 / 9} . \end{aligned}$ (5)

The proof of Theorem 2.2 is derived from Proposition 2.1. From Theorem 2.2, we deduce the particular case where $β = 2$ of optimal bandwidth selection.

Corollary 2.3

Assuming that the assumptions in Theorem 2.2 hold. Then, we have for $β = 2$ $\begin{aligned} E D_{2} ({\hat{f}}_{n, h}, f) & = \frac{1}{2} M I S E ({\hat{f}}_{n, h}), \\ A E D_{2} ({\hat{f}}_{h}, f) & = \frac{1}{2} A M I S E ({\hat{f}}_{n, h}), \end{aligned}$ with $A M I S E ({\hat{f}}_{n, h})$ is the asymptotic $M I S E ({\hat{f}}_{n, h}) = E I S E ({\hat{f}}_{n, h}),$ and its corresponding optimal bandwidth is (6) $h_{A M I S E} := h_{2} = {\{\frac{9}{2} \frac{R (K)}{{(μ_{4} (K))}^{2} R (f^{(4)})}\}}^{1 / 9} n^{- 1 / 9},$ (6) where $R (g) = \int {(g (t))}^{2} d t and μ_{4} (K) = \int x^{4} K (x) d x .$

3. The choice of the bandwidth h

In this section, we describe bandwidth selection methods for the density estimator defined in (Equation2(2) $\begin{aligned} {\hat{f}}_{n, h} (x) & = f_{n, h} (x) - \hat{Bias} (f_{n, h} (x)), \\ = f_{n, h} (x) - \frac{h^{2}}{2} f_{n}^{″} (x) \int t^{2} K (t) d t . \end{aligned}$ (2) ). These methods are adapted to common automatic selectors for kernel density estimation. We propose two selection methods a Normal reference and the cross-validation method. The Normal reference bandwidth is based on estimating the infeasible optimal expression (Equation6(6) $h_{A M I S E} := h_{2} = {\{\frac{9}{2} \frac{R (K)}{{(μ_{4} (K))}^{2} R (f^{(4)})}\}}^{1 / 9} n^{- 1 / 9},$ (6) ), in which the unknown element is $R (f^{(4)})$ .

3.1. Rule-of-thumb for bandwidth selection

This method is based on the rule-of-thumb for complete data (see, e.g., Silverman, Citation1986). The idea is to assume that the underlying distribution is normal, $N (μ, σ)$ , and in this situation, we have

Proposition 3.1

If f is Normal density function with mean μ and variance $σ^{2},$ then the asymptotically optimal bandwidth $h_{β}$ in (Equation5(5) $\begin{aligned} h_{β} & = h_{E D_{β}} \\ = {\{72 \frac{\int {(K (t))}^{2} d t \int_{I} {(f (x))}^{β - 1} d x}{{(\int t^{4} K (t) d t)}^{2} \int_{I} {(f (x))}^{β - 2} {(f^{(4)} (x))}^{2} d x}\}}^{1 / 9} \\ \times n^{- 1 / 9} . \end{aligned}$ (5) ) becomes the normal reference bandwidth as (7) $\begin{aligned} h_{N R_{β}} & = σ {\{\sqrt{\frac{2}{π}} \frac{4 β^{4}}{9 β^{4} - 36 β^{3} + 90 β^{2} + 270 β + 105}\}}^{1 / 9} \\ \times n^{- 1 / 9} . \end{aligned}$ (7)

In the particular case where $β = 2$ , we have $h_{N R_{2}} = σ {\{\sqrt{\frac{2}{π}} \frac{64}{861}\}}^{1 / 9} n^{- 1 / 9} .$ The standard deviation σ can be estimated by the sample standard deviation (S) or by the standardised interquartile range $I Q R / 1.34$ for robustness against outliers $(1.34 = Φ^{- 1} (3 / 4) - Φ^{- 1} (1 / 4))$ , but a better rule of thumb (e.g., Silverman, Citation1986, pp. 45–47; Härdle, Citation1991, p. 91) is to use $\hat{σ} = min (S, \frac{I Q R}{1.34}),$ and to define the following estimator of $h_{N R_{β}}$ as $\begin{aligned} {\hat{h}}_{N R_{β}} & = \hat{σ} {\{\sqrt{\frac{2}{π}} \frac{4 β^{4}}{9 β^{4} - 36 β^{3} + 90 β^{2} + 270 β + 105}\}}^{1 / 9} \\ \times n^{- 1 / 9} . \end{aligned}$ Proof: See Appendix.

3.2. Cross-Validation

The method previously defined is based on minimising estimations of the mean $E D_{β} ({\hat{f}}_{n, h}, f)$ , more precisely of the asymptotic mean $A E D_{β} ({\hat{f}}_{n, h}, f)$ . The least squares Cross-Validation is the most popular method and is related on the minimising procedure of the ISE (integrated squared error), i.e., the particular case of β-divergence with $β = 2$ (see, e.g., Bowman (Citation1984) and Rudemo (Citation1982)). As a generalisation of the ISE, we introduce a β-Divergence Cross Validation ( $D_{β} C V$ ) method. Recall that $\begin{aligned} D_{β} ({\hat{f}}_{n, h}, f) & = \frac{1}{β} \int {\hat{f}}_{n, h}^{β} (x) d x - \frac{1}{β - 1} \int {\hat{f}}_{n, h}^{β - 1} (x) \\ \times f (x) d x + \frac{1}{β (β - 1)} \int f^{β} (x) d x . \end{aligned}$ Since $\frac{1}{β (β - 1)} \int f^{β} (x) d x$ does not depend on h, our β-Divergence Cross Validation approach is based on the minimising procedure likes the ISE method, of the following loss function: $\begin{aligned} L_{β} (h) & = D_{β} ({\hat{f}}_{n, h}, f) - \frac{1}{β (β - 1)} \int f^{β} (x) d x, \\ = \frac{1}{β} \int {\hat{f}}_{n, h}^{β} (x) d x - \frac{1}{β - 1} \int {\hat{f}}_{n, h}^{β - 1} (x) f (x) d x, \\ = \frac{1}{β} \int {\hat{f}}_{n, h}^{β} (x) d x - \frac{1}{β - 1} E ({\hat{f}}_{n, h}^{β - 1} (X)) . \end{aligned}$ Using the same methodology as the least squares cross-validation method we estimate $L_{β} (h)$ from the data and minimise it over h. Considering the following estimator of $L_{β} (h)$ : $D_{β} C V (h) = \frac{1}{β} \int {\hat{f}}_{n, h}^{β} (x) d x - \frac{2}{n (β - 1)} \sum_{i = 1}^{n} {\hat{f}}_{n, h, - i}^{β - 1} (X_{i}),$ with ${\hat{f}}_{h, h, - i} (X_{i}) = \frac{1}{h (n - 1)} \sum_{j \neq i}^{n} K (\frac{X_{i} - X_{j}}{h}) .$ Hence, the optimal bandwidth that minimises the estimator $D_{β} C V (h)$ is ${\hat{h}}_{D_{β} C V} = \arg min_{h} D_{β} C V (h) .$

Remark 3.1

In the preceding section three bandwidths $h_{N R_{β}}$ and ${\hat{h}}_{D_{β} C V}$ were presented as possible optimal choices for density estimation. However, in practice none of them is known since they depend on the unknown parameter β. In the article Dhaker et al. (Citation2018) the authors have shown that optimal β verifies: $1 < β < 2,$ For a β value close to 1 we obtain optimal h obtained using the Kullback-Leibler criteria, and for beta close to 2 we obtain that of the mean integrated square error.

Remark 3.2

From Theorem 2.1 in Xie and Wu (Citation2014), we have (8) $\begin{aligned} V a r ({\hat{f}}_{n, h} (x)) & = \frac{1}{n h} f (x) {(\int u^{2} K (u) d u)}^{2} \\ \times \int (K^{″})^{2} (u) d u + O (n^{- 1}), \end{aligned}$ (8) this variance decreasing in h, while the optimal h for $f_{n, h} (x)$ is given by: $\hat{h} = {\{\frac{\int K (t)^{2} d t \int_{I} f (x)^{β - 1} d x}{{[\int t^{2} K (t) d t]}^{2} \int_{I} f (x)^{β - 2} f^{(2)} (x)^{2} d x}\}}^{1 / 5} n^{- 1 / 5},$ more reference see Dhaker et al. (Citation2018). The optimal $\hat{h}$ of the ordinary kernel estimator $f_{n, h} (x)$ is asymptotically inferior to the bias reduced kernel density estimator, ${\hat{f}}_{n, h} (x)$ , since its convergence rate is $O (n^{- 1 / 5})$ compared to the bias reduced kernel density estimator's $O (n^{- 1 / 9})$ rate, which results in a decrease in variance (Equation8(8) $\begin{aligned} V a r ({\hat{f}}_{n, h} (x)) & = \frac{1}{n h} f (x) {(\int u^{2} K (u) d u)}^{2} \\ \times \int (K^{″})^{2} (u) d u + O (n^{- 1}), \end{aligned}$ (8) ).

4. Simulations

In this section, we evaluate the performance of the bandwidth selection procedures presented in Section 2. To this goal we have carried out a simulation study including rule-of-thumb ( ${\hat{h}}_{N R_{2}}$ ), the Least Squares Cross-Validation bandwidth ( ${\hat{h}}_{L S C V} := {\hat{h}}_{D_{2} C V}$ ) and the β-Divergence Cross Validation ( ${\hat{h}}_{D_{β} C V}$ with $β \in {1.5, 1.1, 1.9}$ ). Two simulation studies are carried out to evaluate different situations. First of all, as the population density, we used a normal mixture. In the second place, we used a lognormal mixture, who is a heavy-tailed distribution is subexponential.

4.1. Simulation study 1

For consideration of computation and generality, assume that the true density f is a normal mixture (9) $m (μ, σ^{2}) = 0.5 N (0, 1) + 0.5 N (μ, σ^{2}),$ (9) where $μ \in {0, 1, 5}$ and $σ \in {1, 0.5, 0.1}$ . One thousand Monte Carlo samples of size n are generated from the normal mixture model in Equation (Equation9(9) $m (μ, σ^{2}) = 0.5 N (0, 1) + 0.5 N (μ, σ^{2}),$ (9) ) for each combination of $n \in {50, 200, 700}$ . The results of our different sets of experiments are presented in Tables . The Table gives the exhibits simulated relative efficiency $R E (\hat{h}) = M I S E ({\hat{f}}_{n, h_{M I S E}}) / M I S E ({\hat{f}}_{n, \hat{h}})$ of the kernel estimator, with $\hat{h}$ takes the bandwidth estimators ${\hat{h}}_{N R_{2}}$ , ${\hat{h}}_{L S C V}$ and ${\hat{h}}_{D_{β} C V}$ , it is lower than 1, because the optimal bandwidth $h_{M I S E}$ minimise MISE. Each bandwidth, mean $E (\hat{h})$ and mean relation error $E (\hat{h} / h_{M I S E} - 1)$ are obtained, these values are given by respectively, Tables and .

For all situations, each relative efficiency $R E (\hat{h}) < 1$ because the optimal bandwidth $h_{M I S E}$ minimises the MISE.
The normal reference bandwidth ${\hat{h}}_{N R_{2}}$ performs well if the true density is not very far from normal, such as the cases of $(μ, σ) \in {(0, 1), (0, 0.5), (1, 1), (1, 0.5)}$ . Otherwise, it usually has the smallest $R E (\hat{h})$ and largest $E (\hat{h})$ , tending to oversmooth its kernel density estimate the most.
We have to remark that in Table , ${\hat{h}}_{L S C V}$ needs a large sample size in order to be competitive. Note also that in Table , it is seen that $E ({\hat{h}}_{L S C V})$ is close to the optimal $h_{M I S E}$ , but the corresponding $E ({\hat{h}}_{L S C V} / {\hat{h}}_{M I S E})$ is large, which means that the bias of ${\hat{h}}_{L S C V}$ is small but its variation is large in Table .
The bandwidth ${\hat{h}}_{D_{β} C V}$ seems to be the best existing bandwidth selectors. In most situations, it is indeed one of the best bandwidth selectors, However, it behaves very poorly for small σ (the true density curve is sharp).

Table 1. $R E (\hat{h})$ for normal mixture $f (x) = 0.5 ϕ (x) + 0.5 ϕ_{σ} (x - μ)$ .

Display Table

Table 2. $E (\hat{h})$ for normal mixture $f (x) = 0.5 ϕ (x) + 0.5 ϕ_{σ} (x - μ)$ .

Display Table

Table 3. $E | \hat{h} / h_{M I S E} - 1 |$ for normal mixture $f (x) = 0.5 ϕ (x) + 0.5 ϕ_{σ} (x - μ)$ .

Display Table

Figure compare, for densities with $(μ = 0, 1, 5$ and $σ = 1, 0.5, 0.1)$ , the results of the five bandwidth selection ${\hat{h}}_{N R_{2}}$ , ${\hat{h}}_{L S C V}$ and ${\hat{h}}_{D_{β} C V}$ (discussed in Section 3), relatively to the results obtained by using the MISE optimal bandwidth ( $h_{M I S E}$ ). These figures present boxplots of the ratio $R E (\hat{h}) = M I S E ({\hat{f}}_{n, h_{M I S E}}) / M I S E ({\hat{f}}_{n, \hat{h}})$ , where $\hat{h}$ takes the estimators ${\hat{h}}_{N R_{2}}$ , ${\hat{h}}_{L S C V}$ and ${\hat{h}}_{D_{β} C V}$ , with $β = 1.1, 1.5, 1.9$ . We see the LSCV and $D_{β} C V$ (with $β = 1.5$ ) methods gave overall the bests ratios across all simulations, and that this ratio was rather large in general.

Figure 1. Boxplots of the relative values RE for the bandwidth selectors for the estimation of densities $μ = 0, 1, 5$ and $σ = 1, 0.5, 0.1$ . The sample size varies from 100 to 2000.

4.2. Simulation study 2

As the populational density, we used a lognormal mixture. (10) $m (μ, σ^{2}) = 0.5 \log N (0, 1) + 0.5 \log N (μ, σ^{2}),$ (10) Where $μ \in {0, 1, 5}$ and $σ \in {1, 0.5, 0.1}$ , with μ and σ are the means and standard deviations, respectively. Similar to the previous subsection for each combination of n = 50, 200, 700, $μ = 0, 1, 5$ , and $ρ = 1, 0.5, 0.1$ . For each case, Table exhibits the simulated relative efficiency RE, Tables and give the $E (\hat{h})$ and $E | \hat{h} / h_{M I S E} - 1 |$ corresponding each bandwidth.

A summary of the results is provided below.

Firstly, in Table showed that the REs values for ${\hat{h}}_{N R}$ and ${\hat{h}}_{L S C V}$ increased as n increased and close to 1, but the performance is not so good in the case $(μ, σ) = {(1, 0.1), (5, 1), (5, 0.5), (5, 0.1)}$ . However ${\hat{h}}_{D_{β} C V}$ outperform others, especially ${\hat{h}}_{D_{1.9} C V}$ which has RE values close to 1 in all situations.

Table 4. $R E (\hat{h})$ for lognormal mixture $f (x) = 0.5 ϕ (x) + 0.5 ϕ_{σ} (x - μ)$ .

Display Table

Table 5. $E (\hat{h})$ for lognormal mixture $f (x) = 0.5 ϕ (x) + 0.5 ϕ_{σ} (x - μ)$ .

Display Table

Table 6. $E | \hat{h} / h_{M I S E} - 1 |$ for lognormal mixture $f (x) = 0.5 ϕ (x) + 0.5 ϕ_{σ} (x - μ)$ .

Display Table

5. Real data analysis

A very natural use of density estimates is in the informal investigation of the properties of a given set of data. Density estimates can give valuable indication of such features as skewness, multimodality and heavy tail in the data. In some cases, they will yield conclusions that may then be regarded as self-evidently true, while in others all they will do is to point the way to further analysis and data collection.

Three examples of data are provided to illustrate the performance of kernel density estimation with different bandwidths, where the Gaussian kernel is used. All of them are classical examples of unimodal, bimodal distributions and heavy tail respectively.

5.1. Application 1

The first data set comprises the ${CO}_{2}$ per capita in the year of 2014. This data set is available in the world bank website. Figure shows the estimated density of ${CO}_{2}$ per capita in the year of 2014 computing with bandwidths estimators ${\hat{h}}_{N R_{2}} = 1.38$ , ${\hat{h}}_{L S C V} = 0.439$ , ${\hat{h}}_{D_{1.5} C V} = 0.832$ , ${\hat{h}}_{D_{1.1} C V} = 0.932$ and ${\hat{h}}_{D_{1.9} C V} = 0.542$ . The data set that the estimated density that was computed with the ${\hat{h}}_{L S C V} = 0.439$ and ${\hat{h}}_{D_{1.9} C V}$ bandwidths captures the peak that characterises the mode, while the estimated density with the bandwidths that ${\hat{h}}_{N R_{2}}$ , ${\hat{h}}_{D_{1.5} C V}$ and ${\hat{h}}_{D_{1.1} C V}$ smoothes out this peak. This happens because the outliers at the tail of the distribution contribute to ${\hat{h}}_{N R_{2}}$ , ${\hat{h}}_{D_{1.5} C V}$ and ${\hat{h}}_{D_{1.1} C V}$ be larger than the other bandwidths.

Figure 2. Estimated density of $C O 2$ per capita in 2008 using the different bandwidths. ${\hat{h}}_{D_{1.1} C V}$ (solid line); ${\hat{h}}_{D_{1.9} C V}$ (dashed line); ${\hat{h}}_{D_{1.5} C V}$ (dotted line); ${\hat{h}}_{L S C V}$ , (dotdash line) and ${\hat{h}}_{N R_{2}}$ (longdash line).

Figure 2. Estimated density of CO2 per capita in 2008 using the different bandwidths. hˆD1.1CV (solid line); hˆD1.9CV (dashed line); hˆD1.5CV (dotted line); hˆLSCV, (dotdash line) and hˆNR2 (longdash line).

5.2. Application 2

We use the time between eruptions set for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA (107 sample data, source: Silverman, Citation1986). Figure plots the data points and the kernel density estimates for old faithful geyser data, using bandwidths ${\hat{h}}_{N R} = 0.442$ , ${\hat{h}}_{L S C V} = 0.162$ , ${\hat{h}}_{D_{1.5} C V} = 0.176$ , ${\hat{h}}_{D_{1.1} C V} = 0.281$ and ${\hat{h}}_{D_{1.9} C V} = 0.210$ .

Figure 3. Estimated density of repair times (hours) for an airborne communication transceiver: ${\hat{h}}_{D_{1.1} C V}$ (solid line); ${\hat{h}}_{D_{1.9} C V}$ (dashed line); ${\hat{h}}_{D_{1.5} C V}$ (dotted line); ${\hat{h}}_{L S C V}$ , (dotdash line) and ${\hat{h}}_{N R_{2}}$ , normal reference (longdash line).

Figure 3. Estimated density of repair times (hours) for an airborne communication transceiver: hˆD1.1CV (solid line); hˆD1.9CV (dashed line); hˆD1.5CV (dotted line); hˆLSCV, (dotdash line) and hˆNR2, normal reference (longdash line).

An important point to note that the density curve for eruption length is similar to bimodal normal density (normal mixture). From our Application 2, we see that the $h_{N B_{2}}$ is always larger than the others bandwidths, he heavily oversmoothes its kernel density curve, underestimating the two peaks of the curve but overestimating the valley between them. About $h_{L S C V}$ , ${\hat{h}}_{D_{1.5} C V}$ and ${\hat{h}}_{D_{1.9} C V}$ seems to undersmooth the curve too much, overestimating the two peaks but underestimating for the valley. However ${\hat{h}}_{D_{1.1} C V}$ is proper bandwidth for their density estimate to be able to capture the feature of the true density curve.

5.3. Application 3

Maintenance data on 46 active repair times in hours for an airborne communication transceiver reported by Von Alven (Citation1964) have been analysed by Sultan and Al-Moisheer (Citation2015) who conclude that mixture of inverse Weibull and lognormal model was a good fit. The estimated density function of maintenance data is presented in Figure , using commonly used bandwidths ${\hat{h}}_{N R} = 1.3150$ , ${\hat{h}}_{L S C V} = 0.5207$ , as well as the newly developed bandwidth ${\hat{h}}_{D_{1.5} C V} = 2.143$ , ${\hat{h}}_{D_{1.1} C V} = 2012$ and ${\hat{h}}_{D_{1.9} C V} = 1859$ .

Figure 4. Estimated density of repair times (hours) for an airborne communication transceiver using the different bandwidths: ${\hat{h}}_{D_{1.1} C V}$ (solid line); ${\hat{h}}_{D_{1.9} C V}$ (dashed line); ${\hat{h}}_{D_{1.5} C V}$ (dotted line); ${\hat{h}}_{L S C V}$ , (dotdash line) and ${\hat{h}}_{N R_{2}}$ , normal reference (longdash line).

Figure 4. Estimated density of repair times (hours) for an airborne communication transceiver using the different bandwidths: hˆD1.1CV (solid line); hˆD1.9CV (dashed line); hˆD1.5CV (dotted line); hˆLSCV, (dotdash line) and hˆNR2, normal reference (longdash line).

As expected, the normal reference bandwidth $h_{N R}$ heavily oversmoothes its kernel density curve. It seems that $h_{S J}$ and $h_{L S C V 4}$ , especially the later, are appropriate bandwidths for their density estimates to be able to capture the feature of the true density curve.

As expected, the normal reference bandwidth $h_{N R}$ heavily oversmoothes its kernel density curve. It seems that ${\hat{h}}_{D_{1.9} C V}$ is appropriate bandwidth for their density estimate to be able to capture the feature of the true density curve.

6. Conclusion

This paper proposed the method for bandwidth selection of bias reduction kernel density estimator, given in (Equation2(2) $\begin{aligned} {\hat{f}}_{n, h} (x) & = f_{n, h} (x) - \hat{Bias} (f_{n, h} (x)), \\ = f_{n, h} (x) - \frac{h^{2}}{2} f_{n}^{″} (x) \int t^{2} K (t) d t . \end{aligned}$ (2) ). A various bandwidth selection strategies have been proposed such as normal reference ${\hat{h}}_{N R_{2}}$ , least squares cross-validation ${\hat{h}}_{L S C V}$ and the β-Divergence Cross Validation ${\hat{h}}_{D_{β} C V}$ , with $β = 1.5, 1.1$ and 1.9. The normal reference bandwidth method is a simple and quick selector, but limited the practical use, since they are restricted to situations where a pre-specified family of densities is correctly selected. The least squared cross validation method do not provide a smooth density estimation, although asymptotically optimal, the finite sample behaviour of ${\hat{h}}_{L S C V}$ is disappointing for its variability and undersmoothing. We have attempted to evaluate choice of the optimal bandwidth ${\hat{h}}_{L S C V}$ and ${\hat{h}}_{N R_{2}}$ , using β-divergence. Compared to traditional bandwidth selection methods designed for kernel density estimation, our proposed $D_{β}$ bandwidth selection method is always one of the best for having large $R E (\hat{h})$ and small $E (\hat{h} / h_{M I S E} - 1)$ . Simulation studies showed that our proposed optimal bandwidth $D_{β}$ method designed for kernel density estimation adapts to different situations, and out-performs other bandwidths. We conclude that the choice of the bandwidth based on the real data is consistent with the one based on simulations which is the $D_{β}$ ( $β = 1.1$ and 1.5 ) method gives us a smoother density estimation.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Notes on contributors

Hamza Dhaker

Hamza Dhaker (PhD), is Assistant Professor in Probability and Statistic at the Faculty of Sciences of Université de Monctoon (Canada). His work revolves around Non-parametric Statistic, Extreme value statistic, Divergence Measures, Risk Measures.

El Hadji Deme

El Hadji Deme (PhD), is Associate Professor in Probablity and Statistic at the Faculty of Applied Sciences and Technology of Gaston Berger University in Saint-Louis (Senegal). His work revolves around Non-parametric Statistic, Extreme value statistic, Empirical process, Divergence Measures, Risk Measures (in finance and insurance), Inequality index and social well-being.

Youssou Ciss

Youssou Ciss (PhD) is Doctor of Applied Mathmatics Probabily and Statistics at the Faculty of Sciences and Technology in Gaston Berger University (Senegal). Field of work: Non parametric statistics.

References

Basu, A., Harris, I. R., Hjort, N. L., & Jones, M. C. (1998). Robust and efficient estimation by minimising a density power divergence. Biometrika, 85(3), 549–559. https://doi.org/https://doi.org/10.1093/biomet/85.3.549
Web of Science ®Google Scholar
Bowman, A. W. (1984). An alternative method of cross-validation for the smoothing of density estimates. Biometrika, 71(2), 353–360. https://doi.org/https://doi.org/10.1093/biomet/71.2.353
Web of Science ®Google Scholar
Cao, R., Cuevas, A., & Gonalez-Manteiga, W. (1994). A comparative study of several smoothing methods in density estimation? Computational Statistics and Data Analysis, 17(2), 153–176. https://doi.org/https://doi.org/10.1016/0167-9473(92)00066-Z
Web of Science ®Google Scholar
Cichocki, A., Zdunek, R., & Amari, S. (2006). Csiszar's divergences for nonnegative matrix factorization: Family of new algorithms. In Lecture notes in computer science (pp. 32–39). Springer.
Google Scholar
Dhaker, H., Ngom, P., Deme, E., & Mbodj, M. (2018). New approach for bandwidth selection in the kernel density estimation based on β-divergence. Journal of Mathematical Sciences: Advances and Applications, 51(1), 57–83. https://doi.org/10.18642/jmsaa_7100121962
Google Scholar
Eguchi, S., & Kano, Y. (2001). Robustifying maximum likelihood estimation (Technical Report). Institute of Statistical Mathematics, June.
Google Scholar
Eugene, F. S. (1969). Estimation of a probability density function and its derivatives. The Annals of Mathematical Statistics, 40(4), 1187–1195. https://doi.org/https://doi.org/10.1214/aoms/1177697495
Google Scholar
Härdle, W. K. (1991). Smoothing techniques: With implementation in S. Springer Science and Business Media.
Google Scholar
Jones, M. C., Marron, J. S., & Sheather, S. J. (1996). A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association, 91(433), 401–407. https://doi.org/https://doi.org/10.1080/01621459.1996.10476701
Web of Science ®Google Scholar
Jorgensen, B. (1997). The Theory of Dispersion Models. Chapman Hall/CRC Monographs on Statistics and Applied Probability.
Google Scholar
Kanazawa, Y. (1993). Hellinger distance and Kullback-Leibler loss for the kernel density estimator. Statistics and Probability Letters, 18(4), 315–321. https://doi.org/https://doi.org/10.1016/0167-7152(93)90022-B
Web of Science ®Google Scholar
Mammen, E., Martinez-Miranda, M. D., Nielsen, J. P., & Sperlich, S. (2011). Do-validation for kernel density estimatio? Journal of the American Statistical Association, 106(494), 651–660. https://doi.org/https://doi.org/10.1198/jasa.2011.tm08687
Web of Science ®Google Scholar
Mammen, E., Martinez-Miranda, M. D., Nielsen, J. P., & Sperlich, S. (2014). Further theoretical and practical insight to the do-validated bandwidth selector. Journal of the Korean Statistical Society, 43(3), 355–365. https://doi.org/https://doi.org/10.1016/j.jkss.2013.11.001
Web of Science ®Google Scholar
Millimet, D. L., List, J. A., & Stengos, T. (2003). The Environmental Kuznets Curve: Real Progress or Misspecified Models. Review of Economics and Statistics, 85(4), 1038–1047. https://doi.org/https://doi.org/10.1162/003465303772815916
Web of Science ®Google Scholar
Millimet, D. L., & Stengos, T. (2000). A semiparametric approach to modelling the environmental kuznets curve across U.S. States Department of Economics working paper, Southern Methodist University.
Google Scholar
Minami, M., & Eguchi, S. (2002). Robust blind source separation by Beta-divergence. Neural Comput., 14(8), 1859–1886. https://doi.org/https://doi.org/10.1162/089976602760128045
Web of Science ®Google Scholar
Mugdadi, A. R., & Ibrahim, A. A. (2004). A bandwidth selection for kernel density estimation of functions of random variables. Computational Statistics and Data Analysis, 47(1), 49–62. https://doi.org/https://doi.org/10.1016/j.csda.2003.10.013
Web of Science ®Google Scholar
Parzen, E. (1962). On estimation of a probability density function and mode. Annals of Mathematical Statistics, 33(3), 1065–1076. https://doi.org/https://doi.org/10.1214/aoms/1177704472
Google Scholar
Rudemo, M. (1982). Empirical choice of histograms and kernel density estimators. Scandinavia Journal of Statistics, 9(2), 65–78.
Web of Science ®Google Scholar
Schmalensee, R., Stoker, T. M., & Judson, R. A. (1998). World Carbon Dioxide Emissions, 1950–2050. The Review of Economics and Statistics, 80(1), 15–27. https://doi.org/https://doi.org/10.1162/003465398557294
Web of Science ®Google Scholar
Scott, W. D. (1992). Multivariate density estimation theory, practice, and visualization. Wiley.
Google Scholar
Silverman, B. W. (1986). Density estimation for statistics and data analysis. Chapman and Hall.
Google Scholar
Sultan, K. S., & Al-Moisheer, A. S. (2015). Mixture of inverse Weibull and lognormal distributions: Properties, estimation, and illustration. Mathematical Problems in Engineering, 2015. https://doi.org/https://doi.org/10.1155/2015/526786
Google Scholar
Taskin, F., & Zaim, O. (2000). Searching for a Kuznets Curve in Environmental Efficiency Using Kernel Estimation. Economics Letters, 68(2), 217–223. https://doi.org/https://doi.org/10.1016/S0165-1765(00)00250-0
Web of Science ®Google Scholar
Turlach, B. A. (1993). Bandwidth selection in kernel density estimation: A review (Technical Report). Universite catholique de Louvain.
Google Scholar
Von Alven, W. H. (Ed.). (1964). Reliability engineering. Prentice Hall.
Google Scholar
Wand, M. P., & Jones, M. C. (1995). Kernel smoothing. Chapman and Hall.
Google Scholar
Xie, X., & Wu, J. (2014). Some Improvement on Convergence Rates of Kernel Density Estimator. Applied Mathematics, 5(11), 1684–1696. https://doi.org/https://doi.org/10.4236/am.2014.511161
Google Scholar

Appendix

Proof of Proposition 2.1

{\hat{f}}_{n}^{β} (x) = {(f_{n} (x) - \hat{B i a s} (\hat{f} (x))}^{β} .

With a random variable $ξ = O_{p} (1)$ whose expectation is 0 and variance 1, we can write $f_{n} (x)$ as (see Kanazawa, Citation1993), (A1) $\begin{aligned} f_{n} (x) & = f (x) [1 + \frac{h^{2}}{2} \frac{f^{(2)} (x)}{f (x)} \int_{I} t^{2} K (t) d t + \frac{h^{4}}{24} \frac{f^{(4)} (x)}{f (x)} \\ \times \int_{I} t^{4} K (t) d t + O (h^{6}) \\ + {\{\frac{\int_{I} K (t)^{2} d t}{n h f (x)}\}}^{1 / 2} ξ + O_{p} (n^{- 1 / 2})] . \end{aligned}$ (A1)

Using the result of the Corollary 2.6 (Eugene, Citation1969), $lim_{n \to \infty} sup_{x} n^{c} | f_{n}^{(r)} (x) - f^{(r)} (x) | = 0 with 0 < c < \frac{1}{2 r + 4},$ we have, $\begin{aligned} {\hat{f}}_{n} (x) & = f_{n} (x) - \hat{B i a s} (f_{n} (x)) = f_{n} (x) - \frac{h^{2}}{2} f_{n}^{(2)} \int_{I} t^{2} K (t) d t \\ = f_{n} (x) - \frac{h^{2}}{2} f^{(2)} \int_{I} t^{2} K (t) d t + O (n^{- c}), \\ = f (x) [1 + \frac{h^{2}}{2} \frac{f^{(2)} (x)}{f (x)} \int_{I} t^{2} K (t) d t + \frac{h^{4}}{24} \frac{f^{(4)} (x)}{f (x)} \\ \times \int_{I} t^{4} K (t) d t + O (h^{6}) + {\{\frac{\int_{I} K (t)^{2} d t}{n h f (x)}\}}^{1 / 2} ξ \\ + O_{p} (n^{- 1 / 2})] - \frac{h^{2}}{2} f^{(2)} \int_{I} t^{2} K (t) d t + O (n^{- c}), \\ = f (x) [1 + \frac{h^{4}}{24} \frac{f^{(4)} (x)}{f (x)} \int_{I} t^{4} K (t) d t + O (h^{6}) \\ + {\{\frac{\int_{I} K (t)^{2} d t}{n h f (x)}\}}^{1 / 2} ξ + O_{p} (n^{- 1 / 2}) + O (n^{- c})] . \end{aligned}$ Where the $O (h^{6})$ terms depend upon x. Using $(1 + z)^{β} = 1 + β z + \frac{β (β - 1)}{2} z^{2} + O (z^{3})$ ,

$\begin{aligned} {\hat{f}}_{n}^{β} (x) & = f (x)^{β} [1 + \frac{h^{4}}{24} \frac{f^{(4)} (x)}{f (x)} \int_{I} t^{4} K (t) d t + O (h^{6}) \\ + {\{\frac{\int_{I} K (t)^{2} d t}{n h f (x)}\}}^{1 / 2} ξ + O_{p} (n^{- 1 / 2}) + O (n^{- c})]^{β}, \\ = f (x)^{β} [1 + β (\frac{h^{4}}{24} \frac{f^{(4)} (x)}{f (x)} \int_{I} t^{4} K (t) d t + {\{\frac{\int_{I} K (t)^{2} d t}{n h f (x)}\}}^{1 / 2} ξ) \\ + \frac{β (β - 1)}{2} (\frac{h^{8}}{576} \frac{(f^{(4)} (x))^{2}}{f^{2} (x)} {(\int_{I} t^{4} K (t) d t)}^{2} + \frac{\int_{I} K (t)^{2} d t}{n h f (x)} ξ^{2}) \\ + O_{p} (n^{- c}) + O (h^{6})], \end{aligned}$ and $\begin{aligned} {\hat{f}}_{n}^{β - 1} (x) & = f (x)^{β - 1} [1 + \frac{h^{4}}{24} \frac{f^{(4)} (x)}{f (x)} \int_{I} t^{4} K (t) d t + O (h^{6}) \\ {+ {\{\frac{\int_{I} K (t)^{2} d t}{n h f (x)}\}}^{1 / 2} ξ + O_{p} (n^{- 1 / 2}) + O (n^{- c})]}^{β - 1} \\ = f (x)^{β - 1} [1 + (β - 1) (\frac{h^{4}}{24} \frac{f^{(4)} (x)}{f (x)} \int_{I} t^{4} K (t) d t \\ + {\{\frac{\int_{I} K (t)^{2} d t}{n h f (x)}\}}^{1 / 2} ξ) + \frac{(β - 1) (β - 2)}{2} \\ \times (\frac{h^{8}}{576} \frac{(f^{(4)} (x))^{2}}{f^{2} (x)} {(\int_{I} t^{4} K (t) d t)}^{2} + \frac{\int_{I} K (t)^{2} d t}{n h f (x)} ξ^{2}) \\ + O_{p} (n^{- c}) + O (h^{6})] . \end{aligned}$ $\begin{aligned} D_{β} ({\hat{f}}_{n} (x), f (x)) \\ = \frac{1}{β} \int {\hat{f}}_{n}^{β} (x) d x - \frac{1}{β - 1} \int {\hat{f}}_{n}^{β - 1} (x) f (x) d x \\ + \frac{1}{β (β - 1)} \int f^{β} (x) d x, \\ = \frac{1}{β} \int f (x)^{β} [1 + β (\frac{h^{4}}{24} \frac{f^{(4)} (x)}{f (x)} \int_{I} t^{4} K (t) d t \\ + {\{\frac{\int_{I} K (t)^{2} d t}{n h f (x)}\}}^{1 / 2} ξ) \\ + \frac{β (β - 1)}{2} (\frac{h^{8}}{576} \frac{(f^{(4)} (x))^{2}}{f^{2} (x)} {(\int_{I} t^{4} K (t) d t)}^{2} \\ + \frac{\int_{I} K (t)^{2} d t}{n h f (x)} ξ^{2}) + O_{p} (n^{- c}) + O (h^{6})] d x \\ - \frac{1}{β - 1} \int f (x)^{β} [1 + (β - 1) \\ \times (\frac{h^{4}}{24} \frac{f^{(4)} (x)}{f (x)} \int_{I} t^{4} K (t) d t + {\{\frac{\int_{I} K (t)^{2} d t}{n h f (x)}\}}^{1 / 2} ξ) \\ + \frac{(β - 1) (β - 2)}{2} \\ \times (\frac{h^{8}}{576} \frac{(f^{(4)} (x))^{2}}{f^{2} (x)} {(\int_{I} t^{4} K (t) d t)}^{2} + \frac{\int_{I} K (t)^{2} d t}{n h f (x)} ξ^{2}) \\ + O_{p} (n^{- c}) + O (h^{6})] d x + \frac{1}{β (β - 1)} \int f^{β} (x) d x, \\ = \frac{1}{β} \int f (x)^{β} [\frac{β (β - 1)}{2} (\frac{h^{8}}{576} \frac{(f^{(4)} (x))^{2}}{f^{2} (x)} {(\int_{I} t^{4} K (t) d t)}^{2} \end{aligned}$ $\begin{aligned} + \frac{\int_{I} K (t)^{2} d t}{n h f (x)} ξ^{2}) + O_{p} (n^{- c}) \\ + O (h^{6})] d x - \frac{1}{β - 1} \int f (x)^{β} \\ \times [\frac{(β - 1) (β - 2)}{2} (\frac{h^{8}}{576} \frac{(f^{(4)} (x))^{2}}{f^{2} (x)} {(\int_{I} t^{4} K (t) d t)}^{2} \\ + \frac{\int_{I} K (t)^{2} d t}{n h f (x)} ξ^{2}) + O_{p} (n^{- c}) + O (h^{6})] d x \\ = \int f (x)^{β} [(\frac{β - 1}{2} - \frac{β - 2}{2}) \\ \times (\frac{h^{8}}{576} \frac{(f^{(4)} (x))^{2}}{f^{2} (x)} {(\int_{I} t^{4} K (t) d t)}^{2} + \frac{\int_{I} K (t)^{2} d t}{n h f (x)} ξ^{2}) \\ + O_{p} (n^{- c}) + O (h^{6})] d x \\ = \frac{1}{2} [\frac{h^{8}}{576} {(\int_{I} t^{4} K (t) d t)}^{2} \int f^{β - 2} (x) {(f^{(4)})}^{2} (x) d x \\ + \frac{1}{n h} \int_{I} K^{2} (t) d t \int f^{β - 1} (x) d x ξ^{2}] \\ + O_{p} (n^{- c}) + O (h^{6}), \\ E D_{β} ({\hat{f}}_{n} (x), f (x)) \\ = \frac{1}{2} [\frac{h^{8}}{576} {(\int_{I} t^{4} K (t) d t)}^{2} \int f^{β - 2} (x) {(f^{(4)})}^{2} (x) d x \\ + \frac{1}{n h} \int_{I} K^{2} (t) d t \int f^{β - 1} (x) d x] \\ + O_{p} (n^{- c}) + O (h^{6}) . \end{aligned}$

Proof of Proposition 3.1

$f (x) = \frac{1}{σ \sqrt{2 π}} e^{- 1 / 2 (\frac{x - m}{σ})^{2}},$ so

$\begin{aligned} f^{(4)} (x) & = \frac{1}{σ^{5} \sqrt{2 π}} e^{- 1 / 2 (\frac{x - m}{σ})^{2}} \\ \times (3 - 6 {(\frac{x - m}{σ})}^{2} + {(\frac{x - m}{σ})}^{4}), \\ {(f^{(4)} (x))}^{2} & = \frac{1}{σ^{10} 2 π} e^{- (\frac{x - m}{σ})^{2}} \\ \times (9 - 36 {(\frac{x - m}{σ})}^{2} + 30 {(\frac{x - m}{σ})}^{4} \\ + 18 {(\frac{x - m}{σ})}^{6} + {(\frac{x - m}{σ})}^{8}), \\ \int f^{β - 2} (x) {(f^{(4)} (x))}^{2} d x & = \frac{1}{σ^{β + 7} \sqrt{β} (2 π)^{\frac{β - 2}{2}}} \\ \times (\frac{\begin{matrix} 9 β^{4} - 36 β^{3} + 90 β^{2} + 270 β + 105 \end{matrix}}{β^{4}}) . \end{aligned}$ and $\int f^{β - 1} (x) d x = \frac{1}{\sqrt{β - 1} (2 π)^{\frac{β - 2}{2}}} .$

In that case the asymptotically optimal bandwidth $h_{β}$ in Equation (Equation5(5) $\begin{aligned} h_{β} & = h_{E D_{β}} \\ = {\{72 \frac{\int {(K (t))}^{2} d t \int_{I} {(f (x))}^{β - 1} d x}{{(\int t^{4} K (t) d t)}^{2} \int_{I} {(f (x))}^{β - 2} {(f^{(4)} (x))}^{2} d x}\}}^{1 / 9} \\ \times n^{- 1 / 9} . \end{aligned}$ (5) ) becomes the normal reference bandwidth. $\begin{aligned} h_{β} & = h_{E D_{β}} = {\{72 \frac{R (K) \int_{I} f (x)^{β - 1} d x}{μ_{4} (K)^{2} \int_{I} f (x)^{β - 2} {(f^{(4)} (x))}^{2} d x}\}}^{1 / 9} n^{- 1 / 9} \\ = (72 R (K))^{1 / 9} (\sqrt{β - 1} (2 π)^{\frac{β - 2}{2}} μ_{4} (K)^{2} \\ \times {\frac{1}{σ^{β + 7} \sqrt{β} (2 π)^{\frac{β - 2}{2}}})}^{- 1 / 9} \\ \times {(\frac{9 β^{4} - 36 β^{3} + 90 β^{2} + 270 β + 105}{β^{4}})}^{- 1 / 9} n^{- 1 / 9} \end{aligned}$ with σ being the standard deviation of f.

For the Gaussian kernel, $μ_{4} (K) = 3$ and $R (K) = (4 π)^{- 1 / 2}$ so that $h_{N R_{β}} = {\{\sqrt{\frac{2}{π}} \frac{4 β^{4}}{9 β^{4} - 36 β^{3} + 90 β^{2} + 27 β + 105} \frac{1}{n}\}}^{1 / 9} σ$

in the particular case for $β = 2$ (A2) $h_{N R_{2}} = {\{\sqrt{\frac{16}{861} \frac{2}{π}} \frac{1}{n}\}}^{1 / 9} σ .$ (A2) The standard deviation σ can be estimated by the sample standard deviation s or by the standardised interquartile range $I Q R / 1.34$ for robustness against outliers $(1.34 = Φ^{- 1} (3 / 4) - Φ^{- 1} (1 / 4))$ , but a better rule of thumb is (e.g., Silverman, Citation1986, pp. 45–47; Härdle, Citation1991, p. 91). (A3) ${\hat{h}}_{N R_{2}} = {\{\sqrt{\frac{2}{π}} \frac{16}{861} \frac{1}{n}\}}^{1 / 9} \hat{σ},$ (A3) with $\hat{σ} = min (s, I Q R / 1.34)$

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

β-divergence loss for the kernel density estimation with bias reduced

Abstract

1. Introduction

2. Bandwidth selection based on β-divergence

3. The choice of the bandwidth h