450
Views
1
CrossRef citations to date
0
Altmetric
Articles

β-divergence loss for the kernel density estimation with bias reduced

ORCID Icon, &
Pages 221-231 | Received 26 Oct 2019, Accepted 30 Nov 2020, Published online: 14 Dec 2020

Abstract

In this paper, we investigate the problem of estimating the probability density function. The kernel density estimation with bias reduced is nowadays a standard technique in explorative data analysis, there is still a big dispute on how to assess the quality of the estimate and which choice of bandwidth is optimal. This framework examines the most important bandwidth selection methods for kernel density estimation in the context of with bias reduction. Normal reference, least squares cross-validation, biased cross-validation and β-divergence loss methods are described and expressions are presented. In order to assess the performance of our various bandwidth selectors, numerical simulations and environmental data are carried out.

1. Introduction

Selecting an appropriate bandwidth for a kernel density estimator is of crucial importance, and the purpose of the estimation may be an influential factor in the selection method. In many situations, it is sufficient to subjectively choose the smoothing parameter by looking at the density estimates produced by a range of bandwidths. A good overview on kernel density estimators is supplied by Silverman (Citation1986), Scott (Citation1992), Mugdadi and Ibrahim (Citation2004). Let (X1,,Xn) be a sample of size n identically distributed with unknown probability density function (p.d.f) f. The kernel density estimator was introduced by Parzen (Citation1962). Let K be a kernel function on real line, and let h be a positive value called bandwidth. Then kernel density estimator of f is defined as (1) fn,h(x)=1nhi=1nKxXih.(1)

To make the estimator meaningful, the kernel function is usually required to satisfy conditions K(x)>0, K(x)dx=1, xK(x)dx=0 and x2K(x)dx<. Note that the bandwidth h:=hn0, as n. The choice of this bandwidth is very important. Several approaches are known for the choice of bandwidth in the kernel smoothing methods, via cross validation or by minimising a measure of error.

Studies are shown that the kernel density estimation of f in (Equation1) is biased. Recently, Xie and Wu (Citation2014) studied a bias reduced version of fn and proved its performances comparing it to the usual methods. If the density f is twice continuously differentiable, this bias reduced estimator is given as follows (2) fˆn,h(x)=fn,h(x)Biasˆ(fn,h(x)),=fn,h(x)h22fn(x)t2K(t)dt.(2) The bandwidth h is the most dominant parameter in the kernel density estimator. This parameter controls the amount of smoothing and is analogous to the bandwidth in a histogram. Even though the kernel estimator depends on the kernel and the bandwidth in a rather complicated way, a graphical representation clearly illustrates the difference in importance between these two parameters, see Figure 3.3 and 2.6(a) in Wand and Jones (Citation1995). To explore the most relevant bandwidth selection methods in density estimation for complete data see the reviews of Turlach (Citation1993), Cao et al. (Citation1994), Jones et al. (Citation1996) or Mammen et al. (Citation2011) and Mammen et al. (Citation2014), and the recent work on β-divergence for Bandwidth Selection by Dhaker et al. (Citation2018).

It should be noticed that nonparametric estimation procedures have been recently applied in environmental data, e.g., Schmalensee et al. (Citation1998), Taskin and Zaim (Citation2000), Millimet and Stengos (Citation2000), and Millimet et al. (Citation2003). However, the nonparametric modelling used in this paper is for another purpose which is to study the dynamics of the entire distribution of CO2 emissions per capita.

Our aim in this paper is to propose and compare several bandwidth selection procedures for the kernel density estimator in (Equation2). The procedures we study are bandwidth selector based on the criterion of β-divergence with different β values. A simulation study is then carried out to assess the finite sample behaviour of these bandwidth selectors.

The remainder of the paper is organised as follows. In Section 2, we state our main results which presents the proposal method for bandwidth selector based on β-divergence Dβ. Section 3 gives the estimation of the optimal bandwidth selection. Section 4 is devoted to our simulation results, Section 5 applies the methods to real datasets and finally, we conclude the paper in Section 6.

2. Bandwidth selection based on β-divergence

The β-divergence (see, e.g., Basu et al., Citation1998; Cichocki et al., Citation2006; Eguchi & Kano, Citation2001) is a general framework of similarity measures induced from various statistical models, such as Poisson, Gamma, Gaussian, Inverse Gaussian and compound Poisson distribution. For the connection between the β-divergence and various statistical distributions, see Jorgensen (Citation1997). Beta divergence was proposed in Basu et al. (Citation1998) and Minami and Eguchi (Citation2002) and is defined as dissimilarity between the density function and its estimator as Dβ(fˆn,h,f)=1βfˆn,h(x)βdx1β1×fˆn,h(x)β1f(x)dx+1β(β1)×f(x)βdx.In the case where β=2, we have 2D2(fˆn,h,f)=ISE(fˆn,h)=fˆn,h(x)f(x)2dx.Before we start our results, we introduce the following assumptions on the probability density function f and on the kernel K:

(F1)

f is compactly supported on I.

(F2)

f is four times continuously differentiable on I.

(F3)

I(f(4)(x))2(f(x))β2dx<.

Proposition 2.1

Under assumptions (F1)--(F3), the mean of Dβ(fˆn,h,f) is given by (3) EDβ(fˆn,h,f):=AEDβ(fˆn,h,f)+Op(nc)+O(h6),0<c<18,(3) where AEDβ(fˆn,h,f) is the asymptotic mean of Dβ(fˆn,h,f) expressed as (4) AEDβ(fˆn,h,f)=h82×576It4K(t)dt2×f(x)β2f(4)(x)2dx+12nhIK(t)2dtf(x)β1dx.(4)

For the proof of the Proposition 2.1, see appendix in Section A. The following theorem allows us to give the analytical value of bandwidth which minimises the asymptotic mean of Dβ(fˆn,h,f).

Theorem 2.2

Assume that (F1)--(F3) hold, then the bandwidth hEDβ that minimises AEDβ(fˆn,h,f) is (5) hβ=hEDβ=72K(t)2dtIf(x)β1dxt4K(t)dt2If(x)β2f(4)(x)2dx1/9×n1/9.(5)

The proof of Theorem 2.2 is derived from Proposition 2.1. From Theorem 2.2, we deduce the particular case where β=2 of optimal bandwidth selection.

Corollary 2.3

Assuming that the assumptions in Theorem 2.2 hold. Then, we have for β=2 ED2(fˆn,h,f)=12MISE(fˆn,h),AED2(fˆh,f)=12AMISE(fˆn,h),with AMISE(fˆn,h) is the asymptotic MISE(fˆn,h)=EISE(fˆn,h), and its corresponding optimal bandwidth is (6) hAMISE:=h2=92R(K)μ4(K)2R(f(4))1/9n1/9,(6) where R(g)=g(t)2dtandμ4(K)=x4K(x)dx.

3. The choice of the bandwidth h

In this section, we describe bandwidth selection methods for the density estimator defined in (Equation2). These methods are adapted to common automatic selectors for kernel density estimation. We propose two selection methods a Normal reference and the cross-validation method. The Normal reference bandwidth is based on estimating the infeasible optimal expression (Equation6), in which the unknown element is R(f(4)).

3.1. Rule-of-thumb for bandwidth selection

This method is based on the rule-of-thumb for complete data (see, e.g., Silverman, Citation1986). The idea is to assume that the underlying distribution is normal, N(μ,σ), and in this situation, we have

Proposition 3.1

If f is Normal density function with mean μ and variance σ2, then the asymptotically optimal bandwidth hβ in (Equation5) becomes the normal reference bandwidth as (7) hNRβ=σ2π4β49β436β3+90β2+270β+1051/9×n1/9.(7)

In the particular case where β=2, we have hNR2=σ2π648611/9n1/9.The standard deviation σ can be estimated by the sample standard deviation (S) or by the standardised interquartile range IQR/1.34 for robustness against outliers (1.34=Φ1(3/4)Φ1(1/4)), but a better rule of thumb (e.g., Silverman, Citation1986, pp. 45–47; Härdle, Citation1991, p. 91) is to use σˆ=min(S,IQR1.34), and to define the following estimator of hNRβ as hˆNRβ=σˆ2π4β49β436β3+90β2+270β+1051/9×n1/9.Proof: See Appendix.

3.2. Cross-Validation

The method previously defined is based on minimising estimations of the mean EDβ(fˆn,h,f), more precisely of the asymptotic mean AEDβ(fˆn,h,f). The least squares Cross-Validation is the most popular method and is related on the minimising procedure of the ISE (integrated squared error), i.e., the particular case of β-divergence with β=2 (see, e.g., Bowman (Citation1984) and Rudemo (Citation1982)). As a generalisation of the ISE, we introduce a β-Divergence Cross Validation (DβCV) method. Recall that Dβ(fˆn,h,f)=1βfˆn,hβ(x)dx1β1fˆn,hβ1(x)×f(x)dx+1β(β1) fβ(x)dx.Since 1β(β1) fβ(x)dx does not depend on h, our β-Divergence Cross Validation approach is based on the minimising procedure likes the ISE method, of the following loss function: Lβ(h)=Dβ(fˆn,h,f)1β(β1) fβ(x)dx,=1βfˆn,hβ(x)dx1β1fˆn,hβ1(x)f(x)dx,=1βfˆn,hβ(x)dx1β1Efˆn,hβ1(X).Using the same methodology as the least squares cross-validation method we estimate Lβ(h) from the data and minimise it over h. Considering the following estimator of Lβ(h): DβCV(h)=1βfˆn,hβ(x)dx2n(β1)i=1nfˆn,h,iβ1(Xi),with fˆh,h,i(Xi)=1h(n1)jinKXiXjh.Hence, the optimal bandwidth that minimises the estimator DβCV(h) is hˆDβCV=argminhDβCV(h).

Remark 3.1

In the preceding section three bandwidths hNRβ and hˆDβCV were presented as possible optimal choices for density estimation. However, in practice none of them is known since they depend on the unknown parameter β. In the article Dhaker et al. (Citation2018) the authors have shown that optimal β verifies: 1<β<2,For a β value close to 1 we obtain optimal h obtained using the Kullback-Leibler criteria, and for beta close to 2 we obtain that of the mean integrated square error.

Remark 3.2

From Theorem 2.1 in Xie and Wu (Citation2014), we have (8) Var(fˆn,h(x))=1nhf(x)u2K(u)du2×(K)2(u)du+O(n1),(8) this variance decreasing in h, while the optimal h for fn,h(x) is given by: hˆ=K(t)2dtIf(x)β1dxt2K(t)dt2If(x)β2f(2)(x)2dx1/5n1/5,more reference see Dhaker et al. (Citation2018). The optimal hˆ of the ordinary kernel estimator fn,h(x) is asymptotically inferior to the bias reduced kernel density estimator, fˆn,h(x), since its convergence rate is O(n1/5) compared to the bias reduced kernel density estimator's O(n1/9) rate, which results in a decrease in variance (Equation8).

4. Simulations

In this section, we evaluate the performance of the bandwidth selection procedures presented in Section 2. To this goal we have carried out a simulation study including rule-of-thumb (hˆNR2), the Least Squares Cross-Validation bandwidth (hˆLSCV:=hˆD2CV) and the β-Divergence Cross Validation (hˆDβCV with β{1.5,1.1,1.9}). Two simulation studies are carried out to evaluate different situations. First of all, as the population density, we used a normal mixture. In the second place, we used a lognormal mixture, who is a heavy-tailed distribution is subexponential.

4.1. Simulation study 1

For consideration of computation and generality, assume that the true density f is a normal mixture (9) m(μ,σ2)=0.5N(0,1)+0.5N(μ,σ2),(9) where μ{0,1,5} and σ{1,0.5,0.1}. One thousand Monte Carlo samples of size n are generated from the normal mixture model in Equation (Equation9) for each combination of n{50,200,700}. The results of our different sets of experiments are presented in Tables . The Table  gives the exhibits simulated relative efficiency RE(hˆ)=MISE(fˆn,hMISE)/MISE(fˆn,hˆ) of the kernel estimator, with hˆ takes the bandwidth estimators hˆNR2,hˆLSCV and hˆDβCV, it is lower than 1, because the optimal bandwidth hMISE minimise MISE. Each bandwidth, mean E(hˆ) and mean relation error E(hˆ/hMISE1) are obtained, these values are given by respectively, Tables  and .

  1. For all situations, each relative efficiency RE(hˆ)<1 because the optimal bandwidth hMISE minimises the MISE.

  2. The normal reference bandwidth hˆNR2 performs well if the true density is not very far from normal, such as the cases of (μ,σ){(0,1),(0,0.5),(1,1),(1,0.5)}. Otherwise, it usually has the smallest RE(hˆ) and largest E(hˆ), tending to oversmooth its kernel density estimate the most.

  3. We have to remark that in Table , hˆLSCV needs a large sample size in order to be competitive. Note also that in Table , it is seen that E(hˆLSCV) is close to the optimal hMISE, but the corresponding E(hˆLSCV/hˆMISE) is large, which means that the bias of hˆLSCV is small but its variation is large in Table .

  4. The bandwidth hˆDβCV seems to be the best existing bandwidth selectors. In most situations, it is indeed one of the best bandwidth selectors, However, it behaves very poorly for small σ (the true density curve is sharp).

Table 1. RE(hˆ) for normal mixture f(x)=0.5ϕ(x)+0.5ϕσ(xμ).

Table 2. E(hˆ) for normal mixture f(x)=0.5ϕ(x)+0.5ϕσ(xμ).

Table 3. E|hˆ/hMISE1| for normal mixture f(x)=0.5ϕ(x)+0.5ϕσ(xμ).

Figure  compare, for densities with (μ=0,1,5 and σ=1,0.5,0.1), the results of the five bandwidth selection hˆNR2, hˆLSCV and hˆDβCV (discussed in Section 3), relatively to the results obtained by using the MISE optimal bandwidth (hMISE). These figures present boxplots of the ratio RE(hˆ)=MISE(fˆn,hMISE)/MISE(fˆn,hˆ), where hˆ takes the estimators hˆNR2, hˆLSCV and hˆDβCV, with β=1.1,1.5,1.9. We see the LSCV and DβCV (with β=1.5) methods gave overall the bests ratios across all simulations, and that this ratio was rather large in general.

Figure 1. Boxplots of the relative values RE for the bandwidth selectors for the estimation of densities μ=0,1,5 and σ=1,0.5,0.1. The sample size varies from 100 to 2000.

Figure 1. Boxplots of the relative values RE for the bandwidth selectors for the estimation of densities μ=0,1,5 and σ=1,0.5,0.1. The sample size varies from 100 to 2000.

4.2. Simulation study 2

As the populational density, we used a lognormal mixture. (10) m(μ,σ2)=0.5logN(0,1)+0.5logN(μ,σ2),(10) Where μ{0,1,5} and σ{1,0.5,0.1}, with μ and σ are the means and standard deviations, respectively. Similar to the previous subsection for each combination of n = 50, 200, 700, μ=0,1,5, and ρ=1,0.5,0.1. For each case, Table  exhibits the simulated relative efficiency RE, Tables  and give the E(hˆ) and E|hˆ/hMISE1| corresponding each bandwidth.

A summary of the results is provided below.

Firstly, in Table showed that the REs values for hˆNR and hˆLSCV increased as n increased and close to 1, but the performance is not so good in the case (μ,σ)={(1,0.1),(5,1),(5,0.5),(5,0.1)}. However hˆDβCV outperform others, especially hˆD1.9CV which has RE values close to 1 in all situations.

Table 4. RE(hˆ) for lognormal mixture f(x)=0.5ϕ(x)+0.5ϕσ(xμ).

Table 5. E(hˆ) for lognormal mixture f(x)=0.5ϕ(x)+0.5ϕσ(xμ).

Table 6. E|hˆ/hMISE1| for lognormal mixture f(x)=0.5ϕ(x)+0.5ϕσ(xμ).

5. Real data analysis

A very natural use of density estimates is in the informal investigation of the properties of a given set of data. Density estimates can give valuable indication of such features as skewness, multimodality and heavy tail in the data. In some cases, they will yield conclusions that may then be regarded as self-evidently true, while in others all they will do is to point the way to further analysis and data collection.

Three examples of data are provided to illustrate the performance of kernel density estimation with different bandwidths, where the Gaussian kernel is used. All of them are classical examples of unimodal, bimodal distributions and heavy tail respectively.

5.1. Application 1

The first data set comprises the CO2 per capita in the year of 2014. This data set is available in the world bank website. Figure  shows the estimated density of CO2 per capita in the year of 2014 computing with bandwidths estimators hˆNR2=1.38, hˆLSCV=0.439, hˆD1.5CV=0.832, hˆD1.1CV=0.932 and hˆD1.9CV=0.542. The data set that the estimated density that was computed with the hˆLSCV=0.439 and hˆD1.9CV bandwidths captures the peak that characterises the mode, while the estimated density with the bandwidths that hˆNR2, hˆD1.5CV and hˆD1.1CV smoothes out this peak. This happens because the outliers at the tail of the distribution contribute to hˆNR2, hˆD1.5CV and hˆD1.1CV be larger than the other bandwidths.

Figure 2. Estimated density of CO2 per capita in 2008 using the different bandwidths. hˆD1.1CV (solid line); hˆD1.9CV (dashed line); hˆD1.5CV (dotted line); hˆLSCV, (dotdash line) and hˆNR2 (longdash line).

Figure 2. Estimated density of CO2 per capita in 2008 using the different bandwidths. hˆD1.1CV (solid line); hˆD1.9CV (dashed line); hˆD1.5CV (dotted line); hˆLSCV, (dotdash line) and hˆNR2 (longdash line).

5.2. Application 2

We use the time between eruptions set for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA (107 sample data, source: Silverman, Citation1986). Figure  plots the data points and the kernel density estimates for old faithful geyser data, using bandwidths hˆNR=0.442, hˆLSCV=0.162, hˆD1.5CV=0.176, hˆD1.1CV=0.281 and hˆD1.9CV=0.210.

Figure 3. Estimated density of repair times (hours) for an airborne communication transceiver: hˆD1.1CV (solid line); hˆD1.9CV (dashed line); hˆD1.5CV (dotted line); hˆLSCV, (dotdash line) and hˆNR2, normal reference (longdash line).

Figure 3. Estimated density of repair times (hours) for an airborne communication transceiver: hˆD1.1CV (solid line); hˆD1.9CV (dashed line); hˆD1.5CV (dotted line); hˆLSCV, (dotdash line) and hˆNR2, normal reference (longdash line).

An important point to note that the density curve for eruption length is similar to bimodal normal density (normal mixture). From our Application 2, we see that the hNB2 is always larger than the others bandwidths, he heavily oversmoothes its kernel density curve, underestimating the two peaks of the curve but overestimating the valley between them. About hLSCV, hˆD1.5CV and hˆD1.9CV seems to undersmooth the curve too much, overestimating the two peaks but underestimating for the valley. However hˆD1.1CV is proper bandwidth for their density estimate to be able to capture the feature of the true density curve.

5.3. Application 3

Maintenance data on 46 active repair times in hours for an airborne communication transceiver reported by Von Alven (Citation1964) have been analysed by Sultan and Al-Moisheer (Citation2015) who conclude that mixture of inverse Weibull and lognormal model was a good fit. The estimated density function of maintenance data is presented in Figure , using commonly used bandwidths hˆNR=1.3150, hˆLSCV=0.5207, as well as the newly developed bandwidth hˆD1.5CV=2.143, hˆD1.1CV=2012 and hˆD1.9CV=1859.

Figure 4. Estimated density of repair times (hours) for an airborne communication transceiver using the different bandwidths: hˆD1.1CV (solid line); hˆD1.9CV (dashed line); hˆD1.5CV (dotted line); hˆLSCV, (dotdash line) and hˆNR2, normal reference (longdash line).

Figure 4. Estimated density of repair times (hours) for an airborne communication transceiver using the different bandwidths: hˆD1.1CV (solid line); hˆD1.9CV (dashed line); hˆD1.5CV (dotted line); hˆLSCV, (dotdash line) and hˆNR2, normal reference (longdash line).

As expected, the normal reference bandwidth hNR heavily oversmoothes its kernel density curve. It seems that hSJ and hLSCV4, especially the later, are appropriate bandwidths for their density estimates to be able to capture the feature of the true density curve.

As expected, the normal reference bandwidth hNR heavily oversmoothes its kernel density curve. It seems that hˆD1.9CV is appropriate bandwidth for their density estimate to be able to capture the feature of the true density curve.

6. Conclusion

This paper proposed the method for bandwidth selection of bias reduction kernel density estimator, given in (Equation2). A various bandwidth selection strategies have been proposed such as normal reference hˆNR2, least squares cross-validation hˆLSCV and the β-Divergence Cross Validation hˆDβCV, with β=1.5,1.1 and 1.9. The normal reference bandwidth method is a simple and quick selector, but limited the practical use, since they are restricted to situations where a pre-specified family of densities is correctly selected. The least squared cross validation method do not provide a smooth density estimation, although asymptotically optimal, the finite sample behaviour of hˆLSCV is disappointing for its variability and undersmoothing. We have attempted to evaluate choice of the optimal bandwidth hˆLSCV and hˆNR2, using β-divergence. Compared to traditional bandwidth selection methods designed for kernel density estimation, our proposed Dβ bandwidth selection method is always one of the best for having large RE(hˆ) and small E(hˆ/hMISE1). Simulation studies showed that our proposed optimal bandwidth Dβ method designed for kernel density estimation adapts to different situations, and out-performs other bandwidths. We conclude that the choice of the bandwidth based on the real data is consistent with the one based on simulations which is the Dβ (β=1.1 and 1.5 ) method gives us a smoother density estimation.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Notes on contributors

Hamza Dhaker

Hamza Dhaker (PhD), is Assistant Professor in Probability and Statistic at the Faculty of Sciences of Université de Monctoon (Canada). His work revolves around Non-parametric Statistic, Extreme value statistic, Divergence Measures, Risk Measures.

El Hadji Deme

El Hadji Deme (PhD), is Associate Professor in Probablity and Statistic at the Faculty of Applied Sciences and Technology of Gaston Berger University in Saint-Louis (Senegal). His work revolves around Non-parametric Statistic, Extreme value statistic, Empirical process, Divergence Measures, Risk Measures (in finance and insurance), Inequality index and social well-being.

Youssou Ciss

Youssou Ciss (PhD) is Doctor of Applied Mathmatics Probabily and Statistics at the Faculty of Sciences and Technology in Gaston Berger University (Senegal). Field of work: Non parametric statistics.

References

Appendix

Proof of Proposition 2.1

fˆnβ(x)=fn(x)Biasˆ(fˆ(x)β.

With a random variable ξ=Op(1) whose expectation is 0 and variance 1, we can write fn(x) as (see Kanazawa, Citation1993), (A1) fn(x)=f(x)[1+h22f(2)(x)f(x)It2K(t)dt+h424f(4)(x)f(x)×It4K(t)dt+O(h6)+IK(t)2dtnhf(x)1/2ξ+Op(n1/2)].(A1)

Using the result of the Corollary 2.6 (Eugene, Citation1969), limnsupxnc|fn(r)(x)f(r)(x)|=0with0<c<12r+4,we have, fˆn(x)=fn(x)Biasˆ(fn(x))=fn(x)h22fn(2)It2K(t)dt=fn(x)h22f(2)It2K(t)dt+O(nc),=f(x)[1+h22f(2)(x)f(x)It2K(t)dt+h424f(4)(x)f(x)×It4K(t)dt+O(h6)+IK(t)2dtnhf(x)1/2ξ+Op(n1/2)]h22f(2)It2K(t)dt+O(nc),=f(x)[1+h424f(4)(x)f(x)It4K(t)dt+O(h6)+IK(t)2dtnhf(x)1/2ξ+Op(n1/2)+O(nc)].Where the O(h6) terms depend upon x. Using (1+z)β=1+βz+β(β1)2z2+O(z3),

fˆnβ(x)=f(x)β[1+h424f(4)(x)f(x)It4K(t)dt+O(h6)+IK(t)2dtnhf(x)1/2ξ+Op(n1/2)+O(nc)]β,=f(x)β[1+βh424f(4)(x)f(x)It4K(t)dt+IK(t)2dtnhf(x)1/2ξ+β(β1)2h8576(f(4)(x))2f2(x)It4K(t)dt2+IK(t)2dtnhf(x)ξ2+β(β1)2h8576(f(4)(x))2f2(x)(It4K(t)dt)2+IK(t)2dtnhf(x)ξ2Op(nc)+O(h6)],and fˆnβ1(x)=f(x)β11+h424f(4)(x)f(x)It4K(t)dt+O(h6)+IK(t)2dtnhf(x)1/2ξ+Op(n1/2)+O(nc)β1=f(x)β11+(β1)h424f(4)(x)f(x)It4K(t)dtIK(t)2dtnhf(x)1/2+IK(t)2dtnhf(x)1/2ξ+(β1)(β2)2×h8576(f(4)(x))2f2(x)It4K(t)dt2+IK(t)2dtnhf(x)ξ2h8576(f(4)(x))2f2(x)(It4K(t)dt)2+IK(t)2dtnhf(x)ξ2+Op(nc)+O(h6). Dβ(fˆn(x),f(x))=1βfˆnβ(x)dx1β1fˆnβ1(x)f(x)dx+1β(β1) fβ(x)dx,=1βf(x)β1+βh424f(4)(x)f(x)It4K(t)dt+IK(t)2dtnhf(x)1/2ξ+β(β1)2h8576(f(4)(x))2f2(x)It4K(t)dt2+IK(t)2dtnhf(x)ξ2+Op(nc)+O(h6)dx1β1f(x)β+IK(t)2dtnhf(x)ξ2+Op(nc)+O(h6)dx1+(β1)×h424f(4)(x)f(x)It4K(t)dt+IK(t)2dtnhf(x)1/2ξ+(β1)(β2)2×h8576(f(4)(x))2f2(x)It4K(t)dt2+IK(t)2dtnhf(x)ξ2h8576(f(4)(x))2f2(x)(It4K(t)dt)2+IK(t)2dtnhf(x)ξ2+Op(nc)+O(h6)dx+1β(β1) fβ(x)dx,=1βf(x)ββ(β1)2h8576(f(4)(x))2f2(x)It4K(t)dt2 +IK(t)2dtnhf(x)ξ2+Op(nc)+IK(t)2dtnhf(x)ξ2+Op(nc)+O(h6)dx1β1f(x)β×(β1)(β2)2h8576(f(4)(x))2f2(x)It4K(t)dt2+IK(t)2dtnhf(x)ξ2+Op(nc)+O(h6)dx=f(x)ββ12β22×h8576(f(4)(x))2f2(x)It4K(t)dt2+IK(t)2dtnhf(x)ξ2h8576(f(4)(x))2f2(x)(It4K(t)dt)2+IK(t)2dtnhf(x)ξ2+Op(nc)+O(h6)dx=12h8576It4K(t)dt2fβ2(x)f(4)2(x)dx+1nhIK2(t)dtfβ1(x)dxξ2+Op(nc)+O(h6),EDβ(fˆn(x),f(x))=12h8576It4K(t)dt2fβ2(x)f(4)2(x)dx+1nhIK2(t)dtfβ1(x)dx+Op(nc)+O(h6).

Proof of Proposition 3.1

f(x)=1σ2πe1/2(xmσ)2,so

f(4)(x)=1σ52πe1/2(xmσ)2×36xmσ2+xmσ4,f(4)(x)2=1σ102πe(xmσ)2×936xmσ2+30xmσ4+18xmσ6+xmσ8,fβ2(x)f(4)(x)2dx=1σβ+7β(2π)β22×9β436β3+90β2+270β+105β4.and fβ1(x)dx=1β1(2π)β22.

In that case the asymptotically optimal bandwidth hβ in Equation (Equation5) becomes the normal reference bandwidth. hβ=hEDβ=72R(K)If(x)β1dxμ4(K)2If(x)β2f(4)(x)2dx1/9n1/9=(72R(K))1/9β1(2π)β22μ4(K)2×1σβ+7β(2π)β221/9×9β436β3+90β2+270β+105β41/9n1/9with σ being the standard deviation of f.

For the Gaussian kernel, μ4(K)=3 and R(K)=(4π)1/2 so that hNRβ=2π4β49β436β3+90β2+27β+1051n1/9σ

in the particular case for β=2 (A2) hNR2=168612π1n1/9σ.(A2) The standard deviation σ can be estimated by the sample standard deviation s or by the standardised interquartile range IQR/1.34 for robustness against outliers (1.34=Φ1(3/4)Φ1(1/4)), but a better rule of thumb is (e.g., Silverman, Citation1986, pp. 45–47; Härdle, Citation1991, p. 91). (A3) hˆNR2=2π168611n1/9σˆ,(A3) with σˆ=min(s,IQR/1.34)

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.