Abstract
The commonly used survey technique of clustering introduces dependence into sample data. Such data is frequently used in economic analysis, though the dependence induced by the sample structure of the data is often ignored. In this paper, the effect of clustering on the non-parametric, kernel estimate of the density, f(x), is examined. The window width commonly used for density estimation for the case of i.i.d. data is shown to no longer be optimal. A new optimal bandwidth using a higher-order kernel is proposed and is shown to give a smaller integrated mean squared error than two window widths which are widely used for the case of i.i.d. data. Several illustrations from simulation are provided.
ACKNOWLEDGMENTS
I would like to thank Aman Ullah for the initial inspiration for this paper. I also appreciated the comments of two anonymous referees which helped to make the paper clearer.
Notes
Clustering is frequently found in economic data gathered from surveys. One common example is the income and expenditure survey, where first a sample of villages is chosen and then, within each village, households are randomly selected. Households within the same village (or cluster) can be assumed to face similar conditions—for example we expect heating fuel costs to be correlated for households in the same area. In this paper, I assume that the data has already been gathered and that the analyst has information about the structure of the data. Kish (1965) and Thompson (1992) provide details of how clustered surveys are conducted.
The intra-cluster correlation coefficient, ρ, can be surprisingly large in cross-sectional data. Deaton (1997) provides examples using World Bank data where intra-cluster correlation coefficients range from 0.2 to 0.5.
In general in the i.i.d. case, for an r-th order kernel, h ** = (λ2(r!)2/2rλ1r )(1/2r+1) n -(1/2r+1) where λ1r = μ r 2 ∫(f r (x))x 2 dx and λ2 = ∫ψ K 2(ψ)dψ. λ1r = 105/32 for this density.