Abstract
We propose a kernel function for ordered categorical data that overcomes limitations present in ordered kernel functions appearing in the literature on the estimation of probability mass functions for multinomial ordered data. Some limitations arise from assumptions made about the support of the underlying random variable. Furthermore, many existing ordered kernel functions lack a particularly appealing property, namely the ability to deliver discrete uniform probability estimates for some value of the smoothing parameter. We propose an asymmetric empirical support kernel function that adapts to the data at hand and possesses certain desirable features. There are no difficulties arising from zero counts caused by gaps in the data while it encompasses both the empirical proportions and the discrete uniform probabilities at the lower and upper boundaries of the smoothing parameter. We propose likelihood and least-squares cross-validation for smoothing parameter selection and study their asymptotic and finite-sample behaviour.
Keywords:
Disclosure statement
No potential conflict of interest was reported by the authors.
Notes
1 Unordered kernel functions place a binary counting weight on each observation, for example, when
and
when
,
, but cannot assess distance (see Aitchison and Aitken Citation1976, p. 419).
2 We are grateful to an anonymous referee who pointed out that Kokonendji and Zocchi (Citation2010) proposed a family of symmetric and asymmetric counting kernels that encompass the properties discussed herein: they write ‘the empirical proportions and the discrete uniform distribution at the extreme of the admissible smoothing parameter’.
3 To see this, note that when
, and when
,
when
and 0 otherwise. So, for any
and
,
will equal 1 if
equals any value
and zero otherwise, hence
when
.
4 One can show that the cross-validation function contains two leading terms that are related to λ. The first term is a positive deterministic quantity which is minimised at , and the second term is a zero-mean
random variable which is minimised at
with some constant positive probability
. Therefore, the cross-validation function is minimised at
with a positive probability
. In general, it is difficult to determine the exact value of δ because the exact value of δ depends on the unknown function
. Simulation in Ouyang et al. (Citation2006) shows that
takes the upper extreme value one roughly at
chance and takes values between zero and one at
chance. Ouyang et al. (Citation2009) showed similar theoretical and simulation results for the nonparametric regression model with categorical regressors.
5 The Wang and van Ryzin kernel function is when
and
when
where
. Note, however, that this kernel, like all existing ordered kernel functions of which we are aware, lacks the ability to deliver discrete uniform probabilities for some value of the smoothing parameter. For the normalised version of this kernel, we divide
by
to ensure that the probabilities indeed sum to one as per Glad, Hjort, and Ushakov (Citation2003).
6 We are grateful to an anonymous referee who suggested incorporating these methods as comparators and who directed us to the R package Ake (Wansouwé et al. Citation2015) for their implementation. Note that the authors of this package only consider and implement least-squares cross-validation for smoothing parameter selection. Therefore, we beg the reader's forgiveness for including these comparators alongside those for least-squares cross-validation but not for likelihood cross-validation.
7 Typically, a P-value of 0 is simply recorded as, say, when using the R statistical platform. However, a P-value of 1 would be unusual in a two-sided test setting (the two means would have to be identical) but is commonplace in a one-sided setting when the two means differ in the direction of the null, i.e. one is much larger than the other.
8 The median SSE (× 100) is 0.009 versus 0 for the normalised Wang and van Ryzin versus the proposed kernel, respectively, and the median SSE are identical for likelihood and least-squares cross-validation.
9 The negative binomial provided the best fit overall, but its tail is too thin, i.e. as the number of successful patent applications increases, the probability estimates approach zero too quickly yet there is still a substantial amount of empirical probability mass remaining in the tail.
10 Using a Taylor expansion, , where
. Note that
because
.