Full article: A multi-step kernel–based regression estimator that adapts to error distributions of unknown form

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

For linear regression models, we propose and study a multi-step kernel density-based estimator that is adaptive to unknown error distributions. We establish asymptotic normality and almost sure convergence. An efficient EM algorithm is provided to implement the proposed estimator. We also compare its finite sample performance with five other adaptive estimators in an extensive Monte Carlo study of eight error distributions. Our method generally attains high mean-square-error efficiency. An empirical example illustrates the gain in efficiency of the new adaptive method when making statistical inference about the slope parameters in three linear regressions.

Keywords:

MATHEMATICS SUBJECT CLASSIFICATION:

1. Introduction

In parametric linear regression analysis one often imposes the model assumptions that the errors are independent and normally distributed. The normality assumption is convenient, as it is well known that the maximum likelihood estimator (MLE) of the unknown parameter vector simplifies to the least squares estimator (LSE). Naturally, an invalid assumption on the error distribution F comes at a cost; the MLE is in general neither consistent nor asymptotically efficient under model misspecification. Moreover, in practice, it can lead to inaccurate or invalid statistical inference; see Sec. 5. This has motivated the search for alternative, (semi)parametric, estimators that retain asymptotic efficiency when F is unknown.

One approach is adaptive estimation, which “adapts” to an unknown, or incorrectly specified, distribution F by maximizing an estimated likelihood function based on an initial estimate of the error distribution; see Bickel (Citation1982), Linton and Xiao (Citation2007), Yuan and De Gooijer (Citation2007), and the references therein. The adaptive idea has been studied for (non)linear regression models using non- and semiparametric methods to estimate F or its probability density function (pdf) f.

There are various alternative adaptive estimation methods for non- and semiparametric regression problems with errors of unknown distributional form. For instance, the empirical likelihood method of Owen (Citation2001) has been used to obtain adaptive confidence limits and likelihood ratio test statistics for regression parameters; see also Owen (Citation1988, Citation1990, Citation1991), Qin and Lawless (Citation1994), Kitamura (Citation1997), Kitamura (Citation2007), among many others. Another example is the multivariate adaptive regression splines (MARS) (Friedman Citation1991) which is a global adaptive nonparametric method for fitting nonlinear regression models. PolyMARS (Kooperberg, Bose, and Stone Citation1997) is an extension of MARS that allows for multiple polychotomous regression. Time series MARS, or TSMARS, can be used for nonlinear time series analysis and forecasting; see, e.g., De Gooijer (Citation2017). More recently, Wang and Yao (Citation2012) proposed a minimum average variance estimation method, a dimension reduction technique, which can be adaptive to different error distributions. Also, Chen, Wang, and Yao (Citation2015) developed an adaptive estimation method for varying coefficient models.

Recently, Yao and Zhao (Citation2013) proposed an adaptive kernel density-based estimator for classical linear regression models, called KDRE. In particular, with an estimate of the “true” parameter vector, f is modeled by a kernel density-based estimator of the regression residuals. In the second step, parameter estimates are obtained by maximizing a local, kernel-based, log-likelihood function using the first-step estimated density function as the true one. Through a simulation study, Yao and Zhao (Citation2013) show that the resulting KDRE is asymptotically equivalent to the oracle estimator in which the true error pdf is known.

Now, it is well known that each LSE-based residual is the sum of two components: one is the true error, the other is a linear function of the entire vector of errors. Since, in finite samples, the second term will tend to be normally distributed (as long as the errors have finite variance), the residuals for small samples will appear more normal than would the unobserved values of the errors. This tendency is called supernormality; see Bassett and Koenker (Citation1982), Bloomfield (Citation1974), and White and MacDonald (Citation1980). Hence, for the two-step KDRE method, it is likely that the finite-sample properties of the proposed estimator strongly depend on the empirical distribution of the residuals, more closely resembling normality than would be the case by using the pdf of the errors itself, if this were in fact possible. This makes the KDRE method nonoptimal.

In this paper, we remedy this deficiency of the KDRE method by further iteration. That is, we maximize over different kernel-based likelihood functions. These different likelihood functions follow iteratively from parameter estimates that result from maximizing the (previous) likelihood function. The algorithm iterates until the parameter estimates (and, hence, the estimated likelihood function) reaches a fixed point. The new estimator is called multi-step KDRE (M-KDRE). In finite samples, one may expect that M-KDRE yields better estimation results; see Robinson (Citation1988). Actually, from an extensive Monte Carlo study where we compare the finite-sample performance of M-KDRE with estimates based on five other, parametric and (semi)nonparametric adaptive estimators (AEs), we find that iterating over the likelihood functions can indeed strongly increase finite-sample performance. In fact, we find that M-KDRE outperforms all other considered AEs. In an empirical example based on Andrabi, Das, and Khwaja (Citation2017), we show that LSE estimates may be misleading. M-KDRE outperforms the other considered estimators in terms of out-of-sample prediction performance. Furthermore, M-KDRE provides strong evidence that the treatment effect as described in Andrabi, Das, and Khwaja (Citation2017) does not exist.

Theoretically, we establish strong (almost sure) convergence to the true parameter vector, under relative weak conditions. We also show that the M-KDRE method is adaptive, i.e., asymptotically normal and efficient. Furthermore, its computation is made convenient by proposing an EM type algorithm.

The rest of the paper is organized as follows. Section 2 introduces the new adaptive M-KDRE method and contains our theoretical results. An efficient EM type algorithm to implement the proposed estimator is also given in this section. In Section 3, we describe and explain five alternative adaptive estimation methods. Section 4 contains results of a simulation-based study of the finite sample properties of the M-KDRE method and compares it with those of the adaptive estimation methods discussed in Section 3. In Section 5, we present the empirical application of our method to the educational data set as used in Andrabi, Das, and Khwaja (Citation2017). Section 6 gives a summary and some concluding remarks. Proofs are presented in Appendix A.

2. Multi-Step kernel Density-Based regression estimation

2.1. Model and method

Consider the general linear regression problem with observations (1) $y_{i} = x_{i}^{'} β_{0} + ε_{i}, (i = 1, \dots, n)$ (1) where y_i is a univariate response variable, $x_{i} = (x_{i, 1}, \dots, x_{i, p})'$ is a p-dimensional (p < n) vector of covariates, and $β_{0} \in B \subseteq R^{p}$ is an unknown parameter vector including an intercept. Here $(y_{i}, x_{i}, ε_{i})$ are independent and identically distributed (i.i.d.) realizations from a common random source $(y, x, ε) .$ Moreover, the $ε_{i}$ ‘s are assumed to have some common unknown pdf $f (ε),$ and $E [ε_{i} | x_{i}, β_{0}] = 0$ and $E [| ε | | x_{i}, β_{0}] < \infty$ $(i = 1, \dots, n) .$ Model (Equation1(1) $y_{i} = x_{i}^{'} β_{0} + ε_{i}, (i = 1, \dots, n)$ (1) ) is semiparametric with $β$ and $f (\cdot)$ its parametric and non-parametric part, respectively.

Let ${\hat{β}}_{LSE}$ be the LSE of $β_{0}$ in (Equation1(1) $y_{i} = x_{i}^{'} β_{0} + ε_{i}, (i = 1, \dots, n)$ (1) ), which is a natural estimator to start the M-KDRE method. Also, let ${\hat{β}}^{(u)}$ denote an estimator of $β_{0}$ at iteration step $u = 0, 1, \dots .$ Then, with the conditions introduced above, the proposed M-KDRE can be obtained as follows.

Initial step: At u = 0, start with ${\hat{β}}^{(0)} = {\hat{β}}_{LSE} .$ Compute the residuals ${\hat{ε}}_{i}^{(0)} = y_{i} - x_{i}^{'} {\hat{β}}^{(0)}$ $(i = 1, \dots, n) .$
Compute the Rosenblatt-Parzen kernel-based estimator ${\hat{f}}_{n}^{(u)} (\cdot)$ of $f (\cdot) .$ That is (2) ${\hat{f}}_{n}^{(u)} (x) = {(n h_{n})}^{- 1} \sum_{i = 1}^{n} K (\frac{x - {\hat{ε}}_{i}^{(u - 1)}}{h_{n}})$ (2) where h_n > 0 is the bandwidth.
Let $c (β) = n^{- 1} \sum_{i = 1}^{n} (y_{i} - x_{i}^{'} β) .$ Then, using (Equation2(2) ${\hat{f}}_{n}^{(u)} (x) = {(n h_{n})}^{- 1} \sum_{i = 1}^{n} K (\frac{x - {\hat{ε}}_{i}^{(u - 1)}}{h_{n}})$ (2) ), compute (3) ${\hat{β}}^{(u)} = \arg \max_{β \in B} {\hat{Q}}_{u} (β) s . t . c (β^{(u)}) = 0$ (3) where ${\hat{Q}}_{u} (β)$ is the local log-likelihood function (4) ${\hat{Q}}_{u} (β) = \sum_{i = 1}^{n} ln {\hat{f}}_{n}^{(u)} (y_{i} - x_{i}^{'} β)$ (4)
Repeat steps (ii)–(iii) until convergence at iteration step $(u + 1)$ $(u = 1, 2, \dots) .$

It is worth mentioning that Yao and Zhao (Citation2013) only iterate over the empirical likelihood function of the first initial, LSE, estimate while in the above case we allow for stochastic fluctuations in ${\hat{f}}_{n}^{(u)} (\cdot) .$ As we show, this generalization will increase mean-square error parameter efficiency.

2.2. Asymptotic properties

In this section we give the asymptotic properties of the uth M-KDRE, denoted hereafter by ${\hat{β}}^{(u)} .$ For convenience, when we emphasize the dependence on $β,$ we use $f (y_{i} | β)$ to denote $f (y_{i} - x_{i}^{'} β) .$ Technical details, lemmas, and proofs are given in Appendix A.

Theorem 2.1.

(Almost sure (a.s.) convergence) Under the assumptions of Lemma A.1 and

$f (y_{i} | \hat{β})$ is non-parametrically identifiable,
$β \in B$ where $B \subseteq R^{p}$ is compact,
$f (y_{i} | β)$ is continuous for each $β \in B,$
$E [\sup_{β \in B} | ln f (y_{i} | β) |] < \infty,$
then

(5)

{\hat{β}}^{(u)} \overset{a . s .}{\to} β_{0}

(5)

The following notation is used throughout the next part of the paper. Let $L_{n} (y_{i} | β) = ln f (y_{i} | β) = ln f (y_{i} - x_{i}^{'} β) .$ Then $d_{β} (β) = \partial L_{n} (y_{i} | β) / \partial β = - f' (y_{i} | β) f^{- 1} (y_{i} | β) x_{i}, d_{β β} (β) = \partial L_{n}^{2} (y_{i} | β) / \partial β \partial β^{'} = (f (y_{i} | β) f ″ (y_{i} | β) - f'^{2} (y_{i} | β) / f^{2} (y_{i} | β)) x_{i} x_{i}^{'},$ where $f' (u) = \partial f (u) / \partial u,$ $f ″ (u) = \partial f' (u) / \partial u,$ and $f'^{2} (u) = f' (u) f' (u) .$ Given these notations, the Fisher information matrix (FIM) for the unconstrained linear regression problem, evaluated at $β_{0},$ can be defined as (6) $I_{β β} = E (d_{β} (β_{0}) d_{β} (β_{0})') = - E (d_{β β} (β_{0}))$ (6) where the second equality holds under mild regularity conditions. If $I_{β β}$ is nonsingular, then $I_{β β}^{- 1}$ is the unconstrained Cramér-Rao bound (CRB) for the mean-square error (MSE) covariance matrix of any unbiased estimator of $β_{0} .$

Observe that incorporation of the one-dimensional linear constraint $c (β^{(u)}) = 0$ in step (iii) of the M-KDRE method leads to a p-dimensional parameter vector that has only p − 1 independent components. As a consequence, the FIM is singular and the CRB may not be an informative lower bound on the MSE matrix of the resulting estimator. So the asymptotic distribution of ${\hat{β}}^{(u)}$ degenerates. For deterministic linear parameter constraints, Stoica and Ng (Citation1998) formulated a constrained CRB (CCRB) that explicitly incorporates the active constraint information with the original FIM, singular or nonsingular. Their general setting is for q (q < p) continuously differentiable constraints $g (β) = 0 .$ Assuming $β$ is regular in the active set of linear constraints, the q × p gradient matrix $G = \partial g (β) / \partial β^{'}$ has full row rank q, with G independent of $β .$ Hence, there exists a matrix $U \in R^{p \times (p - q)}$ whose p-dimensional columns form an orthonormal null space of the range space of the row vectors in G, i.e., such that (7) $GU = 0 and U' U = I_{p - q}$ (7) where $I_{p - q}$ denotes the identity matrix of size p – q. For nonlinear deterministic constraints, G and U are functions of $β;$ see, e.g., Moore, Sadler, and Kozick (Citation2008).

Recently, Ren et al. (Citation2015) extended the deterministic CCRB of Stoica and Ng (Citation1998) to a hybrid parameter vector with both nonrandom and random parameter constraints. In the case of the M-KDRE method the constraint is not deterministic, depending on the random variables $x_{i} .$ Then the matrix U in (Equation7(7) $GU = 0 and U' U = I_{p - q}$ (7) ) depends on the estimate $\hat{β}$ of $β_{0},$ say $U (\hat{β}) .$ The resulting CCRB, as a special case of the hybrid CCRB of Ren et al. (Citation2015), can be stated as follows.

Theorem 2.2.

Let $\hat{β}$ be an unbiased estimate of $β_{0}$ satisfying the active functional constraints $g (β) = 0$ , and let $U = U (\hat{β})$ be defined in Equation(7)(7) $GU = 0 and U' U = I_{p - q}$ (7) . Then, under certain regularity conditions, if $U' I_{β β} U$ is nonsingular, $E ((\hat{β} - β_{0}) (\hat{β} - β_{0})') \geq U {(U' I_{β β} U)}^{- 1} U'$ where the equality is achieved if and only if $\hat{β} - β_{0} = U {(U' I_{β β} U)}^{- 1} U' d_{β} (β)$ , in the mean- square sense.

Remark 1.

Note that rather than requiring a nonsingular FIM $I_{β β},$ the alternative condition is that $U' I_{β β} U$ is nonsingular. Thus, the unconstrained FIM may be singular, or, equivalently, the unconstrained model unidentifiable, but the constrained model must be identifiable, at least locally. Ren et al. (Citation2015) show that the difference between the standard CRB-based covariance matrix and the CCRB-based covariance matrix is a positive semi-definite matrix. This result is expected since the presence of parameter constraints can be considered as additional information to improve the performance of the estimator under study.

Theorem 2.3.

(Normality and efficiency) If model Equation(1)(1) $y_{i} = x_{i}^{'} β_{0} + ε_{i}, (i = 1, \dots, n)$ (1) holds, ${ε_{i}}_{i = 1}^{n}$ are i.i.d. with unknown density f(x) where f is a continuous function symmetric around zero with bounded continuous derivatives that satisfy

$\int x f (x) d x = 0,$
$E [{(\frac{\partial ln f (x)}{\partial x})}^{2} + | \frac{\partial^{2} ln f (x)}{\partial x^{2}} | + | \frac{\partial^{3} ln f (x)}{\partial x^{3}} |] < \infty,$
${x_{i}}_{i = 1}^{\infty}$ satisfies,
$\exists 0 < M < \infty$ such that $‖ x ‖ < M,$
$K (\cdot)$ is a symmetric and four times continuously differentiable function such that
$\exists 0 < ρ < \infty$ such that K(x) = 0 $\forall x : ‖ x ‖ \geq ρ$
holds, and
when $n \to \infty, n h_{n}^{4} \to \infty$ and $n h_{n}^{8} \to 0,$

then ${\hat{β}}^{(u)}$ $(u = 1, 2, \dots)$ is asymptotically normal and efficient. That is, as $n \to \infty,$ (8) $\sqrt{n} ({\hat{β}}^{(u)} - β_{0}) \overset{d}{\to} N (0, U {(U' I_{β β} U)}^{- 1} U')$ (8)

Remark 2.

All conditions are practical and easy to satisfy. Condition (ii) is used to guarantee the adaptiveness of ${\hat{β}}^{(u)} .$

2.3. EM algorithm

In this section, we propose an EM type algorithm by noticing that (Equation4(4) ${\hat{Q}}_{u} (β) = \sum_{i = 1}^{n} ln {\hat{f}}_{n}^{(u)} (y_{i} - x_{i}^{'} β)$ (4) ) has a mixture log-likelihood form with an imposed constraint. Specifically, given an initial parameter estimate ${\hat{β}}^{(0)}$ and the set of initial estimates ${{\hat{ε}}_{i}^{(0)}}_{i = 1}^{n},$ the $(k + 1)$ th iteration of the EM algorithm to maximize (Equation4(4) ${\hat{Q}}_{u} (β) = \sum_{i = 1}^{n} ln {\hat{f}}_{n}^{(u)} (y_{i} - x_{i}^{'} β)$ (4) ) (the uth likelihood function) is as follows.

E-step: Calculate the classification probabilities, (9) $p_{i j, (k + 1)}^{(u)} = \frac{exp {- \frac{1}{2 h_{n}^{2}} {(y_{i} - x_{i}^{'} {\hat{β}}_{(k)}^{(u)} - {\hat{ε}}_{j}^{(u - 1)})}^{2}}}{\sum_{ℓ = 1}^{n} exp {- \frac{1}{2 h_{n}^{2}} {(y_{i} - x_{i}^{'} {\hat{β}}_{(k)}^{(u)} - {\hat{ε}}_{ℓ}^{(u - 1)})}^{2}}}$ (9)

M-step: Update ${\hat{β}}_{(k)}^{(u)}$ with (10) $\begin{matrix} {\hat{β}}_{(k + 1)}^{(u)} = {\hat{β}}_{LSE} - {(\sum_{i = 1}^{n} x_{i} x_{i}^{'})}^{- 1} \sum_{i = 1}^{n} (x_{i} \sum_{j = 1}^{n} p_{i j, (k + 1)}^{(u)} {\hat{ε}}_{j}^{(u - 1)}) \\ + \frac{1}{n} {(\sum_{i = 1}^{n} x_{i} x_{i}^{'})}^{- 1} \sum_{i = 1}^{n} x_{i} (\sum_{i = 1}^{n} \sum_{j = 1}^{n} p_{i j, (k + 1)}^{(u)} {\hat{ε}}_{j}^{(u - 1)}) \end{matrix}$ (10) where (Equation10(10) $\begin{matrix} {\hat{β}}_{(k + 1)}^{(u)} = {\hat{β}}_{LSE} - {(\sum_{i = 1}^{n} x_{i} x_{i}^{'})}^{- 1} \sum_{i = 1}^{n} (x_{i} \sum_{j = 1}^{n} p_{i j, (k + 1)}^{(u)} {\hat{ε}}_{j}^{(u - 1)}) \\ + \frac{1}{n} {(\sum_{i = 1}^{n} x_{i} x_{i}^{'})}^{- 1} \sum_{i = 1}^{n} x_{i} (\sum_{i = 1}^{n} \sum_{j = 1}^{n} p_{i j, (k + 1)}^{(u)} {\hat{ε}}_{j}^{(u - 1)}) \end{matrix}$ (10) ) follows from using a Gaussian kernel for density estimation. The choice of the kernel is not critical. Any symmetric kernel can be used for our method. However, the Gaussian second order kernel provides an explicit solution of the EM algorithm.

Theorem 2.4.

Under the linearity constraint $c (β) = 0$ , each iteration of the above E- and M-steps will monotonically increase the local log-likelihood (Equation4(4) ${\hat{Q}}_{u} (β) = \sum_{i = 1}^{n} ln {\hat{f}}_{n}^{(u)} (y_{i} - x_{i}^{'} β)$ (4) ), i.e., ${\hat{Q}}_{n} ({\hat{β}}_{(k + 1))}^{(u)}) \geq {\hat{Q}}_{n} ({\hat{β}}_{(k)}^{(u)})$ , for all k.

Remark 3.

For the EM type algorithm, we use a full-kernel method rather than a leave-one-out method as in Yao and Zhao (Citation2013). The approach has the following advantage. If a certain residual ${\hat{ε}}_{j}^{(u - 1)}$ is extremely large, $p_{i j, (k + 1)}^{(u)}$ will be close to zero for all $i \neq j$ and close to one for i = j. This implies that the effect of the residual is limited to the observation for which the following iteration of $β$ is likely to lead to a residual that is similar in magnitude. Hence, the effect of a large residual on the maximization of (Equation4(4) ${\hat{Q}}_{u} (β) = \sum_{i = 1}^{n} ln {\hat{f}}_{n}^{(u)} (y_{i} - x_{i}^{'} β)$ (4) ) is small. In the leave-one-out method, the effect of the residual may be considerably larger as $p_{i j, (k + 1)}^{(u)}$ is likely to have a substantial value for several observations.

Remark 4.

The EM type algorithm is considered converged when $\max | {\hat{β}}_{(k)}^{(u)} - {\hat{β}}_{(k + 1)}^{(u)} |$ is smaller than a threshold value, where $\max | A |$ denotes the largest (absolute) element in A. In the uth step of the M-KDRE method, the EM algorithm is initialized by the estimate at the $(u - 1)$ th step. That is, ${\hat{β}}_{(0)}^{(u)} = {\hat{β}}^{(u - 1)} .$

3. Some alternative adaptive estimation methods

3.1. SBS method

Stone (Citation1975), Bickel (Citation1982), and Schick (1993) (henceforth SBS) introduce a two-step AE. Let $\tilde{β}$ be a certain $\sqrt{n}$ -consistent estimator of $β_{0} .$ Then, an infeasible two-step estimator can be defined as $\hat{β} = \tilde{β} + n^{- 1} I_{β β}^{- 1} (\tilde{β}) d_{β} (\tilde{β})$ where $I_{β β} (\tilde{β})$ is Fisher’s information matrix evaluated at $\tilde{β}$ and $d_{β} (\tilde{β})$ is a corresponding $p \times 1$ score vector. The infeasibility of $\hat{β}$ follows from the fact that f is unknown, and hence $I_{β β}$ and $d_{β}$ are unknown. The approach of SBS is to replace $d_{β} (\tilde{β})$ by ${\hat{d}}_{β} (\tilde{β}) = - \sum_{i = 1}^{n} \hat{f}'_{n} (y_{i} | \tilde{β}) {\hat{f}}_{n}^{- 1} (y_{i} | \tilde{β}) x_{i}$ where ${\hat{f}}_{n} (x)$ is defined in a similar way as in (Equation2(2) ${\hat{f}}_{n}^{(u)} (x) = {(n h_{n})}^{- 1} \sum_{i = 1}^{n} K (\frac{x - {\hat{ε}}_{i}^{(u - 1)}}{h_{n}})$ (2) ) and $\hat{f}'_{n} (x)$ is its derivative with respect to x. Similarly, $I_{β β}^{- 1} (\tilde{β})$ is replaced with $n^{2} {[\sum_{i = 1}^{n} x_{i} x_{i}^{'} \sum ({\hat{f}}_{n}^{'} (y_{i} | \tilde{β}) {\hat{f}}_{n}^{- 1} (y_{i} | \tilde{β}))}^{2}]^{- 1} .$

The conditions under which the two-step AE approach can be shown to be asymptotically efficient have been researched extensively. Most importantly, the kernel estimator of the score function must be (i) i.i.d., and (ii) independent of $x_{i} .$ These conditions are restrictive and not easy to verify in practice; see, e.g., Yuan and De Gooijer (Citation2007). Bickel (Citation1982) solved the i.i.d. problem by splitting the sample in two; one sub-sample to estimate the score and another to solve for $β .$ However, Manski (Citation1984) finds that the estimator works much better when the sample is not split, i.e., if the estimated score and $\tilde{β}$ are both computed using the entire sample. If (i) and (ii) are satisfied, a sufficient condition for adaptiveness is that (iii) $E [{(f' (y_{i} | β) f^{- 1} (y_{i} | β) {\hat{f}}_{n}^{'} (y_{i} | \tilde{β}) {\hat{f}}_{n}^{- 1} (y_{i} | \tilde{β}))}^{2}] \to 0,$ as $n \to \infty .$

Since ${\hat{f}}_{n}$ is present in the denominator of ${\hat{d}}_{β},$ unstable estimates may follow for near-zero values of ${\hat{f}}_{n} (\cdot) .$ Hence, Bickel (Citation1982) suggests to trim the estimator of the kernel score as follows $\frac{{\hat{f}}_{n}^{'} (y_{i} | \tilde{β})}{{\hat{f}}_{n} (y_{i} | \tilde{β})} = {\begin{matrix} \frac{{\hat{f}}_{n}^{'} (y_{i} | \tilde{β})}{{\hat{f}}_{n} (y_{i} | \tilde{β})}, & if | y_{i} - x_{i}^{'} β | \leq t_{1}, {\hat{f}}_{n} (y_{i} | \tilde{β}) > t_{2}, and \frac{{\hat{f}}_{n}^{'} (y_{i} | \tilde{β})}{{\hat{f}}_{n} (y_{i} | \tilde{β})} < t_{3}, \\ 0, & otherwise . \end{matrix}$

This trimming mechanism ensures that near-zero values do not have unreasonably large influence on the estimate. If $t_{1} \to \infty,$ $t_{2} \to 0, t_{3} \to \infty, h_{n} \to 0, t_{1} / n h_{n}^{3} \to 0,$ and $h_{n} t_{1} \to 0$ as $n \to \infty$ then condition (iii) is satisfied. Hence, adaptiveness is established under the proper trimming parameters and conditions (i) and (ii).

Naturally, the growth rates of the trimming parameters are of little use to a practitioner, and as such the choice for the trimming parameter is a practical disadvantage. Hsieh and Manski (1987) reduce the problem to selecting a one-dimensional parameter t by suggesting the following relation between the trimming parameters: $t_{1} = t, t_{2} = exp (- t^{2} / 2),$ $t_{3} = t .$ These authors vary t between 3, 4, and 8. For a sample size of 50, t = 8 works best in almost all cases under study.

3.2. LGMM and LGMMS methods

Newey (Citation1988) describes a two-step AE that avoids kernel estimation. His approach is based on moment conditions that can be derived from certain assumptions on the error distribution. Two situations are analyzed. First, the case where the error terms are i.i.d. and independent of $x_{i} .$ This model implies the moment condition that any function of the errors are uncorrelated with any function of the regressors. Second, the case where the distribution of $ε_{i}$ is symmetric (S) around zero conditional on $x_{i} .$ The assumption that the errors are symmetrically distributed around zero yields the moment conditions that any odd function of the errors are uncorrelated with any function of the regressors. Hence, in both situations we can exploit moment restrictions. In particular, in the first case, we refer to the linearized general method of moment estimator as LGMM. In the second case, we use the short-hand notation LGMMS. For LGMM, natural moment conditions arise from the fact that $E [x_{i} (ε_{i}^{j} - E (ε_{i}^{j}))] = 0$ for $j = 1, 2, \dots .$ However, Newey (Citation1988) finds that these high-order “raw” moments, $m_{j} (ε_{i}) = ε_{i}^{j},$ are sensitive to a fat-tailed error distribution.

Estimates that are more robust against fat tails can be obtained using the “transformed” powers with $m_{j} (ε_{i}) = {(ε_{i} / (1 + | ε_{i} |))}^{j}$ or the “weighted” powers with $m_{j} (ε_{i}) = exp (- ε_{i} / 2) ε_{i}^{j} .$ Similarly, for LGMMS we may use $E [x_{i} ε_{i}^{2 j - 1}] = 0$ $(j = 1, 2, \dots) .$ As for LGMM, performance may be improved if we use the odd powers of the transformed method instead. However, for technical reasons the weighted powers can not be used for LGMMS; Newey (Citation1988). In general, both for LGMM and LGMMS, we use the moment condition $E [x_{i} (m_{j} (ε_{i}) - μ_{j})] = 0$ $(j = 1, 2, \dots)$ where $μ_{j} = E [m_{j} (ε_{i})] .$

To define the LGMM(S) estimator, we introduce the following notation for some fixed value $J = J (n)$ of j, (11) $ζ_{i} = (m_{1} (ε_{i}) - μ_{1}, \dots, m_{J} (ε_{i}) - μ_{J})', w = E [(m_{1, ε} (ε_{i}), \dots, m_{J, ε} (ε_{i}))'] V_{ζ ζ} = Cov (ζ_{i})$ (11) where $m_{j, ε} (\cdot) = \partial m_{j} (\cdot) / \partial ε .$ Let ${{\hat{ε}}_{i}}_{i = 1}^{n}$ denote the residuals corresponding to the initial estimate $\hat{β},$ then the quantities in (Equation11(11) $ζ_{i} = (m_{1} (ε_{i}) - μ_{1}, \dots, m_{J} (ε_{i}) - μ_{J})', w = E [(m_{1, ε} (ε_{i}), \dots, m_{J, ε} (ε_{i}))'] V_{ζ ζ} = Cov (ζ_{i})$ (11) ) can be consistently estimated by their corresponding sample statistics, i.e., ${\hat{ζ}}_{i} = (m_{1} ({\hat{ε}}_{i}) - {\hat{μ}}_{1}, \dots, m_{J} (\hat{ε_{i}}) - {\hat{μ}}_{J})', \hat{w} = (n^{- 1} \sum_{i} m_{1, ε} ({\hat{ε}}_{i}), \dots, n^{- 1} \sum_{i} m_{J, ε} ({\hat{ε}}_{i}))'$ and ${\hat{V}}_{ζ ζ} = n^{- 1} \sum_{i} {\hat{ζ}}_{i} {\hat{ζ}}_{i}^{'}$

The LGMM(S) estimator is given by (12) $\begin{matrix} {\hat{β}}_{LGMM (S)} = \hat{β} + {[(\hat{w}' \otimes X' X) ({\hat{V}}_{ζ ζ}^{- 1} \otimes {[X' X]}^{- 1}) (\hat{w} \otimes X' X)]}^{- 1} \\ \times (\hat{w}' \otimes X' X) ({\hat{V}}_{ζ ζ}^{- 1} \otimes {[X' X]}^{- 1}) (I_{J} \otimes X') vec (\hat{ζ}), \end{matrix}$ (12) where $\hat{ζ}$ is the n × J matrix $({\hat{ζ}}_{1}^{'}, \dots, {\hat{ζ}}_{n}^{'})',$ $I_{J}$ is an J × J identity matrix, and X an n × p matrix with its first column an $n \times 1$ vector of ones. Under certain assumptions, Newey (Citation1988) proves asymptotic normality of the LGMM and LGMMS estimators. In particular, it should hold that $J \to \infty$ and $J ln J / ln n \to 0,$ as $n \to \infty .$ Only for LGMMS, asymptotic efficiency is obtained, but not for LGMM. By means of simulation, Newey (Citation1988) finds for LGMM that J = 3 performs best for sample sizes between n = 50 and n = 200. However, the MSE efficiency of the estimator as a function of J flattens out as n increases. Also, the transformed method is in general preferred over the weighted method.

3.3. KDRE method

The KDRE method of Yao and Zhao (Citation2013) can be viewed as the unconstrained two-step version of M-KDRE. That is, it follows from unconstrained maximization of the kernel likelihood function that is estimated on the basis of the residuals corresponding to an initial estimate. Under conditions (i)–(v) of Theorem 2.3 these authors prove that the KDRE method is adaptive. For technical reasons, this property is proven for a trimmed version. The untrimmed maximizer of the kernel-based likelihood is the solution to $n^{- 1} \sum_{i = 1}^{n} {{\hat{f}}_{n}^{'} (y_{i} | β) / {\hat{f}}_{n} (y_{i} | β)} x_{i}^{'} = 0 .$ The trimmed version is then defined as the solution to $\frac{1}{n} \sum_{i = 1}^{n} \frac{{\hat{f}}_{n}^{'} (y_{i} | β)}{{\hat{f}}_{n} (y_{i} | β)} x_{i}^{'} G_{b} ({\hat{f}}_{n} (y_{i} | β)) = 0.$

Here, (13) $G_{b} (x) = {\begin{matrix} 0, & if x < b, \\ \int_{b}^{x} g_{b} (z) d z & if b \leq x \leq 2 b, \\ 1, & if x > 2 b, \end{matrix}$ (13) where $g_{b} (\cdot)$ is a four times continuously differentiable function with support on $[b, 2 b],$ and $b \to 0$ if $n \to \infty .$ This trimming function is introduced by Linton and Xiao (Citation2007) and they suggest the use of the beta function. For the purpose of KDRE, the trimming parameter is only used to simplify the proof and is not a part of the actual implementation of the method.

3.4. YDG method

Yuan and De Gooijer (Citation2007) (hereafter YDG) propose another estimator of $β_{0}$ based on estimating the error density by means of a kernel. The method is a one-step approach and as such does not require an initial estimate. The proposed estimator is given by (14) $\hat{β} = \arg \max_{β \in B} \sum_{i = 1}^{n} ln \frac{1}{(n - 1) h_{n}} \sum_{j \neq i}^{n} K (\frac{r (y_{i} - x_{i}^{'} β) - r (y_{j} - x_{j}^{'} β)}{h_{n}})$ (14) where $r (z) = 10 \times e^{z} / (1 + e^{z}) .$ The nonlinear function $r (\cdot)$ is introduced to avoid cancelation of the intercept coefficient in $x_{j}^{'} β - x_{i}^{'} β .$ However, as Yao and Zhao (Citation2013) note, this comes with an efficiency loss; r(z) = z is efficient in the sense that even though the intercept is canceled out, the slope coefficients are adaptively estimated. They suggest the use of the following estimator (15) ${\hat{β}}_{YDG}^{*} = \arg \max_{β \in B} \sum_{i = 1}^{n} ln \frac{1}{(n - 1) h_{n}} \sum_{j \neq i}^{n} K (\frac{(y_{i} - x_{i}^{*'} β^{*}) - (y_{j} - x_{j}^{*'} β^{*})}{h_{n}})$ (15) where $x_{i}^{*} = {(1, x_{i}^{'})}^{'},$ and ${\hat{β}}_{YDG} = {({\hat{α}}_{YDG}, {\hat{β}}_{YDG}^{*'})}^{'}$ and ${\hat{α}}_{YDG} = n^{- 1} \sum_{i} (y_{i} - x_{i}^{*'} {\hat{β}}_{YDG}^{*}) .$ The intercept estimate ${\hat{α}}_{YDG},$ however, is not in general asymptotically efficient.

4. Simulation study

4.1. Setup

In order to assess the finite sample practical performance of all reviewed AEs, we conduct a Monte Carlo study. We generate i.i.d. data ${(x_{i}, y_{i})}_{i = 1}^{n}$ from the regression model (16) $y_{i} = x_{i}^{'} β + ε_{i}, β = {(1, - 1, 2, - 0.5, 3, 1, - 1, 2, - 0.5, 3)}^{'}$ (16) where $β$ is a $p \times 1$ parameter vector containing an intercept and the parameters corresponding to p − 1 explanatory variables. Here p = 10, but we also consider the case p = 2 and p = 5 consisting of the first two and first five coefficients of $β,$ respectively. The sample size is set at $n = 50, 100, 500,$ and $1, 000 .$ All simulation results are based on 500 replications. The explanatory variables in $x_{i}$ are independent realizations of an N(0, 1) distribution. The errors $ε_{i}$ are i.i.d., and we consider the following eight error distributions:

standard normal;
variance-contaminated normal, the mixture $0.9 N (0, 1 / 9) + 0.1 N (0, 9);$
t-distribution with two degrees of freedom;
bimodal symmetric mixture of two normals, $0.5 N (- 3, 1) + 0.5 N (3, 1);$
Unif $[- \sqrt{3}; \sqrt{3}];$
Gamma(2,2);
skewed mixture of normals, $0.3 N (- 1.4, 1) + 0.7 N (0.6, 0.16);$ and
log-normal, being the distribution of $exp (z)$ for $z \sim N (0, 1) .$

The distributions are centered and scaled to have mean zero and unit variance, when necessary and possible. The t(2)-distribution is left unscaled as its variance is infinite.

If required, we use ${\hat{β}}_{LSE}$ as an initial parameter estimate. In addition, we adopt the standard normal kernel density $K (\cdot) .$ For the SBS method, we set the trimming parameter at t = 8. Following Newey (Citation1988), we compute the LGMM and LGMMS estimators for the transformed moments with J = 3. Implementing the kernel density-based estimators requires a method for choosing the bandwidth h_n. There is a vast literature on this topic, ranging from simple to involved methods. But none of the proposed methods has overall performance. In an extensive simulation study of model (Equation1(1) $y_{i} = x_{i}^{'} β_{0} + ε_{i}, (i = 1, \dots, n)$ (1) ) with n = 100 and p = 2, Reichardt (Citation2017) concludes that for M-KDRE $h_{n, 1} = 1.06 {\hat{σ}}_{n} n^{- 1 / 5}$ is preferable for symmetric error distributions in terms of root mean squared error (RMSE) of the estimators. Here ${\hat{σ}}_{n}$ is the standard deviation of the data. For skewed distributions, he recommends $h_{n, 2} = 0.9 A n^{- 1 / 5}$ where $A = \min ({\hat{σ}}_{n}, R / 1.34)$ with R the inter-quartile range of the data. The KDRE and YDG estimators perform best under $h_{n, 1} .$ The SBS estimator generally shows the smallest RMSE for $h_{n, 2} .$ Hence, throughout the simulations, we use the above bandwidths.

4.2. Results

Averaged over all replications, provides summary information on the computed RMSEs of the slope and intercept coefficients for all n and p, and across all error distributions. Clearly, M-KDRE shows the best overall performance of all estimators for the slope coefficient, with 64 lowest RMSE values out of a total of 84, i.e., 7 estimation methods (M-KDRE, KDRE, YDG, SBS, LGMM, LGMMS, LSE), 3 values for p, and 4 values for n. On the other hand, the SBS estimator has only one lowest RMSE value for $n = 1, 000$ and p = 2. The other estimators have low RMSE values lying in between the above two values, with the LSE results as a benchmark. Finally, from the last column of , we see that M-KDRE performs equally well with YDG, LGGM and LGMMS in estimating the intercept term, and M-KDRE markedly outperforms KDRE.

Table 1. Number of times the RMSE attains its lowest value for seven regression estimators, for each n and p and across all eight error distribution functions, for the slope coefficient and, in parentheses, for the intercept.

Display Table

Reichardt (Citation2017) reports RMSE values for each of the eight error distributions. For the sake of space we omit details. However, for the slope coefficient the simulation results can be summarized as follows.

In terms of RMSE, the M-KDRE method performs very well for the log-normal error distribution (h). That is, the RMSE of the second most efficient estimators (KDRE and LGMM) is approximately 40% larger, even for $n = 1, 000 .$ Under error distributions (b) and (c), M-KDRE is also most efficient, but here the efficiency is gained mostly for n = 50 and n = 100. Furthermore, M-KDRE has a superior performance in small samples of the t(2) error distribution. It is also close to best for error distributions (d)–(g).
The YDG method performs well for error distributions (d)–(g), but fails quite dramatically for error distributions with fat tails, i.e. (b), (c), and (h).
Efficiency of the SBS estimator is in general low relative to alternatives, but performance is especially weak under error distributions (c) and (d).
Overall, LGMM is a reasonable estimator, but its efficiency is lost under error distributions (e) and (f). This efficiency loss persists even for $n = 1, 000 .$
LGMMS is by construction inefficient when the error distribution is skewed: (f)–(h). More interestingly, the LGMMS-estimate of the slope coefficient is no improvement over LGMM under symmetric error distributions. The inefficiency of LGMMS with respect to LGMM is likely to be due to the fact that LGMMS uses moment restrictions on odd powers of the disturbances only and, hence, for a particular value of J, uses higher order moments that may lead to less efficient estimation.

shows summary results for the bias for both slope and intercept estimators for all n and p, and across all error distributions. For the slope coefficient, M-KDRE has the best performance in terms of the lowest bias values. Again, from Reichardt (Citation2017) we learn that the intercept bias of the different estimators is usually of similar magnitude in the symmetric cases. Under the asymmetric error distributions, the bias of the intercept is much larger for KDRE, SBS and LGMMS than for the other estimators.

Table 2. Number of times the bias of seven regression estimators attains it lowest value for each n and p, and across all eight error distribution functions for the slope coefficient and, in parentheses, for the intercept.

Display Table

5. Empirical application

Andrabi, Das, and Khwaja (Citation2017) study the impact of providing information in the form of school report cards on educational outcomes such as school fees, test scores, and enrollment in markets with multiple public and private providers. The report cards given to both households and schools in n randomly sampled villages across three districts, in the Punjab province of Pakistan, include information on the performance of the child, the average score of different schools in the village, and the average village score in mathematics, English, and Urdu. The following three linear regression models are of interest: (17) $Y_{i, j} = α_{j} + β_{j} {RC}_{i} + γ_{j} Y_{i, j}^{*} + δ_{j} X_{i, j} + ε_{i}, (i = 1, \dots, n; j = 1, 2, 3)$ (17) where $Y_{i, 1}, Y_{i, 2},$ and $Y_{i, 3}$ are average fees, test scores, and enrollment rate in the post-intervention year of village i, respectively. $Y_{i, j}^{*}$ denotes the baseline measurement of the same variables. ${RC}_{i}$ is the treatment dummy assignment to village i, which makes β_j the variable of interest, an estimate of the impact of the report card assignment. $X_{i, j}$ is a vector of village-level baseline controls. All models in the paper are estimated using LSE.

, column 1, shows the LSEs of β_j $(j = 1, 2, 3)$ and their corresponding standard errors (in parentheses) as shown in, respectively, (1) panel C, (4) panel C, and (1) panel C of Andrabi, Das, and Khwaja (Citation2017). The Shapiro-Wilk test for normal data indicates that the LSE residuals are far from normally distributed, with p-values 0.000 $(j = 1, n = 104),$ 0.002 $(j = 2, n = 112),$ and 0.000 $(j = 3, n = 112) .$ Indeed, in all cases, diagnostic statistics show that the residuals have fatter tails than one would expect based on normality. Based on the LSE results, Andrabi, Das, and Khwaja (Citation2017) report the following main findings. First, private schools decreased their annualized fees $(Y_{i, 1})$ by an average of 187 rupees, about 17% of their baseline fees, in response to the report card intervention. Second, test scores $(Y_{i, 2})$ increased by 0.11 standard deviation. Third, primary enrollment $(Y_{i, 3})$ increased by 3.2 percent points or 4.5% in treatment villages.

Table 3. Effect of report cards on school fees, test scores, and enrollment as given by parameter estimates of β_j $(j = 1, 2, 3)$ using seven estimation methods. For columns 2–7, standard errors (in parentheses) are based on 500 bootstrap replicates.

Display Table

Table 4. Median absolute prediction error (MAPE) of six AEs relative to LSE.

Download CSV Display Table

, columns 2–7, shows estimates of the six AEs for models $j = 1, 2,$ and 3. We see that these estimators pull the estimated treatment effect toward zero for all models. The results for model 1 are especially striking. The M-KDRE of β is more than 40 times smaller than the LSE, in absolute value. Also, for model 1, the AEs differ substantially. In that respect, it is interesting to investigate the prediction performance of the respective methods.

shows the ratio of the median absolute prediction error (MAPE) of an estimator relative to the LSE. The training set is a random sample of the data of size $⌈ 0.8 n ⌉ .$ We see that M-RKDRE has the lowest MAPE for model 1. In addition, observe that the prediction performance is generally better for AEs with a low estimate of β₁ such as KDRE and SBS. This suggests that the effect of the report cards on school fees, if it exists at all, is much lower than reported. For models 2 and 3, there is less difference between the estimates of the AEs. Also, the estimates are adjusted less strongly with respect to LSE.

reports two bootstrapped 95% confidence intervals for β_j $(j = 1, 2, 3)$ as estimated by M-KDRE. The confidence interval termed “Normal” is based on an asymptotic normality assumption, and the column called “Percentile” is based on the 95% inter quartile range obtained from the empirical distribution of 500 bootstrap replicates. Both intervals show that the estimated treatment effect for model 1 is not significantly different from zero.

Table 5. Bootstrapped 95% confidence intervals of the effect of report cards for the M-KDRE method.

Download CSV Display Table

In summary, the above results demonstrate the practical relevance of the AEs in general and that of the proposed M-KDRE method in particular. None of the other AEs adjusted the treatment effect on school fees as far toward zero as M-KDRE, while prediction performance suggests that this method should be preferred over other methods for this particular linear regression model and sample size. Thus, there is no support for the first finding of Andrabi, Das, and Khwaja (Citation2017) at any reasonable significance level. Further, shows that the AEs find that the effect of report cards on test scores is not significantly different from zero at the 5% nominal level. Lastly, the effect of report cards on the enrollment rate seems, even though marginally significant for M-KDRE, also questionable.

6. Summary and concluding remarks

In this paper, we proposed an adaptive multi-step kernel density-based regression estimator for linear regression models. We have established the theoretical properties of our estimation method, including asymptotic normality and almost sure convergence. In an extensive simulation study, we have shown that the finite sample performance of M-KDRE is second to none of five alternative AEs. For several error distributions, it is up to twice as efficient in terms of RMSE than the second best estimator. Further, for any other error distribution, it is either the most efficient or very close to the most efficient estimator. All other AEs show a loss of efficiency for certain specific error distributions. Our empirical application provides a good illustration of many of these issues. In particular, using the M-KDRE method and its corresponding bootstrap standard errors, we found fairly compelling evidence that the treatment effects that Andrabi, Das, and Khwaja (Citation2017) find are not significantly different from zero.

The results raise several questions for further research. For instance, one may wish to estimate nonlinear regressions via the M-KDRE method. In that case the EM algorithm, at least in its present form, needs to be adjusted. Another issue concerns the fact that the multi-step method makes use of ${\hat{β}}_{LSE}$ in the initial step. This choice was primarily based on computational convenience. Perhaps, efficiency may be further enhanced by a more prudent choice of the initial estimator. It may also be of interest to assess the robustness of M-KDRE to a violation of the independence assumption. In particular, adaptive estimation is not in general possible when the vector of covariates and the error process are not mutually independent. We leave these questions for future research.

Acknowledgments

The authors would like to thank two anonymous referees for their valuable comments and suggestions.

References

Andrabi, T., J. Das, and A. I. Khwaja. 2017. Report cards: The impact of providing school and child test scores on educational markets. American Economic Review 107 (6):1535–63. doi:https://doi.org/10.1257/aer.20140774.
Web of Science ®Google Scholar
Bassett, G. W., and R. W. Koenker. 1982. An empirical quantile function for linear models with iid errors. Journal of the American Statistical Association 77:407–15. doi:https://doi.org/10.2307/2287261.
Web of Science ®Google Scholar
Bickel, P. J. 1982. On adaptive estimation. The Annals of Statistics 10 (3):647–71. doi:https://doi.org/10.1214/aos/1176345863.
Web of Science ®Google Scholar
Bloomfield, P. 1974. On the distribution of the residuals from a fitted linear model. Technical report, Department of Statistics, Princeton University, Princeton, NJ, Series 2; 56.
Google Scholar
Chai, G., Z. Li, and H. Tian. 1991. Consistent nonparametric estimation of error distributions in linear model. Acta Mathematicae Applicatae Sinica 7 (3):245–56. doi:https://doi.org/10.1007/BF02005973.
Google Scholar
Chen, Y., Q. Wang, and W. Yao. 2015. Adaptive estimation for varying coefficient models. Journal of Multivariate Analysis 137:17–31. doi:https://doi.org/10.1016/j.jmva.2015.01.017.
Web of Science ®Google Scholar
Crowder, M. 1984. On constrained maximum likelihood estimation with non-i.i.d. observations. Annals of the Institute of Statistical Mathematics 36 (2):239–49. doi:https://doi.org/10.1007/BF02481968.
Web of Science ®Google Scholar
De Gooijer, J. G. 2017. Elements of nonlinear time series analysis and forecasting. New York: Springer-Verlag.
Google Scholar
Friedman, J. H. 1991. Multivariate adaptive regression splines. The Annals of Statistics 19 (1):1–141. with discussion).
Web of Science ®Google Scholar
Hshieh, D. A., and C. F. Manski. 1987. Monte Carlo evidence on adaptive maximum likelihood estimation of a regression. The Annals of Statistics 15:541–51. doi:https://doi.org/10.1214/aos/1176350359.
Web of Science ®Google Scholar
Kitamura, Y. 1997. Empirical likelihood methods with weakly dependent processes. The Annals of Statistics 25 (5):2084–102. doi:https://doi.org/10.1214/aos/1069362388.
Web of Science ®Google Scholar
Kitamura, Y. 2007. Empirical likelihood methods in econometrics: Theory and practice. In Advances in economics and econometrics: Theory and applications, Ninth World Congress, Econometric Society Monographs, ed. R. Blundell, W. Newey, and T. Persson, pp. 174–237. Cambridge: Cambridge University Press. doi:https://doi.org/10.1017/CBO9780511607547.008.
Google Scholar
Kooperberg, C., S. Bose, and C. J. Stone. 1997. Polychotomous regression. Journal of the American Statistical Association 92 (437):117–27. doi:https://doi.org/10.1080/01621459.1997.10473608.
Web of Science ®Google Scholar
Linton, O., and Z. Xiao. 2007. A nonparametric regression estimator that adapts to error distribution of unknown form. Econometric Theory 23 (3):371–413. doi:https://doi.org/10.1017/S026646660707017X.
Web of Science ®Google Scholar
Manski, C. F. 1984. Adaptive estimation of non-linear regression models. Econometric Reviews 3 (2):145–94. doi:https://doi.org/10.1080/07474938408800060.
Google Scholar
Moore, T. J., B. M. Sadler, and R. J. Kozick. 2008. Maximum-likelihood estimation, the Cramér-Rao bound, and the method of scoring with parameter constraints. IEEE Transactions on Signal Processing 56 (3):895–908. doi:https://doi.org/10.1109/TSP.2007.907814.
Web of Science ®Google Scholar
Newey, W. K. 1988. Adaptive estimation of regression models via moment restrictions. Journal of Econometrics 38 (3):301–39. doi:https://doi.org/10.1016/0304-4076(88)90048-6.
Web of Science ®Google Scholar
Newey, W. K., and D. McFadden. 1994. Large sample estimation and hypothesis testing. In Handbook of econometrics, ed. K. J. Arrow and D. Intriligator, Vol. 4, 2111–245. New York: Elsevier.
Google Scholar
Osborne, M. R. 2000. Scoring with constraints. The Anziam Journal 42 (1):9–25. doi:https://doi.org/10.1017/S1446181100011561.
Google Scholar
Owen, A. B. 1988. Empirical likelihood ratio confidence intervals for a single functional. Biometrika 75 (2):237–49. doi:https://doi.org/10.1093/biomet/75.2.237.
Web of Science ®Google Scholar
Owen, A. B. 1990. Empirical likelihood confidence regions. The Annals of Statistics 18 (1):90–120. doi:https://doi.org/10.1214/aos/1176347494.
Web of Science ®Google Scholar
Owen, A. B. 1991. Empirical likelihood for linear models. The Annals of Statistics 19 (4):1725–47. doi:https://doi.org/10.1214/aos/1176348368.
Web of Science ®Google Scholar
Owen, A. B. 2001. Empirical likelihood. Boca Raton, FL: Chapman & Hall/CRC.
Google Scholar
Qin, J., and J. Lawless. 1994. Empirical likelihood and general estimating equations. The Annals of Statistics 22 (1):300–25. doi:https://doi.org/10.1214/aos/1176325370.
Web of Science ®Google Scholar
Reichardt, H. 2017. Adaptive estimation in linear regression using repeated kernel error density estimation. MSc. thesis. Econometrics, Erasmus University Rotterdam. https://thesis.eur.nl/pub/38903/.
Google Scholar
Ren, C., J. Le Kernec, J. Galy, E. Chaumette, P. Larzabal, and A. Renaux. 2015. A constrained hybrid Cramér-Rao bound for parameter estimation. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brisbane, Australia, 3472–6.
Google Scholar
Robinson, P. M. 1988. Root-n-consistent semiparametric regression. Econometrica 56 (4):931–54. doi:https://doi.org/10.2307/1912705.
Web of Science ®Google Scholar
Schick, A. 1993. On efficient estimation in regression models. The Annals of Statistics 21 (3):1486–521. doi:https://doi.org/10.1214/aos/1176349269.
Web of Science ®Google Scholar
Stoica, P., and B. C. Ng. 1998. On the Cramér-Rao bound under parametric constraints. IEEE Signal Processing Letters 5:177–9. doi:https://doi.org/10.1109/97.700921.
Web of Science ®Google Scholar
Stone, C. J. 1975. Adaptive maximum likelihood estimators of a location parameter. The Annals of Statistics 3 (2):267–84. doi:https://doi.org/10.1214/aos/1176343056.
Web of Science ®Google Scholar
Wade, W. 1974. The bounded convergence theorem. The American Mathematical Monthly 81 (4):387–9. doi:https://doi.org/10.2307/2319009.
Web of Science ®Google Scholar
Wang, Q., and W. Yao. 2012. An adaptive estimation of MAVE. Journal of Multivariate Analysis 104 (1):88–100. doi:https://doi.org/10.1016/j.jmva.2011.07.001.
Web of Science ®Google Scholar
White, H., and G. M. MacDonald. 1980. Some large-sample tests for nonnormality in the linear regression model. Journal of the American Statistical Association 75 (369):16–28. doi:https://doi.org/10.1080/01621459.1980.10477415.
Web of Science ®Google Scholar
Yao, W., and Z. Zhao. 2013. Kernel density-based linear regression estimate. Communications in Statistics - Theory and Methods 42 (24):4499–512. doi:https://doi.org/10.1080/03610926.2011.650269.
Web of Science ®Google Scholar
Yuan, A., and J. G. De Gooijer. 2007. Semiparametric regression with kernel error model. Scandinavian Journal of Statistics 34:841–69. doi:https://doi.org/10.1111/j.1467-9469.2006.00531.x.
Web of Science ®Google Scholar
Zhang, W. Y. 1990. On the congruent kernel estimate of error distributions in linear model (in Chinese). Journal of Sichuan University (Natural Science Edition) 27:132–44.
Google Scholar

Appendix A: Proofs of results

Lemma A.1. (Zhang Citation1990, Theorem 5) If model (Citation1) holds,

{ε_{i}}_{i = 1}^{n}

are i.i.d. with unknown density f(x) where

f (\cdot)

is a uniformly continuous function that satisfies (i)

\int x f (x) d x = 0

, (ii)

0 < \int x^{2} f (x) d x < \infty

, the set of covariates

{x_{i}}_{i = 1}^{\infty}

satisfies (iii)

\exists 0 < M < \infty

such that

‖ x_{i} ‖ < M

\forall i = 1, \dots, n

, (iv)

S_{n} \to Q = E (x x^{'})

where

S_{n} = n^{- 1} \sum_{i = 1}^{n} x_{i} x_{i}^{'}

, and the following assumptions on the kernel function

K (\cdot)

hold: (v) K(x) is uniformly bounded and

\exists 0 < ρ < \infty

such that K(x) = 0

\forall x : ‖ x ‖ \geq ρ

(vi) K(x) is Riemann integrable on

[- ρ, ρ]

(vii) when

n \to \infty, 0 < h_{n} \to 0

and

\sqrt{n} h_{n} / ln n \to \infty

, then

(A.1)

\sup_{x \in R} ‖ {\hat{f}}_{n, LSE} (x) - f (x) ‖ \overset{a . s .}{\to} 0

(A.1) where

{\hat{f}}_{n, LSE}

is the kernel density-based estimator of the

LSE

residuals

{\hat{ε}}_{i, LSE} = y_{i} - x_{i}^{'} {\hat{β}}_{LSE}

(i = 1, \dots, n) .

Lemma A.2.

Suppose that the assumptions of Lemma A.1 hold. Then, if any estimator $β^{*}$ satisfies $P r (\lim_{n \to \infty} \max | β^{*} - β_{0} | \leq \max | S_{n}^{- 1} | ‖ ln \max | S_{n}^{- 1} | ‖) = 1$ where $\max | A | = \max_{i, j} | a_{i j} |$ with a_ij the elements of a matrix A, (A.2) $\sup_{x \in R} ‖ {\hat{f}}_{n}^{*} (x) - f (x) ‖ \overset{a . s .}{\to} 0$ (A.2) where ${\hat{f}}_{n}^{*}$ is the kernel density-based estimator of the residuals ${\hat{ε}}_{i} = y_{i} - x_{i}^{'} β^{*}$ $(i = 1, \dots, n) .$

Proof.

This result follows immediately from Theorem 5 and Lemma 4 in Zhang (Citation1990) in conjunction with Theorem 4 and (Citation29) in Chai, Li, and Tian (Citation1991). □

Lemma A.3.

If there is a function $Q_{0} (β)$ such that (i) $Q_{0} (β)$ is uniquely maximized at $β_{0}$ , (ii) $B$ is compact, (iii) $Q_{0} (β)$ is continuous, (iv) $\sup_{β \in B} ‖ {\hat{Q}}_{n} (β) - Q_{0} (β) ‖ \overset{a . s .}{\to} 0$ , then for $u = 1, 2, \dots,$ ${\hat{β}}^{(u)} \overset{a . s .}{\to} β_{0}$ , where $\hat{β}$ maximizes the objective function ${\hat{Q}}_{n} (β)$ subject to $β \in B$ . The weak convergence result, i.e., $\hat{β} \overset{p}{\to} β_{0}$ can be obtained by replacing condition (iv) by $\sup_{β \in B} ‖ {\hat{Q}}_{n} (β) - Q_{0} (β) ‖ \overset{p}{\to} 0 .$

Proof:

The proof is similar to the proof of Theorem 2.1 of Newey and McFadden (Citation1994). □

Lemma A.4.

If $f_{n} : B \to R$ is a continuous function, $B$ is compact, and $f_{n} \overset{a . s .}{\to} f$ , then (A.3) $\lim_{n \to \infty} \int_{B} f_{n} d u = \int_{B} f d u$ (A.3)

Proof.

Since $B$ is compact and f_n is continuous, the image $f_{n} (B)$ is a compact subset of $R$ and hence, closed and bounded. Then, the result follows from the bounded convergence theorem; see, e.g., Wade (Citation1974). □

Proof of Theorem 2.1.

Following (Newey and McFadden Citation1994, Thm. 2.5), we verify the conditions in Lemma A.3. Note that conditions (i)–(iii) are on the density $f (\cdot)$ of $ε,$ and on the parameter space $B .$ These conditions hold under the usual regularity conditions of MLE. Condition (iv) of Lemma A.3 implies that we have to prove that $\sup_{β \in B} ‖ \frac{1}{n} \sum_{i = 1}^{n} ln {\hat{f}}_{n}^{(1)} (y_{i} | β) - E [ln f (y_{i} | β)] ‖ \overset{a . s .}{\to} 0.$

Since ${\hat{f}}_{n}^{(1)} = {\hat{f}}_{n, LSE},$ we have by Lemma A.1 $\sup_{x \in R} ‖ {\hat{f}}_{n}^{(1)} (x) - f (x) ‖ \overset{a . s .}{\to} 0.$

Now note that $\sup_{β \in B} ‖ {\hat{f}}_{n}^{(1)} (y_{i} | β) - f (y_{i} | β) ‖ \leq \sup_{β \in R^{p}} ‖ {\hat{f}}_{n}^{(1)} (y_{i} | β) - f (y_{i} | β) ‖ \leq \sup_{x \in R} ‖ {\hat{f}}_{n}^{(1)} (x) - f (x) ‖$ implying that (A.4) $\sup_{β \in B} ‖ {\hat{f}}_{n}^{(1)} (y_{i} | β) - f (y_{i} | β) ‖ \overset{a . s .}{\to} 0.$ (A.4)

Condition (iii) implies that $\inf_{β \in B} f (y_{i} | β) > 0 .$ Thus, $\exists ε > 0$ such that $\inf_{β \in B} f (y_{i} | β) > ε .$ Also, by (A.4) for any $ε > 0,$ $Pr (\lim_{n \to \infty} \sup_{β \in B} ‖ {\hat{f}}_{n}^{(1)} (y_{i} | β) - f (y_{i} | β) ‖ < ε) = 1 \Rightarrow Pr (\lim_{n \to \infty} \inf_{β \in B} {\hat{f}}_{n}^{(1)} (y_{i} | β) > 0) = 1.$

This, together with condition (ii), ensures that for n large enough both $ln f (y_{i} | β)$ and $ln {\hat{f}}_{n}^{(1)} (y_{i} | β)$ are uniformly continuous with probability one such that by the uniform continuous mapping theorem $\sup_{β \in B} ‖ ln {\hat{f}}_{n}^{(1)} (y_{i} | β) - ln f (y_{i} | β) ‖ \overset{a . s .}{\to} 0.$

Note that by conditions (ii) and (iii), $ln {\hat{f}}_{n}^{(1)} (y_{i} | β)$ is bounded and we may invoke the uniform law of large numbers such that, (A.5) $\sup_{β \in B} ‖ \frac{1}{n} \sum_{i = 1}^{n} ln {\hat{f}}_{n}^{(1)} (y_{i} | β) - E [ln {\hat{f}}_{n}^{(1)} (y_{i} | β)] ‖ \overset{a . s .}{\to} 0.$ (A.5)

Also, by Lemma A.4, (A.6) $\lim_{n \to \infty} E [ln {\hat{f}}_{n}^{(1)} (y_{i} | β)] = E [ln f (y_{i} | β)]$ (A.6)

Now define the following variables $\begin{matrix} A \equiv ‖ \frac{1}{n} \sum_{i = 1}^{n} ln {\hat{f}}_{n}^{(1)} (y_{i} | β) - E [ln f (y_{i} | β)] ‖, A_{1} \equiv ‖ \frac{1}{n} \sum_{i = 1}^{n} ln {\hat{f}}_{n}^{(1)} (y_{i} | β) - E [ln {\hat{f}}_{n}^{(1)} (y_{i} | β)] ‖, \\ A_{2} \equiv ‖ E [ln {\hat{f}}_{n}^{(1)} (y_{i} | β)] - E [ln f (y_{i} | β)] ‖ . \end{matrix}$

Then, by the triangle inequality, we have $A \leq A_{1} + A_{2},$ and by (EquationA.5(A.5) $\sup_{β \in B} ‖ \frac{1}{n} \sum_{i = 1}^{n} ln {\hat{f}}_{n}^{(1)} (y_{i} | β) - E [ln {\hat{f}}_{n}^{(1)} (y_{i} | β)] ‖ \overset{a . s .}{\to} 0.$ (A.5) ) and (EquationA.6(A.6) $\lim_{n \to \infty} E [ln {\hat{f}}_{n}^{(1)} (y_{i} | β)] = E [ln f (y_{i} | β)]$ (A.6) ), $\sup_{β \in B} A_{1} \overset{a . s .}{\to} 0$ and $\lim_{n \to \infty} \sup_{β \in B} A_{2} = 0 .$ From condition (iv) of Lemma A.3 it follows that (A.7) $\sup_{β \in B} ‖ \frac{1}{n} \sum_{i = 1}^{n} ln {\hat{f}}_{n}^{(1)} (y_{i} | β) - E [ln f (y_{i} | β)] ‖ \overset{a . s .}{\to} 0.$ (A.7)

Thus, by Lemma A.3, (A.8) ${\hat{β}}^{(1)} \overset{a . s .}{\to} β_{0}$ (A.8)

For the sake of completeness, we prove also that the constraint $c (β) = 0$ does not affect this result. Let $C \subseteq B$ be the subset for which $c (β) = 0 .$ That is, $C = {β \in B : c (β) = 0} .$ First, note that $C$ is the level set of the continuous function $c (β)$ such that $C$ is closed. Also, $C$ is bounded since $C \subseteq B$ and $B$ is bounded. Hence, $C$ is compact such that ${\hat{β}}^{(1)} = \arg \sup_{β \in C} \sum_{i = 1}^{n} ln {\hat{f}}_{n}^{(1)} (y_{i} | β) .$ Denote $\tilde{β} = \arg \sup_{β \in B} \sum_{i = 1}^{n} ln {\hat{f}}_{n}^{(1)} (y_{i} | β)$ as the global maximizer of the objective function over $B .$ (Newey and McFadden Citation1994, p. 2122) show that for (A.8) to hold, it suffices to prove that (A.9) $\frac{1}{n} \sum_{i = 1}^{n} ln {\hat{f}}_{n}^{(1)} (y_{i} | {\hat{β}}^{(1)}) - \frac{1}{n} \sum_{i = 1}^{n} ln {\hat{f}}_{n}^{(1)} (y_{i} | \tilde{β}) \overset{a . s .}{\to} 0.$ (A.9)

For that purpose, define $\begin{matrix} B \equiv ‖ \sup_{β \in C} \frac{1}{n} \sum_{i = 1}^{n} ln {\hat{f}}_{n}^{(1)} (y_{i} | β) - \sup_{β \in B} \frac{1}{n} \sum_{i = 1}^{n} ln {\hat{f}}_{n}^{(1)} (y_{i} | β) ‖, \\ B_{1} \equiv ‖ \sup_{β \in C} \frac{1}{n} \sum_{i = 1}^{n} ln {\hat{f}}_{n}^{(1)} (y_{i} | β) - \sup_{β \in C} E [ln f (y_{i} | β)] ‖, \\ B_{2} \equiv ‖ \sup_{β \in C} E [ln f (y_{i} | β)] - \sup_{β \in B} E [ln f (y_{i} | β)] ‖, B_{3} \equiv ‖ \sup_{β \in B} E [ln f (y_{i} | β)] - \sup_{β \in B} \frac{1}{n} \sum_{i = 1}^{n} ln {\hat{f}}_{n}^{(1)} (y_{i} | β) ‖ . \end{matrix}$

Again by the triangle inequality, $B \leq B_{1} + B_{2} + B_{3} .$ From (A.7), it is easy to show that $B_{1} \overset{a . s .}{\to} 0$ and $B_{3} \overset{a . s .}{\to} 0 .$ To show that $B_{2} \overset{a . s .}{\to} 0,$ first observe that by conditions (i) and (ii) of Lemma A.1 and the strong law of large numbers, (A.10) $Pr (\lim_{n \to \infty} [\frac{1}{n} \sum_{i = 1}^{n} (y_{i} - x_{i}^{'} β_{0})] = 0) = 1.$ (A.10)

This implies $\begin{matrix} Pr (\lim_{n \to \infty} β_{0} \in C) = 1 \Rightarrow Pr (\lim_{n \to \infty} [\arg \sup_{β \in B} E [ln f (y_{i} | β)]] \in C) = 1 \\ \Rightarrow Pr (\lim_{n \to \infty} B_{2} = 0) = 1, \end{matrix}$ and the last result implies, by definition of almost sure convergence, that $B_{2} \overset{a . s .}{\to} 0 .$ Hence, $B \overset{a . s .}{\to} 0$ and the constraint does not affect the result.

Lastly, to show that the algorithm asymptotically converges to $β_{0},$ remark that (A.8) implies by Lemma A.2 that $\sup_{x \in R} ‖ {\hat{f}}_{n}^{(2)} (x) - f (x) ‖ \overset{a . s .}{\to} 0$ where ${\hat{f}}_{n}^{(2)} (x)$ is the kernel density-based estimator of the residuals corresponding to ${\hat{β}}^{(1)} .$ Thus, by identical reasoning, we obtain ${\hat{β}}^{(u)} \overset{a . s .}{\to} β_{0}$ for $u = 1, 2, \dots .$ □

Remark 5.

Conditions (i)–(iv) of Theorem 2.1 are the regularity conditions that are necessary for the convergence of MLE under the true density. Thus, the only additional conditions imposed are those in Lemma A.1 of which condition (i) of zero mean goes without loss of generality in the context of linear regression as we can always adjust the intercept parameter in $β$ if the center of $f (\cdot)$ is not zero. Condition (ii) of Lemma A.1 may be restrictive in some cases as it rules out, for instance, the t(v)-distribution with $1 < v \leq 2 .$ However, in Section 4.2 we observed that M-KDRE performs well for t(Equation2(2) ${\hat{f}}_{n}^{(u)} (x) = {(n h_{n})}^{- 1} \sum_{i = 1}^{n} K (\frac{x - {\hat{ε}}_{i}^{(u - 1)}}{h_{n}})$ (2) ). In fact, its performance is best of all considered estimators under that error distribution. Hence, the practical use of M-KDRE does not seem to be restricted to distributions with finite variance. Conditions (iii) and (iv) of Lemma A.1 are easy to verify in practice, and conditions (v)–(vii) are technical requirements on the kernel and bandwidth. Note that (v) is not satisfied by the Gaussian kernel since that kernel does not have bounded support. In practice, however, the Gaussian kernel entails a significant computational advantage.

Proof of Theorem 2.3

(sketch): For the case of q (q < p) with linear random or deterministic equality constraints, the proof of consistency and asymptotic distribution of $\hat{β}$ can be based on results in Crowder (Citation1984) and Osborne (Citation2000). In particular, given these results it follows that ${\hat{β}}^{(1)}$ is asymptotically normal and efficient. The only condition on the initial estimator ${\hat{β}}^{(0)}$ is that $({\hat{β}}^{(0)} - {\hat{β}}_{0}) = O_{p} (n^{- 1 / 2}) .$ For ${\hat{β}}^{(1)}$ this follows from the proof of (Yao and Zhao Citation2013, Thm. 2.1). Hence, all subsequent estimates ${\hat{β}}^{(u)}$ $(u = 2, 3, \dots)$ also satisfy (Equation8(8) $\sqrt{n} ({\hat{β}}^{(u)} - β_{0}) \overset{d}{\to} N (0, U {(U' I_{β β} U)}^{- 1} U')$ (8) ). □

Proof of Theorem 2.4.

Under the Gaussian kernel, the linear constraint $c (β) = 0,$ and a full-kernel method, the M-step in (Equation10(10) $\begin{matrix} {\hat{β}}_{(k + 1)}^{(u)} = {\hat{β}}_{LSE} - {(\sum_{i = 1}^{n} x_{i} x_{i}^{'})}^{- 1} \sum_{i = 1}^{n} (x_{i} \sum_{j = 1}^{n} p_{i j, (k + 1)}^{(u)} {\hat{ε}}_{j}^{(u - 1)}) \\ + \frac{1}{n} {(\sum_{i = 1}^{n} x_{i} x_{i}^{'})}^{- 1} \sum_{i = 1}^{n} x_{i} (\sum_{i = 1}^{n} \sum_{j = 1}^{n} p_{i j, (k + 1)}^{(u)} {\hat{ε}}_{j}^{(u - 1)}) \end{matrix}$ (10) ) becomes (A.11) ${\hat{β}}_{(k + 1)}^{(u)} = \arg \min_{β} \sum_{i = 1}^{n} \sum_{j = 1}^{n} (p_{i j, (k + 1)}^{(u)} {(y_{i} - x_{i}^{'} β - {\hat{ε}}_{j}^{(u - 1)})}^{2}) s . t . \sum_{i = 1}^{n} (y_{i} - x_{i}^{'} β) = 0$ (A.11)

This can be solved by Lagrangian optimization. Define the Lagrangian $L$ as (A.12) $L (β, λ) = \sum_{i = 1}^{n} \sum_{j = 1}^{n} (p_{i j, (k + 1)}^{(u)} {(y_{i} - x_{i}^{'} β - {\hat{ε}}_{j}^{(u - 1)})}^{2}) - λ \sum_{i = 1}^{n} (y_{i} - x_{i}^{'} β)$ (A.12) with first-order conditions (A.13) $\frac{\partial L}{\partial β} = - 2 \sum_{i = 1}^{n} \sum_{j = 1}^{n} (p_{i j, (k + 1)}^{(u)} x_{i} (y_{i} - x_{i}^{'} β - {\hat{ε}}_{j}^{(u - 1)})) - λ \sum_{i = 1}^{n} x_{i} = 0$ (A.13) (A.14) $\frac{\partial L}{\partial λ} = \sum_{i = 1}^{n} (y_{i} - x_{i}^{'} β) = 0.$ (A.14)

By setting $x_{i, 1} \equiv 1$ $(i = 1, \dots, n),$ the first element of the first-order condition in (EquationA.13(A.13) $\frac{\partial L}{\partial β} = - 2 \sum_{i = 1}^{n} \sum_{j = 1}^{n} (p_{i j, (k + 1)}^{(u)} x_{i} (y_{i} - x_{i}^{'} β - {\hat{ε}}_{j}^{(u - 1)})) - λ \sum_{i = 1}^{n} x_{i} = 0$ (A.13) ) implies (A.15) $\begin{matrix} λ = & - \frac{2}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{n} (p_{i j, (k + 1)}^{(u)} (y_{i} - x_{i}^{'} β - {\hat{ε}}_{j}^{(u - 1)})) \\ = & - \frac{2}{n} \sum_{i = 1}^{n} (y_{i} - x_{i}^{'} β) \sum_{j = 1}^{n} p_{i j, (k + 1)}^{(u)} + \frac{2}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{n} p_{i j, (k + 1)}^{(m)} {\hat{ε}}_{j}^{(u - 1)} \\ = & \frac{2}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{n} p_{i j, (k + 1)}^{(u)} {\hat{ε}}_{j}^{(u - 1)}, \end{matrix}$ (A.15) where the last equality follows from (EquationA.14(A.14) $\frac{\partial L}{\partial λ} = \sum_{i = 1}^{n} (y_{i} - x_{i}^{'} β) = 0.$ (A.14) ). Then, by plugging λ in (EquationA.13(A.13) $\frac{\partial L}{\partial β} = - 2 \sum_{i = 1}^{n} \sum_{j = 1}^{n} (p_{i j, (k + 1)}^{(u)} x_{i} (y_{i} - x_{i}^{'} β - {\hat{ε}}_{j}^{(u - 1)})) - λ \sum_{i = 1}^{n} x_{i} = 0$ (A.13) ), rearranging terms and using $\sum_{j = 1}^{n} p_{i j, (k + 1)}^{(u)} = 1,$ we obtain (A.16) $\begin{matrix} {\hat{β}}_{(k + 1)}^{(u)} & = {(\sum_{i = 1}^{n} x_{i} x_{i}^{'})}^{- 1} \sum_{i = 1}^{n} x_{i} y_{i} - {(\sum_{i = 1}^{n} x_{i} x_{i}^{'})}^{- 1} \sum_{i = 1}^{n} (x_{i} \sum_{j = 1}^{n} p_{i j, (k + 1)}^{(u)} {\hat{ε}}_{j}^{(u - 1)}) \\ + \frac{1}{n} {(\sum_{i = 1}^{n} x_{i} x_{i}^{'})}^{- 1} \sum_{i = 1}^{n} x_{i} (\sum_{i = 1}^{n} \sum_{j = 1}^{n} p_{i j, (k + 1)}^{(u)} {\hat{ε}}_{j}^{(u - 1)}) . \end{matrix}$ (A.16)

Recognize that the first term is equal to ${\hat{β}}_{LSE} .$ Then, the fact that (Equation9(9) $p_{i j, (k + 1)}^{(u)} = \frac{exp {- \frac{1}{2 h_{n}^{2}} {(y_{i} - x_{i}^{'} {\hat{β}}_{(k)}^{(u)} - {\hat{ε}}_{j}^{(u - 1)})}^{2}}}{\sum_{ℓ = 1}^{n} exp {- \frac{1}{2 h_{n}^{2}} {(y_{i} - x_{i}^{'} {\hat{β}}_{(k)}^{(u)} - {\hat{ε}}_{ℓ}^{(u - 1)})}^{2}}}$ (9) ) and (Equation10(10) $\begin{matrix} {\hat{β}}_{(k + 1)}^{(u)} = {\hat{β}}_{LSE} - {(\sum_{i = 1}^{n} x_{i} x_{i}^{'})}^{- 1} \sum_{i = 1}^{n} (x_{i} \sum_{j = 1}^{n} p_{i j, (k + 1)}^{(u)} {\hat{ε}}_{j}^{(u - 1)}) \\ + \frac{1}{n} {(\sum_{i = 1}^{n} x_{i} x_{i}^{'})}^{- 1} \sum_{i = 1}^{n} x_{i} (\sum_{i = 1}^{n} \sum_{j = 1}^{n} p_{i j, (k + 1)}^{(u)} {\hat{ε}}_{j}^{(u - 1)}) \end{matrix}$ (10) ) are the E- and M-step, respectively, of an EM type algorithm follows trivially from the proof of Theorem 2.2 in Yao and Zhao (Citation2013). □

A multi-step kernel–based regression estimator that adapts to error distributions of unknown form

Abstract

1. Introduction

2. Multi-Step kernel Density-Based regression estimation

2.1. Model and method

2.2. Asymptotic properties

2.3. EM algorithm

3. Some alternative adaptive estimation methods

3.1. SBS method

3.2. LGMM and LGMMS methods

3.3. KDRE method

3.4. YDG method

4. Simulation study

4.1. Setup

4.2. Results

Table 1. Number of times the RMSE attains its lowest value for seven regression estimators, for each n and p and across all eight error distribution functions, for the slope coefficient and, in parentheses, for the intercept.

Table 2. Number of times the bias of seven regression estimators attains it lowest value for each n and p, and across all eight error distribution functions for the slope coefficient and, in parentheses, for the intercept.

5. Empirical application

Table 3. Effect of report cards on school fees, test scores, and enrollment as given by parameter estimates of β_j $(j = 1, 2, 3)$ using seven estimation methods. For columns 2–7, standard errors (in parentheses) are based on 500 bootstrap replicates.

Table 4. Median absolute prediction error (MAPE) of six AEs relative to LSE.

Table 5. Bootstrapped 95% confidence intervals of the effect of report cards for the M-KDRE method.

6. Summary and concluding remarks

Acknowledgments

References

Appendix A: Proofs of results

Information for

Open access

Opportunities

Help and information

A multi-step kernel–based regression estimator that adapts to error distributions of unknown form

Abstract

1. Introduction

2. Multi-Step kernel Density-Based regression estimation

2.1. Model and method

2.2. Asymptotic properties

2.3. EM algorithm

3. Some alternative adaptive estimation methods

3.1. SBS method

3.2. LGMM and LGMMS methods

3.3. KDRE method

3.4. YDG method

4. Simulation study

4.1. Setup

4.2. Results

Table 1. Number of times the RMSE attains its lowest value for seven regression estimators, for each n and p and across all eight error distribution functions, for the slope coefficient and, in parentheses, for the intercept.

Table 2. Number of times the bias of seven regression estimators attains it lowest value for each n and p, and across all eight error distribution functions for the slope coefficient and, in parentheses, for the intercept.

5. Empirical application

Table 3. Effect of report cards on school fees, test scores, and enrollment as given by parameter estimates of βj (j=1, 2, 3) using seven estimation methods. For columns 2–7, standard errors (in parentheses) are based on 500 bootstrap replicates.

Table 4. Median absolute prediction error (MAPE) of six AEs relative to LSE.

Table 5. Bootstrapped 95% confidence intervals of the effect of report cards for the M-KDRE method.

6. Summary and concluding remarks

Acknowledgments

References

Appendix A: Proofs of results

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

Table 3. Effect of report cards on school fees, test scores, and enrollment as given by parameter estimates of β_j $(j = 1, 2, 3)$ using seven estimation methods. For columns 2–7, standard errors (in parentheses) are based on 500 bootstrap replicates.