1,312
Views
0
CrossRef citations to date
0
Altmetric
Original Articles

A multi-step kernel–based regression estimator that adapts to error distributions of unknown form

&
Pages 6211-6230 | Received 16 Aug 2018, Accepted 06 Mar 2020, Published online: 19 Mar 2020

Abstract

For linear regression models, we propose and study a multi-step kernel density-based estimator that is adaptive to unknown error distributions. We establish asymptotic normality and almost sure convergence. An efficient EM algorithm is provided to implement the proposed estimator. We also compare its finite sample performance with five other adaptive estimators in an extensive Monte Carlo study of eight error distributions. Our method generally attains high mean-square-error efficiency. An empirical example illustrates the gain in efficiency of the new adaptive method when making statistical inference about the slope parameters in three linear regressions.

MATHEMATICS SUBJECT CLASSIFICATION:

1. Introduction

In parametric linear regression analysis one often imposes the model assumptions that the errors are independent and normally distributed. The normality assumption is convenient, as it is well known that the maximum likelihood estimator (MLE) of the unknown parameter vector simplifies to the least squares estimator (LSE). Naturally, an invalid assumption on the error distribution F comes at a cost; the MLE is in general neither consistent nor asymptotically efficient under model misspecification. Moreover, in practice, it can lead to inaccurate or invalid statistical inference; see Sec. 5. This has motivated the search for alternative, (semi)parametric, estimators that retain asymptotic efficiency when F is unknown.

One approach is adaptive estimation, which “adapts” to an unknown, or incorrectly specified, distribution F by maximizing an estimated likelihood function based on an initial estimate of the error distribution; see Bickel (Citation1982), Linton and Xiao (Citation2007), Yuan and De Gooijer (Citation2007), and the references therein. The adaptive idea has been studied for (non)linear regression models using non- and semiparametric methods to estimate F or its probability density function (pdf) f.

There are various alternative adaptive estimation methods for non- and semiparametric regression problems with errors of unknown distributional form. For instance, the empirical likelihood method of Owen (Citation2001) has been used to obtain adaptive confidence limits and likelihood ratio test statistics for regression parameters; see also Owen (Citation1988, Citation1990, Citation1991), Qin and Lawless (Citation1994), Kitamura (Citation1997), Kitamura (Citation2007), among many others. Another example is the multivariate adaptive regression splines (MARS) (Friedman Citation1991) which is a global adaptive nonparametric method for fitting nonlinear regression models. PolyMARS (Kooperberg, Bose, and Stone Citation1997) is an extension of MARS that allows for multiple polychotomous regression. Time series MARS, or TSMARS, can be used for nonlinear time series analysis and forecasting; see, e.g., De Gooijer (Citation2017). More recently, Wang and Yao (Citation2012) proposed a minimum average variance estimation method, a dimension reduction technique, which can be adaptive to different error distributions. Also, Chen, Wang, and Yao (Citation2015) developed an adaptive estimation method for varying coefficient models.

Recently, Yao and Zhao (Citation2013) proposed an adaptive kernel density-based estimator for classical linear regression models, called KDRE. In particular, with an estimate of the “true” parameter vector, f is modeled by a kernel density-based estimator of the regression residuals. In the second step, parameter estimates are obtained by maximizing a local, kernel-based, log-likelihood function using the first-step estimated density function as the true one. Through a simulation study, Yao and Zhao (Citation2013) show that the resulting KDRE is asymptotically equivalent to the oracle estimator in which the true error pdf is known.

Now, it is well known that each LSE-based residual is the sum of two components: one is the true error, the other is a linear function of the entire vector of errors. Since, in finite samples, the second term will tend to be normally distributed (as long as the errors have finite variance), the residuals for small samples will appear more normal than would the unobserved values of the errors. This tendency is called supernormality; see Bassett and Koenker (Citation1982), Bloomfield (Citation1974), and White and MacDonald (Citation1980). Hence, for the two-step KDRE method, it is likely that the finite-sample properties of the proposed estimator strongly depend on the empirical distribution of the residuals, more closely resembling normality than would be the case by using the pdf of the errors itself, if this were in fact possible. This makes the KDRE method nonoptimal.

In this paper, we remedy this deficiency of the KDRE method by further iteration. That is, we maximize over different kernel-based likelihood functions. These different likelihood functions follow iteratively from parameter estimates that result from maximizing the (previous) likelihood function. The algorithm iterates until the parameter estimates (and, hence, the estimated likelihood function) reaches a fixed point. The new estimator is called multi-step KDRE (M-KDRE). In finite samples, one may expect that M-KDRE yields better estimation results; see Robinson (Citation1988). Actually, from an extensive Monte Carlo study where we compare the finite-sample performance of M-KDRE with estimates based on five other, parametric and (semi)nonparametric adaptive estimators (AEs), we find that iterating over the likelihood functions can indeed strongly increase finite-sample performance. In fact, we find that M-KDRE outperforms all other considered AEs. In an empirical example based on Andrabi, Das, and Khwaja (Citation2017), we show that LSE estimates may be misleading. M-KDRE outperforms the other considered estimators in terms of out-of-sample prediction performance. Furthermore, M-KDRE provides strong evidence that the treatment effect as described in Andrabi, Das, and Khwaja (Citation2017) does not exist.

Theoretically, we establish strong (almost sure) convergence to the true parameter vector, under relative weak conditions. We also show that the M-KDRE method is adaptive, i.e., asymptotically normal and efficient. Furthermore, its computation is made convenient by proposing an EM type algorithm.

The rest of the paper is organized as follows. Section 2 introduces the new adaptive M-KDRE method and contains our theoretical results. An efficient EM type algorithm to implement the proposed estimator is also given in this section. In Section 3, we describe and explain five alternative adaptive estimation methods. Section 4 contains results of a simulation-based study of the finite sample properties of the M-KDRE method and compares it with those of the adaptive estimation methods discussed in Section 3. In Section 5, we present the empirical application of our method to the educational data set as used in Andrabi, Das, and Khwaja (Citation2017). Section 6 gives a summary and some concluding remarks. Proofs are presented in Appendix A.

2. Multi-Step kernel Density-Based regression estimation

2.1. Model and method

Consider the general linear regression problem with observations (1) yi=xiβ0+εi,(i=1,,n)(1) where yi is a univariate response variable, xi=(xi,1,,xi,p) is a p-dimensional (p < n) vector of covariates, and β0BRp is an unknown parameter vector including an intercept. Here (yi,xi,εi) are independent and identically distributed (i.i.d.) realizations from a common random source (y,x,ε). Moreover, the εi‘s are assumed to have some common unknown pdf f(ε), and E[εi|xi,β0]=0 and E[|ε||xi,β0]< (i=1,,n). Model (Equation1) is semiparametric with β and f(·) its parametric and non-parametric part, respectively.

Let β̂LSE be the LSE of β0 in (Equation1), which is a natural estimator to start the M-KDRE method. Also, let β̂(u) denote an estimator of β0 at iteration step u=0,1,. Then, with the conditions introduced above, the proposed M-KDRE can be obtained as follows.

  1. Initial step: At u = 0, start with β̂(0)=β̂LSE. Compute the residuals ε̂i(0)=yixiβ̂(0) (i=1,,n).

  2. Compute the Rosenblatt-Parzen kernel-based estimator f̂n(u)(·) of f(·). That is (2) f̂n(u)(x)=(nhn)1i=1nK(xε̂i(u1)hn)(2) where hn > 0 is the bandwidth.

  3. Let c(β)=n1i=1n(yixiβ). Then, using (Equation2), compute (3) β̂(u)=argmaxβBQ̂u(β)s.t.   c(β(u))=0(3) where Q̂u(β) is the local log-likelihood function (4) Q̂u(β)=i=1nlnf̂n(u)(yixiβ)(4)

  4. Repeat steps (ii)–(iii) until convergence at iteration step (u+1) (u=1,2,).

It is worth mentioning that Yao and Zhao (Citation2013) only iterate over the empirical likelihood function of the first initial, LSE, estimate while in the above case we allow for stochastic fluctuations in f̂n(u)(·). As we show, this generalization will increase mean-square error parameter efficiency.

2.2. Asymptotic properties

In this section we give the asymptotic properties of the uth M-KDRE, denoted hereafter by β̂(u). For convenience, when we emphasize the dependence on β, we use f(yi|β) to denote f(yixiβ). Technical details, lemmas, and proofs are given in Appendix A.

Theorem 2.1.

(Almost sure (a.s.) convergence) Under the assumptions of Lemma A.1 and

  1. f(yi|β̂) is non-parametrically identifiable,

  2. βB where BRp is compact,

  3. f(yi|β) is continuous for each βB,

  4. E[supβB|lnf(yi|β)|]<,

    then

(5) β̂(u)a.s.β0(5)

The following notation is used throughout the next part of the paper. Let Ln(yi|β)=lnf(yi|β)=lnf(yixiβ). Then dβ(β)=Ln(yi|β)/β=f(yi|β)f1(yi|β)xi,dββ(β)=Ln2(yi|β)/ββ=(f(yi|β)f(yi|β)f2(yi|β)/f2(yi|β))xixi, where f(u)=f(u)/u, f(u)=f(u)/u, and f2(u)=f(u)f(u). Given these notations, the Fisher information matrix (FIM) for the unconstrained linear regression problem, evaluated at β0, can be defined as (6) Iββ=E(dβ(β0)dβ(β0))=E(dββ(β0))(6) where the second equality holds under mild regularity conditions. If Iββ is nonsingular, then Iββ1 is the unconstrained Cramér-Rao bound (CRB) for the mean-square error (MSE) covariance matrix of any unbiased estimator of β0.

Observe that incorporation of the one-dimensional linear constraint c(β(u))=0 in step (iii) of the M-KDRE method leads to a p-dimensional parameter vector that has only p − 1 independent components. As a consequence, the FIM is singular and the CRB may not be an informative lower bound on the MSE matrix of the resulting estimator. So the asymptotic distribution of β̂(u) degenerates. For deterministic linear parameter constraints, Stoica and Ng (Citation1998) formulated a constrained CRB (CCRB) that explicitly incorporates the active constraint information with the original FIM, singular or nonsingular. Their general setting is for q (q < p) continuously differentiable constraints g(β)=0. Assuming β is regular in the active set of linear constraints, the q × p gradient matrix G=g(β)/β has full row rank q, with G independent of β. Hence, there exists a matrix URp×(pq) whose p-dimensional columns form an orthonormal null space of the range space of the row vectors in G, i.e., such that (7) GU=0  and  UU=Ipq(7) where Ipq denotes the identity matrix of size pq. For nonlinear deterministic constraints, G and U are functions of β; see, e.g., Moore, Sadler, and Kozick (Citation2008).

Recently, Ren et al. (Citation2015) extended the deterministic CCRB of Stoica and Ng (Citation1998) to a hybrid parameter vector with both nonrandom and random parameter constraints. In the case of the M-KDRE method the constraint is not deterministic, depending on the random variables xi. Then the matrix U in (Equation7) depends on the estimate β̂ of β0, say U(β̂). The resulting CCRB, as a special case of the hybrid CCRB of Ren et al. (Citation2015), can be stated as follows.

Theorem 2.2.

Let β̂ be an unbiased estimate of β0 satisfying the active functional constraints g(β)=0, and let U=U(β̂) be defined in Equation(7). Then, under certain regularity conditions, if UIββU is nonsingular, E((β̂β0)(β̂β0))U(UIββU)1U where the equality is achieved if and only if β̂β0=U(UIββU)1Udβ(β), in the mean- square sense.

Remark 1.

Note that rather than requiring a nonsingular FIM Iββ, the alternative condition is that UIββU is nonsingular. Thus, the unconstrained FIM may be singular, or, equivalently, the unconstrained model unidentifiable, but the constrained model must be identifiable, at least locally. Ren et al. (Citation2015) show that the difference between the standard CRB-based covariance matrix and the CCRB-based covariance matrix is a positive semi-definite matrix. This result is expected since the presence of parameter constraints can be considered as additional information to improve the performance of the estimator under study.

Theorem 2.3.

(Normality and efficiency) If model Equation(1) holds, {εi}i=1n are i.i.d. with unknown density f(x) where f is a continuous function symmetric around zero with bounded continuous derivatives that satisfy

  1. xf(x)dx=0,

  2. E[(lnf(x)x)2+|2lnf(x)x2|+|3lnf(x)x3|]<,

    {xi}i=1 satisfies,

  3.   0<M< such that x<M,

    K(·) is a symmetric and four times continuously differentiable function such that

  4.  0<ρ< such that K(x) = 0 x:xρ

    holds, and

  5. when n,nhn4 and nhn80,

then β̂(u) (u=1,2,) is asymptotically normal and efficient. That is, as n, (8) n(β̂(u)β0)dN(0,U(UIββU)1U)(8)

Remark 2.

All conditions are practical and easy to satisfy. Condition (ii) is used to guarantee the adaptiveness of β̂(u).

2.3. EM algorithm

In this section, we propose an EM type algorithm by noticing that (Equation4) has a mixture log-likelihood form with an imposed constraint. Specifically, given an initial parameter estimate β̂(0) and the set of initial estimates {ε̂i(0)}i=1n, the (k+1) th iteration of the EM algorithm to maximize (Equation4) (the uth likelihood function) is as follows.

E-step: Calculate the classification probabilities, (9) pij,(k+1)(u)=exp{12hn2(yixiβ̂(k)(u)ε̂j(u1))2}=1nexp{12hn2(yixiβ̂(k)(u)ε̂(u1))2}(9)

M-step: Update β̂(k)(u) with (10) β̂(k+1)(u)=β̂LSE(i=1nxixi)1i=1n(xij=1npij,(k+1)(u)ε̂j(u1))+1n(i=1nxixi)1i=1nxi(i=1nj=1npij,(k+1)(u)ε̂j(u1))(10) where (Equation10) follows from using a Gaussian kernel for density estimation. The choice of the kernel is not critical. Any symmetric kernel can be used for our method. However, the Gaussian second order kernel provides an explicit solution of the EM algorithm.

Theorem 2.4.

Under the linearity constraint c(β)=0, each iteration of the above E- and M-steps will monotonically increase the local log-likelihood (Equation4), i.e., Q̂n(β̂(k+1))(u))Q̂n(β̂(k)(u)), for all k.

Remark 3.

For the EM type algorithm, we use a full-kernel method rather than a leave-one-out method as in Yao and Zhao (Citation2013). The approach has the following advantage. If a certain residual ε̂j(u1) is extremely large, pij,(k+1)(u) will be close to zero for all ij and close to one for i = j. This implies that the effect of the residual is limited to the observation for which the following iteration of β is likely to lead to a residual that is similar in magnitude. Hence, the effect of a large residual on the maximization of (Equation4) is small. In the leave-one-out method, the effect of the residual may be considerably larger as pij,(k+1)(u) is likely to have a substantial value for several observations.

Remark 4.

The EM type algorithm is considered converged when max|β̂(k)(u)β̂(k+1)(u)| is smaller than a threshold value, where max|A| denotes the largest (absolute) element in A. In the uth step of the M-KDRE method, the EM algorithm is initialized by the estimate at the (u1) th step. That is, β̂(0)(u)=β̂(u1).

3. Some alternative adaptive estimation methods

3.1. SBS method

Stone (Citation1975), Bickel (Citation1982), and Schick (1993) (henceforth SBS) introduce a two-step AE. Let β˜ be a certain n-consistent estimator of β0. Then, an infeasible two-step estimator can be defined as β̂=β˜+n1Iββ1(β˜)dβ(β˜) where Iββ(β˜) is Fisher’s information matrix evaluated at β˜ and dβ(β˜) is a corresponding p×1 score vector. The infeasibility of β̂ follows from the fact that f is unknown, and hence Iββ and dβ are unknown. The approach of SBS is to replace dβ(β˜) by d̂β(β˜)=i=1nf̂n(yi|β˜)f̂n1(yi|β˜)xi where f̂n(x) is defined in a similar way as in (Equation2) and f̂n(x) is its derivative with respect to x. Similarly, Iββ1(β˜) is replaced with n2[i=1nxixi(f̂n(yi|β˜)f̂n1(yi|β˜))2]1.

The conditions under which the two-step AE approach can be shown to be asymptotically efficient have been researched extensively. Most importantly, the kernel estimator of the score function must be (i) i.i.d., and (ii) independent of xi. These conditions are restrictive and not easy to verify in practice; see, e.g., Yuan and De Gooijer (Citation2007). Bickel (Citation1982) solved the i.i.d. problem by splitting the sample in two; one sub-sample to estimate the score and another to solve for β. However, Manski (Citation1984) finds that the estimator works much better when the sample is not split, i.e., if the estimated score and β˜ are both computed using the entire sample. If (i) and (ii) are satisfied, a sufficient condition for adaptiveness is that (iii) E[(f(yi|β)f1(yi|β)f̂n(yi|β˜)f̂n1(yi|β˜))2]0, as n.

Since f̂n is present in the denominator of d^β, unstable estimates may follow for near-zero values of f̂n(·). Hence, Bickel (Citation1982) suggests to trim the estimator of the kernel score as follows f̂n(yi|β˜)f̂n(yi|β˜)={f̂n(yi|β˜)f̂n(yi|β˜),if |yixiβ|t1, f̂n(yi|β˜)>t2, and f̂n(yi|β˜)f̂n(yi|β˜)<t3,0,otherwise.

This trimming mechanism ensures that near-zero values do not have unreasonably large influence on the estimate. If t1, t20,t3,hn0,t1/nhn30, and hnt10 as n then condition (iii) is satisfied. Hence, adaptiveness is established under the proper trimming parameters and conditions (i) and (ii).

Naturally, the growth rates of the trimming parameters are of little use to a practitioner, and as such the choice for the trimming parameter is a practical disadvantage. Hsieh and Manski (1987) reduce the problem to selecting a one-dimensional parameter t by suggesting the following relation between the trimming parameters: t1=t,t2=exp(t2/2), t3=t. These authors vary t between 3, 4, and 8. For a sample size of 50, t = 8 works best in almost all cases under study.

3.2. LGMM and LGMMS methods

Newey (Citation1988) describes a two-step AE that avoids kernel estimation. His approach is based on moment conditions that can be derived from certain assumptions on the error distribution. Two situations are analyzed. First, the case where the error terms are i.i.d. and independent of xi. This model implies the moment condition that any function of the errors are uncorrelated with any function of the regressors. Second, the case where the distribution of εi is symmetric (S) around zero conditional on xi. The assumption that the errors are symmetrically distributed around zero yields the moment conditions that any odd function of the errors are uncorrelated with any function of the regressors. Hence, in both situations we can exploit moment restrictions. In particular, in the first case, we refer to the linearized general method of moment estimator as LGMM. In the second case, we use the short-hand notation LGMMS. For LGMM, natural moment conditions arise from the fact that E[xi(εijE(εij))]=0 for j=1,2,. However, Newey (Citation1988) finds that these high-order “raw” moments, mj(εi)=εij, are sensitive to a fat-tailed error distribution.

Estimates that are more robust against fat tails can be obtained using the “transformed” powers with mj(εi)=(εi/(1+|εi|))j or the “weighted” powers with mj(εi)=exp(εi/2)εij. Similarly, for LGMMS we may use E[xiεi2j1]=0 (j=1,2,). As for LGMM, performance may be improved if we use the odd powers of the transformed method instead. However, for technical reasons the weighted powers can not be used for LGMMS; Newey (Citation1988). In general, both for LGMM and LGMMS, we use the moment condition E[xi(mj(εi)μj)]=0 (j=1,2,) where μj=E[mj(εi)].

To define the LGMM(S) estimator, we introduce the following notation for some fixed value J=J(n) of j, (11) ζi=(m1(εi)μ1,,mJ(εi)μJ),  w=E[(m1,ε(εi),,mJ,ε(εi))]  Vζζ=Cov(ζi)(11) where mj,ε(·)=mj(·)/ε. Let {ε̂i}i=1n denote the residuals corresponding to the initial estimate β̂, then the quantities in (Equation11) can be consistently estimated by their corresponding sample statistics, i.e., ζ̂i=(m1(ε̂i)μ̂1,,mJ(εî)μ̂J),w^=(n1im1,ε(ε̂i),,n1imJ,ε(ε̂i)) and V^ζζ=n1iζ̂iζ̂i

The LGMM(S) estimator is given by (12) β̂LGMM(S)=β̂+[(w^XX)(V^ζζ1[XX]1)(w^XX)]1×(w^XX)(V^ζζ1[XX]1)(IJX)vec(ζ̂),(12) where ζ̂ is the n × J matrix (ζ̂1,,ζ̂n), IJ is an J × J identity matrix, and X an n × p matrix with its first column an n×1 vector of ones. Under certain assumptions, Newey (Citation1988) proves asymptotic normality of the LGMM and LGMMS estimators. In particular, it should hold that J and JlnJ/lnn0, as n. Only for LGMMS, asymptotic efficiency is obtained, but not for LGMM. By means of simulation, Newey (Citation1988) finds for LGMM that J = 3 performs best for sample sizes between n = 50 and n = 200. However, the MSE efficiency of the estimator as a function of J flattens out as n increases. Also, the transformed method is in general preferred over the weighted method.

3.3. KDRE method

The KDRE method of Yao and Zhao (Citation2013) can be viewed as the unconstrained two-step version of M-KDRE. That is, it follows from unconstrained maximization of the kernel likelihood function that is estimated on the basis of the residuals corresponding to an initial estimate. Under conditions (i)–(v) of Theorem 2.3 these authors prove that the KDRE method is adaptive. For technical reasons, this property is proven for a trimmed version. The untrimmed maximizer of the kernel-based likelihood is the solution to n1i=1n{f̂n(yi|β)/f̂n(yi|β)}xi=0. The trimmed version is then defined as the solution to 1ni=1nf̂n(yi|β)f̂n(yi|β)xiGb(f̂n(yi|β))=0.

Here, (13) Gb(x)={0,if x<b,bxgb(z)dzif bx2b,1,if x>2b,(13) where gb(·) is a four times continuously differentiable function with support on [b, 2b], and b0 if n. This trimming function is introduced by Linton and Xiao (Citation2007) and they suggest the use of the beta function. For the purpose of KDRE, the trimming parameter is only used to simplify the proof and is not a part of the actual implementation of the method.

3.4. YDG method

Yuan and De Gooijer (Citation2007) (hereafter YDG) propose another estimator of β0 based on estimating the error density by means of a kernel. The method is a one-step approach and as such does not require an initial estimate. The proposed estimator is given by (14) β̂=argmaxβBi=1nln1(n1)hnjinK(r(yixiβ)r(yjxjβ)hn)(14) where r(z)=10×ez/(1+ez). The nonlinear function r(·) is introduced to avoid cancelation of the intercept coefficient in xjβxiβ. However, as Yao and Zhao (Citation2013) note, this comes with an efficiency loss; r(z) = z is efficient in the sense that even though the intercept is canceled out, the slope coefficients are adaptively estimated. They suggest the use of the following estimator (15) β̂YDG*=argmaxβBi=1nln1(n1)hnjinK((yixi*β*)(yjxj*β*)hn)(15) where xi*=(1, xi), and β̂YDG=(α̂YDG, β̂YDG*) and α̂YDG=n1i(yixi*β̂YDG*). The intercept estimate α̂YDG, however, is not in general asymptotically efficient.

4. Simulation study

4.1. Setup

In order to assess the finite sample practical performance of all reviewed AEs, we conduct a Monte Carlo study. We generate i.i.d. data {(xi,yi)}i=1n from the regression model (16) yi=xiβ+εi,β=(1, 1, 2, 0.5, 3, 1, 1, 2, 0.5, 3)(16) where β is a p×1 parameter vector containing an intercept and the parameters corresponding to p − 1 explanatory variables. Here p = 10, but we also consider the case p = 2 and p = 5 consisting of the first two and first five coefficients of β, respectively. The sample size is set at n=50, 100, 500, and 1,000. All simulation results are based on 500 replications. The explanatory variables in xi are independent realizations of an N(0, 1) distribution. The errors εi are i.i.d., and we consider the following eight error distributions:

  1. standard normal;

  2. variance-contaminated normal, the mixture 0.9N(0,1/9)+0.1N(0,9);

  3. t-distribution with two degrees of freedom;

  4. bimodal symmetric mixture of two normals, 0.5N(3,1)+0.5N(3,1);

  5. Unif[3;3];

  6. Gamma(2,2);

  7. skewed mixture of normals, 0.3N(1.4,1)+0.7N(0.6,0.16); and

  8. log-normal, being the distribution of exp(z) for zN(0,1).

The distributions are centered and scaled to have mean zero and unit variance, when necessary and possible. The t(2)-distribution is left unscaled as its variance is infinite.

If required, we use β̂LSE as an initial parameter estimate. In addition, we adopt the standard normal kernel density K(·). For the SBS method, we set the trimming parameter at t = 8. Following Newey (Citation1988), we compute the LGMM and LGMMS estimators for the transformed moments with J = 3. Implementing the kernel density-based estimators requires a method for choosing the bandwidth hn. There is a vast literature on this topic, ranging from simple to involved methods. But none of the proposed methods has overall performance. In an extensive simulation study of model (Equation1) with n = 100 and p = 2, Reichardt (Citation2017) concludes that for M-KDRE hn,1=1.06σ̂nn1/5 is preferable for symmetric error distributions in terms of root mean squared error (RMSE) of the estimators. Here σ̂n is the standard deviation of the data. For skewed distributions, he recommends hn,2=0.9An1/5 where A=min(σ̂n,R/1.34) with R the inter-quartile range of the data. The KDRE and YDG estimators perform best under hn,1. The SBS estimator generally shows the smallest RMSE for hn,2. Hence, throughout the simulations, we use the above bandwidths.

4.2. Results

Averaged over all replications, provides summary information on the computed RMSEs of the slope and intercept coefficients for all n and p, and across all error distributions. Clearly, M-KDRE shows the best overall performance of all estimators for the slope coefficient, with 64 lowest RMSE values out of a total of 84, i.e., 7 estimation methods (M-KDRE, KDRE, YDG, SBS, LGMM, LGMMS, LSE), 3 values for p, and 4 values for n. On the other hand, the SBS estimator has only one lowest RMSE value for n=1,000 and p = 2. The other estimators have low RMSE values lying in between the above two values, with the LSE results as a benchmark. Finally, from the last column of , we see that M-KDRE performs equally well with YDG, LGGM and LGMMS in estimating the intercept term, and M-KDRE markedly outperforms KDRE.

Table 1. Number of times the RMSE attains its lowest value for seven regression estimators, for each n and p and across all eight error distribution functions, for the slope coefficient and, in parentheses, for the intercept.

Reichardt (Citation2017) reports RMSE values for each of the eight error distributions. For the sake of space we omit details. However, for the slope coefficient the simulation results can be summarized as follows.

  • In terms of RMSE, the M-KDRE method performs very well for the log-normal error distribution (h). That is, the RMSE of the second most efficient estimators (KDRE and LGMM) is approximately 40% larger, even for n=1,000. Under error distributions (b) and (c), M-KDRE is also most efficient, but here the efficiency is gained mostly for n = 50 and n = 100. Furthermore, M-KDRE has a superior performance in small samples of the t(2) error distribution. It is also close to best for error distributions (d)–(g).

  • The YDG method performs well for error distributions (d)–(g), but fails quite dramatically for error distributions with fat tails, i.e. (b), (c), and (h).

  • Efficiency of the SBS estimator is in general low relative to alternatives, but performance is especially weak under error distributions (c) and (d).

  • Overall, LGMM is a reasonable estimator, but its efficiency is lost under error distributions (e) and (f). This efficiency loss persists even for n=1,000.

  • LGMMS is by construction inefficient when the error distribution is skewed: (f)–(h). More interestingly, the LGMMS-estimate of the slope coefficient is no improvement over LGMM under symmetric error distributions. The inefficiency of LGMMS with respect to LGMM is likely to be due to the fact that LGMMS uses moment restrictions on odd powers of the disturbances only and, hence, for a particular value of J, uses higher order moments that may lead to less efficient estimation.

shows summary results for the bias for both slope and intercept estimators for all n and p, and across all error distributions. For the slope coefficient, M-KDRE has the best performance in terms of the lowest bias values. Again, from Reichardt (Citation2017) we learn that the intercept bias of the different estimators is usually of similar magnitude in the symmetric cases. Under the asymmetric error distributions, the bias of the intercept is much larger for KDRE, SBS and LGMMS than for the other estimators.

Table 2. Number of times the bias of seven regression estimators attains it lowest value for each n and p, and across all eight error distribution functions for the slope coefficient and, in parentheses, for the intercept.

5. Empirical application

Andrabi, Das, and Khwaja (Citation2017) study the impact of providing information in the form of school report cards on educational outcomes such as school fees, test scores, and enrollment in markets with multiple public and private providers. The report cards given to both households and schools in n randomly sampled villages across three districts, in the Punjab province of Pakistan, include information on the performance of the child, the average score of different schools in the village, and the average village score in mathematics, English, and Urdu. The following three linear regression models are of interest: (17) Yi,j=αj+βjRCi+γjYi,j*+δjXi,j+εi,(i=1,,n;j=1, 2, 3)(17) where Yi,1,Yi,2, and Yi,3 are average fees, test scores, and enrollment rate in the post-intervention year of village i, respectively. Yi,j* denotes the baseline measurement of the same variables. RCi is the treatment dummy assignment to village i, which makes βj the variable of interest, an estimate of the impact of the report card assignment. Xi,j is a vector of village-level baseline controls. All models in the paper are estimated using LSE.

, column 1, shows the LSEs of βj (j=1, 2, 3) and their corresponding standard errors (in parentheses) as shown in, respectively, (1) panel C, (4) panel C, and (1) panel C of Andrabi, Das, and Khwaja (Citation2017). The Shapiro-Wilk test for normal data indicates that the LSE residuals are far from normally distributed, with p-values 0.000 (j=1,n=104), 0.002 (j=2,n=112), and 0.000 (j=3,n=112). Indeed, in all cases, diagnostic statistics show that the residuals have fatter tails than one would expect based on normality. Based on the LSE results, Andrabi, Das, and Khwaja (Citation2017) report the following main findings. First, private schools decreased their annualized fees (Yi,1) by an average of 187 rupees, about 17% of their baseline fees, in response to the report card intervention. Second, test scores (Yi,2) increased by 0.11 standard deviation. Third, primary enrollment (Yi,3) increased by 3.2 percent points or 4.5% in treatment villages.

Table 3. Effect of report cards on school fees, test scores, and enrollment as given by parameter estimates of βj (j=1, 2, 3) using seven estimation methods. For columns 2–7, standard errors (in parentheses) are based on 500 bootstrap replicates.

Table 4. Median absolute prediction error (MAPE) of six AEs relative to LSE.

, columns 2–7, shows estimates of the six AEs for models j=1, 2, and 3. We see that these estimators pull the estimated treatment effect toward zero for all models. The results for model 1 are especially striking. The M-KDRE of β is more than 40 times smaller than the LSE, in absolute value. Also, for model 1, the AEs differ substantially. In that respect, it is interesting to investigate the prediction performance of the respective methods.

shows the ratio of the median absolute prediction error (MAPE) of an estimator relative to the LSE. The training set is a random sample of the data of size 0.8n. We see that M-RKDRE has the lowest MAPE for model 1. In addition, observe that the prediction performance is generally better for AEs with a low estimate of β1 such as KDRE and SBS. This suggests that the effect of the report cards on school fees, if it exists at all, is much lower than reported. For models 2 and 3, there is less difference between the estimates of the AEs. Also, the estimates are adjusted less strongly with respect to LSE.

reports two bootstrapped 95% confidence intervals for βj (j=1, 2, 3) as estimated by M-KDRE. The confidence interval termed “Normal” is based on an asymptotic normality assumption, and the column called “Percentile” is based on the 95% inter quartile range obtained from the empirical distribution of 500 bootstrap replicates. Both intervals show that the estimated treatment effect for model 1 is not significantly different from zero.

Table 5. Bootstrapped 95% confidence intervals of the effect of report cards for the M-KDRE method.

In summary, the above results demonstrate the practical relevance of the AEs in general and that of the proposed M-KDRE method in particular. None of the other AEs adjusted the treatment effect on school fees as far toward zero as M-KDRE, while prediction performance suggests that this method should be preferred over other methods for this particular linear regression model and sample size. Thus, there is no support for the first finding of Andrabi, Das, and Khwaja (Citation2017) at any reasonable significance level. Further, shows that the AEs find that the effect of report cards on test scores is not significantly different from zero at the 5% nominal level. Lastly, the effect of report cards on the enrollment rate seems, even though marginally significant for M-KDRE, also questionable.

6. Summary and concluding remarks

In this paper, we proposed an adaptive multi-step kernel density-based regression estimator for linear regression models. We have established the theoretical properties of our estimation method, including asymptotic normality and almost sure convergence. In an extensive simulation study, we have shown that the finite sample performance of M-KDRE is second to none of five alternative AEs. For several error distributions, it is up to twice as efficient in terms of RMSE than the second best estimator. Further, for any other error distribution, it is either the most efficient or very close to the most efficient estimator. All other AEs show a loss of efficiency for certain specific error distributions. Our empirical application provides a good illustration of many of these issues. In particular, using the M-KDRE method and its corresponding bootstrap standard errors, we found fairly compelling evidence that the treatment effects that Andrabi, Das, and Khwaja (Citation2017) find are not significantly different from zero.

The results raise several questions for further research. For instance, one may wish to estimate nonlinear regressions via the M-KDRE method. In that case the EM algorithm, at least in its present form, needs to be adjusted. Another issue concerns the fact that the multi-step method makes use of β̂LSE in the initial step. This choice was primarily based on computational convenience. Perhaps, efficiency may be further enhanced by a more prudent choice of the initial estimator. It may also be of interest to assess the robustness of M-KDRE to a violation of the independence assumption. In particular, adaptive estimation is not in general possible when the vector of covariates and the error process are not mutually independent. We leave these questions for future research.

Acknowledgments

The authors would like to thank two anonymous referees for their valuable comments and suggestions.

References

Appendix A: Proofs of results

Lemma A.1.

(Zhang Citation1990, Theorem 5) If model (Citation1) holds, {εi}i=1n are i.i.d. with unknown density f(x) where f(·) is a uniformly continuous function that satisfies (i) xf(x)dx=0, (ii) 0<x2f(x)dx<, the set of covariates {xi}i=1 satisfies (iii)  0<M< such that xi<M i=1,,n, (iv) SnQ=E(xx) where Sn=n1i=1nxixi, and the following assumptions on the kernel function K(·) hold: (v) K(x) is uniformly bounded and  0<ρ< such that K(x) = 0 x:xρ (vi) K(x) is Riemann integrable on [ρ, ρ] (vii) when n,0<hn0 and nhn/lnn, then (A.1) supxR f̂n,LSE(x)f(x)a.s.0(A.1) where f̂n,LSE is the kernel density-based estimator of the LSE residuals ε̂i,LSE=yixiβ̂LSE (i=1,,n).

Lemma A.2.

Suppose that the assumptions of Lemma A.1 hold. Then, if any estimator β* satisfies Pr(limnmax|β*β0|max|Sn1| lnmax|Sn1| )=1 where max|A|=maxi,j|aij| with aij the elements of a matrix A, (A.2) supxRf̂n*(x)f(x)a.s.0(A.2) where f̂n* is the kernel density-based estimator of the residuals ε̂i=yixiβ* (i=1,,n).

Proof.

This result follows immediately from Theorem 5 and Lemma 4 in Zhang (Citation1990) in conjunction with Theorem 4 and (Citation29) in Chai, Li, and Tian (Citation1991). □

Lemma A.3.

If there is a function Q0(β) such that (i) Q0(β) is uniquely maximized at β0, (ii) B is compact, (iii) Q0(β) is continuous, (iv) supβBQ̂n(β)Q0(β)a.s.0, then for u=1,2,, β̂(u)a.s.β0, where β̂ maximizes the objective function Q̂n(β) subject to βB. The weak convergence result, i.e., β̂pβ0 can be obtained by replacing condition (iv) by supβBQ̂n(β)Q0(β)p0.

Proof:

The proof is similar to the proof of Theorem 2.1 of Newey and McFadden (Citation1994). □

Lemma A.4.

If fn:BR is a continuous function, B is compact, and fna.s.f, then (A.3) limnBfndu=Bfdu(A.3)

Proof.

Since B is compact and fn is continuous, the image fn(B) is a compact subset of R and hence, closed and bounded. Then, the result follows from the bounded convergence theorem; see, e.g., Wade (Citation1974). □

Proof of Theorem 2.1.

Following (Newey and McFadden Citation1994, Thm. 2.5), we verify the conditions in Lemma A.3. Note that conditions (i)–(iii) are on the density f(·) of ε, and on the parameter space B. These conditions hold under the usual regularity conditions of MLE. Condition (iv) of Lemma A.3 implies that we have to prove that supβB1ni=1nlnf̂n(1)(yi|β)E[lnf(yi|β)]a.s.0.

Since f̂n(1)=f̂n,LSE, we have by Lemma A.1 supxRf̂n(1)(x)f(x)a.s.0.

Now note that supβBf̂n(1)(yi|β)f(yi|β)supβRpf̂n(1)(yi|β)f(yi|β)supxR f̂n(1)(x)f(x) implying that (A.4) supβB f̂n(1)(yi|β)f(yi|β)a.s.0.(A.4)

Condition (iii) implies that infβBf(yi|β)>0. Thus, ε>0 such that infβBf(yi|β)>ε. Also, by (A.4) for any ε>0, Pr(limnsupβBf̂n(1)(yi|β)f(yi|β)<ε)=1Pr(limninfβBf̂n(1)(yi|β)>0)=1.

This, together with condition (ii), ensures that for n large enough both lnf(yi|β) and lnf̂n(1)(yi|β) are uniformly continuous with probability one such that by the uniform continuous mapping theorem supβBlnf̂n(1)(yi|β)lnf(yi|β)a.s.0.

Note that by conditions (ii) and (iii), lnf̂n(1)(yi|β) is bounded and we may invoke the uniform law of large numbers such that, (A.5) supβB1ni=1nlnf̂n(1)(yi|β)E[lnf̂n(1)(yi|β)]a.s.0.(A.5)

Also, by Lemma A.4, (A.6) limnE[lnf̂n(1)(yi|β)]=E[lnf(yi|β)](A.6)

Now define the following variables A1ni=1nlnf̂n(1)(yi|β)E[lnf(yi|β)],A1 1ni=1nlnf̂n(1)(yi|β)E[lnf̂n(1)(yi|β)],A2E[lnf̂n(1)(yi|β)]E[lnf(yi|β)].

Then, by the triangle inequality, we have AA1+A2, and by (EquationA.5) and (EquationA.6), supβBA1a.s.0 and limnsupβBA2=0. From condition (iv) of Lemma A.3 it follows that (A.7) supβB1ni=1nlnf̂n(1)(yi|β)E[lnf(yi|β)]a.s.0.(A.7)

Thus, by Lemma A.3, (A.8) β̂(1)a.s.β0(A.8)

For the sake of completeness, we prove also that the constraint c(β)=0 does not affect this result. Let CB be the subset for which c(β)=0. That is, C={βB:c(β)=0}. First, note that C is the level set of the continuous function c(β) such that C is closed. Also, C is bounded since CB and B is bounded. Hence, C is compact such that β̂(1)=argsupβCi=1nlnf̂n(1)(yi|β). Denote β˜=argsupβBi=1nlnf̂n(1)(yi|β) as the global maximizer of the objective function over B. (Newey and McFadden Citation1994, p. 2122) show that for (A.8) to hold, it suffices to prove that (A.9) 1ni=1nlnf̂n(1)(yi|β̂(1))1ni=1nlnf̂n(1)(yi|β˜)a.s.0.(A.9)

For that purpose, define BsupβC1ni=1nlnf̂n(1)(yi|β)supβB1ni=1nlnf̂n(1)(yi|β),B1supβC1ni=1nlnf̂n(1)(yi|β)supβCE[lnf(yi|β)],B2supβCE[lnf(yi|β)]supβBE[lnf(yi|β)],B3supβBE[lnf(yi|β)]supβB1ni=1nlnf̂n(1)(yi|β).

Again by the triangle inequality, BB1+B2+B3. From (A.7), it is easy to show that B1a.s.0 and B3a.s.0. To show that B2a.s.0, first observe that by conditions (i) and (ii) of Lemma A.1 and the strong law of large numbers, (A.10) Pr(limn[1ni=1n(yixiβ0)]=0)=1.(A.10)

This implies Pr(limnβ0C)=1Pr(limn[argsupβBE[lnf(yi|β)]]C)=1Pr(limnB2=0)=1, and the last result implies, by definition of almost sure convergence, that B2a.s.0. Hence, Ba.s.0 and the constraint does not affect the result.

Lastly, to show that the algorithm asymptotically converges to β0, remark that (A.8) implies by Lemma A.2 that supxRf̂n(2)(x)f(x)a.s.0 where f̂n(2)(x) is the kernel density-based estimator of the residuals corresponding to β̂(1). Thus, by identical reasoning, we obtain β̂(u)a.s.β0 for u=1,2,.

Remark 5.

Conditions (i)–(iv) of Theorem 2.1 are the regularity conditions that are necessary for the convergence of MLE under the true density. Thus, the only additional conditions imposed are those in Lemma A.1 of which condition (i) of zero mean goes without loss of generality in the context of linear regression as we can always adjust the intercept parameter in β if the center of f(·) is not zero. Condition (ii) of Lemma A.1 may be restrictive in some cases as it rules out, for instance, the t(v)-distribution with 1<v2. However, in Section 4.2 we observed that M-KDRE performs well for t(Equation2). In fact, its performance is best of all considered estimators under that error distribution. Hence, the practical use of M-KDRE does not seem to be restricted to distributions with finite variance. Conditions (iii) and (iv) of Lemma A.1 are easy to verify in practice, and conditions (v)–(vii) are technical requirements on the kernel and bandwidth. Note that (v) is not satisfied by the Gaussian kernel since that kernel does not have bounded support. In practice, however, the Gaussian kernel entails a significant computational advantage.

Proof of Theorem 2.3

(sketch): For the case of q (q < p) with linear random or deterministic equality constraints, the proof of consistency and asymptotic distribution of β̂ can be based on results in Crowder (Citation1984) and Osborne (Citation2000). In particular, given these results it follows that β̂(1) is asymptotically normal and efficient. The only condition on the initial estimator β̂(0) is that (β̂(0)β̂0)=Op(n1/2). For β̂(1) this follows from the proof of (Yao and Zhao Citation2013, Thm. 2.1). Hence, all subsequent estimates β̂(u) (u=2,3,) also satisfy (Equation8). □

Proof of Theorem 2.4.

Under the Gaussian kernel, the linear constraint c(β)=0, and a full-kernel method, the M-step in (Equation10) becomes (A.11) β̂(k+1)(u)=argminβi=1nj=1n(pij,(k+1)(u)(yixiβε̂j(u1))2) s.t. i=1n(yixiβ)=0(A.11)

This can be solved by Lagrangian optimization. Define the Lagrangian L as (A.12) L(β,λ)=i=1nj=1n(pij,(k+1)(u)(yixiβε̂j(u1))2)λi=1n(yixiβ)(A.12) with first-order conditions (A.13) Lβ=2i=1nj=1n(pij,(k+1)(u)xi(yixiβε̂j(u1)))λi=1nxi=0(A.13) (A.14) Lλ=i=1n(yixiβ)=0.(A.14)

By setting xi,11 (i=1,,n), the first element of the first-order condition in (EquationA.13) implies (A.15) λ=2ni=1nj=1n(pij,(k+1)(u)(yixiβε̂j(u1)))=2ni=1n(yixiβ)j=1npij,(k+1)(u)+2ni=1nj=1npij,(k+1)(m)ε̂j(u1)=2ni=1nj=1npij,(k+1)(u)ε̂j(u1),(A.15) where the last equality follows from (EquationA.14). Then, by plugging λ in (EquationA.13), rearranging terms and using j=1npij,(k+1)(u)=1, we obtain (A.16) β̂(k+1)(u)=(i=1nxixi)1i=1nxiyi(i=1nxixi)1i=1n(xij=1npij,(k+1)(u)ε̂j(u1))+1n(i=1nxixi)1i=1nxi(i=1nj=1npij,(k+1)(u)ε̂j(u1)).(A.16)

Recognize that the first term is equal to β̂LSE. Then, the fact that (Equation9) and (Equation10) are the E- and M-step, respectively, of an EM type algorithm follows trivially from the proof of Theorem 2.2 in Yao and Zhao (Citation2013). □