![MathJax Logo](/templates/jsp/_style2/_tandf/pb2/images/math-jax.gif)
Abstract
For linear regression models, we propose and study a multi-step kernel density-based estimator that is adaptive to unknown error distributions. We establish asymptotic normality and almost sure convergence. An efficient EM algorithm is provided to implement the proposed estimator. We also compare its finite sample performance with five other adaptive estimators in an extensive Monte Carlo study of eight error distributions. Our method generally attains high mean-square-error efficiency. An empirical example illustrates the gain in efficiency of the new adaptive method when making statistical inference about the slope parameters in three linear regressions.
1. Introduction
In parametric linear regression analysis one often imposes the model assumptions that the errors are independent and normally distributed. The normality assumption is convenient, as it is well known that the maximum likelihood estimator (MLE) of the unknown parameter vector simplifies to the least squares estimator (LSE). Naturally, an invalid assumption on the error distribution F comes at a cost; the MLE is in general neither consistent nor asymptotically efficient under model misspecification. Moreover, in practice, it can lead to inaccurate or invalid statistical inference; see Sec. 5. This has motivated the search for alternative, (semi)parametric, estimators that retain asymptotic efficiency when F is unknown.
One approach is adaptive estimation, which “adapts” to an unknown, or incorrectly specified, distribution F by maximizing an estimated likelihood function based on an initial estimate of the error distribution; see Bickel (Citation1982), Linton and Xiao (Citation2007), Yuan and De Gooijer (Citation2007), and the references therein. The adaptive idea has been studied for (non)linear regression models using non- and semiparametric methods to estimate F or its probability density function (pdf) f.
There are various alternative adaptive estimation methods for non- and semiparametric regression problems with errors of unknown distributional form. For instance, the empirical likelihood method of Owen (Citation2001) has been used to obtain adaptive confidence limits and likelihood ratio test statistics for regression parameters; see also Owen (Citation1988, Citation1990, Citation1991), Qin and Lawless (Citation1994), Kitamura (Citation1997), Kitamura (Citation2007), among many others. Another example is the multivariate adaptive regression splines (MARS) (Friedman Citation1991) which is a global adaptive nonparametric method for fitting nonlinear regression models. PolyMARS (Kooperberg, Bose, and Stone Citation1997) is an extension of MARS that allows for multiple polychotomous regression. Time series MARS, or TSMARS, can be used for nonlinear time series analysis and forecasting; see, e.g., De Gooijer (Citation2017). More recently, Wang and Yao (Citation2012) proposed a minimum average variance estimation method, a dimension reduction technique, which can be adaptive to different error distributions. Also, Chen, Wang, and Yao (Citation2015) developed an adaptive estimation method for varying coefficient models.
Recently, Yao and Zhao (Citation2013) proposed an adaptive kernel density-based estimator for classical linear regression models, called KDRE. In particular, with an estimate of the “true” parameter vector, f is modeled by a kernel density-based estimator of the regression residuals. In the second step, parameter estimates are obtained by maximizing a local, kernel-based, log-likelihood function using the first-step estimated density function as the true one. Through a simulation study, Yao and Zhao (Citation2013) show that the resulting KDRE is asymptotically equivalent to the oracle estimator in which the true error pdf is known.
Now, it is well known that each LSE-based residual is the sum of two components: one is the true error, the other is a linear function of the entire vector of errors. Since, in finite samples, the second term will tend to be normally distributed (as long as the errors have finite variance), the residuals for small samples will appear more normal than would the unobserved values of the errors. This tendency is called supernormality; see Bassett and Koenker (Citation1982), Bloomfield (Citation1974), and White and MacDonald (Citation1980). Hence, for the two-step KDRE method, it is likely that the finite-sample properties of the proposed estimator strongly depend on the empirical distribution of the residuals, more closely resembling normality than would be the case by using the pdf of the errors itself, if this were in fact possible. This makes the KDRE method nonoptimal.
In this paper, we remedy this deficiency of the KDRE method by further iteration. That is, we maximize over different kernel-based likelihood functions. These different likelihood functions follow iteratively from parameter estimates that result from maximizing the (previous) likelihood function. The algorithm iterates until the parameter estimates (and, hence, the estimated likelihood function) reaches a fixed point. The new estimator is called multi-step KDRE (M-KDRE). In finite samples, one may expect that M-KDRE yields better estimation results; see Robinson (Citation1988). Actually, from an extensive Monte Carlo study where we compare the finite-sample performance of M-KDRE with estimates based on five other, parametric and (semi)nonparametric adaptive estimators (AEs), we find that iterating over the likelihood functions can indeed strongly increase finite-sample performance. In fact, we find that M-KDRE outperforms all other considered AEs. In an empirical example based on Andrabi, Das, and Khwaja (Citation2017), we show that LSE estimates may be misleading. M-KDRE outperforms the other considered estimators in terms of out-of-sample prediction performance. Furthermore, M-KDRE provides strong evidence that the treatment effect as described in Andrabi, Das, and Khwaja (Citation2017) does not exist.
Theoretically, we establish strong (almost sure) convergence to the true parameter vector, under relative weak conditions. We also show that the M-KDRE method is adaptive, i.e., asymptotically normal and efficient. Furthermore, its computation is made convenient by proposing an EM type algorithm.
The rest of the paper is organized as follows. Section 2 introduces the new adaptive M-KDRE method and contains our theoretical results. An efficient EM type algorithm to implement the proposed estimator is also given in this section. In Section 3, we describe and explain five alternative adaptive estimation methods. Section 4 contains results of a simulation-based study of the finite sample properties of the M-KDRE method and compares it with those of the adaptive estimation methods discussed in Section 3. In Section 5, we present the empirical application of our method to the educational data set as used in Andrabi, Das, and Khwaja (Citation2017). Section 6 gives a summary and some concluding remarks. Proofs are presented in Appendix A.
2. Multi-Step kernel Density-Based regression estimation
2.1. Model and method
Consider the general linear regression problem with observations
(1)
(1)
where yi is a univariate response variable,
is a p-dimensional (p < n) vector of covariates, and
is an unknown parameter vector including an intercept. Here
are independent and identically distributed (i.i.d.) realizations from a common random source
Moreover, the
‘s are assumed to have some common unknown pdf
and
and
Model (Equation1
(1)
(1) ) is semiparametric with
and
its parametric and non-parametric part, respectively.
Let be the LSE of
in (Equation1
(1)
(1) ), which is a natural estimator to start the M-KDRE method. Also, let
denote an estimator of
at iteration step
Then, with the conditions introduced above, the proposed M-KDRE can be obtained as follows.
Initial step: At u = 0, start with
Compute the residuals
Compute the Rosenblatt-Parzen kernel-based estimator
of
That is
(2)
(2) where hn > 0 is the bandwidth.
Let
Then, using (Equation2
(2)
(2) ), compute
(3)
(3) where
is the local log-likelihood function
(4)
(4)
Repeat steps (ii)–(iii) until convergence at iteration step
It is worth mentioning that Yao and Zhao (Citation2013) only iterate over the empirical likelihood function of the first initial, LSE, estimate while in the above case we allow for stochastic fluctuations in As we show, this generalization will increase mean-square error parameter efficiency.
2.2. Asymptotic properties
In this section we give the asymptotic properties of the uth M-KDRE, denoted hereafter by For convenience, when we emphasize the dependence on
we use
to denote
Technical details, lemmas, and proofs are given in Appendix A.
Theorem 2.1.
(Almost sure (a.s.) convergence) Under the assumptions of Lemma A.1 and
is non-parametrically identifiable,
where
is compact,
is continuous for each
then
The following notation is used throughout the next part of the paper. Let Then
where
and
Given these notations, the Fisher information matrix (FIM) for the unconstrained linear regression problem, evaluated at
can be defined as
(6)
(6)
where the second equality holds under mild regularity conditions. If
is nonsingular, then
is the unconstrained Cramér-Rao bound (CRB) for the mean-square error (MSE) covariance matrix of any unbiased estimator of
Observe that incorporation of the one-dimensional linear constraint in step (iii) of the M-KDRE method leads to a p-dimensional parameter vector that has only p − 1 independent components. As a consequence, the FIM is singular and the CRB may not be an informative lower bound on the MSE matrix of the resulting estimator. So the asymptotic distribution of
degenerates. For deterministic linear parameter constraints, Stoica and Ng (Citation1998) formulated a constrained CRB (CCRB) that explicitly incorporates the active constraint information with the original FIM, singular or nonsingular. Their general setting is for q (q < p) continuously differentiable constraints
Assuming
is regular in the active set of linear constraints, the q × p gradient matrix
has full row rank q, with G independent of
Hence, there exists a matrix
whose p-dimensional columns form an orthonormal null space of the range space of the row vectors in G, i.e., such that
(7)
(7)
where
denotes the identity matrix of size p – q. For nonlinear deterministic constraints, G and U are functions of
see, e.g., Moore, Sadler, and Kozick (Citation2008).
Recently, Ren et al. (Citation2015) extended the deterministic CCRB of Stoica and Ng (Citation1998) to a hybrid parameter vector with both nonrandom and random parameter constraints. In the case of the M-KDRE method the constraint is not deterministic, depending on the random variables Then the matrix U in (Equation7
(7)
(7) ) depends on the estimate
of
say
The resulting CCRB, as a special case of the hybrid CCRB of Ren et al. (Citation2015), can be stated as follows.
Theorem 2.2.
Let be an unbiased estimate of
satisfying the active functional constraints
, and let
be defined in Equation(7)
(7)
(7) . Then, under certain regularity conditions, if
is nonsingular,
where the equality is achieved if and only if
, in the mean- square sense.
Remark 1.
Note that rather than requiring a nonsingular FIM the alternative condition is that
is nonsingular. Thus, the unconstrained FIM may be singular, or, equivalently, the unconstrained model unidentifiable, but the constrained model must be identifiable, at least locally. Ren et al. (Citation2015) show that the difference between the standard CRB-based covariance matrix and the CCRB-based covariance matrix is a positive semi-definite matrix. This result is expected since the presence of parameter constraints can be considered as additional information to improve the performance of the estimator under study.
Theorem 2.3.
(Normality and efficiency) If model Equation(1)(1)
(1) holds,
are i.i.d. with unknown density f(x) where f is a continuous function symmetric around zero with bounded continuous derivatives that satisfy
satisfies,
such that
is a symmetric and four times continuously differentiable function such that
such that K(x) = 0
holds, and
when
and
then
is asymptotically normal and efficient. That is, as
(8)
(8)
Remark 2.
All conditions are practical and easy to satisfy. Condition (ii) is used to guarantee the adaptiveness of
2.3. EM algorithm
In this section, we propose an EM type algorithm by noticing that (Equation4(4)
(4) ) has a mixture log-likelihood form with an imposed constraint. Specifically, given an initial parameter estimate
and the set of initial estimates
the
th iteration of the EM algorithm to maximize (Equation4
(4)
(4) ) (the uth likelihood function) is as follows.
E-step: Calculate the classification probabilities,
(9)
(9)
M-step: Update with
(10)
(10)
where (Equation10
(10)
(10) ) follows from using a Gaussian kernel for density estimation. The choice of the kernel is not critical. Any symmetric kernel can be used for our method. However, the Gaussian second order kernel provides an explicit solution of the EM algorithm.
Theorem 2.4.
Under the linearity constraint , each iteration of the above E- and M-steps will monotonically increase the local log-likelihood (Equation4
(4)
(4) ), i.e.,
, for all k.
Remark 3.
For the EM type algorithm, we use a full-kernel method rather than a leave-one-out method as in Yao and Zhao (Citation2013). The approach has the following advantage. If a certain residual is extremely large,
will be close to zero for all
and close to one for i = j. This implies that the effect of the residual is limited to the observation for which the following iteration of
is likely to lead to a residual that is similar in magnitude. Hence, the effect of a large residual on the maximization of (Equation4
(4)
(4) ) is small. In the leave-one-out method, the effect of the residual may be considerably larger as
is likely to have a substantial value for several observations.
Remark 4.
The EM type algorithm is considered converged when is smaller than a threshold value, where
denotes the largest (absolute) element in A. In the uth step of the M-KDRE method, the EM algorithm is initialized by the estimate at the
th step. That is,
3. Some alternative adaptive estimation methods
3.1. SBS method
Stone (Citation1975), Bickel (Citation1982), and Schick (1993) (henceforth SBS) introduce a two-step AE. Let be a certain
-consistent estimator of
Then, an infeasible two-step estimator can be defined as
where
is Fisher’s information matrix evaluated at
and
is a corresponding
score vector. The infeasibility of
follows from the fact that f is unknown, and hence
and
are unknown. The approach of SBS is to replace
by
where
is defined in a similar way as in (Equation2
(2)
(2) ) and
is its derivative with respect to x. Similarly,
is replaced with
The conditions under which the two-step AE approach can be shown to be asymptotically efficient have been researched extensively. Most importantly, the kernel estimator of the score function must be (i) i.i.d., and (ii) independent of These conditions are restrictive and not easy to verify in practice; see, e.g., Yuan and De Gooijer (Citation2007). Bickel (Citation1982) solved the i.i.d. problem by splitting the sample in two; one sub-sample to estimate the score and another to solve for
However, Manski (Citation1984) finds that the estimator works much better when the sample is not split, i.e., if the estimated score and
are both computed using the entire sample. If (i) and (ii) are satisfied, a sufficient condition for adaptiveness is that (iii)
as
Since is present in the denominator of
unstable estimates may follow for near-zero values of
Hence, Bickel (Citation1982) suggests to trim the estimator of the kernel score as follows
This trimming mechanism ensures that near-zero values do not have unreasonably large influence on the estimate. If
and
as
then condition (iii) is satisfied. Hence, adaptiveness is established under the proper trimming parameters and conditions (i) and (ii).
Naturally, the growth rates of the trimming parameters are of little use to a practitioner, and as such the choice for the trimming parameter is a practical disadvantage. Hsieh and Manski (1987) reduce the problem to selecting a one-dimensional parameter t by suggesting the following relation between the trimming parameters:
These authors vary t between 3, 4, and 8. For a sample size of 50, t = 8 works best in almost all cases under study.
3.2. LGMM and LGMMS methods
Newey (Citation1988) describes a two-step AE that avoids kernel estimation. His approach is based on moment conditions that can be derived from certain assumptions on the error distribution. Two situations are analyzed. First, the case where the error terms are i.i.d. and independent of This model implies the moment condition that any function of the errors are uncorrelated with any function of the regressors. Second, the case where the distribution of
is symmetric (S) around zero conditional on
The assumption that the errors are symmetrically distributed around zero yields the moment conditions that any odd function of the errors are uncorrelated with any function of the regressors. Hence, in both situations we can exploit moment restrictions. In particular, in the first case, we refer to the linearized general method of moment estimator as LGMM. In the second case, we use the short-hand notation LGMMS. For LGMM, natural moment conditions arise from the fact that
for
However, Newey (Citation1988) finds that these high-order “raw” moments,
are sensitive to a fat-tailed error distribution.
Estimates that are more robust against fat tails can be obtained using the “transformed” powers with or the “weighted” powers with
Similarly, for LGMMS we may use
As for LGMM, performance may be improved if we use the odd powers of the transformed method instead. However, for technical reasons the weighted powers can not be used for LGMMS; Newey (Citation1988). In general, both for LGMM and LGMMS, we use the moment condition
where
To define the LGMM(S) estimator, we introduce the following notation for some fixed value of j,
(11)
(11)
where
Let
denote the residuals corresponding to the initial estimate
then the quantities in (Equation11
(11)
(11) ) can be consistently estimated by their corresponding sample statistics, i.e.,
and
The LGMM(S) estimator is given by
(12)
(12)
where
is the n × J matrix
is an J × J identity matrix, and X an n × p matrix with its first column an
vector of ones. Under certain assumptions, Newey (Citation1988) proves asymptotic normality of the LGMM and LGMMS estimators. In particular, it should hold that
and
as
Only for LGMMS, asymptotic efficiency is obtained, but not for LGMM. By means of simulation, Newey (Citation1988) finds for LGMM that J = 3 performs best for sample sizes between n = 50 and n = 200. However, the MSE efficiency of the estimator as a function of J flattens out as n increases. Also, the transformed method is in general preferred over the weighted method.
3.3. KDRE method
The KDRE method of Yao and Zhao (Citation2013) can be viewed as the unconstrained two-step version of M-KDRE. That is, it follows from unconstrained maximization of the kernel likelihood function that is estimated on the basis of the residuals corresponding to an initial estimate. Under conditions (i)–(v) of Theorem 2.3 these authors prove that the KDRE method is adaptive. For technical reasons, this property is proven for a trimmed version. The untrimmed maximizer of the kernel-based likelihood is the solution to The trimmed version is then defined as the solution to
Here,
(13)
(13)
where
is a four times continuously differentiable function with support on
and
if
This trimming function is introduced by Linton and Xiao (Citation2007) and they suggest the use of the beta function. For the purpose of KDRE, the trimming parameter is only used to simplify the proof and is not a part of the actual implementation of the method.
3.4. YDG method
Yuan and De Gooijer (Citation2007) (hereafter YDG) propose another estimator of based on estimating the error density by means of a kernel. The method is a one-step approach and as such does not require an initial estimate. The proposed estimator is given by
(14)
(14)
where
The nonlinear function
is introduced to avoid cancelation of the intercept coefficient in
However, as Yao and Zhao (Citation2013) note, this comes with an efficiency loss; r(z) = z is efficient in the sense that even though the intercept is canceled out, the slope coefficients are adaptively estimated. They suggest the use of the following estimator
(15)
(15)
where
and
and
The intercept estimate
however, is not in general asymptotically efficient.
4. Simulation study
4.1. Setup
In order to assess the finite sample practical performance of all reviewed AEs, we conduct a Monte Carlo study. We generate i.i.d. data from the regression model
(16)
(16)
where
is a
parameter vector containing an intercept and the parameters corresponding to p − 1 explanatory variables. Here p = 10, but we also consider the case p = 2 and p = 5 consisting of the first two and first five coefficients of
respectively. The sample size is set at
and
All simulation results are based on 500 replications. The explanatory variables in
are independent realizations of an N(0, 1) distribution. The errors
are i.i.d., and we consider the following eight error distributions:
standard normal;
variance-contaminated normal, the mixture
t-distribution with two degrees of freedom;
bimodal symmetric mixture of two normals,
Unif
Gamma(2,2);
skewed mixture of normals,
and
log-normal, being the distribution of
for
The distributions are centered and scaled to have mean zero and unit variance, when necessary and possible. The t(2)-distribution is left unscaled as its variance is infinite.
If required, we use as an initial parameter estimate. In addition, we adopt the standard normal kernel density
For the SBS method, we set the trimming parameter at t = 8. Following Newey (Citation1988), we compute the LGMM and LGMMS estimators for the transformed moments with J = 3. Implementing the kernel density-based estimators requires a method for choosing the bandwidth hn. There is a vast literature on this topic, ranging from simple to involved methods. But none of the proposed methods has overall performance. In an extensive simulation study of model (Equation1
(1)
(1) ) with n = 100 and p = 2, Reichardt (Citation2017) concludes that for M-KDRE
is preferable for symmetric error distributions in terms of root mean squared error (RMSE) of the estimators. Here
is the standard deviation of the data. For skewed distributions, he recommends
where
with R the inter-quartile range of the data. The KDRE and YDG estimators perform best under
The SBS estimator generally shows the smallest RMSE for
Hence, throughout the simulations, we use the above bandwidths.
4.2. Results
Averaged over all replications, provides summary information on the computed RMSEs of the slope and intercept coefficients for all n and p, and across all error distributions. Clearly, M-KDRE shows the best overall performance of all estimators for the slope coefficient, with 64 lowest RMSE values out of a total of 84, i.e., 7 estimation methods (M-KDRE, KDRE, YDG, SBS, LGMM, LGMMS, LSE), 3 values for p, and 4 values for n. On the other hand, the SBS estimator has only one lowest RMSE value for and p = 2. The other estimators have low RMSE values lying in between the above two values, with the LSE results as a benchmark. Finally, from the last column of , we see that M-KDRE performs equally well with YDG, LGGM and LGMMS in estimating the intercept term, and M-KDRE markedly outperforms KDRE.
Table 1. Number of times the RMSE attains its lowest value for seven regression estimators, for each n and p and across all eight error distribution functions, for the slope coefficient and, in parentheses, for the intercept.
Reichardt (Citation2017) reports RMSE values for each of the eight error distributions. For the sake of space we omit details. However, for the slope coefficient the simulation results can be summarized as follows.
In terms of RMSE, the M-KDRE method performs very well for the log-normal error distribution (h). That is, the RMSE of the second most efficient estimators (KDRE and LGMM) is approximately 40% larger, even for
Under error distributions (b) and (c), M-KDRE is also most efficient, but here the efficiency is gained mostly for n = 50 and n = 100. Furthermore, M-KDRE has a superior performance in small samples of the t(2) error distribution. It is also close to best for error distributions (d)–(g).
The YDG method performs well for error distributions (d)–(g), but fails quite dramatically for error distributions with fat tails, i.e. (b), (c), and (h).
Efficiency of the SBS estimator is in general low relative to alternatives, but performance is especially weak under error distributions (c) and (d).
Overall, LGMM is a reasonable estimator, but its efficiency is lost under error distributions (e) and (f). This efficiency loss persists even for
LGMMS is by construction inefficient when the error distribution is skewed: (f)–(h). More interestingly, the LGMMS-estimate of the slope coefficient is no improvement over LGMM under symmetric error distributions. The inefficiency of LGMMS with respect to LGMM is likely to be due to the fact that LGMMS uses moment restrictions on odd powers of the disturbances only and, hence, for a particular value of J, uses higher order moments that may lead to less efficient estimation.
shows summary results for the bias for both slope and intercept estimators for all n and p, and across all error distributions. For the slope coefficient, M-KDRE has the best performance in terms of the lowest bias values. Again, from Reichardt (Citation2017) we learn that the intercept bias of the different estimators is usually of similar magnitude in the symmetric cases. Under the asymmetric error distributions, the bias of the intercept is much larger for KDRE, SBS and LGMMS than for the other estimators.
Table 2. Number of times the bias of seven regression estimators attains it lowest value for each n and p, and across all eight error distribution functions for the slope coefficient and, in parentheses, for the intercept.
5. Empirical application
Andrabi, Das, and Khwaja (Citation2017) study the impact of providing information in the form of school report cards on educational outcomes such as school fees, test scores, and enrollment in markets with multiple public and private providers. The report cards given to both households and schools in n randomly sampled villages across three districts, in the Punjab province of Pakistan, include information on the performance of the child, the average score of different schools in the village, and the average village score in mathematics, English, and Urdu. The following three linear regression models are of interest:
(17)
(17)
where
and
are average fees, test scores, and enrollment rate in the post-intervention year of village i, respectively.
denotes the baseline measurement of the same variables.
is the treatment dummy assignment to village i, which makes βj the variable of interest, an estimate of the impact of the report card assignment.
is a vector of village-level baseline controls. All models in the paper are estimated using LSE.
, column 1, shows the LSEs of βj and their corresponding standard errors (in parentheses) as shown in, respectively, (1) panel C, (4) panel C, and (1) panel C of Andrabi, Das, and Khwaja (Citation2017). The Shapiro-Wilk test for normal data indicates that the LSE residuals are far from normally distributed, with p-values 0.000
0.002
and 0.000
Indeed, in all cases, diagnostic statistics show that the residuals have fatter tails than one would expect based on normality. Based on the LSE results, Andrabi, Das, and Khwaja (Citation2017) report the following main findings. First, private schools decreased their annualized fees
by an average of 187 rupees, about 17% of their baseline fees, in response to the report card intervention. Second, test scores
increased by 0.11 standard deviation. Third, primary enrollment
increased by 3.2 percent points or 4.5% in treatment villages.
Table 3. Effect of report cards on school fees, test scores, and enrollment as given by parameter estimates of βj using seven estimation methods. For columns 2–7, standard errors (in parentheses) are based on 500 bootstrap replicates.
Table 4. Median absolute prediction error (MAPE) of six AEs relative to LSE.
, columns 2–7, shows estimates of the six AEs for models and 3. We see that these estimators pull the estimated treatment effect toward zero for all models. The results for model 1 are especially striking. The M-KDRE of β is more than 40 times smaller than the LSE, in absolute value. Also, for model 1, the AEs differ substantially. In that respect, it is interesting to investigate the prediction performance of the respective methods.
shows the ratio of the median absolute prediction error (MAPE) of an estimator relative to the LSE. The training set is a random sample of the data of size We see that M-RKDRE has the lowest MAPE for model 1. In addition, observe that the prediction performance is generally better for AEs with a low estimate of β1 such as KDRE and SBS. This suggests that the effect of the report cards on school fees, if it exists at all, is much lower than reported. For models 2 and 3, there is less difference between the estimates of the AEs. Also, the estimates are adjusted less strongly with respect to LSE.
reports two bootstrapped 95% confidence intervals for βj as estimated by M-KDRE. The confidence interval termed “Normal” is based on an asymptotic normality assumption, and the column called “Percentile” is based on the 95% inter quartile range obtained from the empirical distribution of 500 bootstrap replicates. Both intervals show that the estimated treatment effect for model 1 is not significantly different from zero.
Table 5. Bootstrapped 95% confidence intervals of the effect of report cards for the M-KDRE method.
In summary, the above results demonstrate the practical relevance of the AEs in general and that of the proposed M-KDRE method in particular. None of the other AEs adjusted the treatment effect on school fees as far toward zero as M-KDRE, while prediction performance suggests that this method should be preferred over other methods for this particular linear regression model and sample size. Thus, there is no support for the first finding of Andrabi, Das, and Khwaja (Citation2017) at any reasonable significance level. Further, shows that the AEs find that the effect of report cards on test scores is not significantly different from zero at the 5% nominal level. Lastly, the effect of report cards on the enrollment rate seems, even though marginally significant for M-KDRE, also questionable.
6. Summary and concluding remarks
In this paper, we proposed an adaptive multi-step kernel density-based regression estimator for linear regression models. We have established the theoretical properties of our estimation method, including asymptotic normality and almost sure convergence. In an extensive simulation study, we have shown that the finite sample performance of M-KDRE is second to none of five alternative AEs. For several error distributions, it is up to twice as efficient in terms of RMSE than the second best estimator. Further, for any other error distribution, it is either the most efficient or very close to the most efficient estimator. All other AEs show a loss of efficiency for certain specific error distributions. Our empirical application provides a good illustration of many of these issues. In particular, using the M-KDRE method and its corresponding bootstrap standard errors, we found fairly compelling evidence that the treatment effects that Andrabi, Das, and Khwaja (Citation2017) find are not significantly different from zero.
The results raise several questions for further research. For instance, one may wish to estimate nonlinear regressions via the M-KDRE method. In that case the EM algorithm, at least in its present form, needs to be adjusted. Another issue concerns the fact that the multi-step method makes use of in the initial step. This choice was primarily based on computational convenience. Perhaps, efficiency may be further enhanced by a more prudent choice of the initial estimator. It may also be of interest to assess the robustness of M-KDRE to a violation of the independence assumption. In particular, adaptive estimation is not in general possible when the vector of covariates and the error process are not mutually independent. We leave these questions for future research.
Acknowledgments
The authors would like to thank two anonymous referees for their valuable comments and suggestions.
References
- Andrabi, T., J. Das, and A. I. Khwaja. 2017. Report cards: The impact of providing school and child test scores on educational markets. American Economic Review 107 (6):1535–63. doi:https://doi.org/10.1257/aer.20140774.
- Bassett, G. W., and R. W. Koenker. 1982. An empirical quantile function for linear models with iid errors. Journal of the American Statistical Association 77:407–15. doi:https://doi.org/10.2307/2287261.
- Bickel, P. J. 1982. On adaptive estimation. The Annals of Statistics 10 (3):647–71. doi:https://doi.org/10.1214/aos/1176345863.
- Bloomfield, P. 1974. On the distribution of the residuals from a fitted linear model. Technical report, Department of Statistics, Princeton University, Princeton, NJ, Series 2; 56.
- Chai, G., Z. Li, and H. Tian. 1991. Consistent nonparametric estimation of error distributions in linear model. Acta Mathematicae Applicatae Sinica 7 (3):245–56. doi:https://doi.org/10.1007/BF02005973.
- Chen, Y., Q. Wang, and W. Yao. 2015. Adaptive estimation for varying coefficient models. Journal of Multivariate Analysis 137:17–31. doi:https://doi.org/10.1016/j.jmva.2015.01.017.
- Crowder, M. 1984. On constrained maximum likelihood estimation with non-i.i.d. observations. Annals of the Institute of Statistical Mathematics 36 (2):239–49. doi:https://doi.org/10.1007/BF02481968.
- De Gooijer, J. G. 2017. Elements of nonlinear time series analysis and forecasting. New York: Springer-Verlag.
- Friedman, J. H. 1991. Multivariate adaptive regression splines. The Annals of Statistics 19 (1):1–141. with discussion).
- Hshieh, D. A., and C. F. Manski. 1987. Monte Carlo evidence on adaptive maximum likelihood estimation of a regression. The Annals of Statistics 15:541–51. doi:https://doi.org/10.1214/aos/1176350359.
- Kitamura, Y. 1997. Empirical likelihood methods with weakly dependent processes. The Annals of Statistics 25 (5):2084–102. doi:https://doi.org/10.1214/aos/1069362388.
- Kitamura, Y. 2007. Empirical likelihood methods in econometrics: Theory and practice. In Advances in economics and econometrics: Theory and applications, Ninth World Congress, Econometric Society Monographs, ed. R. Blundell, W. Newey, and T. Persson, pp. 174–237. Cambridge: Cambridge University Press. doi:https://doi.org/10.1017/CBO9780511607547.008.
- Kooperberg, C., S. Bose, and C. J. Stone. 1997. Polychotomous regression. Journal of the American Statistical Association 92 (437):117–27. doi:https://doi.org/10.1080/01621459.1997.10473608.
- Linton, O., and Z. Xiao. 2007. A nonparametric regression estimator that adapts to error distribution of unknown form. Econometric Theory 23 (3):371–413. doi:https://doi.org/10.1017/S026646660707017X.
- Manski, C. F. 1984. Adaptive estimation of non-linear regression models. Econometric Reviews 3 (2):145–94. doi:https://doi.org/10.1080/07474938408800060.
- Moore, T. J., B. M. Sadler, and R. J. Kozick. 2008. Maximum-likelihood estimation, the Cramér-Rao bound, and the method of scoring with parameter constraints. IEEE Transactions on Signal Processing 56 (3):895–908. doi:https://doi.org/10.1109/TSP.2007.907814.
- Newey, W. K. 1988. Adaptive estimation of regression models via moment restrictions. Journal of Econometrics 38 (3):301–39. doi:https://doi.org/10.1016/0304-4076(88)90048-6.
- Newey, W. K., and D. McFadden. 1994. Large sample estimation and hypothesis testing. In Handbook of econometrics, ed. K. J. Arrow and D. Intriligator, Vol. 4, 2111–245. New York: Elsevier.
- Osborne, M. R. 2000. Scoring with constraints. The Anziam Journal 42 (1):9–25. doi:https://doi.org/10.1017/S1446181100011561.
- Owen, A. B. 1988. Empirical likelihood ratio confidence intervals for a single functional. Biometrika 75 (2):237–49. doi:https://doi.org/10.1093/biomet/75.2.237.
- Owen, A. B. 1990. Empirical likelihood confidence regions. The Annals of Statistics 18 (1):90–120. doi:https://doi.org/10.1214/aos/1176347494.
- Owen, A. B. 1991. Empirical likelihood for linear models. The Annals of Statistics 19 (4):1725–47. doi:https://doi.org/10.1214/aos/1176348368.
- Owen, A. B. 2001. Empirical likelihood. Boca Raton, FL: Chapman & Hall/CRC.
- Qin, J., and J. Lawless. 1994. Empirical likelihood and general estimating equations. The Annals of Statistics 22 (1):300–25. doi:https://doi.org/10.1214/aos/1176325370.
- Reichardt, H. 2017. Adaptive estimation in linear regression using repeated kernel error density estimation. MSc. thesis. Econometrics, Erasmus University Rotterdam. https://thesis.eur.nl/pub/38903/.
- Ren, C., J. Le Kernec, J. Galy, E. Chaumette, P. Larzabal, and A. Renaux. 2015. A constrained hybrid Cramér-Rao bound for parameter estimation. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brisbane, Australia, 3472–6.
- Robinson, P. M. 1988. Root-n-consistent semiparametric regression. Econometrica 56 (4):931–54. doi:https://doi.org/10.2307/1912705.
- Schick, A. 1993. On efficient estimation in regression models. The Annals of Statistics 21 (3):1486–521. doi:https://doi.org/10.1214/aos/1176349269.
- Stoica, P., and B. C. Ng. 1998. On the Cramér-Rao bound under parametric constraints. IEEE Signal Processing Letters 5:177–9. doi:https://doi.org/10.1109/97.700921.
- Stone, C. J. 1975. Adaptive maximum likelihood estimators of a location parameter. The Annals of Statistics 3 (2):267–84. doi:https://doi.org/10.1214/aos/1176343056.
- Wade, W. 1974. The bounded convergence theorem. The American Mathematical Monthly 81 (4):387–9. doi:https://doi.org/10.2307/2319009.
- Wang, Q., and W. Yao. 2012. An adaptive estimation of MAVE. Journal of Multivariate Analysis 104 (1):88–100. doi:https://doi.org/10.1016/j.jmva.2011.07.001.
- White, H., and G. M. MacDonald. 1980. Some large-sample tests for nonnormality in the linear regression model. Journal of the American Statistical Association 75 (369):16–28. doi:https://doi.org/10.1080/01621459.1980.10477415.
- Yao, W., and Z. Zhao. 2013. Kernel density-based linear regression estimate. Communications in Statistics - Theory and Methods 42 (24):4499–512. doi:https://doi.org/10.1080/03610926.2011.650269.
- Yuan, A., and J. G. De Gooijer. 2007. Semiparametric regression with kernel error model. Scandinavian Journal of Statistics 34:841–69. doi:https://doi.org/10.1111/j.1467-9469.2006.00531.x.
- Zhang, W. Y. 1990. On the congruent kernel estimate of error distributions in linear model (in Chinese). Journal of Sichuan University (Natural Science Edition) 27:132–44.
Appendix A: Proofs of results
Lemma A.1.
(Zhang Citation1990, Theorem 5) If model (Citation1) holds, are i.i.d. with unknown density f(x) where
is a uniformly continuous function that satisfies (i)
, (ii)
, the set of covariates
satisfies (iii)
such that
, (iv)
where
, and the following assumptions on the kernel function
hold: (v) K(x) is uniformly bounded and
such that K(x) = 0
(vi) K(x) is Riemann integrable on
(vii) when
and
, then
(A.1)
(A.1)
where
is the kernel density-based estimator of the
residuals
Lemma A.2.
Suppose that the assumptions of Lemma A.1 hold. Then, if any estimator satisfies
where
with aij the elements of a matrix A,
(A.2)
(A.2)
where
is the kernel density-based estimator of the residuals
Proof.
This result follows immediately from Theorem 5 and Lemma 4 in Zhang (Citation1990) in conjunction with Theorem 4 and (Citation29) in Chai, Li, and Tian (Citation1991). □
Lemma A.3.
If there is a function such that (i)
is uniquely maximized at
, (ii)
is compact, (iii)
is continuous, (iv)
, then for
, where
maximizes the objective function
subject to
. The weak convergence result, i.e.,
can be obtained by replacing condition (iv) by
Proof:
The proof is similar to the proof of Theorem 2.1 of Newey and McFadden (Citation1994). □
Lemma A.4.
If is a continuous function,
is compact, and
, then
(A.3)
(A.3)
Proof.
Since is compact and fn is continuous, the image
is a compact subset of
and hence, closed and bounded. Then, the result follows from the bounded convergence theorem; see, e.g., Wade (Citation1974). □
Proof of Theorem 2.1.
Following (Newey and McFadden Citation1994, Thm. 2.5), we verify the conditions in Lemma A.3. Note that conditions (i)–(iii) are on the density of
and on the parameter space
These conditions hold under the usual regularity conditions of MLE. Condition (iv) of Lemma A.3 implies that we have to prove that
Since we have by Lemma A.1
Now note that
implying that
(A.4)
(A.4)
Condition (iii) implies that Thus,
such that
Also, by (A.4) for any
This, together with condition (ii), ensures that for n large enough both and
are uniformly continuous with probability one such that by the uniform continuous mapping theorem
Note that by conditions (ii) and (iii), is bounded and we may invoke the uniform law of large numbers such that,
(A.5)
(A.5)
Also, by Lemma A.4,
(A.6)
(A.6)
Now define the following variables
Then, by the triangle inequality, we have and by (EquationA.5
(A.5)
(A.5) ) and (EquationA.6
(A.6)
(A.6) ),
and
From condition (iv) of Lemma A.3 it follows that
(A.7)
(A.7)
Thus, by Lemma A.3,
(A.8)
(A.8)
For the sake of completeness, we prove also that the constraint does not affect this result. Let
be the subset for which
That is,
First, note that
is the level set of the continuous function
such that
is closed. Also,
is bounded since
and
is bounded. Hence,
is compact such that
Denote
as the global maximizer of the objective function over
(Newey and McFadden Citation1994, p. 2122) show that for (A.8) to hold, it suffices to prove that
(A.9)
(A.9)
For that purpose, define
Again by the triangle inequality, From (A.7), it is easy to show that
and
To show that
first observe that by conditions (i) and (ii) of Lemma A.1 and the strong law of large numbers,
(A.10)
(A.10)
This implies
and the last result implies, by definition of almost sure convergence, that
Hence,
and the constraint does not affect the result.
Lastly, to show that the algorithm asymptotically converges to remark that (A.8) implies by Lemma A.2 that
where
is the kernel density-based estimator of the residuals corresponding to
Thus, by identical reasoning, we obtain
for
□
Remark 5.
Conditions (i)–(iv) of Theorem 2.1 are the regularity conditions that are necessary for the convergence of MLE under the true density. Thus, the only additional conditions imposed are those in Lemma A.1 of which condition (i) of zero mean goes without loss of generality in the context of linear regression as we can always adjust the intercept parameter in if the center of
is not zero. Condition (ii) of Lemma A.1 may be restrictive in some cases as it rules out, for instance, the t(v)-distribution with
However, in Section 4.2 we observed that M-KDRE performs well for t(Equation2
(2)
(2) ). In fact, its performance is best of all considered estimators under that error distribution. Hence, the practical use of M-KDRE does not seem to be restricted to distributions with finite variance. Conditions (iii) and (iv) of Lemma A.1 are easy to verify in practice, and conditions (v)–(vii) are technical requirements on the kernel and bandwidth. Note that (v) is not satisfied by the Gaussian kernel since that kernel does not have bounded support. In practice, however, the Gaussian kernel entails a significant computational advantage.
Proof of Theorem 2.3
(sketch): For the case of q (q < p) with linear random or deterministic equality constraints, the proof of consistency and asymptotic distribution of can be based on results in Crowder (Citation1984) and Osborne (Citation2000). In particular, given these results it follows that
is asymptotically normal and efficient. The only condition on the initial estimator
is that
For
this follows from the proof of (Yao and Zhao Citation2013, Thm. 2.1). Hence, all subsequent estimates
also satisfy (Equation8
(8)
(8) ). □
Proof of Theorem 2.4.
Under the Gaussian kernel, the linear constraint and a full-kernel method, the M-step in (Equation10
(10)
(10) ) becomes
(A.11)
(A.11)
This can be solved by Lagrangian optimization. Define the Lagrangian as
(A.12)
(A.12)
with first-order conditions
(A.13)
(A.13)
(A.14)
(A.14)
By setting
the first element of the first-order condition in (EquationA.13
(A.13)
(A.13) ) implies
(A.15)
(A.15)
where the last equality follows from (EquationA.14
(A.14)
(A.14) ). Then, by plugging λ in (EquationA.13
(A.13)
(A.13) ), rearranging terms and using
we obtain
(A.16)
(A.16)
Recognize that the first term is equal to Then, the fact that (Equation9
(9)
(9) ) and (Equation10
(10)
(10) ) are the E- and M-step, respectively, of an EM type algorithm follows trivially from the proof of Theorem 2.2 in Yao and Zhao (Citation2013). □