642
Views
0
CrossRef citations to date
0
Altmetric
Articles

A short note on fitting a single-index model with massive data

&
Pages 49-60 | Received 17 Jun 2021, Accepted 02 Oct 2022, Published online: 20 Oct 2022

Abstract

This paper studies the inference problem of index coefficient in single-index models under massive dataset. Analysis of massive dataset is challenging owing to formidable computational costs or memory requirements. A natural method is the averaging divide-and-conquer approach, which splits data into several blocks, obtains the estimators for each block and then aggregates the estimators via averaging. However, there is a restriction on the number of blocks. To overcome this limitation, this paper proposed a computationally efficient method, which only requires an initial estimator and then successively refines the estimator via multiple rounds of aggregations. The proposed estimator achieves the optimal convergence rate without any restriction on the number of blocks. We present both theoretical analysis and experiments to explore the property of the proposed method.

1. Introduction

Single-index models provide an efficient way of coping with high-dimensional nonparametric estimation problem and avoid the ‘curse of dimensionality’ by assuming that the response is only related to a single linear combination of the covariates. Because of its usefulness in several areas such as discrete choice analysis in econometrics and dose–response models in biometrics, we restrict our attention to the single-index model in the following form: (1) Y=g0(Xγ01)+ϵ,(1) where Y is the univariate response and X is a vector of the p-dimensional covariates. The function g0() is an unspecified and nonparametric smoothing function; γ01 is the unknown index vector coefficient. For identifiability, one imposes certain conditions on γ01, and we assume that γ01=(1,γ0) with γ0Rp1. This ‘remove-one-component’ method for γ01 has also been applied in Christou and Akritas (Citation2016), Delecroix et al. (Citation2006) and Ichimura (Citation1993). ϵ is assumed to be independent and identically distributed random error with E[ϵX]=0.

In single-index model (Equation1), the primary parameter of interest is the coefficient γ01 in the index Xγ01, since γ01 makes explicit relationship between the response variable Y and the covariate X. Various strategies for estimating γ01 have been proposed in the literature, see Jiang et al. (Citation2013), Jiang et al. (Citation2016), Tang et al. (Citation2018), Wu et al. (Citation2010), and Xia et al. (Citation2002) and so on.

The development of modern technology has enabled data collection of unprecedented size. For instance, Facebook had 1.55 billion monthly active users in the third quarter of 2015. In recent years, statistical analysis of such massive dataset has become a subject of increased interest. However, when the sample size is excessively large, there are two major obstacles. First, the data can be too big to be held in a computer's memory. Second, the computing task can take too long to wait for the results. Some statisticians have made important contributions. One of these methods, called the averaging divide-and-conquer (ADC) has been widely adopted. The main idea of ADC is to first compute local estimators on each block and then take the average, see Chen and Xie (Citation2014), Chen et al. (Citation2019), Jiang et al. (Citation2020), Lin and Xi (Citation2011) and so on.

These averaging-based, ADC approaches suffer from one drawback. In order for the averaging estimator to achieve the optimal convergence rate that utilizes all data points at once, each block must have access to at least O(n) samples, where n is the sample size of the full data set. In other words, the number of blocks should be much smaller than n, which is a highly restrictive assumption. Jordan et al. (Citation2019) proposed the communication-efficient surrogate likelihood procedure to solve distributed statistical learning problem, which relaxes the condition on the number of blocks. However, their methods cannot be applied to estimate unknown index vector coefficient in the single-index model (Equation1), according to the unknown nonparametric function.

This paper proposes an iterative divide-and-conquer (IDC) method for estimating the unknown index vector coefficient in model (Equation1) under massive dataset, which reduces the required primary memory and computation time. The proposed IDC method can remove the constraint on the number of blocks in ADC method, which only requires an initial estimator and then successively refines the estimator via multiple rounds of aggregations. The resulting estimator is as efficient as the estimator by the entire dataset.

The remainder of the paper is organized as follows. In Section 2, we introduce the proposed procedures for model (Equation1). Both the simulation examples and the applications of two real datasets are given in Section 3 to illustrate the proposed procedures. Final remarks are given in Section 4. All the conditions and their discussions as well as technical proofs are relegated to the Appendix.

2. Methodology and main results

2.1. Iterative divide-and-conquer method

We first review the estimation method for full data (Wang et al., Citation2010), which can be analysed by one single machine. Let {Xi,Yi}i=1n be an independent identically distributed (i.i.d.) sample from (X,Y). We can obtain the estimator γˆ of γ0 by minimizing (2) i=1n{Yigˆ(Xiγ1,γ)}2,(2) where γ1=(1,γ), γRp1, (3) gˆ(u,γ)=A2,0(u,γ1,h1)A0,1(u,γ1,h1)A1,0(u,γ1,h1)A1,1(u,γ1,h1)A0,0(u,γ1,h1)A2,0(u,γ1,h1)A1,02(u,γ1,h1),(3) Al,s(u,γ1,hr)=i=1n(Xiγ1u)lYisKhr(Xiγ1u), for l = 0, 1, 2, s = 0, 1, r = 1, 2, Kh()=K(/h)/h, K() is a symmetric kernel function and h is a bandwidth.

However, for massive dataset, we cannot obtain the estimator of γ0, because a computer can't store or spend a long time to solve the optimization problem of (Equation2).

Let us assume that n samples are partitioned into M subsets. In particular, we split the data index set {1,,n} into S1,,SM, where Sm denotes the set of indices on the m-th block, m=1,,M. Without loss of generality, each block has the sample size n~=n/M, where n~ should be an integer.

The averaging divide-and-conquer (ADC) method for γ0 can be obtained by Jiang et al. (Citation2020) as follows: (4) γˆADC=1Mm=1Mγˆm,(4) where γˆm is obtained by minimizing (Equation2) with the subset {Sm}m=1M.

Sensor network data are naturally collected by many sensors. However, by the results of Theorem 4.1 in Jiang et al. (Citation2020), for γˆADC to achieve the optimal convergence rate Op(n1/2), the number of machines M has to be fixed. It is a highly restrictive assumption. In this section, we will propose a method for the case of M, and it is also valid for fixed M.

Note that gˆ() in (Equation3) may not be a linear function, solving (Equation2) is a nonlinear optimization problem, and the computation can be challenging. Instead, we use a local linear approximation of gˆ(Xiγ1,γ) around an initial value γˆ10, where γˆ10=(1,γˆ0). This yields gˆ(Xiγ1,γ)gˆ(Xiγˆ10)gˆ(Xiγˆ10)(Xiγ1Xiγˆ10)=gˆ(Xiγˆ10)gˆ(Xiγˆ10)(Xi,1γXi,1γˆ0),where Xi,1 is the (p1)-dimensional vector consisting of coordinates 2,,p of Xi and (5) gˆ(u,γ)=A0,0(u,γ1,h2)A1,1(u,γ1,h2)A1,0(u,γ1,h2)A0,1(u,γ1,h2)A0,0(u,γ1,h2)A2,0(u,γ1,h2)A1,02(u,γ1,h2).(5) We denote gˆ(Xiγˆ10)=gˆ(Xiγˆ10,γˆ0) and gˆ(Xiγˆ10)=gˆ(Xiγˆ10,γˆ0) for simplicity. Then, the proposed estimator is obtained by minimizing the following least squares function, (6) γˆ=argminγm=1MiSm{Yigˆ(Xiγˆ10)gˆ(Xiγˆ10)(Xi,1γXi,1γˆ0)}2={m=1MiSmgˆ2(Xiγˆ10)Xi,1Xi,1}1{m=1MiSmgˆ(Xiγˆ10)Xi,1Yi},(6) where Yi=Yigˆ(Xiγˆ10)+gˆ(Xiγˆ10)Xi,1γˆ0.

By the forms of (Equation3) and (Equation5), for given γ, it is easy to estimate g0() and g0() under massive dataset. Al,s(u,γ1,hr) in (Equation3) and (Equation5) can be rewritten as (7) Al,s(u,γ1,hr)=m=1M{iSm(Xiγ1u)lYisKhr(Xiγ1u)},(7) where l = 0, 1, 2, s = 0, 1 and r = 1, 2. Thus, by (Equation3), (Equation5) and (Equation7), we can obtain the estimators of g0() and g0() for massive dataset. Note that the estimators are the same as the estimators in (Equation3) and (Equation5) which are computed directly by the full data. Thus, we can use (Equation3), (Equation5), (Equation6) and (Equation7) to iteratively update the estimate of γ0 until convergence.

2.2. Asymptotic normality of the resulting estimator

To establish the asymptotic property of the proposed estimator, the following technical conditions are imposed.

(C1)

The density function of Xγ1 is positive and satisfies a Lipschitz condition of order 1 for γ1 in a neighbourhood of γ01. Further, Xγ01 has a positive and bounded density function on Λ, where Λ={t=Xγ01:XΘ} and Θ is the compact support set of X.

(C2)

g0(t) and the j-th (2jp) component of E[XXγ0=t] have two bounded and continuous derivatives.

(C3)

E[ϵX]=0 and E[ϵ4X]<.

(C4)

The kernel K() is a bounded, continuous and symmetric probability density function, satisfying u2K(u)du0 and |u|4K(u)du<. In addition, Σ=E[g0(Xγ0)X1X1] is a positive definite matrix.

Remark 2.1

Conditions (C1)–(C4) are commonly used in the literature, see Wang et al. (Citation2010). Condition (C1) is used to bound the density function of Xγ1 away from zero. This ensures that the denominators of gˆ(u,γ1) and gˆ(u,γ1) are, with high probability, bounded away from 0 for u=Xγ1, where XΘ and γ1 is near γ01. The Lipschitz condition and the two derivatives in conditions (C1) and (C2) are standard smoothness conditions. Condition (C3) is a necessary condition for the asymptotic normality of an estimator. Condition (C4) is a usual assumption for kernel function.

Theorem 2.1

Suppose conditions (C1)(C4) hold, γˆ0γ02=Op(n~1/2) with n~=nc,0<c1, and n, h1=O(n1/4/logn),h2=O(n1/4log2n), and Q1+log(logn/logn~)/log2. Then, the estimator γˆ of the Q-th iteration, n(γˆγ0)LN(0,Σ1SΣ1),where L stands for convergence in the distribution, Σ=E{g0(Xγ0)X1X1} and S=E[g0(Xγ01)2{X1E(X1Xγ01)}{X1E(X1Xγ01)}ϵ2].The initial estimator γˆ0 can be obtained by the method in Ichimura (Citation1993) based on S1, which satisfies γˆ0γ02=Op(n~1/2) under some regularity conditions.

Theorem 2.1 shows that γˆ achieves the same asymptotic efficiency as estimator in (Equation2) computed directly on all the samples, see Theorem 2 in Wang et al. (Citation2010). Compared to the averaging divide-and-conquer method that also can achieve the convergence rate Op(n1/2) but under the condition fixed M, our approach removes the restriction on the number of machines M by applying multiple rounds of aggregations. It is also important to note that the required number of rounds Q is usually quite small. For example, if n=1020 and n~=105, then Q = 3.

After obtaining the estimation γˆ of γ0, for any given point u, we can estimate g0() in model (Equation1) with massive dataset by (Equation3).

2.3. Algorithm

Based on the above analysis, we now introduce an iterative divide-and-conquer method for estimating γ0.

Step 1: Without loss of generality, the entire data set is partitioned into M subsets: S1,,SM.

Step 2: Calculate the initial estimator γˆ0 based on S1.

Step 3: Compute the estimators gˆ(Xiγˆ10)=k=1MB2,0,1kk=1MB0,1,1kk=1MB1,0,1kk=1MB1,1,1kk=1MB0,0,1kk=1MB2,0,1k(k=1MB1,0,1k)2and gˆ(Xiγˆ10)=k=1MB0,0,2kk=1MB1,1,2kk=1MB1,0,2kk=1MB0,1,2kk=1MB0,0,2kk=1MB2,0,2k(k=1MB1,0,2k)2,where Bl,s,rk=jSk(Xjγˆ10Xiγˆ10)lYjsKhr(Xjγˆ10Xiγˆ10),l = 0, 1, 2, s=0,1, r=1,2.

Step 4: Compute the estimator γˆ: γˆ=(m=1MCm)1(m=1MDm),where Cm=iSmgˆ2(Xiγˆ10)Xi,1Xi,1 and Dm=iSmgˆ(Xiγˆ10)Xi,1Yi.

Step 5: Iterate Q1+log(logn/logn~)/log2 times of Steps 3 and 4.

3. Numerical studies

In this section, we first use Monte Carlo simulation studies to assess the finite sample performance of the proposed procedures and then demonstrate the application of the proposed methods with two real data analyses. All programs are written in R code and our computer has a 3.3 GHz Pentium processor and 8G memory.

The Epanechnikov kernel K(u)=0.75(1u2)I(|u|1) is used in this section. When calculating the estimator γˆ in (Equation6), according to Wang et al. (Citation2010), we choose the bandwidths: h1=hˆn1/5n1/4/logn and h2=hˆn1/5n1/4log2n. We can use the method in Ichimura (Citation1993) to obtain γˆm in (Equation4), which can be obtained by ‘npindexbw’ in R. All the simulations are run for 100 replicates.

3.1. Simulation example 1: effect of M with fixed n

In this example, we fix the total sample size n = 10000 and vary the number of blocks M from {10,50,100}, to access the influence of M on the proposed estimation method. The model for the simulated data is (8) Y=5cos(πXγ01)+exp(|Xγ01|)+ϵ,(8) where X is uniformly distributed on [0,1]3, γ01=(1,γ0), γ0=(2,1) and ϵN(0,1).

We compare the proposed iterative divide-and-conquer (IDC) method for γ0 with the oracle procedure (Oracle) which is obtained by solving (Equation2) by the full data, and averaging divide-and-conquer (ADC) method.

Table  depicts the mean squared errors (MSE=(γˆγ0)(γˆγ0)), and Absolute Bias (|γˆγ0|) of the estimate γˆ to assess the accuracy of the estimation methods. From Table , the following conclusions can be drawn.

  1. All the estimators are close to the true value because the results of Absolute Bias are very small.

  2. Based on MSE, the performances of IDC estimator are better than those of ADC when M = 100, and are worse than those of ADC when M = 10 and M = 50.

  3. t in Table  is the average computing time in seconds used to estimate the index parameter. From t, we see that the operation times of ADC and IDC are faster than that of Oracle. Moreover, IDC is faster than ADC under case of M = 10.

Table 1. The means of Absolute Bias, MSE (standard deviation) and t for simulation Example 1.

3.2. Simulation example 2: effect of M with fixed n~

To compare the effects of the three methods on the number of blocks with fixed sample size on each block (n~=100), we consider M of {10,20,…,100}. The model for the simulated data is also from (Equation8).

The results of MSE are presented in Figure . The average computing time in seconds used to estimate the index parameter is presented in Figure . From Figures and , the following conclusions can be drawn.

  1. From Figure , we can see that the performances of Oracle method are the best of the three methods under different M. However, by Figure , the operation times of Oracle method are much greater than those of ADC and IDC under different M.

  2. As the number of blocks M continues to increase, the MSEs of Oracle and IDC decrease. However, ADC doesn't have this pattern.

  3. If the number of blocks is less than 30, the MSEs of the ADC method are less than that of IDC. However, as the number of blocks continues to increase, IDC can significantly outperform the ADC method. Furthermore, if the number of blocks is less than 60, the operation times of the IDC method are less than that of ADC.

Figure 1. Comparison of MSE versus the number of blocks M with n~=100 for three methods for simulation Example 3.2.

Figure 1. Comparison of MSE versus the number of blocks M with n~=100 for three methods for simulation Example 3.2.

Figure 2. The mean computing time of γˆ (in seconds) for simulation Example 3.2.

Figure 2. The mean computing time of γˆ (in seconds) for simulation Example 3.2.

3.3. Simulation example 3: effect of n

To examine the effect of increasing the sample size, n = 5000, 10000 and 20000 are considered. The following single-index model is considered: (9) Y=sin(0.75Xγ01)+ϵ,(9) where X=(X1,X2) is a two-dimensional standard normal variable, the correlation between X1 and X2 is 0.5, γ01=(1,γ0), γ0=2 and ϵN(0,0.22).

Table  presents the averages of Absolute Bias and computing time t over the 100 data sets along with its estimated standard error. By varying the sample size, as expected, the Absolute Bias becomes smaller and computing time t becomes bigger as the sample size grows.

Table 2. The means of Absolute Bias (standard deviation) and t for simulation Example 3.

3.4. Real data example 1: combined cycle power plant data

We apply the proposed method to combined cycle power plant data. The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006–2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (AT), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant. The data set is obtained from online site: https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant.

In this study, the following single-index model is used to fit the data: (10) EP=g{γ1AT+γ2AP+γ3RH+γ4V}+ϵ,(10) where all the data are normalized. We considered the number of blocks M{8,16,26,46,92}; hence, the respective values of the local sample size are n~{1196,598,368,208,104}. Table  summarizes the estimated coefficients for the above model, showing that AP has the smallest effect on EP among the four covariates and AT is the most important covariate. Figure  shows the estimated EP by the Oracle method along with the observations, illustrating that single-index model (Equation10) is very suitable to the combined cycle power plant data. Furthermore, we evaluate the performances of three estimators based on mean square fitting error (MSFE=i=19568|EPiEPˆi|/9568), where EPˆi is the fitted value of EPi by (Equation3). From Table , the following conclusions can be drawn.

  1. The MSFEs of IDC under different M are very close to that of Oracle method. Thus the performances IDC are well.

  2. As the number of blocks M continues to increase, the MSFEs of ADC increase. The performances of IDC estimator are better than those of ADC when M = 92 and are worse than those of ADC when M is small.

  3. t in Table  is the average computing time in seconds used to estimate the index parameter. From t, we see that the operation times of IDC are faster than that of Oracle and ADC.

Figure 3. Estimated single index function for the combined cycle power plant data. The dots are the observations EP and the curve is the estimated EP by the Oracle method.

Figure 3. Estimated single index function for the combined cycle power plant data. The dots are the observations EP and the curve is the estimated EP by the Oracle method.

Table 3. The coefficient estimates and MSFE for the combined cycle power plant data.

3.5. Real data example 2: airline on-time data

Airline on-time performance data from the 2009 ASA Data Expo are used as a case study. These data are publicly available (http://stat-computing.org/dataexpo/2009/the-data.html). This dataset consists of flight arrival and departure details for all commercial flights within the United States from October 1987 to April 2008. About 12 million flights were recorded with 29 variables. In this section, we only consider the data of year 2008 (the number of samples is 1011963). The first 1000000 data points are used for the estimation and the remaining 11963 data are used for the prediction.

Schifano et al. (Citation2016) developed a linear model that fits the data as follows: (11) AD=γ1HD+γ2DIS+γ3NF+γ4WF+ϵ,(11) where AD is the arrival delay (ArrDelay), which is a continuous variable found by modelling log(ArrDelaymin(ArrDelay)+1), HD is the departure hour (range 0 to 24), DIS is the distance (in 1000 miles), NF is the dummy variable for a night flight (1 if departure between 8 p.m. and 5 a.m., 0 otherwise), and WF is the dummy variable for a weekend flight (1 if departure occurred during the weekend, 0 otherwise).

In this study, the following single-index model is used to fit the data: (12) AD=g{γ1HD+γ2DIS+γ3NF+γ4WF}+ϵ.(12) For comparison purposes, we use the least squares (LS) method to estimate (γ1,γ2,γ3,γ4) in model (Equation11), and use the ADC and IDC methods proposed in Section 2 to estimate (γ1,γ2,γ3,γ4) in model (Equation12). The number of blocks is 1000 for these three methods. Furthermore, we evaluate the performance of these estimators based on their out-of-sample prediction with the mean absolute prediction error (MAPE) of the predictions, MAPE=1ni=1n|ADiADˆi|,where ADˆi is the fitted value of ADi, i=1,,n with n = 11, 963. ADˆi can be obtained by (Equation3). Table  presents the estimated coefficients and MAPEs of the three methods. From Table , we can see that the IDC method performs better than LS and ADC according to the smaller MAPE.

Table 4. The coefficient estimates and MAPE for the airline on-time data.

Disclosure statement

We proposed a divide-and-conquer method to deal with single-index model for massive dataset. The proposed method significantly reduces the required amount of primary memory and enjoys a very low computational cost. The proposed method achieves the same asymptotic efficiency as the estimator using all the data. Furthermore, it allows a weak condition on the sample size as a function of memory size.

Additional information

Funding

We would like to acknowledge support for this project from the Fundamental Research Funds for the Central Universities of China (No. 2232020D-43).

References

  • Chen, X., Liu, W., & Zhang, Y. (2019). Quantile regression under memory constraint. The Annals of Statistics, 47(6), 3244–3273. https://doi.org/10.1214/18-AOS1777
  • Chen, X. Y., & Xie, M. G. (2014). A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica, 24(4), 1655–1684. https://doi.org/10.5705/ss.2013.088
  • Christou, E., & Akritas, M. (2016). Single index quantile regression for heteroscedastic data. Journal of Multivariate Analysis, 150, 169–182. https://doi.org/10.1016/j.jmva.2016.05.010
  • Delecroix, M., Hristache, M., & Patilea, V. (2006). On semiparametric M-estimation in single-index regression. Journal of Statistical Planning and Inference, 136(3), 730–769. https://doi.org/10.1016/j.jspi.2004.09.006
  • Ichimura, H. (1993). Semiparametric least squares (SLS) and weighted SLS estimation of single-index models. Journal of Econometrics, 58(1-2), 71–120. https://doi.org/10.1016/0304-4076(93)90114-K
  • Jiang, R., Guo, M. F., & Liu, X. (2020). Composite quasi-likelihood for single-index models with massive datasets. Communications in Statistics – Simulation and Computation, 51(9), 5024–5040. https://doi.org/10.1080/03610918.2020.1753074
  • Jiang, R., Qian, W. M., & Zhou, Z. G. (2016). Weighted composite quantile regression for single-index models. Journal of Multivariate Analysis, 148, 34–48. https://doi.org/10.1016/j.jmva.2016.02.015
  • Jiang, R., Zhou, Z. G., Qian, W. M., & Chen, Y. (2013). Two step composite quantile regression for single-index models. Computational Statistics & Data Analysis, 64, 180–191. https://doi.org/10.1016/j.csda.2013.03.014
  • Jordan, M., Lee, J., & Yang, Y. (2019). Communication-efficient distributed statistical learning. Journal of the American Statistical Association, 14(526), 668–681. https://doi.org/10.1080/01621459.2018.1429274
  • Lin, N., & Xi, R. (2011). Aggregated estimating equation estimation. Statistics and Its Interface, 4(1), 73–83. https://doi.org/10.4310/SII.2011.v4.n1.a8
  • Schifano, E., Wu, J., Wang, C., Yan, J., & Chen, M. H. (2016). Online updating of statistical inference in the big data setting. Technometrics, 58(3), 393–403. https://doi.org/10.1080/00401706.2016.1142900
  • Tang, Y., Wang, H., & Liang, H. (2018). Composite estimation for single-index models with responses subject to detection limits. Scandinavian Journal of Statistics, 45(3), 444–464. https://doi.org/10.1111/sjos.v45.3
  • Wang, J. L., Xue, L., Zhu, L., & Chong, Y. (2010). Estimation for a partial-linear single-index model. The Annals of Statistics, 38(1), 246–274. https://doi.org/10.1214/09-AOS712
  • Wu, T., Yu, K., & Yu, Y. (2010). Single-index quantile regression. Journal of Multivariate Analysis, 101(7), 1607–1621. https://doi.org/10.1016/j.jmva.2010.02.003
  • Xia, Y., Tong, H., Li, W., & Zhu, L. X. (2002). An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society Series B, 64(3), 363–410. https://doi.org/10.1111/rssb.2002.64.issue-3
  • Zhu, L., & Xue, L. (2006). Empirical likelihood confidence regions in a partially linear single-index model. Journal of the Royal Statistical Society: Series B, 68(3), 549–570. https://doi.org/10.1111/rssb.2006.68.issue-3

Appendix

Proof

Proof of Theorem 2.1

We denote the first iteration γˆ1, and note that (Equation6) can be equivalently written as γˆ1γ0={i=1ngˆ2(Xiγˆ10)Xi,1Xi,1}1{i=1ngˆ(Xiγˆ10)Xi,1Y~i}=Un1Vn,where Y~i=ϵi+g0(Xiγ01)gˆ(Xiγˆ10)gˆ(Xiγˆ10)Xi,1(γ0γˆ0), Vn=n1i=1ngˆ(Xiγˆ10)Xi,1Y~i, and Un=n1×i=1ngˆ2(Xiγˆ10)Xi,1Xi,1. Note that Vn=1ni=1ng0(Xiγ01){Xi,1E(X1Xiγ01)}ϵi+1ni=1n{gˆ(Xiγˆ10)g0(Xiγ01)}Xi,1ϵi+1ni=1ng0(Xiγ01)[Xi,1{g0(Xiγ01)gˆ(Xiγ01)}+E(X1Xiγ01)ϵi]+1ni=1nXi,1{gˆ(Xiγˆ10)g0(Xiγ01)}{g0(Xiγ01)gˆ(Xiγ01)}+1ni=1ngˆ(Xiγˆ10)Xi,1{gˆ(Xiγ01)gˆ(Xiγˆ10)gˆ(Xiγˆ10)Xi,1(γ0γˆ0)}Vn1+Vn2+Vn3+Vn4+Vn5.We first show that Vn22=op(n1/2). Let Vn2,s denote the s-th component of Vn2. Then, we have Vn2,s=1ni=1n{gˆ(Xiγˆ10)g0(Xiγ01)}Xis,1ϵi=1ni=1n{gˆ(Xiγˆ10)g0(Xiγˆ10)}Xis,1ϵi+1ni=1n{g0(Xiγˆ10)g0(Xiγ01)}Xis,1ϵiVn21,s+Vn22,s.Note that gˆ(u,γ) in (Equation3) can be rewritten as gˆ(u,γ)=i=1nW~ni(u,γ1)Yi, where W~ni(u,γ1)=Kh2(Xiγ1u){(Xiγ1u)A0,0(u,γ1,h2)A1,0(u,γ1,h2)}A0,0(u,γ1,h2)A2,0(u,γ1,h2)A1,02(u,γ1,h2).Thus Vn21,s can be rewritten as Vn21,s=1ni=1n{j=1nW~nj(Xiγˆ10,γˆ10)g0(Xjγˆ10)g0(Xiγˆ10)}Xis,1ϵi+1ni=1nW~ni(Xiγˆ10,γˆ10)Xis,1ϵi2+1ni=2nj=1i1aijϵjϵiVn211,s+Vn212,s+Vn213,s,where aij=W~nj(Xiγˆ10)Xis,1+W~ni(Xjγˆ10)Xjs,1.Hence, by Lemma 1 in Zhu and Xue (Citation2006) and conditions (C2) and (C3), γˆ0 is a consistent estimate of γ0 and by the Cauchy–Schwarz inequality, for c1 and c2 big enough, we have nE[Vn211,s2]=1ni=1nE[{j=1nW~nj(Xiγˆ10,γˆ10)g0(Xj,γˆ10)g0(Xiγˆ10)}2Xis,12E(ϵi2Xi)]c11ni=1n[E{j=1nW~nj(Xiγˆ10,γˆ10)g0(Xjγˆ10)g0(Xiγˆ10)}4]1/2{E(Xis,1)4}1/2c2h220.For Vn212,s, by Lemma 2 in Zhu and Xue (Citation2006), for c3 and c4 big enough, we have nE|Vn212,s|1ni=1nE{|W~ni(Xiγˆ10,γˆ10)Xis,1|E(ϵi2Xi)}c31ni=1nE{W~nj(Xiγˆ10,γˆ10)}2E(Xis,12)c4{(nh22)1/2+(nh25/2)1}0.We now consider Vn213,s. Note that E(aijϵjϵiX,ϵj)=0 and E(ai1j1ϵj1ϵi1ai2j2ϵj2ϵi2X)=0 when {i1,j1}{i2,j2}; by Lemma 2 in Zhu and Xue (Citation2006), for c5 and c6 big enough, we have nE(Vn213,s2)=1ni=2nj=1i1E{aij2E(ϵj2Xjs)E(ϵi2Xis)}c51nijE{W~nj(Xiγˆ10,γˆ10)}4E(Xis,14)c6{(n2h23)1+n1h2+(nh25/2)2}0.By the condition γˆ0γ02=Op(n~1/2), we have Vn22,s=op(n1/2). Combining the above results, the s-th moment of Vn2 converges to 0. By the Markov inequality, we have Vn22=op(n1/2).We prove that the mean and the variance of n1/2Vn3,s tend to 0. Using E{gˆ(Xiγ01)g0(Xiγ01)}=O(h12)and condition (C2), we have nEVn3,sO(n1/2h12)0.Using conditions (C1)–(C4) and the (A.35) of Chang et al. (2010), we obtain nEVn3,s2O(nh14+h1+(nh1)1)0.This proves that Vn32=op(n1/2).We now consider Vn4. Vn4,s=1ni=1nXis,1{gˆ(Xiγˆ10)g0(Xiγˆ10)}{g0(Xiγ01)gˆ(Xiγ01)}+1ni=1nXis,1{g0(Xiγˆ10)g0(Xiγ01)}{g0(Xiγ01)gˆ(Xiγ01)}Vn41,s+Vn42,s.By Lemma 3 in Zhu and Xue (Citation2006) and Markov inequality, for any ϵ>0, P(nh23logn|gˆ(Xiγˆ10)g0(Xiγˆ10)|ϵ)nh23lognE{gˆ(Xiγˆ10)g0(Xiγˆ10)}2/ϵ20.Hence, we have, uniformly over 1in, |gˆ(Xiγˆ10)g0(Xiγˆ10)|=Op(lognnh23).Similarly, |gˆ(Xiγˆ10)g0(Xiγˆ10)|=Op(lognnh1).Thus we have n|Vn41,s|1ni=1n|Xis,1||gˆ(Xiγˆ10)g0(Xiγˆ10)||g0(Xiγ01)gˆ(Xiγ01)|=Op(log2nnh1h23).Noting that nh1h23/log2n, we obtain Vn41,s=op(n1/2). For Vn42, by a Taylor expansion, we get Vn42,s=1ni=1ng(ξi)Xis,1Xi(γˆ10γ01){g0(Xiγ01)gˆ(Xiγ01)},where ξi is a point between Xiγ01 and Xiγˆ10. By the condition γˆ0γ02=Op(n~1/2) and using E{gˆ(Xiγ01)g0(Xiγ01)}=O(h12), we have nEVn42,s=O(n~1/2nh12)0,and nEVn42,s2O(h14n~1+(n~nh1)1)0.Thus we can obtain Vn42,s=op(n1/2). Therefore, Vn42=op(n1/2).Finally, we consider Vn5. Vn5,s=Vn51,s+Vn52,s,where Vn51,s=n1i=1ng0(Xiγˆ10)Xis,1{gˆ(Xiγ01)gˆ(Xiγˆ10)gˆ(Xiγˆ10)Xi,1(γ0γˆ0)}and Vn52,s=n1i=1n{gˆ(Xiγˆ10)g0(Xiγˆ10)}Xis,1{gˆ(Xiγ01)gˆ(Xiγˆ10)gˆ(Xiγˆ10)Xi,1(γ0γˆ0)}.We rewrite Vn51,s as Vn51,s=1ni=1ng0(Xiγˆ10)Xis,1{g0(Xiγ01)g0(Xiγˆ10)g0(Xiγˆ10)Xi,1(γ0γˆ0)}+1ni=1ng0(Xiγˆ10)Xis,1{gˆ(Xiγ01)g0(Xiγ01)}1ni=1ng0(Xiγˆ10)Xis,1{gˆ(Xiγˆ10)g0(Xiγˆ10)}1ni=1ng0(Xiγˆ10)Xis,1{gˆ(Xiγˆ10)g0(Xiγˆ10)}Xi,1(γ0γˆ0)Vn511,s+Vn512,s+Vn513,s+Vn514,s.For Vn511,s, by a Taylor expansion, we get Vn511=Op(n~1). Similar to the proof of Vn42,s, we get nEVn512,s=nEVn513,s=O(nh12)0,nEVn512,s2=nEVn513,s2=O(h14+(nh1)1)0,and nEVn514,s=O(n/n~h22)0,nEVn514,s2=O(n~1h22+n~1n1h23)0.Hence, Vn52=op(n1/2)+Op(n~1).Therefore, we have Vn=1ni=1ng0(Xiγ01){Xi,1E(X1Xiγ01)}ϵi+op(n1/2)+Op(n~1).Similar to the proof of Vn, we can obtain Un=Σ+op(1), where Σ=E{g02(Xγ01)X1X1}. Thus γˆ1γ0=Σ11ni=1ng0(Xiγ01){Xi,1E(X1Xiγ01)}ϵi+op(n1/2)+Op(n~1).Note that one round of aggregation enables a refinement of the estimator with its bias reducing from n~1/2 to n~1. Therefore, an iterative refinement of the initial estimator will successively improve the estimation accuracy. The q-th iterative divide-and-conquer method γˆq satisfies γˆqγ0=Σ11ni=1ng0(Xiγ01){Xi,1E(X1Xiγ01)}ϵi+op(n1/2)+Op(n~2q1).Thus, after Q iterations, where Q1+log(logn/logn~)/log2, we have γˆγ0=Σ11ni=1ng0(Xiγ01){Xi,1E(X1Xiγ01)}ϵi+op(n1/2).By the central limit theorem, we can prove Theorem 2.1.