Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

This paper studies the inference problem of index coefficient in single-index models under massive dataset. Analysis of massive dataset is challenging owing to formidable computational costs or memory requirements. A natural method is the averaging divide-and-conquer approach, which splits data into several blocks, obtains the estimators for each block and then aggregates the estimators via averaging. However, there is a restriction on the number of blocks. To overcome this limitation, this paper proposed a computationally efficient method, which only requires an initial estimator and then successively refines the estimator via multiple rounds of aggregations. The proposed estimator achieves the optimal convergence rate without any restriction on the number of blocks. We present both theoretical analysis and experiments to explore the property of the proposed method.

Keywords:

1. Introduction

Single-index models provide an efficient way of coping with high-dimensional nonparametric estimation problem and avoid the ‘curse of dimensionality’ by assuming that the response is only related to a single linear combination of the covariates. Because of its usefulness in several areas such as discrete choice analysis in econometrics and dose–response models in biometrics, we restrict our attention to the single-index model in the following form: (1) $Y = g_{0} (X^{⊤} γ_{01}) + ϵ,$ (1) where $Y$ is the univariate response and $X$ is a vector of the p-dimensional covariates. The function $g_{0} (\cdot)$ is an unspecified and nonparametric smoothing function; $γ_{01}$ is the unknown index vector coefficient. For identifiability, one imposes certain conditions on $γ_{01}$ , and we assume that $γ_{01} = (1, γ_{0}^{⊤})^{⊤}$ with $γ_{0} \in R^{p - 1}$ . This ‘remove-one-component’ method for $γ_{01}$ has also been applied in Christou and Akritas (Citation2016), Delecroix et al. (Citation2006) and Ichimura (Citation1993). $ϵ$ is assumed to be independent and identically distributed random error with $E [ϵ ∣ X] = 0$ .

In single-index model (Equation1(1) $Y = g_{0} (X^{⊤} γ_{01}) + ϵ,$ (1) ), the primary parameter of interest is the coefficient $γ_{01}$ in the index $X^{⊤} γ_{01}$ , since $γ_{01}$ makes explicit relationship between the response variable $Y$ and the covariate $X$ . Various strategies for estimating $γ_{01}$ have been proposed in the literature, see Jiang et al. (Citation2013), Jiang et al. (Citation2016), Tang et al. (Citation2018), Wu et al. (Citation2010), and Xia et al. (Citation2002) and so on.

The development of modern technology has enabled data collection of unprecedented size. For instance, Facebook had 1.55 billion monthly active users in the third quarter of 2015. In recent years, statistical analysis of such massive dataset has become a subject of increased interest. However, when the sample size is excessively large, there are two major obstacles. First, the data can be too big to be held in a computer's memory. Second, the computing task can take too long to wait for the results. Some statisticians have made important contributions. One of these methods, called the averaging divide-and-conquer (ADC) has been widely adopted. The main idea of ADC is to first compute local estimators on each block and then take the average, see Chen and Xie (Citation2014), Chen et al. (Citation2019), Jiang et al. (Citation2020), Lin and Xi (Citation2011) and so on.

These averaging-based, ADC approaches suffer from one drawback. In order for the averaging estimator to achieve the optimal convergence rate that utilizes all data points at once, each block must have access to at least $O (\sqrt{n})$ samples, where n is the sample size of the full data set. In other words, the number of blocks should be much smaller than $\sqrt{n}$ , which is a highly restrictive assumption. Jordan et al. (Citation2019) proposed the communication-efficient surrogate likelihood procedure to solve distributed statistical learning problem, which relaxes the condition on the number of blocks. However, their methods cannot be applied to estimate unknown index vector coefficient in the single-index model (Equation1(1) $Y = g_{0} (X^{⊤} γ_{01}) + ϵ,$ (1) ), according to the unknown nonparametric function.

This paper proposes an iterative divide-and-conquer (IDC) method for estimating the unknown index vector coefficient in model (Equation1(1) $Y = g_{0} (X^{⊤} γ_{01}) + ϵ,$ (1) ) under massive dataset, which reduces the required primary memory and computation time. The proposed IDC method can remove the constraint on the number of blocks in ADC method, which only requires an initial estimator and then successively refines the estimator via multiple rounds of aggregations. The resulting estimator is as efficient as the estimator by the entire dataset.

The remainder of the paper is organized as follows. In Section 2, we introduce the proposed procedures for model (Equation1(1) $Y = g_{0} (X^{⊤} γ_{01}) + ϵ,$ (1) ). Both the simulation examples and the applications of two real datasets are given in Section 3 to illustrate the proposed procedures. Final remarks are given in Section 4. All the conditions and their discussions as well as technical proofs are relegated to the Appendix.

2. Methodology and main results

2.1. Iterative divide-and-conquer method

We first review the estimation method for full data (Wang et al., Citation2010), which can be analysed by one single machine. Let ${X_{i}, Y_{i}}_{i = 1}^{n}$ be an independent identically distributed (i.i.d.) sample from $(X, Y)$ . We can obtain the estimator $\hat{γ}$ of $γ_{0}$ by minimizing (2) $\sum_{i = 1}^{n} {Y_{i} - \hat{g} (X_{i}^{⊤} γ_{1}, γ)}^{2},$ (2) where $γ_{1} = (1, γ^{⊤})^{⊤}$ , $γ \in R^{p - 1}$ , (3) $\hat{g} (u, γ) = \frac{A_{2, 0} (u, γ_{1}, h_{1}) A_{0, 1} (u, γ_{1}, h_{1}) - A_{1, 0} (u, γ_{1}, h_{1}) A_{1, 1} (u, γ_{1}, h_{1})}{A_{0, 0} (u, γ_{1}, h_{1}) A_{2, 0} (u, γ_{1}, h_{1}) - A_{1, 0}^{2} (u, γ_{1}, h_{1})},$ (3) $A_{l, s} (u, γ_{1}, h_{r}) = \sum_{i = 1}^{n} (X_{i}^{⊤} γ_{1} - u)^{l} Y_{i}^{s} K_{h_{r}} (X_{i}^{⊤} γ_{1} - u)$ , for l = 0, 1, 2, s = 0, 1, r = 1, 2, $K_{h} (\cdot) = K (\cdot / h) / h$ , $K (\cdot)$ is a symmetric kernel function and h is a bandwidth.

However, for massive dataset, we cannot obtain the estimator of $γ_{0}$ , because a computer can't store or spend a long time to solve the optimization problem of (Equation2(2) $\sum_{i = 1}^{n} {Y_{i} - \hat{g} (X_{i}^{⊤} γ_{1}, γ)}^{2},$ (2) ).

Let us assume that n samples are partitioned into M subsets. In particular, we split the data index set ${1, \dots, n}$ into $S_{1}, \dots, S_{M}$ , where $S_{m}$ denotes the set of indices on the m-th block, $m = 1, \dots, M$ . Without loss of generality, each block has the sample size $\tilde{n} = n / M$ , where $\tilde{n}$ should be an integer.

The averaging divide-and-conquer (ADC) method for $γ_{0}$ can be obtained by Jiang et al. (Citation2020) as follows: (4) ${\hat{γ}}_{A D C} = \frac{1}{M} \sum_{m = 1}^{M} {\hat{γ}}_{m},$ (4) where ${\hat{γ}}_{m}$ is obtained by minimizing (Equation2(2) $\sum_{i = 1}^{n} {Y_{i} - \hat{g} (X_{i}^{⊤} γ_{1}, γ)}^{2},$ (2) ) with the subset ${S_{m}}_{m = 1}^{M}$ .

Sensor network data are naturally collected by many sensors. However, by the results of Theorem 4.1 in Jiang et al. (Citation2020), for ${\hat{γ}}_{A D C}$ to achieve the optimal convergence rate $O_{p} (n^{- 1 / 2})$ , the number of machines M has to be fixed. It is a highly restrictive assumption. In this section, we will propose a method for the case of $M \to \infty$ , and it is also valid for fixed M.

Note that $\hat{g} (\cdot)$ in (Equation3(3) $\hat{g} (u, γ) = \frac{A_{2, 0} (u, γ_{1}, h_{1}) A_{0, 1} (u, γ_{1}, h_{1}) - A_{1, 0} (u, γ_{1}, h_{1}) A_{1, 1} (u, γ_{1}, h_{1})}{A_{0, 0} (u, γ_{1}, h_{1}) A_{2, 0} (u, γ_{1}, h_{1}) - A_{1, 0}^{2} (u, γ_{1}, h_{1})},$ (3) ) may not be a linear function, solving (Equation2(2) $\sum_{i = 1}^{n} {Y_{i} - \hat{g} (X_{i}^{⊤} γ_{1}, γ)}^{2},$ (2) ) is a nonlinear optimization problem, and the computation can be challenging. Instead, we use a local linear approximation of $\hat{g} (X_{i}^{⊤} γ_{1}, γ)$ around an initial value ${\hat{γ}}_{1}^{0}$ , where ${\hat{γ}}_{1}^{0} = (1, {\hat{γ}}^{0 ⊤})^{⊤}$ . This yields $\begin{aligned} \hat{g} (X_{i}^{⊤} γ_{1}, γ) & \approx \hat{g} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) - {\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) (X_{i}^{⊤} γ_{1} - X_{i}^{⊤} {\hat{γ}}_{1}^{0}) \\ = \hat{g} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) - {\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) (X_{i, - 1}^{⊤} γ - X_{i, - 1}^{⊤} {\hat{γ}}^{0}), \end{aligned}$ where $X_{i, - 1}$ is the $(p - 1)$ -dimensional vector consisting of coordinates $2, \dots, p$ of $X_{i}$ and (5) ${\hat{g}}^{'} (u, γ) = \frac{A_{0, 0} (u, γ_{1}, h_{2}) A_{1, 1} (u, γ_{1}, h_{2}) - A_{1, 0} (u, γ_{1}, h_{2}) A_{0, 1} (u, γ_{1}, h_{2})}{A_{0, 0} (u, γ_{1}, h_{2}) A_{2, 0} (u, γ_{1}, h_{2}) - A_{1, 0}^{2} (u, γ_{1}, h_{2})} .$ (5) We denote $\hat{g} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) = \hat{g} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}, {\hat{γ}}^{0})$ and ${\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) = {\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}, {\hat{γ}}^{0})$ for simplicity. Then, the proposed estimator is obtained by minimizing the following least squares function, (6) $\begin{aligned} \hat{γ} & = \arg min_{γ} \sum_{m = 1}^{M} \sum_{i \in S_{m}} {Y_{i} - \hat{g} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) - {\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) (X_{i, - 1}^{⊤} γ - X_{i, - 1}^{⊤} {\hat{γ}}^{0})}^{2} \\ = {\sum_{m = 1}^{M} \sum_{i \in S_{m}} {\hat{g}}^{' 2} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) X_{i, - 1} X_{i, - 1}^{⊤}}^{- 1} {\sum_{m = 1}^{M} \sum_{i \in S_{m}} {\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) X_{i, - 1} Y_{i}^{*}}, \end{aligned}$ (6) where $Y_{i}^{*} = Y_{i} - \hat{g} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) + {\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) X_{i, - 1}^{⊤} {\hat{γ}}^{0}$ .

By the forms of (Equation3(3) $\hat{g} (u, γ) = \frac{A_{2, 0} (u, γ_{1}, h_{1}) A_{0, 1} (u, γ_{1}, h_{1}) - A_{1, 0} (u, γ_{1}, h_{1}) A_{1, 1} (u, γ_{1}, h_{1})}{A_{0, 0} (u, γ_{1}, h_{1}) A_{2, 0} (u, γ_{1}, h_{1}) - A_{1, 0}^{2} (u, γ_{1}, h_{1})},$ (3) ) and (Equation5(5) ${\hat{g}}^{'} (u, γ) = \frac{A_{0, 0} (u, γ_{1}, h_{2}) A_{1, 1} (u, γ_{1}, h_{2}) - A_{1, 0} (u, γ_{1}, h_{2}) A_{0, 1} (u, γ_{1}, h_{2})}{A_{0, 0} (u, γ_{1}, h_{2}) A_{2, 0} (u, γ_{1}, h_{2}) - A_{1, 0}^{2} (u, γ_{1}, h_{2})} .$ (5) ), for given $γ$ , it is easy to estimate $g_{0} (\cdot)$ and $g_{0}^{'} (\cdot)$ under massive dataset. $A_{l, s} (u, γ_{1}, h_{r})$ in (Equation3(3) $\hat{g} (u, γ) = \frac{A_{2, 0} (u, γ_{1}, h_{1}) A_{0, 1} (u, γ_{1}, h_{1}) - A_{1, 0} (u, γ_{1}, h_{1}) A_{1, 1} (u, γ_{1}, h_{1})}{A_{0, 0} (u, γ_{1}, h_{1}) A_{2, 0} (u, γ_{1}, h_{1}) - A_{1, 0}^{2} (u, γ_{1}, h_{1})},$ (3) ) and (Equation5(5) ${\hat{g}}^{'} (u, γ) = \frac{A_{0, 0} (u, γ_{1}, h_{2}) A_{1, 1} (u, γ_{1}, h_{2}) - A_{1, 0} (u, γ_{1}, h_{2}) A_{0, 1} (u, γ_{1}, h_{2})}{A_{0, 0} (u, γ_{1}, h_{2}) A_{2, 0} (u, γ_{1}, h_{2}) - A_{1, 0}^{2} (u, γ_{1}, h_{2})} .$ (5) ) can be rewritten as (7) $A_{l, s} (u, γ_{1}, h_{r}) = \sum_{m = 1}^{M} {\sum_{i \in S_{m}} {(X_{i}^{⊤} γ_{1} - u)}^{l} Y_{i}^{s} K_{h_{r}} (X_{i}^{⊤} γ_{1} - u)},$ (7) where l = 0, 1, 2, s = 0, 1 and r = 1, 2. Thus, by (Equation3(3) $\hat{g} (u, γ) = \frac{A_{2, 0} (u, γ_{1}, h_{1}) A_{0, 1} (u, γ_{1}, h_{1}) - A_{1, 0} (u, γ_{1}, h_{1}) A_{1, 1} (u, γ_{1}, h_{1})}{A_{0, 0} (u, γ_{1}, h_{1}) A_{2, 0} (u, γ_{1}, h_{1}) - A_{1, 0}^{2} (u, γ_{1}, h_{1})},$ (3) ), (Equation5(5) ${\hat{g}}^{'} (u, γ) = \frac{A_{0, 0} (u, γ_{1}, h_{2}) A_{1, 1} (u, γ_{1}, h_{2}) - A_{1, 0} (u, γ_{1}, h_{2}) A_{0, 1} (u, γ_{1}, h_{2})}{A_{0, 0} (u, γ_{1}, h_{2}) A_{2, 0} (u, γ_{1}, h_{2}) - A_{1, 0}^{2} (u, γ_{1}, h_{2})} .$ (5) ) and (Equation7(7) $A_{l, s} (u, γ_{1}, h_{r}) = \sum_{m = 1}^{M} {\sum_{i \in S_{m}} {(X_{i}^{⊤} γ_{1} - u)}^{l} Y_{i}^{s} K_{h_{r}} (X_{i}^{⊤} γ_{1} - u)},$ (7) ), we can obtain the estimators of $g_{0} (\cdot)$ and $g_{0}^{'} (\cdot)$ for massive dataset. Note that the estimators are the same as the estimators in (Equation3(3) $\hat{g} (u, γ) = \frac{A_{2, 0} (u, γ_{1}, h_{1}) A_{0, 1} (u, γ_{1}, h_{1}) - A_{1, 0} (u, γ_{1}, h_{1}) A_{1, 1} (u, γ_{1}, h_{1})}{A_{0, 0} (u, γ_{1}, h_{1}) A_{2, 0} (u, γ_{1}, h_{1}) - A_{1, 0}^{2} (u, γ_{1}, h_{1})},$ (3) ) and (Equation5(5) ${\hat{g}}^{'} (u, γ) = \frac{A_{0, 0} (u, γ_{1}, h_{2}) A_{1, 1} (u, γ_{1}, h_{2}) - A_{1, 0} (u, γ_{1}, h_{2}) A_{0, 1} (u, γ_{1}, h_{2})}{A_{0, 0} (u, γ_{1}, h_{2}) A_{2, 0} (u, γ_{1}, h_{2}) - A_{1, 0}^{2} (u, γ_{1}, h_{2})} .$ (5) ) which are computed directly by the full data. Thus, we can use (Equation3(3) $\hat{g} (u, γ) = \frac{A_{2, 0} (u, γ_{1}, h_{1}) A_{0, 1} (u, γ_{1}, h_{1}) - A_{1, 0} (u, γ_{1}, h_{1}) A_{1, 1} (u, γ_{1}, h_{1})}{A_{0, 0} (u, γ_{1}, h_{1}) A_{2, 0} (u, γ_{1}, h_{1}) - A_{1, 0}^{2} (u, γ_{1}, h_{1})},$ (3) ), (Equation5(5) ${\hat{g}}^{'} (u, γ) = \frac{A_{0, 0} (u, γ_{1}, h_{2}) A_{1, 1} (u, γ_{1}, h_{2}) - A_{1, 0} (u, γ_{1}, h_{2}) A_{0, 1} (u, γ_{1}, h_{2})}{A_{0, 0} (u, γ_{1}, h_{2}) A_{2, 0} (u, γ_{1}, h_{2}) - A_{1, 0}^{2} (u, γ_{1}, h_{2})} .$ (5) ), (Equation6(6) $\begin{aligned} \hat{γ} & = \arg min_{γ} \sum_{m = 1}^{M} \sum_{i \in S_{m}} {Y_{i} - \hat{g} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) - {\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) (X_{i, - 1}^{⊤} γ - X_{i, - 1}^{⊤} {\hat{γ}}^{0})}^{2} \\ = {\sum_{m = 1}^{M} \sum_{i \in S_{m}} {\hat{g}}^{' 2} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) X_{i, - 1} X_{i, - 1}^{⊤}}^{- 1} {\sum_{m = 1}^{M} \sum_{i \in S_{m}} {\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) X_{i, - 1} Y_{i}^{*}}, \end{aligned}$ (6) ) and (Equation7(7) $A_{l, s} (u, γ_{1}, h_{r}) = \sum_{m = 1}^{M} {\sum_{i \in S_{m}} {(X_{i}^{⊤} γ_{1} - u)}^{l} Y_{i}^{s} K_{h_{r}} (X_{i}^{⊤} γ_{1} - u)},$ (7) ) to iteratively update the estimate of $γ_{0}$ until convergence.

2.2. Asymptotic normality of the resulting estimator

To establish the asymptotic property of the proposed estimator, the following technical conditions are imposed.

(C1)	The density function of $X^{⊤} γ_{1}$ is positive and satisfies a Lipschitz condition of order 1 for $γ_{1}$ in a neighbourhood of $γ_{01}$ . Further, $X^{⊤} γ_{01}$ has a positive and bounded density function on Λ, where $Λ = {t = X^{⊤} γ_{01} : X \in Θ}$ and Θ is the compact support set of $X$ .
(C2)	$g_{0} (t)$ and the j-th ( $2 \leq j \leq p$ ) component of $E [X ∣ X^{⊤} γ_{0} = t]$ have two bounded and continuous derivatives.
(C3)	$E [ϵ ∣ X] = 0$ and $E [ϵ^{4} ∣ X] < \infty$ .
(C4)	The kernel $K (\cdot)$ is a bounded, continuous and symmetric probability density function, satisfying $\int_{- \infty}^{\infty} u^{2} K (u) d u \neq 0$ and $\int_{- \infty}^{\infty} \| u \|^{4} K (u) d u < \infty .$ In addition, $Σ = E [g_{0}^{'} (X^{⊤} γ_{0}) X_{- 1} X_{- 1}^{⊤}]$ is a positive definite matrix.

Remark 2.1

Conditions (C1)–(C4) are commonly used in the literature, see Wang et al. (Citation2010). Condition (C1) is used to bound the density function of $X^{⊤} γ_{1}$ away from zero. This ensures that the denominators of $\hat{g} (u, γ_{1})$ and ${\hat{g}}^{'} (u, γ_{1})$ are, with high probability, bounded away from 0 for $u = X^{⊤} γ_{1}$ , where $X \in Θ$ and $γ_{1}$ is near $γ_{01}$ . The Lipschitz condition and the two derivatives in conditions (C1) and (C2) are standard smoothness conditions. Condition (C3) is a necessary condition for the asymptotic normality of an estimator. Condition (C4) is a usual assumption for kernel function.

Theorem 2.1

Suppose conditions (C1)–(C4) hold, $‖ {\hat{γ}}^{0} - γ_{0} ‖_{2} = O_{p} ({\tilde{n}}^{- 1 / 2})$ with $\tilde{n} = n^{c}, 0 < c \leq 1$ , and $n \to \infty$ , $h_{1} = O (n^{- 1 / 4} / \log n)$ , $h_{2} = O (n^{- 1 / 4} \log^{2} n)$ , and $Q \geq 1 + \log (\log n / \log \tilde{n}) / \log 2$ . Then, the estimator $\hat{γ}$ of the Q-th iteration, $\sqrt{n} (\hat{γ} - γ_{0}) ⟹ L N (0, Σ^{- 1} S Σ^{- 1}),$ where $⟹ L$ stands for convergence in the distribution, $Σ = E {g_{0}^{'} (X^{⊤} γ_{0}) X_{- 1} X_{- 1}^{⊤}}$ and $S = E [g_{0}^{'} (X^{⊤} γ_{01})^{2} {X_{- 1} - E (X_{- 1} ∣ X^{⊤} γ_{01})} {X_{- 1} - E (X_{- 1} ∣ X^{⊤} γ_{01})}^{⊤} ϵ^{2}] .$ The initial estimator ${\hat{γ}}^{0}$ can be obtained by the method in Ichimura (Citation1993) based on $S_{1}$ , which satisfies $‖ {\hat{γ}}^{0} - γ_{0} ‖_{2} = O_{p} ({\tilde{n}}^{- 1 / 2})$ under some regularity conditions.

Theorem 2.1 shows that $\hat{γ}$ achieves the same asymptotic efficiency as estimator in (Equation2(2) $\sum_{i = 1}^{n} {Y_{i} - \hat{g} (X_{i}^{⊤} γ_{1}, γ)}^{2},$ (2) ) computed directly on all the samples, see Theorem 2 in Wang et al. (Citation2010). Compared to the averaging divide-and-conquer method that also can achieve the convergence rate $O_{p} (n^{- 1 / 2})$ but under the condition fixed M, our approach removes the restriction on the number of machines M by applying multiple rounds of aggregations. It is also important to note that the required number of rounds Q is usually quite small. For example, if $n = 10^{20}$ and $\tilde{n} = 10^{5}$ , then Q = 3.

After obtaining the estimation $\hat{γ}$ of $γ_{0}$ , for any given point u, we can estimate $g_{0} (\cdot)$ in model (Equation1(1) $Y = g_{0} (X^{⊤} γ_{01}) + ϵ,$ (1) ) with massive dataset by (Equation3(3) $\hat{g} (u, γ) = \frac{A_{2, 0} (u, γ_{1}, h_{1}) A_{0, 1} (u, γ_{1}, h_{1}) - A_{1, 0} (u, γ_{1}, h_{1}) A_{1, 1} (u, γ_{1}, h_{1})}{A_{0, 0} (u, γ_{1}, h_{1}) A_{2, 0} (u, γ_{1}, h_{1}) - A_{1, 0}^{2} (u, γ_{1}, h_{1})},$ (3) ).

2.3. Algorithm

Based on the above analysis, we now introduce an iterative divide-and-conquer method for estimating $γ_{0}$ .

Step 1: Without loss of generality, the entire data set is partitioned into M subsets: $S_{1}, \dots, S_{M}$ .

Step 2: Calculate the initial estimator ${\hat{γ}}^{0}$ based on $S_{1}$ .

Step 3: Compute the estimators $\hat{g} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) = \frac{\sum_{k = 1}^{M} B_{2, 0, 1}^{k} \sum_{k = 1}^{M} B_{0, 1, 1}^{k} - \sum_{k = 1}^{M} B_{1, 0, 1}^{k} \sum_{k = 1}^{M} B_{1, 1, 1}^{k}}{\sum_{k = 1}^{M} B_{0, 0, 1}^{k} \sum_{k = 1}^{M} B_{2, 0, 1}^{k} - (\sum_{k = 1}^{M} B_{1, 0, 1}^{k})^{2}}$ and ${\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) = \frac{\sum_{k = 1}^{M} B_{0, 0, 2}^{k} \sum_{k = 1}^{M} B_{1, 1, 2}^{k} - \sum_{k = 1}^{M} B_{1, 0, 2}^{k} \sum_{k = 1}^{M} B_{0, 1, 2}^{k}}{\sum_{k = 1}^{M} B_{0, 0, 2}^{k} \sum_{k = 1}^{M} B_{2, 0, 2}^{k} - (\sum_{k = 1}^{M} B_{1, 0, 2}^{k})^{2}},$ where $B_{l, s, r}^{k} = \sum_{j \in S_{k}} (X_{j}^{⊤} {\hat{γ}}_{1}^{0} - X_{i}^{⊤} {\hat{γ}}_{1}^{0})^{l} Y_{j}^{s} K_{h_{r}} (X_{j}^{⊤} {\hat{γ}}_{1}^{0} - X_{i}^{⊤} {\hat{γ}}_{1}^{0}),$ l = 0, 1, 2, $s = 0, 1,$ $r = 1, 2.$

Step 4: Compute the estimator $\hat{γ}$ : $\hat{γ} = {(\sum_{m = 1}^{M} C_{m})}^{- 1} (\sum_{m = 1}^{M} D_{m}),$ where $C_{m} = \sum_{i \in S_{m}} {\hat{g}}^{' 2} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) X_{i, - 1} X_{i, - 1}^{⊤}$ and $D_{m} = \sum_{i \in S_{m}} {\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) X_{i, - 1} Y_{i}^{*} .$

Step 5: Iterate $Q \geq 1 + \log (\log n / \log \tilde{n}) / \log 2$ times of Steps 3 and 4.

3. Numerical studies

In this section, we first use Monte Carlo simulation studies to assess the finite sample performance of the proposed procedures and then demonstrate the application of the proposed methods with two real data analyses. All programs are written in R code and our computer has a 3.3 GHz Pentium processor and 8G memory.

The Epanechnikov kernel $K (u) = 0.75 (1 - u^{2}) I (| u | \leq 1)$ is used in this section. When calculating the estimator $\hat{γ}$ in (Equation6(6) $\begin{aligned} \hat{γ} & = \arg min_{γ} \sum_{m = 1}^{M} \sum_{i \in S_{m}} {Y_{i} - \hat{g} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) - {\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) (X_{i, - 1}^{⊤} γ - X_{i, - 1}^{⊤} {\hat{γ}}^{0})}^{2} \\ = {\sum_{m = 1}^{M} \sum_{i \in S_{m}} {\hat{g}}^{' 2} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) X_{i, - 1} X_{i, - 1}^{⊤}}^{- 1} {\sum_{m = 1}^{M} \sum_{i \in S_{m}} {\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) X_{i, - 1} Y_{i}^{*}}, \end{aligned}$ (6) ), according to Wang et al. (Citation2010), we choose the bandwidths: $h_{1} = \hat{h} n^{1 / 5} n^{- 1 / 4} / \log n$ and $h_{2} = \hat{h} n^{1 / 5} n^{- 1 / 4} \log^{2} n$ . We can use the method in Ichimura (Citation1993) to obtain ${\hat{γ}}_{m}$ in (Equation4(4) ${\hat{γ}}_{A D C} = \frac{1}{M} \sum_{m = 1}^{M} {\hat{γ}}_{m},$ (4) ), which can be obtained by ‘npindexbw’ in R. All the simulations are run for 100 replicates.

3.1. Simulation example 1: effect of M with fixed n

In this example, we fix the total sample size n = 10000 and vary the number of blocks M from ${10, 50, 100}$ , to access the influence of M on the proposed estimation method. The model for the simulated data is (8) $Y = 5 \cos (π X^{⊤} γ_{01}) + \exp (| X^{⊤} γ_{01} |) + ϵ,$ (8) where $X$ is uniformly distributed on $[0, 1]^{3}$ , $γ_{01} = (1, γ_{0}^{⊤})^{⊤}$ , $γ_{0} = (2, - 1)^{⊤}$ and $ϵ \sim N (0, 1)$ .

We compare the proposed iterative divide-and-conquer (IDC) method for $γ_{0}$ with the oracle procedure (Oracle) which is obtained by solving (Equation2(2) $\sum_{i = 1}^{n} {Y_{i} - \hat{g} (X_{i}^{⊤} γ_{1}, γ)}^{2},$ (2) ) by the full data, and averaging divide-and-conquer (ADC) method.

Table depicts the mean squared errors $(M S E = \sqrt{(\hat{γ} - γ_{0})^{⊤} (\hat{γ} - γ_{0})})$ , and Absolute Bias ( $| \hat{γ} - γ_{0} |$ ) of the estimate $\hat{γ}$ to assess the accuracy of the estimation methods. From Table , the following conclusions can be drawn.

All the estimators are close to the true value because the results of Absolute Bias are very small.
Based on MSE, the performances of IDC estimator are better than those of ADC when M = 100, and are worse than those of ADC when M = 10 and M = 50.
t in Table is the average computing time in seconds used to estimate the index parameter. From t, we see that the operation times of ADC and IDC are faster than that of Oracle. Moreover, IDC is faster than ADC under case of M = 10.

Table 1. The means of Absolute Bias, MSE (standard deviation) and t for simulation Example 1.

Display Table

3.2. Simulation example 2: effect of M with fixed $\tilde{n}$

To compare the effects of the three methods on the number of blocks with fixed sample size on each block ( $\tilde{n} = 100$ ), we consider M of {10,20,…,100}. The model for the simulated data is also from (Equation8(8) $Y = 5 \cos (π X^{⊤} γ_{01}) + \exp (| X^{⊤} γ_{01} |) + ϵ,$ (8) ).

The results of MSE are presented in Figure . The average computing time in seconds used to estimate the index parameter is presented in Figure . From Figures and , the following conclusions can be drawn.

From Figure , we can see that the performances of Oracle method are the best of the three methods under different M. However, by Figure , the operation times of Oracle method are much greater than those of ADC and IDC under different M.
As the number of blocks M continues to increase, the MSEs of Oracle and IDC decrease. However, ADC doesn't have this pattern.
If the number of blocks is less than 30, the MSEs of the ADC method are less than that of IDC. However, as the number of blocks continues to increase, IDC can significantly outperform the ADC method. Furthermore, if the number of blocks is less than 60, the operation times of the IDC method are less than that of ADC.

Figure 1. Comparison of MSE versus the number of blocks M with $\tilde{n} = 100$ for three methods for simulation Example 3.2.

Figure 1. Comparison of MSE versus the number of blocks M with n~=100 for three methods for simulation Example 3.2.

Figure 2. The mean computing time of $\hat{γ}$ (in seconds) for simulation Example 3.2.

Figure 2. The mean computing time of γˆ (in seconds) for simulation Example 3.2.

3.3. Simulation example 3: effect of n

To examine the effect of increasing the sample size, n = 5000, 10000 and 20000 are considered. The following single-index model is considered: (9) $Y = \sin (0.75 X^{⊤} γ_{01}) + ϵ,$ (9) where $X = (X_{1}, X_{2})^{⊤}$ is a two-dimensional standard normal variable, the correlation between $X_{1}$ and $X_{2}$ is 0.5, $γ_{01} = (1, γ_{0}^{⊤})^{⊤}$ , $γ_{0} = 2$ and $ϵ \sim N (0, {0.2}^{2})$ .

Table presents the averages of Absolute Bias and computing time t over the 100 data sets along with its estimated standard error. By varying the sample size, as expected, the Absolute Bias becomes smaller and computing time t becomes bigger as the sample size grows.

Table 2. The means of Absolute Bias (standard deviation) and t for simulation Example 3.

Download CSV Display Table

3.4. Real data example 1: combined cycle power plant data

We apply the proposed method to combined cycle power plant data. The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006–2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (AT), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant. The data set is obtained from online site: https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant.

In this study, the following single-index model is used to fit the data: (10) $E P = g {γ_{1} A T + γ_{2} A P + γ_{3} R H + γ_{4} V} + ϵ,$ (10) where all the data are normalized. We considered the number of blocks $M \in {8, 16, 26, 46, 92}$ ; hence, the respective values of the local sample size are $\tilde{n} \in {1196, 598, 368, 208, 104}$ . Table summarizes the estimated coefficients for the above model, showing that AP has the smallest effect on EP among the four covariates and AT is the most important covariate. Figure shows the estimated EP by the Oracle method along with the observations, illustrating that single-index model (Equation10(10) $E P = g {γ_{1} A T + γ_{2} A P + γ_{3} R H + γ_{4} V} + ϵ,$ (10) ) is very suitable to the combined cycle power plant data. Furthermore, we evaluate the performances of three estimators based on mean square fitting error ( $M S F E = \sum_{i = 1}^{9568} | {E P}_{i} - {\hat{E P}}_{i} | / 9568$ ), where ${\hat{E P}}_{i}$ is the fitted value of ${E P}_{i}$ by (Equation3(3) $\hat{g} (u, γ) = \frac{A_{2, 0} (u, γ_{1}, h_{1}) A_{0, 1} (u, γ_{1}, h_{1}) - A_{1, 0} (u, γ_{1}, h_{1}) A_{1, 1} (u, γ_{1}, h_{1})}{A_{0, 0} (u, γ_{1}, h_{1}) A_{2, 0} (u, γ_{1}, h_{1}) - A_{1, 0}^{2} (u, γ_{1}, h_{1})},$ (3) ). From Table , the following conclusions can be drawn.

The MSFEs of IDC under different M are very close to that of Oracle method. Thus the performances IDC are well.
As the number of blocks M continues to increase, the MSFEs of ADC increase. The performances of IDC estimator are better than those of ADC when M = 92 and are worse than those of ADC when M is small.
t in Table is the average computing time in seconds used to estimate the index parameter. From t, we see that the operation times of IDC are faster than that of Oracle and ADC.

Figure 3. Estimated single index function for the combined cycle power plant data. The dots are the observations EP and the curve is the estimated EP by the Oracle method.

Table 3. The coefficient estimates and MSFE for the combined cycle power plant data.

Download CSV Display Table

3.5. Real data example 2: airline on-time data

Airline on-time performance data from the 2009 ASA Data Expo are used as a case study. These data are publicly available (http://stat-computing.org/dataexpo/2009/the-data.html). This dataset consists of flight arrival and departure details for all commercial flights within the United States from October 1987 to April 2008. About 12 million flights were recorded with 29 variables. In this section, we only consider the data of year 2008 (the number of samples is 1011963). The first 1000000 data points are used for the estimation and the remaining 11963 data are used for the prediction.

Schifano et al. (Citation2016) developed a linear model that fits the data as follows: (11) $A D = γ_{1} H D + γ_{2} D I S + γ_{3} N F + γ_{4} W F + ϵ,$ (11) where $A D$ is the arrival delay (ArrDelay), which is a continuous variable found by modelling $\log (A r r D e l a y - min (A r r D e l a y) + 1)$ , $H D$ is the departure hour (range 0 to 24), $D I S$ is the distance (in 1000 miles), $N F$ is the dummy variable for a night flight (1 if departure between 8 p.m. and 5 a.m., 0 otherwise), and $W F$ is the dummy variable for a weekend flight (1 if departure occurred during the weekend, 0 otherwise).

In this study, the following single-index model is used to fit the data: (12) $A D = g {γ_{1} H D + γ_{2} D I S + γ_{3} N F + γ_{4} W F} + ϵ .$ (12) For comparison purposes, we use the least squares (LS) method to estimate $(γ_{1}, γ_{2}, γ_{3}, γ_{4})^{⊤}$ in model (Equation11(11) $A D = γ_{1} H D + γ_{2} D I S + γ_{3} N F + γ_{4} W F + ϵ,$ (11) ), and use the ADC and IDC methods proposed in Section 2 to estimate $(γ_{1}, γ_{2}, γ_{3}, γ_{4})^{⊤}$ in model (Equation12(12) $A D = g {γ_{1} H D + γ_{2} D I S + γ_{3} N F + γ_{4} W F} + ϵ .$ (12) ). The number of blocks is 1000 for these three methods. Furthermore, we evaluate the performance of these estimators based on their out-of-sample prediction with the mean absolute prediction error (MAPE) of the predictions, $M A P E = \frac{1}{n} \sum_{i = 1}^{n} | {A D}_{i} - {\hat{A D}}_{i} |,$ where ${\hat{A D}}_{i}$ is the fitted value of ${A D}_{i}$ , $i = 1, \dots, n$ with n = 11, 963. ${\hat{A D}}_{i}$ can be obtained by (Equation3(3) $\hat{g} (u, γ) = \frac{A_{2, 0} (u, γ_{1}, h_{1}) A_{0, 1} (u, γ_{1}, h_{1}) - A_{1, 0} (u, γ_{1}, h_{1}) A_{1, 1} (u, γ_{1}, h_{1})}{A_{0, 0} (u, γ_{1}, h_{1}) A_{2, 0} (u, γ_{1}, h_{1}) - A_{1, 0}^{2} (u, γ_{1}, h_{1})},$ (3) ). Table presents the estimated coefficients and MAPEs of the three methods. From Table , we can see that the IDC method performs better than LS and ADC according to the smaller MAPE.

Table 4. The coefficient estimates and MAPE for the airline on-time data.

Download CSV Display Table

Disclosure statement

We proposed a divide-and-conquer method to deal with single-index model for massive dataset. The proposed method significantly reduces the required amount of primary memory and enjoys a very low computational cost. The proposed method achieves the same asymptotic efficiency as the estimator using all the data. Furthermore, it allows a weak condition on the sample size as a function of memory size.

Additional information

Funding

We would like to acknowledge support for this project from the Fundamental Research Funds for the Central Universities of China (No. 2232020D-43).

References

Chen, X., Liu, W., & Zhang, Y. (2019). Quantile regression under memory constraint. The Annals of Statistics, 47(6), 3244–3273. https://doi.org/10.1214/18-AOS1777
Web of Science ®Google Scholar
Chen, X. Y., & Xie, M. G. (2014). A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica, 24(4), 1655–1684. https://doi.org/10.5705/ss.2013.088
Web of Science ®Google Scholar
Christou, E., & Akritas, M. (2016). Single index quantile regression for heteroscedastic data. Journal of Multivariate Analysis, 150, 169–182. https://doi.org/10.1016/j.jmva.2016.05.010
Web of Science ®Google Scholar
Delecroix, M., Hristache, M., & Patilea, V. (2006). On semiparametric M-estimation in single-index regression. Journal of Statistical Planning and Inference, 136(3), 730–769. https://doi.org/10.1016/j.jspi.2004.09.006
Web of Science ®Google Scholar
Ichimura, H. (1993). Semiparametric least squares (SLS) and weighted SLS estimation of single-index models. Journal of Econometrics, 58(1-2), 71–120. https://doi.org/10.1016/0304-4076(93)90114-K
Web of Science ®Google Scholar
Jiang, R., Guo, M. F., & Liu, X. (2020). Composite quasi-likelihood for single-index models with massive datasets. Communications in Statistics – Simulation and Computation, 51(9), 5024–5040. https://doi.org/10.1080/03610918.2020.1753074
Web of Science ®Google Scholar
Jiang, R., Qian, W. M., & Zhou, Z. G. (2016). Weighted composite quantile regression for single-index models. Journal of Multivariate Analysis, 148, 34–48. https://doi.org/10.1016/j.jmva.2016.02.015
Web of Science ®Google Scholar
Jiang, R., Zhou, Z. G., Qian, W. M., & Chen, Y. (2013). Two step composite quantile regression for single-index models. Computational Statistics & Data Analysis, 64, 180–191. https://doi.org/10.1016/j.csda.2013.03.014
Web of Science ®Google Scholar
Jordan, M., Lee, J., & Yang, Y. (2019). Communication-efficient distributed statistical learning. Journal of the American Statistical Association, 14(526), 668–681. https://doi.org/10.1080/01621459.2018.1429274
Google Scholar
Lin, N., & Xi, R. (2011). Aggregated estimating equation estimation. Statistics and Its Interface, 4(1), 73–83. https://doi.org/10.4310/SII.2011.v4.n1.a8
Web of Science ®Google Scholar
Schifano, E., Wu, J., Wang, C., Yan, J., & Chen, M. H. (2016). Online updating of statistical inference in the big data setting. Technometrics, 58(3), 393–403. https://doi.org/10.1080/00401706.2016.1142900
PubMed Web of Science ®Google Scholar
Tang, Y., Wang, H., & Liang, H. (2018). Composite estimation for single-index models with responses subject to detection limits. Scandinavian Journal of Statistics, 45(3), 444–464. https://doi.org/10.1111/sjos.v45.3
Web of Science ®Google Scholar
Wang, J. L., Xue, L., Zhu, L., & Chong, Y. (2010). Estimation for a partial-linear single-index model. The Annals of Statistics, 38(1), 246–274. https://doi.org/10.1214/09-AOS712
Web of Science ®Google Scholar
Wu, T., Yu, K., & Yu, Y. (2010). Single-index quantile regression. Journal of Multivariate Analysis, 101(7), 1607–1621. https://doi.org/10.1016/j.jmva.2010.02.003
Web of Science ®Google Scholar
Xia, Y., Tong, H., Li, W., & Zhu, L. X. (2002). An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society Series B, 64(3), 363–410. https://doi.org/10.1111/rssb.2002.64.issue-3
Google Scholar
Zhu, L., & Xue, L. (2006). Empirical likelihood confidence regions in a partially linear single-index model. Journal of the Royal Statistical Society: Series B, 68(3), 549–570. https://doi.org/10.1111/rssb.2006.68.issue-3
Google Scholar

Appendix

Proof

Proof of Theorem 2.1

We denote the first iteration

{\hat{γ}}^{1}

, and note that (Equation6) can be equivalently written as

{\hat{γ}}^{1} - γ_{0} = {\sum_{i = 1}^{n} {\hat{g}}^{' 2} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) X_{i, - 1} X_{i, - 1}^{⊤}}^{- 1} {\sum_{i = 1}^{n} {\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) X_{i, - 1} {\tilde{Y}}_{i}} = U_{n}^{- 1} V_{n},

where

{\tilde{Y}}_{i} = ϵ_{i} + g_{0} (X_{i}^{⊤} γ_{01}) - \hat{g} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) - {\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) X_{i, - 1}^{⊤} (γ_{0} - {\hat{γ}}^{0})

V_{n} = n^{- 1} \sum_{i = 1}^{n} {\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) X_{i, - 1} {\tilde{Y}}_{i}

, and

U_{n} = n^{- 1} \times \sum_{i = 1}^{n} {\hat{g}}^{' 2} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) X_{i, - 1} X_{i, - 1}^{⊤}

. Note that

\begin{aligned} V_{n} & = \frac{1}{n} \sum_{i = 1}^{n} g_{0}^{'} (X_{i}^{⊤} γ_{01}) {X_{i, - 1} - E (X_{- 1} ∣ X_{i}^{⊤} γ_{01})} ϵ_{i} \\ + \frac{1}{n} \sum_{i = 1}^{n} {{\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) - g_{0}^{'} (X_{i}^{⊤} γ_{01})} X_{i, - 1} ϵ_{i} \\ + \frac{1}{n} \sum_{i = 1}^{n} g_{0}^{'} (X_{i}^{⊤} γ_{01}) [X_{i, - 1} {g_{0} (X_{i}^{⊤} γ_{01}) - \hat{g} (X_{i}^{⊤} γ_{01})} + E (X_{- 1} ∣ X_{i}^{⊤} γ_{01}) ϵ_{i}] \\ + \frac{1}{n} \sum_{i = 1}^{n} X_{i, - 1} {{\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) - g_{0}^{'} (X_{i}^{⊤} γ_{01})} {g_{0} (X_{i}^{⊤} γ_{01}) - \hat{g} (X_{i}^{⊤} γ_{01})} \\ + \frac{1}{n} \sum_{i = 1}^{n} {\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) X_{i, - 1} {\hat{g} (X_{i}^{⊤} γ_{01}) - \hat{g} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) - {\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) X_{i, - 1}^{⊤} (γ_{0} - {\hat{γ}}^{0})} \\ \equiv V_{n 1} + V_{n 2} + V_{n 3} + V_{n 4} + V_{n 5} . \end{aligned}

We first show that

‖ V_{n 2} ‖_{2} = o_{p} (n^{- 1 / 2})

. Let

V_{n 2, s}

denote the s-th component of

V_{n 2}

. Then, we have

\begin{aligned} V_{n 2, s} & = \frac{1}{n} \sum_{i = 1}^{n} {{\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) - g_{0}^{'} (X_{i}^{⊤} γ_{01})} X_{i s, - 1} ϵ_{i} \\ = \frac{1}{n} \sum_{i = 1}^{n} {{\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) - g_{0}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0})} X_{i s, - 1} ϵ_{i} \\ + \frac{1}{n} \sum_{i = 1}^{n} {g_{0}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) - g_{0}^{'} (X_{i}^{⊤} γ_{01})} X_{i s, - 1} ϵ_{i} \\ \equiv V_{n 21, s} + V_{n 22, s} . \end{aligned}

Note that

{\hat{g}}^{'} (u, γ)

in (Equation3) can be rewritten as

{\hat{g}}^{'} (u, γ) = \sum_{i = 1}^{n} {\tilde{W}}_{n i} (u, γ_{1}) Y_{i},

where

{\tilde{W}}_{n i} (u, γ_{1}) = \frac{K_{h_{2}} (X_{i}^{⊤} γ_{1} - u) {(X_{i}^{⊤} γ_{1} - u) A_{0, 0} (u, γ_{1}, h_{2}) - A_{1, 0} (u, γ_{1}, h_{2})}}{A_{0, 0} (u, γ_{1}, h_{2}) A_{2, 0} (u, γ_{1}, h_{2}) - A_{1, 0}^{2} (u, γ_{1}, h_{2})} .

Thus

V_{n 21, s}

can be rewritten as

\begin{aligned} V_{n 21, s} & = \frac{1}{n} \sum_{i = 1}^{n} {\sum_{j = 1}^{n} {\tilde{W}}_{n j} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}, {\hat{γ}}_{1}^{0}) g_{0} (X_{j}^{⊤} {\hat{γ}}_{1}^{0}) - g_{0}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0})} X_{i s, - 1} ϵ_{i} \\ + \frac{1}{n} \sum_{i = 1}^{n} {\tilde{W}}_{n i} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}, {\hat{γ}}_{1}^{0}) X_{i s, - 1} ϵ_{i}^{2} + \frac{1}{n} \sum_{i = 2}^{n} \sum_{j = 1}^{i - 1} a_{i j} ϵ_{j} ϵ_{i} \\ \equiv V_{n 211, s} + V_{n 212, s} + V_{n 213, s}, \end{aligned}

where

a_{i j} = {\tilde{W}}_{n j} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) X_{i s, - 1} + {\tilde{W}}_{n i} (X_{j}^{⊤} {\hat{γ}}_{1}^{0}) X_{j s, - 1} .

Hence, by Lemma 1 in Zhu and Xue (Citation2006) and conditions (C2) and (C3),

{\hat{γ}}^{0}

is a consistent estimate of

γ_{0}

and by the Cauchy–Schwarz inequality, for

c_{1}

and

c_{2}

big enough, we have

\begin{aligned} n E [V_{n 211, s}^{2}] & = \frac{1}{n} \sum_{i = 1}^{n} E [{\sum_{j = 1}^{n} {\tilde{W}}_{n j} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}, {\hat{γ}}_{1}^{0}) g_{0} (X_{j}^{⊤}, {\hat{γ}}_{1}^{0}) - g_{0}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0})}^{2} X_{i s, - 1}^{2} E (ϵ_{i}^{2} ∣ X_{i})] \\ \leq c_{1} \frac{1}{n} \sum_{i = 1}^{n} {[E {\sum_{j = 1}^{n} {\tilde{W}}_{n j} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}, {\hat{γ}}_{1}^{0}) g_{0} (X_{j}^{⊤} {\hat{γ}}_{1}^{0}) - g_{0}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0})}^{4}]}^{1 / 2} {E (X_{i s, - 1})^{4}}^{1 / 2} \\ \leq c_{2} h_{2}^{2} \to 0. \end{aligned}

For

V_{n 212, s}

, by Lemma 2 in Zhu and Xue (Citation2006), for

c_{3}

and

c_{4}

big enough, we have

\begin{aligned} \sqrt{n} E | V_{n 212, s} | & \leq \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} E {| {\tilde{W}}_{n i} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}, {\hat{γ}}_{1}^{0}) X_{i s, - 1} | E (ϵ_{i}^{2} ∣ X_{i})} \\ \leq c_{3} \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} \sqrt{E {{\tilde{W}}_{n j} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}, {\hat{γ}}_{1}^{0})}^{2}} \sqrt{E (X_{i s, - 1}^{2})} \\ \leq c_{4} {(n h_{2}^{2})^{- 1 / 2} + (n h_{2}^{5 / 2})^{- 1}} \to 0. \end{aligned}

We now consider

V_{n 213, s}

. Note that

E (a_{i j} ϵ_{j} ϵ_{i} ∣ X, ϵ_{j}) = 0

and

E (a_{i_{1} j_{1}} ϵ_{j_{1}} ϵ_{i_{1}} a_{i_{2} j_{2}} ϵ_{j_{2}} ϵ_{i_{2}} ∣ X) = 0

when

{i_{1}, j_{1}} \neq {i_{2}, j_{2}}

; by Lemma 2 in Zhu and Xue (Citation2006), for

c_{5}

and

c_{6}

big enough, we have

\begin{aligned} n E (V_{n 213, s}^{2}) & = \frac{1}{n} \sum_{i = 2}^{n} \sum_{j = 1}^{i - 1} E {a_{i j}^{2} E (ϵ_{j}^{2} ∣ X_{j s}) E (ϵ_{i}^{2} ∣ X_{i s})} \\ \leq c_{5} \frac{1}{n} \sum_{i \neq j} \sqrt{E {{\tilde{W}}_{n j} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}, {\hat{γ}}_{1}^{0})}^{4}} \sqrt{E (X_{i s, - 1}^{4})} \\ \leq c_{6} {(n_{2} h_{2}^{3})^{- 1} + n^{- 1} \sqrt{h_{2}} + (n h_{2}^{5 / 2})^{- 2}} \to 0. \end{aligned}

By the condition

‖ {\hat{γ}}^{0} - γ_{0} ‖_{2} = O_{p} ({\tilde{n}}^{- 1 / 2})

, we have

V_{n 22, s} = o_{p} (n^{- 1 / 2}) .

Combining the above results, the s-th moment of

V_{n 2}

converges to 0. By the Markov inequality, we have

‖ V_{n 2} ‖_{2} = o_{p} (n^{- 1 / 2}) .

We prove that the mean and the variance of

n^{1 / 2} V_{n 3, s}

tend to 0. Using

E {\hat{g} (X_{i}^{⊤} γ_{01}) - g_{0} (X_{i}^{⊤} γ_{01})} = O (h_{1}^{2})

and condition (C2), we have

\sqrt{n} E V_{n 3, s} \leq O (n^{1 / 2} h_{1}^{2}) \to 0.

Using conditions (C1)–(C4) and the (A.35) of Chang et al. (2010), we obtain

n E V_{n 3, s}^{2} \leq O (n h_{1}^{4} + \sqrt{h_{1}} + (n h_{1})^{- 1}) \to 0.

This proves that

‖ V_{n 3} ‖_{2} = o_{p} (n^{- 1 / 2}) .

We now consider

V_{n 4}

\begin{aligned} V_{n 4, s} & = \frac{1}{n} \sum_{i = 1}^{n} X_{i s, - 1} {{\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) - g_{0}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0})} {g_{0} (X_{i}^{⊤} γ_{01}) - \hat{g} (X_{i}^{⊤} γ_{01})} \\ + \frac{1}{n} \sum_{i = 1}^{n} X_{i s, - 1} {g_{0}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) - g_{0}^{'} (X_{i}^{⊤} γ_{01})} {g_{0} (X_{i}^{⊤} γ_{01}) - \hat{g} (X_{i}^{⊤} γ_{01})} \\ \equiv V_{n 41, s} + V_{n 42, s} . \end{aligned}

By Lemma 3 in Zhu and Xue (Citation2006) and Markov inequality, for any

ϵ > 0

\begin{aligned} P (\sqrt{\frac{n h_{2}^{3}}{\log n}} | {\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) - g_{0}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) | \geq ϵ) \\ \leq \frac{n h_{2}^{3}}{\log n} E {{\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) - g_{0}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0})}^{2} / ϵ^{2} \to 0. \end{aligned}

Hence, we have, uniformly over

1 \leq i \leq n

| {\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) - g_{0}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) | = O_{p} (\sqrt{\frac{\log n}{n h_{2}^{3}}}) .

Similarly,

| \hat{g} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) - g_{0} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) | = O_{p} (\sqrt{\frac{\log n}{n h_{1}}}) .

Thus we have

\begin{aligned} \sqrt{n} | V_{n 41, s} | & \leq \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} | X_{i s, - 1} | | {\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) - g_{0}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) | | g_{0} (X_{i}^{⊤} γ_{01}) - \hat{g} (X_{i}^{⊤} γ_{01}) | \\ = O_{p} (\sqrt{\frac{\log^{2} n}{n h_{1} h_{2}^{3}}}) . \end{aligned}

Noting that

n h_{1} h_{2}^{3} / \log^{2} n \to \infty

, we obtain

V_{n 41, s} = o_{p} (n^{- 1 / 2}) .

For

V_{n 42}

, by a Taylor expansion, we get

V_{n 42, s} = \frac{1}{n} \sum_{i = 1}^{n} g^{″} (ξ_{i}) X_{i s, - 1} X_{i}^{⊤} ({\hat{γ}}_{1}^{0} - γ_{01}) {g_{0} (X_{i}^{⊤} γ_{01}) - \hat{g} (X_{i}^{⊤} γ_{01})},

where

ξ_{i}

is a point between

X_{i}^{⊤} γ_{01}

and

X_{i}^{⊤} {\hat{γ}}_{1}^{0}

. By the condition

‖ {\hat{γ}}^{0} - γ_{0} ‖_{2} = O_{p} ({\tilde{n}}^{- 1 / 2})

and using

E {\hat{g} (X_{i}^{⊤} γ_{01}) - g_{0} (X_{i}^{⊤} γ_{01})} = O (h_{1}^{2})

, we have

\sqrt{n} E V_{n 42, s} = O ({\tilde{n}}^{- 1 / 2} \sqrt{n} h_{1}^{2}) \to 0,

and

n E V_{n 42, s}^{2} \leq O (h_{1}^{4} {\tilde{n}}^{- 1} + (\tilde{n} n h_{1})^{- 1}) \to 0.

Thus we can obtain

V_{n 42, s} = o_{p} (n^{- 1 / 2})

. Therefore,

‖ V_{n 4} ‖_{2} = o_{p} (n^{- 1 / 2}) .

Finally, we consider

V_{n 5}

V_{n 5, s} = V_{n 51, s} + V_{n 52, s},

where

V_{n 51, s} = n^{- 1} \sum_{i = 1}^{n} g_{0}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) X_{i s, - 1} {\hat{g} (X_{i}^{⊤} γ_{01}) - \hat{g} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) - {\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) X_{i, - 1}^{⊤} (γ_{0} - {\hat{γ}}^{0})}

and

V_{n 52, s} = n^{- 1} \sum_{i = 1}^{n} {{\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) - g_{0}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0})} X_{i s, - 1} {\hat{g} (X_{i}^{⊤} γ_{01}) - \hat{g} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) - {\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) X_{i, - 1}^{⊤} (γ_{0} - {\hat{γ}}^{0})} .

We rewrite

V_{n 51, s}

\begin{aligned} V_{n 51, s} & = \frac{1}{n} \sum_{i = 1}^{n} g_{0}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) X_{i s, - 1} {g_{0} (X_{i}^{⊤} γ_{01}) - g_{0} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) - g_{0}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) X_{i, - 1}^{⊤} (γ_{0} - {\hat{γ}}^{0})} \\ + \frac{1}{n} \sum_{i = 1}^{n} g_{0}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) X_{i s, - 1} {\hat{g} (X_{i}^{⊤} γ_{01}) - g_{0} (X_{i}^{⊤} γ_{01})} \\ - \frac{1}{n} \sum_{i = 1}^{n} g_{0}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) X_{i s, - 1} {\hat{g} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) - g_{0} (X_{i}^{⊤} {\hat{γ}}_{1}^{0})} \\ - \frac{1}{n} \sum_{i = 1}^{n} g_{0}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) X_{i s, - 1} {{\hat{g}}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0}) - g_{0}^{'} (X_{i}^{⊤} {\hat{γ}}_{1}^{0})} X_{i, - 1}^{⊤} (γ_{0} - {\hat{γ}}^{0}) \\ \equiv V_{n 511, s} + V_{n 512, s} + V_{n 513, s} + V_{n 514, s} . \end{aligned}

For

V_{n 511, s}

, by a Taylor expansion, we get

V_{n 511} = O_{p} ({\tilde{n}}^{- 1})

. Similar to the proof of

V_{n 42, s}

, we get

\begin{aligned} \sqrt{n} E V_{n 512, s} = \sqrt{n} E V_{n 513, s} = O (\sqrt{n} h_{1}^{2}) \to 0, \\ n E V_{n 512, s}^{2} = n E V_{n 513, s}^{2} = O (h_{1}^{4} + (n h_{1})^{- 1}) \to 0, \end{aligned}

and

\sqrt{n} E V_{n 514, s} = O (\sqrt{n / \tilde{n}} h_{2}^{2}) \to 0, n E V_{n 514, s}^{2} = O ({\tilde{n}}^{- 1} h_{2}^{2} + {\tilde{n}}^{- 1} n^{- 1} h_{2}^{- 3}) \to 0.

Hence,

‖ V_{n 5} ‖_{2} = o_{p} (n^{- 1 / 2}) + O_{p} ({\tilde{n}}^{- 1}) .

Therefore, we have

V_{n} = \frac{1}{n} \sum_{i = 1}^{n} g_{0}^{'} (X_{i}^{⊤} γ_{01}) {X_{i, - 1} - E (X_{- 1} ∣ X_{i}^{⊤} γ_{01})} ϵ_{i} + o_{p} (n^{- 1 / 2}) + O_{p} ({\tilde{n}}^{- 1}) .

Similar to the proof of

V_{n}

, we can obtain

U_{n} = Σ + o_{p} (1),

where

Σ = E {g_{0}^{' 2} (X^{⊤} γ_{01}) X_{- 1} X_{- 1}^{⊤}}

. Thus

{\hat{γ}}^{1} - γ_{0} = Σ^{- 1} \frac{1}{n} \sum_{i = 1}^{n} g_{0}^{'} (X_{i}^{⊤} γ_{01}) {X_{i, - 1} - E (X_{- 1} ∣ X_{i}^{⊤} γ_{01})} ϵ_{i} + o_{p} (n^{- 1 / 2}) + O_{p} ({\tilde{n}}^{- 1}) .

Note that one round of aggregation enables a refinement of the estimator with its bias reducing from

{\tilde{n}}^{- 1 / 2}

{\tilde{n}}^{- 1}

. Therefore, an iterative refinement of the initial estimator will successively improve the estimation accuracy. The q-th iterative divide-and-conquer method

{\hat{γ}}^{q}

satisfies

\begin{aligned} {\hat{γ}}^{q} - γ_{0} & = Σ^{- 1} \frac{1}{n} \sum_{i = 1}^{n} g_{0}^{'} (X_{i}^{⊤} γ_{01}) {X_{i, - 1} - E (X_{- 1} ∣ X_{i}^{⊤} γ_{01})} ϵ_{i} \\ + o_{p} (n^{- 1 / 2}) + O_{p} ({\tilde{n}}^{- 2^{q - 1}}) . \end{aligned}

Thus, after Q iterations, where

Q \geq 1 + \log (\log n / \log \tilde{n}) / \log 2

, we have

\hat{γ} - γ_{0} = Σ^{- 1} \frac{1}{n} \sum_{i = 1}^{n} g_{0}^{'} (X_{i}^{⊤} γ_{01}) {X_{i, - 1} - E (X_{- 1} ∣ X_{i}^{⊤} γ_{01})} ϵ_{i} + o_{p} (n^{- 1 / 2}) .

By the central limit theorem, we can prove Theorem 2.1.

A short note on fitting a single-index model with massive data

Abstract

1. Introduction

2. Methodology and main results

2.1. Iterative divide-and-conquer method

2.2. Asymptotic normality of the resulting estimator

2.3. Algorithm

3. Numerical studies

3.1. Simulation example 1: effect of M with fixed n

Table 1. The means of Absolute Bias, MSE (standard deviation) and t for simulation Example 1.

3.2. Simulation example 2: effect of M with fixed $\tilde{n}$

3.3. Simulation example 3: effect of n

Table 2. The means of Absolute Bias (standard deviation) and t for simulation Example 3.

3.4. Real data example 1: combined cycle power plant data

Table 3. The coefficient estimates and MSFE for the combined cycle power plant data.

3.5. Real data example 2: airline on-time data

Table 4. The coefficient estimates and MAPE for the airline on-time data.

Disclosure statement

References

Appendix

Proof of Theorem 2.1

Information for

Open access

Opportunities

Help and information

A short note on fitting a single-index model with massive data

Abstract

1. Introduction

2. Methodology and main results

2.1. Iterative divide-and-conquer method

2.2. Asymptotic normality of the resulting estimator

2.3. Algorithm

3. Numerical studies

3.1. Simulation example 1: effect of M with fixed n

Table 1. The means of Absolute Bias, MSE (standard deviation) and t for simulation Example 1.

3.2. Simulation example 2: effect of M with fixed n~

3.3. Simulation example 3: effect of n

Table 2. The means of Absolute Bias (standard deviation) and t for simulation Example 3.

3.4. Real data example 1: combined cycle power plant data

Table 3. The coefficient estimates and MSFE for the combined cycle power plant data.

3.5. Real data example 2: airline on-time data

Table 4. The coefficient estimates and MAPE for the airline on-time data.

Disclosure statement

Additional information

Funding

References

Appendix

Proof of Theorem 2.1

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

3.2. Simulation example 2: effect of M with fixed $\tilde{n}$