Search in:

Statistical Theory and Related Fields Volume 6, 2022 - Issue 2

Submit an article Journal homepage

Open access

5,969

Views

CrossRef citations to date

Altmetric

Listen

Review Article

A review of distributed statistical inference

Yuan Gaoa School of Statistics and Key Laboratory of Advanced Theory and Application in Statistics and Data Science – MOE, East China Normal University, Shanghai, People's Republic of ChinaView further author information

Weidong Liub School of Mathematical Sciences and Key Lab of Articial Intelligence – MOE, Shanghai Jiao Tong University, Shanghai, People's Republic of ChinaView further author information

Hansheng Wangc Guanghua School of Management, Peking University, Beijing, People's Republic of ChinaView further author information

Xiaozhou Wanga School of Statistics and Key Laboratory of Advanced Theory and Application in Statistics and Data Science – MOE, East China Normal University, Shanghai, People's Republic of ChinaView further author information

Yibo Yana School of Statistics and Key Laboratory of Advanced Theory and Application in Statistics and Data Science – MOE, East China Normal University, Shanghai, People's Republic of ChinaView further author information

Riquan Zhanga School of Statistics and Key Laboratory of Advanced Theory and Application in Statistics and Data Science – MOE, East China Normal University, Shanghai, People's Republic of ChinaCorrespondence[email protected]
View further author information

Pages 89-99 | Received 02 Sep 2020, Accepted 01 Aug 2021, Published online: 13 Sep 2021

Cite this article
https://doi.org/10.1080/24754269.2021.1974158
CrossMark

In this article

1. Introduction
2. Parametric models
3. Nonparametric models
4. Other related works
5. Future study
Disclosure statement
Additional information
References

Full Article
Figures & data
References
Citations
Metrics
Licensing
Reprints & Permissions
View PDF PDF View EPUB EPUB

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

The rapid emergence of massive datasets in various fields poses a serious challenge to traditional statistical methods. Meanwhile, it provides opportunities for researchers to develop novel algorithms. Inspired by the idea of divide-and-conquer, various distributed frameworks for statistical estimation and inference have been proposed. They were developed to deal with large-scale statistical optimization problems. This paper aims to provide a comprehensive review for related literature. It includes parametric models, nonparametric models, and other frequently used models. Their key ideas and theoretical properties are summarized. The trade-off between communication cost and estimate precision together with other concerns is discussed.

Keywords:

Distributed computing
divide-and-conquer
communication-efficiency
shrinkage methods
nonparametric estimation
principal component analysis
feature screening
bootstrap

1. Introduction

With the rapid development of information technology, datasets of massive sizes become increasingly available. E-commerce companies like Amazon have to analyse billions of transaction data for personalized recommendation. Bioinformatics scientists need to locate relevant genes corresponding to some specific phenotype or disease from massive SNPs data. For Internet-related companies, large amounts of text, image, voice, and even video data are in urgent need of effective analysis. Due to the accelerated growth of data sizes, the computing power and memory of one single computer are no longer sufficient. Constraint on network bandwidth and other privacy or security considerations also make it difficult to process the whole data on one central machine. Accordingly, distributed computing systems become increasingly popular.

Similar to parallel computing executed on a single computer, distributed computing is closely related to the idea of divide-and-conquer. Simply speaking, for some statistical problems, we can divide a complicated large task into many small pieces so that they can be tackled simultaneously on multiple CPUs or machines. Their outcomes are then aggregated to obtain the final result. It is conceivable that this procedure can save the computing time substantially if the algorithm can be executed in a parallel way. The main difference between a traditional parallel computing system and a distributed computing system is the way they access memory. For parallel computing, different processors can share the same memory. Consequently, they can exchange information with each other in a super-efficient way. While for distributed computing, distinct machines are physically separated. They are often connected by a network. Accordingly, each machine can only access its own memory directly. Therefore, the inter-machine communication cost in terms of time spending could be significant and thus should be prudently considered.

The rest of this article is organized as follows. Section 2 studies parametric models. Section 3 focuses on nonparametric methods. Section 4 expresses some other related methods. The article is concluded with a short discussion in Section 5.

2. Parametric models

Assume a total of N observations denoted as $Z_{i} = (X_{i}^{⊤}, Y_{i})^{⊤} \in R^{p + 1}$ with $1 \leq i \leq N$ . Here $X_{i} \in R^{p}$ is the covariate vector and $Y_{i} \in R$ is the corresponding scalar response. Define ${P_{θ} : θ \in Θ}$ to be a family of statistical models parameterized by $θ \in Θ \subset R^{p}$ . We further assume that $Z_{i}$ 's are independent and identically distributed with the distribution $P_{θ^{*}}$ , where $θ^{*} = (θ_{1}^{*}, \dots, θ_{p}^{*})^{⊤}$ is the true parameter. Consider a distributed setting, where N sample units are allocated randomly and evenly to K local machines $M_{k}, 1 \leq k \leq K$ , such that each machine has n observations. Obviously, we should have $N = n K$ . Write $S = {1, \dots, N}$ as the index set of whole sample. Then, let $S_{k}$ denote the index set of local sample on $M_{k}$ with $S_{k_{1}} \cap S_{k_{2}} = s e t$ for any $k_{1} \neq k_{2}$ . Other than the local machines, there also exists a central machine represented by $M_{c e n t e r}$ . A standard architecture should have $M_{c e n t e r}$ to be connected with every $M_{k}$ .

Let $L : Θ \times R^{p + 1} \mapsto R$ be the loss function. Assume that the true parameter $θ^{*}$ minimizes the population risk $L^{*} (θ) = E [L (θ; Z)]$ , where $E$ stands for expectation with respect to $P_{θ^{*}}$ . Define the local loss on the kth machine as $L_{k} (θ) = n^{- 1} \sum_{i \in S_{k}} L (θ; Z_{i})$ . Correspondingly, define the global loss function based on the whole sample as $L (θ) = N^{- 1} \sum_{i \in S} L (θ; Z_{i}) = K^{- 1} \sum_{k = 1}^{K} L_{k} (θ)$ , whose minimizer is $\hat{θ} = {a r g m i n}_{θ \in Θ} L (θ)$ . In most cases, the whole sample estimator $\hat{θ}$ should be $\sqrt{N}$ -consistent and asymptotically normal (Lehmann & Casella, Citation2006). If N is small enough so that the whole sample $S$ can be read into the memory of one single computer, then $\hat{θ}$ can be easily computed. The entire computation can be executed in the memory of this computer. On the other hand, if N is too large so that the whole sample $S$ cannot be placed on any single computer, then a distributed system must be used. In this case, the whole sample estimator $\hat{θ}$ is no longer computable (or at least very difficult to compute) in practice. Then, how to develop novel statistical methods for distributed systems becomes a problem of great importance.

2.1. One-shot approach

To solve the problems, various methods have been proposed. They can be roughly divided into two classes. The first class contains so-called one-shot methods. They are to be reviewed in this subsection. The other class contains various iterative methods. They are to be reviewed in the next subsection.

The basic idea of the one-shot approach is to calculate some relevant statistics on each local machine. Subsequently, these statistics are sent to a central machine, where they are assembled into the final estimator. The most popular and direct way of aggregation is simple average. Specifically, for each $1 \leq k \leq K$ , machine $M_{k}$ uses local sample $S_{k}$ to compute the local empirical minimizer ${\hat{θ}}_{k} = {a r g m i n}_{θ \in Θ} L_{k} (θ)$ . These local estimates (i.e., ${\hat{θ}}_{k}$ 's) are then transferred to the centre machine $M_{c e n t e r}$ , where they are averaged as $\bar{θ} = K^{- 1} \sum_{k = 1}^{K} {\hat{θ}}_{k}$ . This leads to the final simple averaging estimator $\bar{θ}$ (see Figure (a)).

Figure 1. Illustrations of the two different approaches. (a) one-shot approach and (b) iterative approach.

Obviously, the one-shot style of distributed framework is highly communication-efficient. Because it requires only one single round of communication between each $M_{k}$ and $M_{c e n t e r}$ . Hence, the communication cost is of the order $O (K p)$ , where p is the dimension of each estimate $\hat{θ}$ . Theoretical properties of simple averaging estimator were also studied in the literature. For example, it was shown in Zhang et al. (Citation2013, Corollary 2) that, under appropriate regularity conditions, (1) $E ‖ \bar{θ} - θ^{*} ‖_{2}^{2} \leq \frac{C_{1}}{N} + \frac{C_{2}}{n^{2}} + O (\frac{1}{N n} + \frac{1}{n^{3}}),$ (1) where $C_{1}, C_{2}$ are some positive constants. If n is sufficiently large such that $n^{- 2} = o (N^{- 1})$ , then the dominant term in (Equation1(1) $E ‖ \bar{θ} - θ^{*} ‖_{2}^{2} \leq \frac{C_{1}}{N} + \frac{C_{2}}{n^{2}} + O (\frac{1}{N n} + \frac{1}{n^{3}}),$ (1) ) becomes $C_{1} / N$ , and is of the order $O (N^{- 1})$ . It is the same as that of the whole sample estimator. This also implies that, in order to obtain the global convergence rate, we should not divide the whole sample into too many parts. A further improved theoretical result was obtained by Rosenblatt and Nadler (Citation2016). They showed that the one-shot estimator is the first order equivalent to the whole sample estimator. However, the second-order error terms of $\bar{θ}$ can be non-negligible for nonlinear models. Similar observation was also obtained by Huang and Huo (Citation2015). The work of Duchi et al. (Citation2014) revealed that the minimal communication budget to attain the global estimation error for linear regression is $O (K p)$ bits up to a logarithmic factor under some conditions. This result matches the simple averaging procedure and confirms the sharpness of the bound in (Equation1(1) $E ‖ \bar{θ} - θ^{*} ‖_{2}^{2} \leq \frac{C_{1}}{N} + \frac{C_{2}}{n^{2}} + O (\frac{1}{N n} + \frac{1}{n^{3}}),$ (1) ). To further reduce the bias, a novel subsampling method was developed by Zhang et al. (Citation2013). By this technique, the error bound is improved to be $O (N^{- 1} + n^{- 3})$ , which relaxes the restriction on the number of machines.

Instead of the linear combination of local maximum likelihood estimates (MLEs) as simple average, Liu and Ihler (Citation2014) proposed a KL-divergence based combination method. The final estimator is computed by ${\hat{θ}}_{K L} = {a r g m i n}_{θ \in Θ} \sum_{k = 1}^{K} KL (p (x | {\hat{θ}}_{k}) ‖ p (x | θ)),$ where $p (x | θ)$ is the probability density of $P_{θ}$ with respect to some proper measure μ, and KL-divergence is defined by $KL (p (x) ‖ q (x)) = \int_{X} p (x) \log {p (x) / q (x)} d μ (x)$ . It was shown that ${\hat{θ}}_{K L}$ is exactly the global MLE $\hat{θ}$ if ${P_{θ} : θ \in Θ}$ is a full exponential family (defined in their paper). This sheds light on the inference about generalized linear models (GLMs) based on exponential likelihood.

In many cases, some local machines might suffer from data of poor quality. This could lead to abnormal local estimates, which further degrade the statistical efficiency of the final estimator. To fix the problem, Minsker (Citation2019) devised a robust assembling method. It leads to an estimator as ${\hat{θ}}_{r o b u s t} = {a r g m i n}_{θ \in Θ} \sum_{k = 1}^{K} ρ (| θ - {\hat{θ}}_{k} |)$ , where $ρ (\cdot)$ is a robust loss function satisfying some conditions. For example, when $ρ (u) = u$ and $p = 1$ (univariate case), ${\hat{θ}}_{r o b u s t}$ is the median of ${\hat{θ}}_{k}$ 's. It should be more robust against outliers compared with the simple average. Under some regularity conditions, they showed that ${\hat{θ}}_{r o b u s t}$ achieves the same convergence rate as the whole sample estimator provided $K \leq O (\sqrt{N})$ .

2.2. Iterative approach

Although one-shot approach involves little communication cost, it suffers from several disadvantages. First, the local machines need to have sufficient amount of data (e.g., $n ≫ \sqrt{N}$ ). Otherwise the aggregated estimator cannot reach the convergence rate as the global estimator. This prevents us from utilizing many machines to speed up the computation process (Jordan et al., Citation2019; Wang et al., Citation2017). Second, the simple averaging estimator is often poor in performance for nonlinear models (Huang & Huo, Citation2015; Jordan et al., Citation2019; Rosenblatt & Nadler, Citation2016). Last, when p is diverging with N, the situation could be even worse (Lee et al., Citation2017; Rosenblatt & Nadler, Citation2016). This suggests that carefully designed algorithms allowing a reasonable number of iterations should be useful for a distributed system.

Inspired by the one-step method in the M-estimator theory, Huang and Huo (Citation2015) proposed an one-step refinement of the simple averaging estimator. Let us recall that $\bar{θ}$ is the one-shot averaging estimator. To further improve its statistical efficiency, it should be broadcast to each local machine. Next, local gradient $\nabla L_{k} (\bar{θ})$ and local Hessian $\nabla^{2} L_{k} (\bar{θ})$ can be computed on each $M_{k}$ . Then, they are reported to $M_{c e n t e r}$ to form the whole sample gradient $\nabla L (\bar{θ}) = K^{- 1} \sum_{k = 1}^{K} \nabla L_{k} (\bar{θ})$ and Hessian $\nabla^{2} L (\bar{θ}) = K^{- 1} \sum_{k = 1}^{K} \nabla^{2} L_{k} (\bar{θ})$ . Thus an one-step updated estimator can be constructed on $M_{c e n t e r}$ as (2) ${\hat{θ}}^{(1)} = \bar{θ} - [\nabla^{2} L (\bar{θ})]^{- 1} \nabla L (\bar{θ}) .$ (2) Compared with one-shot estimator, ${\hat{θ}}^{(1)}$ involves one more round of communication cost. Nevertheless, the statistical efficiency of the resulting estimator could be well improved. In fact, Huang and Huo (Citation2015) showed that $E ‖ {\hat{θ}}^{(1)} - θ^{*} ‖_{2}^{2} \leq \frac{C_{1}}{N} + O (\frac{1}{n^{4}} + \frac{1}{N^{2}}),$ where $C_{1} > 0$ is some constant. Obviously, this is a lower upper bound of mean squared error than that in (Equation1(1) $E ‖ \bar{θ} - θ^{*} ‖_{2}^{2} \leq \frac{C_{1}}{N} + \frac{C_{2}}{n^{2}} + O (\frac{1}{N n} + \frac{1}{n^{3}}),$ (1) ). To attain the global convergence rate, the local sample size needs to satisfy $n^{- 4} = o (N^{- 1})$ , which is a much milder condition. Furthermore, they showed that ${\hat{θ}}^{(1)}$ also has the same asymptotic efficiency as the whole sample estimator $\hat{θ}$ under some regularity conditions.

A natural idea to further extend the one-step estimator is to allow the iteration (Equation2(2) ${\hat{θ}}^{(1)} = \bar{θ} - [\nabla^{2} L (\bar{θ})]^{- 1} \nabla L (\bar{θ}) .$ (2) ) to be executed many times. Specifically, let ${\hat{θ}}^{(t)}$ be the estimator of the tth iteration. Then, we can use (Equation2(2) ${\hat{θ}}^{(1)} = \bar{θ} - [\nabla^{2} L (\bar{θ})]^{- 1} \nabla L (\bar{θ}) .$ (2) ) by replacing $\bar{θ}$ with ${\hat{θ}}^{(t)}$ to generate the next step estimator ${\hat{θ}}^{(t + 1)}$ (see Figure (b)). However, this requires a large number of Hessian matrices to be computed and transferred. If the parameter dimension p is relatively high, this will lead to a significant communication cost of the order $O (K p^{2})$ . To fix the problem, Shamir et al. (Citation2014) proposed an approximate Newton method, which conducts Newton-type iteration distributedly without transferring the Hessian matrices. Following this strategy, Jordan et al. (Citation2019) developed an approximate likelihood approach. Their key idea is to update Hessian matrix on one single machine (e.g., $M_{c e n t e r}$ ) only. Then, (Equation2(2) ${\hat{θ}}^{(1)} = \bar{θ} - [\nabla^{2} L (\bar{θ})]^{- 1} \nabla L (\bar{θ}) .$ (2) ) can be revised to be ${\hat{θ}}^{(t + 1)} = {\hat{θ}}^{(t)} - [\nabla^{2} L_{c e n t e r} ({\hat{θ}}^{(t)})]^{- 1} \nabla L ({\hat{θ}}^{(t)}),$ where $\nabla^{2} L_{c e n t e r}$ is the Hessian matrix computed on the central machine. By doing so, the communication cost due to transmission of Hessian matrices can be saved. Under some conditions, they showed that (3) $‖ {\hat{θ}}^{(t + 1)} - \hat{θ} ‖_{2} \leq \frac{C_{1}}{\sqrt{n}} ‖ {\hat{θ}}^{(t)} - \hat{θ} ‖_{2}, f o r t \geq 0,$ (3) holds with high probability, where $C_{1} > 0$ is some constant. By the linear convergence formula (Equation3(3) $‖ {\hat{θ}}^{(t + 1)} - \hat{θ} ‖_{2} \leq \frac{C_{1}}{\sqrt{n}} ‖ {\hat{θ}}^{(t)} - \hat{θ} ‖_{2}, f o r t \geq 0,$ (3) ), we can see that it requires $[\log K / \log n]$ iterations to achieve the $\sqrt{N}$ -consistency as the whole sample estimator $\hat{θ}$ , provided ${\hat{θ}}^{(0)}$ is $\sqrt{n}$ -consistent. Note that if $n = K = \sqrt{N}$ , one iteration suffices to attain the optimal convergence rate. However, the satisfactory performance of this method relies on a good choice of the machine, on which the Hessian needs to be updated (Fan, Guo et al., Citation2019a). To fix the problems, Fan, Guo et al. (Citation2019a) added an extra regularized term to the approximate likelihood used in Jordan et al. (Citation2019). With this modification, the performance of the resulting estimator can be well improved. Theoretically, they showed a similar linear convergence rate under some more general conditions, which require no strict homogeneity of the local loss functions.

2.3. Shrinkage methods

We study the shrinkage methods for sparse estimation in this subsection. For a high-dimensional problem, especially when the dimension of $θ^{*}$ is larger than the sample size N, it is difficult to estimate $θ^{*}$ without any additional assumptions (Hastie et al., Citation2015). A popular constraint for tackling these problems is sparsity, which assumes only a subset of the entries in $θ^{*}$ is non-zero. The index of non-zero entries is called the support of $θ^{*}$ , that is $supp (θ^{*}) = A^{*} = {1 \leq j \leq p : θ_{j}^{*} \neq 0} .$ To induce a sparse solution, an additional regularization term of $θ$ is often introduced in the loss function. Specifically, we need to solve the shrinkage regression problem as $min_{θ \in Θ} {L (θ) + \sum_{j = 1}^{p} ρ_{λ} (| θ_{j} |)}$ , where $ρ_{λ} (\cdot)$ is a penalty function with a regularization parameter $λ > 0$ . Popular choices are LASSO (Tibshirani, Citation1996), SCAD (Fan & Li, Citation2001) and others discussed in Zhang and Zhang (Citation2012). For simplicity, we consider the LASSO estimator in the framework of the linear regression problem. Specifically, the whole sample estimator is computed as ${\hat{θ}}_{λ} = {a r g m i n}_{θ \in Θ} {\frac{1}{N} ‖ Y - X θ ‖_{2}^{2} + λ ‖ θ ‖_{1}},$ where $Y = (Y_{1}, \dots, Y_{N})^{⊤} \in R^{N}$ is the vector of response, $X = (X_{1}, \dots, X_{N})^{⊤} \in R^{N \times p}$ is the design matrix, and $‖ θ ‖_{1} = \sum_{j = 1}^{p} | θ_{j} |$ denotes the $l_{1}$ -norm of $θ$ . It is known that the LASSO procedure would produce biased estimators for the large coefficients. This is undesirable for the simple average procedures, since average cannot eliminate the systematic bias. To reduce bias, Javanmard and Montanari (Citation2014) proposed a debiasing technique for the lasso estimator, that is (4) ${\hat{θ}}_{λ}^{(d)} = {\hat{θ}}_{λ} + \frac{1}{N} M X^{⊤} (Y - X {\hat{θ}}_{λ}),$ (4) where $M \in R^{p \times p}$ is an approximation to the inverse of $\hat{Σ} = X^{⊤} X / N$ . It appears that when $\hat{Σ}$ is invertible (e.g., when $N ≫ p$ ), setting $M = {\hat{Σ}}^{- 1}$ gives ${\hat{θ}}_{λ}^{(d)} = (X^{⊤} X)^{- 1} X^{⊤} Y$ , which is the ordinary least squares estimator and obviously unbiased. Hence, procedure (Equation4(4) ${\hat{θ}}_{λ}^{(d)} = {\hat{θ}}_{λ} + \frac{1}{N} M X^{⊤} (Y - X {\hat{θ}}_{λ}),$ (4) ) compensates for the bias incurred by $ℓ_{1}$ regularization in some sense.

By this debiasing technique, Lee et al. (Citation2017) developed an one-shot type estimator for the LASSO problem. Specifically, let ${\hat{θ}}_{k, λ}^{(d)}$ be the debiased LASSO estimator computed on $M_{k}$ . Then an averaging estimator can be constructed on $M_{c e n t e r}$ as ${\bar{θ}}_{λ} = K^{- 1} \sum_{k = 1}^{K} {\hat{θ}}_{k, λ}^{(d)}$ . Unfortunately, the sparsity level can be seriously degraded by averaging. For this reason, a hard threshold step often comes as a remedy. It was noticed that the debiasing step is computationally expensive. Hence an improved algorithm was also proposed to alleviate the computational cost of this step. Under certain conditions, they showed that the resulting estimator has the same convergence rate as the whole sample estimator. Battey et al. (Citation2018) investigated the same problem with additional study on hypothesis testing. Furthermore, a refitted estimation procedure was used to preserve the global oracle property of the distributed estimator. An extension to high dimensional GLMs can also be found in Lee et al. (Citation2017) and Battey et al. (Citation2018). For this model, Chen and Xie (Citation2014) implemented a majority voting method to aggregate the regularized local estimates. For the low dimensional sparse problem with smooth loss function (e.g., GLMs, Cox model), Zhu et al. (Citation2019) developed a local quadratic approximation method with an adaptive-LASSO type penalty. They showed rigorously that the resulting estimator can be as good as the global oracle estimator.

Intuitively, above one-shot methods may need a stringent condition on the local sample size to meet the global convergence rate due to the limited communication. In fact, the simple averaging estimator requires $n \geq O (K s^{2} \log p)$ to match the oracle rate in the context of sparse linear model (Lee et al., Citation2017), where $s = | A^{*} |$ is the number of non-zero entries of $θ^{*}$ . For this problem, Wang et al. (Citation2017) and Jordan et al. (Citation2019) independently proposed a communication-efficient iterative algorithm, which constructs a regularized likelihood by using local Hessian matrix. As demonstrated by Wang et al. (Citation2017), an one-step estimator ${\hat{θ}}^{(1)}$ suffices to achieve the global convergence rate if $n \geq O (K s^{2} \log p)$ (the condition used in Lee et al., Citation2017). Furthermore, if multi-round communication is allowed, ${\hat{θ}}^{(t + 1)}$ (i.e., estimator of the $(t + 1)$ th iteration) can match the estimator based on the whole sample as long as $n \geq O (s^{2} \log p)$ and $t \geq O (\log K)$ , under some certain conditions.

2.4. Non-smooth loss based models

The methods we described above typically require the loss function $L$ to be sufficiently smooth, although a non-smooth regularization term is permitted (see e.g., Jordan et al., Citation2019; Wang et al., Citation2017; Zhang et al., Citation2013; Zhu et al., Citation2019). However, there are also some useful methods involving non-smooth loss functions, such as quantile regression and support vector machine. It is then of great interest to develop distributed methods for these methods.

We first focus on the quantile regression (QR) model. The QR model has a widespread use in statistics and econometrics, and performs more robustly against the outliers than the ordinary quadratic loss (Koenker, Citation2005). Specifically, a QR model assumes $Y_{i} = X_{i}^{⊤} θ_{τ}^{*} + ϵ_{i}, i \in S$ , where $X_{i} \in R^{p}$ is the covariate vector, $Y_{i}$ is the corresponding response, $θ_{τ}^{*} \in R^{p}$ is the true regression coefficient, and $ϵ_{i}$ is the random noise satisfying $P (ϵ_{i} \leq 0 | X_{i}) = τ$ , where $τ \in (0, 1)$ is a known quantile level. It is known that $θ_{τ}^{*}$ is the minimizer of $E [ρ_{τ} (Y_{i} - X_{i}^{⊤} θ)]$ . Here $ρ_{τ} (u) = u (τ - 1 {u \leq 0}) = u (1 {u > 0} + τ - 1)$ is the non-differentiable check-loss function, where $1 {\cdot}$ is the indicator function. When data size N is moderate, we can estimate $θ_{τ}^{*}$ by ${\hat{θ}}_{τ} = min_{θ \in Θ} N^{- 1} \sum_{i \in S} ρ_{τ} (Y_{i} - X_{i}^{⊤} θ)$ on one single machine. However, when N is very large, a distributed system has to be used. Accordingly, distributed estimators have to be developed.

In this regard, Volgushev et al. (Citation2019) studied the one-shot averaging type estimator. Specifically, a local estimator ${\hat{θ}}_{k, τ}$ is first computed on each local machine $M_{k}$ . Then, the averaging estimator is assembled as ${\bar{θ}}_{τ} = K^{- 1} \sum_{k = 1}^{K} {\hat{θ}}_{k, τ}$ on the central machine $M_{c e n t e r}$ . They further investigated the theoretical properties of the averaging estimator in detail. It was shown that the if the number of machines satisfies $K = o (\sqrt{N} / \log N)$ , then ${\bar{θ}}_{τ}$ should be as efficient as the whole sample estimator ${\hat{θ}}_{τ}$ under some regularity conditions. Chen and Zhou (Citation2020) proposed an estimating equation based one-shot approach for the QR problem. The asymptotic equivalence between the resulting estimator and the whole sample estimator was also established under $K = o (N^{1 / 4})$ and some other conditions. It can be seen that the performance of one-shot approaches relies more on the local sample size. In fact, Volgushev et al. (Citation2019) showed that $K = o (\sqrt{N})$ is a necessary condition for the global efficiency of the simple averaging estimator ${\bar{θ}}_{τ}$ . To remove the constraint $K = o (\sqrt{N})$ on the number of machines, Chen et al. (Citation2019) proposed an iterative approach. Their key idea is to approximate the check-loss function by a smooth alternative. More specifically, they approximated $1 {u > 0}$ by a smooth function $H (u / h)$ , where $H (\cdot)$ is a smooth cumulative distribution function and $h > 0$ is the tuning parameter controlling the approximation goodness (see Figure (a)). With this modification, the algorithm can update the estimates by (5) ${\hat{θ}}_{τ}^{(t + 1)} = [V ({\hat{θ}}_{τ}^{(t)})]^{- 1} U ({\hat{θ}}_{τ}^{(t)}),$ (5) where $U (θ) = \sum_{k = 1}^{K} U_{k} (θ)$ , $V (θ) = \sum_{k = 1}^{K} V_{k} (θ)$ , and $U_{k} \in R^{p}, V_{k} \in R^{p \times p}$ depend only on the bandwidth h and local sample $S_{k}$ . It was shown that a constant number of rounds of iteration suffices to match the convergence rate of the whole sample estimator. Thus, the communication cost is roughly of the order $O (K p^{2})$ . But this is not applicable when p is very large. Thus, for the high dimensional QR problem, Zhao et al. (Citation2014) and Zhao et al. (Citation2019) adopted an one-shot averaging method based on the debiased local estimates as that in (Equation4(4) ${\hat{θ}}_{λ}^{(d)} = {\hat{θ}}_{λ} + \frac{1}{N} M X^{⊤} (Y - X {\hat{θ}}_{λ}),$ (4) ). Accordingly, Chen et al. (Citation2020) proposed a communication-efficient multi-round algorithm inspired by the approximate Newton method (Shamir et al., Citation2014). This iterative approach removes the restriction on the number of machines. A revised divide-and-conquer stochastic gradient descent method for QR and other models with diverging dimension can be found in Chen, Liu et al. (Citation2021b).

Figure 2. Approximation of two non-smooth loss functions. (a) QR loss with $τ = 0.6$ and (b) hinge loss.

We next consider the support vector machine (SVM), which is one of the most successful statistical learning methods (Vapnik, Citation2013). The classical SVM is aimed at the binary classification problem, i.e., the response variable $Y_{i} \in {- 1, 1}$ . Formally, a standard linear SVM solves the problem $min_{θ \in Θ} N^{- 1} \sum_{i \in S} (1 - Y_{i} X_{i}^{⊤} θ)_{+} + λ ‖ θ ‖_{2}^{2}$ , where $(u)_{+} = u 1 (u > 0)$ is the hinge loss, and $λ > 0$ is the regularization parameter. By the same smooth technique used in Chen et al. (Citation2019), i.e., replacing the hinge loss with a smooth alternative (see Figure (b)), Wang, Yang et al. (Citation2019) proposed an iterative algorithm like (Equation5(5) ${\hat{θ}}_{τ}^{(t + 1)} = [V ({\hat{θ}}_{τ}^{(t)})]^{- 1} U ({\hat{θ}}_{τ}^{(t)}),$ (5) ). To reduce the communication cost incurred by transferring matrices, they further employed the approximate Newton method (Shamir et al., Citation2014). Theoretically, they showed the asymptotic normality of the estimator, which can be used to construct confidence interval. For the ultra-high dimensional SVM problem, Lian and Fan (Citation2018) studied the one-shot averaging method with debiasing procedure similar to (Equation4(4) ${\hat{θ}}_{λ}^{(d)} = {\hat{θ}}_{λ} + \frac{1}{N} M X^{⊤} (Y - X {\hat{θ}}_{λ}),$ (4) ).

3. Nonparametric models

Different from parametric models, a nonparametric model typically involves infinite-dimensional parameters. In this section, we focus mainly on the nonparametric regression problems. Specifically, consider here a general regression model as $Y_{i} = f^{*} (X_{i}) + ϵ_{i}, i \in S$ , where $f^{*} (\cdot)$ is an unknown but sufficiently smooth function and $ϵ_{i}$ is the random noise with zero mean. The aim of nonparametric regression is to estimate function $f^{*} \in F$ , where $F$ is a given nonparametric class of functions.

3.1. Local smoothing

One way to estimate $f^{*} (\cdot)$ is to fit a locally constant model by kernel smoothing (Fan & Gijbels, Citation1996). More concretely, the whole sample estimator is given by ${\hat{f}}_{h} (x) = \sum_{i \in S} W_{h, X_{i}} (x) Y_{i},$ where the $W_{h, X_{i}} (x) \geq 0$ is the local weight at $X = x$ satisfying $\sum_{i \in S} W_{h, X_{i}} (x) = 1$ . Specifically, for a Nadaraya–Watson kernel estimator, we should have $W_{h, X_{i}} (x) = K ((X_{i} - x) / h) / \sum_{i^{'} \in S} K ((X_{i^{'}} - x) / h)$ , where $K (\cdot)$ is a kernel function and $h > 0$ is the bandwidth. In the univariate case ( $p = 1$ ), classical theory stated that the mean squared error of ${\hat{f}}_{h} (x)$ is of the order $O (h^{4} + (N h)^{- 1})$ (Fan & Gijbels, Citation1996). Thus, the optimal rate $O (N^{- 4 / 5})$ can be achieved by choosing bandwidth $h = O (N^{- 1 / 5})$ .

For a distributed kernel smoothing, an one-shot estimator can also be constructed. Let ${\hat{f}}_{k, h} (x)$ be the local estimator computed on $M_{k}$ . Then an averaging estimator can be obtained as ${\bar{f}}_{h} (x) = K^{- 1} \sum_{k = 1}^{K} {\hat{f}}_{k, h} (x)$ . Chang, Lin, Wang (Citation2017) studied the theoretical properties of ${\bar{f}}_{h} (x)$ in a specific function space $F$ . They established the same minimax convergence rate of ${\bar{f}}_{h} (x)$ as that of the whole sample estimator. However, they found that a strict restriction on the number of machines K is needed to achieve this optimal rate. To fix the problem, two solutions were provided. They are, respectively, a date-dependent bandwidth selection algorithm and an algorithm with a qualication step.

Nearest neighbours method can be regarded as another local smoothing method. Qiao et al. (Citation2019) studied the Nearest neighbours classification in a distributed setting, where the optimal number of neighbours to achieve the optimal rate of convergence was derived. Li et al. (Citation2013) discussed the problem of density estimation for scattered datasets. Kaplan (Citation2019) focused on the choice of bandwidth for nonparametric smoothing techniques. All the works above in this subsection indicate that the bandwidth (or local smoothing parameter) used in the distributed setting should be adjusted according to the whole sample size N, other than the local sample size n.

3.2. RKHS methods

We next discuss another popular nonparametric regression method. This is reproducing kernel Hilbert space (RKHS) method. An RKHS $H$ can be induced by a continuous, symmetric and positive semi-definite kernel function $K (\cdot, \cdot) : R^{p} \times R^{p} \mapsto R$ . Two typical examples are: the polynomial kernel $K (x_{1}, x_{2}) = (x_{1}^{⊤} x_{2} + 1)^{d}$ with an integer $d \geq 1$ , and the radical kernel $K (x_{1}, x_{2}) = \exp (- γ ‖ x_{1} - x_{2} ‖_{2}^{2})$ with $γ > 0$ . Refer to, for example, Berlinet and Thomas-Agnan (Citation2011); Wahba (Citation1990) for more details about RKHS. Then, our target is to find an $\hat{f} \in H$ so that the following penalized empirical loss can be minimized. That leads to the whole sample estimator as (6) ${\hat{f}}_{λ} = {a r g m i n}_{f \in H} {\frac{1}{N} \sum_{i \in S} (Y_{i} - f (X_{i}))^{2} + λ ‖ f ‖_{H}^{2}},$ (6) where $‖ \cdot ‖_{H}$ is the norm associated with the RKHS $H$ and $λ > 0$ is the regularization parameter. This problem is also known as kernel ridge regression (KRR). By the representer theorem for the RKHS (Wahba, Citation1990), any solution to the problem (Equation6(6) ${\hat{f}}_{λ} = {a r g m i n}_{f \in H} {\frac{1}{N} \sum_{i \in S} (Y_{i} - f (X_{i}))^{2} + λ ‖ f ‖_{H}^{2}},$ (6) ) must have the linear form as ${\hat{f}}_{λ} (x) = \sum_{i \in S} α_{i} K (X_{i}, x)$ , where $α_{i} \in R$ for each $i \in S$ . By this property, we can treat the KRR as a parametric problem with unknown parameter $α = (α_{1}, \dots, α_{N})^{⊤} \in R^{N}$ . The error bounds of the whole sample estimator ${\hat{f}}_{λ}$ has been well established in the existing literature (e.g., Steinwart et al., Citation2009; Zhang, Citation2005). However, a standard implementation of the KRR involves inverting a kernel matrix in $R^{N \times N}$ (Saunders et al., Citation1998). Therefore, when N is extremely large, it is time consuming or even computationally infeasible to process the whole sample on a single machine. Thus, we should consider a distributed system.

In this regard, Zhang et al. (Citation2015) studied the distributed KRR by taking the one-shot averaging approach. Specifically, each machine $M_{k}$ computes local KRR estimate ${\hat{f}}_{k, λ}$ by (Equation6(6) ${\hat{f}}_{λ} = {a r g m i n}_{f \in H} {\frac{1}{N} \sum_{i \in S} (Y_{i} - f (X_{i}))^{2} + λ ‖ f ‖_{H}^{2}},$ (6) ) based on local sample $S_{k}$ . Then the central machine $M_{c e n t e r}$ averages them to obtain final estimator ${\bar{f}}_{λ} = K^{- 1} \sum_{k = 1}^{K} {\hat{f}}_{k, λ}$ . Theoretically, they established the optimal convergence rate of mean squared error for ${\bar{f}}_{λ}$ with different types of kernel functions, under some regularity conditions. Lin et al. (Citation2017) derived a similar optimal error bound under some relaxed conditions. Xu et al. (Citation2016) extended the loss function in (Equation6(6) ${\hat{f}}_{λ} = {a r g m i n}_{f \in H} {\frac{1}{N} \sum_{i \in S} (Y_{i} - f (X_{i}))^{2} + λ ‖ f ‖_{H}^{2}},$ (6) ) to a further general form. Some related works on the distributed KRR problem by one-shot averaging approach can be found in Shang and Cheng (Citation2017), Lin and Zhou (Citation2018), Guo et al. (Citation2019); Mücke and Blanchard (Citation2018) and Wang (Citation2019) and many others. It was noted that these one-shot approaches require the number of machines diverges in a relative slow speed to meet the global convergence rate. To fix the problem, Chang, Lin, Zhou (Citation2017) proposed a semi-supervised learning framework by utilizing the additional unlabelled data (i.e., observations without response $Y_{i}$ ). Latest work of Lin et al. (Citation2020) allowed communication between machines to improve the performance. In order to choose an optimal tuning parameter λ in (Equation6(6) ${\hat{f}}_{λ} = {a r g m i n}_{f \in H} {\frac{1}{N} \sum_{i \in S} (Y_{i} - f (X_{i}))^{2} + λ ‖ f ‖_{H}^{2}},$ (6) ), Xu et al. (Citation2018) proposed a distributed generalized cross-validation method.

For semiparametric models, Zhao et al. (Citation2016) considered a partially linear model with heterogeneous data in a distributed setting. Specifically, they assumed the following model (7) $Y_{i} = X_{i}^{⊤} θ_{(k)}^{*} + f^{*} (W_{i}) + ϵ_{i}, i \in S_{k},$ (7) where $W_{i} \in R$ is an additional covariate, $f^{*} (\cdot)$ is the unknown function, and $θ_{(k)}^{*} \in R^{p}$ is the true linear coefficient associated with the data on $M_{k}$ for $1 \leq k \leq K$ . In other words, the local data on different machines are assumed to share the same nonparametric part, but are allowed to have different linear coefficients. To estimate the unknown function and coefficients, they extended the classical RKHS theory to cope with the partially linear function space. Under some regularity conditions, the resulting estimator of the nonparametric part is shown to be as efficient as the whole sample estimator, provided the number of machines does not grow too fast. The case with high dimensional linear part was also investigated. For example, under the homogeneity assumption (i.e., the linear coefficients $θ_{(k)}^{*}$ 's are assumed to be identical to $θ^{*}$ across different machines), Lv and Lian (Citation2017) adopted the one-shot averaging approach with debiasing technique analogous to (Equation4(4) ${\hat{θ}}_{λ}^{(d)} = {\hat{θ}}_{λ} + \frac{1}{N} M X^{⊤} (Y - X {\hat{θ}}_{λ}),$ (4) ) to estimate the linear coefficient. Lian et al. (Citation2019) considered the same heterogeneous model as in (Equation7(7) $Y_{i} = X_{i}^{⊤} θ_{(k)}^{*} + f^{*} (W_{i}) + ϵ_{i}, i \in S_{k},$ (7) ), but the linear part is assumed in a high dimensional setting (i.e., $p > N$ ). For this model, they proposed a novel projection approach to estimate the common nonparametric part (not in an RKHS framework). Theoretically, the asymptotic normality of the one-shot averaging estimator for the nonparametric function was established under some certain conditions.

4. Other related works

4.1. Principal component analysis

Principal component analysis (PCA) is a common procedure to reduce the dimension of the data. It is widely used in the practical data analysis. Unlike the regression problems, PCA is an unsupervised method, which does not require a response variable Y. To conduct a PCA, a covariance matrix $\hat{Σ}$ needs to be constructed as $\hat{Σ} = N^{- 1} \sum_{i \in S} X_{i} X_{i}^{⊤}$ , where $X_{i}$ 's are assumed to be centralized already. Next, a standard singular value decomposition (SVD) is applied to $\hat{Σ}$ . That leads to $\hat{Σ} = \hat{V} \hat{D} {\hat{V}}^{⊤}$ , where $\hat{D}$ is a diagonal matrix of eigenvalues and $\hat{V}$ is an orthogonal matrix of eigenvectors. Then, the columns of $\hat{V}$ are the principal component directions that we need.

In a distributed setting, simple average of the eigenvectors estimated locally cannot give a valid result. To solve the problem, Fan, Wang et al. (Citation2019b) developed a divide-and-conquer algorithm for estimating eigenspaces. It involves only one single round of communication. This algorithm is quite easy to implement as well. We state it as follows (Fan, Wang et al., Citation2019b, Algorithm 1):

For each $k = 1, \dots, K$ , machine $M_{k}$ computes d leading eigenvectors of the local sample covariance matrix ${\hat{Σ}}_{k} = n^{- 1} \sum_{i \in S_{k}} X_{i} X_{i}^{⊤}$ , denoted by ${\hat{v}}_{1, k}, \dots, {\hat{v}}_{d, k} \in R^{p}$ . Next, they are arranged by columns in ${\hat{V}}_{k} = ({\hat{v}}_{1, k}, \dots, {\hat{v}}_{d, k}) \in R^{p \times d}$ , which is then sent to the central machine $M_{c e n t e r}$ .
The central machine $M_{c e n t e r}$ averages K local projection matrices to obtain $\tilde{Σ} = K^{- 1} \sum_{k = 1}^{K} {\hat{V}}_{k} {\hat{V}}_{k}^{⊤}$ . Then it computes d leading eigenvectors of $\tilde{Σ}$ , denoted by ${\tilde{v}}_{1}, \dots, {\tilde{v}}_{d} \in R^{p}$ . ${\tilde{v}}_{1}, \dots, {\tilde{v}}_{d}$ are the estimators of the first d principal component directions that we need.

It is noticeable that the communication cost of above one-shot algorithm is of the order $O (K d p)$ . This can be considered to be communication-efficient since d is usually much smaller than p in practice. Fan, Wang et al. (Citation2019b) showed that, under some appropriate conditions, the distributed estimator achieves the same convergence rate as the global estimator. The cases with heterogeneous local data were also investigated in their work. To further remove the restriction on the number of machines, Chen, Lee et al. (Citation2021) proposed a communication-efficient multi-round algorithm based on the approximate Newton method (Shamir et al., Citation2014).

4.2. Feature screening

Massive datasets often involve ultrahigh dimensional data, for which feature screening is critically important (Fan & Lv, Citation2008). To fix the idea, consider a standard linear regression model as $Y_{i} = X_{i}^{⊤} θ^{*} + ϵ_{i}, i \in S$ , where $X_{i} \in R^{p}$ is the covariate vector, $Y_{i}$ is the corresponding response, $θ^{*} \in R^{p}$ is the true parameter, and $ϵ_{i}$ is the random noise. To screen for the most promising features, the seminal method of sure independence screening (SIS) has been proposed by Fan and Lv (Citation2008). Specifically, let $A^{*} = {1 \leq j \leq p : θ_{j}^{*} \neq 0}$ be the true sparse model. Let $ω_{j}$ be the Pearson correlation between jth feature and response Y. Then, SIS screens features by a hard threshold procedure as ${\hat{A}}_{γ} = {1 \leq j \leq p : | {\hat{ω}}_{j} | > γ}$ , where γ is a prespecified threshold and ${\hat{ω}}_{j}$ is the whole sample estimator of $ω_{j}$ . Under some specific conditions, Fan and Lv (Citation2008) showed the sure screening property for SIS, that is, $P (A^{*} \subset {\hat{A}}_{γ}) \to 1 a s N \to \infty .$ However, the estimator ${\hat{ω}}_{j}$ is usually biased for many correlation measures. This indicates that a direct one-shot averaging approach is unlikely to be the best practice for the distributed system. To fix the problem, Li et al. (Citation2020) proposed a novel debiasing technique. They found that many correlation measures can be expressed as $ω_{j} = g (ν_{1}, \dots, ν_{s})$ , including Pearson correlation used above, Kendall τ rank correlation, SIRS correlation (Zhu et al., Citation2011), etc. Therefore, they used U-statistics to estimate the components $ν_{q}, 1 \leq q \leq s$ on local machines. Then, these unbiased estimators of $ν_{q}$ 's given by local machines are averaged on the central machine $M_{c e n t e r}$ . Consequently, $M_{c e n t e r}$ can construct distributed estimator ${\tilde{ω}}_{j}$ by the averaging estimators of the components in the known function g. Finally, they showed the sure screening property of ${\hat{A}}_{γ}$ based on the distributed estimators under some regularity conditions. When the feature dimension is much larger than the sample size (i.e., $p ≫ N$ ), another distributed computing strategy is to partition the whole data by features, other than by samples. Refer to, for example, Song and Liang (Citation2015); Yang et al. (Citation2016) for more details.

4.3. Bootstrap

Bootstrap and related resampling techniques provide a general and easily implemented procedure for automatically statistical inference. However, these methods are usually computationally expensive. Especially when sample size N is very large, it would be even practically infeasible to conduct. To mitigate this computing issue, various alternative methods have been proposed, such as subsamping approach (Politis et al., Citation1999) and ‘m-out-of-n’ bootstrap (Bickel et al., Citation2012). Their key idea is to reduce the resample size. However, due to the difference between the size of whole sample and resample, an additional correction step is generally required to rescale the result. This makes these methods less automatic.

To solve this problem, Kleiner et al. (Citation2014) proposed the bag of little bootstraps (BLB) method. It integrates the idea of subsampling and can be computed distributedly without a correction step. Suppose that N sample units have been randomly and evenly partitioned to K machines. Consider that we want to assess the accuracy of the point estimator for some parameter $θ$ . Then we summarize their algorithm as follows.

For each $1 \leq k \leq K$ , machine $M_{k}$ draws r samples of size N (instead of n) from $S_{k}$ with replacement. Then it computes r estimates of $θ$ based on the r resamples drawn above, respectively. After that, each $M_{k}$ computes some accuracy measure (e.g., variance, confidence region) by the r estimates above, denoted by ${\hat{ξ}}_{k}$ . Finally, all of the local machines send ${\hat{ξ}}_{k}$ 's to the central machine $M_{c e n t e r}$ .
The central machine $M_{c e n t e r}$ aggregates these ${\hat{ξ}}_{k}$ 's by $\bar{ξ} = K^{- 1} \sum_{k = 1}^{K} {\hat{ξ}}_{k}$ . And $\bar{ξ}$ is the final accuracy measure that we need.

It is remarkable that one does not need to process datasets of size N on local machines actually, although the nominal size of resample is N. This is because each machine contains at most n sample units. In fact, randomly generating some certain weight vectors of length n suffices to approximate the resampling process.

5. Future study

To conclude the article, we would like to discuss here a number of interesting topics for future study. First, for datasets with massive sizes, a distributed system is definitely needed. Obviously, there could be no place to store the data. On the other hand, for datasets with sufficiently small sizes, traditional memory based statistical methods can be immediately used. Then, there leaves a big gap between the big and small datasets. Those middle-sized data are often of sizes much larger than the computer memory but smaller than the hard drive. Consequently, they can be comfortably placed on a personal computer, but can hardly be processed by memory as a whole. For those datasets, their sizes are not large enough to justify an expensive distributed system. They are also not small enough to be handled by traditional statistical methods. How to analyse datasets of this size seems to be a topic worth studying. Second, when the whole data are allocated to local machines randomly and evenly, the data on different machines are independent and identically distributed and balanced. Then, all of the methods discussed above can be safely implemented. However, when the data on different machines are collected from (for example) different regions, the homogeneity of the local data would normally be hard to satisfy. The situation could be even worse if the sample sizes allocated to different local machines are very different. How to cope with these heterogeneous and unbalanced local data is a problem of great importance (Wang et al., Citation2020). The idea of meta analysis may be applicable to these situations (Liu et al., Citation2015; Xu & Shao, Citation2020; Zhou & Song, Citation2017). Finally, in the era of big data, personal privacy is under unprecedented threat. How to protect users' private information during the learning process deserves urgent attention. In this regard, differential privacy (DP) provides a theoretical approach for privacy-preserving data analysis (Dwork, Citation2008). Some related works associated with distributed learning are Agarwal et al. (Citation2018), Truex et al. (Citation2019) and Wang, Ishii et al. (Citation2019) and many others. Although it is a hot research area recently, there are still many open challenges. Thus, it is of great interest to study the privacy-preserving distributed statistical learning problems practically and theoretically.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

This work is supported by National Natural Science Foundation of China (No. 11971171), the 111 Project (B14019) and Project of National Social Science Fund of China (15BTJ027). Weidong Liu's research is supported by National Program on Key Basic Research Project (973 Program, 2018AAA0100704), National Natural Science Foundation of China (No. 11825104, 11690013), Youth Talent Support Program, and a grant from Australian Research Council. Hansheng Wang's research is partially supported by National Natural Science Foundation of China (No. 11831008, 11525101, 71532001). It is also supported in part by China's National Key Research Special Program (No. 2016YFC0207704).

Notes on contributors

Yuan Gao

Mr. Yuan Gao is a Ph.D. candidate in school of statistics at East China Normal University.

Weidong Liu

Dr. Weidong Liu is the Distinguished Professor in school of mathematical sciences at Shanghai Jiao Tong University.

Hansheng Wang

Dr. Hansheng Wang is a professor in Guanghua School of Management at Peking University.

Xiaozhou Wang

Dr. Xiaozhou Wang is an assistant professor in school of statistics at East China Normal University.

Yibo Yan

Mr. Yibo Yan is a Ph.D. candidate in school of statistics at East China Normal University.

Riquan Zhang

Dr. Riquan Zhang is a professor in school of statistics at East China Normal University.

References

Agarwal, N., Suresh, A. T., Yu, F., Kumar, S., & McMahan, H. B. (2018). cpsgd: Communication-efficient and differentially-private distributed sgd. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (pp. 7575–7586). Curran Associates Inc.
Google Scholar
Battey, H., Fan, J., Liu, H., Lu, J., & Zhu, Z. (2018). Distributed testing and estimation under sparse high dimensional models. Annals of Statistics, 46(3), 1352–1382. https://doi.org/https://doi.org/10.1214/17-AOS1587
PubMed Web of Science ®Google Scholar
Berlinet, A., & Thomas-Agnan, C. (2011). Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer Science & Business Media.
Google Scholar
Bickel, P. J., Götze, F., & van Zwet, W. R. (2012). Resampling fewer than n observations: Gains, losses, and remedies for losses. In Selected Works of Willem van Zwet (pp. 267–297). Springer.
Google Scholar
Chang, X., Lin, S.-B., & Wang, Y. (2017). Divide and conquer local average regression. Electronic Journal of Statistics, 11(1), 1326–1350. https://doi.org/https://doi.org/10.1214/17-EJS1265
Web of Science ®Google Scholar
Chang, X., Lin, S.-B., & Zhou, D.-X. (2017). Distributed semi-supervised learning with kernel ridge regression. The Journal of Machine Learning Research, 18(1), 1493–1514. https://jmlr.org/papers/volume18/16-601/16-601.pdf
Google Scholar
Chen, L., & Zhou, Y. (2020). Quantile regression in big data: A divide and conquer based strategy. Computational Statistics & Data Analysis, 144, 106892. https://doi.org/https://doi.org/10.1016/j.csda.2019.106892
Google Scholar
Chen, X., Lee, J. D., Li, H., & Yang, Y. (2021). Distributed estimation for principal component analysis: An enlarged eigenspace analysis. Journal of the American Statistical Association, 1–31. https://doi.org/https://doi.org/10.1080/01621459.2021.1886937
Google Scholar
Chen, X., Liu, W., Mao, X., & Yang, Z. (2020). Distributed high-dimensional regression under a quantile loss function. Journal of Machine Learning Research, 21(182), 1–43. https://jmlr.org/papers/volume21/20-297/20-297.pdf
PubMedGoogle Scholar
Chen, X., Liu, W., & Zhang, Y. (2019). Quantile regression under memory constraint. The Annals of Statistics, 47(6), 3244–3273. https://doi.org/https://doi.org/10.1214/18-AOS1777
Web of Science ®Google Scholar
Chen, X., Liu, W., & Zhang, Y. (2021b). First-order newton-type estimator for distributed estimation and inference. Journal of the American Statistical Association, 1–40. https://doi.org/https://doi.org/10.1080/01621459.2021.1891925
Google Scholar
Chen, X., & Xie, M.-G. (2014). A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica, 1655–1684. https://doi.org/https://doi.org/10.5705/ss.2013.088
Google Scholar
Duchi, J. C., Jordan, M. I., Wainwright, M. J., & Zhang, Y. (2014). Optimality guarantees for distributed statistical estimation. arXiv preprint arXiv:1405.0782.
Google Scholar
Dwork, C. (2008). Differential privacy: A survey of results. In International Conference on Theory and Applications of Models of Computation (pp. 1–19). Springer.
Google Scholar
Fan, J., & Gijbels, I. (1996). Local polynomial modelling and its applications: Monographs on statistics and applied probability 66 (Vol. 66). CRC Press.
Google Scholar
Fan, J., Guo, Y., & Wang, K. (2019a). Communication-efficient accurate statistical estimation. arXiv preprint arXiv:1906.04870.
Google Scholar
Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360. https://doi.org/https://doi.org/10.1198/016214501753382273
Web of Science ®Google Scholar
Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5), 849–911. https://doi.org/https://doi.org/10.1111/rssb.2008.70.issue-5
PubMed Web of Science ®Google Scholar
Fan, J., Wang, D., Wang, K., & Zhu, Z. (2019b). Distributed estimation of principal eigenspaces. The Annals of Statistics, 47(6), 3009–3031. https://doi.org/https://doi.org/10.1214/18-AOS1713
PubMed Web of Science ®Google Scholar
Guo, Z.-C., Lin, S.-B., & Shi, L. (2019). Distributed learning with multi-penalty regularization. Applied and Computational Harmonic Analysis, 46(3), 478–499. https://doi.org/https://doi.org/10.1016/j.acha.2017.06.001
Web of Science ®Google Scholar
Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statistical Learning with Sparsity: The Lasso and Generalizations. CRC press.
Google Scholar
Huang, C., & Huo, X. (2015). A distributed one-step estimator. arXiv preprint arXiv:1511.01443.
Google Scholar
Javanmard, A., & Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensional regression. The Journal of Machine Learning Research, 15(1), 2869–2909. https://jmlr.csail.mit.edu/papers/volume15/javanmard14a/javanmard14a.pdf
Google Scholar
Jordan, M. I., Lee, J. D., & Yang, Y. (2019). Communication-efficient distributed statistical inference. Journal of the American Statistical Association, 114(526), 668–681. https://doi.org/https://doi.org/10.1080/01621459.2018.1429274
Web of Science ®Google Scholar
Kaplan, D. M. (2019). Optimal smoothing in divide-and-conquer for big data. Technical report, working paper available at https://faculty.missouri.edu/kaplandm.
Google Scholar
Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M. I. (2014). A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4), 795–816. https://doi.org/https://doi.org/10.1111/rssb.2014.76.issue-4
Web of Science ®Google Scholar
Koenker (2005). Quantile Regression (Econometric Society Monographs; No. 38). Cambridge University Press.
Google Scholar
Lee, J. D., Liu, Q., Sun, Y., & Taylor, J. E. (2017). Communication-efficient sparse regression. The Journal of Machine Learning Research, 18(1), 115–144. https://jmlr.csail.mit.edu/papers/volume18/16-002/16-002.pdf
Google Scholar
Lehmann, E. L., & Casella, G. (2006). Theory of Point Estimation. Springer Science & Business Media.
Google Scholar
Li, R., Lin, D. K., & Li, B. (2013). Statistical inference in massive data sets. Applied Stochastic Models in Business and Industry, 29(5), 399–409. https://doi.org/https://doi.org/10.1002/asmb.1927
Web of Science ®Google Scholar
Li, X., Li, R., Xia, Z., & Xu, C. (2020). Distributed feature screening via componentwise debiasing. Journal of Machine Learning Research, 21(24), 1–32. https://jmlr.csail.mit.edu/papers/volume21/19-537/19-537.pdf
Google Scholar
Lian, H., & Fan, Z. (2018). Divide-and-conquer for debiased l1-norm support vector machine in ultra-high dimensions. Journal of Machine Learning Research, 18(1), 1–26. https://jmlr.org/papers/volume18/17-343/17-343.pdf
Google Scholar
Lian, H., Zhao, K., & Lv, S. (2019). Projected spline estimation of the nonparametric function in high-dimensional partially linear models for massive data. Annals of Statistics, 47(5), 2922–2949. https://doi.org/https://doi.org/10.1214/18-AOS1769
Web of Science ®Google Scholar
Lin, S.-B., Guo, X., & Zhou, D.-X. (2017). Distributed learning with regularized least squares. The Journal of Machine Learning Research, 18(1), 3202–3232. https://www.jmlr.org/papers/volume18/15-586/15-586.pdf
Google Scholar
Lin, S.-B., Wang, D., & Zhou, D.-X. (2020). Distributed kernel ridge regression with communications. arXiv preprint arXiv:2003.12210.
Google Scholar
Lin, S.-B., & Zhou, D.-X. (2018). Distributed kernel-based gradient descent algorithms. Constructive Approximation, 47(2), 249–276. https://doi.org/https://doi.org/10.1007/s00365-017-9379-1
Web of Science ®Google Scholar
Liu, D., Liu, R. Y., & Xie, M. (2015). Multivariate meta-analysis of heterogeneous studies using only summary statistics: efficiency and robustness. Journal of the American Statistical Association, 110(509), 326–340. https://doi.org/https://doi.org/10.1080/01621459.2014.899235
Web of Science ®Google Scholar
Liu, Q., & Ihler, A. T. (2014). Distributed estimation, information loss and exponential families. In Advances in Neural Information Processing Systems (pp. 1098–1106). MIT Press.
Google Scholar
Lv, S., & Lian, H. (2017). Debiased distributed learning for sparse partial linear models in high dimensions. arXiv preprint arXiv:1708.05487.
Google Scholar
Minsker, S. (2019). Distributed statistical estimation and rates of convergence in normal approximation. Electronic Journal of Statistics, 13(2), 5213–5252. https://doi.org/https://doi.org/10.1214/19-EJS1647
Web of Science ®Google Scholar
Mücke, N., & Blanchard, G. (2018). Parallelizing spectrally regularized kernel algorithms. The Journal of Machine Learning Research, 19(1), 1069–1097. https://www.jmlr.org/papers/volume19/16-569/16-569.pdf
Google Scholar
Politis, D. N., Romano, J. P., & Wolf, M. (1999). Subsampling. Springer Science & Business Media.
Google Scholar
Qiao, X., Duan, J., & Cheng, G. (2019). Rates of convergence for large-scale nearest neighbor classification. In Advances in Neural Information Processing Systems (pp. 10768–10779). Curran Associates Inc.
Google Scholar
Rosenblatt, J. D., & Nadler, B. (2016). On the optimality of averaging in distributed statistical learning. Information and Inference: A Journal of the IMA, 5(4), 379–404. https://doi.org/https://doi.org/10.1093/imaiai/iaw013
Google Scholar
Saunders, C., Gammerman, A., & Vovk, V. (1998). Ridge regression learning algorithm in dual variables. In International Conference on Machine Learning (pp. 515–521).
Google Scholar
Shamir, O., Srebro, N., & Zhang, T. (2014). Communication-efficient distributed optimization using an approximate newton-type method. In International Conference on Machine Learning (pp. 1000–1008). JMLR.org.
Google Scholar
Shang, Z., & Cheng, G. (2017). Computational limits of a distributed algorithm for smoothing spline. Journal of Machine Learning Research, 18(108), 1–37. https://jmlr.org/papers/volume18/16-289/16-289.pdf
Google Scholar
Song, Q., & Liang, F. (2015). A split-and-merge bayesian variable selection approach for ultrahigh dimensional regression. Journal of the Royal Statistical Society: Series B: Statistical Methodology, 77(5), 947–972. https://doi.org/https://doi.org/10.1111/rssb.12095.
Web of Science ®Google Scholar
Steinwart, I., Hush, D. R., & Scovel, C. (2009). Optimal rates for regularized least squares regression. In COLT (pp. 79–93). https://www.cs.mcgill.ca/∼colt2009/papers/038.pdf
Google Scholar
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288. https://doi.org/https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
Web of Science ®Google Scholar
Truex, S., Baracaldo, N., Anwar, A., Steinke, T., Ludwig, H., Zhang, R., & Zhou, Y. (2019). A hybrid approach to privacy-preserving federated learning. In Proceedings of the 12th ACM Workshop on Artificial Intelligence and Security (pp. 1–11). Association for Computing Machinery.
Google Scholar
Vapnik, V. (2013). The Nature of Statistical Learning Theory. Springer Science & Business Media.
Google Scholar
Volgushev, S., Chao, S.-K., & Cheng, G. (2019). Distributed inference for quantile regression processes. The Annals of Statistics, 47(3), 1634–1662. https://doi.org/https://doi.org/10.1214/18-AOS1730
Web of Science ®Google Scholar
Wahba, G. (1990). Spline Models for Observational Data (Vol. 59). Siam.
Google Scholar
Wang, F., Huang, D., Zhu, Y., & Wang, H. (2020). Efficient estimation for generalized linear models on a distributed system with nonrandomly distributed data. arXiv preprint arXiv:2004.02414.
Google Scholar
Wang, J., Kolar, M., Srebro, N., & Zhang, T. (2017). Efficient distributed learning with sparsity. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 3636–3645). JMLR.org.
Google Scholar
Wang, S. (2019). A sharper generalization bound for divide-and-conquer ridge regression. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, pp. 5305–5312). AAAI Press.
Google Scholar
Wang, X., Ishii, H., Du, L., Cheng, P., & Chen, J. (2019). Differential privacy-preserving distributed machine learning. In 2019 IEEE 58th Conference on Decision and Control (CDC) (pp. 7339–7344). IEEE.
Google Scholar
Wang, X., Yang, Z., Chen, X., & Liu, W. (2019). Distributed inference for linear support vector machine. Journal of Machine Learning Research, 20(113), 1–41. https://www.jmlr.org/papers/volume20/18-801/18-801.pdf
Google Scholar
Xu, C., Zhang, Y., Li, R., & Wu, X. (2016). On the feasibility of distributed kernel regression for big data. IEEE Transactions on Knowledge and Data Engineering, 28(11), 3041–3052. https://doi.org/https://doi.org/10.1109/TKDE.2016.2594060
Web of Science ®Google Scholar
Xu, G., Shang, Z., & Cheng, G. (2018). Optimal tuning for divide-and-conquer kernel ridge regression with massive data. In Proceedings of Machine Learning Research. PMLR.
Google Scholar
Xu, M., & Shao, J. (2020). Meta-analysis of independent datasets using constrained generalised method of moments. Statistical Theory and Related Fields, 4(1), 109–116. https://doi.org/https://doi.org/10.1080/24754269.2019.1630545
Google Scholar
Yang, J., Mahoney, M. W., Saunders, M. A., & Sun, Y. (2016). Feature-distributed sparse regression: A screen-and-clean approach. In NIPS (pp. 2712–2720). Curran Associates, Inc.
Google Scholar
Zhang, C.-H., & Zhang, T. (2012). A general theory of concave regularization for high-dimensional sparse estimation problems. Statistical Science, 27(4), 576–593. https://doi.org/https://doi.org/10.1214/12-STS399
Web of Science ®Google Scholar
Zhang, T. (2005). Learning bounds for kernel regression using effective data dimensionality. Neural Computation, 17(9), 2077–2098. https://doi.org/https://doi.org/10.1162/0899766054323008
Web of Science ®Google Scholar
Zhang, Y., Duchi, J., & Wainwright, M. (2015). Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. The Journal of Machine Learning Research, 16(1), 3299–3340. https://jmlr.org/papers/volume16/zhang15d/zhang15d.pdf
Google Scholar
Zhang, Y., Duchi, J. C., & Wainwright, M. J. (2013). Communication-efficient algorithms for statistical optimization. The Journal of Machine Learning Research, 14(1), 3321–3363. https://www.jmlr.org/papers/volume14/zhang13b/zhang13b.pdf
Google Scholar
Zhao, T., Cheng, G., & Liu, H. (2016). A partially linear framework for massive heterogeneous data. Annals of Statistics, 44(4), 1400. https://doi.org/https://doi.org/10.1214/15-AOS1410
PubMed Web of Science ®Google Scholar
Zhao, T., Kolar, M., & Liu, H. (2014). A general framework for robust testing and confidence regions in high-dimensional quantile regression. arXiv preprint arXiv:1412.8724.
Google Scholar
Zhao, W., Zhang, F., & Lian, H. (2019). Debiasing and distributed estimation for high-dimensional quantile regression. IEEE Transactions on Neural Networks and Learning Systems, 31(7), 2569–2577. https://doi.org/https://doi.org/10.1109/TNNLS.2019.2933467
Web of Science ®Google Scholar
Zhou, L., & Song, P. X.-K. (2017). Scalable and efficient statistical inference with estimating functions in the mapreduce paradigm for big data. arXiv preprint arXiv:1709.04389.
Google Scholar
Zhu, L.-P., Li, L., Li, R., & Zhu, L.-X. (2011). Model-free feature screening for ultrahigh-dimensional data. Journal of the American Statistical Association, 106(496), 1464–1475. https://doi.org/https://doi.org/10.1198/jasa.2011.tm10563
PubMed Web of Science ®Google Scholar
Zhu, X., Li, F., & Wang, H. (2019). Least squares approximation for a distributed system. arXiv preprint arXiv:1908.04904.
Google Scholar

Download PDF

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

A review of distributed statistical inference

Abstract

1. Introduction

2. Parametric models

2.1. One-shot approach

2.2. Iterative approach

2.3. Shrinkage methods

2.4. Non-smooth loss based models

3. Nonparametric models

3.1. Local smoothing

3.2. RKHS methods

4. Other related works

4.1. Principal component analysis

4.2. Feature screening

4.3. Bootstrap

5. Future study

Disclosure statement

Notes on contributors

Yuan Gao

Weidong Liu

Hansheng Wang

Xiaozhou Wang

Yibo Yan

Riquan Zhang

References

Information for

Open access

Opportunities

Help and information

A review of distributed statistical inference

Abstract

1. Introduction

2. Parametric models

2.1. One-shot approach

2.2. Iterative approach

2.3. Shrinkage methods

2.4. Non-smooth loss based models

3. Nonparametric models

3.1. Local smoothing

3.2. RKHS methods

4. Other related works

4.1. Principal component analysis

4.2. Feature screening

4.3. Bootstrap

5. Future study

Disclosure statement

Additional information

Funding

Notes on contributors

Yuan Gao

Weidong Liu

Hansheng Wang

Xiaozhou Wang

Yibo Yan

Riquan Zhang

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date