5,956
Views
5
CrossRef citations to date
0
Altmetric
Review Article

A review of distributed statistical inference

, , , , &
Pages 89-99 | Received 02 Sep 2020, Accepted 01 Aug 2021, Published online: 13 Sep 2021

Abstract

The rapid emergence of massive datasets in various fields poses a serious challenge to traditional statistical methods. Meanwhile, it provides opportunities for researchers to develop novel algorithms. Inspired by the idea of divide-and-conquer, various distributed frameworks for statistical estimation and inference have been proposed. They were developed to deal with large-scale statistical optimization problems. This paper aims to provide a comprehensive review for related literature. It includes parametric models, nonparametric models, and other frequently used models. Their key ideas and theoretical properties are summarized. The trade-off between communication cost and estimate precision together with other concerns is discussed.

1. Introduction

With the rapid development of information technology, datasets of massive sizes become increasingly available. E-commerce companies like Amazon have to analyse billions of transaction data for personalized recommendation. Bioinformatics scientists need to locate relevant genes corresponding to some specific phenotype or disease from massive SNPs data. For Internet-related companies, large amounts of text, image, voice, and even video data are in urgent need of effective analysis. Due to the accelerated growth of data sizes, the computing power and memory of one single computer are no longer sufficient. Constraint on network bandwidth and other privacy or security considerations also make it difficult to process the whole data on one central machine. Accordingly, distributed computing systems become increasingly popular.

Similar to parallel computing executed on a single computer, distributed computing is closely related to the idea of divide-and-conquer. Simply speaking, for some statistical problems, we can divide a complicated large task into many small pieces so that they can be tackled simultaneously on multiple CPUs or machines. Their outcomes are then aggregated to obtain the final result. It is conceivable that this procedure can save the computing time substantially if the algorithm can be executed in a parallel way. The main difference between a traditional parallel computing system and a distributed computing system is the way they access memory. For parallel computing, different processors can share the same memory. Consequently, they can exchange information with each other in a super-efficient way. While for distributed computing, distinct machines are physically separated. They are often connected by a network. Accordingly, each machine can only access its own memory directly. Therefore, the inter-machine communication cost in terms of time spending could be significant and thus should be prudently considered.

The rest of this article is organized as follows. Section 2 studies parametric models. Section 3 focuses on nonparametric methods. Section 4 expresses some other related methods. The article is concluded with a short discussion in Section 5.

2. Parametric models

Assume a total of N observations denoted as Zi=(Xi,Yi)Rp+1 with 1iN. Here XiRp is the covariate vector and YiR is the corresponding scalar response. Define {Pθ:θΘ} to be a family of statistical models parameterized by θΘRp. We further assume that Zi's are independent and identically distributed with the distribution Pθ, where θ=(θ1,,θp) is the true parameter. Consider a distributed setting, where N sample units are allocated randomly and evenly to K local machines Mk, 1kK, such that each machine has n observations. Obviously, we should have N=nK. Write S={1,,N} as the index set of whole sample. Then, let Sk denote the index set of local sample on Mk with Sk1Sk2=set for any k1k2. Other than the local machines, there also exists a central machine represented by Mcenter. A standard architecture should have Mcenter to be connected with every Mk.

Let L:Θ×Rp+1R be the loss function. Assume that the true parameter θ minimizes the population risk L(θ)=E[L(θ;Z)], where E stands for expectation with respect to Pθ. Define the local loss on the kth machine as Lk(θ)=n1iSkL(θ;Zi). Correspondingly, define the global loss function based on the whole sample as L(θ)=N1iSL(θ;Zi)=K1k=1KLk(θ), whose minimizer is θˆ=argminθΘL(θ). In most cases, the whole sample estimator θˆ should be N-consistent and asymptotically normal (Lehmann & Casella, Citation2006). If N is small enough so that the whole sample S can be read into the memory of one single computer, then θˆ can be easily computed. The entire computation can be executed in the memory of this computer. On the other hand, if N is too large so that the whole sample S cannot be placed on any single computer, then a distributed system must be used. In this case, the whole sample estimator θˆ is no longer computable (or at least very difficult to compute) in practice. Then, how to develop novel statistical methods for distributed systems becomes a problem of great importance.

2.1. One-shot approach

To solve the problems, various methods have been proposed. They can be roughly divided into two classes. The first class contains so-called one-shot methods. They are to be reviewed in this subsection. The other class contains various iterative methods. They are to be reviewed in the next subsection.

The basic idea of the one-shot approach is to calculate some relevant statistics on each local machine. Subsequently, these statistics are sent to a central machine, where they are assembled into the final estimator. The most popular and direct way of aggregation is simple average. Specifically, for each 1kK, machine Mk uses local sample Sk to compute the local empirical minimizer θˆk=argminθΘLk(θ). These local estimates (i.e., θˆk's) are then transferred to the centre machine Mcenter, where they are averaged as θ¯=K1k=1Kθˆk. This leads to the final simple averaging estimator θ¯ (see Figure (a)).

Figure 1. Illustrations of the two different approaches. (a) one-shot approach and (b) iterative approach.

Figure 1. Illustrations of the two different approaches. (a) one-shot approach and (b) iterative approach.

Obviously, the one-shot style of distributed framework is highly communication-efficient. Because it requires only one single round of communication between each Mk and Mcenter. Hence, the communication cost is of the order O(Kp), where p is the dimension of each estimate θˆ. Theoretical properties of simple averaging estimator were also studied in the literature. For example, it was shown in Zhang et al. (Citation2013, Corollary  2) that, under appropriate regularity conditions, (1) Eθ¯θ22C1N+C2n2+O(1Nn+1n3),(1) where C1, C2 are some positive constants. If n is sufficiently large such that n2=o(N1), then the dominant term in (Equation1) becomes C1/N, and is of the order O(N1). It is the same as that of the whole sample estimator. This also implies that, in order to obtain the global convergence rate, we should not divide the whole sample into too many parts. A further improved theoretical result was obtained by Rosenblatt and Nadler (Citation2016). They showed that the one-shot estimator is the first order equivalent to the whole sample estimator. However, the second-order error terms of θ¯ can be non-negligible for nonlinear models. Similar observation was also obtained by Huang and Huo (Citation2015). The work of Duchi et al. (Citation2014) revealed that the minimal communication budget to attain the global estimation error for linear regression is O(Kp) bits up to a logarithmic factor under some conditions. This result matches the simple averaging procedure and confirms the sharpness of the bound in (Equation1). To further reduce the bias, a novel subsampling method was developed by Zhang et al. (Citation2013). By this technique, the error bound is improved to be O(N1+n3), which relaxes the restriction on the number of machines.

Instead of the linear combination of local maximum likelihood estimates (MLEs) as simple average, Liu and Ihler (Citation2014) proposed a KL-divergence based combination method. The final estimator is computed by θˆKL=argminθΘk=1KKL(p(x|θˆk)  p(x|θ)),where p(x|θ) is the probability density of Pθ with respect to some proper measure μ, and KL-divergence is defined by KL(p(x)  q(x))=Xp(x) log{p(x)/q(x)}dμ(x). It was shown that θˆKL is exactly the global MLE θˆ if {Pθ:θΘ} is a full exponential family (defined in their paper). This sheds light on the inference about generalized linear models (GLMs) based on exponential likelihood.

In many cases, some local machines might suffer from data of poor quality. This could lead to abnormal local estimates, which further degrade the statistical efficiency of the final estimator. To fix the problem, Minsker (Citation2019) devised a robust assembling method. It leads to an estimator as θˆrobust=argminθΘk=1Kρ(|θθˆk|), where ρ() is a robust loss function satisfying some conditions. For example, when ρ(u)=u and p=1 (univariate case), θˆrobust is the median of θˆk's. It should be more robust against outliers compared with the simple average. Under some regularity conditions, they showed that θˆrobust achieves the same convergence rate as the whole sample estimator provided KO(N).

2.2. Iterative approach

Although one-shot approach involves little communication cost, it suffers from several disadvantages. First, the local machines need to have sufficient amount of data (e.g., nN). Otherwise the aggregated estimator cannot reach the convergence rate as the global estimator. This prevents us from utilizing many machines to speed up the computation process (Jordan et al., Citation2019; Wang et al., Citation2017). Second, the simple averaging estimator is often poor in performance for nonlinear models (Huang & Huo, Citation2015; Jordan et al., Citation2019; Rosenblatt & Nadler, Citation2016). Last, when p is diverging with N, the situation could be even worse (Lee et al., Citation2017; Rosenblatt & Nadler, Citation2016). This suggests that carefully designed algorithms allowing a reasonable number of iterations should be useful for a distributed system.

Inspired by the one-step method in the M-estimator theory, Huang and Huo (Citation2015) proposed an one-step refinement of the simple averaging estimator. Let us recall that θ¯ is the one-shot averaging estimator. To further improve its statistical efficiency, it should be broadcast to each local machine. Next, local gradient Lk(θ¯) and local Hessian 2Lk(θ¯) can be computed on each Mk. Then, they are reported to Mcenter to form the whole sample gradient L(θ¯)=K1k=1KLk(θ¯) and Hessian 2L(θ¯)=K1k=1K2Lk(θ¯). Thus an one-step updated estimator can be constructed on Mcenter as (2) θˆ(1)=θ¯[2L(θ¯)]1L(θ¯).(2) Compared with one-shot estimator, θˆ(1) involves one more round of communication cost. Nevertheless, the statistical efficiency of the resulting estimator could be well improved. In fact, Huang and Huo (Citation2015) showed that Eθˆ(1)θ22C1N+O(1n4+1N2),where C1>0 is some constant. Obviously, this is a lower upper bound of mean squared error than that in (Equation1). To attain the global convergence rate, the local sample size needs to satisfy n4=o(N1), which is a much milder condition. Furthermore, they showed that θˆ(1) also has the same asymptotic efficiency as the whole sample estimator θˆ under some regularity conditions.

A natural idea to further extend the one-step estimator is to allow the iteration (Equation2) to be executed many times. Specifically, let θˆ(t) be the estimator of the tth iteration. Then, we can use (Equation2) by replacing θ¯ with θˆ(t) to generate the next step estimator θˆ(t+1) (see Figure (b)). However, this requires a large number of Hessian matrices to be computed and transferred. If the parameter dimension p is relatively high, this will lead to a significant communication cost of the order O(Kp2). To fix the problem, Shamir et al. (Citation2014) proposed an approximate Newton method, which conducts Newton-type iteration distributedly without transferring the Hessian matrices. Following this strategy, Jordan et al. (Citation2019) developed an approximate likelihood approach. Their key idea is to update Hessian matrix on one single machine (e.g., Mcenter) only. Then, (Equation2) can be revised to be θˆ(t+1)=θˆ(t)[2Lcenter(θˆ(t))]1L(θˆ(t)),where 2Lcenter is the Hessian matrix computed on the central machine. By doing so, the communication cost due to transmission of Hessian matrices can be saved. Under some conditions, they showed that (3) θˆ(t+1)θˆ2C1nθˆ(t)θˆ2,for t0,(3) holds with high probability, where C1>0 is some constant. By the linear convergence formula (Equation3), we can see that it requires [logK/logn] iterations to achieve the N-consistency as the whole sample estimator θˆ, provided θˆ(0) is n-consistent. Note that if n=K=N, one iteration suffices to attain the optimal convergence rate. However, the satisfactory performance of this method relies on a good choice of the machine, on which the Hessian needs to be updated (Fan, Guo et al., Citation2019a). To fix the problems, Fan, Guo et al. (Citation2019a) added an extra regularized term to the approximate likelihood used in Jordan et al. (Citation2019). With this modification, the performance of the resulting estimator can be well improved. Theoretically, they showed a similar linear convergence rate under some more general conditions, which require no strict homogeneity of the local loss functions.

2.3. Shrinkage methods

We study the shrinkage methods for sparse estimation in this subsection. For a high-dimensional problem, especially when the dimension of θ is larger than the sample size N, it is difficult to estimate θ without any additional assumptions (Hastie et al., Citation2015). A popular constraint for tackling these problems is sparsity, which assumes only a subset of the entries in θ is non-zero. The index of non-zero entries is called the support of θ, that is supp(θ)=A={1jp: θj0}.To induce a sparse solution, an additional regularization term of θ is often introduced in the loss function. Specifically, we need to solve the shrinkage regression problem as minθΘ{L(θ)+j=1pρλ(|θj|)}, where ρλ() is a penalty function with a regularization parameter λ>0. Popular choices are LASSO (Tibshirani, Citation1996), SCAD (Fan & Li, Citation2001) and others discussed in Zhang and Zhang (Citation2012). For simplicity, we consider the LASSO estimator in the framework of the linear regression problem. Specifically, the whole sample estimator is computed as θˆλ=argminθΘ{1NYXθ22+λθ1},where Y=(Y1,,YN)RN is the vector of response, X=(X1,,XN)RN×p is the design matrix, and θ1=j=1p|θj| denotes the l1-norm of θ. It is known that the LASSO procedure would produce biased estimators for the large coefficients. This is undesirable for the simple average procedures, since average cannot eliminate the systematic bias. To reduce bias, Javanmard and Montanari (Citation2014) proposed a debiasing technique for the lasso estimator, that is (4) θˆλ(d)=θˆλ+1NMX(YXθˆλ),(4) where MRp×p is an approximation to the inverse of Σˆ=XX/N. It appears that when Σˆ is invertible (e.g., when Np), setting M=Σˆ1 gives θˆλ(d)=(XX)1XY, which is the ordinary least squares estimator and obviously unbiased. Hence, procedure (Equation4) compensates for the bias incurred by 1 regularization in some sense.

By this debiasing technique, Lee et al. (Citation2017) developed an one-shot type estimator for the LASSO problem. Specifically, let θˆk,λ(d) be the debiased LASSO estimator computed on Mk. Then an averaging estimator can be constructed on Mcenter as θ¯λ=K1k=1Kθˆk,λ(d). Unfortunately, the sparsity level can be seriously degraded by averaging. For this reason, a hard threshold step often comes as a remedy. It was noticed that the debiasing step is computationally expensive. Hence an improved algorithm was also proposed to alleviate the computational cost of this step. Under certain conditions, they showed that the resulting estimator has the same convergence rate as the whole sample estimator. Battey et al. (Citation2018) investigated the same problem with additional study on hypothesis testing. Furthermore, a refitted estimation procedure was used to preserve the global oracle property of the distributed estimator. An extension to high dimensional GLMs can also be found in Lee et al. (Citation2017) and Battey et al. (Citation2018). For this model, Chen and Xie (Citation2014) implemented a majority voting method to aggregate the regularized local estimates. For the low dimensional sparse problem with smooth loss function (e.g., GLMs, Cox model), Zhu et al. (Citation2019) developed a local quadratic approximation method with an adaptive-LASSO type penalty. They showed rigorously that the resulting estimator can be as good as the global oracle estimator.

Intuitively, above one-shot methods may need a stringent condition on the local sample size to meet the global convergence rate due to the limited communication. In fact, the simple averaging estimator requires nO(Ks2logp) to match the oracle rate in the context of sparse linear model (Lee et al., Citation2017), where s=|A| is the number of non-zero entries of θ. For this problem, Wang et al. (Citation2017) and Jordan et al. (Citation2019) independently proposed a communication-efficient iterative algorithm, which constructs a regularized likelihood by using local Hessian matrix. As demonstrated by Wang et al. (Citation2017), an one-step estimator θˆ(1) suffices to achieve the global convergence rate if nO(Ks2logp) (the condition used in  Lee et al., Citation2017). Furthermore, if multi-round communication is allowed, θˆ(t+1) (i.e., estimator of the (t+1)th iteration) can match the estimator based on the whole sample as long as nO(s2logp) and tO(logK), under some certain conditions.

2.4. Non-smooth loss based models

The methods we described above typically require the loss function L to be sufficiently smooth, although a non-smooth regularization term is permitted (see e.g., Jordan et al., Citation2019; Wang et al., Citation2017; Zhang et al., Citation2013; Zhu et al., Citation2019). However, there are also some useful methods involving non-smooth loss functions, such as quantile regression and support vector machine. It is then of great interest to develop distributed methods for these methods.

We first focus on the quantile regression (QR) model. The QR model has a widespread use in statistics and econometrics, and performs more robustly against the outliers than the ordinary quadratic loss (Koenker, Citation2005). Specifically, a QR model assumes Yi=Xiθτ+ϵi,iS, where XiRp is the covariate vector, Yi is the corresponding response, θτRp is the true regression coefficient, and ϵi is the random noise satisfying P(ϵi0|Xi)=τ, where τ(0,1) is a known quantile level. It is known that θτ is the minimizer of E[ρτ(YiXiθ)]. Here ρτ(u)=u(τ1{u0})=u(1{u>0}+τ1) is the non-differentiable check-loss function, where 1{} is the indicator function. When data size N is moderate, we can estimate θτ by θˆτ=minθΘN1iSρτ(YiXiθ) on one single machine. However, when N is very large, a distributed system has to be used. Accordingly, distributed estimators have to be developed.

In this regard, Volgushev et al. (Citation2019) studied the one-shot averaging type estimator. Specifically, a local estimator θˆk,τ is first computed on each local machine Mk. Then, the averaging estimator is assembled as θ¯τ=K1k=1Kθˆk,τ on the central machine Mcenter. They further investigated the theoretical properties of the averaging estimator in detail. It was shown that the if the number of machines satisfies K=o(N/logN), then θ¯τ should be as efficient as the whole sample estimator θˆτ under some regularity conditions. Chen and Zhou (Citation2020) proposed an estimating equation based one-shot approach for the QR problem. The asymptotic equivalence between the resulting estimator and the whole sample estimator was also established under K=o(N1/4) and some other conditions. It can be seen that the performance of one-shot approaches relies more on the local sample size. In fact, Volgushev et al. (Citation2019) showed that K=o(N) is a necessary condition for the global efficiency of the simple averaging estimator θ¯τ. To remove the constraint K=o(N) on the number of machines, Chen et al. (Citation2019) proposed an iterative approach. Their key idea is to approximate the check-loss function by a smooth alternative. More specifically, they approximated 1{u>0} by a smooth function H(u/h), where H() is a smooth cumulative distribution function and h>0 is the tuning parameter controlling the approximation goodness (see Figure (a)). With this modification, the algorithm can update the estimates by (5) θˆτ(t+1)=[V(θˆτ(t))]1U(θˆτ(t)),(5) where U(θ)=k=1KUk(θ), V(θ)=k=1KVk(θ), and UkRp, VkRp×p depend only on the bandwidth h and local sample Sk. It was shown that a constant number of rounds of iteration suffices to match the convergence rate of the whole sample estimator. Thus, the communication cost is roughly of the order O(Kp2). But this is not applicable when p is very large. Thus, for the high dimensional QR problem, Zhao et al. (Citation2014) and Zhao et al. (Citation2019) adopted an one-shot averaging method based on the debiased local estimates as that in (Equation4). Accordingly, Chen et al. (Citation2020) proposed a communication-efficient multi-round algorithm inspired by the approximate Newton method (Shamir et al., Citation2014). This iterative approach removes the restriction on the number of machines. A revised divide-and-conquer stochastic gradient descent method for QR and other models with diverging dimension can be found in Chen, Liu et al. (Citation2021b).

Figure 2. Approximation of two non-smooth loss functions. (a) QR loss with τ=0.6 and (b) hinge loss.

Figure 2. Approximation of two non-smooth loss functions. (a) QR loss with τ=0.6 and (b) hinge loss.

We next consider the support vector machine (SVM), which is one of the most successful statistical learning methods (Vapnik, Citation2013). The classical SVM is aimed at the binary classification problem, i.e., the response variable Yi{1,1}. Formally, a standard linear SVM solves the problem minθΘN1iS(1YiXiθ)++λθ22, where (u)+=u1(u>0) is the hinge loss, and λ>0 is the regularization parameter. By the same smooth technique used in Chen et al. (Citation2019), i.e., replacing the hinge loss with a smooth alternative (see Figure (b)), Wang, Yang et al. (Citation2019) proposed an iterative algorithm like (Equation5). To reduce the communication cost incurred by transferring matrices, they further employed the approximate Newton method (Shamir et al., Citation2014). Theoretically, they showed the asymptotic normality of the estimator, which can be used to construct confidence interval. For the ultra-high dimensional SVM problem, Lian and Fan (Citation2018) studied the one-shot averaging method with debiasing procedure similar to (Equation4).

3. Nonparametric models

Different from parametric models, a nonparametric model typically involves infinite-dimensional parameters. In this section, we focus mainly on the nonparametric regression problems. Specifically, consider here a general regression model as Yi=f(Xi)+ϵi,iS, where f() is an unknown but sufficiently smooth function and ϵi is the random noise with zero mean. The aim of nonparametric regression is to estimate function fF, where F is a given nonparametric class of functions.

3.1. Local smoothing

One way to estimate f() is to fit a locally constant model by kernel smoothing (Fan & Gijbels, Citation1996). More concretely, the whole sample estimator is given by fˆh(x)=iSWh,Xi(x)Yi,where the Wh,Xi(x)0 is the local weight at X=x satisfying iSWh,Xi(x)=1. Specifically, for a Nadaraya–Watson kernel estimator, we should have Wh,Xi(x)=K((Xix)/h)/iSK((Xix)/h), where K() is a kernel function and h>0 is the bandwidth. In the univariate case (p=1), classical theory stated that the mean squared error of fˆh(x) is of the order O(h4+(Nh)1) (Fan & Gijbels, Citation1996). Thus, the optimal rate O(N4/5) can be achieved by choosing bandwidth h=O(N1/5).

For a distributed kernel smoothing, an one-shot estimator can also be constructed. Let fˆk,h(x) be the local estimator computed on Mk. Then an averaging estimator can be obtained as f¯h(x)=K1k=1Kfˆk,h(x). Chang, Lin, Wang (Citation2017) studied the theoretical properties of f¯h(x) in a specific function space F. They established the same minimax convergence rate of f¯h(x) as that of the whole sample estimator. However, they found that a strict restriction on the number of machines K is needed to achieve this optimal rate. To fix the problem, two solutions were provided. They are, respectively, a date-dependent bandwidth selection algorithm and an algorithm with a qualication step.

Nearest neighbours method can be regarded as another local smoothing method. Qiao et al. (Citation2019) studied the Nearest neighbours classification in a distributed setting, where the optimal number of neighbours to achieve the optimal rate of convergence was derived. Li et al. (Citation2013) discussed the problem of density estimation for scattered datasets. Kaplan (Citation2019) focused on the choice of bandwidth for nonparametric smoothing techniques. All the works above in this subsection indicate that the bandwidth (or local smoothing parameter) used in the distributed setting should be adjusted according to the whole sample size N, other than the local sample size n.

3.2. RKHS methods

We next discuss another popular nonparametric regression method. This is reproducing kernel Hilbert space (RKHS) method. An RKHS H can be induced by a continuous, symmetric and positive semi-definite kernel function K(,):Rp×RpR. Two typical examples are: the polynomial kernel K(x1,x2)=(x1x2+1)d with an integer d1, and the radical kernel K(x1,x2)=exp(γx1x222) with γ>0. Refer to, for example, Berlinet and Thomas-Agnan (Citation2011); Wahba (Citation1990) for more details about RKHS. Then, our target is to find an fˆH so that the following penalized empirical loss can be minimized. That leads to the whole sample estimator as (6) fˆλ=argminfH{1NiS(Yif(Xi))2+λfH2},(6) where H is the norm associated with the RKHS H and λ>0 is the regularization parameter. This problem is also known as kernel ridge regression (KRR). By the representer theorem for the RKHS (Wahba, Citation1990), any solution to the problem (Equation6) must have the linear form as fˆλ(x)=iSαiK(Xi,x), where αiR for each iS. By this property, we can treat the KRR as a parametric problem with unknown parameter α=(α1,,αN)RN. The error bounds of the whole sample estimator fˆλ has been well established in the existing literature (e.g., Steinwart et al., Citation2009; Zhang, Citation2005). However, a standard implementation of the KRR involves inverting a kernel matrix in RN×N (Saunders et al., Citation1998). Therefore, when N is extremely large, it is time consuming or even computationally infeasible to process the whole sample on a single machine. Thus, we should consider a distributed system.

In this regard, Zhang et al. (Citation2015) studied the distributed KRR by taking the one-shot averaging approach. Specifically, each machine Mk computes local KRR estimate fˆk,λ by (Equation6) based on local sample Sk. Then the central machine Mcenter averages them to obtain final estimator f¯λ=K1k=1Kfˆk,λ. Theoretically, they established the optimal convergence rate of mean squared error for f¯λ with different types of kernel functions, under some regularity conditions. Lin et al. (Citation2017) derived a similar optimal error bound under some relaxed conditions. Xu et al. (Citation2016) extended the loss function in (Equation6) to a further general form. Some related works on the distributed KRR problem by one-shot averaging approach can be found in Shang and Cheng (Citation2017), Lin and Zhou (Citation2018), Guo et al. (Citation2019); Mücke and Blanchard (Citation2018) and Wang (Citation2019) and many others. It was noted that these one-shot approaches require the number of machines diverges in a relative slow speed to meet the global convergence rate. To fix the problem, Chang, Lin, Zhou (Citation2017) proposed a semi-supervised learning framework by utilizing the additional unlabelled data (i.e., observations without response Yi). Latest work of Lin et al. (Citation2020) allowed communication between machines to improve the performance. In order to choose an optimal tuning parameter λ in (Equation6), Xu et al. (Citation2018) proposed a distributed generalized cross-validation method.

For semiparametric models, Zhao et al. (Citation2016) considered a partially linear model with heterogeneous data in a distributed setting. Specifically, they assumed the following model (7) Yi=Xiθ(k)+f(Wi)+ϵi,iSk,(7) where WiR is an additional covariate, f() is the unknown function, and θ(k)Rp is the true linear coefficient associated with the data on Mk for 1kK. In other words, the local data on different machines are assumed to share the same nonparametric part, but are allowed to have different linear coefficients. To estimate the unknown function and coefficients, they extended the classical RKHS theory to cope with the partially linear function space. Under some regularity conditions, the resulting estimator of the nonparametric part is shown to be as efficient as the whole sample estimator, provided the number of machines does not grow too fast. The case with high dimensional linear part was also investigated. For example, under the homogeneity assumption (i.e., the linear coefficients θ(k)'s are assumed to be identical to θ across different machines), Lv and Lian (Citation2017) adopted the one-shot averaging approach with debiasing technique analogous to (Equation4) to estimate the linear coefficient. Lian et al. (Citation2019) considered the same heterogeneous model as in (Equation7), but the linear part is assumed in a high dimensional setting (i.e., p>N). For this model, they proposed a novel projection approach to estimate the common nonparametric part (not in an RKHS framework). Theoretically, the asymptotic normality of the one-shot averaging estimator for the nonparametric function was established under some certain conditions.

4. Other related works

4.1. Principal component analysis

Principal component analysis (PCA) is a common procedure to reduce the dimension of the data. It is widely used in the practical data analysis. Unlike the regression problems, PCA is an unsupervised method, which does not require a response variable Y. To conduct a PCA, a covariance matrix Σˆ needs to be constructed as Σˆ=N1iSXiXi, where Xi's are assumed to be centralized already. Next, a standard singular value decomposition (SVD) is applied to Σˆ. That leads to Σˆ=VˆDˆVˆ, where Dˆ is a diagonal matrix of eigenvalues and Vˆ is an orthogonal matrix of eigenvectors. Then, the columns of Vˆ are the principal component directions that we need.

In a distributed setting, simple average of the eigenvectors estimated locally cannot give a valid result. To solve the problem, Fan, Wang et al. (Citation2019b) developed a divide-and-conquer algorithm for estimating eigenspaces. It involves only one single round of communication. This algorithm is quite easy to implement as well. We state it as follows (Fan, Wang et al., Citation2019b, Algorithm 1):

  1. For each k=1,,K, machine Mk computes d leading eigenvectors of the local sample covariance matrix Σˆk=n1iSkXiXi, denoted by vˆ1,k,,vˆd,kRp. Next, they are arranged by columns in Vˆk=(vˆ1,k,,vˆd,k)Rp×d, which is then sent to the central machine Mcenter.

  2. The central machine Mcenter averages K local projection matrices to obtain Σ~=K1k=1KVˆkVˆk. Then it computes d leading eigenvectors of Σ~, denoted by v~1,,v~dRp. v~1,,v~d are the estimators of the first d principal component directions that we need.

It is noticeable that the communication cost of above one-shot algorithm is of the order O(Kdp). This can be considered to be communication-efficient since d is usually much smaller than p in practice. Fan, Wang et al. (Citation2019b) showed that, under some appropriate conditions, the distributed estimator achieves the same convergence rate as the global estimator. The cases with heterogeneous local data were also investigated in their work. To further remove the restriction on the number of machines, Chen, Lee et al. (Citation2021) proposed a communication-efficient multi-round algorithm based on the approximate Newton method (Shamir et al., Citation2014).

4.2. Feature screening

Massive datasets often involve ultrahigh dimensional data, for which feature screening is critically important (Fan & Lv, Citation2008). To fix the idea, consider a standard linear regression model as Yi=Xiθ+ϵi, iS, where XiRp is the covariate vector, Yi is the corresponding response, θRp is the true parameter, and ϵi is the random noise. To screen for the most promising features, the seminal method of sure independence screening (SIS) has been proposed by Fan and Lv (Citation2008). Specifically, let A={1jp:θj0} be the true sparse model. Let ωj be the Pearson correlation between jth feature and response Y. Then, SIS screens features by a hard threshold procedure as Aˆγ={1jp:|ωˆj|>γ}, where γ is a prespecified threshold and ωˆj is the whole sample estimator of ωj. Under some specific conditions, Fan and Lv (Citation2008) showed the sure screening property for SIS, that is, P(AAˆγ)1as N.However, the estimator ωˆj is usually biased for many correlation measures. This indicates that a direct one-shot averaging approach is unlikely to be the best practice for the distributed system. To fix the problem, Li et al. (Citation2020) proposed a novel debiasing technique. They found that many correlation measures can be expressed as ωj=g(ν1,,νs), including Pearson correlation used above, Kendall τ rank correlation, SIRS correlation (Zhu et al., Citation2011), etc. Therefore, they used U-statistics to estimate the components νq,1qs on local machines. Then, these unbiased estimators of νq's given by local machines are averaged on the central machine Mcenter. Consequently, Mcenter can construct distributed estimator ω~j by the averaging estimators of the components in the known function g. Finally, they showed the sure screening property of Aˆγ based on the distributed estimators under some regularity conditions. When the feature dimension is much larger than the sample size (i.e., pN), another distributed computing strategy is to partition the whole data by features, other than by samples. Refer to, for example, Song and Liang (Citation2015); Yang et al. (Citation2016) for more details.

4.3. Bootstrap

Bootstrap and related resampling techniques provide a general and easily implemented procedure for automatically statistical inference. However, these methods are usually computationally expensive. Especially when sample size N is very large, it would be even practically infeasible to conduct. To mitigate this computing issue, various alternative methods have been proposed, such as subsamping approach (Politis et al., Citation1999) and ‘m-out-of-n’ bootstrap (Bickel et al., Citation2012). Their key idea is to reduce the resample size. However, due to the difference between the size of whole sample and resample, an additional correction step is generally required to rescale the result. This makes these methods less automatic.

To solve this problem, Kleiner et al. (Citation2014) proposed the bag of little bootstraps (BLB) method. It integrates the idea of subsampling and can be computed distributedly without a correction step. Suppose that N sample units have been randomly and evenly partitioned to K machines. Consider that we want to assess the accuracy of the point estimator for some parameter θ. Then we summarize their algorithm as follows.

  1. For each 1kK, machine Mk draws r samples of size N (instead of n) from Sk with replacement. Then it computes r estimates of θ based on the r resamples drawn above, respectively. After that, each Mk computes some accuracy measure (e.g., variance, confidence region) by the r estimates above, denoted by ξˆk. Finally, all of the local machines send ξˆk's to the central machine Mcenter.

  2. The central machine Mcenter aggregates these ξˆk's by ξ¯=K1k=1Kξˆk. And ξ¯ is the final accuracy measure that we need.

It is remarkable that one does not need to process datasets of size N on local machines actually, although the nominal size of resample is N. This is because each machine contains at most n sample units. In fact, randomly generating some certain weight vectors of length n suffices to approximate the resampling process.

5. Future study

To conclude the article, we would like to discuss here a number of interesting topics for future study. First, for datasets with massive sizes, a distributed system is definitely needed. Obviously, there could be no place to store the data. On the other hand, for datasets with sufficiently small sizes, traditional memory based statistical methods can be immediately used. Then, there leaves a big gap between the big and small datasets. Those middle-sized data are often of sizes much larger than the computer memory but smaller than the hard drive. Consequently, they can be comfortably placed on a personal computer, but can hardly be processed by memory as a whole. For those datasets, their sizes are not large enough to justify an expensive distributed system. They are also not small enough to be handled by traditional statistical methods. How to analyse datasets of this size seems to be a topic worth studying. Second, when the whole data are allocated to local machines randomly and evenly, the data on different machines are independent and identically distributed and balanced. Then, all of the methods discussed above can be safely implemented. However, when the data on different machines are collected from (for example) different regions, the homogeneity of the local data would normally be hard to satisfy. The situation could be even worse if the sample sizes allocated to different local machines are very different. How to cope with these heterogeneous and unbalanced local data is a problem of great importance (Wang et al., Citation2020). The idea of meta analysis may be applicable to these situations (Liu et al., Citation2015; Xu & Shao, Citation2020; Zhou & Song, Citation2017). Finally, in the era of big data, personal privacy is under unprecedented threat. How to protect users' private information during the learning process deserves urgent attention. In this regard, differential privacy (DP) provides a theoretical approach for privacy-preserving data analysis (Dwork, Citation2008). Some related works associated with distributed learning are Agarwal et al. (Citation2018), Truex et al. (Citation2019) and Wang, Ishii et al. (Citation2019) and many others. Although it is a hot research area recently, there are still many open challenges. Thus, it is of great interest to study the privacy-preserving distributed statistical learning problems practically and theoretically.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

This work is supported by National Natural Science Foundation of China (No. 11971171), the 111 Project (B14019) and Project of National Social Science Fund of China (15BTJ027). Weidong Liu's research is supported by National Program on Key Basic Research Project (973 Program, 2018AAA0100704), National Natural Science Foundation of China (No. 11825104, 11690013), Youth Talent Support Program, and a grant from Australian Research Council. Hansheng Wang's research is partially supported by National Natural Science Foundation of China (No. 11831008, 11525101, 71532001). It is also supported in part by China's National Key Research Special Program (No. 2016YFC0207704).

Notes on contributors

Yuan Gao

Mr. Yuan Gao is a Ph.D. candidate in school of statistics at East China Normal University.

Weidong Liu

Dr. Weidong Liu is the Distinguished Professor in school of mathematical sciences at Shanghai Jiao Tong University.

Hansheng Wang

Dr. Hansheng Wang is a professor in Guanghua School of Management at Peking University.

Xiaozhou Wang

Dr. Xiaozhou Wang is an assistant professor in school of statistics at East China Normal University.

Yibo Yan

Mr. Yibo Yan is a Ph.D. candidate in school of statistics at East China Normal University.

Riquan Zhang

Dr. Riquan Zhang is a professor in school of statistics at East China Normal University.

References