595
Views
0
CrossRef citations to date
0
Altmetric
Short Communications

Discussion on the paper ‘A review of distributed statistical inference’

ORCID Icon
Pages 108-110 | Received 11 Nov 2021, Accepted 20 Nov 2021, Published online: 16 Dec 2021

Abstract

Distributed statistical inferences have attracted more and more attention in recent years with the emergence of massive data. We are grateful to the authors for the excellent review of the literature in this active area. Besides the progress mentioned by the authors, we would like to discuss some additional development in this interesting area. Specifically, we focus on the balance of communication cost and the statistical efficiency of divide-and-conquer (DC) type estimators in linear discriminant analysis and hypothesis testing. It is seen that the DC approach has different behaviours in these problems, which is different from that in estimation problems. Furthermore, we discuss some issues on the statistical inferences under restricted communication budgets.

1. Linear discriminant analysis

Linear discriminant analysis (LDA) is a classical classification method (Anderson, Citation2003). For simplicity, we consider the two-sample problem, assuming that XNp(μ1,Σ),YNp(μ2,Σ),where μiRp, i=1,2 are the mean vectors with μ1μ2 and ΣRp×p is the covariance matrix. Furthermore, assume that observations come either from X with probability π1 or from Y with probability π2 such that π1+π2=1. For a new observation Z, Fisher's linear discriminant rule is defined as follows: (1) ψ(Z)=1{(Zμa)Θμd>log(π1/π2)},(1) where μa=(μ1+μ2)/2, μd=μ1μ2, and Θ=Σ1 represents the precision matrix, and 1{} is the indicator function. Suppose that {Xi, i=1,,N1} and {Yi, i=1,,N2} are the independently and identically distributed copies of X and Y, respectively. Let N=N1+N2 be the total sample size and suppose that N>p. For i = 1, 2, denote μˆi as the sample means, and Σˆi as the sample covariance using observations Xi's and Yi's, respectively. Then the estimators of μa,μd and Θ can be defined respectively as follows μˆa=(μˆ1+μˆ2)/2,μˆd=μˆ1μˆ2,Θˆ=(Σˆpool)1,where Σˆpool=(N1/N)Σˆ1+(N2/N)Σˆ2 denotes the pooled sample covariance matrix. Then the empirical version of ψ(Z), denoted as ψˆ(Z), can be derived by plugging in the above estimators into (Equation1).

In a distributed setting, one has a central machine (or hub) and many local machines. Suppose that data are split randomly and evenly, and are stored at K local machines. Denote by {Xi(k), i=1,,N1/K} and {Yi(k), i=1,,N2/K} the samples from two classes on the k-th local machine k=1,,K. Tian and Gu (Citation2017) considered sparse LDA in the high dimensional regime in the case of π1=π2=1/2, under the assumption that β=Θμd is a sparse vector. They proposed a one-shot estimator, which is communication efficient and attains the same convergence rate as the global estimator if K=O(N/logp/max{s,s}), where s and s stand for the sparsity of some parameters.

Li and Zhao (Citation2021) considered the distributed LDA without sparsity assumption under the settings where p/N0 and Kp/Nr[0,1). Note that to compute Σˆ1, one needs to transfer p by p matrices to the central machine, of which the communication costs can be expensive. Li and Zhao (Citation2021) proposed a two-round estimator and a one-shot estimator, defined as follows.

Denote by μˆi(k) the estimator of μi with data at the kth machine, for i=1,2, and k=1,,K. The one-shot estimator considers the following decision rule, (2) ψone(Z)=1{Z(K1k=1KΘˆ(k)μˆd(k))K1k=1K(μˆa(k))Θˆ(k)μˆd(k)>log(N1/N2)},(2) where Θˆ(k)=(Σˆpoo1(k))1 is the pooled sample covariance matrix using the data at the kth machine, μˆa(k)=(μˆ1(k)+μˆ2(k))/2 and μˆd(k)=μˆ1(k)μˆ2(k). Note that Θˆ(k) and μi(k) can be computed with the data only at the kth machine and that it is sufficient to transmit the vectors Θˆ(k)μˆd(k)Rp and the scalars (μˆa(k))Θˆ(k)μˆd(k) for all k to the hub. The two-round estimator is an improved version of ψone(Z), just replacing the local estimators μˆa(k),μˆd(k) in (Equation2) by the global ones μˆa,μˆd with an additional round of communication. In fact, by transferring μˆi(k)'s to the central hub, we can obtain μˆi=K1k=1Kμˆi(k) and consequently μˆa=(μˆ1+μˆ2)/2 and μˆd=μˆ1μˆ2.

Li and Zhao (Citation2021) compared the classification accuracy of the global estimator with those of distributed ones. They showed that when K=o(N/p), both the two-round estimator and the one-shot estimator can be as good as the global one under mild conditions. Moreover, they found if Kp/Nr[0,1) and π1=π2, the two-round estimator can be as good as the global one, but the one-shot estimator is inferior to the global one. This is an interesting result, since when Kp/Nr>0, Σˆpool(k) is not a consistent estimator of Σ by the random matrix theory. Therefore, at the price of more communication cost, the two-round estimator achieves better statistical efficiency.

2. Hypothesis testing of the mean vectors

In this section, we discuss the DC approach in the one-sample testing problem in the distributed system. We observe that DC type test statistics always lead to the loss of power, which is different from that of point estimation where the DC type estimator can be as good as the global one.

Suppose that XRp is a random vector with E(X)=μ. For a given vector μ0, consider the hypothesis testing problem H0:μ=μ0v.s.H1:μμ0.Suppose that X follows the normal distribution N(μ,Σ) with unknown covariance matrix Σ. Let {Xi,i=1,,n} are independent and identically distributed copies of X. In the setting of p<n, the classical test statistic is Hotelling T2 (Anderson, Citation2003), defined as follows, T2=(n1)(X¯μ0)Θˆ(X¯μ0),where X¯ denotes the sample mean and Θˆ=(Σˆ)1 with Σˆ being the sample covariance matrix. In high dimensional cases with p>n, the sample covariance matrix is singular and the Hotelling T2 test statistic is not well defined. Many works are developed to extend the Hotelling T2 to large or high dimensional regimes (Bai & Saranadasa, Citation1996; Srivastava & Du, Citation2008; Wang et al., Citation2015, etc.).

Du and Zhao (Citation2021) considered the distributed version of these test statistics. Specifically, based on the DC approach, they extended the Hotelling T2 statistics under the setting Kp/nr[0,1) and the nonparametric test statistics of Wang et al. (Citation2015) for high dimensional settings. The ratio of the communication cost of deriving the global test statistics over that of the distributed test statistics is of order O(p2) in the case of Kp/nr[0,1), and O(p) in high dimensional regimes.

They compared the power of distributed statistics with those of global ones, showing that the distributed test statistics are less efficient than those of the global ones whenever K>1. Denote by βd(n) and βg(n) the powers of the distributed and global test statistics as the function of sample size n, respectively, and define ng/nd such that βd(nd)=βg(ng) as the relative efficiency. The asymptotic relative efficiencies of distributed test statistics have the order 1/K.

Hence, the story of the DC approach in the hypothesis problem above is quite different from that of the point estimation, where the mean square error (MSE) of the DC estimators can be as good as that of global ones (Lee et al., Citation2017; Volgushev et al., Citation2019; Zhang et al., Citation2013, etc.). On the other hand, Shi et al. (Citation2018) and Banerjee et al. (Citation2019) showed that, in some nonstandard problems, the DC estimators converge at a rate much faster than the global ones. These results show the different behaviours of the DC approach in statistical inferences.

3. Statistical inferences under a restricted communication budget

As discussed before, it is seen that the DC method is communication efficient compared with the global one. But the statistical efficiencies of DC estimators are inferior to the global ones in many cases. To improve the efficiency of the DC estimators, some iterative methods are proposed in the literature at the price of more communication costs. This leads to an interesting problem of how to implement statistical inferences with the given communication budgets.

For the distributed mean estimation, Garg et al. (Citation2014) proved the bounds of the bits in communication required to achieve the minimax square loss. Zhang et al. (Citation2013) and Braverman et al. (Citation2016) found the minimax rate when estimating the mean vector with a restricted communication cost. Cai and Wei (Citation2020) discussed the estimation of the mean vector of a Gaussian distribution with the restriction on communication budget.

However, how to handle the statistical problems with the restricted budget in other settings is an interesting problem for future work. For example, for the hypothesis testing problem discussed in Section 2, how to design test statistics that can achieve good statistical efficiency under a given communication budget needs further investigation.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Correction Statement

This article has been republished with minor changes. These changes do not impact the academic content of the article.

References