Full article: Discussion on the paper ‘A review of distributed statistical inference’

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Distributed statistical inferences have attracted more and more attention in recent years with the emergence of massive data. We are grateful to the authors for the excellent review of the literature in this active area. Besides the progress mentioned by the authors, we would like to discuss some additional development in this interesting area. Specifically, we focus on the balance of communication cost and the statistical efficiency of divide-and-conquer (DC) type estimators in linear discriminant analysis and hypothesis testing. It is seen that the DC approach has different behaviours in these problems, which is different from that in estimation problems. Furthermore, we discuss some issues on the statistical inferences under restricted communication budgets.

Keywords:

1. Linear discriminant analysis

Linear discriminant analysis (LDA) is a classical classification method (Anderson, Citation2003). For simplicity, we consider the two-sample problem, assuming that $X \sim N_{p} (μ_{1}, Σ), Y \sim N_{p} (μ_{2}, Σ),$ where $μ_{i} \in R^{p}, i = 1, 2$ are the mean vectors with $μ_{1} \neq μ_{2}$ and $Σ \in R^{p \times p}$ is the covariance matrix. Furthermore, assume that observations come either from X with probability $π_{1}$ or from Y with probability $π_{2}$ such that $π_{1} + π_{2} = 1$ . For a new observation Z, Fisher's linear discriminant rule is defined as follows: (1) $ψ (Z) = 1 {(Z - μ_{a})^{⊤} Θ μ_{d} > \log (π_{1} / π_{2})},$ (1) where $μ_{a} = (μ_{1} + μ_{2}) / 2$ , $μ_{d} = μ_{1} - μ_{2}$ , and $Θ = Σ^{- 1}$ represents the precision matrix, and $1 {\cdot}$ is the indicator function. Suppose that ${X_{i}, i = 1, \dots, N_{1}}$ and ${Y_{i}, i = 1, \dots, N_{2}}$ are the independently and identically distributed copies of X and Y, respectively. Let $N = N_{1} + N_{2}$ be the total sample size and suppose that N>p. For i = 1, 2, denote ${\hat{μ}}_{i}$ as the sample means, and ${\hat{Σ}}_{i}$ as the sample covariance using observations $X_{i}$ 's and $Y_{i}$ 's, respectively. Then the estimators of $μ_{a}, μ_{d}$ and Θ can be defined respectively as follows $\begin{aligned} {\hat{μ}}_{a} & = ({\hat{μ}}_{1} + {\hat{μ}}_{2}) / 2, {\hat{μ}}_{d} = {\hat{μ}}_{1} - {\hat{μ}}_{2}, \\ \hat{Θ} & = ({\hat{Σ}}_{p o o l})^{- 1}, \end{aligned}$ where ${\hat{Σ}}_{p o o l} = (N_{1} / N) {\hat{Σ}}_{1} + (N_{2} / N) {\hat{Σ}}_{2}$ denotes the pooled sample covariance matrix. Then the empirical version of $ψ (Z)$ , denoted as $\hat{ψ} (Z)$ , can be derived by plugging in the above estimators into (Equation1(1) $ψ (Z) = 1 {(Z - μ_{a})^{⊤} Θ μ_{d} > \log (π_{1} / π_{2})},$ (1) ).

In a distributed setting, one has a central machine (or hub) and many local machines. Suppose that data are split randomly and evenly, and are stored at K local machines. Denote by ${X_{i}^{(k)}, i = 1, \dots, N_{1} / K}$ and ${Y_{i}^{(k)}, i = 1, \dots, N_{2} / K}$ the samples from two classes on the k-th local machine $k = 1, \dots, K$ . Tian and Gu (Citation2017) considered sparse LDA in the high dimensional regime in the case of $π_{1} = π_{2} = 1 / 2$ , under the assumption that $β = Θ μ_{d}$ is a sparse vector. They proposed a one-shot estimator, which is communication efficient and attains the same convergence rate as the global estimator if $K = O (\sqrt{N / \log p} / max {s, s^{'}})$ , where s and $s^{'}$ stand for the sparsity of some parameters.

Li and Zhao (Citation2021) considered the distributed LDA without sparsity assumption under the settings where $p / N \to 0$ and $K p / N \to r \in [0, 1)$ . Note that to compute ${\hat{Σ}}^{- 1}$ , one needs to transfer p by p matrices to the central machine, of which the communication costs can be expensive. Li and Zhao (Citation2021) proposed a two-round estimator and a one-shot estimator, defined as follows.

Denote by ${\hat{μ}}_{i}^{(k)}$ the estimator of $μ_{i}$ with data at the kth machine, for $i = 1, 2,$ and $k = 1, \dots, K$ . The one-shot estimator considers the following decision rule, (2) $\begin{aligned} ψ_{o n e} (Z) = 1 {Z^{⊤} (K^{- 1} \sum_{k = 1}^{K} {\hat{Θ}}^{(k)} {\hat{μ}}_{d}^{(k)}) \\ - K^{- 1} \sum_{k = 1}^{K} ({\hat{μ}}_{a}^{(k)})^{⊤} {\hat{Θ}}^{(k)} {\hat{μ}}_{d}^{(k)} > \log (N_{1} / N_{2})}, \end{aligned}$ (2) where ${\hat{Θ}}^{(k)} = ({\hat{Σ}}_{p o o 1}^{(k)})^{- 1}$ is the pooled sample covariance matrix using the data at the kth machine, ${\hat{μ}}_{a}^{(k)} = ({\hat{μ}}_{1}^{(k)} + {\hat{μ}}_{2}^{(k)}) / 2$ and ${\hat{μ}}_{d}^{(k)} = {\hat{μ}}_{1}^{(k)} - {\hat{μ}}_{2}^{(k)}$ . Note that ${\hat{Θ}}^{(k)}$ and $μ_{i}^{(k)}$ can be computed with the data only at the kth machine and that it is sufficient to transmit the vectors ${\hat{Θ}}^{(k)} {\hat{μ}}_{d}^{(k)} \in R^{p}$ and the scalars $({\hat{μ}}_{a}^{(k)})^{⊤} {\hat{Θ}}^{(k)} {\hat{μ}}_{d}^{(k)}$ for all k to the hub. The two-round estimator is an improved version of $ψ_{o n e} (Z)$ , just replacing the local estimators ${\hat{μ}}_{a}^{(k)}, {\hat{μ}}_{d}^{(k)}$ in (Equation2(2) $\begin{aligned} ψ_{o n e} (Z) = 1 {Z^{⊤} (K^{- 1} \sum_{k = 1}^{K} {\hat{Θ}}^{(k)} {\hat{μ}}_{d}^{(k)}) \\ - K^{- 1} \sum_{k = 1}^{K} ({\hat{μ}}_{a}^{(k)})^{⊤} {\hat{Θ}}^{(k)} {\hat{μ}}_{d}^{(k)} > \log (N_{1} / N_{2})}, \end{aligned}$ (2) ) by the global ones ${\hat{μ}}_{a}, {\hat{μ}}_{d}$ with an additional round of communication. In fact, by transferring ${\hat{μ}}_{i}^{(k)}$ 's to the central hub, we can obtain ${\hat{μ}}_{i} = K^{- 1} \sum_{k = 1}^{K} {\hat{μ}}_{i}^{(k)}$ and consequently ${\hat{μ}}_{a} = ({\hat{μ}}_{1} + {\hat{μ}}_{2}) / 2$ and ${\hat{μ}}_{d} = {\hat{μ}}_{1} - {\hat{μ}}_{2}$ .

Li and Zhao (Citation2021) compared the classification accuracy of the global estimator with those of distributed ones. They showed that when $K = o (N / p)$ , both the two-round estimator and the one-shot estimator can be as good as the global one under mild conditions. Moreover, they found if $K p / N \to r \in [0, 1)$ and $π_{1} = π_{2}$ , the two-round estimator can be as good as the global one, but the one-shot estimator is inferior to the global one. This is an interesting result, since when $K p / N \to r > 0$ , ${\hat{Σ}}_{p o o l}^{(k)}$ is not a consistent estimator of Σ by the random matrix theory. Therefore, at the price of more communication cost, the two-round estimator achieves better statistical efficiency.

2. Hypothesis testing of the mean vectors

In this section, we discuss the DC approach in the one-sample testing problem in the distributed system. We observe that DC type test statistics always lead to the loss of power, which is different from that of point estimation where the DC type estimator can be as good as the global one.

Suppose that $X \in R^{p}$ is a random vector with $E (X) = μ$ . For a given vector $μ_{0}$ , consider the hypothesis testing problem $H_{0} : μ = μ_{0} v . s . H_{1} : μ \neq μ_{0} .$ Suppose that X follows the normal distribution $N (μ, Σ)$ with unknown covariance matrix Σ. Let ${X_{i}, i = 1, \dots, n}$ are independent and identically distributed copies of X. In the setting of p<n, the classical test statistic is Hotelling $T^{2}$ (Anderson, Citation2003), defined as follows, $T^{2} = (n - 1) (\bar{X} - μ_{0})^{⊤} \hat{Θ} (\bar{X} - μ_{0}),$ where $\bar{X}$ denotes the sample mean and $\hat{Θ} = (\hat{Σ})^{- 1}$ with $\hat{Σ}$ being the sample covariance matrix. In high dimensional cases with p>n, the sample covariance matrix is singular and the Hotelling $T^{2}$ test statistic is not well defined. Many works are developed to extend the Hotelling $T^{2}$ to large or high dimensional regimes (Bai & Saranadasa, Citation1996; Srivastava & Du, Citation2008; Wang et al., Citation2015, etc.).

Du and Zhao (Citation2021) considered the distributed version of these test statistics. Specifically, based on the DC approach, they extended the Hotelling $T^{2}$ statistics under the setting $K p / n \to r \in [0, 1)$ and the nonparametric test statistics of Wang et al. (Citation2015) for high dimensional settings. The ratio of the communication cost of deriving the global test statistics over that of the distributed test statistics is of order $O (p^{2})$ in the case of $K p / n \to r \in [0, 1)$ , and $O (p)$ in high dimensional regimes.

They compared the power of distributed statistics with those of global ones, showing that the distributed test statistics are less efficient than those of the global ones whenever K>1. Denote by $β_{d} (n)$ and $β_{g} (n)$ the powers of the distributed and global test statistics as the function of sample size n, respectively, and define $n_{g} / n_{d}$ such that $β_{d} (n_{d}) = β_{g} (n_{g})$ as the relative efficiency. The asymptotic relative efficiencies of distributed test statistics have the order $1 / \sqrt{K}$ .

Hence, the story of the DC approach in the hypothesis problem above is quite different from that of the point estimation, where the mean square error (MSE) of the DC estimators can be as good as that of global ones (Lee et al., Citation2017; Volgushev et al., Citation2019; Zhang et al., Citation2013, etc.). On the other hand, Shi et al. (Citation2018) and Banerjee et al. (Citation2019) showed that, in some nonstandard problems, the DC estimators converge at a rate much faster than the global ones. These results show the different behaviours of the DC approach in statistical inferences.

3. Statistical inferences under a restricted communication budget

As discussed before, it is seen that the DC method is communication efficient compared with the global one. But the statistical efficiencies of DC estimators are inferior to the global ones in many cases. To improve the efficiency of the DC estimators, some iterative methods are proposed in the literature at the price of more communication costs. This leads to an interesting problem of how to implement statistical inferences with the given communication budgets.

For the distributed mean estimation, Garg et al. (Citation2014) proved the bounds of the bits in communication required to achieve the minimax square loss. Zhang et al. (Citation2013) and Braverman et al. (Citation2016) found the minimax rate when estimating the mean vector with a restricted communication cost. Cai and Wei (Citation2020) discussed the estimation of the mean vector of a Gaussian distribution with the restriction on communication budget.

However, how to handle the statistical problems with the restricted budget in other settings is an interesting problem for future work. For example, for the hypothesis testing problem discussed in Section 2, how to design test statistics that can achieve good statistical efficiency under a given communication budget needs further investigation.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Correction Statement

This article has been republished with minor changes. These changes do not impact the academic content of the article.

References

Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis (3rd ed.). John Wiley & Sons.
Google Scholar
Bai, Z., & Saranadasa, H. (1996). Effect of high dimension: By an example of a two sample problem. Statistica Sinica, 6(2), 311–329. https://doi.org/https://doi.org/10.1007/s004400050035
Web of Science ®Google Scholar
Banerjee, M., Durot, C., & Sen, B. (2019). Divide and conquer in nonstandard problems and the super-efficiency phenomenon. Annals of Statistics, 47(2), 720–757. https://doi.org/https://doi.org/10.1214/17-AOS1633
Web of Science ®Google Scholar
Braverman, M., Garg, A., Ma, T., Nguyen, H. L., & Woodruff, D. P. (2016). Communication lower bounds for statistical estimation problems via a distributed data processing inequality. In Proceedings of the Forty-Eighth Annual ACM Symposium on Theory of Computing (pp. 1011–1020).
Google Scholar
Cai, T. T., & Wei, H. (2020). Distributed Gaussian mean estimation under communication constraints: Optimal rates and communication-efficient algorithms. arXiv:2001.08877.
Google Scholar
Du, B., & Zhao, J. (2021). Hypothesis testing of one sample mean vector in distributed frameworks. arXiv:2110.02588.
Google Scholar
Garg, A., Ma, T., & Nguyen, H. (2014). On communication cost of distributed statistical estimation and dimensionality. Advances in Neural Information Processing Systems, 27, 2726–2734. https://doi.org/https://doi.org/10.1109/hipc.1997.634533
Google Scholar
Lee, J. D., Liu, Q., Sun, Y., & Taylor, J. E. (2017). Communication-efficient sparse regression. Journal of Machine Learning Research, 18(1), 115–144. https://doi.org/https://doi.org/10.17077/etd.005893
Google Scholar
Li, M., & Zhao, J. (2021). Communication-efficient distributed linear discriminant analysis for binary classification. Statistica Sinica. https://doi.org/https://doi.org/10.5705/ss.202020.0374
Google Scholar
Shi, C., Lu, W., & Song, R. (2018). A massive data framework for M-estimators with cubic-rate. Journal of the American Statistical Association, 113(524), 1698–1709. https://doi.org/https://doi.org/10.1080/01621459.2017.1360779
PubMed Web of Science ®Google Scholar
Srivastava, M. S., & Du, M. (2008). A test for the mean vector with fewer observations than the dimension. Journal of Multivariate Analysis, 99(3), 386–402. https://doi.org/https://doi.org/10.1016/j.jmva.2006.11.002
Web of Science ®Google Scholar
Tian, L., & Gu, Q. (2017). Communication-efficient distributed sparse linear discriminant analysis. In Artificial Intelligence and Statistics (pp. 1178–1187).
Google Scholar
Volgushev, S., Chao, S. K., & Cheng, G. (2019). Distributed inference for quantile regression processes. Annals of Statistics, 47(3), 1634–1662. https://doi.org/https://doi.org/10.1214/18-AOS1730
Web of Science ®Google Scholar
Wang, L., Peng, B., & Li, R. (2015). A high-dimensional nonparametric multivariate test for mean vector. Journal of the American Statistical Association, 110(512), 1658–1669. https://doi.org/https://doi.org/10.1080/01621459.2014.988215
PubMed Web of Science ®Google Scholar
Zhang, Y., Duchi, J. C., Jordan, M. I., & Wainwright, M. J. (2013). Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In Neural Information Processing Systems (pp. 2328–2336).
Google Scholar

Discussion on the paper ‘A review of distributed statistical inference’

Abstract

1. Linear discriminant analysis

2. Hypothesis testing of the mean vectors

3. Statistical inferences under a restricted communication budget

Disclosure statement

References

Information for

Open access

Opportunities

Help and information

Discussion on the paper ‘A review of distributed statistical inference’

Abstract

1. Linear discriminant analysis

2. Hypothesis testing of the mean vectors

3. Statistical inferences under a restricted communication budget

Disclosure statement

Correction Statement

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date