![MathJax Logo](/templates/jsp/_style2/_tandf/pb2/images/math-jax.gif)
Abstract
Distributed statistical inferences have attracted more and more attention in recent years with the emergence of massive data. We are grateful to the authors for the excellent review of the literature in this active area. Besides the progress mentioned by the authors, we would like to discuss some additional development in this interesting area. Specifically, we focus on the balance of communication cost and the statistical efficiency of divide-and-conquer (DC) type estimators in linear discriminant analysis and hypothesis testing. It is seen that the DC approach has different behaviours in these problems, which is different from that in estimation problems. Furthermore, we discuss some issues on the statistical inferences under restricted communication budgets.
1. Linear discriminant analysis
Linear discriminant analysis (LDA) is a classical classification method (Anderson, Citation2003). For simplicity, we consider the two-sample problem, assuming that
where
are the mean vectors with
and
is the covariance matrix. Furthermore, assume that observations come either from X with probability
or from Y with probability
such that
. For a new observation Z, Fisher's linear discriminant rule is defined as follows:
(1)
(1) where
,
, and
represents the precision matrix, and
is the indicator function. Suppose that
and
are the independently and identically distributed copies of X and Y, respectively. Let
be the total sample size and suppose that N>p. For i = 1, 2, denote
as the sample means, and
as the sample covariance using observations
's and
's, respectively. Then the estimators of
and Θ can be defined respectively as follows
where
denotes the pooled sample covariance matrix. Then the empirical version of
, denoted as
, can be derived by plugging in the above estimators into (Equation1
(1)
(1) ).
In a distributed setting, one has a central machine (or hub) and many local machines. Suppose that data are split randomly and evenly, and are stored at K local machines. Denote by and
the samples from two classes on the k-th local machine
. Tian and Gu (Citation2017) considered sparse LDA in the high dimensional regime in the case of
, under the assumption that
is a sparse vector. They proposed a one-shot estimator, which is communication efficient and attains the same convergence rate as the global estimator if
, where s and
stand for the sparsity of some parameters.
Li and Zhao (Citation2021) considered the distributed LDA without sparsity assumption under the settings where and
. Note that to compute
, one needs to transfer p by p matrices to the central machine, of which the communication costs can be expensive. Li and Zhao (Citation2021) proposed a two-round estimator and a one-shot estimator, defined as follows.
Denote by the estimator of
with data at the kth machine, for
and
. The one-shot estimator considers the following decision rule,
(2)
(2) where
is the pooled sample covariance matrix using the data at the kth machine,
and
. Note that
and
can be computed with the data only at the kth machine and that it is sufficient to transmit the vectors
and the scalars
for all k to the hub. The two-round estimator is an improved version of
, just replacing the local estimators
in (Equation2
(2)
(2) ) by the global ones
with an additional round of communication. In fact, by transferring
's to the central hub, we can obtain
and consequently
and
.
Li and Zhao (Citation2021) compared the classification accuracy of the global estimator with those of distributed ones. They showed that when , both the two-round estimator and the one-shot estimator can be as good as the global one under mild conditions. Moreover, they found if
and
, the two-round estimator can be as good as the global one, but the one-shot estimator is inferior to the global one. This is an interesting result, since when
,
is not a consistent estimator of Σ by the random matrix theory. Therefore, at the price of more communication cost, the two-round estimator achieves better statistical efficiency.
2. Hypothesis testing of the mean vectors
In this section, we discuss the DC approach in the one-sample testing problem in the distributed system. We observe that DC type test statistics always lead to the loss of power, which is different from that of point estimation where the DC type estimator can be as good as the global one.
Suppose that is a random vector with
. For a given vector
, consider the hypothesis testing problem
Suppose that X follows the normal distribution
with unknown covariance matrix Σ. Let
are independent and identically distributed copies of X. In the setting of p<n, the classical test statistic is Hotelling
(Anderson, Citation2003), defined as follows,
where
denotes the sample mean and
with
being the sample covariance matrix. In high dimensional cases with p>n, the sample covariance matrix is singular and the Hotelling
test statistic is not well defined. Many works are developed to extend the Hotelling
to large or high dimensional regimes (Bai & Saranadasa, Citation1996; Srivastava & Du, Citation2008; Wang et al., Citation2015, etc.).
Du and Zhao (Citation2021) considered the distributed version of these test statistics. Specifically, based on the DC approach, they extended the Hotelling statistics under the setting
and the nonparametric test statistics of Wang et al. (Citation2015) for high dimensional settings. The ratio of the communication cost of deriving the global test statistics over that of the distributed test statistics is of order
in the case of
, and
in high dimensional regimes.
They compared the power of distributed statistics with those of global ones, showing that the distributed test statistics are less efficient than those of the global ones whenever K>1. Denote by and
the powers of the distributed and global test statistics as the function of sample size n, respectively, and define
such that
as the relative efficiency. The asymptotic relative efficiencies of distributed test statistics have the order
.
Hence, the story of the DC approach in the hypothesis problem above is quite different from that of the point estimation, where the mean square error (MSE) of the DC estimators can be as good as that of global ones (Lee et al., Citation2017; Volgushev et al., Citation2019; Zhang et al., Citation2013, etc.). On the other hand, Shi et al. (Citation2018) and Banerjee et al. (Citation2019) showed that, in some nonstandard problems, the DC estimators converge at a rate much faster than the global ones. These results show the different behaviours of the DC approach in statistical inferences.
3. Statistical inferences under a restricted communication budget
As discussed before, it is seen that the DC method is communication efficient compared with the global one. But the statistical efficiencies of DC estimators are inferior to the global ones in many cases. To improve the efficiency of the DC estimators, some iterative methods are proposed in the literature at the price of more communication costs. This leads to an interesting problem of how to implement statistical inferences with the given communication budgets.
For the distributed mean estimation, Garg et al. (Citation2014) proved the bounds of the bits in communication required to achieve the minimax square loss. Zhang et al. (Citation2013) and Braverman et al. (Citation2016) found the minimax rate when estimating the mean vector with a restricted communication cost. Cai and Wei (Citation2020) discussed the estimation of the mean vector of a Gaussian distribution with the restriction on communication budget.
However, how to handle the statistical problems with the restricted budget in other settings is an interesting problem for future work. For example, for the hypothesis testing problem discussed in Section 2, how to design test statistics that can achieve good statistical efficiency under a given communication budget needs further investigation.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Correction Statement
This article has been republished with minor changes. These changes do not impact the academic content of the article.
References
- Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis (3rd ed.). John Wiley & Sons.
- Bai, Z., & Saranadasa, H. (1996). Effect of high dimension: By an example of a two sample problem. Statistica Sinica, 6(2), 311–329. https://doi.org/https://doi.org/10.1007/s004400050035
- Banerjee, M., Durot, C., & Sen, B. (2019). Divide and conquer in nonstandard problems and the super-efficiency phenomenon. Annals of Statistics, 47(2), 720–757. https://doi.org/https://doi.org/10.1214/17-AOS1633
- Braverman, M., Garg, A., Ma, T., Nguyen, H. L., & Woodruff, D. P. (2016). Communication lower bounds for statistical estimation problems via a distributed data processing inequality. In Proceedings of the Forty-Eighth Annual ACM Symposium on Theory of Computing (pp. 1011–1020).
- Cai, T. T., & Wei, H. (2020). Distributed Gaussian mean estimation under communication constraints: Optimal rates and communication-efficient algorithms. arXiv:2001.08877.
- Du, B., & Zhao, J. (2021). Hypothesis testing of one sample mean vector in distributed frameworks. arXiv:2110.02588.
- Garg, A., Ma, T., & Nguyen, H. (2014). On communication cost of distributed statistical estimation and dimensionality. Advances in Neural Information Processing Systems, 27, 2726–2734. https://doi.org/https://doi.org/10.1109/hipc.1997.634533
- Lee, J. D., Liu, Q., Sun, Y., & Taylor, J. E. (2017). Communication-efficient sparse regression. Journal of Machine Learning Research, 18(1), 115–144. https://doi.org/https://doi.org/10.17077/etd.005893
- Li, M., & Zhao, J. (2021). Communication-efficient distributed linear discriminant analysis for binary classification. Statistica Sinica. https://doi.org/https://doi.org/10.5705/ss.202020.0374
- Shi, C., Lu, W., & Song, R. (2018). A massive data framework for M-estimators with cubic-rate. Journal of the American Statistical Association, 113(524), 1698–1709. https://doi.org/https://doi.org/10.1080/01621459.2017.1360779
- Srivastava, M. S., & Du, M. (2008). A test for the mean vector with fewer observations than the dimension. Journal of Multivariate Analysis, 99(3), 386–402. https://doi.org/https://doi.org/10.1016/j.jmva.2006.11.002
- Tian, L., & Gu, Q. (2017). Communication-efficient distributed sparse linear discriminant analysis. In Artificial Intelligence and Statistics (pp. 1178–1187).
- Volgushev, S., Chao, S. K., & Cheng, G. (2019). Distributed inference for quantile regression processes. Annals of Statistics, 47(3), 1634–1662. https://doi.org/https://doi.org/10.1214/18-AOS1730
- Wang, L., Peng, B., & Li, R. (2015). A high-dimensional nonparametric multivariate test for mean vector. Journal of the American Statistical Association, 110(512), 1658–1669. https://doi.org/https://doi.org/10.1080/01621459.2014.988215
- Zhang, Y., Duchi, J. C., Jordan, M. I., & Wainwright, M. J. (2013). Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In Neural Information Processing Systems (pp. 2328–2336).