Abstract
This article studies distributed estimation and inference for a general statistical problem with a convex loss that could be nondifferentiable. For the purpose of efficient computation, we restrict ourselves to stochastic first-order optimization, which enjoys low per-iteration complexity. To motivate the proposed method, we first investigate the theoretical properties of a straightforward divide-and-conquer stochastic gradient descent approach. Our theory shows that there is a restriction on the number of machines and this restriction becomes more stringent when the dimension p is large. To overcome this limitation, this article proposes a new multi-round distributed estimation procedure that approximates the Newton step only using stochastic subgradient. The key component in our method is the proposal of a computationally efficient estimator of , where
is the population Hessian matrix and w is any given vector. Instead of estimating
(or
) that usually requires the second-order differentiability of the loss, the proposed first-order Newton-type estimator (FONE) directly estimates the vector of interest
as a whole and is applicable to nondifferentiable losses. Our estimator also facilitates the inference for the empirical risk minimizer. It turns out that the key term in the limiting covariance has the form of
, which can be estimated by FONE.
Supplementary Materials
The supplementary material provides the verification of conditions, the theory of mini-batch SGD with diverging dimension, the proofs of all technical results in the main paper, and additional numerical experiments.
Acknowledgments
The authors are very grateful to anonymous referees and the associate editor for their detailed and constructive comments that considerably improved the quality of this article.
Notes
1 With a slight abuse of notation, we use n to denote either the sample size in nondistributed settings or the local sample size of a single machine in distributed settings.
2 Noting that although we present the evenly distributed setting for DC-SGD for the ease of illustration, one can easily see the convergence rate is actually determined by the smallest subsample size from the proof.
3 U.S. Census, http://www.census.gov/census2000/PUMS5.html