417
Views
0
CrossRef citations to date
0
Altmetric
Short Communications

A discussion on “A selective review of statistical methods using calibration information from similar studies”

ORCID Icon & ORCID Icon
Pages 196-198 | Received 26 Mar 2022, Accepted 02 May 2022, Published online: 10 Jun 2022

It is our pleasure to have an opportunity of making comments on this fine work in that the authors present a comprehensive review on empirical likelihood (EL) methods for integrative data analyses. This paper focuses on a unified methodological framework based on EL and estimating equations (EE) to sequentially combine summary information from individual data batches to obtain desirable estimation and inference comparable to those obtained by the EL method utilizing all individual-level data. The latter is sometimes referred to as an oracle estimation and inference in the setting of massively distributed data batches. An obvious strength of this review paper concerns the detailed theoretical properties in connection to the improved estimation efficiency through the utility of auxiliary information.

In this paper, the authors consider a typical data integration situation where individual-level data from the Kth data batch is combined with certain ‘good’ summary information from the previous K−1 data batches. While appreciating the theoretical strengths in this paper, we notice a few interesting aspects that are worth some discussions.

Distributed data structures: In practice, both individual data batch size and the number of data batches may appear rather heterogeneous, requiring different theory and algorithms in the data analysis. Such heterogeneity in distributed data structures is not well aligned with the methodological framework reviewed in the paper. One important practical scenario is that the number of data batches tends to infinity. Such setting may arise from distributed data collected from millions of mobile device users, or from electronic health records (EHR) data sources distributed across thousands of hospitals. In the presence of massively distributed data batches, a natural question pertains to a trade-off between data communication efficiency and analytic approximation accuracy. Although one-round data communication is popular in this type of integrative data analysis, multiple rounds of data communication may be also viable in the implementation via high-performance computing clusters. Our experience suggests that sacrifice in the flexibility of data communication (e.g., limited to one-round communication in the Hadoop paradigm), although enjoys computational speed, may pay a substantial price on the loss of approximation accuracy, leading to potentially accumulated estimation bias when the number of data batches increases. This issue of estimation bias is a technical challenge in nonlinear models due to the invocation of approximations to linearize both estimation procedure and numerical search algorithm. On the other hand, relaxing the restrictions on data communication, such as the operations within the lambda architecture, can help reduce the approximation error and lower estimation bias. Clearly, the latter requires more computational resources.

This important issue was investigated by Zhou et al. (Citation2022) that studied asymptotical equivalence between distributed EL estimator and oracle EL estimator under both one-round communication and unlimited rounds of communication when the number of distributed data batches increases perpetually. They found that under one-round communication, if the number of data batches, K, increases with the sample size n at a slow order of O(n1/2δ) with 0<δ1/2 and all individual batch sizes increase (i.e., nmin=minknk), their proposed distributed EL estimator is asymptotically equivalent to the oracle EL estimator in the mode of convergence in distribution. Interestingly, they found that if there is no limit on communication, both technical conditions above can be removed, and moreover, under much weaker conditions the distributed EL estimator and the oracle EL estimator are asymptotically equivalent in the mode of convergence in probability. The latter is a stronger convergence result than the former. Furthermore, assisted by the ADMM algorithm, even if there exist serious unbalanced covariate distributions in several data batches, the distributed EL estimator can still work well, while the conventional meta methods fail miserably.

Heterogeneity: Good theoretical properties, including estimation consistency and asymptotic normality as well as estimation efficiency, are reviewed in this paper. We notice that these theoretical properties are established under a big assumption of a homogeneous underlying data generating mechanism and a homogeneous statistical model across all data batches. In practice, this assumption can be easily violated, especially when the number of data batches increases. Generally speaking, bias and variance trade-off is a common criterion in statistical analysis. With distributed data, heterogeneity issues are unavoidable. Aggregating information from heterogeneous data batches using a homogeneous modelling approach could suffer severe estimation bias and failure in inference. This is the well-known fact that the bias of an estimator is in fact a more dominant issue than its variance when the volume of the data at hand is big. Thus, investigating similarity among available data batches is a critical step in the early stage of analyzing distributed data.

Addressing both data and model heterogeneity has been extensively considered in the literature of distributed data analyses. For example, federated learning (McMahan et al., Citation2017) aims to find effective methods to borrow information across similar datasets while accounting for individualized heterogeneity. Li et al. (Citation2020) utilized a proximal notion specific to each local objective to tackle heterogeneity in federated network learning; Collins et al. (Citation2021) and Fallah et al. (Citation2020) considered federated learning of a shared data representation or models across data batches; Smith et al. (Citation2017) focused on studying heterogeneous models via multi-task learning (or meta learning) by shared sparsity across different models.

As mentioned above, unbalanced covariate distributions or uneven dimensions of covariates across data batches are pervasive in practice. Little work is available in the literature to handle this technical challenge. Zhou et al. (Citation2022) considered a simple case of unbalanced covariate distribution with the same dimension of covariates across data batches. With the help of the ADMM algorithm, their proposed distributed EL estimator worked well. In addition, some researchers studied non-IID data in the development of distributed estimation and inference. For example, renewable estimation and incremental inference proposed by Luo et al. (Citation2022) allow to sequentially update both estimation and inference for clustered data steams; Wang et al. (Citation2012) proposed an integrative analysis of distributed longitudinal data; Hector and Song (Citation2021) considered a distributed generalized method of moments (GMM) for multi-dimensional outcomes with a diverging dimension; and Tang et al. (Citation2020) utilized the confidence distribution approach to establish a distributed lasso estimation in distributed datasets, just to name a few.

Implementation: One noted aspect missing in this paper is the lack of review on algorithms and software packages related to implementation. There are some R software packages available in the literature, such as R package DDIMM (Hector & Song, Citation2020) to perform data integration with dependent data sources, and R package metafuse (Tang & Song, Citation2016) to fuse heterogeneous parameters across independent data sources into subgroups. Both algorithms and software packages play important roles in translational research, which leads to broader impacts.

Some future directions: The authors have built up an interesting framework that may motivate many important future research problems. With our limited knowledge in this field, we humbly suggest three. First, despite the unified framework that seems appealing in the low-dimensional case, high-dimensional data would present a great challenge related to potentially heavy computational burdens, in addition to notoriously hard problems regarding post-model selection inference. Second, in the big-data era, data with specific structures like spatially and/or temporally correlated data are pervasive. Extending the low-dimensional framework to handle distributed spatio-temporal data is an important direction. Third, for massive distributed datasets, outliers or contaminated data are ubiquitous. It is important to develop robust distributed EL methods to obtain reliable and stable results in both estimation and inference.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by NSF [grant number 2113564].

References