1,199
Views
0
CrossRef citations to date
0
Altmetric
Editorial

Editorial to the special issue: Statistical Approaches for Big Data and Machine Learning

ORCID Icon, , &

The special issue ‘Statistical Approaches for Big Data and Machine Learning’ of the Journal of Applied Statistics (JAS), Taylor & Francis, contains some papers that were presented at the 6th International Conference on Fuzzy Systems and Data Mining (FSDM 2020) which was held virtually during 13–16 November 2020. The conference featured plenary sessions, including keynote speeches, invited speeches, oral presentations and poster presentations. This conference was organized and hosted by Huaqiao University. In addition, the guest editors solicited paper submissions for the special issue. The special issue has drawn broad attention from numerous researchers all over the world. Many of these accepted papers were submitted by those who had not attended the conference. This special issue of JAS includes 19 excellent papers, after several rounds of reviews by referees following the standard of the journal. The refereeing processes for some of the submissions have been slowed down due to the COVID-19 pandemic, therefore, the editorial process of the special issue has been going over for more than two years. The accepted papers in the special issue include different topics of novel statistical approaches for big data and machine learning, and the papers provide readers with new statistical approaches and innovative applications in the big data era.

This timely issue provides the most recent developments in the big data analysis, and the frontiers of data ming and machine learning, as well as the interdisciplinary areas. The papers in this special issue cover big data analysis for missing data, data mining tools and statistical learning techniques, predictions on the COVID-19 outbreak using machine learning, and novel methods for neural network analysis. These research papers reflect a comprehensive view of recent advances and challenges in the methodology of big data analysis and applications to biostatistics and bioinformatics. Finally, we wish that the readers can learn the current trend of the big data research and deep learning from the articles of the special issue.

For high-dimensional two-sample Behrens–Fisher problems, some non-scale-invariant and scale-invariant tests have been investigated. Zhang et al. [Citation1] proposed a normal reference scale-invariant test. The benefit of the new method is by neither imposing strong assumptions on the underlying group covariance matrices nor assuming their equality. It is shown that the distribution of the chi-square type mixture can be well approximated by the Welch–Satterthwaite chi-square approximation.

The deep learning can help to effectively develop a COVID-19 diagnosis model to attain a maximum detection rate with minimum computating time. Pustokhin et al. [Citation2] proposed a new residual network (ResNet) based class attention layer called RCAL-BiLSTM for the COVID-19 diagnosis.

The proposed RCAL-BiLSTM model involves a series of processes namely bilateral filtering-based preprocessing, RCAL-BiLSTM-based feature extraction, and softmax (SM) based classification. The RCAL-BiLSTM-based feature extraction process takes place using three modules, i.e. ResNet-based feature extraction, CAL, and Bi-LSTM modules. They also apply the SM layer to categorize the feature vectors into corresponding feature maps.

Network (graph) data analysis is a popular research topic in statistics and machine learning. One is concerned with graph two-sample hypothesis testing. Yuan and Wen [Citation3] investigated the weighted graph two-sample hypothesis testing problem and propose a practical test statistic. The proposed test statistic converges in distribution to the standard normal distribution under the null hypothesis.

Zhao et al. [Citation4] studied the estimation and model selection for longitudinal partial linear varying coefficient errors-in-variables models. They proposed the bias-corrected penalized quadratic inference functions approach with two penalty function terms. Their method can not only handle the measurement errors of covariates but also can estimate and select significant non-zero parametric and nonparametric components at the same time.

When the data are stored in a distributed manner, direct application of traditional hypothesis testing procedures is often prohibitive due to communication costs and privacy concerns. Xie et al. [Citation5] developed a distributed two-node Kolmogorov–Smirnov hypothesis testing scheme, implemented by the divide-and-conquer strategy. In this paper, the authors provided a distributed fraud detection and a distribution-based classification for multi-node machines based on the proposed scheme. These new methods can improve the accuracy of statistical inference in a distributed storage architecture.

Unsupervised document classification for imbalanced data sets faces a major challenge. To obtain accurate classification results, training data sets are often created manually by human beings which require the expert knowledge, money, etc. Thielmann et al. [Citation6] proposed an integration of web scraping, one-class support vector machine and latent Dirichl et al. location topic modeling as a multi-step classification rule, which can circumvent the manual labeling. The proposed method outperforms common machine learning classifiers.

Fan et al. [Citation7] considered the problem of fault detection in data collection in wireless sensor networks. They combine evolutionary computing and machine learning to propose a productive technical solution. They also choose the classical particle swarm optimization (PSO). The proposed RS-PPSO algorithm was successfully used to optimize the initial weights and biases of the backpropagation neural network and raise the prediction accuracy. The proposed optimized machine learning technique can effectively identify the fault data, and ensure the effective operation of wireless sensor networks.

Modeling cyber risks has been an important but challenging task in the study of cyber security. Wu et al. [Citation8] combined the deep learning approach with the extreme value theory to propose an approach for modeling multivariate cyber risks. In the presented approach, high accurate point predictions can be obtained by training the deep learning network, and the extreme value theory was used to enhance the prediction of the high quantiles. In practice, this method can potentially guide the defender to prepare the resource for both the regular attack and worst attack scenarios.

The National Heart, Lung and Blood Institute Growth and Health Study is a large longitudinal study of childhood health. Zhang et al. [Citation9] proposed a dynamic copula approach for estimating an outcome’s joint distributions at two time points given a large number of time-varying covariates. The new models depend on the outcome’s time-varying distributions at one time point, the bivariate copula densities, etc. They also propose a three-step procedure for variable selection and estimation, which selects the influential covariates using a machine learning procedure based on spline Lasso-regularized least-squares, and estimates the functional copula parameter of the dynamic copula models.

Discriminative subspace clustering (DSC) can make full use of linear discriminant analysis (LDA) to reduce the dimension of data and achieve effective clustering of high-dimension data. However, most existing DSC algorithms do not consider the noise and outliers. Zhi et al. [Citation10] discussed the problem of the sensitivity of DSC to noise and outliers. Replacing the Euclidean distance in the objective function of LDA with an exponential non-Euclidean distance, the authors developed a noise-insensitive LDA (NILDA) algorithm. Then, they proposed a noise-insensitive discriminative subspace fuzzy clustering (NIDSFC) algorithm.

The current large amounts of data and advanced technologies have produced new types of complex data, such as histogram-valued data. Kang et al. [Citation11] focused on classification problems when predictors are observed as or aggregated into histograms. They developed a margin-based classifier called support histogram machine (SHM) for histogram-valued data. The authors adopt the support vector machine framework and the Wasserstein-Kantorovich metric to measure distances between histograms.

Triple-negative breast cancer (TNBC) is generally considered an aggressive breast cancer subtype associated with poor prognostic outcomes. In their paper, Liu et al. [Citation12] developed a semiparametric model with kernel for gene-based analysis with a breast cancer GWAS data. The software of SPMGBA in MATLAB is available at GitHub (https://github.com/zliu3/SPMGBA). They discover genetic signatures associated with breast cancer. Moreover, they found that SEL1L is associated with the overall survival of TNBC with the p-value of .0002.

Feature selection is an important data dimension reduction method. However, these existing works set the matrix norms used in the loss function and the regularization terms to the same norm. To address these problems, Zhi et al. [Citation13] presented a generalized norm regression-based feature selection method, based on a new optimization criterion. The criterion extended the optimization criterion when the loss function and the regularization terms in regression use different matrix norms. This paper also proposed a new optimization criterion in a regression framework without regularization.

Knowledge of the network structure can be incorporated into regularized regression settings via a network penalty term. The connection signs have to be estimated jointly with the covariate coefficients. This can be done with an algorithm iterating a connection sign estimation step and a covariate coefficient estimation step. Weber et al. [Citation14] proposed a very interesting algorithm, called 3CoSE, which shows simulation results and an application forecasting event times.

Evidence-based fall prevention programs are delivered nationwide, supported by funding from the administration for community living (ACL). Cheng et al. [Citation15] applied data from 39 ACL grantees in 22 states from 2014 to 2017 in their work. The large amount of missing values for falls efficacy may lead to potentially biased statistical results. They used multiple imputation-stepwise regression (MI-stepwise) and multiple imputation-least absolute shrinkage and selection operator (MI-LASSO) methods and conducted simulation studies. They evaluated the performance of MI-LASSO method to address the over-selection issue in cross-validation (CV)-based LASSO.

Causal inference under the potential outcome framework relies on the ignorable treatment assumption. This condition is questionable in observational studies. Zhou and Yao [Citation16] proposed a new sensitivity analysis procedure to evaluate the impact of the unmeasured confounder by leveraging ideas of doubly robust estimators, the exponential tilt method, and the super learner algorithm. The exponential tilting method proposed in this paper does not impose any restriction on the structure or models of the unmeasured confounders. The authors also develop incorporating the super learner machine learning algorithm to perform nonparametric model estimation and the corresponding sensitivity analysis.

Model-assisted estimators have attracted a lot of attention in the last three decades. Dagdoug et al. [Citation17] studied the model-assisted estimators from a design-based point of view and in a high-dimensional setting, such as linear regression and penalized estimators. They conducted an extensive simulation study to evaluate model-assisted estimators in terms of bias and efficiency in this high-dimensional setting.

Missing data analysis for the high-dimensional data should be handled properly in order to reduce the nonresponse bias. In their paper, Chen and Xu [Citation18] discussed some modern machine learning techniques including penalized regression approaches, tree-based approaches, and deep learning (DL) for handling missing data with high dimensionality. The proposed methods in the paper can be used for estimating parameters including means and percentiles with imputation-based estimators, propensity score estimators, and doubly robust estimators.

It is well known that multi-parametric MRI (mpMRI) is a critical tool in the prostate cancer (PCa) diagnosis. In their paper, Jin et al. [Citation19] developed a machine learning-based approach. Their idea is to apply an ensemble learning procedure to capture the regional heterogeneity in the data, where classifiers are developed by using the super learner algorithm and account for the between-voxel correlation. The benefit is to allow any type of classifier to be the base learner. Finally, they develop useful algorithms for the binary PCa classification, and for classifying the ordinal clinical significance of PCa to improve the detection of less prevalent cancer categories.

Acknowledgements

We would like to thank all of people, who have supported the creation of this special issue with JAS. Firstly, we are very grateful to the Editor-in-Chief, Professor Jie Chen for her full support from establishing the special issue to its publication. Secondly, we acknowledge the kind help from Ms. Cindy Shen, the FSDM conference secretary who enabled us to conceptualize the special issue. Thirdly, we deeply appreciate the authors of each paper for their great contributions. Last but not least, our sincere appreciations go to all the dedicated referees of the papers for their rigorous and very careful reviews.

References

  • L. Zhang, T. Zhu, and J.T. Zhang, Two-sample Behrens–Fisher problems for high-dimensional data: A normal reference scale-invariant test. J. Appl. Stat. 50 (2023), pp. 456–476.
  • D. Pustokhin, I. Pustokhina, P. Dinh, S. Phan, G. Nguyen, G. Joshi and, and K. Shankar, An effective deep residual network based class attention layer with bidirectional LSTM for diagnosis and classification of COVID-19. J. Appl. Stat. 50 (2023), pp. 477–494.
  • Mi. Yuan and Q. Wen, A practical two-sample test for weighted random graphs. J. Appl. Stat. 50 (2023), pp. 495–511.
  • M. Zhao, X. Xu, Y. Zhu, K. Zhang, and Y. Zhou, Model estimation and selection for partial linear varying coefficient EV models with longitudinal data. J. Appl. Stat. 50 (2023), pp. 512–534.
  • M. Weber, J. Striaukas, M. Schumacher, and H. Binder, Regularized regression when covariates are linked on a network: The 3CoSE algorithm. J. Appl. Stat. 50 (2023), pp. 535–554.
  • X. Xie, J. Shi, and K. Song, A distributed multiple sample testing for massive data. J. Appl. Stat. 50 (2023), pp. 555–573.
  • A. Thielmann, C. Weisser, A. Krenz, and B. Säfken, Unsupervised document classification integrating web scraping, one-class SVM and LDA topic modelling. J. Appl. Stat. 50 (2023), pp. 574–591.
  • F. Fan, S.-C. Chu, J.-S. Pan, C. Lin, and H. Zhao, An optimized machine learning technology scheme and its application in fault detection in wireless sensor networks. J. Appl. Stat. 50 (2023), pp. 592–609.
  • M. Zhang Wu, J. Luo, X. Fang, M. Xu, and P. Zhao, Modeling multivariate cyber risks: Deep learning dating extreme value theory. J. Appl. Stat. 50 (2023), pp. 610–630.
  • W. Zhang, C.O. Wu, X. Ma, X. Tian, and Q. Li, Analysis of multivariate longitudinal data using dynamic lasso-regularized copula models with application to large pediatric cardiovascular studies. J. Appl. Stat. 50 (2023), pp. 631–658.
  • X. Zhi, T. Yu, L. Bi, and Y. Li, Noise-insensitive discriminative subspace fuzzy clustering. J. Appl. Stat. 50 (2023), pp. 659–674.
  • I. Kang, C. Park, Y.J. Yoon, C. Park, S.-S. Kwon, and H. Choi, Classification of histogram-valued data with support histogram machines. J. Appl. Stat. 50 (2023), pp. 675–690.
  • X. Liu, G. Tian, and Z. Liu, Identification of novel genes for triple-negative breast cancer with semiparametric gene-based analysis. J. Appl. Stat. 50 (2023), pp. 691–702.
  • X. Zhi, J. Liu, S. Wu, and C. Niu, A generalized l2,p-norm regression based feature selection algorithm. J. Appl. Stat. 50 (2023), pp. 703–723.
  • Y. Cheng, Y. Li, M.L. Smith, C. Li, and Y. Shen, Analyzing evidence-based falls prevention data with significant missing information using variable selection after multiple imputation. J. Appl. Stat. 50 (2023), pp. 724–743.
  • M. Zhou, and W. Yao, Sensitivity analysis of unmeasured confounding in causal inference based on exponential tilting and super learner. J. Appl. Stat. 50 (2023), pp. 744–760.
  • M. Dagdoug, C. Goga, and D. Haziza, Model-assisted estimation in high-dimensional settings for survey data. J. Appl. Stat. 50 (2023), pp. 761–785.
  • S. Chen, and C. Xu, Handling high-dimensional data with missing values by modern machine learning techniques. J. Appl. Stat. 50 (2023), pp. 786–804.
  • J. Jin, L. Zhang, E. Leng, G.J. Metzger, and J.S. Koopmeiners, Multi-resolution super learner for voxel-wise classification of prostate cancer using multi-parametric MRI. J. Appl. Stat. 50 (2023), pp. 805–826.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.