621
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Robust ranking by ensembling of diverse models and assessment metrics

, &
Pages 77-102 | Received 20 Mar 2022, Accepted 21 Jun 2022, Published online: 10 Jul 2022

References

  • Dokas P, Ertoz L, Kumar V, et al. Data mining for network intrusion detection. In: Proceedings NSF Workshop on Next Generation Data Mining; 2002. p. 21–30.
  • Weiss GM. Mining with rare cases. In: Data mining and knowledge discovery handbook. Boston (MA): Springer; 2009. p. 747–757.
  • Wu J, Xiong H, Chen J. COG: local decomposition for rare class analysis. Data Min Knowl Discov. 2010;20(2):191–220.
  • Dongre SS, Malik LG. Rare class problem in data mining. Int J Adv Res Comput Sci. 2017;8(7):1102–1105.
  • Zhou D, Karthikeyan A, Wang K, et al. Discovering rare categories from graph streams. Data Min Knowl Discov. 2017;31(2):400–423.
  • Bolton RJ, Hand DJ. Statistical fraud detection: a review. Stat Sci. 2002;17(3):235–255.
  • Bhusari V, Patil S. Application of hidden Markov model in credit card fraud detection. Int J Distrib Parallel Syst. 2011;2(6):203–211.
  • Lippmann RP, Cunningham RK. Improving intrusion detection performance using keyword selection and neural network. Comput Netw. 2000 Oct;34(4):597–603.
  • Sherif JS, Dearmond TG. Intrusion detection: systems and models. In: Proceedings Eleventh IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises; 2002 Jun; Pittsburgh, PA. IEEE. p. 115–133.
  • Kemmerer RA, Vigna G. Hi-DRA: intrusion detection for internet security. Proc IEEE. 2005 Oct;93(10):1848–1857.
  • Maiti A, Sivanesan S. Cloud controlled intrusion detection and burglary prevention stratagems in home automation systems. In: 2012 2nd Baltic Congress on Future Internet Communications; 2012 Apr; Vilnius, Lithuania. IEEE. p. 182–186.
  • Gendreau AA, Moorman M. Survey of intrusion detection systems towards an end to end secure internet of things. In: 2016 IEEE 4th International Conference on Future Internet of Things and Cloud (FiCloud); 2016 Aug. p. 84–90.
  • Fienberg SE, Shmueli G. Statistical issues and challenges associated with rapid detection of bio-terrorist attacks. Stat Med. 2005 Feb;24(4):513–529.
  • Peeters J. Method and apparatus for wide area surveillance of a terrorist or personal threat. US Patent 7, 109, 859; 2006. Available from: https://www.google.com/patents/US7109859
  • Cohen K, Johansson F, Kaati L, et al. Detecting linguistic markers for radical violence in social media. Terror Political Violence. 2014;26(1):246–256.
  • Gordon M, Pathak P. Finding information on the world wide web: the retrieval effectiveness of search engines. Inf Process Manag Int J. 1999 Mar;35(2):141–180.
  • Nachmias R, Gilad A. Needle in a hyperstack. J Res Technol Educ. 2002;34(4):475–486.
  • Al-Masri E, Mahmoud QH. Investigating web services on the world wide web. In: Proceedings of the 17th International Conference on World Wide Web, WWW '08; New York (NY), USA. ACM; 2008. p. 795–804.
  • Baeza-Yates RA, Ribeiro-Neto BA. Modern information retrieval – the concepts and technology behind search. 2nd ed. Harlow: Pearson Education; 2011.
  • Bleicher KH, Böhm HJ, Müller K, et al. A guide to drug discovery: hit and lead generation: beyond high-throughput screening. Nat Rev Drug Discov. 2003;2:369–378.
  • Wang J, Hou T. Drug and drug candidate building block analysis. J Chem Inf Model. 2010;50(1):55–67. PMID: 20020714.
  • Tomal JH, Welch WJ, Zamar RH. Exploiting multiple descriptor sets in QSAR studies. J Chem Inf Model. 2016;56(3):501–509.
  • Zhu M, Su W, Chipman HA. LAGO: a computationally efficient approach for statistical detection. Technometrics. 2006;48(2):193–205. Available from: http://www.jstor.org/stable/25471156
  • Chawla NV. Data mining for imbalanced datasets: an overview. Boston (MA): Springer US; 2010. Chapter 40; p. 875–886.
  • Tayal A, Coleman TF, Li Y. RankRC: large-scale nonlinear rare class ranking. IEEE Trans Knowl Data Eng. 2015;27(12):3347–3359.
  • Hsu GG, Tomal JH, Welch WJ. EPX: an R package for the ensemble of subsets of variables for highly unbalanced binary classification. Comput Biol Med. 2021;136:104760.
  • Kearsley SK, Sallamack S, Fluder EM, et al. Chemical similarity using physiochemical property descriptors. J Chem Inf Comput Sci. 1996;36(1):118–127.
  • Kishida K. Property of average precision and its generalization: an examination of evaluation indicator for information retrieval experiments. Tokyo: Japan National Institute of Informatics; 2005.
  • Fu Y, Pan R, Yang Q, et al. Query-adaptive ranking with support vector machines for protein homology prediction. In: Chen J, Wang J, Zelikovsky A, editors. Bioinformatics research and applications. Berlin: Springer; 2011. p. 320–331.
  • Tomal JH, Welch WJ, Zamar RH. Ensembling classification models based on phalanxes of variables with applications in drug discovery. Ann Appl Stat. 2015;9(1):69–93.
  • Karplus K, Barrett C, Hughey R. Hidden Markov models for detecting remote protein homologies. Bioinformatics. 1998;14(10):846–856.
  • Hochreiter S, Heusel M, Obermayer K. Fast model-based protein homology detection without alignment. Bioinformatics. 2007;23(14):1728–1736.
  • Ma J, Wang S, Wang Z, et al. MRFalign: protein homology detection through alignment of Markov random fields. PLoS Comput Biol. 2014 Mar;10(3):1–12.
  • Yen SJ, Lee YS. Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. Berlin: Springer; 2006. p. 731–740.
  • Liu R, Hall LO, Bowyer KW, et al. Synthetic minority image over-sampling technique: How to improve AUC for glioblastoma patient survival prediction. In: 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC); 2017. p. 1357–1362.
  • Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123–140.
  • Dietterich TG. An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach Learn. 2000;40(2):139–157.
  • Breiman L. Random forests. Mach Learn. 2001 Oct;45(1):5–32.
  • Friedman JH. Stochastic gradient boosting. Comput Stat Data Anal. 2002;38(4):367–378.
  • Cannings TI, Samworth RJ. Random-projection ensemble classification. J R Stat Soc Series B Stat Methodol. 2017;79(4):959–1035.
  • Chawla NV, Lazarevic A, Hall LO, et al. SMOTEBoost: improving prediction of the minority class in boosting. In: Lavrač N, Gamberger D, Todorovski L, et al., editors. Knowledge discovery in databases: PKDD 2003; Berlin: Springer; 2003. p. 107–119.
  • Chen C, Breiman L. Using random forest to learn imbalanced data. Berkeley: University of California; 2004 Jan.
  • Lin HI, Nguyen MC. Boosting minority class prediction on imbalanced point cloud data. Appl Sci. 2020;10(3):Article no. 973.
  • Melville P, Mooney RJ. Constructing diverse classifier ensembles using artificial training examples. In: Proceedings of the IJCAI-2003; 2003 Aug; Acapulco, Mexico. p. 505–510.
  • Kuncheva LI, Whitaker CJ. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach Learn. 2003;51(2):181–207.
  • Brown G, Wyatt JL, Tino P, et al. Managing diversity in regression ensembles. J Mach Learn Res. 2005;6(9):1621–1650.
  • Wang Y. Statistical methods for high throughput screening drug discovery data [dissertation]. Waterloo: University of Waterloo; 2005.
  • Gupta R, Audhkhasi K, Narayanan S. Training ensemble of diverse classifiers on feature subsets. In: 2014 IEEE International Conferences on Acoustic, Speech and Signal Processing (ICASSP); 2014. p. 2951–2955.
  • Tomal JH, Welch WJ, Zamar RH. Discussion of random-projection ensemble classification by T. I. Cannings and R. J. Samworth.. J R Stat Soc B Stat Methodol. 2017;79(4):1024–1025.
  • Ho TK. The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell. 1998;20(8):832–844.
  • Koonin EV, Galperin MY. Sequence – Evolution – Function: computational approaches in comparative genomics. Boston (MA): Kluwer Academic; 2003.
  • Bordoli L, Kiefer F, Arnold K, et al. Protein structure homology modelling using SWISS-MODEL workspace. Nat Protoc. 2009 Feb;4:1–13.
  • Söding J. Protein homology detection by HMM-HMM comparison. Bioinformatics. 2005;21(7):951–960.
  • Nam SZ, Lupas AN, Alva V, et al. The MPI bioinformatics toolkit as an integrative platform for advanced protein sequence and structure analysis. Nucleic Acids Res. 2016 Apr;44(W1):W410–W415.
  • Bork P, Koonin EV. Predicting functions from protein sequences – where are the bottlenecks? Nat Genet. 1998 Apr;18(4):313–318.
  • Henn-Sax M, Höcker B, Wilmanns M, et al. Divergent evolution of (βα)8-barrel enzymes. Biol Chem. 2001 Sep;382:1315–1320.
  • Kinch LN, Wrabl JO, Sri Krishna S, et al. CASP5 assessment of fold recognition target predictions. Proteins: Struct Funct Genet. 2003 Oct;53:395–409.
  • Marks DS, Colwell LJ, Sheridan R, et al. Protein 3D structure computed from evolutionary sequence variation. PLoS One. 2011 Dec;6(12):1–20.
  • Meier A, Söding J. Automatic prediction of protein 3D structures by probabilistic multi-template homology modeling. PLoS Comput Biol. 2015 Oct;11(10):1–20.
  • Sinha S, Eisenhaber B, Lynn AM. Predicting protein function using homology-based methods. Singapore: Springer; 2018. p. 289–305.
  • Waterhouse A, Rempfer C, Heer FT, et al. SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Res. 2018 May;46(W1):W296–W303.
  • Eddy SR. Profile hidden Markov models. Bioinformatics. 1998;14(3):755–763.
  • Eddy SR. Accelerated profile HMM searches. PLoS Comput Biol. 2011 Oct;7(10):1–16.
  • Kaushik S, Nair AG, Mutt E, et al. Rapid and enhanced remote homology detection by cascading hidden Markov model searches in sequence space. Bioinformatics. 2016 Feb;32(3):338–344.
  • Zhang C, Zheng W, Mortuza SM, et al. DeepMSA: constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant-homology proteins. Bioinformatics. 2019 Nov;36(7):2105–2112.
  • Zhu M. Recall, precision and average precision. Waterloo: Department of Statistics and Actuarial Science, University of Waterloo; 2004.
  • Zamar RH, Welch WJ, Yan G, et al. Partial results for KDD cup 2004; 2004. Available from: http://stat.ubc.ca/will/ddd/kdd_result.html
  • Vallat BK, Pillardy J, Májek P, et al. Building and assessing atomic models of proteins from structural templates: learning and benchmarks. Proteins. 2009 Sep;76(4):930–945.
  • Teodorescu O, Galor T, Pillardy J, et al. Enriching the sequence substitution matrix by structural information. Proteins: Struct Funct Genet. 2004;54(1):41–48.
  • Vallat BK, Pillardy J, Elber R. A template-finding algorithm and a comprehensive benchmark for homology modeling of proteins. Proteins. 2008 Aug;72(3):910–928.
  • Haddad Y, Adam V, Heger Z. Ten quick tips for homology modeling of high-resolution protein 3D structures. PLoS Comput Biol. 2020 Apr;16(4):1–19.
  • Vyas V, Ukawala R, Ghate M, et al. Homology modeling a fast tool for drug discovery: current perspectives. Indian J Pharm Sci. 2012 Mar;74:1–17.
  • Makigaki S, Ishida T. Sequence alignment using machine learning for accurate template-based protein structure prediction. Bioinformatics. 2020 Jun;36(1):104–111.
  • Senior A, Evans R, Jumper J, et al. Improved protein structure prediction using potentials from deep learning. Nature. 2020 Jan;577:706–710.
  • Bachem O, Lucic M, Krause A. Coresets for nonparametric estimation – the case of DP-means. In: Bach F, Blei D, editors. Proceedings of the 32nd International Conference on Machine Learning; Jul 7–9; Lille, France. PMLR; 2015. p. 209–217. (Proceedings of Machine Learning Research; Vol. 37).
  • Bachem O, Lucic M, Krause A. Scalable k-means clustering via lightweight coresets. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD '18; London, UK. ACM SIGKDD; 2018. p. 1119–1127.
  • Hirschberger F, Forster D, Lücke J. Large scale clustering with variational EM for Gaussian mixture models. arXiv:181000803; 2019.
  • Tsai CF, Lin WC, Ke SW. Big data mining with parallel computing: a comparison of distributed and mapreduce methodologies. J Syst Softw. 2016;122:83–92.
  • Pfahringer B. The Weka solution to the 2004 KDD cup. ACM SIGKDD Explorations Newsletter. 2004;6(2):117–119.
  • Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci. 1997;55(1):119–139.
  • Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–297.
  • Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference, and prediction. 2nd ed. New York: Springer; 2009.
  • Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc B Stat Methodol. 1996;58(1):267–288.
  • Breiman L, Friedman J, Stone C, et al. Classification and regression trees. New York: Chapman and Hall/CRC; 1984.
  • Ripley BD. Pattern recognition and neural networks. Cambridge (UK): Cambridge University Press; 1996.
  • Campbell G, Ratnaparkhi MV. An application of Lomax distributions in receiver operating characteristic (ROC) curve analysis. Commun Stat Theory Methods. 1993;22(6):1681–1687.
  • Pepe MS. Receiver operating characteristic methodology. J Am Stat Assoc. 2000;95(449):308–311.
  • Boyd K, Eng KH, Page CD. Area under the precision-recall curve: point estimates and confidence intervals. In: Joint European conference on machine learning and knowledge discovery in databases. Prague (Czech Republic): Springer; 2013. p. 451–466.
  • Sofaer HR, Hoeting JA, Jarnevich CS. The area under the precision-recall curve as a performance metric for rare binary events. Methods Ecol Evol. 2019;10(4):565–577.
  • Su W, Yuan Y, Zhu M. A relationship between the average precision and the area under the ROC curve. In: Proceedings of the 2015 International Conference on The Theory of Information Retrieval. Northampton (MA): ACM; 2015. p. 349–352.
  • Fan G, Zhu M. Detection of rare items with target. Stat Interface. 2011;4(1):11–17.
  • Bellinger C, Jabbar MSM, Zaïane O, et al. A systematic review of data mining and machine learning for air pollution epidemiology. BMC Public Health. 2017;17(907):1–19.