Robust ranking by ensembling of diverse models and assessment metrics

Jabed H. Tomala Department of Mathematics and Statistics, Thompson Rivers University, Kamloops, British Columbia, CanadaCorrespondence[email protected]
View further author information

William J. Welchb Department of Statistics, The University of British Columbia, Vancouver, British Columbia, CanadaView further author information

Ruben H. Zamarb Department of Statistics, The University of British Columbia, Vancouver, British Columbia, CanadaView further author information

Pages 77-102 | Received 20 Mar 2022, Accepted 21 Jun 2022, Published online: 10 Jul 2022

Cite this article
https://doi.org/10.1080/00949655.2022.2093873
CrossMark

Full Article
Figures & data
References
Citations
Metrics
Licensing
Reprints & Permissions
View PDF PDF View EPUB EPUB

References

Dokas P, Ertoz L, Kumar V, et al. Data mining for network intrusion detection. In: Proceedings NSF Workshop on Next Generation Data Mining; 2002. p. 21–30.
Google Scholar
Weiss GM. Mining with rare cases. In: Data mining and knowledge discovery handbook. Boston (MA): Springer; 2009. p. 747–757.
Google Scholar
Wu J, Xiong H, Chen J. COG: local decomposition for rare class analysis. Data Min Knowl Discov. 2010;20(2):191–220.
Web of Science ®Google Scholar
Dongre SS, Malik LG. Rare class problem in data mining. Int J Adv Res Comput Sci. 2017;8(7):1102–1105.
Google Scholar
Zhou D, Karthikeyan A, Wang K, et al. Discovering rare categories from graph streams. Data Min Knowl Discov. 2017;31(2):400–423.
Web of Science ®Google Scholar
Bolton RJ, Hand DJ. Statistical fraud detection: a review. Stat Sci. 2002;17(3):235–255.
Web of Science ®Google Scholar
Bhusari V, Patil S. Application of hidden Markov model in credit card fraud detection. Int J Distrib Parallel Syst. 2011;2(6):203–211.
Google Scholar
Lippmann RP, Cunningham RK. Improving intrusion detection performance using keyword selection and neural network. Comput Netw. 2000 Oct;34(4):597–603.
Web of Science ®Google Scholar
Sherif JS, Dearmond TG. Intrusion detection: systems and models. In: Proceedings Eleventh IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises; 2002 Jun; Pittsburgh, PA. IEEE. p. 115–133.
Google Scholar
Kemmerer RA, Vigna G. Hi-DRA: intrusion detection for internet security. Proc IEEE. 2005 Oct;93(10):1848–1857.
Web of Science ®Google Scholar
Maiti A, Sivanesan S. Cloud controlled intrusion detection and burglary prevention stratagems in home automation systems. In: 2012 2nd Baltic Congress on Future Internet Communications; 2012 Apr; Vilnius, Lithuania. IEEE. p. 182–186.
Google Scholar
Gendreau AA, Moorman M. Survey of intrusion detection systems towards an end to end secure internet of things. In: 2016 IEEE 4th International Conference on Future Internet of Things and Cloud (FiCloud); 2016 Aug. p. 84–90.
Google Scholar
Fienberg SE, Shmueli G. Statistical issues and challenges associated with rapid detection of bio-terrorist attacks. Stat Med. 2005 Feb;24(4):513–529.
PubMed Web of Science ®Google Scholar
Peeters J. Method and apparatus for wide area surveillance of a terrorist or personal threat. US Patent 7, 109, 859; 2006. Available from: https://www.google.com/patents/US7109859
Google Scholar
Cohen K, Johansson F, Kaati L, et al. Detecting linguistic markers for radical violence in social media. Terror Political Violence. 2014;26(1):246–256.
Web of Science ®Google Scholar
Gordon M, Pathak P. Finding information on the world wide web: the retrieval effectiveness of search engines. Inf Process Manag Int J. 1999 Mar;35(2):141–180.
Web of Science ®Google Scholar
Nachmias R, Gilad A. Needle in a hyperstack. J Res Technol Educ. 2002;34(4):475–486.
Google Scholar
Al-Masri E, Mahmoud QH. Investigating web services on the world wide web. In: Proceedings of the 17th International Conference on World Wide Web, WWW '08; New York (NY), USA. ACM; 2008. p. 795–804.
Google Scholar
Baeza-Yates RA, Ribeiro-Neto BA. Modern information retrieval – the concepts and technology behind search. 2nd ed. Harlow: Pearson Education; 2011.
Google Scholar
Bleicher KH, Böhm HJ, Müller K, et al. A guide to drug discovery: hit and lead generation: beyond high-throughput screening. Nat Rev Drug Discov. 2003;2:369–378.
PubMed Web of Science ®Google Scholar
Wang J, Hou T. Drug and drug candidate building block analysis. J Chem Inf Model. 2010;50(1):55–67. PMID: 20020714.
PubMed Web of Science ®Google Scholar
Tomal JH, Welch WJ, Zamar RH. Exploiting multiple descriptor sets in QSAR studies. J Chem Inf Model. 2016;56(3):501–509.
PubMed Web of Science ®Google Scholar
Zhu M, Su W, Chipman HA. LAGO: a computationally efficient approach for statistical detection. Technometrics. 2006;48(2):193–205. Available from: http://www.jstor.org/stable/25471156
Web of Science ®Google Scholar
Chawla NV. Data mining for imbalanced datasets: an overview. Boston (MA): Springer US; 2010. Chapter 40; p. 875–886.
Google Scholar
Tayal A, Coleman TF, Li Y. RankRC: large-scale nonlinear rare class ranking. IEEE Trans Knowl Data Eng. 2015;27(12):3347–3359.
Web of Science ®Google Scholar
Hsu GG, Tomal JH, Welch WJ. EPX: an R package for the ensemble of subsets of variables for highly unbalanced binary classification. Comput Biol Med. 2021;136:104760.
Google Scholar
Kearsley SK, Sallamack S, Fluder EM, et al. Chemical similarity using physiochemical property descriptors. J Chem Inf Comput Sci. 1996;36(1):118–127.
Google Scholar
Kishida K. Property of average precision and its generalization: an examination of evaluation indicator for information retrieval experiments. Tokyo: Japan National Institute of Informatics; 2005.
Google Scholar
Fu Y, Pan R, Yang Q, et al. Query-adaptive ranking with support vector machines for protein homology prediction. In: Chen J, Wang J, Zelikovsky A, editors. Bioinformatics research and applications. Berlin: Springer; 2011. p. 320–331.
Google Scholar
Tomal JH, Welch WJ, Zamar RH. Ensembling classification models based on phalanxes of variables with applications in drug discovery. Ann Appl Stat. 2015;9(1):69–93.
Web of Science ®Google Scholar
Karplus K, Barrett C, Hughey R. Hidden Markov models for detecting remote protein homologies. Bioinformatics. 1998;14(10):846–856.
PubMed Web of Science ®Google Scholar
Hochreiter S, Heusel M, Obermayer K. Fast model-based protein homology detection without alignment. Bioinformatics. 2007;23(14):1728–1736.
PubMed Web of Science ®Google Scholar
Ma J, Wang S, Wang Z, et al. MRFalign: protein homology detection through alignment of Markov random fields. PLoS Comput Biol. 2014 Mar;10(3):1–12.
Web of Science ®Google Scholar
Yen SJ, Lee YS. Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. Berlin: Springer; 2006. p. 731–740.
Google Scholar
Liu R, Hall LO, Bowyer KW, et al. Synthetic minority image over-sampling technique: How to improve AUC for glioblastoma patient survival prediction. In: 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC); 2017. p. 1357–1362.
Google Scholar
Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123–140.
Web of Science ®Google Scholar
Dietterich TG. An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach Learn. 2000;40(2):139–157.
Web of Science ®Google Scholar
Breiman L. Random forests. Mach Learn. 2001 Oct;45(1):5–32.
Web of Science ®Google Scholar
Friedman JH. Stochastic gradient boosting. Comput Stat Data Anal. 2002;38(4):367–378.
Web of Science ®Google Scholar
Cannings TI, Samworth RJ. Random-projection ensemble classification. J R Stat Soc Series B Stat Methodol. 2017;79(4):959–1035.
Web of Science ®Google Scholar
Chawla NV, Lazarevic A, Hall LO, et al. SMOTEBoost: improving prediction of the minority class in boosting. In: Lavrač N, Gamberger D, Todorovski L, et al., editors. Knowledge discovery in databases: PKDD 2003; Berlin: Springer; 2003. p. 107–119.
Google Scholar
Chen C, Breiman L. Using random forest to learn imbalanced data. Berkeley: University of California; 2004 Jan.
Google Scholar
Lin HI, Nguyen MC. Boosting minority class prediction on imbalanced point cloud data. Appl Sci. 2020;10(3):Article no. 973.
Google Scholar
Melville P, Mooney RJ. Constructing diverse classifier ensembles using artificial training examples. In: Proceedings of the IJCAI-2003; 2003 Aug; Acapulco, Mexico. p. 505–510.
Google Scholar
Kuncheva LI, Whitaker CJ. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach Learn. 2003;51(2):181–207.
Web of Science ®Google Scholar
Brown G, Wyatt JL, Tino P, et al. Managing diversity in regression ensembles. J Mach Learn Res. 2005;6(9):1621–1650.
Google Scholar
Wang Y. Statistical methods for high throughput screening drug discovery data [dissertation]. Waterloo: University of Waterloo; 2005.
Google Scholar
Gupta R, Audhkhasi K, Narayanan S. Training ensemble of diverse classifiers on feature subsets. In: 2014 IEEE International Conferences on Acoustic, Speech and Signal Processing (ICASSP); 2014. p. 2951–2955.
Google Scholar
Tomal JH, Welch WJ, Zamar RH. Discussion of random-projection ensemble classification by T. I. Cannings and R. J. Samworth.. J R Stat Soc B Stat Methodol. 2017;79(4):1024–1025.
Google Scholar
Ho TK. The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell. 1998;20(8):832–844.
Web of Science ®Google Scholar
Koonin EV, Galperin MY. Sequence – Evolution – Function: computational approaches in comparative genomics. Boston (MA): Kluwer Academic; 2003.
Google Scholar
Bordoli L, Kiefer F, Arnold K, et al. Protein structure homology modelling using SWISS-MODEL workspace. Nat Protoc. 2009 Feb;4:1–13.
PubMed Web of Science ®Google Scholar
Söding J. Protein homology detection by HMM-HMM comparison. Bioinformatics. 2005;21(7):951–960.
PubMed Web of Science ®Google Scholar
Nam SZ, Lupas AN, Alva V, et al. The MPI bioinformatics toolkit as an integrative platform for advanced protein sequence and structure analysis. Nucleic Acids Res. 2016 Apr;44(W1):W410–W415.
PubMed Web of Science ®Google Scholar
Bork P, Koonin EV. Predicting functions from protein sequences – where are the bottlenecks? Nat Genet. 1998 Apr;18(4):313–318.
PubMed Web of Science ®Google Scholar
Henn-Sax M, Höcker B, Wilmanns M, et al. Divergent evolution of (βα)8-barrel enzymes. Biol Chem. 2001 Sep;382:1315–1320.
PubMed Web of Science ®Google Scholar
Kinch LN, Wrabl JO, Sri Krishna S, et al. CASP5 assessment of fold recognition target predictions. Proteins: Struct Funct Genet. 2003 Oct;53:395–409.
PubMed Web of Science ®Google Scholar
Marks DS, Colwell LJ, Sheridan R, et al. Protein 3D structure computed from evolutionary sequence variation. PLoS One. 2011 Dec;6(12):1–20.
Web of Science ®Google Scholar
Meier A, Söding J. Automatic prediction of protein 3D structures by probabilistic multi-template homology modeling. PLoS Comput Biol. 2015 Oct;11(10):1–20.
Web of Science ®Google Scholar
Sinha S, Eisenhaber B, Lynn AM. Predicting protein function using homology-based methods. Singapore: Springer; 2018. p. 289–305.
Google Scholar
Waterhouse A, Rempfer C, Heer FT, et al. SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Res. 2018 May;46(W1):W296–W303.
PubMed Web of Science ®Google Scholar
Eddy SR. Profile hidden Markov models. Bioinformatics. 1998;14(3):755–763.
PubMedGoogle Scholar
Eddy SR. Accelerated profile HMM searches. PLoS Comput Biol. 2011 Oct;7(10):1–16.
Web of Science ®Google Scholar
Kaushik S, Nair AG, Mutt E, et al. Rapid and enhanced remote homology detection by cascading hidden Markov model searches in sequence space. Bioinformatics. 2016 Feb;32(3):338–344.
PubMed Web of Science ®Google Scholar
Zhang C, Zheng W, Mortuza SM, et al. DeepMSA: constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant-homology proteins. Bioinformatics. 2019 Nov;36(7):2105–2112.
Web of Science ®Google Scholar
Zhu M. Recall, precision and average precision. Waterloo: Department of Statistics and Actuarial Science, University of Waterloo; 2004.
Google Scholar
Zamar RH, Welch WJ, Yan G, et al. Partial results for KDD cup 2004; 2004. Available from: http://stat.ubc.ca/will/ddd/kdd_result.html
Google Scholar
Vallat BK, Pillardy J, Májek P, et al. Building and assessing atomic models of proteins from structural templates: learning and benchmarks. Proteins. 2009 Sep;76(4):930–945.
PubMed Web of Science ®Google Scholar
Teodorescu O, Galor T, Pillardy J, et al. Enriching the sequence substitution matrix by structural information. Proteins: Struct Funct Genet. 2004;54(1):41–48.
PubMed Web of Science ®Google Scholar
Vallat BK, Pillardy J, Elber R. A template-finding algorithm and a comprehensive benchmark for homology modeling of proteins. Proteins. 2008 Aug;72(3):910–928.
PubMed Web of Science ®Google Scholar
Haddad Y, Adam V, Heger Z. Ten quick tips for homology modeling of high-resolution protein 3D structures. PLoS Comput Biol. 2020 Apr;16(4):1–19.
Web of Science ®Google Scholar
Vyas V, Ukawala R, Ghate M, et al. Homology modeling a fast tool for drug discovery: current perspectives. Indian J Pharm Sci. 2012 Mar;74:1–17.
PubMed Web of Science ®Google Scholar
Makigaki S, Ishida T. Sequence alignment using machine learning for accurate template-based protein structure prediction. Bioinformatics. 2020 Jun;36(1):104–111.
PubMed Web of Science ®Google Scholar
Senior A, Evans R, Jumper J, et al. Improved protein structure prediction using potentials from deep learning. Nature. 2020 Jan;577:706–710.
PubMed Web of Science ®Google Scholar
Bachem O, Lucic M, Krause A. Coresets for nonparametric estimation – the case of DP-means. In: Bach F, Blei D, editors. Proceedings of the 32nd International Conference on Machine Learning; Jul 7–9; Lille, France. PMLR; 2015. p. 209–217. (Proceedings of Machine Learning Research; Vol. 37).
Google Scholar
Bachem O, Lucic M, Krause A. Scalable k-means clustering via lightweight coresets. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD '18; London, UK. ACM SIGKDD; 2018. p. 1119–1127.
Google Scholar
Hirschberger F, Forster D, Lücke J. Large scale clustering with variational EM for Gaussian mixture models. arXiv:181000803; 2019.
Google Scholar
Tsai CF, Lin WC, Ke SW. Big data mining with parallel computing: a comparison of distributed and mapreduce methodologies. J Syst Softw. 2016;122:83–92.
Web of Science ®Google Scholar
Pfahringer B. The Weka solution to the 2004 KDD cup. ACM SIGKDD Explorations Newsletter. 2004;6(2):117–119.
Google Scholar
Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci. 1997;55(1):119–139.
Web of Science ®Google Scholar
Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–297.
Web of Science ®Google Scholar
Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference, and prediction. 2nd ed. New York: Springer; 2009.
Google Scholar
Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc B Stat Methodol. 1996;58(1):267–288.
Google Scholar
Breiman L, Friedman J, Stone C, et al. Classification and regression trees. New York: Chapman and Hall/CRC; 1984.
Google Scholar
Ripley BD. Pattern recognition and neural networks. Cambridge (UK): Cambridge University Press; 1996.
Google Scholar
Campbell G, Ratnaparkhi MV. An application of Lomax distributions in receiver operating characteristic (ROC) curve analysis. Commun Stat Theory Methods. 1993;22(6):1681–1687.
Web of Science ®Google Scholar
Pepe MS. Receiver operating characteristic methodology. J Am Stat Assoc. 2000;95(449):308–311.
Web of Science ®Google Scholar
Boyd K, Eng KH, Page CD. Area under the precision-recall curve: point estimates and confidence intervals. In: Joint European conference on machine learning and knowledge discovery in databases. Prague (Czech Republic): Springer; 2013. p. 451–466.
Google Scholar
Sofaer HR, Hoeting JA, Jarnevich CS. The area under the precision-recall curve as a performance metric for rare binary events. Methods Ecol Evol. 2019;10(4):565–577.
Web of Science ®Google Scholar
Su W, Yuan Y, Zhu M. A relationship between the average precision and the area under the ROC curve. In: Proceedings of the 2015 International Conference on The Theory of Information Retrieval. Northampton (MA): ACM; 2015. p. 349–352.
Google Scholar
Fan G, Zhu M. Detection of rare items with target. Stat Interface. 2011;4(1):11–17.
Web of Science ®Google Scholar
Bellinger C, Jabbar MSM, Zaïane O, et al. A systematic review of data mining and machine learning for air pollution epidemiology. BMC Public Health. 2017;17(907):1–19.
PubMedGoogle Scholar

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Robust ranking by ensembling of diverse models and assessment metrics

References

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

Robust ranking by ensembling of diverse models and assessment metrics

References

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date