130
Views
6
CrossRef citations to date
0
Altmetric
Articles

Filter feature selectors in the development of binary QSAR models

, , &
Pages 313-345 | Received 29 Jan 2019, Accepted 25 Feb 2019, Published online: 21 May 2019

References

  • C. Helma, T. Cramer, S. Kramer, and L. De Raedt, Data mining and machine learning techniques for the identification of mutagenicity inducing substructures and structure activity relationships of noncongeneric compounds, J. Chem. Inf. Comput. Sci. 44 (2004), pp. 1402–1411.
  • G. Cerruela García, N. García-Pedrajas, I. Luque Ruiz, and M.Á. Gómez-Nieto, Molecular activity prediction by means of supervised subspace projection based ensembles of classifiers, SAR QSAR Environ. Res. 29 (2018), pp. 187–212.
  • M.K. Qasim, Z.Y. Algamal, and H.T.M. Ali, A binary QSAR model for classifying neuraminidase inhibitors of influenza A viruses (H1N1) using the combined minimum redundancy maximum relevancy criterion with the sparse support vector machine, SAR QSAR Environ. Res. 29 (2018), pp. 517–527.
  • M. Goodarzi, B. Dejaegher, and Y.V. Heyden, Feature selection methods in QSAR studies, J. AOAC Int. 95 (2012), pp. 636–651.
  • Z.Y. Algamal and M.H. Lee, A new adaptive L1-norm for optimal descriptor selection of high-dimensional QSAR classification model for anti-hepatitis C virus activity of thiourea derivatives, SAR QSAR Environ. Res. 28 (2017), pp. 75–90.
  • Z.Y. Algamal, M.K. Qasim, and H.T.M. Ali, A QSAR classification model for neuraminidase inhibitors of influenza A viruses (H1N1) based on weighted penalized support vector machine, SAR QSAR Environ. Res. 28 (2017), pp. 415–426.
  • X. Liu and H. Wang, A discretization algorithm based on a heterogeneity criterion, IEEE Trans. Knowl. Data Eng. 17 (2005), pp. 1166–1173.
  • S.-U. Guan, J. Liu, and Y. Qi, An incremental approach to contribution-based feature selection, Int. J. Intell. Syst. 13 (2004), pp. 15–42.
  • R.K. Sivagaminathan and S. Ramakrishnan, A hybrid approach for feature subset selection using neural networks and ant colony optimization, Expert Syst. Appl. 33 (2007), pp. 49–60.
  • D. Newby, A.A. Freitas, and T. Ghafourian, Pre-processing feature selection for improved C&RT models for oral absorption, J. Chem Inf. Model. 53 (2013), pp. 2730–2742.
  • Y. Liu, A comparative study on feature selection methods for drug discovery, J. Chem. Inf. Comput. Sci. 44 (2004), pp. 1823–1828.
  • J. Weston, F. Pérez-Cruz, O. Bousquet, O. Chapelle, A. Elisseeff, and B. Schölkopf, Feature selection and transduction for prediction of molecular bioactivity for drug design, Bioinformatics 19 (2003), pp. 764–771.
  • M.A. Demel, A.G.K. Janecek, W.N. Gansterer, and G.F. Ecker, Comparison of contemporary feature selection algorithms: Application to the classification of ABC‐transporter substrates, QSAR Comb. Sci. 28 (2009), pp. 1087–1091.
  • T.A. Alhaj, M.M. Siraj, A. Zainal, H.T. Elshoush, and F. Elhaj, Feature selection using information gain for improved structural-based alert correlation, PLoS One 11 (2016), p. e0166017.
  • A.M. Wassermann, B. Nisius, M. Vogt, and J. Bajorath, Identification of descriptors capturing compound class-specific features by mutual information analysis, J. Chem Inf. Model. 50 (2010), pp. 1935–1940.
  • J.W. Godden and J. Bajorath, An information-theoretic approach to descriptor selection for database profiling and QSAR modeling, QSAR Comb. Sci. 22 (2003), pp. 487–497.
  • J.R. Vergara and P.A. Estévez, A review of feature selection methods based on mutual information, Neural Comput. Appl. 24 (2014), pp. 175–186.
  • D.C. Whitley, M.G. Ford, and D.J. Livingstone, Unsupervised forward selection: A method for eliminating redundant variables, J. Chem. Inf. Comput. Sci. 40 (2000), pp. 1160–1168.
  • D.W. Salt, L. Maccari, M. Botta, and M.G. Ford, Variable selection and specification of robust QSAR models from multicollinear data: Arylpiperazinyl derivatives with affinity and selectivity for α2-adrenoceptors, J. Comput.-Aided Mol. Des. 18 (2004), pp. 495–509.
  • I. Kononenko, Estimating attributes: analysis and extensions of RELIEF, European Conference on Machine Learning, Catania, Italy, April 6–8, 1994.
  • A.S. Reddy, S. Kumar, and R. Garg, Hybrid-genetic algorithm based descriptor optimization and QSAR models for predicting the biological activity of Tipranavir analogs for HIV protease inhibition, J. Mol. Graph. 28 (2010), pp. 852–862.
  • M.A. Hall, Correlation-based feature selection of discrete and numeric class machine learning, Working Paper, Department of Computer Science, University of Waikato, New Zealand, 2000.
  • L. Yu and H. Liu, Efficient feature selection via analysis of relevance and redundancy, J. Mach. Learn. Res. 5 (2004), pp. 1205–1224.
  • D. Veltri, U. Kamath, and A. Shehu, Improving recognition of antimicrobial peptides and target selectivity through machine learning and genetic programming, IEEE-ACM Trans. Comput. Biol. Bioinform. 14 (2017), pp. 300–313.
  • P. Mitra, C. Murthy, and S.K. Pal, Unsupervised feature selection using feature similarity, IEEE Trans. Pattern Anal. Mach. Intell. 24 (2002), pp. 301–312.
  • Chembench repository. Carolina Exploratory Center for Cheminformatics Research, Chapel Hill, NC, 2016; software available at https://chembench.mml.unc.edu.
  • S.-Y. Choi, J.H. Shin, C.K. Ryu, K.-Y. Nam, K.T. No, and H.-Y. Park Choo, The development of 3D-QSAR study and recursive partitioning of heterocyclic quinone derivatives with antifungal activity, Bioorg. Med. Chem. 14 (2006), pp. 1608–1617.
  • V. Schattel, G. Hinselmann, A. Jahn, A. Zell, and S. Laufer, Modeling and benchmark data set for the inhibition of c-jun N-terminal kinase-3, J. Chem Inf. Model. 51 (2011), pp. 670–679.
  • F. Hammann, C. Suenderhauf, and J. Huwyler, A binary ant colony optimization classifier for molecular activities, J. Chem Inf. Model. 51 (2011), pp. 2690–2696.
  • S. Ekins, R. Pottorf, R.C. Reynolds, A.J. Williams, A.M. Clark, and J.S. Freundlich, Looking back to the future: Predicting in vivo efficacy of small molecules versus Mycobacterium tuberculosis, J. Chem Inf. Model. 54 (2014), pp. 1070–1082.
  • J. Mohr, B. Jain, A. Sutter, A.T. Laak, T. Steger-Hartmann, N. Heinrich, and K. Obermayer, A maximum common subgraph kernel method for predicting the chromosome aberration test, J. Chem Inf. Model. 50 (2010), pp. 1821–1838.
  • C.L. Russom , C.R. Williams, T.W. Stewart, A.E. Swank, and R. Am, DSSTox EPA Fathead Minnow Acute Toxicity Database (EPAFHM): SDF files and documentation, version: EPAFHM_v4b_617_15Feb2008, 2008; software available at www.epa.gov/ncct/dsstox/sdf_epafhm.html.
  • F. Fontaine, M. Pastor, I. Zamora, and F. Sanz, Anchor−GRIND: Filling the gap between standard 3D QSAR and the GRid-INdependent descriptors, J. Med. Chem. 48 (2005), pp. 2687–2694.
  • B. Gaüzère, L. Brun, and D. Villemin, Two new graphs kernels in chemoinformatics, Pattern Recognit. Lett. 33 (2012), pp. 2038–2047.
  • G. Subramanian, B. Ramsundar, V. Pande, and R.A. Denny, Computational modeling of β-secretase 1 (BACE-1) inhibitors using ligand based approaches, J. Chem Inf. Model. 56 (2016), pp. 1936–1949.
  • I.F. Martins, A.L. Teixeira, L. Pinheiro, and A.O. Falcao, A Bayesian approach to in silico blood–brain barrier penetration modeling, J. Chem Inf. Model. 52 (2012), pp. 1686–1697.
  • K.M. Gayvert, N.S. Madhukar, and O. Elemento, A data-driven approach to predicting successes and failures of clinical trials, Cell Chem. Biol. 23 (2016), pp. 1294–1301.
  • V. Poongavanam, N. Haider, and G.F. Ecker, Fingerprint-based in silico models for the prediction of P-glycoprotein substrates and inhibitors, Bioorg. Med. Chem. 20 (2012), pp. 5388–5395.
  • A. Dalby, J.G. Nourse, W.D. Hounshell, A.K.I. Gushurst, D.L. Grier, B.A. Leland, and J. Laufer, Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited, J. Chem. Inf. Comput. Sci. 32 (1992), pp. 244–255.
  • Open-Source Cheminformatics Software (RDKit). 2019. Available at https://www.rdkit.org/docs/.
  • Chemistry Development Kit (CDK). 2019. Available at https://cdk.github.io/.
  • K. Varmuza, P. Filzmoser, M. Hilchenbach, H. Krüger, and J. Silén, KNN classification — Evaluated by repeated double cross validation: Recognition of minerals relevant for comet dust, Chemom. Intell. Lab. Syst. 138 (2014), pp. 64–71.
  • H. Ishibuchi and Y. Nojima, Repeated double cross-validation for choosing a single solution in evolutionary multi-objective fuzzy classifier design, Knowledge-Based Syst. 54 (2013), pp. 22–31.
  • P. Filzmoser, B. Liebmann, and K. Varmuza, Repeated double cross validation, J. Chemom. 23 (2009), pp. 160–171.
  • A.P. Bradley, The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recogn. 30 (1997), pp. 1145–1159.
  • J.A. Hanley and B.J. McNeil, The meaning and use of the area under a receiver operating characteristic (ROC) curve, Radiology 143 (1982), pp. 29–36.
  • Q. Song, J. Ni, and G. Wang, A fast clustering-based feature subset selection algorithm for high-dimensional data, IEEE Trans. Knowl. Data Eng. 25 (2013), pp. 1–14.
  • X. He, D. Cai, and P. Niyogi, Laplacian score for feature selection, Proceedings of the 18th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2005.
  • C. Huertas and R. Juárez-Ramírez, Filter feature selection performance comparison in high-dimensional data: A theoretical and empirical analysis of most popular algorithms, 17th International Conference on Information Fusion, Salamanca, Spain, 2014.
  • H. Almuallim and T.G. Dietterich, Learning with many irrelevant features, AAAI Conference, Anaheim, California, 1991.
  • H. Almuallim and T. Dietterich, Efficient algorithms for identifying relevant features, 9th Canadian Conference on Artificial Intelligence, Vancouver, Canada, 1992.
  • Y. Yang and J.O. Pedersen, A comparative study on feature selection in text categorization, ICML, Nashville, Tennessee, 1997.
  • H. Liu and R. Setiono, Scalable feature selection for large sized databases, Proceedings of the Fourth World Congress on Expert Systems, Mexico City, Mexico, 1998.
  • T.M. Cover and J.A. Thomas, Elements of Information Theory, John Wiley & Sons, New Jersey, 2012.
  • M. Dash, Feature selection via set cover, Knowledge and Data Engineering Exchange Workshop, Newport Beach, CA, 1997.
  • D.S. Johnson, Approximation algorithms for combinatorial problems, J. Comput. Syst. Sci. 9 (1974), pp. 256–278.
  • C.-C. Chang, and C.-J. Lin, LIBSVM: A library for support vector machines, ACM Trans. Intell. Syst. Technol. 2 (2011), pp. 1–27.
  • J.R. Quinlan, Improved use of continuous attributes in C4.5, J. Artif. Intell. Res. 4 (1996), pp. 77–90.
  • J. Demšar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (2006), pp. 1–30.
  • R.L. Iman and J.M. Davenport, Approximations of the critical region of the fbietkan statistic, Commun. Stat.-Theory Meth. 9 (1980), pp. 571–595.
  • S. Holm, A simple sequentially rejective multiple test procedure, Scand. J. Stat. (1979), pp. 65–70.
  • P.B. Nemenyi, Distribution-Free Multiple Comparisons. Princeton University, New Jersey, 1963.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.