2,051
Views
12
CrossRef citations to date
0
Altmetric
Editorial

How far have decision tree models come for data mining in drug discovery?

&
Pages 1067-1069 | Received 21 Aug 2018, Accepted 15 Oct 2018, Published online: 23 Oct 2018

1. Introduction

Machine learning (ML) methods assist in drug discovery mostly by way of data mining in virtual screening (VS). If the target is sufficiently characterized, say by knowledge of its three-dimensional (3D) structure or gene sequence, we can take a structure-based VS (SBVS) approach and run molecular docking and dynamics, or 3D-similarity matching experiments. More often, however, we only know a set of molecular structures and their biological activities, and so we may perform a ligand-based VS (LBVS) [Citation1]. The results of an LBVS, which are computationally much less expensive to obtain than those of an SBVS, can be the basis of chemical database queries and, optimally, at an early stage of drug discovery enhance our understanding of how a molecule’s action may come about (hypothesis generation). The concept of decision trees is well suited for this.

2. Divide and conquer

The decision tree induction (DTI) paradigm is one of the staples of ML. By splitting a dataset into increasingly homogenous subsets through rule-based decisions (a ‘divide and conquer’ approach), it acts as a classifier that not only mimics human decision-making but that can also be easily visualized in the form of a tree, making it easy to convey and interpret results. In its most basic iteration, DTI provides binary classification, but modifications exist that allow for numerical and regression predictions. Because of these properties and their wide-spread implementations in easy-to-use software packages, we have seen many applications of DTIs for endpoints relevant to drug discovery, such as absorption, distribution, metabolism, and elimination (ADME), and toxicity [Citation2]. Its roots can be traced back to the 1970s, with pioneering work from Topliss and his manually deduced ‘operational schemes for analog synthesis in drug design’ which are – in essence – decision trees [Citation3,Citation4]. Ever since then, DTI has been in use for questions relevant to drug discovery and design. In the 2000s, Wagner and Geerestein developed a DTI model to distinguish between potential drugs and nondrugs by identifying important structural features [Citation5]. One more recent example is a study from 2015 in which Su et al. employed an approach based on C5.0, which is a variation of the classic DTI algorithm C4.5 [Citation6] and uses an information entropy approach to find attributes that best classify the data at each given level. They established rule sets for structural features related to inhibition of five major cytochrome P450 (CYP) enzymes (CYP1A2, CYP2C19, CYP2C9, CYP2D6, and CYP3A4) from training datasets of over 10,000 substances [Citation7]. Another study by Newby et al. in 2015 used DTI to define the roles of permeability and solubility on the prediction of oral absorption in the lower intestines [Citation8]. For drug discovery, prediction of e.g. metabolism via CYP enzymes or oral absorption helps to identify promising new compounds, applying the rule sets of the decision trees. For more historical perspective, the reader is referred to prior reviews [Citation9Citation11].

There are important limitations to DTI that can hamper its use for the high volumes of data often encountered in data mining within the pharmaceutical sciences (e.g. from CHEMBL, National Center for Biotechnology Information, or PDB [Citation12Citation14]. For instance, because the structure of the final model depends to a great deal on the training dataset, seemingly innocuous changes to these data can result in rather different models. This is due to the decreasing influence rules have as the process proceeds downstream [Citation15]. Smaller datasets exhibit this behavior more readily than larger ones, but larger ones are prone to overfitting even if the number of predictive attributes is carefully selected, trees are pruned, and cross-validation measures are in place. While it is impossible to place precise limits on the number of instances (e.g. molecules) for which DTI paradigms yield best results, personal experience shows that datasets in the hundreds to low thousands produce robust models that perform well in cross-validated settings.

3. Strength in numbers

Although DTI models are fast and can successfully predict even noisy and complex clinical endpoints, it is important to be wary of their susceptibility to errors due to variance. When very large datasets are being analyzed, as is increasingly becoming the case with the massive amounts of data generated from high-throughput assays, or if the data are highly skewed – e.g. when only a few molecules show the desired property (positive controls) in a dataset with a larger proportion of negative controls – ensemble variations of classic DTI may give the more robust model and are seeing increasing use. In an ensemble predictor, many ‘weaker’ predictors can be combined to give a single ‘stronger’ predictor that represents the final hypothesis, and these are finding more and more acceptance in drug discovery applications.

Random Forests (RF) and Decision Forests (DF) are two examples that differ only modestly in their implementation. Both are consensus modeling techniques. The probably better known of the two, RF, creates a large number of shallow trees by resampling with replacement from the training data, while DF, distributed by the National Center for Toxicological Research (NCTR) and the Food Drug Administration, does the opposite: it relies on a small sample of deep, complex trees [Citation16]. A recent case study by the NCTR reported on a model for drug-induced liver injury (DILI) using DF with fivefold cross-validation in 1241 compounds that predicted the trinary endpoint (‘most-DILI’, ‘less-DILI’, and ‘no-DILI’), with an average accuracy of 72.9% [Citation17]. This is nearly identical to results from a study published in 2017 using support vector machines (SVM), RF, and 10-fold CV for the trinary classification of 721 drugs that reported accuracies of 73.8% and 72.6% [Citation18]. Recently, RF and DTI algorithms were used to model OATP1B1 inhibition in a dataset of 1203 molecules, where they were on par with SVM in terms of predictive accuracy [Citation19]. Furthermore, RF was used in 2017 to predict human intestinal absorption in a dataset with 970 substances with a high accuracy of over 90% [Citation20].

Aside from outright prediction, RF can work as a powerful feature selector, e.g. to automatically condense a large set of molecular descriptors into a smaller ranked list of the most predictive attributes. A recent study showed how an RF-based approach to feature selection gave stronger classification results than SVM and artificial neural networks (ANN) with manually selected features on small molecule drugs [Citation21].

Adaptive Boosting (AdaBoost), though not strictly a DTI ensemble predictor, is often used with DTI as the ‘weak’ learner. In 2018, Afolabi et al. used AdaBoost with DTI classifiers for the prediction of on-target activities in three mid-size datasets (5083–8568 instances) from the MDL Drug Data Report database, and saw consistently better results compared to SVM models [Citation22].

4. Going deeper

Cheaper computational power has recently led to a renaissance of ANN in the form of deep learning (DL), two terms that are now often used in collusion with each other. This is already being applied to the high-volume datasets encountered in drug discovery [Citation23]. In 2015, authors at Merck contrasted DL and RF for 15 large in-house datasets from a Kaggle [Citation24] competition they sponsored in 2012 (endpoints spanning ADME (e.g. CYP inhibition, plasma protein binding) and on-target activity (e.g. angiotensin-II receptor interaction)), and found DL to be superior in up to 13 of these endpoints [Citation25]. A more exhaustive analysis of other datasets for ADME and on-target activities ranked DL over SVM and a set of ‘classical’ ML algorithms, of which DTI-based paradigms were represented by AdaBoost DTI and RF, by comparing spider plots of different metrics of predictive power [Citation26].

Although the DL concept of stacked algorithms representing hierarchical layers of representation of data is commonly implemented with ANN, there are already efforts being made to integrate other learners such as DTI or RF directly into DL setups [Citation27]. This has not seemed to be applied to drug discovery yet, although successful ensemble implementations exist in toxicology. One example is the DeepTox pipeline, the grand challenge winner in the 2014 Tox21 Data Challenge hosted by the National Center for Advancing Translational Science [Citation28]. Data consisted of over 12,000 drugs and environmental chemicals with up to 12 different toxicity endpoints. The DeepTox pipeline, while relying primarily on DL ANN, used RF, SVM, and elastic nets to supplement the DL models [Citation29].

5. Expert opinion

Although the paradigm has been around for decades, DTI in its many different guises has managed to remain relevant in drug discovery. While single tree DTI algorithms such as C5.0, Classification and Regression Trees, or Chi-squared Automatic Interaction Detector are giving way to more complex ensemble setups, they are versatile, fast, and easy to handle learners that can drive hypothesis generation, generate intuitively understandable visualizations, and are often sufficient in small to midsize datasets. All of this makes the DTI family of learners an attractive tool for (quantitative) structure-activity relationships ((Q)SAR) that can even be used by researchers outside of the ML community. The investment is low due to the multitude of often free software packages with intricate graphical user interfaces (e.g. the Weka collection of data mining software [Citation30] or the Konstanz Information Miner [Citation31]). Of course, a basic knowledge of data mining and ML principles are necessary for users to build robust models and avoid drawing the wrong conclusions because models were misspecified.

Still, ensemble and consensus learners such as RF or AdaBoost are becoming more widespread, both in predictive model building and as an adjunct tool for feature selection. Even though tree-based algorithms are sometimes outperformed by other learners, one of their main advantages is the prioritization of the predictors, which allows deducing general rule sets that might also be applied without using the whole model or in dimensionality reduction and feature selection.

In the face of the rise of DL and big data for drug discovery, DTI has yet to carve out its niche within this subdomain of ML. It is unlikely that single-tree learners will play a role here by themselves, but ensemble predictors are well up to the task of handling big data problems. DL workflows mostly employ ANN (deep nets), but efforts outside of drug discovery have shown that DTI can act as one of the weak learners in such a setup. DL at its heart is multilevel learning, where raw data is subsequently integrated into more complex features to predict the actual outcome of interest. In drug discovery, this could span physicochemical properties such as lipophilicity, proceed to cell membrane passage and target engagement, and, finally, central nervous activity. Individual subproblems can be structured very differently, and different weak learners may be more capable of describing certain features than others.

For VSs, we will in all likelihood see more and more technically involved workflows and algorithms, although the success of DL will depend on our ability to provide it with high-quality data (a considerable feat giving that by its very nature huge data stores cannot be surveyed and curated in as much detail by human experts as could the small to midsize datasets that are the substrate for more traditional ML studies). Such models, while fully recognizing all their predictive power, will have one major drawback: they are black boxes, even to ML experts. Single decision trees will probably remain the most powerful tool for machine-assisted hypothesis generation.

Declaration of interest

The authors have no relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript. This includes employment, consultancies, honoraria, stock ownership or options, expert testimony, grants or patents received or pending, or royalties.

Reviewer Disclosures

Peer reviewers on this manuscript have no relevant financial or other relationships to disclose.

Additional information

Funding

This manuscript was not funded.

References

  • Lima AN, Philot EA, Trossini GH, et al. Use of machine learning approaches for novel drug discovery. Expert Opin Drug Discov. 2016;11(3):225–239. PubMed PMID: 26814169.
  • Hammann F, Drewe J. Decision tree models for data mining in hit discovery. Expert Opin Drug Discov. 2012 Apr;7(4):341–352. PubMed PMID: 22458505.
  • Topliss JG. Utilization of operational schemes for analog synthesis in drug design. J Med Chem. 1972 Oct;15(10):1006–1011. PubMed PMID: 5069767.
  • Topliss JG. A manual method for applying the Hansch approach to drug design. J Med Chem. 1977 Apr;20(4):463–469. PubMed PMID: 321782.
  • Wagener M, van Geerestein VJ. Potential drugs and nondrugs: prediction and identification of important structural features. J Chem Inf Comput Sci. 2000 Mar;40(2):280–292. PubMed PMID: 10761129.
  • Quinlan JR. C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc; San Francisco, California, USA. 1993.
  • Su BH, Tu YS, Lin C, et al. Rule-based prediction models of cytochrome P450 inhibition. J Chem Inf Model. 2015 Jul 27;55(7):1426–1434. PubMed PMID: 26108525.
  • Newby D, Freitas AA, Ghafourian T. Decision trees to characterise the roles of permeability and solubility on the prediction of oral absorption. Eur J Med Chem. 2015 Jan 27;90:751–765.
  • Blower PE, Cross KP. Decision tree methods in pharmaceutical research. Curr Top Med Chem. 2006;6(1): 31–39. PubMed PMID: 16454756.
  • Gertrudes JC, Maltarollo VG, Silva RA, et al. Machine learning techniques and drug design. Curr Med Chem. 2012;19(25):4289–4297. PubMed PMID: 22830342.
  • Zhang L, Tan J, Han D, et al. From machine learning to deep learning: progress in machine intelligence for rational drug discovery. Drug Discov Today. 2017 Nov;22(11):1680–1685. PubMed PMID: 28881183.
  • ChEMBL. [Cited 2018 Aug 01]. Available from: https://www.ebi.ac.uk/chembl/
  • The PubChem Project. [Cited 2018 Aug 01]. Available from: https://pubchem.ncbi.nlm.nih.gov/
  • RCSB Protein Data Bank. [Cited 2018 Aug 01]. Available from: https://www.rcsb.org/
  • Lavecchia A. Machine-learning approaches in drug discovery: methods and applications. Drug Discov Today. 2015 Mar;20(3):318–331. PubMed PMID: 25448759.
  • Decision Forest. [Cited 2018 Jul 30]. Available from: https://www.fda.gov/ScienceResearch/BioinformaticsTools/DecisionForest/default.htm
  • Hong H, Thakkar S, Chen M, et al. Development of decision forest models for prediction of drug-induced liver injury in humans using a large set of FDA-approved drugs. Sci Rep. 2017 12 11;7(1):17311.
  • Kim E, Nam H. Prediction models for drug-induced hepatotoxicity by using weighted molecular fingerprints. BMC Bioinformatics. 2017 May 31;18(Suppl 7):227. PubMed PMID: 28617228; PubMed Central PMCID: PMCPMC5471939.
  • Danielson ML, Sawada GA, Raub TJ, et al. In silico and in vitro assessment of OATP1B1 inhibition in drug discovery. Mol Pharm. 2018;15:3060–3068.
  • Wang -N-N, Huang C, Dong J, et al. Predicting human intestinal absorption with modified random forest approach: a comprehensive evaluation of molecular representation, unbalanced data, and applicability domain issues. RSC Adv. 2017;7(31):19007–19018.
  • Cano G, Garcia-Rodriguez J, Garcia-Garcia A, et al. Automatic selection of molecular descriptors using random forest: application to drug discovery. Expert Syst Appl. 2017;72:151–159.
  • Afolabi LT, Saeed F, Hashim H, et al. Ensemble learning method for the prediction of new bioactive molecules. PLoS One. 2018;13(1):e0189538. PubMed PMID: 29329334; PubMed Central PMCID: PMCPMC5766097.
  • Chen H, Engkvist O, Wang Y, et al. The rise of deep learning in drug discovery. Drug Discovery Today. 2018 06 01;23(6):1241–1250.
  • Kaggle. [Cited 2018 Aug 01]. Available from: http://www.kaggle.com
  • Ma J, Sheridan RP, Liaw A, et al. Deep neural nets as a method for quantitative structure–activity relationships. J Chem Inf Model. 2015;55(2):263–274.
  • Korotcov A, Tkachenko V, Russo DP, et al. Comparison of deep learning with multiple machine learning methods and metrics using diverse drug discovery data sets. Mol Pharm. 2017 12 04;14(12):4462–4475.
  • Kontschieder P, Fiterau M, Criminisi A, et al., editors Deep neural decision forests. Proceedings of the IEEE international conference on computer vision; Santiago, Chile. 2015.
  • Tox21 Data Challenge 2014. [Cited 2018 Aug 01]. Available from: https://tripod.nih.gov/tox21/challenge
  • Mayr A, Klambauer G, Unterthiner T, et al. DeepTox: toxicity prediction using deep learning [Original research]. Front Environ Sci. 2016 February 02;3(80). English.
  • Weka 3: Data Mining Software in Java. [Cited 2018 Aug 01]. Available from: https://www.cs.waikato.ac.nz/ml/weka/
  • Konstanz Information Miner (KNIME). [CIted 2018 Aug 01]. Available from: http://www.knime.com

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.