Search in:

Expert Opinion on Drug Discovery Volume 13, 2018 - Issue 12

Submit an article Journal homepage

Free access

2,051

Views

CrossRef citations to date

Altmetric

Listen

Editorial

How far have decision tree models come for data mining in drug discovery?

Verena SchöningDivision of Clinical Pharmacology and Toxicology, Department of Clinical Research and Department of Internal Medicine, University Hospital Basel, University of Basel, Basel, SwitzerlandView further author information

Felix HammannDivision of Clinical Pharmacology and Toxicology, Department of Clinical Research and Department of Internal Medicine, University Hospital Basel, University of Basel, Basel, SwitzerlandCorrespondence[email protected]
View further author information

Pages 1067-1069 | Received 21 Aug 2018, Accepted 15 Oct 2018, Published online: 23 Oct 2018

Cite this article
https://doi.org/10.1080/17460441.2018.1538208
CrossMark

In this article

1. Introduction
2. Divide and conquer
3. Strength in numbers
4. Going deeper
5. Expert opinion
Additional information
References

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions
View PDF PDF

KEYWORDS:

QSAR
decision tree induction
random forest
decision forest
deep learning

1. Introduction

Machine learning (ML) methods assist in drug discovery mostly by way of data mining in virtual screening (VS). If the target is sufficiently characterized, say by knowledge of its three-dimensional (3D) structure or gene sequence, we can take a structure-based VS (SBVS) approach and run molecular docking and dynamics, or 3D-similarity matching experiments. More often, however, we only know a set of molecular structures and their biological activities, and so we may perform a ligand-based VS (LBVS) [Citation1]. The results of an LBVS, which are computationally much less expensive to obtain than those of an SBVS, can be the basis of chemical database queries and, optimally, at an early stage of drug discovery enhance our understanding of how a molecule’s action may come about (hypothesis generation). The concept of decision trees is well suited for this.

2. Divide and conquer

The decision tree induction (DTI) paradigm is one of the staples of ML. By splitting a dataset into increasingly homogenous subsets through rule-based decisions (a ‘divide and conquer’ approach), it acts as a classifier that not only mimics human decision-making but that can also be easily visualized in the form of a tree, making it easy to convey and interpret results. In its most basic iteration, DTI provides binary classification, but modifications exist that allow for numerical and regression predictions. Because of these properties and their wide-spread implementations in easy-to-use software packages, we have seen many applications of DTIs for endpoints relevant to drug discovery, such as absorption, distribution, metabolism, and elimination (ADME), and toxicity [Citation2]. Its roots can be traced back to the 1970s, with pioneering work from Topliss and his manually deduced ‘operational schemes for analog synthesis in drug design’ which are – in essence – decision trees [Citation3,Citation4]. Ever since then, DTI has been in use for questions relevant to drug discovery and design. In the 2000s, Wagner and Geerestein developed a DTI model to distinguish between potential drugs and nondrugs by identifying important structural features [Citation5]. One more recent example is a study from 2015 in which Su et al. employed an approach based on C5.0, which is a variation of the classic DTI algorithm C4.5 [Citation6] and uses an information entropy approach to find attributes that best classify the data at each given level. They established rule sets for structural features related to inhibition of five major cytochrome P₄₅₀ (CYP) enzymes (CYP1A2, CYP2C19, CYP2C9, CYP2D6, and CYP3A4) from training datasets of over 10,000 substances [Citation7]. Another study by Newby et al. in 2015 used DTI to define the roles of permeability and solubility on the prediction of oral absorption in the lower intestines [Citation8]. For drug discovery, prediction of e.g. metabolism via CYP enzymes or oral absorption helps to identify promising new compounds, applying the rule sets of the decision trees. For more historical perspective, the reader is referred to prior reviews [Citation9–Citation11].

There are important limitations to DTI that can hamper its use for the high volumes of data often encountered in data mining within the pharmaceutical sciences (e.g. from CHEMBL, National Center for Biotechnology Information, or PDB [Citation12–Citation14]. For instance, because the structure of the final model depends to a great deal on the training dataset, seemingly innocuous changes to these data can result in rather different models. This is due to the decreasing influence rules have as the process proceeds downstream [Citation15]. Smaller datasets exhibit this behavior more readily than larger ones, but larger ones are prone to overfitting even if the number of predictive attributes is carefully selected, trees are pruned, and cross-validation measures are in place. While it is impossible to place precise limits on the number of instances (e.g. molecules) for which DTI paradigms yield best results, personal experience shows that datasets in the hundreds to low thousands produce robust models that perform well in cross-validated settings.

3. Strength in numbers

Although DTI models are fast and can successfully predict even noisy and complex clinical endpoints, it is important to be wary of their susceptibility to errors due to variance. When very large datasets are being analyzed, as is increasingly becoming the case with the massive amounts of data generated from high-throughput assays, or if the data are highly skewed – e.g. when only a few molecules show the desired property (positive controls) in a dataset with a larger proportion of negative controls – ensemble variations of classic DTI may give the more robust model and are seeing increasing use. In an ensemble predictor, many ‘weaker’ predictors can be combined to give a single ‘stronger’ predictor that represents the final hypothesis, and these are finding more and more acceptance in drug discovery applications.

Random Forests (RF) and Decision Forests (DF) are two examples that differ only modestly in their implementation. Both are consensus modeling techniques. The probably better known of the two, RF, creates a large number of shallow trees by resampling with replacement from the training data, while DF, distributed by the National Center for Toxicological Research (NCTR) and the Food Drug Administration, does the opposite: it relies on a small sample of deep, complex trees [Citation16]. A recent case study by the NCTR reported on a model for drug-induced liver injury (DILI) using DF with fivefold cross-validation in 1241 compounds that predicted the trinary endpoint (‘most-DILI’, ‘less-DILI’, and ‘no-DILI’), with an average accuracy of 72.9% [Citation17]. This is nearly identical to results from a study published in 2017 using support vector machines (SVM), RF, and 10-fold CV for the trinary classification of 721 drugs that reported accuracies of 73.8% and 72.6% [Citation18]. Recently, RF and DTI algorithms were used to model OATP1B1 inhibition in a dataset of 1203 molecules, where they were on par with SVM in terms of predictive accuracy [Citation19]. Furthermore, RF was used in 2017 to predict human intestinal absorption in a dataset with 970 substances with a high accuracy of over 90% [Citation20].

Aside from outright prediction, RF can work as a powerful feature selector, e.g. to automatically condense a large set of molecular descriptors into a smaller ranked list of the most predictive attributes. A recent study showed how an RF-based approach to feature selection gave stronger classification results than SVM and artificial neural networks (ANN) with manually selected features on small molecule drugs [Citation21].

Adaptive Boosting (AdaBoost), though not strictly a DTI ensemble predictor, is often used with DTI as the ‘weak’ learner. In 2018, Afolabi et al. used AdaBoost with DTI classifiers for the prediction of on-target activities in three mid-size datasets (5083–8568 instances) from the MDL Drug Data Report database, and saw consistently better results compared to SVM models [Citation22].

4. Going deeper

Cheaper computational power has recently led to a renaissance of ANN in the form of deep learning (DL), two terms that are now often used in collusion with each other. This is already being applied to the high-volume datasets encountered in drug discovery [Citation23]. In 2015, authors at Merck contrasted DL and RF for 15 large in-house datasets from a Kaggle [Citation24] competition they sponsored in 2012 (endpoints spanning ADME (e.g. CYP inhibition, plasma protein binding) and on-target activity (e.g. angiotensin-II receptor interaction)), and found DL to be superior in up to 13 of these endpoints [Citation25]. A more exhaustive analysis of other datasets for ADME and on-target activities ranked DL over SVM and a set of ‘classical’ ML algorithms, of which DTI-based paradigms were represented by AdaBoost DTI and RF, by comparing spider plots of different metrics of predictive power [Citation26].

Although the DL concept of stacked algorithms representing hierarchical layers of representation of data is commonly implemented with ANN, there are already efforts being made to integrate other learners such as DTI or RF directly into DL setups [Citation27]. This has not seemed to be applied to drug discovery yet, although successful ensemble implementations exist in toxicology. One example is the DeepTox pipeline, the grand challenge winner in the 2014 Tox21 Data Challenge hosted by the National Center for Advancing Translational Science [Citation28]. Data consisted of over 12,000 drugs and environmental chemicals with up to 12 different toxicity endpoints. The DeepTox pipeline, while relying primarily on DL ANN, used RF, SVM, and elastic nets to supplement the DL models [Citation29].

5. Expert opinion

Although the paradigm has been around for decades, DTI in its many different guises has managed to remain relevant in drug discovery. While single tree DTI algorithms such as C5.0, Classification and Regression Trees, or Chi-squared Automatic Interaction Detector are giving way to more complex ensemble setups, they are versatile, fast, and easy to handle learners that can drive hypothesis generation, generate intuitively understandable visualizations, and are often sufficient in small to midsize datasets. All of this makes the DTI family of learners an attractive tool for (quantitative) structure-activity relationships ((Q)SAR) that can even be used by researchers outside of the ML community. The investment is low due to the multitude of often free software packages with intricate graphical user interfaces (e.g. the Weka collection of data mining software [Citation30] or the Konstanz Information Miner [Citation31]). Of course, a basic knowledge of data mining and ML principles are necessary for users to build robust models and avoid drawing the wrong conclusions because models were misspecified.

Still, ensemble and consensus learners such as RF or AdaBoost are becoming more widespread, both in predictive model building and as an adjunct tool for feature selection. Even though tree-based algorithms are sometimes outperformed by other learners, one of their main advantages is the prioritization of the predictors, which allows deducing general rule sets that might also be applied without using the whole model or in dimensionality reduction and feature selection.

In the face of the rise of DL and big data for drug discovery, DTI has yet to carve out its niche within this subdomain of ML. It is unlikely that single-tree learners will play a role here by themselves, but ensemble predictors are well up to the task of handling big data problems. DL workflows mostly employ ANN (deep nets), but efforts outside of drug discovery have shown that DTI can act as one of the weak learners in such a setup. DL at its heart is multilevel learning, where raw data is subsequently integrated into more complex features to predict the actual outcome of interest. In drug discovery, this could span physicochemical properties such as lipophilicity, proceed to cell membrane passage and target engagement, and, finally, central nervous activity. Individual subproblems can be structured very differently, and different weak learners may be more capable of describing certain features than others.

For VSs, we will in all likelihood see more and more technically involved workflows and algorithms, although the success of DL will depend on our ability to provide it with high-quality data (a considerable feat giving that by its very nature huge data stores cannot be surveyed and curated in as much detail by human experts as could the small to midsize datasets that are the substrate for more traditional ML studies). Such models, while fully recognizing all their predictive power, will have one major drawback: they are black boxes, even to ML experts. Single decision trees will probably remain the most powerful tool for machine-assisted hypothesis generation.

Declaration of interest

The authors have no relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript. This includes employment, consultancies, honoraria, stock ownership or options, expert testimony, grants or patents received or pending, or royalties.

Reviewer Disclosures

Peer reviewers on this manuscript have no relevant financial or other relationships to disclose.

Additional information

Funding

This manuscript was not funded.

References

Lima AN, Philot EA, Trossini GH, et al. Use of machine learning approaches for novel drug discovery. Expert Opin Drug Discov. 2016;11(3):225–239. PubMed PMID: 26814169.
PubMed Web of Science ®Google Scholar
Hammann F, Drewe J. Decision tree models for data mining in hit discovery. Expert Opin Drug Discov. 2012 Apr;7(4):341–352. PubMed PMID: 22458505.
PubMed Web of Science ®Google Scholar
Topliss JG. Utilization of operational schemes for analog synthesis in drug design. J Med Chem. 1972 Oct;15(10):1006–1011. PubMed PMID: 5069767.
PubMed Web of Science ®Google Scholar
Topliss JG. A manual method for applying the Hansch approach to drug design. J Med Chem. 1977 Apr;20(4):463–469. PubMed PMID: 321782.
PubMed Web of Science ®Google Scholar
Wagener M, van Geerestein VJ. Potential drugs and nondrugs: prediction and identification of important structural features. J Chem Inf Comput Sci. 2000 Mar;40(2):280–292. PubMed PMID: 10761129.
PubMedGoogle Scholar
Quinlan JR. C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc; San Francisco, California, USA. 1993.
Google Scholar
Su BH, Tu YS, Lin C, et al. Rule-based prediction models of cytochrome P450 inhibition. J Chem Inf Model. 2015 Jul 27;55(7):1426–1434. PubMed PMID: 26108525.
PubMed Web of Science ®Google Scholar
Newby D, Freitas AA, Ghafourian T. Decision trees to characterise the roles of permeability and solubility on the prediction of oral absorption. Eur J Med Chem. 2015 Jan 27;90:751–765.
PubMed Web of Science ®Google Scholar
Blower PE, Cross KP. Decision tree methods in pharmaceutical research. Curr Top Med Chem. 2006;6(1): 31–39. PubMed PMID: 16454756.
PubMed Web of Science ®Google Scholar
Gertrudes JC, Maltarollo VG, Silva RA, et al. Machine learning techniques and drug design. Curr Med Chem. 2012;19(25):4289–4297. PubMed PMID: 22830342.
PubMed Web of Science ®Google Scholar
Zhang L, Tan J, Han D, et al. From machine learning to deep learning: progress in machine intelligence for rational drug discovery. Drug Discov Today. 2017 Nov;22(11):1680–1685. PubMed PMID: 28881183.
PubMed Web of Science ®Google Scholar
ChEMBL. [Cited 2018 Aug 01]. Available from: https://www.ebi.ac.uk/chembl/
Google Scholar
The PubChem Project. [Cited 2018 Aug 01]. Available from: https://pubchem.ncbi.nlm.nih.gov/
Google Scholar
RCSB Protein Data Bank. [Cited 2018 Aug 01]. Available from: https://www.rcsb.org/
Google Scholar
Lavecchia A. Machine-learning approaches in drug discovery: methods and applications. Drug Discov Today. 2015 Mar;20(3):318–331. PubMed PMID: 25448759.
PubMed Web of Science ®Google Scholar
Decision Forest. [Cited 2018 Jul 30]. Available from: https://www.fda.gov/ScienceResearch/BioinformaticsTools/DecisionForest/default.htm
Google Scholar
Hong H, Thakkar S, Chen M, et al. Development of decision forest models for prediction of drug-induced liver injury in humans using a large set of FDA-approved drugs. Sci Rep. 2017 12 11;7(1):17311.
PubMedGoogle Scholar
Kim E, Nam H. Prediction models for drug-induced hepatotoxicity by using weighted molecular fingerprints. BMC Bioinformatics. 2017 May 31;18(Suppl 7):227. PubMed PMID: 28617228; PubMed Central PMCID: PMCPMC5471939.
PubMedGoogle Scholar
Danielson ML, Sawada GA, Raub TJ, et al. In silico and in vitro assessment of OATP1B1 inhibition in drug discovery. Mol Pharm. 2018;15:3060–3068.
PubMedGoogle Scholar
Wang -N-N, Huang C, Dong J, et al. Predicting human intestinal absorption with modified random forest approach: a comprehensive evaluation of molecular representation, unbalanced data, and applicability domain issues. RSC Adv. 2017;7(31):19007–19018.
Web of Science ®Google Scholar
Cano G, Garcia-Rodriguez J, Garcia-Garcia A, et al. Automatic selection of molecular descriptors using random forest: application to drug discovery. Expert Syst Appl. 2017;72:151–159.
Web of Science ®Google Scholar
Afolabi LT, Saeed F, Hashim H, et al. Ensemble learning method for the prediction of new bioactive molecules. PLoS One. 2018;13(1):e0189538. PubMed PMID: 29329334; PubMed Central PMCID: PMCPMC5766097.
PubMed Web of Science ®Google Scholar
Chen H, Engkvist O, Wang Y, et al. The rise of deep learning in drug discovery. Drug Discovery Today. 2018 06 01;23(6):1241–1250.
PubMed Web of Science ®Google Scholar
Kaggle. [Cited 2018 Aug 01]. Available from: http://www.kaggle.com
Google Scholar
Ma J, Sheridan RP, Liaw A, et al. Deep neural nets as a method for quantitative structure–activity relationships. J Chem Inf Model. 2015;55(2):263–274.
PubMed Web of Science ®Google Scholar
Korotcov A, Tkachenko V, Russo DP, et al. Comparison of deep learning with multiple machine learning methods and metrics using diverse drug discovery data sets. Mol Pharm. 2017 12 04;14(12):4462–4475.
PubMed Web of Science ®Google Scholar
Kontschieder P, Fiterau M, Criminisi A, et al., editors Deep neural decision forests. Proceedings of the IEEE international conference on computer vision; Santiago, Chile. 2015.
Google Scholar
Tox21 Data Challenge 2014. [Cited 2018 Aug 01]. Available from: https://tripod.nih.gov/tox21/challenge
Google Scholar
Mayr A, Klambauer G, Unterthiner T, et al. DeepTox: toxicity prediction using deep learning [Original research]. Front Environ Sci. 2016 February 02;3(80). English.
Google Scholar
Weka 3: Data Mining Software in Java. [Cited 2018 Aug 01]. Available from: https://www.cs.waikato.ac.nz/ml/weka/
Google Scholar
Konstanz Information Miner (KNIME). [CIted 2018 Aug 01]. Available from: http://www.knime.com
Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Download PDF

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

How far have decision tree models come for data mining in drug discovery?

1. Introduction

2. Divide and conquer

3. Strength in numbers

4. Going deeper

5. Expert opinion

Declaration of interest

Reviewer Disclosures

References

Information for

Open access

Opportunities

Help and information

How far have decision tree models come for data mining in drug discovery?

1. Introduction

2. Divide and conquer

3. Strength in numbers

4. Going deeper

5. Expert opinion

Declaration of interest

Reviewer Disclosures

Additional information

Funding

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date