Search in:

Expert Opinion on Drug Discovery Volume 16, 2021 - Issue 9: Artificial Intelligence (AI) in Drug Discovery

Submit an article Journal homepage

Free access

1,998

Views

CrossRef citations to date

Altmetric

Listen

Editorial

Practical constraints with machine learning in drug discovery

Igor I. Baskina Department of Materials Science and Engineering, Technion – Israel Institute of Technology, Haifa, Israel;b Butlerov Institute of Chemistry, Kazan Federal University, Kazan, RussiaCorrespondence[email protected]

https://orcid.org/0000-0003-0874-1148 View further author information

Pages 929-931 | Received 11 Dec 2020, Accepted 03 Feb 2021, Published online: 19 Feb 2021

Cite this article
https://doi.org/10.1080/17460441.2021.1887133
CrossMark

In this article

1. Introduction
2. Data analysis
3. Expert opinion
Additional information
References

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions
View PDF PDF View EPUB EPUB

1. Introduction

Nowadays, machine learning methods, as a part of Artificial Intelligence, play a significant role at almost all stages of drug discovery. With their help, scientists can: identify promising targets for drug discovery and druggable binding pockets in them, assist in performing protein-ligand docking, assess docking results, build models for virtual screening of pre-prepared databases of potential drug molecules, perform de novo generation of novel molecules with desired biological activity, build QSAR models for lead optimization and assessing toxicity, ADME, environmental and hazardous properties of designed drugs, plan ways of their synthesis, predict side effects and interactions with other medications, etc [Citation1–4]. At the same time, however, one should be aware that the use of machine learning in drug discovery always comes with certain practical constraints.

2. Data analysis

One of the main sources of the limitations associated with the use of machine learning in drug discovery is the primary reliance on data analysis and ignorance of profound knowledge in natural sciences. This has several important implications [Citation5]. The first one concerns the scarceness and sparseness of data (). It is believed that the more data is used to build a model using machine learning methods, the more accurate predictions this model will give. At the same time, drug development tends to have little data, especially in the early stages. This significantly limits the possibility of using machine learning methods to solve those problems where such a deficit is felt especially acutely. For example, this might be the case for the models that require training data from time-consuming and resource-intensive biological or clinical tests. There are several fundamentally different approaches to solving this problem.

Figure 1. Features of data in drug discovery

First, with a small amount of data, one should build the simplest models, that is, from the standpoint of the statistical learning theory, models with the minimum acceptable value of the Vapnik-Chervonenkis dimension, which can be interpreted as the effective number of descriptors. The easiest way to achieve this is to use linear models with as few uncorrelated descriptors as possible. This exactly corresponds to the scheme adopted when constructing classical QSAR models using multiple linear regression based on a fixed number of descriptors with clear physicochemical or structural interpretation. Of the modern approaches based on a similar idea, a group of methods based on a combination of perturbation theory and machine learning (PTML) should be noted [Citation6], including mt-QSAR and mtk-QSBER. Unfortunately, the ideal case of having a fixed set of several easily interpreted descriptors is not typical, and currently it is preferred in most publications to select them from a large set of pre-calculated non-interpretable descriptors.

The problem of selecting descriptors for structure-activity models using machine learning methods, although may seem simple at first glance, is very nontrivial and can lead to important practical constraints [Citation7]. First, it should be borne in mind that the effective number of descriptors in the model obtained using the procedure of their automatic selection can be significantly higher than the final number of selected descriptors. It also depends on the number of descriptors in the initial descriptor pool from which the selection was carried out, as well as on the method of their selection. As a result, the classical apparatus of the applied mathematical statistics, which is based on testing statistical hypotheses, assessing the statistical significance of revealed dependencies, and calculating the significance intervals for model parameters and prediction intervals for the results of predictions, cannot be correctly applied and, as a result, provides a distorted in the ‘optimistic’ direction assessment of model quality. This corresponds to the overfitting phenomenon well known in the theory and practice of machine learning. There can be many factors leading to overfitting, but when building QSAR models the main one is related to descriptor selection. This is the case for so-called wrapper descriptor selection methods based on stochastic optimization procedures, such as a genetic algorithm.

If the overfitting is possible, the standard approach to assess the predictive ability of the model and thereby to tackle this phenomenon is the use of a cross-validation procedure. With a small amount of data, the most popular procedure is n-fold cross-validation with m random reshuffles of compounds in data sets (n = 3,5,10, m = 5,10,20), which provides a compromise to a computationally efficient single splitting of the data into training and test sets, which leads to a biased and unstable assessment of predictive ability, and the leave-one-out method, which leads to more accurate assessment but requiring significant computational resources. In the case when the predictive ability estimate obtained using the cross-validation procedure is used as an optimized criterion for selecting descriptors, the predictive performance of the final model turns out to be overestimated since the information from the test sets penetrates indirectly into the training sets through the set of selected descriptors. In this case, the correct solution is to use a nested cross-validation procedure, in which the inner loop is used to calculate some model performance measures needed to tune descriptor selection, while the outer loop is used to unbiased estimate of the model predictive performance. As applied to problems related to drug discovery, such a nested scheme was proposed in the paper [Citation8], where the inner loop was termed internal cross-validation, and the outer loop was termed external cross-validation.

An alternative to the selection of descriptors is the extraction (formation) of new descriptors by combining the original ones, as in multivariate linear methods. Another powerful alternative to the selection of descriptors is the use of regularization, which allows one to reduce the effective number of descriptors while their actual number remains unchanged. In this case, descriptor selection is replaced by their weighing and model selection. A typical example of model selection is hyperparameter optimization. Both approaches are combined in neural networks, with new descriptors being formed on hidden neurons. Although for ‘shallow’ neural networks such means of preventing overfitting as controlling the size of the hidden layer, early stopping, using ensembles of networks or Bayesian regularization procedures are enough, the transition to deep learning requires the use of new powerful regularization tools, such as dropout [Citation4,Citation9]. Consensus predictions is also an efficient approach (see examples in Refs. [Citation1,Citation5]). The result of a consensus prediction is a combination of predictions made by different models, which can be built using different algorithms, different descriptors and, perhaps, different datasets.

In addition to controlling the simplicity of models, other approaches can be used for building models for scarce data. Transfer learning performs an adaptation of a model built using a large amount of data to a model for predicting another property with a smaller amount of data [Citation10]. A practical constraint of this is the similarity of the properties predicted by these models. A special case is multi-task learning, when several properties are predicted simultaneously, e.g. using a neural network with several output units [Citation11]. Semi-supervised learning allows, under certain conditions, to improve the quality of models by using additional information on unlabeled examples, such as molecules with unknown properties [Citation12]. One of the practical constraints here is related to the need to organize cross-validation correctly, to exclude penetration of any information from the test set to the model. Active learning allows forming a training set in such a way that the smallest number of necessary experimental studies would lead to the best models [Citation13]. One-shot and few-shot learning allows building models to predict even those types of biological activity for which only one or a few examples are known, thanks to transfer learning using deep neural networks [Citation14].

3. Expert opinion

Despite the widespread use of machine learning methods in drug discovery, there are still many unresolved problems. Listed below are only those that relate to the above approaches. It is unclear to what extent the procedure for assessing model predictive performance using cross-validation and the concept of applicability domain can be applied to the process of discovering fundamentally new drugs. It is unclear in what cases the tasks of predicting different properties help or hinder each other in transfer and multitask learning. It is unclear what factors determine how semi-supervised learning influences model predictive performance.

Other sources of constraints concern data distribution and the nature of the data involved in drug discovery () [Citation5]. Because of uneven data distribution and non-representativeness of training sets, a clear definition of model applicability domains, the use of generative neural network models capable of learning heterogeneities in data distribution, as well as methods of chemical cartography capable of describing the distribution of data across the chemical space [Citation15], should play a particularly important role. Given the time-based evolutionary nature of drug discovery, traditional static methods for testing the predictive power of models must be complemented by dynamic approaches based on time separation. The primary reliance on data manipulation, to the detriment of a deep understanding of the phenomena being studied, requires compensating for this imbalance by striving for building interpretable models [Citation16] and integrating deep scientific knowledge into the model building process. In addition to the laws of physics and chemistry, such knowledge should include information on the spatial structures of ligands and proteins, both in statics and dynamics, as well as on the networks of interactions between genetic and protein factors in living organisms.

Declaration of interest

II Baskin has no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.

Reviewer disclosures

Peer reviewers on this manuscript have no relevant financial or other relationships to disclose.

Additional information

Funding

II Baskin is supported by the Ministry of Education of Youth and Sports of the Czech Republic via grant MSMT-5727/2018-2 and the Ministry of Higher Education and Science of Russian Federation via grant 14.587.21.0049.

References

Muratov EN, Bajorath J, Sheridan RP, et al. QSAR without borders. Chem Soc Rev. 2020;49(11):3525–3564.
PubMed Web of Science ®Google Scholar
Brown N, Ertl P, Lewis R, et al. Artificial intelligence in chemistry and drug design. J Comput-Aided Mol Des. 2020;34(7):709–715.
PubMed Web of Science ®Google Scholar
Sellwood MA, Ahmed M, Segler MH, et al. Artificial intelligence in drug discovery. Future Med Chem. 2018;10(17):2025–2028.
PubMed Web of Science ®Google Scholar
Baskin II. The power of deep learning to ligand-based novel drug discovery. Expert Opin Drug Discov. 2020;15(7):756–764.
PubMed Web of Science ®Google Scholar
Varnek A, Baskin I. Machine learning methods for property prediction in chemoinformatics: quo vadis? J Chem Inf Model. 2012;52(6):1413–1437.
PubMed Web of Science ®Google Scholar
Simón-Vidal L, García-Calvo O, Oteo U, et al. Perturbation-theory and machine learning (PTML) model for high-throughput screening of Parham reactions: experimental and theoretical studies. J Chem Inf Model. 2018;58(7):1384–1396.
PubMed Web of Science ®Google Scholar
Livingstone DJ, Salt DW. Variable selection - spoilt for choice? Rev Comp Chem. 2005;21:287–348.
Web of Science ®Google Scholar
Tetko IV, Solov’ev VP, Antonov AV, et al. Benchmarking of linear and nonlinear approaches for quantitative structure-property relationship studies of metal complexation with ionophores. J Chem Inf Model. 2006;46(2):808–819.
PubMed Web of Science ®Google Scholar
Baskin II, Winkler D, Tetko IV. A renaissance of neural networks in drug discovery. Expert Opin Drug Discov. 2016;11(8):785–795.
PubMed Web of Science ®Google Scholar
Simões RS, Maltarollo VG, Oliveira PR, et al. Transfer and multi-task learning in QSAR modeling: advances and challenges. Front Pharmacol. 2018;9:74.
PubMed Web of Science ®Google Scholar
Varnek A, Gaudin C, Marcou G, et al. Inductive transfer of knowledge: application of multi-task learning and feature net approaches to model tissue-air partition coefficients. J Chem Inf Model. 2009;49(1):133–144.
PubMed Web of Science ®Google Scholar
Kondratovich E, Baskin II, Varnek A. Transductive support vector machines: promising approach to model small and unbalanced datasets. Mol Inf. 2013;32(3):261–266.
PubMed Web of Science ®Google Scholar
Fujiwara Y, Yamashita Y, Osoda T, et al. Virtual screening system for finding structurally diverse hits by active learning. J Chem Inf Model. 2008;48(4):930–940.
PubMed Web of Science ®Google Scholar
Baskin II. Is one-shot learning a viable option in drug discovery? Expert Opin Drug Discov. 2019;14(7):601–603.
PubMed Web of Science ®Google Scholar
Kireeva N, Baskin II, Gaspar HA, et al. Generative Topographic Mapping (GTM): universal Tool for Data Visualization, Structure-Activity Modeling and Dataset Comparison. Mol Inf. 2012;31(3–4):301–312.
PubMed Web of Science ®Google Scholar
Polishchuk P. Interpretation of quantitative structure–activity relationship models: past, present, and future. J Chem Inf Model. 2017;57(11):2618–2639.
PubMed Web of Science ®Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Download PDF

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Practical constraints with machine learning in drug discovery

1. Introduction

2. Data analysis

3. Expert opinion

Declaration of interest

Reviewer disclosures

References

Information for

Open access

Opportunities

Help and information

Practical constraints with machine learning in drug discovery

1. Introduction

2. Data analysis

3. Expert opinion

Declaration of interest

Reviewer disclosures

Additional information

Funding

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date