1,998
Views
2
CrossRef citations to date
0
Altmetric
Editorial

Practical constraints with machine learning in drug discovery

ORCID Icon
Pages 929-931 | Received 11 Dec 2020, Accepted 03 Feb 2021, Published online: 19 Feb 2021

1. Introduction

Nowadays, machine learning methods, as a part of Artificial Intelligence, play a significant role at almost all stages of drug discovery. With their help, scientists can: identify promising targets for drug discovery and druggable binding pockets in them, assist in performing protein-ligand docking, assess docking results, build models for virtual screening of pre-prepared databases of potential drug molecules, perform de novo generation of novel molecules with desired biological activity, build QSAR models for lead optimization and assessing toxicity, ADME, environmental and hazardous properties of designed drugs, plan ways of their synthesis, predict side effects and interactions with other medications, etc [Citation1–4]. At the same time, however, one should be aware that the use of machine learning in drug discovery always comes with certain practical constraints.

2. Data analysis

One of the main sources of the limitations associated with the use of machine learning in drug discovery is the primary reliance on data analysis and ignorance of profound knowledge in natural sciences. This has several important implications [Citation5]. The first one concerns the scarceness and sparseness of data (). It is believed that the more data is used to build a model using machine learning methods, the more accurate predictions this model will give. At the same time, drug development tends to have little data, especially in the early stages. This significantly limits the possibility of using machine learning methods to solve those problems where such a deficit is felt especially acutely. For example, this might be the case for the models that require training data from time-consuming and resource-intensive biological or clinical tests. There are several fundamentally different approaches to solving this problem.

Figure 1. Features of data in drug discovery

Figure 1. Features of data in drug discovery

First, with a small amount of data, one should build the simplest models, that is, from the standpoint of the statistical learning theory, models with the minimum acceptable value of the Vapnik-Chervonenkis dimension, which can be interpreted as the effective number of descriptors. The easiest way to achieve this is to use linear models with as few uncorrelated descriptors as possible. This exactly corresponds to the scheme adopted when constructing classical QSAR models using multiple linear regression based on a fixed number of descriptors with clear physicochemical or structural interpretation. Of the modern approaches based on a similar idea, a group of methods based on a combination of perturbation theory and machine learning (PTML) should be noted [Citation6], including mt-QSAR and mtk-QSBER. Unfortunately, the ideal case of having a fixed set of several easily interpreted descriptors is not typical, and currently it is preferred in most publications to select them from a large set of pre-calculated non-interpretable descriptors.

The problem of selecting descriptors for structure-activity models using machine learning methods, although may seem simple at first glance, is very nontrivial and can lead to important practical constraints [Citation7]. First, it should be borne in mind that the effective number of descriptors in the model obtained using the procedure of their automatic selection can be significantly higher than the final number of selected descriptors. It also depends on the number of descriptors in the initial descriptor pool from which the selection was carried out, as well as on the method of their selection. As a result, the classical apparatus of the applied mathematical statistics, which is based on testing statistical hypotheses, assessing the statistical significance of revealed dependencies, and calculating the significance intervals for model parameters and prediction intervals for the results of predictions, cannot be correctly applied and, as a result, provides a distorted in the ‘optimistic’ direction assessment of model quality. This corresponds to the overfitting phenomenon well known in the theory and practice of machine learning. There can be many factors leading to overfitting, but when building QSAR models the main one is related to descriptor selection. This is the case for so-called wrapper descriptor selection methods based on stochastic optimization procedures, such as a genetic algorithm.

If the overfitting is possible, the standard approach to assess the predictive ability of the model and thereby to tackle this phenomenon is the use of a cross-validation procedure. With a small amount of data, the most popular procedure is n-fold cross-validation with m random reshuffles of compounds in data sets (n = 3,5,10, m = 5,10,20), which provides a compromise to a computationally efficient single splitting of the data into training and test sets, which leads to a biased and unstable assessment of predictive ability, and the leave-one-out method, which leads to more accurate assessment but requiring significant computational resources. In the case when the predictive ability estimate obtained using the cross-validation procedure is used as an optimized criterion for selecting descriptors, the predictive performance of the final model turns out to be overestimated since the information from the test sets penetrates indirectly into the training sets through the set of selected descriptors. In this case, the correct solution is to use a nested cross-validation procedure, in which the inner loop is used to calculate some model performance measures needed to tune descriptor selection, while the outer loop is used to unbiased estimate of the model predictive performance. As applied to problems related to drug discovery, such a nested scheme was proposed in the paper [Citation8], where the inner loop was termed internal cross-validation, and the outer loop was termed external cross-validation.

An alternative to the selection of descriptors is the extraction (formation) of new descriptors by combining the original ones, as in multivariate linear methods. Another powerful alternative to the selection of descriptors is the use of regularization, which allows one to reduce the effective number of descriptors while their actual number remains unchanged. In this case, descriptor selection is replaced by their weighing and model selection. A typical example of model selection is hyperparameter optimization. Both approaches are combined in neural networks, with new descriptors being formed on hidden neurons. Although for ‘shallow’ neural networks such means of preventing overfitting as controlling the size of the hidden layer, early stopping, using ensembles of networks or Bayesian regularization procedures are enough, the transition to deep learning requires the use of new powerful regularization tools, such as dropout [Citation4,Citation9]. Consensus predictions is also an efficient approach (see examples in Refs. [Citation1,Citation5]). The result of a consensus prediction is a combination of predictions made by different models, which can be built using different algorithms, different descriptors and, perhaps, different datasets.

In addition to controlling the simplicity of models, other approaches can be used for building models for scarce data. Transfer learning performs an adaptation of a model built using a large amount of data to a model for predicting another property with a smaller amount of data [Citation10]. A practical constraint of this is the similarity of the properties predicted by these models. A special case is multi-task learning, when several properties are predicted simultaneously, e.g. using a neural network with several output units [Citation11]. Semi-supervised learning allows, under certain conditions, to improve the quality of models by using additional information on unlabeled examples, such as molecules with unknown properties [Citation12]. One of the practical constraints here is related to the need to organize cross-validation correctly, to exclude penetration of any information from the test set to the model. Active learning allows forming a training set in such a way that the smallest number of necessary experimental studies would lead to the best models [Citation13]. One-shot and few-shot learning allows building models to predict even those types of biological activity for which only one or a few examples are known, thanks to transfer learning using deep neural networks [Citation14].

3. Expert opinion

Despite the widespread use of machine learning methods in drug discovery, there are still many unresolved problems. Listed below are only those that relate to the above approaches. It is unclear to what extent the procedure for assessing model predictive performance using cross-validation and the concept of applicability domain can be applied to the process of discovering fundamentally new drugs. It is unclear in what cases the tasks of predicting different properties help or hinder each other in transfer and multitask learning. It is unclear what factors determine how semi-supervised learning influences model predictive performance.

Other sources of constraints concern data distribution and the nature of the data involved in drug discovery () [Citation5]. Because of uneven data distribution and non-representativeness of training sets, a clear definition of model applicability domains, the use of generative neural network models capable of learning heterogeneities in data distribution, as well as methods of chemical cartography capable of describing the distribution of data across the chemical space [Citation15], should play a particularly important role. Given the time-based evolutionary nature of drug discovery, traditional static methods for testing the predictive power of models must be complemented by dynamic approaches based on time separation. The primary reliance on data manipulation, to the detriment of a deep understanding of the phenomena being studied, requires compensating for this imbalance by striving for building interpretable models [Citation16] and integrating deep scientific knowledge into the model building process. In addition to the laws of physics and chemistry, such knowledge should include information on the spatial structures of ligands and proteins, both in statics and dynamics, as well as on the networks of interactions between genetic and protein factors in living organisms.

Declaration of interest

II Baskin has no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.

Reviewer disclosures

Peer reviewers on this manuscript have no relevant financial or other relationships to disclose.

Additional information

Funding

II Baskin is supported by the Ministry of Education of Youth and Sports of the Czech Republic via grant MSMT-5727/2018-2 and the Ministry of Higher Education and Science of Russian Federation via grant 14.587.21.0049.

References

  • Muratov EN, Bajorath J, Sheridan RP, et al. QSAR without borders. Chem Soc Rev. 2020;49(11):3525–3564.
  • Brown N, Ertl P, Lewis R, et al. Artificial intelligence in chemistry and drug design. J Comput-Aided Mol Des. 2020;34(7):709–715.
  • Sellwood MA, Ahmed M, Segler MH, et al. Artificial intelligence in drug discovery. Future Med Chem. 2018;10(17):2025–2028.
  • Baskin II. The power of deep learning to ligand-based novel drug discovery. Expert Opin Drug Discov. 2020;15(7):756–764.
  • Varnek A, Baskin I. Machine learning methods for property prediction in chemoinformatics: quo vadis? J Chem Inf Model. 2012;52(6):1413–1437.
  • Simón-Vidal L, García-Calvo O, Oteo U, et al. Perturbation-theory and machine learning (PTML) model for high-throughput screening of Parham reactions: experimental and theoretical studies. J Chem Inf Model. 2018;58(7):1384–1396.
  • Livingstone DJ, Salt DW. Variable selection - spoilt for choice? Rev Comp Chem. 2005;21:287–348.
  • Tetko IV, Solov’ev VP, Antonov AV, et al. Benchmarking of linear and nonlinear approaches for quantitative structure-property relationship studies of metal complexation with ionophores. J Chem Inf Model. 2006;46(2):808–819.
  • Baskin II, Winkler D, Tetko IV. A renaissance of neural networks in drug discovery. Expert Opin Drug Discov. 2016;11(8):785–795.
  • Simões RS, Maltarollo VG, Oliveira PR, et al. Transfer and multi-task learning in QSAR modeling: advances and challenges. Front Pharmacol. 2018;9:74.
  • Varnek A, Gaudin C, Marcou G, et al. Inductive transfer of knowledge: application of multi-task learning and feature net approaches to model tissue-air partition coefficients. J Chem Inf Model. 2009;49(1):133–144.
  • Kondratovich E, Baskin II, Varnek A. Transductive support vector machines: promising approach to model small and unbalanced datasets. Mol Inf. 2013;32(3):261–266.
  • Fujiwara Y, Yamashita Y, Osoda T, et al. Virtual screening system for finding structurally diverse hits by active learning. J Chem Inf Model. 2008;48(4):930–940.
  • Baskin II. Is one-shot learning a viable option in drug discovery? Expert Opin Drug Discov. 2019;14(7):601–603.
  • Kireeva N, Baskin II, Gaspar HA, et al. Generative Topographic Mapping (GTM): universal Tool for Data Visualization, Structure-Activity Modeling and Dataset Comparison. Mol Inf. 2012;31(3–4):301–312.
  • Polishchuk P. Interpretation of quantitative structure–activity relationship models: past, present, and future. J Chem Inf Model. 2017;57(11):2618–2639.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.