226
Views
0
CrossRef citations to date
0
Altmetric
Editorial

Lessons learnt from machine learning in early stages of drug discovery

, &
Pages 631-633 | Received 20 Mar 2024, Accepted 08 May 2024, Published online: 10 May 2024

1. Introduction

With the promise of a big leap, the field of Drug Discovery (DD) seems to have been permeated by Machine Learning (ML); it is not unreasonable to think that for every single ‘classical’ computational method within DD, there exists an ML-based counterpart; namely, for docking, Molecular Dynamics (MD), protein modeling, etc. Furthermore, the amount of money being invested for ML in DD is growing steadily. Evidently, ML methods have come to stay, and, in our opinion, they will be a valuable aid in accelerating the drug discovery pipeline.

In essence, supervised learning models sort out a statistical relationship between training input and output data in a non-linear fashion, which is the cause that many models lack interpretability, especially those concerned with Deep Learning (DL). It is already a well-known fact that for prospective cases, there is no guarantee that the model’s performance will be consistent with what was observed in retrospective stages. Moreover, the absence of a clear rational link between input and output data gives many ML models a ‘black-box’ nature, which entails various challenges that need to be thoroughly tackled. While useful and interesting applications have emerged [Citation1–3, there has also been a tendency to overestimate the capacity of ML models. Therefore, rather than showcasing some role-model applications, we discuss three significant aspects identified in previous implementations concerning early stages of drug discovery in order to highlight their advantages, disadvantages, and risks.

a) Reaching accurate outcomes with ML albeit based on the wrong reasons: It has been seen that some ML-based scoring functions which presented a superior performance in virtual screening were, in fact, wrongly validated using the DUD-E dataset [Citation4,Citation5]; As this dataset was built based on the structural difference between ligands and decoys, it is not correct to validate ML methods that use structural descriptors of small-molecules as input data. Efforts have been made to improve this scenario by collecting more suitable data for ML applications, as the case of LIT-PCBA [Citation6], although further contributions are highly necessary.

Moreover, it was demonstrated that among the predicted docking poses from five DL-based docking methods, a notable percentage had physically implausible ligand poses, such as wrong ligand stereochemistry or not planar aromatic rings, despite having RMSD values lower than 2.0 Å to the experimental-binding modes [Citation7]. It was proposed, therefore, that physical plausibility must be considered when validating ML-based docking poses prediction.

An additional example involves the use of ML for the identification of a kinase inhibitor in a mere span of 21 days [Citation8]. This seemed to establish a new paradigm for de novo drug design. The study proposed a novel ML-driven pipeline for the generation of compounds based on a series of molecular properties of interest, but it was later pointed out that the found inhibitor was very similar to an already known drug present in the training set [Citation9]. This fact prompted the publication of some rational guidelines when using ML for de novo design, which attempt, essentially, to perform a stringent validation of generative artificial intelligence (AI) models in DD [Citation9,Citation10].

In addition to rigorous validations, a natural way to mitigate the harms of achieving results based on the wrong reasons with ML models is by the utilization of explainable AI methods, which have been covered comprehensively along with implementations in drug discovery [Citation11].

b) The availability of high-quality data and the development of ML methods in DD: Evidently, the data is a crucial component of any ML development. Unlike other fields where ML has had a great impact, such as computer vision or natural language processing, in the field of DD, the gathering of a high-quality dataset is a non-trivial task [Citation12]; biological data is highly complex, where an observed in vivo effect might be caused by many different factors, which often interact in combination. Therefore, obtaining specific and high-quality biological data that properly represents the problem at hand is equally complex. Although techniques such as multitask learning or self-supervised learning could mitigate the impact of data limitations, these techniques only provide an incomplete solution to the problem.

In the context of toxicity prediction, for example, the numerous and diverse toxicity endpoints require separate analysis. Consequently, an enormous amount of data related to each of these safety in vivo endpoints is necessary for the appropriate use of ML predictive methods, which is currently not accessible in most cases [Citation13]. This suggests that current efforts in the area should be directed toward the collection of high-quality toxicity data, whose description is closely linked to the particular toxic endpoint under investigation, rather than solely focusing on improving a validation metric.

c) The advent of AlphaFold as a protein modeling tool: AlphaFold (AF), an AI methodology developed to computationally characterize the 3D structure of a protein from its amino acid sequence [Citation14], results a paradigmatic case to analyze in light of the repercussions it had on the scientific community [Citation15], and may provide general considerations regarding the use of ML in DD. As a first consideration, AlphaFold illustrates the capital importance of data availability. The vast number of crystal structures found in the PDB nowadays covers a wide variety of protein families. Evidently, the availability of these good-quality structures allowed the development of more accurate protein structure prediction models such as AlphaFold.

The enormous success of AlphaFold had some experts claim that it has solved a 50-year-old grand challenge of structural biology. Without demeriting neither the success nor the usefulness of AF, modeling 3D structures based on a complex relationship from the basis of amino acid sequences is not the same as understanding why a protein folds in a determined way, or how it does it. An ML model does not ‘learn’ the fundamental laws of nature; AF has no knowledge of the underlying protein physics. It does not matter how much data is utilized during the training stage, or how many physical constraints are imposed, the method will, ultimately, do some sort of statistics based on the training data. As Finkelstein [Citation16] clearly pointed out, the 3D structure prediction is based on the similarity between some parts of its amino acid sequence and parts of sequences with already known 3D structures.

Also, extreme caution must be taken when inferring any conclusions involving biological behavior based on the outcomes of an ML model. For example, in some AF 3D structures, proteins display large regions that, visually, do not conform any type of secondary structure. While these regions display, quantitively, a low confidence in model predictions, it has been proposed, based on the results of AF, that these regions may be unstructured in isolation [Citation17]. An alternative and more cautious conclusion would be that disorder in AF models reflect regions not well represented in the training data.

2. Expert opinion

It is rather surprising that short after the AF boom, AF-modeled structures could be found along with crystallized structures in the PDB. While this represents a step toward the use of AF structures, as with any other methodology, ML models should be carefully and extensively evaluated as much as possible, exploring the limitations of the application of the model. In our case, for example, we have shown that AF structures did not have the needed accuracy to be readily used in a docking-based virtual screening scenario [Citation18], which was also concluded by other studies [Citation19,Citation20]. Finding model limitations will not only guide a more conscious use of ML models but could also lead to improvements, as in the particular case of AF [Citation21].

Although ML will continue to be routinely used in the field of DD, in our view, it is unlikely that there will be any big shift of paradigm within DD by the use of ML in the short term. The only possibility that we can envision to drive such an impact is through the generation and availability of large amounts of high-quality data, which is a grand challenge in the area of drug discovery. The utmost importance and challenges associated with the utilization of biological data have already been discussed in depth [Citation22].

Certainly, we do expect the development of enhanced methodologies based on ML, especially where good-quality datasets are already available or could be easily acquired. This was the case, for example, in the discovery of a novel antibiotic based on a DL predictor [Citation23]. An illustrative scenario in which data collection is relatively straightforward is using ML to accelerate computational methods, for example, to drive accelerated ultra-large docking campaigns [Citation24] or, although more computationally demanding, to guide molecular dynamics simulations [Citation25]. Here, the data required to develop an ML method (for example, molecular descriptors and docking scores) can be generated in silico in a more cost-effective manner compared to experimental data. An excellent example of coupling ML with molecular docking is showcased by the work of Graff et al. [Citation26], where a 40-fold reduction was achieved regarding the number of docking calculations needed to screen a large database.

Contrary to the current opinion, we envision ML as a powerful mathematical tool that can optimize different methods or processes, and not as a revolutionary set of methodologies that will bring a magical solution overnight. The latter will just not happen. In this sense, we consider ML methods must be incorporated along with the already established methodologies within the field of DD, rather than pursuing an end-to-end ML protocol. In our view, conceiving ML as a mechanism for a possible automatic drug discovery pipeline is a misconception. Even in the ideal case of having a set of optimized methods via ML within each stage of the DD pipeline that could be streamlined, there are plenty of biological, physical and chemical subtleties that must be considered in between. Only in the hands of a drug discovery expert ML approaches may offer a superior approach.

Declaration of interest

The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.

Reviewer disclosures

Peer reviewers on this manuscript have no relevant financial or other relationships to disclose.

Acknowledgments

Computing time from the CCAD (Centro de Computación de Alto Desempeño de la Universidad Nacional de Córdoba) is greatly appreciated.

Additional information

Funding

This work was supported by the National Agency for the Promotion of Science and Technology (ANPCyT) (PICT-2021-1129).

References

  • Jimenez-Luna J, Grisoni F, Weskamp N, et al. Artificial intelligence in drug discovery: recent advances and future perspectives. Expert Opin Drug Discov. 2021 Sep;16(9):949–959. doi: 10.1080/17460441.2021.1909567
  • Vamathevan J, Clark D, Czodrowski P, et al. Applications of machine learning in drug discovery and development. Nat Rev Drug Discov. 2019 Jun;18(6):463–477. doi: 10.1038/s41573-019-0024-5
  • Di Filippo JI, Cavasotto CN. Guided structure-based ligand identification and design via artificial intelligence modeling. Expert Opin Drug Discovery. 2022;17(1):71–78. doi: 10.1080/17460441.2021.1979514
  • Chen L, Cruz A, Ramsey S, et al. Hidden bias in the DUD-E dataset leads to misleading performance of deep learning in structure-based virtual screening. PLoS One. 2019;14(8):e0220113. doi: 10.1371/journal.pone.0220113
  • Sieg J, Flachsenberg F, Rarey M. In need of bias control: evaluating chemical data for machine learning in structure-based virtual screening. J Chem Inf Model. 2019 Mar 25;59(3):947–961. doi: 10.1021/acs.jcim.8b00712
  • Tran-Nguyen VK, Jacquemard C, Rognan D. LIT-PCBA: an unbiased data set for machine learning and virtual screening. J Chem Inf Model. 2020 Sep 28;60(9):4263–4273. doi: 10.1021/acs.jcim.0c00155
  • Buttenschoen M, Morris GM, Deane CM. PoseBusters: AI-based docking methods fail to generate physically valid poses or generalise to novel sequences. Chem Sci. 2024 Feb 28;15(9):3130–3139. doi: 10.1039/D3SC04185A
  • Zhavoronkov A, Ivanenkov YA, Aliper A, et al. Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nat Biotechnol. 2019 Sep;37(9):1038–1040. doi: 10.1038/s41587-019-0224-x
  • Walters WP, Murcko M. Assessing the impact of generative AI on medicinal chemistry. Nat Biotechnol. 2020 Feb;38(2):143–145. doi: 10.1038/s41587-020-0418-2
  • Zhavoronkov A, Aspuru-Guzik A. Reply to ‘assessing the impact of generative AI on medicinal chemistry’. Nat Biotechnol. 2020 Feb;38(2):146. doi: 10.1038/s41587-020-0417-3
  • Jiménez-Luna J, Grisoni F, Schneider G. Drug discovery with explainable artificial intelligence. Nature Mach Intell. 2020 Oct 01;2(10):573–584. doi: 10.1038/s42256-020-00236-4
  • Rodrigues T. The good, the bad, and the ugly in chemical and biological data for machine learning. Drug Discov Today Technol. 2019 Dec;32-33:3–8. doi: 10.1016/j.ddtec.2020.07.001
  • Cavasotto CN, Scardino V. Machine learning toxicity prediction: latest advances by toxicity end point. ACS Omega. 2022 Dec 27;7(51):47536–47546. doi: 10.1021/acsomega.2c05693
  • Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021 Aug;596(7873):583–589. doi: 10.1038/s41586-021-03819-2
  • Subramaniam S, Kleywegt GJ. A paradigm shift in structural biology. Nat Methods. 2022 Jan 01;19(1):20–23. doi: 10.1038/s41592-021-01361-7
  • Finkelstein AV. Does AlphaFold predict the spatial structure of a protein from physics or recognize it (its main parts and their association) using databases? bioRxiv. 2022:2022.11.21.517308. doi: 10.1101/2022.11.21.517308
  • Tunyasuvunakool K, Adler J, Wu Z, et al. Highly accurate protein structure prediction for the human proteome. Nature. 2021 Aug;596(7873):590–596. doi: 10.1038/s41586-021-03828-1
  • Scardino V, Di Filippo JI, Cavasotto CN. How good are AlphaFold models for docking-based virtual screening? iScience. iScience. 2023 Jan 20;26(1):105920. doi: 10.1016/j.isci.2022.105920
  • Diaz-Rovira AM, Martin H, Beuming T, et al. Are deep learning structural models sufficiently accurate for virtual screening? Application of docking algorithms to AlphaFold2 predicted structures. J Chem Inf Model. 2023 Mar 27;63(6):1668–1674. doi: 10.1021/acs.jcim.2c01270
  • Zhang Y, Vass M, Shi D, et al. Benchmarking refined and unrefined AlphaFold2 structures for hit discovery. J Chem Inf Model. 2023 Mar 27;63(6):1656–1667. doi: 10.1021/acs.jcim.2c01219
  • Evans R, O’Neill M, Pritzel A, et al. Protein complex prediction with AlphaFold-Multimer. bioRxiv. 2021:2021.10.04.463034. doi: 10.1101/2021.10.04.463034
  • Bender A, Cortes-Ciriano I. Artificial intelligence in drug discovery: what is realistic, what are illusions? Part 2: a discussion of chemical and biological data. Drug Discov Today. 2021 Apr 01;26(4):1040–1052. doi: 10.1016/j.drudis.2020.11.037
  • Stokes JM, Yang K, Swanson K, et al. A deep learning approach to Antibiotic Discovery. Cell. 2020 Feb 20;180(4):688–702 e13. doi: 10.1016/j.cell.2020.01.021
  • Cavasotto CN, Di Filippo JI. The impact of supervised learning methods in ultralarge high-throughput docking. J Chem Inf Model. 2023 Apr 24;63(8):2267–2280. doi: 10.1021/acs.jcim.2c01471
  • Noé F, Tkatchenko A, Müller K-R, et al. Machine learning for molecular simulation. Annu Rev Phys Chem. 2020;71(1):361–390. doi: 10.1146/annurev-physchem-042018-052331
  • Graff DE, Shakhnovich EI, Coley CW. Accelerating high-throughput virtual screening through molecular pool-based active learning. Chem Sci. 2021 Apr 29;12(22):7866–7881. doi: 10.1039/D0SC06805E

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.