Search in:

Future Science OA Volume 10, 2024 - Issue 1

Submit an article Journal homepage

Open access

384

Views

CrossRef citations to date

Altmetric

Listen

Commentary

Chemical and biological language models in molecular design: opportunities, risks and scientific reasoning

Jürgen Bajorath1 Department of Life Science Informatics & Data Science, B-IT, LIMES Program Unit Chemical Biology & Medicinal Chemistry, Lamarr Institute for Machine Learning & Artificial Intelligence, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, D-53115 Bonn, GermanyCorrespondence[email protected]

https://orcid.org/0000-0002-0557-5714

Article: FSO957 | Received 20 Dec 2023, Accepted 03 Jan 2024, Published online: 07 Feb 2024

Cite this article
https://doi.org/10.2144/fsoa-2023-0318
CrossMark

In this article

Model explanation & implications
Model-dependent risks
Focus on language models
Sequence-based compound design
Scientific reasoning
Conclusion
References

Full Article
Figures & data
References
Citations
Metrics
Licensing
Reprints & Permissions
View PDF PDF View EPUB EPUB

Keywords: :

artificial intelligence
chemical language models
drug discovery
model explanation
protein language models
sequence-to-compound learning

In the physical and life sciences, including drug discovery, the use of deep learning (DL) models, such as language models (LMs) or graph neural networks (GNNs), is on the rise for various applications. LMs are generally designed to translate sequences of characters and are particularly versatile and adaptable to many different machine translation tasks, giving rise to their popularity in many areas. Transformer networks, with their multi-head self-attention mechanism, and encoder–decoder frameworks [Citation1] have become powerful LMs for many applications. However, the versatility of DL architectures comes at a price. While they open the door to novel applications, their use is also prone to misconceptions or misunderstandings, often leading to false assumptions and controversial views of scientific applications. This commentary discusses the general requirements, potential caveats or pitfalls and explanation of DL models in the context of molecular design.

Model explanation & implications

In the realm of complex DL models, characterized by their ‘black box’ nature, procedures providing insights into their operations and predictions are crucial for avoiding incorrect expectations or questionable conclusions. In machine learning (ML) including DL, it is encouraging to note increasing application of approaches for model explanation including feature attribution methods such as Shapley additive explanations (SHAP) [Citation2,Citation3]. Such methods belong to explainable artificial intelligence (XAI) and are used to analyze model decisions and explain predictions. Model-agnostic methods such as SHAP are particularly useful since they can be applied to different ML models and enable comparison of their prediction characteristics. There are also other approaches for shedding light on black boxes and helping to rationalize predictions. For example, for transformers, attention weights can be visualized in feature maps to identify features driving predictions. However, a misunderstanding in model explanation is considering the results of feature attribution analysis as a chemical or biological interpretation. Any feature attribution methodology aims to identify features that determine a prediction. Importantly, identifying such features does not ensure interpretability. The question if identified key features might be chemically or biologically intuitive and interpretable must be subsequently addressed. For example, structural features driving correct predictions of active compounds can be mapped on test compounds and visualized. This additional analysis step makes it possible to determine if key features form coherent substructures that might be associated with the biological activity of test compounds [Citation3,Citation4]. Features determining predictions might not always be understandable based on human reasoning because the decisions of ML models are statistically determined. Thus, the potential lack of feature interpretability is not a shortcoming of correctly used feature attribution or visualization methods. This is not always considered when attempting to rationalize predictions, reflecting a misconception.

In a similar vein, the fundamental distinction between correlation and causality in ML [Citation5] is often not taken into consideration. For example, in activity prediction, structural features shared by training and test compounds might strongly correlate with prediction accuracy. However, correlation in ML does not ensure causality. In this example, causality would apply if characteristic structural features would not only determine prediction accuracy but also be directly responsible for the given biological activity. However, this is a different question. For instance, the presence of structural features distinguishing active from inactive compounds might be coincidental or result from data bias. Accordingly, one might hypothesize causality in light of accurate predictions, but firmly establishing causality would require experimental work further investigating these features. For many ML applications in the life sciences, testing causality hypotheses requires experimental follow-up.

Model-dependent risks

A major attraction of GNNs, transformers, or other DL models is their potential to address prediction tasks that were previously unfeasible. However, when addressing novel prediction tasks, there is the potential of confusion because accurate predictions might be obtained for other than apparent or expected reasons, thus representing ‘Clever Hans' incidences [Citation6,Citation7]. This name originates from the true story of a horse (named Hans) that was long believed to be able to count, until artifacts were uncovered [Citation7]. In ML, Clever Hans effects have often been identified retrospectively. For example, in drug discovery, accurate predictions of the binding affinity or relative free energies of active compounds continue to be challenging [Citation8]. GNNs have recently been used for compound affinity predictions based on graph representations of protein–ligand interactions extracted from x-ray structures. These studies typically produced fairly accurate predictions, leading to conclusions that GNNs are capable of learning protein–ligand interactions and quantifying binding energies. However, detailed control calculations and XAI analysis demonstrated subsequently that these predictions were largely determined by ligand memorization effects [Citation9,Citation10]. Similar compounds often bind to the same or related targets with comparable potency. Therefore, depending on the composition of training and test sets, reasonable predictions might be obtained if GNNs memorized similar compounds and their affinity. Hence, the results of affinity predictions using GNNs did not depend on learning protein–ligand interactions, representing an exemplary Clever Hans effect. It follows that special care must be taken in exploring novel prediction scenarios with black box DL models. Formulating a clear hypothesis that can be directly tested by a ML model and appropriate controls often helps to avoid Clever Hans effects and incorrect conclusions [Citation11].

Focus on language models

Models originating from natural language processing are increasingly employed for machine translation tasks in other fields. LMs consist of recurrent neural networks (RNNs) or transformers, which are increasingly used, and are particularly versatile in learning translations of different types of sequential or textual data representations. Small molecules are typically encoded as string representations. For instance, in the life sciences and drug discovery, LMs can be trained to learn compound-to-compound, protein-to-protein, or protein-to-compound mappings. In generative compound design, this makes it possible to predict new compounds from reference molecules or protein sequence data. LMs for compound-to-compound learning are often referred to as ‘Chemical LMs' (CLMs) while LMs for supervised or unsupervised learning from protein sequences are referred to as ‘Protein LMs' (PLMs).

Sequence-based compound design

Compound design based on reference molecules is a standard approach in drug design that is not limited to LMs, but feasible with a variety of computational methods. By contrast, the prediction of new active compounds from protein sequences is difficult, if not impossible using other computational approaches. Accordingly, attempts have recently been made to distinguish true protein–ligand pairings (complexes) from randomly assembled (false) pairings [Citation^12-15] or to predict compounds from protein sequence data directly [Citation^15-18]. For these and other applications, PLMs are also used for representation learning from amino acid sequences, yielding sequence embeddings that implicitly capture structural and functional characteristics of proteins [Citation19,Citation20]. For predicting protein–ligand pairings using LMs, tokenized amino acid sequence and compound representations are combined. Potential applications of these models include target validation (for example, for active compounds from phenotypical screens) or compound repurposing (finding alternative targets and applications for active compounds or drugs). For the prediction of new compounds from protein sequences, representations such as protein embeddings serve as input for generating compound (output) strings. Models for the assessment of protein-ligand pairs or sequence-based compound predictions typically combine PLM and CLM components. Such LMs have correctly predicted pairs or active compounds in benchmark calculations and prospective applications involving experimental evaluation [Citation15].

Scientific reasoning

Sequence-to-compound learning is an instructive example for a prediction task at the crossroads between computational feasibility and scientific reasoning. The idea to predict active compounds from protein sequences is not new, but it can now be addressed computationally in sophisticated ways using LMs. This section delves deeper into the underlying scientific challenges of LMs or other DL models designed for this purpose. In structure-based drug design, the availability of 3D target structure information enables the delineation of ligand binding sites and the application of computational approaches to identify candidate compounds with a high degree of shape and chemical complementarity to given sites. In three-dimensions, in the context of a folded protein structure, only a limited number of amino acid residues participate in ligand binding. These contact residues are typically widely distributed across the primary structure of the protein and might form characteristic sequence motifs for individual protein families. By contrast, the majority of residues in protein sequences are not involved in ligand binding but important for the structural integrity of a given protein fold. These residues might also form sequence motifs that are characteristic of secondary structure elements or other structural features. However, during evolution, protein structure has been much more conserved than sequence and large statistical variations are often observed in sequences adopting a given fold, up to the level that global sequence similarity is no longer detectable statistically.

Insights into the formation of ligand binding sites by a limited number of residues and statistical variations of sequences of proteins with similar structures are not available to a computational model learning sequence-to-compound mappings. Instead, the model learns to associate protein sequences with structures of active compounds based on large volumes of sequence and compound training data. Protein representation learning via PLMs might recognize characteristic patterns in sequences that are indicative of structural or functional features and produce informative embeddings. Consistent with our insights into protein sequence-structure relationships and ligand binding sites, one might hypothesize that a LM for sequence-to-compound predictions must be capable of learning residue patterns implicated in ligand binding to correctly predict novel compounds; an ambitious conjecture. Protein sequences can be modified through ‘computational mutations' and re-tested to identify residues that are important for correctly predicting an active compound [Citation15]. In selected cases, such calculations might provide evidence for the importance of individual binding site residues for accurate predictions [Citation15]. However, must this be the case? Can LMs only predict active compounds based on sequence data if binding site motifs are recognized? Or can predictions be driven by associating compound structures with residue patterns in global sequences that become statistical signatures although they are not implicated in ligand binding (potentially moving into Clever Hans territory…)? These questions point to critical issues. In sequence-to-compound modeling, LM predictions might not only produce promising results if a model indeed learns what we know to be determinants of specific protein-ligand interactions. There might be many other reasons for the success (or failure) of a model, which might be hidden to us. Ultimately, if predictions of events that have solid physical foundations are solely driven by statistical associations, LM or other DL models will strongly depend on training data and protocols and have limited generalization potential. Their predictions will not be sustainable. In this context, sound scientific reasoning requires awareness that ML might often not conform to our knowledge or pre-conceived notions and that stringent control calculations are required for hypothesis testing aiming to rationalize predictions. For sequence-to-compound modeling, systematic XAI analysis using alternative approaches will be essential for exploring the origins of prediction outcomes, avoid misinterpretation and critically judge model performance and most influential factors.

Conclusion

Interpretation of ML models is of critical importance for interdisciplinary applications. However, the identification of features determining predictions must be distinguished from chemical or biological interpretability, for which feature attribution analysis provides a starting point. Furthermore, in life science and drug discovery applications, establishing causality for predictions often requires experimental follow-up. In general, special care must be taken to avoid ascribing prediction outcomes to incorrect reasons. Clearly formulated hypotheses that can be directly tested using a ML model often help to avoid such pitfalls. In pharmaceutical research, LM models offer many opportunities for addressing previously difficult or unfeasible prediction scenarios as machine translation tasks. Sequence-based compound design is an instructive example of a task that has become feasible through the use of LMs, but might be controversially viewed. Here, scientific reasoning becomes critically important at different levels, for example, by considering that positive predictions might not be a consequence of learning physical foundations of binding events, designing scientifically meaningful controls for such predictions, and avoiding premature interpretation of prediction results. Without doubt, for new predictions tasks tackled using LMs, the development of explanatory methods for analyzing learning characteristics of these models and origins of their predictions will become increasingly important.

Financial disclosure

The author has no financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript. This includes employment, consultancies, honoraria, stock ownership or options, expert testimony, grants or patents received or pending, or royalties.

Writing disclosure

No writing assistance was utilized in the production of this manuscript.

Editorial Board

J Bajorath is a member of the Future Science OA Editorial Board. They were not involved in any editorial decisions related to the publication of this article, and all author details were blinded to the article's peer reviewers as per the journal's double-blind peer review policy.

Competing interests disclosure

The author has no competing interests or relevant affiliations with any organization or entity with the subject matter or materials discussed in the manuscript. This includes employment, consultancies, honoraria, stock ownership or options, expert testimony, grants or patents received or pending, or royalties.

References

Vaswani A, Shazeer N, Parmar N et al. Attention is all you need. Adv. Neur. Inf. Proc. Sys. 30(1), 5998–6008 (2017).
Google Scholar
Chen H, Covert IC, Lundberg SM, Lee S. Algorithms to estimate Shapley value feature attributions. Nat. Mach. Intell. 5(6), 590–601 (2023).
Google Scholar
Rodríguez-Pérez R, Bajorath J. Explainable machine learning for property predictions in compound optimization. J. Med. Chem. 64(24), 17744–17752 (2021).
PubMed Web of Science ®Google Scholar
Feldmann C, Bajorath J. Machine learning reveals that structural features distinguishing promiscuous and non-promiscuous compounds depend on target combinations. Sci. Rep. 11(1), 7863 (2021).
PubMed Web of Science ®Google Scholar
Bontempi G, Flauder M. From dependency to causality: a machine learning approach. J. Mach. Learn. Res. 16(1), 2437–2457 (2015).
Google Scholar
Lapuschkin S, Wäldchen S, Binder A, Montavon G, Samek W, Müller KR. Unmasking Clever Hans predictors and assessing what machines really learn. Nat. Commun. 10(1), 1096 (2019).
PubMedGoogle Scholar
Pfungst O. Clever Hans (the horse of Mr. Von Osten): contribution to experimental animal and human psychology. J. Philos. Psychol. Sci. Meth. 8(1), 663–666 (1911).
Google Scholar
Abel R, Wang L, Harder ED, Berne BJ, Friesner RA. Advancing drug discovery through enhanced free energy calculations. Acc. Chem. Res. 50(7), 1625–1632 (2017).
PubMed Web of Science ®Google Scholar
Volkov M, Turk J-A, Drizard N et al. On the frustration to predict binding affinities from protein–ligand structures with deep neural networks. J. Med. Chem. 65(11), 7946–7958 (2022).
PubMed Web of Science ®Google Scholar
Mastropietro A, Pasculli G, Bajorath J. Learning characteristics of graph neural networks predicting protein–ligand affinities. Nat. Mach. Intell. 5(12), 1427–1436 (2023).
Google Scholar
Bajorath J. Potential inconsistencies or artifacts in deriving and interpreting deep learning models and key criteria for scientifically sound applications in the life sciences. Artif. Intell. Life Sci. 5(1), 100093 (2024).
Google Scholar
Chen L, Tan X, Wang D et al. TransformerCPI: improving compound–protein interaction prediction by sequence-based deep learning with self-attention mechanism and label reversal experiments. Bioinformatics 36(16), 4406–4414 (2020).
PubMed Web of Science ®Google Scholar
Nguyen T, Le H, Quinn TP et al. GraphDTA: predicting drug–target binding affinity with graph neural networks. Bioinformatics 37(8), 1140–1147 (2020).
Web of Science ®Google Scholar
Zhao Q, Zhao H, Zheng K, Wang J. HyperAttentionDTI: improving drug–protein interaction prediction by sequence-based deep learning with attention mechanism. Bioinformatics 38(3), 655–662 (2022).
PubMed Web of Science ®Google Scholar
Chen L, Fan Z, Chang J et al. Sequence-based drug design as a concept in computational drug design. Nat. Commun. 14(1), 4217 (2023).
PubMed Web of Science ®Google Scholar
Grechishnikova D. Transformer neural network for protein-specific de novo drug generation as a machine translation problem. Sci. Rep. 11(1), 321 (2021).
PubMed Web of Science ®Google Scholar
Qian H, Lin C, Zhao D et al. AlphaDrug: protein target specific de novo molecular generation. PNAS Nexus 1(4), pgac227 (2022).
PubMedGoogle Scholar
Yoshimori A, Bajorath J. Motif2Mol: prediction of new active compounds based on sequence motifs of ligand binding sites in proteins using a biochemical language model. Biomolecules 13(5), 833 (2023).
PubMed Web of Science ®Google Scholar
Rives A, Meier J, Sercu T et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118(25), e2016239118 (2021).
PubMedGoogle Scholar
Elnaggar A, Heinzinger M, Dallago C et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 44(1), 7112–7127 (2022).
PubMedGoogle Scholar

Download PDF

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Chemical and biological language models in molecular design: opportunities, risks and scientific reasoning

Model explanation & implications

Model-dependent risks

Focus on language models

Sequence-based compound design

Scientific reasoning

Conclusion

Financial disclosure

Writing disclosure

Editorial Board

Competing interests disclosure

References

Information for

Open access

Opportunities

Help and information

Chemical and biological language models in molecular design: opportunities, risks and scientific reasoning

Model explanation & implications

Model-dependent risks

Focus on language models

Sequence-based compound design

Scientific reasoning

Conclusion

Financial disclosure

Writing disclosure

Editorial Board

Competing interests disclosure

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date