Views

CrossRef citations to date

Altmetric

Report

Toward generalizable prediction of antibody thermostability using machine learning on sequence and structure features

Ameya Harmalkara Department of Chemical and Biomolecular Engineering, The Johns Hopkins University, Baltimore, MD, USA

https://orcid.org/0000-0001-6863-9634 View further author information

Roshan Raob Electrical Engineering and Computer Science, University of California, Berkeley, CA, USA

https://orcid.org/0000-0003-4412-3742 View further author information

Yuxuan Richard Xiec Department of Bioengineering and Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Urbana, IL, USA

https://orcid.org/0000-0003-1664-9114 View further author information

Jonas Honerd Therapeutic Discovery, Amgen Research (Munich) GmbH, Munich, GermanyView further author information

Wibke Deistingd Therapeutic Discovery, Amgen Research (Munich) GmbH, Munich, GermanyView further author information

Jonas Anlahrd Therapeutic Discovery, Amgen Research (Munich) GmbH, Munich, GermanyView further author information

Anja Hoenigd Therapeutic Discovery, Amgen Research (Munich) GmbH, Munich, GermanyView further author information

Julia Czwiklad Therapeutic Discovery, Amgen Research (Munich) GmbH, Munich, Germany

https://orcid.org/0000-0001-7856-789X View further author information

Eva Sienz-Widmannd Therapeutic Discovery, Amgen Research (Munich) GmbH, Munich, GermanyView further author information

Doris Raud Therapeutic Discovery, Amgen Research (Munich) GmbH, Munich, GermanyView further author information

Austin J. Ricee Therapeutic Discovery, Amgen Research, Amgen Inc, Thousand Oaks, CA, USA

https://orcid.org/0000-0002-4165-4241 View further author information

Timothy P. Rileye Therapeutic Discovery, Amgen Research, Amgen Inc, Thousand Oaks, CA, USAView further author information

Danqing Lie Therapeutic Discovery, Amgen Research, Amgen Inc, Thousand Oaks, CA, USAView further author information

Hannah B. Catteralle Therapeutic Discovery, Amgen Research, Amgen Inc, Thousand Oaks, CA, USAView further author information

Christine E. Tinbergf Therapeutic Discovery, Amgen Research, Amgen Inc, South San Francisco, CA, USA

https://orcid.org/0000-0002-6179-0435 View further author information

Jeffrey J. Graya Department of Chemical and Biomolecular Engineering, The Johns Hopkins University, Baltimore, MD, USA

https://orcid.org/0000-0001-6380-2324 View further author information

Kathy Y. Weif Therapeutic Discovery, Amgen Research, Amgen Inc, South San Francisco, CA, USACorrespondence[email protected]

https://orcid.org/0000-0002-8794-1385 View further author information

show all

Figures & data

Figure 1. A pipeline to identify scFv thermostability using deep learning. (a) The biological challenge of antibody thermostability prediction from sequences. Antibody thermostability can determine the manufacturability of antibodies in downstream processes. The biological question that we AIM to tackle is whether we can predict the thermal characteristics of an scFv, given its sequence. Available data for this challenge can comprise of the amino acid sequences, structures and calculated energetics. Leveraging antibodies with pre-determined temperature characteristics is paramount, however, the availability is scarce for such a dataset. (b) Thermostability data generation. To generate a dataset of scFv sequences with known temperature-specific features, we determined the loss of target binding of the scFv post high temperature stress to obtain a TS50 measurement. (c) Training a classification network for predicting TS50 bins. One of the approaches is transfer learning with unsupervised models (top branch). We utilized pre-trained BERT-like models (such as ESM1-b, ESM1-v, etc) to make (1) Zero-shot predictions and (2) Fine-tuned predictions with the labeled TS50 dataset. Another approach is to train a supervised model with calculated thermodynamic energies (bottom branch). We used sequence and structure-based features for supervised learning using simple convolutional models to train a classifier. The outcome of such trained ML models can be employed either for predicting thermostability of generated antibody sequences or to computationally validate experimental designs.

Panel A: The biological challenge of thermostability prediction from sequence, structure or energy features. Panel B: Curation of experimental TS50 data for training. Panel C: An overview of machine learning models with pre-trained language models (top branch) and supervised convolutional networks (bottom branch) trained for the thermostability prediction task.

Figure 2. Fine-tuning over pre-trained unsupervised models improves correlation on withheld targets. (a) Zero-shot and (b) Fine-tuned sequence scoring methods for thermostability prediction. (c) Zero-shot likelihood-based predictions with pre-trained models do not correlate strongly with the TS50 datasets. (d) Fine-tuning the pre-trained models on TS50 data from $n - 1$ targets significantly improves correlation on the held-out target. (e) Zero-shot likelihood-based predictions on blind test sets. (f) Models fine-tuned on TS50 data do not generalize well to blind test sets.

Panel A, B: Zero-shot and fine-tuned training on the pre-trained language models, respectively. Panel C, D: Performance on validation sets for zero-shot and fine-tuned models, respectively. Zero-shot models do not show strong correlation. Panel E, F: Performance on out-of-distribution test sets for zero-shot and fine-tuned models, respectively. Both show poor generalization.

Figure 3. Energy features can extract ’generalizable’ information of thermostability. (a) The supervised convolutional network architecture for classification of antibody sequences. The input scFv sequences pass a structure-generation module with DeepAb followed by Rosetta-based evaluation to estimate per-residue energies for each amino acid residue in the scFv structure. The sequences are one-hot encoded (top branch) and the energetic features, represented as an i-j matrix(bottom branch), are provided to the network. The output from the sequence branch and the energy branch are passed through a dense-layer to generate the probabilities of the sequence to lie in each of the temperature bins. (b) t-stochastic neighbor embeddings from the energetics-only model colored by the temperature bins. (c) Receiver-operating characteristic curve to demonstrate the classification of the test sequences for the above-70 bin with the energetics-only model. Note that Test scFv and Isolated scFv have a smaller sample size, explaining the relatively less rugged nature of the curves. (d) The model’s performance metrics for the classification task on completely blind test scFv sequences is reported with the Spearman's correlation coefficient.

Panel A: Supervised model architecture trained on sequence and energy features. Panel B: Model embeddings show that sequence-only networks are clustered by experimental set, but energy-only models are independent of experimental bias. Panel C: ROC curve for the 70-up bin shows more than 0.7 true positive rate on test sets. Panel D: Performance of the supervised models is evaluated with spearman’s correlation coefficient.

Figure 4. Computational deep mutational scan of an antibody variable fragment shows agreement with experimental thermal denaturation data. Validating all the point mutants with our SCNN and ESM-1b finetuned models for anti-VEGF antibody (PDB ID: 2FJG(bound) and PDB ID: 2FJF (unbound)), we observed synergies in mutants predicted in the over-70°C bin and the experimental thermal denaturation data available from prior work.^{Citation10,Citation33} Spheres indicate the experimentally validated mutants that improved Tm; pink indicates predictions from the network with the same residue position, but different amino acid mutation; red indicates the predictions matching experimental data and gray indicates mutations which were not observed in computational predictions. Thumbnails highlight the mutations in agreement with experiments and potential interactions. The table illustrates the comparison with the experimental and computational predictions.

Comparison of experimental thermal denaturation data from a mAb with the computational predictions made by the pre-trained language models and supervised models. Out of the 20 mutations experimentally validated for the mAb, the models could identify 18 residue positions and 5 exact amino-acid mutations.

Supplemental material

Supplemental Material

Download PDF (10.5 MB)

Data availability statement

The source code for TherML (zero-shot, fine-tuned and supervised models) is available at https://github.com/AmeyaHarmalkar/therML for non-commercial use only. The experimental thermostability data and sequences are from internal antibody engineering studies and cannot be made available as the sequences are an intellectual property of Amgen. Any additional information required to reanalyze the data reported in this paper is available from the lead author upon request.

Toward generalizable prediction of antibody thermostability using machine learning on sequence and structure features

Supplemental Material

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

Toward generalizable prediction of antibody thermostability using machine learning on sequence and structure features

Figures & data

Supplemental Material

Data availability statement

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date