Search in:

Advances in Physics: X Volume 7, 2022 - Issue 1

Submit an article Journal homepage

Open access

7,963

Views

CrossRef citations to date

Altmetric

Reviews

Machine learning in the analysis of biomolecular simulations

Shreyas KaptanDepartment of Physics, University of Helsinki, Helsinki, FinlandView further author information

Ilpo VattulainenDepartment of Physics, University of Helsinki, Helsinki, FinlandCorrespondence[email protected]
View further author information

Article: 2006080 | Received 19 Mar 2021, Accepted 09 Nov 2021, Published online: 10 Jan 2022

Cite this article
https://doi.org/10.1080/23746149.2021.2006080
CrossMark

Full Article
Figures & data
References
Citations
Metrics
Licensing
Reprints & Permissions
View PDF PDF View EPUB EPUB

Figures & data

Figure 1. Typical Machine Learning Workflow. Data generated from simulation trajectories is first represented by selecting certain features, usually reducing the dimensionality. Data is then chosen for training the ML tasks to generate a model, which is optimized for its hyperparameters, validated, and tested for overfitting.

Figure 2. Dimensionality Reduction with Machine Learning. A. PCA is used to detect directions of highest variance. In a two-dimensional case, PCA resolves the variance into two orthogonal Principal Components (PCs). B. Typical eigenvalue spectrum obtained from PCA. PCA can be used to reduce dimensionality by selecting a cut-off, where the variance starts to go asymptotically to zero. C. Comparison of PCA and tICA methods on a two-well potential. Left: Projection of the first PC and tIC on the data. The first tiC correctly identifies the two minima in the free energy surface (FES). Right: tIC finds the direction along the two minima in the FES. D. Kernel PCA solves the PCA problem by first applying a non-linear transform on the data that embeds the data into a higher-dimensional space, where a hyperplane can linearly separate the data points which were not linearly separable earlier.

Figure 3. ANNs for Dimensionality Reduction. Autoencoder networks (shown in gray) are used for training a lower-dimensional representation of the simulation data by reconstructing sampled structures (deep blue) with decoded structures (light blue). A trained autoencoder can be used to generate a latent space representation of the data set (blue points), which can used to generate unseen latent space data (red points) to mine unsampled structures (red structures). Figure adapted from Degiacomi et al. [Citation62].

Figure 4. Principal Component Regression (PCR). A. PCR-based ensemble-weighted mode for the Leucine Binding Protein. B. Coefficient α_i of the contribution to the PCR model from the largest PCA eigenvectors. C. Eigenvalues of the PCs used to construct the PCR model. D. Contribution of the variance of the PCs to the variance of the collective mode. Figure adapted from Hub et al. [Citation63].

Figure 5. ANNs for Regression. ANNs can be trained to develop coarse-grained force fields by the force matching method. Physical restrictions of translational and rotational invariance, and conservative forces, are imposed on the regression task performed by the CGnets architecture by a choice of internal coordinates and the GDML layer, whereby the atomistic force field is reduced to a coarse-grained force field. Figure adapted from Wang et al. [Citation71]. Further permissions related to the material excerpted (https://pubs.acs.org/doi/full/10.1021/acscentsci.8b00913) should be directed to the ACS.

Figure 6. Classification and Clustering (Machine Learning) Techniques. A. Partial least squares-based Discriminant Analysis technique (PLS-DA) is used for identifying the collective mode that differentiates between bound and unbound ubiquitin simulations. Projection of the simulation data from the bound and unbound simulations on the difference vector given by the PLS-DA mode separates their distributions. Inset: The structural ensembles of ubiquitin binding region in the unbound and bound mode identified by the PLS-DA eigenvector. Figure adapted from Peters et al. [Citation74]. B. Active and inactive states of the Src kinase are identified from MD simulations with clustering and then classified using a Random Forest (RF) classifier. Using a Gini index, importance is assigned to the residues that contribute most significantly to the classification. Figure adapted from Sultan et al. [Citation76]. Further permissions related to the material excerpted (https://pubs.acs.org/doi/10.1021/ct500353m) should be directed to the ACS. C. L11 · 23S protein-complex is first subjected to dimensionality reduction. The simulation data are projected on the first two eigenvectors and then clustered with the k-means algorithm to identify the structure in the data. Figure adapted from Wolf et al. [Citation35]. D. Gaussian Mixture Model-based clustering is used to identify the free energy minima in Calmodulin simulations using the InfleCS methodology. The GMM is built on a 2D surface using the coordinates reciprocal interatomic distances (DRID) and the linker backbone dihedral angle correlation (BDAC). The most likely transition pathways in these states are identified and plotted on the free energy surface. Figure adapted from Westerlund et al. [Citation96]. Further permissions related to the material excerpted (https://pubs.acs.org/doi/abs/10.1021/acs.jctc.9b00454) should be directed to the ACS.

Figure 7. ANNs for Classification. ANNs used for classification of GPCRs based on differences in a bound agonist. The geometric coordinates are first encoded into the RGB code to generate a two-dimensional image. Using a convolutional neural network architecture, the map between the image and the corresponding level is learnt. Sensitivity analysis is used to retrace the pixels in the image and then from them the corresponding residues that are mainly responsible for the classification task. Figure adapted from Plante et al. [Citation98].

Degiacomi MT. Coupling molecular dynamics and deep learning to mine protein conformational space. Structure. 2019;27:1034–1040.e3.

PubMed Web of Science ®Google Scholar

Hub JS, De Groot BL. Detection of functional modes in protein dynamics. PLoS Comput Biol. 2009;5:1000480.

PubMed Web of Science ®Google Scholar

Wang J, Olsson S, Wehmeyer C, et al. Machine learning of coarse-grained molecular dynamics force fields. ACS Cent. Sci. 2019;5:755–767.

PubMed Web of Science ®Google Scholar

Peters JH, de Groot BL, Levitt M. Ubiquitin dynamics in complexes reveal molecular recognition mechanisms beyond induced fit and conformational selection. PLoS Comput. Biol. 2012;8:e1002704.

PubMed Web of Science ®Google Scholar

Sultan MM, Kiss G, Shukla D, et al. Automatic selection of order parameters in the analysis of large scale molecular dynamics simulations. J Chem Theory Comput. 2014;10:5217–5223.

PubMed Web of Science ®Google Scholar

Wolf A, Kirschner KN. Principal component and clustering analysis on molecular dynamics data of the ribosomal L11·23S subdomain. J Mol Model. 2013;19:539–549.

PubMed Web of Science ®Google Scholar

Westerlund AM, Delemotte L. InfleCS: clustering free energy landscapes with Gaussian mixtures. J Chem Theory Comput. 2019;15:6752–6759.

PubMed Web of Science ®Google Scholar

Plante A, Shore DM, Morra G, et al. A machine learning approach for the discovery of ligand-specific functional mechanisms of GPCRs. Molecules. 2019;24:2097.

Web of Science ®Google Scholar

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Machine learning in the analysis of biomolecular simulations

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

Machine learning in the analysis of biomolecular simulations

Figures & data

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date