1,158
Views
0
CrossRef citations to date
0
Altmetric
Review

Another string to your bow: machine learning prediction of the pharmacokinetic properties of small molecules

, , & ORCID Icon
Pages 683-698 | Received 23 Oct 2023, Accepted 23 Apr 2024, Published online: 10 May 2024
 

ABSTRACT

Introduction

Prediction of pharmacokinetic (PK) properties is crucial for drug discovery and development. Machine-learning (ML) models, which use statistical pattern recognition to learn correlations between input features (such as chemical structures) and target variables (such as PK parameters), are being increasingly used for this purpose. To embed ML models for PK prediction into workflows and to guide future development, a solid understanding of their applicability, advantages, limitations, and synergies with other approaches is necessary.

Areas covered

This narrative review discusses the design and application of ML models to predict PK parameters of small molecules, especially in light of established approaches including in vitro-in vivo extrapolation (IVIVE) and physiologically based pharmacokinetic (PBPK) models. The authors illustrate scenarios in which the three approaches are used and emphasize how they enhance and complement each other. In particular, they highlight achievements, the state of the art and potentials of applying machine learning for PK prediction through a comphrehensive literature review.

Expert opinion

ML models, when carefully crafted, regularly updated, and appropriately used, empower users to prioritize molecules with favorable PK properties. Informed practitioners can leverage these models to improve the efficiency of drug discovery and development process.

Article highlights

  • Predicting pharmacokinetic (PK) properties of drug candidates in animals and in humans is an essential task for drug discovery and development.

  • Machine learning is increasingly applied to predict absorption, distribution, metabolism, and excretion (ADME) and PK properties of small molecules.

  • Machine-learning (ML) models complement established methods including in vitro-in vivo extrapolation (IVIVE) and physiologically based pharmacokinetic (PBPK) modeling, enhancing the ability to design and prioritize molecules with favorable PK properties.

  • Successful predictions of PK parameters with ML models require high-quality and continuously updated data, a reliable infrastructure, mechanisms to assess model’s performance regularly and to retrain the model when necessary, feedback and retrospective analysis comparing predictions and observations, as well as research and education on how to integrate them into drug discovery workflows.

  • ML-based PK prediction warrants further research, in particular enriching data, improving models’ interpretability, reducing bias, and exploring synergies with other models, especially in clinical settings.

  • The integration of ML models with IVIVE and PBPK approaches can provide a more comprehensive understanding of drug’s behavior, potentially improving the efficiency of drug discovery and development.

List of abbreviations

AAFE=

Absolute average fold error, a value indicating the closeness of the model prediction to the real value. The value is a positive number equal or greater than 1. The higher the value, the further the predictions to the real values (therefore the poorer the performance). The definition is given below (ABS=absolute).

AAFE=10ABS(log10predictionobservation/numberofobservations]

AFE=

Average fold error, a measure of the average over/under estimation of a predicted property by a model. Its value is positive, values above 1 indicate a tendency to overprediction, while below 1, a tendency to underprediction. The definition is given below.

AFE=10log10predictionobservation/numberofobservations]

ADME=

Absorption, distribution, metabolism, and excretion.

AUC=

Area under the (time/concentration) curve. It is defined as the definite integral of the concentration of a drug in plasma as a function of time.

CLp=

Plasma clearance, the amount of plasma which is cleared from the drug in a defined time frame.

Cmax=

Maximum plasma concentration, also written as Cmax.

DL=

Deep learning. A term used to describe machine-learning methodologies that use neural network architectures with multiple layers.

DNN=

Deep neural network.

F%=

Bioavailability. It is expressed as the percentage of the administered compound which reaches the blood systemic circulation. It is calculated as the ratio between the AUC of the administration route of interest and the AUC of the intravenous route, which has F% = 100% by definition.

GMFE=

Geometric mean fold error, synonymous with AAFE.

GNN=

Graph neural network, a neural network architecture for graph-based learning. For instance, 2D or 3D structures of small molecules can be represented as a graph, i.e. a collection of nodes (atoms) and edges (bonds).

HT=

High-throughput.

HT-PBPK=

High-throughput PBPK modeling.

IVIVE=

In vitro-in vivo extrapolation.

MESN=

Multi embedding-based synthetic network.

MLP=

Multilayer perceptron, a classical architecture of an artificial neural network, in which every neuron is fully connected to all the neurons in the previous and next layer.

MPS=

Microphysiological systems.

NAM=

New approach methodologies.

NN=

Neural network, synonymous with artificial neural network in this context.

PBPK=

Physiologically based pharmacokinetic modeling.

PCA=

Principal component analysis.

PD=

Pharmacodynamics.

PK=

Pharmacokinetics.

ML=

Machine learning.

ODE=

Ordinary differential equation.

PopPK=

Population pharmacokinetics.

QML=

Quantum machine learning.

RMSE=

Root mean square error, a measure of goodness of the model fit. It is defined as the root mean square of the residuals, using the following equation (pred = prediction; obs = observation, n = number of predictions/observation evaluated).

RMSE=i=0n(prediobsi)2n

R2=

Coefficient of determination. There are multiple definitions of R2. In our context, it is defined with the equation below (the bar over observations means the average observation), which is a relative measure of discrepancy between predictions and observations, compared with a null model which uses just the average value of the observations as prediction.

R2=1observationprediction2observationobservations2

RF=

Random forest, a class of machine learning models built with an ensemble of individual decision trees. Each decision tree makes its prediction, and the random forest model makes predictions by pooling individual predictions.

SVR=

Support vector regression, a variant of a machine-learning algorithm known as the support vector machine (SVM). SVM is a supervised learning algorithm for classification tasks. It works by mapping input data into higher dimensions and finding a decision boundary (known as hyperplanes) there to separate classes of input data. The term ‘support vector’ refers to the data point(s) that lie closest to the decision boundary. SVR is derived from SVM and addresses regression tasks.

t1/2=

Half-life. Time necessary for a substance to reach a plasma concentration equal to half of its initial value.

Tmax=

Time point in which the Cmax is measured in the concentration/time curve of a substance in the plasma, also written as Tmax or tmax.

Vss=

Volume of distribution at steady state. It represents a theoretical volume into which a drug is distributed at steady-state conditions.

Acknowledgments

The authors would like to thank Stephen Fowler, Andrea Andrews-Morger, Julia Pletz, and Leonid Komissarov for their valuable comments and feedback. The authors are indebted to the input of many colleagues in the department of Pharmaceutical Science, and the support of Fabian Birzele, Sherri Dudal and Marianne Manchester. The authors also thank Matthew Wright from Genentech, who shared with us valuable experience and helpful suggestions.

Declaration of Interest

D Bassani was kindly supported by the Roche Postdoc Fellowship (RPF). All authors are employees of Hoffmann-La Roche. The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.

Reviewer Disclosures

Peer reviewers on this manuscript have no relevant financial or other relationships to disclose.

Supplementary material

Supplemental data for this article can be accessed online at https://doi.org/10.1080/17460441.2024.2348157.

Additional information

Funding

This paper was not funded.