2,015
Views
1
CrossRef citations to date
0
Altmetric
Editorial

Could advances in representation learning in Artificial Intelligence provide the new paradigm for data integration in drug discovery?

, , & ORCID Icon
Pages 191-194 | Received 19 Oct 2018, Accepted 21 Jan 2019, Published online: 30 Jan 2019

1. Introduction

The goal of data integration in drug discovery is to enable cross-dataset analyses that strengthen or fill gaps in our knowledge of biological, pharmacological, or clinical phenomena. For example, this could be done by testing for consistent statistical associations among similar datasets, as in a meta-analysis, or by chaining together statistical associations across datasets to discover novel associations, as in an enrichment analysis. The potential for data integration to yield valuable insights is high given the vast quantity and variety of data in our field. Drug discovery and development programs generate large amounts of data on genes/proteins, compounds, genomes, variants, patients, animal models, interventions, and more. But these data are heterogeneous, sparse, and high-dimensional, originating from high-throughput sequencing, mass spectrometry, flow cytometry, high-content imaging, high-throughput screening, functional assays, clinical records, pharmacy records, etc. The heterogeneity, sparsity, and high-dimensionality of the data, as well as historically little consideration for data reuse and integration at the time data were generated, make data integration in our field a formidable challenge [Citation1].

Data integration projects have made great progress in the face of these difficulties. There are excellent data integration resources that primarily focus on one data type, such as Gene Expression Omnibus (GEO), UniProt, and ZINC15 [Citation2Citation4]. There are excellent portals like The European Bioinformatics Institute (EBI) and The National Center for Biotechnology Information (NCBI) that offer a multitude of these resources [Citation5,Citation6]. It has been encouraging to see a new class of resources that integrate data across multiple data types, such as Open Targets, Open PHACTS and Pharos [Citation7Citation12]. The solution adopted is often to represent each data type in its own column with the ability to query across data types. These may be termed as tabular- or spreadsheet-based resources, which offer ease of use.

Given the great progress in collecting, organizing, and summarizing data, as well as building tools for querying data, our field may be ready for a new paradigm for data integration. We postulate that Artificial Intelligence based representation learning solutions are poised to revolutionize data integration for drug discovery as they offer a flexible, dimensionless way of integrating and converting between disparate data, as we will describe below.

2. Artificial Intelligence and autoencoders

Artificial Intelligence (AI) is a branch of computer science that aims to equip machines with human-like intelligence and the ability to learn from ever-changing environments to successfully achieve their goals. For instance, artificial neural networks (ANN) are mathematical models for data processing, born out of AI, designed to mimic the large array of neurons in the brain that enable humans to parse sensory inputs, learn, and make decisions. A basic ANN consists of an input layer, the hidden layer that is a transformed representation of the input data, and the output layer. Recent advances in computing power and algorithmic innovations have led to the development of deep neural networks from ANN’s, which use multiple hidden layers to learn a hierarchy of increasingly abstract but hopefully more meaningful representations of the data [Citation13]. Deep Learning with these deep neural networks has delivered impressive achievements in the last decade with self-driving cars and language translation [Citation13]. A key concept that has become popular in Deep Learning is that of an Autoencoder.

An Autoencoder is a model that transforms data in some way (encoding) and then attempts to reconstruct the data (decoding) [Citation14]. The output of the encoder (and input to the decoder) is typically a vector of latent variables known as the latent representation, embedding, or code. Autoencoders are typically used for compression, representation learning, and/or visualization of high-dimensional data. Compression can be a preprocessing step or auxiliary task for supervised learning, where the aim is to eliminate noise, collapse correlated variables, and possibly incorporate unlabeled data, making it easier to learn the distribution of the data and hopefully improve generalization [Citation14]. Principal Components Analysis (PCA) is likely the best-known and simplest Autoencoder. In PCA, the encoder and decoder functions are linear transformations, and the parameters of the transformation matrices can be determined analytically [Citation15].

Recently, deep neural network Autoencoders such as the Variational Autoencoder have become popular [Citation14,Citation16]. In these models, the encoder and decoder functions are deep neural networks, and their parameters are determined by minimizing the reconstruction error on a set of training data. Deep neural networks allow a hierarchy of increasingly abstract and ideally more meaningful features of the input data to be learned, such as patterns of input variables corresponding to pathways, regulatory motifs, or binding domains [Citation16]. Deep neural networks are also extremely flexible. Their architectures can be customized to process data in a way that takes advantage of intrinsic characteristics of a data type, such as the regular 2-D arrangement of pixels in an image. Many data type specific architectures already exist that can be plugged in as encoders and decoders. The greater flexibility and expressive power of deep neural networks over other machine learning models comes at the cost of increased model engineering time, increased computational demand, and the need for large amounts of training data.

Large amounts of diverse but sparse data is produced from bespoke projects in the process of drug discovery. Data has historically been integrated by specific types, and only recently across multiple data types (Open Targets, Pharos). Artificial Intelligence (Autoencoders, in particular) can provide compact encodings that span data types. This promises a dimensionless way of integrating drug discovery data, ushering in the next generation of data integration that can use measurements on the system to infer its hidden state. This hidden state discovery will enable machine learning based discovery of predictive biomarkers and allow us to create virtual patient cohorts for trials, thus providing a data driven approach for decision making in drug discovery.

3. Expert opinion

A useful way to interpret an Autoencoder’s latent variables is to consider them master dials that completely define the ‘state’ of a system, entity, or object. From this point-of-view, if we know a system’s latent variables, then we know its state, and we should be able to predict almost everything that we could possibly observe about that system. For example, we could conceive of patients as having a set of latent variables that define their health state, and clinical records, prescriptions, and lab results would all be observations determined by that health state. This point-of-view leads to an approach to data integration that is conceptually straightforward, general purpose, and data-driven. If different types of data about a system are simply different types of observations of the same latent state, then we should be able to use an Autoencoder (or set of Autoencoders) to infer a single set of latent variables from which we can explain all our observations (). More importantly, the encoder and decoder functions would provide mappings between the different types of data, which we could use for cross-dataset predictive analytics.

Figure 1. Autoencoders: a natural framework for data integration that may learn the hidden state of the biological system.

Autoencoders consist of encoder and decoder functions that can compress high-dimensional data into a latent representation and then reconstruct the output, bringing light to hidden states that may exist in the data from disparate sources. Drug discovery problems include target identification, biomarker selection, compound prioritization, and patient stratification.

Figure 1. Autoencoders: a natural framework for data integration that may learn the hidden state of the biological system.Autoencoders consist of encoder and decoder functions that can compress high-dimensional data into a latent representation and then reconstruct the output, bringing light to hidden states that may exist in the data from disparate sources. Drug discovery problems include target identification, biomarker selection, compound prioritization, and patient stratification.

As an example, imagine we have oncology clinical trial data on 100 patients with 30 responders and 70 non-responders. We would like to find a response biomarker. For each patient at baseline, we have anthropomorphic and lifestyle traits, clinical data, lab values, blood cytokine levels, tissue mRNA expression, immunohistochemistry, and biopsy images. We do not have complete data for any single patient. Given the small sample size and high dimensionality of the data, it is unlikely that we will be able to find a response predictor using conventional statistical or machine learning approaches [Citation17,Citation18]. This raises a troubling issue, having a large amount of data on these patients does not seem to help us in this situation. Rather, it seems to make the problem more difficult by increasing the probability of false discovery. This is strange; why does having more data seem to provide no benefit?

We argue the latent state approach to data integration provides a way to unlock the benefits of high-dimensional, multi-modal data, with Autoencoders being a key enabling tool. If we knew our patients’ latent states, we should be able to predict their response. For the response prediction task, the Autoencoder provides a reduced set of candidate predictors (the latent variables) that also explain the other patient observations. We expect these candidate predictors to generalize better than those found by a purely supervised approach, which need not contain any coherent information about the patient.

Several challenges must be resolved for Autoencoders to be used successfully for data integration. For example, a significant amount of neural network architecture engineering will be required, with different data types needing different encoder/decoder architectures (e.g. convolutional neural networks for images, recurrent neural networks for sequences, and graph convolutional neural networks for compounds). Also, experimentation with the training objective will be required. Reconstruction error may need to be computed differently for different data types (e.g. squared error loss for normally distributed data and cross entropy loss for binary data), data types may need weights, with lower quality data receiving less weight, and data types may need to be randomly masked at the input during training to force the model to produce reconstructions given only partial observations. Also, validation will depend on use case and each use case may require fine-tuning of the model, e.g. training an Autoencoder to integrate patient data and then fine-tuning the encoder to predict response to treatment. Finally, a strategy to collect the same types of data across clinical trials or across preclinical projects would be required to have a sufficient number of samples to fit the models. Autoencoders could be used to integrate data across patients, targets, compounds, diseases, and more. Autoencoders could also be used to integrate data of only one of the aforementioned data types from multiple resources accounting for batch-effects. Some early examples of using Autoencoders for data integration have emerged [Citation19Citation22], and we anticipate research on this topic will grow. Recent research has shown that Autoencoders can be used to select features and reproduce virtual patient images, thus producing virtual cohorts [Citation23]. The emergence of affordable data generation methods has led to huge amounts of labeled omic, genetic, compound and clinical data. The large-scale data generation coinciding with advances in computing hardware and deep learning, suggests that this is the right time for integration across drug discovery data.

Data integration has promised to revolutionize drug discovery for the past two decades. Though resources around single data types continue to flourish, advances have come from assembling disparate pieces of data together in large tables and databases, with convenient access provided through web interfaces and APIs [Citation10]. Representation Learning developments in Artificial Intelligence, such as Autoencoders, will provide the new paradigm for data integration (). For example, the ability of an Autoencoder to learn the latent master dials to completely define the state of the system may provide a solution to the problem of combining very different aspects and measurements of a system in a dimensionless way.

Figure 2. The future of drug discovery with AI-driven data integration.

The current state of drug discovery has been achieved by integrating data of single types from individual projects and querying large databases that were built to collate data from disparate sources. AI and Autoencoders can learn any hidden states present among high-dimensional, sparse data from disparate source, thus building a platform for data-driven decision making in drug discovery.

Figure 2. The future of drug discovery with AI-driven data integration.The current state of drug discovery has been achieved by integrating data of single types from individual projects and querying large databases that were built to collate data from disparate sources. AI and Autoencoders can learn any hidden states present among high-dimensional, sparse data from disparate source, thus building a platform for data-driven decision making in drug discovery.

This will enable us to create virtual clinical subject cohorts and simulate their response in silico to experimental therapeutic interventions prior to any human clinical intervention study [Citation23]. It could also help stratify patients based on these master dials that will incorporate anthropomorphic, genetics, genomics and drug treatment data, leading to better biomarkers and diagnoses. Integrating data across the pipeline using Autoencoders will provide a data driven approach, which would mark the new paradigm for decision making throughout the drug discovery and development pipeline.

Declaration of Interest

All authors are employees of GlaxoSmithKline. The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.

Reviewer Disclosures

Peer reviewers on this manuscript have no relevant financial or other relationships to disclose.

Additional information

Funding

This manuscript was supported by GlaxoSmithKline.

References

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.