5,647
Views
19
CrossRef citations to date
0
Altmetric
Review

A new wave of innovation in Semantic web tools for drug discovery

ORCID Icon & ORCID Icon
Pages 433-444 | Received 18 Dec 2018, Accepted 21 Feb 2019, Published online: 19 Mar 2019

ABSTRACT

Introduction: The use of semantic web technologies to aid drug discovery has gained momentum over recent years. Researchers in this domain have realized that semantic web technologies are key to dealing with the high levels of data for drug discovery. These technologies enable us to represent the data in a formal, structured, interoperable and comparable way, and to tease out undiscovered links between drug data (be it identifying new drug-targets or relevant compounds, or links between specific drugs and diseases).

Areas covered: This review focuses on explaining how semantic web technologies are being used to aid advances in drug discovery. The main types of semantic web technologies are explained, outlining how they work and how they can be used in the drug discovery process, with a consideration of how the use of these technologies has progressed from their initial usage.

Expert opinion: The increased availability of shared semantic resources (tools, data and importantly the communities) have enabled the application of semantic web technologies to facilitate semantic (context dependent) search across multiple data sources, which can be used by machine learning to produce better predictions by exploiting the semantic links in knowledge graphs and linked datasets.

1. Introduction

Drug discovery is a complex and long-term scientific investigation involving interdisciplinary research methods coupled with large heterogeneous datasets [Citation1,Citation2]. Recognition of drug targets [Citation3] and the identification of new compounds that can be used in drugs against specific diseases are major aspects of drug discovery [Citation4]. All of these processes are complex within their own right and involve analyzing large quantities of data [Citation5]. With such a substantial volume of data relating to different areas of drug discovery [Citation6,Citation7] it is unsurprising that frequently scientists have research questions that cannot be addressed using a single dataset and require the integration of multiple sources to obtain the answers they need. The drug discovery process has become increasingly more reliant on creating and using computational methods to find innovative ways of managing, curating and integrating these datasets to ensure that researchers have access to the relevant knowledge they need [Citation4].

Figure 3. A basic example of a drug discovery knowledge graph.

Figure 3. A basic example of a drug discovery knowledge graph.

There has been a wave of computation-based innovation around creating tools to aid drug discovery [Citation8,Citation9,Citation10] and many of these tools utilize semantic web technologies. This is both due to data management capabilities of the semantic tools coupled with the researcher’s recognition that purely integrating sources together is not enough. Semantic web technologies are needed to facilitate interoperability, to interpret the information in the correct context [Citation11], to make use of inferencing and the richer data structures to provide a superior level of knowledge management [Citation12,Citation13] and the ability to both search and mine large integrated datasets [Citation14], and effectively utilize ‘Big Data’ and Artificial Intelligence and Machine Learning [Citation15Citation17].

2. What is the semantic web? the current state of the technologies

The conceptualization of the web broke down some of the barriers to entry into data driven fields (such as drug discovery), as it provided a key information channel to reaching public data [Citation18], and this combined with movements towards making scientific datasets open provided access to vital resources to conduct research in this area. However, these resources could only be of limited use due to a lack of data standards and interoperability across different data resources and domains.

The Semantic Web was conceptualized by Tim Berners-Lee [Citation19] with the main goal of creating the underlying technology to support a `Web of Data’ that enables machines to better interpret information and provide a set of common interoperable data formats to be used across different platforms [Citation20]. In order to achieve this aim, several core formats were created to facilitate a richer way of modelling, characterizing and querying data.

2.1. Knowledge representation

RDF (Resource Description Framework) is the semantic web linked data format [Citation21]. A graph model is used to store data in triples (subject → predicate → object) whereby the predicate defines the relationship between the subject and object, thus permitting almost any dataset to be broken down into these triples (as demonstrated in ) [Citation20].

Figure 1. RDF graph example of linking a drug to a disease.

Figure 1. RDF graph example of linking a drug to a disease.

Using this linked data format, data relating to drug discovery can be formalised and linked in a standard interoperable format within a knowledge base that provides a much richer description. Below is a textual and graph example of an RDF triple illustrating the relationship between a drug that can treat a disease.

  • Subject: http://www.exampledrugontology.org/drugs/drug1 (the URI for the drug in question)

  • Predicate: http://www.exampledrugontology.org/terms/#cantreat (The URI that provides the term ‘can treat’ which links together drugs and diseases

  • Object: (the URI for the disease in question)

It is possible to produce code to export data from a relational database model into RDF using the Relational Database Model (RDB) to RDF Mapping Language (R2RML) [Citation22], which facilitates defining customizable mappings between relational datasets to RDF. The creation of this language enabled a substantially faster method of converting large datasets into RDF as opposed to being required to manually make the conversion.

RDF supports storing data in a linked graph format, but is not sufficient by itself to represent the required domain knowledge that provides the context and meaning behind the data. In order to achieve this, ontologies need to be created. Ontologies are vocabularies or dictionaries for the semantic web, which provide a formal definition of the common terms used within a specific domain; essentially describing the hierarchy of classes used to define the relationships and restrictions of different concepts within the ontology. This technology allows reusable terms to be built up for use within other systems employing the same terminology [Citation23]. Ontologies are typically written in the Web Ontology Language (OWL) [Citation24] due to its expressive capabilities but can also be written in the simple ontology language RDF(S) [Citation25]. These two vocabularies are similar, but OWL has a much larger vocabulary and stronger syntax than RDF. With RDF(S) one can define classes, properties and restrict the type of the subject (domain) and object (range) linked together by a property (as demonstrated in ).

Figure 2. Basic example of RDF(S) and OWL.

Figure 2. Basic example of RDF(S) and OWL.

With OWL however, it is possible to define the semantic relationships. As can be shown in , the domain and range of a property can be set, which restricts which types of classes this property can be used to link. However, with RDF(S) that’s as far as the description capabilities go, whereas OWL has description logic capabilities, which have been enhanced since the latest version of OWL (OWL2) was formalized [Citation26]. Classes and object properties (that define relationships between classes) can be given qualified cardinality restrictions; data range restrictions, and a wider range of defined properties. In addition to using description logic, rules can also be made in ontologies using other languages such as SWRL (Semantic Web Rule Language) [Citation27]. Ontologies are useful in drug discovery because they facilitate a formal description of concept relations in a set domain [Citation28]. There have been many drug discovery ontologies developed over the last decade and these are starting to be utilized in further applications; as researchers are starting to extend these ontologies, their usefulness is advancing. For example, work has been undertaken to extend the Gene Ontology to facilitate describing subcellular structure concepts in the extracelluar RNA (exRNA) domain [Citation29] which enabled a simplified process to annotate and query exRNA data as there was a standardized shared set of terms and relationships for this domain.

2.2. Semantic annotation

RDFa (Resource Description Framework in Attributes), is a W3C recommendation that adds a new set of extensions at attribute level to HTML, XHTML and certain XML document types to allow rich metadata to be embedded within web pages. In addition to RDFa, two other lightweight formats have emerged for semantic metadata, and these are the JavaScript Object Notation for Linked Data (JSON-LD) [Citation30], and more recently Microdata [Citation31], a HTML-extension specification that defines HTML attributes to embed machine-readable data in HTML documents (although this is even more lightweight and less expressive than RDFa or JSON-LD). This means that HTML documents containing drug discovery information can have semantic metadata embedded within them, this enabling search engines like Google to return more accurate, comprehensive results for users seeking answers to drug discovery questions. Using the same terms as , below are examples of using RDFa and JSON-LD for semantic annotations in this domain ( and ):

Box 1. Simple example of using rdfa for semantic annotations in drug discovery.

Box 2. Simple example of using JSON-LD for semantic annotations in drug discovery.

2.3. Semantic queries for search

SPARQL (Simple Protocol and RDF Query Language) is the query language for RDF and enables linked data to be retrieved using significantly more complex queries than using single datasets with an unlinked structure. Once drug discovery data has been put into a linked data format using domain specific ontologies, SPARQL can be used to make complex queries over these combined datasets to pull out new, previously unexplored, connections. An increasing amount of work has been done to create new semantic knowledgebases which can be searched across using SPARQL, and to combine the use of SPARQL with other technologies such as text mining and similarity scoring to enable smarter semantic searches [Citation32]

2.4. Rules, reasoning & inferencing

Another facet of the semantic web toolkit is the rule creation, and reasoning and inferencing capabilities. As touched on above, the rules regarding the relationships defined in ontologies can be made more complex using either the inbuilt description logic capabilities of OWL (which were extended in OWL2) or by using one of the rule-based languages developed for the Semantic Web. The data that exists within the drug discovery domain is multi-faceted and in addition to having many different concepts and relationships to represent, many different rulesets and modelling techniques are used on this data when conducting research. There are several rule languages that have been created as part of the semantic web toolkit to facilitate the inclusion of mathematical expressions and more complex rule-based constraints. The Semantic Web Rule Language (SWRL) [Citation27] was created to extend OWL axioms to include Horn-like rules [Citation33] by combining the OWL DL and OWL Lite versions of OWL with the Unary/Binary Rule ML version of the Rule Markup Language. SPIN [Citation34] (SPARQL Inferencing Notation) and its successor SHACL [Citation35] (Shapes Constraint Language) were created as an RDF syntax to represent SPARQL complex constraints by specifying inferencing rules. Thus far the available research only demonstrates the use of SWRL in drug discovery, however SHACL is a relatively new language (proposed in 2017) and given the increase in semantic search applications over the last few years, it is feasible that future applications could begin to include the use of these languages.

2.5. Increasing use of semantic web technologies

These core technologies were developed nearly 20 years ago, and still make up the backbone of the semantic web. However, in the last decade, improvements have been made to advance these core technologies [Citation26,Citation36,Citation37], and as semantic web technologies are being further recognized as necessary enablers to support broader technologies [Citation38] further innovations that utilize these technologies are being made across multiple domains, including drug discovery.

As the use of semantic technologies has become more commonplace, other novel tools, which are based on semantics, have started to emerge. A classic example of this is the knowledge graph. This in itself is not an overtly ‘new’ advancement, as Google introduced the concept in 2012 [Citation39]. However, similarly to the core technologies themselves, this concept has taken a while to gain traction, and it is only in the last three years or so that drug discovery applications that utilize knowledge graphs have started to emerge. Knowledge graphs are essentially graph network structures to describe real world entities and their relationships through the combination of linked data and ontologies. These graphs of linked entities are powerful structures and enhancing these has led to the creation of new applications that exploit the networks of links to apply advanced learning techniques on the data. provides a simplified example of how drug discovery data could be represented in a knowledge graph.

Using semantic web technologies has the potential to be of great value, however there is a certain time cost involved in formalizing the ontologies and putting datasets into a linked data form such that they can be linked, searched and used in conjunction with other technologies such as machine learning. Semantic web technologies have been around since 2001, and slow and steady work began to use these technologies to aid and advance drug discovery shortly after. However, in recent years this work has gained momentum and a new wave of innovation in semantic web-based tools for drug discovery have come to the forefront. This paper will consider the recent advancements that have been made in this area, both with regards to creating and advancing ontologies and in creating new semantic web-based tools for drug discovery. An overview of the progress that has been made in recent years will be presented, and an expert opinion on where the advances have been made and how they are aiding drug discovery further will be given.

3. Advances in semantic web tools for drug discovery

In 2012 the latest version of OWL (OWL2) was created, and since then advancements have also been made in the latest versions of RDF, OWL and SPARQL, to enhance their capabilities and offer additional features. Moreover, there has been a steady uptake of these technologies in the area of drug discovery. Initial research in this domain when the Semantic Web was first conceptualized proposed that semantic web technologies would support ‘flexible, extensible and evolvable knowledge transfer and reuse of scientific data’ [Citation12]. As semantic web technologies have developed, researchers have periodically evaluated their use within the drug discovery domain.

Borkum and Frey in 2014 [Citation40] expressed the value of linked datasets and ontologies to aid chemical research, giving examples of general vocabularies such as SKOS [Citation41] and Dublin Core [Citation42], in addition to describing their own work to create some controlled vocabularies for specific chemical terminology. This line of thinking has gained traction in recent years as many more linked repositories, datasets and ontologies are being curated and created in the area of drug discovery, and in general the uptake of semantic web technologies to aid managing and leveraging higher value from drug discovery data has increased significantly. An example of one of these semantic repositories is Chem2BioRDF [Citation43], a cross domain chemical biology resource which contains aggregated data from multiple chemogeonomics repositories that have been cross linked into BioRDF (a platform for querying biological data in RDF). This repository was created to enable users to make specific chemical/biological queries, which as the authors demonstrate, could have strong uses in areas such as multiple pathway inhibition and adverse drug reaction pathway mapping.

In 2010 Chen and Xie [Citation44] cited creating linked networks of open data, ad hoc collaboration to improve pipeline productivity, and efficient semantic data mining as the main benefits of using semantic web technology at the time of their article. These remain valid points, although since these articles the amount of available semantic resources has increased, meaning that there have been notable advancements in semantic web tools for drug discovery. There have been endeavors to create new ontologies and make extensions to those that already exist to improve their coverage in certain areas [Citation45Citation47]. Since these ontologies have gained traction and both increased and improved, new platforms have been created to make enhanced predictive systems that utilize the semantic links afforded by ontologies and network graphs in conjunction with other computational techniques such as network analysis and supervised learning approaches. Since Google introduced knowledge graphs in 2012, their use in drug discovery has increased and advanced the previous data mining capabilities.

In 2015 Machado et al [Citation18] reviewed 11 different applications from the last decade that make use of semantic web technologies in drug discovery, illustrating that the main areas that these technologies can be utilized is through data and knowledge representation, and resource mapping and integration services. Companies have also started to realize the potential of using these technologies in their drug discovery work. OntoText (a semantic web company) agreed a partnership with DrugBank (a bioinformatics and cheminformatics drug data resource) in 2018 [Citation48] to provide DrugBank’s database in an RDF format, thus providing a new semantic resource for researchers.

All three reviews of semantic web technologies within the scientific domain had a strong focus on the linking of data through the use of RDF and Ontologies [Citation18,Citation40,Citation44]. Machado et al illustrated that out of the 11 semantic drug discovery applications that they surveyed, seven of these used controlled vocabularies in their systems; demonstrating that this was a common method of using these technologies. However, one of the key benefits of producing this type of linked data is the enhanced semantic search capability that it affords. As new semantic knowledge bases, ontologies, and knowledge graphs have started to emerge, as has new and improved semantic search. Semantic applications in drug discovery have evolved past purely using knowledgebases to become systems that use intelligent semantic search techniques, and use other technologies such as machine learning to exploit the semantic links in the data.

The following sections detail the work done to make more drug discovery data available in a linked format, to create and extend ontologies and to create community-driven knowledge bases for semantically annotating data with drug discovery information. Following this, the more notable advances of new semantic search and machine learning based applications will be described, including providing exemplars of applications from both industry and academia that utilize these technologies.

3.1. Enhanced knowledge representation for drug discovery

Much of today’s scientific knowledge and data has progressed from being kept in the heads of scientists, through being written down in books, to being put into a digital form that (depending on its openness and format) can be searched and consumed much faster than on paper. However, just because something is in a digital form, does not mean it is necessarily going to be useful. Humans process data to turn it into meaningful information, which enables them to build knowledge; the semantic web provides these capabilities by offering the mechanisms to markup data in a linked format, using ontologies to add context and meaning. These techniques have been used in drug discovery to improve the knowledge representation of drug discovery data, thus providing enhanced knowledge bases that can be used to explore undiscovered links and be used in conjunction with other technologies.

In order to create well-formed, consistent, semantic drug discovery data, there is a need to create and maintain ontologies that define the many different concepts and relationships within this space [Citation49]. Much fundamental groundwork has been done to create these ontologies over the last 15 years [Citation50,Citation51], creating both specific ontologies for a certain area (e.g. The Drug Ontology [Citation52], or the Vaccine Ontology [Citation53]) and also the underlying ontologies that define core scientific terms such as the Basic Formal Ontology [Citation54]). This reuse and extension of ontologies is vitally important, as it promotes a standard for certain terms and means that similar terminology is not needlessly replicated. Some of the ontologies detailed above have been created with these ideals in mind, however there is still an issue of interoperability between some ontologies in the biomedical domain, as noted by Shen and Lee [Citation55].

With regards to using existing ontologies, one of the most well-used set of ontologies in the drug discovery space – and the one that has spawned the most advancements as a result of its creation – is the Gene Ontology Group (GO) [Citation28,Citation56]. The Gene Ontology consortium is a collaborative project that aims to provide consistent descriptions of gene products across different databases, and includes controlled vocabulary terms for cellular components, biological processes and molecular functions. There have been a number of projects either to extend the functionality of the Gene Ontology, or to create new semantic platforms that utilize it. Foulger et al [Citation57] undertook some work to extend the Gene Ontology with new classes to model processes relating to virus-host interactions by describing microbial and viral gene products. The creation of the ontology in itself was not explicitly for drug discovery, but nonetheless it has many potential future applications. Drug discovery is a complex process that involves the consideration and combination of many different types of data including viral receptors. A potential future application of this ontology could be to construct complex search queries across datasets annotated using these ontology terms to locate all of the viral receptors for a specific cell type. Therefore these new ontologies are vastly important to the overall advancement of drug discovery.

Another area of vital importance to drug discovery is drug-target identification. Research has suggested that only a very small amount of potential druggable targets have been extensively studied within this space, leaving a great deal of unexplored territory to be investigated [Citation46]. With such a vast amount of data and potential untapped knowledge, it is unsurprising that there have been many endeavors to improve this process and facilitate the prediction and identification of relevant drug-target pairs using semantic web technologies. Lin et al [Citation46] recently created the Drug Target Ontology (DTO) which provides a formal semantic model for druggable targets. This was created to facilitate integrating, navigating and analyzing drug discovery data using consistent standards and classifications, thus facilitating a more efficient well-defined method of utilizing this data to its greatest potential.

These new ontologies have been created in part due to the lack of standards to formally and consistently represent aspects of drug discovery. The BioAssay Ontology (BAO) [Citation58] was created by Visser et al [Citation59] to provide a common vocabulary for describing drug and probe screening assays and results. Subsequent work has been conducted over the last five years to enhance and extend this ontology [Citation45]. This was to both integrate with further external ontologies, and to modularize the ontology such that the computing costs of the description logic capabilities are spread across different aspects of the ontology, thus reducing these costs where possible depending on which module of the ontology is used. Additionally, after the ontology was improved it was utilized in a new application in 2016 by Clark et al [Citation60] to create a platform to allow domain experts who do not necessarily possess ontology expertise to semantically annotate assay protocols. This is an example of how building these underlying data models facilitates advances in the technology for users within the drug discovery domain.

These new and extended ontologies facilitate advancement in drug discovery, enabling new computational methods to be created that require the rich, interoperable, clearly defined data terms made available by using a combination of RDF and OWL [Citation50].

3.2. Community knowledge bases and semantic annotation for drug discovery

In addition to small groups of researchers creating ontologies, there have been a number of community efforts to create shared schemas for a wide range of disciplines to improve the amount of structured data on the web. These vocabularies are then used to markup data on websites or in emails to create structured data with semantic annotations to define their context and meaning. Some of these community knowledge bases are domain-specific, such as WikiPathways [Citation61] whereas others are more general purpose such as WikiData [Citation62] which has over 52 Million data items, and schema.org which is used by over 10 million websites [Citation63]. These large vocabularies have grown to contain data items relevant to drug discovery (for example schema.org has a specific health-lifesci extension to its core vocabulary, and WikiData contains some data items related to drug discovery) and now that this initial work has been put in, use cases for these vocabularies are starting to emerge.

Earlier in 2018 Xin et al [Citation44] published their work on using JSON-LD with their API, BioThings [Citation45], which has three APIs to provide annotation for Genes, Variants and Chemicals/Drugs. This was to provide interoperability and cross linking between the different APIs and provide more complex query capabilities; thus allowing links and answers to be extracted from multiple datasets. These authors postulate that there is still more to be done to standardize URIs for biological concepts, and recognizes schema.org as one of the entities that provides some of the vocabularies for the standard concepts. Ekins et al [Citation64]identified WikiData among others as a data source for information on the Zika virus, illustrating that multiple data sources would need to be linked together in order to achieve the proposed research on this virus.

Crowdsourcing these vocabularies is a way to bring domain experts (both technical and scientific) together and divide the workload. Furthermore this to an extent decouples the process of adding semantic metadata from creating the vocabularies themselves, and defining how they can be used to markup data. Once these processes are put in place this removes some of the expertise barrier that comes with the use of semantic web technologies. Typically, creating ontologies and semantic data requires the use of both domain and technical experts [Citation65] to capture the appropriate domain knowledge, and also to use it correctly to create the necessary semantic resources. However, once these have been put in place applications can be made that mean domain experts that do not necessarily have the required level of technical experience can still partake in these activities. Tang et al [Citation66] have worked on a community effort to create a knowledge base of drug-target interactions whereby users can help each other annotate their data, working with data approvers who will ensure that this process is performed correctly.

3.3. Semantic search engines for drug discovery

As detailed in Section 2, one of the outcomes of the increase in the availability of linked drug discovery data, and the creation and use of ontologies to markup this data, is being able to design semantic search engines to exploit this data [Citation67]. Semantic search is an advanced technique that facilitates more complex queries, and enables searching via concepts and concept relationships rather than purely by text matching [Citation68]. Sometimes just having the data is not enough, it needs to be combined in ways that are useful to scientists; undiscovered links need to be found. Furthermore, there are many scientific questions that cannot be addressed with a single data source, but instead require multiple sources. In this vein, Elkseth et al [Citation69], took 37 datasets and marked them up semantically into a linked data format, then normalized them such that they used the same concepts across the different datasets; thus enabling the researchers to build KnittingTools, a semantic search engine for these datasets. Similarly, Djokic-Petrovic et al [Citation69] created a web based application called PIBAS FedSPARQL which uses semantic technologies to enable researchers to search across multiple chemical, biological and pharmacological datasets.

Along similar lines, Open PHACTS (Open Pharmacological Concept Triple Store) [Citation70], was created as part of the Open PHACTS project to answer complex questions in drug discovery [Citation71]. It essentially provides an application layer on top of the Open PHACTS API, such that users can search for drug discovery data without needing to understand the RDF data behind it, or even understand SPARQL [Citation72]. This platform combines a number of the well-used datasets in drug discovery (including the Gene Ontology and ChEBI – Chemical Entities of Biological Interest Ontology [Citation73]), such that it is possible to see the relationships between compounds, targets, pathways etc. This work has subsequently been extended by Miller et al [Citation74] to use WikiPathways data [Citation61] (which is a database of biological pathways). López-Massaguer [Citation75] used data extracted from Open PHACTS in their work to obtain QSAR-ready compounds using semantic web technologies.

These are clear examples of how the previous work undertaken to create semantic web foundations for drug discovery has led to new innovations. The initial ontologies needed to be created to put the data into a linked format using the correct concepts, and the linked data needed to be available before it could be linked with other data. Creating these resources enabled these datasets and ontologies that each had their own individual uses to be combined to provide a new layer of knowledge.

3.4. Rules, reasoning, inferencing & learning for drug discovery

Building new ontologies and vocabularies has paved the way for new applications to be created that can make use of not only the semantic links facilitated by the combination of OWL and RDF, but also the rules and reasoning aspects of ontologies. Rule-based reasoning and Inferencing is a powerful aspect of the Semantic Web toolkit. By defining hierarchies, relationships, and the nature of these relationships between concepts, reasoning can be performed to infer where relationships could exist, and what classes different data concepts belong to. Furthermore, researchers have started to use machine learning techniques in conjunction with semantic web technologies, as their training algorithms and neural networks can exploit the semantic links and relationships between the data concepts to find new patterns and make better predictions [Citation76].

The Gene Ontology has been used to facilitate several drug-target identification and classification systems. For example, Chen et al [Citation77] used GO in conjunction with KEGG (Kyoto Encyclopedia of Genes and Genomes, a drug-target classification system) to score and identify drug-target-based classes and determine which GO terms are the most important for this process.

As with any data-driven domain, time is a key element to conducting investigations and making important discoveries; thus making data and literature mining a vital part of this process [Citation78Citation80]. Semantic web technologies can be used to enhance and improve the data mining process [Citation81], as data can be mined based on concepts and relationships rather than just text [Citation82]. In this vein, the knowledge graph has been found to be a very useful aspect of semantic data mining [Citation83]. Knowledge graphs can be constructed to provide the relevant graph pathways between related concepts to yield new information and links. Malas et al [Citation84] highlighted how semantic knowledge graphs can be useful in identifying drug candidates, by pointing out that they can connect different databases and reflect relationships between different concepts, in this instance, gene pathways and diseases. In a similar vein, Sang et al [Citation85] developed a knowledge graph based on literature mining, in order to discovery new drug therapies from literature. A knowledge graph was created with extractions from PubMed abstracts, which was then used by SemTyP (Semantic Path Type – an intelligent drug discovery method) to exploit the semantic paths to discovery new drug therapies.

These projects illustrate the power of linking and reasoning in ontologies that have helped to further drug discovery capabilities. They also demonstrate how different learning techniques can be used in conjunction with semantic web technologies to exploit those much-needed links between data concepts to make new discoveries.

3.5. Semantic applications in drug discovery

In addition to the work conducted to extend these resources and technologies, a number of new semantic web applications for drug discovery have been created, both in industry and academia. New systems are being built that demonstrate creating and using ontologies for drug discovery and using semantic search and machine learning techniques for drug discovery. Furthermore, scientific companies are also making partnership agreements with semantic web companies to enable them to use semantic capabilities with their data and products. As noted in the Laboratory Informatics Guide 2019: Semantic search and metadata alongside other software approaches are being increasingly used to bridge the gap between traditional methods and the new data-driven research methods that are being used today.

Passi et al [Citation86] created the drug repurposing platform RepTB, which uses molecular function correlations among known drug-target pairs to enable predictions about new novel drug-target pairings that could be repurposed for tuberculosis (TB). To facilitate this platform, a new Gene Ontology Network was created (based on the Molecular Function Ontology – part of the GO set) and Network Based Inference (NBI) [Citation87] was used to predict which existing drugs could potentially be repurposed to combat TB by identifying new drug-target interactions and computing their association scores. Whilst NBI in itself is not a semantic technology, research has shown that the semantic links between the data can significantly improve the prediction capabilities of supervised learning models, as demonstrated by [Citation88]. In a similar vein, Olier et al [Citation89] made use of ontologies and semantic datasets in their QSAR (Quantitative Structure Activity Relationships) learning algorithm, illustrating that their techniques could improve performance in these areas by up to 13%.

Han et al [Citation9], harnessed the power of ontology inferencing combined with network analysis to identify new immune-relevant drug-target genes for Alzheimer’s Disease (AD). This was working on the theory that due to the strong relation of AD to the immune system, there were potentially undiscovered immune-relevant AD drug-target genes that could help combat the disease. Han et al created an ontology of AD drug, gene, SNOP, disease, and haplotype data, and reasoned over it to identify which target genes to further analyze. Enrichment analysis was then performed on these targets, making use of the Gene, PANTHER and Reactome ontologies, in addition to eight other databases. In addition to using description logic-based rules in OWL, work has also been undertaken to use SWRL (Semantic Web Rule Language). Herreo-Zao et al [Citation90] created DINTO (Drug-Drug Interactions Technology) to formally represent different types of drug-drug interactions (DDIs) using SWRL rules to infer different types of DDIs. By using this rule-based inferencing technique, the researchers were able to demonstrate that DDIs and their mechanisms could be inferred on a much larger scale than with previous knowledge bases.

Vitrana [Citation91] are now using semantic and ontology enabled search in their HiLIT Platform (which delivers specialized clinical and healthcare information management services) [Citation92]. BioMax – a semantic search company, partnered with Royal DSM (a company working in health, nutrition and materials) to provide them with a sophisticated AI-Semantic search platform which uses AI augmented search algorithms over their semantic network. Zheng et al [Citation93] recently reported successfully using ontologies to help identify factors in Schizophrenia and NuMedi [Citation94] announced their adoption of semantic web technologies to accelerate drug discovery in complex and rare diseases [Citation95].

These endeavors and applications illustrate the new wave of innovation in semantic web tools for drug discovery. Industry and academia alike have demonstrated an increased uptake in the use of semantic web technologies in this domain, both through the efforts that have been made to create new drug discovery semantic resources (knowledgebases, platforms and ontologies), and also through the ways they have progressed to use these technologies.

4. Expert opinion

There is a circular issue in using semantic web technologies to aid in data-driven spaces such as drug discovery. Very powerful applications can be made that harness semantic capabilities and use them in conjunction with other technologies to make predictions and identify new links in drug discovery. However, in order to do this, the underlying standards and formal descriptions of the data are required, which involves first creating and then curating the linked datasets and ontologies. Therefore, whilst these technologies have been in existence for nearly two decades, it is only recently that real advancements in terms of utilizing novel computational methods within drug discovery have emerged.

Much of the initial work in the drug discovery space was to produce ontologies and create and integrate linked datasets related to this domain. Furthermore, some of these ontologies (such as the Basic Formal Ontology (BFO) [Citation35]) were not intended directly for drug discovery but were created for general purpose representation of scientific data; and many of the specifically developed drug discovery ontologies have built upon this early work. These underlying ontologies may not have seemed like overt advancements at the time of creation, but the overall set of ontologies and integrated linked datasets that now exist for drug discovery have paved the way for some real advancement in what semantic tools can do to aid in this space. Furthermore, researchers have been taking steps to enhance and extend existing ontologies with necessary information, and reuse ontologies where possible, rather than creating new ones that replicate the same information. For example the Ontology of Adverse Vaccine Events [Citation96] was created to extend the original Ontology of Adverse Events [Citation97] and the Vaccine Ontology [Citation53]; demonstrating that the initial work done to create these ontologies helped enable further development of a formal vocabulary to describe these terms. There have also been many community endeavors to provide open schemas and vocabularies with which to embed semantic metadata within HTML documents, thus improving the general representation of data on existing websites. Due to these efforts, much of the recent work undertaken in this area has been to create novel applications that make use of semantic links, logical capabilities and inferencing and reasoning capabilities of these technologies for drug discovery (as illustrated in ).

Figure 4. A variant on the Semantic layer cake [Citation98] illustrating the added value of using Semantic web technologies for drug discovery.

Figure 4. A variant on the Semantic layer cake [Citation98] illustrating the added value of using Semantic web technologies for drug discovery.

The capabilities of the core semantic web technologies, and indeed researchers’ understanding of how best to get value out of their use, have been advanced in the past six years. Newer, improved versions of RDF, OWL and SPARQL have all been developed with more advanced capabilities: more clearly defined data types, better ways of representing and structuring the rules around the data, and improved methods for querying linked data. Furthermore, new formats have become available to embed metadata within existing websites, and of course the knowledge graph has been conceptualized, providing new ways of creating and exploiting semantic links in and between datasets. However, whilst these advancements have been useful, the authors assert that the main ‘advancement’ has been human-driven rather than technology-driven. The core concepts of the Semantic Web have been around for nearly two decades, and it has taken that time for its value to be appreciated on a larger scale, both as a whole, and for specific endeavors such as drug discovery. The drug discovery community has now reached a state where a lot of useful groundwork has been put in, both to make specific ontologies and drug discovery knowledge bases, and also for the overall semantic web community to crowdsource large-scale knowledge bases.

This paper has shown a shift towards using more powerful semantic searching capabilities for drug discovery (based on the plethora of new semantic resources available) and the work conducted by industry and academia to produce superior semantic search tools. It has also become clear that researchers and companies are starting to realize the power of harnessing machine learning capabilities alongside semantic web technologies to exploit linked datasets of drug discovery information.

Further work should be undertaken, not only to extend these endeavors now that there is a solid starting point, but also to exploit the nature of semantic web technologies in conjunction with other technologies, such as machine learning, to create intelligent semantic systems to aid and advance drug discovery. This can only be achieved if researchers continue to work together to ensure that the necessary data and ontologies are made available in the areas of drug discovery that need to be addressed. When considering problems that need to be solved in drug discovery, researchers should consider what semantic data sources are available to them, and exploit their links in their applications. If the data sources that they require are not available, efforts should be made to create them. It is important to note that, where possible, reuse of ontologies should be encouraged. When creating new vocabularies, researchers should consider extending existing resources rather than duplicating information. To this end, the ontology communities that facilitate collaborative ontology and vocabulary development have great potential to help in this area.

There are a number of community endeavours such as schema.org that allow the public to request new items for vocabularies and query what vocabularies to use, and how best to use them. These resources should be utilized, ensuring a strong interdisciplinary collaboration between technical and domain experts. The work to date that has been considered in this article illustrates the importance of the human factor in using semantic web technologies for drug discovery. The core technologies of linked data, annotations and ontologies are key enablers for facilitating semantic search and enhancing the capabilities of other intelligent technologies. As demonstrated in this article, these technologies will be of considerable value to the drug discovery process. If the momentum of semantic web development in this domain continues along its current trajectory there is the potential to make a real difference to drug discovery.

Article highlights

  • The uptake of using semantic web technologies to aid and advance drug discovery has increased in both academia and industry over recent years.

  • Semantic web applications for drug discovery began as predominantly knowledge bases, and they have evolved to include powerful semantic search capabilities, and have made use of other technologies such as machine learning to exploit the links in semantic datasets to make better predictions.

  • A significant driver in creating new semantic web applications for drug discovery has been the increased availability of shared semantic resources (tools, data and the community). Creating (and extending existing) semantic drug discovery resources enabled datasets and ontologies that each had their own individual uses to be combined to provide a new layer of knowledge.

  • The drug discovery community has now reached a state where a lot of useful groundwork has been put in, both to make specific ontologies and drug discovery knowledge bases, and also for the overall semantic web community to crowdsource large-scale knowledge bases. This means that there is now a potential for exploitation; as evidenced by the uptake of commercial semantic drug discovery applications.

  • Further work should be undertaken, not only to extend these endeavors now that there is a solid starting point, but also to exploit the nature of semantic web technologies in conjunction with other technologies, such as machine learning, to create intelligent semantic systems to aid and advance drug discovery.

Declaration of interest

The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.

Reviewer disclosures

Peer reviewers on this manuscript have no relevant financial or other relationships to disclose.

Acknowledgements

The authors would like to thank Dr Colin Bird for his invaluable help with editing and reviewing the manuscript.

Additional information

Funding

This work was supported in part by the Web Science Centre for Doctoral Training at the University of Southampton, funded by the UK Engineering Physical and Sciences Research Council (EPSRC) under the Grant No: EP/G036926/1, and by the Artificial Intelligence and Augmented Intelligence for Automated Investigations for Scientific Discovery Network+ (AI3SD), which is also funded by the UK Engineering Physical and Sciences Research Council (EPSRC) under the Grant No: EP/S000356/1.

References