188
Views
0
CrossRef citations to date
0
Altmetric
Research Article

A Minimal Metadata Schema and Its Tool to Improve the Searchableness of Research Data in Bioinformatics

ORCID Icon, ORCID Icon, ORCID Icon, ORCID Icon, ORCID Icon & ORCID Icon

Abstract

Bioinformatics develops methods to understand biological data and to explain biological processes. The discipline operates on huge data sets and is computationally intensive. The fast growth of the field and its specialization into many subdisciplines makes it hard to search for, find, and to keep track of scientific results relevant for one’s own research. Being able to get knowledge of prior studies, and the data they rely on, is a prerequisite to further one’s own research and to ensure effective progress of bioinformatics as a field. At the same time, scientists’ own research and reputation benefit from the best possible searchableness of their research data. The FAIR data movement draws from this motivation. Before research data can be accessed and reused, however, it must first be found or discovered. For this purpose, the large space of highly diverse research data must be conquered, it must have a high quality of searchableness. To increase searchableness, we have devised a metadata schema to describe the entire field of bioinformatics by a small set of descriptors. Our metadata schema has been inspired by Dublin Core and aims at replicating its success in the domain of bioinformatics. The schema aims at complementing the many metadata schemes used by bioinformaticians in practice by extracting their common core, yielding to a schema that can be used across bioinformatics subdisciplines. Our minimal schema for bioinformatics metadata is complemented by a Web-based annotation tool where such metadata can be provided in an effective, time-saving, and concise manner.

Introduction

Like many disciplines, bioinformatics is growing and diversifying into different subfields. The broadening and deepening of the field can be observed by the many metadata approaches and the large number of special-purpose deposition databases that are specific to the subdisciplines of bioinformatics. This makes it hard to implement a FAIR-compliant research data management policy across the entire bioinformatics community. While the research communities each acknowledge the need for FAIR-compliance in their respective subfield, each new metadata standard created for it, and each new deposition database, dilutes FAIR principles at the overarching level of bioinformatics. Paradoxically, we believe that yet another metadata standard, together with a tool to create descriptions adhering to the standard, will better the current state of affairs. The metadata scheme we propose aims at maximizing concise information content while minimizing the number of metadata fields that each describe a single, specific aspect of bioinformatics studies at a high granular level; proper tool support minimizes the amount of work to fill-in all fields with maximum conciseness by making use of external, widely accepted controlled vocabularies.

The motivation of our work stems from the authors’ involvement in the BioDATEN project,1 a three-year project that aimed at exploring and contributing to a common infrastructure for research data management in bioinformatics that encompasses the entire life cycle from data creation to data reuse. The project profited from a wide range of partner organizations located in the German federal state of Baden-Wuerttemberg, namely, the seven universities of Tübingen, Freiburg, Heidelberg, Hohenheim, Konstanz, and Ulm, the Max Planck Institute for Animal Behavior, the Deutsches Krebsforschungszentrum, and the European Molecular Biology Laboratory in Heidelberg. Partners contributed their computing infrastructure, their existing expertise in research data management, or had research data from a wide range of biomedical domains, broadly speaking, from plant biology, cell and tumor biology, functional and structural genomics to molecular biology. In the project, a central repository to host and give access to such a diverse set of research data was to be built. As a result, we were faced with the problem of describing such diverse data with a common vocabulary so that users from the project partner community could search for and discover research data relevant for their interest in a principled, uniform manner.

To some degree, existing research data came with rich metadata, and we soon found that the diversity in research data was complemented by the diversity of its metadata. For example, whole genome sequencing (WGS) analyses the complete genome of a given organism based on, e.g., a blood sample, while single cell sequencing (SCS) targets the genome of a specific cell. And while the research questions of WGS are rather broad, those of SCS have a narrower scope. Naturally, this is reflected in the metadata required to adequately describe all data. For WGS, it might be sufficient to record the sample origin as blood while for SCS, the specific cell type must be recorded. Other common research methods do not target specific organisms at all. Researchers interested, say, in the bacterial microbiome or biodiversity of soil, water, or air samples use 16S sequencing to detect and assess the bacteria population in the given sample. In this case, there is no specific organism or cell type available and hence to be recorded. In these cases, researchers will be required, however, to record the sample origin, say, in terms of geolocation and type. In transcriptomic research, researchers try to establish a link between genes that are activated at different development stages or in different environmental conditions or that are linked to certain metabolic processes, and again, require yet another descriptional toolset.

In the project, we hence started on the endeavor to extract the common core of all these different descriptional means, a common vocabulary that can be used to describe bioinformatics data at a high-level of granularity. This more abstract description should not replace the many metadata schemes available for each specific subdiscipline, but complement it; we were looking for an overarching scheme that can be easily understood and used by all bioinformaticians.

Our work is inspired by the Dublin Core (DC) metadata standard (ISO/TC 46/SC 4 Technical Committee. ISO 15836-1:2017, 2017). DC consists of only 15 elementary descriptors, but they suffice to find hundreds of millions of books, newspapers, microfilm reels, maps, sheet music, sound recordings, prints and photographic images, and other resources. While it is true that librarians use more expressive metadata standards such as MARC (Library of Congress, 2023) (with hundreds of descriptors), there is a well-established cross-walk that maps MARC-based metadata to Dublin Core. Naturally, such a cross-walk comes with information-loss, which, however, is out-weighted by DC’s main benefit, namely, its simplicity and user-friendliness. Clearly, librarians do not seek to replace MARC with Dublin Core, as they wish to maintain the expressiveness of the standard. But they are happy with complementing the MARC-based annotations of their holdings with DC-based descriptions to make them more searchable to the layman. Also, once an interesting party has identified a resource using DC, it is often possible to access the more expressive MARC-based description of a resource to get information that is only expressed there. We believe that the approach from simple and generic metadata to one that is rich and specialized is urgently needed for the bioinformatics domain.

It is the question whether a DC-like minimal metadata standard can be defined for the entire field of bioinformatics resources, one that is widely accepted and used across bioinformatics subdisciplines. Such a standard would not replace the many metadata standards that exist for the many bioinformatics subfields, but rather complement them at an overarching level of bioinformatics. Researchers would use the minimal standard to search for resources across many different databases, and for each resource found, they can always inspect other, more specific metadata attached to the resource to get more information about it.

In this paper, we describe such a minimal metadata standard, aiming in particular, at increasing the searchableness of research data, but in agreement with the FAIR principles. The paper is structured as follows. We first describe the current situation of research data management in bioinformatics. We review existing standards to describe data, and we also list some data repositories or databases where such data is stored. In light of this background, we discuss the shortcomings of the current situation in terms of the FAIR principles for research data management and prepare the ground for our work. In the main part of the paper, we first present a minimal set of descriptors for the markup of bioinformatics data; we then describe an annotation tool that supports researchers in the provision of the minimal metadata. This includes easy access to controlled vocabularies that should be used to provide the values for many of the descriptors. The paper concludes with a discussion of our approach.

Background

Researchers in bioinformatics make extensive use of computer-supported workflows or pipelines, and data analysis is performed on both local workstations and high performance computing clusters. Intermediate data is stored locally or on cloud space, and is often reanalyzed or reused in modified workflows. Once results have been obtained, a scientific paper written to describe the approach, and the paper been accepted for publication in a journal, metadata must be provided to satisfy the journal’s metadata guidelines and the paper’s reviewers. The metadata asked for by the publisher differ from case to case, with varying formatting requirements (e.g. upload metadata as tab separated file). The metadata provided to the publisher is often available as supplementary information for download on the journal’s website. To complement the publication of the scientific paper and its supplementary part, its underlying research data is often uploaded to subdiscipline-specific repositories such as ArrayExpress2 for functional genomics data (Athar et al., 2019), the European Nucleotide Archive (ENA) for nucleotide sequence data,3 the PRoteomics IDEntifications Archive (PRIDE) for proteomics data,4 the European Genome-phenome Archive (EGA) for personally identifiable genetic an phenotypic data,5 the archive for biological samples and associated experimental data (Courtot et al., 2021), and many other types of data.

To keep track of the many existing special-purpose deposition databases for bioinformatics data, sites likes the National Center for Biotechnology Information6 (NCBI), EMBL’s European Bioinformatics Institute,7 and ELIXIR8 serve as central portal for researchers to access them.

Ideally, metadata and the data that is being described must have a good number of qualities to allow researchers to find studies of interest, understand them, and potentially reevaluate and reuse the data in other research contexts.

In the context of research data, we understand the term “searchableness” as the quality or capability of being searchable. In other words, the term describes how easily research data can be found or located through a search process. To discover research data of interest, users must formulate appropriate search queries to retrieve results relevant to their interests. A highly searchable universe of data requires that the data’ description is well-organized, adequately indexed in a database holding all metadata, and easily accessible through searching mechanisms, while a low level suggests that finding specific information may be more challenging or inefficient for users. Our minimal metadata scheme aims at describing and organizing research data in the bioinformatics domain in terms of a few simple organization principles.

FAIR research data management

Across scientific areas, there is an increasing awareness that a proper and disciplined management of research data is required so that research data can be more easily discovered and reused, and that research work must be increasingly well documented so that published results can be reproduced and verified (Chervitz et al., 2011; Cernava et al., 2022). In recent years, the FAIR data movement for proper research data management gathered quite some steam, with its four guiding principles: Findability, Accessibility, Interoperability, and Reusability, see .

Figure 1. The FAIR principles according to Wilkinson et al. (Citation2016).

Figure 1. The FAIR principles according to Wilkinson et al. (Citation2016).

With the field of bioinformatics making extensive use of data, and computational methods to produce and process such data, the implementation of FAIR principles for research data management is all but trivial. Clearly, the field offers many different databases, where research data can be deposited and searched for (F), but the sheer number of such databases that researchers would need to consult to find data has a negative impact on the findability criterion. The accessibility of data and metadata is scattered across the scientific publication that describes a study, supplementary information somehow attached to the publication, and research data deposited in a community-specific repository (A). And while there are available many, formally-defined, metadata standards and ontologies, is it often the case that metadata is being made available with individually constructed Excel tables whose column identifiers and cell values use string values rather than controlled terms (I). The added ambiguity to such Excel-based representations makes the data it describes less ­reusable (R).

Our work aims at contributing to the FAIR principles by defining a metadata schema that increases its searchableness dimension. With our schema, data can be richly described in terms specific to the bioinformatics domain (F2) using a vocabulary that captures accurately the attributes of this domain (R1), and where most attribute values must stem from ontologies that are accessible and broadly shared in the bioinformatics community (I1).

Metadata standards in bioinformatics

For the following discussion on metadata standards, we take a step back and reflect on the nature of any scientific discipline. Each discipline can be characterized by how it answers two main questions: What are the objects of their scientific inquiry, and in which manner, by which methods, are these objects being studied? In bioinformatics, for example, an object of study may be the genome, the genetic information of an organism and the nucleotide sequences of DNA or RNA it consists of. A study may only focus on a particular gene, or an particular interaction between DNA, RNA, proteins, or other substances. Or a study may focus on a range of targets, such as a regulon, an association of multiple genes or operons which are regulated by a common regulatory protein. An example for a method of study may be DNA sequencing, protein analysis or a gene regulatory network modeling. FAIR metadata to describe studies in bioinformatics must be able to accurately describe both objects of study and the methods used to study them.

There is no single metadata standard in bioinformatics that is generally accepted and in wide use. On the contrary, the metadata landscape can be characterized as manifold, and often a metadata standard only covers a particular subfield of bioinformatics. Thus, the manifold metadata landscape correlates to the manifold research topics and specializations in the vast field of omics.

Nonetheless, two standards aim at covering the entire field: bioschemas.org and ISA.9 The schema bioschema.org is built upon schema.org, a metadata language that was initiated by the major search engine companies (Google, Microsoft, Yahoo and Yandex) (Guha et al., 2015). The initial aim was to define a vocabulary that website providers could use to enrich their (often html-based) websites with semantic information, allowing search engines to complement their full-text search with semantic features. Since its inception, the terminology of schema.org has benefited from regular vocabulary extensions, making available terms such https://schema.org/Dataset and https://schema.org/Observation. The bioinformatics community extends schema.org with terminology specific to its discipline such as https://schema.org/BioChemEntity with its subterms ChemicalSubstance, Gene, MolecularEntity, and Protein. While an Observation can be described in terms of a measurement technique, there is little controlled vocabulary in place to describe method types, or instances thereof, in detail.

ISA is a metadata framework that focuses on the description of experiments in the life sciences. It offers vocabulary to express administrative and domain-specific metadata. Its three components are Investigation (the project context), Study (a unit of research) and Assay (analytical measurements). Its rich vocabulary allows researchers to provide extensive descriptions of experimental metadata (i.e., sample characteristics, technology and measurement types, sample-to-data relationships) so that the resulting data and discoveries are reproducible and reusable. The ISA metadata schema comes with a variety of tools, supporting researchers with the task to provide FAIR metadata, for a Python-based API see (Johnson et al., Citation2021).

There are many “smaller” schemas that cater for bioinformatics subdisciplines. If the object of study is the genome, then researchers can use the relevant standard from the MIxS family of standards10 such as MIGSEukaryote for eukaryotes, MIGSPlan for plants and MIGSVirus for viruses. For methods to study objects of interest in bioinformatics, there are special-purpose standards such as MIAME (Brazma et al., Citation2001) (Minimum Information About a Microarray Experiment) or MINSEQE (Minimum Information About a Sequencing Experiment) standard (Brazma et al., 2012).

The situation is more complex for Omics data, please consult (Chervitz et al., Citation2011) for a good overview of metadata standards for this domain.

The FAIR principles stress the need for interoperable (meta)data. In bioinformatics, there exists a good number of different ontologies ranging from high-level ones such as EDAM11 (Ison et al., Citation2013) to very detailed ones such as the ontology of genes and genomes12 (70k terms), the ontology for cells13 (15k terms), or the Protein Ontology for taxon-specific and taxon-neutral protein-related entities14 (Natale et al., Citation2017) (230k terms). In fact, ontology building has a long history in bioinformatics and the life sciences, probably stemming from the large number of (non-)organism that must be organized and named in a concise and non-ambiguous matter to make scientific communication effective. The art of ontology building is by no means trivial (Smith et al., Citation2006) and there have been efforts toward building a reference terminology for ontology research and development in the biomedical domain (Ceusters & Smith, Citation2010). With the number and sizes of ontologies increasing, it also becomes increasingly problematic to use them together to organize, curate and interpret the vast quantities of data arising from biological experiments. The Open Biological and Biomedical Ontologies (OBO) Foundry was created to address this by facilitating the development, harmonization, application and sharing of ontologies, guided by a set of overarching principles (Jackson et al., Citation2021). The OBO website15 lists 20 principles that ontology developers shall take into account such as being openly available, the use of a common formal language, an identifier space, and the use of textual definitions for its classes and terms.

In sum, bioinformatics researchers describe their data with extensive metadata adhering to varying metadata schemes and informally-defined, Excel-based formats, which are often issued by publishers (journals where research studies are being published) or the repository they use for archiving their data. The manifold nature of the metadata and the distributed storage of research data and their metadata in external research data repositories, or at publishers’ websites makes it hard to discover and re-use research data of interest. Often, all of scientific publication, well-documented research data, and metadata are required to make research data FAIR. The sheer and highly distributed nature of research data, together with the use of many different metadata schemas, however, affect the searchableness in a negative manner.

To remedy the situation, we believe that research data must get an additional higher-level description that is understandable across the bioinformatics sub-disciplines. Clearly, the new description by itself contributes to an increased searchableness, but is insufficient to achieve FAIRness. This is only possible in combination with more specific metadata schemes and scientific publications where research data is described.

The CMDI metadata infrastructure

The design of a new metadata schema is by no means trivial. Basically, it is a knowledge representation enterprise that, ideally, profits from both knowledge engineers (with expertise in terminology/ontology management) and domain experts that make available their knowledge, or relevant parts or views thereof, so that it can be organized and expressed in term hierarchies. The knowledge engineer, in turn, needs access to techniques (such as interview techniques) and to a knowledge representation toolset that helps expressing all knowledge in a formal language that is both machine-processable and readable by humans.

The creation of our minimal metadata schema for bioinformatics has benefited from CMDI, the Component MetaData Infrastructure (Broeder et al., Citation2010). The infrastructures supports constructing schemas in a brick-like manner by making use of predefined, hierarchically structured, components (maintained in the Component Registry16) and elementary metadata fields (maintained in the ConceptRegistry,17 or in any other data category registry). The CMDI framework encourages its users to reuse components and fields whenever possible; otherwise, new components and fields can be defined, and added to the respective registry.

While both registries precede the formulation of the OBO foundry principles (Jackson et al., Citation2021, ), they adhere to many of them, at least to a certain extent. Both registries implement, for instance, some notion of openness, support a common language (XML), offer a persistent identifier for each of their entries, and each entry must come with a textual definition.

Table 2. CMDI metadata component for method; measurement conditions.

In our work, we have used the CMDI metadata infrastructure, in particular, the component registry, to define the Bioinformatics Core Metadata Set (BC), a small set of descriptors to describe bioinformatics data at a high-level of granularity.

The bioinformatics core metadata set (BC)

In this section, we present a minimal set of descriptors required to markup any bioinformatics resource on a high level to enhance the searchableness aspect of FAIR metadata. The metadata schema is complemented by an annotation tool that we present in the next section.

The underlying idea and main structure of our metadata set stems from the motivation that any scientific discipline can be described in terms of the objects it studies, and the methods it develops and employs to make scientifically backed statements about the objects of study. Our metadata schema is an attempt to instantiate this understanding for the domain of bioinformatics. , a screenshot taken from the XML editor Oxygen, depicts the basic structure of our minimal schema to describe research data in bioinformatics. The schema consists of four components, each of which must occur at least once but possibly multiple times: the object being studied, the method used to study the object, experimental runs that link together studied object and the method via their respective identifiers, and a File component that lists the data files resulting from experimental runs. We describe the first two components in detail and skip the components Run and File as they are not relevant for the purpose of this paper.18

Figure 2. The BC Schema for bioinformatics data, centered around the two ingredients StudiedObject and (description of what was studied) and Method (how the object was studied).

Figure 2. The BC Schema for bioinformatics data, centered around the two ingredients StudiedObject and (description of what was studied) and Method (how the object was studied).

Objects of study

A bioinformatics experiment has one or more objects of study that are analyzed using one or more scientific methods. shows the (CMDI-based) metadata component for the description of a StudiedObject.19

Table 1. CMDI metadata component for studied object/sample: material entity from which the data was derived.

As with all other components, the component StudiedObject comes with a unique identifier (later used to link it together with a method id) and a description in natural language. The field typeStudiedObject has two possible values: organism (e.g., eurokaryota, bacteria, archae) or non-organism (e.g., virus, water, environmental RNA). If organism is chosen, then the fields cellType and organism become mandatory fields and must be filled out; otherwise these fields are not shown to the metadata annotator. In the organism field, the organism should be named, preferentially by making reference to an ontology such as https://bioportal.bioontology.org/ontologies/BERO/. Similarly, in the cellType field the cell type shall be specified, again using a controlled vocabulary such as https://bioportal.bioontology.org/ontologies/CL.20 Information additional to cell type and organism can be specified in an extra field. Also, the field nameMaterial is used to name the material (structure, substance, device) removed from a source (patient, donor, physical location, product) or to name a material entity that has the specimen role. Here, the terms from https://bioportal.bioontology.org/ontologies/MESH shall be used. The measurement target of the studied object or sample also needs to be specified, say by naming the relevant gene, protein, or compound. This information is complemented by a database link where information on the measurement target is being stored (e.g., by specifying an ENA number). All fields have placeholder information attached to them to inform metadata annotators about expected values (see below).

Methods of study

In the component Method, users specify how the studied object/sample was prepared or processed, e.g., via extraction, purification, concentration, or derivation, see . By default, the use of the controlled vocabulary at https://ontobee.org/ontology/OBI?iri=http://purl.obolibrary.org/obo/OBI_0000094 is supported. The field typeOfMethod is used to specify the class of the method being used: There are seven method types to choose from, for instance, sequencing Method, analytical Method, protein-protein interaction, or particle analysis method. A field Other is supplied if none of the suggested method types applies. When users decided on the method class, follow-up fields are being shown. For example, when the method is sequencing, users can choose between 16 different sequence methods (including an Other value):

  • Maxam–Gilbert sequencing

  • Chain termination (Sanger sequencing)

  • Pyrosequencing

  • Ion semiconductor sequencing

  • Single-molecule real-time sequencing

  • Sequencing by synthesis

  • Combinatorial probe anchor synthesis

  • Sequencing by ligation

  • Nanopore Sequencing

  • GenapSys Sequencing

  • Massively parallel signature sequencing (MPSS)

  • Polony sequencing

  • DNA nanoball sequencing

  • Helicos single molecule fluorescent sequencing

  • Microfluidic Systems

  • Other

When the method is an analysis method, then this can be further detailed into, say, high performance liquid chromatography, gas chromatography, or bioassay. Again, an Other field is supplied for each method instance if none of the predefined methods apply. Also, users are asked to give a URL to point to their description of the method (e.g., a link to a scientific paper where the method has been described).

Note that most metadata fields must be provided with values as they are either mandatory or mandatory if applicable. There are two fields with a recommended status, namely additionalInformation, where the studied object can be described in plain text, and URLMethodDescription, where users should provide a link to a scientific paper where their method is described.

The schema has been constructed using the graphical user interface of the CLARIN component registry at https://concepts.clarin.eu/ccr/browser/. Once a schema is defined, it can be exported in XML or XSD format.21 In addition to the metadata fields and their value space, we have also enriched the schema with annotations that are used by the annotation tool to help users entering their data.

The BC metadata annotation tool

We have developed a purpose-built annotation tool that helps researchers providing both metadata using the BC minimal schema and administrative metadata using DataCite22 (El-Gebali & Stathis, Citation2024). The tool is bootstrapped by an XSLT stylesheet that converts the XSD serialization of both schemas into HTML-based user interfaces that gather user input. To streamline the provision of metadata, our BC schema definition encodes cues for tools that instruct our annotation tool to show or hide metadata fields given prior data provision. gives the XML representation of the element typeStudiedObject: the cardinality information defines the element as mandatory, the documentation specifies the string to be shown to the user, and the range of values for this element (the choice between organism and non-organism), cf. . The cue attribute then specifies that two metadata fields, namely, organism and cellType should be shown to the user only in case the object of study is an organism. The second element, if shown, asks users to enter a value for an organism. Note the placeholder attribute, which cues users with an example value. Also note our use of the Concept Link attribute that semantically grounds our use of the term organism into a BioPortal ontology.

Figure 3. XML representation of two metadata fields.

Figure 3. XML representation of two metadata fields.

Figure 4. Screenshot of the annotation tool depicting a part of the component “StudiedObject.

Figure 4. Screenshot of the annotation tool depicting a part of the component “StudiedObject.”

The tool also comes with a comfortable configuration mechanism allowing tool providers to adapt individual metadata fields (hence overwriting information stemming from the XSD-based schema), configuring controlled vocabularies, and attaching them as values for specific metadata fields. XSD-based schemas can also be added, removed, and exchanged very easily. The GUI-based configuration capability allows non technical experts to adapt the tool to their needs without manipulating the underlying schemas that are used to create a default user view.

depicts a screenshot of the annotation tool for the metadata tab “Research Metadata.” A part of our component StudiedObject is displayed; here a user identified the type of the object under study as an organism, and hence the additional fields to name the organism and its relevant cell type are being shown. shows a part of the annotation tool where users are encouraged to provide a value for the field nameMaterial. The example shows how users get access to a pull-down menu showing possible completions for their partial input. For this field, the external ontology https://bioportal.bioontology.org/ontologies/MESH is made available to users. For this ontology, some terms come with a description, which the annotation tool displays with a small, clickable icon. A click on the icon displays a pop-up window with this information. The annotation tool can validate all user input against the two schemas provided (DataCite and our BC Minimal Schema). Once input is complete and validated, the tool can export all metadata to XML. The tool is also connected with our institutional repository system so that users can semi-automatically transfer all metadata and associated data to this repository. No reentrance of data is required.

Figure 5. Screenshot of a part of the annotation tool demonstrating auto-complete functionality.

Figure 5. Screenshot of a part of the annotation tool demonstrating auto-complete functionality.

The annotation tool’s capability of giving users easy access to controlled vocabularies is a crucial prerequisite to satisfy the interoperability aspect of FAIR research data management.

Discussion

Validation

The BC metadata schema went through numerous iterations with our bioinformatics project partners; their valuable feedback contributed to schema modifications such as the addition or deletion of metadata fields, their naming and documentation, or the proposed use of controlled vocabularies to define their value space. The cooperation between the authors and our bioinformatics experts lead to quite a few compromises when it comes to the use of controlled vocabularies. While some documentation strings make explicit reference to bioinformatics ontologies or term hierarchies, no agreement could be reached to attach exactly one controlled vocabulary to each metadata field. In contrast, our researchers argued that no single ontology should be preferred over others. As a consequence, we augmented the annotation tool with configuration capabilities to link one or multiple ontologies with each metadata field. As a result, the technical value space for those fields is string, while the tool encourages users to select such strings from the controlled vocabularies linked to the field.

To further validate the developed metadata schema, research data from several scientific studies were annotated with the schema we developed. These comprised research data form different scientific fields within bioinformatics. To give an example, Mönke et al. (Citation2012) studied the ABI3 regulon in the model organism Arabidopsis thaliana. Therefore, samples (StudiedObject) were generated and each sample was measured with three different methods (Method), see . From each experimental Run (linkage of StudiedObject and Method) a dataset was derived, which in turn were aggregated into a further data table. In the vast majority of scientific studies used for validation, the metadata schema developed proved to be suitable for describing research data in terms of the objects under study and the scientific methods.

Figure 6. Example of a study used to validate the developed metadata schema. For the identification of the ABI3 regulon, samples from the model organism Arabidopsis thaliana were generated. Each sample was measured with three different methods. From each experimental run (combination of StudiedObject and Method) three datasets (Dataset 1–3) were derived, which themselves were aggregated to another table (Table S2).

Figure 6. Example of a study used to validate the developed metadata schema. For the identification of the ABI3 regulon, samples from the model organism Arabidopsis thaliana were generated. Each sample was measured with three different methods. From each experimental run (combination of StudiedObject and Method) three datasets (Dataset 1–3) were derived, which themselves were aggregated to another table (Table S2).

The annotation tool equally profited from feedback by project partners. Bioinformaticians used beta versions of the tool and tested the tool for its usability. Overall, the feedback was overwhelmingly positive, the Web-based nature of the tool (no installation of software required) and its browser-based interface (working flawlessly on different browsers and operating systems) was perceived as beneficial to users. The tool’s capability to hide or unhide metadata fields depending on the context also proved a useful feature. If the object of study, for instance, was not an organism, metadata fields to name the organism or the cell type involved were hidden; once the method type was specified, say, a sequencing method was chosen, an extra field asking to name the sequencing method using a controlled vocabulary was shown; and all fields to gather information on other types of method were hidden. We believe that this feature reduces the users’ cognitive load, and increases users’ willingness to supply a maximum of metadata.

Reusability

Both the BC schema and the annotation tool can be easily adapted for use in bioinformatics but also other scientific disciplines.

Metadata schema

The BC schema has been implemented using the CMDI metadata framework. All components are stored in the CLARIN concept registry and are available for reuse. Components can also be easily forked and adapted as necessary. For write access, a user registration is required.

The metadata schema has been designed around the very general notion that any scientific discipline can be characterized by the objects it studies and by the methods it uses or develops to answer questions about the object of study or their relationships. The components StudiedObject and Method are bound together by a Run component because any scientific study may subject multiple objects of study to multiple methods. For bioinformatics, it is straightforward to extend the method types (e.g., analytical method or sequencing method) with other method types, or to add a specific method instance to a given type, and as time progresses such adaptations are very likely required. It should be emphasized, however, that the schema’s basic structure could also be used to describe research in, say, psychology or literature (and a noteworthy exercise for members of this disciplines to give such a formalization a try).

Annotation tool

The developed annotation tool make use of XSL transformations allowing it to convert any XSD-based metadata schema into HTML-based code that is displayable in any standard browser. After transformation, the tool’s design allows users to further finetune its look & feel. Its render options allow the easy adaptation of metadata fields, headings and descriptions; also metadata fields can be easily linked to external ontologies so that users profit from autocomplete functionality to provide values to metadata fields. The tool, hence, supports the interoperability aspect of FAIR, and we think that such tool support is a prerequisite to construct rich metadata that is both valid with regard to a given schema, but also interoperable as underlying machine-readable ontologies help machine processability of all data. As a result, it is straightforward to bootstrap a metadata annotation tool for any discipline and the schemas used therein. The standard blueprint system can be adapted in many ways, and also linked together with existing infrastructures, in particular, repository systems. Clearly, any metadata obtained from the tool must be converted to the format understood by the repository back-end, and if other formats than DataCite are required, extra conversion work is needed here.

Searching across research data repositories

The complexity of the bioinformatics field with its many subdisciplines, and the highly distributed nature of its data is at the core of the research data management problem. To find data of interest, researchers will need to consult the established data portals and query each database one by one, often without proficient knowledge of a database’ content or query language. This makes data discovery hard and time-consuming, especially for newcomers to the field.

The linguistics community has been in a comparable situation. Research data is held locally at the research organizations, each of which has often used their own subdiscipline specific annotation scheme for describing their data. The linguistics community addressed the issue with the creation of the CLARIN Virtual Language Observatory23 (VLO), see Van Uytvanck et al. (Citation2012). At regular intervals, the VLO harvests CMDI-based metadata from 40+ different repository sites,24 and maps their metadata fields to the search facets of the VLO, which serve as minimal, common schema for the entire linguistics community. VLO users can do a combination of faceted and full-text search to get easy access to a highly distributed set of over 1 million language-related resources. All institution-specific metadata is visible inside the VLO, often providing direct links to the research data it describes.

Ideally, the bioinformatics community would get together to create a Virtual Bioinformatics Observatory where our minimal schema serves as a blueprint to the facets it provides to support the exploration of data sets that continue to be stored in a distributed manner. The construction of such observatory is a daunting but worthy enterprise. At its very center is the mapping of subdiscipline specific metadata such as MIAPE, ISA, or the metadata used to construct BioSamples to the metadata fields of our minimal schema. The creation of such crosswalks, however, also has societal benefits as the use of a common, core language for the description of bioinformatics data brings together communities, creates common ground, and fosters mutual understanding and exchange.

Initiatives across scientific disciplines

The OpenAIRE consortium (https://www.openaire.eu/) is a EU funded nonprofit organization that aims at improving “discoverability, accessibility, shareability, reusability, reproducibility, and monitoring of data-driven research results, globally.” As part of its Open Science agenda, the consortium publishes a metadata standard (called “Application Profile”) that covers 32 metadata fields to describe research data. Their guidelines for Literature, institutional and thematic repositories has 12 metadata fields mapped to DataCite, six fields mapped to Dublin Core, one field mapped to DC Terms, and 13 fields mapped to none of the aforementioned metadata standards. For these 13 fields, one is the mandatory field “Resource Type” that seeks its value from the COAR Resource Type Vocabulary,25 which only has one value to describe bioinformatics data, namely, “genomic data.”26 All other fields are domain agnostic and only add detail to administrative metadata.

The GO FAIR initiative (https://www.go-fair.org) aims at advocating the FAIR principles to make data findable, accessible, interoperable, and reusable. Its webpage describes the four pillars of FAIR data in detail, and its main focus is on raising awareness and on training activities. The GO FAIR initiative also have so-called Implementation Networks to build the “Internet of FAIR Data and Services.” Implementation networks include “BiodiFAIRse” (biodiversity), “Marine Data Centers,” and “GAIA Data” (Global Integrated Earth Data Implementation Network). Those initiatives seek to develop their own sets of metadata technology to make their data FAIR such as agreeing on metadata standards, or harmonizing them, the adoption and maintenance of commonly agreed vocabularies and ontologies, and the aggregation of research data (and their metadata) from distributed services to make them centrally available for discovery.

Our work can be seen as a small GO FAIR initiative that focuses on the particular needs of bioinformaticians.

Conclusion and future work

We presented a metadata schema for describing research data in the field of bioinformatics at a high-level of detail spanning the entire field of bioinformatics. The schema is accompanied by a tool that allows researchers to annotate research data with this schema in a concise manner and with little effort. We believe that our work increases the searchableness of research data in bioinformatics by increasing both the findability and interoperability aspect of FAIR data.

Our work has been inspired by small metadata standards such as Dublin Core whose small set of descriptors suffice to organize large and diverse sets of (most physical) resources. Confronted with a diverse set of bioinformatics research data, stemming from different subdisciplines, we asked ourselves whether we could identify a core set of descriptors to describe bioinformatics data at an equally high level so that users can use this vocabulary to gain an easy, principled access to all data via a central repository gateway.

We wanted our metadata schema to consist of core elements only, and ideally, the value of each metadata field should take values from existing controlled vocabularies. The two central pillars of our metadata scheme are StudiedObject and Method, because any scientific discipline must be able to describe what its objects of study are, and it must also specify the scientific methods and techniques it employs to answer the research questions of their field. In this respect, our metadata schema can serve as a blueprint for other discipline-specific metadata schemas. For the bioinformatics domain, we made one distinction between organisms and non-organism. When it comes to specific fields such as cellType, nameMaterial to describe the object of study in detail, we make available to users existing vocabularies or ontologies that the bioinformatics community has already built. The use of commonly used controlled vocabularies or ontologies increases the interoperability aspect of FAIR data and hence their searchableness. Unfortunately, we did not find an established controlled vocabulary to classify methods. Here, we invented our own controlled vocabulary to classify methods into different types (e.g., sequencing method, protein-lipid-interaction). The controlled vocabulary is open and will continue to grow as the science of bioinformatics progresses. However, a controlled vocabulary is often insufficient to describe the method(s) used adequately. As a result, we have added to our schema plain text fields such as methodDescription and URLMethodDescription to briefly describe the method employed, or to link to such descriptions via a URL.

We believe that our metadata schema is indeed minimal, and with our annotation tool—whose generic nature makes it possible to be used with any metadata scheme—users should be empowered to annotate their research data rather quickly. To gain traction, the annotation of new research data must be complemented by mapping existing metadata for research data to our scheme. With an abundance of existing rich metadata in bioinformatics, a coordinated mapping effort would allow the bioinformatics community to drastically increase the searchableness of their research data. Following the example of the Virtual Language Observatory for Language Resources, a Virtual Observatory for Bioinformatics (VOBI) would give research data in this discipline a truly FAIR game. VOBI would harvest rich metadata from many different metadata providers; it would map parts of the rich metadata to its search facets, which in turn are directly defined in terms of the metadata fields of our BC. Users can then explore all metadata via faceted search or via full text search on the harvested metadata. With the full metadata available, and access to the underlying data given by the repositories being harvested, all four pillars of FAIR are met, Findability and Interoperability on the BC granularity level, Access via both the central access via VOBI and the participating repositories, and reusability by the fine-level granularity of the sub-discipline specific metadata.

Acknowledgments

The minimal schema for bioinformatics data and its corresponding annotation tool has been developed within the Bioinformatics DATa ENvironment (BioDATEN) project (https://portal.biodaten.info), which is driven by prominent members of the life sciences community in the German federal state of Baden-Württemberg. The project aims at providing a state-wide, FAIR-compliant science data center for bioinformatics data. We thank all BioDATEN project partners for their valuable input in the last four years.

The authors would like to thank the reviewers for their careful reading of our manuscript; their insightful comments and suggestions helped improve the paper considerably.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Correction Statement

This article has been corrected with minor changes. These changes do not impact the academic content of the article.

Additional information

Funding

The project BioDATEN (Bioinformatics DATa ENvironment) has been funded with 2,5M Euros by the Ministry for Science, Research and Arts, Baden-Württemberg, Germany (01.07.2019 - 30.06.2023).

References

  • Athar, A., Füllgrabe, A., George, N., Iqbal, H., Huerta, L., Ali, A., Snow, C., Fonseca, N. A., Petryszak, R., Papatheodorou, I., Sarkans, U., & Brazma, A. (2019). Arrayexpress update–From bulk to single-cell expression data. Nucleic Acids Research, 47(D1), D711–D715. https://doi.org/10.1093/nar/gky964
  • Brazma, A., Ball, C., Bumgarner, R., Furlanello, C., Miller, M., Quackenbush, J., Reich, M., Rustici, G., Stoeckert, C., Chervitz, S., & Taylor, R. C. (2012, June). MINSEQE: Minimum information about a high-throughput Nucleotide SeQuencing Experiment - A proposal for standards in functional genomic data reporting. Zenodo. https://doi.org/10.5281/zenodo.5706412
  • Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C., Aach, J., Ansorge, W., Ball, C. A., Causton, H. C., Gaasterland, T., Glenisson, P., Holstege, F. C., Kim, I. F., Markowitz, V., Matese, J. C., Parkinson, H., Robinson, A., Sarkans, U., … Vingron, M. (2001). Minimum Information About a Microarray Experiment (MIAME) - Toward standards for microarray data. Nature Genetics, 29(4), 365–371. https://doi.org/10.1038/ng1201-365
  • Broeder, D., Kemps-Snijders, M., Van Uytvanck, D., Windhouwer, M., Withers, P., Wittenburg, P., & Zinn, C. (2010, May 17–23). A data category registry- and component-based metadata framework. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner, & D. Tapias (Eds.), Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010. European Language Resources Association (pp. 43–47).
  • Cernava, T., Rybakova, D., Buscot, F., Clavel, T., McHardy, A. C., Meyer, F., Overmann, J., Stecher, B., Sessitsch, A., Schloter, M., Berg, G., & The MicrobiomeSupport Team (2022). Metadata harmonization-standards are the key for a better usage of omics data for integrative microbiome analysis. Environmental Microbiome, 17(33), 1–10.
  • Ceusters, W., & Smith, B. (2010). A unified framework for biomedical terminologies and ontologies. In C. Safran, S. R. Reti, & H. F. Marin (Eds.), MEDINFO 2010 - Proceedings of the 13th World Congress on MedicalInformatics, Cape Town, South Africa, September 12-15, 2010, volume 160 of Studies in Health Technology and Informatics (pp. 1050–1054). IOS Press.
  • Chervitz, S. A., Deutsch, E. W., Field, D., Parkinson, H., Quackenbush, J., Rocca-Serra, P., Sansone, S.-A., Stoeckert, C. J., Taylor, C. F., Taylor, R., & Ball, C. A. (2011). Data standards for omics data: The basis of data sharing and reuse (pp. 31–69). Humana Press.
  • Courtot, M., Gupta, D., Liyanage, I., Xu, F., & Burdett, T. 11 (2021). BioSamples database: FAIRer samples metadata to accelerate research data management. Nucleic Acids Research, 50(D1), D1500–D1507. https://doi.org/10.1093/nar/gkab1046
  • El-Gebali, S., & Stathis, K. (2024). Updating the DataCite metadata schema: Introducing schema 4.5 and deprecating schema 3. Presentation. https://doi.org/10.5281/zenodo.10813183
  • Guha, R. V., Brickley, D., & MacBeth, S. (2015). Schema.org: Evolution of structured data on the web: Big data makes common schemas even more necessary. Queue, 13(9), 10–37. https://doi.org/10.1145/2857274.2857276
  • Ison, J., Kalas, M., Jonassen, I., Bolser, D., Uludag, M., McWilliam, H., Malone, J., Lopez, R., Pettifer, S., & Rice, P. (2013). EDAM: An ontology of bioinformatics operations, types of data and identifiers, topics and formats. Bioinformatics, 29(10), 1325–1332. https://doi.org/10.1093/bioinformatics/btt113
  • ISO/TC 46/SC 4 Technical Committee. ISO 15836-1:2017. (2017). Information and documentation - The Dublin Core metadata element set - part 1: Core elements (Technical report). International Organization for Standardization.
  • Jackson, R., Matentzoglu, N., Overton, J. A., Vita, R., Balhoff, J. P., Buttigieg, P. L., Carbon, S., Courtot, M., Diehl, A. D., Dooley, D. M., Duncan, W. D., Harris, N. L., Haendel, M. A., Lewis, S. E., Natale, D. A., Osumi-Sutherland, D., Ruttenberg, A., Schriml, L. M., Smith, B., … Peters, B. (2021). OBO Foundry in 2021: Operationalizing open data principles to evaluate ontologies. Database, 2021, baab069. https://doi.org/10.1093/database/baab069
  • Johnson, D., Batista, D., Cochrane, K., Davey, R. P., Etuk, A., Gonzalez-Beltran, A., Haug, K., Izzo, M., Larralde, M., Lawson, T. N., Minotto, A., Moreno, P., Nainala, V. C., O’Donovan, C., Pireddu, L., Roger, P., Shaw, F., Steinbeck, C., Weber, R. J. M., Sansone, S.-A., & Rocca-Serra, P. giab060, 09 (2021). ISA API: An open platform for interoperable life science experimental metadata. GigaScience, 10(9), 1–13. https://doi.org/10.1093/gigascience/giab060
  • Library of Congress. (2023). Marc 21 format for bibliographic data (including update no. 37). https://www.loc.gov/marc/bibliographic/
  • Mönke, G., Seifert, M., Keilwagen, J., Mohr, M., Grosse, I., Hähnel, U., Junker, A., Weisshaar, B., Conrad, U., Bäumlein, H., & Altschmied, L. 06 (2012). Toward the identification and regulation of the Arabidopsis thaliana ABI3 regulon. Nucleic Acids Research, 40(17), 8240–8254. https://doi.org/10.1093/nar/gks594
  • Natale, D. A., Arighi, C. N., Blake, J. A., Bona, J., Chen, C., Chen, S.-C., Christie, K. R., Cowart, J., D’Eustachio, P., Diehl, A. D., Drabkin, H. J., Duncan, W. D., Huang, H., Ren, J., Ross, K., Ruttenberg, A., Shamovsky, V., Smith, B., Wang, Q., … Wu, C. H. (2017). Protein ontology (pro): Enhancing and scaling up the representation of protein entities. Nucleic Acids Research, 45(D1), D339–D346. https://doi.org/10.1093/nar/gkw1075
  • Smith, B., Kusnierczyk, W., Schober, D., & Ceusters, W. (2006). Towards a reference terminology for ontology research and development in the biomedical domain. In B. Olivier (Ed.), KR-MED 2006, Formal Biomedical Knowledge Representation, Proceedings of the Second International Workshop on Formal Biomedical Knowledge Representation: "Biomedical Ontology in Action" (KR-MED 2006), Baltimore, Maryland, USA (pp. 57–65). CEUR-WS.org. CEUR-WS.org. https://ceur-ws.org/Vol-222/krmed2006-p07.pdf
  • Van Uytvanck, D., Stehouwer, H., & Lampen, L. (2012). Semantic metadata mapping in practice: The virtual language observatory. In N. Calzolari (Ed.), Proceedings of LREC 2012: 8th International Conference on Language Resources and Evaluation (pp. 1029–1034). European Language Resources Association (ELRA).
  • Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., Gonzalez-Beltran, A., Gray, A. J. G, Groth, P., Goble, C., Grethe, J. S., Heringe, J., ‘t Hoen, P. A. C., Hooft, R., Kuhn, T., Kok, R., Kok, J., Lusher, S. J., Martone, M. E., Mons, A., Packer, A. L., Persson, B., Rocca-Serra, P., Roos, M., van Schaik, R., Sansone, S.-A., Schultes, E., Sengstag, T., Slater, T., Strawn, G., Swertz, M. A., Thompson, M., van der Lei, J., van Mulligen, E., Velterop, J., Waagmeester, A., Wittenburg, P., Wolstencroft, K., Zhao, J., & Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3(1), 160018. https://doi.org/10.1038/sdata.2016.18