1,549
Views
0
CrossRef citations to date
0
Altmetric
Editorial

What Are The Obstacles for an Integrated System for Comprehensive Interpretation of Cross-Platform Metabolic Profile Data?

&
Pages 1511-1514 | Published online: 07 Dec 2009

The metabolomics researcher’s dream

The metabolomic researcher’s dream is a software tool that automatically confirms or contradicts a given hypothesis and, additionally, reports a substantial gain of knowledge in grammatically correct sentences together with complete statistical analysis, metabolic pathway graph mappings and literature analysis. Such a fictional integrated workbench or Swiss army knife of data interpretation currently does not exist. However, the building blocks for such a workbench are not far from reality.

Data interpretation starts with the idea of the experiment & requires ontologies & semantic annotation of raw data

Over the past 5 years, the Metabolomics Facility at the UC Davis Genome Center has covered metabolic profiling studies on more than 40 different species, spanning 260 different experimental setups and over 18,000 samples measured. This diversity in study designs cannot be embraced without detailed experimental descriptions that are stored, queried and disseminated via electronic resources. We learned from both poor- and well-described experiments that data without metadata is meaningless. In order to analyze data, it is of utmost importance to include the greatest possible detail about the provenance of the data, including information that may not directly be deemed important for the investigators of the study such as the manufacturer and distributor of the ‘standard chow’ that rodents would be fed when testing the effects of allelic variation in metabolic enzymes. Indeed, we found in one study that the nutritional content of rodent chows varies over the years, which can confound the interpretation of metabolic results if such information is not provided. Whether any of these parameters had an impact on the final metabolic conclusions can be tested by supervized learning and compared in magnitude to the effects studied in the experimental design. Moreover, contaminations from laboratory supplies and reagents might interfere with data analysis and can be removed from statistical evaluations if complete metadata descriptions are available, including sample preparation, storage, methods for quenching metabolism and data acquisition.

Biological metadata contains taxonomy data about superkingdoms (archaea, bacteria, eukaryota and viruses) as well as fine granular descriptors such as subspecies or single-nucleotide polymorphism genotypes. Apart from actual treatments to influence metabolism (e.g., a drug-exposure time series study or biotic perturbations such as pathogen infections), further parameters include the physical object and its history, such as sample origin, growth, disease history of the organism and the correct description of organs, tissue types, cell types and subcellular compartments. As metabolomics tries to capture spatial and temporal snapshots of metabolism, the simplest experiment description is therefore ‘biosource × treatment’ (equivalent to ‘genotype × environment’ in agronomic studies). The biosource description would include all the taxonomy and compartment data, whereas treatment parameters would detail the type of treatment(s), dose, time dependencies and treatment combinations.

The descriptive language of metabolomics experiments must use ontologies (i.e., controlled vocabularies and structural hierarchies and dependencies of these terms). For genomics, GeneOntology has been established and is used in many different experimental settings. Other areas of research have developed case-specific ontologies, such as on plant organs in the Plant Ontology Consortium. For discerning species, the curated NCBI taxonomy can be used while, unfortunately, ‘treatment’ parameters are far less covered by ontologies because of the diverse nature of potential studies. Most of the principles were outlined in the Metabolomics Society journal in the September 2007 issue and cover the developments of the metabolomics standards initiative and proposed minimum reporting standards for chemical analysis of different biological studies.

Metabolomics requires public data repositories similar to GenBank

The necessity of experimental data exchange was correctly perceived and set in place by molecular biology researchers in 1982, since when nucleic acid sequence data from all organisms is collected in GenBank. Researchers are required to submit their experimental data before they can publish a scientific report in a journal. Similar approaches are used for transcriptomics data with the Gene Expression Omnibus database, which covers annotated transcript data from more than 100 organisms covering more than 13,000 experiments. The advantages are clear: by using database queries, researchers worldwide can plan new experiments by using existing knowledge or can make new discoveries by mining different combinations of studies deposited in databases. This ‘public data’ approach opens up new frontiers in theoretical bioinformatics. By contrasting results from different study designs, researchers can classify specific and generic biological responses without performing experimental work, which may lead to novel hypotheses that can be validated in subsequent, more advanced studies. Unfortunately, in analytical chemistry and metabolomics, no such data-submission requirements are in place. Instead, researchers only publish parts of their data (e.g., in bar graphs or tables). Even this limited information is not directly accessible, as the original (complex) data is soon forgotten on old hard drives and DVDs, even in the investigator laboratories. The collection of such data from the existing literature via text mining and optical character or structure recognition is prone to errors and data losses and is not keeping up with the current technology. The UC Davis Genome Center has mechanisms in place to store all annotated studies and corresponding raw data and processed results, albeit not across all data-acquisition platforms that are currently used.

Cross-platform metabolic profile data requires open data-exchange formats

Metabolites vary in size, chemistry and physicochemical properties. Therefore, no single platform can assess all metabolites in one analytical run. Consequently, several platforms are used, necessitating the need for cross-platform data integration, spanning different technologies such as MS or NMR as well as different types of instruments, models or vendors. Instead of using proprietary data formats, researchers and MS vendors have developed new data-exchange formats such as mzXML and mzData, ultimately joined to form the mzML data format. The development of those formats was driven by the much larger proteomics community, whereas the traditional analytical chemistry community (including environmental, food and natural product branches) still pursues classic and less-adequate ways of publishing data. Ultimately, the success of open data-exchange formats does not only depend on good will but requires a ‘push-and-pull’ strategy: ‘pulling’ from practical software solutions, participation of multiple user groups and exemplary studies and ‘pushing’ from data-exchange requirements mandated by funding agencies and journals. Open data exchange will allow the import and export of metabolomics data-analysis platforms without the problems associated with undisclosed and undocumented data formats. Furthermore, data exchange between different research groups would be streamlined without the need for installing multiple software packages, hence increasing the re-use of data sets.

Automation in metabolomics data handling requires software pipelines & workflows

While many bioanalytical studies consist of only a few sample sets, metabolic profiling studies may cover hundreds to thousands of samples. Once a machine acquisition method is set up and the machine data is acquired, metabolites need to be annotated or identified in an automatic way. Classic target-oriented analysis usually surveys fewer than 50 metabolites, whereas modern multi-target approaches may screen more than 500 compounds, requiring different annotation approaches. Such methods often comprise the detection and reporting of unknown compounds, because such unknown metabolites may evolve as novel biomarkers that could be identified by authentic reference standards at a later time point. While ‘unknowns’ are hard to publish, analytical parameters such as accurate mass, mass fragmentations and retention indices may be collectively stored and could unambiguously assign such novel entities. The BinBase metabolomics database at the UC Davis Genome Center currently accumulates over 10 million spectra that were annotated as less than 6000 metabolites, including unknowns, which can be queried by public web interfaces. While the programs are hard-coded in single monolithic programs, more flexible approaches might use smaller library components and exploit scientific workflows. Workflows would also allow the adjustment to different data types and data sources, which could integrate commercial as well as open-source components and lead from study designs to mzML data and processed datasets to statistical treatments and biological interpretations.

Statistical interpretation requires deep inspection of raw data

Metabolite annotations require at least two independent parameters such as retention time and mass spectral matching or 2D coupling experiments in the case of NMR. While automatic routines may correctly assign metabolic signals in most instances, retrospective inspections of raw data are often advantageous, for example to examine the contribution of noise or co-eluting compounds in order to retrieve possibly false-negative or false-positive candidates. Therefore, statistical software tools are highly beneficial if they allow an inspection of associated raw data directly from the statistics software.

Pathway databases still lack known & unknown metabolites

It may sound like an oxymoron, but how can software map unknown compounds to a pathway map? The reality is that the majority of detected metabolic signals do not have identified chemical structures. How are such metabolites incorporated into final experimental results? One solution is to use spectral similarities or substructure similarities to group metabolites together and later map them to an existing pathway. Metabolites that are unknown fatty acids would be mapped together with known fatty acids to an existing pathway. Such an approach only works well if the structural complexity is low, hence the compound does not have too many different substructure features. A more complicated issue is that many compounds are missing from pathway databases. For instance, plant galactolipids such as monogalactosyldiacylglycerols and digalactosyldiacylglycerols, comprising a total of 200 individual lipid species, are well known and have been experimentally determined for more than 40 years. However, these lipid species are still missing from major pathway databases. This problem can only be overcome if new data-sharing policies are implemented with the direct submission of experimental results, including chemical structures and molecular spectra to journals, databases or public repositories. We extensively discussed this problem in our 2009 Public Library of Science publication: ‘How large is the metabolome? A critical analysis of data exchange practices in chemistry’ Citation[1]. Better semantic annotation of texts and new text-mining concepts may also help to ease this existing problem.

Towards systems biology: software for transcript, proteome & metabolome data handling

The interpretation of metabolomics studies using a more comprehensive view will always require the inclusion of secondary lines of evidence, such as protein and RNA-expression levels. The combination of such complex experimental setups and the design of software that can handle multiple platform and multiple technologies is still in development or is not yet matured. Scientists from different backgrounds need to agree to use the same vocabulary and develop common interfaces to combine and understand the results from those technologies. Such processes include a more-stringent use of minimum standards and a mandated enforcement of those rules and standards. Julian Griffin correctly outlined it in his 2005 publication ‘The Cinderella story of metabolic profiling: does metabolomics get to go to the functional genomics ball?’ when he suggested that metabolomics can only attend the functional genomic ball if it dresses in similar fine clothes to that worn by other Omics approaches Citation[2].

Financial & competing interests disclosure

The authors have no relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript. This includes employment, consultancies, honoraria, stock ownership or options, expert testimony, grants or patents received or pending or royalties.

No writing assistance was utilized in the production of this manuscript.

Bibliography

  • Kind T , ScholzM, FiehnO. How large is the metabolome? A critical analysis of data exchange practices in chemistry. PLoS ONE4(5), E5440 (2009).
  • Griffin JL . The Cinderella story of metabolic profiling: does metabolomics get to go to the functional genomics ball? Phil. Trans. R. Soc. 361(1465), 147–161 (2006).

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.