1,321
Views
75
CrossRef citations to date
0
Altmetric
Editorial

The Proteomics Identifications Database (PRIDE) and the ProteomExchange Consortium: making proteomics data accessible

&
Pages 1-3 | Published online: 09 Jan 2014

The field of proteomics has undergone an explosive development in recent years. Proteomics technologies are now part of the standard armory of modern life sciences research. Organizationally, proteomics is undergoing the same evolution as its neighbor fields genomics, structural proteomics and transcriptomics, by radiating outwards from small-scale, single-laboratory approaches towards highly automated large-scale experiments Citation[1] and large-scale collaborative efforts, such as the Human Proteome Organization (HUPO) Proteome Projects Citation[2,3].

However, proteomics still lags significantly behind in the key area of systematic data capture. Proteomics data are being produced on a large scale, and being lost on a large scale. Data supporting published research is highly fragmented, often not available at all, or scattered across local databases, and authors’ and journals’ websites, in a variety of formats, including tables in PDF format, which are more or less computationally inaccessible. It is currently practically impossible to answer questions such as ‘In which published proteomics experiments has my protein of interest been observed?’ or ‘Who has observed a set of proteins similar to the set I have observed?’. Partially due to the diversity of proteomics data, and partially due to the relative youth of the field and its fast-paced development, no generally accepted proteomics data repositories exist. Data representation standards are still under development, with the exception of molecular interactions, where the HUPO Proteomics Standards Initiative (PSI) Molecular Interactions (MI) Extensible Markup Language (XML) standard is now widely accepted Citation[4].

But times are changing. Guidelines on what to report as part of a proteomics publication are being developed Citation[5], as well as formats on how to represent the data in a standardized manner Citation[6,7]. The HUPO PSI has released mzData Citation[8], a standard format for the representation of mass spectra Citation[101]. mzData will be complemented by the analysisXML format for the representation of search engine results. mzData are already supported by instrument vendors, engine providers and databases, among them the European Bioinformatics Institute’s Proteomics Identifications Database (PRIDE) Citation[102,9].

PRIDE provides a key public repository for proteomics data supporting research publications. It was originally developed as a data repository for the HUPO Plasma Proteome Project, and is now open for general submission of proteomics data supporting peer-reviewed publications Citation[10].

PRIDE aims to fully represent all relevant aspects of a proteomics experiment and to make them easily accessible to the user. PRIDE implements the PSI mzData format, and extends it by additional data items, which will later become part of analysisXML. Elements of a PRIDE experiment comprise title and description, references, detailed sample description, a description of the instrumentation, software, and procedures used in the experiment and, finally, identified peptides and proteins, as well as potential protein modifications.

A key feature of PRIDE is its support for private, collaborative data access. In the prepublication stage, an author can submit a file in PRIDE XML to the database. This will be stored in PRIDE, but will be accessible only to the author. PRIDE accession numbers suitable for inclusion in a publication will be returned to the author. In addition, the author can invite other identified users into his PRIDE collaboration (e.g., colleagues in a large, distributed proteomics project). Thus, data can be jointly assessed and discussed in the same form in which they will be published. For each collaboration, PRIDE automatically creates a reviewer account. As part of the peer review process, this account can be used by the reviewers of a publication, thus allowing them to access the data in their final, computationally accessible form. Finally, on publication of the manuscript, or at an author-defined date, the PubMed identifier will be added to the data set, and the data will become publicly available and searchable, indirectly increasing the visibility of the publication.

A planned feature of PRIDE is the comparison of data sets with simple set operations like union and intersection. This will be of huge benefit for general use, but also in the prepublication stage, as it will allow easy comparison of a submitted, but still private, data set to existing public data sets in PRIDE.

All PRIDE data are fully publicly available, both for interactive access via web pages and for download in (compressed) XML format. In addition, all source code is freely available, both for academic and commercial use in an attempt to foster local PRIDE installations, the spread of systematic proteomics data capture and, ultimately, a higher availability of proteomics data in the public domain.

It is important to emphasize that PRIDE is only one of a number of public proteomics databases, among them Global Proteome Machine Citation[11], Osteo-Promoter Database Citation[12], Proteome Experimental Data Repository (PEDRo) Citation[13] and PeptideAtlas Citation[14], all with their own strengths and specific contributions to the proteomics community. All of these repositories capture proteomics data and make it publicly available. However, each of them captures a different subset of data and currently provides it in a different form. Thus, to access all proteomics data available in public databases still requires an enormous effort, and still captures only a subset of all published data, as many data sets are never even submitted to a public database. While it might, at first sight, be desirable to have a single, global repository for proteomics data, it is unlikely that such a concentration process will happen any time soon. In addition, such a single, monolithic repository would be unlikely to react quickly enough to changing user requests in a fast-paced field. In addition, even from an organizational point of view, a single resource might be too vulnerable to changing funding, a move into the private sector or technical challenges. A network of collaborating, independent databases, such as the long-established International Nucleotide Sequence Database Collaboration Citation[103] or the nascent International Molecular Exchange Consortium Citation[104], is likely to provide a more stable long-term perspective.

As a satellite meeting to the HUPO 4th Annual World Congress in Munich, September 2005, major proteomics data providers, software developers and journal representatives met to form the ProteomExchange group, with the aim to establish a regular data exchange between major proteomics repositories. While this process is still in a very early stage, it is hoped it will ultimately overcome the current fragmentation of proteomics data and lead to a network of stable, synchronized proteomics resources and, perhaps in the future, enable us to ‘Blast’ our newly derived proteome against the vast majority of published data sets in a single operation.

References

  • Washburn MP, Wolters D, Yates JR III. Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nature Biotechnol. 19(3), 242–247 (2001).
  • Hamacher M, Meyer HE. HUPO Brain Proteome Project: aims and needs in proteomics. Expert Rev. Proteomics 2(1), 1–3 (2005).
  • Omenn GS, States DJ, Adamski M et al. Overview of the HUPO Plasma Proteome Project: results from the pilot phase with 35 collaborating laboratories and multiple analytical groups, generating a core data set of 3020 proteins and a publicly-available database. Proteomics 5(13), 3226–3245 (2005).
  • Hermjakob H, Montecchi-Palazzi L, Bader G et al. The HUPO PSI’s molecular interaction format – a community standard for the representation of protein interaction data. Nature Biotechnol. 22(2), 177–183 (2004).
  • Carr S, Aebersold R, Baldwin M et al. The need for guidelines in publication of peptide and protein identification data: Working Group on Publication Guidelines for Peptide and Protein Identification Data. Mol. Cell. Proteomics 3(6), 531–533 (2004).
  • Taylor CF, Paton NW, Garwood KL et al. A systematic approach to modeling, capturing, and disseminating proteomics experimental data. Nature Biotechnol. 21(3), 247–254 (2003).
  • Pedrioli PG, Eng JK, Hubley R et al. A common open representation of mass spectrometry data and its application to proteomics research. Nature Biotechnol. 22(11), 1459–1466 (2004).
  • Orchard S, Hermjakob H, Taylor CF et al. Second proteomics standards initiative spring workshop. Expert Rev. Proteomics 2(3), 287–289 (2005).
  • Jones P, Cote RG, Martens L et al. PRIDE: a public repository of protein and peptide identifications for the proteomics community. Nucleic Acids Res. 34(Database issue), D659–D663 (2006).
  • Adamski M, Blackwell T, Menon R et al. Data management and preliminary data analysis in the pilot phase of the HUPO Plasma Proteome Project. Proteomics 5(13), 3246–3261 (2005).
  • Craig R, Cortens JP, Beavis RC. Open source system for analyzing, validating, and storing protein identification data. J. Proteome Res. 3(6), 1234–1242 (2004).
  • Prince JT, Carlson MW, Wang R, Lu P, Marcotte EM. The need for a public proteomics repository. Nature Biotechnol. 22(4), 471–472 (2004).
  • Garwood K, McLaughlin T, Garwood C et al. PEDRo: a database for storing, searching and disseminating experimental proteomics data. BMC Genomics 5(1), 68 (2004).
  • Desiere F, Deutsch EW, King NL et al. The PeptideAtlas project. Nucleic Acids Res. 34(Database issue), D655–D658 (2006).

Websites

  • PSI-MS: Mass Spectrometry Standards Working Group http://psidev.sourceforge.net/ms/index.html
  • PRIDE: PRoteomics IDEntifications database www.ebi.ac.uk/pride
  • International Nucleotide Sequence Database Collaboration www.insdc.org
  • The International Molecular Exchange Consortium (IMEx) http://imex.sf.net

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.