ABSTRACT
Introduction: The technological and scientific progress performed in the Human Proteome Project (HPP) has provided to the scientific community a new set of experimental and bioinformatic methods in the challenging field of shotgun and SRM/MRM-based Proteomics. The requirements for a protein to be considered experimentally validated are now well-established, and the information about the human proteome is available in the neXtProt database, while targeted proteomic assays are stored in SRMAtlas. However, the study of the missing proteins continues being an outstanding issue.
Areas covered: This review is focused on the implementation of proteogenomic methods designed to improve the detection and validation of the missing proteins. The evolution of the methodological strategies based on the combination of different omic technologies and the use of huge publicly available datasets is shown taking the Chromosome 16 Consortium as reference.
Expert commentary: Proteogenomics and other strategies of data analysis implemented within the C-HPP initiative could be used as guidance to complete in a near future the catalog of the human proteins. Besides, in the next years, we will probably witness their use in the B/D-HPP initiative to go a step forward on the implications of the proteins in the human biology and disease.
Article highlights
Proteogenomics is a promising area of research in several technological and scientific areas especially in biology, biomedicine and, in the last few years, in clinical biomarker discovery.
The basics of proteogenomic methods are the creation of customized protein sequence databases for the proteomic searches and the subsequent statistical analyses for the FDR estimation in the obtained results considering the size effect derived from these databases.
One of the key objectives of the HPP project is the experimental detection of the proteins annotated in neXtProt database with protein evidences PE2, PE3 and PE4 (MPs) in a biological matrix using stringent statistical thresholds.
The availability in public repositories such as GEO or PRIDE of large amounts of high throughput experiments to study the human transcriptome (microarrays, RNA-Seq) and proteome (shotgun Proteomics) has allowed the development of new bioinformatic workflows for finding MPs with a reanalysis of these datasets that follows the HPP guidelines.
The integration of transcriptomic and proteomic experiments (Proteogenomics) has been used to study the characteristics of the missing proteins in order to increase the knowledge about them. The obtained results describe the functions and pathways in which they are involved, their tissue specificity, and serve as guidance for the design of validation experiments in certain biological matrices (for example, brain and testis tissues and embryonic cell lines).
The study of peptide detectability using a machine learning approach reveals the MS limitations to detect a subset of peptides, especially MP peptides.
Unfortunately, even the predictions performed using the most sophisticated algorithms are difficult to validate. New experimental approaches, such as protein enrichment or depletion strategies, and new biological matrices must be incorporated into the project in order to complete the human proteome catalog. The bioinformatic methods provided by the HPP scientific community to study the MPs can also be applied to the B/D-HPP initiative for the research of human protein implications in the cellular processes and human diseases.
This box summarizes key points contained in the article.
Declaration of interest
The authors have no relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript. This includes employment, consultancies, honoraria, stock ownership or options, expert testimony, grants or patents received or pending, or royalties.
Reviewer declarations
Peer reviewers on this manuscript have no relevant financial or other relationships to disclose.
Supplementary material
Supplemental data for this article can be accessed here.