477
Views
3
CrossRef citations to date
0
Altmetric
Editorial

Genomics and proteomics: of hares, tortoises and the complexity of tortoises

Pages 469-472 | Published online: 09 Jan 2014

The draft human genome sequence was published in 2001 and took 13 years to complete. Necessity being the mother of invention, the accelerated development of DNA sequencing technologies led to the completion of the Human Genome Project nearly a year ahead of schedule Citation[1]. In 1998, the total sequencing output from the entire Human Genome Project was about 200 million base pairs for the year. By January 2003, the Department of Energy Joint Genome Institute had reached the capacity to sequence 1.5 billion bases per month.

Today, the Bejing Genome Institute in China has the capacity to generate the equivalent of 10,000 human genomes per year. In 2010, the Bejing Genome Institute generated ten times the amount of data the National Center for Biotechnology Information had generated in the past 20 years. In early 2012, Oxford Nanopore Technologies announced that its second-generation GridIons would be able to sequence a human genome in 15 min Citation[2]. Enter proteomics, the proverbial tortoise in the race with the hare.

The number of genes & the number of proteins

In 1953, Wilkins et al. Citation[3], Franklin and Gosling Citation[4] and Watson and Crick Citation[5] simultaneously published the structure of DNA in the same issue of Nature. (Earlier that year, Pauling and Corey Citation[6] had published an incorrect triple helix model of DNA). By 1964, Vogel Citation[7] very accurately calculated the size of the human genome, but inaccurately predicted it might consist of over 6 million genes. By recent estimates, the human genome has only 22,000–23,000 genes, slightly more than the worm Caenorhabditis elegans, which has only 959 cells but 20,000 genes Citation[8], and fewer than a grape plant, which has 30,434 genes Citation[9].

There is a surprising lack of correspondence between genome size and complexity of the organism Citation[10,11]. If the DNA from a single human cell were uncoiled and laid end to end, it would measure over two meters in length. By comparison, the genome of the flowering plant Paris japonica is 50 times larger than the human genome Citation[12], and outstretched, the DNA from a single cell would span the length of a soccer field.

While transcriptomics attempts to monitor gene expression by quantifying mRNA transcripts, the poor correlation between mRNA levels and protein expression is well known Citation[13]. Hence, neither genomics or transcriptomics can accurately predict the constituency of the proteomic complement, since an undeterminable number of possible post-translational modifications (PTMs) can potentially produce multiple isoforms of each protein. For instance, there are at least 3778 distinct genes encoding for plasma proteins, of which at least 51% of these genes encode for more than one protein isoform Citation[14]. Therefore, the total number of proteins in the human proteome is expected to reach into the millions. Consider the immunoglobulins for which there are 70 genes encoding 320 possible light chain combinations and 10,530 possible heavy chain combinations, which collectively could contribute 3,369,600 possible antibody types Citation[15].

The diversity of proteins

To emphasize the diversity of proteins, consider a single stranded DNA hexamer in which there are 4096 possible combinations of A, G, T and C. By comparison, there are 20 proteogenic amino acids, as such the number of possible sequences for a hexapeptide is 206 or 64,000,000. Now consider only one of the most common PTMs, the glycosylation of Asn, Arg, Ser, Tyr and Thr residues, and the number of possible sequences for the same theoretical hexapeptide increases to 256 or 244,140,625. To further add to this complexity, a calculation of all possible oligosaccharide isomers, both branched and linear, predicts 1012 structures for a reducing hexasaccharide Citation[16].

The median length of eukaryotic proteins is 360 amino acids Citation[17,18], therefore the prediction of possible sequences for an ‘average’ protein, not including PTMs, reaches an astronomical 20360. Incidentally, titin is the largest known protein, 34,350 amino acids in length. Written out, its full chemical name would be the longest word in the English language consisting of 189,819 letters.

As of 2011, more than 87,000 post translational modifications had been identified experimentally from over 530,000 protein entries in the Swiss-Prot protein database, while another 234,938 are postulated Citation[19]. While over 200 different protein PTMs have been identified Citation[20], there remains much disparity over the frequency at which PTMs occur. For example, Apweiler et al. Citation[21] predicted as many as 50% of all human proteins might be glycosylated, whereas more recent estimates by Khoury et al. Citation[19] suggest the actual number might be fewer than 20%.

Carpe articulum (Seize the important moment)

It is estimated that 30% of all cellular proteins are phosphorylated on at least one residue Citation[22], the relative frequency of serine, threonine and tyrosine phosphoylation sites being 54%, 31%, 14%, respectively. PhosphoNET presently holds over 657,000 known and putative phosphorylation sites in over 23,000 human proteins, only 15% of which have been experimentally validated Citation[23].

There are 518 known kinases Citation[24] and possibly over 1000 phosphatases Citation[25,26] that regulate their activity. The addition or elimination of a phosphate at one or multiple sites on a protein alters its net charge, conformation, and regulates functionality by rapidly oscillating between functional and nonfunctional states. Isoelectric focusing is particularly sensitive for simultaneously detecting changes in the phosphorylation state of multiple proteins Citation[27].

Protein phosphorylation state, at any given moment, is the net result of the opposing activities of kinases and phosphatases and can change very rapidly. As examples, one molecule of tissue nonspecific alkaline phosphatase can convert 971 molecules of substrate per second Citation[28], and a turnover rate of over 2000 molecules per second has been reported for human intestinal phosphatase Citation[29]. Hence, it is extremely difficult to obtain a phosphoprotein profile that accurately reflects the in vivo state. Juhl showed that protein phosphorylation states were altered in surgical specimens, even prior to resection, where the ligation of arteries induces ischemic response Citation[30].

Membrane proteins

It is estimated that membrane proteins constitute, roughly 20% of the human proteome, and more recently, Almen et al. Citation[31] predicted this number might be as high as 27%. As of early 2012, the Protein Data Bank of Transmembrane Proteins held over 83,700 structures of which 1690 transmembrane proteins have been predicted so far Citation[32,33]. Membrane proteins are of particular importance, since over 60% of current drugs are based on targeting this group of proteins Citation[34].

The dynamic range of protein concentrations

The complexity of proteomes is exacerbated by the broad range of protein concentrations, spanning at least 12 orders of magnitude as exemplified by human plasma where the mass of albumin is nearly ten billion times greater than that of clinically important proteins like the interleukins Citation[35,36]. Moreover, it is most likely that trace proteins which are not supposed to be in plasma, those which have leaked from damaged tissues, will be of the most diagnostic and prognostic importance.

Every tissue in the body is supplied by over 60,000 miles of veins, arteries and capillaries that comprise the human circulatory system, so conversely, tissues will have a background of plasma proteins. Perfusion of tissues to remove blood components is not a viable option, since the in vitro state of proteins is unlikely to be preserved under these conditions. For example, Quissell et al. demonstrated a half-life of less than 6 min for the phoshorylated state of an integral membrane protein regulating exocytosis Citation[37].

Strategies for the depletion of high abundance structural and housekeeping proteins and the enrichment of extremely low abundance proteins will be necessary to effectively mine the proteome. Like peeling away the layers of an onion, the systematic depletion of the most abundant proteins reveals the next innermost layer of proteome, but as with chefs, this has not happened without sometimes bringing tears to the eyes of some scientists.

The most development in this area has been for plasma and serum, from which it is hoped a ‘biopsy’ of the human condition might be derived. In human plasma, only six proteins constitute more than 99% of the total protein mass. Immunoaffinity chromatography has been used to deplete the 20 most abundant proteins from plasma, however, this risks the depletion of other proteins, peptides and potential biomarkers that may be associated with the molecules targeted for depletion. The study of proteins and peptides bound to albumin has been termed ‘albuminomics’ and Gundry et al. Citation[38] showed at least 35 other proteins of both high and low abundance that were retained in the albumin fraction. In terms of mining for biomarkers, these approaches seem to come with some risk of ‘throwing the baby away with the wash water’.

Conversely, immunoaffinity can be used in enrichment schemes designed to isolate and quantify very low abundance proteins, as exemplified by sandwich ELISAs, which utilize capture antibodies to selectively immobilize specific antigens. However, it is interesting to note that mouse monoclonal antibodies will be ineffective in many cases, since heterophilic antibodies such as mouse antihuman antibodies are expressed in 10–40% of the human population Citation[39,40].

Alternatively, the synthetic peptide ligand libraries first described by Lam et al. Citation[41] and later extensively developed by Righetti and Boshetti Citation[42,43] have been shown to be highly effective for isolating extremely low abundance proteins and potential biomarkers from complex samples such serum Citation[44], urine Citation[45], saliva Citation[46] and cerebrospinal fluid Citation[47]. Immobilized on a single column, a hexapeptide library constructed from 17 amino acids would represent over 24 million possible ligands Citation[48].

Challenges facing the proteomics generation

Nature’s antibodypedia antibody database lists over 186,000 reviewed antibodies, covering approximately 84% of all human genes Citation[49]. However, the ability to quantify potentially valuable protein biomarkers is frequently hindered by the inability to reliably isolate these proteins. The Human Proteome Organization’s Human Antibody Initiative is aimed toward compiling comprehensive tissue profiles for normal and diseased tissues Citation[50]. This will be particularly challenging, since the most stringent conditions are often required to isolate the total protein constituency from some tissues. A random survey of 327 validated antibodies listed on antibodypedia showed that 17% of ELISA compatible antibodies were not compatible with western blotting, indicating that conformational epitopes needed to be preserved for antibody recognition.

While sequencing the human genome in 13 years has been heralded as one of humankind’s greatest scientific achievements, compared with walking on the moon and the splitting of the atom, decrypting the enormous complexity of the human proteome and its ‘interactome’ will be no less daunting a task, requiring many more decades to complete.

Financial & competing interests disclosure

The author has no relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript. This includes employment, consultancies, honoraria, stock ownership or options, expert testimony, grants or patents received or pending, or royalties.

No writing assistance was utilized in the production of this manuscript.

References

  • Venter JC, Adams MD, Myers EW et al. The sequence of the human genome. Science 291(5507), 1304–1351 (2001).
  • Pollack A. Company unveils DNA sequencing device meant to be portable, disposable and cheap. New York Times, 17 February 2012.
  • Wilkins MH, Stokes AR, Wilson HR. Molecular structure of deoxypentose nucleic acids. Nature 171(4356), 738–740 (1953).
  • Franklin RE, Gosling RG. Molecular configuration in sodium thymonucleate. Nature 171(4356), 740–741 (1953).
  • Watson JD, Crick FH. Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature 171(4356), 737–738 (1953).
  • Pauling L, Corey RB. A Proposed structure for the nucleic acids. Proc. Natl Acad. Sci. USA 39(2), 84–97 (1953).
  • Vogel F. A Preliminary estimate of the number of human genes. Nature 201, 847 (1964).
  • Hillier LW, Coulson A, Murray JI, Bao Z, Sulston JE, Waterston RH. Genomics in C. elegans: so many genes, such a little worm. Genome Res. 15(12), 1651–1660 (2005).
  • Pertea M, Salzberg SL. Between a chicken and a grape: estimating the number of human genes. Genome Biol. 11(5), 206 (2010).
  • Van Straalen NI, Roelofs D. Introduction to Ecological Genetics. Oxford University Press, NY, USA, 56–112 (2006).
  • Parfrey LW, Lahr DJ, Katz LA. The dynamic nature of eukaryotic genomes. Mol. Biol. Evol. 25(4), 787–794 (2008).
  • Gregory TR, Nicol JA, Tamm H et al. Eukaryotic genome size databases. Nucleic Acids Res. 35(Database issue), D332–D338 (2007).
  • Gygi SP, Rochon Y, Franza BR, Aebersold R. Correlation between protein and mRNA abundance in yeast. Mol. Cell. Biol. 19(3), 1720–1730 (1999).
  • Muthusamy B, Hanumanthu G, Suresh S et al. Plasma proteome database as a resource for proteomics research. Proteomics 5, 3531–3536 (2005).
  • Scott M, Roberts G, Kurukulaaratchy RJ, Matthews S, Nove A, Arshad SH. Multifaceted allergen avoidance during infancy reduces asthma during childhood with the effect persisting until age 18 years. Thorax doi:10.1136/thoraxjnl-2012-202150 (2012) (Epub ahead of print).
  • Laine RA. A calculation of all possible oligosaccharide isomers both branched and linear yields 1.05 x 10(12) structures for a reducing hexasaccharide: the Isomer Barrier to development of single-method saccharide sequencing or synthesis systems. Glycobiology 4(6), 759–767 (1994).
  • Brocchieri L, Karlin S. Protein length in eukaryotic and prokaryotic proteomes. Nucleic Acids Res. 33(10), 3390–3400 (2005).
  • Zhang J. Protein-length distributions for the three domains of life. Trends Genet. 16(3), 107–109 (2000).
  • Khoury GA, Baliban RC, Floudas CA. Proteome-wide post-translational modification statistics: frequency analysis and curation of the swiss-prot database. Sci. Rep. 1, 90 (2011).
  • Zhao Y, Jensen ON. Modification-specific proteomics: strategies for characterization of post-translational modifications using enrichment techniques. Proteomics 9(20), 4632–4641 (2009).
  • Apweiler R, Hermjakob H, Sharon N. On the frequency of protein glycosylation, as deduced from analysis of the SWISS-PROT database. Biochim. Biophys. Acta 1473(1), 4–8 (1999).
  • Ubersax JA, Ferrell JE Jr. Mechanisms of specificity in protein phosphorylation. Nat. Rev. Mol. Cell Biol. 8(7), 530–541 (2007).
  • Linding R, Jensen LJ, Pasculescu A et al. NetworKIN: a resource for exploring cellular phosphorylation networks. Nucleic Acids Res. 36(Database issue), D695–D699 (2008).
  • Manning G, Whyte DB, Martinez R, Hunter T, Sudarsanam S. The protein kinase complement of the human genome. Science 298(5600), 1912–1934 (2002).
  • Mustelin T. A brief introduction to the protein phosphatase families. Methods Mol. Biol. 365, 9–22 (2007).
  • Hooft van Huijsduijnen R. Protein tyrosine phosphatases: counting the trees in the forest. Gene 225(1-2), 1–8 (1998).
  • Smejkal GB, Rivas-Morello C, Chang JH et al. Thermal stabilization of tissues and the preservation of protein phosphorylation states for two-dimensional gel electrophoresis. Electrophoresis 32, 1–10 (2011).
  • Numa N, Ishida Y, Nasu M et al. Molecular basis of perinatal hypophosphatasia with tissue-nonspecific alkaline phosphatase bearing a conservative replacement of valine by alanine at position 406. Structural importance of the crown domain. FEBS J. 275(11), 2727–2737 (2008).
  • Schlamowitz M, Bodansky O. Tissue sources of human serum alkaline phosphatase, as determined by immunochemical procedures. J. Biol. Chem. 234(6), 1433–1437 (1959).
  • Juhl H. Effects of intraoperative ischemia on cancer and normal tissue. Fifth Annual Biospecimen Research Network Symposium. National Cancer Institute, Washington, USA, 22–23 (2012).
  • Almén MS, Nordström KJ, Fredriksson R, Schiöth HB. Mapping the human membrane proteome: a majority of the human membrane proteins can be classified according to function and evolutionary origin. BMC Biol. 7, 50 (2009).
  • Tusnády GE, Dosztányi Z, Simon I. Transmembrane proteins in the Protein Data Bank: identification and classification. Bioinformatics 20(17), 2964–2972 (2004).
  • Tusnády GE, Dosztányi Z, Simon I. PDB_TM: selection and membrane localization of transmembrane proteins in the protein data bank. Nucleic Acids Res. 33(Database issue), D275–D278 (2005).
  • Lundstrom KH. Structural genomics on membrane proteins. CRC Press, Boca Raton, FL, USA, 400 (2006).
  • Service RF. Proteomics. Proteomics ponders prime time. Science 321(5897), 1758–1761 (2008).
  • Anderson NL, Anderson NG. The human plasma proteome: history, character, and diagnostic prospects. Mol. Cell. Proteomics 1, 845–867 (2002).
  • Quissell DO, Deisher LM, Barzen KA. The rate-determining step in cAMP-mediated exocytosis in the rat parotid and submandibular glands appears to involve analogous 26-kDa integral membrane phosphoproteins. Proc. Natl Acad. Sci. USA 82(10), 3237–3241 (1985).
  • Gundry RL, Fu Q, Jelinek CA, Van Eyk JE, Cotter RJ. Investigation of an albumin-enriched fraction of human serum and its albuminome. Proteomics. Clin. Appl. 1(1), 73–88 (2007).
  • Klee GG. Mouse anti-human antibodies. Arch. Pathol. Lab. Med. 124, 921–923 (2000).
  • Bertholf RL, Johannsen L, Guy B. False elevation of serum CA-125 level caused by human anti-mouse antibodies. Ann. Clin. Lab. Sci. 32(4), 414–418 (2002).
  • Lam KS, Salmon SE, Hersh EM, Hruby VJ, Kazmierski WM, Knapp RJ. A new type of synthetic peptide library for identifying ligand-binding activity. Nature 354(6348), 82–84 (1991).
  • Boschetti E, Righetti PG. The art of observing rare protein species in proteomes with peptide ligand libraries. Proteomics 9(6), 1492–1510 (2009).
  • Righetti PG, Boschetti E. The proteominer and the fortyniners: searching for gold nuggets in the proteomics arena. Mass Spectrom. Rev. 27, 596–608 (2008).
  • Di Girolamo F, Bala K, Chung MC, Righetti PG. “Proteomineering” serum biomarkers. A study in scarlet. Electrophoresis 32(9), 976–980 (2011).
  • Decramer S, Gonzalez de Peredo A, Breuil B et al. Urine in clinical proteomics. Mol. Cell Proteomics 7(10), 1850–1862 (2008).
  • Bandhakavi S, Stone MD, Onsongo G, Van Riper SK, Griffin TJ. A dynamic range compression and three-dimensional peptide fractionation analysis platform expands proteome coverage and the diagnostic potential of whole saliva. J. Proteome Res. 8(12), 5590–5600 (2009).
  • Mouton-Barbosa E, Roux-Dalvai F, Bouyssié D et al. In-depth exploration of cerebrospinal fluid by combining peptide ligand library treatment and label-free protein quantification. Mol. Cell Proteomics 9(5), 1006–1021 (2010).
  • Righetti PG, Boschetti E, Zanella A, Fasoli E, Citterio A. Plucking, pillaging and plundering proteomes with combinatorial peptide ligand libraries. J. Chromatogr. A 1217(6), 893–900 (2010).
  • Björling E, Uhlén M. Antibodypedia, a portal for sharing antibody and antigen validation data. Mol. Cell Proteomics 7(10), 2028–2037 (2008).
  • Uhlen M, Ponten F. Antibody-based proteomics for human tissue profiling. Mol. Cell Proteomics 4(4), 384–393 (2005).

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.