Abstract
Intrinsically disordered proteins are now widely accepted to play crucial roles in biological functions. Identification of signatures of intrinsic disorder is one of the key steps towards building a proper repertoire for their occurrence in proteomes. In this work, systematic computational synthesis of a library of all possible (3368400) dipeptides, tripeptides, tetrapeptides and pentapeptides using the natural 20 amino acids allowed us to identify 36 unique tetrapeptides present exclusively in intrinsically disordered proteins and absent in the complete primary sequence space of naturally occurring structured proteins. Further, out of more than 530000 known naturally occurring primary sequences without any structural information, 1349 sequences contain the above identified unique signatures of intrinsic disorder. These sequences, having cellular functions varying from housekeeping to metabolic to transport, more than double the number of the currently known intrinsically disordered proteins. On similar lines, we report that 26577 pentapeptide signatures exclusive to intrinsically disordered proteins, and absent in naturally occurring structured proteins, identify ∼50% of more than half-a-million curated protein sequences without structural information to be intrinsically disordered. The results reported are a major leap forward in exploring functional manifestations of intrinsically disordered proteins.
Communicated by Ramaswamy H. Sarma
Acknowledgements
AMC is grateful to IIT Delhi for fellowship support. The authors also thank IIT Delhi for providing access to the HPC facility. AM is grateful to Kusuma Trust (UK) for their generous funding support towards assisting him in establishing the teaching and research programs of the School of Biological Sciences (subsequently renamed as the Kusuma School of Biological Sciences) at IIT Delhi. AM is also grateful to Dept. of Biotechnology, Government of India and the National Supercomputing Mission, Government of India for their support to the Supercomputing Facility for Bioinformatics & Computational Biology at IIT Delhi.
Author contributions
AMC and ST collected the data. AMC collected the complete peptide count data and ST independently confirmed the dipeptide and tripeptide count data. AMC also analyzed some of the data. AM designed the study, analyzed the data, prepared the figures and wrote the manuscript.
Disclosure statement
No potential conflict of interest was reported by the author(s).