Abstract
Availability of computational methods that predict disorder from protein sequences fuels rapid advancements in the protein disorder field. The most accurate predictions are usually obtained with consensus-based approaches. However, their design is performed in an ad hoc manner. We perform first-of-its-kind rational design where we empirically search for an optimal mixture of base methods, selected out of a comprehensive set of 20 modern predictors, and we explore several novel ways to build the consensus. Our method for the prediction of disorder based on Consensus of Predictors (disCoP) combines seven base methods, utilizes custom-designed set of selected 11 features that aggregate base predictions over a sequence window and uses binomial deviance loss-based regression to implement the consensus. Empirical tests performed on an independent benchmark set (with low-sequence similarity compared with proteins used to design disCoP), shows that disCoP provides statistically significant improvements with at least moderate magnitude of differences. disCoP outperforms 28 predictors, including other state-of-the-art consensuses, and achieves Area Under the ROC Curve of .85 and Matthews Correlation Coefficient of .5 compared with .83 and .48 of the best considered approach, respectively. Our consensus provides high rate of correct disorder predictions, especially when low rate of incorrect disorder predictions is desired. We are first to comprehensively assess predictions in the context of several functional types of disorder and we demonstrate that disCoP generates accurate predictions of disorder located at the post-translational modification sites (in particular phosphorylation sites) and in autoregulatory and flexible linker regions. disCoP is available at http://biomine.ece.ualberta.ca/disCoP/.
Acknowledgement
The authors thank Mr. Marcin Mizianty for help with the implementation and testing of the disCoP_WS method.