67
Views
1
CrossRef citations to date
0
Altmetric
Original Research

The development of an affinity evaluation and prediction system by using protein–protein docking simulations and parameter tuning

, , , &
Pages 1-15 | Published online: 12 Jan 2009

Abstract

A system was developed to evaluate and predict the interaction between protein pairs by using the widely used shape complementarity search method as the algorithm for docking simulations between the proteins. We used this system, which we call the affinity evaluation and prediction (AEP) system, to evaluate the interaction between 20 protein pairs. The system first executes a “round robin” shape complementarity search of the target protein group, and evaluates the interaction between the complex structures obtained by the search. These complex structures are selected by using a statistical procedure that we developed called ‘grouping’. At a prevalence of 5.0%, our AEP system predicted protein–protein interactions with a 50.0% recall, 55.6% precision, 95.5% accuracy, and an F-measure of 0.526. By optimizing the grouping process, our AEP system successfully predicted 10 protein pairs (among 20 pairs) that were biologically relevant combinations. Our ultimate goal is to construct an affinity database that will provide cell biologists and drug designers with crucial information obtained using our AEP system.

Introduction

Proteins have diverse functions ranging from molecular machines to signaling. They participate in catalytic reactions, provide transport, form viral capsids, traverse membranes and construct regulated channels, transmit information from DNA to RNA, thus making it possible to synthesize new proteins. Given their importance, considerable effort has centered on the prediction of protein functions. A major way of achieving this is through the identification of binding partners. If we know the function of at least one of the components with which the protein interacts, that should enable us to determine its function(s) and the pathway(s) in which it plays a role. This holds since the vast majority of the functions performed by proteins in living cells involve protein–protein interactions (PPI).Citation1,Citation2 Hence, by observing the intricate network of these interactions we can map cellular pathways, their interconnectivities and their dynamic regulation. Their identification is at the heart of functional genomics; their prediction is crucial to the discovery of new drugs. Knowledge of these pathways, their topologies, lengths, and dynamics may provide useful information for forecasting side effects.

The PPI affinity problem involves finding a protein pair that is interactive from a protein group with a known structure.Citation1,Citation2 Recently, pharmaceutical companies have been trying to produce medicines based on candidates discovered using protein– chemical compound docking simulations. Docking analysis with respect to PPI remains limited, however, because the macromolecules involved in these simulations are large and thus require extensive computational resources. PPI is a basic element of model construction in research that first analyzes the cell-signaling pathway and then identifies the target protein of a drug. The prediction of the interactions between vast numbers of proteins is crucial if we are to understand such biological phenomena as cellular signaling, enzyme reaction and gene expression regulation.

PPI prediction has been extensively studied both experimentallyCitation3Citation7 and theoretically. In experimental studies, Giot and colleagues,Citation8 and Stanyon and colleaguesCitation9 constructed a protein interaction map on a genome scale for Drosophila melanogaster, and Rain and colleaguesCitation10 constructed that for human Helicobacter pylori. Based on theoretical calculations, Wojcik and colleaguesCitation11 constructed an interaction map by using domain profile pairs in which the input was the amino acid sequence for H. pylori.

These studies mainly used either genome scale analysisCitation8,Citation10 or sequence analysis.Citation12 However, in vivo, PPI is achieved by using protein structural information. Therefore, ultimately, PPI evaluation and prediction should also utilize protein structural information. Such an interaction evaluation and prediction system has been discussed by Smith and colleagues,Citation13 they did not consider the composition, performance or accuracy of the evaluation or the prediction of a concrete system. If we can perform such an ultimate PPI evaluation and prediction of the interaction between pairs of a large number of proteins, we can assume a signal transmission pathway, and then we can evaluate and predict the affinity between an enzyme and inhibitor(s) or substrate(s), an antibody and an antigen. Such a large-scale PPI affinity evaluation system has been discussed by Park and colleaguesCitation14 and Gong and colleagues.Citation15 Their system, Protein Structural Interactome map (PSIMAP) (http://psimap.com/), performs evaluations by using Euclidean distance between two domains and the number of residue pairs within a given distance threshold as the PPI measurement base. This method is simple and fast. We know that the data provided by PSIMAP are useful, but such a method based on simply distance comparisons is not available for making PPI maps that include sufficient three-dimensional (3D) structural information for supporting a drug discovery. Therefore, in this study we developed a high-speed and more detailed PPI affinity evaluation and prediction system that is based on our original docking simulation program and the original high-speed FFT library produced by Hourai and colleaguesCitation16 and can be performed using a massive parallel PC cluster. In this system, which we call an affinity evaluation and prediction (AEP) system, first a “round robin” shape complementarity search is executed for the target protein group, and then the interaction between the proteins is evaluated based on complex structures determined by using a statistical procedure that we developed called ‘grouping’.

Methods

Evaluation of shape complementarity

When the shape complementarityCitation17Citation19 between two proteins is evaluated, the structure of a protein is converted into a rigid body object represented in a 3D grid space. Typically, in the protein–protein docking simulation, each point of the protein represented on the grid is classified as being in the core area, the surface area, or the cavity area as described below. Because the accuracy of docking simulations that use the shape complementarity search depends strongly on the representation method, various representation methods have been developed.

Here, we have developed a representation method called the voxel model, which has two distinguishing features. (1) The molecular surface can be faithfully modeled by changing the radius of the probe sphere to define as the extended solvent-accessible surface, which is traced out by the probe sphere center as it rolls over the protein, of each atomic type. (2) The shape complementarity can be determined much more accurately than just made voxel models by reducing the thickness of the molecular surface. However, when the thickness is reduced too much, the response of the shape complementarity evaluation function becomes too sensitive. Next we describe in detail the voxel model and the procedure for converting the crystal structure of a protein to the voxel model (voxelization).

Voxelization

The voxel model is expressed as shown in expression (Equation1), where VOXEL (V) represents the voxel model, INNER (I) defines the area on the atom, where the protein exists, OUTER (O) is the area that is not INNER, in other words it indicates a solvent. SURFACE (S) is the area on the molecular surface of the protein, CAVITY (C) is the area inside SURFACE that is not INNER, namely, it means a cavity. N is a set of natural numbers, and φ is an empty set. nI, nO, nS, and nC are the upper bound values of each voxel type (I, O, S, C) decided from the molecular size of the protein. All the areas in the voxel space in the initial state are defined as DEFAULT (D), and a temporary molecular surface of protein is defined as TMP_SURFACE (Stmp).

ν={I,O,S,CIOSC=φ}I={IiiN,0inI}O={OjjN,0jnO}S={SkkN,0knS}C={CIlN,0lnC}0nI,nO,nS,nCnVnVN,nV00nI+nO+nS+nCnVnVN,nV0(1)

The procedure of voxelization is formulated using expression (Equation2). details each step of this procedure.

Figure 1 Six-step voxelization process for the quantization and characterization of protein surfaces in our AEP system.

Abbreviation: AEP, affinity evaluation and prediction.
Figure 1 Six-step voxelization process for the quantization and characterization of protein surfaces in our AEP system.

ν=(step1){D}-(step2){I,D}-(step3){I,O,D}-(step4){I,O,Stmp,D}-(step5){I,O,S,D}-(step6){I,O,S,C}(2)
  • Step 1 (DEFAULT):

    In the voxel space, all grid points are assigned to DEFAULT as the initial state.

  • Step 2 (INNER):

    The grid points that exist within a spheroid of radius Ri (=RviΔri) are assigned to INNER, where Rvi is the van der Waals radius of each atom, and Δri is the increment in the radius of a defined type of atom.

  • Step 3 (OUTER):

    All the remaining grid points are assigned as OUTER.

  • Step 4 (TMP_SURFACE):

    The INNER grid points that adjoin the OUTER grid points are newly assigned to TMP_SURFACE.

  • Step 5 (SURFACE):

    A TMP_SURFACE grid point is finally replaced by SURFACE or INNER according to the following process. When a TMP_SURFACE grid point is enclosed by six or more OUTER grid points, the TMP_SURFACE grid point is re-assigned to SURFACE. After this process has been completed for all TMP_SURFACE grid points, the remaining TMP_SURFACE grid points are re-assigned to INNER. Thus, finally, the TMP_SURFACE grid points vanish from the voxel space.

  • Step 6 (CAVITY):

    The grid points which remain as DEFAULT are re-assigned to CAVITY. Hence, the DEFAULT grid points also disappear from the voxel space.

Score function of contact surface area

In the shape complementarity evaluation, the surface contact between proteins represented by the grid points is evaluated by using the score function of the contact surface area.Citation17 This score function defines a score for INNER, OUTER, SURFACE, and CAVITY as described in Voxelization. The contact surface score function we adopted for our voxel model is the pair-wise shape complementarity (PSC) score function SPSC that Weng and colleaguesCitation17 defined with ZDOCK.Citation17,Citation18,Citation20 As defined in expressions (Equation3) and (Equation4), this SPSC is a primitive function where the electrostatic interaction is not considered. The score function SPSC consists of two factors RPSC, LPSC.

Re[RPSC(l,m,n)]={numberofSURFACEwithin(vdwradiusofCarbon+bonusR)OUTER0otherwiseRe[LPSC(l,m,n)]={1SURFACE,INNER0otherwiseIm[RPSC(l,m,n)]=Im[LPSC(l,m,n)]={3SURFACE9INNER,CAVITY0OUTER(3)

The factor RPSC(l, m, n) is for the receptor protein, which is a protein fixed on the grid space, and the factor LPSC(l, m, n) is for the probe protein, which is a protein that is rotated and translated on the grid space. In expressions (Equation3) and (Equation4), l, m, and n are the coordinates of the grid point, and “Re[]” and “Im[]”, respectively, indicate the real and imaginary parts of the factor and the PSC scoring function. Moreover, bonusR, which appears in the real part of RPSC(l, m, n), indicates “bonus” points (ie, additional points added to the PSC scoring) when the probe protein penetrates the pocket structure of the receptor protein. The final SPSC, as shown by expression (Equation4), is obtained by calculating the correlation function between RPSC and LPSC. The computational complexity of this calculation is O(N6), but can be decreased to O(N3logN) by using a fast Fourier transform (FFT) as follows.

SPSC(o,p,q)=Re[l=1Nm=1Nn=1NRPSC(l,m,n)·LPSC(l+o,m+p,n+q)]=Re[1N3IFT(IFT(RPSC)·DFT(LPSC))]-Im[1N3IFT(IFT(RPSC)·DFT(LPSC))](4)

Abbreviations: (DFT, Discrete Fourier Transform; IFT, Inverse Fourier transform).

In the SPSC calculation, we adopted the FFT library CONV3D developed by Hourai and Nukada.Citation16 The CONV3D is highly optimized for specific CPU architectures (Pentium, Xeon, Athlon, POWER5, etc.), achieves high-speed performance, and is about three times faster than the widely used FFT library package FFTWCitation21,Citation22 (according to the results obtained with Protein–Protein Docking Benchmark 2.0;Citation23 data not shown).

At the beginning, a rotational search is performed in an actual evaluation of the contact surface between proteins. The rotational search works by explicitly rotating the ligand protein around each of its three Cartesian angles by a certain increment (15 degrees in this study). A total of 3600 rotated bodies of the ligand protein are thus generated. Next, the shape complementarity evaluation is undertaken for all rotated bodies. If the grid size of the ligand protein is N3, then N3 × 3600 PSC scores must be calculated. Based on preliminary experiments, in all the analyses we only extracted the top-ranked 512 PSC scores for the following procedures.

Reliability of shape complementarity evaluation

To confirm the reliability of the shape complementarity evaluation by using our voxel model, we performed simple docking calculations using some bound structures adopted from Weng’s Protein–Protein Docking Benchmark 2.0 as samples. It is widely recognized that the main challenge in protein docking lies in the unbound–bound transition. Although we are also more interested in the unbound experiments than in the bound, the latter enables us to primitively confirm the advantages of our AEP’s improvement at the initial stage. Because a protein complex for the latter, which is pulled apart and reassembled, is more complementary than individually crystallized component structures for the former. The widely used i-RMSD (backbone RMSD at the interface) was the criterion we used to evaluate the reliability. In this way we evaluated the difference between the complex structure obtained by using the docking calculation and the X-ray crystal structure.

Affinity analysis

To calculate protein-protein affinity, we assess the biological relevance of target proteins by statistically analyzing the characteristics of shape complementarity. The affinity analysis involved processing the 512 top-ranked values of SPSC obtained from the shape complementarity evaluation by using a clustering method that we call grouping. The following describes the affinity prediction procedure and includes a description of the grouping method.

  • Step 1 (Shape complementarity evaluation):

    A shape complementarity evaluation of protein pairs is performed. Only the 512 top-ranked SPSC scores are then considered for step 2.

  • Step 2 (Grouping):

    The purpose of step 2 is to classify the 512 SPSC scores into 10 clusters, Ci (i = 1–10). First, the clustering method is performed hierarchically by using the nearest neighbor method for the 512 values of SPSC as follows. We define the targeted dataset X = {x1, ..., x512} as consisting of 512 SPSC scores from step 1. For each data point, xi is represented as xi = (Si, Ti(TX, TY, TZ), Ri(Rθ, Rφ, Rψ)). Here, Si = SPSC from expression (Equation4), Ti(TX, TY, TZ) is the amount of translation for the ligand protein in the grid space, and Ri(Rθ, Rφ, Rψ) is a rotation parameter for the ligand protein. If (o,p,q) in expression (Equation4) is used, then o = TX, p = TY, and q = TZ in Ti(TX, TY, TZ), respectively.

    All data x belonging to the dataset X are sorted by the rank order of Si. In X, we define the data point with the maximum Si (maxSi) as the central data y1 of cluster 1 (y1C1). Here, the term central data indicates data that represent a characteristic of a cluster. Now, we introduce the distances D1(y1, xi) and D2(y1, xi) between y1 and x. Here, in this dataset x, the data y1 is not included and is not classified into any clusters Ci (xi ∈ (Ci)C). The distance D1 is represented as the Euclidean translation distance, and the rotational distance D2 is a value defined to represent the difference in rotation of data xi and data y1. Here, we define the center of gravity of a ligand as point G and the coordinates of the Cα atom in the C- and N-terminals of a ligand as points C and N, respectively. We also define vector u as the normal vector of a plane, which is defined by three points (G, C, and N). Thus, the rotational difference D2 indicates the cosine between ui of data xi and u1 of data y1. When the following two conditions exist between the above data xi (ie, xi ∈ (Ci)C) and the central data y1, the data xi are classified into cluster 1 (C1).

    1. When the translation distance |D1| between Ti of data xi and T1 of the central data y1 is less than a certain translational threshold (eg, 10.0Å).

    2. When the rotational distance |D2| between xi and y1 is less than a certain rotational threshold (eg, π/6).

      Cluster C1 is determined by repeating step 2 for all xi where y1 is excluded. Similarly, step 2 is repeated for clusters Ci (i = 2–10). Step 2 ends either when the ten clusters (C1C10) are generated or when all xi belong to any cluster Ci.

  • Step 3 (Calculation of Z-score):

    The purpose of step 3 is to prepare for the affinity analysis. The Z-score indicates the number of standard deviations that an observation is above or below the mean. It allows us to compare observations from different normal distributions. Here, a set of central data yi of each cluster Ci obtained in step 2 is defined as the central dataset Y. The distribution Zscore(yi) is the Z-score distribution of the PSC score of each set of central data yiY when the dataset X is considered a population. The distribution Zconformation(Ci) is the distribution of the Z-score of mCi. Here mCi is the number of data that constitute cluster Ci when the dataset Mcluster = {mci | iN, i ≧0} is considered a population. In step 3, the distributions Zscore(yi) and Zconformation(Ci) are calculated.

  • Step 4 (Calculation of affinity score Saffinity):

    The measure for the affinity analysis is called an affinity score Saffinity, which is defined in expression (Equation5) as the maximum value of a linear combination of Zscore(yi) and Zconformation(Ci) and their corresponding weighting factors ws and wc. The weighting factors ws and wc are introduced to determine the balance between the SPSC score distribution Zscore(yi) and the conformation distribution Zconformation(Ci) when the affinity score Saffinity is calculated. The analysis of the affinity between protein pairs can be quantified by using Saffinity.

    Saffinity=max(ws·Zscore(yi)+wC·Zconformation(Ci))ws:weightingfactorforZscore(yi)we:weightingfactorforZconformation(ci)(5)

  • Step 5:

    Steps 1 to 4 are repeated. For example, if the affinity analysis is needed for 20 receptors and 20 ligands (20 × 20), then steps 1 to 4 are repeated 20 times because there are 20 ligands for each receptor. The Saffinity matrix is obtained by repeating steps 1–5 for all protein pairs.

When several target proteins have an unknown affinity, we can consider our AEP system to be an affinity prediction system. To evaluate the prediction performance of our AEP system, we used receiver operating characteristics curve (ROC) analysis. Based on the ROC analysis, we evaluated the maximum prediction performance of our AEP system by using a recall, a precision and a cutoff value. The parameters of ws and wc in expression (Equation5) were defined at the maximum prediction performance.

Performance measures

Sensitivity and specificity are one of the performance measures for assessing the statistical significance of AEP’s predictions. The sensitivity or the recall measures the percentage of biologically relevant pairs which are correctly identified as having the high-affinity scores; and the specificity measures the percentage of biologically not relevant combinations which are identified as having the low-affinity. Here, the evaluation of each affinity scores (ie, high or low) is determined by a cutoff value from ROC analysis. Then, precision is the number of predictions statistically identified as belonging to the protein pairs divided by the total number of pairs. “Accuracy” is closely related to precision, measures the percentage of correctly identified protein pairs in the total number of pairs. Moreover, prevalence is defined as the total number of biologically relevant pairs in all the pairs. F-measure, which is the weighted harmonic mean of precision and recall, can be used as a single measure of performance.

Implementation

Our AEP system was implemented using C language, which we ran on the cluster PC (16 cores and 64 GB memory). To improve the performance, we parallelized the core program of AEP using the Message-Passing Interface (MPI) library. For example, the time needed for the docking calculation is about 12 seconds for a grid size of 128Citation3 with a rotation resolution of 15 degrees. Thus, the peak performance of AEP is 7,200 protein–protein pairs per day.

Protein pair dataset

All protein pairs used in this study were selected manually from complex structures registered with the Protein Data Bank (PDB). We adopted the 20 protein pairs (bound dataset) shown in , and analyzed the affinity of each pair.

Table 1 Targeted 20 protein pairs

In a docking simulation that uses a general shape complementarity search, the receptor is fixed to the grid space and the ligand is rotated and translated in the grid space.

Results and discussion

Validation of docking accuracy using our AEP system

We selected seven protein pairs from Protein–Protein Docking Benchmark 2.0 according to the distribution of difficulty (“moderate”, “medium”, and “difficult”) that Weng and colleagues defined. The breakdown of the difficulty in the seven selected pairs was five pairs in the moderate difficultly class (1DFJCitation24, 1F34Citation25, 1HIACitation26, 1AKJCitation27, 1F51Citation28), one pair (1I2MCitation29) in the medium difficultly class and one pair (1EERCitation30) in the difficult class. summarizes the docking calculation results for seven protein pairs in which i-RMSD was used to confirm the reliability of the shape complementarity evaluation. For all seven pairs, i-RMSD < 1.10 Å, indicating enough accuracy for docking prediction. As regards 1F34, 1HIA, 1EER, 1I2M, and 1F51, the rank that gives these highly accurate results is within the top 20 ranks. On the other hand, the SPSC for 1DFJ and 1AKJ is relatively low, and therefore their ranks are also low, namely 129th and 230th, respectively. However, the affinity analysis successfully predicted the active sites of these pairs, including 1DFJ and 1AKJ. Although a general docking simulation uses a more complex evaluation function, the results obtained here by using a primitive evaluation function consisting only of the shape complementarity were good. Based on these docking results (), the shape complementarity evaluation is sufficiently reliable for affinity analysis.

Table 2 Docking calculation results to validate the shape complementarity evaluation using voxel model

Shape complementarity evaluation

Prediction performance comparison between AEP and ZDOCK 3.0.1

As mentioned in Evaluation of shape complementarity, an original docking simulation code is included in our AEP system. We evaluated the prediction performance of the system when replacing our code with ZDOCK 3.0.1,Citation31 which is a widely used docking program, and compared the prediction accuracy of the ZDOCK 3.0.1 version with that of the original AEP. It is the purpose of the comparison to show the effect of voxelization, grouping, and parallelization, which we have introduced.

We have already confirmed this effect. For this comparison, we adopted 20 different protein pairs (bound dataset) from the Protein–Protein Docking Benchmark 2.0. The highest score of ZDOCK 3.0.1 output was adopted as the affinity score for each protein pair.

The prediction performance obtained using ZDOCK 3.0.1 is shown in . This table represents the advantages of performance parameters of AEP over ZDOCK 3.0.1. Here, these parameters are of the peak performance by using the chosen cutoff from the ROC analysis. The F-measure of AEP is 5.1% more accurate than that of ZDOCK 3.0.1. That means the high specificity of AEP over ZDOCK 3.0.1 while the difference of sensitivity is slightly 10.0% (ie, the difference of the value of true positives is only 2). AEP has an ability to correctly identify many true negatives (ie, TN = 307) with few false positives (ie, FP = 73) rather than ZDOCK 3.0.1 (ie, TN = 260, FP = 120). The ability to eliminate false positives is very important in terms of our goal that tries to deeply penetrate protein–protein networks. It is significantly valuable not to predict nonbonding pairs as well as identifying correct combinations.

Table 3 Comparison prediction performance between AEP and ZDOCK

Moreover, our AEP required much less computational time than ZDOCK 3.0.1. ZDOCK 3.0.1 needs over 3600 seconds for one protein pair whereas our AEP system needs only 12 seconds, namely it is 300 times faster. As shown by these results, we have revealed the advantage of our AEP with respect to prediction and computational performance.

The effect of grid size on the identification capability of our affinity analysis

The accuracy of the representation of the protein model is dominated by the grid-size of the voxelization procedure. In shape complementarity docking, the optimum grid size is defined as being 1.2 Å by Weng and colleaguesCitation17 as regards accuracy of docking, although, for our affinity analysis, which is derived from that shape complementarity docking scheme, the optimum grid size is unclear. Thus, we first investigated an optimal grid size for our affinity analysis. In this investigation using 20 × 20 protein pairs as shown in , the grid size ranged from 0.8 to 2.2 Å at 0.2 Å intervals. As the standard for our evaluation of the accuracy of the affinity analysis we used F-measure, which we obtained as the cutoff value from the ROC analysis. The results of F-measure are shown in . According to these results, the highest F-measure was derived with a grid size of 1.2 Å. Thus, we use this grid size for all analyses as mentioned below.

Table 4 Evaluations about the prediction performance for various of grid size

Evaluation for 20 × 20 protein pairs

shows the results of the shape complementarity evaluation for about 20 × 20 protein pairs. It also shows the normalized SPSC for the receptor of each ligand. The vertical axis shows the normalized SPSC where the maximum value is 1.0, and the horizontal axis shows the PDB index of the receptors (The relation between the index and PDB-ID is shown in ).

Figure 2 Normalized PSC score SSPC of the receptor of each ligand. The horizontal axis shows the PDB index as in , and the vertical axis shows the normalized PSC score SSPC.

Abbreviations: PDB, Protein Data Bank; PSC, pair-wise shape complementarity.
Figure 2 Normalized PSC score SSPC of the receptor of each ligand. The horizontal axis shows the PDB index as in Table 1, and the vertical axis shows the normalized PSC score SSPC.

The polygonal lines of 12 out of 20 receptors have no characteristic peak. For these receptors, the score of each predicted ligand is not the maximum (ie, SPSC < 1.0) or the number of ligands with maximal score is not only one: 1AY7Citation32(index = 0), 1CAUCitation33(2), 1D4VCitation34(3), 1EAYCitation35(4), 1EM8Citation36(5), 1VETCitation37(11), 1WMHCitation38(12), 1WRDCitation39(13), 1XG2Citation40(14), 2F9ZCitation41(16), 2HSNCitation42(18), and 2NXNCitation43(19). Thus, when using only the normalized SPSC, it is too difficult to evaluate and quantify the affinity between ligand and receptor.

Using our affinity calculation, and show the affinity score Saffinity with wc = 0 and ws = 1. These values for wc and ws (parameters used for calculating affinity scores) were chosen to provide the maximum prediction performance (ie, F-measure) from our preliminary experiments.

Figure 3 Affinity score Saffinity of the receptor of each ligand. The cutoff value is 7.54σ. The filled arrowhead (▾) denotes a protein pair in which Saffinity ≥ 7.54σ and is thus biologically relevant; the unfilled arrowhead is a protein pair incorrectly identified as being biologically relevant (ie, false positive). The horizontal axis shows the PDB index, and the vertical axis shows the affinity score Saffinity.

Abbreviation: PDB, Protein Data Bank.
Figure 3 Affinity score Saffinity of the receptor of each ligand. The cutoff value is 7.54σ. The filled arrowhead (▾) denotes a protein pair in which Saffinity ≥ 7.54σ and is thus biologically relevant; the unfilled arrowhead is a protein pair incorrectly identified as being biologically relevant (ie, false positive). The horizontal axis shows the PDB index, and the vertical axis shows the affinity score Saffinity.

Figure 4 Matrix of affinity score Saffinity with weighting parameters wc = 0, and ws = 1. The points on the diagonal line show the biologically relevant pairs, and the circles show the biologically relevant pairs where the AEP system provided a successful prediction.

Abbreviation: AEP, affinity evaluation and prediction.
Figure 4 Matrix of affinity score Saffinity with weighting parameters wc = 0, and ws = 1. The points on the diagonal line show the biologically relevant pairs, and the circles show the biologically relevant pairs where the AEP system provided a successful prediction.

shows the Saffinity of the receptor of each ligand, and the vertical axis shows the Saffinity and the horizontal axis shows the PDB index of the receptor. When our AEP system is used as a prediction system, the horizontal dotted line shows that the cutoff value at which the system achieves its maximum prediction performance is 7.54σ. This value provides the optimal tradeoff between sensitivity and specificity. Each filled arrowhead (▾) denotes a protein pair where Saffinity exceeds the chosen cutoff value, thus indicating that the biologically relevant pair with a statistically significant score; each unfilled arrowhead is a protein pair incorrectly identified as being biologically relevant (ie, false positive).

shows the Saffinity matrix in which the 10 data points (denoted by circles) along the diagonal of the matrix represent the protein pairs that have been experimentally confirmed to be biologically relevant. The vertical axis shows the index of the ligand, and the horizontal axis that of the receptor.

Based on and , the Saffinity obtained by our grouping process is more suitable than the simple SPSC for estimating protein–protein affinity. For example, the Saffinity for a specific ligand is the highest for the following receptors: 1BPLCitation44(1), 1F60Citation45(7), 1GO3Citation46(8), 1KA9Citation47(9), 1RKECitation48(10), 1VET(11), 1Y96Citation49(15), 2G2UCitation50(17), and 2NXN(19). These protein pairs with the highest score are also biologically relevant pairs. Moreover, for 1F2TCitation51(6), the Saffinity of a biologically relevant pair is the 2nd high score.

Our AEP system can therefore constitute a prediction system if we introduce a recall, a precision and a cutoff value. Our system predicts pairs whose Saffinity exceeds the cutoff value and are thus biologically relevant pairs. Based on the ROC curve in , a cutoff value of 7.54σ demonstrates the maximal prediction performance of our system. When this cutoff is adopted, the performance of our AEP as a prediction system has an F-measure of 0.526, a recall of 50.0%, a precision of 55.6% (sensitivity of 50.0% and a specificity of 97.9%), and an accuracy of 95.5%. When we consider that the prevalence of the current sample size is 5%, this performance is sufficiently high. Our AEP system identified 10 of 20 biologically relevant combinations from among 400 pairs of combination candidates. In addition, because our AEP system can identify the true negative pairs, the specificity and the accuracy are sufficiently high at 97.9%, 95.5%, respectively. This means that the prediction provided by our AEP system is reliable.

Figure 5 The ROC curve for the prediction performance of the AEP system. The peak performance achieved at point P (cutoff value is 7.54σ), and their values are shown in the bottom-right corner.

Abbreviations: AEP, affinity evaluation and prediction; ROC, receiver operating characteristics.
Figure 5 The ROC curve for the prediction performance of the AEP system. The peak performance achieved at point P (cutoff value is 7.54σ), and their values are shown in the bottom-right corner.

Efficacy of “asymmetrically biased” PSC score function for affinity analysis results by comparison of molecular size

We compiled the relationship between the number of atoms in terms of a molecular size and the results of the affinity analysis as shown in . There is a relation between the affinity results and the size of receptors or ligands. This size indicates the number of atoms for a protein. The correlation coefficient between the identification probability of biologically relevant pairs and the size of receptor is 0.40. This positive correlation is related to the characteristic of the score function of the shape complementarity. As mentioned above, the score function we used for the receptor is different from that for the ligand. Because the score function for a receptor has a “bias” that tends to pull into the ligand to own pocket structure, we expect our AEP system to exhibit increasing docking structure accuracy when searching between a “large” receptor and a “small” ligand. The current results reveal the efficacy as regards the evaluation of affinity. Moreover, because this characteristic is always good for a molecule that consists of a large number of atoms, to make the estimation performance more accurate, the dataset should be rearranged in terms of the assignment of either a receptor or a ligand by the order of the molecular size. In other word, a “large” molecule should always be assigned as a receptor regardless of whether it functions biologically as a receptor or a ligand.

Figure 6 Relation between the size of each protein pair (ie, number of atoms) needed for biological relevance and the affinity analysis results. The horizontal axis shows the number of atoms in the receptor, and the vertical axis shows that in the ligand. The symbols denote the protein pair. The unfilled circles (○) denote a protein pair that was identified as biologically relevant by the AEP system, and the filled circles (●) denotes a protein pair identified as not biologically relevant.

Abbreviation: AEP, affinity evaluation and prediction.
Figure 6 Relation between the size of each protein pair (ie, number of atoms) needed for biological relevance and the affinity analysis results. The horizontal axis shows the number of atoms in the receptor, and the vertical axis shows that in the ligand. The symbols denote the protein pair. The unfilled circles (○) denote a protein pair that was identified as biologically relevant by the AEP system, and the filled circles (●) denotes a protein pair identified as not biologically relevant.

Problem with shape complementarity evaluation

In the current affinity analysis, our AEP system succeeded in predicting the following 10 biologically relevant protein pairs from 400 (ie, 20 × 20) protein pairs: 1BPL(1), 1F2T(6), 1F60(7), 1GO3(8), 1KA9(9), 1RKE(10), 1VET(11), 1Y69(15), 2G2U(17), and 2NXN(19). The common feature of those complex structures is that, except for 1F2T(6) and 1Y96(15), they have a typical docking type that we called “surface-to-surface”. The ”surface-to-surface” type as shown in is docking formed under a wide contact surface. Even the affinity evaluation was derived from a shape complementarity search, because in practice, the affinity analysis revealed that a shape complementarity search can identify a pair with such a docking type in a sophisticated manner from 400 protein pair candidates.

Figure 7 Models for docking styles. Type (a), called “surface-to-surface”, is most popular docking among our docking complex structures. Type (b) is called “edged-to-edge” contacting only using each edge of receptor and ligand. Type (c) is called “point-to-point” which is most less performance by shape complementarity search.

Figure 7 Models for docking styles. Type (a), called “surface-to-surface”, is most popular docking among our docking complex structures. Type (b) is called “edged-to-edge” contacting only using each edge of receptor and ligand. Type (c) is called “point-to-point” which is most less performance by shape complementarity search.

In addition, the docking type of 1F2T(6) or 1Y96(15) is different from the typical type as mentioned above. We call their docking type “edge-to-edge”. The “edge-to-edge” type is a docking style where the edge of the surface of a receptor comes in contact with a ligand edge as shown in . Naturally, a shape complementarity search has difficulty in dealing with the “edge-to-edge” docking type. Indeed as shown in , with 1F2T(6), even our AEP system succeeded in predicting the pair, and the affinity score from the correct pair is the 2nd highest score rather than the highest. Our AEP system also succeeded for 1Y96(15), however, the 2nd highest score from an irrelevant pair is also over the cutoff value of 7.54σ.

On the other hand, our AEP system failed to predict the following 10 biologically relevant pairs: 1AY7(0), 1CAU(2), 1D4V(3), 1EAY(4), 1EM8(5), 1WHM(12), 1WRD(13), 1XG2(14), 2F9Z(16), and 2HSN(18). This prediction failure might be due to a structural feature of the protein pair that hinders the shape complementarity evaluation.

Because the docking type of 1AY7(0), 1CAU(2), 1EAY(4), 1WHM(12), 1WRD(13), 1XG2(14), and 2F9Z(16) is “edge-to-edge”, the affinity scores Saffinity are low (ie, under 7.54σ) for almost all the ligands. The complex structures of 1D4V(3), 1EM8(5), and 2HSN(18) are “point-to-point” docking type. This docking type has a smaller contact surface area than the “edge-to-edge” type as shown in . In fact, the docking site of the 1EM8(5) receptor is too small to allow us to detect the correct pair. This makes a shape complementarity search difficult to perform with this type.

Most of the protein pairs that our AEP system failed to predict were not the typical docking type (ie, surface-to-surface) with a complex formation as mentioned above but edge-to-edge and/or point-to-point type, because the shape complementarity search cannot deal easily with these types. We are now extracting the common characteristics from these negative docking types and developing methods to overcome these issues.

Conclusion

A shape complementarity search method and a statistical method called ‘grouping’ were combined to evaluate and predict the affinity between various protein pairs with known structures. At a prevalence of 5.0%, prediction with our AEP system had a recall of 50.0%, a precision of 55.6% and an accuracy of 95.5%. An evaluation of the affinity was difficult when only the PSC score based on the shape complementarity evaluation was considered.

The affinity evaluation was significantly improved by the grouping process that we introduced. Future research will be undertaken on the relation between the physicochemical characteristic of proteins and the prediction accuracy of our AEP system, and then, based on those findings, research will be undertaken on further improving the accuracy of the system. Although an unbound dataset was not considered in this study, our future research will also include how to improve the prediction accuracy of an unbound dataset.

Disclosure

The authors report no conflicts of interest in this work.

References

  • JonesSThorntonJMPrinciples of protein-protein interactionsProc Natl Acad Sci U S A19969313208552589
  • TeichmannSAPrinciples of protein-protein interactionsBioinformatics200218S24924912386009
  • AyubMJSmulskiCRNyambegaBProtein-protein interaction map of the Trypanosoma cruzi ribosomal P protein complexGene200535712913616120475
  • RualJFVenkatesanKHaoTTowards a proteome-scale map of the human protein-protein interaction networkNature20054371173117816189514
  • CollinsSRMillerKMMaasNLFunctional dissection of protein complexes involved in yeast chromosome biology using a genetic interaction mapNature200744680681017314980
  • ParrishJRYuJKLiuGZA proteome-wide protein interaction map for Campylobacter jejuniGenome Biol.20078R13017615063
  • WongJNakajimaYWestermannSA protein interaction map of the mitotic spindleMol Biol Cell2007183800380917634282
  • GiotLBaderJSBrouwerCA protein interaction map of Drosophila melanogasterScience20033021727173614605208
  • StanyonCALiuGZMangiolaBAA Drosophila protein-interaction map centered on cell-cycle regulatorsGenome Biol.20045R9615575970
  • RainJCSeligLDe ReuseHThe protein-protein interaction map of Helicobacter pyloriNature200140921121511196647
  • WojcikJSchachterVProtein-protein interaction map inference using interacting domain profile pairsBioinformatics200117Suppl 1S29630511473021
  • OfranYRostBPredicted protein-protein interaction sites from local sequence informationFEBS Lett200354423623912782323
  • SmithGRSternbergMJPrediction of protein-protein interactions by docking methodsCurr Opin Struct Biol200212283511839486
  • ParkDULeeSMBolserDComparative interactomics analysis of protein family interaction networks using PSIMAP (protein structural interactome map)Bioinformatics2005213234324015914543
  • GongSYoonGJangIPSIbase: A database of Protein Structural Interactome map (PSIMAP)Bioinformatics2005212541254315749693
  • NukadaAHouraiYNishidaAHigh performance 3D convolution for protein docking on IBM Blue GeneProc. ISPA 2007 Lecture Notes in Computer Science20074742958969
  • ChenRLiLWengZPZDOCK: An initial-stage protein-docking algorithmProteins200352808712784371
  • ChenRTongWWMintserisJLiLWengZPZDOCK predictions for the CAPRI ChallengeProteins200352687312784369
  • PierceBTongWWWengZPM-ZDOCK: a grid-based approach for C-n symmetric multimer dockingBioinformatics2005211472147815613396
  • ZhangCLiuSZhouYQDocking prediction using biological information, ZDOCK sampling technique, and clustering guided by the DFIRE statistical energy functionProteins20056031431815981255
  • KiselyovOTahaWRelating FFTW and split-radixEmbedded Software and Systems20053605488493
  • VuducRDemmelJWCode generators for automatic tuning of numerical kernels: Experiences with FFTW - Position paperSemantics, Applications and Implementation of Program Generation, Proceedings20001924190211
  • MintserisJWieheKPierceBProtein-protein docking benchmark 2.0: An updateProteins20056021421615981264
  • KobeBDeisenhoferJA structural basis of the interactions between leucine-rich repeats and protein ligandsNature19953741831867877692
  • NgKKSPetersenJFWCherneyMMStructural basis for the inhibition of porcine pepsin by Ascaris pepsin inhibitor-3Nature Struct Biol2000765365710932249
  • MittlPREDiMarcoSFendrichGA new structural class of serine protease inhibitors revealed by the structure of the hirustasin-kallikrein complexStructure19975585585
  • GaoGFTormoJGerthUCCrystal structure of the complex between human CD8 alpha alpha and HLA-A2Nature19973876306349177355
  • ZapfJSenUMadhusudanHochJAVarugheseKIA transient interaction between two phosphorelay proteins trapped in a crystal lattice reveals the mechanism of molecular recognition and phosphotransfer in signal transductionStructure2000885186210997904
  • RenaultLKuhlmannJHenkelAWittinghoferAStructural basis for guanine nucleotide exchange on Ran by the regulator of chromosome condensation (RCC1)Cell200110524525511336674
  • SyedRSReidSWLiCWEfficiency of signalling through cytokine receptors depends critically on receptor orientationNature19983955115169774108
  • MintserisJPierceBWieheKAndersonRChenRWengZPIntegrating statistical pair potentials into protein complex predictionProteins20076951152017623839
  • SevcikJUrbanikovaLDauterZWilsonKSRecognition of RNase Sa by the Inhibitor Barstar: Structure of the complex at 1.7 angstrom resolutionActa Crystallogr D Biol Crystallogr1998549549639757110
  • KoTPNgJDDayJGreenwoodAMcphersonADetermination of three crystal structures of canavalin by molecular replacementActa Crystallogr D Biol Crystallogr19934947848915299507
  • MongkolsapayaJGrimesJMChenNStructure of the TRAILDR5 complex reveals mechanisms conferring specificity in apoptotic initiationNature Struct Biol199961048105310542098
  • McEvoyMMHausrathACRandolphGBRemingtonSJDahlquistFWTwo binding modes reveal flexibility in kinase/response regulator interactions in the bacterial chemotaxis pathwayProc Natl Acad Sci U S A199895733373389636149
  • GulbisJMKazmirskiSLFinkelsteinJKelmanZO’DonnellMKuriyanJCrystal structure of the chi: psi subassembly of the Escherichia coli DNA polymerase clamp-loader complexEur J Biochem200427143944914717711
  • KurzbauerRTeisDde AraujoMEGCrystal structure of the p14/MP1 scaffolding complex flow a twin couple attaches mitogen-activated protein kinase signaling to late endosomesProc Natl Acad Sci U S A2004101109841098915263099
  • HiranoYYoshinagaSTakeyaRStructure of a cell polarity regulator, a complex between atypical PKC and Par6 PB1 domainsJ Biol Chem20052809653966115590654
  • AkutsuMKawasakiMKatohYStructural basis for recognition of ubiquitinated cargo by Tom1-GAT domainFebs Lett1020055795385539116199040
  • Di MatteoAGiovaneARaiolaAStructural basis for the interaction between pectin methylesterase and a specific inhibitor proteinPlant Cell20051784985815722470
  • ChaoXJMuffTJParkSYA receptor-modifying deamidase in complex with a signaling phosphatase reveals reciprocal regulationCell200612456157116469702
  • SimaderHHothornMKohlerCBasquinJSimosGSuckDStructural basis of yeast aminoacyl-tRNA synthetase complex formation revealed by crystal structures of two binary sub-complexesNucleic Acid Res2006343968397916914447
  • DemirciHGregorySTDahlbergAEJoglGRecognition of ribosomal protein L11 by the protein trimethyltransferase PrmAEmbo J20072656757717215866
  • MachiusMWiegandGHuberRCrystal-structure of calcium-depleted Bacillus-licheniformis alpha-amylase at 2.2-angstrom resolutionJ Mol Biol19952465455597877175
  • AndersenGRPedersenLValenteLStructural basis for nucleotide exchange and competition with tRNA in the yeast elongation factor complex eEF1A: eEF1B alphaMol Cell200061261126611106763
  • TodoneFBrickPWernerFWeinzierlROJOnestiSStructure of an archaeal homolog of the eukaryotic RNA polymerase II RPB4/RPB7 complexMol Cell200181137114311741548
  • OmiRMizuguchiHGotoMStructure of imidazole glycerol phosphate synthase from Thermus thermophilus HB8: Open-closed conformational change and ammonia tunnelingJ Biochem200213275976512417026
  • IzardTEvansGBorgonRARushCLBricogneGBoisPRJVinculin activation by talin through helical bundle conversionNature200442717117514702644
  • MaYLDostieJDreyfussGVan DuyneGDThe Gemin6-Gemin7 heterodimer from the survival of motor neurons complex has an Sm protein-like structureStructure20051388389215939020
  • ReynoldsKAThomsonJMCorbettKDStructural and computational characterization of the SHV-1 beta-lactamase-beta- lactamase inhibitor protein interfaceJ Biol Chem2006281267452675316809340
  • HopfnerKPKarcherAShinDSStructural biology of Rad50 ATPase: ATP-driven conformational control in DNA double-strand break repair and the ABC-ATPase superfamilyCell200010178980010892749