• Description
  • Method
  • Contact
  • Definition of terms

  • Description

    The database of protein-protein co-crystallized structures is built on the basis of biological unit files from PDB (Biounit files).

    PDB files are filtered to exclude "illegitimate" complexes. The chains and complexes are annotated according to their classification or other structural features. A complex can be a pairwise complex (2 chains and one interface) or an association of pairwise complexes in case of multi-n-ary state. Data is stored in a relational database. The request page offers the possibility to select a non-redundant subset of complexes for a given degree of sequence identity. The database is semi-annually updated and annotated.

    The current database of protein-protein complexes contains:

    49 521 PDB entries from the overall 109 274 PDB Structures

    169 295 Biounit chains

    214 029 pairwise complexes (interfaces between two Biounit chains)



    Primary developer: Dominique Douguet



    Method


    Algorithm


    • The structure is resolved by X-ray diffraction only
    • Obsolete PDB entries are excluded
    • Chains contain at least 30 residues
    • Binary combinations of chains are generated (e.g., 2hhbA and 2hhbB)
    • A physical contact between the 2 subunits must exist
    • Complexes with alternative binding modes (Figure 1c) and multi-n-ary complexes (Figure 1d) are annotat ed.
    • Non redundant sets of pairwise complexes are extracted


    Figure 1. Examples of special cases of complexes.


    (a) an alternative binding mode, the subunit identical to the green ligand is shown in blue, (b) a ternary complex.


    hierarchy figure



    Data Source


    The original data are the Biounit Files (see the following link for additional information) along with the original PDB file created upon the mmCIF file by the translation program CIFTr. The later allows us to retrieve a curated SEQRES and DBREF data blocks. The record DBREF provides a cross-reference link to a sequence database such as Swiss-Prot, TrEMBL,... Then, the EBI's srs system allows us to check if the studied chain has a transmembrane segment.

    PDB keywords (KEYWDS data block) and sequence keywords in the sequence database are also scanned to identify a familly classification (ANTIBODY and ANTIGEN, VIRUS, ELECTRON and TRANSPORT, ELECTRON and TRANSFERT, DNA or RNA).



    Analysis of complexes


    Biounit PDB entries are analyzed to extract chains that interact with each other. The MODEL tag in Biounit file has to be associated with the chain name in order to distinguish chains (Chain A in the MODEL 1 may interact with chain A in the MODEL 2). MODEL is usually used in NMR-determined structures. We also compare the sequence of the biounit chain-model to the original PDB content to find (and check) the original. Some chains are mismatched (e.g. 2csm chain A). Identifying the original chain is important since the previously extracted data (see the above paragraph) is related to the original chain name. Biounit files containing more than 24 MODELs are excluded.



    Illegitimate complexes


    The program NSC is used to calculate the interface area and to identify the interacting residues (Eisenhaber and Argos, J. Comput. Chem., 11:1272-1280, 1993). An automatic filtering is performed to discard 'illegitimate' complexes (presence of false complexes, Figure 1, significantly skew the results based on their utilization (see Vakser et al., 1999; Tovchigrechko & Vakser, 2001; Tovchigrechko et al., 2002).


    Interwoven chains are identified using information contained in the DBREF record. These complexes are discarded in our Biounit relational database. Two chains are interwoven when two original PDB chains represent a single polymer with a residue gap. Therefore, these sequences have to be consolidated into single PDB chain. For example, 2ltn: chain A and B have the same protein accession number P02867 and the final number of segment of chain A (211) in the sequence database Swiss-Prot is smaller than the initial number of the segment B (218) in the same database. These two chains are contiguous segments of the same protein. Thus chain A and B may be merged into a single chain A.


    DBREF of interwoven chains 2ltn



    Additional annotations

    The presence of a ligand, DNA or RNA at the interface (<= 5 Å from interacting residues) is also identified and annotated.


    Example 1GNO (ligand UOE):


    Ligand at the interface


    Example 1GTD with DNA:


    DNA at the interface




    Additional annotations indicate multimeric states higher than dimer, complex type (HOMO or HETERO) and the presence of alternative binding modes. The threshold of the interface area is 250 Ų (mean ASA buried by each chain). For this purpose, we also use the DBREF record extracted from the mmCIF PDB file.


    multimeric cases


    multimeric cases





    Limitations


    A major problem in compiling representative databases of protein-protein complexes is the lack of credible criteria for distinguishing complexes existing in vivo from crystal packing artifacts. The in vivo complexes have to be strong enough to be formed at the biological concentration of monomers with no help of the crystal lattice. However, the experimental data reflecting these properties are not available in many cases. In addition, the practical applicability of existing binding energy-estimating computational procedures to systematic separation of "strong" (biologically relevant) and "weak" (artifacts of crystallization) complexes is not obvious. However, functional considerations, including evolutionary factors, may provide additional help in discriminating crystal packing complexes.



    List of excluded PDB structures




    Contact


    If you have a general question about the database, send an email to dockground@ku.edu.



    Definition of terms


    The multimeric state or Oligomeric state is the number of chains that interact, at least, with one other chain in the PDB file. Thus, a dimer has a multimeric state = 2. If all interface areas are less than 250 Ų, the multimeric state is set to 0. The interface area is the sum of the mean ASA buried by interacting chain. If the PDB entry contains only one chain (e.g. interwoven chains see 2ltn, 1cov,...) then the multimeric state is also set to 0.

    THUS, THE NUMBER OF CHAINS IN THE BIOUNIT FILE MAY BE HIGHER THAN THE INDICATED MULTIMERIC STATE IF SOME INTERFACE AREAS ARE LESS THAN 250 Ų (example: pdb 1gyr has a multimeric state of 2 although 3 chains are present in the PDB file: the interface area between chains B and C is 174 Ų ; example 2: 1qzv has 22 interfaces but only one A:B is greater than 250 Ų). 1qzv has a multimeric state = 2.



    The area is the mean accessible surface area (ASA) buried by each chain in the pairwise complex:

          area = [ (ASA(chain 1) + ASA(chain 2)) - ASA(pairwise complex) ] / 2

    The accessible surface area is computed by the NSC program (Eisenhaber and Argos, J. Comput. Chem., 11:1272-1280, 1993).



    Chain names_ The MODEL number in the Biounit file has to be associated with the chain name in order to distinguish chains (Chain A in MODEL 1 may interact with chain A in MODEL 2). MODEL is usually used in NMR-determined structures. We also compare the sequence of the biounit chain-model to the original PDB content to find (and check) the original chain name. Some chains turnout to be mismatched (e.g. 2csm chain A and the biounit file 2csm chain_A_MODEL_1-chain_A_MODEL_2). Identifying the original chain is important since most of the sequence information is associated with the original chain name (e.g. the GI number, SCOP domain,...). Biounit files containing more than 24 MODELs are excluded from the database.



    SEQRES is the aminoacid sequence of residues in the current chain retrieved from the mmCIF PDB file (PDB file generated upon the mmCIF file by using the program CIFTr). In the PDB file, you can find :


    hierarchy figure




    Unbound structures are PDB entries that have only one chain in the Biounit file (Biological unit file) and only one chain in the original PDB file (no crystal packing). Examples: 1g83A (chain A of the PDB entry 1g83) is the monomeric form of the complexed chain C in the 1avz PDB entry.



    An alternative binding mode means that a chain/protein may bind another chain/protein at more than one position WITH a mean ASA buried by each chain ≥ 250 Ų.

    To identify such cases, we use the DBREF record extracted from the mmCIF PDB file. If the DBREF record is missing, then, the annotation is 'ND' (Not Determined). The following examples were extracted from the Original PDB content:
    multimeric cases

    For example, the pairwise complex 1apx chain A and B has the annotation '[ A-B A-C ][ B-A B-D ]' as alternative binding mode



    A homo-n-ary or a hetero-n-ary complex indicate a multimeric state higher than 2. In such complex, involved chains must interact with all the others WITH a mean ASA buried by each chain ≥ 250 Ų. For this purpose, we use the DBREF record extracted from the mmCIF PDB file. If the DBREF record is missing, then, the annotation is 'ND' (Not Determined).


    multimeric cases

    The MODEL tag in the Biounit file has to be associated with the chain name in order to distinguish chains (Chain A in the MODEL 1 may interact with chain A in the MODEL 2). MODEL is usually used in NMR-determined structures.

    Therefore, a homo-n-ary or hetero-n-ary complex is defined by the association of the chain name and the model name (e.g.: PDB entry 1q5n is an homo-n-ary complex combining chain A in MODEL 1, chain A in MODEL 2, chain A in MODEL 3 and chain A in MODEL 4: the annotation is 'A1 A2 A3 A4'). If the Biounit structure does not contain a MODEL section then the MODEL number is skipped (e.g.: PDB entry 13pk is an homo-n-ary complex combining 3 chains (there is no MODEL) and is annotated 'A B C' ).



    The interaction zone indicates the number of the first and the last residue in the PDB sequence which interacts with the other chain. Associated with the SCOP domain boundaries, it may indicate which domain interacts (especially when the protein contain several domains).



    The complex type is set according to the BLAST result: HOMO means the sequence identity >= 70% and Evalue < 0.0001.





    Manually created representative dataset in Easy Mode is generated based on following constraints:


    Constrains (1)-(7),(11)-(13) are the same as in automatically selected representative dataset.

    The differences with the automatically-selected set:

    -- Keep interfaces containing metal ion or PO4,SO4, S-S bond(annotated)

    -- Keep chains associated with membrane

    -- If several chains interact with other chains as a whole, treat them as one entity(e.g.,1dvf AB:CD)

    -- If strucures with sequence identity > 30% have different binding modes, consider them as different entries. Annotation is added for such entries.

    -- Eliminate obligate and crystal packing interactions based on related references and visual inspection of interface size/packing.



    Templates for structural alignment v1.0 are generated based on following constraints:


    Interfaces. The initial set of all bound hetero- and homo-dimers from DOCKGROUND was reduced using following requirements:

    (1) X-ray structures with resolution >= 3.5 A;

    (2) mean accessible surface area buried by each chain >= 250 A2;

    (3) number of residues at the interface in each chain >= 10.

    Each complex was further checked for inter-penetration by an automated procedure. Application of these criteria resulted in 12,134 redundant complexes. The structural redundancy was eliminated by MM-align (Mukherjee & Zhang, Nucleic Acids Res. 2009, 37: e83). Two interfaces were similar if their TM-score was > 0.9. The similarity graph was generated and clustering performed by an in-house graph clustering procedure, producing 7,107 clusters. Cluster representatives were selected based on the lowest number of missing residues and the best resolution.

    The following notation is used to name the PDB files:

    iXXXXM1CH1M2CH2_N.pdb

    'i' - indicates that the template is from protein interface library
    XXXX - PDB code
    M1, M2 - serial number of the model in the corresponding Biounit file for chains CH1 and CH2
    CH1, CH2 - chain identifiers for the two interacting proteins
    N - 1 for chain CH1 and 2 for chain CH2



    Full structures. The additional requirement for the initial set of 12,134 redundant complexes (see 'Interfaces' above) was the presence of at least three regular secondary structure elements (alpha-helices and/or beta-strands) in each subunit, reducing the number of complexes to 11,774. The structural redundancy was eliminated by MM-align comparison of full structures, with the same criteria as for the interface (see above). The final set consisted of 5,050 structurally non-redundant complexes.

    The following notation is used to name the PDB files:

    XXXXM1CH1M2CH2_N.pdb

    XXXX - PDB code
    M1, M2 - serial number of the model in the corresponding Biounit file for chains CH1 and CH2
    CH1, CH2 - chain identifiers for the two interacting proteins
    N - 1 for chain CH1 and 2 for chain CH2



    Templates for structural alignment v1.1 were generated using a more sophisticated graph clustering algorithm by Hartuv and Shamir (Inform. Process. Lett. 2000, 76: 175-181) to eliminate redundancies. The sets contain 4,950 full structures and 5,936 interfaces.