Build Database Info

 

  • Description
  • Method
  • Contact
  • Definition of terms

  • Description

    The database of protein-protein co-crystallized structures is built on the basis of biological unit files from PDB (Biounit files).

    PDB files are filtered to exclude "illegitimate" complexes. The chains and complexes are annotated according to their classification or other structural features. A complex can be a pairwise complex (2 chains and one interface) or an association of pairwise complexes in case of multi-n-ary state. Data is stored in a relational database. The request page offers the possibility to select a non-redundant subset of complexes for a given degree of sequence identity. The database is semi-annually updated and annotated.

    The current database of protein-protein complexes contains:

    49 521 PDB entries from the overall 109 274 PDB Structures

    169 295 Biounit chains

    214 029 pairwise complexes (interfaces between two Biounit chains)



    Primary developer: Dominique Douguet from the Center of Structural Biochemistry (CBS), Montpellier, France.



    Method


    Algorithm


    • The structure is resolved by X-ray diffraction only
    • Obsolete PDB entries are excluded
    • Chains contain at least 30 residues
    • Binary combinations of chains are generated (e.g., 2hhbA and 2hhbB)
    • A physical contact between the 2 subunits must exist
    • Complexes are automatically analysed to exclude 'illegitimate' ones (see Figure 1).
    • Complexes with alternative binding modes (Figure 1c) and multi-n-ary complexes (Figure 1d) are annotat ed.
    • Non redundant sets of pairwise complexes are extracted


    Figure 1. Examples of special cases of complexes.


    (a) Disordered termini that are part of the interface, (b) interwoven chains, (c) an alternative binding mode, the subunit identical to the green ligand is shown in blue, (d) a ternary complex.


    hierarchy figure



    Data Source


    The original data are the Biounit Files (see the following link for additional information) along with the original PDB file created upon the mmCIF file by the translation program CIFTr. The later allows us to retrieve a curated SEQRES and DBREF data blocks. The record DBREF provides a cross-reference link to a sequence database such as Swiss-Prot, TrEMBL,... Then, the EBI's srs system allows us to check if the studied chain has a transmembrane segment.

    PDB keywords (KEYWDS data block) and sequence keywords in the sequence database are also scanned to identify a familly classification (ANTIBODY and ANTIGEN, VIRUS, ELECTRON and TRANSPORT, ELECTRON and TRANSFERT, DNA or RNA).



    Analysis of complexes


    Biounit PDB entries are analyzed to extract chains that interact with each other. The MODEL tag in Biounit file has to be associated with the chain name in order to distinguish chains (Chain A in the MODEL 1 may interact with chain A in the MODEL 2). MODEL is usually used in NMR-determined structures. We also compare the sequence of the biounit chain-model to the original PDB content to find (and check) the original. Some chains are mismatched (e.g. 2csm chain A). Identifying the original chain is important since the previously extracted data (see the above paragraph) is related to the original chain name. Biounit files containing more than 24 MODELs are excluded.



    Illegitimate complexes


    The program NSC is used to calculate the interface area and to identify the interacting residues (Eisenhaber and Argos, J. Comput. Chem., 11:1272-1280, 1993). An automatic filtering is performed to discard 'illegitimate' complexes (presence of false complexes, Figure 1, significantly skew the results based on their utilization (see Vakser et al., 1999; Tovchigrechko & Vakser, 2001; Tovchigrechko et al., 2002).


    Interwoven chains are identified using information contained in the DBREF record. These complexes are discarded in our Biounit relational database. Two chains are interwoven when two original PDB chains represent a single polymer with a residue gap. Therefore, these sequences have to be consolidated into single PDB chain. For example, 2ltn: chain A and B have the same protein accession number P02867 and the final number of segment of chain A (211) in the sequence database Swiss-Prot is smaller than the initial number of the segment B (218) in the same database. These two chains are contiguous segments of the same protein. Thus chain A and B may be merged into a single chain A.


    DBREF of interwoven chains 2ltn


    Tangled chains have a free and unfolded segment of more than 6 residues interacting exclusively with the other chain (see the example below, 1adu). Here, an unfolded segment is a segment with residues having accessible surface area >= 40 Ų (calculated for the extracted/unbound chain by the program NSC). This algorithm can identify some interwoven chains not previously determined because of missing sequence information or errors (e.g. 1lgbAB or 1lGH). However, we also annotate some false positives (2rsp, 1xva, 1ath,...).


    tangled chain


    Chains that are disordered at the interface (e.g. 1fcbAB in the figure below). A pairwise complex is annotated when an unfolded C-ter or N-ter segment has more than 10 residues interacting with a similar segment of the other chain (1sidBF, 1ldcAB, 1sva26,...).


    Disordered at the interface


    The presence of a ligand, DNA or RNA at the interface (<= 5 Å from interacting residues) is also identified and annotated.


    Example 1GNO (ligand UOE):


    Ligand at the interface


    Example 1GTD with DNA:


    DNA at the interface



    Additional annotations


    Additional annotations indicate multimeric states higher than dimer, complex type (HOMO or HETERO) and the presence of alternative binding modes. The threshold of the interface area is 250 Ų (mean ASA buried by each chain). For this purpose, we also use the DBREF record extracted from the mmCIF PDB file.


    multimeric cases


    multimeric cases



    Generating representative sets


    The dataset has options to exclude redundancies based on sequence similarity. Working with representatives avoids the overrepresentation of some classes of proteins and subsequent bias.

    Two pairwise complexes may be representative if one chain of the first pairwise complex is similar to one chain of the second pairwise complex and if the other chains are different.



    Limitations


    A major problem in compiling representative databases of protein-protein complexes is the lack of credible criteria for distinguishing complexes existing in vivo from crystal packing artifacts. The in vivo complexes have to be strong enough to be formed at the biological concentration of monomers with no help of the crystal lattice. However, the experimental data reflecting these properties are not available in many cases. In addition, the practical applicability of existing binding energy-estimating computational procedures to systematic separation of "strong" (biologically relevant) and "weak" (artifacts of crystallization) complexes is not obvious. However, functional considerations, including evolutionary factors, may provide additional help in discriminating crystal packing complexes.



    List of excluded PDB structures





    Contact


    If you have a general question about the database, send an email to dockground@ku.edu.



    Definition of terms


    The multimeric state or Oligomeric state is the number of chains that interact, at least, with one other chain in the PDB file. Thus, a dimer has a multimeric state = 2. If all interface areas are less than 250 Ų, the multimeric state is set to 0. The interface area is the sum of the mean ASA buried by interacting chain. If the PDB entry contains only one chain (e.g. interwoven chains see 2ltn, 1cov,...) then the multimeric state is also set to 0.

    THUS, THE NUMBER OF CHAINS IN THE BIOUNIT FILE MAY BE HIGHER THAN THE INDICATED MULTIMERIC STATE IF SOME INTERFACE AREAS ARE LESS THAN 250 Ų (example: pdb 1gyr has a multimeric state of 2 although 3 chains are present in the PDB file: the interface area between chains B and C is 174 Ų ; example 2: 1qzv has 22 interfaces but only one A:B is greater than 250 Ų). 1qzv has a multimeric state = 2.



    The area is the mean accessible surface area (ASA) buried by each chain in the pairwise complex:

          area = [ (ASA(chain 1) + ASA(chain 2)) - ASA(pairwise complex) ] / 2

    The accessible surface area is computed by the NSC program (Eisenhaber and Argos, J. Comput. Chem., 11:1272-1280, 1993).



    Chain names_ The MODEL number in the Biounit file has to be associated with the chain name in order to distinguish chains (Chain A in MODEL 1 may interact with chain A in MODEL 2). MODEL is usually used in NMR-determined structures. We also compare the sequence of the biounit chain-model to the original PDB content to find (and check) the original chain name. Some chains turnout to be mismatched (e.g. 2csm chain A and the biounit file 2csm chain_A_MODEL_1-chain_A_MODEL_2). Identifying the original chain is important since most of the sequence information is associated with the original chain name (e.g. the GI number, SCOP domain,...). Biounit files containing more than 24 MODELs are excluded from the database.



    SEQRES is the aminoacid sequence of residues in the current chain retrieved from the mmCIF PDB file (PDB file generated upon the mmCIF file by using the program CIFTr). In the PDB file, you can find :


    hierarchy figure




    Unbound structures are PDB entries that have only one chain in the Biounit file (Biological unit file) and only one chain in the original PDB file (no crystal packing). Examples: 1g83A (chain A of the PDB entry 1g83) is the monomeric form of the complexed chain C in the 1avz PDB entry.



    A pairwise complex is annotated 'disordered at the interface' when an unfolded c-ter or n-ter segment with more than 10 residues interacts with a similar segment of the other chain (1fcbAB, 1sidBF, 1ldcAB, 1sva26,...).
    disorderd



    A pairwise complex is annotated 'tangled' when a free and unfolded segment of more than 6 residues interacts exclusively with the other chain (see the example below, 1adu). Here, an unfolded segment is a segment with accessible surface area >= 40 Ų.


    tangled



    The presence of a ligand, DNA or RNA at the interface (<= 5 Å from interfacing residues) is also identified and annotated.


          Example 1GNO (ligand UOE):


    Ligand at the interface


          Example 1GTD with DNA:


    DNA at the interface



    An alternative binding mode means that a chain/protein may bind another chain/protein at more than one position WITH a mean ASA buried by each chain ≥ 250 Ų.

    To identify such cases, we use the DBREF record extracted from the mmCIF PDB file. If the DBREF record is missing, then, the annotation is 'ND' (Not Determined). The following examples were extracted from the Original PDB content:
    multimeric cases

    For example, the pairwise complex 1apx chain A and B has the annotation '[ A-B A-C ][ B-A B-D ]' as alternative binding mode



    A homo-n-ary or a hetero-n-ary complex indicate a multimeric state higher than 2. In such complex, involved chains must interact with all the others WITH a mean ASA buried by each chain ≥ 250 Ų. For this purpose, we use the DBREF record extracted from the mmCIF PDB file. If the DBREF record is missing, then, the annotation is 'ND' (Not Determined).


    multimeric cases

    The MODEL tag in the Biounit file has to be associated with the chain name in order to distinguish chains (Chain A in the MODEL 1 may interact with chain A in the MODEL 2). MODEL is usually used in NMR-determined structures.

    Therefore, a homo-n-ary or hetero-n-ary complex is defined by the association of the chain name and the model name (e.g.: PDB entry 1q5n is an homo-n-ary complex combining chain A in MODEL 1, chain A in MODEL 2, chain A in MODEL 3 and chain A in MODEL 4: the annotation is 'A1 A2 A3 A4'). If the Biounit structure does not contain a MODEL section then the MODEL number is skipped (e.g.: PDB entry 13pk is an homo-n-ary complex combining 3 chains (there is no MODEL) and is annotated 'A B C' ).



    The interaction zone indicates the number of the first and the last residue in the PDB sequence which interacts with the other chain. Associated with the SCOP domain boundaries, it may indicate which domain interacts (especially when the protein contain several domains).



    The complex type is set according to the BLAST result: HOMO means the sequence identity >= 70% and Evalue < 0.0001.



    Automatically selected representative dataset in Easy Mode is generated based on following constraints:


    (1) PDB entry is not obsolete

    (2) Mean area buried by each chain > 400 Å2

    (3) Multimeric state = 2

    (4) Chains are not tangled

    (5) Chains are not interwoven

    (6) Chains are not disordered at the interface

    (7) No DNA or RNA at the interface

    (8) No S-S bonds between chains

    (9) No ligand at the interface

    (10) Chains are not membrane associated


    In oligomer mode (additional pairwise complexes present in the Biounit file are included):

    (11) Monomeric chains are clustered with sequence identity 30% (by standalone PISCES package)

    (12) Pairwise complexes are re-clustered

    (13) Oligomer complex representatives are selected. The selection is based on the best resolution

    (14) Dimeric representatives are selected (only one interface in the Biounit file)



    Manually created representative dataset in Easy Mode is generated based on following constraints:


    Constrains (1)-(7),(11)-(13) are the same as in automatically selected representative dataset.

    The differences with the automatically-selected set:

    -- Keep interfaces containing metal ion or PO4,SO4, S-S bond(annotated)

    -- Keep chains associated with membrane

    -- If several chains interact with other chains as a whole, treat them as one entity(e.g.,1dvf AB:CD)

    -- If strucures with sequence identity > 30% have different binding modes, consider them as different entries. Annotation is added for such entries.

    -- Eliminate obligate and crystal packing interactions based on related references and visual inspection of interface size/packing.



    Templates for structural alignment v1.0 are generated based on following constraints:


    Interfaces. The initial set of all bound hetero- and homo-dimers from DOCKGROUND was reduced using following requirements:

    (1) X-ray structures with resolution >= 3.5 A;

    (2) mean accessible surface area buried by each chain >= 250 A2;

    (3) number of residues at the interface in each chain >= 10.

    Each complex was further checked for inter-penetration by an automated procedure. Application of these criteria resulted in 12,134 redundant complexes. The structural redundancy was eliminated by MM-align (Mukherjee & Zhang, Nucleic Acids Res. 2009, 37: e83). Two interfaces were similar if their TM-score was > 0.9. The similarity graph was generated and clustering performed by an in-house graph clustering procedure, producing 7,107 clusters. Cluster representatives were selected based on the lowest number of missing residues and the best resolution.

    The following notation is used to name the PDB files:

    iXXXXM1CH1M2CH2_N.pdb

    'i' - indicates that the template is from protein interface library
    XXXX - PDB code
    M1, M2 - serial number of the model in the corresponding Biounit file for chains CH1 and CH2
    CH1, CH2 - chain identifiers for the two interacting proteins
    N - 1 for chain CH1 and 2 for chain CH2



    Full structures. The additional requirement for the initial set of 12,134 redundant complexes (see 'Interfaces' above) was the presence of at least three regular secondary structure elements (alpha-helices and/or beta-strands) in each subunit, reducing the number of complexes to 11,774. The structural redundancy was eliminated by MM-align comparison of full structures, with the same criteria as for the interface (see above). The final set consisted of 5,050 structurally non-redundant complexes.

    The following notation is used to name the PDB files:

    XXXXM1CH1M2CH2_N.pdb

    XXXX - PDB code
    M1, M2 - serial number of the model in the corresponding Biounit file for chains CH1 and CH2
    CH1, CH2 - chain identifiers for the two interacting proteins
    N - 1 for chain CH1 and 2 for chain CH2



    Templates for structural alignment v1.1 were generated using a more sophisticated graph clustering algorithm by Hartuv and Shamir (Inform. Process. Lett. 2000, 76: 175-181) to eliminate redundancies. The sets contain 4,950 full structures and 5,936 interfaces.