Dockground

Bound Set


Description


The database of protein-protein experimentally determined structures is built on the basis of biological unit files from PDB (Biounit files).

The chains and complexes are annotated according to their classification or other structural features. Each structure is split into pairs of interacting chains. Data is stored in a relational PostgreSQL database. The Build Database page offers the possibility to select a subset of complexes using user-defined search criteria. The database is regularly updated and annotated.


The database of protein-protein complexes currently contains (as of Nov. 17th, 2020):


Method


Selection criteria




Data source


Data in the repository are extracted from PDB biounit file. In the case of multiple biounit files, the first one is used.



Analysis of complexes


Biounit PDB entries are analyzed to extract chains that interact with each other. The MODEL tag in Biounit file has to be associated with the chain name in order to distinguish chains (Chain A in the MODEL 1 may interact with chain A in the MODEL 2). Chains are considered interacting if the interface area between them is larger than is 250 Ų (mean ASA buried by each chain). Interface areas are extracted from local files pre-computed for all PDB structures.



Additional annotations


The presence of a ligand, DNA or RNA at the interface (<= 5 Å from interacting residues) is also identified and annotated.


Example 1GNO (ligand UOE):


Ligand at the interface


Example 1GTD with DNA:


DNA at the interface




Additional annotations include multimeric states higher than dimer (homo-n-ary or hetero-n-ary) and complex type (HOMO or HETERO). The complex type is defined by BLAST alignment of interacting chains: if the sequence identity is larger than 70% and Evalue < 0.0001, then it is HOMO, otherwise it is HETERO.


multimeric cases


In homo-n-ary and hetero-n-ary complexes, involved chains must interact with all the others WITH a mean ASA buried by each chain ≥ 250 Ų. For this purpose, we use the UniProt accession number extracted from the mmCIF PDB file. If the UniProt accession number is missing, then, the annotation is 'ND' (Not Determined). Therefore, a homo-n-ary or hetero-n-ary complex is defined by the association of the chain name and the model name (e.g.: PDB entry 1q5n is an homo-n-ary complex combining chain A in MODEL 1, chain A in MODEL 2, chain A in MODEL 3 and chain A in MODEL 4: the annotation is 'A1 A2 A3 A4'). If the Biounit structure does not contain a MODEL section then the MODEL number is skipped (e.g.: PDB entry 13pk is an homo-n-ary complex combining 3 chains (there is no MODEL) and is annotated 'A B C' ).





Limitations


A major problem in compiling representative databases of protein-protein complexes is the lack of credible criteria for distinguishing complexes existing in vivo from crystal packing artifacts. The in vivo complexes have to be strong enough to be formed at the biological concentration of monomers with no help of the crystal lattice. However, the experimental data reflecting these properties are not available in many cases. In addition, the practical applicability of existing binding energy-estimating computational procedures to systematic separation of "strong" (biologically relevant) and "weak" (artifacts of crystallization) complexes is not obvious. However, functional considerations, including evolutionary factors, may provide additional help in discriminating crystal packing complexes.



Contact


If you have a general question about the database, send an email to dockground@ku.edu.



Definition of terms


The multimeric state or Oligomeric state is the number of chains that interact, at least, with one other chain in the PDB file. Thus, a dimer has a multimeric state = 2. If all interface areas are less than 250 Ų, the multimeric state is set to 0. The interface area is the sum of the mean ASA buried by interacting chain. If the PDB entry contains only one chain (e.g. interwoven chains see 2ltn, 1cov,...) then the multimeric state is also set to 0.

THUS, THE NUMBER OF CHAINS IN THE BIOUNIT FILE MAY BE HIGHER THAN THE INDICATED MULTIMERIC STATE IF SOME INTERFACE AREAS ARE LESS THAN 250 Ų (example: pdb 1gyr has a multimeric state of 2 although 3 chains are present in the PDB file: the interface area between chains B and C is 174 Ų ; example 2: 1qzv has 22 interfaces but only one A:B is greater than 250 Ų). 1qzv has a multimeric state = 2.



The area is the mean accessible surface area (ASA) buried by each chain in the pairwise complex:

      area = [ (ASA(chain 1) + ASA(chain 2)) - ASA(pairwise complex) ] / 2

The accessible surface area is computed by using the FreeSASA program (Simon Mitternacht (2016) FreeSASA: An open source C library for solvent accessible surface area calculation. F1000Research 5:189 (doi: 10.12688/f1000research.7931.1) .



Chain names_ The MODEL number in the Biounit file has to be associated with the chain name in order to distinguish chains (Chain A in MODEL 1 may interact with chain A in MODEL 2). MODEL is usually used in NMR-determined structures. We also compare the sequence of the biounit chain-model to the original PDB content to find (and check) the original chain name. Some chains turnout to be mismatched (e.g. 2csm chain A and the biounit file 2csm chain_A_MODEL_1-chain_A_MODEL_2). Identifying the original chain is important since most of the sequence information is associated with the original chain name (e.g. the GI number, ECOD domain,...). Biounit files containing more than 24 MODELs are excluded from the database.



SEQRES is the aminoacid sequence of residues in the current chain retrieved from the mmCIF PDB file (PDB file generated upon the mmCIF file by using the program CIFTr). In the PDB file, you can find :


hierarchy figure




Unbound structures are PDB entries that have only one chain in the Biounit file (Biological unit file) and only one chain in the original PDB file (no crystal packing). Examples: 1g83A (chain A of the PDB entry 1g83) is the monomeric form of the complexed chain C in the 1avz PDB entry.



The interaction ranges indicates the number of the first and the last residue in the PDB sequence which interacts with the other chain. Associated with the ECOD domain boundaries, it may indicate which domain interacts (especially when the protein contain several domains) (under development). (H. Cheng, R. D. Schaeffer, Y. Liao, L. N. Kinch, J. Pei, S. Shi, B. H. Kim, N. V. Grishin. (2014) ECOD: An evolutionary classification of protein domains. PLoS Comput Biol 10(12): e1003926., H. Cheng, Y. Liao, R. D. Schaeffer, N. V. Grishin. (2015) Manual classification strategies in the ECOD database. Proteins 83(7): 1238-1251.)