Dockground
Description
The database of protein-protein experimentally determined structures is built on the basis of biological unit files from PDB (Biounit files).
The chains and complexes are annotated according to their classification or other structural features. Each structure is split into pairs of interacting chains. Data is stored in a relational PostgreSQL database. The Build Database page offers the possibility to select a subset of complexes using user-defined search criteria. The database is regularly updated and annotated.
The database of protein-protein complexes updates on a weekly basis and currently contains:
- 101948 PDB entries
- 461667 Chains
- 1277021 Pairwise complexes (interfaces between two Biounit chains)
Membrane Set
The database of membrane protein-protein experimentally determined structures on the basis of the Orientations of Proteins in Membranes database (OPM).
Each structure is split into pairs of chains that interact in the membrane. These chains are parsed so that only the transmembrane section of the protein is in the download file.
The database of membrane protein-protein experimentally determined complexes currently contains (as of Sept. 29th, 2021):
- 275 PDB entries
Method
Selection criteria
Bound Set
- The structure's resolution must be less than 6 Angstroms
- Obsolete PDB entries are excluded
- Chains contain at least 20 residues
- Interface between interacting chains should exceed 250 Ų per chain.
Membrane Set
- The structure's resolution must be less than 6 Angstroms
- Chains contain at least 20 transmembrane residues
- Interface between interacting chains should exceed 250 Ų per chain.
- Transmembrane portion must contain at least one alpha helix
Data source
Bound Set
Data in the repository are extracted from PDB biounit file. In the case of multiple biounit files, the first one is used.
Membrane Set
Data in the repository are modified from OPMs the polytopic alphahelical dataset.
Analysis of complexes
Bound Set
Biounit PDB entries are analyzed to extract chains that interact with each other. The MODEL tag in Biounit file has to be associated with the chain name in order to distinguish chains (Chain A in the MODEL 1 may interact with chain A in the MODEL 2). Chains are considered interacting if the interface area between them is larger than is 250 Ų (mean ASA buried by each chain). Interface areas are extracted from local files pre-computed for all PDB structures.
Membrane Set
PDB entries are analyzed to extract chains that interact with each other in the membrane.
Chains are considered interacting if the interface area between them is larger than is 250 Ų (mean ASA buried by each chain). Transmembrane segments extracted from OPM alpha helical polytopic dataset were used to extract interface areas.
These complexes were then filtred bassed on a TM score. clustering was done using highly connected subgraphs with a clustering cutoff of 0.6Å.
Additional annotations
Bound Set
The presence of a ligand, DNA or RNA at the interface (<= 5 Å from interacting residues) is also identified and annotated.
Example 1GNO (ligand UOE):
Example 1GTD with DNA:
Additional annotations include multimeric states higher than dimer (homo-n-ary or hetero-n-ary) and complex type (HOMO or HETERO). The complex type is defined by BLAST alignment of interacting chains: if the sequence identity is larger than 70% and Evalue < 0.0001, then it is HOMO, otherwise it is HETERO.
In homo-n-ary and hetero-n-ary complexes, involved chains must interact with all the others WITH a mean ASA buried by each chain ≥ 250 Ų. For this purpose, we use the UniProt accession number extracted from the mmCIF PDB file. If the UniProt accession number is missing, then, the annotation is 'ND' (Not Determined). Therefore, a homo-n-ary or hetero-n-ary complex is defined by the association of the chain name and the model name (e.g.: PDB entry 1q5n is an homo-n-ary complex combining chain A in MODEL 1, chain A in MODEL 2, chain A in MODEL 3 and chain A in MODEL 4: the annotation is 'A1 A2 A3 A4'). If the Biounit structure does not contain a MODEL section then the MODEL number is skipped (e.g.: PDB entry 13pk is an homo-n-ary complex combining 3 chains (there is no MODEL) and is annotated 'A B C' ).
Limitations
A major problem in compiling representative databases of protein-protein complexes is the lack of credible criteria for distinguishing complexes existing in vivo from crystal packing artifacts. The in vivo complexes have to be strong enough to be formed at the biological concentration of monomers with no help of the crystal lattice. However, the experimental data reflecting these properties are not available in many cases. In addition, the practical applicability of existing binding energy-estimating computational procedures to systematic separation of "strong" (biologically relevant) and "weak" (artifacts of crystallization) complexes is not obvious. However, functional considerations, including evolutionary factors, may provide additional help in discriminating crystal packing complexes.
Contact
If you have a general question about the database, send an email to dockground@ku.edu.
Definition of terms
The multimeric state or Oligomeric state is the number of chains that interact, at least, with one other chain in the PDB file. Thus, a dimer has a multimeric state = 2. If all interface areas are less than 250 Ų, the multimeric state is set to 0. The interface area is the sum of the mean ASA buried by interacting chain. If the PDB entry contains only one chain (e.g. interwoven chains see 2ltn, 1cov,...) then the multimeric state is also set to 0.
THUS, THE NUMBER OF CHAINS IN THE BIOUNIT FILE MAY BE HIGHER THAN THE INDICATED MULTIMERIC STATE IF SOME INTERFACE AREAS ARE LESS THAN 250 Ų (example: pdb 1gyr has a multimeric state of 2 although 3 chains are present in the PDB file: the interface area between chains B and C is 174 Ų ; example 2: 1qzv has 22 interfaces but only one A:B is greater than 250 Ų). 1qzv has a multimeric state = 2.
The Release Date refers to the initial release date of the protein structure into the PDB.
The area is the mean accessible surface area (ASA) buried by each chain in the
pairwise complex:
area = [ (ASA(chain 1) + ASA(chain 2)) - ASA(pairwise complex) ] / 2
The accessible surface area is computed by using the FreeSASA program (Simon Mitternacht (2016) FreeSASA: An open source C library for solvent accessible surface area calculation. F1000Research 5:189 (doi: 10.12688/f1000research.7931.1) .
Chain names_ The MODEL number in the Biounit file has to be associated with the chain
name in order to distinguish chains (Chain A in MODEL 1 may interact with chain A in MODEL 2). MODEL is usually
used in NMR-determined structures. We also compare the sequence of the biounit chain-model to the original PDB
content to find (and check) the original chain name. Some chains turnout to be mismatched (e.g. 2csm chain A and
the biounit file 2csm chain_A_MODEL_1-chain_A_MODEL_2). Identifying the original chain is important since most
of the sequence information is associated with the original chain name (e.g. the GI number, ECOD domain,...).
Biounit files containing more than 24 MODELs are excluded from the database.
SEQRES is the aminoacid sequence of residues in the current chain retrieved from the
mmCIF PDB file (PDB file generated upon the mmCIF file by using the program CIFTr). In the PDB file, you can
find :
Unbound structures are PDB entries that have only one chain in the Biounit file (Biological
unit file) and only one chain in the original PDB file (no crystal packing). Examples: 1g83A (chain A of
the PDB entry 1g83) is the monomeric form of the complexed chain C in the 1avz PDB entry.
The complex type specifies whether the protein complex is a homomultimeric or heteromultimeric protein complex. A homomultimeric complex is a complex where all chains have a sequence identity of at least 90% to each other, and a heteromultimeric complex is a complex containing two or more chains with sequence identities lower than 90%.
The membrane complex specifies if the complex is a transmembrane protein as denoted by Uniprot. Both chains in the complex must be marked as transmembrane for the complex to be marked as a transmembrane complex.
The nucleic acid specifies if the protein complex is involved with nucleic acid chains, and to what extent.
- none: There are no nucleic acids involved with the complex.
- related: Nucleic acids are included in the structure file but are not in the interface.
- associated: Nucleic acids are found in the complex's interface.
The disulfide bonds indicates if the complex contains disulfide bonds.
The number of interface residues field is the number of residues in contact with the opposite protein chain in the complex determined by changes in accessible surface area.
The interaction ranges indicates the number of the first and the last residue in the PDB sequence which
interacts with the other chain. Associated with the ECOD domain
boundaries, it may indicate which domain
interacts (especially when the protein contain several domains) (under development). (H. Cheng, R. D. Schaeffer, Y.
Liao, L. N. Kinch, J. Pei, S. Shi, B. H. Kim, N. V. Grishin. (2014) ECOD: An evolutionary classification of protein
domains. PLoS Comput Biol 10(12): e1003926.,
H. Cheng, Y. Liao, R. D. Schaeffer, N. V. Grishin. (2015) Manual classification strategies in the ECOD database.
Proteins 83(7): 1238-1251.)