The database of protein-protein experimentally determined structures is built on the basis of biological unit files from PDB (Biounit files).
The chains and complexes are annotated according to their classification or other structural features. Each structure is split into pairs of interacting chains. Data is stored in a relational PostgreSQL database. The Build Database page offers the possibility to select a subset of complexes using user-defined search criteria. The database is regularly updated and annotated.
The database of protein-protein complexes currently contains (as of Nov. 17th, 2020):
- 77167 PDB entries
- 270739 Chains
- 635107 Pairwise complexes (interfaces between two Biounit chains)
- The structure's resolution must be less than 6 Angstroms
- Obsolete PDB entries are excluded
- Chains contain at least 20 residues
- Interface between interacting chains should exceed 250 Å² per chain.
Data in the repository are extracted from PDB biounit file. In the case of multiple biounit files, the first one is used.
Analysis of complexes
Biounit PDB entries are analyzed to extract chains that interact with each other. The MODEL tag in Biounit file has to be associated with the chain name in order to distinguish chains (Chain A in the MODEL 1 may interact with chain A in the MODEL 2). Chains are considered interacting if the interface area between them is larger than is 250 Å² (mean ASA buried by each chain). Interface areas are extracted from local files pre-computed for all PDB structures.
The presence of a ligand, DNA or RNA at the interface (<= 5 Å from interacting residues) is also identified and annotated.
Example 1GNO (ligand UOE):
Example 1GTD with DNA:
Additional annotations include multimeric states higher than dimer (homo-n-ary or hetero-n-ary) and complex type (HOMO or HETERO). The complex type is defined by BLAST alignment of interacting chains: if the sequence identity is larger than 70% and Evalue < 0.0001, then it is HOMO, otherwise it is HETERO.
In homo-n-ary and hetero-n-ary complexes, involved chains must interact with all the others WITH a mean ASA buried by each chain ≥ 250 Å². For this purpose, we use the UniProt accession number extracted from the mmCIF PDB file. If the UniProt accession number is missing, then, the annotation is 'ND' (Not Determined). Therefore, a homo-n-ary or hetero-n-ary complex is defined by the association of the chain name and the model name (e.g.: PDB entry 1q5n is an homo-n-ary complex combining chain A in MODEL 1, chain A in MODEL 2, chain A in MODEL 3 and chain A in MODEL 4: the annotation is 'A1 A2 A3 A4'). If the Biounit structure does not contain a MODEL section then the MODEL number is skipped (e.g.: PDB entry 13pk is an homo-n-ary complex combining 3 chains (there is no MODEL) and is annotated 'A B C' ).
A major problem in compiling representative databases of protein-protein complexes is the lack of credible criteria for distinguishing complexes existing in vivo from crystal packing artifacts. The in vivo complexes have to be strong enough to be formed at the biological concentration of monomers with no help of the crystal lattice. However, the experimental data reflecting these properties are not available in many cases. In addition, the practical applicability of existing binding energy-estimating computational procedures to systematic separation of "strong" (biologically relevant) and "weak" (artifacts of crystallization) complexes is not obvious. However, functional considerations, including evolutionary factors, may provide additional help in discriminating crystal packing complexes.
If you have a general question about the database, send an email to email@example.com.
Definition of terms
The multimeric state or Oligomeric state is the number of chains that interact, at least, with one other chain in the PDB file. Thus, a dimer has a multimeric state = 2. If all interface areas are less than 250 Å², the multimeric state is set to 0. The interface area is the sum of the mean ASA buried by interacting chain. If the PDB entry contains only one chain (e.g. interwoven chains see 2ltn, 1cov,...) then the multimeric state is also set to 0.
THUS, THE NUMBER OF CHAINS IN THE BIOUNIT FILE MAY BE HIGHER THAN THE INDICATED MULTIMERIC STATE IF SOME INTERFACE AREAS ARE LESS THAN 250 Å² (example: pdb 1gyr has a multimeric state of 2 although 3 chains are present in the PDB file: the interface area between chains B and C is 174 Å² ; example 2: 1qzv has 22 interfaces but only one A:B is greater than 250 Å²). 1qzv has a multimeric state = 2.
area = [ (ASA(chain 1) + ASA(chain 2)) - ASA(pairwise complex) ] / 2
The accessible surface area is computed by using the FreeSASA program (Simon Mitternacht (2016) FreeSASA: An open source C library for solvent accessible surface area calculation. F1000Research 5:189 (doi: 10.12688/f1000research.7931.1) .
Chain names_ The MODEL number in the Biounit file has to be associated with the chain name in order to distinguish chains (Chain A in MODEL 1 may interact with chain A in MODEL 2). MODEL is usually used in NMR-determined structures. We also compare the sequence of the biounit chain-model to the original PDB content to find (and check) the original chain name. Some chains turnout to be mismatched (e.g. 2csm chain A and the biounit file 2csm chain_A_MODEL_1-chain_A_MODEL_2). Identifying the original chain is important since most of the sequence information is associated with the original chain name (e.g. the GI number, ECOD domain,...). Biounit files containing more than 24 MODELs are excluded from the database.
SEQRES is the aminoacid sequence of residues in the current chain retrieved from the mmCIF PDB file (PDB file generated upon the mmCIF file by using the program CIFTr). In the PDB file, you can find :
Unbound structures are PDB entries that have only one chain in the Biounit file (Biological unit file) and only one chain in the original PDB file (no crystal packing). Examples: 1g83A (chain A of the PDB entry 1g83) is the monomeric form of the complexed chain C in the 1avz PDB entry.
The interaction ranges indicates the number of the first and the last residue in the PDB sequence which interacts with the other chain. Associated with the ECOD domain boundaries, it may indicate which domain interacts (especially when the protein contain several domains) (under development). (H. Cheng, R. D. Schaeffer, Y. Liao, L. N. Kinch, J. Pei, S. Shi, B. H. Kim, N. V. Grishin. (2014) ECOD: An evolutionary classification of protein domains. PLoS Comput Biol 10(12): e1003926., H. Cheng, Y. Liao, R. D. Schaeffer, N. V. Grishin. (2015) Manual classification strategies in the ECOD database. Proteins 83(7): 1238-1251.)