Dockground

Bound Set


Description


The database of protein-protein experimentally determined structures is built on the basis of biological unit files from PDB (Biounit files).

The chains and complexes are annotated according to their classification or other structural features. Each structure is split into pairs of interacting chains. Data is stored in a relational PostgreSQL database. The Build Database page offers the possibility to select a subset of complexes using user-defined search criteria. The database is regularly updated and annotated.


The database of protein-protein complexes updates on a weekly basis and currently contains:


Membrane Set

The database of membrane protein-protein experimentally determined structures on the basis of the Orientations of Proteins in Membranes database (OPM).

Each structure is split into pairs of chains that interact in the membrane. These chains are parsed so that only the transmembrane section of the protein is in the download file.

The database of membrane protein-protein experimentally determined complexes currently contains (as of Sept. 29th, 2021):


Method


Selection criteria


Bound Set

Membrane Set





Data source

Bound Set


Data in the repository are extracted from PDB biounit file. In the case of multiple biounit files, the first one is used.


Membrane Set


Data in the repository are modified from OPMs the polytopic alphahelical dataset.


Analysis of complexes

Bound Set


Biounit PDB entries are analyzed to extract chains that interact with each other. The MODEL tag in Biounit file has to be associated with the chain name in order to distinguish chains (Chain A in the MODEL 1 may interact with chain A in the MODEL 2). Chains are considered interacting if the interface area between them is larger than is 250 Ų (mean ASA buried by each chain). Interface areas are extracted from local files pre-computed for all PDB structures.


Membrane Set

PDB entries are analyzed to extract chains that interact with each other in the membrane. Chains are considered interacting if the interface area between them is larger than is 250 Ų (mean ASA buried by each chain). Transmembrane segments extracted from OPM alpha helical polytopic dataset were used to extract interface areas. These complexes were then filtred bassed on a TM score. clustering was done using highly connected subgraphs with a clustering cutoff of 0.6Å.


Additional annotations


Bound Set

The presence of a ligand, DNA or RNA at the interface (<= 5 Å from interacting residues) is also identified and annotated.


Example 1GNO (ligand UOE):


Ligand at the interface


Example 1GTD with DNA:


DNA at the interface




Additional annotations include multimeric states higher than dimer (homo-n-ary or hetero-n-ary) and complex type (HOMO or HETERO). The complex type is defined by BLAST alignment of interacting chains: if the sequence identity is larger than 70% and Evalue < 0.0001, then it is HOMO, otherwise it is HETERO.


multimeric cases


In homo-n-ary and hetero-n-ary complexes, involved chains must interact with all the others WITH a mean ASA buried by each chain ≥ 250 Ų. For this purpose, we use the UniProt accession number extracted from the mmCIF PDB file. If the UniProt accession number is missing, then, the annotation is 'ND' (Not Determined). Therefore, a homo-n-ary or hetero-n-ary complex is defined by the association of the chain name and the model name (e.g.: PDB entry 1q5n is an homo-n-ary complex combining chain A in MODEL 1, chain A in MODEL 2, chain A in MODEL 3 and chain A in MODEL 4: the annotation is 'A1 A2 A3 A4'). If the Biounit structure does not contain a MODEL section then the MODEL number is skipped (e.g.: PDB entry 13pk is an homo-n-ary complex combining 3 chains (there is no MODEL) and is annotated 'A B C' ).





Limitations


A major problem in compiling representative databases of protein-protein complexes is the lack of credible criteria for distinguishing complexes existing in vivo from crystal packing artifacts. The in vivo complexes have to be strong enough to be formed at the biological concentration of monomers with no help of the crystal lattice. However, the experimental data reflecting these properties are not available in many cases. In addition, the practical applicability of existing binding energy-estimating computational procedures to systematic separation of "strong" (biologically relevant) and "weak" (artifacts of crystallization) complexes is not obvious. However, functional considerations, including evolutionary factors, may provide additional help in discriminating crystal packing complexes.



Contact


If you have a general question about the database, send an email to dockground@ku.edu.



Definition of terms


The multimeric state or Oligomeric state is the number of chains that interact, at least, with one other chain in the PDB file. Thus, a dimer has a multimeric state = 2. If all interface areas are less than 250 Ų, the multimeric state is set to 0. The interface area is the sum of the mean ASA buried by interacting chain. If the PDB entry contains only one chain (e.g. interwoven chains see 2ltn, 1cov,...) then the multimeric state is also set to 0.

THUS, THE NUMBER OF CHAINS IN THE BIOUNIT FILE MAY BE HIGHER THAN THE INDICATED MULTIMERIC STATE IF SOME INTERFACE AREAS ARE LESS THAN 250 Ų (example: pdb 1gyr has a multimeric state of 2 although 3 chains are present in the PDB file: the interface area between chains B and C is 174 Ų ; example 2: 1qzv has 22 interfaces but only one A:B is greater than 250 Ų). 1qzv has a multimeric state = 2.



The Release Date refers to the initial release date of the protein structure into the PDB.



The area is the mean accessible surface area (ASA) buried by each chain in the pairwise complex:

      area = [ (ASA(chain 1) + ASA(chain 2)) - ASA(pairwise complex) ] / 2

The accessible surface area is computed by using the FreeSASA program (Simon Mitternacht (2016) FreeSASA: An open source C library for solvent accessible surface area calculation. F1000Research 5:189 (doi: 10.12688/f1000research.7931.1) .



Chain names_ The MODEL number in the Biounit file has to be associated with the chain name in order to distinguish chains (Chain A in MODEL 1 may interact with chain A in MODEL 2). MODEL is usually used in NMR-determined structures. We also compare the sequence of the biounit chain-model to the original PDB content to find (and check) the original chain name. Some chains turnout to be mismatched (e.g. 2csm chain A and the biounit file 2csm chain_A_MODEL_1-chain_A_MODEL_2). Identifying the original chain is important since most of the sequence information is associated with the original chain name (e.g. the GI number, ECOD domain,...). Biounit files containing more than 24 MODELs are excluded from the database.



SEQRES is the aminoacid sequence of residues in the current chain retrieved from the mmCIF PDB file (PDB file generated upon the mmCIF file by using the program CIFTr). In the PDB file, you can find :


hierarchy figure




Unbound structures are PDB entries that have only one chain in the Biounit file (Biological unit file) and only one chain in the original PDB file (no crystal packing). Examples: 1g83A (chain A of the PDB entry 1g83) is the monomeric form of the complexed chain C in the 1avz PDB entry.


The complex type specifies whether the protein complex is a homomultimeric or heteromultimeric protein complex. A homomultimeric complex is a complex where all chains have a sequence identity of at least 90% to each other, and a heteromultimeric complex is a complex containing two or more chains with sequence identities lower than 90%.


The membrane complex specifies if the complex is a transmembrane protein as denoted by Uniprot. Both chains in the complex must be marked as transmembrane for the complex to be marked as a transmembrane complex.


The nucleic acid specifies if the protein complex is involved with nucleic acid chains, and to what extent.


The disulfide bonds indicates if the complex contains disulfide bonds.


The number of interface residues field is the number of residues in contact with the opposite protein chain in the complex determined by changes in accessible surface area.



The interaction ranges indicates the number of the first and the last residue in the PDB sequence which interacts with the other chain. Associated with the ECOD domain boundaries, it may indicate which domain interacts (especially when the protein contain several domains) (under development). (H. Cheng, R. D. Schaeffer, Y. Liao, L. N. Kinch, J. Pei, S. Shi, B. H. Kim, N. V. Grishin. (2014) ECOD: An evolutionary classification of protein domains. PLoS Comput Biol 10(12): e1003926., H. Cheng, Y. Liao, R. D. Schaeffer, N. V. Grishin. (2015) Manual classification strategies in the ECOD database. Proteins 83(7): 1238-1251.)