Database Statistics

Our database stores 515,920 known and 9,160,203 putative binding sites obtained from Protein Data Bank (PDB).

[Def. 1] Definition of a ligand:
- Other than buffers(Glycerol and Ethanediol), water molecule, nucleic acid (DNA/RNA), and protein peptide

[Def. 2] Definition of a potential ligand-binding region:
- A probe-cluster with less than or equal to 200 probes, which are generated by
using Ghecom [2]

[Def. 3] Definition of a binding site:
- A known ligand-binding site is a set of amino acids close to a non-polymer molecule
- A putative binding site is a set of amino acids close to a probe-cluster
- Binding site consisting residues must have at least one heavy atom whose distance to a ligand/probe-cluster is within 5.0 angstrom
- A binding site must contain more than or equal to 6 residues

All sites have been annotated with information of various types such as CATH [3], SCOP [4], EC numbers [5] and Gene Ontology [6] as far as possible.

  1. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res 2000;28(1):235-242.
  2. Kawabata T. Detection of multi-scale pockets on protein surfaces using mathematical morphology. Proteins 2010;78(5):1195-1211.
  3. Greene LH, Lewis TE, Addou S, Cuff A, Dallman T, Dibley M, Redfern O, Pearl F, Nambudiry R, Reid A, Sillitoe I, Yeats C, Thornton JM, Orengo CA. The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution. Nucleic Acids Res 2007;35(Database issue):D291-7.
  4. Andreeva A, Howorth D, Chandonia JM, Brenner SE, Hubbard TJ, Chothia C, Murzin AG. Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res 2008;36(Database issue):D419-25.
  5. Martin AC. PDBSprotEC: a Web-accessible database linking PDB chains to EC numbers via SwissProt. Bioinformatics 2004;20:986-988 (2004).
  6. The Gene Ontology Consortium (Jan 2008). The Gene Ontology project in 2008. Nucleic Acids Research 2008;36(Database issue):D440-4.

Comparison Methods

We have developed an ultrafast method [1] that can detects similar pairs between over 1 million binding sites in a reasonable computation time. In our method, ligand-binding sites are first encoded as feature vectors based on their physicochemical and geometric properties. Then similar sites are enumerated using a fast neighbor search algorithm called SketchSort [2] whose source code is available from We applied our method to all-pair similarity searches for the 3.4 million known and potential ligand-binding sites. Consequently, we discovered over 24 million similar binding sites.

  1. Ito J, Tabei Y, Shimizu K, Tomii K, and Tsuda K. PDB-Scale analysis of known and putative ligand binding sites with structural sketches. Proteins 2011;80:747-63.
  2. Tabei Y, Uno T, Sugiyama M, Tsuda K. Single Versus Multiple Sorting for All Pairs Similarity Search. The 2nd Asian Conference on Machine Learning (ACML2010) 2010.


Submit Form

Search K:
If your interest is to find protein local regions that similar to a known ligand-binding site in PDB, Search K is useful. This search mode will enumerate similar known and putative sites against a query site.
Required inputs for Search K are "PDB ID" of structure and "HET code" of ligand binding to the structure.

  1. PDB ID (4 letter) for protein structure to be searched. (e.g. 1DJQ). (Required for search)
  2. HET code (3 letter) for a ligand (e.g. ATP) binds to the structure selected in form (1). (Required for search)

    (The following inputs are optional parameters for search)

  3. User can specify the chain name for the ligand such as "A" or "B". If this form leaves blank, all ligand-binding sites involving the PDB ID and HET code entered in the form (1) and (2) are regarded as queries.
  4. User can search the query against 3 types of databases:
    + Known ligand-binding sites
    + Putative binding sites
    + Both known and putative sites
    For the details about each dataset, please see database definition.
  5. Cut-off value for cosine similarity. Large cosine value means high similarity of binding sites. The binding sites with more than this cutoff value are retrieved as hits.
  6. Cut-off value for aligned residues (i.e. Ca atoms). The hits with below this value are discarded.
  7. User can set Maximum number of hits.
  8. Two types of reports are available. One is visualizing the search results on the web-interface. Another is to download the results as a plain text file.
  9. User can select annotations: CATH classification code, SCOP classification code, EC number and Gene ontology (GO), are now available. We have assigned the CATH and SCOP code(s) to each binding site by matching the site consisting residues against the CATH/SCOP domain consisting residues. In case the binding site resides are between the multi-domains, multiple CATH/SCOP codes will be shown. Also we have detected the largest overlaped protein chain to each binding site, and show the corresponding EC number and GO terms.
  10. Go to search.

Search P:
When you have a PDB structure of interest and would like to predict ligands that potentially bind to the structure, you can use Search P. This search mode will search similar known ligand-binding sites against a protein structure.
Required input for Search P is only "PDB ID" of a structure of interest.

  1. PDB ID (4 letter) for protein structure to be searched. (e.g. 1DJQ). (Required for search)

  2. Other parameters (5 ~ 10) are identical to those of Search K.

Search Results

After submission, a result page will be displayed as follows (usually within a few seconds).

  1. Summary of you request
  2. User can visualize the submitted binding site on Jmol viewer. In case user did not specify a chain name of ligand in the submit form, multiple sites may be regarded as queries and only one site randomly selected from them is visualized.
  3. The number of hits (i.e., detected similar binding sites) against your query.
  4. User can download the all results as a text file, which includes corresponding amino acids after the structural alignment in addition to annotations.
  5. Information related to each hit is presented in a row in the list. The rows involved with known ligand-binding sites are shown with a blue background; putative ones are green.
    (5 -1) The first column shows the superposition of the query to the hit. You can vizualize the superposition by clicking the No.
    (5 - 2:4) The second to fourth columns respectively show the PDB ID of the structure, the name of the bound ligand, and the Chain ID of the ligand. Regarding putative sites, the name of the bounded ligand is simply displayed as 'PRB' and Chain ID is displayed as 'X'.
    (5 - 5:8) The fifth to eighth columns show similarity/dissimilarity values such as cosine similarity, p-value (based on empirical study of 3D alignment; see more details in our paper), aligned length, and RMSD of Ca atoms.
    (5 - 9) The ninth column shows the protein name(micromolecule name).
    (5 - 10:13) CATH, SCOP, EC numbers, GO terms that users specified in the submit form, are shown.
    (5 - 14) The last column provides users with a useful function. For example, if a known ligand-binding site of interest is detected as a hit, one might then be inspired to find the binding sites that are related to the hit. One could then simply click the 'Search it!' button instead of returning to the top page and resubmitting the site using Search K.
    For further analyses, all results can also be downloaded as a plain text file, which includes a list of well-aligned amino acids whose inter-atomic distances of the Ca atoms are within 5.0 angstrom, as obtained from 3D superposition.

Superposition on Jmol

The binding site of query and target(hit) is shown in pink and cyan sticks, respectively.
For the hits of known binding sites, the ligand of query/target is shown in red/green sticks.

For the hits of putative sites, only the ligand of query site is displayed in red sticks.

For more information about Jmol, please see Jmol web-site.

Save the results as a text file

For further analyses, all results can also be downloaded as a plain text file, which includes a list of well-aligned amino acids whose inter-atomic dis tances of the Ca atoms are within 5.0 angstrom, as obtained from 3D superposition. The text file consists of three parts: Summary of your query, Search results and Records. Basically, the first and second part are identical to that of web-interface. For the Record part, fields are delimited by "|" and the first row explains the name of each field. The last field provides corresponding residues between the two sites obtained from structural alignment. For example, "A_V395-A_I9" in No.6 meanss that the V395 of A-chain of your query (1DJQ) is aligned with I9 of A-chain of the hit(3KD9).

----- Summary of Your query -----------------------------------
HET code of ligand: ADP
Chain name of ligand: Any
CATH code of query site :
SCOP code of query site : c.4.1.1
EC number of query chain :
GO term 1: Molecular Function : GO:0016491 GO:0046872 GO:0050470 GO:0051536 GO:0051539
GO term 2: Biological Process : GO:0055114
GO term 3: Cellular Component : none
Search against 'Known ligand binding sites'
Cosine similarity >= 0.77
Aligned length >= 7
Annotates hits with CATH/SCOP/EC/GO: Yes/Yes/Yes/Yes
----- Search Results ------------------------------------------
Number of hits for known ligand binding sites: 5
Number of hits for putative sites: 5
----- Records -------------------------------------------------
No. of hits | PDB ID | HET code | HET_chain | cos-value | p-val | Aligned length | RMSD(Ca) | Protein Name | CATH code | SCOP code | EC Numbers | GO: Molecular Function | GO: Biological Process | GO: Cellular Component | Aligned residues (Ca atoms) |