The KNOTTIN database

Documentation


Jerôme Gracy1, Dung Le-Nguyen2, Jean-Christophe Gelly3, Quentin kaas4, Annie Heitz1 and Laurent Chiche1

1 Centre de Biochimie Structurale, CNRS UMR5048, INSERM U554, Université, 29 rue de Navacelles, F-34090 MONTPELLIER
2 SysDiag, FRE3009 CNRS/BIO-RAD, Complex system modelling and engineering for diagnostic, F-34184 MONTPELLIER
3 Laboratoire de Biochimie Theorique - CNRS UPR9080 IBPC, 11, rue Pierre et Marie Curie, F-75005 PARIS
4 Institut for Molecular Bioscience, The University of Queensland, Brisbane, Australia.
 
 Introduction

The elucidation, in 1982, of the X-ray structure of PCI, a carboxypeptidase inhibitor from potato, revealed for the first time a "knotted" structure in which a disulfide bridge was shown to penetrate a macrocycle formed by two other disulfides and the interconnecting backbone segments [Rees & Lipscomb, 1982]. In 1989, this peculiar scaffold was shown to also appear in the squash trypsin inhitors [Heitz et al. 1989; Chiche et al., 1989; Bode et al., 1989], and later on in toxins from cone snails and spiders [Davis et al, 1993; Yu et al, 1993]. This structural family now extends to almost 30 different families and 150 experimentally determined structures and more than 1200 protein sequences. We proposed that this structural family be referred to as knottins [Le-Nguyen et al, 1990], although other names were later suggested (i.e. Inhibitor Cystine Knots or ICK [Pallaghy et al., 1994]). The specific interest in this particular scaffold has come from the observation that these proteins are very small, and thus readily accessible to chemical synthesis, yet remarkably stable thanks to the high content in disulfide bridges and the "knotted" topology. Various uses of this scaffold have been reported in protein engineering, drug design and combinatorial approaches [Baggio et al., 2002; Hilpert et al., 2002; Heitz et al., 2000; Craik et al., 2000; Smith et al., 1998], and reviews have been published [ Norton & Pallaghy, 1998; Craik et al., 2001].

 To boost researches in this area we have setup a dedicated information system, the Kottin website, that gathers essential data on knottin discoveries, folding, applications, functions and bibliography. This is completed by the Knottin database, a relational database that stores information on known structures and sequences extracted from the Protein Data Bank and the UniProt (TrEMBL/Swiss-Prot) knowledgebase. Geometrical data (secondary structures, hydrogen bonds, contacts, solvent accessibilities, etc) have been pre-compiled for all structures and stored in the database. A unique standardization is applied allover the database that facilitates knottin analyses and comparisons.

The system is available at http://knottin.cbs.cnrs.fr or http://knottin.com

Different information themes are accessible via the left menu:

  • Home points at an introductory page that displays news and current statistics about the database, and shows main knottin features.

  • Search Database allows retrievial of data stored in the KNOTTIN database, producing dynamic tabular reports that further allow retrievial of standardized sequence alignments. The KNOTTIN database contains two subsections 'Structures' and 'Sequences' that can be seached independently.

  • Blast Database allows users to blast a sequence against sequences included in the KNOTTIN database. Searches against a database of HMM models (one for each knottin family) is also available.

  • Segment Search allows users to search the KNOTTIN structural database for 20-residue long segments based on PDB ID, standardized positions, sequence, or secondary structure pattern.

  • Tools allows users to check if their own protein structure or protein sequence is a knottin. If yes, the corresponding knottin nomenclature, a renumbered PDB file or a standardized sequence alignment, and a standardized 2D representation (Colliers de Perles) are provided.

  • Sequences points at static pre-computed sequence alignments of all sequences in the KNOTTIN database (one alignment for each family). These alignments are NOT renumbered, but were manually checked to ensure that the cysteines of the knot are correctly aligned.

  • Structures points at static pre-computed tabular informations on knottin structures.

  • Functions displays the major known biological functions exerted by knottins.

  • Folding presents the main historical and recent efforts to improve knowledge and understanding of the knottin folding processes.

  • Synthesis presents the main historical and recent efforts to obtain knottin samples from non-natural sources.

  • Modeling & Drug Design presents the main historical and recent modeling works based on knottin structures. Includes, e.g., amino acid mutations, protein engineering, circular permutations, homology modeling, computer simulations, or combinatorial libraries.

  • Landmarks displays main historical and recent landmarks regarding knottins.

  • References displays a list of the references cited in the different sections. PubMed links are available.

  • Links provides a short selected list of Internet sites in relation to knottins.

 New KNOTTIN nomenclature


To facilitate analyses and comparisons, a nomenclature and a unique numbering scheme are proposed and applied throughout the structural database. The knottin scaffold is based upon the I-IV, II-V,III-VI connectivity of six cysteines to form three disulfide bridges. The proposed nomenclature indicates successively the length of the loops between cysteine I and II, II and III etc referred to as a, b, c, d and e in the figures below. The two loops involved in the disulfide macrocycle are shown in parenthesis, and if necessary, numbers are separated by dots (i.e. PDB ID 2eti: nomenclature = "(6)5.3(1)5 " ).

For macrocyclic knottins in which cysteines VI and I are connected by a peptidic segment, an additionnal loop length is shown between brakets ( i.e. PDB ID 1ha9: nomenclature = "(6)5.3(1)5[8]" ).

  It is worth noting that this nomenclature could be applied to the growth factor cystine knots, the only other structural protein family with a disulfide bridge penetrating a disulfide macrocycle ( i.e. PDB ID 1bet: nomenclature = "42(9)11.27(1)" ).

 
 New unique numbering scheme
 
A uniform numbering system has been set up for all knottins, whatever their function or origin. This greatly facilitates sequence and structure comparisons between structurally similar but sequentially divergent knottins. Such a unique numbering has already proved extremely useful for Immunoglobulins and T cell Receptors [ Lefranc MP et al. , 2003].

  • The maximum observed lengths for loops (a), b, c, (d), e, and [f] in known knottin structures are (7), 11, 7, (16), 13 and [10], respectively.

  • Cysteine IV is special because it does not have a fixed position. It is most often adjacent to cysteine III (e.g. in many toxins) or two positions before CYS V (e.g. in Serine protease inhibitor1 and Cyclotides).

 

 

 

  •  To allow space for future insertions, and get a numbering scheme easy to remember, we number cysteines as follows:

    CYS I = 20
    CYS II = 40
    CYS III = 60
    CYS V = 80
    CYS VI = 100

  • CYS IV is free and will be numbered either
    61 (e.g. in Spider toxins)
    78 (e.g; in Serine protease inhibitor1) 
    etc.

 

  • The loop residues between cysteines are distributed evenly at each end of the loop.
  • An exemple of the numbering is shown below for a squash trypsin inhibitor (2btcI), the minimized Agouti-related protein (1hykA), a conotoxin (1ag7) and a spider toxin (1agg).

    Cysteines I, II, III, V and VI are in RED
    Cysteine IV is in BLUE
              1                  20                    40         50
2btcI/501-529 ---------- -------RVC PKI------- ------LMEC KKD-------
1hykA/1-46    ---------- ---------C VRL------- ------HESC LGQ-------
1agg/1-48     ---------- ------EDNC IAED------ ------YGKC TWG------- 
1ag7/1-34     ---------- --------AC SGR------- ------GSRC XX--------

51 60 80 100 2btcI/501-529 -------SDC LAE------- -------CIC LEH------- -------GYC 1hykA/1-46 ------QVPC CDP------- ------CATC YCRFF----- ----NAFCYC 1agg/1-48 ------GTKC CRG------- -------RPC RCSMI----- ----GTNCEC 1ag7/1-34 --------QC CMG------- -------LRC GRGN------ ------PQKC
101 140 2btcI/501-529 G--------- ---------- ---------- ---------- 1hykA/1-46 RKLGTAMNPC SRT------- ---------- ---------- 1agg/1-48 TPRLIMEGLS FA-------- ---------- ---------- 1ag7/1-34 IGAHXDV--- ---------- ---------- ----------

The KNOTTIN database


As of September 2007, the KNOTTIN database is updated using mostly the automatic procedures KNOTER3D and KNOTER1D, according to the flow chart shown below.

KNOTER3D determines if a three-dimensional structure is a knottin:

  1. New Protein Data Bank entries are filtered for proteins with at least 3 disulfide bridges and a connectivity typical of knottins (but not only), i.e. I-IV, II-V, III-VI..
  2. The selected protein structures are renumbered according to the unique knottin numbering.
  3. The renumbered structures are superimposed onto a reference knottin structure (the X-ray structure of Cucurbita Pepo Trypsin Inhibitor II, CPTI-II, PDB ID 2btcI) for residues 40, 60-61, 79-81 and 99-100. These residues approximately define the Cystine-Stabilized Beta-sheet (CSB) elementary motif in knottins that includes the II-V and III-VI disulfide bridges.
    Proteins that display both a CSB motif and a third I-IV disulfide bridge are indeed knottins.
  4. Root mean square deviations below 2.5 define knottins. Ambiguous hits ( 2.5 < RMSD < 3.5 ) are manually checked. If the RMSD is above 3.5, the structure is rejected.
KNOTER1D predicts if a protein sequence is a knottin
  1. The SwissProt/TrEMBL (UniProt) release is filtered for cysteine-rich proteins. Each protein seq1 in the resulting subset is compared to all known knottin sequences knot_seq2 using a similarity score S based on: 
    - the BLAST P-value for the seq1/knot_seq2 comparison.
    - the number of conserved cysteines of the knot when aligning seq1 onto the multiple alignment of the family of knot_seq2.
    - the compatibility of intercysteine loop lengths with known knottin loop lengths.
    - the taxonomic proximity between seq1 and knot_seq2.
    - the similarity between the seq1 function and knottin functions
    - the similarity between seq1 keywords and knottin keywords, and/or dissimilarity between seq1 keywords and non-knottin keywords.
  2. New sequences are classified by decreasing S scores. This list is then manually annotated for sequences to be included or rejected. 

General data for all selected knottin structures are obtained from the Protein Data bank or MMDB and are stored in Table 1.
Local structural information is computed for each residue and stored in Table 2.
Non-local information is computed for all interactions and stored in Table 3.
General data for all selected knottin sequences are obtained from the SwissProt/TrEMBL database and stored in Table 4.

 

 
 Searching the database
 
 The button "Search Database" points at a web form that allows to search either knottin structures OR knottin sequences.
 
  1.  Select the SEQUENCE or STRUCTURE database

  2. Make selections based on various criteria
    • Family Select one or all knottin protein families
    • Keyword Enter keyword (currently only one) to select entries that contain the keyword in the Family, SwissProt ID, PDB ID, Descriptor, Source or Function fields. The keyword is case insensitive.
      exemples: 2eti, agatoxin, sodium, ITR2_momco, homo
    • Method allows selection of the method used to build the model (X-ray or NMR)
    • Nomenclature (i.e. loop lengths). Use Perl Regular expressions to select particular loop lengths. For example, "^[2-4]$" will select loops with 2, 3, or 4 residues.

  3. Sort the results

    1. Family sort by family name
    2. SwissProt ID: sort by SwissProt ID
    3. PDB code: sort by PDB code
    4. Length: sort by sequence length
    5. RMS dev: sort by root mean square deviation from the reference structure 2btcI.
    6. Nomenclature: sort by the a-e loop lengths as used in the knottin nomenclature

  4. Output tables
    1. The checkboxes allow selections for further sequence alignments or structural superimpositions available via menus immediately above and (possibly) below the output tables.
    2. The green buttons point at individual knottin cards
    3. The "Jmol" link in structure tables forwards you to 3D viewing of the structure with the Jmol Java applet.
    4. Nomenclature: The knottin nomenclature based on loop lengthes
    5. Family: The protein family
    6. SwissProt: The SwissProt ID ("-" means that no SwissProt ID is available yet)
    7. PDBsum: The PDB code points at the corresponding page in PDBsum
    8. For 3D structures
      R points at renumbered PDB files according to the unique knottin numbering scheme
      F points at renumbered AND structurally fitted PDB files. The renumbered structures are superimposed to a reference knottin structure (the X-ray structure of Cucurbita Pepo Trypsin Inhibitor II, CPTI-II, PDB ID 2btcI) for residues 40, 60-61, 79-81 and 99-100. These residues define approximately the Cystine-Stabilized Beta-sheet elementary motif in knottins that includes the II-V and III-VI disulfide bridges
      Chain: The chain ID in the PDB file 
      Length
      : The sequence length 
      RMS
      : the RMS deviation from the reference structure for backbone atoms of residues 40, 60-61, 79-81 and 99-100 
      Experimental: The method for structure determination.
    9. Descriptor
    10. Source
    11. Tissue
    12. Function
    13. PubMed

  5. Individual knottin cards
    General data similar to those in the Main tabular output are available
    Supplementary information essentially includes

    -
    A "Collier de Perles" two-dimensional representation. We gratefully thank Marie-Paule Lefranc, for authorizing the use of the expression "Collier de Perles" which originally refer to standardized 2D representations, in IMGT, the international ImMunoGenetics information system® (http://imgt.cines.fr) [Lefranc, M.-P.et al. Nucleic Acid. Res., 1999, 27, 209-212]
    The original program for automatic drawing of Colliers de Perles was rewritten and adapted to knottins by
    Quentin Kaas.

    -
    For 3D structures: a sequence-structure table showing:
    1. Number: The knottin unique numbering.
    2. Residue: The amino acid at each position
    3. Secondary: The secondary structure computed using the Stride program.
    4. Phi: The Phi angle computed using the Stride program.
    5. Psi: The Psi angle computed using the Stride program.
    6. ASA: The accessible surface area computed using the Stride program
    7. PAC: The percent of accessibility computed using the local PDBgeo program
    8. Yellow columns indicate the knotted disulfide bridges
      Brown columns
      indicate the additional disulfides

  6. Sequence alignments
    1. A "Get Alignment" button is available on top of the output tables.
    2. It is first necessary to select which knottins must be aligned using either the "Select All" button or the checkboxes on the left side of each protein line.
    3. The renumbered sequences of the selected proteins are passed to PAT (Protein Analysis Toolkit written by Jerome Gracy jgracy@cbs.cnrs.fr) that manages to produce and display an html color-coded standardized alignment.
    4.  Each alignment displayed on the KNOTTIN website is preceeded by a drop-down menu that allows to:
      - retrieve a text-formated version of the alignment in various formats for download
      - create a sequence logo
      - send the alignment to PAT and switch the browser page to the PAT webserver for more powerful analyses.

  BLASTing the database
 
  The button "Blast Database" points at a web form that allows users to upload their own sequence and BLAST (or search a HMM model database) this sequence against the KNOTTIN database.
 
  1. Select search method
    BLAST: selecting BLAST means that a blastp is performed against all sequences in the database (including sequences of the PDB files)
    HMM: selecting HMM means that a HMMsearch is performed against HMM models pre-computed for each family in the database. The output diplays first the HHMsearch result, then the alignment of the user sequence with the sequences used to build the matching model (HMMalign is used for this).

  2. Paste sequence in FASTA format
    The given sequence must conform to the FASTA format (see below) and must only contain permitted characters.

    A sequence in FASTA format begins with a single-line description, followed by lines of one-letter code sequence data.

    • The description line starts with a greater than symbol (">").
    • The word following the greater than symbol (">") immediately is the name of the sequence, the rest of the line is the description.
    • Name and description are optional.
    • All lines of text should be shorter than 80 characters.
    • The sequence ends if there is another greater than symbol (">") at the beginning of a line.

    Exemple:
    >ITR2_ECBEL Trypsin Inhibitor II (EETI-II).
    GCPRILMRCKQDSDCLAGCVCGPNGFCGSP


  3. Select MAX number of hits
    Restrict the number of hits shown in the output

  4. Select MAX p-value
    Display only hits with below the maximal selected p-value

  5. Submit
    The user sequence is passed to PAT (Protein Analysis Tools written by Jerome Gracy) that manages to run BLASTA against the knottin database.
    The BLAST output is then HTML color-coded by PAT for display.

   SEGMENT SEARCH
 
   The button "SEGMENT SEARCH" points at a web form that allows users to search the KNOTTIN database for 20-residue long segments that satisfy the selection explained below:
 
  1. Select display
    Default display includes sequence and position according to the unique knottin numbering.
    Users can choose to display the Phi and Psi angles, the secondary structure and the percent of accessibility (PAC) by checking the appropriate radio button.

    All subsequent selections are optional, although at least one selection must be performed. The maximum output is currently limited to 1000 segments.

  2. Select protein
    To select protein(s), type a list of PDB IDs separated by blanks. (example "2btc 1ha9 1c6w").
    The segment search will only be done on these proteins.
    Note that if no other selection is performed, results will include all sliding segments for each , i.e. as many segments as there is residues in the protein. To avoid this, users can select a starting position for the segment (see below)

  3. Starting position
    To select only segments starting at a given position, users can use the scrolling list. The position correspond to the unique knottin numbering. "*" means any position.

  4. Select sequence pattern
    Users can select specific residue for each position along the 20-residue segment using the scrolling lists.


    "*" means any residue


  5. Select Ramachandran quadrant
    Users can select the Ramachandran quadrant for each residue along the 20-residue segment.
     

    *: any quadrant

    UL: Upper Left (Phi <0, Psi>0)

    UR: Upper Right (Phi >0, Psi>0)

    LL: Lower Left (Phi <0, Psi<0)

    LR: Lower Right (Phi >0, Psi<0)

  6. Select secondary structure
    Users can select the secondary structure for each residue along the 20-residue segment. Secondary structure is computed using the STRIDE program and uses the following ciding:


    *: any secondary structure

    H: alpha-helix

    G: 3-10 helix

    I: PI-helix

    E: Extended

    B: Isolated bridge

    T: Turn

    C: Coil (none of the above)

  7. Output display

    The ouput consists of a succession of segments extracted from the KNOTTIN database.
    The first line indicates the protein (PDB ID, chain) and the sequence of the segment.
    The second line indicates the numbering of each residue according to the
    knottin unique numbering.
    Subsequent lines indicates dihedral angles, percent of accessibility or secondary structure, depending on the output request.
    Output positions that match the requested selection are shown in red.

   TOOLS
 
The "Tools" menu points at a web form that allows users to submit a PDB file or a protein sequence.

KNOTER1D will determine if a 3D structure is a knottin
KNOTER3D will predict if a protein sequence is a knottin.

 

  1. Submit a protein structure or a protein sequence
    The 3D structure must be in PDB format and the sequence in the FASTA format.

  2. Provide additional information if a sequence is submitted
    KNOTER1D will search for homologs in the SwissProt/TrEMBL database, and extract taxonomy, function or keywords from the best homolog if any. The user has the possibility to provide  this information directly to KNOTER1D.

  3. Output
    KNOTER3D or KNOTER1D will indicate if the protein is or is probably a knottin, 
    A detailed output of  KNOTER1D is available via an html link.
    If the protein is a knottin or putative knottin, standardized information is provided
    - The original and standardized sequence numbers of the cysteines of the knot
    - The knottin nomenclature
    - a standardized Collier de Perles 2D representation.
    - a standardized renumbered alignment (knotted cysteines I, II, III, V and VI are renumbered 20, 40, 60, 80 and 100, respectively).

   Regular Expressions
 
  Regular expression can be used to select knottins in the KNOTTIN database according to the knottin nomenclature, i.e. according to loop lengths. To facilitate writing, separate expressions are entered for each loop.

Main metacharacters used in regular expressions are indicated below:

character meaning example
^ beginning of string ^6 the string begins with "6"
$ end of string 6$ the string ends with "6"
. any character (except newline) a.6 an "a" followed by any single character followed by a "6"
* match 0 or more times the preceding item c*3 any number of (or no) "c" followed by a "3"
+ match 1 or more times the preceding item c+3 any number of "c" (at least one) followed by a "3"
? match 0 or 1 times the preceding item c?3 no or one "c" followed by a "3"
[ ] set of characters [a,b,c]3 either "a" or "b" or "c" followed by a "3"
| alternative ^1|3$ either the string begins with a "1" or ends with a "3"
{ } repetition modifier c{2,5}3 two to five "c" followed by a "3"