Downloads

The following table describes the statistics of the substrate datasets we have extracted from MEROPS to develop the PROSPER tool for predicting cleavage sites of multiple proteases. Each substrate dataset can be downloaded by clicking the hyperlink associated with each MEROPS ID of protease family in this table.

All of these substrate data were mainly derived from the MEROPS database, an online information resource for proteases and their inhibitors (Rawlings et al., Nucleic Acids Res 2008, 36, D320-D325). To avoid over-training, sequence homology reduction within the training and testing datasets was performed in such a way that sequence identity between any two peptide sequences should not be larger than 70%.

 

Protease class
Protease subfamily (MEROPS ID)

Number of

substrate sequences

Number of

cleavage sites

P8-P8'

sequence logo

Aspartic protease
239
376
Cysteine protease
Cathepsin K (C01.036)
69
85
42
82
41
50
235
347
74
89
62
168
41
58
Metalloprotease
Matrix metallopeptidase-2 (M10.003)
575
1185
Matrix metallopeptidase-9 (M10.004)
44
211
53
152
43
95
Serine protease
161
293
235
276
154
249
121
176
148
154
1660
7395
76
94
44
96
377
703
72
82
269
269
42
43
303
303

 

 

 

Supplementary material downloads

1. The compiled substrate datasets consist of 24 different protease types, covering four major protease families. They are Aspartic (A), Cysteine (C), Metallo (M) and Serine (S). After sequence homology reduction, the final datasets contain 3520 substrate sequences and 5635 cleavage sites. The curated substrate dataset of each protease can be respectively downloaded by clicking the MEROPS ID of each protease family in the above table. Alternatively, you can download the whole substrate dataset all the twenty-four proteases at this link: Substrate_seq.txt

For each entry (starting with ">") of a substrate:
1) The first line denotes the Uniprot ID, then followed by the MEROPS ID for the corresponding proteasese that can cleave the substrate. These two annotations are separated by "|" ;
2) The second line started with "site:" denotes the substrate cleavage site through P4 to P4' sites, "|" indicates the cleavage site. Note that a substrate might have one to more experimentally verfied cleavage sites;
3) The third line denotes the FASTA format of the substrate sequence;
4) The fourth line denotes the predicted secondary structure information by the PSIPRED program (Jones, 1999). "H" denotes alpha-helix, "E" denotes beta-strand, while "C" denotes coils or loops;
5) The fifth line denotes the predicted solvent accessibility information by the SCRATCH program (Cheng et al., 2005). "e" denotes exposed, while "b" denotes buried;
6) The last line denotes the predicted natively unstructured or disordered regions by DISOPRED 2 program (Ward et al., 2004). "*" denotes disordered, while "." denots structured or ordered.

 

2. The independent testing substrate datasets of caspase-3, MMP-2, granzyme B (human) and granzyme B (mouse). Among them, the MMP-2 substrate dataset was extracted from a recent proteome profiling study (Kleifled et al., 2010), while the rest were extracted from the recent update of MEROPS (Rawlings et al., 2010): Independent_test_set.txt

 

3. Distribution of structural determinats of cleavage sites at three different structural types (secondary structure, solvent accessibility and native disorder): Structural_determinants.xlsx

 

4. We applied PROSPER with high stringency at 100% Specificity level to the human and mouse proteomes that have 87,040 and 56,687 proteins, respectively. The prediction results of predicted cleavage sites of Caspase-1, 3, 7, 6, 8, granzyme B (human) and granzyme B (mouse) can be downloadd from this link: Large-scale.Prediction.tgz

 

5. Analysis of Gene Ontology assignments for the predicted substrates of caspase-1, 3, 7, 6, 8, granzyme B (human) and granzyme B (mouse ) in the human proteome: GO_distribution_Specificity100%.xlsx

 

 

Standalone version and sourcecode Downloads

The local standalone version and sourcecodes of PROSPER tool can be downloaded at the following link. Users who are interested in using PROPSER locally on a large number of substrate sequences are encouraged to download this verion, instead of using the online web server.

prosper.tgz

 

The local version of the compressed PROSPER package is approximately 210 MB. However, after decompression, it is about 4.0 GB, as most of the bigger files are the SVM models that were trained using different combinations of sequence and structure profiles as described in the Methodology section.

 


Copyright © 2012-2018. Monash Bioinformatics Platform, School of Biomedical Sciences, Faculty of Medicine, Monash University, Australia