编辑: 被控制998 | 2019-07-16 |
SP3 . . . 3.05/2.00
3 Method Data set Validation Pearson'
s ρ σ (kcal/mol) mCSMa SP1 5-fold cross validation 0.54 1.23 MAESTRO-Score SP1 5-fold cross validation 0.45 - MAESTRO SP1 5-fold cross validation 0.67 1.12 MAESTRO-Score SP3 20-fold cross validation 0.44 - MAESTRO SP3 20-fold cross validation 0.74 1.23 MAESTRO-Score SP4 10-fold cross validation 0.40 - MAESTRO SP4 10-fold cross validation 0.65 1.36 mCSMa SP1
351 blind test 0.67 1.19 PoPMuSiCa SP1
351 blind test 0.73 1.09 MAESTRO-Score SP1
351 blind test 0.59 ? MAESTRO SP1
351 blind test 0.71 1.16 Table S2: Prediction performance in case of excluded mutation sites. a Data obtained from Pires et al. (supplementary material) [5]. In the second type of blind test experiments we investigated the e?ect of excluded proteins. This re?ects best the real world application of a prediction method. Therefore we ?rst performed n-fold cross validation experiments on the SP1, SP3 as well as on the SP4 data set, where all mutations of a certain protein are either exclusively in the training or in the test set. In a second set of experiments we aimed to determine the impact of sequence similarity between a protein in the training set and in the test set. All proteins in a certain set (SP1,SP3,SP4) were clustered by sequence similarity using BLASTclust with similarity cuto? of 30% identical residues in the alignment (BLASTclust parameter -S = 30, the remaining parameters were left at their default values). In the blind test a certain protein cluster is then either exclusively in the training or in the test set. We ?nally performed an experiment on data set SP1 where we used the n-fold de?nition as kindly provided by Pires et al. [5] on their web pages2 . The results are summarized in Table S3 below. Method Data set Validation Pearson'
sρ σ (kcal/mol) mCSMa SP1 5-fold cross validation 0.51 1.26 MAESTRO SP1 5-fold cross validation 0.63 1.17 MAESTRO SP1 5-fold cross validation (BLASTclust) 0.63 1.17 MAESTRO SP1 5-fold cross validation (Pires def.) 0.62 1.18 MAESTRO SP3 20-fold cross validation 0.70 1.32 MAESTRO SP3 20-fold cross validation (BLASTclust) 0.69 1.33 MAESTRO SP4 10-fold cross validation 0.60 1.44 MAESTRO SP4 10-fold cross validation (BLASTclust) 0.61 1.44 Table S3: Prediction performance in case of excluded proteins. a Data obtained from Pires et al. (sup- plementary material) [5]. In general, we observe a decrease in performance with this protein based blind test compared to the random n-fold tests (see results on single point mutations in the main text) and also compared to the blind test regarding the mutation site (Table S2). However, the performance decrease is less pronounced for MAESTRO then for mCSM. The appearance of homologous proteins in training set and test set has little impact on the results. The di?erently grouped 5-fold cross validation sets for data set S1 (ours vs. the mCSM ones) does not in?uence the MAESTRO result. Besides the regression performance we analyzed the impact of the two blind test experiments on the binary classi?cation performance. The results in Table S4 show that the classi?cation performance is less a?ected as the regression performance. 2http://bleoberis.bioc.cam.ac.uk/mcsm/data
4 Data Recall Prec. Recall Prec. set Blind test Acc.MCC AUC SP1 5-fold mutation site 0.81 0.55 0.59 0.88 0.87 0.45 0.83 5-fold protein 0.80 0.55 0.55 0.87 0.87 0.42 0.81 5-fold protein (BLASTclust) 0.80 0.53 0.55 0.87 0.86 0.41 0.81 5-fold protein (Pires def.) 0.79 0.56 0.54 0.86 0.87 0.42 0.81 SP3 20-fold mutation site 0.82 0.70 0.69 0.87 0.87 0.57 0.85 20-fold protein 0.81 0.70 0.67 0.85 0.87 0.55 0.85 20-fold protein (BLASTclust) 0.80 0.73 0.65 0.83 0.88 0.54 0.84 SP4 10-fold mutation site 0.82 0.39 0.57 0.93 0.86 0.37 0.79 10-fold protein 0.82 0.32 0.57 0.94 0.85 0.33 0.77 10-fold protein (BLASTclust) 0.82 0.39 0.57 0.93 0.86 0.37 0.78 Table S4: Classi?cation performance on blind test experiments on mutation site and protein level. Finally, we performed jack knife tests on the SP1 data set, where either a wild type amino acid or an exchange amino acid type was excluded from the training. In both cases the predictive power was reduced only marginally. The jack knife test on the wild type amino acids results in an overall ρ = 0.65 with σ = 1.14, while the jack knife test on the exchange amino acid results in an overall ρ = 0.67 with σ = 1.13.