Link Search Menu Expand Document

Supplementary information: A two-step site and mRNA-level model for predicting microRNA targets

Takay Saito and Pål Sætrom

Citation:
Saito, T and Sætrom, P. A two-step site and mRNA-level model for predicting microRNA targets. BMC Bioinformatics 11, 612 (2010). https://doi.org/10.1186/1471-2105-11-612.
Table of contents
  1. Supplementary Methods
    1. Data partition into positive (down-regulated) and negative genes
    2. Data retrieval for benchmarks
    3. Performance evaluation on independent dataset
  2. Supplementary Results
    1. Analysis of target site level training data: Selection of seed types influences the coverage of potential target sites
    2. Prediction of targets with optimal distance sites: Optimal distance is effectively incorporated in our SVM prediction
    3. Site accessibility and sequence conservation: Explicit usage of site accessibility and sequence conservation do not improve SVM prediction
    4. Classification vs. Regression: Support vector classification (SVC) performs better than support vector regression (SVR)
    5. Dataset selection: Different parameters for constructing training or test data do not affect SVM prediction performance
  3. Supplementary Tables
    1. Table S1. Target site level features.
    2. Table S2. Analysis of target site level feature influence.
    3. Table S3. mRNA level features.
    4. Table S4. Analysis of mRNA level feature influence.
    5. Table S5. Benchmarks on the Linsley dataset.
    6. Table S6. mRNA/siRNA sequences from five microarray experiments.
    7. Table S7. Parameters used for the Needle software.
    8. Table S8. Benchmarks on the Selbach dataset.
    9. Table S9. Benchmarks on the Baek dataset.
    10. Table S10. Benchmarks on the Linsley dataset – Positive records defined by log ratio -0.1.
    11. Table S11. Benchmarks on the Linsley dataset – Positive records defined by log ratio -0.3.
    12. Table S12. Benchmarks on the Selbach dataset – Positive records defined by log ratio -0.1.
    13. Table S13. Benchmarks on the Selbach dataset – Positive records defined by log ratio -0.3.
    14. Table S14. Benchmarks on the Baek dataset – Positive records defined by log ratio -0.1.
    15. Table S15. Benchmarks on the Baek dataset – Positive records defined by log ratio -0.3.
  4. Supplementary Figures
    1. Figure S1. 10-fold cross-validation of target site level classifiers trained on three microarray datasets.
    2. Figure S2. Density plot for the number of potential target sites divided by 3’UTR length.
    3. Figure S3. Comparison between two SVM trainings with or without explicit addition of negative records.
    4. Figure S4. 10-fold cross-validation of mRNA level classifiers trained on three microarray datasets.
    5. Figure S5. Three method-specific benchmarks of 8 different algorithms on the Linsley dataset.
    6. Figure S6. siRNA benchmarks on the Jackson dataset.
    7. Figure S7. miRNA benchmarks on the Lim dataset.
    8. Figure S8. siRNA benchmarks on the Birmingham dataset.
    9. Figure S9. miRNA benchmarks on the Grimson dataset.
    10. Figure S10. Targets with optimal distance sites show higher average discriminant value than other three target types.
    11. Figure S11. Adding site accessibility and sequence conservation features did not affect target prediction accuracy.
    12. Figure S12. SVC performs better than SVR at both target site and mRNA levels.
    13. Figure S13. SVM performance is similar between the training set with up-regulated genes and the one without up-regulated genes.
    14. Figure S14. Two-step SVM retains the performance when trained with proteomics data.
    15. Figure S15. Benchmarks on the Selbach dataset.
    16. Figure S16. Benchmarks on the Baek dataset.
    17. Figure S17. Benchmarks on the Linsley dataset with positive record threshold by log ratio -0.1.
    18. Figure S18. Benchmarks on the Linley dataset with positive record threshold by log ratio -0.3.
    19. Figure S19. Benchmarks on the Selbach dataset with positive record threshold by log ratio -0.1.
    20. Figure S20. Benchmarks on the Selbach dataset with positive record threshold by log ratio -0.3.
    21. Figure S21. Benchmarks on the Baek dataset with positive record threshold by log ratio -0.1.
    22. Figure S22. Benchmarks on the Baek dataset with positive record threshold by log ratio -0.3.
  5. References

Supplementary Methods

Data partition into positive (down-regulated) and negative genes

Down-regulated (positive) genes of GEO datasets were defined as records with p-value < 0.001 and log intensity ratio < -0.3 to obtain strongly down-regulated genes. Down-regulated genes for the Birmingham dataset were obtained directly from the published data [1].

Some microarray records contained different results for the same combination of miRNA/siRNA and mRNA; for example, there were five records of a miRNA on the same mRNA, and four of them were regarded as down-regulated but one of them were unaffected. In this case, all records that belonged to this combination were disregarded to enhance the quality.

Negative records of GEO datasets consisted of both unaffected and up-regulated genes. Unaffected genes were defined as records with p-value > 0.3 and log intensity ratio between -0.3 and 0.3, whereas up-regulated genes were defined as records with p-value < 0.001 and log intensity ratio > 0.3. Negative records for the Birmingham dataset were defined as all RefSeq transcripts except the positive records.

As a final filter, all miRNA:mRNA pairs that had either very high (>1.25) or very low (<-1.5) log intensity values in the control experiments were eliminated to reduce potential noise form extreme expression values.

All the positive records were used in the training dataset, whereas negative records were randomly selected to meet a total record number of 20000, unless otherwise specified. A maximum total record number of 20000 was used due to computational time and memory space constraints.

For the test data, positive records were selected as for the training data, whereas all the remaining records were used as negative records.

For the Linsley dataset [2], down-regulated genes were defined as records with p-value < 0.001 and log intensity ratio < -0.2.

For the Selbach and Baek datasets, down-regulated genes (positive) were defined as records with log ratio values < -0.2, whereas unaffected genes (negative) were defined as log ratio values > 0. The records that do not fit in these criteria were not used in this study.

Data retrieval for benchmarks

Prediction data were downloaded from the Segal Lab site (https://genie.weizmann.ac.il/pubs/mir07/) for PITA All and PITA Top, the TargetScanHuman 5.1 site (http://www.targetscan.org) for TargetScan and TargetScan with conserved genes, the MicroCosm Targets Version 5 site (http://www.ebi.ac.uk/enright-srv/microcosm) for miRanda, the miRDB site (http://mirdb.org) for mirTarget2, and the PicTar WEB INTERFACE site (https://pictar.mdc-berlin.de/) for PicTar. Several methods provide only target site level scores for download, and, in that case, summation methods showed in the web sites or in the original publications were applied to obtain scores for the mRNA level. For siRNA predictions, PITA and TargetScan executable programs were downloaded and used to calculate prediction scores.

Performance evaluation on independent dataset

Six dataset – “ROC with All genes”, “ROC10*n”, “ROC with 7mer + Conservation”, and three method-specific datasets for TargetScan, miRanda, and PicTar – were created for evaluation. The “ROC with All genes” dataset was comprised of all the records from the microarray dataset.

“ROC10*n” was based on the standard ROC50 [3] benchmark. The ROC50 benchmark score is the area under the ROC curve until 50 negatives are found. Consequently, the “ROC10*n” benchmark datasets consisted of all the true positives found in the microarray experiments and the 10*n first negatives found by each method benchmarked.

The “ROC with 7mer + Conservation” dataset was comprised of genes with conserved 8mer or 7mer-m8 or 7mer-A1 sites. Sites were defined as conserved if the average phyloP 44 scores in the seed region were > 0.9.

The three method-specific datasets for TargetScan, miRanda, and PicTar were generated by only using the records from the microarray experiments that had corresponding predictions for each method.

Supplementary Results

Analysis of target site level training data: Selection of seed types influences the coverage of potential target sites

Current methods for miRNA target prediction base their analyses on identifying so-called seed sites in 3’ UTR. The definition of seed types varies from one prediction algorithm to another [4,5,6], however, and this heavily influences the coverage of potential target sites. Lower coverage means that predictions may lack many true positive sites, whereas higher coverage may result in poor accuracy because of too many false positive sites.

Based on four different microarray datasets - Birmingham [1], Jackson [7], Lim [8], and Grimson [5], we identified a set of 823 genes that were down-regulated in response to artificial miRNA or siRNA transfection, or up-regulated in response to miRNA inhibition (see Supplementary Methods). This positive set of genes constituted 0.14% of the 571570 potential miRNA:gene interactions assayed in the microarray experiments. Searching for seed sites among the positive genes identified that 347 and 510 genes contained stringent and all types of seed sites. Thus, basing miRNA target predictions on stringent seeds or all seed types would cover 42% or 62% of the positive genes. The corresponding numbers for the negative set of 570747 genes were 128254 and 31277. Consequently, the predictive accuracy, as measured by the positive predictive value (PPV = True positive predictions / All positive predictions), was almost three times greater for stringent seeds (PPV = 1.1%) than for all seed types (PPV = 0.40%). Even so, this showed that predicting miRNA targets purely based on identifying stringent seed sites would give too many false positive predictions to be viable in practice.

Several approaches have shown that other target-related features besides the seed type and additional data such as miRNA and mRNA expression profiles can improve predictive accuracy [9,10]. We therefore chose to include all seed types to achieve high coverage in the training data and then improve the accuracy through two-step SVM training.

Prediction of targets with optimal distance sites: Optimal distance is effectively incorporated in our SVM prediction

We were interested to see how well the features of optimal distances were incorporated into our algorithm. Our model uses two mRNA features that represent optimal distances: (i) the number of sites that have other neighboring sites within 13-46 nt and (ii) the number of sites that have other neighboring sites within 17-25 nt. Furthermore, one target site level feature, “distance to the next target” could be effective to recognize the targets with optimal distance. However, these features showed little contribution when the influence of both target site and mRNA level features was analyzed (Tables S2 and S4). Therefore, we applied a different approach as to see the difference of prediction power between the target genes with optimal distance and other target genes.

First, we categorized all potential targets into four different target types, which were defined by the number of potential target sites, seed types, and existence of optimal distance sites within the gene’s 3’ UTR. “Single non-stringent” and “Single stringent” were targets that had only one “non-stringent” and “stringent” seed site, whereas “Multiple without optimal distance” and “Multiple with optimal distance” were targets that had multiple sites where none or at least one pair of sites were within . Then, we checked the mRNA level classifier’s distributions of discriminant values for these four target types. The resulting box plots (Fig. S10) show that for each target type, the mRNA-level SVM clearly separates between the positive (down-regulated) and negative (non-down-regulated) targets. Moreover, targets with multiple sites had higher discriminant values than targets with single stringent and single non-stringent sites, and the targets with optimally spaced sites had the highest discriminant values. Thus, site distance contributed positively to the SVM’s predictions.

Site accessibility and sequence conservation: Explicit usage of site accessibility and sequence conservation do not improve SVM prediction

Although target site accessibility is important for both miRNA and siRNA targeting [11,12], it is unclear how useful computational predictions of target site accessibility are for genome-wide miRNA target analyses [13,6]. Similarly, even though many target prediction tools rely on conserved sequences, as these indicate conserved function and targeting, sequences with little or no evidence of conservation can still be functional target sites [14]. Moreover, calculating target site accessibility and conservation features demand much higher computational power and resources than calculating the other features. Therefore, we constructed classifiers without using the site accessibility and sequence conservation features to evaluate their importance in our SVM model.

To evaluate the classifier without of site accessibility and sequence conservation features, we compared the ROC scores of three mRNA level classifiers that used the same mRNA level features, but three different target site classifiers; that is, classifiers (i) without site accessibility and conservational information, (ii) with site accessibility, and (iii) with conservational information. The classifiers were then tested on the independent dataset.

All the results from the 10-fold cross-validation and independent tests showed very little difference among the three classifiers (Fig. S11). Although theoretically sound, the computationally intensive accessibility predictions could therefore, in practice, be eliminated from the SVM classifier. In addition, although many miRNA target sites appear to be well conserved, the small influence of the conservation-related features on SVM performance indicate that many non-conserved miRNA targets exist. This is consistent with the results of the benchmarks of the 8 existing miRNA target prediction methods. Consequently, the proportion of non-conserved miRNA target sites in the set of all potential target sites seems too large to be ignored. Therefore, we opted to train our SVM classifiers without using site accessibility and sequence conservation.

Classification vs. Regression: Support vector classification (SVC) performs better than support vector regression (SVR)

The microarray data used in our experiments measured actual changes in gene expression in response to miRNA or siRNA over-expression. Consequently, even though we treated miRNA target prediction as a classification problem, target prediction could also be solved as a regression problem. To test whether treating target prediction as a regression problem would further improve our predictions, we used the PyML library to train SVR models to fit the log ratio values from the Jackson, Lim, and Grimson microarray experiments. We did not use the Birmingham dataset because pre-processed log ratio values were not available. The SVR models used the same two step approach and the same features as our previous SVC models. The same optimization approach as we used for the SVC models showed that a linear kernel with default parameters gave the best predictions at both the target site and mRNA level. This model selection often resulted in over-fitting or no convergence, however, hence requiring more analysis time than SVC to find an optimal kernel. Moreover, the SVR target site and mRNA level models showed lower prediction performance at the 10-fold cross-validation and independent data set than the SVC models did (Fig. S12). Thus, when the input is based on features derived from the target sequence, classification models are as good or better at predicting miRNA or siRNA target gene expression fold changes as are regression models.

Dataset selection: Different parameters for constructing training or test data do not affect SVM prediction performance

It is important to verify that our SVM approach is still effective when we select different positive (down-regulated) and negative (non-down-regulated) datasets for both training and test because it proves that our SVM approach is not optimized only for a certain parameter criteria of data selection. We generated seven different datasets; one for training and six for independent test sets to further investigate SVM prediction power with different data selection parameters.

Two major steps applied to create datasets in our study were; (i) all microarray datasets were separated into three groups; down-regulated, unaffected and up-regulated, depending of the log ratio values, and (ii) positive data were selected from down-regulated genes, whereas negative records were from both unaffected and up-regulated genes. However, some miRNAs are known to potentially up-regulate genes. The mechanism of this up-regulation is unknown, and the features of up-regulated target sites/genes are possibly quite different from those of miRNAs involved in down-regulation. To check the influence of involving these up-regulated genes in our SVM training sets, we constructed a dataset without up-regulated genes, and trained and tested it on the independent datasets. The 10-fold cross-validation ROC plots for both target site and mRNA level classifiers trained without up-regulated genes (Fig. S13 A, C) were almost identical to those trained with up-regulated genes (Fig. S13 E, G). The prediction power was also retained when the classifiers were tested on the Linsley dataset (Fig. S13 B, D) and compared with those trained with up-regulated genes (Fig. S13 F, H).

Moreover, we generated six independent subsets for one transcriptomics (Linsley) and two proteomics (Selbach and Baek) experiments. Two different positive record datasets defined by log-ratio values <-0.1 and <-0.3 were created for each experiment, resulting in six independent datasets altogether. We performed benchmarks on these datasets and compared with the dataset defined by the log-ratio values <-0.2, which had been used as the default value for all the other benchmarks. The number of positive records detected by all of the eight algorithms in the benchmarks was higher in the log ratio <-0.1 sets; Linsley (Table S10), Selback (Table S12), and Baek (Table S14), and lower in the log ratio <-0.3 sets; Linsley (Table S11), Selback (Table S13), and Baek (Table S15), as expected. The ROC scores from the benchmarks show that SVM outperforms other algorithms in most cases regardless of different parameter criteria of data selection. We also analyzed the ROC curves of the benchmarks to check the trends among the log ratio <-0.2 sets; Linsley (Fig. 5, S5), Selback (Fig. 6, S15) and Baek (Fig.6, S16), the log ratio <-0.1 sets; Linsley (Fig. S17), Selback (Fig. S19) and Baek (Fig. S21), and the log ratio <-0.3 sets; Linsley (Fig. S18), Selback (Fig. S20) and Baek (Fig. S22). All ROC curves showed very similar trends among different positive record sets. More importantly, our SVM approach retained its prediction power even when it was tested on the log ratio <-0.3 and <-0.1 sets.

These results indicate that our SVM prediction power is not affected when trained with or without up-regulated genes, or tested on strongly down-regulated pairs (log ratio <-0.3) as well as moderately down-regulated pairs (log ratio <-0.1).

Supplementary Tables

Table S1. Target site level features.

Feature Vector sizea Siteb Consvc Incld
1 Seed Type 9
2 Similarity (1-20 nt) 1
3 Similarity in the 5' portion (1-9 nt) 1
4 Similarity in the 3' portion (>10 nt) 1
5 Similarity in the 3' portion (position 13,14,15,16 ) 1
6 A:U rich context in the 30nt upstream 30
7 A:U rich context in the 30nt downstream 30
8 Relative site position (position / 3’ UTR length) 1
9 Distance to the next-nearest-neighbor site 1
10 Site accessibility: dGduplex 1
11 Site accessibility: dGopen 1
12 Site accessibility: ddG 1
13 Site accessibility with 3/15 flank: dGopen 1
14 Site accessibility with 3/15 flank: ddG 1
15 20nt site matches (Match:1, AU:0.4, None-match:0) 20
16 Pos 1 nucleotide of the target site (A:1, UCG:0) 1
17 multiz17way score for the seed site 1
18 multiz17way score for the target site 1
19 PhastCons44way scores for the target site 1
20 PhastCons44wayMammal scores for the target site 1
21 PhastCons44wayPrimate scores for the target site 1
22 PhyloP44ways scores for the seed site 8
23 PhyloP44waysMammal scores for the seed site 8
24 PhyloP44waysPrimate scores for the seed site 8

aVector size used in SVM training. bSite accessibility features. ○: Yes, blank: No. cSequence conservation features. ○: Yes, Blank: No. dFeatures included in the final version of our classifier. ○: included features, blank: features excluded during optimization steps.


Table S2. Analysis of target site level feature influence.

Rank ROC ROC diffa Removed feature
1 0.6734 0.0317 Seed Type
2 0.6958 0.0093 20nt site matches (Match:1, AU:0.4, None-match:0)
3 0.6998 0.0053 A:U rich context in the 30nt upstream
4 0.7047 0.0004 PhastCons44wayMammal scores for the target site
5 0.7047 0.0003 Similarity in the 3' portion (>10 nt)
6 0.7048 0.0002 Relative site position (position / 3’UTR length)
7 0.7049 0.0001 Similarity in the 5' portion (1-9 nt)
8 0.7050 0.0000 Site accessibility: dGopen
9 0.7050 0.0000 Distance to the next-nearest-neighbor site
10 0.7050 0.0000 Site accessibility: dGduplex
11 0.7050 0.0000 Site accessibility: ddG
12 0.7050 0.0000 Site accessibility with 3/15 flank: dGopen
13 0.7050 0.0000 Site accessibility with 3/15 flank: ddG
14 0.7051 -0.0001 Similarity in the 3' portion (position 13,14,15,16 )
15 0.7053 -0.0002 PhyloP44ways scores for the seed site
16 0.7054 -0.0003 Similarity (1-20 nt)
17 0.7054 -0.0004 multiz17way score for the seed site
18 0.7056 -0.0006 Pos 1 nucleotide of the target site (A:1, UCG:0)
19 0.7057 -0.0006 A:U rich context in the 30nt downstream
20 0.7059 -0.0008 multiz17way score for the target site
21 0.7060 -0.0009 PhastCons44wayPrimate scores for the target site
22 0.7064 -0.0014 PhyloP44waysMammal scores for the seed site
23 0.7068 -0.0017 PhastCons44way scores for the target site
24 0.7081 -0.0030 PhyloP44waysPrimate scores for the seed site

aROC diff shows the difference of ROC scores between the classifier trained with all the features (0.7050) and classifiers trained with one feature removed.


Table S3. mRNA level features.

Feature Vector sizea Inclb
1 3'UTR Length 1
2 Number of target sites 1
3 Total discriminant value / 3’ UTR length 1
4 Number of 8mer sites 1
5 Number of 7mer-m8 sites 1
6 Number of 7mer-A1 sites 1
7 Number of 6mer sites 1
8 Number of GUM sites 1
9 Number of GUT sites 1
10 Number of LP sites 1
11 Number of BT sites 1
12 Number of BM sites 1
13 Distribution bins of discriminant values 16
14 Site counts within optimal distance 14-46 nt 1
15 Site counts within optimal distance 17-25 nt 1
16 Site number flags (1 or 2-7 or >8) 3
17 Total discriminant value 1

aVector size used in SVM training. bFeatures included in the final version of our classifier. ○: included features, blank: features excluded during optimization steps.


Table S4. Analysis of mRNA level feature influence.

Rank ROC ROC diffa Removed feature
1 0.7565 0.0432 Distribution bins of discriminant values
2 0.7904 0.0093 Number of GUM sites
3 0.7971 0.0026 Site number flags (1 or 2-7 or >8)
4 0.7978 0.0019 Total discriminant value / 3' UTR length
5 0.7981 0.0016 Number of 7mer-m8 sites
6 0.7984 0.0013 Site counts within optimal distance 17-25 nt
7 0.7986 0.0011 Number of BT sites
8 0.7994 0.0003 Number of 6mer sites
9 0.7995 0.0002 Number of 7mer-A1 sites
10 0.7996 0.0001 Total discriminant value
11 0.7996 0.0001 3' UTR Length
12 0.7996 0.0001 Number of GUT sites
13 0.7997 0.0000 Number of BM sites
14 0.7997 0.0000 Number of 8mer sites
15 0.7999 -0.0002 Number of LP sites
16 0.8003 -0.0006 Site counts within optimal distance 14-46 nt
17 0.8009 -0.0012 Number of target sites

aROC diff shows the difference of ROC scores between the classifier trained with all the features (0.7997) and classifiers trained with one feature removed.


Table S5. Benchmarks on the Linsley dataset.

Benchmark Method # of targetsa ROCb
Positive Negative
All genes All 667 158374
SVM 564 69037 0.81
PITA 539 55953 0.76
TargetScan 400 21228 0.75
miRanda 93 6641 0.55
PITA_top 147 4867 0.6
TargetScan_consv 174 5634 0.61
mirTarget2 191 3386 0.63
PicTar 124 3826 0.58
ROC10*n All 667 90
SVM 16 90 0.0168
PITA 1 90 0.0058
TargetScan 13 90 0.0267
miRanda 11 90 0.0144
PITA_top 2 90 0.0076
TargetScan_consv 14 90 0.0209
mirTarget2 19 90 0.0235
PicTar 17 90 0.0151
7mer + Conservation All 353 15823
SVM 339 12731 0.73
PITA 339 13230 0.61
TargetScan 310 10715 0.69
miRanda 93 6641 0.43
PITA_top 147 4867 0.57
TargetScan_consv 174 5634 0.59
mirTarget2 191 3386 0.67
PicTar 124 3826 0.57
TargetScan All 174 5658
SVM 168 5022 0.67
PITA 174 5440 0.56
TargetScan 174 5634 0.65
miRanda 36 827 0.53
PITA_top 119 3527 0.56
TargetScan_consv 174 5634 0.65
mirTarget2 109 1926 0.66
PicTar 87 2418 0.57
miRanda All 93 6641
SVM 89 4301 0.82
PITA 88 4942 0.69
TargetScan 82 3072 0.8
miRanda 93 6641 0.61
PITA_top 31 801 0.61
TargetScan_consv 36 819 0.64
mirTarget2 41 602 0.67
PicTar 25 554 0.6
PicTar All 124 3826
SVM 119 3683 0.69
PITA 122 3420 0.61
TargetScan 105 3003 0.63
miRanda 25 554 0.53
PITA_top 92 2152 0.63
TargetScan_consv 87 2406 0.61
mirTarget2 77 1408 0.67
PicTar 124 3826 0.65

aPositive and Negative show the total number of positive (down-regulated) and negative (unaffected) genes present within the six benchmarks (Method “All”) and among the predictions of each method on the six benchmarks. bROC is the method’s ROC score on the specific benchmark.


Table S6. mRNA/siRNA sequences from five microarray experiments.

Accession miRNA/siRNA name Sequence Strand miRNA/siRNA
GSE5814 MAPK14-pos1 mismatch CCUACAGAGAACUGCGGUA sense siRNA
GSE5814 MAPK14-pos2 mismatch CCUACAGAGAACUGCGGAU sense siRNA
GSE5814 MAPK14-pos8 mismatch CCUACAGAGAAAUGCGGUU sense siRNA
GSE5814 MAPK14-pos7 mismatch CCUACAGAGAACGGCGGUU sense siRNA
GSE5814 MAPK14-pos6 mismatch CCUACAGAGAACUACGGUU sense siRNA
GSE5814 MAPK14-pos4 mismatch CCUACAGAGAACUGCAGUU sense siRNA
GSE5814 MAPK14-pos3 mismatch CCUACAGAGAACUGCGAUU sense siRNA
GSE5814 MAPK14-193a CCUACAGAGAACUGCGGUU sense siRNA
GSE5814 PLK1 1319 CACGCCTCATCCTCTACAA sense siRNA
GSE5814 PLK1 772 GAGACCTACCTCCGGATCA sense siRNA
GSE5814 MAPK14-pos5 mismatch CCUACAGAGAACUGAGGUU sense siRNA
GSE5814 PIK3CB-6338 GUGACAACAUCAUGGUCAA sense siRNA
GSE5814 PIK3CB-6340 CUCCUAAUAUGAAUCCUAU sense siRNA
GSE2075 miR-1 UGGAAUGUAAAGAAGUAUGUAA antisense miRNA
GSE2075 miR-124 UAAGGCACGCGGUGAAUGCCA antisense miRNA
GSE2075 miR-124mut5-6 UAAGCGACGCGGUGAAUGCCA antisense siRNA
GSE2075 miR-373 GAAGUGCUUCGAUUUUGGGGUGU antisense miRNA
GSE6838 hsa-miR-16 UAGCAGCACGUAAAUAUUGGCG antisense miRNA
GSE6838 hsa-miR-15a UAGCAGCACAUAAUGGUUUGUG antisense miRNA
GSE6838 hsa_miR_106b UAAAGUGCUGACAGUGCAGAU antisense miRNA
GSE6838 hsa_miR_103 AGCAGCAUUGUACAGGGCUAUGA antisense miRNA
GSE6838 hsa-miR-192 CUGACCUAUGAAUUGACAGCC antisense miRNA
GSE6838 hsa-miR-215 AUGACCUAUGAAUUGACAGAC antisense miRNA
GSE6838 hsa-miR-17-5p CAAAGUGCUUACAGUGCAGGUAGU antisense miRNA
GSE6838 hsa-miR-20 UAAAGUGCUUAUAGUGCAGGUA antisense miRNA
GSE6838 hsa-let-7c UGAGGUAGUAGGUUGUAUGGUU antisense miRNA
GSE8501 hsa-miR-7 UGGAAGACUAGUGAUUUUGUUG antisense miRNA
GSE8501 hsa-miR-9 UCUUUGGUUAUCUAGCUGUAUGA antisense miRNA
GSE8501 hsa-miR-122a UGGAGUGUGACAAUGGUGUUUGU antisense miRNA
GSE8501 hsa-miR-128a UCACAGUGAACCGGUCUCUUUU antisense miRNA
GSE8501 hsa-miR-132 UAACAGUCUACAGCCAUGGUCG antisense miRNA
GSE8501 hsa-miR-133a UUGGUCCCCUUCAACCAGCUGU antisense miRNA
GSE8501 hsa-miR-142-3p UGUAGUGUUUCCUACUUUAUGGA antisense miRNA
GSE8501 hsa-miR-148a UCAGUGCACUACAGAACUUUGU antisense miRNA
GSE8501 hsa-miR-181a AACAUUCAACGCUGUCGGUGAGU antisense miRNA
E-MEXP-668 C52 CAGGGCGGAGACTTCACCA sense siRNA
E-MEXP-668 G4 TGGTTTACATGTTCCAATA sense siRNA
E-MEXP-668 C4 GGAAAGACTGTTCCAAAAA sense siRNA
E-MEXP-668 G41 GTATGACAACAGCCTCAAG sense siRNA
E-MEXP-668 M3 GAGGTTCTCTGGATCAAGT sense siRNA
E-MEXP-668 M1 GCACATGGATGGAGGTTCT sense siRNA
E-MEXP-668 M2 GCAGAGAGAGCAGATTTGA sense siRNA
E-MEXP-668 C14 GGCCTTAGCTACAGGAGAG sense siRNA
E-MEXP-668 C1 GAAAGAGCATCTACGGTGA sense siRNA
E-MEXP-668 C2 GAAAGGATTTGGCTACAAA sense siRNA
E-MEXP-668 C3 ACAGCAAATTCCATCGTGT sense siRNA
E-MEXP-668 M4 GAGCAGATTTGAAGCAACT sense siRNA
Selbach hsa-miR-30a UGUAAACAUCCUCGACUGGAAG antisense miRNA
Selbach hsa-miR-16 UAGCAGCACGUAAAUAUUGGCG antisense miRNA
Selbach hsa-miR-155 UUAAUGCUAAUCGUGAUAGGGGU antisense miRNA
Selbach hsa-miR-1 UGGAAUGUAAAGAAGUAUGUAU antisense miRNA
Selbach hsa-let-7b UGAGGUAGUAGGUUGUGUGGUU antisense miRNA
Baek hsa-mir-181a AACAUUCAACGCUGUCGGUGAGU antisense miRNA
Baek hsa-mir-124 UAAGGCACGCGGUGAAUGCC antisense miRNA
Baek hsa-mir-1 UGGAAUGUAAAGAAGUAUGUAU antisense miRNA

Jackson (GSE5814), Lim (GSE2075), Grimson (GSE8501), and Birmingham (E-MEXP-668) were used for training, and Linsley (GSE6838) was used for the independent test. Two datasets, Selbach and Baek, from mass spectrometry experiments were used as both training datasets (for the proteomics-based classifiers) and independent test sets (for the mRNA-based classifiers).

aMAPK14-193 has three different sample groups; (i) GSM133702 and GSM134468, (ii) GSM134488, and (iii) GSM134491. We assigned three different unique IDs to these groups, therefore they were treated as different sequences internally in our model.


Table S7. Parameters used for the Needle software.

Parameter Value
Gap open (‑gapopen) 10.0
Gap extend (‑gapextend) 7.0
Scoring matrix (‑datafile) A U G C
A -3 5 -3 -3
U 5 -3 2 -3
G -3 2 -3 5
C -3 -3 5 -3

The scoring matrix was generated based on the matrix used in the miRanda algorithm [15].


Table S8. Benchmarks on the Selbach dataset.

Benchmark Method # of targetsa ROCb
Positive Negative
All genes All 2311 12890
SVM 1493 5936 0.64
PITA 1230 4276 0.61
TargetScan 688 1220 0.61
miRanda 183 493 0.52
PITA_top 266 274 0.55
TargetScan_consv 305 309 0.55
mirTarget2 205 155 0.54
PicTar 185 175 0.53
ROC10*n All 2311 50
SVM 95 50 0.0253
PITA 17 50 0.0042
TargetScan 84 50 0.0212
miRanda 28 50 0.0079
PITA_top 57 50 0.0138
TargetScan_consv 82 50 0.0213
mirTarget2 83 50 0.0231
PicTar 81 50 0.021
7mer + Conservation All 576 1005
SVM 560 800 0.71
PITA 554 835 0.61
TargetScan 495 616 0.69
miRanda 183 493 0.42
PITA_top 266 274 0.6
TargetScan_consv 305 309 0.63
mirTarget2 205 155 0.6
PicTar 185 175 0.58
TargetScan All 306 313
SVM 305 298 0.63
PITA 302 310 0.52
TargetScan 305 309 0.63
miRanda 76 41 0.56
PITA_top 187 172 0.55
TargetScan_consv 305 309 0.63
mirTarget2 133 84 0.59
PicTar 139 125 0.55
miRanda All 183 493
SVM 169 305 0.76
PITA 166 328 0.66
TargetScan 131 152 0.74
miRanda 183 493 0.54
PITA_top 68 41 0.64
TargetScan_consv 76 39 0.67
mirTarget2 58 28 0.63
PicTar 54 29 0.62
PicTar All 185 175
SVM 185 172 0.61
PITA 184 172 0.54
TargetScan 168 163 0.62
miRanda 54 29 0.56
PITA_top 126 94 0.58
TargetScan_consv 139 124 0.6
mirTarget2 86 47 0.6
PicTar 185 175 0.62

aPositive and Negative show the total number of positive (down-regulated) and negative (unaffected) genes present within the six benchmarks (Method “All”) and among the predictions of each method on the six benchmarks. bROC is the method’s ROC score on the specific benchmark.


Table S9. Benchmarks on the Baek dataset.

Benchmark Method # of targetsa ROCb
Positive Negative
All genes All 1301 4257
SVM 725 2005 0.56
PITA 605 1481 0.56
TargetScan 319 557 0.56
miRanda 81 216 0.51
PITA_top 103 143 0.52
TargetScan_consv 142 172 0.53
mirTarget2 70 92 0.52
PicTar 73 102 0.52
ROC10*n All 1301 30
SVM 39 30 0.0193
PITA 15 30 0.0046
TargetScan 28 30 0.0157
miRanda 17 30 0.0081
PITA_top 31 30 0.0148
TargetScan_consv 32 30 0.0174
mirTarget2 23 30 0.0086
PicTar 29 30 0.0131
7mer + Conservation All 258 486
SVM 235 407 0.59
PITA 234 389 0.6
TargetScan 217 322 0.62
miRanda 81 216 0.44
PITA_top 103 143 0.56
TargetScan_consv 142 172 0.61
mirTarget2 70 92 0.54
PicTar 73 102 0.54
TargetScan All 142 176
SVM 136 169 0.57
PITA 135 164 0.57
TargetScan 142 172 0.57
miRanda 28 32 0.5
PITA_top 86 98 0.56
TargetScan_consv 142 172 0.57
mirTarget2 48 54 0.52
PicTar 51 58 0.52
miRanda All 81 216
SVM 65 144 0.64
PITA 70 142 0.62
TargetScan 51 83 0.62
miRanda 81 216 0.5
PITA_top 21 32 0.56
TargetScan_consv 28 31 0.59
mirTarget2 19 25 0.55
PicTar 16 16 0.57
PicTar All 73 102
SVM 73 100 0.54
PITA 71 91 0.62
TargetScan 65 85 0.63
miRanda 16 16 0.53
PITA_top 42 48 0.6
TargetScan_consv 51 57 0.63
mirTarget2 27 26 0.56
PicTar 73 102 0.59

aPositive and Negative show the total number of positive (down-regulated) and negative (unaffected) genes present within the six benchmarks (Method “All”) and among the predictions of each method on the six benchmarks. bROC is the method’s ROC score on the specific benchmark.


Table S10. Benchmarks on the Linsley dataset – Positive records defined by log ratio -0.1.

Benchmark Method # of targetsa ROCb
Positive Negative
All genes All 851 158190
SVM 692 68909 0.79
PITA 655 55837 0.74
TargetScan 474 21154 0.73
miRanda 110 6624 0.54
PITA_top 180 4834 0.59
TargetScan_consv 221 5587 0.61
mirTarget2 231 3346 0.62
PicTar 159 3791 0.58
ROC10*n All 851 90
SVM 20 90 0.0133
PITA 1 90 0.0014
TargetScan 13 90 0.0147
miRanda 13 90 0.0093
PITA_top 4 90 0.0039
TargetScan_consv 14 90 0.0142
mirTarget2 22 90 0.0186
PicTar 19 90 0.0106
7mer + Conservation All 425 15751
SVM 409 12661 0.73
PITA 404 13165 0.60
TargetScan 368 10657 0.70
miRanda 110 6624 0.43
PITA_top 180 4834 0.57
TargetScan_consv 221 5587 0.61
mirTarget2 231 3346 0.67
PicTar 159 3791 0.58
TargetScan All 221 5611
SVM 215 4975 0.67
PITA 219 5395 0.55
TargetScan 221 5587 0.66
miRanda 48 815 0.54
PITA_top 151 3495 0.55
TargetScan_consv 221 5587 0.66
mirTarget2 134 1901 0.65
PicTar 116 2389 0.58
miRanda All 110 6624
SVM 104 4286 0.81
PITA 105 4925 0.71
TargetScan 97 3057 0.81
miRanda 110 6624 0.62
PITA_top 39 793 0.62
TargetScan_consv 48 807 0.67
mirTarget2 49 594 0.68
PicTar 30 549 0.60
PicTar All 159 3791
SVM 154 3648 0.68
PITA 154 3388 0.59
TargetScan 135 2973 0.63
miRanda 30 549 0.52
PITA_top 113 2131 0.60
TargetScan_consv 116 2377 0.61
mirTarget2 94 1391 0.64
PicTar 159 3791 0.63

aPositive and Negative show the total number of positive (down-regulated) and negative (unaffected) genes present within the six benchmarks (Method “All”) and among the predictions of each method on the six benchmarks. bROC is the method’s ROC score on the specific benchmark.


Table S11. Benchmarks on the Linsley dataset – Positive records defined by log ratio -0.3.

Benchmark Method # of targetsa ROCb
Positive Negative
All genes All 128 158913
SVM 111 69490 0.83
PITA 109 56383 0.80
TargetScan 84 21544 0.78
miRanda 19 6715 0.55
PITA_top 21 4993 0.57
TargetScan_consv 21 5787 0.57
mirTarget2 38 3539 0.64
PicTar 20 3930 0.57
ROC10*n All 128 90
SVM 1 90 0.0148
PITA 0 90 0.0078
TargetScan 1 90 0.0148
miRanda 3 90 0.0237
PITA_top 1 90 0.0109
TargetScan_consv 0 90 0.0078
mirTarget2 1 90 0.0134
PicTar 2 90 0.0151
7mer + Conservation All 69 16107
SVM 68 13002 0.74
PITA 64 13505 0.61
TargetScan 55 10970 0.63
miRanda 19 6715 0.44
PITA_top 21 4993 0.51
TargetScan_consv 21 5787 0.49
mirTarget2 38 3539 0.68
PicTar 20 3930 0.53
TargetScan All 21 5811
SVM 21 5169 0.76
PITA 21 5593 0.57
TargetScan 21 5787 0.70
miRanda 8 855 0.62
PITA_top 16 3630 0.61
TargetScan_consv 21 5787 0.70
mirTarget2 13 2022 0.67
PicTar 12 2493 0.61
miRanda All 19 6715
SVM 19 4371 0.87
PITA 18 5012 0.73
TargetScan 19 3135 0.88
miRanda 19 6715 0.59
PITA_top 8 824 0.65
TargetScan_consv 8 847 0.66
mirTarget2 9 634 0.69
PicTar 4 575 0.57
PicTar All 20 3930
SVM 20 3782 0.72
PITA 20 3522 0.71
TargetScan 17 3091 0.65
miRanda 4 575 0.54
PITA_top 14 2230 0.62
TargetScan_consv 12 2481 0.55
mirTarget2 12 1473 0.64
PicTar 20 3930 0.65

aPositive and Negative show the total number of positive (down-regulated) and negative (unaffected) genes present within the six benchmarks (Method “All”) and among the predictions of each method on the six benchmarks. bROC is the method’s ROC score on the specific benchmark.


Table S12. Benchmarks on the Selbach dataset – Positive records defined by log ratio -0.1.

Benchmark Method # of targetsa ROCb
Positive Negative
All genes All 4221 10980
SVM 2451 4978 0.59
PITA 1946 3560 0.58
TargetScan 929 979 0.57
miRanda 279 397 0.52
PITA_top 331 209 0.53
TargetScan_consv 379 235 0.53
mirTarget2 248 112 0.52
PicTar 237 123 0.52
ROC10*n All 4221 50
SVM 126 50 0.0168
PITA 35 50 0.0033
TargetScan 114 50 0.0144
miRanda 41 50 0.0061
PITA_top 105 50 0.0135
TargetScan_consv 112 50 0.015
mirTarget2 98 50 0.0159
PicTar 152 50 0.0196
7mer + Conservation All 793 788
SVM 738 622 0.66
PITA 734 655 0.58
TargetScan 629 482 0.65
miRanda 279 397 0.43
PITA_top 331 209 0.58
TargetScan_consv 379 235 0.6
mirTarget2 248 112 0.58
PicTar 237 123 0.58
TargetScan All 382 237
SVM 379 224 0.62
PITA 377 235 0.52
TargetScan 379 235 0.6
miRanda 86 31 0.55
PITA_top 230 129 0.55
TargetScan_consv 379 235 0.6
mirTarget2 157 60 0.58
PicTar 177 87 0.57
miRanda All 279 397
SVM 227 247 0.67
PITA 225 269 0.58
TargetScan 160 123 0.66
miRanda 279 397 0.53
PITA_top 78 31 0.6
TargetScan_consv 85 30 0.61
mirTarget2 64 22 0.58
PicTar 62 21 0.58
PicTar All 237 123
SVM 236 121 0.59
PITA 235 121 0.57
TargetScan 216 115 0.59
miRanda 62 21 0.55
PITA_top 154 66 0.57
TargetScan_consv 176 87 0.58
mirTarget2 101 32 0.58
PicTar 237 123 0.64

aPositive and Negative show the total number of positive (down-regulated) and negative (unaffected) genes present within the six benchmarks (Method “All”) and among the predictions of each method on the six benchmarks. bROC is the method’s ROC score on the specific benchmark.


Table S13. Benchmarks on the Selbach dataset – Positive records defined by log ratio -0.3.

Benchmark Method # of targetsa ROCb
Positive Negative
All genes All 1311 13890
SVM 933 6496 0.68
PITA 805 4701 0.65
TargetScan 504 1404 0.65
miRanda 131 545 0.53
PITA_top 200 340 0.56
TargetScan_consv 231 383 0.58
mirTarget2 168 192 0.55
PicTar 145 215 0.55
ROC10*n All 1311 50
SVM 75 50 0.0369
PITA 12 50 0.0045
TargetScan 67 50 0.0322
miRanda 21 50 0.0111
PITA_top 33 50 0.0142
TargetScan_consv 64 50 0.0306
mirTarget2 58 50 0.0264
PicTar 55 50 0.0258
7mer + Conservation All 423 1158
SVM 414 946 0.72
PITA 410 979 0.61
TargetScan 372 739 0.71
miRanda 131 545 0.42
PITA_top 200 340 0.6
TargetScan_consv 231 383 0.64
mirTarget2 168 192 0.61
PicTar 145 215 0.59
TargetScan All 231 388
SVM 230 373 0.65
PITA 229 383 0.52
TargetScan 231 383 0.67
miRanda 61 56 0.56
PITA_top 144 215 0.55
TargetScan_consv 231 383 0.67
mirTarget2 109 108 0.6
PicTar 110 154 0.57
miRanda All 131 545
SVM 123 351 0.78
PITA 120 374 0.63
TargetScan 101 182 0.77
miRanda 131 545 0.53
PITA_top 50 59 0.63
TargetScan_consv 61 54 0.69
mirTarget2 50 36 0.65
PicTar 42 41 0.62
PicTar All 145 215
SVM 145 212 0.64
PITA 144 212 0.52
TargetScan 135 196 0.66
miRanda 42 41 0.55
PITA_top 100 120 0.56
TargetScan_consv 110 153 0.63
mirTarget2 73 60 0.62
PicTar 145 215 0.6

aPositive and Negative show the total number of positive (down-regulated) and negative (unaffected) genes present within the six benchmarks (Method “All”) and among the predictions of each method on the six benchmarks. bROC is the method’s ROC score on the specific benchmark.


Table S14. Benchmarks on the Baek dataset – Positive records defined by log ratio -0.1.

Benchmark Method # of targetsa ROCb
Positive Negative
All genes All 1992 3566
SVM 1062 1668 0.55
PITA 861 1225 0.55
TargetScan 446 430 0.55
miRanda 113 184 0.5
PITA_top 140 106 0.52
TargetScan_consv 191 123 0.53
mirTarget2 99 63 0.52
PicTar 100 75 0.51
ROC10*n All 1992 30
SVM 63 30 0.0196
PITA 30 30 0.0055
TargetScan 56 30 0.016
miRanda 21 30 0.0069
PITA_top 51 30 0.0155
TargetScan_consv 53 30 0.017
mirTarget2 57 30 0.014
PicTar 54 30 0.0138
7mer + Conservation All 364 380
SVM 331 311 0.6
PITA 322 301 0.57
TargetScan 300 239 0.63
miRanda 113 184 0.41
PITA_top 140 106 0.56
TargetScan_consv 191 123 0.61
mirTarget2 99 63 0.55
PicTar 100 75 0.54
TargetScan All 192 126
SVM 184 121 0.56
PITA 181 118 0.51
TargetScan 191 123 0.58
miRanda 36 24 0.5
PITA_top 115 69 0.55
TargetScan_consv 191 123 0.58
mirTarget2 68 34 0.55
PicTar 63 46 0.49
miRanda All 113 184
SVM 89 120 0.64
PITA 91 121 0.6
TargetScan 67 67 0.61
miRanda 113 184 0.49
PITA_top 27 26 0.56
TargetScan_consv 36 23 0.59
mirTarget2 27 17 0.57
PicTar 22 10 0.57
PicTar All 100 75
SVM 100 73 0.59
PITA 97 65 0.65
TargetScan 89 61 0.64
miRanda 22 10 0.54
PITA_top 56 34 0.6
TargetScan_consv 63 45 0.59
mirTarget2 37 16 0.59
PicTar 100 75 0.58

aPositive and Negative show the total number of positive (down-regulated) and negative (unaffected) genes present within the six benchmarks (Method “All”) and among the predictions of each method on the six benchmarks. bROC is the method’s ROC score on the specific benchmark.


Table S15. Benchmarks on the Baek dataset – Positive records defined by log ratio -0.3.

Benchmark Method # of targetsa ROCb
Positive Negative
All genes All 837 4721
SVM 492 2238 0.58
PITA 416 1670 0.57
TargetScan 228 648 0.57
miRanda 50 247 0.50
PITA_top 83 163 0.53
TargetScan_consv 113 201 0.55
mirTarget2 59 103 0.52
PicTar 53 122 0.52
ROC10*n All 837 30
SVM 29 30 0.0230
PITA 8 30 0.0047
TargetScan 24 30 0.0220
miRanda 7 30 0.0080
PITA_top 19 30 0.0143
TargetScan_consv 30 30 0.0264
mirTarget2 20 30 0.0107
PicTar 16 30 0.0106
7mer + Conservation All 184 560
SVM 171 471 0.62
PITA 168 455 0.60
TargetScan 163 376 0.66
miRanda 50 247 0.42
PITA_top 83 163 0.59
TargetScan_consv 113 201 0.65
mirTarget2 59 103 0.57
PicTar 53 122 0.54
TargetScan All 113 205
SVM 109 196 0.60
PITA 106 193 0.56
TargetScan 113 201 0.60
miRanda 21 39 0.50
PITA_top 72 112 0.57
TargetScan_consv 113 201 0.60
mirTarget2 44 58 0.55
PicTar 39 70 0.50
miRanda All 50 247
SVM 41 168 0.67
PITA 44 168 0.61
TargetScan 35 99 0.67
miRanda 50 247 0.54
PITA_top 14 39 0.56
TargetScan_consv 21 38 0.63
mirTarget2 15 29 0.57
PicTar 10 22 0.56
PicTar All 53 122
SVM 53 120 0.56
PITA 52 110 0.59
TargetScan 49 101 0.66
miRanda 10 22 0.50
PITA_top 34 56 0.63
TargetScan_consv 39 69 0.67
mirTarget2 22 31 0.59
PicTar 53 122 0.57

aPositive and Negative show the total number of positive (down-regulated) and negative (unaffected) genes present within the six benchmarks (Method “All”) and among the predictions of each method on the six benchmarks. bROC is the method’s ROC score on the specific benchmark.


Supplementary Figures

 

Figure S1

Figure S1. 10-fold cross-validation of target site level classifiers trained on three microarray datasets.

The ROC graphs show the classification performance of four classifiers trained with three microarray datasets at the target site level. The curve is the average ROC curve of the ten individual cross-validation test results; error bars show standard errors. Dotted lines illustrate random prediction. Avg. AUC is the average of the area under the curve (AUC; ROC score) of the ten individual cross-validation test results.
(A) Trained with the Birmingham, Lim and Grimson datasets.
(B) Trained with the Jackson, Birmingham and Grimson datasets.
(C) Trained with the Jackson, Lim and Grimson datasets.
(D) Trained with the Jackson, Birmingham and Lim datasets.


 

Figure S2

Figure S2. Density plot for the number of potential target sites divided by 3’UTR length.

It shows the density distribution of the values calculated by the number of miRNA potential sites divided by the 3’ UTR length. The red line indicates the distribution of down-regulated genes whereas the blue line is for the non-affected genes. The difference between two density distributions is very significant (Kolmogorov-Smirnov test, p-value: 3.862e-11).


 

Figure S3

Figure S3. Comparison between two SVM trainings with or without explicit addition of negative records.

Two box plots show the number of potential target sites vs. discriminant values from mRNA level training. The left panel shows the distributions of discriminant values by the number of potential target sites for the classifier trained without any explicit additional of negative records. The right panel shows the classifier trained with data enriched with 1000 randomly selected non-target genes with more than 7 target sites. Records with the number of target sites >25 and/or negative discriminant values were disregarded. Whereas the left box plot shows that discriminant values for negative (non-downregulated) genes generally increase with increasing number of potential target sites, the right box plot indicates that the discriminant values for negative genes are relatively stable.


 

Figure S4

Figure S4. 10-fold cross-validation of mRNA level classifiers trained on three microarray datasets.

The ROC graphs show the classification performance of four classifiers trained with three different combinations of microarray datasets at the mRNA level. The curve is the average ROC curve of the ten individual cross-validation test results; error bars show standard errors. Dotted lines illustrate a random prediction. Avg. AUC is the average of the area under the curve (AUC; ROC score) of the ten individual cross-validation test results.
(A) Trained with the Birmingham, Lim and Grimson datasets.
(B) Trained with the Jackson, Birmingham and Grimson datasets.
(C) Trained with the Jackson, Lim and Grimson datasets.
(D) Trained with the Jackson, Birmingham and Lim datasets.


 

Figure S5

Figure S5. Three method-specific benchmarks of 8 different algorithms on the Linsley dataset.

Receiver operating characteristic (ROC) graphs show the performances of 8 different target prediction algorithms; SVM, PITA All (PITA), PITA Top (PITA_top), TargetScan (TargetS), TargetScan with conserved genes (TargetS_c), MicroCosm miRanda (miRanda), mirTarget2, and PicTar, on the Linsley dataset. Dotted lines illustrate random prediction. The values of the area under the ROC curve (AUC) are shown in the legend box. The benchmarks used for the evaluation were (A) ROC with TargetScan dataset, (B) ROC with miRanda dataset, and (C) ROC with PicTar dataset.


 

Figure S6

Figure S6. siRNA benchmarks on the Jackson dataset.

Receiver operating characteristic (ROC) graphs show the performance of 3 different target prediction algorithms – SVM not trained on Jackson dataset (SVM), PITA, and TargetScan (TargetS) – on the Jackson dataset. The SVM classifier was trained with the Lim, Birmingham and Grimson datasets. Dotted lines illustrate random prediction. The ROC scores are shown in the legend box. The benchmarks used for the evaluation were (A) ROC with All genes, (B) ROC10*n, and (C) ROC with 7mer + Conservation.


 

Figure S7

Figure S7. miRNA benchmarks on the Lim dataset.

Receiver operating characteristic (ROC) graphs show the performances of 8 different target prediction algorithms – SVM not trained on Lim dataset (SVM), PITA All (PITA), PITA Top (PITA_top), TargetScan (TargetS), TargetScan with conserved genes (TargetS_c), MicroCosm miRanda (miRanda), mirTarget2 and PicTar – on the Lim dataset. The SVM classifier was trained with the Jackon, Birmingham and Grimson datasets. Dotted lines illustrate random prediction. The values of the area under the ROC curve (AUC) are shown in the legend box. The benchmarks used for the evaluation were (A) ROC with TargetScan dataset, (B) ROC with miRanda dataset, (C) ROC with PicTar dataset, (D) ROC with TargetScan dataset, (E) ROC with miRanda dataset, and (F) ROC with PicTar dataset.


 

Figure S8

Figure S8. siRNA benchmarks on the Birmingham dataset.

Receiver operating characteristic (ROC) graphs show the performances of 3 different target prediction algorithms – SVM not trained on Birmingham dataset (SVM), PITA, and TargetScan (TargetS) – on the Birmingham dataset. The SVM classifier was trained with the Jackson, Lim and Grimon datasets. Dotted lines illustrate random prediction. The ROC scores are shown in the legend box. The benchmarks used for the evaluation were (A) ROC with All genes, (B) ROC10*n, and (C) ROC with 7mer + Conservation.


 

Figure S9

Figure S9. miRNA benchmarks on the Grimson dataset.

Receiver operating characteristic (ROC) graphs show the performances of 8 different target prediction algorithms – SVM not trained on Grimson dataset (SVM), PITA All (PITA), PITA Top (PITA_top), TargetScan (TargetS), TargetScan with conserved genes (TargetS_c), MicroCosm miRanda (miRanda), mirTarget2 and PicTar – on the Grimson dataset. The SVM classifier was trained with the Jackon, Lim, and Birmingham datasets. Dotted lines illustrate random prediction. The values of the area under the ROC curve (AUC) are shown in the legend box. The benchmarks used for the evaluation were (A) ROC with TargetScan dataset, (B) ROC with miRanda dataset, (C) ROC with PicTar dataset, (D) ROC with TargetScan dataset, (E) ROC with miRanda dataset, and (F) ROC with PicTar dataset.


 

Figure S10

Figure S10. Targets with optimal distance sites show higher average discriminant value than other three target types.

The boxplot shows the comparison of mRNA level discriminant values among four different target types. The result shows our SVM approach gives scores as the preference: “multiple with optimal distance” > “multiple without optimal distance” > “single stringent” > “single non-stringent”.


 

Figure S11

Figure S11. Adding site accessibility and sequence conservation features did not affect target prediction accuracy.

Receiver operating characteristic (ROC) graphs show the classification performance of three mRNA level classifiers. These classifiers had the same mRNA level features, but their discriminant values were generated by three different target site classifiers: (i) neither site accessibility nor sequence conservation used, (ii) with site accessibility and, (iii) with conservational information. Panels A-C show the three classifiers’ 10-fold cross-validation performance. The curve is the average ROC curve of the ten individual cross-validation test results; error bars show standard errors; Avg. AUC is the average of the area under the curve (AUC; ROC score) of the ten individual cross-validation test results. Panels D-F show the three classifiers’ performance on the independent (Linsley) dataset; ROC graphs were plotted by only considering the genes with prediction scores in the Linsley dataset. Dotted lines illustrate random prediction.
(A) 10-fold cross-validation of the classifier with neither site accessibility nor sequence conservation used
(B) 10-fold cross-validation of the classifier with site accessibility
(C) 10-fold cross-validation of the classifier with sequence conservation
(D) Classifier with neither site accessibility nor conservational information tested on the independent dataset
(E) Classifier with site accessibility tested on the independent dataset
(F) Classifier with sequence conservation tested on the independent dataset


 

Figure S12

Figure S12. SVC performs better than SVR at both target site and mRNA levels.

The ROC graphs show the prediction performance of SVM regression (SVR) and SVM classification (SVC). Panels A, C, E, and G show the models’ 10-fold cross-validation performance. The curve is the average ROC curve of the ten individual cross-validation test results; error bars show standard errors; Avg. AUC is the average of the area under the curve (AUC; ROC score) of the ten individual cross-validation test results. Panels B, D, F, and H show the models’ performance on the independent (Linsley) dataset; ROC graphs were plotted by only considering the genes with SVC and SVR prediction scores in the Linsley dataset. Dotted lines illustrate random prediction. SVR and SVC show very similar performance, but SVC shows slightly better performance than SVR at both target site and mRNA levels.
(A) 10-fold cross-validation of SVR target site level
(B) Target site level SVR classifier tested on the independent dataset
(C) 10-fold cross-validation of SVR mRNA level
(D) mRNA level SVR classifier tested on the independent dataset
(E) 10-fold cross-validation of SVC target site level
(F) Target site level SVC classifier tested on the independent dataset
(G) 10-fold cross-validation of SVC mRNA level
(H) mRNA level SVC classifier tested on the independent dataset


 

Figure S13

Figure S13. SVM performance is similar between the training set with up-regulated genes and the one without up-regulated genes.

The ROC graphs show the prediction performance of two SVM classifiers trained either with or without up-regulated genes. Panels A, C, E, and G show the models’ 10-fold cross-validation performance. The curve is the average ROC curve of the ten individual cross-validation test results; error bars show standard errors; Avg. AUC is the average of the area under the curve (AUC; ROC score) of the ten individual cross-validation test results. Panels B, D, F, and H show the models’ performance on the independent (Linsley) dataset; ROC graphs were plotted by only considering the genes with SVM prediction scores in the Linsley dataset. Dotted lines illustrate random prediction. Two classifiers show almost identical performances in terms of both ROC sores and ROC curves.
(A) 10-fold cross-validation of SVM without up-regulated genes at target site level
(B) Target site level SVM without up-regulated genes tested on the independent dataset
(C) 10-fold cross-validation of SVM without up-regulated genes at mRNA level
(D) mRNA level SVM without up-regulated genes tested on the independent dataset
(E) 10-fold cross-validation of SVM with up-regulated genes at target site level
(F) Target site level SVM with up-regulated genes tested on the independent dataset
(G) 10-fold cross-validation of SVM with up-regulated genes at mRNA level
(H) mRNA level SVM with up-regulated genes tested on the independent dataset


 

Figure S14

Figure S14. Two-step SVM retains the performance when trained with proteomics data.

(A) SVM classifier was trained by the Selbach and Baek datasets. The definitions of the error bars, AUC, true positive and false positive rates were the same as described in Figure 2. (B) Both proteomics and transcriptomics classifiers were tested on the Linsley dataset. Proteomics data contains the Selbach and Baek datasets, whereas Transcriptomics contains the Jackson, Lim, Birmingham, and Grimson datasets. The ROC scores were shown in the legend box.


 

Figure S15

Figure S15. Benchmarks on the Selbach dataset.

Receiver operating characteristic (ROC) graphs show the performances of 8 different target prediction algorithms – SVM, PITA All (PITA), PITA Top (PITA_top), TargetScan (TargetS), TargetScan with conserved genes (TargetS_c), MicroCosm miRanda (miRanda), mirTarget2 and PicTar – on the Selbach dataset. Dotted lines illustrate random prediction. The ROC scores are shown in the legend box. The benchmarks used for the evaluation were, (A) ROC with TargetScan dataset, (B) ROC with miRanda dataset, and (C) ROC with PicTar dataset.


 

Figure S16

Figure S16. Benchmarks on the Baek dataset.

Receiver operating characteristic (ROC) graphs show the performances of 8 different target prediction algorithms – SVM, PITA All (PITA), PITA Top (PITA_top), TargetScan (TargetS), TargetScan with conserved genes (TargetS_c), MicroCosm miRanda (miRanda), mirTarget2 and PicTar – on the Baek dataset. Dotted lines illustrate random prediction. The ROC scores are shown in the legend box. The benchmarks used for the evaluation were (A) ROC with TargetScan dataset, (B) ROC with miRanda dataset, and (C) ROC with PicTar dataset.


 

Figure S17

Figure S17. Benchmarks on the Linsley dataset with positive record threshold by log ratio -0.1.

Receiver operating characteristic (ROC) graphs show the performances of 8 different target prediction algorithms – SVM, PITA All (PITA), PITA Top (PITA_top), TargetScan (TargetS), TargetScan with conserved genes (TargetS_c), MicroCosm miRanda (miRanda), mirTarget2 and PicTar – on the Linsley dataset. Dotted lines illustrate random prediction. The ROC scores are shown in the legend box. Positive records (down-regulated) were selected by log ratio value -0.1 instead of the default value -0.2. The benchmarks used for the evaluation were (A) ROC with All genes, (B) ROC10*n, (C) ROC with 7mer + Conservation, (D) ROC with TargetScan dataset, (E) ROC with miRanda dataset, and (F) ROC with PicTar dataset.


 

Figure S18

Figure S18. Benchmarks on the Linley dataset with positive record threshold by log ratio -0.3.

Receiver operating characteristic (ROC) graphs show the performances of 8 different target prediction algorithms – SVM, PITA All (PITA), PITA Top (PITA_top), TargetScan (TargetS), TargetScan with conserved genes (TargetS_c), MicroCosm miRanda (miRanda), mirTarget2 and PicTar – on the Linsley dataset. Dotted lines illustrate random prediction. The ROC scores are shown in the legend box. Positive records (down-regulated) were selected by log ratio value -0.3 instead of the default value -0.2. The benchmarks used for the evaluation were (A) ROC with All genes, (B) ROC10*n, (C) ROC with 7mer + Conservation, (D) ROC with TargetScan dataset, (E) ROC with miRanda dataset, and (F) ROC with PicTar dataset.


 

Figure S19

Figure S19. Benchmarks on the Selbach dataset with positive record threshold by log ratio -0.1.

Receiver operating characteristic (ROC) graphs show the performances of 8 different target prediction algorithms – SVM, PITA All (PITA), PITA Top (PITA_top), TargetScan (TargetS), TargetScan with conserved genes (TargetS_c), MicroCosm miRanda (miRanda), mirTarget2 and PicTar – on the Selbach dataset. Dotted lines illustrate random prediction. The ROC scores are shown in the legend box. Positive records (down-regulated) were selected by log ratio value -0.1 instead of the default value -0.2. The benchmarks used for the evaluation were (A) ROC with All genes, (B) ROC10*n, (C) ROC with 7mer + Conservation, (D) ROC with TargetScan dataset, (E) ROC with miRanda dataset, and (F) ROC with PicTar dataset.


 

Figure S20

Figure S20. Benchmarks on the Selbach dataset with positive record threshold by log ratio -0.3.

Receiver operating characteristic (ROC) graphs show the performances of 8 different target prediction algorithms – SVM, PITA All (PITA), PITA Top (PITA_top), TargetScan (TargetS), TargetScan with conserved genes (TargetS_c), MicroCosm miRanda (miRanda), mirTarget2 and PicTar – on the Selbach dataset. Dotted lines illustrate random prediction. The ROC scores are shown in the legend box. Positive records (down-regulated) were selected by log ratio value -0.3 instead of the default value -0.2. The benchmarks used for the evaluation were (A) ROC with All genes, (B) ROC10*n, (C) ROC with 7mer + Conservation, (D) ROC with TargetScan dataset, (E) ROC with miRanda dataset, and (F) ROC with PicTar dataset.


 

Figure S21

Figure S21. Benchmarks on the Baek dataset with positive record threshold by log ratio -0.1.

Receiver operating characteristic (ROC) graphs show the performances of 8 different target prediction algorithms – SVM, PITA All (PITA), PITA Top (PITA_top), TargetScan (TargetS), TargetScan with conserved genes (TargetS_c), MicroCosm miRanda (miRanda), mirTarget2 and PicTar – on the Baek dataset. Dotted lines illustrate random prediction. The ROC scores are shown in the legend box. Positive records (down-regulated) were selected by log ratio value -0.1 instead of the default value -0.2. The benchmarks used for the evaluation were (A) ROC with All genes, (B) ROC10*n, (C) ROC with 7mer + Conservation, (D) ROC with TargetScan dataset, (E) ROC with miRanda dataset, and (F) ROC with PicTar dataset.


 

Figure S22

Figure S22. Benchmarks on the Baek dataset with positive record threshold by log ratio -0.3.

Receiver operating characteristic (ROC) graphs show the performances of 8 different target prediction algorithms – SVM, PITA All (PITA), PITA Top (PITA_top), TargetScan (TargetS), TargetScan with conserved genes (TargetS_c), MicroCosm miRanda (miRanda), mirTarget2 and PicTar – on the Baek dataset. Dotted lines illustrate random prediction. The ROC scores are shown in the legend box. Positive records (down-regulated) were selected by log ratio value -0.3 instead of the default value -0.2. The benchmarks used for the evaluation were (A) ROC with All genes, (B) ROC10*n, (C) ROC with 7mer + Conservation, (D) ROC with TargetScan dataset, (E) ROC with miRanda dataset, and (F) ROC with PicTar dataset.


References

  1. Birmingham A, Anderson EM, Reynolds A, Ilsley-Tyree D, Leake D, Fedorov Y, et al. 3′ UTR seed matches, but not overall identity, are associated with RNAi off-targets. Nature Methods 2006;3:199–204. https://doi.org/10.1038/nmeth854.
  2. Linsley PS, Schelter J, Burchard J, Kibukawa M, Martin MM, Bartz SR, et al. Transcripts targeted by the MicroRNA-16 family cooperatively regulate cell cycle progression. Molecular and Cellular Biology 2007;27:2240–52. https://doi.org/10.1128/mcb.02005-06.
  3. Gribskov M, Robinson NL. Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Computers & Chemistry 1996;20:25–33. https://doi.org/10.1016/s0097-8485(96)80004-0.
  4. Gaidatzis D, Nimwegen E van, Hausser J, Zavolan M. Inference of miRNA targets using evolutionary conservation and pathway analysis. BMC Bioinformatics 2007;8:69. https://doi.org/10.1186/1471-2105-8-69.
  5. Grimson A, Farh KK-H, Johnston WK, Garrett-Engele P, Lim LP, Bartel DP. MicroRNA targeting specificity in mammals: determinants beyond seed pairing. Molecular Cell 2007;27:91–105. https://doi.org/10.1016/j.molcel.2007.06.017.
  6. Hammell M, Long D, Zhang L, Lee A, Carmack CS, Han M, et al. mirWIP: microRNA target prediction based on microRNA-containing ribonucleoprotein-enriched transcripts. Nature Methods 2008;5:813–9. https://doi.org/10.1038/nmeth.1247.
  7. Jackson AL, Burchard J, Schelter J, Chau BN, Cleary M, Lim L, et al. Widespread siRNA "off-target" transcript silencing mediated by seed region sequence complementarity. RNA 2006;12:1179–87. https://doi.org/10.1261/rna.25706.
  8. Lim LP, Lau NC, Garrett-Engele P, Grimson A, Schelter JM, Castle J, et al. Microarray analysis shows that some microRNAs downregulate large numbers of target mRNAs. Nature 2005;433:769–73. https://doi.org/10.1038/nature03315.
  9. Huang JC, Babak T, Corson TW, Chua G, Khan S, Gallie BL, et al. Using expression profiling data to identify human microRNA targets. Nature Methods 2007;4:1045–9. https://doi.org/10.1038/nmeth1130.
  10. Sales G, Coppe A, Bicciato S, Bortoluzzi S, Romualdi C. Impact of probe annotation on the integration of miRNA-mRNA expression profiles for miRNA target detection. Nucleic Acids Research 2010;38:e97–e97. https://doi.org/10.1093/nar/gkp1239.
  11. Kertesz M, Iovino N, Unnerstall U, Gaul U, Segal E. The role of site accessibility in microRNA target recognition. Nature Genetics 2007;39:1278–84. https://doi.org/10.1038/ng2135.
  12. Long D, Lee R, Williams P, Chan CY, Ambros V, Ding Y. Potent effect of target structure on microRNA function. Nature Structural & Molecular Biology 2007;14:287–94. https://doi.org/10.1038/nsmb1226.
  13. Baek D, Villén J, Shin C, Camargo FD, Gygi SP, Bartel DP. The impact of microRNAs on protein output. Nature 2008;455:64–71. https://doi.org/10.1038/nature07242.
  14. Sethupathy P, Megraw M, Hatzigeorgiou AG. A guide through present computational approaches for the identification of mammalian microRNA targets. Nature Methods 2006;3:881–6. https://doi.org/10.1038/nmeth954.
  15. Enright AJ, John B, Gaul U, Tuschl T, Sander C, Marks DS. MicroRNA targets in Drosophila. Genome Biology 2003;5:R1. https://doi.org/10.1186/gb-2003-5-1-r1.

Leave a comment