Computational analysis of regulatory mechanism and interactions of microRNAs
Takay Saito
Faculty of Medicine
Norwegian University of Science and Technology
Table of contents
Abstract
For years, RNAs were thought to have only two broad functions in cells, transmitting information between DNA and protein as messenger RNA (mRNA), and playing structural, catalytic, information decoding roles in protein synthesis as ribosomal RNA (rRNA) and transfer RNA (tRNA). However, the discovery of RNA interference (RNAi) changed this picture. RNAi is a regulatory process that uses small non-coding RNAs (ncRNAs) to suppress gene expression at the post-transcriptional level. This discovery led to identification of many classes of functional ncRNAs. MicroRNA (miRNA) is a class of such ncRNAs with ∼22 nucleotides that are abundant and found in most eukaryotic cells. This thesis focuses on revealing regulatory roles and characteristics of miRNAs through bioinformatics approaches by addressing three research questions.
The first research question is whether we can enhance miRNAs target prediction in animals by considering multiple target sites. Many algorithms exist for miRNA target predictions, but most algorithms do not consider multiple target sites. Predicting accurate miRNA target genes is important to infer miRNA regulatory roles since annotations of miRNA regulations are still poor. To solve this possible fault, we developed a two step support vector machine (SVM) model. Benchmark tests showed that our two step model outperformed other existing miRNA target prediction algorithms.
The second research question is whether there are factors to explain differences between different miRNA high-throughput experiments. There are several high-throughput technologies widely used for miRNA experiments, such as microarray and quantitative proteomics, but the results from these technologies are often inconsistent. By statistically analyzing several such high-throughput miRNA experiments, we revealed the characteristic of different technologies and also identified several factors that cause the differences.
The third research question is whether miRNAs interact with other classes of ncRNAs. There are strong evidences that some miRNAs are involved in transcription by interacting with other ncRNAs. We investigated ncRNAs in complex loci to find potential miRNA:ncRNA interactions. A complex locus is a locus that contains multiple genes that interact between themselves. We found evidence that some miRNAs are involved in transcriptional regulation with ncRNAs in complex loci.
In summary, this thesis provides solutions for these research questions, and it contributes to a better understanding of several important aspects of miRNA characteristics and regulations. It also shows effective bioinformatics approaches to develop a robust machine learning model and analyze different miRNA high-throughput experiments.
Acknowledgements
This thesis is based on four years of research funded by the Functional Genomics Program of the Norwegian Research Council. During the course of the research I have been helped by many individuals.
First and foremost, I would like to thank my two supervisors. I owe my deepest gratitude to Pål Sætrom for his contributions of time, ideas, and guidance to make this thesis possible. His wide knowledge and logical way of thinking have been of great value to me. I also gratefully acknowledge Finn Drabløs for his supervision. He has helped to make bioinformatics fun for me throughout my PhD.
I am indebted to the members of the Bioinformatics and Gene Regulation group for their contributions to providing an excellent working environment. I am especially grateful to Laurent Thomas and Even Skaland for their collaboration. I am also grateful to Tony Håndstad for his advice on manuscript preparation.
I would like to thank my friends and colleagues for making my time at NTNU tremendously enjoyable.
Lastly, I wish to thank my parents, Kikuo Saito and Mihoko Saito, for their love and support. Arigatou.
Contents
-
High-throughput biological experiments
-
4.1 One microarray experiment can detect thousands of gene expressions
-
4.2 The next generation sequencing methods are faster and more cost-effective than Sanger sequencing
-
4.3 The second generation sequencing technologies can cover a wide range of applications
-
4.5 Most preprocessed and raw data sets from high-throughput experiments are publicly available
-
-
-
5.1 Parametric statistics: Parameters and Hypothesis testing
-
5.2 Non-parametric statistical methods: Wilcoxon rank-sum and Kolmogorov-Smirnov tests
-
5.4 Multiple comparison tests: Analysis of variance, Bonferroni correction, and False discovery rate
-
5.5 Correlation: Pearson’s and Spearman’s correlation coefficients
-
List of Figures
List of Tables
List of Papers
- Paper 1
- MicroRNAs - targeting and target prediction
- Takaya Saito and Pål Sætrom
- New biotechnology 2010
- DOI: 10.1016/j.nbt.2010.02.016
- Paper 2
- A two step site and mRNA-level model for predicting microRNA targets
- Takaya Saito and Pål Sætrom
- BMC bioinformatics 2010
- DOI: 10.1186/1471-2105-11-612
- Paper 3
- Target gene expression levels and competition between transfected and endogenous microRNAs are strong confounding factors in microRNA high-throughput experiments
- Takaya Saito and Pål Sætrom
- Silence 2012
- DOI: 10.1186/1758-907X-3-3
- Paper 4
- MicroRNAs affect gene expression by targeting cis-transcribed non-coding RNAs
- Takaya Saito, Even Skaland, and Pål Sætrom
- (Submitted)
- Paper 5
- Inferring causative variants in microRNA target sites
- Laurent F. Thomas, Takaya Saito, and Pål Sætrom
- Nucleic Acids Research 2011
- DOI: 10.1093/nar/gkr414
Glossary
- 3’ UTR
- Three prime untranslated region; non-coding regions of mRNA on the 3’ end
- 5’ UTR
- Five prime untranslated region; non-coding regions of mRNA on the 5’ end
- A
- Adenine; a purine nucleobase paired with thymine in DNA and uracil in RNA
- ACC
- Accuracy; (TP + TN) / (P + N) in a binary classification model
- ADTree
- Alternating decision tree; a machine learning algorithm that combines more than one decision tree
- Agile
- Agile software development; a type of RAD methodology
- Ago
- Argonaut protein; a key component of the RISC complex
- ANN
- Artificial neural network; a machine learning method that mimics biological neural networks
- ANOVA
- Analysis of variance; a statistical method to infer differences among multiple groups
- AU rich
- Adenine:Uracil rich; nucleotide sequences with many adenines and uracils
- AUC
- Area under the ROC curve; a performance measure to evaluate the ROC curves
- C. elegans
- Caenorhabditis elegans transparent roundworm about 1 mm in length
- bp
- Base pair; a unit for nucleotide length with a base pair as a Watson and Crick pair
- C
- Cytosine; a pyrimidine nucleobase paired with guanine
- CAR
- Chromatin associated RNA; experimentally validated non-coding RNAs that amirna-target-predictionre associated with chromatin
- cis-NAT
- Cis-natural antisense transcript; a pair of sense and anti-sense transcript that overlap each other in the same locus
- CLIP
- Cross-linking and immunoprecipitation; a technique used to pull down RNA-protein complexes
- CROC
- concentrated ROC; a version of ROC for evaluating early retrieval performance
- Cy3
- Cyanine 3; a green fluorescent dye used in the microarray assay
- Cy5
- Cyanine 5; a red fluorescent dye used in the microarray assay
- cDNA
- Complementary DNA; DNA synthesized from mRNA by reverse transcriptase
- CDS
- Coding sequence; coding region of mRNA
- CNS
- Central nervous system; the central part of the nervous system in the brain
- DGCR8
- DiGeorge syndrome critical region gene 8; a protein that recognizes a miRNA stem loop in pri-miRNA
- DNA
- Deoxyribonucleic acid; a nucleic acid that contains genetic information
- dsRNA
- Double-stranded RNA; RNA with two complementary strands
- EBI
- European bioinformatics institute; a center for research and services in bioinformatics in Europe
- ERR
- Error rate; (FP + FN) / (P + N) in a binary classification model
- FDR
- False discovery rate; FP / (FP + TN)
- EST
- Expressed sequence tag; a short sub-sequence of a cDNA sequence
- FLcDNA
- Full-length cDNA; full-length cDNA used by the Sanger sequencing method
- FN
- False negative; prediction outcome is false while the actual value is true
- FP
- False positive; prediction outcome is true while the actual value is false
- G
- Guanine; a purine nucleobase paired with cytosine
- Gb
- Giga base pair; 1,000,000,000 bp
- GEO
- Gene expression omnibus; a public repository for microarray data
- GFF
- General feature format; a text file format for genomic positional information
- GPMDB
- Global Proteome Machine database; a public repository for proteomics data
- G:U wobble
- Guanine:Uracil wobble; guanine and uracil wobble paring
- GTP
- Guanosine triphosphate; a purine nucleotide that is used for energy transfer within the cell
- HITS
- High throughput sequencing; the next generation sequencing
- iTRAQ
- Isotope tags for relative and absolute quantification; a non-gel-based technique for quantifying proteins
- INSDC
- International nucleotide sequence database collaboration; a group that organizes SRA repositories
- K-S test
- Kolmogorov-Smirnov test; a non-parametric statistical method
- k-NN
- k-nearest neighbor; a type of machine learning algorithm
- LC-MS/MS
- Liquid chromatography-tandem mass spectrometry; MS/MS with liquid chromatography. Liquid chromatography separates ions or molecules dissolved in a solvent
- ML
- Machine learning; a class of computational algorithms that can imitate learning
- MAF
- Multiple alignment format; a text file format for multiple alignments
- MIAME
- Minimum information about a microarray experiment; a standard for reporting microarray experiments
- miRISCs
- miRNA RISC; RISC loaded with miRNA
- miRNA
- Micro RNA; a class of small ncRNA that regulates protein expression
- MS
- Mass-spectrometry; a technique that measures the mass-to-charge ratio of charged particles
- MS/MS
- Tandem mass spectrometry; a technique that involves multiple steps of mass spectrometry
- N
- Negative; actual negative values in a binary classification model
- NB
- Naive Bayes; a type of statistical learning algorithm that uses Bayes’ theorem
- NCBI
- National center for biotechnology information; U.S. government-funded national resource for molecular biology information
- NPV
- Negative predictive value; TN / (TN + FN) in a binary classification model
- OOP
- Object-oriented programming; a computer programming paradigm
- P
- Positive; actual positive values in a binary classification model
- PCR
- Polymerase chain reaction; a technique used to amplify DNA sequences
- piRNA
- Piwi-interacting RNA; siRNA/miRNA like ncRNAs found in germline cells
- Pol II
- RNA polymerase II; an enzyme that synthesizes several types of RNAs
- Pol III
- RNA polymerase III; an enzyme that synthesizes rRNA, tRNA and other small RNAs
- PRIDE
- Proteomics identifications database; a public repository for proteomics data
- RDB
- Relational database; a computational data storage method. Data are stored in tables with a collection of relations
- RIP
- Ribonucleoprotein immunoprecipitation; a technique used to pull down RNA-protein complexes
- ncRNA
- Non-protein-coding RNA; functional RNA that is not translated into protein
- pri-miRNA
- primary miRNA; a RNA molecule that contains one or more miRNA stem loops
- pre-miRNA
- precursor miRNA; miRNA precursor with a hairpin stem loop, that is exported into cytoplasm
- PRC
- Precision; equivalent to PPV or Positive predictive value
- PPV
- Positive predictive value; TP / (TP + FP) in a binary classification model
- QP
- Quadratic programming; A class of optimization algorithms to maximize a quadratic function subject to linear constrains
- RAD
- Rapid application development; a software development methodology
- Ran
- Ras-related nuclear protein; a GTP binding protein that is involved in transport between nucleus and cytoplasm
- RNA
- Ribonucleic acid; a nucleic acid that catalyzes with many biological molecules
- RBF
- Radial basis function; a real-valued function whose value depends only on the distance from the origin
- RNAi
- RNA interference; a regulatory process that suppresses gene expression at the post-transcriptional level with small RNAs
- RNase
- Ribonuclease; an enzyme that degrades RNAs into smaller components
- rRNA
- Ribosomal RNA; RNA components of the ribosome
- RISC
- RNA-induced silencing complex; a key multiprotein complex in RNAi
- RITS
- RNA-induced initiation of transcriptional gene-silencing; a complex involved in regulation of chromatin structure
- RT-qPCR
- Reverse transcription quantitative PCR; a variant of PCR that can be used to measure RNA expression levels
- ROC
- Receiver operating characteristics; a graph that shows true positive rate versus false positive rate
- SAGE
- Serial Analysis of Gene Expression; a sequencing technique that uses short tags generated from $3^\prime$ ends of mRNA transcripts
- siRISC
- siRNA RISC; RISC loaded with siRNA
- siRNA
- Small interfering RNA; small ncRNAs involved in RNAi for gene silencing
- SILAC
- Stable isotope labeling with amino acids in cell culture; a technique for in vivo incorporation of a label into proteins
- SRA
- Sequence read archive; data repository for next generation sequencing data
- SRM
- Structural Risk Minimization; a machine learning principle
- SN
- Sensitivity; TP / P in a binary classification model
- SP
- Specificity; TN / N in a binary classification model
- SQL
- Structured query language; a language used with RDB
- SNP
- Single-nucleotide polymorphism; DNA polymorphism with a single nucleotide difference between members of a species
- ssRNA
- Single-stranded RNA; RNA with one strand
- SVM
- Support vector machine; a machine learning algorithm that guarantees the maximum margin between decision boundaries
- SVR
- Support vector regression; a version of SVM for regression
- TNR
- True negative rate; equivalent to SP or Specificity
- TPR
- True positive rate; equivalent to SN or Sensitivity
- TDD
- Test-driven development; a type of RAD methodology
- tRNA
- Transfer RNA; transfer a specific amino acid for protein synthesis
- TN
- True negative; prediction outcome is false while the actual value is false
- TP
- True positive; prediction outcome is true while the actual value is true
- U
- Uracil; a pyrimidine nucleobase paired with adenine in RNA
- UV
- Ultraviolet; electromagnetic radiation with shorter wavelength than visible light
- VC dimension
- Vapnik Chervonenkis dimension; a measure of capacity for the data point separation by hyperplanes
- WTSS
- Whole transcriptome shotgun sequencing; high throughput technique at the whole transcriptome level with next generation sequencing
- XP
- Extreme programming; a type of RAD methodology