Link Search Menu Expand Document

Computational analysis of regulatory mechanism and interactions of microRNAs

Takay Saito

Faculty of Medicine
Norwegian University of Science and Technology


Table of contents
  1. Abstract
  2. Acknowledgements
  3. Contents
  4. List of Figures
  5. List of Tables
  6. List of Papers
  7. Glossary

Abstract

For years, RNAs were thought to have only two broad functions in cells, transmitting information between DNA and protein as messenger RNA (mRNA), and playing structural, catalytic, information decoding roles in protein synthesis as ribosomal RNA (rRNA) and transfer RNA (tRNA). However, the discovery of RNA interference (RNAi) changed this picture. RNAi is a regulatory process that uses small non-coding RNAs (ncRNAs) to suppress gene expression at the post-transcriptional level. This discovery led to identification of many classes of functional ncRNAs. MicroRNA (miRNA) is a class of such ncRNAs with ∼22 nucleotides that are abundant and found in most eukaryotic cells. This thesis focuses on revealing regulatory roles and characteristics of miRNAs through bioinformatics approaches by addressing three research questions.

The first research question is whether we can enhance miRNAs target prediction in animals by considering multiple target sites. Many algorithms exist for miRNA target predictions, but most algorithms do not consider multiple target sites. Predicting accurate miRNA target genes is important to infer miRNA regulatory roles since annotations of miRNA regulations are still poor. To solve this possible fault, we developed a two step support vector machine (SVM) model. Benchmark tests showed that our two step model outperformed other existing miRNA target prediction algorithms.

The second research question is whether there are factors to explain differences between different miRNA high-throughput experiments. There are several high-throughput technologies widely used for miRNA experiments, such as microarray and quantitative proteomics, but the results from these technologies are often inconsistent. By statistically analyzing several such high-throughput miRNA experiments, we revealed the characteristic of different technologies and also identified several factors that cause the differences.

The third research question is whether miRNAs interact with other classes of ncRNAs. There are strong evidences that some miRNAs are involved in transcription by interacting with other ncRNAs. We investigated ncRNAs in complex loci to find potential miRNA:ncRNA interactions. A complex locus is a locus that contains multiple genes that interact between themselves. We found evidence that some miRNAs are involved in transcriptional regulation with ncRNAs in complex loci.

In summary, this thesis provides solutions for these research questions, and it contributes to a better understanding of several important aspects of miRNA characteristics and regulations. It also shows effective bioinformatics approaches to develop a robust machine learning model and analyze different miRNA high-throughput experiments.

Acknowledgements

This thesis is based on four years of research funded by the Functional Genomics Program of the Norwegian Research Council. During the course of the research I have been helped by many individuals.

First and foremost, I would like to thank my two supervisors. I owe my deepest gratitude to Pål Sætrom for his contributions of time, ideas, and guidance to make this thesis possible. His wide knowledge and logical way of thinking have been of great value to me. I also gratefully acknowledge Finn Drabløs for his supervision. He has helped to make bioinformatics fun for me throughout my PhD.

I am indebted to the members of the Bioinformatics and Gene Regulation group for their contributions to providing an excellent working environment. I am especially grateful to Laurent Thomas and Even Skaland for their collaboration. I am also grateful to Tony Håndstad for his advice on manuscript preparation.

I would like to thank my friends and colleagues for making my time at NTNU tremendously enjoyable.

Lastly, I wish to thank my parents, Kikuo Saito and Mihoko Saito, for their love and support. Arigatou.

Contents

  1. Introduction

  2. Papers and their corresponding sub-goals

  3. MicroRNAs and other non-coding RNAs

  4. High-throughput biological experiments

  5. Statistical tests and methods

  6. Machine learning theory and Support vector machine

  7. Computational implementation

  8. Future perspectives

List of Figures

List of Tables

List of Papers

Paper 1
MicroRNAs - targeting and target prediction
Takaya Saito and Pål Sætrom
New biotechnology 2010
DOI: 10.1016/j.nbt.2010.02.016
Paper 2
A two step site and mRNA-level model for predicting microRNA targets
Takaya Saito and Pål Sætrom
BMC bioinformatics 2010
DOI: 10.1186/1471-2105-11-612
Paper 3
Target gene expression levels and competition between transfected and endogenous microRNAs are strong confounding factors in microRNA high-throughput experiments
Takaya Saito and Pål Sætrom
Silence 2012
DOI: 10.1186/1758-907X-3-3
Paper 4
MicroRNAs affect gene expression by targeting cis-transcribed non-coding RNAs
Takaya Saito, Even Skaland, and Pål Sætrom
(Submitted)
Paper 5
Inferring causative variants in microRNA target sites
Laurent F. Thomas, Takaya Saito, and Pål Sætrom
Nucleic Acids Research 2011
DOI: 10.1093/nar/gkr414

Glossary

3’ UTR
Three prime untranslated region; non-coding regions of mRNA on the 3’ end
5’ UTR
Five prime untranslated region; non-coding regions of mRNA on the 5’ end
A
Adenine; a purine nucleobase paired with thymine in DNA and uracil in RNA
ACC
Accuracy; (TP + TN) / (P + N) in a binary classification model
ADTree
Alternating decision tree; a machine learning algorithm that combines more than one decision tree
Agile
Agile software development; a type of RAD methodology
Ago
Argonaut protein; a key component of the RISC complex
ANN
Artificial neural network; a machine learning method that mimics biological neural networks
ANOVA
Analysis of variance; a statistical method to infer differences among multiple groups
AU rich
Adenine:Uracil rich; nucleotide sequences with many adenines and uracils
AUC
Area under the ROC curve; a performance measure to evaluate the ROC curves
C. elegans
Caenorhabditis elegans transparent roundworm about 1 mm in length
bp
Base pair; a unit for nucleotide length with a base pair as a Watson and Crick pair
C
Cytosine; a pyrimidine nucleobase paired with guanine
CAR
Chromatin associated RNA; experimentally validated non-coding RNAs that amirna-target-predictionre associated with chromatin
cis-NAT
Cis-natural antisense transcript; a pair of sense and anti-sense transcript that overlap each other in the same locus
CLIP
Cross-linking and immunoprecipitation; a technique used to pull down RNA-protein complexes
CROC
concentrated ROC; a version of ROC for evaluating early retrieval performance
Cy3
Cyanine 3; a green fluorescent dye used in the microarray assay
Cy5
Cyanine 5; a red fluorescent dye used in the microarray assay
cDNA
Complementary DNA; DNA synthesized from mRNA by reverse transcriptase
CDS
Coding sequence; coding region of mRNA
CNS
Central nervous system; the central part of the nervous system in the brain
DGCR8
DiGeorge syndrome critical region gene 8; a protein that recognizes a miRNA stem loop in pri-miRNA
DNA
Deoxyribonucleic acid; a nucleic acid that contains genetic information
dsRNA
Double-stranded RNA; RNA with two complementary strands
EBI
European bioinformatics institute; a center for research and services in bioinformatics in Europe
ERR
Error rate; (FP + FN) / (P + N) in a binary classification model
FDR
False discovery rate; FP / (FP + TN)
EST
Expressed sequence tag; a short sub-sequence of a cDNA sequence
FLcDNA
Full-length cDNA; full-length cDNA used by the Sanger sequencing method
FN
False negative; prediction outcome is false while the actual value is true
FP
False positive; prediction outcome is true while the actual value is false
G
Guanine; a purine nucleobase paired with cytosine
Gb
Giga base pair; 1,000,000,000 bp
GEO
Gene expression omnibus; a public repository for microarray data
GFF
General feature format; a text file format for genomic positional information
GPMDB
Global Proteome Machine database; a public repository for proteomics data
G:U wobble
Guanine:Uracil wobble; guanine and uracil wobble paring
GTP
Guanosine triphosphate; a purine nucleotide that is used for energy transfer within the cell
HITS
High throughput sequencing; the next generation sequencing
iTRAQ
Isotope tags for relative and absolute quantification; a non-gel-based technique for quantifying proteins
INSDC
International nucleotide sequence database collaboration; a group that organizes SRA repositories
K-S test
Kolmogorov-Smirnov test; a non-parametric statistical method
k-NN
k-nearest neighbor; a type of machine learning algorithm
LC-MS/MS
Liquid chromatography-tandem mass spectrometry; MS/MS with liquid chromatography. Liquid chromatography separates ions or molecules dissolved in a solvent
ML
Machine learning; a class of computational algorithms that can imitate learning
MAF
Multiple alignment format; a text file format for multiple alignments
MIAME
Minimum information about a microarray experiment; a standard for reporting microarray experiments
miRISCs
miRNA RISC; RISC loaded with miRNA
miRNA
Micro RNA; a class of small ncRNA that regulates protein expression
MS
Mass-spectrometry; a technique that measures the mass-to-charge ratio of charged particles
MS/MS
Tandem mass spectrometry; a technique that involves multiple steps of mass spectrometry
N
Negative; actual negative values in a binary classification model
NB
Naive Bayes; a type of statistical learning algorithm that uses Bayes’ theorem
NCBI
National center for biotechnology information; U.S. government-funded national resource for molecular biology information
NPV
Negative predictive value; TN / (TN + FN) in a binary classification model
OOP
Object-oriented programming; a computer programming paradigm
P
Positive; actual positive values in a binary classification model
PCR
Polymerase chain reaction; a technique used to amplify DNA sequences
piRNA
Piwi-interacting RNA; siRNA/miRNA like ncRNAs found in germline cells
Pol II
RNA polymerase II; an enzyme that synthesizes several types of RNAs
Pol III
RNA polymerase III; an enzyme that synthesizes rRNA, tRNA and other small RNAs
PRIDE
Proteomics identifications database; a public repository for proteomics data
RDB
Relational database; a computational data storage method. Data are stored in tables with a collection of relations
RIP
Ribonucleoprotein immunoprecipitation; a technique used to pull down RNA-protein complexes
ncRNA
Non-protein-coding RNA; functional RNA that is not translated into protein
pri-miRNA
primary miRNA; a RNA molecule that contains one or more miRNA stem loops
pre-miRNA
precursor miRNA; miRNA precursor with a hairpin stem loop, that is exported into cytoplasm
PRC
Precision; equivalent to PPV or Positive predictive value
PPV
Positive predictive value; TP / (TP + FP) in a binary classification model
QP
Quadratic programming; A class of optimization algorithms to maximize a quadratic function subject to linear constrains
RAD
Rapid application development; a software development methodology
Ran
Ras-related nuclear protein; a GTP binding protein that is involved in transport between nucleus and cytoplasm
RNA
Ribonucleic acid; a nucleic acid that catalyzes with many biological molecules
RBF
Radial basis function; a real-valued function whose value depends only on the distance from the origin
RNAi
RNA interference; a regulatory process that suppresses gene expression at the post-transcriptional level with small RNAs
RNase
Ribonuclease; an enzyme that degrades RNAs into smaller components
rRNA
Ribosomal RNA; RNA components of the ribosome
RISC
RNA-induced silencing complex; a key multiprotein complex in RNAi
RITS
RNA-induced initiation of transcriptional gene-silencing; a complex involved in regulation of chromatin structure
RT-qPCR
Reverse transcription quantitative PCR; a variant of PCR that can be used to measure RNA expression levels
ROC
Receiver operating characteristics; a graph that shows true positive rate versus false positive rate
SAGE
Serial Analysis of Gene Expression; a sequencing technique that uses short tags generated from $3^\prime$ ends of mRNA transcripts
siRISC
siRNA RISC; RISC loaded with siRNA
siRNA
Small interfering RNA; small ncRNAs involved in RNAi for gene silencing
SILAC
Stable isotope labeling with amino acids in cell culture; a technique for in vivo incorporation of a label into proteins
SRA
Sequence read archive; data repository for next generation sequencing data
SRM
Structural Risk Minimization; a machine learning principle
SN
Sensitivity; TP / P in a binary classification model
SP
Specificity; TN / N in a binary classification model
SQL
Structured query language; a language used with RDB
SNP
Single-nucleotide polymorphism; DNA polymorphism with a single nucleotide difference between members of a species
ssRNA
Single-stranded RNA; RNA with one strand
SVM
Support vector machine; a machine learning algorithm that guarantees the maximum margin between decision boundaries
SVR
Support vector regression; a version of SVM for regression
TNR
True negative rate; equivalent to SP or Specificity
TPR
True positive rate; equivalent to SN or Sensitivity
TDD
Test-driven development; a type of RAD methodology
tRNA
Transfer RNA; transfer a specific amino acid for protein synthesis
TN
True negative; prediction outcome is false while the actual value is false
TP
True positive; prediction outcome is true while the actual value is true
U
Uracil; a pyrimidine nucleobase paired with adenine in RNA
UV
Ultraviolet; electromagnetic radiation with shorter wavelength than visible light
VC dimension
Vapnik Chervonenkis dimension; a measure of capacity for the data point separation by hyperplanes
WTSS
Whole transcriptome shotgun sequencing; high throughput technique at the whole transcriptome level with next generation sequencing
XP
Extreme programming; a type of RAD methodology

Leave a comment