MicroRNAs: computational target prediction and analysis with machine learning methods

Scientific area: Bioinformatics

Project team

Stefano Rovetta, p.i. (DISI)
Francesco Masulli (DISI)
Maura E. Monville (DISI)
Giuseppe Russo (CROM)

DISI - Dept. of Computer and Information Sciences - University of Genova
Via Dodecaneso 35 - 16166 Genova (www.disi.unige.it) Tel: +39 010 353 6604 Fax: +39 010 353 6699

PI: Stefano Rovetta

CROM - Oncology Research Centre of Mercogliano "Fiorentino Lo Vuolo"
Via Ammiraglio Bianco - 83013 Mercogliano (Av) - Italy (http://www.cro-m.eu) Tel: +39 0825-1911711

PI: Giuseppe Russo

General description

MicroRNA. MicroRNAs (miRNAs) are short, non-coding RNAs, typically ~22nt long, that regulate gene expression by suppressing translation and destabilizing messenger RNAs with specific target sequences. The action mechanism of each individual miRNAs is not always understood; however, regulation by miRNAs is peculiar in that it guarantees rapid and reversible changes in protein synthesis without altering transcription, and it is clear that collectively miRNAs play a central role in controlling both physiological and pathological processes.

As an example, several miRNAs have been found to have a significant involvement in the development of some types of cancer, for instance prostate cancer. Evidence suggests that miRNAs are aberrantly expressed in prostate cancer compared with normal tissue, although inconsistencies in the data collected prevent a clear understanding of which miRNAs are specific biomarkers for this pathology. However, decreased expression of miR-23b, miR-100, miR-145 and miR-205 has been consistently observed.

As another example, hundreds of miRNAs are expressed in mammalian brain, and many of them are specifically expressed or enriched in brain tissues, such as let-7a, miR-218, and miR-125. Recent studies suggest a possible role of miRNAs in neurogenesis. It has also been shown that synaptic plasticity is critically dependent upon regulation of specific protein synthesis near or within the pre- and/or post-synaptic sites, and numerous components of the microRNA machinery including dicer are expressed within dendrites and mature miRNAs and their precursors are detected in nerve terminals.

Moreover, a single miRNA may target hundreds of mRNAs including global regulators of translation, and the hypothesis of a combinatorial action of sets of miRNAs has been proposed; so it is possible that miRNAs play an even more profound role than what is currently known.

miRNA prediction. The mechanisms for miRNA expression and maturation are quite well known. In contrast, the action mechanisms are not completely understood. A miRNA does not require perfect complementary (Watson-Crick) match to its target mRNA to perform its control action. In fact, imperfect complementarity and the resulting secondary conformation of the miRNA-target complex may explain part of the action mechanism. It is also known that miRNAs reduce protein expression in two ways, namely, either inhibiting translation of the target mRNA, or promoting its cleavage; in addition, they may also activate gene expression.

Several studies have highlighted different patterns of partial match. For instance, it is well known that a perfect match in the so-called seed region (bases 2-8 in the 5' of the miRNA) accounts for a large number of experimentally validated miRNA targets; but it is also known (Grimson, 2007) that other patterns, in addition or as alternatives to seed matching, are also observable. This makes it difficult to analyze miRNA-mRNA interactions in a purely computational way (target prediction), and this difficulty accounts for the relatively high number of computational prediction methods currently available, and for the fact that they tend to disagree with each other, or at least to find agreement only among methods based on similar hypotheses.

Research project: An optimization approach to automatically tune miRNA prediction software

Computational target prediction. Most currently available prediction tools are based on a given set of predefined hypotheses. These hypotheses often stem from either empirical observation, or statistical analysis of nucleotide sequence patterns (frequency of observation, enrichment analysis, phylogenetic conservation analysis). They are tested on a miRNA-mRNA site pair; partial scores are given to specific features measured on the pair, and if the overall score passes a threshold, then the mRNA site is predicted to be a target for the given miRNA.

For instance, TargetScan analyzes seed complementarity and phylogenetic conservation; RNAhybrid finds sites with minimum free energy. PicTar and miRanda (Enright, 2003) are somewhat more comprehensive tools. For instance, miRanda takes into account several effects, such as seed complementarity, non-complementarity of the bases following the seed region, overall complementarity, free energy, and conservation analysis. However, even miRanda does not provide perfectly consistent results with respect to available knowledge (targets which have been validated in lab experiments).

Design goals. The main desirable properties of a target prediction software are:

In order to identify all possible binding sites, in principle a biological validation should be performed on an immense amount of gene sites. A sophisticated and efficient in-silico approach has the potential to decimate the number of putative sites to be validated, thus reducing the number and cost of actual lab experiments. Given the currently incomplete knowledge about miRNA targets, this is not an easy requirement to implement.

Identification of target transcripts can be done offline, and its results can be provided as a library. Therefore, it needs not be particularly time-efficient, since it does not affect the workflow's throughput. On the other hand, the identification should be very precise, since errors can have an effect on the overall cost of the analysis.

Improving existing methods. The DISI group has recently developed a methodology to improve the performance of current prediction tools for the estimation of target mRNAs for miRNA. The method replaces the current approach, based on working hypotheses, with a data-based approach. It builds upon miRanda, which was proposed in 2003 by Enright et al. (Enright, 2003).

In addition to its flexibility, this tool has been selected because its code is available (whereas other tools, like TargetScan, only give access to a library of results). The approach followed by miRanda is to split the target gene prediction task into three distinct steps carried out in sequence, in order of increasing computational cost: (A) Homology evaluation; (B) Free energy computation; (C) Evolutionary conservation computation.

The first step in miRanda is based on sequence matching: miRNA and 3'-UTR miRNA sequences are (reverse-)aligned in order to find sites with a certain level of complementarity, so as to assess if there are any potential binding sites. Sequence alignment is carried out using a slightly modified version of the Smith-Waterman algorithm, a dynamic programming based technique. miRanda uses it to compute alignment based on complementarity rather than match, so the score matrix assigns positive scores to complementary nucleotides and G=U 'wobble' pairs, and negative scores to all other base pairs. Scores have values selected so as to favour the presence of known effects, such as those cited above (seed complementarity, non-complementarity of the small region following the seed, and so on).

Only those alignments whose score exceeds a given threshold are considered potential binding sites and passed to the subsequent processing step, i.e., the free energy computation. This second step is carried out using the RNA folding routine RNAfold included in the Vienna RNA secondary structure library (RNAlib). This routine computes the secondary structure and the free energy of a single RNA sequence folding. Again only hybridization sites whose free energy is under a given threshold are considered valid. Finally, a third filter is applied to binding sites that passed the previous two filtering stages. In order to reduce false positives, only predicted target sites which are conserved among different species are considered valid. The evolutionary conservation computation is carried out using PhastCons5, that is a software tool based on a phylogenetic hidden Markov model able to estimate the degree of sequence conservation starting from a multi-alignment of different sequences. PhastCons is not integrated in the miRanda code, so this computation is carried out after the execution of the miRanda program itself.

Optimization-based algorithm design. These steps, especially the first two, are based on a set of parameters which have been heuristically determined. We have shown (Masulli, 2008) that it is possible to increase the performance of this procedure by automatic tuning of these parameters by a global optimization method based on Genetic Algorithms. A preliminary implementation of this technique has been successfully applied to the improvement of computational target identification for investigating known miRNA involvement in prostate cancer. Note, however, that neither the use of a Genetic Algorithm, nor the application to miRanda, are essential to the developed procedure; in fact, experimentation with alternative optimization techniques (Particle Swarm Optimization) and with more comprehensive prediction methods (committees of several methods) are on schedule for future studies.

The optimization problem can be cast as a classifier training problem, where the classifier is the prediction software, the optimization variables are the method's parameters, and the objective function is an evaluation of the classifier performance with respect to the database of validated targets. To achieve the stated design goals (which in this settings amount to low empirical error and low expected error) in the absence of sound theoretical bounds, the main strategy is of the brute force type, that is, use as large a data set as possible.

To this end, software for automated query of gene transcript and miRNA databases has been set up. At the time of writing, we have data for 695 miRNAs and 557972 transcripts. The optimization procedure that we have selected is inherently suitable for parallel implementation, but the discrete (or discretized) optimization domain and the sheer volume of the data imply that computational requirements remain large.

A gain in this respect may be obtained by using a randomized offspring selection procedure within the genetic algorithm. This also provides the added bonus of operating on a stochastic approximation of performance, which, on such a large data set, is a more reliable estimate of the expected performance than simple empirical performance.

Expected output of the project. The immediate result of the activity will be a prediction method whose performance and reliability are better than other available alternatives. However, this method will be used to perform in-silico experiments (reliable predictions), which will be available to the biologist as a ground truth against which hypotheses about the operation of miRNAs can be validated and compared, to shed light on the actual targeting mechanisms.

Further developments of the proposed project include in-silico analysis of putative miRNA target sites in genes related to specific application domains, and lab experiments to study the function of these miRNAa and their gene targets. Two current application projects are related to the previously cited examples, namely, identification of biomarkers in prostate cancer and study of neural development and synaptic plasticity.


  1. Enright, A. J., John, B., Gaul, U., Tuschl, T., Sander, C., Marks,D. S.(2003). "microRNA targets in Drosophila." Genome Biol, vol. 5.
  2. Grimson, A. , K. K. Farh, W. K. Johnston, P. Garrett-Engele, L. P. Lim, and D. P. Bartel, "Microrna targeting specificity in mammals: Determinants beyond seed pairing," vol. 27, no. 1, pp. 91-105, July 2007.
  3. F. Masulli, A. Parini, S. Rovetta, and G. Russo, "Searching for micrornaprostate cancer target genes," ICNN09 – Neural Networks, IEEE - INNS - ENNS International Joint Conference on, pp. 2021–2026, 2009.