MicroRNAs are a highly conserved class of small endogenous RNA, about ~22nt in length, involved in post-transcriptional gene silencing and have prominent roles in disease and development. Though the process of microRNA discovery was once an arduous task, the advent of high throughput sequencing technology has resulted in novel microRNAs being discovered at a rapid rate. Several data-driven pipelines and machine learning-based methods have been devised so that the beginning stages of microRNA discovery can be performed in silico. Despite these efforts, several challenges have persisted in the computational prediction of microRNAs. These challenges include the identification of microRNAs with low expression, proper determination of the precursor span, and the precise labeling of the cleavage sites involved in their biogenesis. This thesis addresses these challenges with two new machine learning-based approaches. MiRWoods improves precursor detection and uses stacked random forests for the sensitive detection of microRNAs. We report that miRWoods has a 10% higher recall of annotated microRNAs when compared with other software. We applied this method to the genomes of human, mouse, Felis catus (cat) and Bos Taurus (cow) and identified hundreds of novel microRNAs in small RNA sequencing datasets. Our novel predictions include a microRNA in an intron of tyrosine kinase 2 (TYK2), that is present in both cat and cow, as well as a family of mirtrons with two instances in the human genome. Our predictions support a more expanded miR-2284 family in the bovine genome, a larger mir-548 family in the human genome, and a larger let-7 family in the feline genome. DeepMirCut is a deep learning approach for identifying cleavage sites within microRNAs. This approach is inspired by site-labeling methods for natural language processing, and can accurately predict how the microRNA processing enzymes Dicer and Drosha cleave the microRNA precursor.
Funding Statement (additional comments about funding)
This work is supported by OSHU MRF grant number 1414, NIH grant R56 AG053460. Start-up funds from Oregon State University, and NIH grant R01 AG061406 for David A. Hendrix. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.