Graduate Thesis Or Dissertation
 

Interpretable Machine Learning: Applications in Biology and Genomics

Public Deposited

Downloadable Content

Download PDF
https://ir.library.oregonstate.edu/concern/graduate_thesis_or_dissertations/zk51vp627

Descriptions

Attribute NameValues
Creator
Abstract
  • Machine learning (ML) and deep learning (DL) models impact our daily lives with applications in natural language modeling, image analysis, healthcare, genomics, and bioinformatics. The exponential growth of biological sequence data necessitates accompanying advances in computational methods. Although deep learning is highly effective for detecting and classifying biological sequences, challenges remain in extracting meaningful patterns and information from the learned models. To realize the potential of deep learning in biology, we need to develop strategies for model interpretation to reveal or further clarify biological principles. In this thesis, we first present problems and methods to classify patterns in biological sequence data. Next, we describe a series of techniques we developed to understand the machine learning models and identify meaningful biological patterns. For each problem we created an interpretable, intelligent system without sacrificing performance. To test our approaches for model interpretation, we first focused our analysis on known biological patterns, and then extended the search beyond what is known. This work can be categorized into four different applications: I) the development of bpRNA, a novel annotation tool capable of parsing RNA secondary structures. bpRNA is a richly-annotated database that contains over 100,000 structures from seven different sources along with base pairing information. II) The detection of pseudoknots from sequence data alone with a machine learning model, Pseudoknow. As one of the most common RNA structural motifs, pseudoknots are crucial for RNA regulation. Improving the prediction of RNA pseudoknot structure will allow for better understanding of how RNA structure informs regulation and metabolism. III) Classification from gene expression data using stacked denoising auto encoders (SDAE) to distinguish healthy cells from cancerous ones, and to predict post-mortem time-of-death. These classification methods were developed with the goal to identify genes that are most informative for prediction and hence most biological relevant. Our study suggests that the most influential genes from the dimensionality reduction performed by SDAE were highly predictive of cancerous vs non-cancerous cell type. IV) Interpretation of the rules learned by a deep convolutional neural network to recognize known and previously uncharacterized core promoter sequence motifs from the whole genome sequences of human. We proposed and compared new training strategies to identify transcription start sites (TSS), located within core promoters, from biological sequences. The main goal of this application was to develop new strategies to interpret how the convolutional neural network learns biological patterns, and to understand the correlations between and within the convolutional layers. These new techniques could aid in deriving unknown patterns in biology and genomics and are applicable more broadly to other areas of data science.
License
Resource Type
Date Issued
Degree Level
Degree Name
Degree Field
Degree Grantor
Commencement Year
Advisor
Committee Member
Academic Affiliation
Rights Statement
Publisher
Peer Reviewed
Language
Embargo reason
  • Pending Publication
Embargo date range
  • 2019-12-13 to 2020-07-14

Relationships

Parents:

This work has no parents.

In Collection:

Items