Graduate Thesis Or Dissertation
 

Deep Learning for Human and Biological Languages

Öffentlich Deposited

Herunterladbarer Inhalt

PDF Herunterladen
https://ir.library.oregonstate.edu/concern/graduate_thesis_or_dissertations/70795g883

Descriptions

Attribute NameValues
Creator
Abstract
  • We explore the application of deep learning to the disparate fields of natural language processing and computational biology. Both the sentences uttered by humans as well as the RNA and protein sequences found within the cells of their bodies can be considered formal languages in computer science, as sets of strings composed from an alphabet generated by grammar rules. To briefly characterize these languages, words in natural language sentences have a large number of types but a limited sequence of tokens, while nucleotides in biological contexts have limited types in long sequences of tokens. A sentence has a possible vocabulary size greater than 100,000 but in practice usually have less than 20-30 words; RNA sequences have 4 possible tokens but feature sequences anywhere from less than 100 to greater than 10,000 nucleotides. Protein sequences similarly have 20 possible amino acid tokens. The practical differences between these contexts inform our modeling choices to make deep learning tractable and effective, and they further influence what additional algorithms are needed to attain strong results. These widely different domains presumably have their own forms of syntactic structure, and their respective grammars dictate the relationships on how words, nucleotides, and amino acids interact within themselves to form structures. With language this comes in the form of syntactic parse tree diagrams, with RNA this becomes secondary structure base pairings, and with proteins this becomes tertiary structure contact map pairings. We present a deep learning approach for predicting syntactic structures for human languages (parsing), and dynamic programming techniques that allow for fast linear-time decoding while maintaining close to state-of-the-art accuracy. Converting the traditional $O(n^3)$ exhaustive cubic time CKY parsing algorithm into having a left-to-right, bottom-up reordering allowed us to additionally apply inexact beam search and then cube-pruning to attain linear $O(n \cdot b\log(b))$ runtime complexity. Despite being an inexact search, our model attained results (91.97 F1) better than the previous state-of-the-art model (91.79 F1) which used an exhaustive decoding upon the same underlying neural network architecture. Analogous to linguistic grammar rules, nucleotides in RNA sequences are also subject to base pairing potentials, as Adenine (A) prefers to bind with Uracil (U) and Cytosine (C) prefers to bind with Guanine (G). The secondary structure base pairing behavior of RNA often involves interactions across the entire sequence. We present a deep learning approach for predicting secondary structure for RNA sequences (folding), and using self-attention-based Transformer models to visualize and correct errors made by other structure prediction algorithms called RNA-Fix. We find that a simple architecture consisting of LSTM and Transformer layers succeed at attaining a strong baseline, which then further improves when predictions made by another program are made available as input. Visualizing the attention weights of our model, we find that strong attention in the last layer is paid towards bracketed structural sections in the output. We further show a connection to our human language parsing work, by presenting the Nussinov dynamic programming decoding algorithm adapted for deep learning, that guarantees a balanced and valid base pairing output. With cubic runtime complexity analogous to CKY, we show on a dataset of RNA sequences limited to length 50 accuracies surpassing our RNA-Fix models. We also discuss how to linearize the runtime which would allow us to scale to longer sequence datasets. Even more complex than RNA, protein sequences feature even more possible interactions between the 20 different types of amino acids. A typical way to model how a protein sequence will eventually fold into a 3D molecule is to first search for many similar or homologous sequences in a database, and then use the aligned multiple sequence alignment (MSA) as the input, before predicting the distances between each amino acid to every other position, called a contact map. We present a deep learning approach for predicting tertiary structure for protein sequences (contact map prediction), and an algorithm that overall improves the input and output simultaneously by iteratively realigning the former based on the alignment of the latter. Focusing on the cases where little to no homologous sequences can be found for a given input protein sequence (MSA size $\leq$ 10), we find that the iterative process of realigning the input sequences and output structures results in improvement especially in short, but also in , medium, and long range contacts.
License
Resource Type
Date Issued
Degree Level
Degree Name
Degree Field
Degree Grantor
Commencement Year
Advisor
Committee Member
Academic Affiliation
Urheberrechts-Erklärung
Publisher
Peer Reviewed
Language

Beziehungen

Parents:

This work has no parents.

In Collection:

Artikel