Intuitively, it seems as though natural language processing tasks might benefit from explicit representations of the syntactic and semantic properties of text. Ontonotes is a dataset which attempts to annotate texts, to represent as much as possible of the meaning of the text explicitly within the annotation. Many tools exist...
Commercial and public safety usage of Unmanned Aerial Vehicles in the National Airspace is currently restricted by federal regulation. The Federal Aeronautics Association is interested in modifying the restrictions; however, research is needed to study the human factor and required aptitude for a single human operating multiple UAVs. This Master’s...
In-hand manipulations consist of dexterous motions that come easy to humans but still pose a challenge to robotic systems. It is difficult to control finger motions in long complicated sequences due to high DOFs and intricate contact interactions. For such complex motions, in-hand manipulations have generally been broken into a...
We take for granted how quickly we, as humans, form mental models of the world around us. By the time we are toddlers, we have an observable intuition around the physical rules of the world. Stacking blocks such that they don’t fall over becomes such a trivial task, that it...
Severe weather in the United States causes huge insured losses to crop and property frequently.It creates major impact and elicit diverse response in the weather insurance industry. Events like hail, storm, hurricane etc. are more likely to cause catastrophe losses. So it becomes crucial to collect and analyze these extreme...
We consider the problem of finding unknown patterns that are recurring across multiple sets. For example, finding multiple objects that are present in multiple images or a short DNA code that is repeated across multiple DNA sequences. We first consider a simple problem of finding a single unknown pattern in...
Novelty detection plays an important role in machine learning and signal processing. This
project studies novelty detection in a new setting where the data object is represented as
a bag of instances and associated with multiple class labels, referred to as multi-instance
multi-label (MIML) learning. Contrary to the common assumption...
Object recognition is a fundamental problem in computer vision. Recognition is
required by many applications. This thesis presents a distance based approach to
recognize objects. We are interested in objects that belong to very similar classes,
where each class has large variations. This problem is called fine-grained object
recognition. Given...
Monte-Carlo planning algorithms such as UCT make decisions at each step by
intelligently expanding a single search tree given the available time and then
selecting the best root action. Recent work has provided evidence that it can be
advantageous to instead construct an ensemble of search trees and make a...
Gusset plates are an important component of bridges. They are thick sheets of steel that join steel members together using fasteners and also strengthen their joint. Transportation agencies regularly evaluate and rate their inventories of gusset plate connections using visual inspection, which is very costly. To address this issue, we...
Many large-scale data analysis applications involve data that can vary over both time and space. Often the primary goal of analyzing spatiotemporal data is identifying trends, movements, and sudden changes with respect to time, location, or both. This can include a variety of applications in economics (housing prices, unemployment, job...
Machine learning models for natural language processing have traditionally relied on large numbers of discrete features, built up from atomic categories such as word forms and part-of-speech labels, which are considered completely distinct from each other. Recently however, the advent of dense feature representations coupled with deep learning techniques has...
Many applications in surveillance, monitoring, scientific discovery, and data cleaning require the identification of anomalies. Although many methods have been developed to identify statistically significant anomalies, a more difficult task is to identify anomalies that are both interesting and statistically significant. Category detection is an emerging area of machine learning...
Simultaneous speech translation (SimulST) is widely useful in many cross-lingual communication scenarios, including multinational conferences and international traveling. Since text-based simultaneous machine translation (SimulMT) has achieved great success in recent years. The conventional cascaded approach for SimulST uses a pipeline of streaming ASR followed by simultaneous MT but suffers from...
Machine Translation, the task of automatically translating between human languages has been studied for decades. This task is used to be solved by count-based statistical models, e.g. Phrase-based Statistical Machine Translation (PBSMT), which solves the translation problem by separately training a statistical language model and a translation model. Recently, Neural...
There has been tremendous growth in using data analytic and machine learning algorithms to make critical decisions, such as in the national power grid, healthcare operations, and autonomous vehicles. Employing data analytic for decision-making allows cyber attackers to manipulate the decisions of these algorithms through data falsification. Hence, the trustworthiness...
Compositional data is a type of data where the features are non-negative and always sum to a constant. This type of data is frequently encountered in many fields such as microbiology, geology, economics and natural language processing. Compositional data has unique statistical properties that makes it difficult to apply standard...
We describe a series of novel computational models, CERENKOV (Computational Elucidation of the REgulatory NonKOding Variome) and its successors CERENKOV2, CERENKOV3, and Convolutional CERENKOV3, for discriminating regulatory single nucleotide polymorphisms (rSNPs) from non-regulatory SNPs within non-coding genetic loci. The CERENKOV models are designed for recognizing rSNPs in the context of...
As robots are becoming more relevant to our lives, they are still having hard time accomplishing simple tasks such as picking and lifting. Problems that include environmental constraints, pose uncertainties and hardware noises restrain robots for grasping an object successfully from a perceivable environment. Many have looked into finding best...
RNA structure prediction is a challenging problem, especially with pseudoknots. Recently, there has been a shift from the classical minimum free energy-based methods (MFE) to partition function-based ones that assemble structures based on base-pairing probabilities. Two typical examples of the latter group are the popular maximum expected accuracy (MEA) method...
This thesis studies the problem of structured prediction (SP), where the agent needs to predict a structured output for a given structured input (e.g., Part-of-Speech tagging sequence for an input sentence). Many important applications including machine translation in natural language processing (NLP) and image interpretation in computer vision can be...
Narratives are central to communication and the human experience. For a computer system to understand a narrative, it must be able to identify the key facts or plot elements that describe what happened or how the world has changed. These element are called events;establishing a document’s events and the relationships...
Heatmap regression has became one of the mainstream approaches to localize facial landmarks. As Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) are becoming popular in solving computer vision tasks, extensive research has been done on these architectures. However, the loss function for heatmap regression is rarely studied. In...
We consider multiple Compressive Sensing (CS) problems wherein the supports of signal vectors of CS problems are restricted to satisfy a collection of joint logical constraints, which we refer to as coupling constraints. We consider a case where the coupling constraints are encoded in a graph and present a sequential...
This dissertation addresses the problem of video labeling at both the frame and pixel levels using deep learning. For pixel-level video labeling, we have studied two problems: i) Spatiotemporal video segmentation and ii) Boundary detection and boundary flow estimation. For the problem of spatiotemporal video segmentation, we have developed recurrent...
An important impact of the genome technology revolution will be the elucidation of mechanisms of cancer pathogenesis, leading to improvements in the diagnosis of cancer and the selection of cancer treatment. Integrated with current well-studied massive knowledge and findings about the role of protein-coding mutations in cancer, demystifying the functional...
Social media sources such as Twitter represent a massively distributed social sensor over diverse topics ranging from social and political events to entertainment and sports news. However, due to the overwhelming volume of content, it can be difficult to identify novel and significant content within a broad topic in a...
The Focused Ion Beam (FIB) tool is a versatile instrument for nano-machining in
circuit editing. Circuit editing is one of the most important steps in the design of an
electronic circuit on a chip. Circuit editing can be improved by imaging of silicon
plates and analyzing the resultant images. However...
Object categorization is one of the fundamental topics in computer vision research. Most current work in object categorization aims to discriminate among generic object classes with gross differences. However, many applications require much finer distinctions. This thesis focuses on the design, evaluation and analysis of learning algorithms for fine- grained...
Data can be represented in multiple views. Traditional multi-view learning methods (i.e., co-training, multi-task learning) focus on improving learning performance using information from the auxiliary view, although information from the target view is sufficient for learning task. However, this work addresses a semi-supervised case of multi-view learning, the surrogate supervision...
Probabilistic models have been successfully applied for a wide variety of problems, such as but not limited to information retrieval, computer vision, bio-informatics and speech processing. Probabilistic models allow us to encode our assumptions about the data in an elegant fashion and enable us to perform machine learning tasks such...
Montane meadows comprise a small area of the predominantly forested landscape
of the Oregon Cascade Range. Tree encroachment in the last century in these areas has
threatened a loss of biodiversity and habitat. Climate change in the coming century may
accelerate tree encroachment into meadows, and exacerbate biodiversity loss. Multiple...
Recently, delta-sigma modulation has become a widely applied technique for high-performance analog-to-digital conversion of narrow-band signals. Most of the early designs used discrete-time structure for good accuracy and good linearity. The transfer functions are independent of the clock frequency. However, high unity-gain bandwidths of the opamps are required to satisfy...
Markov Decision Processes (MDPs) are the de-facto formalism for studying sequential decision making problems with uncertainty, ranging from classical problems such as inventory control and path planning, to more complex problems such as reservoir control under rainfall uncertainty and emergency response optimization for fire and medical emergencies. Most prior research...
Machine learning systems are generally trained offline using ground truth data that has been labeled by experts. However, these batch training methods are not a good fit for many applications, especially in the cases where complete ground truth data is not available for offline training. In addition, batch methods do...
Citizen Science is a paradigm in which volunteers from the general public participate in scientific studies, often by performing data collection. This paradigm is especially useful if the scope of the study is too broad to be performed by a limited number of trained scientists. Although citizen scientists can contribute...
Sequential supervised learning problems arise in many real applications. This dissertation focuses on two important research directions in sequential supervised learning: efficient training and feature induction.
In the direction of efficient training, we study the training of conditional random fields (CRFs), which provide a flexible and powerful model for sequential...
This paper addresses the high model complexity and overconfident frame labeling of state-of-the-art (SOTA) action segmenters. Their complexity is typically justified by the need to sequentially refine action segmentation through multiple stages of a deep architecture. However, this multistage refinement does not take into account uncertainty of frame labeling predicted...
The focus of this thesis is to design, characterize, and apply novel computational methods and molecular systems to interrogate heterogeneous human gut microbiome-related phenomena. In Chapter 2, I design, implement, and characterize a method for embedding co-occurrence patterns derived from massive 16s amplicon datasets. I use this method to 1....
Iterative algorithms are simple yet efficient in solving large-scale optimization problems in practice. With a surge in the amount of data in past decades, these methods have become increasingly important in many application areas including matrix/tensor recovery, deep learning, data mining, and reinforcement learning. To optimize or improve iterative algorithms,...
Papers proposing novel machine learning algorithms tend to present the algorithm or technique in question in the best possible light. The standard practice is generally for authors to emphasize their proposed algorithms' performance in the precise setting where it is maximally impressive, often by only fully evaluating their best known...
The primary goal of this dissertation is to improve the quality of nuclear data available to the nuclear science community. We propose to accomplish this by applying machine learning algorithms to the large number of available benchmark experiments and simulations, with the goal of determining which nuclear data have strong...
Hand detection is a fundamental step for many hand-related computer vision tasks, such as gesture recognition, hand pose estimation, hand sign language translation, and so on. However, robustly detecting hands is a challenging task because of drastic changes in appearance based on finger articulation and changes in lighting conditions, camera...
Deep learning is becoming the latest trend in sensitive applications, such as healthcare, criminal justice, and finance. As these new applications emerge, adversaries are circumventing them.
Further, there have been concerns about the possibility of bias and discrimination in predictive applications.
In order to address these issues, we propose an...
Ecological domains seeking to understand the environment and the behavior of species have received little attention in machine learning (ML), despite the fact that environmental changes have a significant impact on humans as well as ecosystems. Some ecological problems can be formulated similarly to other common ML applications, but there...
In weak supervision learning, label information can be provided at different levels of granularity. For example, in multi-instance multi-label learning, samples are organized into bags and labels for each class are provided at the bag level. For small datasets, this approach offers means of reducing the labeling efforts. However, in...
Emergence of highly accurate Convolutional Neural Networks (CNNs) with the capability to process large datasets, has led to their popularity in many applications, including safety/security-sensitive (e.g. disease recognition, self-driving cars). Despite the high accuracy of convolutional neural networks, they have been found to be susceptible to adversarial noise added to...
Uses for materials with a large surface area and high porosity have grown sig-nificantly in recent years. Porous materials have found usage in applications such as separation, gas storage, sensing, purification and more, prompting researchers to find and discover numerous new porous materials to suit a specific purpose. Hundreds of...
Simultaneous translation, which translates concurrently with the source language speech, is widely used in many scenarios including multilateral organizations. However, it is well known to be one of the most challenging tasks for humans due to the simultaneous perception and production in two languages. On the other hand, simultaneous translation...
In supervised learning, label information can be provided at different levels of granularity. For small datasets, it is possible to acquire a label for each data instance. However, in the big-data regime, this fine granularity approach is prohibitively costly. For example, in semi-supervised learning, only a limited number of samples...
This dissertation addresses the problem of semantic labeling of image pixels. In the course of our work, we considered different types of semantic labels, including object classes (e.g., car, person), 3D depth values (in the range 0 to 80 meters), and affordance classes (e.g., walkable, sittable). Semantic pixel labeling is...
Intermodal freight transportation uses at least two different transportation modes (e.g., truck, rail, ship, air) to move freight loads that are in the same transportation unit (e.g., a shipping container) from origin to destination without handling the goods themselves. The increasing shift to intermodal transportation and the growth of freight...
Maintaining the sustainability of the earth’s ecosystems has attracted much attention as these ecosystems are facing more and more pressure from human activities. Machine learning can play an important role in promoting sustainability as a large amount of data is being collected from ecosystems. There are at least three important...
Software testing is of critical importance for the success of software projects. Current inefficient testing methods often still take up half or more of a software project's budget. Automatic test data generation is the most promising way to lower the software testing cost. Manually creating testing data is expensive and...
Many problems in ecology and conservation biology can be formulated and solved using machine learning algorithms for multi-label classification. This dissertation addresses three topics related to predicting the distributions of multiple species. It improves existing methods and proposes a new modeling paradigm to address the multi-species, multi-label problem. The first...
This thesis considers the problem in which a teacher is interested in teaching action policies to computer agents for sequential decision making. The vast majority of policy
learning algorithms o er teachers little flexibility in how policies are taught. In particular,
one of two learning modes is typically considered: 1)...
Worst-case analysis is often meaningless in practice. Some problems never reach the anticipated worst-case complexity. Other solutions get bogged down with impractical constants during implementation, despite having favorable asymptotic running times. In this thesis, we investigate these contrasts in the context of finding maximum flows in planar digraphs. We suggest...
This dissertation constitutes a multi-scale quantitative and qualitative investigation of patterns of urban development in metropolitan regions of the United States. This work has generated a comprehensive data set on spatial patterns of metropolitan development in the U.S. and an approach to the study of such patterns that can be...
The thermophilic cyanobacterium Thermosynechococcus elongatus was examined for the ability to sequester CO₂ while producing hydrogen (H₂), polyhydroxybutyrate (PHB), lipids, and glycogen. H₂ was produced at a maximum rate of 188 nmol H₂ mg Chl a⁻¹ hr⁻¹. Hydrogen production occurred in the presence of methyl viologen but the cells were...
Linear transformation for dimension reduction is a well established problem in the field of machine learning. Due to the numerous observability of parameters and data, processing of the data in its raw form is computationally complex and difficult to visualize. Dimension reduction by means of feature extraction offers a strong...
Automated recognition of object categories in images is a critical step for many real-world computer vision applications. Interest region detectors and region descriptors have been widely employed to tackle the variability of objects in pose, scale, lighting, texture, color, and so on. Different types of object recognition problems usually require...
Machine learning (ML) and deep learning (DL) models impact our daily lives with applications in natural language modeling, image analysis, healthcare, genomics, and bioinformatics. The exponential growth of biological sequence data necessitates accompanying advances in computational methods. Although deep learning is highly effective for detecting and classifying biological sequences, challenges...
Currently, forecasts produced by the Oregon-Washington (OR-WA) Coastal Ocean Forecast System are constrained by assimilation of only surface observations. The 4-dimensional variational (4DVAR) data assimilation (DA) algorithm is utilized to combine the model and the data, with the time-independent forecast ("background'') error covariance B. In this study, two possible improvements...
Detection of illicit drug residues from wastewater provides a new route toward community level assessment of drug abuse that is critical to public health. However, traditional chemistry analytical tools such as liquid chromatography in tandem with mass spectrometry cannot meet the large-scale testing requirement in terms of cost, throughput, and...
Networks of distributed, remote sensors are providing ecological scientists with a view of our environment that is unprecedented in detail. However, these networks are subject to harsh conditions, which lead to malfunctions in individual sensors and failures in network communications. This behavior manifests as corrupt or missing measurements in the...
Remote sensors are becoming the standard for observing and recording ecological data in the field. Such sensors can record data at fine temporal resolutions, and they can operate under extreme conditions prohibitive to human access. Unfortunately, sensor data streams exhibit many kinds of errors ranging from corrupt communications to partial...