Model-based Clustering Methods in Exploratory Analysis of RNA-Seq Experiments

Zhang, Wanli

Graduate Thesis Or Dissertation

Model-based Clustering Methods in Exploratory Analysis of RNA-Seq Experiments

Öffentlich Deposited

PDF Herunterladen

Citeable URL: https://ir.library.oregonstate.edu/concern/graduate_thesis_or_dissertations/rb68xh750

Descriptions

Attribute Name	Values
Creator	Zhang, Wanli
Abstract	Differential expression (DE) analysis allows us to identify genes that respond differently under varying experimental conditions, therefore granting us an understanding of the molecular basis of phenotypic variation. Following the identification of significantly differentially expressed genes, it is often of the researcher's interest to elucidate hidden patterns or groups in the expression data to gain insight into the genes' potential functions. A first step towards this goal is to conduct exploratory analysis through the use of data mining tools including cluster analysis and efficient data visualization. In this dissertation, we develop statistical methods for the exploration and visualization of RNA-seq gene expression data. The dissertation comprises two separate but thematically consistent studies: a new model-based clustering method for observations with errors, and a visualization tool for high-dimensional data. Model-based clustering with finite mixture models has become a widely used clustering method. One of the recent implementations is MCLUST. When objects to be clustered are summary statistics such as regression coefficient estimates, they are naturally associated with estimation errors, which can often be calculated exactly or approximated using asymptotic theory. This article proposes an extension to Gaussian finite mixture modeling---called MCLUST-ME---that properly accounts for the estimation errors. More specifically, we assume that the distribution of each observation consists of an underlying true component distribution and an independent measurement error distribution. Under this assumption, each unique value of estimation error covariance corresponds to its own classification boundary, which consequently results in a different grouping from MCLUST. Through simulation and application to a RNA-Seq data set, we discovered that under certain circumstances, explicitly modeling estimation errors improves clustering performance in terms of accuracy, compared with when errors are simply ignored, while the degree of improvement depends on factors such as the distribution of error covariance matrices. The accumulation of RNA-Seq gene expression data in recent years has resulted in large and complex data sets of high dimensions. Exploratory analysis, including data mining and visualization, reveals hidden patterns and potential outliers in such data, but is often challenged by the high dimensional nature of the data. The scatterplot matrix is a commonly used tool for visualizing multivariate data, and allows us to view multiple bivariate relationships simultaneously. However, the scatterplot matrix becomes less effective for high dimensional data because the number of bivariate displays increases quadratically with data dimensionality. In this study, we introduce a selection criterion for each bivariate scatterplot and design/implement an algorithm that automatically scan and rank all possible scatterplots, with the goal of identifying the plots in which separation between two pre-defined groups is maximized. By applying our method to a multi-experiment extit{Arabidopsis} RNA-Seq data set, we were able to successfully pinpoint the visualization angles where genes from two biological pathways are the most separated, as well as identify potential outliers.
License	All rights reserved
Resource Type	Dissertation
Date Issued	2017-12-05
Degree Level	Doctoral
Degree Name	Doctor of Philosophy (Ph.D.)
Degree Field	Statistics
Degree Grantor	Oregon State University
Commencement Year	2018
Advisor	Di, Yanming
Committee Member	Sharpton, Thomas Chang, Jeff Jiang, Duo Emerson, Sarah
Academic Affiliation	Statistics
Urheberrechts-Erklärung	In Copyright
Publisher	Oregon State University
Peer Reviewed	No
Language	English [eng]

Beziehungen

Parents:

This work has no parents.

In Collection:

Graduate Theses and Dissertations (GTD)

Artikel

Miniaturansicht	Titel	Hochladedatum	Sichtbarkeit	Aktionen
	WanliZhang2017.pdf	2017-12-21	Öffentlich	Herunterladen

Hyrax

Model-based Clustering Methods in Exploratory Analysis of RNA-Seq Experiments

Herunterladbarer Inhalt

Descriptions

Beziehungen

Artikel