Model-based Clustering Methods in Exploratory Analysis of RNA-Seq Experiments Public Deposited


Attribute NameValues
Abstract or Summary
  • Differential expression (DE) analysis allows us to identify genes that respond differently under varying experimental conditions, therefore granting us an understanding of the molecular basis of phenotypic variation. Following the identification of significantly differentially expressed genes, it is often of the researcher's interest to elucidate hidden patterns or groups in the expression data to gain insight into the genes' potential functions. A first step towards this goal is to conduct exploratory analysis through the use of data mining tools including cluster analysis and efficient data visualization. In this dissertation, we develop statistical methods for the exploration and visualization of RNA-seq gene expression data. The dissertation comprises two separate but thematically consistent studies: a new model-based clustering method for observations with errors, and a visualization tool for high-dimensional data. Model-based clustering with finite mixture models has become a widely used clustering method. One of the recent implementations is MCLUST. When objects to be clustered are summary statistics such as regression coefficient estimates, they are naturally associated with estimation errors, which can often be calculated exactly or approximated using asymptotic theory. This article proposes an extension to Gaussian finite mixture modeling---called MCLUST-ME---that properly accounts for the estimation errors. More specifically, we assume that the distribution of each observation consists of an underlying true component distribution and an independent measurement error distribution. Under this assumption, each unique value of estimation error covariance corresponds to its own classification boundary, which consequently results in a different grouping from MCLUST. Through simulation and application to a RNA-Seq data set, we discovered that under certain circumstances, explicitly modeling estimation errors improves clustering performance in terms of accuracy, compared with when errors are simply ignored, while the degree of improvement depends on factors such as the distribution of error covariance matrices. The accumulation of RNA-Seq gene expression data in recent years has resulted in large and complex data sets of high dimensions. Exploratory analysis, including data mining and visualization, reveals hidden patterns and potential outliers in such data, but is often challenged by the high dimensional nature of the data. The scatterplot matrix is a commonly used tool for visualizing multivariate data, and allows us to view multiple bivariate relationships simultaneously. However, the scatterplot matrix becomes less effective for high dimensional data because the number of bivariate displays increases quadratically with data dimensionality. In this study, we introduce a selection criterion for each bivariate scatterplot and design/implement an algorithm that automatically scan and rank all possible scatterplots, with the goal of identifying the plots in which separation between two pre-defined groups is maximized. By applying our method to a multi-experiment extit{Arabidopsis} RNA-Seq data set, we were able to successfully pinpoint the visualization angles where genes from two biological pathways are the most separated, as well as identify potential outliers.
Resource Type
Date Issued
Degree Level
Degree Name
Degree Field
Degree Grantor
Commencement Year
Committee Member
Academic Affiliation
Rights Statement
Peer Reviewed



This work has no parents.

Last modified

Downloadable Content

Download PDF