Graduate Thesis Or Dissertation

 

Higher-level Analysis of RNA-Seq Experiment: Multiple Data Sets and Multiple Genes Public Deposited

Downloadable Content

Download PDF
https://ir.library.oregonstate.edu/concern/graduate_thesis_or_dissertations/jm214r66g

Descriptions

Attribute NameValues
Creator
Abstract
  • Differential expression (DE) analysis is a key task in gene expression study, because it uncovers the association between expression levels of a gene and the covariates of interest. This dissertation pertains to two particular aspects of DE analysis—identifying stably expressed genes for count normalization and accounting for correlation between DE test statistics in gene-set test. RNA-Sequencing (RNA-Seq) has become the tool of choice for measuring gene expression over the past few years, and data generated from RNA-Seq experiments are the focus of this thesis. Identifying stably expressed genes is useful for count normalization and DE analysis. We examined RNA-Seq data on 211 biological samples from 24 different experiments conducted by different labs, and identified genes that are stably expressed across samples, treatment conditions, and experiments. We fit a Poisson log-linear mixed-effect model to the count data, and decomposed the total variance into between-sample, between-treatment and between-experiment variance components. The variance component analysis that we explore here is a first step towards understanding the sources and nature of the RNA-Seq count variation. The stability ranking of genes, when quantified by a numerical stability measure, is dependent on several factors: the background sample set and the reference gene set used for count normalization, the technology used to measure gene expression, and the specific stability measure. Since DE is measured by relative frequencies, we argue that DE is a relative concept. We advocate using an explicit reference gene set for count normalization to improve interpretability of DE results, and recommend using a common reference gene set when analyzing multiple RNA-Seq experiments to avoid potential inconsistent conclusions. We investigate the relationship between correlation among test statistics and the correlation of underlying observed data. For false discovery control (FDR) procedures and gene-set tests, pooling DE test statistics together is a frequently used idea and the correlation among test statistics needs to be taken into account. The sample correlation of observed data is often used to approximate the test statistics correlation. We show, however, that such an approximation is only valid under limited settings. In particular, we derive a formula for correlation between test statistics when they take a specific form, and as a special case, we present the exact expression of test-statistic correlation for equal-variance two-sample t-test statistic under bivariate normal assumption. We conclude that test-statistic correlation is weaker than the correlation of underlying observed data (normally distributed) in the context of equal-variance two-sample t-test. Competitive gene-set test is a widely used tool for interpreting high-throughput biological data, such as gene expression and proteomics data. It aims at testing categories of genes for enriched association signals in a list of genes inferred from genome-wide data. Most conventional enrichment testing methods ignore or do not properly account for the widespread correlations among genes, which, as we show, can result in inflated type I error rates and/or power loss. We propose a new framework, MEACA, for gene-set test based on a mixed effects quasi-likelihood model, where the data are not required to be Gaussian. Our method effectively adjusts for completely unknown, unstructured correlations among genes. It uses a score test approach and allows for analytical assessment of p-values. Compared to existing methods such as GSEA and CAMERA, our method enjoys robust and substantially improved control over type I error and maintains good power in a variety of correlation structure and association settings. We also present two real data analyses to illustrate our approach.
License
Resource Type
Date Available
Date Issued
Degree Level
Degree Name
Degree Field
Degree Grantor
Commencement Year
Advisor
Committee Member
Academic Affiliation
Non-Academic Affiliation
Subject
Rights Statement
Publisher
Peer Reviewed
Language
Replaces

Relationships

Parents:

This work has no parents.

In Collection:

Items