Statistical analysis of RNA sequencing count data

Mi, Gu

Graduate Thesis Or Dissertation

Statistical analysis of RNA sequencing count data

公开 Deposited

下载PDF文件

Citeable URL: https://ir.library.oregonstate.edu/concern/graduate_thesis_or_dissertations/2n49t651w

Descriptions

Attribute Name	Values
Creator	Mi, Gu
Abstract	RNA-Sequencing (RNA-Seq) has rapidly become the de facto technique in transcriptome studies. However, established statistical methods for analyzing experimental and observational microarray studies need to be revised or completely re-invented to accommodate RNA-Seq data's unique characteristics. In this dissertation, we focus on statistical analyses performed at two particular stages in the RNA-Seq pipeline, namely, regression analysis of gene expression levels including tests for differential expression (DE) and the downstream Gene Ontology (GO) enrichment analysis. The negative binomial (NB) distribution has been widely adopted to model RNA-Seq read counts for its flexibility in accounting for any extra-Poisson variability. Because of the relatively small number of samples in a typical RNA-Seq experiment, power-saving strategies include assuming some commonalities of the NB dispersion parameters across genes, via simple models relating them to mean expression rates. Many such NB dispersion models have been proposed, but there is limited research on evaluating model adequacy. We propose a simulation-based goodness-of- t (GOF) test with diagnostic graphics to assess the NB assumption for a single gene via parametric bootstrap and empirical probability plots, and assess the adequacy of NB dispersion models by combining individual GOF test p-values from all genes. Our simulation studies and real data analyses suggest the NB assumption is valid for modeling a gene's read counts, and provide evidence on how NB dispersion models differ in capturing the variation in the dispersion. It is not well understood to what degree a dispersion-modeling approach can still be useful when a fitted dispersion model captures a significant part, but not all, of the variation in the dispersion. As a further step towards understanding the power-robustness trade-offs of NB dispersion models, we propose a simple statistic to quantify the inadequacy of a fitted NB dispersion model. Subsequent power-robustness analyses are guided by this estimated residual dispersion variation and other controlling factors estimated from real RNA-Seq datasets. The proposed measure for quantifying residual dispersion variation gives hints on whether we can gain statistical power by a dispersion-modeling approach. Our real-databased simulations also provide benchmarking investigations into the power and robustness properties of the many NB dispersion methods in current RNA-Seq community. For statistical tests of enriched GO categories, which aim to relate the outcome of DE analysis to biological functions, the transcript length becomes a confounding factor as it correlates with both the GO membership and the significance of the DE test. We propose to adjust for such bias using the logistic regression and incorporate the length as a covariate. The use of continuous measures of differential expression via transformations of DE test p-values also avoids the subjective specification of a p-value threshold adopted by contingency-table-based approaches. Simulation and real data examples indicate that enriched categories no longer favor longer transcripts after the adjustment, which justifies the effectiveness of our proposed method.
License	All rights reserved
Resource Type	Dissertation
Date Available	2015-06-24T08:00:09+00:00
Date Issued	2014-06-10
Degree Level	Doctoral
Degree Name	Doctor of Philosophy (Ph.D.)
Degree Field	Statistics
Degree Grantor	Oregon State University
Commencement Year	2015
Advisor	Di, Yanming Schafer, Daniel W.
Committee Member	Emerson, Sarah C. Chang, Jeff H. Jiang, Yuan
Academic Affiliation	Statistics
Non-Academic Affiliation	Oregon State University. Graduate School
Subject	Nucleotide sequence -- Data processing Nucleotide sequence -- Statistical methods Sequence alignment (Bioinformatics) -- Statistical methods RNA -- Data processing
权利声明	In Copyright
Publisher	Oregon State University
Peer Reviewed	No
Language	English [eng]
Replaces	http://hdl.handle.net/1957/49422

关联

Parents:

This work has no parents.

属于 Collection:

Graduate Theses and Dissertations (GTD)

单件

缩略图	标题	上传日期	公开度	行动
	MiGu2014.pdf	2017-10-27	公开	下载

蹄兔

Statistical analysis of RNA sequencing count data

可下载的内容

Descriptions

关联

单件