RNA-Sequencing (RNA-Seq) has rapidly become the de facto technique in transcriptome studies. However, established statistical methods for analyzing experimental and observational microarray studies need to be revised or completely re-invented to accommodate RNA-Seq data's unique characteristics. In this dissertation, we focus on statistical analyses performed at two particular stages in the RNA-Seq pipeline, namely, regression analysis of gene expression levels including tests for differential expression (DE) and the downstream Gene Ontology (GO) enrichment analysis.
The negative binomial (NB) distribution has been widely adopted to model RNA-Seq read counts for its flexibility in accounting for any extra-Poisson variability. Because of the relatively small number of samples in a typical RNA-Seq experiment, power-saving strategies include assuming some commonalities of the NB dispersion parameters across genes, via simple models relating them to mean expression rates. Many such NB dispersion models have been proposed, but there is limited research on evaluating model adequacy. We propose a simulation-based goodness-of- t (GOF) test with diagnostic graphics to assess the NB assumption for a single gene via parametric bootstrap and empirical probability plots, and assess the adequacy of NB dispersion models by combining individual GOF test p-values from all genes. Our simulation studies and real data analyses suggest the NB assumption is valid for modeling a gene's read counts, and provide evidence on how NB dispersion models differ in capturing the variation in the dispersion.
It is not well understood to what degree a dispersion-modeling approach can still be useful when a fitted dispersion model captures a significant part, but not all, of the variation in the dispersion. As a further step towards understanding the power-robustness trade-offs of NB dispersion models, we propose a simple statistic to quantify the inadequacy of a fitted NB dispersion model. Subsequent power-robustness analyses are guided by this estimated residual dispersion variation and other controlling factors estimated from real RNA-Seq datasets. The proposed measure for quantifying residual dispersion variation gives hints on whether we can gain statistical power by a dispersion-modeling approach. Our real-databased simulations also provide benchmarking investigations into the power and robustness properties of the many NB dispersion methods in current RNA-Seq community.
For statistical tests of enriched GO categories, which aim to relate the outcome of DE analysis to biological functions, the transcript length becomes a confounding factor as it correlates with both the GO membership and the significance of the DE test. We propose to adjust for such bias using the logistic regression and incorporate the length as a covariate. The use of continuous measures of differential expression via transformations of DE test p-values also avoids the subjective specification of a p-value threshold adopted by contingency-table-based approaches. Simulation and real data examples indicate that enriched categories no longer favor longer transcripts after the adjustment, which justifies the effectiveness of our proposed method.