Faculty Research Publications (Statistics)
http://hdl.handle.net/1957/29656
2015-07-02T16:40:30ZThe Level of Residual Dispersion Variation and the Power of Differential Expression Tests for RNA-Seq Data
http://hdl.handle.net/1957/55822
The Level of Residual Dispersion Variation and the Power of Differential Expression Tests for RNA-Seq Data
Mi, Gu; Di, Yanming
RNA-Sequencing (RNA-Seq) has been widely adopted for quantifying gene expression changes in comparative transcriptome analysis. For detecting differentially expressed genes, a variety of statistical methods based on the negative binomial (NB) distribution have been proposed. These methods differ in the ways they handle the NB nuisance parameters (i.e., the dispersion parameters associated with each gene) to save power, such as by using a dispersion model to exploit an apparent relationship between the dispersion parameter and the NB mean. Presumably, dispersion models with fewer parameters will result in greater power if the models are correct, but will produce misleading conclusions if not. This paper investigates this power and robustness trade-off by assessing rates of identifying true differential expression using the various methods under realistic assumptions about NB dispersion parameters. Our results indicate that the relative performances of the different methods are closely related to the level of dispersion variation unexplained by the dispersion model. We propose a simple statistic to quantify the level of residual dispersion variation from a fitted dispersion model and show that the magnitude of this statistic gives hints about whether and how much we can gain statistical power by a dispersion-modeling approach.
This is the publisher’s final pdf. The published article is copyrighted by the author(s) and published by the Public Library of Science. The published article can be found at: http://www.plosone.org/.
2015-04-07T00:00:00ZGoodness-of-Fit Tests and Model Diagnostics for Negative Binomial Regression of RNA Sequencing Data
http://hdl.handle.net/1957/55722
Goodness-of-Fit Tests and Model Diagnostics for Negative Binomial Regression of RNA Sequencing Data
Mi, Gu; Di, Yanming; Schafer, Daniel W.
This work is about assessing model adequacy for negative binomial (NB) regression, particularly
(1) assessing the adequacy of the NB assumption, and (2) assessing the appropriateness
of models for NB dispersion parameters. Tools for the first are appropriate for NB
regression generally; those for the second are primarily intended for RNA sequencing
(RNA-Seq) data analysis. The typically small number of biological samples and large number
of genes in RNA-Seq analysis motivate us to address the trade-offs between robustness
and statistical power using NB regression models. One widely-used power-saving
strategy, for example, is to assume some commonalities of NB dispersion parameters
across genes via simple models relating them to mean expression rates, and many such
models have been proposed. As RNA-Seq analysis is becoming ever more popular, it is appropriate
to make more thorough investigations into power and robustness of the resulting
methods, and into practical tools for model assessment. In this article, we propose simulation-based statistical tests and diagnostic graphics to address model adequacy. We provide
simulated and real data examples to illustrate that our proposed methods are effective for
detecting the misspecification of the NB mean-variance relationship as well as judging the
adequacy of fit of several NB dispersion models.
This is the publisher’s final pdf. The published article is copyrighted by the author(s) and published by the Public Library of Science. The published article can be found at: http://www.plosone.org/.
2015-03-18T00:00:00ZEstimation and model selection in generalized additive partial linear models for correlated data with diverging number of covariates
http://hdl.handle.net/1957/50198
Estimation and model selection in generalized additive partial linear models for correlated data with diverging number of covariates
Wang, Li; Xue, Lan; Qu, Annie; Liang, Hua
We propose generalized additive partial linear models for complex data
which allow one to capture nonlinear patterns of some covariates, in the presence
of linear components. The proposed method improves estimation efficiency
and increases statistical power for correlated data through incorporating
the correlation information. A unique feature of the proposed method is
its capability of handling model selection in cases where it is difficult to specify
the likelihood function. We derive the quadratic inference function-based
estimators for the linear coefficients and the nonparametric functions when
the dimension of covariates diverges, and establish asymptotic normality for
the linear coefficient estimators and the rates of convergence for the nonparametric
functions estimators for both finite and high-dimensional cases. The
proposed method and theoretical development are quite challenging since the
numbers of linear covariates and nonlinear components both increase as the
sample size increases. We also propose a doubly penalized procedure for variable
selection which can simultaneously identify nonzero linear and nonparametric
components, and which has an asymptotic oracle property. Extensive
Monte Carlo studies have been conducted and show that the proposed procedure
works effectively even with moderate sample sizes. A pharmacokinetics
study on renal cancer data is illustrated using the proposed method.
This is the publisher’s final pdf. The published article is copyrighted by the Institute of Mathematical Statistics and can be found at: http://www.imstat.org/aos/.
2014-04-01T00:00:00ZIn defense of P values
http://hdl.handle.net/1957/49298
In defense of P values
Murtaugh, Paul A.
Statistical hypothesis testing has been widely criticized by ecologists in recent
years. I review some of the more persistent criticisms of P values and argue that most stem
from misunderstandings or incorrect interpretations, rather than from intrinsic shortcomings
of the P value. I show that P values are intimately linked to confidence intervals and to
differences in Akaike’s information criterion (ΔAIC), two metrics that have been advocated as
replacements for the P value. The choice of a threshold value of ΔAIC that breaks ties among
competing models is as arbitrary as the choice of the probability of a Type I error in
hypothesis testing, and several other criticisms of the P value apply equally to ΔAIC. Since P
values, confidence intervals, and ΔAIC are based on the same statistical information, all have
their places in modern statistical practice. The choice of which to use should be stylistic,
dictated by details of the application rather than by dogmatic, a priori considerations.
This is the publisher’s final pdf. The published article is copyrighted by the Ecological Society of America and can be found at: http://www.esajournals.org/loi/ecol.
2014-03-01T00:00:00Z