Department of Statisticshttp://hdl.handle.net/1957/184672015-05-04T06:05:42Z2015-05-04T06:05:42ZGoodness-of-Fit Tests and Model Diagnostics for Negative Binomial Regression of RNA Sequencing DataMi, GuDi, YanmingSchafer, Daniel W.http://hdl.handle.net/1957/557222015-04-29T19:21:23Z2015-03-18T00:00:00ZGoodness-of-Fit Tests and Model Diagnostics for Negative Binomial Regression of RNA Sequencing Data
Mi, Gu; Di, Yanming; Schafer, Daniel W.
This work is about assessing model adequacy for negative binomial (NB) regression, particularly
(1) assessing the adequacy of the NB assumption, and (2) assessing the appropriateness
of models for NB dispersion parameters. Tools for the first are appropriate for NB
regression generally; those for the second are primarily intended for RNA sequencing
(RNA-Seq) data analysis. The typically small number of biological samples and large number
of genes in RNA-Seq analysis motivate us to address the trade-offs between robustness
and statistical power using NB regression models. One widely-used power-saving
strategy, for example, is to assume some commonalities of NB dispersion parameters
across genes via simple models relating them to mean expression rates, and many such
models have been proposed. As RNA-Seq analysis is becoming ever more popular, it is appropriate
to make more thorough investigations into power and robustness of the resulting
methods, and into practical tools for model assessment. In this article, we propose simulation-based statistical tests and diagnostic graphics to address model adequacy. We provide
simulated and real data examples to illustrate that our proposed methods are effective for
detecting the misspecification of the NB mean-variance relationship as well as judging the
adequacy of fit of several NB dispersion models.
This is the publisher’s final pdf. The published article is copyrighted by the author(s) and published by the Public Library of Science. The published article can be found at: http://www.plosone.org/.
2015-03-18T00:00:00ZEstimation and model selection in generalized additive partial linear models for correlated data with diverging number of covariatesWang, LiXue, LanQu, AnnieLiang, Huahttp://hdl.handle.net/1957/501982014-07-08T20:53:54Z2014-04-01T00:00:00ZEstimation and model selection in generalized additive partial linear models for correlated data with diverging number of covariates
Wang, Li; Xue, Lan; Qu, Annie; Liang, Hua
We propose generalized additive partial linear models for complex data
which allow one to capture nonlinear patterns of some covariates, in the presence
of linear components. The proposed method improves estimation efficiency
and increases statistical power for correlated data through incorporating
the correlation information. A unique feature of the proposed method is
its capability of handling model selection in cases where it is difficult to specify
the likelihood function. We derive the quadratic inference function-based
estimators for the linear coefficients and the nonparametric functions when
the dimension of covariates diverges, and establish asymptotic normality for
the linear coefficient estimators and the rates of convergence for the nonparametric
functions estimators for both finite and high-dimensional cases. The
proposed method and theoretical development are quite challenging since the
numbers of linear covariates and nonlinear components both increase as the
sample size increases. We also propose a doubly penalized procedure for variable
selection which can simultaneously identify nonzero linear and nonparametric
components, and which has an asymptotic oracle property. Extensive
Monte Carlo studies have been conducted and show that the proposed procedure
works effectively even with moderate sample sizes. A pharmacokinetics
study on renal cancer data is illustrated using the proposed method.
This is the publisher’s final pdf. The published article is copyrighted by the Institute of Mathematical Statistics and can be found at: http://www.imstat.org/aos/.
2014-04-01T00:00:00ZIn defense of P valuesMurtaugh, Paul A.http://hdl.handle.net/1957/492982014-06-23T16:24:06Z2014-03-01T00:00:00ZIn defense of P values
Murtaugh, Paul A.
Statistical hypothesis testing has been widely criticized by ecologists in recent
years. I review some of the more persistent criticisms of P values and argue that most stem
from misunderstandings or incorrect interpretations, rather than from intrinsic shortcomings
of the P value. I show that P values are intimately linked to confidence intervals and to
differences in Akaike’s information criterion (ΔAIC), two metrics that have been advocated as
replacements for the P value. The choice of a threshold value of ΔAIC that breaks ties among
competing models is as arbitrary as the choice of the probability of a Type I error in
hypothesis testing, and several other criticisms of the P value apply equally to ΔAIC. Since P
values, confidence intervals, and ΔAIC are based on the same statistical information, all have
their places in modern statistical practice. The choice of which to use should be stylistic,
dictated by details of the application rather than by dogmatic, a priori considerations.
This is the publisher’s final pdf. The published article is copyrighted by the Ecological Society of America and can be found at: http://www.esajournals.org/loi/ecol.
2014-03-01T00:00:00ZDetecting Differential Gene Expression in Subgroups of a Disease PopulationEmerson, Sarah C.Emerson, Scott S.http://hdl.handle.net/1957/479362014-09-18T08:00:08Z2013-09-18T00:00:00ZDetecting Differential Gene Expression in Subgroups of a Disease Population
Emerson, Sarah C.; Emerson, Scott S.
In many disease settings, it is likely that only a subset of the disease population will exhibit
certain genetic or phenotypic differences from the healthy population. Therefore, when seeking to identify
genes or other explanatory factors that might be related to the disease state, we might expect a mixture
distribution of the variable of interest in the disease group. A number of methods have been proposed for
performing tests to identify situations for which only a subgroup of samples or patients exhibit differential
expression levels. Our discussion here focuses on how inattention to standard statistical theory can lead to
approaches that exhibit some serious drawbacks. We present and discuss several approaches motivated by
theoretical derivations and compare to an ad hoc approach based upon identification of outliers. We find
that the outlier-sum statistic proposed by Tibshirani and Hastie offers little benefit over a t-test even in the
most idealized scenarios and suffers from a number of limitations including difficulty of calibration, lack of
robustness to underlying distributions, high false positive rates owing to its asymmetric treatment of
groups, and poor power or discriminatory ability under many alternatives.
This is the publisher’s final pdf. The published article is copyrighted by Walter de Gruyter GmbH and can be found at: http://www.degruyter.com/view/j/ijb.
2013-09-18T00:00:00Z