Statistical Hurdle Models for Single Cell Gene Expression: Differential Expression and Graphical Modeling


Autoria(s): McDavid, Andrew
Contribuinte(s)

Gottardo, Raphael

Drton, Mathias

Data(s)

14/07/2016

14/07/2016

01/06/2016

Resumo

Thesis (Ph.D.)--University of Washington, 2016-06

This dissertation describes a set of statistical methods developed for analysis of single cell gene expression. A characteristic of single cell expression is bimodal expression, in which two clusters of expression are present. In any given transcript, the null cluster corresponds to cells without detectable expression (hence a non-zero measurement reflects measurement error) while the signal cluster contains cells with a positive, detectable level of expression. Statistical models that accommodate this characteristic are considered. • In Chapter 1, motivation and history of single cell gene expression is considered. Scientific and statistical questions addressable through single cell expression are discussed, and some statistical frameworks for bulk and single cell expression are described. • In Chapter 2, I consider data generated from replicates of single cells and 100 cell aggregates that were assayed through single cell reverse-transcriptase qPCR (rt-qPCR). In rt-qPCR the null cluster manifests as bona-fide zeros, so expression is characterized by zero-inflation of otherwise continuous values. The average expression from single cells and 100-cell replicates is compared to develop quality control metrics that optimize the single-cell, 100-cell concordance. A Hurdle model is proposed, which accounts for the fact that genes at the single-cell level can be on (and a continuous expression measure is recorded) or dichotomously off (and the recorded expression is zero). Based on this model, I derive a combined likelihood-ratio test for differential expression that incorporates both the discrete and continuous components. This chapter was originally published in McDavid et al. [2013]. • In Chapter 3, I consider application of the hurdle model to single cell RNA sequencing (scRNAseq). In these technologies, the binary zero-inflation described found in rt- qPCR-based assays manifests itself as continuous, bimodal expression, motivating a clustering and thresholding procedure to assign expression to a cluster. The Hurdle model, extended and cast as a vector generalized linear model (vGLM), is provided as an R package named MAST. The cellular detection rate (CDR) is defined as the number of expressed genes found in a cell. It is identified as an important latent factor in single cell experiments, and is argued to measure size and efficiency variations among cells. Gene set enrichment analysis using the Hurdle model, and use of residuals defined through such models are discussed. Parts of this chapter were originally published in Finak et al. [2015], McDavid et al. [2014]. • In Chapter 4, the Hurdle model is generalized to model multivariate dependences between cells, permitting the parametrization of graphical models. A neighborhood selection-based method is proposed to leverage group-l1 penalized regression. Networks estimated on single-cell and multi-cell experiments are contrasted and found to be very distinct. In order to synthesize graphs estimated on transcriptome-scale data, a test for enrichment of connections between and within gene ontology categories is proposed.

Formato

application/pdf

Identificador

McDavid_washington_0250E_15661.pdf

http://hdl.handle.net/1773/36846

Idioma(s)

en_US

Palavras-Chave #gene expression #graphical model #hurdle model #RNA sequencing #single cell #zero-inflated #Biostatistics #statistics
Tipo

Thesis