Mixture-based procedures for statistical analysis of RNA-seq data – MixStatSeq
MixStatSeq
Mixture-based procedures for statistical analysis of RNA-seq data
Issues
In recent years, significant advances in next generation sequencing technologies have made RNA sequencing (RNA-seq) a popular choice for studies of gene expression. Although microarrays and RNA-seq both aim to characterize transcriptional activity, the statistical tools developed for the analysis of the former are ill-suited to the latter. To date, the methodological developments for RNA-seq data have mainly focused on normalization and differential analysis; little methodological research has been devoted to the identification of coexpressed genes in RNA-seq data. However, as costs for RNA-seq experiments continue to decrease, it is likely that such studies will replace the use of microarrays for many applications involving investigations of the transcriptome. It is therefore crucial to pursue research on the development of statistical methods that allow biologists to exploit RNA-seq data. In the MixStatSeq project, we focus on the main following biological questions: the detection of differentially expressed genes, the detection of co-expressed gene clusters, and the detection of particular genes.
To address these biological questions, we propose to develop a suite of statistically sound methods based on mixture models, and to pursue theoretical studies about picture models. Throughout the MixStatSeq project, the team will foster collaborations with biologists of several laboratories to validate chosen models and test the developed approaches on real RNA-seq data obtained from different organisms. The originality of the MixStatSeq project will be the continuous exchange between theoretical, methodological and applied research, including the assessment of biologists, in order to ensure the immediate potential impact of the developed procedures.
In Rigaill et al. (2016), the main statistical ingredients (count modelling, low count filtering and dispersion modeling) to realize the differential analysis are studied using a synthetic dataset. The most important things are to use a negative binomial GLM and well model the mean taking into account all the covariates.
For the detection of co-expressed genes, a Poisson mixture model, implemented in the package HTSCluster, is proposed in Rau et al. (2015). Next, Gaussian mixtures are considered on transformed data (transformations of the gene expression proportions). A ICL-like penalized criterion is used to select the number of clusters and the best transformation. This clustering procedure is tested on four real datasets and is implemented in the package coseq. This work is currently submitted for publication.
In Laurent et al. (2014), we are interested in the detection of a unidimensional two-component mixture. A procedure based on several spacings of the order statistics is proposed. We prove the optimality of the power of our procedure in various situations and the procedure is automatically adapted to the proportion of the mixture and to the difference of the means of the two components of the mixture under the alternative. Currently, we study the optimal separation conditions for multidimensional two-component mixture detection.
In Gadat et al. (2016, submitted), we consider a parametric density contamination model. Under general hypotheses on the contaminated distribution, we establish the optimal rates of convergence for the estimation
of the mixture parameters.
In the last period of this project, we want to pursue the theoretical works on the mixture models for estimation, test procedures, … We also pursue our works for the co-expression in order to propose alternative clustering methods.
In recent years, significant advances in next generation sequencing technologies have made RNA sequencing (RNA-seq) a popular choice for studies of gene expression. Although microarrays and RNA-seq both aim to characterize transcriptional activity, the statistical tools developed for the analysis of the former are ill-suited to the latter. To date, the methodological developments for RNA-seq data have mainly focused on normalization and differential analysis, but the testing procedures currently proposed lack power to detect differentially expressed genes; little methodological research has been devoted to the identification of co-expressed genes in RNA-seq data. However, as costs for RNA-seq experiments continue to decrease, it is likely that such studies will replace the use of microarrays for many applications involving investigations of the transcriptome. It is therefore crucial to pursue research on the development of statistical methods that allow biologists to exploit RNA-seq data.
In the MixStatSeq project, we focus on three main biological questions for RNA-seq data: (i) the detection of differentially expressed genes, (ii) the detection of co-expressed gene clusters, and (iii) the detection of invariant genes, i.e., those with stable expression in several biological conditions. To address these three biological questions, we propose to develop a suite of statistically sound methods based on mixture models.
For the analysis of differential expression, two points of view are envisaged. In the first, we aim to construct a powerful testing procedure by first performing a gene clustering step, followed by a testing procedure for each subgroup of genes and a correction for multiple testing. In the second, we will investigate model-based clustering procedures that directly cluster genes into groups representing differential and non-differential expression.
For the detection of co-expressed gene clusters, we will extend our preliminary work on the use of mixture models. In particular, as the number of RNA-seq experiments will continue to increase in the coming years, it is crucial to develop variable selection procedures, as well as to incorporate external biological knowledge, in order to improve the interpretability of gene clustering.
For the detection of invariant genes, we aim to develop a non-asymptotic multiple hypothesis testing procedure to test a single distribution against a mixture of distributions, and to study its theoretical properties to ensure a powerful test. Beyond the biological application, such a development is a difficult theoretical challenge.
Throughout the MixStatSeq project, the team will foster collaborations with biologists of several laboratories to validate chosen models and test the developed approaches on real RNA-seq data obtained from different organisms. The originality of the MixStatSeq project will be the continuous exchange between theoretical, methodological and applied research, including the assessment of biologists, in order to ensure the immediate potential impact of the developed procedures. Moreover, beyond the RNA-seq data study, this project will provide new theoretical and methodological knowledge for the study of count data with mixtures.
Project coordination
Cathy MAUGIS-RABUSSEAU (INSTITUT DE MATHEMATIQUES DE TOULOUSE)
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
Partner
IMT INSTITUT DE MATHEMATIQUES DE TOULOUSE
Help of the ANR 95,000 euros
Beginning and duration of the scientific project:
February 2014
- 48 Months