CE45 - Mathématique, informatique, automatique, traitement du signal pour répondre aux défis de la biologie et de la santé 2018

Statistics and Machine Learning for Single Cell Genomics – SingleStatOmics

Statistics and Machine Learning for Single Cell Genomics – SingleStatOmics

Cell-to-cell variability, central to gene regulation and differentiation, reveals molecular networks and promises to transform our understanding of genome regulation. However, the high dimensionality of single-cell data demands new mathematical models. This project aims to develop methodologies to study cell identity and differentiation by integrating single-cell expression and epigenomics, leveraging unique expertise and global collaborations.

Take up the challenge of single-cell data analysis.

The ability to measure genome-wide gene expression or mutations from large cell populations revolutionized biology in the late 1990s, enabling characterization of cancer subtypes and comprehensive gene expression profiling. However, traditional bulk genomics masks critical cell-to-cell variability within samples. Advances in sequencing and high-throughput cell biology now enable genome-wide measurements at the single-cell level, encompassing DNA, RNA, chromatin states, and proteins. This emerging field, single-cell genomics, reveals intra-tissue heterogeneity in cell types such as T cells, lung cells, and myeloid progenitors, and supports the construction of a comprehensive human cell atlas. Cell variability, central to processes like gene regulation and differentiation, offers insights into stochastic molecular processes and functional roles in cellular decision-making. Single-cell genomics holds transformative potential for understanding gene regulation and resolving longstanding biological debates. Despite its promise, single-cell genomics introduces significant computational and mathematical challenges. Issues such as multiplets, high missing data rates (~90%), experimental artifacts, and vast data scales (millions of cells) require novel statistical models and scalable algorithms. Furthermore, emerging biological questions—like modeling differentiation or integrating genetic and epigenetic data—necessitate innovative approaches. Dedicated analytical tools are essential to fully leverage single-cell genomics. This project aims to address key challenges in single-cell genomics through the development of mathematical models and computational tools for three critical biological problems: (i) analyzing sample heterogeneity and cell identity, (ii) modeling cell differentiation and gene regulation dynamics, and (iii) exploring single-cell multi-omics. Our consortium combines expertise in high-dimensional statistics, machine learning, optimal transport, bioinformatics, and systems biology, supported by a broad network of collaborators in France and abroad. This integrated effort seeks to advance the field and unlock the full potential of single-cell genomics.

WP1: Analyzing Sample Heterogeneity and Cell Identity

 

Modeling Heterogeneity: the framework of latent variable models has been central to addressing the probabilistic modeling of count data such as scRNA data. The family of multivariate Poisson log-normal (PLN) models served as the starting point. Graphical models were also developed to investigate cells heterogeneity using dimension reduction techniques such as SNE and UMAP.

Scalability: we developed implementations that scale to single-cell data for our models. It also involved designing estimation algorithms tailored to the size of the data (variational approaches, stochastic algorithms, hybrid deep/statistical approaches). Our developments also used resampling techniques to increase the scalability of our approaches (Nystrom method). We also developed GPU-dedicated computing methods to scale our algorithms.

 

 

WP2: Modeling Cell Differentiation and Gene Regulation Dynamics

 

Causal GRN Inference: We developed variational algorithms to estimate graphical models based on our PLN model dedicated to counts. We also provided a version of our method in the presence of zero inflation, which is a main characteristic of single-cell data. We also investigated the precision of the variational estimators in the PLN model.

Dynamical GRN Inference: we developed a framework based on piece-wise deterministic Markov processes (PDMP) to model gene expression regulation. Then we proposed a reduction of this model based on a discrete coarse-grained model with a limited number of cell types. We developed analytical results and numerical tools to investigate the functioning of the underlying Gene Regulatory Network (GRN). Then we tackled the reverse problem, that is inferring the GRN from transcriptional profiles, and also proposed a simulation algorithm based on the mechanistic model together with a proof-of-concept inference method derived from likelihood maximization

 

 

WP3: Exploring Single-Cell Multi-Omics

 

Chromatin States: Using computer experiments, we investigated the impact of parametrization of analysis pipelines on the quality of representation of sc-epigenomics data. Then we also proposed theoretical frameworks that account for the 1D nature of the data, to model the uncertainty of sc-chipSeq coordinates and to account for multiple testing while comparing sc-epigenomics. We also proposed a non-parametric framework based on kernel methods to test the difference between sc-ChiPSeq distributions.

3D Genome Structure: due to the recomposition of the consortium and the leave of JPVert, this part of the project has not been developed

Multiomics Integration: Due to the early termination of C. Gayral's thesis after one year, this part of the project was not further developed.

Most of our results are in the form of methods and packages that we provide to the community for single-cell data analysis. WP1: Analyzing Sample Heterogeneity and Cell Identity Modeling Heterogeneity: We proposed the pyPLNmodels package available on PyPI to use the Poisson log-normal model in pratice (with a zero inflation option). This was complemented by the prob-dim-red package for the implementation of the Gaussian probabilistic PCA (pPCA). Then we provided the first theoretical framework that characterizes the probabilistic nature of the most used embedding methods for dimension reduction (SNE, UMAP). We also proposed the ktest package that performs non-linear differential analysis based on kernel methods. Scalability: we proposed a new algorithmic framework to improve the scalability of our method that combines two stochastic gradient variants. We also proved the consistency of resampling techniques when performing non-parametric testing. Applications (Illustration 1): we characterized the gene-expression diversity of the immune response following a vaccine shot. We showed how individual T cell clones contribute to this heterogeneity throughout immune responses. WP2: Modeling Cell Differentiation and Gene Regulation Dynamics Causal GRN Inference (Illustration 2): the PLN package has been updated to be able to infer Causal Gene Regulatory networks based on scRNASeq (with zero inflation). Dynamical GRN Inference (Illustration 3) We developed CARDAMOM, a new algorithm for inferring a GRN from timestamped scRNA-seq data. We demonstrated its ability to infer a reliable GRN from in silico expression datasets, with good computational speed. To the best of our knowledge, this was the first description of a method which uses the concept of metastability for performing GRN inference. We obtained new analytical results to perform model reduction for piecewise deterministic Markov processes (PDMP) models. These results now allows us to describe the functioning of an underlying Gene Regulatory Network (GRN). WP3: Exploring Single-Cell Multi-Omics Chromatin States: we provide a benchmark of sc-chipSeq data that investigates the impact of tuning parameters on analysis pipelines. Then we developed the first method that performs dimension reduction (PCA) for point processes, which allows for the characterization of scChipSeq heterogeneity. More generally, we also obtain new theoretical results on the characterization of functional PCA, a dimension reduction method that is widely used for spatial data. In another framework (using our package ktest), based on scChipSeq, we also identified a putative reservoir population of cancer cells that may be related to breast cancer resistance to treatment (Illustration 4). Finally, we proposed the first multiple testing procedure that accounts for the spatial distribution of 1D-structured genomic data.

WP1: Analyzing Sample Heterogeneity and Cell Identity

 

The SingleStatomics project has enabled significant improvements in the modeling of single-cell data using count models, which are notoriously challenging to infer. Thanks to our advancements in optimizing the inference of these models, it is now possible to apply them to very large datasets—an application that was previously unattainable. By modeling with count data, we can better characterize biological variability, and we anticipate that new biological insights will emerge as a result of the framework we have proposed. Our contributions to dimensionality reduction and the probabilistic foundations of popular methods like UMAP and t-SNE open numerous avenues for exploration. These include providing stronger theoretical guarantees for these methods and enhancing them through their proper probabilistic formulation.

 

WP2: Modeling Cell Differentiation and Gene Regulation Dynamics

 

Our developments in gene regulatory network (GRN) modeling represent a significant step forward in the physical modeling of biological regulatory processes. Our next goal is to apply this framework to various differentiation sequences, particularly in the context of cancer research. Additionally, in a more predictive capacity, our model will serve as a foundation for proposing physically-informed neural networks capable of predicting the trajectories of differentiating cells. This approach has immense potential, including clinical applications.

 

WP3: Exploring Single-Cell Multi-Omics

 

Our framework for modeling the 1D structure of the sc-ChIP-seq signal will be expanded and enriched to model multi-target sc-ChIP-seq (Spatial-CUT&Tag). This represents an unprecedented opportunity to decode the so-called chromatin code at the single-cell level. Furthermore, the kernel-based method we proposed for comparing single-cell data will be further developed to address the perturbation framework—a critical area in single-cell research. Specifically, this involves identifying significant changes in single-cell expression following biological or chemical perturbations of populations.

The ability to measure genome-wide gene expression or mutations from a biological sample made of thousands or millions of cells has revolutionized biology in the late 1990’s, allowing for example to characterize subtypes of cancers from their molecular profile or to identify comprehensive lists of genes expressed or inhibited in particular conditions. Cells within a sample are however never all the same, and measuring an average over thousands of cells may mask or even misrepresent signals of interest that vary between individual cells. Fortunately, recent technological advance in massively parallel sequencing and high-throughput cell biology technologies now give us the ability to measure, at the level of individual cells, genome-wide measurements based on DNA, RNA, chromatin states or proteins. The use of these techniques, which we collectively refer to as single-cell genomics, allows us to study cell-to-cell variability within a biological sample and investigate new questions out of reach for classical bulk genomics. For example, intra-tissue heterogeneity is now clearly established in many cell types including T cells, lung cells, or myeloid progenitors. The construction of a comprehensive atlas of human cell types is now within our reach. Cell-to-cell variability is also central in many biological processes such as gene regulation or cell differentiation, as it reflects the intrinsic stochastic molecular processes and provides information on the underlying molecular networks. This variability has been shown to play an important functional role in the cell decision-making process and beyond. Consequently, the measurement of gene expression in single cells has the promise of revolutionizing our understanding of gene regulation and resolving many longstanding debates in biology. Besides technological aspects, single-cell genomics raises new mathematical and computational challenges. The nature of data produced by single-cell genomics techniques, as well as the questions we need to answer, differ indeed a lot from standard bulk genomics. For example, due to the extremely small amount of biological material present in a single cell, it is common to have 90% of missing values in a single-cell experiment, and the observed values can themselves be strongly distorted by particular experimental artifacts, calling for new statistical modelling of these data. In addition, the quantity of cells that are investigated simultaneously by the latest (and future) single-cell technologies goes easily in the millions, orders of magnitude more than the number of samples in standard bulk genomics, raising new computational challenges for scalability. Finally, new biological questions are raised, such as modelling a differentiation process or integrating genetic and epigenetic data at the single-cell level, which calls for new mathematical models and algorithms. In short, new dedicated analytical tools are crucially needed to unleash the full power of single cell genomics. The goal of this project is to attack some of these pressing challenges, by developing new mathematical models and computational tools for three biological problems: (i) investigating sample heterogeneity and cell identity, (ii) modelling the dynamics of cell differentiation and gene regulation, and (iii) exploring single cell epigenomics. For that purpose, we have gathered a consortium with a unique combined experience in high dimensional statistics, machine learning, bioinformatics, computational and systems biology, and an extended network of collaborators on single-cell genomics in France and abroad.

Project coordination

Franck PICARD (Laboratoire biologie et modélisation de la cellule)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partnership

LBBE Laboratoire de biométrie et biologie évolutive
Mathématiques et Informatique Appliquées
LBMC LABORATOIRE DE BIOLOGIE ET MODELISATION DE LA CELLULE
LBMC UMR 5239 Laboratoire biologie et modélisation de la cellule

Help of the ANR 597,436 euros
Beginning and duration of the scientific project: February 2019 - 48 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter