Genomic differences explain in great part why patients experience disease differently. However, current methods to discover associations between single nucleotide polymorphisms (SNPs) and a phenotype account for little of these differences. SCAPHE builds on the hypothesis that this is due to the effect of non-additive interactions between SNPs, together with a lack of robustness stemming from the relatively small sample sizes, which can be alleviated by integrating biological networks.
The goal of SCAPHE is to develop methods that enable the discovery, from data generated by high-throughput genomic technologies, of SNP combinations associated with a given phenotype. Ultimately, this project aims at generating novel biological hypotheses based on strong statistical evidence. <br /> <br />One way to reduce the statistical issues that stem from having orders of magnitude more features (SNPs) than samples is to integrate the organization of these features in networks of interactions, regulatory relationships, or contact maps defining the 3D structure of the genomes. <br /> <br />Recent endeavors of statistics and machine learning to elegantly incorporate data structure directly in the learning procedure are giving promising results. However, these methods only contemplate additive effects between features, although many biological phenomena are non-linear; and, because they focus on building predictive rather than interpretable models, fail to guarantee the robustness of their selection, meaning that different sets of SNPs might be selected on overlapping subsets of the same data. <br /> <br />In SCAPHE, we surmise that part of the missing heritability of many phenotypes can be discovered by combining GWAS data with established biological knowledge. Achieving this calls for novel data mining procedures, which successfully model non-linear interactions between genetic loci and compensate for the lack of statistical power by incorporating biological networks as well as data collected for multiple related phenotypes. <br /> <br />This will be achieved by developing network-guided GWAS through three orthogonal work packages (WP): <br />- the development of methods for non-additive, multi-locus, network-guided GWAS (WP 1); <br />- the development of biomarker discovery algorithms explicitly designed for robustness (WP 2); <br />- the joint analysis of multiple related phenotypes (WP 3).
We propose casting GWAS as a feature selection problem, and addressing the objectives of SCAPHE by building on the regularized relevance framework, which allows for building on a large body of work from statistical genetics and offers computational efficiency in very high dimension.
In our first work package, we will propose variants of published network-guided models to account for interactions between SNPs. This work package includes the development of methods to score sets of SNPs non-linearly, and of heuristics to alleviate the computational and statistical burdens of non-additive multi-locus models.
The second work package of SCAPHE will address robustness by building on stability selection, which combines the results of a large number of runs of a feature selection procedure on bootstrap samples of the data. We will integrate the network structure to the creation of bootstrap samples and integrate stability selection directly in the regularized relevance formulation.
Finally, the assumption that there are benefits to be gained from jointly learning on related tasks has long driven the field of multitask learning. The goal of our third work package will be to propose and evaluate new tools for multi-phenotype network-guided GWAS that integrate a notion of similarity between the phenotypes. We will start from existing additive formulations, then extend those to non-additive formulations, and finally focus on the specific case of eQTL studies, in which the phenotypes are abundances of gene transcripts.
These work packages will be supported by three transversal tasks:
- the quantification of power gains, through appropriate empirical evaluations and theoretical analysis;
- high-performance computing to deal with the large dimensionality of the data;
- biological applications, which will guide the methodological developments we propose.
In the context of WP1, we compared approaches for the integration of biological network in GWAS on a non BRCA1/BRCA2 familial breast cancer data set. We proposed combining different approaches to build a consensus network.
In collaboration with the International Inflammatory Bowel Disease Genetics Consortium, we are currently investigating an extension of this method to detect purely epistatic interactions.
In a multiple sclerosis GWAS, we showed how to combine metabolic pathway data with an epistasis detection method.
Finally, we have adapted kernelPSI, a generic method for post selection inference for non-linear feature selection methods, to the specific context of GWAS; this required intensive developments in high-performance computing.
We are currently working on the development of a multitask gruop lasso to analyze GWAS data stratified by population structure. This work is part of both WP2 (looking for robustness) and WP3 (the different populations constituting different data sets with distinct but similar phenotypes).
- Asma Nouira, Chloé-Agathe Azencott, Multitask group lasso for genome-wide association studies, poster at SMPGD 2020.
- Lotfi Slim, Hélène de Foucauld, Clément Chatelain, Chloé-Agathe Azencott. A systematic analysis of gene-gene interaction in multiple sclerosis. BioRxiv (2020).
- Héctor Climente-González, Christine Lonjou, Fabienne Lesueur, GENESIS Study collaborators, Dominique Stoppa-Lyonnet, Nadine Andrieu, Chloé-Agathe Azencott. Combining network-guided GWAS to discover susceptibility mechanisms for breast cancer BioRXiv (2020)
- gwas-tools (2020): github.com/hclimente/gwas-tools
- epiGWAS (2019): cran.r-project.org/web/packages/epiGWAS/index.html
- kernelPSI CUDA (2020): github.com/EpiSlim/kernelPSI
Differences in how patients experience disease can be explained in great part by their genomic differences. Enabling precision medicine, that is to say, being able to tailor treatment to the personal characteristics of patients, hence requires identifying genomic features associated with disease risk, prognosis or response to treatment. This is often achieved using genome-wide association studies (GWAS), which look for associations between single nucleotide polymorphisms (SNPs) and a phenotype. However, for many complex traits, the SNPs these studies uncover account for little of the known heritable variation.
One key explanation for this missing heritability is that few of the established approaches for GWAS account for the joint epistatic effect of multiple SNPs, although several SNPs might act together towards a phenotype, for example by regulating multiple redundant parts of a same pathway.
Moreover, GWAS are statistically underpowered, as the number of SNPs investigated is orders of magnitude larger than the sample sizes: only SNPs with a large effect size can be detected. This additionally results in a robustness issue, particularly when using complex models:
which SNPs are deemed associated with the phenotype can vary a lot across related datasets. This suggests that current approaches often capture spurious associations rather than truly relevant SNPs.
SCAPHE is built on the hypothesis that part of the missing heritability can be discovered by combining GWAS data with established biological knowledge. We surmise that this calls for novel machine learning procedures, which successfully model non-linear interactions between genetic loci and compensate for the lack of statistical power due to relatively small sample sizes by incorporating multiple sources of evidence.
More specifically, these include molecular networks and data collected for multiple related phenotypes.
SCAPHE propose to develop novel machine learning algorithms for GWAS, cast as
a feature selection problem, through three orthogonal research directions: (1) the development of methods for non-additive, multi-locus, network-guided GWAS; (2) the development of biomarker discovery algorithms explicitly designed for robustness, that is to say, to reliably return the same SNPs on overlapping subsets of the same data; and (3) the joint analysis of multiple related phenotypes.
These three research directions will be complemented by three transversal tasks, ensuring a focus throughout the project on the control of false discovery rate, high-performance computing, and applicative aspects.
To achieve its objectives, SCAPHE will build on a machine learning framework called regularized relevance. This framework formalizes the idea of encouraging the selected loci to be connected on a pre-defined biological network, supposing that SNPs along pathways or in a set of co-expressed genes are more likely to act together towards the phenotype of interest. It also allows for the combination of evidence from multiple data sets pertaining to related phenotypes, and the inclusion of nonlinear interactions between SNPs.
SCAPHE will propose new tools that will benefit human geneticists and clinicians by providing novel precision medicine insights, potentially resulting in new diagnostic tools or therapeutic targets. Moreover, the application of feature selection methods for high-dimensional data, far from being restricted to genomic studies, is of broad interest in a variety of domains ranging from medical imaging to quantitative finance and climate science.
To facilitate the dissemination of our work, the results of SCAPHE will be published in Open Access peer-reviewed publications, and we will put a strong emphasis both on Open Source code development and on facilitating usability via tutorials and user-friendly interfaces.
Madame Chloé-Agathe Azencott (ARMINES)
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
ARMINES - CBIO ARMINES
Help of the ANR 251,639 euros
Beginning and duration of the scientific project: December 2018 - 36 Months