Specific Advanced graphical MOdels for Genome-Wide Association Studies – SAMOGWAS
In the biomedical research field, high-throughput genotyping provides massive data (between a few hundred thousands to one or two millions genetic markers for each INDIVIDUAL observed). Downstream high-throughput genotyping, genome-wide association studies (GWASs) aim at identifying DNA variations responsible for genetic diseases, and have to cope with such vast amounts of data. Besides, these data, which consist in genetic markers called « SNPs », are complex since they are characterized by short and long-range dependences between variables, along the genome. These dependences are called linkage disequilibrium (LD). The connecting thread of this interdisciplinary project is the concept of graphical models used for the design of advanced algorithms dedicated to GWAS-purpose data mining. The project will explore strategies based on the use of Bayesian networks on the one hand and on specific random forests on the other hand. There exist indeed very few approaches attempting to model dependences between SNPs while addressing the scalability problem of GWASs.
Complexity and high-dimensionality advocate the use of specific Bayesian networks (BNs) to model the LD: a novel purpose-design type of BN will be defined and implemented, the forest of hierarchical latent class models (or F model). Such a model will reduce the data dimensionality through latent variables, thus allowing scalability. On top of this model, integration of supplementary data (transcriptomic data) and additional knowledge (from ontologies and from gene annotation databases) will allow cross-confirmation of putative associations between genetic factors and disease. In parallel with data integration combined with the use of an F model, the potentiality of specific random forests (or T models) combined with data integration will be investigated. Besides data integration, integration of models will be explored: an hybrid model obtained through the integration of F and T models will be proposed and evaluated. To evaluate the power of the model-based GWAS strategies and their integrative variants, an innovative method will be developed for the fast simulation of realistic datasets.
In summary, this project will design, implement and test advanced algorithms and strategies to propel progress in the field of GWASs. Modeling a complex natural system from massive data and scalability are two keywords of this project; in addition, the evaluation of the innovative GWAS strategies will require the fast generation of genome-wide simulated data. Simulation of massive data is therefore another dimension of the SAMOGWAS project. Finally, be it for speed increase or tractability purpose, this project will deploy intensive calculations on grids; any stage is likely to be concerned: modeling, use of model for a GWAS purpose, simulation of GWAS data, thorough evaluation of GWAS strategies.
In the SAMOGWAS project, targets to be met are advances in machine learning techniques, data mining and knowledge discovery dealing with very high-dimensional data, including highly correlated data. Such advances will be enhanced through the integration of heterogeneous sources of data. To serve the purpose of advances in the biomedical research domain, through scientific advances in computer science, this multidisciplinary methodological project will make available innovative software prototypes dedicated to GWASs. Finally, as more and more plant genomes are sequenced, genetics of plant biology is currently opening to genome-scale analyses. Not only will the biomedical research domain draw benefit from the methodology and prototypes developed (e.g. personalized medicine, public healthcare control in western aging populations), the animal and plant biology domains are also concerned with respect to the selection of phenotypes of interest in agronomy.
Madame Christine SINOQUET (Laboratoire d'Informatique de Nantes Atlantique - UMR CNRS 6241) – email@example.com
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
GIGA-R GIGA-R / Institut Montefiore
LPMA Laboratoire de Probalité et Modèles Aléatoires - UMR CNRS 7599
l'institut du thorax Unité INSERM UMR 1087 / CNRS UMR UMR 6291
LINA Laboratoire d'Informatique de Nantes Atlantique - UMR CNRS 6241
Help of the ANR 398,941 euros
Beginning and duration of the scientific project: September 2013 - 42 Months