CE23 - Intelligence artificielle

Learning causal effects between phenome and exposome from large amounts of heterogeneous data in human complex diseases – GePhEx

GePhEx - Genome, Phenome, Exposome

Learning causal effects between phenome and exposome from large amounts of heterogeneous data in human complex diseases.

Learning causal relationships between symptoms, extrinsic factors and altered genes.

The last ten years have witnessed considerable expansion into various omics data that has resulted in an explosion of publicly available heterogeneous datasets. Recent genotyping and profiling technologies enable the scientific community to investigate disease-related genomic alterations in human disorders. At the same time, it becomes increasingly clear that some complex diseases result from the interaction between individual genetic background and environmental factors, as for lung or coronary heart diseases. While promising biological treatments are being explored, health professionals progressively advocate medical educational or preventive interventions, for which the clinical benefits have been positively evaluated by previous studies.<br /><br />The GePhEx (Genome-Phenome-Exposome) project proposes to automatically discover the phenome and the exposome associated to genomic alterations in the context of a given human complex disease and to learn the causal relationships between symptoms, environmental factors and impacted genes. This project is dealing with critical public health issues as the discovery of new environmental determinants or phenotypic traits of a disease could help to establish efficient medical recommendations and favor earlier diagnosis. The novel analytic methods proposed by GePhEx will enable (i) automatic discovery of exposures and their associated phenotypic traits from large amount of publicly available data and scientific literature, (ii) causally relate phenome, exposome and genome entities in the context of a specific disease and (iii) provide an easy-to-use web application to improve patient self-awareness and practitioners early diagnosis.

The first phase of the project will provide a robust approach to simultaneously classify human genes and scientific documents reporting on these genes into homogeneous co-clusters. This novel method will extend classical co-clustering algorithms by integrating heterogeneous large-scale datasets to guide the partitioning of scientific documents. This strategy will ensure a robust identification of information-rich documents-genes co-clusters.

The second phase of the project will focus on the discovery of the most representative phenotypic traits and environmental exposures, and their causal relationships for each sub-set of documents. This will be obtained by first exploiting word vector representations or embeddings which have been shown to be successful in automatically recovering syntactic and semantic information from a corpus. In particular, word vector representations demonstrate great capacities for the discovery of synonymous words, where synonymous means here words that are found within the same semantic context in a corpus.

The major outputs of the first phase are (i) an original text mining algorithm supported by multi-source data, (ii) information-rich sub-corpora corresponding to documents-genes co-clusters and (iii) groups of genes automatically learned with the partitioning algorithm. The gene groups will be of great interest for the biomedical community as new gene associations could trigger novel experiments on possible therapeutic targets. The text mining and natural language processing communities will also benefit from this novel co-clustering algorithm that improves information retrieval from large biomedical corpus. The main outputs of the second phase are (i) the identification of the human complex disease phenome and exposome, (ii) the learning of causal networks that associate phenotypic traits, symptoms and risk factors and (iii) the implementation of a web application and a source package for the visualization of disease-causing relationships.

This project is dealing with critical public health issues as the discovery of new environmental determinants or phenotypic traits of a disease could help to establish efficient recommendations and favor earlier diagnosis. The novel analytic methods proposed by GePhEx will enable automatic and systematic discovery of exposures and their associated phenotypic traits. The resulting methodology and web application will facilitate health professionals and researchers investigations on human complex diseases. In particular, GePhEx will provide an easy-to-use Python package with open source code freely available under a General Public Licence. The software resulting from this project will also be accessible online through a user-friendly web application.

Affeldt, S., Labiod, L. & Nadif, M. Regularized bi-directional co-clustering. Stat Comput 31, 32 (2021).
doi.org/10.1007/s11222-021-10006-w

Affeldt, S., Labiod, L. & Nadif, M. Ensemble Block Co-clustering: A Unified Framework for Text Data. Proceedings of the 29th ACM International Conference on Information &
Knowledge Management. Association for Computing Machinery, CIKM20, 5–14. 2020
doi.org/10.1145/3340531.3412058

Affeldt S, Labiod L & Nadif M. Regularized Dual-PPMI Co-clustering for Text Data. SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. (2021)
sigir.org/sigir2021/accepted-papers/

Affeldt, S., Labiod, L. & Nadif, M. (2021). Approche ensemble pour le co-clustering par blocs sur des données textuelles: Application au biomédical. Extraction et Gestion des Connaissances: Actes EGC'2021.
editions-rnti.fr

The last ten years have witnessed considerable expansion into various omics data that has resulted in an explosion of publicly available heterogeneous biological datasets. Recent genotyping and profiling technologies enable the scientific community to investigate disease-related genomic alterations in human disorders. At the same time, it becomes increasingly clear that some complex diseases result from the interaction between individual genetic background and environmental factors, as for lung or coronary heart diseases. While promising biological treatments are being explored, health professionals progressively advocate medical educational or preventive interventions, for which the clinical benefits have been positively evaluated by previous studies.

Such interventions comprise the transmission of medical knowledge on phenotypic traits or symptoms and hence improve the patient survival by for instance triggering earlier testing. These interventions also instruct on measures that can counteract the onset of a complex disease (e.g., diabetes, chronic respiratory diseases or rheumatoid arthritis) by avoiding or modifying key extrinsic risk factors (e.g., tabacco or alcohol consumption, unhealthy diet). It is also crucial to identify the combined causal effects of environmental factors to propose efficient treatments with few or no side-effects. Hence, effective interventions should be based on the most exhaustive and accurate information on the phenotypic traits (phenome) and the environmental exposures (exposome) in the context of a complex disease.

The GePhEx (Genome-Phenome-Exposome) project proposes to automatically discover the phenome and the exposome associated to genomic alterations in the context of a given human complex disease and to learn the causal relationships between symptoms, environmental factors and impacted genes. This project is dealing with critical public health issues as the discovery of new environmental determinants or phenotypic traits of a disease could help to establish efficient medical recommendations and favor earlier diagnosis. The novel analytic methods proposed by GePhEx will enable (i) automatic discovery of exposures and their associated phenotypic traits from large amount of publicly available biological data and scientific literature, (ii) causally relate phenome, exposome and genome entities in the context of a specific disease and (iii) provide an easy-to-use web application to improve patient self-awarness and pratictioners early diagnosis.

Humans encounter numerous environmental exposures over the lifespan (e.g., smoking, air pollution, dietary imbalance) and a non negligeable portion of complex disease risk is likely due to interactions between these exposures and genetic factors. New machine learning approaches are necessary to analyze complex data that include genome and environmental information. The discovery of environmental causes, acting either alone or in concert, could strengthen the basis for risk assessment and prevention. GePhEx proposes big data analytics and visualization tools to accelerate research related to human exposome, establish mechanisms of disease causality and promote public health interventions.

Project coordinator

Madame Severine Affeldt (Centre Borelli (CNRS, UMR 9010))

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partner

UDPESCARTES-Centre Borelli Centre Borelli (CNRS, UMR 9010)

Help of the ANR 109,080 euros
Beginning and duration of the scientific project: November 2019 - 36 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter