CE45 - Interfaces: mathématiques, sciences du numérique –biologie, santé

Unbiased massive RNA data-mining for medical applications – full-RNA

Submission summary

High-throughput RNA sequencing (RNA-seq) is a unique tool for the discovery of medical biomarkers and drug targets. However, while nearly one million human RNA-seq libraries are publicly available, this treasure trove of medical information cannot realize its full potential because it is impossible to directly query this resource to measure the expression of an RNA of interest. Several bioinformatics projects have addressed this issue, but they rely on normal reference RNAs that do not capture the full diversity of pathological transcripts. New reference-free data structures using k-mers allow querying of large sequence databases, but they do not allow quantitative queries.
Our goal here is to develop new indexing structures capable of handling reference-free quantitative queries in tens of thousands of RNA-seq libraries while optimizing disk and memory consumption. To this aim, we will build on our Reindeer indexing system. We will bring important innovations to reduce the disk and memory footprint of the tool, and we will extend it to long-read sequences. In addition, we will implement in the new version of Reindeer statistical tools to screen the indexes for RNAs significantly associated with qualitative or quantitative traits related to the phenotype of the samples. This will allow us to discover RNAs associated with clinical or cellular characteristics, and ultimately produce new diagnostic/prognostic models. We will first create two indexes of about 10,000 samples from the Short Read Archive and GTEX databases. Using these indexes, we propose a series of applications aiming to better understand the determinants of aging and cellular senescence, two related processes involved in a large number of pathologies. We will generate the first predictive models of aging and senescence using unlisted RNAs such as retrotransposons, lncRNAs and novel splice variants. The distributed architecture of our system, combined with web servers allowing public queries will allow a large community to evaluate our tools, opening the way to a wide range of applications. Our consortium is composed of bioinformaticians from four institutions, with strong experience in informatics, string data structure, high-throughput RNA sequence analysis and health transcriptomics.

Project coordination

Daniel GAUTHERET (Centre national de la recherche scientifique)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partner

IP INSTITUT PASTEUR
IRMB Institut de Médecine Régnératrice & Biothérapies-Université de Montpellier
Université de Lille (EPE)
I2BC Centre national de la recherche scientifique

Help of the ANR 599,698 euros
Beginning and duration of the scientific project: September 2022 - 42 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter