The expression of genes produces an infinite diversity of transcripts. Finding and quantifying these transcripts in the immense repositories of sequences that accumulate in the world is currently an impossible task. We propose here for the first time a solution to this problem, via a new structure for indexing massive sequences. This tool opens up unique perspectives in biology and health.
High-throughput sequencing revolutionizes our view of gene expression by its ability to capture the wide variety of transcripts produced by each cell. However, bioinformatics analysis of this data, which most often uses a comparison with reference sequences, fails to identify a very large number of RNAs carrying essential biologically variations. Here we propose a new concept of transcriptomic analysis using selected k-mers to represent each variation in a transcript, and an indexing system allowing to efficiently search for an unprecedented number of transcriptomic variants. This system will make it possible to re-analyze very large sets of public data, opening the way to a wide range of applications such as diagnosis by RNA-seq or analysis of regulatory networks by NRAs.
Here we propose a system for analyzing transcription variants by RNA-seq based on a concept of k-mers signature. This concept uses minimal sequence information to capture each event, whether transcriptional, post-transcriptional or genetic, regardless of a reference transcriptome. We will develop a new data structure to store signatures in an efficient «encyclopedia« that will associate the signatures with a variety of biological events such as splice variants, SNV, indels, circular RNAs, fusion transcripts, etc. To allow querying of large RNA-seq datasets with k-mer signatures, we will develop a new index structure that can effectively link a k-mer to all its occurrences in the reads of an RNA-seq library.
Partner 1 managed to extract the signatures of all the Gencode transcripts (deliverable 1.B). With the tools of deliverables 3.x, this allows us to efficiently quantify any reference human transcript in an RNA-seq dataset. The reflection on the aspects of the base structure (1A) and the ontology (1C) are in progress but not yet finalized. This point is not blocking the progress of other deliverables, we prefer to continue this reflection and postpone the delivery dates.
For deliverable 2B, partners 1 and 2 analyzed several hundred RNA-seq banks of leukemias, lung cancer and prostate cancer in order to extract signatures from specific transcripts of these different pathologies. For deliverable 2.A, new statistics have been integrated into DEkupl making it possible to process hundreds of banks, and a new tool (KamRat) has been developed in C ++ to very quickly search for “signature” k-seas with others. statistics than those of DEkupl, notably logistic regression, naive Bayesian classification and Anova for problems with multiple conditions. These works are also the subject of publications in the course of writing. Finally, partner 1 published an opinion article on unreferenced approaches such as those we defend in this ANR.
Partner 3 reached a major milestone in the project with the development and publication of the REINDEER software in collaboration with partner 1. This software makes it possible to index several thousand RNA-seq banks and to efficiently search for signatures of k-mers in these banks (Deliverable 4B). By relying on REINDEER, partner 3 produced the covid19seqsearch site (see highlights) which is a first version of deliverable 4A, i.e. a complete workflow including input of a sequence of interest, extraction of k- seas and quantification in more than 1850 fastq files.
The main benefits of the project include (1) the ability to re-analyze RNA-seq projects in any type of organization, allowing the identification of an unprecedented variety of transcriptional events; (2) the discovery of RNA biomarkers of diagnostic and prognostic value; (3) a new way for groups managing large public datasets to offer access to their data; (4) in the longer term, the emergence of an ecosystem for the curation of an index of transcriptomic events based on k-mer signatures with applications in the field of health and research; and (5) a powerful platform for business services that industry partners can combine with manual curation and machine learning to develop targeted biological or medical applications.
- Marchet C, Iqbal Z, Gautheret D, Salson M, Chikhi R. (2020) REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets. Bioinformatics. In press.
- Morillon A, Gautheret D. (2019). Bridging the gap between reference and real transcriptomes. Genome Biol. 20:112.
- Marchet C, Iqbal Z, Gautheret D, Salson M, Chikhi R. (2020) REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets. ISMB (actes publiés)
- Marchet C. et al. 2019 Indexing De Bruijn graphs with minimizers, BiATA, St Petersburg (Russia)
- Marchet C. et al. 2019 Survey of k-mer set of sets data structures for querying large collections of sequencing datasets, DSB, Dortmund (Germany)
- Marchet C. et al. 2019 Survey of k-mer set of sets data structures for querying large collections of sequencing datasets, Helsinki Bioinformatics Day (Finland)
- Marchet C, Iqbal Z, Gautheret D, Salson M, Chikhi R. REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets bioRxiv 2020.03.29.014159; doi: doi.org/10.1101/2020.03.29.014159
- Riquier S, Mathieu M, Boureux A, Ruffle F, Lemaitre JM, Djouad F, Gilbert N, Commes T. Detailed analysis of public RNAseq data and long non-coding RNA: a proposed enhancement to mesenchymal stem cell characterisation. BioRXiv. doi: doi.org/10.1101/2020.03.09.976001
- Marchet C, Kerbiriou M, Limasset A. BLight: Efficient exact associative structure for k-mers, bioRxiv 2020.04.28.546309; doi: doi.org/10.1101/546309
- Marchet C, Boucher C, Puglisi S, Medvedev P, Salson M, Chikhi R. Data structures based on k-mers for querying large collections of sequencing datasets, bioRxiv 2019.12.06.866756 doi: doi.org/10.1101/546309
Transcript diversity is a product of genetic, transcriptional and post-transcriptional variations. The combination of these three layers of variations produce a virtually limitless transcript catalogue for any given species. The RNA-seq deep sequencing technology provides a fascinating insight into this diversity through its ability to measure transcript expression levels as well as to discover new transcripts. However, current software for analyzing RNA-seq data do not permit to exploit the full potential of the technology. Leading protocols involve mapping and/or assembly procedures that are error-prone and do not scale well to the number of publicly available RNA-seq datasets (about 235.000 just for human). Recent k-mer based approaches have considerably improved computing time and scalability of RNA-seq analysis. Yet these methods are limited by their reliance on a reference transcriptome and cannot infer new transcription or processing events.
Here we propose a new system for the efficient analysis of transcript variants from RNA-seq data. Our methodology is based on a k-mer signature design that seeks minimum sequence information for capturing each variant, either resulting from transcriptional, post-transcriptional or genetic variation, irrespective of a reference transcriptome. We will develop a novel data structure to store signatures into a space-efficient and curated “encyclopedia” that will associate signatures to a limitless variety of biological events such as splice variants, SNVs, indels, circular transcripts or fusion transcripts. To enable querying large RNA-seq datasets with k-mer signatures, we will develop a new index structure that can efficiently link a k-mer to all its occurrences in reads across multiple datasets.
In parallel, signature inference tools will be developed to enable discovery of new k-mer signatures of biological interest from RNA-seq experiment data. We will specifically seek predictive signatures related to human disease, based on the large public collection of medical RNA-seq data. Our hypothesis-free approach has the potential to reveal important diagnostic or prognostic biomarkers having escaped previous screens, such as non coding RNAs, splice variants, gene fusions and even foreign RNA from pathogens. All inferred signatures will be integrated in the encyclopedia.
The encyclopedia and associated query tools will be provided both as set of standalone, open source programs and through web-based interfaces. For end users, TranSiPedia will enable (1) retrieving the expression profile of any curated or user-provided k-mer signature across large sets of RNA-seq libraries (N>10.000) and (2) analyzing user-provided RNA-seq libraries for the presence of any signature present in the encyclopedia. Working prototypes are already available for each element of the system.
Major benefits and outcomes of the project include (1) the capacity to re-analyse RNA-seq projects in any type of organism, enabling identification of an unprecedented diversity of transcripts; (2) the discovery of RNA biomarkers of diagnosis and prognostic value; (3) a new way for groups managing large public datasets to offer access to their data; (4) on the longer term, the emergence of an ecosystem involved in the curation of an index of transcriptomic events based on k-mer signatures with applications in healthcare and research; and (5) a powerful platform for commercial services that industrial partners can associate with manual curation and machine learning to develop targeted biological or medical applications.
Monsieur Daniel Gautheret (Institut de Biologie Intégrative de la Cellule)
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
IRMB Cellules souches, plasticité cellulaire, régénération tissulaire et immunothérapie des maladies inflammatoires
CRIStAL Centre de Recherche en Informatique, Signal et Automatique de Lille
I2BC Institut de Biologie Intégrative de la Cellule
Help of the ANR 519,949 euros
Beginning and duration of the scientific project: November 2018 - 36 Months