We propose to develop algorithms and software for analyzing third generation sequencing data. Third generation is an emerging technology that promises to give a better picture for studying genomes, transcriptomes, metagenomes and metatranscriptomes of all living organisms. It will be key for discovering new fundamental mechanisms in cell biology, with broad implications in environmental research, health and agriculture.
Compared to second generation sequencing, third generation sequencing is able to produce fragments that cover significantly larger regions of the molecule, up to several thousands of bases. This important feature allows to overcome the main limitations of second generation sequencers and offers a real potential of disruption. Remarkably, this transition does not significantly affect the difficulty and costs at which sequence data can be obtained. One can even expect that third generation will further promote the easy access to sequencing technologies with the advent of low-cost and highly portable instruments, such as the MinION commercialized by Oxford Nanopore Technologies.
In this project, we focus on transcriptome sequencing by nanopore technology. Transcriptome is the sequencing of expressed RNA in a population of cells. It is of great interest to understand what fraction of the genome is expressed and to characterize it, and serves as a basis for multiple downstream analyses, including gene prediction, variant calling, species identification. However, analyzing this data is computationally challenging due to a very high rate of sequencing errors on the one hand and the intrinsic complexity of transcriptomes in the other hand. So there is a pressing need for models and algorithms that can accommodate this new kind of data and that are also scalable. In this perspective, we will develop innovative computational analysis methods for transcriptomes (RNA from a single organism), 16S ribosomal RNA and metatranscriptomes (RNA sampled from a community). For that, we will consider several settings, depending on whether a reference genome and/or supporting second generation data are available. This will give raise to a number of specialized algorithms in several primary analysis steps that complement one another: alignment, error correction, identification of gene structures, identification of variants. To achieve these goals, we will make use of state-of-the-art techniques in text algorithms and invent new ones: new models for seeds, alignment-free heuristics, compression, graph structures, text indexes.
The project unites two expert groups in bioinformatics algorithms (Bonsai, CRIStAL in Lille and Erable, LBBE in Lyon), and two sequencing and analysis platforms that have been very active in the MinION Access Program (Genoscope and Institut Pasteur de Lille). Bonsai and Erable both have a long-standing experience in the design of algorithms and software for high-throughtput sequencing data analysis (Kissplice, CRAC, sortmeRNA). Genoscope and Institut Pasteur de Lille will allow all partners of the project to have early access to the latest data with the MinION and the upcoming Promethion, as well as an expert view on these data. For example, Genoscope has recently developed NAS, a comprehensive bioinformatics pipeline for error correction of nanopore data.
All algorithms proposed within the project will be made available to a broader community through the development of open-source user-friendly bioinformatics software, that will benefit from a fast dissemination through the national network France Genomique and high-level publications. In conjunction, the underlying components will be added to the GATB library, which will further increase the audience of this work. The generated sequencing data will also be made publicly available and deposited in open archives, in order to serve as benchmarks for other research groups.
Madame Hélène Touzet (Centre de Recherche en Informatique, Signal et Automatique de Lille)
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
Bonsai - CRIStAL Centre de Recherche en Informatique, Signal et Automatique de Lille
Erable - LBBE Laboratoire de Biométrie et Biologie Evolutive - U LYON1
INSTITUT PASTEUR DE LILLE
CEA - GENOSCOPE Commissariat à l'energie atomique et aux energies alternatives
Help of the ANR 562,841 euros
Beginning and duration of the scientific project: November 2016 - 48 Months