Full-length and in-depth analysis of RNA – Find-RNA
RNA is a fundamental molecule of the living, the seat of the genetic material of some viruses and very frequently the conveyer of messages in the cell, for the production and regulation of proteins. Studying RNA reveals functional aspects in cells, as well as fundamental questions about nucleic acids, their properties and evolution. Through sequencing, these molecules are accessed as digital sequences called reads, and studied using notably text and graph algorithms. Short-reads technologies provide snapshots of small parts of RNA molecules. A short-read dataset can show in depth all kinds of RNAs found at a time in a tissue or an environment, and they allow access to rare RNAs. However these technologies can yield extremely large amounts of data. Our computational power and methodology do not evolve at the same pace, making the data exponentially less searchable and analyzable. Novel long-read technologies propose renewed ways to access RNA, by covering a larger part of the molecules at the expense of more noise. Being more recent, they benefit from less methodological developments.
We do not know the genetic material of a majority of species on Earth. While most computational solutions work with prior knowledge, and are therefore well adapted for species like the human or the mouse, a large part of the living remains more difficult to reach despite the advances of sequencing techniques. It is even more pronounced for species or groups of species and symbiosis that cannot be cultivated. Find-RNA’s main goal is to provide new solutions to allow RNA analysis for these organisms. Our main objective is to develop efficient and scalable methods to create read catalogs that will enable RNA identification. We want to promote the adoption of long-read sequencing for RNAs by combining them with short reads in the development of these catalogs. This involves the development of methods to reduce the storage of datasets, the improvement of long reads, as well as new possibilities for querying these catalogs.
Find-RNA has three scientific work packages and a management work package. In the first work package, the goal is to create and test methods that can store and organize sets of small sequences extracted from short reads. These short sequences are in practice the bread and butter of multiple computational techniques for DNA and RNA. The aim is to make these sets use as little space as possible while still being efficient. One important milestone of this work is to reveal the inner structure of the short sequences sets, using a new and unexplored data-structure. The second work package focuses on creating methods that can quickly update and search through large catalogs of RNA datasets. The aim is to make a "dictionary" data structure specialized for RNA, that can add new sequences easily. This structure is the main milestone of the project. The third work package builds on the previous parts, and works on improving long reads accuracy by removing noise. The goal is to create a new “dictionary” structure that can handle both short and long sequences, which is also a major contribution of Find-RNA. Finally, the programs will be tested on a real-life problem application case: studying the symbiosis of plankton.
In addition to open access scientific publications, the project will deliver several pieces of software for bioinformaticians and biologists who study organisms using their sequenced RNA. It will open access to under-studied instances, in particular in ecology or conservation biology. By exploring data structures from remote fields, it will also build bridges in computer science. Finally, it will integrate long reads in an original fashion, in order to get on the bandwagon of the opportunities offered by this technology.
Project coordination
Marchet CAMILLE (UMR 9189 - CRISTAL - Centre de Recherche en Informatique, Signal et Automatique de Lille)
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
Partnership
CRIStAL UMR 9189 - CRISTAL - Centre de Recherche en Informatique, Signal et Automatique de Lille
Help of the ANR 194,221 euros
Beginning and duration of the scientific project:
December 2023
- 48 Months