Emergence - Emergence

Genomic Assembly Tool Box – GATB

Genomic Assembly & Analysis Tool Box

Genomic witnesses an unprecedented change with the advent of high throughput sequencing. These technologies generate huge volumes of data, called NGS. The GATB project focuses on critical NGS treatments such as genome assembly. For complex genomes, billions of short DNA fragments need to be processed, leading to time-consuming processing requiring computers with very large memories. This is a serious bottleneck in many analysis both for academic and industry companies.

Fast and efficient design of NGS software

The Genome Assembly & Analysis Tool Box aims to provide an easy way to develop efficient and fast NGS tools. The GATB project has been designed as a C++ library of high level functions that benefit from recent advance researches in NGS data structures. The GATB software environment offers: <br /><br />1 - An open source library from which new efficient NGS tools can be developed very quickly and very easily;<br /><br />2 - Optimized functions with very low memory fingerprint: as an example, assembly of a Human genome requires less than 6 GB of memory while concurrent software asked hundreds of GB;<br /><br />3 - Transparent parallel implementation targeting multi-core processors that are today the main target hardware resources.<br /><br /><br />

One of the main concerns of the GATB library is to provide
computing modules able to run on standard machines, i.e. computers not requiring large amount of main memory.

The central data structure is a de-Bruijn graph from which numerous actions can be performed: data error correction, assembly,
motif detection (e.g. SNP), etc. The graph is constructed
by extracting and by counting all the different k-mers from one or
several sequencing data sets. This time and space consuming task
is conducted by a very efficient disk streaming algorithm,
which adapts its memory requirement according to the
available computer memory. Trade-off between execution time and
memory occupancy can be set up: the larger the computer memory,
shorter the computation time (reduced disk access).
The de-Bruijn graph memory footprint is kept very low thanks
to an optimized Bloom filter representation. Only vertices of the de-Bruijn graph are memorized. Edges are deduced by querying the Bloom filter.
False positives (due to the probabilistic behavior of the Bloom
filter) are suppressed by adding an extra data structure enumerating
critical vertices. This very efficient de-Bruijn graph representation
fits, for example, a complete mammal genome in less than GB

The GATB project has been thought as 3-layer construction for designing NGS tools:

GATB-CORE: a C++ library holding all the services needed for developing software dedicated to NGS datasets. This library is available as an open source software for the scientific community.

GATB-TOOLS: a set of elementary NGS tools mainly built upon the GATB library. During the project, the following tools have been designed:

- Minia: contiger
- DSK: k-mer counter
- Bloocoo: read corrector
- TakeABreak: breakpoint inversion
- Leon: read compressor
- DicoSNP: SNP detection
- MindTheGap: assembly and detection of insertions
- Mapseembler2: target assembler

All theses software are available from the GATB web site.

GATB-PIPELINE: a set of NGS pipeline that links together tools from the previous layer.

Today, the GATB tool box offers a cutting-edge technology to design efficient software for the high throughput sequencing data processing, especially for short read processing. A mid term objective is to adapt the tool box for long read technologies.

The GATB-core library is available for academic and industrial actors as a A-GPL licence. A start-up company project to exploit this technology is currently under discussion (spring 2015)

E. Drezen, G. Rizk, R. Chikhi, C. Deltel, C. Lemaitre, P. Peterlongo, D. Lavenier, GATB: Genome Assembly & Analysis Tool Box, Bioinformatics, 2014
G. Rizk, A. Goin, R. Chikhi, C. Lemaitre, MindTheGap : integrated detection and assembly of short and long insertions, Bioinformatics, August 2014
G. Rizk, D. Lavenier, R. Chikhi, DSK: k-mer counting with very low memory usage, Bioinformatics, 2013 Mar 1;29(5):652-3
R. Chikhi, G. Rizk. Space-efficient and exact de Bruijn graph representation based on a Bloom filter, Algorithms for Molecular Biology 2013, 8:22
G. Collet, G. Rizk, R. Chikhi, D. Lavenier, Minia on Raspberry Pi, assembling a 100 Mbp genome on a Credit Card Sized Computer, Poster at the JOBIM conference, 2013 Jul 1-4 (Toulouse) Best poster award.
K.l Salikhov, G. Sacomoto, G. Kucherov, Using Cascading Bloom Filters to Improve the Memory Usage for de Brujin Graphs, Algorithms in Bioinformatics, Lecture Notes in Computer Science, Volume 8126, 2013, pp 364-376




-

A few years ago, genomic witnessed an unprecedentedly deep change with the advent of High
Throughput Sequencing (HTS), also known as Next Generation Sequencing (NGS).
These technologies generate huge volumes of genomic data. Crucial computational developments
are currently needed to extract knowledge from this mass of data.

The GATB project focuses on a specific critical HTS treatment: assembly. Genomic assembly
consists in reconstructing a genome from sets of very small DNA or RNA sequences, called reads,
generated by NGS machines. For complex genomes, billions of reads need to be ordered, leading
to time-consuming processing requiring computers with very large memories. This is a serious
bottleneck in many HTS analysis both for academic and industry companies.

The INRIA GenScale team has developed fast innovative assembly algorithms with very low memory
fingerprint. Two prototypes, respectively called Monument and Mapsembler, have been developed
as proof of concept. Monument is dedicated to de-novo assembly for reconstructing complete genome.
Mapsembler, which is a more general HTS processing tool, offers the possibility to assemble specific
regions of interest.

In this project we propose to develop a Genomic Assembly Tool Box allowing end-users to customize
the assembly process according (1) to the nature of the genomic data generated by NGS machines,
(2) to the complexity of the genome to assemble, or (3) to the answer of a specific biological question.
The final goal is to prepare industrial transfer targeting a wide range of genomic domains (health, agronomy, ecology, etc.).

Project coordination

Dominique LAVENIER (Institut National de la Recherche en Informatique et Automatique )

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partner

INRIA Institut National de la Recherche en Informatique et Automatique
INRIA Institut National de la Recherche en Informatique et Automatique
INRIA Institut National de la Recherche en Informatique et Automatique

Help of the ANR 183,372 euros
Beginning and duration of the scientific project: January 2013 - 24 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter