CE45 - Mathématiques et sciences du numérique pour la biologie et la santé

Search engine for environmental genomic sequencing data – SeqDigger

Search engine for genomic sequencing data

New scaling breakthrough, allowing users to directly query large unassembled raw sequencing data on the fly in order to tap into the largest underexploited resource in life sciences.

Provide an ultra fast and user-friendly search engine for querying genomic data

The central objective of this proposal is to provide an ultra fast and user-friendly search engine that compares a query sequence, typically a read or a gene (or a small set of such sequences), against the exhaustive set of all available data corresponding to one or several large-scale (meta)genomic sequencing project(s), such as New York City metagenome, Human Microbiome Projects (HMP or MetaHIT), Tara Oceans project, Airborne Environment, etc. This would be the first ever occurrence of such a comprehensive tool, and would strongly benefit the scientific community, from environmental genomics to biomedicine.

New data structures and sequences analyses tools

We will propose core data structures for indexing k-mers from numerous read sets capable of assigning a k-mer to the read sets in which it occurs (P1), and providing the abundance of a k-mer in each read set (P2). The practical solution will differ for the two problems. The data structures must have minimal lookup delay, minimal memory footprint, and should be updatable: accepting the addition of new read sets or the removal of read sets (outdated or incorrect metadata).We will separate the work into two distinct subtasks. The first task will focus on proposing core data structures and the second task will be dedicated to their plasticity. Although these two tasks are deeply entangled, we prefer to tackle them in distinct subtasks- the “core” data structure dynamicity (Minimal Perfect Hash Functions, Bloom Filters, Counting Quotient Filter, BWT approaches) is a fundamental question that has to be tackled separately from the practical solutions.
During the project, all SeqDigger deployments will be performed by members of the project. All implemented tools will be made open-source and availalble on Github. Software will be developed in such a way that it can be easily run by other labs. While we will provide packaged stand-alone tools, supporting too many platforms/architectures will pose an unnecessary engineering burden on the project, therefore we will restrict ourselves to ensuring that our software runs fine on recent versions of Linux and OSX.

Results

To date (July 2021) two algorithmic solutions and their implementations have been proposed.
- kmtricks: github.com/tlemane/kmtricks (prepublication www.biorxiv.org/content/10.1101/2021.02.16.429304v1). Proposes a novel efficient way to generate a set of data structures (bloom filters) downstream used for indexing large number of huge datasets (up to dozens of Terabytes up to now).
- findere: github.com/lrobidou/findere (SPIRE 2021, prepublication www.biorxiv.org/content/10.1101/2021.05.31.446182v1). This is a simple strategy and its implementation for reducing the false-positive rate of any approximate membership query (AMQ) data structure indexing k-mers (words of length k). The method enables to speed-up the queries by a factor two and to decrease the false-positive rate by two order of magnitudes. This achievement is done one the fly at query time, without modifying the original indexing data-structure, without generating false-negative calls and with no memory overhead. With no drawback, this method, as simple as it is effective, reduces either the false-positive rate or the space required to represent a set given a user-defined false-positive rate.

Prospects

Scale up for indexing PB data of raw sequencing datasets.

Scientific productions and patents

findere: github.com/lrobidou/findere (SPIRE 2021, prepublication www.biorxiv.org/content/10.1101/2021.05.31.446182v1)

kmtricks: github.com/tlemane/kmtricks (prepublication www.biorxiv.org/content/10.1101/2021.02.16.429304v1)

Automated strain separation in low-complexity metagenomes using long reads
www.biorxiv.org/content/10.1101/2021.02.24.429166v2.abstract

Submission summary

We are currently witnessing a deep knowledge revolution due to the availability of exponentially expanding sequence databases made possible by the continuously accelerating throughput of sequencing techniques. This trend is highlighted, for instance, in the Earth Bio-Genome Project which was presented during the World Economic Forum Davos 2018- this project aims to “use genomics to help discover the remaining 80 to 90 percent of species that are currently hidden from science”.

Sequencing data is accumulating faster than Moore’s Law, bringing fundamental new biological insights, conjecture, and understanding, with impacts on medicine, agronomy and ecology. The main objectives have been to assemble new genomes in order to compare specific organisms to representative reference species, highlighting genomic variations that reveal genetic properties correlated to ecological, agronomical or clinical markers. Today, the International Nucleotide Sequence Database Collaboration (INSDC) Sequence Read Archive (SRA) stores over 10,000 Pb nucleotides in the form of short sequences (<1000 bp), which represent fragments from generally unknown genomic locations (randomly sampled “reads” from shotgun sequencing projects).

However, the overwhelming majority of those sequences have only been analysed within the context of single project, each addressing only a small fraction of the total resource. It is therefore of primary importance to maintain a pattern of diversity for meta-analyses in the future and to develop technologies to interrogate data across project boundaries. Access to entire data sets as opposed to single or limited number of read sets would provide researchers unparalleled opportunities to make novel discoveries.

Unfortunately, raw sequences stored in genomic data banks such as the SRA are not indexed and therefore cannot be queried efficiently, apart from direct accession lookups. Oftentimes, these data sets are never revisited because of the huge overhead involved in manipulating such voluminous data. Today, it would be unthinkable to access the Internet without powerful search engines. However, this is precisely the current situation for raw read archives, where precious data sleep undisturbed in rarely-opened drawers.

The central objective of this proposal is to provide an ultra fast and user-friendly search engine that compares a query sequence, typically a read or a gene (or a small set of such sequences), against the exhaustive set of all available data corresponding to one or several large-scale metagenomic sequencing project(s), such as New York City metagenome, Human Microbiome Projects (HMP or MetaHIT), Tara Oceans project, Airborne Environment, etc. This would be the first ever occurrence of such a comprehensive tool, and would strongly benefit the scientific community, from environmental genomics to biomedicine.

Pierre Peterlongo (Centre de Recherche Inria Rennes - Bretagne Atlantique)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

IP INSTITUT PASTEUR
AMU-MIO UNIVERSITE d'AIX-MARSEILLE-Institut Méditerranéen d’Océanologie
UMR 8030 / CEA UMR 8030 / GENOSCOPE / CEA
Inria Rennes - Bretagne Atlantique Centre de Recherche Inria Rennes - Bretagne Atlantique

Help of the ANR 544,306 euros
Beginning and duration of the scientific project: December 2019 - 48 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.