CE45 - Mathématiques et sciences du numérique pour la biologie et la santé 2019

Search engine for environmental genomic sequencing data – SeqDigger

Search engine for genomic sequencing data

New scaling breakthrough, allowing users to directly query large unassembled raw sequencing data on the fly in order to tap into the largest underexploited resource in life sciences.

Provide an ultra fast and user-friendly search engine for querying genomic data

The central objective of this proposal is to provide an ultra fast and user-friendly search engine that compares a query sequence, typically a read or a gene (or a small set of such sequences), against the exhaustive set of all available data corresponding to one or several large-scale (meta)genomic sequencing project(s), such as New York City metagenome, Human Microbiome Projects (HMP or MetaHIT), Tara Oceans project, Airborne Environment, etc. This would be the first ever occurrence of such a comprehensive tool, and would strongly benefit the scientific community, from environmental genomics to biomedicine.

We will propose core data structures for indexing k-mers from numerous read sets capable of assigning a k-mer to the read sets in which it occurs (P1), and providing the abundance of a k-mer in each read set (P2). The practical solution will differ for the two problems. The data structures must have minimal lookup delay, minimal memory footprint, and should be updatable: accepting the addition of new read sets or the removal of read sets (outdated or incorrect metadata).We will separate the work into two distinct subtasks. The first task will focus on proposing core data structures and the second task will be dedicated to their plasticity. Although these two tasks are deeply entangled, we prefer to tackle them in distinct subtasks- the “core” data structure dynamicity (Minimal Perfect Hash Functions, Bloom Filters, Counting Quotient Filter, BWT approaches) is a fundamental question that has to be tackled separately from the practical solutions. During the project, all SeqDigger deployments will be performed by members of the project. All implemented tools will be made open-source and availalble on Github. Software will be developed in such a way that it can be easily run by other labs. While we will provide packaged stand-alone tools, supporting too many platforms/architectures will pose an unnecessary engineering burden on the project, therefore we will restrict ourselves to ensuring that our software runs fine on recent versions of Linux and OSX.

As of September 2025, several algorithmic solutions and their implementations have been proposed to optimize the indexing and searching of k-mers in massive datasets. These approaches aim to improve efficiency, accuracy, and query speed, particularly for genomics and metagenomics applications. Among the most notable solutions are:

 

 

kmindex: github.com/natir/kmindex (published in Nature Computational Science, DOI:10.1038/s43588-024-00596-6)

kmindex is a groundbreaking approach capable of indexing thousands of metagenomes and performing sequence searches in a fraction of a second. Index construction is an order of magnitude faster than previous methods, while search times are two orders of magnitude faster. With a negligible false positive rate (below 0.01%), kmindex outperforms the precision of existing approaches by four orders of magnitude. Its scalability has been demonstrated by successfully indexing 1,393 marine seawater metagenome samples from the Tara Oceans project.

 

 

findere: github.com/lrobidou/findere (published at SPIRE 2021)

This is a simple and effective strategy for reducing false positive rates in any Approximate Membership Query (AMQ) data structure indexing k-mers. The method accelerates queries by a factor of two while reducing false positives by two orders of magnitude. It operates on-the-fly, without modifying the original indexing data structure, without introducing false negatives, and without memory overhead. With no major drawbacks, it either reduces the false positive rate or decreases the space required to represent a given set, while respecting a user-defined false positive rate.

 

 

fimpera: github.com/lrobidou/fimpera

Fimpera is a simple strategy for reducing false positives in any AMQ data structure that supports abundance queries. The proposed implementation uses a counting Bloom filter and provides a method for indexing and querying k-mers from biological sequences (in fastq or fasta format, compressed or not). Thanks to its use of templates, fimpera can be easily adapted to other AMQs supporting abundance queries, for a wide range of applications.

 

 

These solutions have paved the way for significant advancements in genomic data analysis, enabling fast, precise, and scalable indexing—essential for projects like Tara Oceans and other large-scale metagenomic initiatives.

Scale up for indexing PB data of raw sequencing datasets.

We are currently witnessing a deep knowledge revolution due to the availability of exponentially expanding sequence databases made possible by the continuously accelerating throughput of sequencing techniques. This trend is highlighted, for instance, in the Earth Bio-Genome Project which was presented during the World Economic Forum Davos 2018- this project aims to “use genomics to help discover the remaining 80 to 90 percent of species that are currently hidden from science”.

Sequencing data is accumulating faster than Moore’s Law, bringing fundamental new biological insights, conjecture, and understanding, with impacts on medicine, agronomy and ecology. The main objectives have been to assemble new genomes in order to compare specific organisms to representative reference species, highlighting genomic variations that reveal genetic properties correlated to ecological, agronomical or clinical markers. Today, the International Nucleotide Sequence Database Collaboration (INSDC) Sequence Read Archive (SRA) stores over 10,000 Pb nucleotides in the form of short sequences (<1000 bp), which represent fragments from generally unknown genomic locations (randomly sampled “reads” from shotgun sequencing projects).

However, the overwhelming majority of those sequences have only been analysed within the context of single project, each addressing only a small fraction of the total resource. It is therefore of primary importance to maintain a pattern of diversity for meta-analyses in the future and to develop technologies to interrogate data across project boundaries. Access to entire data sets as opposed to single or limited number of read sets would provide researchers unparalleled opportunities to make novel discoveries.

Unfortunately, raw sequences stored in genomic data banks such as the SRA are not indexed and therefore cannot be queried efficiently, apart from direct accession lookups. Oftentimes, these data sets are never revisited because of the huge overhead involved in manipulating such voluminous data. Today, it would be unthinkable to access the Internet without powerful search engines. However, this is precisely the current situation for raw read archives, where precious data sleep undisturbed in rarely-opened drawers.

The central objective of this proposal is to provide an ultra fast and user-friendly search engine that compares a query sequence, typically a read or a gene (or a small set of such sequences), against the exhaustive set of all available data corresponding to one or several large-scale metagenomic sequencing project(s), such as New York City metagenome, Human Microbiome Projects (HMP or MetaHIT), Tara Oceans project, Airborne Environment, etc. This would be the first ever occurrence of such a comprehensive tool, and would strongly benefit the scientific community, from environmental genomics to biomedicine.

Project coordination

Pierre Peterlongo (Centre de Recherche Inria Rennes - Bretagne Atlantique)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partnership

Inria Rennes - Bretagne Atlantique Centre de Recherche Inria Rennes - Bretagne Atlantique
UMR 8030 / CEA UMR 8030 / GENOSCOPE / CEA
AMU-MIO UNIVERSITE d'AIX-MARSEILLE-Institut Méditerranéen d’Océanologie
IP INSTITUT PASTEUR

Help of the ANR 544,269 euros
Beginning and duration of the scientific project: December 2019 - 48 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter