Statistical Methods to Infer Transmissions of Infectious Diseases from deep sequencing data – SMITID
Viruses can cause epidemics of high impact in developing and developed countries alike. For such pathogens, inferring transmission links within a host population or between host populations (e.g. for zoonoses) is crucial to build epidemiological predictions and control strategies. In this aim, for fast-evolving pathogens, one can take advantage of the statistical analysis of pathogen sequence data because they inform which hosts contain pathogen variants that are most closely related to each other. However, so far existing models have mostly exploited a limited amount of information from sequencing data, such as consensus Sanger sequences, although deep Sanger sequencing (DSS; based on amplicon cloning) and high-throughput sequencing (HTS) techniques can reveal the polymorphic nature of within-host populations of pathogens. In this project, we propose an avant-gardist modelling and statistical approach that will exploit DSS and HTS data to infer disease transmission links for fast-evolving pathogens, such as viruses, and to infer relationships between transmissions and environment.
Our approach will be based on an original pseudo-evolutionary model (and an associated estimation method) concisely describing transitions between sets of sequences sampled at different times either from a single host unit or from a host unit and its suspected source. As proof-of-concept, we developed a preliminary version of the pseudo-evolutionary model to identify the source of an infection and we obtained encouraging results. In the project, we will develop this approach to obtain an accurate, robust and rapid method for estimating transmission links based on DSS and HTS data. The approach will be applied to simulated data in order to assess its efficiency with varying sampling effort, with diverse sequencing techniques (corresponding to diverse depths, read lengths and accuracies) and with diverse models of the evolution & transmission of the virus. Then, the approach will be applied to two data sets concerning influenza A viruses sampled from animal populations, to two data sets concerning viruses sampled from wild and cultivated plant populations and to a data set generated from the 2014 Ebola outbreak. An R-software package will be developed to facilitate the dissemination of the approach.
This project will enable major advances in quantitative molecular epidemiology and computational biology, both through its innovative statistical approach and through the possibility to automatically infer a high number of transmission links from DSS and HTS data. This should lead to more accurate inferences of transmission links, better insights into the spread of pathogens within or between host populations, better understanding of the links between transmission and environment and, consequently, more robust predictions of epidemics and more efficient disease control strategies.
The methodological project that is proposed requires skills in statistics, probability, modelling, software development, epidemiology, virology and evolutionary biology. Our complementary expertise in these fields and our existing collaborative relationship will allow us to achieve the project goals.
Monsieur Samuel Soubeyrand (INSTITUT NATIONAL DE LA RECHERCHE AGRONOMIQUE - Biostatistique et processus spatiaux)
The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.
INRA PACA - BioSP INSTITUT NATIONAL DE LA RECHERCHE AGRONOMIQUE - Biostatistique et processus spatiaux
Help of the ANR 251,228 euros
Beginning and duration of the scientific project: October 2016 - 48 Months