Top menu

Data science

Objective: To develop statistical and population genetic methods to understand genetic diversity and the evolution of pathogen and vector populations.

Essential systems are now in place to isolate and sequence pathogens from many thousands of clinical samples per year. This influx of data presents new challenges for informatics and data analysis including the sheer amount of data which can be in the order billions base pairs for each sample sequenced.

The unprecedented scale of this raw genetic data can be harnessed to increase the statistical power of genomic analyses and, as a result, our confidence in the findings. But to do this, researchers need to overcome several non-trivial challenges inherit in assembling the whole genome sequence of parasites and the vectors that transmit them.

Parasite and vectors genomes tend to be much larger and complex than bacteria and viruses, and are characterised by high rates of genetic recombination. Complex life-cycles can present additional challenges. DNA sequence data needs to be standardised and integrated, so that results of different studies can be combined to construct an accurate picture of genome variation in pathogen populations around the world.

From the outset, we’ve worked to develop quality-assured, standard methods for the lab and the field, including for sample collection, DNA extraction, and our work on sequencing from small sample sizes. Concurrently, we’re developing statistical and computational methods to derive quality-assured genotypes and to facilitate the analysis of large-scale genomic and epidemiological data.

The methodological advances we’ve made have played a crucial role in our ability to generate and analyse large, high-quality data sets to uncover profound insights into the diversity and evolution of pathogen populations around the world. One example of this work is the discovery of distinct genetic patterns of variation associated with drug resistance in malaria parasite populations. These findings demonstrate that large-scale genomics, especially if combined with epidemiological data, can offer complex insight into the way that pathogen populations are responding to public health interventions.

Key activities

There is a broad range of research that pertains to this theme including:

  • Evaluating and optimising sequencing technologies
  • Generating and improving reference sequences
  • Joint analysis of human and pathogen phenotypes
  • Developing the statistical framework and methods for analysing pathogen phenotypes in highly structured populations
  • Analysing geographical structuring of hypervariable immune-related genes