Large public data set gives unprecedented view of African mosquito genome variation
The first major dataset from the Anopheles gambiae 1000 Genomes (Ag1000G) project has been released. Comprising data on genetic variation between 765 mosquitoes from 8 countries spanning sub-Saharan Africa, it is one of the largest public datasets on genome variation in any species and a key resource for malaria research.
The Ag1000G project has made this data release available through a web application using the Panoptes software developed by the MRC Centre for Genomics and Global Health. “The sheer scale and complexity of these data can be daunting, even for a seasoned bioinformatician” says Alistair Miles, who is leading data analyses for the Ag1000G project. “The Panoptes web application has completely transformed our ability to visualise and interact with large-scale genome variation data, we hope it will enable many people to access the Ag1000G data and explore its full potential.”
Anopheles gambiae is one of the primary vectors of the malaria parasite Plasmodium falciparum. This new data set is a quantum leap in terms of the breadth and depth of data now available for researchers studying how the species has become such an effective transmitter of malaria, and tracking the emergence and spread of insecticide resistance.
The release incorporates data on 44 million single nucleotide polymorphisms (SNPs), more than were discovered by the human 1000 genomes project from a similar number of genomes, even though the human genome is more than ten times larger than that of A. gambiae, highlighting the spectacular natural diversity that exists in mosquito populations.
“Understanding how and why mosquitoes are genetically different from each other is fundamental to many areas of malaria research” says Dominic Kwiatkowski, one of the project’s founders and chair of the Ag1000G data analysis group. “We hope these data will be a valuable resource for the community and will lead to new discoveries that make a difference for malaria control in Africa.”
The Ag1000G project is using whole genome deep sequencing to provide a high-resolution view of genetic variation in natural populations of A. gambiae. The wild-caught mosquitoes are collected at field sites across Africa by Ag1000G partners and sequenced in the UK at the Wellcome Trust Sanger Institute in Hinxton.
Martin Donnelly, of the Liverpool School of Tropical Medicine, explained the importance of this shared data resource. “We hope that these data will allow the Anopheles research community to understand the evolutionary processes that make Anopheles gambiae such a formidable malaria vector.” Professor Donnelly is a co-founder of the project and chairs the Ag1000G partner working group, which brings together partners from 13 research institutions.
A strong ethos of the project is to release data prior to publication, with the expectation that they will be valuable for other researchers, in keeping with Fort Lauderdale principles regarding the publication of global analyses of the data.
Scientific collaboration is a well-established path for the creation of big-data resources, particularly in the area of genomics. However, datasets of this size offer both extraordinary possibilities and challenges when it comes to data analysis. By utilising the Panoptes software, the AG1000g Consortium is enabling researchers across the world to interrogate the world’s largest public dataset on genome variation in a clinically important disease vector.
For more information about the Panoptes software framework, visit Panoptes on GitHub. To find out more about the Ag1000G project, visit the Ag1000G home page. For technical information about this data release, visit the Ag1000G phase 1 AR2 data release web page.