The Vertebrate Genomes Project

About VGP

Deciphering the genetic code of organisms has become an essential part of understanding life. Being able to access the complete set of genetic information of an organism is critical for scientists to address the pressing biological, ecological, and medical questions of our time.

Recent advances in genomic technologies now allow us to create the quality of genome needed for meaningful research. The Vertebrate Genomes Project (VGP), an international collaboration of hundreds of scientists from around the world, is working to generate the most accurate and complete reference genome assemblies of all 70,000+ extant vertebrate species (see details of our approach).

Four phases to achieve a moonshot project

In order to tackle all 70,000 species, we plan to sequence the genomes in four phases: Phase 1 includes all 268 species representing all vertebrate orders with a divergence time of 50 million years ago or greater from their most recent common ordinal ancestor, including human and several species on the brink of extinction. As we move into Phase 2 with approximately 1,100 species representing all families, Phase 3 with over 10,000 species of all genera, and Phase 4 with all ~70,000 species.

High-quality reference genomes assemblies are valuable resources to many biological fields, including conservation genomics, biomedical research and comparative genomics. Expanding the availabilities of such genome assemblies for all vertebrates will not only allow us to answer long-standing questions but will enable biologists to ask new questions.

A view of the collective, global institutes around the world that will contribute to collectively to accomplishing the goals set in this proposal.

Collective structure of collaboration

The VGL is part of an ecosystem of VGP Hubs located around the world including: The Wellcome Sanger Institute DToL project, UK and the Max Planck Dresden Genome Center, Germany as already established VGP hubs, as well as Minderoo, AfricaBP , Monash University Malaysia, Qatar Falcon Genome Project, and Amazoomics as VGP hubs that are currently being established. These hubs, including the VGL, are located in 8 countries on 7 continents and play a key role to collectively achieve the goals of the project. we are also collaborating with the Earth Biogenome Project and the European Reference genome Atlas.

▶︎ Check our publication: "Towards complete and error-free genome assemblies of all vertebrate species". https://doi.org/10.1038/s41586-021-03451-0

Phase 1 progress

We are currently finishing phase 1 of the project (264 genomes) and are more than half way done (see graph below) . Phase 1 is set to be completed in 2022. Raw data and assemblies are made publicly available immediately after production in our genomeArk repository. Raw data and assemblies can be used following our data policy (see bottom of this page for full details on our data policy).

Conservation genomics

The high-quality reference genomes we create with the VGP can also be a valuable resource for species conservation research. Having a complete and accurate picture of a species genetic markup can provide key information for managing, assessing population health, and plan conservation efforts.

▶︎ Check our publication: "The era of reference genomes in conservation genomics". https://doi.org/10.1016/j.tree.2021.11.008

The kākāpō

The kākāpō (Strigops habroptilus) is an emblematic bird endemic to New Zealand, and deeply associated with the Maori culture. This flightless parrot was once very abundant throughout the North and South islands, occupying many different habitats. During the European colonization of the nineteenth century, their population rapidly declined due to many factors, including habitat destruction, overhunting, and the introduction of mammalian predators such as stoats, weasels and rats. By the 1970s, there were only 50 individuals remaining, pushing the species close to extinction. Thanks to the Kākāpō Recovery Program, the population is now back to 201 individuals. Inbreeding and the accumulation of deleterious mutations can be exacerbated by such a population bottleneck. Genetic monitoring combined breeding programs are essential for maintaining the health of a recovering population. Analyses of population data using the reference genome we generated in collaboration with Bruce Robertson and Nicolas Dussex (University of Otago) demonstrated that many deleterious mutations were purged from their genomes over the last 10,000 years, making the survival and recovery of the small population possible.


▶︎ Raw data and assembly available here.

▶︎ Check our publication: "Population genomics of the critically endangered kākāpō". https://doi.org/10.1016/j.xgen.2021.100002

The black rhinoceros

The black rhinoceros (Diceros bicornis) is a native species of southern Africa. This species was very common at the beginning of the 20th century (several hundred thousand). A combination of hunting, habitat destruction and other human-related nuisances have significantly contributed to the species decline down to 2,410 individuals in 2004. Conservation efforts have help the species recovery to 5,500 individuals in 2019. Habitat protection, poaching prevention, breeding and reintroduction are key strategies that can rescue this species in the coming decade. A high-quality genome will be useful to assess population health and guide reintroduction and breeding. For this purpose and in collaboration with Klaus-Peter Kaupfli (Smithsonian Conservation Biology Institute), we have sequenced the genome of a black rhino trio (offspring and parents). Further analyses are in progress.


▶︎ Raw data and assembly available here.

The vaquita

The vaquita (Phocoena sinus) is a critically endangered marine mammal. Less than 19 individuals remain. These past two decades, its population has significantly decreased due mostly to bycatch with large-mesh gillnets. In collaboration with Philippe Morin (NOAA), we generated and assembled a high-quality reference genome of the vaquita. This assembly allowed us to infer the demographic history of the species. Decrease in population size often causes inbreeding depression reducing the biological fitness of the population. We found that the vaquita population has remained very small for over 200,000 years (<5,000 individuals). These results suggest that the vaquita population has probably purged deleterious alleles and entertains the hope for a possible species recovery.


▶︎ Raw data and assembly available here.

▶︎ Check our publication: "Reference genome and demographic history of the most endangered marine mammal, the vaquita". https://doi.org/10.1111/1755-0998.13284

The leatherback sea turtle

The leatherback sea turtle (Dermochelys coriacea) is the largest turtle. Human activities has caused a significant decline in their populations. Some leatherback populations have experienced more than 95% decline. In collaboration with Lisa Komoroske (University of Massachusetts, Amherst), we have generated and assembled a high-quality reference genome of the leatherback sea turtle. Demographic and diversity analyses using this assembly showed extremely low diversity and high proportion of deleterious variants. These results raise concerns about the future of this species. This high-quality genome assembly will allow conservation biologists to further study global distributions and long-distance migratory connectivity of this species with additional population markers and provide a valuable resource to guide future conservation efforts.


▶︎ Raw data and assembly available here.

▶︎ Check our preprint: "Differential sensory and immune gene evolution in sea turtles with contrasting demographic and life history". https://doi.org/10.1101/2022.01.10.475373

Biomedical research

High-quality reference genomes of animal species can also be used to help guide the forefront of biomedical research. Scientists can learn the genetic basis of the regeneration from the axolotl (Ambystoma mexicanum), or cancer resistance from the naked mole-rat (Heterocephalus glaber), and apply it to the generation of new therapeutics. These reference genomes can also help scientists understand emerging zoonotic diseases, identify intermediate host species, and inform strategies for disease intervention.

The Chinese pangolin

High-quality reference genomes play an important role in understanding zoonotic disease like COVID-19. By comparing genomes of species that are susceptible, carriers, or immune to the virus, scientists can design better vaccines and treatments. They can also use these genomes to determine which local wildlife or domestic species could serve as natural reservoirs or intermediate hosts of SARS-CoV-2 virus, and thereby help curb the further spread of the virus. In collaboration with William Murphy (Texas A&M University), we have sequenced the genome of the Chinese pangolin (Manis pentadactyla), suspected to be at the origin of SARS-CoV-2 pandemic, and are in the process of performing the assembly.


▶︎ Raw data available here.

The Nile rat

The Nile rat (Avicanthis niloticus) is a model for studying type 2 diabetes. Other rodents are used as model systems in medical research such as the mouse and the rat but not all traits of interest can be well modeled by these organisms. Specifically, the Nile rat rapidly develop diet-induced diabetes when fed on a conventional rodent diet and exhibit human-like symptoms. In contrast, mouse and rat are quite resistant to type 2 diabetes. In collaboration with Yury Buckman (Morgridge Institute for Research), we have generated and assembled a high-quality reference genome of the Nile rat. This assembly will potentially expand the use of this model species for future genetic studies.


▶︎ Raw data and assembly available here.

▶︎ Check our preprint: "A haplotype-resolved genome assembly of the Nile rat facilitates exploration of the genetic basis of diabetes". https://doi.org/10.1101/2021.12.08.471837

The ring-tailed lemur

The ring-tailed lemur (Lemur catta) is an endemic species of Madagascar. Lemurs are a evolutionary interesting group of primates. Like other strepsirrhine primates, they are a basal group that share many ancestral traits with early primates. Comparative genomics is a powerful approach to study gene function and disease susceptibility, but comparing human genome to only macaque and mouse genomes can be limiting. Extending high-quality genomic resources to all non-human primates, including lemurs, would be a valuable resource for the biomedical field. In collaboration with Tomas Marques-Bonet and Marc Palmada-Flores (Universitat Pompeu Fabra-CSIC) and Mads F Bertelsen (University of Copenhagen), we have generated and assembled a high-quality reference genome of the ring-tailed lemur.

SARS-CoV-2, which in humans leads to the disease COVID-19, has cause a world-wide pandemic. The primary entry point of the virus is the cellular receptor angiotensin-converting enzyme-2 (ACE2). We have used this data to study SARS-CoV-2 susceptibility. Specifically, we compared 71 strepsirrhine primate ACE2 sequences including the ring-tailed lemur. While it is known that all primates are at risk of infection, we observed that the risk is not equal across the order. Our results suggest that several species of lemurs have higher potential susceptibility to SARS-CoV-2 infection.


▶︎ Raw data available here.

▶︎ Check our publication: "Variation in predicted COVID-19 risk among lemurs and lorises". https://doi.org/10.1002/ajp.23255

▶︎ Check our publication: "A high-quality, long-read genome assembly of the endangered ring-tailed lemur (Lemur catta)". https://doi.org/10.1093/gigascience/giac026

The common marmoset

The common marmoset (Callithrix jacchus) is a widely used primate model system for biomedical research, especially in neuroscience, stem cell biology and regenerative medicine. In collaboration with Stephanie Marcus (The Rockefeller University) and Guojie Zhang (University of Copenhagen) we have generated and assembled a high-quality reference genome of the common marmoset using a trio approach (offspring and parents). By comparing this genome with the human genome, we found four genes with fixed substitutions in the marmoset that encode amino acids that are known to be pathogenic in humans resulting in nervous system diseases. This finding suggests that the marmoset has acquired mechanisms to compensate for these mutations. Further studies of the genomic context of these mutations may provide us with crucial information for understanding the mechanisms of their pathological effect.


▶︎ Raw data and assembly available here.

▶︎ Check our publication: "Evolutionary and biomedical insights from a marmoset diploid genome assembly". https://doi.org/10.1038/s41586-021-03535-x

Comparative genomics

One of the benefits of having a database of high-quality reference genomes of a variety of species is that it allows scientists to understand the genetics behind some of the most complex behaviors found in evolutionarily diverse species. Comparing the genomes of different species allows scientists to hone in on the genes that drive uniquely shared traits.

The bottlenose dolphin

The bottlenose dolphin (Tursiops truncatus) is capable of vocal mimicry, a behavior found in only a handful of other species, like humans and parrots. Scientists can use high-quality reference genomes to identify the genes that are responsible for the neural circuitry required for this incredibly rare ability. Dolphins are an especially useful species to compare to humans, as they have been shown to use learned vocalizations to identify themselves to others (known as “signature whistles”), similar to how humans use names. In collaboration with Brigid Maloney and Marcelo Magnasco (The Rockefeller University) we have generated and assembled a high-quality reference genome of the bottlenose dolphin using a trio approach (offspring and parents). Further analyses are in progress.


▶︎ Raw data and assembly available here.

The zebrafinch

The zebrafinch (Taeniopygia guttata) is a model system for studying neurogenetics of vocal learning. Functional genomics studies (RNA-seq, methyl-seq, CHiP-seq) a reference genome. However, the quality of the genome significant influence any biological interpretations of the data. For example, incorrect genome assembly can mislead gene homology or gene structure inference. Recurrence of such erroneous conclusions has convinced us to remediate this lack of accurate genomic resource. We have generated and assembled a high-quality reference genome of the zebrafinch using a trio approach (offspring and parents). With this improved resource, we were able to reevaluate and correct the hierarchical naming system of the avian brain nomenclature.


▶︎ Raw data and assembly available here.

▶︎ Check our publication: Towards complete and error-free genome assemblies of all vertebrate species". https://doi.org/10.1038/s41586-021-03451-0

▶︎ Check our publication: "As above, so below: Whole transcriptome profiling demonstrates strong molecular similarities between avian dorsal and ventral pallial subdivisions". https://doi.org/10.1002/cne.25159

The barn swallow

The barn swallow (Hirundo rustica) is a small, semi-colonial, migratory bird. Thanks to its nesting habits, typically connected to farms and other human-made structures, it became one of the bird species most closely tied to humans and thus acquired a significant cultural value worldwide. Although it is not an endangered species, the modernization of agricultural practices and climate change might raise concerns about the stability of barn swallow population sizes in the years to come. In collaboration with Simona Secomandi and Luca Gianfranceschi (University of Milan), we generated a chromosome-level reference genome for the European barn swallow. A pangenome for this subspecies was also generated. This genome was compare to other bird genomes to identify genes under selection, and to perform genome-wide linkage disequilibrium scans using all publicly available data. We found that several genes associated with the onset of tameness-related aspects such as stress response and fear memory formation were under selection likely driven by the barn swallow strict association with humans.


▶︎ Raw data and assembly available here.

▶︎ Check our preprint: "Pangenomics provides insights into the role of synanthropy in barn swallow evolution". https://doi.org/10.1101/2022.03.28.486082

Data policy

Data release policy

  • The VGP will release raw, intermediate and final genome assemblies and transcriptome data immediately after production to a public S3 Bucket specifically dedicated to the VGP.

  • Soon after the final assembly is produced, the VGP will release raw and assembled genome and transcriptome data publicly through GenBank and Gene Expression Omnibus (GEO) at NCBI, European Nucleotide Archive (ENA) and ArrayExpress at EMBL-EBI, the DNA Data Bank of Japan (DDBJ).

  • Data can be viewed and used by anyone, but publication by non-VGP members is limited by our data embargo policy.

Data embargo policy for non-VGP

  • Non-VGP members may not publish or present on VGP genomes until 2 years after public release of the data or until the VGP has published or presented on a genome. Early permission may also be granted for publication by the VGP member(s) affiliated with a genome.

  • Exceptions to the policy also include analyses of either a single locus, or a single gene family in a species, or a maximum of 5 gene loci across multiple species, or for use as a reference for mapping reads from independent studies by non-VGP members. 

Data policy for VGP members

  • A VGP member is defined as anyone who has provided samples, funded a genome, and/or contributed to data production of a genome.

  • A VGP member may publish at any time on a genome which they have funded and/or provided the samples for and are not subject to the embargo.