Viral reference sequences
There are three kinds of reference sets for virus sequences:
Viral complete genomes
Virus complete genomes data is cannot be directly retreived from INSDCdata, the tag “complete genome” is sometimes missing or inappropriately associated to complete segments.
- Vertebrate virus complete genomes: Here we present a list of virus complete genomes gathered and manually checked from different annotation sources: GenBank, ViPR, Virus variation resource, IRD, HIV sequence database, Virology.ca. Data coming from these sources had to be curated. For example we have found 4845 complete segments of Bunyaviridae, but only 2514 are part of a really complete genome.
The list of SEPT_2017 comprises 70,352 complete virus genomes, comprising 254,914 sequences.
Download Complete genomes(Vertebrate viruses) GenBank accession list in Excel format
These files display: NCBI accession number of nucleotide sequence, virus species+isolate name
- Bacteria virus complete genomes: Bacteriophages genomes is a database of curated complete genomes for bacterial viruses
Reference sequences for annotation
Basically these sequences are annotated manually by experts to be golden standard. For example a user wants analyse a new HIV sequence, which sequence could be used as a standard? the answer is NCBI refseq and/or UniProt reference proteomes.
- Nucleic acid references: NCBI hosts a list of viral reference genomes manually curated and updated to have the best annotation available. There is about one reference sequence per viral species. NCBI RefSeq :7475 complete genomes in September 2018
- Proteomic references: these reference sets are annotated manually or automatically from sequences well curated for gene prediction. Human and veterinarian viruses are manually annotated. These reference proteomes are standards for protein expression, gene and protein names, as well as for proteomic annotation. There is about one reference proteome per viral genus. UniProt reference proteomes, all viruses 5,884 viruses; 264,284 entries in release 2018_08 Note that reference proteomes for non-vertebrate viruses are still incomplete. UniProt reference proteins, Human viruses :3,913 entries in release 2018_08
These set of sequences are produced computationally to cover the whole variation diversity of viral sequences. One can compute its own set from the databases, depending on the focus of the study. UniProt offers Uniref sequences which lower the complexity of the sequence landscape with a threshold of 90 or 50% identity. Fore example this study has created a synthetic human virome using human viruses uniref 90%.
human virus representative proteins, 90% identity : All human viruses; 107,858 entries in release 2018_08
Virus specific reference sets: Rotavirus new genotypes references from Rotavirus Classification Working Group (RCWG)
Viral immunology. Comprehensive serological profiling of human populations using a synthetic human virome
George J. Xu, Tomasz Kula, Qikai Xu, Mamie Z. Li, Suzanne D. Vernon, Thumbi Ndung?u, Kiat Ruxrungtham, Jorge Sanchez, Christian Brander, Raymond T. Chung, Kevin C. O?Connor, Bruce Walker, H. Benjamin Larman, Stephen J. Elledge
Science June 5, 2015; 348: aaa0698