Viral reference sequences

There are three kinds of reference sets for virus sequences:

Viral complete genomes

Virus complete genomes list in hardly available as in oct_2016. In INSDC the tag “complete genome” is sometimes missing or inappropriately associated to complete segments. Here we present a list of virus complete genomes gathered and manually checked from different annotation sources: GenBank, ViPR, Virus variation resource, IRD, HIV sequence database,

Data coming from these sources had to be curated. For example we have found 4845 complete segments of Bunyaviridae, but only 2514 are part of a really complete genome.

The list of DEC_2016 comprises 78,618 complete virus genomes, comprising 333,861 sequences.

Download Complete genomes(Vertebrate viruses) GenBank accession list in Excel format

These files display: NCBI accession number of nucleotide sequence, virus species+isolate name

Reference sequences for annotation

Basically these sequences are annotated manually by experts to be golden standard. For example a user wants analyse a new HIV sequence, which sequence could be used as a standard? the answer is NCBI refseq and/or UniProt reference proteomes.

  • Nucleic acid references: NCBI hosts a list of viral reference genomes manually curated and updated to have the best annotation available. There is about one reference sequence per viral species.
    NCBI RefSeq :7135 complete genomes in April 2017

  • Proteomic references: these reference sets are annotated manually or automatically from sequences well curated for gene prediction. Human and veterinarian viruses are manually annotated. These reference proteomes are standards for protein expression, gene and protein names, as well as for proteomic annotation. There is about one reference proteome per viral genus.
    UniProt reference proteomes, all viruses 2245 viruses; 130,175 entries in release 2017_04 Note that reference proteomes for non vertebrate viruses are still incomplete.
    UniProt reference proteins, Human viruses :2,759 entries in release 2017_04

Representative sequences for metagenomics

These set of sequences are produced computationally to cover the whole variation diversity of viral sequences. One can compute its own set from the databases, depending on the focus of the study. UniProt offers Uniref sequences which lower the complexity of the sequence landscape with a threshold of 90 or 50% identity. Fore example this study has created a synthetic human virome using human viruses uniref 90%.

human virus representative sequences, 90% identity : All human viruses; 100,366 entries in release 2016_08

Virus specific reference sets:
Rotavirus new genotypes references from Rotavirus Classification Working Group (RCWG)