There are three kinds of reference sets for virus sequences:
A comprehensive list of virus complete genomes in hardly available as in oct_2016. In INSDC the tag “complete genome” is sometimes missing or inappropriately associated to complete segments. Here we present a list of virus complete genomes gathered and manually checked from different annotation sources: GenBank, ViPR, Virus variation resource, IRD, HIV sequence database, Virology.ca.
Data coming from these sources had to be curated. For example we have found 4845 complete segments of Bunyaviridae, but only 2514 are part of a really complete genome.
The list of DEC_2016 comprises 78,618 complete virus genomes, comprising 333,861 sequences.
These files display: NCBI accession number of nucleotide sequence, virus species+isolate name
Basically these sequences are annotated manually by experts to be golden standard. For example a user wants analyse a new HIV sequence, which sequence could be used as a standard? the answer is NCBI refseq and/or UniProt reference proteomes.
These set of sequences are produced computationally to cover the whole variation diversity of viral sequences. One can compute its own set from the databases, depending on the focus of the study. UniProt offers Uniref sequences which lower the complexity of the sequence landscape with a threshold of 90 or 50% identity. Fore example this study has created a synthetic human virome using human viruses uniref 90%.
human virus representative sequences, 90% identity : All human viruses; 97,179 entries in release 2016_08
NCBI protein clusters :2652 clusters in 08/15
Virus specific reference sets: Rotavirus new genotypes references from Rotavirus Classification Working Group (RCWG)