There are three kinds of reference sets for virus sequences:
Virus complete genomes list in hardly available as in oct_2016. In INSDC the tag “complete genome” is sometimes missing or inappropriately associated to complete segments. Here we present a list of virus complete genomes gathered and manually checked from different annotation sources: GenBank, ViPR, Virus variation resource, IRD, HIV sequence database, Virology.ca.
Data coming from these sources had to be curated. For example we have found 4845 complete segments of Bunyaviridae, but only 2514 are part of a really complete genome.
The list of SEPT_2017 comprises 70,352 complete virus genomes, comprising 254,914 sequences.Download Complete genomes(Vertebrate viruses) GenBank accession list in Excel format
These files display: NCBI accession number of nucleotide sequence, virus species+isolate name
Basically these sequences are annotated manually by experts to be golden standard. For example a user wants analyse a new HIV sequence, which sequence could be used as a standard? the answer is NCBI refseq and/or UniProt reference proteomes.
These set of sequences are produced computationally to cover the whole variation diversity of viral sequences. One can compute its own set from the databases, depending on the focus of the study. UniProt offers Uniref sequences which lower the complexity of the sequence landscape with a threshold of 90 or 50% identity. Fore example this study has created a synthetic human virome using human viruses uniref 90%.
human virus representative proteins, 90% identity : All human viruses; 101,794 entries in release 2016_08
Virus specific reference sets: Rotavirus new genotypes references from Rotavirus Classification Working Group (RCWG)