Viral reference sequences

There are three kinds of reference sets for virus sequences:

Viral complete genomes

Virus complete genomes data is cannot be directly retreived from INSDC(GenBank/EMBL/DDBJ)data, the tag "complete genome" is sometimes missing or inappropriately associated to complete segments.

  • Vertebrate virus complete genomes: Here we present a list of virus complete genomes gathered and manually checked from different annotation sources: GenBank, ViPR, Virus variation resource, IRD, HIV sequence database, Data coming from these sources had to be curated. For example we have found 4845 complete segments of Bunyaviridae, but only 2514 are part of a really complete genome.

    The list of SEPT_2017 comprises 70,352 complete virus genomes, comprising 254,914 sequences.

    Download Complete genomes(Vertebrate viruses) GenBank accession list in Excel format

    These files display: NCBI accession number of nucleotide sequence, virus species+isolate name

  • Bacteria virus complete genomes: Bacteriophages genomes is a database of curated complete genomes for bacterial viruses

Reference sequences for annotation

Basically these sequences are annotated manually by experts to be golden standard. For example a user wants analyse a new HIV sequence, which sequence could be used as a standard? the answer is NCBI refseq and/or UniProt reference proteomes.

  • Nucleic acid references: NCBI hosts a list of viral reference genomes manually curated and updated to have the best annotation available. There is about one reference sequence per viral species.
    NCBI RefSeq :8512 complete genomes in February 2019

  • Proteomic references: these reference sets are annotated manually or automatically from sequences well curated for gene prediction. Human and veterinarian viruses are manually annotated. These reference proteomes are standards for protein expression, gene and protein names, as well as for proteomic annotation. There is about one reference proteome per viral genus.
    UniProt reference proteomes, all viruses 6,183 viruses; 309,546 entries in release 2019_02 Note that reference proteomes for non-vertebrate viruses are still incomplete.
    UniProt reference proteins, Human viruses :3,274 entries in release 2019_02

Representative sequences for metagenomics

These set of sequences are produced computationally to cover the whole variation diversity of viral sequences.


RVDB: Reference Viral Database (RVDB) is developed by Arifa Khan's group at CBER, FDA for enhancing virus detection using next-generation sequencing (NGS) technologies.


RVDB protein: Reference Viral Databases (RVDB-prot and RVDB-prot-HMM) were developed by Thomas Bigot in Marc Eloit?s Pathogen Discovery group in collaboration with Center of Bioinformatics, Biostatistics and Integrative Biology (C3BI) at Institut Pasteur, for enhancing virus detection using next-generation sequencing (NGS) technologies. They are based on the reference Viral DataBase, courtesy of Arifa Khan?s group at CBER, FDA

UniProt offers Uniref sequences which lower the complexity of the sequence landscape with a threshold of 90 or 50% identity. Fore example this study has created a synthetic human virome using human viruses uniref 90%.

human virus representative proteins, 90% identity : All human viruses; 110,672 entries in release 2019_02

Virus specific reference sets:
Rotavirus new genotypes references from Rotavirus Classification Working Group (RCWG)