What it does
VirSorter2 applies a multi-classifier, expert-guided approach to detect diverse DNA and RNA virus genomes.
Input
A fasta sequence.
The default score cutoff (0.5) works well known viruses (RefSeq). For the real environmental data, we can expect to get false positives (non-viral) with the default cutoff. Generally, samples with more host (e.g. bulk metaG) and unknown sequences (e.g. soil) tends to have more false positives. We find a score cutoff of 0.9 work well as a cutoff for high confidence hits, but there are also many viral hits with score <0.9. It's difficult to separate the viral and non-viral hits by score alone. So we recommend using the default score cutoff (0.5) for maximal sensitivity and then applying a quality checking step using checkV. Here is a tutorial of [viral identification SOP](https://www.protocols.io/view/viral-sequence-identification-sop-with-virsorter2-btv8nn9w) used in Sullivan Lab.
Output
identified viral sequences, including the following types:
Note that suffix ||full, ||lt2gene and ||{i}_partial have been added to original sequence names to differentiate sub-sequences in case of multiple viral subsequences found in one contig. Partial sequences can be treated as proviruses since they are extracted from longer host sequences. Full sequences, however, can be proviruses or free virus since it can be a short fragment sequenced from a provirus region. Moreover, "full" sequences are just sequences with strong viral signal as a whole ("nearly full" is more accurate). They might be trimmed due to partial gene overhang at ends, duplicate segments from circular genomes, and an end trimming step for all identified viral sequences to find the optimal viral segments (longest within 95% of peak score by default). Again, the "full" sequences trimmed by the end trimming step should not be interpreted as provirus, since genes that have low impact on score, such as unknown gene or genes shared by host and virus, could be trimmed. If you prefer the full sequences (ending with ||full) not to be trimmed and leave it to specialized tools such as checkV, you can use --keep-original-seq option.
Scores: This table can be used for further screening of results. It includes the following columns:
Boundary information: This is a intermediate file that 1) might have extra records compared to other two files and should be ignored; 2) do not include the viral sequences with < 2 gene but have >= 1 hallmark gene; 3) the group and trim_pr are intermediate results and might not match the max_group and max_score respectively in the Scores output. Only some of the columns in this file might be useful: