VirSorter (version 2.2.4+galaxy0)

Reference database:

Sequences:

Viral groups:

Classifiers for these groups will be used

Minimal score:

to be identified as viral

Minimal sequence length:

All sequences shorter than this will be removed

Keep the original sequences:

keep the original sequences instead of trimmed; By default, the untranslated regions at both ends of identified viral seqs are trimmed; circular sequences are modified to remove overlap between both ends and adjusted for the gene splitted into two ends;

Exclude short sequences:

short seqs (less than 2 genes) does not have any scores, but those with hallmark genes are included as viral by default; use this option to exclude them

Only output high confidence viral sequences:

this is equivalent to screening final-viral-score.tsv with the following criteria: (max_score >= 0.9) OR (max_score >=0.7 AND hallmark >= 1)

Require hallmark gene on all viral sequences:

Require hallmark gene on short viral sequences:

By default sequences with less than 3kb are termed short. This can reduce false positives at reasonable cost of sensitivity

Require viral genes annotated:

Removing putative viral seqs with no genes annotated; this can reduce false positives at reasonable cost of sensitivity

Do not require more viral than cellular genes for calling full sequence viral:

this is useful when only using VirSorter2 to produce DRAMv input with viral sequence identified from other tools, or those trimmed by checkV

Do not add suffix to sequence names:

By default suffix: ||full, ||{i}_partial, ||lt2gene is appended. ote that this might cause partial seqs from the same contig to have the same name; this option is could be used when you are sure there is one partial sequence at max from each contig

Generate viral seqfile and viral-affi-contigs for DRAMv:

Extract provirus after classifying full contigs:

Should only be done if you need results fast and not interested in provirus.

What it does

VirSorter2 applies a multi-classifier, expert-guided approach to detect diverse DNA and RNA virus genomes.

Usage

Input

A fasta sequence.

The default score cutoff (0.5) works well known viruses (RefSeq). For the real environmental data, we can expect to get false positives (non-viral) with the default cutoff. Generally, samples with more host (e.g. bulk metaG) and unknown sequences (e.g. soil) tends to have more false positives. We find a score cutoff of 0.9 work well as a cutoff for high confidence hits, but there are also many viral hits with score <0.9. It's difficult to separate the viral and non-viral hits by score alone. So we recommend using the default score cutoff (0.5) for maximal sensitivity and then applying a quality checking step using checkV. Here is a tutorial of [viral identification SOP](https://www.protocols.io/view/viral-sequence-identification-sop-with-virsorter2-btv8nn9w) used in Sullivan Lab.

Output

identified viral sequences, including the following types:

full sequences identified as viral (identified with suffix ||full);
partial sequences identified as viral (identified with suffix ||{i}_partial); here {i} can be numbers starting from 0 to max number of viral fragments found in that contig;
short (less than two genes) sequences with hallmark genes identified as viral (identified with suffix ||lt2gene);

Note that suffix ||full, ||lt2gene and ||{i}_partial have been added to original sequence names to differentiate sub-sequences in case of multiple viral subsequences found in one contig. Partial sequences can be treated as proviruses since they are extracted from longer host sequences. Full sequences, however, can be proviruses or free virus since it can be a short fragment sequenced from a provirus region. Moreover, "full" sequences are just sequences with strong viral signal as a whole ("nearly full" is more accurate). They might be trimmed due to partial gene overhang at ends, duplicate segments from circular genomes, and an end trimming step for all identified viral sequences to find the optimal viral segments (longest within 95% of peak score by default). Again, the "full" sequences trimmed by the end trimming step should not be interpreted as provirus, since genes that have low impact on score, such as unknown gene or genes shared by host and virus, could be trimmed. If you prefer the full sequences (ending with ||full) not to be trimmed and leave it to specialized tools such as checkV, you can use --keep-original-seq option.

Scores: This table can be used for further screening of results. It includes the following columns:

sequence name
score of each viral sequences across groups (multiple columns)
max score across groups
max score group
contig length
hallmark gene count
viral gene %
nonviral gene %

Boundary information: This is a intermediate file that 1) might have extra records compared to other two files and should be ignored; 2) do not include the viral sequences with < 2 gene but have >= 1 hallmark gene; 3) the group and trim_pr are intermediate results and might not match the max_group and max_score respectively in the Scores output. Only some of the columns in this file might be useful:

seqname: original sequence name
trim_orf_index_start, trim_orf_index_end: start and end ORF index on orignal sequence of identified viral sequence
trim_bp_start, trim_bp_end: start and end position on orignal sequence of identified viral sequence
trim_pr: score of final trimmed viral sequence
partial: full sequence as viral or partial sequence as viral; this is defined when a full sequence has score > score cutoff, it is full (0), or else any viral sequence extracted within it is partial (1)
pr_full: score of the original sequence
hallmark_cnt: hallmark gene count
group: the classifier of viral group that gives high score; this should NOT be used as reliable classification