Freyja: Demix (version 2.0.0+galaxy0)

Dataset with input variant calls:

This can be a VCF dataset, or the tabular calls output of freayja call or ivar variants.

Set sample name:

Select autodetect to have the dataset or collection element name used as the sample name, or, for a single input dataset, provide an explicit sample name.

Sequencing depth file:

Source of UShER barcodes data:

Freyja ships with an usher_barcodes.csv file, which the tool can access internally. Since this file gets updated rather frequently, you can also download the latest version of the file from https://github.com/andersen-lab/Freyja/raw/main/freyja/data/usher_barcodes.csv, set the dataset's datatype to csv and use it as a custom barcodes file.

Custom lineage metadata file:

For additional flexibility and reproducibility of analyses, a custom lineage-to-contellation mapping metadata file can be provided.

Minimum lineage abundance tp include:

e.g. 0.0001.

Remove unconfirmed lineages from the analysis:

If the UShER tree includes proposed lineages, the --confirmedonly flag removes unconfirmed lineages from the analysis.

Use larger library with non-public lineages:

Allows to use larger UShER barcodes library extended with GISAID non-public lineages

Pathogen of interest:

Depth cutoff for coverage estimate:

In the result file the coverage value will provide the 10x coverage estimate (percent of sites with 10 or greater reads- 10 is the default but can be modfied in this field.

What it does

Freyja is a tool to recover relative lineage abundances from mixed SARS-CoV-2 samples from a sequencing dataset (BAM aligned to the Hu-1 reference).

General information

Freyja is a tool to recover relative lineage abundances from mixed SARS-CoV-2 samples from a sequencing dataset (BAM aligned to the Hu-1 reference). The method uses lineage-determining mutational "barcodes" derived from the UShER global phylogenetic tree as a basis set to solve the constrained (unit sum, non-negative) de-mixing problem.

Freyja is intended as a post-processing step after primer trimming and variant calling in iVar (Grubaugh and Gangavaparu et al., 2019). From measurements of SNV freqency and sequencing depth at each position in the genome, Freyja returns an estimate of the true lineage abundances in the sample.

Freyja demix estimates lineage abundances in a potentially multi-lineage input sample.

Inputs

The tool requires as input a dataset with called variants and a dataset with genome-wide sequencing depth information. Both types of data can be produced with Freyja call, but the tool accepts variant calls also in VCF format.

Note

For single samples it is recommended to select "Specify sample name explicitly" under "Set sample name".

To use this tool on multiple samples in parallel, please provide two collections in the same sample sort order - one with the variant calls, the other one with the sequencing depths - and select "Autodetect sample name", which will use collection element identifiers as the names of the samples. This will produce a new collection of demixing reports that can be passed to Freyja: Aggregate and visualize with sample names preserved.

Selection of multiple regular called variants and depth datasets is discouraged since proper dataset pairing cannot be guaranteed!

Outputs

The tool produces tabular output that includes the lineages detected in the sample, their corresponding abundances, and a lineage summary by constellation.

Example output:

summarized	[('Delta', 0.65), ('Other', 0.25), ('Alpha', 0.1')]
lineages	['B.1.617.2' 'B.1.2' 'AY.6' 'Q.3']
abundances	"[0.5 0.25 0.15 0.1]"
resid	3.14159
coverage	95.8

Where summarized denotes a sum of all lineage abundances in a particular WHO designation (i.e. B.1.617.2 and AY.6 abundances are summed in the above example), otherwise they are grouped into "Other". The lineage array lists the identified lineages in descending order, and abundances contains the corresponding abundances estimates. The value of resid corresponds to the residual of the weighted least absolute devation problem used to estimate lineage abundances. The coverage value provides the 10x coverage estimate (percent of sites with 10 or greater reads- 10 is the default but can be modfied using the --covcut option).