Galaxy |

This script sorts and (optionally) filters the rows/variants of a VCF file (containing data for 2 or more samples) based on the differences in the variant data between samples or sample groups. Degree of "difference" is determined by either the ratio of group-specific genotype calls over group size or the difference in average allelic frequency (with a gap size threshold). The the pair of samples or sample groups used to represent the difference for a variant row is the one leading to the greatest difference in consistent genotype or average allelic frequencies (i.e. observation ratios, e.g. AO/DP) of the same variant state.

This script works with VCF files generated by freeBayes (for SNP and small nucleotide variants) or svTyper (for structural variants). It will work with any other VCF data that includes GT or AO, RO, and DP tags in the FORMAT string.

Each row in a VCF file will be assumed to represent a variant (or variant position). In the context of this script, there are two ways to look at differences among the samples: genotype calls and the ratio of observations of a particular variant out of the total observatons. We'll refer to this as either "allelic frequency" or "observation ratios" throughout the documentation.

DEFAULT SORTING BEHAVIOR

VCF rows/variants are sorted by the (maximum) degree of difference that exists between the pairs of sample groups you define. If multiple pairs are defined, the maximum difference computed among those pairs is used in the sort. How the degree of difference is calculated depends on whether the --genotype or --nogenotype flag is supplied.

If --genotype is supplied, sorting will be based on the degree of difference in genotype calls between the 2 sample groups. Put simply, variants where all the genotype calls differ between sample group 1 and sample group 2 will be at the top of the results. If the genotype calls within a group are not consistent, the rank of the row falls and it will appear lower in the results. If all of the genotype calls between 2 sample groups are the same, the row will be at the bottom of the results. If samples do not have genotype calls, the rank falls even lower. The very bottom of the results will contain variant which have no genotype calls among any samples in the 2 groups.

If --nogenotype is supplied, sorting will be based on the degree of difference in allelic frequencies between the 2 samples groups. The degree of difference between allelic frequencies will be the maximum difference in observation ratios among the samples, e.g. an 'A' in sample 1 is seen in 1 out of 10 reads that map over the variant position whereas an 'A' is seen in sample 2 in 10 out of 10 reads. The difference in those observation ratios is 9/10 or 0.9. The variant state (among all the observations in the 2 sample groups) with the largest difference in observation ratios between the samples in the 2 sample groups is selected to represent the row. The difference in average ratios of each group is what is used in the sort.

SETTING A MINIMUM GROUP SIZE

Supplying a --min-group-size affects sorting and allows you to find which samples among 2 sample groups differ (by bringing them to the top of the sorted results). By default, all group members are used to compute maximum difference between 2 sample groups as described above.

When --min-group-size is supplied with --nogentotype, the maximum difference between the sample groups' average observation ratios is computed twice, between N members of sample group 1 and M members of sample group 2. When comparing sample groups, the maximum difference is determined by taking the greater difference of 2 comparisons: 1. the top N observation ratios of sample group 1 versus the bottom M (inverse) observation ratios of sample group 2. 2. the bottom N observation ratios of sample group 1 versus top M observation ratios of sample group 2. In order to avoid meaningless results, either N or M must represent a majority of their respective sample groups. It is recommended to always set -d to the group size for 1 of the 2 groups. --min-group-size should only be used when the groups being compared are not 2 sets of replicates.

When --min-group-size is supplied with --gentotype, the maximum difference between the sample groups' is computed in the same manner as described above for --genotype, except calls within a sample group are allowed to differ as long as there exists a subgroup of at least size --min-group-size with a consistent genotype call.

DYNAMIC CREATION OF SAMPLE GROUPS

When pairs of sample groups are not supplied, sample groups are greedily determined on each row independently. Up to 2 --min-group-size's can be supplied, but must not sum to more than the number of samples. The default minimum group sizes are both 1. Sorting is performed in the same manner, except (in the case of --nogenotype) the top N and bottom M samples compared are selected from a single list. In the case of --genotype, the samples are ordered by genotype call abundance and assigned to the groups from either end (omitting those with no-calls).

GROWING SAMPLE GROUPS FROM THE MINIMUM GROUP SIZE

If you have supplied a --min-group-size that is less than the number of samples defined in the group, you can allow sample groups to grow using the --grow parameter. This allows you to identify groups of different (i.e. non-replicate) samples that share a difference with the comparison group. Growing groups behaves differently depending on whether --genotype or --nogenotype is supplied.

If --nogenotype is supplied, grow groups is done using the --separation-gap threshold. It uses the difference in the obervation ratio of 1 group and its inverse observation ratio in the comparison group. For example, if you supply --grow --nogenotype --separation-gap 0.5, samples will be greedily* added to the 2 groups in order of their difference with the current group's observation ratio average and stops just before the difference in the averages crosses the threshold of 0.5.

If --genotype is supplied, all members of a sample group matching a genotype call in the sample group of size --min-group-size are added to the group. If sample groups are being created dynamically and the groups have genotype calls in common, no other samples of the common genotype call will be added.

* Sample groups are seeded with members from either the bottom or top set of observation ratios. Samples in different groups are seeded from opposite ends (top or bottom). Samples are then traversed top-down or bottom-up and greedily added to the respective sample group in order of ascending difference from the current group average.

FILTERING

There are 2 threshold options that can be used to filter variants that do not contain differences between the sample groups that meet the thresholds. In --genotype mode, the threshold is --min-group-size. In --nogenotype mode the threshold is the combination of --separation-gap and --min-group-size.

n --nogenotype mode, if the difference between the observation ratios between (all of the*) pairs of sample groups is less than the separation gap threshold, the row will not be printed.

In --genotype mode, if the (all of the*) pairs of sample groups share a common genotypoe call, the row will not be printed.

* In either case, if any pair of sample groups meets the threshold(s), the row will be printed regardless of whether or not any other pair fails the threshold(s).

EXAMPLE

To sort based on the difference between specific samples or groups of samples, those groups can be defined on the command line using -s. You can specify a minimum number of samples in the groups to differ. So for example, say you have 3 wildtype (WT) replicates and you would like to see differences that all 3 WT samples have with any one of a set of 10 mutant samples. You would do that on the command line using the sample names:

-s "wt1 wt2 wt3" -d 3 -s "m1 m2 m3 m4 m5 m6 m7 m8 m9 m10" -d 1

The largest difference that the average observation ratio of the WT samples has with 1 of the mutant samples will be at the top of the results.