Galaxy | Tool Preview

Make strain profiles (version 0.11.1.1)
Only genes with prevalence higher than the threshold are allowed
Only genes with prevalence lower than the threshold are allowed
e.g. 'Streptococcus'

What it does

HUMAnN is a pipeline for efficiently and accuretly profiling the presence/absence and abundance of microbial pathways in a community from metagenomic or metatranscriptomic sequencing data.

Read more about the tool: http://huttenhower.sph.harvard.edu/humann2/manual.

This script is currently at an experimental stage. Please use with caution.

The HUMAnN2 script humann2_strain_profiler can help explore strain-level variation in your data. This approach assumes you have run HUMAnN2 on a series of samples and then merged the resulting genefamilies.tsv tables with humann2_merge_tables. Cases will arise in which the same species was detected in two or more samples, but gene families within that species were not consistently present across samples. For example, four samples may contain the species Dialister invisus, but only two samples contain the gene family UniRef50_Q5WII6 within Dialister invisus. This is a form of strain-level variation in the Dialister invisus species: one which we can connect directly to function based on annotations of the UniRef50_Q5WII6 gene family.

humann2_strain_profiler first looks for (species, sample) pairs where (i) a large number of gene families within the species were identified (default: 500) and (ii) the mean abundance of detected genes was high (default: mean > 10 RPK). For species that meet these criteria, we can infer that absent gene families are likely to be truly absent, as opposed to undersampled. Simulations suggest that the cutoff of 10 RPK results in a false negative rate below 0.001 (i.e. for every 1000 genes identified as absent, at most one would be present but missed due to undersampling). For a given species, if at least two samples pass these criteria, the species and passing samples are sliced from the merged table and saved as a strain profile.

Strain profiles can be additionally restricted to a subset of species (e.g. those from a particular genus) or to gene families with a high level of variability in the population (e.g. present in fewer than 80% of samples but more than 20% of samples). Additional thresholds (e.g. the minimum non-zero mean) can be configured with command line parameters.