view humann2_strain_profiler.xml @ 4:1595f7cadbf9 draft

"planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/humann2 commit 55eb503f50c54695ec36c3d4671c2b3e64d05f40"
author iuc
date Fri, 05 Feb 2021 18:28:23 +0000
parents eb13789029de
children
line wrap: on
line source

<tool id="humann2_strain_profiler" name="Make strain profiles" version="@WRAPPER_VERSION@.1">
    <description></description>
    <macros>
        <import>humann2_macros.xml</import>
    </macros>
    <expand macro="stdio"/>
    <expand macro="requirements"/>
    <expand macro="version"/>
    <command detect_errors="exit_code"><![CDATA[
humann2_strain_profiler
    --input '$input'
    --critical_mean '$critical_mean'
    --critical_count '$critical_count'
    --pinterval '$pinterval_1' '$pinterval_2'
    --critical_samples '$critical_samples'
    #if str($limit) != ''
        --limit '$limit'
    #end if
    ]]></command>
    <inputs>
        <param argument="--input" type="data" format="tsv,tabular,biom1" label="Merged gene families output for two or more samples"/>
        <param argument="--critical_mean" type="float" value="10.0" label="Default mean non-zero gene abundance for inclusion"/>
        <param argument="--critical_count" type="integer" value="500" label="Default non-zero number of genes for inclusion"/>
        <param name="pinterval_1" type="float" value="1e-10" label="Low prevalence threshold" help="Only genes with prevalence higher than the threshold are allowed"/>
        <param name="pinterval_2" type="float" value="1" label="High prevalence threshold" help="Only genes with prevalence lower than the threshold are allowed"/>
        <param argument="--critical_samples" type="integer" value="2" label="Threshold number of samples having strain"/>
        <param argument="--limit" type="text" value="" optional="true" label="Limit output to species matching a particular pattern?" help="e.g. 'Streptococcus'"/>
    </inputs>
    <outputs>
        <collection name="output" type="list" label="${tool.name} on ${on_string}">
            <discover_datasets pattern="(?P&lt;designation&gt;.+)-strain_profile.tsv" format="tsv" directory="."/>
        </collection>
    </outputs>
    <tests>
        <test>
            <param name="input" value="strain_profiler-input.txt"/>
            <param name="critical_mean" value="1"/>
            <param name="critical_count" value="2"/>
            <param name="pinterval_1" value="1e-10"/>
            <param name="pinterval_2" value="1"/>
            <param name="critical_samples" value="2"/>
            <output_collection name="output" type="list">
                <element name="s1" md5="09b0645f058ecdaccb3af12f655198a0"/>
                <element name="s2" md5="935698addd30312500b3cb1139c7d24b"/>
            </output_collection>
        </test>
    </tests>
    <help><![CDATA[
@HELP_HEADER@

This script is currently at an experimental stage. Please use with caution.

The HUMAnN2 script humann2_strain_profiler can help explore strain-level variation in your data. This approach assumes you have run HUMAnN2 on a series of samples and then merged the resulting genefamilies.tsv tables with humann2_merge_tables. Cases will arise in which the same species was detected in two or more samples, but gene families within that species were not consistently present across samples. For example, four samples may contain the species Dialister invisus, but only two samples contain the gene family UniRef50_Q5WII6 within Dialister invisus. This is a form of strain-level variation in the Dialister invisus species: one which we can connect directly to function based on annotations of the UniRef50_Q5WII6 gene family.

humann2_strain_profiler first looks for (species, sample) pairs where (i) a large number of gene families within the species were identified (default: 500) and (ii) the mean abundance of detected genes was high (default: mean > 10 RPK). For species that meet these criteria, we can infer that absent gene families are likely to be truly absent, as opposed to undersampled. Simulations suggest that the cutoff of 10 RPK results in a false negative rate below 0.001 (i.e. for every 1000 genes identified as absent, at most one would be present but missed due to undersampling). For a given species, if at least two samples pass these criteria, the species and passing samples are sliced from the merged table and saved as a strain profile.

Strain profiles can be additionally restricted to a subset of species (e.g. those from a particular genus) or to gene families with a high level of variability in the population (e.g. present in fewer than 80% of samples but more than 20% of samples). Additional thresholds (e.g. the minimum non-zero mean) can be configured with command line parameters.
    ]]></help>
    <expand macro="citations"/>
</tool>