Mercurial > repos > iuc > chewbbaca_createschema

<tool id="chewbbaca_createschema" name="chewBBACA CreateSchema" version="@CHEW_VERSION@+galaxy@VERSION_SUFFIX@" profile="@PROFILE@">
    <description>Create a gene-by-gene schema</description>
    <macros>
        <import>macros.xml</import>
    </macros>
    <expand macro="requirements" />
    <command detect_errors="exit_code"><![CDATA[
        #import re
        mkdir 'input' &&
        #for $file in $input_file
        #set escaped_element_identifier = re.sub('[^\w\-]', '_', str($file.element_identifier))
        ln -sf '$file' 'input/${escaped_element_identifier}.${file.ext}' &&
        #end for
        chewBBACA.py CreateSchema
            #if $training_file:
                --ptf '$training_file'
            #end if
            $cds_input
            @COMMON_INPUT@
            --pm $prodigal_mode
            -i 'input' -o 'output' &&
        cd 'output/' &&
        zip -r schema_seed.zip 'schema_seed'
    ]]></command>
    <inputs>
        <param format="fasta" name="input_file" type="data" multiple="True" label="Genome assemblies in FASTA format"/>
        <section name="advanced" title="Advanced options">
            <param argument="--training-file" type="data" format="binary" label="Prodigal training file" optional="true" />
            <param argument="--cds-input" type="boolean" truevalue="--cds-input" falsevalue="" checked="false" label="CDS input" optional="true"/>
            <param argument="--minimum-length" type="integer" min="0" value="201" label="Minimum sequence length value"/>
            <expand macro="common_param" />
            <param argument="--prodigal-mode" type="select" label="Prodigal Mode" help="&quot;single&quot; for finished genomes, reasonable quality draft genomes and big viruses. &quot;meta&quot; for metagenomes, low quality draft genomes, small viruses, and small plasmids">
                <option value="single" selected="true">
                        single
                    </option>
                    <option value="meta">
                        meta
                    </option>
            </param>
        </section>
        <section name="output" title="Output options">
            <param name="show_cds_invalid" type="boolean" truevalue="true" falsevalue="false" checked="false" label="Output invalid CDS file?"/>
            <param name="show_cds_coord" type="boolean" truevalue="true" falsevalue="false" checked="false" label="Output CDS coordinates File?"/>
        </section>
    </inputs>
    <outputs>
        <data format="zip" name="schema" from_work_dir="output/schema_seed.zip" label="${tool.name} on ${on_string}: Schema files"/>
        <data format="txt" name="txt_file" from_work_dir="output/invalid_cds.txt" label="${tool.name} on ${on_string}: Invalid CDS">
            <filter>output['show_cds_invalid']</filter>
        </data>
        <data format="tabular" name="tsv_file" from_work_dir="output/cds_coordinates.tsv" label="${tool.name} on ${on_string}: CDS coordinates">
            <filter>output['show_cds_coord']</filter>
        </data>
    </outputs>
    <tests>
        <test expect_num_outputs="1">
            <param name="input_file" value="GCA_000007265.1_ASM726v1_genomic.fna"/>
            <output name="schema">
                <assert_contents>
                    <has_archive_member path="schema_seed/.*\.fasta" n="204"/>
                    <has_archive_member path="schema_seed/short/.*\.fasta" n="102"/>
                    <has_archive_member path="schema_seed/\.schema_config"/>
                </assert_contents>
            </output>
        </test>
        <test expect_num_outputs="1">
            <param name="input_file" value="GCA_000007265.1_ASM726v1_genomic"/>
            <output name="schema">
                <assert_contents>
                    <has_archive_member path="schema_seed/.*\.fasta" n="204"/>
                    <has_archive_member path="schema_seed/short/.*\.fasta" n="102"/>
                    <has_archive_member path="schema_seed/\.schema_config"/>
                </assert_contents>
            </output>
        </test>
        <test expect_num_outputs="1">
            <param name="input_file" value="GCA_000007265.1_ASM726v1_genomic.fna"/>
            <param name="training_file" value="Streptococcus_agalactiae.trn"/>
            <output name="schema">
                <assert_contents>
                    <has_archive_member path="schema_seed/.*\.fasta" n="198"/>
                    <has_archive_member path="schema_seed/short/.*\.fasta" n="99"/>
                    <has_archive_member path="schema_seed/\.schema_config"/>
                </assert_contents>
            </output>
        </test>
        <test expect_num_outputs="1">
            <param name="input_file" value="CDS_Str_agalactiae.fasta"/>
            <param name="cds_input" value="true"/>
            <output name="schema">
                <assert_contents>
                    <has_archive_member path="schema_seed/CDS-Str-agalactiae-fasta-protein1.fasta"/>
                </assert_contents>
            </output>
        </test>
    </tests>
    <help>

chewBBACA is a software suite for the creation and evaluation of core genome and whole genome MultiLocus Sequence Typing (cg/wgMLST) schemas and results.

A Schema is a pre-defined set of loci that is used in MLST analyses. Traditional MLST schemas relied in 7 loci that were internal fragments of housekeeping genes and each locus was defined by its amplification by a pair of primers yielding a fragment of a defined size.

In genomic analyses, schemas are a set of loci that are:

- Present in the majority of strains for core genome (cg) MLST schemas, typically a threshold of presence in 95% of the strains is used in schema creation. The assumption is that in each strain up to 5% of loci may not be identified due to sequencing coverage problems, assembly problems or other issues related to the use of draft genome assemblies.
- Present in at least one of the analyzed strains in the schema creation for pan genome/whole genome (pg/wg) MLST schemas.
- Present in less than 95% of the strains for accessory genome (ag) MLST schemas.

.. class:: infomark

**Note**

These definitions are always operational in nature, in the sense that the analyses are performed on a limited number of strains representing part of the biological diversity of a given species or genus and are always dependent on the definition of thresholds.

.. class:: infomark

**Important**

The use of a prodigal training file for schema creation is highly recommended.

.. class:: infomark

**Important**

If you provide the **--cds-input** parameter, chewBBACA assumes that the input FASTA files contain coding sequences and skips the gene prediction step with Prodigal. To avoid issues related with the format of the sequence headers, chewBBACA renames the sequence headers based on the unique basename prefix determined for each input file and on the order of the coding sequences (e.g.: coding sequences inside a file named GCF_000007125.1_ASM712v1_cds_from_genomic.fna are renamed to GCF_000007125-protein1, GCF_000007125-protein2, …, GCF_000007125-proteinN).

    </help>
    <expand macro="citations" />
</tool>
author	iuc
date	Fri, 07 Jun 2024 14:27:26 +0000
parents	4e61ec4fd5f5
children	ae1374a5ecdf