view squirrel-qc.xml @ 1:cd668d67431e draft

planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/main/tools/squirrel commit d684b71bf5129645fe8eb349a56fcb29c321a7ab
author iuc
date Tue, 11 Feb 2025 19:01:20 +0000
parents f63d24309f49
children
line wrap: on
line source

<tool id="squirrel_qc" name="Squirrel QC" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="21.05">
    <description>QC of MPXV (Mpox virus) sequences</description>
    <macros>
        <import>macros.xml</import>
    </macros>
    <expand macro="xrefs"/>
    <expand macro="requirements"/>
    <expand macro="version_command"/>

    <command detect_errors="exit_code"><![CDATA[
      #set $mask_output = 'input.suggested_mask.csv'
      #set $exclude_output = 'suggested_to_exclude.csv'

      ln -s '${sequences}' input.fasta &&

      squirrel
      --seq-qc
      --clade $clade

      --threads \${GALAXY_SLOTS:-1}

      input.fasta &&

      mv '${mask_output}' '$mask' &&
      mv '${exclude_output}' '$exclude'
    ]]></command>

    <inputs>
        <param name="sequences"
          type="data"
          format="fasta"
          label="Sequences in FASTA format" />
        <param name="clade"
          type="select"
          label="Select MPXV Clade">
          <option value="cladei">Clade I</option>
          <option value="cladeia">Clade Ia</option>
          <option value="cladeib">Clade Ib</option>
          <option value="cladeii">Clade II</option>
          <option value="cladeiia">Clade IIa</option>
          <option value="cladeiib">Clade IIb</option>
        </param>

    </inputs>

    <outputs>
      <!-- standard outputs-->
      <data name="mask" format="csv" label="${tool.name} - flagged mutations to mask" />
      <data name="exclude" format="csv" label="${tool.name} - flagged sequences to exclude" />
    </outputs>

    <tests>
        <test expect_num_outputs="2">
          <param name="sequences" value="test-sequences.fasta" />
          <param name="clade" value="cladeii" />
          <output name="mask" file="sequences.suggested_mask.csv" />
          <output name="exclude" file="suggested_to_exclude.csv" />
        </test>
    </tests>
    <help><![CDATA[
  Squirrel in QC mode can run quality control (QC) on the alignment and flag certain sites to the user that may need to be masked. Squirrel can flag potential issues in the MPXV sequences that have been provided for alignment (e.g. SNPS near tracts of N, clusters of unique SNPs, reversions to reference alleles and convergent mutations) and outputs these in a mask file for investigation. 

  It is recommended that the user looks at these sites in an alignment viewer to judge whether the sites should be masked or not. 

  Squirrel with check within the alignment for:

  - Mutations that are adjacent to N bases

  The rationale for this is that N sites are usually a product of low coverage regions. Mutations that occur directly adjacent to low coverage regions may be a result of mis-alignment prior to the low coverage masking and may not be real SNPs. In squirrel, non-majority alleles that are present next to an N are flagged as potential sites for masking

  - Unique mutations that clump together

  If mutations are observed in only a single sequence in the genome, they are classed as unique mutations. Usually mutations do not clump closely together and may suggest an alignment or assembly issue. If these mutations are not shared with any other sequences, they are flagged for masking.

  - Sequences with a high N content
  
  Sequences that have many ambiguous bases in them are flagged that they may want to be excluded in further analysis. This may not always be appropriate, often genomes that have a lot of ambiguity can still be informative, however if there is something unusual about a sequence, having lots of ambiguities can be a flag for wider problems (like low read count during assembly).
    ]]></help>
<expand macro="citations"/>
</tool>