view immuneml_simulate_dataset.xml @ 17:b63bd4447c24 draft

"planemo upload commit baf8a7b46ce256e676d2c14b18ab58ee6f5b3cca"
author immuneml
date Fri, 04 Feb 2022 14:25:21 +0000
parents 45ca02982e1f
children
line wrap: on
line source

<tool id="immuneml_simulate_dataset" name="Simulate a synthetic immune receptor or repertoire dataset" version="@VERSION@.0">
  <description></description>
    <macros>
        <import>prod_macros.xml</import>
    </macros>
    <expand macro="requirements" />
  <command><![CDATA[

      cp "$yaml_input" yaml_copy &&
      immune-ml ./yaml_copy ${html_outfile.files_path} --tool DataSimulationTool &&

      mv ${html_outfile.files_path}/index.html ${html_outfile} && 
      mv ${html_outfile.files_path}/immuneML_output.zip $archive

      ]]>
  </command>
  <inputs>
      <param name="yaml_input" type="data" format="txt" label="YAML specification" multiple="false"/>
  </inputs>
    <outputs>
        <data format="zip" name="archive" label="Archive: dataset simulation"/>
        <data format="immuneml_receptors" name="html_outfile" label="ImmuneML dataset (simulated sequences)"/>
    </outputs>


  <help><![CDATA[

        This Galaxy tool allows you to quickly make a dummy dataset.
        The tool generates a SequenceDataset, ReceptorDataset or RepertoireDataset consisting of random CDR3 sequences, which could be used for benchmarking machine learning methods or encodings,
        or testing out other functionalities.
        The amino acids in the sequences are chosen from a uniform random distribution, and there is no underlying structure in the sequences.

        You can control:

        - The amount of sequences in the dataset, and in the case of a RepertoireDataset, the amount of repertoires

        - The length of the generated sequences

        - Labels, which can be used as a target when training ML models

        Note that since these labels are randomly assigned, they do not bear any meaning and it is not possible to train a ML model with high classification accuracy on this data.
        Meaningful labels can be added using the `Simulate immune events into existing repertoire/receptor dataset <root?tool_id=immuneml_simulation>`_ Galaxy tool.

        For the exhaustive documentation of this tool and an example YAML specification, see the tutorial `How to simulate an AIRR dataset in Galaxy <https://docs.immuneml.uio.no/latest/galaxy/galaxy_simulate_dataset.html>`_.

        **Tool output**

        This Galaxy tool will produce the following history elements:

        - ImmuneML dataset (simulated sequences): a sequence, receptor or repertoire dataset which can be used as an input to other immuneML tools. The history element contains a summary HTML page describing general characteristics of the dataset, including the name of the dataset
          (which is used in the dataset definition of a yaml specification), the dataset type and size, available labels, and a link to download the raw data files.

        - Archive: dataset simulation: a .zip file containing the complete output folder as it was produced by immuneML. This folder
          contains the output of the DatasetExport instruction including raw data files.
          Furthermore, the folder contains the complete YAML specification file for the immuneML run, the HTML output and a log file.


    ]]>
  </help>

</tool>