What it does
The tool aligns the sequenced reads in an arbitrary number of input datasets against a common reference genome and stores the results in a single, possibly multi-sample output dataset.
Internally, the tool uses the ultrafast, hashtable-based aligner SNAP (http://snap.cs.berkeley.edu).
Notes:
Input formats
The tool accepts SAM, BAM, fastq and fastq.gz input datasets of sequenced reads and supports both single-end and paired-end data.
The recommended approach with MiModD is to store NGS datasets in SAM/BAM format with Run Metadata (see below) stored in the file header. You can use the MiModD Run Annotation and MiModD Convert tools to convert data from fastq format to SAM/BAM format while attaching run metadata to it.
While alignments directly from fastq format are supported, this is less reliable due to less strict specifications of this format. If you find the tool complaining about malformed fastq input, it is likely that you can fix this problem by converting the data to SAM/BAM format first.
If you wish to align paired-end data directly from fastq format, the mate sequence data has to be split over two datasets as is mostly standard today. If you have your paired-end data as a single dataset you may look into the FASTQ splitter and FASTQ de-interlacer tools for Galaxy, which are available from the Fastq Manipulation category of the Galaxy Tool Shed and may be able to convert your files to the expected format.
Run Metadata
Every input file requires accompanying Run Metadata! Most importantly, this includes a read-group ID (an identifier of the sequencing run that produced the data) and a sample name (identifying the biological sample sequenced in the run).
If an input dataset does not provide this information directly (fastq datasets never do; SAM/BAM datasets may provide it in their header), you need to specify a separate SAM/BAM dataset with an appropriate header as the source of the Run Metadata.
You can use the MiModD Run Annotation tool to generate such a file.
If a SAM/BAM input dataset already provides Run Metadata, you can still specify a different Run Metadata source, which will then overwrite the information already present in the input. This is useful, for example, to resolve read-group ID conflicts between multiple input datasets.
Every input dataset can only contain reads from a single read-group. If you would like, for example, to realign the reads in a multi-sample SAM/BAM dataset. You should first use the MiModD Sort tool to sort the data by read names (this step is only necessary for paired-end data), then split the reads into new per-read-group datasets using the MiModD Convert tool.
Several input datasets can declare identical read-group IDs and/or sample names.
Identical read-group IDs mean that the datasets were produced in the same sequencing run, as is the case, for example, with partial fastq sequencing data. In the output dataset, the corresponding reads will be merged and it will not be possible to trace back their source.
Identical sample names (but different read-group IDs) indicate that the same sample has been sequenced multiple times. In the output dataset, the corresponding reads will be tagged appropriately and tools like the MiModD Variant Calling tool will let you decide whether you want to treat them together or separately.
Tool Options
The section Alignment parameters lets you configure global settings for the alignment job that will be applied to all input datasets. For each input dataset, however, you can overwrite some or all of these settings by specifying new values in the section Alignment options for this sample. Some of the alignment parameters may have big effects on the alignment quality, but these effects are very dependent on the type of input sequences. You are strongly encouraged to consult the in-depth tool documentation for detailed explanations of the available options.
For additional help see these resources: