# HG changeset patch # User wolma # Date 1423661342 18000 # Node ID 6231ae8f87b83cc6b43f21fe196f2bb088b60bfa Uploaded diff -r 000000000000 -r 6231ae8f87b8 annotate_variants.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/annotate_variants.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,166 @@ + + Predict the effects of SNPs and indels on known genes in the reference genome using SnpEff + mimodd version -q + + mimodd annotate + + "$inputfile" + + #if $str($annotool.name)=='snpeff': + --genome "${annotool.genomeVersion}" + #if $annotool.ori_output: + --snpeff-out "$snpeff_file" + #end if + #if $annotool.stats: + --stats "$summary_file" + #end if + ${annotool.snpeff_settings.chr} ${annotool.snpeff_settings.no_us} ${annotool.snpeff_settings.no_ds} ${annotool.snpeff_settings.no_intron} ${annotool.snpeff_settings.no_intergenic} ${annotool.snpeff_settings.no_utr} + #if $annotool.snpeff_settings.min_cov: + --minC "${annotool.snpeff_settings.min_cov}" + #end if + #if $annotool.snpeff_settings.min_qual: + --minQ "${annotool.snpeff_settings.min_qual}" + #end if + #if $annotool.snpeff_settings.ud: + --ud "${annotool.snpeff_settings.ud}" + #end if + #end if + + --ofile "$outputfile" + #if $str($formatting.oformat) == "text": + --oformat text + #end if + #if $str($formatting.oformat) == "html": + #if $formatting.formatter_file: + --link "${formatting.formatter_file}" + #end if + #if $formatting.species + --species "${formatting.species}" + #end if + #end if + + #if $str($grouping): + --grouping $grouping + #end if + --verbose + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ## default settings for SnpEff + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + (annotool['name']=="snpeff" and annotool['ori_output']) + + + (annotool['name']=="snpeff" and annotool['stats']) + + + + +.. class:: infomark + + **What it does** + +The tool turns a variant list in VCF format into a more readable summary table listing variant sites and effects. + +If installed, the variant annotation tool SnpEff can be used transparently to determine the genomic features, e.g., genes or transcripts, affected by the variants. + +Use of this feature requires that you have an appropriate SnpEff genome file installed on the host machine. You can use the *List installed SnpEff genomes* tool to generate a list of all available SnpEff genomes. +This list can then be used (by selecting the dataset as the *genome list*) to populate the *genome* dropdown menu, from which you can select the SnpEff genome file to be used for the annotation. + +As output file formats HTML or plain text are supported. +In HTML mode, variant positions and/or affected genomic features can be turned into hyperlinks to corresponding views in web-based genome browsers and databases. + +The behavior of this feature depends on: + +1) Recognition of the species that is analyzed + + You can declare the species you are working with using the *Species* text field. + If you are not declaring the species explicitly, but are choosing SnpEff for effect annotation, the tool will usually be able to auto-detect the species from the SnpEff genome you are using. + If no species gets assigned in either way, no hyperlinks will be generated and the html output will look essentially like plain text. + +2) Available hyperlink formatting rules for this species + + When the species has been recognized, the tool checks if you have selected an *optional file with hyperlink formatting instructions*. + If you did and that file contains an entry matching the recognized species, that entry will be used as a template to construct the hyperlinks. + If no matching entry is found in the file, an error will be raised. + + If you did not supply a hyperlink formatting instruction file, the tool will consult an internal lookup table to see if it finds default rules for the construction of the hyperlinks for the species. + If not, no hyperlinks will be generated and the html output will look essentially like plain text. + + **TIP:** + MiModD's internal hyperlink formatting lookup tables are maintained and growing with every new version, but since weblinks are changing frequently as well, it is possible that you will encounter broken hyperlinks for your species of interest. In such a case, you can resort to two things: `tell us about the problem`_ to make sure it gets fixed in the next release and, in the meantime, use a custom file with hyperlink formatting instructions to overwrite the default entry for your species. + +.. _tell us about the problem: mailto:mimodd@googlegroups.com + + + diff -r 000000000000 -r 6231ae8f87b8 bamsort.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/bamsort.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,35 @@ + + Sort a BAM file by coordinates (or names) of the mapped reads + mimodd version -q + + mimodd sort "$inputfile" -o "$output" --oformat $oformat $by_name + + + + + + + + + + + + + + + + + +.. class:: infomark + + **What it does** + +The tool sorts a BAM file of aligned reads, typically by the reference genome coordinates that the reads have been mapped to. + +Coordinate-sorted input files are expected by most downstream MiModD tools, but note that the *SNAP Read Alignment* produces coordinate-sorted output by default and it is only necessary to sort files that come from other sources or from *SNAP Read Alignment* jobs with a custom sort order. + +The option *Sort by read names instead of coordinates* is useful if you want to re-align coordinate-sorted paired-end data. In *paired-end mode*, the *SNAP Read Alignment* tool expects the reads in the input file to be arranged in read pairs, i.e., the forward read information of a pair must be followed immediately by its reverse mate information, which is typically not the case in coordinate-sorted files. Resorting such files by read names fixes this problem. + + + + diff -r 000000000000 -r 6231ae8f87b8 cloudmap.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/cloudmap.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,70 @@ + + with the CloudMap series of tools. + mimodd version -q + + mimodd cloudmap "$ifile" ${run.mode} + + #if $str($run.mode) != "SVD": + "${run.refsample}" + #end if + + "$sample" -o "$ofile" + + #if $seqdict: + -s "$dictfile" + #end if + + + + + + + + + + + + + + + + + + + + + + + + + seqdict + + + + +.. class:: infomark + + **What it does** + +The purpose of this tool is to provide compatibility of the MiModD analysis workflow with the external `CloudMap`_ *EMS Variant Density Mapping*, *Variant Discovery Mapping* and *Hawaiian Variant Mapping* tools. + +These tools complement MiModD by providing easily interpreted visualizations of mapping-by-sequencing analysis workflows. + +The tool converts a VCF file as generated by the *Extract Variant Sites* or *VCF Filter* tools to the format expected by the *CloudMap* series of tools. + +Optionally, it also extracts the chromosome names and sizes and reports them in the *CloudMap* *species configuration file* format. +Such a file is required as input to the current versions of the *CloudMap* *Hawaiian* and *Variant Density* mapping tools, if you are working with a species other than the natively supported ones (i.e., other than *C. elegans* or *A. thaliana*). + +To use the output datasets of the tool with *CloudMap*, you only have to upload them to any public Galaxy server that hosts *CloudMap* like, e.g., the main Galaxy server at https://usegalaxy.org . + +.. class:: warningmark + + EMS Variant Density Mapping is currently limited to *C. elegans* and other species with six chromosomes on the *CloudMap* side. + +More information on combining MiModD and CloudMap in mapping-by-sequencing analyses can be found in the `corresponding section of the MiModD User Guide`_. + +.. _CloudMap: https://usegalaxy.org/u/gm2123/p/cloudmap +.. _corresponding section of the MiModD User Guide: http://mimodd.readthedocs.org/en/latest/cloudmap.html + + + diff -r 000000000000 -r 6231ae8f87b8 convert.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/convert.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,133 @@ + + between different sequence data formats + mimodd version -q + + mimodd convert + + #for $i in $mode.input_list + "${i.file1}" + #if $str($mode.iformat) in ("fastq_pe", "gz_pe"): + "${i.file2}" + #end if + #end for + #if $str($mode.header) != "None": + --header "$(mode.header)" + #end if + --ofile "$outputname" + --iformat $(mode.iformat) + --oformat $(mode.oformat) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +.. class:: infomark + + **What it does** + +The tool converts between different file formats used for storing next-generation sequencing data. + +As input file types it can handle uncompressed or gzipped fastq, SAM or BAM format, which it can convert to SAM or BAM format. + +**Notes:** + +1) In its standard configuration Galaxy will decompress any .gz files during their upload, so the option to align gzipped fastq input is useful only with customized Galaxy instances or by using linked files as explained in our `recipe for using gzipped fastq files in Galaxy`_ from the `MiModD user guide`_. + +2) The tool can convert fastq files representing data from paired-end sequencing runs to appropriate SAM/BAM format provided that the mate information is split over two fastq files in corresponding order. + + **TIP:** If your paired-end data is arranged differently, you may look into the *fastq splitter* and *fastq de-interlacer* tools for Galaxy from the `Fastq Manipulation category`_ of the Galaxy Tool Shed to see if they can convert your files to the expected format. + +3) Merging partial fastq (or gzipped fastq) files into a single SAM/BAM file is supported both for single-end and paired-end data. Simply add additional input datasets and select the appropriate files (pairs of files in case of paired-end data). + + Concatenation of SAM/BAM file during conversion is currently not supported. + +4) For input in fastq format a SAM header file providing run metadata **has to be specified**. The information in this file will be used as the header data of the new SAM/BAM file. You can use the *NGS Run Annotation* tool to generate a new header file for your data. + + For input in SAM/BAM format the tool will simply copy the existing header data to the new file. To modify the header of an existing SAM/BAM file, use the *Reheader BAM file* tool instead. + +.. _Fastq Manipulation category: https://toolshed.g2.bx.psu.edu/repository/browse_repositories_in_category?id=310ff67d4caf6531 +.. _recipe for using gzipped fastq files in Galaxy: http://mimodd.readthedocs.org/en/latest/recipes.html#use-gzipped-fastq-files-in-galaxy +.. _MiModD user guide: http://mimodd.readthedocs.org/en/latest + + + + diff -r 000000000000 -r 6231ae8f87b8 covstats.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/covstats.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,27 @@ + + Calculate coverage statistics for a BCF file as generated by the Variant Calling tool + mimodd version -q + + mimodd covstats "$ifile" --ofile "$output_vcf" + + + + + + + + + + +.. class:: infomark + + **What it does** + +The tool takes as input a BCF file produced by the *Variant Calling* tool, and calculates per-chromosome read coverage from it. + +.. class:: warningmark + + The tool treats genome positions missing from the BCF input as zero coverage, so it is safe to use ONLY with BCF files produced by the *Variant Calling* tool or through other commands that keep the information for all sites. + + + diff -r 000000000000 -r 6231ae8f87b8 deletion_predictor.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/deletion_predictor.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,61 @@ + + Predicts deletions in one or more aligned read samples based on coverage of the reference genome and on insert sizes + mimodd version -q + + mimodd delcall + #for $l in $list_input + "${l.bamfile}" + #end for + "$covfile" -o "$outputfile" + --max-cov "$max_cov" --min-size "$min_size" $include_uncovered $group_by_id --verbose + + + + + + + + + + + + + + + + + + +.. class:: infomark + + **What it does** + +The tool predicts deletions from paired-end data in a two-step process: + +1) It finds regions of low-coverage, i.e., candidate regions for deletions, by scanning a BCF file produced by the *Variant Calling* tool. + + The *maximal coverage allowed inside a low-coverage region* and the *minimal deletion size* parameters are used at this step to define what is considered a low-coverage region. + + .. class:: warningmark + + The tool treats genome positions missing from the BCF input as zero coverage, so it is safe to use ONLY with BCF files produced by the *Variant Calling* tool or through other commands that keep the information for all sites. + +2) It assesses every low-coverage region statistically for evidence of it being a real deletion. **This step requires paired-end data** since it relies on shifts in the distribution of read pair insert sizes around real deletions. + +By default, the tool only reports Deletions, i.e., the subset of low-coverage regions that pass the statistical test. +If *include low-coverage regions* is selected, regions that failed the test will also be reported. + +With *group reads based on read group id only* selected, as it is by default, grouping of reads into samples is done strictly based on their read group IDs. +With the option deselected, grouping is done based on sample names in the first step of the analysis, i.e. the reads of all samples with a shared sample name are used to identify low-coverage regions. +In the second step, however, reads will be regrouped by their read group IDs again, i.e. the statistical assessment for real deletions is always done on a per read group basis. + +**TIP:** +Deselecting *group reads based on read group id only* can be useful, for example, if you have both paired-end and single-end sequencing data for the same sample. + +In this case, the two sets of reads will usually share a common sample name, but differ in their read groups. +With grouping based on sample names, the single-end data can be used together with the paired-end data to identify low-coverage regions, thus increasing overall coverage and reliability of this step. +Still, the assessment of deletions will use only the paired-end data (auto-detecting that the single-end reads do not provide insert size information). + + + + diff -r 000000000000 -r 6231ae8f87b8 fileinfo.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/fileinfo.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,34 @@ + + for supported data formats. + mimodd version -q + + mimodd info "$ifile" -o "$outputfile" --verbose --oformat $oformat + + + + + + + + + + + + + + + + + + + +.. class:: infomark + + **What it does** + +The tool inspects the input file and generates a report summarizing its contents. + +It autodetects and works with most file formats produced by MiModD, i.e., **SAM / BAM, vcf / bcf and fasta**, and produces a standardized report for all of them. + + + diff -r 000000000000 -r 6231ae8f87b8 reheader.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/reheader.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,202 @@ + + + From a BAM file generate a new file with the original header (if any) replaced or modified by that found in a second SAM file + mimodd version -q + + #if ($str($rg.treat_rg) != "ignore" and $str($rg.rginfo.source) == "from_form") or $str($co.treat_co) != "ignore": + mimodd header + #if $str($rg.treat_rg) != "ignore" and $str($rg.rginfo.source) == "from_form": + #for $rginfo in $rg.rginfo.rg + #if $str($rginfo.source_id): + --rg-id "${rginfo.source_id}" + #end if + #if $str($rginfo.rg_sm): + --rg-sm "${rginfo.rg_sm}" + #end if + #if $str($rginfo.rg_cn): + --rg-cn "${rginfo.rg_cn}" + #else: + --rg-cn "" + #end if + #if $str($rginfo.rg_ds): + --rg-ds "${rginfo.rg_ds}" + #else: + --rg-ds "" + #end if + #if $str($rginfo.rg_date): + --rg-dt "${rginfo.rg_date}" + #else: + --rg-dt "" + #end if + #if $str($rginfo.rg_lb): + --rg-lb "${rginfo.rg_lb}" + #else: + --rg-lb "" + #end if + #if $str($rginfo.rg_pl): + --rg-pl "${rginfo.rg_pl}" + #else: + --rg-pl "" + #end if + #if $str($rginfo.rg_pi): + --rg-pi "${rginfo.rg_pi}" + #else: + --rg-pi "" + #end if + #if $str($rginfo.rg_pu): + --rg-pu "${rginfo.rg_pu}" + #else: + --rg-pu "" + #end if + #end for + #end if + #if $str($co.treat_co) != "ignore": + --co + #for $comment in $co.coinfo + #if $str($comment.line): + "${comment.line}" + #end if + #end for + #end if + | + #end if + mimodd reheader "$inputfile" --sq ignore + --rg ${rg.treat_rg} + #if $str($rg.treat_rg) != "ignore": + #if $str($rg.rginfo.source) == "from_file": + "${rg.rginfo.data}" + #else: + - + #end if + #for $rgmapping in $rg.rginfo.rg + #if $str($rgmapping.source_id) and $str($rgmapping.rg_id): + "$str($rgmapping.source_id)" : "$str($rgmapping.rg_id)" + #end if + #end for + #end if + + --co ${co.treat_co} + #if $str($co.treat_co) != "ignore": + - + #end if + + #set $restr = "" + #for $rename in $rg_renaming + #set $restr = $restr + ($str($rename.from) and $str($rename.to) and '"' + $str($rename.from) + '" : "' + $str($rename.to) + '"') + #end for + #if $restr + --rgm $restr + #end if + + #set $restr = "" + #for $rename in $sq_renaming + #set $restr = $restr + ($str($rename.from) and $str($rename.to) and '"' + $str($rename.from) + '" : "' + $str($rename.to) + '"') + #end for + #if $restr + --sqm $restr + #end if + + -o "$output" + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +.. class:: infomark + + **What it does** + +The tool generates a copy of the BAM input file with a modified header (i.e., metadata). + +It can update or replace read-group information (i.e., information about the samples in the file), add or replace comment lines, and rename reference sequences declared in the header. + +The tool ensures that the resulting BAM file is valid and can be further processed by other MiModD tools and standard software like samtools. It aborts with an error message if a valid BAM file cannot be generated with the user-specified settings. + +The template information used to modify or replace the input file metadata is provided through forms or, in the case of read-group information, can be taken from an existing SAM file as can be generated, for example, with the *NGS Run Annotation* tool. + + + + diff -r 000000000000 -r 6231ae8f87b8 sam_header.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/sam_header.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,125 @@ + + Create a SAM format header from run metadata for sample annotation. + mimodd version -q + + mimodd header + + --rg-id "$rg_id" + --rg-sm "$rg_sm" + + #if $str($rg_cn): + --rg-cn "$rg_cn" + #end if + #if $str($rg_ds): + --rg-ds "$rg_ds" + #end if + #if $str($rg_date): + --rg-dt "$rg_date" + #end if + #if $str($rg_lb): + --rg-lb "$rg_lb" + #end if + #if $str($rg_pl): + --rg-pl "$rg_pl" + #end if + #if $str($rg_pi): + --rg-pi "$rg_pi" + #end if + #if $str($rg_pu): + --rg-pu "$rg_pu" + #end if + + --ofile "$outputfile" + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +.. class:: infomark + + **What it does** + +This tool takes the user-provided information about a next-generation sequencing run and constructs a valid header in the SAM file format from it. + +The result file can be used by the tools *Convert* and *Reheader* or in the *SNAP Read Alignment* step to add run metadata to sequenced reads files (or to overwrite pre-existing information). + +**Note:** + +**MiModD requires run metadata for every input file at the Alignment step !** + +**Tip:** + +While you can do Alignments from fastq file format by providing a custom header file directly to the *SNAP Read Alignment* tool, we **recommend** you to first convert all input files to and archive all datasets in SAM/BAM format with appropriate header information prior to any downstream analysis. Although a bit more time-consuming, this practice protects against information loss and ensures that the input datasets will remain useful for others in the future. + + + + diff -r 000000000000 -r 6231ae8f87b8 snap_caller.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/snap_caller.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,232 @@ + + Map sequence reads to a reference genome using SNAP + mimodd version -q + + mimodd snap-batch -s + ## SNAP calls (considering different cases) + + #for $i in $datasets + "snap ${i.mode_choose.mode} '$ref_genome' + #if $str($i.mode_choose.mode) == "paired" and $str($i.mode_choose.input.iformat) in ("fastq", "gz"): +'${i.mode_choose.input.ifile1}' '${i.mode_choose.input.ifile2}' + #else: +'${i.mode_choose.input.ifile}' + #end if +--ofile '$outputfile' --iformat ${i.mode_choose.input.iformat} --oformat $oformat +--idx-seedsize '$set.seedsize' +--idx-slack '$set.slack' --maxseeds '$set.maxseeds' --maxhits '$set.maxhits' --clipping=$set.clipping --maxdist '$set.maxdist' --confdiff '$set.confdiff' --confadapt '$set.confadpt' + #if $i.mode_choose.input.header: +--header '${i.mode_choose.input.header}' + #end if + #if $str($i.mode_choose.mode) == "paired": +--spacing '$set.sp_min' '$set.sp_max' + #end if + #if $str($set.selectivity) != "off": +--selectivity '$set.selectivity' + #end if + #if $str($set.filter_output) != "off": +--filter-output $set.filter_output + #end if + #if $str($set.sort) != "off": +--sort $set.sort + #end if + #if $str($set.mmatch_notation) == "general": +-M + #end if +--max-mate-overlap '$set.max_mate_overlap' +--verbose +" + #end for + + + + ## mandatory arguments (and mode-conditionals) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ## optional arguments + + + + + + + + ## default settings + + + + + + + + + + + + + + + + + + + + + + ## change settings + + + + + + ## paired-end specific options + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +.. class:: infomark + + **What it does** + +The tool aligns the sequenced reads in an arbitrary number of input datasets against a common reference genome and stores the results in a single, possibly multi-sample output file. It supports a variety of different sequenced reads input formats, i.e., SAM, BAM, fastq and gzipped fastq, and both single-end and paired-end data. + +Internally, the tool uses the ultrafast, hashtable-based aligner SNAP (http://snap.cs.berkeley.edu), hence its name. + +**Notes:** + +1) In its standard configuration Galaxy will decompress any .gz files during their upload, so the option to align gzipped fastq input is useful only with customized Galaxy instances or by using linked files as explained in our `recipe for using gzipped fastq files in Galaxy`_ from the `MiModD user guide`_. + +2) To use paired-end fastq data with the tool the read mate information needs to be split over two fastq files in corresponding order. + + **TIP:** If your paired-end data is arranged differently, you may look into the *fastq splitter* and *fastq de-interlacer* tools for Galaxy from the `Fastq Manipulation category`_ of the Galaxy Tool Shed to see if they can convert your files to the expected format. + +3) The tool supports the alignment of reads from the same sequencing run, but distributed across several input files. + + Generally, it expects the reads from each input dataset to belong to one read-group and will abort with an error message if any input dataset declares more than one read group or sample names in its header. Different datasets, however, are allowed to contain reads from the same read-group (as indicated by matching read-group IDs and sample names in their headers), in which case the reads will be combined into one group in the output. + +4) Read-group information is required for every input dataset! + + We generally recommend to store NGS datasets in SAM/BAM format with run metadata stored in the file header. You can use the *NGS Run Annotation* and *Convert* tools to convert data in fastq format to SAM/BAM with added run information. + + While it is not our recommended approach, you can, if you prefer it, align reads from fastq files or SAM/BAM files without header read-group information. To do so, you **must** specify a SAM file that provides the missing information in its header along with the input dataset. You can generate a SAM header file with the *NGS Run Annotation* tool. + + Optionally, a SAM header file can also be used to replace existing read-group information in a headered SAM/BAM input file. This can be used to resolve read-group ID conflicts between multiple input files at tool runtime. + +4) Currently, you cannot configure aligner-specific options separately for specific input files from within this Galaxy tool. If you need this advanced level of control, you should use the command line tool ``mimodd snap-batch``. + +.. _Fastq Manipulation category: https://toolshed.g2.bx.psu.edu/repository/browse_repositories_in_category?id=310ff67d4caf6531 +.. _recipe for using gzipped fastq files in Galaxy: http://mimodd.readthedocs.org/en/latest/recipes.html#use-gzipped-fastq-files-in-galaxy +.. _MiModD user guide: http://mimodd.readthedocs.org/en/latest + + + + diff -r 000000000000 -r 6231ae8f87b8 snp_caller_caller.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/snp_caller_caller.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,62 @@ + + From a reference and aligned reads generate a BCF file with position-specific variant likelihoods and coverage information + mimodd version -q + + mimodd varcall + + "$ref_genome" + #for $l in $list_input + "${l.inputfile}" + #end for + --ofile "$output_vcf" + --depth "$depth" + $group_by_id + $no_md5_check + --verbose + --quiet + + + + + + + + + + + + + + + + + +.. class:: infomark + + **What it does** + +The tool transforms the read-centered information of its aligned reads input files into position-centered information. + +**It produces a BCF file that serves as the basis for all further variant analyses with MiModD**. + +**Notes:** + +By default, the tool will check whether the input BAM file(s) provide(s) MD5 checksums for the reference genome sequences used during read alignment (the *SNAP Read Alignment* tool stores these in the BAM file header). If it finds MD5 sums for all sequences, it will compare them to the actual checksums of the sequences in the specified reference genome and +check that every sequence mentioned in any BAM input file has a counterpart with matching MD5 sum in the reference genome and abort with an error message if that is not the case. If it finds sequences with matching checksum, but different names in the reference genome, it will use the name from the reference genome file in its output. + +This behavior has two benefits: + +1) It protects from accidental variant calling against a wrong reference genome (i.e., a different one than that used during the alignment step), which would result in wrong calls. This is the primary reason why we recommend to leave the check activated + +2) It provides an opportunity to change sequence names between aligned reads files and variant call files by providing a reference genome file with altered sequence names (but identical sequence data). + +Since there may be rare cases where you *really* want to align against a reference genome with different checksums (e.g., you may have edited the reference sequence based on the alignment results), the check can be turned off, but only do this if you know exactly why. + +----------- + +Internally, the tool uses samtools mpileup combined with bcftools to do all per-nucleotide calculations. + +It exposes just a single configuration parameter of these tools - the *maximum per-BAM depth*. Through this parameter, the maximum number of reads considered for variant calling at any site can be controlled. Its default value of 250 is taken from *samtools mpileup* and usually suitable. Consider, however, that this gives the maximum read number per input file, so if you have a large number of samples in one input file, it could become necessary to increase the value to get sufficient reads considered per sample. + + + diff -r 000000000000 -r 6231ae8f87b8 snpeff_genomes.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/snpeff_genomes.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,20 @@ + + Checks the local SnpEff installation to compile a list of currently installed genomes + mimodd version -q + + mimodd snpeff-genomes -o "$outputfile" + + + + + +.. class:: infomark + +**What it does** + +When executed this tool searches the host machine's SnpEff installation for properly registered and installed +genome annotation files. The resulting list is added as a plain text file to your history for use with the *Variant Annotation* Tool. + + + + diff -r 000000000000 -r 6231ae8f87b8 tool_dependencies.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tool_dependencies.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,75 @@ + + + + + + + + + + + + + http://sourceforge.net/projects/mimodd/files/MiModD-0.1.5.2.tar.gz + + + + + + + + + + + pyvenv --without-pip $INSTALL_DIR/MiModD_venv + + rm $INSTALL_DIR/MiModD_venv/bin/python + + $INSTALL_DIR/MiModD_venv/bin/python3 setup.py install + + chmod 755 $INSTALL_DIR/MiModD_venv/lib/python3.4/site-packages/MiModD/bin/* + + + + + $INSTALL_DIR/MiModD_venv/bin + + + + + $ENV[LD_LIBRARY_PATH] + + + + + + +Summary: Tools for Mutation Identification in Model Organism Genomes using Desktop PCs +Home-page: http://sourceforge.net/projects/mimodd/ +Author: Wolfgang Maier +Author-email: wolfgang.maier@biologie.uni-freiburg.de +License: GPL +Download-URL: http://sourceforge.net/projects/mimodd/ + +MiModD - Identify Mutations from Whole-Genome Sequencing Data +************************************************************* + +MiModD is an integrated solution for efficient and user-friendly analysis of +whole-genome sequencing (WGS) data from laboratory model organisms. +It enables geneticists to identify the genetic mutations present in an organism +starting from just raw WGS read data and a reference genome without the help of +a trained bioinformatician. + +MiModD is designed for good performance on standard hardware and enables WGS +data analysis for most model organisms on regular desktop PCs. + +MiModD can be installed under Linux and Mac OS with minimal software +requirements and a simple setup procedure. As a standalone package it can be +used from the command line, but can also be integrated seamlessly and easily +into any local installation of a Galaxy bioinformatics server providing a +graphical user interface, database management of results and simple composition +of analysis steps into workflows. + + + diff -r 000000000000 -r 6231ae8f87b8 varextract.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/varextract.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,97 @@ + + from a BCF file + mimodd version -q + + mimodd varextract "$ifile" + #if $len($sitesinfo) + -p + #for $source in $sitesinfo + "${source.pre_vcf}" + #end for + #end if + --ofile "$output_vcf" + $keep_alts + --verbose + + + + + + + + + + + + + + +.. class:: infomark + + **What it does** + +The tool takes as input a BCF file like the ones produced by the *Variant Calling* tool, extracts just the variant sites from it and reports them in VCF format. + +If the BCF input file specifies multiple samples, sites are included if they qualify as variant sites in at least one sample. + +In a typical analysis workflow, you will use the tool's VCF output as input for the *VCF Filter* tool to cut down the often still impressive list of sites to a subset with relevance to your project. + +**Options:** + +1) By default, a variant site is considered to be a position in the genome for which a non-reference allele appears in the inferred genotype of any sample. + + You can select the *keep all sites with alternate bases* option, if instead you want to extract all sites, for which at least one non-reference base has been observed (whether resulting in a non-reference allele call or not). Using this option should rarely be necessary, but could be occassionally helpful for closer inspection of candidate genomic regions. + +2) During the process of variant extraction the tool can take into account genome positions specified in one or more independently generated VCF files. If such additional VCF input is provided, the tool output will contain the samples found in these files as additional samples and sites from the main BCF file will be included if they either qualify as variant sites in at least one sample specified in the BCF or if they are listed in any of the additional VCF files. + + Optional VCF input can be particularly useful in one of the following situations: + + *scenario i* - you have prior information that leads you to think that certain genome positions are of special relevance for your project and, thus, you are interested in the statistics produced by the variant caller for these positions even if they are not considered variant sites. In this case you can use a minimal VCF file to guide the variant extraction process to include these positions. This minimal VCF file needs a minimal header: + + ``##fileformat=VCFv4.2`` + + followed by positional information like in this example:: + + #CHROM POS ID REF ALT QUAL FILTER INFO + chrI 1222 . . . . . . + chrI 2651 . . . . . . + chrI 3659 . . . . . . + chrI 3731 . . . . . . + + , where columns are tab-separated and . serves as a placeholder for missing information. + + *scenario ii* - you have actual variant calls from an additional sample, but you do not have access to the original sequenced reads data (if you had, the recommended approach would be to align this data along with your other sequencing data or, at least, to perform the *Variant Calling* step together). + + This situation is often encountered with published datasets. Assume you have obtained a list of known single nucleotide variants (SNVs) found in one particular strain of your favorite model organism and you would like to know which of these SNVs are present in the related strains you have sequenced. You have aligned the sequenced reads from your samples and have used the *Variant Calling* tool, which has generated a BCF file ready for variant extraction. If the SNV list for the previously sequenced strain is in VCF format already, you can now just plug it into the analysis process by specifying it in the tool interface as an *independently generated vcf file*. The resulting vcf output file will contain all SNV sites along with the variant sites found in the BCF alone. You can then proceed to the *VCF Filter* tool to look at the original SNV sites only or to investigate any other interesting subset of sites. If the SNV list is in some other format, you will have o convert it to VCF first. At a minimum, the file must have a ``##fileformat`` header line like the previous example and have the ``REF`` and ``ALT`` column filled in like so:: + + #CHROM POS ID REF ALT QUAL FILTER INFO + chrI 1897409 . A G . . . + chrI 1897492 . C T . . . + chrI 1897616 . C A . . . + chrI 1897987 . A T . . . + chrI 1898185 . C T . . . + chrI 1898715 . G A . . . + chrI 1898729 . T C . . . + chrI 1900288 . T A . . . + + , in which case the tool will assume that the corresponding sample is homozygous for each of the SNVs. If you need to distinguish between homozygous and heterozygous SNVs you will have to extend the format to include a format and a sample column with genotype (GT) information like in this example:: + + #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sampleX + chrI 1897409 . A G . . . GT 1/1 + chrI 1897492 . C T . . . GT 0/1 + chrI 1897616 . C A . . . GT 0/1 + chrI 1897987 . A T . . . GT 0/1 + chrI 1898185 . C T . . . GT 0/1 + chrI 1898715 . G A . . . GT 0/1 + chrI 1898729 . T C . . . GT 0/1 + chrI 1900288 . T A . . . GT 0/1 + + , in which sampleX would be heterozygous for all SNVs except the first. + + .. class:: warningmark + + If the optional VCF input contains INDEL calls, these will be ignored by the tool. + + + + diff -r 000000000000 -r 6231ae8f87b8 vcf_filter.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/vcf_filter.xml Wed Feb 11 08:29:02 2015 -0500 @@ -0,0 +1,124 @@ + + Extracts lines from a vcf variant file based on field-specific filters + mimodd version -q + + mimodd vcf-filter + "$inputfile" + -o "$outputfile" + #if len($datasets): + -s + #for $i in $datasets + "$i.sample" + #end for + --gt + #for $i in $datasets + ## remove whitespace from free-text input + "#echo ("".join($i.GT.split()) or "ANY")#" + #echo " " + #end for + --dp + #for $i in $datasets + "$i.DP" + #end for + --gq + #for $i in $datasets + "$i.GQ" + #end for + #end if + #if len($regions): + -r + #for $i in $regions + #if $i.stop: + "$i.chrom:$i.start-$i.stop" + #else: + "$i.chrom:$i.start" + #end if + #end for + #end if + #if $vfilter: + --vfilter + ## remove ',' (and possibly adjacent whitespace) and replace with ' ' + "#echo ('" "'.join($vfilter.split(',')))#" + #end if + $vartype + + + + + + + + + + + + + + + + + + + + + + + + + + + + +.. class:: infomark + + **What it does** + +The tool filters a variant file in VCF format to generate a new VCF file with only a subset of the original variants. + +The following types of variant filters can be set up: + +1) Sample-specific filters: + + Filter variants based on their characteristics in the sequenced reads of a specific sample. Multiple sample-specific filters are combined by logical AND, i.e., only variants that pass ALL sample-specific filters are kept. + +2) Region filters: + + Filter variants based on the genomic region they affect. Multiple region filters are combined by logical OR, i.e., variants passing ANY region filter are kept. + +3) Variant type filter: + + Filter variants by their type, i.e. whether they are single nucleotide variations (SNVs) or indels + +In addition, the *sample* filter can be used to reduce the samples encoded in a multi-sample VCF file to just those specified by the filter. +The *sample* filter is included mainly for compatibility reasons: if an external tool cannot deal with the multisample file format, but instead looks only at the first sample-specific column of the file, you can use the filter to turn the multi-sample file into a single-sample file. Besides, the filter can also be used to change the order of the samples since it will sort the samples in the order specified in the filter field. + +**Examples of sample-specific filters:** + +*Simple genotype pattern* + +genotype pattern: 1/1 ==> keep all variants in the vcf input file for which the specified sample's genotype is homozygous mutant + +*Complex genotype pattern* + +genotype pattern: 0/1, 0/0 ==> keep all variants for which the sample's genotype is either heterozygous or homozygous wildtype + +*Multiple sample-specific filters* + +Filter 1: genotype pattern: 0/0, Filter 2: genotype pattern 1/1: +==> keep all variants for which the first sample's gentoype is homozygous wildtype **and** the second sample's genotype is homozygous mutant + +*Combining sample-specific filter criteria* + +genotype pattern: 1/1, depth of coverage: 3, genotype quality: 9 +==> keep variants for which the sample's genotype is homozygous mutant **and** for which this genotype assignment is corroborated by a genotype quality score of at least 9 +**and** at least three reads from the sample cover the variant site + +**TIP:** + +As in the example above, genotype quality is typically most useful in combination with a genotype pattern. +It acts then, effectively, to make the genotype filter more stringent. + + + + +