view fastqFilter.xml @ 0:0475e4175855 draft default tip

planemo upload commit 81ece2551cea27cbd0e718ef5b7a2fe8d4abd071-dirty
author yqiancolumbia
date Mon, 30 Apr 2018 05:25:11 -0400
parents
children
line wrap: on
line source

<tool id="fastqFilter" name="Filter FASTQ files">
  	<description></description>
  	<command interpreter="perl">
		/home/galaxy/tools/CTK/fastq_filter.pl 
		#if $index.indexRequired == "yes":
		-index $index.sequence
		#end if
		-maxN $MaxN -v -if sanger -f $Filter  -of $OutputFormat $inputfile
		$outputfile 
  	</command>

  	<inputs>
        <param name="inputfile" format="fastq"  type="data" label="Input Sanger FASTQ file (.gz file accepted; see help below for more information)" />

	<conditional name="index">
		<param name="indexRequired" type="select" label="Filter by sample index (see help below for parameter suggestion)" >
		<option value="yes">Yes</option>
		<option value="no" selected="true">No</option>
		</param>
		<when value="yes">
			<param name="sequence" type="text" value="" label="Index position and sequence" />
		</when>
		<when value="no">
		</when>
	</conditional>	

    	<param name="Filter"  type="text" value="" label="Quality score filter string; format: Method:Start-End:Score (zero-based; see help below for parameter suggestion)" />
	<param name="MaxN" type="integer" value="-1" label="Max number of N in sequence (default off - value less than 0) " />
	<param name="OutputFormat" type="select" label="Output data type">
		<option value="fastq">FASTQ</option>
		<option value="fasta">FASTA</option>
	</param>

  	</inputs>

	<outputs>
	<data name="outputfile" format="fastq" label="Read quality filtering on ${on_string}">
		<change_format>
			<when input="OutputFormat" value="fasta" format="fasta" />
		</change_format>
	</data>	
	</outputs>

	<help>

.. class:: infomark

**What this tool does**

This tool will extract reads passing quality filters.

It will take as input Sanger FASTQ files and output FASTQ/A files of filtered reads.

-----

**FASTQ format**

Check quality score in the FASTQ file for the right format.

Reference https://en.wikipedia.org/wiki/FASTQ_format#Quality :

* Sanger format can encode a Phred quality score from 0 to 93 using ASCII 33 to 126. 
* Solexa/Illumina 1.0 format can encode a Solexa/Illumina quality score from -5 to 62 using ASCII 59 to 126.

See http://www.asciitable.com/ for ASCII table.

-----

**Filter by sample index (optional)**

For users who would like to start from a FASTQ file consisting of multiple libraries.  

For example:

If you have six samples with indexes GTCA, GCATG, ACTG, AGCT, GCATC, TCGA, you can extract reads for each library with indicated index sequences (e.g. GTCA, etc.) starting from position 0 in the read. For example, you could specify 0:GTCA, etc.

-----

**How to set the filter**

You can apply multiple filtering criteria based on the quality scores for each read. They are separated by commas.

Each critieron is composed of four components (e.g. method1:start1-end1:score1,method2:start2-end2:score2)

1. Method: min or mean, which means requirement on minimal or mean score of a region 
2. Start:  the first nucleotide to consider (0-based)
3. End:    the last nucleotide to consider (0-based)
4. score:  the threshold required

**Parameter suggestion**

* For Standard CLIP protocol filtering: mean:0-29:20 (this specifies a mean score of 20 or above in the first 30 bases, which includes 5 positions with sample indexes and the random barcode, followed by 25 positions with the actual CLIP tag).
* For iCLIP/BrdU CLIP filtering: mean:0-38:20 (this specifies a mean score of 20 or above in the first 39 bases, which includes 14 positions with sample indexes and the random barcode, followed by 25 positions with the actual CLIP tag). 

The reason to filter as such is because low quality reads can introduce mapping errors and background. They will inflate the number of unique tags after removal of PCR duplicates. 

For example:

When you have degenerate barcode at the first 5 nucleotides, you can use min:0-4:20,mean:5-29:20, which means the first 5 nucleotides have a minimal quality score of 20 and the next 25 nucleotides (i.e., the first 25 nucleotides of RNA) have a mean score of 20.

	</help>
</tool>