annotate README.md @ 4:c70137414dcd draft

sickle v1.33
author nikhil-joshi
date Wed, 23 Jul 2014 18:35:10 -0400
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
4
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
1 # sickle - A windowed adaptive trimming tool for FASTQ files using quality
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
2
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
3 ## About
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
4
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
5 Most modern sequencing technologies produce reads that have
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
6 deteriorating quality towards the 3'-end and some towards the 5'-end
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
7 as well. Incorrectly called bases in both regions negatively impact
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
8 assembles, mapping, and downstream bioinformatics analyses.
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
9
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
10 Sickle is a tool that uses sliding windows along with quality and
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
11 length thresholds to determine when quality is sufficiently low to
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
12 trim the 3'-end of reads and also determines when the quality is
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
13 sufficiently high enough to trim the 5'-end of reads. It will also
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
14 discard reads based upon the length threshold. It takes the quality
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
15 values and slides a window across them whose length is 0.1 times the
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
16 length of the read. If this length is less than 1, then the window is
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
17 set to be equal to the length of the read. Otherwise, the window
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
18 slides along the quality values until the average quality in the
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
19 window rises above the threshold, at which point the algorithm
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
20 determines where within the window the rise occurs and cuts the read
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
21 and quality there for the 5'-end cut. Then when the average quality
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
22 in the window drops below the threshold, the algorithm determines
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
23 where in the window the drop occurs and cuts both the read and quality
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
24 strings there for the 3'-end cut. However, if the length of the
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
25 remaining sequence is less than the minimum length threshold, then the
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
26 read is discarded entirely (or replaced with an "N" record). 5'-end
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
27 trimming can be disabled.
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
28
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
29 Sickle supports three types of quality values: Illumina, Solexa, and
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
30 Sanger. Note that the Solexa quality setting is an approximation (the
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
31 actual conversion is a non-linear transformation). The end
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
32 approximation is close. Illumina quality refers to qualities encoded
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
33 with the CASAVA pipeline between versions 1.3 and 1.7. Illumina
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
34 quality using CASAVA >= 1.8 is Sanger encoded.
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
35
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
36 Note that Sickle will remove the 2nd fastq record header (on the "+"
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
37 line) and replace it with simply a "+". This is the default format for
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
38 CASAVA >= 1.8.
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
39
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
40 Sickle also supports gzipped file inputs and optional gzipped outputs. By default,
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
41 Sickle will produce regular (i.e. not gzipped) output, regardless of the input.
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
42 Sickle also has an option to truncate reads with Ns at the first N position.
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
43
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
44 There is also a sickle.xml file included in the package that can be used to add sickle to your
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
45 local [Galaxy](http://galaxy.psu.edu/) server.
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
46
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
47 ## Citation
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
48 Sickle doesn't have a paper, but you can cite it like this:
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
49
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
50 Joshi NA, Fass JN. (2011). Sickle: A sliding-window, adaptive, quality-based trimming tool for FastQ files
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
51 (Version 1.33) [Software]. Available at https://github.com/najoshi/sickle.
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
52
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
53 ## Requirements
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
54
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
55 Sickle requires a C compiler; GCC or clang are recommended. Sickle
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
56 relies on Heng Li's kseq.h, which is bundled with the source.
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
57
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
58 Sickle also requires Zlib, which can be obtained at
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
59 <http://www.zlib.net/>.
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
60
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
61 ## Building and Installing Sickle
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
62
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
63 To build Sickle, enter:
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
64
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
65 make
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
66
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
67 Then, copy or move "sickle" to a directory in your $PATH.
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
68
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
69 ## Usage
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
70
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
71 Sickle has two modes to work with both paired-end and single-end
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
72 reads: `sickle se` and `sickle pe`.
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
73
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
74 Running sickle by itself will print the help:
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
75
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
76 sickle
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
77
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
78 Running sickle with either the "se" or "pe" commands will give help
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
79 specific to those commands:
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
80
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
81 sickle se
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
82 sickle pe
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
83
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
84 ### Sickle Single End (`sickle se`)
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
85
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
86 `sickle se` takes an input fastq file and outputs a trimmed version of
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
87 that file. It also has options to change the length and quality
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
88 thresholds for trimming, as well as disabling 5'-trimming and enabling
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
89 truncation of sequences with Ns.
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
90
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
91 #### Examples
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
92
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
93 sickle se -f input_file.fastq -t illumina -o trimmed_output_file.fastq
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
94 sickle se -f input_file.fastq -t illumina -o trimmed_output_file.fastq -q 33 -l 40
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
95 sickle se -f input_file.fastq -t illumina -o trimmed_output_file.fastq -x -n
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
96 sickle se -t sanger -g -f input_file.fastq -o trimmed_output_file.fastq.gz
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
97
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
98 ### Sickle Paired End (`sickle pe`)
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
99
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
100 `sickle pe` can operate with two types of input. First, it can take
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
101 two paired-end files as input and outputs two trimmed paired-end files
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
102 as well as a "singles" file. The second form starts with a single
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
103 combined input file of reads where you have already interleaved the
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
104 reads from the sequencer. In this form, you also supply a single
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
105 output file name as well as a "singles" file. The "singles" file
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
106 contains reads that passed filter in either the forward or reverse
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
107 direction, but not the other. Finally, there is an option (-M) to only
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
108 produce one interleaved output file where any reads that did not pass
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
109 filter will be output as a FastQ record with a single "N" (whose quality
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
110 value is the lowest possible based upon the quality type), thus
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
111 preserving the paired nature of the data. You can also change the length
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
112 and quality thresholds for trimming, as well as disable 5'-trimming and
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
113 enable truncation of sequences with Ns.
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
114
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
115 #### Examples
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
116
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
117 sickle pe -f input_file1.fastq -r input_file2.fastq -t sanger \
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
118 -o trimmed_output_file1.fastq -p trimmed_output_file2.fastq \
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
119 -s trimmed_singles_file.fastq
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
120
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
121 sickle pe -f input_file1.fastq -r input_file2.fastq -t sanger \
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
122 -o trimmed_output_file1.fastq -p trimmed_output_file2.fastq \
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
123 -s trimmed_singles_file.fastq -q 12 -l 15
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
124
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
125 sickle pe -f input_file1.fastq -r input_file2.fastq -t sanger \
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
126 -o trimmed_output_file1.fastq -p trimmed_output_file2.fastq \
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
127 -s trimmed_singles_file.fastq -n
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
128
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
129 sickle pe -c combo.fastq -t sanger -m combo_trimmed.fastq \
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
130 -s trimmed_singles_file.fastq -n
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
131
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
132 sickle pe -t sanger -g -f input_file1.fastq -r input_file2.fastq \
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
133 -o trimmed_output_file1.fastq.gz -p trimmed_output_file2.fastq.gz \
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
134 -s trimmed_singles_file.fastq.gz
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
135
c70137414dcd sickle v1.33
nikhil-joshi
parents:
diff changeset
136 sickle pe -c combo.fastq -t sanger -M combo_trimmed_all.fastq