comparison README @ 0:acc8d8bfeb9a

Uploaded
author jjohnson
date Wed, 08 Feb 2012 16:59:24 -0500
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:acc8d8bfeb9a
1 This document has the information on how to run CREST for structural
2 variation detection.
3
4 =============
5 Requirements:
6 =============
7 Before running CREST, you need to make sure that several pieces of software
8 and/or modules are installed on the system:
9 1. BLAT software suite, especially blat, gfClient, and gfServer. BLAT
10 can be obtained from these links:
11 BLAT for academic use: http://www.soe.ucsc.edu/~kent
12 BLAT commercial license: http://www.kentinformatics.com/
13 2. CAP3 assembly program, available here:
14 CAP3 for academic use: http://seq.cs.iastate.edu/cap3.html
15 CAP3 commercial license: Contact Robin Kolehmainen at Michigan Tech,
16 rakolehm@mtu.edu or (906)487-2228.
17 3: SAMtools library for accessing SAM/BAM files, available from SourceForge:
18 SAMtools: http://sourceforge.net/projects/samtools/files/
19 4. BioPerl and Bio::DB::Sam modules. They are usually available as
20 packages on most Linux distributions, but are also available at this link:
21 BioPerl: http://www.bioperl.org/
22 Bio::DB::Sam: http://search.cpan.org/~lds/Bio-SamTools/lib/Bio/DB/Sam.pm
23 Important: you must install SAMtools library before install Bio::DB::Sam.
24 5. ptrfinder is needed if you want to remove short tandem repeat mediated
25 SVs, the executable is included in the download package, put it on the path.
26
27 Note:
28 1. You can use your own programs in place of BLAT and CAP3, but you need to
29 implement the run method in SVExtTools.pm.
30 2. The pipeline uses gfServer to mimic a standard blat server, so you need
31 to setup your own gfServer. Details on setting up the server can be found in
32 the BLAT package. Using a query server can significantly increase the speed of
33 the pipeline.
34
35 Your BAM files must contain soft-clipping signatures at the breakpoints. If
36 they do not, you will not get any results. For more information see the
37 section "About Soft-Clipping" at the end of this document.
38
39 =====================
40 Running the pipeline:
41 =====================
42
43 Make sure that all the required perl modules are in @INC. One simple way
44 is to put all .pm and .pl scripts in the same directory and run them from
45 this same directory. Also, the input bam file, must be sorted and indexed
46 before running the pipeline.
47
48 We are going to use two sample bam files (tumor.bam and germline.bam) to
49 illustate how to run the pipeline. The examples assume you want to find SV
50 in tumor.bam and you also have the matched germline sample bam file.
51
52 Important: indexing all bam files before running the pipeline is required.
53
54 1. Get soft-clipping positions.
55
56 The program extractSClip.pl will extract all soft-clipping positions first,
57 and identify those positions with a cluster of soft-clipped reads. The
58 program requires only the BAM file and the reference genome's FASTA file
59 The following is an example to extract all positions:
60
61 extractSClip.pl -i tumor.bam --ref_genome hg18.fa
62
63 Two files named tumor.bam.cover and tumor.bam.sclip.txt will be generated
64 for use in the next step.
65
66 Note: The program can use either paired or single-end sequencing data. For
67 single-end data, use the --nopaired parameter.
68
69 For whole genome sequencing project, we highly suggest running the procedure
70 in parallel by dividing the genome into pieces. One natural way is by
71 chromosome. The following is an example to extract all positons on chr4.
72
73 extractSClip.pl -i tumor.bam --ref_genome hg18.fa -r 4
74
75 Important: The genome file used in this pipeline must be the same as the one
76 used to map reads, so the chromosome names need to agree. In this example,
77 the genome file and bam file all have the chromosome name as 4 instead of
78 chr4 you may encounter.
79
80 Two files named tumor.bam.4.cover and tumor.bam.4.sclip.txt will be generated
81 for use in the next step. So it's very easy to run this step in parallel and
82 combine the results together to form a final result.
83
84 The output files for this step have names with suffixes of *.cover and
85 *.sclip.txt. The .cover file is a tab-delimited text file, with columns:
86 chr, position, strand, number of soft-clipped reads, and coverage at that
87 position. The strand is just left-clipped or right-clipped to help identify
88 the SV orientation. The .sclip.txt file has the detailed information for
89 all soft-clipped reads including sequence and quality values. This file is
90 also tab-delimited with the following columns: chr, posiiton, strand, read
91 name, sequence, and quality.
92
93 Example of part of a *.cover file:
94 4 125892327 + 1 28
95 4 125892458 + 1 27
96 4 125893225 + 1 28
97 4 125893227 + 5 29
98 4 125893365 - 1 26
99 4 125893979 - 1 16
100 10 66301086 - 1 33
101 10 66301858 + 4 14
102 10 66301865 - 8 21
103 10 66301871 - 1 22
104 10 66302136 + 1 51
105
106 Example of part of a *.sclip.txt file:
107 4 125892327 + HWUSI-EAS1591_6113C:3:17:12332:19420#0 CC CC
108 4 125892458 + HWUSI-EAS1591_6113C:4:91:6281:9961#0 GACTAACCACCACGGTACATGTTTTCCTATGTAAAAAACCTGCACATTCTACACATGTATCCCAGAACTTAAAGTAAAACAC B@C@?:CC>CCBCCCCACBCDCCCCCC;<:<9CCCCC@CCCCCBCCCCCCCCCCCCCCCACCCCCCCCCCCCCCCCCCCCAC
109 4 125893225 + HWI-EAS90_614M9:5:18:17924:10181#0 CCCTCCTGGGTTCAAGTGATTCTCCTGCCTCTACCTCCCGAGTAGCTGGGATTACAGGTGCCCACCACCATGCCTGGCTAA #######@@7@:8@><16+6(B>AABCAA3AB@CC6CCCCCCCDCCCCCCBCCDCCCCCCCCCCCCDCCCCCCCCCCCCCC
110
111 If you run this step in parallel, you need to combine the outputs by
112 concatenating the files. Tumor and germline files must be concatenated
113 separately, for example:
114 cat tumor.bam.*.cover > tumor.bam.cover
115 cat germline.bam.*.cover > germline.bam.cover
116
117 2.Remove germline events (optional)
118
119 Running step 1 on both germline and tumor samples, you will get the
120 soft-clipping posiitons in both samples. This step will remove any position
121 in the tumor sample that also appears in germline sample, so germline events
122 will be removed. This step does not use any sequence information and could
123 remove true events. By our observations, true events are rarely removed.
124 You can skip this step and the program will do germline clean up at later step
125 (see the -g parameter for CREST.pl).
126
127 The script for this step is countDiff.pl and it only requires two parameters
128 to specity the two output files from previous step.
129
130 countDiff.pl -d tumor.bam.cover -g germline.bam.cover > soft_clip.dist.txt
131
132 A file named tumor.bam.cover.somatic.cover will be generated for next step.
133
134 The program will generate a file with suffix *.somatic.cover, and it will
135 be used for the next step. The file has the same format as *.cover generated
136 in the previous step.
137
138 The standard output will show the coverage distribution. For every read count
139 in the range 1-999, it will show the number of breakpoints supported by that
140 many soft-clipped reads.
141
142 3. Running the SV detection script.
143
144 This is the core step in the detection process. The program is CREST.pl.
145
146 The program needs quite a few parameters, but you can think about what you
147 will need. Here is a partial list of required and common parameters:
148
149 -f The input soft-clipped coverage file produced in step 1 or 2.
150 -d The disease or tumor bam file
151 -g The germline bam file. If you want to identify somatic SVs only, you
152 should provide this parameter. If you also want to identify germline
153 events, you can leave this parameter unspecified. When treat your
154 germline file as disease without specify -g parameter, the program can
155 be used to identify germline events, or SV polymorphism.
156 --ref_genome The reference genome in fa format (used by bam file)
157 -t The reference genome in 2bit format (used by gfClient), this file can
158 be generated by using faToTwoBit program in BLAT program suit. This
159 file must be the same as the one you used to setup gfServer.
160 --blatserver The name or IP address of blat server, you need to use your
161 own one instead of using the public one at UCSC.
162 --blatport The port number for the blat server.
163 --nopaired Tell the program the reads are not paired.
164
165 If all of the required programs are on the path then you won't need to
166 specify them again, otherwise you need to specify the paths to the programs
167 using the corresponding parameters. Please use CREST.pl --man to show the
168 man page, which provides a detailed parameter list.
169
170 An example of running this step is:
171 CREST.pl -f tumor.bam.cover -d tumor.bam -g germline.bam --ref_genome \
172 hg18.fa -t hg18.2bit
173
174 There is also a -r parameter to specify the range to be searched and it's highly
175 recommended to run using -r as below:
176 CREST.pl -f tumor.bam.cover -d tumor.bam -g germline.bam --ref_genome \
177 hg18.fa -t /genome/hg18.2bit -r chr1
178 So it's very easy to run the program in parallel by spliting the genome into
179 pieces.
180
181 The program will generate a *.predSV.txt file. The filename will be the input
182 bam with .predSV.txt appended unless you specify the -p parameter. Also the
183 STDERR output has the full list of SVs, including rejected ones. The output
184 file *.predSV.txt has the following tab-delimited columns: left_chr, left_pos,
185 left_strand, # of left soft-clipped reads, right_chr, right_pos, right_strand,
186 # right soft-clipped reads, SV type, coverage at left_pos, coverage at
187 right_pos, assembled length at left_pos, assembled length at right_pos,
188 average percent identity at left_pos, percent of non-unique mapping reads at
189 left_pos, average percent identity at right_pos, percent of non-unique mapping
190 reads at right_pos, start position of consensus mapping to genome,
191 starting chromosome of consensus mapping, position of the genomic mapping of
192 consensus starting position, end position of consensus mapping to genome,
193 ending chromsome of consnesus mapping, position of genomic mapping of
194 consensus ending posiiton, and consensus sequences. For inversion(INV), the
195 last 7 fields will be repeated to reflect the fact two different breakpoints
196 are needed to identify an INV event.
197
198 Example of the tumor.predSV.txt file:
199 4 125893227 + 5 10 66301858 - 4 CTX 29 14 83 71 0.895173453996983 0.230769230769231 0.735384615384615 0.5 1 4 125893135 176 10 66301773 TTATGAATTTTGAAATATATATCATATTTTGAAATATATATCATATTCTAAATTATGAAAAGAGAATATGATTCTCTTTTCAGTAGCTGTCACCTCCTGGGTTCAAGTGATTCTCCTGCCTCTACCTCCCGAGTAGCTGGGATTACAGGTGCCCACCACCATGCCTGGCTAATTTT
200 5 7052198 - 0 10 66301865 + 8 CTX 0 22 0 81 0.761379310344828 0.482758620689655 0 0 1 5 7052278 164 10 66301947 AGCCATGGACCTTGTGGTGGGTTCTTAACAATGGTGAGTCCGGAGTTCTTAACGATGGTGAGTCCGTAGTTTGTTCCTTCAGGAGTGAGCCAAGATCATGCCACTGCACTCTAGCCTGGGCAACAGAGGAAGACTCCACCTCAAAAAAAAAAAGTGGGAAGAGG
201 10 66301858 + 4 4 125893225 - 1 CTX 15 28 71 81 0.735384615384615 0.5 0.889507154213037 0.243243243243243 1 10 66301777 153 4 125893154 TTAGCCAGGCATGGTGGTGGGCACCTGTAATCCCAGCTACTCGGGAGGTAGAGGCAGGAGAATCACTTGAACCCAGGAGGTGACAGCTACTGAAAAGAGAATCATATTCTCTTTTCATAATTTAGAATATGATATATATTTCAAAATATGATA
202
203 If there are no or very few results, there may be a lack of soft-clipping. See
204 the section "About Soft-Clipping" at the end of this document.
205
206 4. Visulization of the detailed alignment at breakpoint (optional)
207
208 The bam2html.pl script builds an html view of the multiple alignment for
209 the breakpoint, so you can manually check the soft-clipping and other things.
210
211 bam2html.pl -d diag.bam -g germline.bam -i diag.bam.predSV.txt \
212 --ref_genome /genome/hg18 -o diag.bam.predSV.html
213
214 The output file is specified by -o option.
215
216 ====================
217 About Soft-Clipping:
218 ====================
219
220 CREST uses soft-clipping signatures to identify breakpoints. Soft-clipping is
221 indicated by "S" elements in the CIGAR for SAM/BAM records. Soft-clipping may
222 not occur, depending on the mapping algorithm and parameters and sometimes even
223 the library preparation.
224
225 With bwa sampe:
226 ---------------
227
228 One mapping method that will soft-clip reads is bwa sampe (BWA for paired-end
229 reads). When BWA successfully maps one read in a pair but is not able to map
230 the other, it will attempt a more permissive Smith-Waterman alignment of the
231 unmapped read in the neighborhood of the mapped mate. If it is only able to
232 align part of the read, then it will soft-clip the portion on the end that it
233 could not align. Often this occurs at the breakpoints of structural
234 variations.
235
236 In some cases when the insert sizes approach the read length, BWA will not
237 perform Smith-Waterman alignment. Reads from inserts smaller than the read
238 length will contain primer and/or adapter and will often not map. When the
239 insert size is close to the read length, this creates a skewed distribution
240 of inferred insert sizes which may cause BWA to not attempt Smith-Waterman
241 realignment. This is indicated by the error message "weird pairing". Often
242 in these cases there are also unusually low mapping rates.
243
244 One way to fix this problem is to remap unmapped reads bwasw. To do this,
245 extract the unmapped reads as FASTQ files (this may be done with a combination
246 of samtools view -f 4 and Picard's SamToFastq). Realign using bwa bwasw and
247 build a BAM file. Then, re-run CREST on this new BAM file, and you may pick
248 up events that would have been missed otherwise.
249
250 With other aligners:
251 --------------------
252
253 Consult the documentation or mailing list(s) for your mapper to determine its
254 behavior with regard to soft-clipping.