0
|
1 This document has the information on how to run CREST for structural
|
|
2 variation detection.
|
|
3
|
|
4 =============
|
|
5 Requirements:
|
|
6 =============
|
|
7 Before running CREST, you need to make sure that several pieces of software
|
|
8 and/or modules are installed on the system:
|
|
9 1. BLAT software suite, especially blat, gfClient, and gfServer. BLAT
|
|
10 can be obtained from these links:
|
|
11 BLAT for academic use: http://www.soe.ucsc.edu/~kent
|
|
12 BLAT commercial license: http://www.kentinformatics.com/
|
|
13 2. CAP3 assembly program, available here:
|
|
14 CAP3 for academic use: http://seq.cs.iastate.edu/cap3.html
|
|
15 CAP3 commercial license: Contact Robin Kolehmainen at Michigan Tech,
|
|
16 rakolehm@mtu.edu or (906)487-2228.
|
|
17 3: SAMtools library for accessing SAM/BAM files, available from SourceForge:
|
|
18 SAMtools: http://sourceforge.net/projects/samtools/files/
|
|
19 4. BioPerl and Bio::DB::Sam modules. They are usually available as
|
|
20 packages on most Linux distributions, but are also available at this link:
|
|
21 BioPerl: http://www.bioperl.org/
|
|
22 Bio::DB::Sam: http://search.cpan.org/~lds/Bio-SamTools/lib/Bio/DB/Sam.pm
|
|
23 Important: you must install SAMtools library before install Bio::DB::Sam.
|
|
24 5. ptrfinder is needed if you want to remove short tandem repeat mediated
|
|
25 SVs, the executable is included in the download package, put it on the path.
|
|
26
|
|
27 Note:
|
|
28 1. You can use your own programs in place of BLAT and CAP3, but you need to
|
|
29 implement the run method in SVExtTools.pm.
|
|
30 2. The pipeline uses gfServer to mimic a standard blat server, so you need
|
|
31 to setup your own gfServer. Details on setting up the server can be found in
|
|
32 the BLAT package. Using a query server can significantly increase the speed of
|
|
33 the pipeline.
|
|
34
|
|
35 Your BAM files must contain soft-clipping signatures at the breakpoints. If
|
|
36 they do not, you will not get any results. For more information see the
|
|
37 section "About Soft-Clipping" at the end of this document.
|
|
38
|
|
39 =====================
|
|
40 Running the pipeline:
|
|
41 =====================
|
|
42
|
|
43 Make sure that all the required perl modules are in @INC. One simple way
|
|
44 is to put all .pm and .pl scripts in the same directory and run them from
|
|
45 this same directory. Also, the input bam file, must be sorted and indexed
|
|
46 before running the pipeline.
|
|
47
|
|
48 We are going to use two sample bam files (tumor.bam and germline.bam) to
|
|
49 illustate how to run the pipeline. The examples assume you want to find SV
|
|
50 in tumor.bam and you also have the matched germline sample bam file.
|
|
51
|
|
52 Important: indexing all bam files before running the pipeline is required.
|
|
53
|
|
54 1. Get soft-clipping positions.
|
|
55
|
|
56 The program extractSClip.pl will extract all soft-clipping positions first,
|
|
57 and identify those positions with a cluster of soft-clipped reads. The
|
|
58 program requires only the BAM file and the reference genome's FASTA file
|
|
59 The following is an example to extract all positions:
|
|
60
|
|
61 extractSClip.pl -i tumor.bam --ref_genome hg18.fa
|
|
62
|
|
63 Two files named tumor.bam.cover and tumor.bam.sclip.txt will be generated
|
|
64 for use in the next step.
|
|
65
|
|
66 Note: The program can use either paired or single-end sequencing data. For
|
|
67 single-end data, use the --nopaired parameter.
|
|
68
|
|
69 For whole genome sequencing project, we highly suggest running the procedure
|
|
70 in parallel by dividing the genome into pieces. One natural way is by
|
|
71 chromosome. The following is an example to extract all positons on chr4.
|
|
72
|
|
73 extractSClip.pl -i tumor.bam --ref_genome hg18.fa -r 4
|
|
74
|
|
75 Important: The genome file used in this pipeline must be the same as the one
|
|
76 used to map reads, so the chromosome names need to agree. In this example,
|
|
77 the genome file and bam file all have the chromosome name as 4 instead of
|
|
78 chr4 you may encounter.
|
|
79
|
|
80 Two files named tumor.bam.4.cover and tumor.bam.4.sclip.txt will be generated
|
|
81 for use in the next step. So it's very easy to run this step in parallel and
|
|
82 combine the results together to form a final result.
|
|
83
|
|
84 The output files for this step have names with suffixes of *.cover and
|
|
85 *.sclip.txt. The .cover file is a tab-delimited text file, with columns:
|
|
86 chr, position, strand, number of soft-clipped reads, and coverage at that
|
|
87 position. The strand is just left-clipped or right-clipped to help identify
|
|
88 the SV orientation. The .sclip.txt file has the detailed information for
|
|
89 all soft-clipped reads including sequence and quality values. This file is
|
|
90 also tab-delimited with the following columns: chr, posiiton, strand, read
|
|
91 name, sequence, and quality.
|
|
92
|
|
93 Example of part of a *.cover file:
|
|
94 4 125892327 + 1 28
|
|
95 4 125892458 + 1 27
|
|
96 4 125893225 + 1 28
|
|
97 4 125893227 + 5 29
|
|
98 4 125893365 - 1 26
|
|
99 4 125893979 - 1 16
|
|
100 10 66301086 - 1 33
|
|
101 10 66301858 + 4 14
|
|
102 10 66301865 - 8 21
|
|
103 10 66301871 - 1 22
|
|
104 10 66302136 + 1 51
|
|
105
|
|
106 Example of part of a *.sclip.txt file:
|
|
107 4 125892327 + HWUSI-EAS1591_6113C:3:17:12332:19420#0 CC CC
|
|
108 4 125892458 + HWUSI-EAS1591_6113C:4:91:6281:9961#0 GACTAACCACCACGGTACATGTTTTCCTATGTAAAAAACCTGCACATTCTACACATGTATCCCAGAACTTAAAGTAAAACAC B@C@?:CC>CCBCCCCACBCDCCCCCC;<:<9CCCCC@CCCCCBCCCCCCCCCCCCCCCACCCCCCCCCCCCCCCCCCCCAC
|
|
109 4 125893225 + HWI-EAS90_614M9:5:18:17924:10181#0 CCCTCCTGGGTTCAAGTGATTCTCCTGCCTCTACCTCCCGAGTAGCTGGGATTACAGGTGCCCACCACCATGCCTGGCTAA #######@@7@:8@><16+6(B>AABCAA3AB@CC6CCCCCCCDCCCCCCBCCDCCCCCCCCCCCCDCCCCCCCCCCCCCC
|
|
110
|
|
111 If you run this step in parallel, you need to combine the outputs by
|
|
112 concatenating the files. Tumor and germline files must be concatenated
|
|
113 separately, for example:
|
|
114 cat tumor.bam.*.cover > tumor.bam.cover
|
|
115 cat germline.bam.*.cover > germline.bam.cover
|
|
116
|
|
117 2.Remove germline events (optional)
|
|
118
|
|
119 Running step 1 on both germline and tumor samples, you will get the
|
|
120 soft-clipping posiitons in both samples. This step will remove any position
|
|
121 in the tumor sample that also appears in germline sample, so germline events
|
|
122 will be removed. This step does not use any sequence information and could
|
|
123 remove true events. By our observations, true events are rarely removed.
|
|
124 You can skip this step and the program will do germline clean up at later step
|
|
125 (see the -g parameter for CREST.pl).
|
|
126
|
|
127 The script for this step is countDiff.pl and it only requires two parameters
|
|
128 to specity the two output files from previous step.
|
|
129
|
|
130 countDiff.pl -d tumor.bam.cover -g germline.bam.cover > soft_clip.dist.txt
|
|
131
|
|
132 A file named tumor.bam.cover.somatic.cover will be generated for next step.
|
|
133
|
|
134 The program will generate a file with suffix *.somatic.cover, and it will
|
|
135 be used for the next step. The file has the same format as *.cover generated
|
|
136 in the previous step.
|
|
137
|
|
138 The standard output will show the coverage distribution. For every read count
|
|
139 in the range 1-999, it will show the number of breakpoints supported by that
|
|
140 many soft-clipped reads.
|
|
141
|
|
142 3. Running the SV detection script.
|
|
143
|
|
144 This is the core step in the detection process. The program is CREST.pl.
|
|
145
|
|
146 The program needs quite a few parameters, but you can think about what you
|
|
147 will need. Here is a partial list of required and common parameters:
|
|
148
|
|
149 -f The input soft-clipped coverage file produced in step 1 or 2.
|
|
150 -d The disease or tumor bam file
|
|
151 -g The germline bam file. If you want to identify somatic SVs only, you
|
|
152 should provide this parameter. If you also want to identify germline
|
|
153 events, you can leave this parameter unspecified. When treat your
|
|
154 germline file as disease without specify -g parameter, the program can
|
|
155 be used to identify germline events, or SV polymorphism.
|
|
156 --ref_genome The reference genome in fa format (used by bam file)
|
|
157 -t The reference genome in 2bit format (used by gfClient), this file can
|
|
158 be generated by using faToTwoBit program in BLAT program suit. This
|
|
159 file must be the same as the one you used to setup gfServer.
|
|
160 --blatserver The name or IP address of blat server, you need to use your
|
|
161 own one instead of using the public one at UCSC.
|
|
162 --blatport The port number for the blat server.
|
|
163 --nopaired Tell the program the reads are not paired.
|
|
164
|
|
165 If all of the required programs are on the path then you won't need to
|
|
166 specify them again, otherwise you need to specify the paths to the programs
|
|
167 using the corresponding parameters. Please use CREST.pl --man to show the
|
|
168 man page, which provides a detailed parameter list.
|
|
169
|
|
170 An example of running this step is:
|
|
171 CREST.pl -f tumor.bam.cover -d tumor.bam -g germline.bam --ref_genome \
|
|
172 hg18.fa -t hg18.2bit
|
|
173
|
|
174 There is also a -r parameter to specify the range to be searched and it's highly
|
|
175 recommended to run using -r as below:
|
|
176 CREST.pl -f tumor.bam.cover -d tumor.bam -g germline.bam --ref_genome \
|
|
177 hg18.fa -t /genome/hg18.2bit -r chr1
|
|
178 So it's very easy to run the program in parallel by spliting the genome into
|
|
179 pieces.
|
|
180
|
|
181 The program will generate a *.predSV.txt file. The filename will be the input
|
|
182 bam with .predSV.txt appended unless you specify the -p parameter. Also the
|
|
183 STDERR output has the full list of SVs, including rejected ones. The output
|
|
184 file *.predSV.txt has the following tab-delimited columns: left_chr, left_pos,
|
|
185 left_strand, # of left soft-clipped reads, right_chr, right_pos, right_strand,
|
|
186 # right soft-clipped reads, SV type, coverage at left_pos, coverage at
|
|
187 right_pos, assembled length at left_pos, assembled length at right_pos,
|
|
188 average percent identity at left_pos, percent of non-unique mapping reads at
|
|
189 left_pos, average percent identity at right_pos, percent of non-unique mapping
|
|
190 reads at right_pos, start position of consensus mapping to genome,
|
|
191 starting chromosome of consensus mapping, position of the genomic mapping of
|
|
192 consensus starting position, end position of consensus mapping to genome,
|
|
193 ending chromsome of consnesus mapping, position of genomic mapping of
|
|
194 consensus ending posiiton, and consensus sequences. For inversion(INV), the
|
|
195 last 7 fields will be repeated to reflect the fact two different breakpoints
|
|
196 are needed to identify an INV event.
|
|
197
|
|
198 Example of the tumor.predSV.txt file:
|
|
199 4 125893227 + 5 10 66301858 - 4 CTX 29 14 83 71 0.895173453996983 0.230769230769231 0.735384615384615 0.5 1 4 125893135 176 10 66301773 TTATGAATTTTGAAATATATATCATATTTTGAAATATATATCATATTCTAAATTATGAAAAGAGAATATGATTCTCTTTTCAGTAGCTGTCACCTCCTGGGTTCAAGTGATTCTCCTGCCTCTACCTCCCGAGTAGCTGGGATTACAGGTGCCCACCACCATGCCTGGCTAATTTT
|
|
200 5 7052198 - 0 10 66301865 + 8 CTX 0 22 0 81 0.761379310344828 0.482758620689655 0 0 1 5 7052278 164 10 66301947 AGCCATGGACCTTGTGGTGGGTTCTTAACAATGGTGAGTCCGGAGTTCTTAACGATGGTGAGTCCGTAGTTTGTTCCTTCAGGAGTGAGCCAAGATCATGCCACTGCACTCTAGCCTGGGCAACAGAGGAAGACTCCACCTCAAAAAAAAAAAGTGGGAAGAGG
|
|
201 10 66301858 + 4 4 125893225 - 1 CTX 15 28 71 81 0.735384615384615 0.5 0.889507154213037 0.243243243243243 1 10 66301777 153 4 125893154 TTAGCCAGGCATGGTGGTGGGCACCTGTAATCCCAGCTACTCGGGAGGTAGAGGCAGGAGAATCACTTGAACCCAGGAGGTGACAGCTACTGAAAAGAGAATCATATTCTCTTTTCATAATTTAGAATATGATATATATTTCAAAATATGATA
|
|
202
|
|
203 If there are no or very few results, there may be a lack of soft-clipping. See
|
|
204 the section "About Soft-Clipping" at the end of this document.
|
|
205
|
|
206 4. Visulization of the detailed alignment at breakpoint (optional)
|
|
207
|
|
208 The bam2html.pl script builds an html view of the multiple alignment for
|
|
209 the breakpoint, so you can manually check the soft-clipping and other things.
|
|
210
|
|
211 bam2html.pl -d diag.bam -g germline.bam -i diag.bam.predSV.txt \
|
|
212 --ref_genome /genome/hg18 -o diag.bam.predSV.html
|
|
213
|
|
214 The output file is specified by -o option.
|
|
215
|
|
216 ====================
|
|
217 About Soft-Clipping:
|
|
218 ====================
|
|
219
|
|
220 CREST uses soft-clipping signatures to identify breakpoints. Soft-clipping is
|
|
221 indicated by "S" elements in the CIGAR for SAM/BAM records. Soft-clipping may
|
|
222 not occur, depending on the mapping algorithm and parameters and sometimes even
|
|
223 the library preparation.
|
|
224
|
|
225 With bwa sampe:
|
|
226 ---------------
|
|
227
|
|
228 One mapping method that will soft-clip reads is bwa sampe (BWA for paired-end
|
|
229 reads). When BWA successfully maps one read in a pair but is not able to map
|
|
230 the other, it will attempt a more permissive Smith-Waterman alignment of the
|
|
231 unmapped read in the neighborhood of the mapped mate. If it is only able to
|
|
232 align part of the read, then it will soft-clip the portion on the end that it
|
|
233 could not align. Often this occurs at the breakpoints of structural
|
|
234 variations.
|
|
235
|
|
236 In some cases when the insert sizes approach the read length, BWA will not
|
|
237 perform Smith-Waterman alignment. Reads from inserts smaller than the read
|
|
238 length will contain primer and/or adapter and will often not map. When the
|
|
239 insert size is close to the read length, this creates a skewed distribution
|
|
240 of inferred insert sizes which may cause BWA to not attempt Smith-Waterman
|
|
241 realignment. This is indicated by the error message "weird pairing". Often
|
|
242 in these cases there are also unusually low mapping rates.
|
|
243
|
|
244 One way to fix this problem is to remap unmapped reads bwasw. To do this,
|
|
245 extract the unmapped reads as FASTQ files (this may be done with a combination
|
|
246 of samtools view -f 4 and Picard's SamToFastq). Realign using bwa bwasw and
|
|
247 build a BAM file. Then, re-run CREST on this new BAM file, and you may pick
|
|
248 up events that would have been missed otherwise.
|
|
249
|
|
250 With other aligners:
|
|
251 --------------------
|
|
252
|
|
253 Consult the documentation or mailing list(s) for your mapper to determine its
|
|
254 behavior with regard to soft-clipping.
|