Mercurial > repos > jjohnson > crest
comparison README @ 0:acc8d8bfeb9a
Uploaded
author | jjohnson |
---|---|
date | Wed, 08 Feb 2012 16:59:24 -0500 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:acc8d8bfeb9a |
---|---|
1 This document has the information on how to run CREST for structural | |
2 variation detection. | |
3 | |
4 ============= | |
5 Requirements: | |
6 ============= | |
7 Before running CREST, you need to make sure that several pieces of software | |
8 and/or modules are installed on the system: | |
9 1. BLAT software suite, especially blat, gfClient, and gfServer. BLAT | |
10 can be obtained from these links: | |
11 BLAT for academic use: http://www.soe.ucsc.edu/~kent | |
12 BLAT commercial license: http://www.kentinformatics.com/ | |
13 2. CAP3 assembly program, available here: | |
14 CAP3 for academic use: http://seq.cs.iastate.edu/cap3.html | |
15 CAP3 commercial license: Contact Robin Kolehmainen at Michigan Tech, | |
16 rakolehm@mtu.edu or (906)487-2228. | |
17 3: SAMtools library for accessing SAM/BAM files, available from SourceForge: | |
18 SAMtools: http://sourceforge.net/projects/samtools/files/ | |
19 4. BioPerl and Bio::DB::Sam modules. They are usually available as | |
20 packages on most Linux distributions, but are also available at this link: | |
21 BioPerl: http://www.bioperl.org/ | |
22 Bio::DB::Sam: http://search.cpan.org/~lds/Bio-SamTools/lib/Bio/DB/Sam.pm | |
23 Important: you must install SAMtools library before install Bio::DB::Sam. | |
24 5. ptrfinder is needed if you want to remove short tandem repeat mediated | |
25 SVs, the executable is included in the download package, put it on the path. | |
26 | |
27 Note: | |
28 1. You can use your own programs in place of BLAT and CAP3, but you need to | |
29 implement the run method in SVExtTools.pm. | |
30 2. The pipeline uses gfServer to mimic a standard blat server, so you need | |
31 to setup your own gfServer. Details on setting up the server can be found in | |
32 the BLAT package. Using a query server can significantly increase the speed of | |
33 the pipeline. | |
34 | |
35 Your BAM files must contain soft-clipping signatures at the breakpoints. If | |
36 they do not, you will not get any results. For more information see the | |
37 section "About Soft-Clipping" at the end of this document. | |
38 | |
39 ===================== | |
40 Running the pipeline: | |
41 ===================== | |
42 | |
43 Make sure that all the required perl modules are in @INC. One simple way | |
44 is to put all .pm and .pl scripts in the same directory and run them from | |
45 this same directory. Also, the input bam file, must be sorted and indexed | |
46 before running the pipeline. | |
47 | |
48 We are going to use two sample bam files (tumor.bam and germline.bam) to | |
49 illustate how to run the pipeline. The examples assume you want to find SV | |
50 in tumor.bam and you also have the matched germline sample bam file. | |
51 | |
52 Important: indexing all bam files before running the pipeline is required. | |
53 | |
54 1. Get soft-clipping positions. | |
55 | |
56 The program extractSClip.pl will extract all soft-clipping positions first, | |
57 and identify those positions with a cluster of soft-clipped reads. The | |
58 program requires only the BAM file and the reference genome's FASTA file | |
59 The following is an example to extract all positions: | |
60 | |
61 extractSClip.pl -i tumor.bam --ref_genome hg18.fa | |
62 | |
63 Two files named tumor.bam.cover and tumor.bam.sclip.txt will be generated | |
64 for use in the next step. | |
65 | |
66 Note: The program can use either paired or single-end sequencing data. For | |
67 single-end data, use the --nopaired parameter. | |
68 | |
69 For whole genome sequencing project, we highly suggest running the procedure | |
70 in parallel by dividing the genome into pieces. One natural way is by | |
71 chromosome. The following is an example to extract all positons on chr4. | |
72 | |
73 extractSClip.pl -i tumor.bam --ref_genome hg18.fa -r 4 | |
74 | |
75 Important: The genome file used in this pipeline must be the same as the one | |
76 used to map reads, so the chromosome names need to agree. In this example, | |
77 the genome file and bam file all have the chromosome name as 4 instead of | |
78 chr4 you may encounter. | |
79 | |
80 Two files named tumor.bam.4.cover and tumor.bam.4.sclip.txt will be generated | |
81 for use in the next step. So it's very easy to run this step in parallel and | |
82 combine the results together to form a final result. | |
83 | |
84 The output files for this step have names with suffixes of *.cover and | |
85 *.sclip.txt. The .cover file is a tab-delimited text file, with columns: | |
86 chr, position, strand, number of soft-clipped reads, and coverage at that | |
87 position. The strand is just left-clipped or right-clipped to help identify | |
88 the SV orientation. The .sclip.txt file has the detailed information for | |
89 all soft-clipped reads including sequence and quality values. This file is | |
90 also tab-delimited with the following columns: chr, posiiton, strand, read | |
91 name, sequence, and quality. | |
92 | |
93 Example of part of a *.cover file: | |
94 4 125892327 + 1 28 | |
95 4 125892458 + 1 27 | |
96 4 125893225 + 1 28 | |
97 4 125893227 + 5 29 | |
98 4 125893365 - 1 26 | |
99 4 125893979 - 1 16 | |
100 10 66301086 - 1 33 | |
101 10 66301858 + 4 14 | |
102 10 66301865 - 8 21 | |
103 10 66301871 - 1 22 | |
104 10 66302136 + 1 51 | |
105 | |
106 Example of part of a *.sclip.txt file: | |
107 4 125892327 + HWUSI-EAS1591_6113C:3:17:12332:19420#0 CC CC | |
108 4 125892458 + HWUSI-EAS1591_6113C:4:91:6281:9961#0 GACTAACCACCACGGTACATGTTTTCCTATGTAAAAAACCTGCACATTCTACACATGTATCCCAGAACTTAAAGTAAAACAC B@C@?:CC>CCBCCCCACBCDCCCCCC;<:<9CCCCC@CCCCCBCCCCCCCCCCCCCCCACCCCCCCCCCCCCCCCCCCCAC | |
109 4 125893225 + HWI-EAS90_614M9:5:18:17924:10181#0 CCCTCCTGGGTTCAAGTGATTCTCCTGCCTCTACCTCCCGAGTAGCTGGGATTACAGGTGCCCACCACCATGCCTGGCTAA #######@@7@:8@><16+6(B>AABCAA3AB@CC6CCCCCCCDCCCCCCBCCDCCCCCCCCCCCCDCCCCCCCCCCCCCC | |
110 | |
111 If you run this step in parallel, you need to combine the outputs by | |
112 concatenating the files. Tumor and germline files must be concatenated | |
113 separately, for example: | |
114 cat tumor.bam.*.cover > tumor.bam.cover | |
115 cat germline.bam.*.cover > germline.bam.cover | |
116 | |
117 2.Remove germline events (optional) | |
118 | |
119 Running step 1 on both germline and tumor samples, you will get the | |
120 soft-clipping posiitons in both samples. This step will remove any position | |
121 in the tumor sample that also appears in germline sample, so germline events | |
122 will be removed. This step does not use any sequence information and could | |
123 remove true events. By our observations, true events are rarely removed. | |
124 You can skip this step and the program will do germline clean up at later step | |
125 (see the -g parameter for CREST.pl). | |
126 | |
127 The script for this step is countDiff.pl and it only requires two parameters | |
128 to specity the two output files from previous step. | |
129 | |
130 countDiff.pl -d tumor.bam.cover -g germline.bam.cover > soft_clip.dist.txt | |
131 | |
132 A file named tumor.bam.cover.somatic.cover will be generated for next step. | |
133 | |
134 The program will generate a file with suffix *.somatic.cover, and it will | |
135 be used for the next step. The file has the same format as *.cover generated | |
136 in the previous step. | |
137 | |
138 The standard output will show the coverage distribution. For every read count | |
139 in the range 1-999, it will show the number of breakpoints supported by that | |
140 many soft-clipped reads. | |
141 | |
142 3. Running the SV detection script. | |
143 | |
144 This is the core step in the detection process. The program is CREST.pl. | |
145 | |
146 The program needs quite a few parameters, but you can think about what you | |
147 will need. Here is a partial list of required and common parameters: | |
148 | |
149 -f The input soft-clipped coverage file produced in step 1 or 2. | |
150 -d The disease or tumor bam file | |
151 -g The germline bam file. If you want to identify somatic SVs only, you | |
152 should provide this parameter. If you also want to identify germline | |
153 events, you can leave this parameter unspecified. When treat your | |
154 germline file as disease without specify -g parameter, the program can | |
155 be used to identify germline events, or SV polymorphism. | |
156 --ref_genome The reference genome in fa format (used by bam file) | |
157 -t The reference genome in 2bit format (used by gfClient), this file can | |
158 be generated by using faToTwoBit program in BLAT program suit. This | |
159 file must be the same as the one you used to setup gfServer. | |
160 --blatserver The name or IP address of blat server, you need to use your | |
161 own one instead of using the public one at UCSC. | |
162 --blatport The port number for the blat server. | |
163 --nopaired Tell the program the reads are not paired. | |
164 | |
165 If all of the required programs are on the path then you won't need to | |
166 specify them again, otherwise you need to specify the paths to the programs | |
167 using the corresponding parameters. Please use CREST.pl --man to show the | |
168 man page, which provides a detailed parameter list. | |
169 | |
170 An example of running this step is: | |
171 CREST.pl -f tumor.bam.cover -d tumor.bam -g germline.bam --ref_genome \ | |
172 hg18.fa -t hg18.2bit | |
173 | |
174 There is also a -r parameter to specify the range to be searched and it's highly | |
175 recommended to run using -r as below: | |
176 CREST.pl -f tumor.bam.cover -d tumor.bam -g germline.bam --ref_genome \ | |
177 hg18.fa -t /genome/hg18.2bit -r chr1 | |
178 So it's very easy to run the program in parallel by spliting the genome into | |
179 pieces. | |
180 | |
181 The program will generate a *.predSV.txt file. The filename will be the input | |
182 bam with .predSV.txt appended unless you specify the -p parameter. Also the | |
183 STDERR output has the full list of SVs, including rejected ones. The output | |
184 file *.predSV.txt has the following tab-delimited columns: left_chr, left_pos, | |
185 left_strand, # of left soft-clipped reads, right_chr, right_pos, right_strand, | |
186 # right soft-clipped reads, SV type, coverage at left_pos, coverage at | |
187 right_pos, assembled length at left_pos, assembled length at right_pos, | |
188 average percent identity at left_pos, percent of non-unique mapping reads at | |
189 left_pos, average percent identity at right_pos, percent of non-unique mapping | |
190 reads at right_pos, start position of consensus mapping to genome, | |
191 starting chromosome of consensus mapping, position of the genomic mapping of | |
192 consensus starting position, end position of consensus mapping to genome, | |
193 ending chromsome of consnesus mapping, position of genomic mapping of | |
194 consensus ending posiiton, and consensus sequences. For inversion(INV), the | |
195 last 7 fields will be repeated to reflect the fact two different breakpoints | |
196 are needed to identify an INV event. | |
197 | |
198 Example of the tumor.predSV.txt file: | |
199 4 125893227 + 5 10 66301858 - 4 CTX 29 14 83 71 0.895173453996983 0.230769230769231 0.735384615384615 0.5 1 4 125893135 176 10 66301773 TTATGAATTTTGAAATATATATCATATTTTGAAATATATATCATATTCTAAATTATGAAAAGAGAATATGATTCTCTTTTCAGTAGCTGTCACCTCCTGGGTTCAAGTGATTCTCCTGCCTCTACCTCCCGAGTAGCTGGGATTACAGGTGCCCACCACCATGCCTGGCTAATTTT | |
200 5 7052198 - 0 10 66301865 + 8 CTX 0 22 0 81 0.761379310344828 0.482758620689655 0 0 1 5 7052278 164 10 66301947 AGCCATGGACCTTGTGGTGGGTTCTTAACAATGGTGAGTCCGGAGTTCTTAACGATGGTGAGTCCGTAGTTTGTTCCTTCAGGAGTGAGCCAAGATCATGCCACTGCACTCTAGCCTGGGCAACAGAGGAAGACTCCACCTCAAAAAAAAAAAGTGGGAAGAGG | |
201 10 66301858 + 4 4 125893225 - 1 CTX 15 28 71 81 0.735384615384615 0.5 0.889507154213037 0.243243243243243 1 10 66301777 153 4 125893154 TTAGCCAGGCATGGTGGTGGGCACCTGTAATCCCAGCTACTCGGGAGGTAGAGGCAGGAGAATCACTTGAACCCAGGAGGTGACAGCTACTGAAAAGAGAATCATATTCTCTTTTCATAATTTAGAATATGATATATATTTCAAAATATGATA | |
202 | |
203 If there are no or very few results, there may be a lack of soft-clipping. See | |
204 the section "About Soft-Clipping" at the end of this document. | |
205 | |
206 4. Visulization of the detailed alignment at breakpoint (optional) | |
207 | |
208 The bam2html.pl script builds an html view of the multiple alignment for | |
209 the breakpoint, so you can manually check the soft-clipping and other things. | |
210 | |
211 bam2html.pl -d diag.bam -g germline.bam -i diag.bam.predSV.txt \ | |
212 --ref_genome /genome/hg18 -o diag.bam.predSV.html | |
213 | |
214 The output file is specified by -o option. | |
215 | |
216 ==================== | |
217 About Soft-Clipping: | |
218 ==================== | |
219 | |
220 CREST uses soft-clipping signatures to identify breakpoints. Soft-clipping is | |
221 indicated by "S" elements in the CIGAR for SAM/BAM records. Soft-clipping may | |
222 not occur, depending on the mapping algorithm and parameters and sometimes even | |
223 the library preparation. | |
224 | |
225 With bwa sampe: | |
226 --------------- | |
227 | |
228 One mapping method that will soft-clip reads is bwa sampe (BWA for paired-end | |
229 reads). When BWA successfully maps one read in a pair but is not able to map | |
230 the other, it will attempt a more permissive Smith-Waterman alignment of the | |
231 unmapped read in the neighborhood of the mapped mate. If it is only able to | |
232 align part of the read, then it will soft-clip the portion on the end that it | |
233 could not align. Often this occurs at the breakpoints of structural | |
234 variations. | |
235 | |
236 In some cases when the insert sizes approach the read length, BWA will not | |
237 perform Smith-Waterman alignment. Reads from inserts smaller than the read | |
238 length will contain primer and/or adapter and will often not map. When the | |
239 insert size is close to the read length, this creates a skewed distribution | |
240 of inferred insert sizes which may cause BWA to not attempt Smith-Waterman | |
241 realignment. This is indicated by the error message "weird pairing". Often | |
242 in these cases there are also unusually low mapping rates. | |
243 | |
244 One way to fix this problem is to remap unmapped reads bwasw. To do this, | |
245 extract the unmapped reads as FASTQ files (this may be done with a combination | |
246 of samtools view -f 4 and Picard's SamToFastq). Realign using bwa bwasw and | |
247 build a BAM file. Then, re-run CREST on this new BAM file, and you may pick | |
248 up events that would have been missed otherwise. | |
249 | |
250 With other aligners: | |
251 -------------------- | |
252 | |
253 Consult the documentation or mailing list(s) for your mapper to determine its | |
254 behavior with regard to soft-clipping. |