annotate README @ 1:4f6952e0af48 default tip

CREST - add crest.loc.sample
author Jim Johnson <jj@umn.edu>
date Wed, 08 Feb 2012 16:08:01 -0600
parents acc8d8bfeb9a
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
1 This document has the information on how to run CREST for structural
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
2 variation detection.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
3
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
4 =============
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
5 Requirements:
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
6 =============
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
7 Before running CREST, you need to make sure that several pieces of software
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
8 and/or modules are installed on the system:
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
9 1. BLAT software suite, especially blat, gfClient, and gfServer. BLAT
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
10 can be obtained from these links:
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
11 BLAT for academic use: http://www.soe.ucsc.edu/~kent
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
12 BLAT commercial license: http://www.kentinformatics.com/
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
13 2. CAP3 assembly program, available here:
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
14 CAP3 for academic use: http://seq.cs.iastate.edu/cap3.html
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
15 CAP3 commercial license: Contact Robin Kolehmainen at Michigan Tech,
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
16 rakolehm@mtu.edu or (906)487-2228.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
17 3: SAMtools library for accessing SAM/BAM files, available from SourceForge:
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
18 SAMtools: http://sourceforge.net/projects/samtools/files/
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
19 4. BioPerl and Bio::DB::Sam modules. They are usually available as
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
20 packages on most Linux distributions, but are also available at this link:
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
21 BioPerl: http://www.bioperl.org/
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
22 Bio::DB::Sam: http://search.cpan.org/~lds/Bio-SamTools/lib/Bio/DB/Sam.pm
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
23 Important: you must install SAMtools library before install Bio::DB::Sam.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
24 5. ptrfinder is needed if you want to remove short tandem repeat mediated
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
25 SVs, the executable is included in the download package, put it on the path.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
26
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
27 Note:
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
28 1. You can use your own programs in place of BLAT and CAP3, but you need to
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
29 implement the run method in SVExtTools.pm.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
30 2. The pipeline uses gfServer to mimic a standard blat server, so you need
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
31 to setup your own gfServer. Details on setting up the server can be found in
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
32 the BLAT package. Using a query server can significantly increase the speed of
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
33 the pipeline.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
34
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
35 Your BAM files must contain soft-clipping signatures at the breakpoints. If
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
36 they do not, you will not get any results. For more information see the
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
37 section "About Soft-Clipping" at the end of this document.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
38
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
39 =====================
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
40 Running the pipeline:
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
41 =====================
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
42
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
43 Make sure that all the required perl modules are in @INC. One simple way
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
44 is to put all .pm and .pl scripts in the same directory and run them from
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
45 this same directory. Also, the input bam file, must be sorted and indexed
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
46 before running the pipeline.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
47
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
48 We are going to use two sample bam files (tumor.bam and germline.bam) to
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
49 illustate how to run the pipeline. The examples assume you want to find SV
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
50 in tumor.bam and you also have the matched germline sample bam file.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
51
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
52 Important: indexing all bam files before running the pipeline is required.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
53
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
54 1. Get soft-clipping positions.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
55
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
56 The program extractSClip.pl will extract all soft-clipping positions first,
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
57 and identify those positions with a cluster of soft-clipped reads. The
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
58 program requires only the BAM file and the reference genome's FASTA file
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
59 The following is an example to extract all positions:
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
60
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
61 extractSClip.pl -i tumor.bam --ref_genome hg18.fa
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
62
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
63 Two files named tumor.bam.cover and tumor.bam.sclip.txt will be generated
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
64 for use in the next step.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
65
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
66 Note: The program can use either paired or single-end sequencing data. For
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
67 single-end data, use the --nopaired parameter.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
68
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
69 For whole genome sequencing project, we highly suggest running the procedure
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
70 in parallel by dividing the genome into pieces. One natural way is by
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
71 chromosome. The following is an example to extract all positons on chr4.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
72
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
73 extractSClip.pl -i tumor.bam --ref_genome hg18.fa -r 4
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
74
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
75 Important: The genome file used in this pipeline must be the same as the one
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
76 used to map reads, so the chromosome names need to agree. In this example,
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
77 the genome file and bam file all have the chromosome name as 4 instead of
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
78 chr4 you may encounter.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
79
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
80 Two files named tumor.bam.4.cover and tumor.bam.4.sclip.txt will be generated
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
81 for use in the next step. So it's very easy to run this step in parallel and
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
82 combine the results together to form a final result.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
83
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
84 The output files for this step have names with suffixes of *.cover and
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
85 *.sclip.txt. The .cover file is a tab-delimited text file, with columns:
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
86 chr, position, strand, number of soft-clipped reads, and coverage at that
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
87 position. The strand is just left-clipped or right-clipped to help identify
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
88 the SV orientation. The .sclip.txt file has the detailed information for
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
89 all soft-clipped reads including sequence and quality values. This file is
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
90 also tab-delimited with the following columns: chr, posiiton, strand, read
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
91 name, sequence, and quality.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
92
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
93 Example of part of a *.cover file:
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
94 4 125892327 + 1 28
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
95 4 125892458 + 1 27
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
96 4 125893225 + 1 28
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
97 4 125893227 + 5 29
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
98 4 125893365 - 1 26
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
99 4 125893979 - 1 16
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
100 10 66301086 - 1 33
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
101 10 66301858 + 4 14
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
102 10 66301865 - 8 21
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
103 10 66301871 - 1 22
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
104 10 66302136 + 1 51
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
105
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
106 Example of part of a *.sclip.txt file:
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
107 4 125892327 + HWUSI-EAS1591_6113C:3:17:12332:19420#0 CC CC
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
108 4 125892458 + HWUSI-EAS1591_6113C:4:91:6281:9961#0 GACTAACCACCACGGTACATGTTTTCCTATGTAAAAAACCTGCACATTCTACACATGTATCCCAGAACTTAAAGTAAAACAC B@C@?:CC>CCBCCCCACBCDCCCCCC;<:<9CCCCC@CCCCCBCCCCCCCCCCCCCCCACCCCCCCCCCCCCCCCCCCCAC
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
109 4 125893225 + HWI-EAS90_614M9:5:18:17924:10181#0 CCCTCCTGGGTTCAAGTGATTCTCCTGCCTCTACCTCCCGAGTAGCTGGGATTACAGGTGCCCACCACCATGCCTGGCTAA #######@@7@:8@><16+6(B>AABCAA3AB@CC6CCCCCCCDCCCCCCBCCDCCCCCCCCCCCCDCCCCCCCCCCCCCC
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
110
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
111 If you run this step in parallel, you need to combine the outputs by
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
112 concatenating the files. Tumor and germline files must be concatenated
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
113 separately, for example:
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
114 cat tumor.bam.*.cover > tumor.bam.cover
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
115 cat germline.bam.*.cover > germline.bam.cover
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
116
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
117 2.Remove germline events (optional)
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
118
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
119 Running step 1 on both germline and tumor samples, you will get the
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
120 soft-clipping posiitons in both samples. This step will remove any position
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
121 in the tumor sample that also appears in germline sample, so germline events
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
122 will be removed. This step does not use any sequence information and could
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
123 remove true events. By our observations, true events are rarely removed.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
124 You can skip this step and the program will do germline clean up at later step
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
125 (see the -g parameter for CREST.pl).
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
126
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
127 The script for this step is countDiff.pl and it only requires two parameters
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
128 to specity the two output files from previous step.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
129
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
130 countDiff.pl -d tumor.bam.cover -g germline.bam.cover > soft_clip.dist.txt
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
131
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
132 A file named tumor.bam.cover.somatic.cover will be generated for next step.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
133
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
134 The program will generate a file with suffix *.somatic.cover, and it will
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
135 be used for the next step. The file has the same format as *.cover generated
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
136 in the previous step.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
137
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
138 The standard output will show the coverage distribution. For every read count
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
139 in the range 1-999, it will show the number of breakpoints supported by that
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
140 many soft-clipped reads.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
141
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
142 3. Running the SV detection script.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
143
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
144 This is the core step in the detection process. The program is CREST.pl.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
145
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
146 The program needs quite a few parameters, but you can think about what you
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
147 will need. Here is a partial list of required and common parameters:
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
148
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
149 -f The input soft-clipped coverage file produced in step 1 or 2.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
150 -d The disease or tumor bam file
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
151 -g The germline bam file. If you want to identify somatic SVs only, you
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
152 should provide this parameter. If you also want to identify germline
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
153 events, you can leave this parameter unspecified. When treat your
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
154 germline file as disease without specify -g parameter, the program can
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
155 be used to identify germline events, or SV polymorphism.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
156 --ref_genome The reference genome in fa format (used by bam file)
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
157 -t The reference genome in 2bit format (used by gfClient), this file can
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
158 be generated by using faToTwoBit program in BLAT program suit. This
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
159 file must be the same as the one you used to setup gfServer.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
160 --blatserver The name or IP address of blat server, you need to use your
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
161 own one instead of using the public one at UCSC.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
162 --blatport The port number for the blat server.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
163 --nopaired Tell the program the reads are not paired.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
164
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
165 If all of the required programs are on the path then you won't need to
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
166 specify them again, otherwise you need to specify the paths to the programs
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
167 using the corresponding parameters. Please use CREST.pl --man to show the
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
168 man page, which provides a detailed parameter list.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
169
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
170 An example of running this step is:
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
171 CREST.pl -f tumor.bam.cover -d tumor.bam -g germline.bam --ref_genome \
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
172 hg18.fa -t hg18.2bit
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
173
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
174 There is also a -r parameter to specify the range to be searched and it's highly
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
175 recommended to run using -r as below:
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
176 CREST.pl -f tumor.bam.cover -d tumor.bam -g germline.bam --ref_genome \
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
177 hg18.fa -t /genome/hg18.2bit -r chr1
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
178 So it's very easy to run the program in parallel by spliting the genome into
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
179 pieces.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
180
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
181 The program will generate a *.predSV.txt file. The filename will be the input
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
182 bam with .predSV.txt appended unless you specify the -p parameter. Also the
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
183 STDERR output has the full list of SVs, including rejected ones. The output
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
184 file *.predSV.txt has the following tab-delimited columns: left_chr, left_pos,
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
185 left_strand, # of left soft-clipped reads, right_chr, right_pos, right_strand,
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
186 # right soft-clipped reads, SV type, coverage at left_pos, coverage at
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
187 right_pos, assembled length at left_pos, assembled length at right_pos,
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
188 average percent identity at left_pos, percent of non-unique mapping reads at
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
189 left_pos, average percent identity at right_pos, percent of non-unique mapping
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
190 reads at right_pos, start position of consensus mapping to genome,
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
191 starting chromosome of consensus mapping, position of the genomic mapping of
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
192 consensus starting position, end position of consensus mapping to genome,
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
193 ending chromsome of consnesus mapping, position of genomic mapping of
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
194 consensus ending posiiton, and consensus sequences. For inversion(INV), the
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
195 last 7 fields will be repeated to reflect the fact two different breakpoints
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
196 are needed to identify an INV event.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
197
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
198 Example of the tumor.predSV.txt file:
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
199 4 125893227 + 5 10 66301858 - 4 CTX 29 14 83 71 0.895173453996983 0.230769230769231 0.735384615384615 0.5 1 4 125893135 176 10 66301773 TTATGAATTTTGAAATATATATCATATTTTGAAATATATATCATATTCTAAATTATGAAAAGAGAATATGATTCTCTTTTCAGTAGCTGTCACCTCCTGGGTTCAAGTGATTCTCCTGCCTCTACCTCCCGAGTAGCTGGGATTACAGGTGCCCACCACCATGCCTGGCTAATTTT
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
200 5 7052198 - 0 10 66301865 + 8 CTX 0 22 0 81 0.761379310344828 0.482758620689655 0 0 1 5 7052278 164 10 66301947 AGCCATGGACCTTGTGGTGGGTTCTTAACAATGGTGAGTCCGGAGTTCTTAACGATGGTGAGTCCGTAGTTTGTTCCTTCAGGAGTGAGCCAAGATCATGCCACTGCACTCTAGCCTGGGCAACAGAGGAAGACTCCACCTCAAAAAAAAAAAGTGGGAAGAGG
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
201 10 66301858 + 4 4 125893225 - 1 CTX 15 28 71 81 0.735384615384615 0.5 0.889507154213037 0.243243243243243 1 10 66301777 153 4 125893154 TTAGCCAGGCATGGTGGTGGGCACCTGTAATCCCAGCTACTCGGGAGGTAGAGGCAGGAGAATCACTTGAACCCAGGAGGTGACAGCTACTGAAAAGAGAATCATATTCTCTTTTCATAATTTAGAATATGATATATATTTCAAAATATGATA
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
202
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
203 If there are no or very few results, there may be a lack of soft-clipping. See
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
204 the section "About Soft-Clipping" at the end of this document.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
205
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
206 4. Visulization of the detailed alignment at breakpoint (optional)
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
207
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
208 The bam2html.pl script builds an html view of the multiple alignment for
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
209 the breakpoint, so you can manually check the soft-clipping and other things.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
210
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
211 bam2html.pl -d diag.bam -g germline.bam -i diag.bam.predSV.txt \
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
212 --ref_genome /genome/hg18 -o diag.bam.predSV.html
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
213
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
214 The output file is specified by -o option.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
215
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
216 ====================
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
217 About Soft-Clipping:
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
218 ====================
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
219
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
220 CREST uses soft-clipping signatures to identify breakpoints. Soft-clipping is
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
221 indicated by "S" elements in the CIGAR for SAM/BAM records. Soft-clipping may
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
222 not occur, depending on the mapping algorithm and parameters and sometimes even
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
223 the library preparation.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
224
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
225 With bwa sampe:
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
226 ---------------
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
227
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
228 One mapping method that will soft-clip reads is bwa sampe (BWA for paired-end
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
229 reads). When BWA successfully maps one read in a pair but is not able to map
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
230 the other, it will attempt a more permissive Smith-Waterman alignment of the
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
231 unmapped read in the neighborhood of the mapped mate. If it is only able to
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
232 align part of the read, then it will soft-clip the portion on the end that it
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
233 could not align. Often this occurs at the breakpoints of structural
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
234 variations.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
235
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
236 In some cases when the insert sizes approach the read length, BWA will not
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
237 perform Smith-Waterman alignment. Reads from inserts smaller than the read
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
238 length will contain primer and/or adapter and will often not map. When the
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
239 insert size is close to the read length, this creates a skewed distribution
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
240 of inferred insert sizes which may cause BWA to not attempt Smith-Waterman
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
241 realignment. This is indicated by the error message "weird pairing". Often
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
242 in these cases there are also unusually low mapping rates.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
243
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
244 One way to fix this problem is to remap unmapped reads bwasw. To do this,
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
245 extract the unmapped reads as FASTQ files (this may be done with a combination
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
246 of samtools view -f 4 and Picard's SamToFastq). Realign using bwa bwasw and
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
247 build a BAM file. Then, re-run CREST on this new BAM file, and you may pick
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
248 up events that would have been missed otherwise.
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
249
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
250 With other aligners:
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
251 --------------------
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
252
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
253 Consult the documentation or mailing list(s) for your mapper to determine its
acc8d8bfeb9a Uploaded
jjohnson
parents:
diff changeset
254 behavior with regard to soft-clipping.