Mercurial > repos > jjohnson > crest
diff README @ 0:acc8d8bfeb9a
Uploaded
author | jjohnson |
---|---|
date | Wed, 08 Feb 2012 16:59:24 -0500 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/README Wed Feb 08 16:59:24 2012 -0500 @@ -0,0 +1,254 @@ +This document has the information on how to run CREST for structural +variation detection. + +============= +Requirements: +============= +Before running CREST, you need to make sure that several pieces of software +and/or modules are installed on the system: +1. BLAT software suite, especially blat, gfClient, and gfServer. BLAT +can be obtained from these links: + BLAT for academic use: http://www.soe.ucsc.edu/~kent + BLAT commercial license: http://www.kentinformatics.com/ +2. CAP3 assembly program, available here: + CAP3 for academic use: http://seq.cs.iastate.edu/cap3.html + CAP3 commercial license: Contact Robin Kolehmainen at Michigan Tech, + rakolehm@mtu.edu or (906)487-2228. +3: SAMtools library for accessing SAM/BAM files, available from SourceForge: + SAMtools: http://sourceforge.net/projects/samtools/files/ +4. BioPerl and Bio::DB::Sam modules. They are usually available as +packages on most Linux distributions, but are also available at this link: + BioPerl: http://www.bioperl.org/ + Bio::DB::Sam: http://search.cpan.org/~lds/Bio-SamTools/lib/Bio/DB/Sam.pm +Important: you must install SAMtools library before install Bio::DB::Sam. +5. ptrfinder is needed if you want to remove short tandem repeat mediated +SVs, the executable is included in the download package, put it on the path. + +Note: +1. You can use your own programs in place of BLAT and CAP3, but you need to +implement the run method in SVExtTools.pm. +2. The pipeline uses gfServer to mimic a standard blat server, so you need +to setup your own gfServer. Details on setting up the server can be found in +the BLAT package. Using a query server can significantly increase the speed of +the pipeline. + +Your BAM files must contain soft-clipping signatures at the breakpoints. If +they do not, you will not get any results. For more information see the +section "About Soft-Clipping" at the end of this document. + +===================== +Running the pipeline: +===================== + +Make sure that all the required perl modules are in @INC. One simple way +is to put all .pm and .pl scripts in the same directory and run them from +this same directory. Also, the input bam file, must be sorted and indexed +before running the pipeline. + +We are going to use two sample bam files (tumor.bam and germline.bam) to +illustate how to run the pipeline. The examples assume you want to find SV +in tumor.bam and you also have the matched germline sample bam file. + +Important: indexing all bam files before running the pipeline is required. + +1. Get soft-clipping positions. + +The program extractSClip.pl will extract all soft-clipping positions first, +and identify those positions with a cluster of soft-clipped reads. The +program requires only the BAM file and the reference genome's FASTA file +The following is an example to extract all positions: + + extractSClip.pl -i tumor.bam --ref_genome hg18.fa + +Two files named tumor.bam.cover and tumor.bam.sclip.txt will be generated +for use in the next step. + +Note: The program can use either paired or single-end sequencing data. For +single-end data, use the --nopaired parameter. + +For whole genome sequencing project, we highly suggest running the procedure +in parallel by dividing the genome into pieces. One natural way is by +chromosome. The following is an example to extract all positons on chr4. + + extractSClip.pl -i tumor.bam --ref_genome hg18.fa -r 4 + +Important: The genome file used in this pipeline must be the same as the one +used to map reads, so the chromosome names need to agree. In this example, +the genome file and bam file all have the chromosome name as 4 instead of +chr4 you may encounter. + +Two files named tumor.bam.4.cover and tumor.bam.4.sclip.txt will be generated +for use in the next step. So it's very easy to run this step in parallel and +combine the results together to form a final result. + +The output files for this step have names with suffixes of *.cover and +*.sclip.txt. The .cover file is a tab-delimited text file, with columns: +chr, position, strand, number of soft-clipped reads, and coverage at that +position. The strand is just left-clipped or right-clipped to help identify +the SV orientation. The .sclip.txt file has the detailed information for +all soft-clipped reads including sequence and quality values. This file is +also tab-delimited with the following columns: chr, posiiton, strand, read +name, sequence, and quality. + +Example of part of a *.cover file: +4 125892327 + 1 28 +4 125892458 + 1 27 +4 125893225 + 1 28 +4 125893227 + 5 29 +4 125893365 - 1 26 +4 125893979 - 1 16 +10 66301086 - 1 33 +10 66301858 + 4 14 +10 66301865 - 8 21 +10 66301871 - 1 22 +10 66302136 + 1 51 + +Example of part of a *.sclip.txt file: +4 125892327 + HWUSI-EAS1591_6113C:3:17:12332:19420#0 CC CC +4 125892458 + HWUSI-EAS1591_6113C:4:91:6281:9961#0 GACTAACCACCACGGTACATGTTTTCCTATGTAAAAAACCTGCACATTCTACACATGTATCCCAGAACTTAAAGTAAAACAC B@C@?:CC>CCBCCCCACBCDCCCCCC;<:<9CCCCC@CCCCCBCCCCCCCCCCCCCCCACCCCCCCCCCCCCCCCCCCCAC +4 125893225 + HWI-EAS90_614M9:5:18:17924:10181#0 CCCTCCTGGGTTCAAGTGATTCTCCTGCCTCTACCTCCCGAGTAGCTGGGATTACAGGTGCCCACCACCATGCCTGGCTAA #######@@7@:8@><16+6(B>AABCAA3AB@CC6CCCCCCCDCCCCCCBCCDCCCCCCCCCCCCDCCCCCCCCCCCCCC + +If you run this step in parallel, you need to combine the outputs by +concatenating the files. Tumor and germline files must be concatenated +separately, for example: +cat tumor.bam.*.cover > tumor.bam.cover +cat germline.bam.*.cover > germline.bam.cover + +2.Remove germline events (optional) + +Running step 1 on both germline and tumor samples, you will get the +soft-clipping posiitons in both samples. This step will remove any position +in the tumor sample that also appears in germline sample, so germline events +will be removed. This step does not use any sequence information and could +remove true events. By our observations, true events are rarely removed. +You can skip this step and the program will do germline clean up at later step +(see the -g parameter for CREST.pl). + +The script for this step is countDiff.pl and it only requires two parameters +to specity the two output files from previous step. + + countDiff.pl -d tumor.bam.cover -g germline.bam.cover > soft_clip.dist.txt + +A file named tumor.bam.cover.somatic.cover will be generated for next step. + +The program will generate a file with suffix *.somatic.cover, and it will +be used for the next step. The file has the same format as *.cover generated +in the previous step. + +The standard output will show the coverage distribution. For every read count +in the range 1-999, it will show the number of breakpoints supported by that +many soft-clipped reads. + +3. Running the SV detection script. + +This is the core step in the detection process. The program is CREST.pl. + +The program needs quite a few parameters, but you can think about what you +will need. Here is a partial list of required and common parameters: + + -f The input soft-clipped coverage file produced in step 1 or 2. + -d The disease or tumor bam file + -g The germline bam file. If you want to identify somatic SVs only, you + should provide this parameter. If you also want to identify germline + events, you can leave this parameter unspecified. When treat your + germline file as disease without specify -g parameter, the program can + be used to identify germline events, or SV polymorphism. + --ref_genome The reference genome in fa format (used by bam file) + -t The reference genome in 2bit format (used by gfClient), this file can + be generated by using faToTwoBit program in BLAT program suit. This + file must be the same as the one you used to setup gfServer. + --blatserver The name or IP address of blat server, you need to use your + own one instead of using the public one at UCSC. + --blatport The port number for the blat server. + --nopaired Tell the program the reads are not paired. + +If all of the required programs are on the path then you won't need to +specify them again, otherwise you need to specify the paths to the programs +using the corresponding parameters. Please use CREST.pl --man to show the +man page, which provides a detailed parameter list. + +An example of running this step is: + CREST.pl -f tumor.bam.cover -d tumor.bam -g germline.bam --ref_genome \ + hg18.fa -t hg18.2bit + +There is also a -r parameter to specify the range to be searched and it's highly +recommended to run using -r as below: + CREST.pl -f tumor.bam.cover -d tumor.bam -g germline.bam --ref_genome \ + hg18.fa -t /genome/hg18.2bit -r chr1 +So it's very easy to run the program in parallel by spliting the genome into +pieces. + +The program will generate a *.predSV.txt file. The filename will be the input +bam with .predSV.txt appended unless you specify the -p parameter. Also the +STDERR output has the full list of SVs, including rejected ones. The output +file *.predSV.txt has the following tab-delimited columns: left_chr, left_pos, +left_strand, # of left soft-clipped reads, right_chr, right_pos, right_strand, +# right soft-clipped reads, SV type, coverage at left_pos, coverage at +right_pos, assembled length at left_pos, assembled length at right_pos, +average percent identity at left_pos, percent of non-unique mapping reads at +left_pos, average percent identity at right_pos, percent of non-unique mapping +reads at right_pos, start position of consensus mapping to genome, +starting chromosome of consensus mapping, position of the genomic mapping of +consensus starting position, end position of consensus mapping to genome, +ending chromsome of consnesus mapping, position of genomic mapping of +consensus ending posiiton, and consensus sequences. For inversion(INV), the +last 7 fields will be repeated to reflect the fact two different breakpoints +are needed to identify an INV event. + +Example of the tumor.predSV.txt file: +4 125893227 + 5 10 66301858 - 4 CTX 29 14 83 71 0.895173453996983 0.230769230769231 0.735384615384615 0.5 1 4 125893135 176 10 66301773 TTATGAATTTTGAAATATATATCATATTTTGAAATATATATCATATTCTAAATTATGAAAAGAGAATATGATTCTCTTTTCAGTAGCTGTCACCTCCTGGGTTCAAGTGATTCTCCTGCCTCTACCTCCCGAGTAGCTGGGATTACAGGTGCCCACCACCATGCCTGGCTAATTTT +5 7052198 - 0 10 66301865 + 8 CTX 0 22 0 81 0.761379310344828 0.482758620689655 0 0 1 5 7052278 164 10 66301947 AGCCATGGACCTTGTGGTGGGTTCTTAACAATGGTGAGTCCGGAGTTCTTAACGATGGTGAGTCCGTAGTTTGTTCCTTCAGGAGTGAGCCAAGATCATGCCACTGCACTCTAGCCTGGGCAACAGAGGAAGACTCCACCTCAAAAAAAAAAAGTGGGAAGAGG +10 66301858 + 4 4 125893225 - 1 CTX 15 28 71 81 0.735384615384615 0.5 0.889507154213037 0.243243243243243 1 10 66301777 153 4 125893154 TTAGCCAGGCATGGTGGTGGGCACCTGTAATCCCAGCTACTCGGGAGGTAGAGGCAGGAGAATCACTTGAACCCAGGAGGTGACAGCTACTGAAAAGAGAATCATATTCTCTTTTCATAATTTAGAATATGATATATATTTCAAAATATGATA + +If there are no or very few results, there may be a lack of soft-clipping. See +the section "About Soft-Clipping" at the end of this document. + +4. Visulization of the detailed alignment at breakpoint (optional) + +The bam2html.pl script builds an html view of the multiple alignment for +the breakpoint, so you can manually check the soft-clipping and other things. + + bam2html.pl -d diag.bam -g germline.bam -i diag.bam.predSV.txt \ + --ref_genome /genome/hg18 -o diag.bam.predSV.html + +The output file is specified by -o option. + +==================== +About Soft-Clipping: +==================== + +CREST uses soft-clipping signatures to identify breakpoints. Soft-clipping is +indicated by "S" elements in the CIGAR for SAM/BAM records. Soft-clipping may +not occur, depending on the mapping algorithm and parameters and sometimes even +the library preparation. + +With bwa sampe: +--------------- + +One mapping method that will soft-clip reads is bwa sampe (BWA for paired-end +reads). When BWA successfully maps one read in a pair but is not able to map +the other, it will attempt a more permissive Smith-Waterman alignment of the +unmapped read in the neighborhood of the mapped mate. If it is only able to +align part of the read, then it will soft-clip the portion on the end that it +could not align. Often this occurs at the breakpoints of structural +variations. + +In some cases when the insert sizes approach the read length, BWA will not +perform Smith-Waterman alignment. Reads from inserts smaller than the read +length will contain primer and/or adapter and will often not map. When the +insert size is close to the read length, this creates a skewed distribution +of inferred insert sizes which may cause BWA to not attempt Smith-Waterman +realignment. This is indicated by the error message "weird pairing". Often +in these cases there are also unusually low mapping rates. + +One way to fix this problem is to remap unmapped reads bwasw. To do this, +extract the unmapped reads as FASTQ files (this may be done with a combination +of samtools view -f 4 and Picard's SamToFastq). Realign using bwa bwasw and +build a BAM file. Then, re-run CREST on this new BAM file, and you may pick +up events that would have been missed otherwise. + +With other aligners: +-------------------- + +Consult the documentation or mailing list(s) for your mapper to determine its +behavior with regard to soft-clipping.