Mercurial > repos > jjohnson > crest

This document has the information on how to run CREST for structural
variation detection.

=============
Requirements:
=============
Before running CREST, you need to make sure that several pieces of software
and/or modules are installed on the system:
1. BLAT software suite, especially blat, gfClient, and gfServer. BLAT
can be obtained from these links:
    BLAT for academic use: http://www.soe.ucsc.edu/~kent
    BLAT commercial license: http://www.kentinformatics.com/
2. CAP3 assembly program, available here:
    CAP3 for academic use: http://seq.cs.iastate.edu/cap3.html
    CAP3 commercial license: Contact Robin Kolehmainen at Michigan Tech,
    rakolehm@mtu.edu or (906)487-2228.
3: SAMtools library for accessing SAM/BAM files, available from SourceForge:
    SAMtools: http://sourceforge.net/projects/samtools/files/
4. BioPerl and Bio::DB::Sam modules.  They are usually available as
packages on most Linux distributions, but are also available at this link:
    BioPerl: http://www.bioperl.org/
	Bio::DB::Sam: http://search.cpan.org/~lds/Bio-SamTools/lib/Bio/DB/Sam.pm
Important: you must install SAMtools library before install Bio::DB::Sam.
5. ptrfinder is needed if you want to remove short tandem repeat mediated
SVs, the executable is included in the download package, put it on the path.

Note:
1. You can use your own programs in place of BLAT and CAP3, but you need to
implement the run method in SVExtTools.pm.
2. The pipeline uses gfServer to mimic a standard blat server, so you need
to setup your own gfServer.  Details on setting up the server can be found in
the BLAT package.  Using a query server can significantly increase the speed of
the pipeline.

Your BAM files must contain soft-clipping signatures at the breakpoints.  If
they do not, you will not get any results.  For more information see the
section "About Soft-Clipping" at the end of this document.

=====================
Running the pipeline:
=====================

Make sure that all the required perl modules are in @INC.  One simple way
is to put all .pm and .pl scripts in the same directory and run them from
this same directory. Also, the input bam file, must be sorted and indexed
before running the pipeline.

We are going to use two sample bam files (tumor.bam and germline.bam) to
illustate how to run the pipeline. The examples assume you want to find SV
in tumor.bam and you also have the matched germline sample bam file.

Important: indexing all bam files before running the pipeline is required.

1. Get soft-clipping positions.

The program extractSClip.pl will extract all soft-clipping positions first,
and identify those positions with a cluster of soft-clipped reads.  The
program requires only the BAM file and the reference genome's FASTA file
The following is an example to extract all positions:

	extractSClip.pl -i tumor.bam --ref_genome hg18.fa

Two files named tumor.bam.cover and tumor.bam.sclip.txt will be generated
for use in the next step.

Note: The program can use either paired or single-end sequencing data.  For
single-end data, use the --nopaired parameter.

For whole genome sequencing project, we highly suggest running the procedure
in parallel by dividing the genome into pieces.  One natural way is by
chromosome. The following is an example to extract all positons on chr4.

	extractSClip.pl -i tumor.bam --ref_genome hg18.fa -r 4

Important: The genome file used in this pipeline must be the same as the one
used to map reads, so the chromosome names need to agree.  In this example,
the genome file and bam file all have the chromosome name as 4 instead of
chr4 you may encounter.

Two files named tumor.bam.4.cover and tumor.bam.4.sclip.txt will be generated
for use in the next step. So it's very easy to run this step in parallel and
combine the results together to form a final result.

The output files for this step have names with suffixes of *.cover and
*.sclip.txt.  The .cover file is a tab-delimited text file, with columns:
chr, position, strand, number of soft-clipped reads, and coverage at that
position.  The strand is just left-clipped or right-clipped to help identify
the SV orientation.  The .sclip.txt file has the detailed information for
all soft-clipped reads including sequence and quality values. This file is
also tab-delimited with the following columns: chr, posiiton, strand, read
name, sequence, and quality.

Example of part of a *.cover file:
4       125892327       +       1       28
4       125892458       +       1       27
4       125893225       +       1       28
4       125893227       +       5       29
4       125893365       -       1       26
4       125893979       -       1       16
10      66301086        -       1       33
10      66301858        +       4       14
10      66301865        -       8       21
10      66301871        -       1       22
10      66302136        +       1       51

Example of part of a *.sclip.txt file:
4       125892327       +       HWUSI-EAS1591_6113C:3:17:12332:19420#0  CC      CC
4       125892458       +       HWUSI-EAS1591_6113C:4:91:6281:9961#0    GACTAACCACCACGGTACATGTTTTCCTATGTAAAAAACCTGCACATTCTACACATGTATCCCAGAACTTAAAGTAAAACAC B@C@?:CC>CCBCCCCACBCDCCCCCC;<:<9CCCCC@CCCCCBCCCCCCCCCCCCCCCACCCCCCCCCCCCCCCCCCCCAC
4       125893225       +       HWI-EAS90_614M9:5:18:17924:10181#0      CCCTCCTGGGTTCAAGTGATTCTCCTGCCTCTACCTCCCGAGTAGCTGGGATTACAGGTGCCCACCACCATGCCTGGCTAA  #######@@7@:8@><16+6(B>AABCAA3AB@CC6CCCCCCCDCCCCCCBCCDCCCCCCCCCCCCDCCCCCCCCCCCCCC

If you run this step in parallel, you need to combine the outputs by
concatenating the files.  Tumor and germline files must be concatenated
separately, for example:
cat tumor.bam.*.cover > tumor.bam.cover
cat germline.bam.*.cover > germline.bam.cover

2.Remove germline events (optional)

Running step 1 on both germline and tumor samples, you will get the
soft-clipping posiitons in both samples.  This step will remove any position
in the tumor sample that also appears in germline sample, so germline events
will be removed.  This step does not use any sequence information and could
remove true events.  By our observations, true events are rarely removed.
You can skip this step and the program will do germline clean up at later step
(see the -g parameter for CREST.pl).

The script for this step is countDiff.pl and it only requires two parameters
to specity the two output files from previous step.

	countDiff.pl -d tumor.bam.cover -g germline.bam.cover > soft_clip.dist.txt

A file named tumor.bam.cover.somatic.cover will be generated for next step.

The program will generate a file with suffix *.somatic.cover, and it will
be used for the next step.  The file has the same format as *.cover generated
in the previous step.

The standard output will show the coverage distribution.  For every read count
in the range 1-999, it will show the number of breakpoints supported by that
many soft-clipped reads.

3. Running the SV detection script.

This is the core step in the detection process. The program is CREST.pl.

The program needs quite a few parameters, but you can think about what you
will need.  Here is a partial list of required and common parameters:

	-f The input soft-clipped coverage file produced in step 1 or 2.
	-d The disease or tumor bam file
	-g The germline bam file.  If you want to identify somatic SVs only, you
	should provide this parameter.  If you also want to identify germline
	events, you can leave this parameter unspecified. When treat your
	germline file as disease without specify -g parameter, the program can
	be used to identify germline events, or SV polymorphism.
	--ref_genome The reference genome in fa format (used by bam file)
	-t The reference genome in 2bit format (used by gfClient), this file can
	be generated by using faToTwoBit program in BLAT program suit.  This
	file must be the same as the one you used to setup gfServer.
	--blatserver The name or IP address of blat server, you need to use your
	own one instead of using the public one at UCSC.
	--blatport The port number for the blat server.
	--nopaired Tell the program the reads are not paired.

If all of the required programs are on the path then you won't need to
specify them again, otherwise you need to specify the paths to the programs
using the corresponding parameters.  Please use CREST.pl --man to show the
man page, which provides a detailed parameter list.

An example of running this step is:
	CREST.pl -f tumor.bam.cover -d tumor.bam -g germline.bam --ref_genome \
	hg18.fa -t hg18.2bit

There is also a -r parameter to specify the range to be searched and it's highly
recommended to run using -r as below:
	CREST.pl -f tumor.bam.cover -d tumor.bam -g germline.bam --ref_genome \
	hg18.fa -t /genome/hg18.2bit -r chr1
So it's very easy to run the program in parallel by spliting the genome into
pieces.

The program will generate a *.predSV.txt file.  The filename will be the input
bam with .predSV.txt appended unless you specify the -p parameter.  Also the
STDERR output has the full list of SVs, including rejected ones.  The output
file *.predSV.txt has the following tab-delimited columns: left_chr, left_pos,
left_strand, # of left soft-clipped reads, right_chr, right_pos, right_strand,
# right soft-clipped reads, SV type, coverage at left_pos, coverage at
right_pos, assembled length at left_pos, assembled length at right_pos,
average percent identity at left_pos, percent of non-unique mapping reads at
left_pos, average percent identity at right_pos, percent of non-unique mapping
reads at right_pos, start position of consensus mapping to genome,
starting chromosome of consensus mapping, position of the genomic mapping of
consensus starting position, end position of consensus mapping to genome,
ending chromsome of consnesus mapping, position of genomic mapping of
consensus ending posiiton, and consensus sequences.  For inversion(INV), the
last 7 fields will be repeated to reflect the fact two different breakpoints
are needed to identify an INV event.

Example of the tumor.predSV.txt file:
4	125893227	+	5	10	66301858	-	4	CTX	29	14	83	71	0.895173453996983	0.230769230769231	0.735384615384615	0.5	1	4	125893135	176	10	66301773	TTATGAATTTTGAAATATATATCATATTTTGAAATATATATCATATTCTAAATTATGAAAAGAGAATATGATTCTCTTTTCAGTAGCTGTCACCTCCTGGGTTCAAGTGATTCTCCTGCCTCTACCTCCCGAGTAGCTGGGATTACAGGTGCCCACCACCATGCCTGGCTAATTTT
5	7052198	-	0	10	66301865	+	8	CTX	0	22	0	81	0.761379310344828	0.482758620689655	0	0	1	5	7052278	164	10	66301947	AGCCATGGACCTTGTGGTGGGTTCTTAACAATGGTGAGTCCGGAGTTCTTAACGATGGTGAGTCCGTAGTTTGTTCCTTCAGGAGTGAGCCAAGATCATGCCACTGCACTCTAGCCTGGGCAACAGAGGAAGACTCCACCTCAAAAAAAAAAAGTGGGAAGAGG
10	66301858	+	4	4	125893225	-	1	CTX	15	28	71	81	0.735384615384615	0.5	0.889507154213037	0.243243243243243	1	10	66301777	153	4	125893154	TTAGCCAGGCATGGTGGTGGGCACCTGTAATCCCAGCTACTCGGGAGGTAGAGGCAGGAGAATCACTTGAACCCAGGAGGTGACAGCTACTGAAAAGAGAATCATATTCTCTTTTCATAATTTAGAATATGATATATATTTCAAAATATGATA

If there are no or very few results, there may be a lack of soft-clipping.  See
the section "About Soft-Clipping" at the end of this document.

4. Visulization of the detailed alignment at breakpoint (optional)

The bam2html.pl script builds an html view of the multiple alignment for
the breakpoint, so you can manually check the soft-clipping and other things.

	bam2html.pl -d diag.bam -g germline.bam -i diag.bam.predSV.txt \
		--ref_genome /genome/hg18 -o diag.bam.predSV.html

The output file is specified by -o option.

====================
About Soft-Clipping:
====================

CREST uses soft-clipping signatures to identify breakpoints.  Soft-clipping is
indicated by "S" elements in the CIGAR for SAM/BAM records.  Soft-clipping may
not occur, depending on the mapping algorithm and parameters and sometimes even
the library preparation.

With bwa sampe:
---------------

One mapping method that will soft-clip reads is bwa sampe (BWA for paired-end
reads).  When BWA successfully maps one read in a pair but is not able to map
the other, it will attempt a more permissive Smith-Waterman alignment of the
unmapped read in the neighborhood of the mapped mate.  If it is only able to
align part of the read, then it will soft-clip the portion on the end that it
could not align.  Often this occurs at the breakpoints of structural
variations.

In some cases when the insert sizes approach the read length, BWA will not
perform Smith-Waterman alignment.  Reads from inserts smaller than the read
length will contain primer and/or adapter and will often not map.  When the
insert size is close to the read length, this creates a skewed distribution
of inferred insert sizes which may cause BWA to not attempt Smith-Waterman
realignment.  This is indicated by the error message "weird pairing".  Often
in these cases there are also unusually low mapping rates.

One way to fix this problem is to remap unmapped reads bwasw.  To do this,
extract the unmapped reads as FASTQ files (this may be done with a combination
of samtools view -f 4 and Picard's SamToFastq).  Realign using bwa bwasw and
build a BAM file.  Then, re-run CREST on this new BAM file, and you may pick
up events that would have been missed otherwise.

With other aligners:
--------------------

Consult the documentation or mailing list(s) for your mapper to determine its
behavior with regard to soft-clipping.
author	jjohnson
date	Wed, 08 Feb 2012 16:59:24 -0500
parents
children