Galaxy |

What it does

CSEM (ChIP-Seq multi-read allocation using E-M algorithm) is a multi-read allocation algorithm. Multi-reads are the reads that map to multiple locations on the reference genome. Most common analysis of ChIP-seq data relies on using only reads that map uniquely to relevant reference genome (uni-reads). This can lead to the omission of up to 30 % of alignable reads. Chung et al. (2011) illustrated that incorporation of multi-reads significantly increases sequencing depths, leads to detection of novel peaks that are not otherwise identifiable with uni-reads, and improves detection of peaks in low mappable regions. The computational and experimental results established that multi-reads can be of critical importance for studying DNA-protein interactions in highly repetitive regions of genomes with ChIP-seq experiments. Output from CSEM can be used with other peak callers such as MOSAiCS and MACS to identify peaks that are in both high and low mappable regions of genomes.

Please cite: Chung D, Kuan PF, Li B, SanalKumar R, Liang K, Bresnick E, Dewey C, and Keles S (2011), "Discovering transcription factor binding sites in highly repetitive regions of genomes with multi-read analysis of ChIP-Seq data," PLoS Computational Biology, 7(7): e1002111.

Input formats

CSEM accepts short reads aligned using bowtie as input. Bowtie accepts single-end reads, in FASTA or FASTQ format, as input. Quality scores of reads are ignored.

Pseudo-tags

For each read in the alignment file, CSEM estimates the fraction of the read allocated to each of its alignments. This fraction reflects the degree of confidence in each particular alignment. Currently, only the peak caller MOSAiCS can accept fractional of reads as input. However, you can incorporate multi-reads into ChIP-seq analysis with your favoriate peak-caller by utilizing this pseudo-tag functionality. Pseudo-tags are generated by assigning each multi-read to the location it maps to with the largest weight and filtering out multi-reads with weights less than 0.5. Although summarizing CSEM output as pseudo-tags decreases the number of utilized multi-reads, it still leads to a significant increase in the sequencing depth compared to using uni-reads alone and facilitates identification of peaks in repetitive regions.

Outputs

Currently, results from CSEM can be exported into BED or GFF file formats, or as a table. Each line of the output file specifies a single alignment. The lines of the output file are ordered such that all of the unique read alignments appear first. If pseudo-tags are generated, FRAC equals to 1 for all reads if the output is a table and score is set to 1000 for all the reads in the BED and GFF formats.

If the output is a table, it has the following columns:

Column    Description
--------  --------------------------------------------------------
 1 RID    ID of a read
 2 CID    Chromosome of the alignment
 3 DIR    Strand of the alignment (+ or -)
 4 POS    Left-most position of the aligned read (the first base in a chromosome is numbered 1)
 5 FRAC   Fraction of the read allocated to the alignment (which is 1 for uni-reads)

If the output is in BED format, it has the following columns:

Column        Description
------------  --------------------------------------------------------
1 chrom       Chromosome of the alignment
2 chromStart  Start position of the aligned read (the first base in a chromosome is numbered 0)
3 chromEnd    End position of the aligned read (the first base in a chromosome is numbered 0)
4 name        ID of a read
5 score       1000 * fraction of the read allocated to the alignment (which is 1000 for uni-reads)
6 strand      Strand of the alignment (+ or -)

If the output is in GFF format, it has the following columns:

Column     Description
---------  --------------------------------------------------------
1 seqname  Chromosome of the alignment
2 source   Always "CSEM"
3 feature  ID of a read
4 start    Start position of the aligned read (the first base in a chromosome is numbered 1)
5 end      End position of the aligned read (the first base in a chromosome is numbered 1)
6 score    1000 * fraction of the read allocated to the alignment (which is 1000 for uni-reads)
7 strand   Strand of the alignment (+ or -)
8 frame    Always "."
9 group    Always "."