# HG changeset patch # User kevyin # Date 1356046083 18000 # Node ID f0b5827b60513f31fd809bb985d9a091afb47205 # Parent d27851e0cbbdcba7d0742ad21825d80243524a1d Uploaded diff -r d27851e0cbbd -r f0b5827b6051 README --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/README Thu Dec 20 18:28:03 2012 -0500 @@ -0,0 +1,15 @@ +Homer wrapper for Galaxy + +The homer tools will need to be accessible from command line + +Code repo: https://bitbucket.org/gvl/homer + +=========================================: +LICENSE for this wrapper: +=========================================: +Kevin Ying +Garvan Institute: http://www.garvan.org.au +GVL: https://genome.edu.au/wiki/GVL + +http://opensource.org/licenses/mit-license.php + diff -r d27851e0cbbd -r f0b5827b6051 annotatePeaks.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/annotatePeaks.xml Thu Dec 20 18:28:03 2012 -0500 @@ -0,0 +1,164 @@ + + + homer + + + + + annotatePeaks.pl $input_bed $genome_selector 1> $out_annotated + 2> $out_log || echo "Error running annotatePeaks." >&2 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + .. class:: infomark + + **Homer annoatePeaks** + + More information on accepted formats and options + + http://biowhat.ucsd.edu/homer/ngs/annotation.html + + TIP: use homer_bed2pos and homer_pos2bed to convert between the homer peak positions and the BED format. + +**Parameter list** + +Command line options (not all of them are supported):: + + Usage: annotatePeaks.pl <peak file | tss> <genome version> [additional options...] + + Available Genomes (required argument): (name,org,directory,default promoter set) + -- or -- + Custom: provide the path to genome FASTA files (directory or single file) + + User defined annotation files (default is UCSC refGene annotation): + annotatePeaks.pl accepts GTF (gene transfer formatted) files to annotate positions relative + to custom annotations, such as those from de novo transcript discovery or Gencode. + -gtf <gtf format file> (-gff and -gff3 can work for those files, but GTF is better) + + Peak vs. tss/tts/rna mode (works with custom GTF file): + If the first argument is "tss" (i.e. annotatePeaks.pl tss hg18 ...) then a TSS centric + analysis will be carried out. Tag counts and motifs will be found relative to the TSS. + (no position file needed) ["tts" now works too - e.g. 3' end of gene] + ["rna" specifies gene bodies, will automaticall set "-size given"] + NOTE: The default TSS peak size is 4000 bp, i.e. +/- 2kb (change with -size option) + -list <gene id list> (subset of genes to perform analysis [unigene, gene id, accession, + probe, etc.], default = all promoters) + -cTSS <promoter position file i.e. peak file> (should be centered on TSS) + + Primary Annotation Options: + -mask (Masked repeats, can also add 'r' to end of genome name) + -m <motif file 1> [motif file 2] ... (list of motifs to find in peaks) + -mscore (reports the highest log-odds score within the peak) + -nmotifs (reports the number of motifs per peak) + -mdist (reports distance to closest motif) + -mfasta <filename> (reports sites in a fasta file - for building new motifs) + -fm <motif file 1> [motif file 2] (list of motifs to filter from above) + -rmrevopp <#> (only count sites found within <#> on both strands once, i.e. palindromic) + -matrix <prefix> (outputs a motif co-occurrence files: + prefix.count.matrix.txt - number of peaks with motif co-occurrence + prefix.ratio.matrix.txt - ratio of observed vs. expected co-occurrence + prefix.logPvalue.matrix.txt - co-occurrence enrichment + prefix.stats.txt - table of pair-wise motif co-occurrence statistics + additional options: + -matrixMinDist <#> (minimum distance between motif pairs - to avoid overlap) + -matrixMaxDist <#> (maximum distance between motif pairs) + -mbed <filename> (Output motif positions to a BED file to load at UCSC (or -mpeak)) + -mlogic <filename> (will output stats on common motif orientations) + -d <tag directory 1> [tag directory 2] ... (list of experiment directories to show + tag counts for) NOTE: -dfile <file> where file is a list of directories in first column + -bedGraph <bedGraph file 1> [bedGraph file 2] ... (read coverage counts from bedGraph files) + -wig <wiggle file 1> [wiggle file 2] ... (read coverage counts from wiggle files) + -p <peak file> [peak file 2] ... (to find nearest peaks) + -pdist to report only distance (-pdist2 gives directional distance) + -pcount to report number of peaks within region + -vcf <VCF file> (annotate peaks with genetic variation infomation, one col per individual) + -editDistance (Computes the # bp changes relative to reference) + -individuals <name1> [name2] ... (restrict analysis to these individuals) + -gene <data file> ... (Adds additional data to result based on the closest gene. + This is useful for adding gene expression data. The file must have a header, + and the first column must be a GeneID, Accession number, etc. If the peak + cannot be mapped to data in the file then the entry will be left empty. + -go <output directory> (perform GO analysis using genes near peaks) + -genomeOntology <output directory> (perform genomeOntology analysis on peaks) + -gsize <#> (Genome size for genomeOntology analysis, default: 2e9) + + Annotation vs. Histogram mode: + -hist <bin size in bp> (i.e 1, 2, 5, 10, 20, 50, 100 etc.) + The -hist option can be used to generate histograms of position dependent features relative + to the center of peaks. This is primarily meant to be used with -d and -m options to map + distribution of motifs and ChIP-Seq tags. For ChIP-Seq peaks for a Transcription factor + you might want to use the -center option (below) to center peaks on the known motif + ** If using "-size given", histogram will be scaled to each region (i.e. 0-100%), with + the -hist parameter being the number of bins to divide each region into. + Histogram Mode specific Options: + -nuc (calculated mononucleotide frequencies at each position, + Will report by default if extracting sequence for other purposes like motifs) + -di (calculated dinucleotide frequencies at each position) + -histNorm <#> (normalize the total tag count for each region to 1, where <#> is the + minimum tag total per region - use to avoid tag spikes from low coverage + -ghist (outputs profiles for each gene, for peak shape clustering) + -rm <#> (remove occurrences of same motif that occur within # bp) + + Peak Centering: (other options are ignored) + -center <motif file> (This will re-center peaks on the specified motif, or remove peak + if there is no motif in the peak. ONLY recentering will be performed, and all other + options will be ignored. This will output a new peak file that can then be reanalyzed + to reveal fine-grain structure in peaks (It is advised to use -size < 200) with this + to keep peaks from moving too far (-mirror flips the position) + -multi (returns genomic positions of all sites instead of just the closest to center) + + Advanced Options: + -len <#> / -fragLength <#> (Fragment length, default=auto, might want to set to 0 for RNA) + -size <#> (Peak size[from center of peak], default=inferred from peak file) + -size #,# (i.e. -size -10,50 count tags from -10 bp to +50 bp from center) + -size "given" (count tags etc. using the actual regions - for variable length regions) + -log (output tag counts as log2(x+1+rand) values - for scatter plots) + -sqrt (output tag counts as sqrt(x+rand) values - for scatter plots) + -strand <+|-|both> (Count tags on specific strands relative to peak, default: both) + -pc <#> (maximum number of tags to count per bp, default=0 [no maximum]) + -cons (Retrieve conservation information for peaks/sites) + -CpG (Calculate CpG/GC content) + -ratio (process tag values as ratios - i.e. chip-seq, or mCpG/CpG) + -nfr (report nuclesome free region scores instead of tag counts, also -nfrSize <#>) + -norevopp (do not search for motifs on the opposite strand [works with -center too]) + -noadj (do not adjust the tag counts based on total tags sequenced) + -norm <#> (normalize tags to this tag count, default=1e7, 0=average tag count in all directories) + -pdist (only report distance to nearest peak using -p, not peak name) + -map <mapping file> (mapping between peak IDs and promoter IDs, overrides closest assignment) + -noann, -nogene (skip genome annotation step, skip TSS annotation) + -homer1/-homer2 (by default, the new version of homer [-homer2] is used for finding motifs) + + + + + diff -r d27851e0cbbd -r f0b5827b6051 bed2pos.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/bed2pos.xml Thu Dec 20 18:28:03 2012 -0500 @@ -0,0 +1,37 @@ + + + homer + + + + + bed2pos.pl $input_bed 1> $out_pos + 2> $out_log || echo "Error running bed2pos." >&2 + + + + + + + + + + + + + + + + + + + .. class:: infomark + + Converts: BED -(to)-> homer peak positions + + **Homer bed2pos.pl** + + http://biowhat.ucsd.edu/homer/ngs/miscellaneous.html + + + diff -r d27851e0cbbd -r f0b5827b6051 findPeaks.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/findPeaks.xml Thu Dec 20 18:28:03 2012 -0500 @@ -0,0 +1,122 @@ + + + homer + + Homer's peakcaller. Requires tag directories (see makeTagDirectory) + + + findPeaks $tagDir.extra_files_path $options -o $outputPeakFile + + #if $control_tagDir: + -i $control_tagDir.extra_files_path + #end if + + 2> $out_log || echo "Error running findPeaks." >&2 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + .. class:: infomark + + **Homer findPeaks** + + For more options, look under: "Command line options for findPeaks" + + http://biowhat.ucsd.edu/homer/ngs/peaks.html + + TIP: use homer_bed2pos and homer_pos2bed to convert between the homer peak positions and the BED format. + +**Parameter list** + +Command line options (not all of them are supported):: + + Usage: findPeaks <tag directory> [options] + + Finds peaks in the provided tag directory. By default, peak list printed to stdout + + General analysis options: + -o <filename|auto> (file name for to output peaks, default: stdout) + "-o auto" will send output to "<tag directory>/peaks.txt", ".../regions.txt", + or ".../transcripts.txt" depending on the "-style" option + -style <option> (Specialized options for specific analysis strategies) + factor (transcription factor ChIP-Seq, uses -center, output: peaks.txt, default) + histone (histone modification ChIP-Seq, region based, uses -region -size 500 -L 0, regions.txt) + groseq (de novo transcript identification from GroSeq data, transcripts.txt) + tss (TSS identification from 5' RNA sequencing, tss.txt) + dnase (Hypersensitivity [crawford style (nicking)], peaks.txt) + + chipseq/histone options: + -i <input tag directory> (Experiment to use as IgG/Input/Control) + -size <#> (Peak size, default: auto) + -minDist <#> (minimum distance between peaks, default: peak size x2) + -gsize <#> (Set effective mappable genome size, default: 2e9) + -fragLength <#|auto> (Approximate fragment length, default: auto) + -inputFragLength <#|auto> (Approximate fragment length of input tags, default: auto) + -tbp <#> (Maximum tags per bp to count, 0 = no limit, default: auto) + -inputtbp <#> (Maximum tags per bp to count in input, 0 = no limit, default: auto) + -strand <both|separate> (find peaks using tags on both strands or separate, default:both) + -norm # (Tag count to normalize to, default 10000000) + -region (extends start/stop coordinates to cover full region considered "enriched") + -center (Centers peaks on maximum tag overlap and calculates focus ratios) + -nfr (Centers peaks on most likely nucleosome free region [works best with mnase data]) + (-center and -nfr can be performed later with "getPeakTags" + + Peak Filtering options: (set -F/-L/-C to 0 to skip) + -F <#> (fold enrichment over input tag count, default: 4.0) + -P <#> (poisson p-value threshold relative to input tag count, default: 0.0001) + -L <#> (fold enrichment over local tag count, default: 4.0) + -LP <#> (poisson p-value threshold relative to local tag count, default: 0.0001) + -C <#> (fold enrichment limit of expected unique tag positions, default: 2.0) + -localSize <#> (region to check for local tag enrichment, default: 10000) + -inputSize <#> (Size of region to search for control tags, default: 2x peak size) + -fdr <#> (False discovery rate, default = 0.001) + -poisson <#> (Set poisson p-value cutoff, default: uses fdr) + -tagThreshold <#> (Set # of tags to define a peak, default: 25) + -ntagThreshold <#> (Set # of normalized tags to define a peak, by default uses 1e7 for norm) + -minTagThreshold <#> (Absolute minimum tags per peak, default: expected tags per peak) + + GroSeq Options: (Need to specify "-style groseq"): + -tssSize <#> (size of region for initiation detection/artifact size, default: 250) + -minBodySize <#> (size of regoin for transcript body detection, default: 1000) + -maxBodySize <#> (size of regoin for transcript body detection, default: 10000) + -tssFold <#> (fold enrichment for new initiation dectection, default: 4.0) + -bodyFold <#> (fold enrichment for new transcript dectection, default: 4.0) + -endFold <#> (end transcript when levels are this much less than the start, default: 10.0) + -fragLength <#> (Approximate fragment length, default: 150) + -uniqmap <directory> (directory of binary files specifying uniquely mappable locations) + Download from http://biowhat.ucsd.edu/homer/groseq/ + -confPvalue <#> (confidence p-value: 1.00e-05) + -minReadDepth <#> (Minimum initial read depth for transcripts, default: auto) + -pseudoCount <#> (Pseudo tag count, default: 2.0) + -gtf <filename> (Output de novo transcripts in GTF format) + "-o auto" will produce <dir>/transcripts.txt and <dir>/transcripts.gtf + + + diff -r d27851e0cbbd -r f0b5827b6051 makeTagDirectory.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/makeTagDirectory.py Thu Dec 20 18:28:03 2012 -0500 @@ -0,0 +1,94 @@ +""" + + +""" +import re +import os +import sys +import subprocess +import optparse +import shutil +import tempfile + +def getFileString(fpath, outpath): + """ + format a nice file size string + """ + size = '' + fp = os.path.join(outpath, fpath) + s = '? ?' + if os.path.isfile(fp): + n = float(os.path.getsize(fp)) + if n > 2**20: + size = ' (%1.1f MB)' % (n/2**20) + elif n > 2**10: + size = ' (%1.1f KB)' % (n/2**10) + elif n > 0: + size = ' (%d B)' % (int(n)) + s = '%s %s' % (fpath, size) + return s + +class makeTagDirectory(): + """wrapper + """ + + def __init__(self,opts=None, args=None): + self.opts = opts + self.args = args + + def run_makeTagDirectory(self): + """ + makeTagDirectory [options] [alignment file 2] + + """ + if self.opts.format != "bam": + cl = [self.opts.executable] + args + ["-format" , self.opts.format] + else: + cl = [self.opts.executable] + args + print cl + p = subprocess.Popen(cl) + retval = p.wait() + + + html = self.gen_html(args[0]) + #html = self.gen_html() + return html,retval + + def gen_html(self, dr=os.getcwd()): + flist = os.listdir(dr) + print flist + """ add a list of all files in the tagdirectory + """ + res = ['

Files created by makeTagDirectory

\n'] + + flist.sort() + for i,f in enumerate(flist): + if not(os.path.isdir(f)): + fn = os.path.split(f)[-1] + res.append('\n' % (fn,getFileString(fn, dr))) + + res.append('
%s
\n') + + return res + +if __name__ == '__main__': + op = optparse.OptionParser() + op.add_option('-e', '--executable', default='makeTagDirectory') + op.add_option('-o', '--htmloutput', default=None) + op.add_option('-f', '--format', default="sam") + opts, args = op.parse_args() + #assert os.path.isfile(opts.executable),'## makeTagDirectory.py error - cannot find executable %s' % opts.executable + + #if not os.path.exists(opts.outputdir): + #os.makedirs(opts.outputdir) + f = makeTagDirectory(opts, args) + + html,retval = f.run_makeTagDirectory() + f = open(opts.htmloutput, 'w') + f.write(''.join(html)) + f.close() + if retval <> 0: + print >> sys.stderr, serr # indicate failure + + + diff -r d27851e0cbbd -r f0b5827b6051 makeTagDirectory.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/makeTagDirectory.xml Thu Dec 20 18:28:03 2012 -0500 @@ -0,0 +1,146 @@ + + + homer + + Simple wrapper for makeTagDirectory. Used by findPeaks + + makeTagDirectory.py ${tagDir.files_path} + #for $alignF in $alignmentFiles + $alignF.file -f $alignF.file.ext + #end for + -o $tagDir + 2> $out_log || echo "Error running homer_makeTagDirectory." >&2 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + .. class:: infomark + + **Homer makeTagDirectory** + + For more options, look under: "Command line options" + + http://biowhat.ucsd.edu/homer/ngs/tagDir.html + +**Parameter list** + +Command line options (not all of them are supported):: + + Usage: makeTagDirectory <directory> <alignment file 1> [file 2] ... [options] + + Creates a platform-independent 'tag directory' for later analysis. + Currently BED, eland, bowtie, and sam files are accepted. The program will try to + automatically detect the alignment format if not specified. Program will also + unzip *.gz, *.bz2, and *.zip files and convert *.bam to sam files on the fly + Existing tag directories can be added or combined to make a new one using -d/-t + If more than one format is needed and the program cannot auto-detect it properly, + make separate tag directories by running the program separately, then combine them. + To perform QC/manipulations on an existing tag directory, add "-update" + + Options: + -fragLength <# | given> (Set estimated fragment length - given: use read lengths) + By default treats the sample as a single read ChIP-Seq experiment + -format <X> where X can be: (with column specifications underneath) + bed - BED format files: + (1:chr,2:start,3:end,4:+/- or read name,5:# tags,6:+/-) + -force5th (5th column of BED file contains # of reads mapping to position) + sam - SAM formatted files (use samTools to covert BAMs into SAM if you have BAM) + -unique (keep if there is a single best alignment based on mapq) + -mapq <#> (Minimum mapq for -unique, default: 10, set negative to use AS:i:/XS:i:) + -keepOne (keep one of the best alignments even if others exist) + -keepAll (include all alignments in SAM file) + -mis (Maximum allowed mismatches, default: no limit, uses MD:Z: tag) + bowtie - output from bowtie (run with --best -k 2 options) + (1:read name,2:+/-,3:chr,4:position,5:seq,6:quality,7:NA,8:misInfo) + eland_result - output from basic eland + (1:read name,2:seq,3:code,4:#zeroMM,5:#oneMM,6:#twoMM,7:chr, + 8:position,9:F/R,10-:mismatches + eland_export - output from illumina pipeline (22 columns total) + (1-5:read name info,9:sequence,10:quality,11:chr,13:position,14:strand) + eland_extended - output from illumina pipeline (4 columns total) + (1:read name,2:sequence,3:match stats,4:positions[,]) + mCpGbed - encode style mCpG reporting in extended BED format, no auto-detect + (1:chr,2:start,3:end,4:name,5:,6:+/-,7:,8:,9:,10:#C,11:#mC) + allC - Lister style output files detailing the read information about all cytosines + (1:chr,2:pos,3:strand,4:context,#mC,#totalC,#C + -minCounts <#> (minimum number of reads to report mC/C ratios, default: 10) + -mCcontext <CG|CHG|CHH|all> (only use C's in this context, default: CG) + HiCsummary - minimal paired-end read mapping information + (1:readname,2:chr1,3:5'pos1,4:strand1,5:chr2,6:5'pos2,7:strand2) + -force5th (5th column of BED file contains # of reads mapping to position) + -d <tag directory> [tag directory 2] ... (add Tag directory to new tag directory) + -t <tag file> [tag file 2] ... (add tag file i.e. *.tags.tsv to new tag directory) + -single (Create a single tags.tsv file for all "chromosomes" - i.e. if >100 chromosomes) + -update (Use current tag directory for QC/processing, do not parse new alignment files) + -tbp <#> (Maximum tags per bp, default: no maximum) + -precision <1|2|3> (number of decimal places to use for tag totals, default: 1) + + GC-bias options: + -genome <genome version> (To see available genomes, use "-genome list") + -or- (for custom genomes): + -genome <path-to-FASTA file or directory of FASTA files> + + -checkGC (check Sequence bias, requires "-genome") + -freqStart <#> (offset to start calculating frequency, default: -50) + -freqEnd <#> (distance past fragment length to calculate frequency, default: +50) + -oligoStart <#> (oligo bias start) + -oligoEnd <#> (oligo bias end) + -normGC <target GC profile file> (i.e. tagGCcontent.txt file from control experiment) + Use "-normGC default" to match the genomic GC distribution + -normFixedOligo <oligoFreqFile> (normalize 5' end bias, "-normFixedOligo default" ok) + -minNormRatio <#> (Minimum deflation ratio of tag counts, default: 0.25) + -maxNormRatio <#> (Maximum inflation ratio of tag counts, default: 2.0) + -iterNorm <#> (Sets -max/minNormRatio to 1 and 0, iteratively normalizes such that the + resulting distrubtion is no more than #% different than target, i.e. 0.1,default: off) + + Paired-end/HiC options + -illuminaPE (when matching PE reads, assumes last character of read name is 0 or 1) + -removePEbg (remove paired end tags within 1.5x fragment length on same chr) + -PEbgLength <#> (remove PE reads facing on another within this distance, default: 1.5x fragLen) + -restrictionSite <seq> (i.e. AAGCTT for HindIII, assign data < 1.5x fragment length to sites) + Must specify genome sequence directory too. (-rsmis <#> to specify mismatches, def: 0) + -both, -one, -onlyOne, -none (Keeps reads near restriction sites, default: keep all) + -removeSelfLigation (removes reads linking same restriction fragment) + -removeRestrictionEnds (removes reads starting on a restriction fragment) + -assignMidPoint (will place reads in the middle of HindIII fragments) + -restrictionSiteLength <#> (maximum distance from restriction site, default: 1.5x fragLen) + -removeSpikes <size bp> <#> (remove tags from regions with > than # times + the average tags per size bp, suggest "-removeSpikes 10000 5") + + + + + diff -r d27851e0cbbd -r f0b5827b6051 pos2bed.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/pos2bed.xml Thu Dec 20 18:28:03 2012 -0500 @@ -0,0 +1,37 @@ + + + homer + + + + + pos2bed.pl $input_peak 1> $out_bed + 2> $out_log || echo "Error running pos2bed." >&2 + + + + + + + + + + + + + + + + + + + .. class:: infomark + + Converts: homer peak positions -(to)-> BED format + + **Homer pos2bed.pl** + + http://biowhat.ucsd.edu/homer/ngs/miscellaneous.html + + + diff -r d27851e0cbbd -r f0b5827b6051 tool_dependencies.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tool_dependencies.xml Thu Dec 20 18:28:03 2012 -0500 @@ -0,0 +1,24 @@ + + + + + + http://biowhat.ucsd.edu/homer/configureHomer.pl + perl ./configureHomer.pl -install + + + ./ + $INSTALL_DIR + + + $INSTALL_DIR/bin + + + + + I'm sorry but this does not work + + + + +