Mercurial > repos > thondeboer > neat_genreads
diff utilities/README.md @ 0:6e75a84e9338 draft
planemo upload commit e96b43f96afce6a7b7dfd4499933aad7d05c955e-dirty
author | thondeboer |
---|---|
date | Tue, 15 May 2018 02:39:53 -0400 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/utilities/README.md Tue May 15 02:39:53 2018 -0400 @@ -0,0 +1,100 @@ +# computeGC.py + +Takes .genomecov files produced by BEDtools genomeCov (with -d option). + +``` +bedtools genomecov + -d \ + -ibam normal.bam \ + -g reference.fa +``` + +``` +python computeGC.py \ + -r reference.fa \ + -i genomecovfile \ + -w [sliding window length] \ + -o /path/to/model.p +``` + +# computeFraglen.py + +Takes SAM file via stdin: + +./samtools view toy.bam | python computeFraglen.py + +and creates fraglen.p model in working directory. + + +# genMutModel.py + +Takes references genome and TSV file to generate mutation models: + +``` +python genMutModel.py \ + -r hg19.fa \ + -m inputVariants.tsv \ + -o /home/me/models.p +``` + +Trinucleotides are identified in the reference genome and the variant file. Frequencies of each trinucleotide transition are calculated and output as a pickle (.p) file. + +# genSeqErrorModel.py + +Generates sequence error model for genReads.py -e option. + +``` +python genSeqErrorModel.py \ + -i input_read1.fq (.gz) / input_read1.sam \ + -o output.p \ + -i2 input_read2.fq (.gz) / input_read2.sam \ + -p input_alignment.pileup \ + -q quality score offset [33] \ + -Q maximum quality score [41] \ + -n maximum number of reads to process [all] \ + -s number of simulation iterations [1000000] \ + --plot perform some optional plotting +``` + +# plotMutModel.py + +Performs plotting and comparison of mutation models generated from genMutModel.py. + +``` +python plotMutModel.py \ + -i model1.p [model2.p] [model3.p]... \ + -l legend_label1 [legend_label2] [legend_label3]... \ + -o path/to/pdf_plot_prefix +``` + +# vcf_compare_OLD.py + +Tool for comparing VCF files. + +``` +python vcf_compare_OLD.py + --version show program's version number and exit \ + -h, --help show this help message and exit \ + -r <ref.fa> * Reference Fasta \ + -g <golden.vcf> * Golden VCF \ + -w <workflow.vcf> * Workflow VCF \ + -o <prefix> * Output Prefix \ + -m <track.bed> Mappability Track \ + -M <int> Maptrack Min Len \ + -t <regions.bed> Targetted Regions \ + -T <int> Min Region Len \ + -c <int> Coverage Filter Threshold [15] \ + -a <float> Allele Freq Filter Threshold [0.3] \ + --vcf-out Output Match/FN/FP variants [False] \ + --no-plot No plotting [False] \ + --incl-homs Include homozygous ref calls [False] \ + --incl-fail Include calls that failed filters [False] \ + --fast No equivalent variant detection [False] +``` +Mappability track examples: https://github.com/zstephens/neat-repeat/tree/master/example_mappabilityTracks + +## Controlled Data and Germline-Reference Allele Mismatch Information +ICGC's "Access Controlled Data" documention can be found at http://docs.icgc.org/access-controlled-data. To have access to controlled germline data, a DACO must be +submitted. Open tier data can be obtained without a DACO, but germline alleles that do not match the reference genome are masked and replaced with the reference +allele. Controlled data includes unmasked germline alleles. +