comparison utilities/README.md @ 0:6e75a84e9338 draft

planemo upload commit e96b43f96afce6a7b7dfd4499933aad7d05c955e-dirty
author thondeboer
date Tue, 15 May 2018 02:39:53 -0400
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:6e75a84e9338
1 # computeGC.py
2
3 Takes .genomecov files produced by BEDtools genomeCov (with -d option).
4
5 ```
6 bedtools genomecov
7 -d \
8 -ibam normal.bam \
9 -g reference.fa
10 ```
11
12 ```
13 python computeGC.py \
14 -r reference.fa \
15 -i genomecovfile \
16 -w [sliding window length] \
17 -o /path/to/model.p
18 ```
19
20 # computeFraglen.py
21
22 Takes SAM file via stdin:
23
24 ./samtools view toy.bam | python computeFraglen.py
25
26 and creates fraglen.p model in working directory.
27
28
29 # genMutModel.py
30
31 Takes references genome and TSV file to generate mutation models:
32
33 ```
34 python genMutModel.py \
35 -r hg19.fa \
36 -m inputVariants.tsv \
37 -o /home/me/models.p
38 ```
39
40 Trinucleotides are identified in the reference genome and the variant file. Frequencies of each trinucleotide transition are calculated and output as a pickle (.p) file.
41
42 # genSeqErrorModel.py
43
44 Generates sequence error model for genReads.py -e option.
45
46 ```
47 python genSeqErrorModel.py \
48 -i input_read1.fq (.gz) / input_read1.sam \
49 -o output.p \
50 -i2 input_read2.fq (.gz) / input_read2.sam \
51 -p input_alignment.pileup \
52 -q quality score offset [33] \
53 -Q maximum quality score [41] \
54 -n maximum number of reads to process [all] \
55 -s number of simulation iterations [1000000] \
56 --plot perform some optional plotting
57 ```
58
59 # plotMutModel.py
60
61 Performs plotting and comparison of mutation models generated from genMutModel.py.
62
63 ```
64 python plotMutModel.py \
65 -i model1.p [model2.p] [model3.p]... \
66 -l legend_label1 [legend_label2] [legend_label3]... \
67 -o path/to/pdf_plot_prefix
68 ```
69
70 # vcf_compare_OLD.py
71
72 Tool for comparing VCF files.
73
74 ```
75 python vcf_compare_OLD.py
76 --version show program's version number and exit \
77 -h, --help show this help message and exit \
78 -r <ref.fa> * Reference Fasta \
79 -g <golden.vcf> * Golden VCF \
80 -w <workflow.vcf> * Workflow VCF \
81 -o <prefix> * Output Prefix \
82 -m <track.bed> Mappability Track \
83 -M <int> Maptrack Min Len \
84 -t <regions.bed> Targetted Regions \
85 -T <int> Min Region Len \
86 -c <int> Coverage Filter Threshold [15] \
87 -a <float> Allele Freq Filter Threshold [0.3] \
88 --vcf-out Output Match/FN/FP variants [False] \
89 --no-plot No plotting [False] \
90 --incl-homs Include homozygous ref calls [False] \
91 --incl-fail Include calls that failed filters [False] \
92 --fast No equivalent variant detection [False]
93 ```
94 Mappability track examples: https://github.com/zstephens/neat-repeat/tree/master/example_mappabilityTracks
95
96 ## Controlled Data and Germline-Reference Allele Mismatch Information
97 ICGC's "Access Controlled Data" documention can be found at http://docs.icgc.org/access-controlled-data. To have access to controlled germline data, a DACO must be
98 submitted. Open tier data can be obtained without a DACO, but germline alleles that do not match the reference genome are masked and replaced with the reference
99 allele. Controlled data includes unmasked germline alleles.
100