comparison README.md @ 0:6e75a84e9338 draft

planemo upload commit e96b43f96afce6a7b7dfd4499933aad7d05c955e-dirty
author thondeboer
date Tue, 15 May 2018 02:39:53 -0400 (2018-05-15)
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:6e75a84e9338
1 # neat-genreads
2 NEAT-genReads is a fine-grained read simulator. GenReads simulates real-looking data using models learned from specific datasets. There are several supporting utilities for generating models used for simulation.
3
4 This is an in-progress v2.0 of the software. For a previous stable release please see: [genReads1](https://github.com/zstephens/genReads1)
5
6
7 Table of Contents
8 =================
9
10 * [neat-genreads](#neat-genreads)
11 * [Table of Contents](#table-of-contents)
12 * [Requirements](#requirements)
13 * [Usage](#usage)
14 * [Functionality](#functionality)
15 * [Examples](#examples)
16 * [Whole genome simulation](#whole-genome-simulation)
17 * [Targeted region simulation](#targeted-region-simulation)
18 * [Insert specific variants](#insert-specific-variants)
19 * [Single end reads](#single-end-reads)
20 * [Large single end reads](#large-single-end-reads)
21 * [Parallelizing simulation](#parallelizing-simulation)
22 * [Utilities](#utilities)
23 * [computeGC.py](#computegcpy)
24 * [computeFraglen.py](#computefraglenpy)
25 * [genMutModel.py](#genmutmodelpy)
26 * [genSeqErrorModel.py](#genseqerrormodelpy)
27 * [plotMutModel.py](#plotmutmodelpy)
28 * [vcf_compare_OLD.py](#vcf_compare_oldpy)
29 * [Note on Sensitive Patient Data](#note-on-sensitive-patient-data)
30
31
32
33
34 ## Requirements
35
36 * Python 2.7
37 * Numpy 1.9.1+
38
39 ## Usage
40 Here's the simplest invocation of genReads using default parameters. This command produces a single ended fastq file with reads of length 101, ploidy 2, coverage 10X, using the default sequencing substitution, GC% bias, and mutation rate models.
41
42 ```
43 python genReads.py -r ref.fa -R 101 -o simulated_data
44 ```
45
46 The most commonly added options are --pe, --bam, --vcf, and -c.
47
48
49 Option | Description
50 ------ |:----------
51 -h, --help | Displays usage information
52 -r <str> | Reference sequence file in fasta format. A reference index (.fai) will be created if one is not found in the directory of the reference as [reference filename].fai. Required. The index can be created using samtools faidx.
53 -R <int> | Read length. Required.
54 -o <str> | Output prefix. Use this option to specify where and what to call output files. Required
55 -c <float> | Average coverage across the entire dataset. Default: 10
56 -e <str> | Sequencing error model pickle file
57 -E <float> | Average sequencing error rate. The sequencing error rate model is rescaled to make this the average value.
58 -p <int> | ploidy [2]
59 -t <str> | bed file containing targeted regions; default coverage for targeted regions is 98% of -c option; default coverage outside targeted regions is 2% of -c option
60 -to <float> | off-target coverage scalar [0.02]
61 -m <str> | mutation model pickle file
62 -M <float> | Average mutation rate. The mutation rate model is rescaled to make this the average value. Must be between 0 and 0.3. These random mutations are inserted in addition to the once specified in the -v option.
63 -s <str> | input sample model
64 -v <str> | Input VCF file. Variants from this VCF will be inserted into the simulated sequence with 100% certainty.
65 --pe <int> <int> | Paired-end fragment length mean and standard deviation. To produce paired end data, one of --pe or --pe-model must be specified.
66 --pe-model <str> | Empirical fragment length distribution. Can be generated using [computeFraglen.py](#computefraglenpy). To produce paired end data, one of --pe or --pe-model must be specified.
67 --gc-model <str> | Empirical GC coverage bias distribution. Can be generated using [computeGC.py](#computegcpy)
68 --job <int> <int>| Jobs IDs for generating reads in parallel
69 --nnr | save non-N ref regions (for parallel jobs)
70 --bam | Output golden BAM file
71 --vcf | Output golden VCF file
72 --rng <int> | rng seed value; identical RNG value should produce identical runs of the program, so things like read locations, variant positions, error positions, etc, should all be the same.
73 --gz | Gzip output FQ and VCF
74 --no-fastq | Bypass generation of FASTQ read files
75
76
77 ## Functionality
78
79 ![Diagram describing the way that genReads simulates datasets](docs/flow_new.png "Diagram describing the way that genReads simulates datasets")
80
81 NEAT genReads produces simulated sequencing datasets. It creates FASTQ files with reads sampled from a provided reference genome, using sequencing error rates and mutation rates learned from real sequencing data. The strength of genReads lies in the ability for the user to customize many sequencing parameters, produce 'golden', true positive datasets, and produce types of data that other simulators cannot (e.g. tumour/normal data).
82
83 Features:
84
85 - Simulate single-end and paired-end reads
86 - Custom read length
87 - Can introduce random mutations and/or mutations from a VCF file
88 - Supported mutation types include SNPs, indels (of any length), inversions, translocations, duplications
89 - Can emulate multi-ploid heterozygosity for SNPs and small indels
90 - Can simulate targeted sequencing via BED input specifying regions to sample from
91 - Can accurately simulate large, single-end reads with high indel error rates (PacBio-like) given a model
92 - Specify simple fragment length model with mean and standard deviation or an empirically learned fragment distribution using utilities/computeFraglen.py
93 - Simulates quality scores using either the default model or empirically learned quality scores using utilities/fastq_to_qscoreModel.py
94 - Introduces sequencing substitution errors using either the default model or empirically learned from utilities/
95 - Accounts for GC% coverage bias using model learned from utilities/computeGC.py
96 - Output a VCF file with the 'golden' set of true positive variants. These can be compared to bioinformatics workflow output (includes coverage and allele balance information)
97 - Output a BAM file with the 'golden' set of aligned reads. These indicate where each read originated and how it should be aligned with the reference
98 - Create paired tumour/normal datasets using characteristics learned from real tumour data
99 - Parallelized. Can run multiple "partial" simulations in parallel and merge results
100 - Low memory footprint. Constant (proportional to the size of the reference sequence)
101
102 ## Examples
103
104 The following commands are examples for common types of data to be generated. The simulation uses a reference genome in fasta format to generate reads of 126 bases with default 10X coverage. Outputs paired fastq files, a BAM file and a VCF file. The random variants inserted into the sequence will be present in the VCF and all of the reads will show their proper alignment in the BAM. Unless specified, the simulator will also insert some "sequencing error" -- random variants in some reads that represents false positive results from sequencing.
105
106 ### Whole genome simulation
107 Simulate whole genome dataset with random variants inserted according to the default model.
108
109 ```
110 python genReads.py \
111 -r hg19.fa \
112 -R 126 \
113 -o /home/me/simulated_reads \
114 --bam \
115 --vcf \
116 --pe 300 30
117 ```
118
119 ### Targeted region simulation
120 Simulate a targeted region of a genome, e.g. exome, with -t.
121
122 ```
123 python genReads.py \
124 -r hg19.fa \
125 -R 126 \
126 -o /home/me/simulated_reads \
127 --bam \
128 --vcf \
129 --pe 300 30 \
130 -t hg19_exome.bed
131 ```
132
133 ### Insert specific variants
134 Simulate a whole genome dataset with only the variants in the provided VCF file using -v and -M.
135
136 ```
137 python genReads.py \
138 -r hg19.fa \
139 -R 126 \
140 -o /home/me/simulated_reads \
141 --bam \
142 --vcf \
143 --pe 300 30 \
144 -v NA12878.vcf \
145 -M 0
146 ```
147
148 ### Single end reads
149 Simulate single-end reads by omitting the --pe option.
150
151 ```
152 python genReads.py \
153 -r hg19.fa \
154 -R 126 \
155 -o /home/me/simulated_reads \
156 --bam \
157 --vcf
158 ```
159
160 ### Large single end reads
161 Simulate PacBio-like reads by providing an error model.
162
163 ```
164 python genReads.py \
165 -r hg19.fa \
166 -R 5000 \
167 -e models/errorModel_pacbio_toy.p \
168 -E 0.10 \
169 -o /home/me/simulated_reads
170 ```
171
172 ### Parallelizing simulation
173 When possible, simulation can be done in parallel via multiple executions with different --job options. The resultant files will then need to be merged using utilities/mergeJobs.py. The following example shows splitting a simulation into 4 separate jobs (which can be run independently):
174
175 ```
176 python genReads.py -r hg19.fa -R 126 -o /home/me/simulated_reads --bam --vcf --job 1 4
177 python genReads.py -r hg19.fa -R 126 -o /home/me/simulated_reads --bam --vcf --job 2 4
178 python genReads.py -r hg19.fa -R 126 -o /home/me/simulated_reads --bam --vcf --job 3 4
179 python genReads.py -r hg19.fa -R 126 -o /home/me/simulated_reads --bam --vcf --job 4 4
180
181 python mergeJobs.py -i /home/me/simulated_reads -o /home/me/simulated_reads_merged -s /path/to/samtools
182 ```
183
184 In future revisions the dependence on SAMtools will be removed.
185
186 To simulate human WGS 50X, try 50 chunks or less.
187
188 # Utilities
189 Several scripts are distributed with genReads that are used to generate the models used for simulation.
190
191 ## computeGC.py
192
193 Computes GC% coverage bias distribution from sample (bedrolls genomecov) data.
194 Takes .genomecov files produced by BEDtools genomeCov (with -d option).
195
196 ```
197 bedtools genomecov
198 -d \
199 -ibam normal.bam \
200 -g reference.fa
201 ```
202
203 ```
204 python computeGC.py \
205 -r reference.fa \
206 -i genomecovfile \
207 -w [sliding window length] \
208 -o /path/to/model.p
209 ```
210
211 ## computeFraglen.py
212
213 Computes empirical fragment length distribution from sample data.
214 Takes SAM file via stdin:
215
216 ./samtools view toy.bam | python computeFraglen.py
217
218 and creates fraglen.p model in working directory.
219
220 ## genMutModel.py
221
222 Takes references genome and TSV file to generate mutation models:
223
224 ```
225 python genMutModel.py \
226 -r hg19.fa \
227 -m inputVariants.tsv \
228 -o /home/me/models.p
229 ```
230
231 Trinucleotides are identified in the reference genome and the variant file. Frequencies of each trinucleotide transition are calculated and output as a pickle (.p) file.
232
233 ## genSeqErrorModel.py
234
235 Generates sequence error model for genReads.py -e option.
236 This script needs revision, to improve the quality-score model eventually, and to include code to learn sequencing errors from pileup data.
237
238 ```
239 python genSeqErrorModel.py \
240 -i input_read1.fq (.gz) / input_read1.sam \
241 -o output.p \
242 -i2 input_read2.fq (.gz) / input_read2.sam \
243 -p input_alignment.pileup \
244 -q quality score offset [33] \
245 -Q maximum quality score [41] \
246 -n maximum number of reads to process [all] \
247 -s number of simulation iterations [1000000] \
248 --plot perform some optional plotting
249 ```
250 ## plotMutModel.py
251
252 Performs plotting and comparison of mutation models generated from genMutModel.py.
253
254 ```
255 python plotMutModel.py \
256 -i model1.p [model2.p] [model3.p]... \
257 -l legend_label1 [legend_label2] [legend_label3]... \
258 -o path/to/pdf_plot_prefix
259 ```
260
261 ## vcf_compare_OLD.py
262
263 Tool for comparing VCF files.
264
265 ```
266 python vcf_compare_OLD.py
267 --version show program's version number and exit \
268 -h, --help show this help message and exit \
269 -r <ref.fa> * Reference Fasta \
270 -g <golden.vcf> * Golden VCF \
271 -w <workflow.vcf> * Workflow VCF \
272 -o <prefix> * Output Prefix \
273 -m <track.bed> Mappability Track \
274 -M <int> Maptrack Min Len \
275 -t <regions.bed> Targetted Regions \
276 -T <int> Min Region Len \
277 -c <int> Coverage Filter Threshold [15] \
278 -a <float> Allele Freq Filter Threshold [0.3] \
279 --vcf-out Output Match/FN/FP variants [False] \
280 --no-plot No plotting [False] \
281 --incl-homs Include homozygous ref calls [False] \
282 --incl-fail Include calls that failed filters [False] \
283 --fast No equivalent variant detection [False]
284 ```
285 Mappability track examples: https://github.com/zstephens/neat-repeat/tree/master/example_mappabilityTracks
286
287 ### Note on Sensitive Patient Data
288 ICGC's "Access Controlled Data" documention can be found at http://docs.icgc.org/access-controlled-data. To have access to controlled germline data, a DACO must be
289 submitted. Open tier data can be obtained without a DACO, but germline alleles that do not match the reference genome are masked and replaced with the reference
290 allele. Controlled data includes unmasked germline alleles.
291
292
293
294