0
|
1 # RepeatExplorer utilities #
|
|
2
|
|
3 This repository include utilities for preprocessing of NGS data to suitable format for RepeatExplorer and TAREAN
|
|
4 analysis. Each tool include also XML file which define tool interface for Galaxy environment
|
|
5
|
|
6 ## Available tools ##
|
|
7
|
|
8 ### Paired fastq reads filtering and interlacing ###
|
|
9 tool definition file: `paired_fastq_filtering.xml`
|
|
10
|
|
11 This tool is designed to make memory efficient preprocessing of two fastq files. Output of this file can be used as input of RepeatExplorer clustering. Input files can be in GNU zipped archive (.gz extension). Reads are filtered based on the quality, presence of N bases and adapters. Two input fastq files are procesed in parallel. Only complete pair are kept. As the input files are process in chunks, it is required that pair reads are complete and in the same order in both input files. All reads which pass the quality filter fill be writen into output files. If sampling is specified, only sample of sequences will be returned. Cutadapt us run with this options:
|
|
12
|
|
13 ```
|
|
14 --anywhere='AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT'
|
|
15 --anywhere='AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT'
|
|
16 --anywhere='GATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
|
|
17 --anywhere='ATCTCGTATGCCGTCTTCTGCTTG'
|
|
18 --anywhere='CAAGCAGAAGACGGCATACGAGAT'
|
|
19 --anywhere='GTGACTGGAGTTCAGACGTGTGCTCTTCCGATC'
|
|
20 --error-rate=0.05
|
|
21 --times=1 --overlap=15 --discard
|
|
22 ```
|
|
23
|
|
24 Order of fastq files processing
|
|
25
|
|
26 1. Trimming (optional)
|
|
27 2. Filter by quality
|
|
28 3. Discard single reads, keep complete pairs
|
|
29 4. Cutadapt filtering
|
|
30 5. Discard single reads, keep complete pairs
|
|
31 6. Sampling (optional)
|
|
32 7. Interlacing two fasta files
|
|
33
|
|
34
|
|
35 ### single fastq reads filtering ###
|
|
36 tool definition file: `single_fastq_filtering.xml`
|
|
37
|
|
38 This tool is designed to perform preprocessing
|
|
39 of fastq file. Input files can be in GNU zipped archive (.gz extension). Reads
|
|
40 are filtered based on the quality, presence of N bases and adapters. All reads
|
|
41 which pass the quality filter fill be writen into output files. If sampling is
|
|
42 specified, only sample of sequences will be returned.
|
|
43
|
|
44 ### fasta afixer ###
|
|
45 tool definition file: `fasta_affixer.xml`
|
|
46
|
|
47 Tool for appending prefix and suffix to sequences names in fasta formated sequences. This tool is useful
|
|
48 if you want to do comparative analysis with RepeatExplorer and need to
|
|
49 append sample codes to sequence identifiers
|
|
50
|
3
|
51 ### ChIP-Seq-mapper ###
|
|
52
|
|
53
|
|
54 Analysis of NGS sequences from Chromatin Imunoprecipitation. ChiP and Input reads are mapped to contigs obtained from graph based repetitive sequence clustering to enriched repeats. This method was used in (Neumann et al. 2012). for identification of repetitive sequences associated with cetromeric region.
|
|
55
|
|
56 #### Authors ####
|
|
57 Petr Novak, Jiri Macas, Pavel Neumann, Georg Hermanutz
|
|
58
|
|
59 Biology Centre CAS, Czech Republic
|
|
60
|
|
61
|
|
62 #### Installation and dependencies ####
|
|
63
|
|
64 ChIP-Seq-mapper require NCBI blast to be installed, R programming language with installed R2HTML and base64 packages and python3
|
|
65
|
|
66 #### Usage ####
|
|
67
|
|
68 ```
|
|
69 ChipSeqRatioAnalysis.py [-h] [-m MAX_CL] [-n NPROC] -c CHIPSEQ -i
|
|
70 INPUTSEQ [-o OUTPUT] [-ht HTML] [-t THRESHOLD]
|
|
71 -k CONTIGS
|
|
72
|
|
73 optional arguments:
|
|
74 -h, --help show this help message and exit
|
|
75 -m MAX_CL, --max_cl MAX_CL
|
|
76 Sets the maximum cluster number. Default = 200
|
|
77 -n NPROC, --nproc NPROC
|
|
78 Sets the number of cpus to be used. Default = all
|
|
79 available
|
|
80 -c CHIPSEQ, --ChipSeq CHIPSEQ
|
|
81 Fasta file containing the Chip Sequences
|
|
82 -i INPUTSEQ, --InputSeq INPUTSEQ
|
|
83 Fasta file containing the Input Sequences
|
|
84 -o OUTPUT, --output OUTPUT
|
|
85 Specify a name for the CSV file to which the output
|
|
86 will be save to. Default: ChipSeqRatio.csv
|
|
87 -ht HTML, --html HTML
|
|
88 Specify a name for the html report. Default :
|
|
89 ChipSeqRatioReport
|
|
90 -t THRESHOLD, --threshold THRESHOLD
|
|
91 Optional plot filter. Default: mean ration between
|
|
92 Input hits and Chip hits.
|
|
93 -k CONTIGS, --Contigs CONTIGS
|
|
94 Contig file for blast
|
|
95 ```
|
|
96
|
|
97 #### References ####
|
|
98 [PLoS Genet. Epub 2012 Jun 21. Stretching the rules: monocentric chromosomes with multiple centromere domains. Neumann P, Navrátilová A, Schroeder-Reiter E, Koblížková A, Steinbauerová V, Chocholová E, Novák P, Wanner G, Macas J.](http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1002777)
|
|
99
|
0
|
100
|
|
101 ## Dependencies ##
|
|
102
|
|
103 R programming environment with installed packages *optparse* and *ShortRead* (Bioconductor)
|
|
104 python3
|
|
105 cutadapt
|
|
106
|
|
107 ## License ##
|
|
108
|
|
109 Copyright (c) 2012 Petr Novak (petr@umbr.cas.cz), Jiri Macas and Pavel Neumann,
|
|
110 Laboratory of Molecular Cytogenetics(http://w3lamc.umbr.cas.cz/lamc/)
|
|
111 Institute of Plant Molecular Biology, Biology Centre AS CR, Ceske Budejovice, Czech Republic
|
|
112
|
|
113 This program is free software: you can redistribute it and/or modify
|
|
114 it under the terms of the GNU General Public License as published by
|
|
115 the Free Software Foundation, either version 3 of the License, or
|
|
116 (at your option) any later version.
|
|
117
|
|
118 This program is distributed in the hope that it will be useful,
|
|
119 but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
120 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
121 GNU General Public License for more details.
|
|
122 You should have received a copy of the GNU General Public License
|
|
123 along with this program. If not, see <http://www.gnu.org/licenses/>.
|