annotate README.md @ 20:5a05925340b0 draft

Uploaded
author petr-novak
date Mon, 07 Jun 2021 08:46:07 +0000
parents e320ef2d105a
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
1 # RepeatExplorer utilities #
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
2
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
3 This repository include utilities for preprocessing of NGS data to suitable format for RepeatExplorer and TAREAN
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
4 analysis. Each tool include also XML file which define tool interface for Galaxy environment
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
5
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
6 ## Available tools ##
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
7
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
8 ### Paired fastq reads filtering and interlacing ###
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
9 tool definition file: `paired_fastq_filtering.xml`
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
10
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
11 This tool is designed to make memory efficient preprocessing of two fastq files. Output of this file can be used as input of RepeatExplorer clustering. Input files can be in GNU zipped archive (.gz extension). Reads are filtered based on the quality, presence of N bases and adapters. Two input fastq files are procesed in parallel. Only complete pair are kept. As the input files are process in chunks, it is required that pair reads are complete and in the same order in both input files. All reads which pass the quality filter fill be writen into output files. If sampling is specified, only sample of sequences will be returned. Cutadapt us run with this options:
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
12
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
13 ```
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
14 --anywhere='AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT'
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
15 --anywhere='AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT'
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
16 --anywhere='GATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
17 --anywhere='ATCTCGTATGCCGTCTTCTGCTTG'
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
18 --anywhere='CAAGCAGAAGACGGCATACGAGAT'
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
19 --anywhere='GTGACTGGAGTTCAGACGTGTGCTCTTCCGATC'
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
20 --error-rate=0.05
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
21 --times=1 --overlap=15 --discard
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
22 ```
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
23
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
24 Order of fastq files processing
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
25
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
26 1. Trimming (optional)
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
27 2. Filter by quality
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
28 3. Discard single reads, keep complete pairs
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
29 4. Cutadapt filtering
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
30 5. Discard single reads, keep complete pairs
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
31 6. Sampling (optional)
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
32 7. Interlacing two fasta files
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
33
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
34
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
35 ### single fastq reads filtering ###
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
36 tool definition file: `single_fastq_filtering.xml`
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
37
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
38 This tool is designed to perform preprocessing
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
39 of fastq file. Input files can be in GNU zipped archive (.gz extension). Reads
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
40 are filtered based on the quality, presence of N bases and adapters. All reads
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
41 which pass the quality filter fill be writen into output files. If sampling is
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
42 specified, only sample of sequences will be returned.
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
43
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
44 ### fasta afixer ###
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
45 tool definition file: `fasta_affixer.xml`
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
46
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
47 Tool for appending prefix and suffix to sequences names in fasta formated sequences. This tool is useful
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
48 if you want to do comparative analysis with RepeatExplorer and need to
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
49 append sample codes to sequence identifiers
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
50
3
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
51 ### ChIP-Seq-mapper ###
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
52
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
53
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
54 Analysis of NGS sequences from Chromatin Imunoprecipitation. ChiP and Input reads are mapped to contigs obtained from graph based repetitive sequence clustering to enriched repeats. This method was used in (Neumann et al. 2012). for identification of repetitive sequences associated with cetromeric region.
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
55
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
56 #### Authors ####
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
57 Petr Novak, Jiri Macas, Pavel Neumann, Georg Hermanutz
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
58
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
59 Biology Centre CAS, Czech Republic
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
60
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
61
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
62 #### Installation and dependencies ####
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
63
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
64 ChIP-Seq-mapper require NCBI blast to be installed, R programming language with installed R2HTML and base64 packages and python3
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
65
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
66 #### Usage ####
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
67
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
68 ```
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
69 ChipSeqRatioAnalysis.py [-h] [-m MAX_CL] [-n NPROC] -c CHIPSEQ -i
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
70 INPUTSEQ [-o OUTPUT] [-ht HTML] [-t THRESHOLD]
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
71 -k CONTIGS
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
72
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
73 optional arguments:
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
74 -h, --help show this help message and exit
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
75 -m MAX_CL, --max_cl MAX_CL
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
76 Sets the maximum cluster number. Default = 200
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
77 -n NPROC, --nproc NPROC
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
78 Sets the number of cpus to be used. Default = all
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
79 available
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
80 -c CHIPSEQ, --ChipSeq CHIPSEQ
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
81 Fasta file containing the Chip Sequences
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
82 -i INPUTSEQ, --InputSeq INPUTSEQ
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
83 Fasta file containing the Input Sequences
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
84 -o OUTPUT, --output OUTPUT
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
85 Specify a name for the CSV file to which the output
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
86 will be save to. Default: ChipSeqRatio.csv
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
87 -ht HTML, --html HTML
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
88 Specify a name for the html report. Default :
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
89 ChipSeqRatioReport
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
90 -t THRESHOLD, --threshold THRESHOLD
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
91 Optional plot filter. Default: mean ration between
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
92 Input hits and Chip hits.
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
93 -k CONTIGS, --Contigs CONTIGS
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
94 Contig file for blast
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
95 ```
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
96
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
97 #### References ####
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
98 [PLoS Genet. Epub 2012 Jun 21. Stretching the rules: monocentric chromosomes with multiple centromere domains. Neumann P, Navrátilová A, Schroeder-Reiter E, Koblížková A, Steinbauerová V, Chocholová E, Novák P, Wanner G, Macas J.](http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1002777)
e320ef2d105a Uploaded
petr-novak
parents: 0
diff changeset
99
0
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
100
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
101 ## Dependencies ##
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
102
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
103 R programming environment with installed packages *optparse* and *ShortRead* (Bioconductor)
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
104 python3
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
105 cutadapt
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
106
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
107 ## License ##
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
108
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
109 Copyright (c) 2012 Petr Novak (petr@umbr.cas.cz), Jiri Macas and Pavel Neumann,
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
110 Laboratory of Molecular Cytogenetics(http://w3lamc.umbr.cas.cz/lamc/)
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
111 Institute of Plant Molecular Biology, Biology Centre AS CR, Ceske Budejovice, Czech Republic
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
112
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
113 This program is free software: you can redistribute it and/or modify
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
114 it under the terms of the GNU General Public License as published by
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
115 the Free Software Foundation, either version 3 of the License, or
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
116 (at your option) any later version.
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
117
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
118 This program is distributed in the hope that it will be useful,
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
119 but WITHOUT ANY WARRANTY; without even the implied warranty of
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
120 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
121 GNU General Public License for more details.
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
122 You should have received a copy of the GNU General Public License
a4cd8608ef6b Uploaded
petr-novak
parents:
diff changeset
123 along with this program. If not, see <http://www.gnu.org/licenses/>.