comparison README.md @ 0:a4cd8608ef6b draft

Uploaded
author petr-novak
date Mon, 01 Apr 2019 07:56:36 -0400
parents
children e320ef2d105a
comparison
equal deleted inserted replaced
-1:000000000000 0:a4cd8608ef6b
1 # RepeatExplorer utilities #
2
3 This repository include utilities for preprocessing of NGS data to suitable format for RepeatExplorer and TAREAN
4 analysis. Each tool include also XML file which define tool interface for Galaxy environment
5
6 ## Available tools ##
7
8 ### Paired fastq reads filtering and interlacing ###
9 tool definition file: `paired_fastq_filtering.xml`
10
11 This tool is designed to make memory efficient preprocessing of two fastq files. Output of this file can be used as input of RepeatExplorer clustering. Input files can be in GNU zipped archive (.gz extension). Reads are filtered based on the quality, presence of N bases and adapters. Two input fastq files are procesed in parallel. Only complete pair are kept. As the input files are process in chunks, it is required that pair reads are complete and in the same order in both input files. All reads which pass the quality filter fill be writen into output files. If sampling is specified, only sample of sequences will be returned. Cutadapt us run with this options:
12
13 ```
14 --anywhere='AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT'
15 --anywhere='AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT'
16 --anywhere='GATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
17 --anywhere='ATCTCGTATGCCGTCTTCTGCTTG'
18 --anywhere='CAAGCAGAAGACGGCATACGAGAT'
19 --anywhere='GTGACTGGAGTTCAGACGTGTGCTCTTCCGATC'
20 --error-rate=0.05
21 --times=1 --overlap=15 --discard
22 ```
23
24 Order of fastq files processing
25
26 1. Trimming (optional)
27 2. Filter by quality
28 3. Discard single reads, keep complete pairs
29 4. Cutadapt filtering
30 5. Discard single reads, keep complete pairs
31 6. Sampling (optional)
32 7. Interlacing two fasta files
33
34
35 ### single fastq reads filtering ###
36 tool definition file: `single_fastq_filtering.xml`
37
38 This tool is designed to perform preprocessing
39 of fastq file. Input files can be in GNU zipped archive (.gz extension). Reads
40 are filtered based on the quality, presence of N bases and adapters. All reads
41 which pass the quality filter fill be writen into output files. If sampling is
42 specified, only sample of sequences will be returned.
43
44 ### fasta afixer ###
45 tool definition file: `fasta_affixer.xml`
46
47 Tool for appending prefix and suffix to sequences names in fasta formated sequences. This tool is useful
48 if you want to do comparative analysis with RepeatExplorer and need to
49 append sample codes to sequence identifiers
50
51
52 ## Dependencies ##
53
54 R programming environment with installed packages *optparse* and *ShortRead* (Bioconductor)
55 python3
56 cutadapt
57
58 ## License ##
59
60 Copyright (c) 2012 Petr Novak (petr@umbr.cas.cz), Jiri Macas and Pavel Neumann,
61 Laboratory of Molecular Cytogenetics(http://w3lamc.umbr.cas.cz/lamc/)
62 Institute of Plant Molecular Biology, Biology Centre AS CR, Ceske Budejovice, Czech Republic
63
64 This program is free software: you can redistribute it and/or modify
65 it under the terms of the GNU General Public License as published by
66 the Free Software Foundation, either version 3 of the License, or
67 (at your option) any later version.
68
69 This program is distributed in the hope that it will be useful,
70 but WITHOUT ANY WARRANTY; without even the implied warranty of
71 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
72 GNU General Public License for more details.
73 You should have received a copy of the GNU General Public License
74 along with this program. If not, see <http://www.gnu.org/licenses/>.