3
|
1 rod_finder
|
|
2 ==========
|
|
3
|
|
4 A script to find regions of difference (RODs) between a query genome and reference genome(s).
|
|
5
|
|
6 ## Synopsis
|
|
7
|
|
8 blast_rod_finder_legacy.sh subject.fasta query.fasta query.[embl|gbk|fasta] 5000
|
|
9
|
|
10 or
|
|
11
|
|
12 perl blast_rod_finder.pl -q query.embl -r blastn.out -m 2000
|
|
13
|
|
14 ## Description
|
|
15
|
|
16 This script is intended to identify RODs between a nucleotide query and a nucleotide subject/reference sequence. In order to do so, a *blastn* (http://blast.ncbi.nlm.nih.gov/Blast.cgi) needs to be performed beforehand with the query and the subject sequences (see also *blast_rod_finder_legacy.sh* below). *blast_rod_finder.pl* is mainly designed to work with bacterial genomes, while a query genome can be blasted against several subject sequences to detect RODs over a number of references. Although the results are optimized towards a complete query genome, both the reference(s) as well as the query can be used in draft form. To create artificial genomes via concatenation use *cat_seq.pl* or the EMBOSS application union (http://emboss.sourceforge.net/).
|
|
17
|
|
18 The *blastn* report file, the query sequence file (preferably in RichSeq format, see below) and a minimum size for ROD detection have to be provided. Subsequently, RODs are summarized in a tab-separated summary file, a gff3 (usable e.g. in Artemis/DNAPlotter, http://www.sanger.ac.uk/resources/software/artemis/) and a BRIG (BLAST Ring Image Generator, http://brig.sourceforge.net/) output file. Nucleotide sequences of each ROD are written to a multi-fasta file.
|
|
19
|
|
20 The query sequence can be provided in RichSeq format (embl or genbank), but has to correspond to the fasta file used in querying the BLAST database (the accession numbers have to correspond to the fasta headers). Use *seq_format-converter.pl* to create a corresponding fasta file from embl|genbank files for *blastn* if needed. With RichSeq query files additional info is given in the result summary and the amino acid sequences of all non-pseudo CDSs, which are contained or overlap a ROD, are written to a result file. Furthermore, all detected RODs are saved in individual sequence files in the corresponding query sequence format.
|
|
21
|
|
22 Run *blastn* and the script *blast_rod_finder.pl* consecutively manually or use the bash shell wrapper script *blast_rod_finder_legacy.sh* (see usage below) to perform the pipeline with one command. The same folder has to contain the subject fasta file(s), the query fasta file, optionally the query RichSeq file and the script *blast_rod_finder.pl*! *blastn* is run **without** filtering of query sequences ('-F F') and an evalue cutoff of '2e-11' is set.
|
|
23
|
|
24 ## Usage
|
|
25
|
|
26 ### 1.) Manual consecutively
|
|
27
|
|
28 #### 1.1.) *blastn*
|
|
29
|
|
30 formatdb -p F -i subject.fasta -n blast_db
|
|
31 blastall -p blastn -d blast_db -i query.fasta -o blastn.out -e 2e-11 -F F
|
|
32
|
|
33 #### 1.2.) *blast_rod_finder.pl*
|
|
34
|
|
35 perl blast_rod_finder.pl -q query.[embl|gbk|fasta] -r blastn.out -m 5000
|
|
36
|
|
37 ### 2.) With one command: *blast_rod_finder_legacy.sh* pipeline
|
|
38
|
|
39 blast_rod_finder_legacy.sh subject.fasta query.fasta query.[embl|gbk|fasta] 5000
|
|
40
|
|
41 ## Options for *blast_rod_finder.pl*
|
|
42
|
|
43 ### Mandatory options
|
|
44
|
|
45 * -m, -min
|
|
46
|
|
47 Minimum size of RODs that are reported
|
|
48
|
|
49 * -q, -query
|
|
50
|
|
51 Query sequence file [fasta, embl, or genbank format]
|
|
52
|
|
53 * -r, -report
|
|
54
|
|
55 *blastn* report/output file
|
|
56
|
|
57 ### Optional options
|
|
58
|
|
59 * -h, -help: Help (perldoc POD)
|
|
60
|
|
61 ## Output
|
|
62
|
|
63 ### a.) *blast_rod_finder_legacy.sh* or *blastn*
|
|
64
|
|
65 * *blastn* database files for subject sequence(s)
|
|
66
|
|
67 \*.nhr, \*.nin, \*.nsq
|
|
68
|
|
69 * *blastn* report
|
|
70
|
|
71 Text file named 'blastn.out'
|
|
72
|
|
73 ### b.) *blast_rod_finder.pl*
|
|
74
|
|
75 * ./results
|
|
76
|
|
77 All output files are stored in this result folder
|
|
78
|
|
79 * rod_summary.txt
|
|
80
|
|
81 Summary of detected ROD regions (for embl/genbank queries includes annotation), tab-separated
|
|
82
|
|
83 * rod.gff
|
|
84
|
|
85 GFF3 file with ROD coordinates to use in Artemis/DNAPlotter etc.
|
|
86
|
|
87 * rod_BRIG.txt
|
|
88
|
|
89 ROD coordinates to use in BRIG (BLAST Ring Image Generator), tab-separated
|
|
90
|
|
91 * rod_seq.fasta
|
|
92
|
|
93 Nucleotide sequences of ROD regions (>ROD# size start..stop), multi-fasta
|
|
94
|
|
95 * rod_aa_fasta.txt
|
|
96
|
|
97 Only present if query is in RichSeq format. Amino acid sequences of all CDSs that are contained in or overlap a ROD region in multi-fasta format (>locus_tag gene product). RODs are seperated in the file via '\~\~' (\~\~ROD# size start..stop).
|
|
98
|
|
99 * ROD#.[embl|gbk]
|
|
100
|
|
101 Only present if query is in RichSeq format. Each identified ROD is written to an individual sequence file (in the same format as the query).
|
|
102
|
|
103 ## Run environment
|
|
104
|
|
105 The Perl script runs under Windows and UNIX flavors, the bash-shell script of course only under UNIX.
|
|
106
|
|
107 ## Dependencies (not in the core Perl modules)
|
|
108
|
|
109 * Legacy blast (tested version blastall 2.2.18)
|
|
110 * BioPerl (tested with version 1.006901)
|
|
111
|
|
112 ## Authors/contact
|
|
113
|
|
114 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
|
|
115
|
|
116 David Studholme (original code; D[dot]J[dot]Studholme[at]exeter[dot]ac[dot]uk; University of Exeter)
|
|
117
|
|
118 ## Citation, installation, and license
|
|
119
|
|
120 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
|
|
121
|
|
122 ## Changelog
|
|
123
|
|
124 * v0.4 (13.02.2013)
|
|
125 - included a POD
|
|
126 - options with Getopt::Long
|
|
127 - results directory for output files
|
|
128 - include accession number column for multi-sequence files in 'rod_summary.txt'
|
|
129 - include locus_tags (or alternatively gene, product, note ...) in 'rod_summary.txt'
|
|
130 - feature positions according to leading or lagging strand
|
|
131 - indicate if a primary feature overlaps ROD boundaries
|
|
132 - output each ROD in the query RichSeq format with BioPerl's Bio::SeqUtils
|
|
133 * v0.3 (23.11.2011)
|
|
134 - status messages with autoflush
|
|
135 - BRIG output file
|
|
136 - extended primary tag output for RODs (in addition to CDS): tRNA, rRNA, tmRNA, ncRNA, misc_RNA, repeat_region, misc_binding, and mobile_element
|
|
137 * v0.1 (07.11.2011)
|