annotate COG/bac-genomics-scripts/cds_extractor/README.md @ 3:e42d30da7a74 draft

Uploaded
author dereeper
date Thu, 30 May 2024 11:52:25 +0000
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
1 cds_extractor
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
2 =============
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
4 `cds_extractor.pl` is a script to extract amino acid or nucleotide sequences from coding sequence (CDS) features in annotated genomes.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
5
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
6 * [Synopsis](#synopsis)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
7 * [Description](#description)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
8 * [Usage](#usage)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
9 * [Extract amino acid sequences](#extract-amino-acid-sequences)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
10 * [Extract nucleotide sequences](#extract-nucleotide-sequences)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
11 * [UNIX loop to extract sequences from all files in the current working directory](#unix-loop-to-extract-sequences-from-all-files-in-the-current-working-directory)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
12 * [Options](#options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
13 * [Mandatory options](#mandatory-options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
14 * [Optional options](#optional-options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
15 * [Output](#output)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
16 * [Dependencies](#dependencies)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
17 * [Run environment](#run-environment)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
18 * [Author - contact](#author---contact)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
19 * [Citation, installation, and license](#citation-installation-and-license)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
20 * [Changelog](#changelog)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
21
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
22 ## Synopsis
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
23
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
24 perl cds_extractor.pl -i seq_file.[embl|gbk] -p
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
25
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
26 ## Description
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
27
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
28 This script extracts protein or DNA sequences of CDS features from a (multi)-RichSeq file (e.g. EMBL or GENBANK format) and writes them to a multi-FASTA file. The FASTA headers for each CDS include either the locus tag, if that's not available, protein ID, gene, or an internal CDS counter as identifier (in this order). The organism info includes also possible plasmid names. Pseudogenes (tagged by **/pseudo**) are not included (except in the CDS counter).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
29
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
30 In addition to the identifier, FASTA headers include gene (**g=**), product (**p=**), organism (**o=**), and EC numbers (**ec=**), if these are present for a CDS. Individual EC numbers are separated by **semicolons**. The location/position (**l=** start..stop) of a CDS will always be included. If gene is used as FASTA header ID '**g=** gene' will only be included with option **-f**.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
31
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
32 Fuzzy locations in the feature table of a sequence file are not taken into consideration for **l=**. If you set options **-u** and/or **-d** and the feature location overlaps a **circular** replicon boundary, positions are marked with '<' or '>' in the direction of the exceeded boundary. Features with overlapping locations in **linear** sequences (e.g. contigs) will be skipped and are **not** included in the output! A CDS feature is on the lagging strand if start > stop in the location. In the special case of overlapping circular sequence boundaries this is reversed.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
33
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
34 Of course, the **l=** positions are separate for each sequence in a multi- sequence file. Thus, if you want continuous positions for the CDSs run these files first through [`cat_seq.pl`](/cat_seq).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
35
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
36 Optionally, a file with locus tags can be given to extract only these CDS features with option **-l** (each locus tag in a new line).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
37
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
38 ## Usage
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
39
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
40 ### Extract amino acid sequences
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
41
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
42 perl cds_extractor.pl -i Ecoli_MG1655.gbk -p [-l locus_tags.txt -c MG1655 -f]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
43
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
44 ### Extract nucleotide sequences
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
45
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
46 perl cds_extractor.pl -i Banthracis_Ames.embl -n [-l locus_tags.txt -u 100 -d 20 -c Ames -f]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
47
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
48 ### UNIX loop to extract sequences from all files in the current working directory
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
49
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
50 for file in *.embl; do perl cds_extractor.pl -i "$file" -p [-l locus_tags.txt]; done
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
51
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
52 ## Options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
53
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
54 ### Mandatory options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
55
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
56 * **-i**=_str_, **-input**=_str_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
57
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
58 Input RichSeq sequence file including CDS annotation (e.g. EMBL or GENBANK format)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
59
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
60 * **-p**, **-protein**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
61
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
62 Extract **protein** sequence for each CDS feature, excludes option **-n**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
63
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
64 **or**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
65
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
66 * **-n**, **-nucleotide**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
67
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
68 Extract **nucleotide** sequence for each CDS feature, excludes option **-p**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
69
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
70 ### Optional options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
71
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
72 * **-h**, **-help**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
73
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
74 Help (perldoc POD)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
75
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
76 * **-u**=_int_, **-upstream**=_int_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
77
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
78 Include given number of flanking nucleotides upstream of each CDS feature, forces option **-n**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
79
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
80 * **-d**=_int_, **-downstream**=_int_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
81
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
82 Include given number of flanking nucleotides downstream of each CDS feature, forces option **-n**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
83
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
84 * **-c**=_str_, **-cds_prefix**=_str_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
85
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
86 Prefix for the internal CDS counter [default = 'CDS']
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
87
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
88 * **-l**=_str_, **-locustag_list**=_str_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
89
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
90 List of locus tags to extract only those (each locus tag on a new line)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
91
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
92 * **-f**, **-full_header**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
93
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
94 If gene is used as ID include additionally '**g=** gene' in FASTA headers, so downstream analyses can recognize the gene tag (e.g. [`prot_finder.pl`](/prot_finder)).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
95
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
96 * **-v**, **-version**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
97
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
98 Print version number to *STDERR*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
99
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
100 ## Output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
101
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
102 * \*.faa
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
103
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
104 Multi-FASTA file of CDS protein sequences
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
105
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
106 **or**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
107
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
108 * \*.ffn
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
109
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
110 Multi-FASTA file of CDS DNA sequences
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
111
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
112 * (no_annotation_err.txt)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
113
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
114 Lists input files missing CDS annotation, script exited with **fatal error** i.e. no FASTA output file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
115
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
116 * (double_id_err.txt)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
117
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
118 Lists input files with ambiguous FASTA IDs, script exited with **fatal error** i.e. no FASTA output file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
119
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
120 * (locus_tag_missing_err.txt)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
121
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
122 Lists CDS features without locus tags
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
123
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
124 * (linear_seq_cds_overlap_err.txt)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
125
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
126 Lists CDS features overlapping sequence border of a **linear** molecule, which are **not** included in the result multi-FASTA file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
127
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
128 ## Dependencies
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
129
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
130 * [BioPerl](http://www.bioperl.org) (tested with version 1.006923)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
131
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
132 ## Run environment
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
133
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
134 The Perl script runs under Windows and UNIX flavors.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
135
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
136 ## Author - contact
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
137
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
138 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
139
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
140 ## Citation, installation, and license
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
141
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
142 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
143
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
144 ## Changelog
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
145
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
146 * v0.7.1 (26.10.2015)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
147 - changed output file extensions from **\_cds\_aa.fasta* or **\_cds\_nuc.fasta* to **.faa* or **.ffn*, respectively
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
148 - minor syntax changes in README, included TOC
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
149 - minor syntax changes in POD
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
150 * v0.7 (31.03.2014)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
151 - location (l=) and EC numbers (ec=) for CDS features are included in the FASTA header
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
152 - 'ec=', 'g=', 'p=', and 'o=' only included in FASTA header if these tags are present for a CDS feature, or additionally for 'g=' with option **-f**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
153 - if, with options '-u' and/or '-d', the location of a CDS feature overlaps a sequence boundary, the positions are marked with '<' or '>' in 'l='
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
154 - additionally, CDS features whose location overlaps the sequence boundary of a linear molecule will not be included in the output, but IDs written to an error file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
155 - new option **-c** to chose prefix for internal CDS counter
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
156 - /product feature value will not be used as FASTA ID anymore, skip directly to internal CDS counter, if /locus_tag, /protein_id, or /gene is missing for a CDS (too many 'hypothetical proteins')
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
157 - internal CDS counter counts all CDSs of multi-sequence files sequential (doesn't start new with each new sequence in the multi-sequence file)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
158 - 'control_double' subroutine also called if /gene is used as FASTA ID
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
159 - fixed bug introduced in v0.6 to exit if no CDS primary features found, because a draft multi-sequence file might have unannotated small contigs
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
160 - new error files: no_annotation_err.txt, double_id_err.txt, linear_seq_cds_overlap_err.txt (the first two come in handy if you run `cds_extractor.pl` in a UNIX loop with many files)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
161 - included 'use autodie'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
162 - included version switch
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
163 - included pod2usage with Pod::Usage
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
164 - reorganized code into more subroutines to remove useless double codings (which contained also some bugs) and to make the script more concise
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
165 - minor changes to Perl syntax
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
166 * v0.6 (06.06.2013)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
167 - exit with error if no CDS primary features present in input file, as /translation feature only present in CDS features (some GENBANK files are only annotated with 'gene')
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
168 - included Bio::SeqFeatureI's method *spliced-seq* for CDS with split nucleotide sequences (CDS position indicated by 'join')
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
169 - minor changes how the optional list of locus tags is handled
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
170 * v0.5 (03.06.2013)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
171 - included a POD
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
172 - options with Getopt::Long
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
173 - option **-n** to alternatively extract nucleotide sequences for CDS features (optionally with upstream and downstream sequences)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
174 - option to include full FASTA ID header for downstream [`prot_finder.pl`](/prot_finder) analysis
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
175 - exit with error if the values for two (or more) /locus_tag or /protein_id tags are not unambiguous
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
176 - print message to *STDOUT* if and which locus tags were not found in a given locus tag list (option **-l**)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
177 * v0.4 (06.02.2013)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
178 - replace whitespaces of /product values with underscores
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
179 * v0.3 (06.09.2012)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
180 - internal CDS counter to use in FASTA ID for CDS features without a /locus_tag, /protein_id, /gene, or /product tag
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
181 - include also organism (and possible plasmid) information in FASTA ID lines
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
182 - give a warning to *STDOUT* if a CDS feature without a /locus_tag is found (but only for the first occurence)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
183 - additionally, *locus_tag_errors.txt* error file to list all CDSs without locus tags
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
184 - catch errors with *eval* if a tag is missing
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
185 * v0.2 (04.09.2012)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
186 - if a CDS feature does not have a /locus_tag, then use the value for /protein_id, /gene, or /product (in this order) in the FASTA ID lines of the result file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
187 - optional extract only CDSs with locus tags given in a file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
188 * v0.1 (24.05.2012)