comparison COG/bac-genomics-scripts/cds_extractor/README.md @ 3:e42d30da7a74 draft

Uploaded
author dereeper
date Thu, 30 May 2024 11:52:25 +0000
parents
children
comparison
equal deleted inserted replaced
2:97e4e3e818b6 3:e42d30da7a74
1 cds_extractor
2 =============
3
4 `cds_extractor.pl` is a script to extract amino acid or nucleotide sequences from coding sequence (CDS) features in annotated genomes.
5
6 * [Synopsis](#synopsis)
7 * [Description](#description)
8 * [Usage](#usage)
9 * [Extract amino acid sequences](#extract-amino-acid-sequences)
10 * [Extract nucleotide sequences](#extract-nucleotide-sequences)
11 * [UNIX loop to extract sequences from all files in the current working directory](#unix-loop-to-extract-sequences-from-all-files-in-the-current-working-directory)
12 * [Options](#options)
13 * [Mandatory options](#mandatory-options)
14 * [Optional options](#optional-options)
15 * [Output](#output)
16 * [Dependencies](#dependencies)
17 * [Run environment](#run-environment)
18 * [Author - contact](#author---contact)
19 * [Citation, installation, and license](#citation-installation-and-license)
20 * [Changelog](#changelog)
21
22 ## Synopsis
23
24 perl cds_extractor.pl -i seq_file.[embl|gbk] -p
25
26 ## Description
27
28 This script extracts protein or DNA sequences of CDS features from a (multi)-RichSeq file (e.g. EMBL or GENBANK format) and writes them to a multi-FASTA file. The FASTA headers for each CDS include either the locus tag, if that's not available, protein ID, gene, or an internal CDS counter as identifier (in this order). The organism info includes also possible plasmid names. Pseudogenes (tagged by **/pseudo**) are not included (except in the CDS counter).
29
30 In addition to the identifier, FASTA headers include gene (**g=**), product (**p=**), organism (**o=**), and EC numbers (**ec=**), if these are present for a CDS. Individual EC numbers are separated by **semicolons**. The location/position (**l=** start..stop) of a CDS will always be included. If gene is used as FASTA header ID '**g=** gene' will only be included with option **-f**.
31
32 Fuzzy locations in the feature table of a sequence file are not taken into consideration for **l=**. If you set options **-u** and/or **-d** and the feature location overlaps a **circular** replicon boundary, positions are marked with '<' or '>' in the direction of the exceeded boundary. Features with overlapping locations in **linear** sequences (e.g. contigs) will be skipped and are **not** included in the output! A CDS feature is on the lagging strand if start > stop in the location. In the special case of overlapping circular sequence boundaries this is reversed.
33
34 Of course, the **l=** positions are separate for each sequence in a multi- sequence file. Thus, if you want continuous positions for the CDSs run these files first through [`cat_seq.pl`](/cat_seq).
35
36 Optionally, a file with locus tags can be given to extract only these CDS features with option **-l** (each locus tag in a new line).
37
38 ## Usage
39
40 ### Extract amino acid sequences
41
42 perl cds_extractor.pl -i Ecoli_MG1655.gbk -p [-l locus_tags.txt -c MG1655 -f]
43
44 ### Extract nucleotide sequences
45
46 perl cds_extractor.pl -i Banthracis_Ames.embl -n [-l locus_tags.txt -u 100 -d 20 -c Ames -f]
47
48 ### UNIX loop to extract sequences from all files in the current working directory
49
50 for file in *.embl; do perl cds_extractor.pl -i "$file" -p [-l locus_tags.txt]; done
51
52 ## Options
53
54 ### Mandatory options
55
56 * **-i**=_str_, **-input**=_str_
57
58 Input RichSeq sequence file including CDS annotation (e.g. EMBL or GENBANK format)
59
60 * **-p**, **-protein**
61
62 Extract **protein** sequence for each CDS feature, excludes option **-n**
63
64 **or**
65
66 * **-n**, **-nucleotide**
67
68 Extract **nucleotide** sequence for each CDS feature, excludes option **-p**
69
70 ### Optional options
71
72 * **-h**, **-help**
73
74 Help (perldoc POD)
75
76 * **-u**=_int_, **-upstream**=_int_
77
78 Include given number of flanking nucleotides upstream of each CDS feature, forces option **-n**
79
80 * **-d**=_int_, **-downstream**=_int_
81
82 Include given number of flanking nucleotides downstream of each CDS feature, forces option **-n**
83
84 * **-c**=_str_, **-cds_prefix**=_str_
85
86 Prefix for the internal CDS counter [default = 'CDS']
87
88 * **-l**=_str_, **-locustag_list**=_str_
89
90 List of locus tags to extract only those (each locus tag on a new line)
91
92 * **-f**, **-full_header**
93
94 If gene is used as ID include additionally '**g=** gene' in FASTA headers, so downstream analyses can recognize the gene tag (e.g. [`prot_finder.pl`](/prot_finder)).
95
96 * **-v**, **-version**
97
98 Print version number to *STDERR*
99
100 ## Output
101
102 * \*.faa
103
104 Multi-FASTA file of CDS protein sequences
105
106 **or**
107
108 * \*.ffn
109
110 Multi-FASTA file of CDS DNA sequences
111
112 * (no_annotation_err.txt)
113
114 Lists input files missing CDS annotation, script exited with **fatal error** i.e. no FASTA output file
115
116 * (double_id_err.txt)
117
118 Lists input files with ambiguous FASTA IDs, script exited with **fatal error** i.e. no FASTA output file
119
120 * (locus_tag_missing_err.txt)
121
122 Lists CDS features without locus tags
123
124 * (linear_seq_cds_overlap_err.txt)
125
126 Lists CDS features overlapping sequence border of a **linear** molecule, which are **not** included in the result multi-FASTA file
127
128 ## Dependencies
129
130 * [BioPerl](http://www.bioperl.org) (tested with version 1.006923)
131
132 ## Run environment
133
134 The Perl script runs under Windows and UNIX flavors.
135
136 ## Author - contact
137
138 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
139
140 ## Citation, installation, and license
141
142 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
143
144 ## Changelog
145
146 * v0.7.1 (26.10.2015)
147 - changed output file extensions from **\_cds\_aa.fasta* or **\_cds\_nuc.fasta* to **.faa* or **.ffn*, respectively
148 - minor syntax changes in README, included TOC
149 - minor syntax changes in POD
150 * v0.7 (31.03.2014)
151 - location (l=) and EC numbers (ec=) for CDS features are included in the FASTA header
152 - 'ec=', 'g=', 'p=', and 'o=' only included in FASTA header if these tags are present for a CDS feature, or additionally for 'g=' with option **-f**
153 - if, with options '-u' and/or '-d', the location of a CDS feature overlaps a sequence boundary, the positions are marked with '<' or '>' in 'l='
154 - additionally, CDS features whose location overlaps the sequence boundary of a linear molecule will not be included in the output, but IDs written to an error file
155 - new option **-c** to chose prefix for internal CDS counter
156 - /product feature value will not be used as FASTA ID anymore, skip directly to internal CDS counter, if /locus_tag, /protein_id, or /gene is missing for a CDS (too many 'hypothetical proteins')
157 - internal CDS counter counts all CDSs of multi-sequence files sequential (doesn't start new with each new sequence in the multi-sequence file)
158 - 'control_double' subroutine also called if /gene is used as FASTA ID
159 - fixed bug introduced in v0.6 to exit if no CDS primary features found, because a draft multi-sequence file might have unannotated small contigs
160 - new error files: no_annotation_err.txt, double_id_err.txt, linear_seq_cds_overlap_err.txt (the first two come in handy if you run `cds_extractor.pl` in a UNIX loop with many files)
161 - included 'use autodie'
162 - included version switch
163 - included pod2usage with Pod::Usage
164 - reorganized code into more subroutines to remove useless double codings (which contained also some bugs) and to make the script more concise
165 - minor changes to Perl syntax
166 * v0.6 (06.06.2013)
167 - exit with error if no CDS primary features present in input file, as /translation feature only present in CDS features (some GENBANK files are only annotated with 'gene')
168 - included Bio::SeqFeatureI's method *spliced-seq* for CDS with split nucleotide sequences (CDS position indicated by 'join')
169 - minor changes how the optional list of locus tags is handled
170 * v0.5 (03.06.2013)
171 - included a POD
172 - options with Getopt::Long
173 - option **-n** to alternatively extract nucleotide sequences for CDS features (optionally with upstream and downstream sequences)
174 - option to include full FASTA ID header for downstream [`prot_finder.pl`](/prot_finder) analysis
175 - exit with error if the values for two (or more) /locus_tag or /protein_id tags are not unambiguous
176 - print message to *STDOUT* if and which locus tags were not found in a given locus tag list (option **-l**)
177 * v0.4 (06.02.2013)
178 - replace whitespaces of /product values with underscores
179 * v0.3 (06.09.2012)
180 - internal CDS counter to use in FASTA ID for CDS features without a /locus_tag, /protein_id, /gene, or /product tag
181 - include also organism (and possible plasmid) information in FASTA ID lines
182 - give a warning to *STDOUT* if a CDS feature without a /locus_tag is found (but only for the first occurence)
183 - additionally, *locus_tag_errors.txt* error file to list all CDSs without locus tags
184 - catch errors with *eval* if a tag is missing
185 * v0.2 (04.09.2012)
186 - if a CDS feature does not have a /locus_tag, then use the value for /protein_id, /gene, or /product (in this order) in the FASTA ID lines of the result file
187 - optional extract only CDSs with locus tags given in a file
188 * v0.1 (24.05.2012)