3
|
1 trunc_seq
|
|
2 =========
|
|
3
|
|
4 `trunc_seq.pl` is a script to truncate sequence files.
|
|
5
|
|
6 * [Synopsis](#synopsis)
|
|
7 * [Description](#description)
|
|
8 * [Usage](#usage)
|
|
9 * [Options](#options)
|
|
10 * [Output](#output)
|
|
11 * [Run environment](#run-environment)
|
|
12 * [Dependencies](#dependencies)
|
|
13 * [Author - contact](#author---contact)
|
|
14 * [Citation, installation, and license](#citation-installation-and-license)
|
|
15 * [Changelog](#changelog)
|
|
16
|
|
17 ## Synopsis
|
|
18
|
|
19 perl trunc_seq.pl 20 3500 seq-file.embl > seq-file_trunc_20_3500.embl
|
|
20
|
|
21 **or**
|
|
22
|
|
23 perl trunc_seq.pl file_of_filenames_and_coords.tsv
|
|
24
|
|
25 ## Description
|
|
26
|
|
27 This script truncates sequence files according to the given
|
|
28 coordinates. The features/annotations in RichSeq files (e.g. EMBL or
|
|
29 GENBANK format) will also be adapted accordingly. Use option **-o** to
|
|
30 specify a different output sequence format. Input can be given directly
|
|
31 as a file and truncation coordinates to the script, with the start
|
|
32 position as the first argument, stop as the second and (the path to)
|
|
33 the sequence file as the third. In this case the truncated sequence
|
|
34 entry is printed to *STDOUT*. Input sequence files should contain only
|
|
35 one sequence entry, if a multi-sequence file is used as input only the
|
|
36 **first** sequence entry is truncated.
|
|
37
|
|
38 Alternatively, a file of filenames (fof) with respective coordinates
|
|
39 and sequence files in the following **tab-separated** format can be
|
|
40 given to the script (the header is optional):
|
|
41
|
|
42 \#start stop seq-file<br>
|
|
43 300 9000 (path/to/)seq-file<br>
|
|
44 50 1300 (path/to/)seq-file2<br>
|
|
45
|
|
46 With a fof the resulting truncated sequence files are printed into a
|
|
47 results directory. Use option **-r** to specify a different results
|
|
48 directory than the default.
|
|
49
|
|
50 It is also possible to truncate a RichSeq sequence file loaded into the
|
|
51 [Artemis](http://www.sanger.ac.uk/science/tools/artemis) genome browser
|
|
52 from the Sanger Institute: Select a subsequence and then go to Edit ->
|
|
53 Subsequence (and Features)
|
|
54
|
|
55 ## Usage
|
|
56
|
|
57 perl trunc_seq.pl -o gbk 120 30000 seq-file.embl > seq-file_trunc_120_3000.gbk
|
|
58
|
|
59 **or**
|
|
60
|
|
61 perl trunc_seq.pl -o fasta 5300 18500 seq-file.gbk | perl revcom_seq.pl -i fasta > seq-file_trunc_revcom.fasta
|
|
62
|
|
63 **or**
|
|
64
|
|
65 perl trunc_seq.pl -r path/to/trunc_embl_dir -o embl file_of_filenames_and_coords.tsv
|
|
66
|
|
67 ## Options
|
|
68
|
|
69 - **-h**, **-help**
|
|
70
|
|
71 Help (perldoc POD)
|
|
72
|
|
73 - **-o**=*str*, **-outformat**=*str*
|
|
74
|
|
75 Specify different sequence format for the output (files) [fasta, embl, or gbk]
|
|
76
|
|
77 - **-r**=*str*, **-result\_dir**=*str*
|
|
78
|
|
79 Path to result folder for fof input \[default = './trunc\_seq\_results'\]
|
|
80
|
|
81 - **-v**, **-version**
|
|
82
|
|
83 Print version number to *STDOUT*
|
|
84
|
|
85 ## Output
|
|
86
|
|
87 - *STDOUT*
|
|
88
|
|
89 If a single sequence file is given to the script the truncated sequence
|
|
90 file is printed to *STDOUT*. Redirect or pipe into another tool as
|
|
91 needed.
|
|
92
|
|
93 **or**
|
|
94
|
|
95 - ./trunc_seq_results
|
|
96
|
|
97 If a fof is given to the script, all output files are stored in a
|
|
98 results folder
|
|
99
|
|
100 - ./trunc_seq_results/seq-file_trunc_start_stop.format
|
|
101
|
|
102 Truncated output sequence files are named appended with 'trunc' and the
|
|
103 corresponding start and stop positions
|
|
104
|
|
105 ## Run environment
|
|
106
|
|
107 The Perl script runs under Windows and UNIX flavors.
|
|
108
|
|
109 ## Dependencies
|
|
110
|
|
111 - [**BioPerl**](http://www.bioperl.org) (tested version 1.007001)
|
|
112
|
|
113 ## Author - contact
|
|
114
|
|
115 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
|
|
116
|
|
117 ## Citation, installation, and license
|
|
118
|
|
119 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
|
|
120
|
|
121 ## Changelog
|
|
122
|
|
123 * v0.2 (2015-12-07)
|
|
124 * Merged funtionality of `trunc_seq.pl` and `run_trunc_seq.pl` in one single script
|
|
125 * Allows now single file and file of filenames (fof) with coordinates input
|
|
126 * output for single file input printed to *STDOUT* now
|
|
127 * output for fof input printed into files in a result directory, new option **-r** to specify result directory
|
|
128 * included a POD instead of a simple usage text
|
|
129 * included `pod2usage` with Pod::Usage
|
|
130 * included 'use autodie' pragma
|
|
131 * options with Getopt::Long
|
|
132 * output format now specified with option **-o**
|
|
133 * included version switch, **-v**
|
|
134 * fixed bug to remove input filepaths from fof input for output files
|
|
135 * skip empty or comment lines (/^#/) in fof input
|
|
136 * check and warn if input seq file has more than one seq entries
|
|
137 * v0.1 (2013-02-08)
|
|
138 * In v0.1 `trunc_seq.pl` only for single sequence input, but included additional wrapper script `run_trunc_seq.pl` for a fof input
|