comparison COG/bac-genomics-scripts/trunc_seq/README.md @ 3:e42d30da7a74 draft

Uploaded
author dereeper
date Thu, 30 May 2024 11:52:25 +0000
parents
children
comparison
equal deleted inserted replaced
2:97e4e3e818b6 3:e42d30da7a74
1 trunc_seq
2 =========
3
4 `trunc_seq.pl` is a script to truncate sequence files.
5
6 * [Synopsis](#synopsis)
7 * [Description](#description)
8 * [Usage](#usage)
9 * [Options](#options)
10 * [Output](#output)
11 * [Run environment](#run-environment)
12 * [Dependencies](#dependencies)
13 * [Author - contact](#author---contact)
14 * [Citation, installation, and license](#citation-installation-and-license)
15 * [Changelog](#changelog)
16
17 ## Synopsis
18
19 perl trunc_seq.pl 20 3500 seq-file.embl > seq-file_trunc_20_3500.embl
20
21 **or**
22
23 perl trunc_seq.pl file_of_filenames_and_coords.tsv
24
25 ## Description
26
27 This script truncates sequence files according to the given
28 coordinates. The features/annotations in RichSeq files (e.g. EMBL or
29 GENBANK format) will also be adapted accordingly. Use option **-o** to
30 specify a different output sequence format. Input can be given directly
31 as a file and truncation coordinates to the script, with the start
32 position as the first argument, stop as the second and (the path to)
33 the sequence file as the third. In this case the truncated sequence
34 entry is printed to *STDOUT*. Input sequence files should contain only
35 one sequence entry, if a multi-sequence file is used as input only the
36 **first** sequence entry is truncated.
37
38 Alternatively, a file of filenames (fof) with respective coordinates
39 and sequence files in the following **tab-separated** format can be
40 given to the script (the header is optional):
41
42 \#start&emsp;stop&emsp;seq-file<br>
43 300&emsp;9000&emsp;(path/to/)seq-file<br>
44 50&emsp;1300&emsp;(path/to/)seq-file2<br>
45
46 With a fof the resulting truncated sequence files are printed into a
47 results directory. Use option **-r** to specify a different results
48 directory than the default.
49
50 It is also possible to truncate a RichSeq sequence file loaded into the
51 [Artemis](http://www.sanger.ac.uk/science/tools/artemis) genome browser
52 from the Sanger Institute: Select a subsequence and then go to Edit ->
53 Subsequence (and Features)
54
55 ## Usage
56
57 perl trunc_seq.pl -o gbk 120 30000 seq-file.embl > seq-file_trunc_120_3000.gbk
58
59 **or**
60
61 perl trunc_seq.pl -o fasta 5300 18500 seq-file.gbk | perl revcom_seq.pl -i fasta > seq-file_trunc_revcom.fasta
62
63 **or**
64
65 perl trunc_seq.pl -r path/to/trunc_embl_dir -o embl file_of_filenames_and_coords.tsv
66
67 ## Options
68
69 - **-h**, **-help**
70
71 Help (perldoc POD)
72
73 - **-o**=*str*, **-outformat**=*str*
74
75 Specify different sequence format for the output (files) [fasta, embl, or gbk]
76
77 - **-r**=*str*, **-result\_dir**=*str*
78
79 Path to result folder for fof input \[default = './trunc\_seq\_results'\]
80
81 - **-v**, **-version**
82
83 Print version number to *STDOUT*
84
85 ## Output
86
87 - *STDOUT*
88
89 If a single sequence file is given to the script the truncated sequence
90 file is printed to *STDOUT*. Redirect or pipe into another tool as
91 needed.
92
93 **or**
94
95 - ./trunc_seq_results
96
97 If a fof is given to the script, all output files are stored in a
98 results folder
99
100 - ./trunc_seq_results/seq-file_trunc_start_stop.format
101
102 Truncated output sequence files are named appended with 'trunc' and the
103 corresponding start and stop positions
104
105 ## Run environment
106
107 The Perl script runs under Windows and UNIX flavors.
108
109 ## Dependencies
110
111 - [**BioPerl**](http://www.bioperl.org) (tested version 1.007001)
112
113 ## Author - contact
114
115 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
116
117 ## Citation, installation, and license
118
119 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
120
121 ## Changelog
122
123 * v0.2 (2015-12-07)
124 * Merged funtionality of `trunc_seq.pl` and `run_trunc_seq.pl` in one single script
125 * Allows now single file and file of filenames (fof) with coordinates input
126 * output for single file input printed to *STDOUT* now
127 * output for fof input printed into files in a result directory, new option **-r** to specify result directory
128 * included a POD instead of a simple usage text
129 * included `pod2usage` with Pod::Usage
130 * included 'use autodie' pragma
131 * options with Getopt::Long
132 * output format now specified with option **-o**
133 * included version switch, **-v**
134 * fixed bug to remove input filepaths from fof input for output files
135 * skip empty or comment lines (/^#/) in fof input
136 * check and warn if input seq file has more than one seq entries
137 * v0.1 (2013-02-08)
138 * In v0.1 `trunc_seq.pl` only for single sequence input, but included additional wrapper script `run_trunc_seq.pl` for a fof input