annotate COG/bac-genomics-scripts/trunc_seq/README.md @ 15:dbde253606c5 draft default tip

Uploaded
author dereeper
date Wed, 11 Dec 2024 08:25:06 +0000
parents e42d30da7a74
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
1 trunc_seq
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
2 =========
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
4 `trunc_seq.pl` is a script to truncate sequence files.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
5
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
6 * [Synopsis](#synopsis)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
7 * [Description](#description)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
8 * [Usage](#usage)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
9 * [Options](#options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
10 * [Output](#output)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
11 * [Run environment](#run-environment)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
12 * [Dependencies](#dependencies)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
13 * [Author - contact](#author---contact)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
14 * [Citation, installation, and license](#citation-installation-and-license)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
15 * [Changelog](#changelog)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
16
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
17 ## Synopsis
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
18
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
19 perl trunc_seq.pl 20 3500 seq-file.embl > seq-file_trunc_20_3500.embl
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
21 **or**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
22
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
23 perl trunc_seq.pl file_of_filenames_and_coords.tsv
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
24
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
25 ## Description
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
26
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
27 This script truncates sequence files according to the given
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
28 coordinates. The features/annotations in RichSeq files (e.g. EMBL or
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
29 GENBANK format) will also be adapted accordingly. Use option **-o** to
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
30 specify a different output sequence format. Input can be given directly
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
31 as a file and truncation coordinates to the script, with the start
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
32 position as the first argument, stop as the second and (the path to)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
33 the sequence file as the third. In this case the truncated sequence
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
34 entry is printed to *STDOUT*. Input sequence files should contain only
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
35 one sequence entry, if a multi-sequence file is used as input only the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
36 **first** sequence entry is truncated.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
37
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
38 Alternatively, a file of filenames (fof) with respective coordinates
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
39 and sequence files in the following **tab-separated** format can be
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
40 given to the script (the header is optional):
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
41
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
42 \#start&emsp;stop&emsp;seq-file<br>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
43 300&emsp;9000&emsp;(path/to/)seq-file<br>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
44 50&emsp;1300&emsp;(path/to/)seq-file2<br>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
45
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
46 With a fof the resulting truncated sequence files are printed into a
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
47 results directory. Use option **-r** to specify a different results
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
48 directory than the default.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
49
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
50 It is also possible to truncate a RichSeq sequence file loaded into the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
51 [Artemis](http://www.sanger.ac.uk/science/tools/artemis) genome browser
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
52 from the Sanger Institute: Select a subsequence and then go to Edit ->
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
53 Subsequence (and Features)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
54
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
55 ## Usage
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
56
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
57 perl trunc_seq.pl -o gbk 120 30000 seq-file.embl > seq-file_trunc_120_3000.gbk
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
58
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
59 **or**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
60
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
61 perl trunc_seq.pl -o fasta 5300 18500 seq-file.gbk | perl revcom_seq.pl -i fasta > seq-file_trunc_revcom.fasta
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
62
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
63 **or**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
64
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
65 perl trunc_seq.pl -r path/to/trunc_embl_dir -o embl file_of_filenames_and_coords.tsv
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
66
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
67 ## Options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
68
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
69 - **-h**, **-help**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
70
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
71 Help (perldoc POD)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
72
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
73 - **-o**=*str*, **-outformat**=*str*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
74
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
75 Specify different sequence format for the output (files) [fasta, embl, or gbk]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
76
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
77 - **-r**=*str*, **-result\_dir**=*str*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
78
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
79 Path to result folder for fof input \[default = './trunc\_seq\_results'\]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
80
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
81 - **-v**, **-version**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
82
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
83 Print version number to *STDOUT*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
84
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
85 ## Output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
86
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
87 - *STDOUT*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
88
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
89 If a single sequence file is given to the script the truncated sequence
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
90 file is printed to *STDOUT*. Redirect or pipe into another tool as
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
91 needed.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
92
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
93 **or**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
94
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
95 - ./trunc_seq_results
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
96
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
97 If a fof is given to the script, all output files are stored in a
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
98 results folder
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
99
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
100 - ./trunc_seq_results/seq-file_trunc_start_stop.format
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
101
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
102 Truncated output sequence files are named appended with 'trunc' and the
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
103 corresponding start and stop positions
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
104
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
105 ## Run environment
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
106
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
107 The Perl script runs under Windows and UNIX flavors.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
108
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
109 ## Dependencies
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
110
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
111 - [**BioPerl**](http://www.bioperl.org) (tested version 1.007001)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
112
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
113 ## Author - contact
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
114
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
115 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
116
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
117 ## Citation, installation, and license
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
118
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
119 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
120
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
121 ## Changelog
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
122
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
123 * v0.2 (2015-12-07)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
124 * Merged funtionality of `trunc_seq.pl` and `run_trunc_seq.pl` in one single script
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
125 * Allows now single file and file of filenames (fof) with coordinates input
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
126 * output for single file input printed to *STDOUT* now
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
127 * output for fof input printed into files in a result directory, new option **-r** to specify result directory
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
128 * included a POD instead of a simple usage text
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
129 * included `pod2usage` with Pod::Usage
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
130 * included 'use autodie' pragma
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
131 * options with Getopt::Long
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
132 * output format now specified with option **-o**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
133 * included version switch, **-v**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
134 * fixed bug to remove input filepaths from fof input for output files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
135 * skip empty or comment lines (/^#/) in fof input
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
136 * check and warn if input seq file has more than one seq entries
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
137 * v0.1 (2013-02-08)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
138 * In v0.1 `trunc_seq.pl` only for single sequence input, but included additional wrapper script `run_trunc_seq.pl` for a fof input