Mercurial > repos > dereeper > pangenome_explorer
comparison COG/bac-genomics-scripts/trunc_seq/README.md @ 3:e42d30da7a74 draft
Uploaded
author | dereeper |
---|---|
date | Thu, 30 May 2024 11:52:25 +0000 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
2:97e4e3e818b6 | 3:e42d30da7a74 |
---|---|
1 trunc_seq | |
2 ========= | |
3 | |
4 `trunc_seq.pl` is a script to truncate sequence files. | |
5 | |
6 * [Synopsis](#synopsis) | |
7 * [Description](#description) | |
8 * [Usage](#usage) | |
9 * [Options](#options) | |
10 * [Output](#output) | |
11 * [Run environment](#run-environment) | |
12 * [Dependencies](#dependencies) | |
13 * [Author - contact](#author---contact) | |
14 * [Citation, installation, and license](#citation-installation-and-license) | |
15 * [Changelog](#changelog) | |
16 | |
17 ## Synopsis | |
18 | |
19 perl trunc_seq.pl 20 3500 seq-file.embl > seq-file_trunc_20_3500.embl | |
20 | |
21 **or** | |
22 | |
23 perl trunc_seq.pl file_of_filenames_and_coords.tsv | |
24 | |
25 ## Description | |
26 | |
27 This script truncates sequence files according to the given | |
28 coordinates. The features/annotations in RichSeq files (e.g. EMBL or | |
29 GENBANK format) will also be adapted accordingly. Use option **-o** to | |
30 specify a different output sequence format. Input can be given directly | |
31 as a file and truncation coordinates to the script, with the start | |
32 position as the first argument, stop as the second and (the path to) | |
33 the sequence file as the third. In this case the truncated sequence | |
34 entry is printed to *STDOUT*. Input sequence files should contain only | |
35 one sequence entry, if a multi-sequence file is used as input only the | |
36 **first** sequence entry is truncated. | |
37 | |
38 Alternatively, a file of filenames (fof) with respective coordinates | |
39 and sequence files in the following **tab-separated** format can be | |
40 given to the script (the header is optional): | |
41 | |
42 \#start stop seq-file<br> | |
43 300 9000 (path/to/)seq-file<br> | |
44 50 1300 (path/to/)seq-file2<br> | |
45 | |
46 With a fof the resulting truncated sequence files are printed into a | |
47 results directory. Use option **-r** to specify a different results | |
48 directory than the default. | |
49 | |
50 It is also possible to truncate a RichSeq sequence file loaded into the | |
51 [Artemis](http://www.sanger.ac.uk/science/tools/artemis) genome browser | |
52 from the Sanger Institute: Select a subsequence and then go to Edit -> | |
53 Subsequence (and Features) | |
54 | |
55 ## Usage | |
56 | |
57 perl trunc_seq.pl -o gbk 120 30000 seq-file.embl > seq-file_trunc_120_3000.gbk | |
58 | |
59 **or** | |
60 | |
61 perl trunc_seq.pl -o fasta 5300 18500 seq-file.gbk | perl revcom_seq.pl -i fasta > seq-file_trunc_revcom.fasta | |
62 | |
63 **or** | |
64 | |
65 perl trunc_seq.pl -r path/to/trunc_embl_dir -o embl file_of_filenames_and_coords.tsv | |
66 | |
67 ## Options | |
68 | |
69 - **-h**, **-help** | |
70 | |
71 Help (perldoc POD) | |
72 | |
73 - **-o**=*str*, **-outformat**=*str* | |
74 | |
75 Specify different sequence format for the output (files) [fasta, embl, or gbk] | |
76 | |
77 - **-r**=*str*, **-result\_dir**=*str* | |
78 | |
79 Path to result folder for fof input \[default = './trunc\_seq\_results'\] | |
80 | |
81 - **-v**, **-version** | |
82 | |
83 Print version number to *STDOUT* | |
84 | |
85 ## Output | |
86 | |
87 - *STDOUT* | |
88 | |
89 If a single sequence file is given to the script the truncated sequence | |
90 file is printed to *STDOUT*. Redirect or pipe into another tool as | |
91 needed. | |
92 | |
93 **or** | |
94 | |
95 - ./trunc_seq_results | |
96 | |
97 If a fof is given to the script, all output files are stored in a | |
98 results folder | |
99 | |
100 - ./trunc_seq_results/seq-file_trunc_start_stop.format | |
101 | |
102 Truncated output sequence files are named appended with 'trunc' and the | |
103 corresponding start and stop positions | |
104 | |
105 ## Run environment | |
106 | |
107 The Perl script runs under Windows and UNIX flavors. | |
108 | |
109 ## Dependencies | |
110 | |
111 - [**BioPerl**](http://www.bioperl.org) (tested version 1.007001) | |
112 | |
113 ## Author - contact | |
114 | |
115 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) | |
116 | |
117 ## Citation, installation, and license | |
118 | |
119 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). | |
120 | |
121 ## Changelog | |
122 | |
123 * v0.2 (2015-12-07) | |
124 * Merged funtionality of `trunc_seq.pl` and `run_trunc_seq.pl` in one single script | |
125 * Allows now single file and file of filenames (fof) with coordinates input | |
126 * output for single file input printed to *STDOUT* now | |
127 * output for fof input printed into files in a result directory, new option **-r** to specify result directory | |
128 * included a POD instead of a simple usage text | |
129 * included `pod2usage` with Pod::Usage | |
130 * included 'use autodie' pragma | |
131 * options with Getopt::Long | |
132 * output format now specified with option **-o** | |
133 * included version switch, **-v** | |
134 * fixed bug to remove input filepaths from fof input for output files | |
135 * skip empty or comment lines (/^#/) in fof input | |
136 * check and warn if input seq file has more than one seq entries | |
137 * v0.1 (2013-02-08) | |
138 * In v0.1 `trunc_seq.pl` only for single sequence input, but included additional wrapper script `run_trunc_seq.pl` for a fof input |