annotate COG/bac-genomics-scripts/order_fastx/README.md @ 14:5a5c9a6b047b draft

Uploaded
author dereeper
date Tue, 10 Dec 2024 16:20:53 +0000
parents e42d30da7a74
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
1 order_fastx
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
2 ===========
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
4 `order_fastx.pl` is a script to order sequences in FASTA or FASTQ files.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
5
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
6 * [Synopsis](#synopsis)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
7 * [Description](#description)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
8 * [Usage](#usage)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
9 * [Options](#options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
10 * [Mandatory options](#mandatory-options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
11 * [Optional options](#optional-options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
12 * [Output](#output)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
13 * [Run environment](#run-environment)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
14 * [Author - contact](#author---contact)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
15 * [Citation, installation, and license](#citation-installation-and-license)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
16 * [Changelog](#changelog)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
17
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
18
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
19 ## Synopsis
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
20
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
21 perl order_fastx.pl -i infile.fasta -l order_id_list.txt > ordered.fasta
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
22
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
23 ## Description
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
24
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
25 Order sequence entries in FASTA or FASTQ sequence files according to
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
26 an ID list with a given order. Beware, the IDs in the order list
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
27 have to be **identical** to the entire IDs in the sequence file.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
28
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
29 However, the ">" or "@" ID identifiers of FASTA or FASTQ files,
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
30 respectively, can be omitted in the ID list.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
31
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
32 The file type is detected automatically. But, you can set the file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
33 type manually with option **-f**. FASTQ format assumes **four** lines
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
34 per read, if this is not the case run the FASTQ file through
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
35 [`fastx_fix.pl`](/fastx_fix) or use Heng Li's [`seqtk
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
36 seq`](https://github.com/lh3/seqtk):
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
37
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
38 seqtk seq -l 0 infile.fq > outfile.fq
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
39
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
40 The script can also be used to pull a subset of sequences in the ID
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
41 list from the sequence file. Probably best to set option flag **-s**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
42 in this case, see [Optional options](#optional-options) below. But, rather use
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
43 [`filter_fastx.pl`](/filter_fastx).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
44
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
45 ## Usage
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
46
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
47 perl order_fastx.pl -i infile.fq -l order_id_list.txt -s -f fastq > ordered.fq
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
48
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
49 perl order_fastx.pl -i infile.fasta -l order_id_list.txt -e > ordered.fasta
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
50
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
51 ## Options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
52
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
53 ### Mandatory options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
54
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
55 - -i, -input
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
56
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
57 Input FASTA or FASTQ file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
58
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
59 - -l, -list
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
60
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
61 List with sequence IDs in specified order
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
62
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
63 ### Optional options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
64
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
65 - -h, -help
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
66
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
67 Help (perldoc POD)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
68
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
69 - -f, -file_type
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
70
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
71 Set the file type manually [fasta|fastq]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
72
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
73 - -e, -error_files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
74
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
75 Write missing IDs in the seq file or the order ID list without an equivalent in the other to error files instead of *STDERR* (see [Output](#output) below)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
76
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
77 - -s, -skip_errors
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
78
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
79 Skip missing ID error statements, excludes option **-e**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
80
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
81 - -v, -version
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
82
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
83 Print version number to *STDERR*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
84
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
85 ## Output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
86
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
87 - *STDOUT*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
88
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
89 The newly ordered sequences are printed to *STDOUT*. Redirect or pipe into another tool as needed.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
90
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
91 - (order_ids_missing.txt)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
92
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
93 If IDs in the order list are missing in the sequence file with option **-e**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
94
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
95 - (seq_ids_missing.txt)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
96
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
97 If IDs in the sequence file are missing in the order ID list with option **-e**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
98
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
99 ## Run environment
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
100
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
101 The Perl script runs under Windows and UNIX flavors.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
102
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
103 ## Author - contact
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
104
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
105 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
106
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
107 ## Citation, installation, and license
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
108
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
109 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
110
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
111 ## Changelog
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
112
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
113 - v0.1 (20.11.2014)