comparison COG/bac-genomics-scripts/order_fastx/README.md @ 3:e42d30da7a74 draft

Uploaded
author dereeper
date Thu, 30 May 2024 11:52:25 +0000
parents
children
comparison
equal deleted inserted replaced
2:97e4e3e818b6 3:e42d30da7a74
1 order_fastx
2 ===========
3
4 `order_fastx.pl` is a script to order sequences in FASTA or FASTQ files.
5
6 * [Synopsis](#synopsis)
7 * [Description](#description)
8 * [Usage](#usage)
9 * [Options](#options)
10 * [Mandatory options](#mandatory-options)
11 * [Optional options](#optional-options)
12 * [Output](#output)
13 * [Run environment](#run-environment)
14 * [Author - contact](#author---contact)
15 * [Citation, installation, and license](#citation-installation-and-license)
16 * [Changelog](#changelog)
17
18
19 ## Synopsis
20
21 perl order_fastx.pl -i infile.fasta -l order_id_list.txt > ordered.fasta
22
23 ## Description
24
25 Order sequence entries in FASTA or FASTQ sequence files according to
26 an ID list with a given order. Beware, the IDs in the order list
27 have to be **identical** to the entire IDs in the sequence file.
28
29 However, the ">" or "@" ID identifiers of FASTA or FASTQ files,
30 respectively, can be omitted in the ID list.
31
32 The file type is detected automatically. But, you can set the file
33 type manually with option **-f**. FASTQ format assumes **four** lines
34 per read, if this is not the case run the FASTQ file through
35 [`fastx_fix.pl`](/fastx_fix) or use Heng Li's [`seqtk
36 seq`](https://github.com/lh3/seqtk):
37
38 seqtk seq -l 0 infile.fq > outfile.fq
39
40 The script can also be used to pull a subset of sequences in the ID
41 list from the sequence file. Probably best to set option flag **-s**
42 in this case, see [Optional options](#optional-options) below. But, rather use
43 [`filter_fastx.pl`](/filter_fastx).
44
45 ## Usage
46
47 perl order_fastx.pl -i infile.fq -l order_id_list.txt -s -f fastq > ordered.fq
48
49 perl order_fastx.pl -i infile.fasta -l order_id_list.txt -e > ordered.fasta
50
51 ## Options
52
53 ### Mandatory options
54
55 - -i, -input
56
57 Input FASTA or FASTQ file
58
59 - -l, -list
60
61 List with sequence IDs in specified order
62
63 ### Optional options
64
65 - -h, -help
66
67 Help (perldoc POD)
68
69 - -f, -file_type
70
71 Set the file type manually [fasta|fastq]
72
73 - -e, -error_files
74
75 Write missing IDs in the seq file or the order ID list without an equivalent in the other to error files instead of *STDERR* (see [Output](#output) below)
76
77 - -s, -skip_errors
78
79 Skip missing ID error statements, excludes option **-e**
80
81 - -v, -version
82
83 Print version number to *STDERR*
84
85 ## Output
86
87 - *STDOUT*
88
89 The newly ordered sequences are printed to *STDOUT*. Redirect or pipe into another tool as needed.
90
91 - (order_ids_missing.txt)
92
93 If IDs in the order list are missing in the sequence file with option **-e**
94
95 - (seq_ids_missing.txt)
96
97 If IDs in the sequence file are missing in the order ID list with option **-e**
98
99 ## Run environment
100
101 The Perl script runs under Windows and UNIX flavors.
102
103 ## Author - contact
104
105 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
106
107 ## Citation, installation, and license
108
109 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
110
111 ## Changelog
112
113 - v0.1 (20.11.2014)