comparison seqrequester/README.md @ 1:1085e094cf5f draft

planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/microsatbed commit 7ceb6658309a7ababe622b5d92e729e5470e22f0-dirty
author fubar
date Sat, 13 Jul 2024 12:39:06 +0000
parents
children
comparison
equal deleted inserted replaced
0:dd71d3167476 1:1085e094cf5f
1 # seqrequester
2
3 This is 'seqrequester', a tool for summarizing, extracting, generating and
4 modifying DNA sequences.
5
6 ## Installation
7 ### Dependency
8 * GCC (tested with 11.3.0)
9 * git 2.25.1 or higher
10
11 ### Make
12 ```
13 git clone https://github.com/marbl/seqrequester.git
14 cd seqrequester/src
15 make -j 12
16 ```
17
18 ## Summarizing
19
20 The summarize mode will generate a table of Nx lengths, a lovely ASCII
21 plot of the histogram of sequence lengths, report GC content, and di-
22 and tri-nucleotide frequencies.
23
24 It can optionally split sequences at N's before computing the length of a sequence.
25
26 You can also get a simple histogram of the sequence lengths and the number of sequences
27 at each length, or just a simple list of all sequence lengths.
28
29 It will, of course, read FASTA and FASTQ, uncompressed or compressed with
30 gzip, bzip2 or xz.
31
32 Only one report is generated, regardless of how many sequence files are supplied.
33
34
35 ```
36 % seqrequester summarize
37 usage: seqrequester [mode] [options] [sequence_file ...]
38
39 OPTIONS for summarize mode:
40 -size base size to use for N50 statistics
41 -1x limit NG table to 1x coverage
42
43 -split-n split sequences at N bases before computing length
44 -simple output a simple 'length numSequences' histogram
45 -lengths output a list of the sequence lengths
46
47 -assequences load data as complete sequences (for testing)
48 -asbases load data as blocks of bases (for testing)
49 ```
50
51 ```
52 % seqrequester summarize /archive/mothra/FLX/*gz
53
54 G=6462464889 sum of || length num
55 NG length index lengths || range seqs
56 ----- ------------ --------- ------------ || ------------------- -------
57 00010 652 801160 646246790 || 42-112 4768|-
58 00020 582 1862887 1292493013 || 113-183 16961|-
59 00030 555 3002684 1938739802 || 184-254 89381|--
60 00040 538 4186751 2584986254 || 255-325 536862|--------
61 00050 523 5405461 3231232945 || 326-396 1463599|--------------------
62 00060 509 6657839 3877479295 || 397-467 1960924|---------------------------
63 00070 488 7952426 4523725460 || 468-538 4616863|---------------------------------------------------------------
64 00080 447 9329218 5169971940 || 539-609 2858982|----------------------------------------
65 00090 389 10872299 5816218777 || 610-680 625376|---------
66 00100 42 12803136 6462464889 || 681-751 252454|----
67 001.000x 12803137 6462464889 || 752-822 134849|--
68 || 823-893 78435|--
69 || 894-964 47976|-
70 || 965-1035 30852|-
71 || 1036-1106 21127|-
72 || 1107-1177 14817|-
73 || 1178-1248 28461|-
74 || 1249-1319 4930|-
75 || 1320-1390 3655|-
76 || 1391-1461 2657|-
77 || 1462-1532 2120|-
78 || 1533-1603 1597|-
79 || 1604-1674 1268|-
80 || 1675-1745 953|-
81 || 1746-1816 766|-
82 || 1817-1887 573|-
83 || 1888-1958 443|-
84 || 1959-2029 344|-
85 || 2030-2100 1022|-
86 || 2101-2171 21|-
87 || 2172-2242 23|-
88 || 2243-2313 20|-
89 || 2314-2384 17|-
90 || 2385-2455 8|-
91 || 2456-2526 9|-
92 || 2527-2597 4|-
93 || 2598-2668 2|-
94 || 2669-2739 6|-
95 || 2740-2810 2|-
96 || 2811-2881 6|-
97 || 2882-2952 1|-
98 || 2953-3023 0|
99 || 3024-3094 0|
100 || 3095-3165 1|-
101 || 3166-3236 0|
102 || 3237-3307 0|
103 || 3308-3378 0|
104 || 3379-3449 1|-
105 || 3450-3520 0|
106 || 3521-3591 1|-
107
108 --------------------- --------------------- ----------------------------------------------------------------------------------------------
109 mononucleotide dinucleotide trinucleotide
110 --------------------- --------------------- ----------------------------------------------------------------------------------------------
111 1959571306 0.3032 A 665030151 0.1031 AA 237235545 0.0369 AAA 132268487 0.0205 AAC 136675399 0.0212 AAG 158473516 0.0246 AAT
112 1247489432 0.1930 C 389352138 0.0604 AC 115665542 0.0180 ACA 87346626 0.0136 ACC 70986769 0.0110 ACG 114582435 0.0178 ACT
113 1345011807 0.2081 G 397219280 0.0616 AG 121659180 0.0189 AGA 65811037 0.0102 AGC 102037062 0.0159 AGG 106854671 0.0166 AGT
114 1910392344 0.2956 T 507072196 0.0786 AT 152454159 0.0237 ATA 89877335 0.0140 ATC 106195089 0.0165 ATG 158544503 0.0246 ATT
115 380831936 0.0590 CA 132169383 0.0205 CAA 76839888 0.0119 CAC 67197045 0.0104 CAG 104566859 0.0162 CAT
116 --GC-- --AT-- 281892951 0.0437 CC 86178881 0.0134 CCA 65022575 0.0101 CCC 50576089 0.0079 CCG 79660170 0.0124 CCT
117 40.12% 59.88% 208535008 0.0323 CG 60164341 0.0093 CGA 27649662 0.0043 CGC 52322022 0.0081 CGG 67296554 0.0105 CGT
118 374626420 0.0581 CT 95122699 0.0148 CTA 75643338 0.0118 CTC 74304266 0.0115 CTG 129554475 0.0201 CTT
119 383528854 0.0595 GA 128291282 0.0199 GAA 70244915 0.0109 GAC 88104990 0.0137 GAG 96746854 0.0150 GAT
120 218253748 0.0338 GC 72696062 0.0113 GCA 51118632 0.0079 GCC 27797659 0.0043 GCG 66512705 0.0103 GCT
121 361154273 0.0560 GG 87662449 0.0136 GGA 51591104 0.0080 GGC 124820908 0.0194 GGG 89162545 0.0139 GGT
122 371793122 0.0576 GT 112773785 0.0175 GTA 61960408 0.0096 GTC 66540524 0.0103 GTG 130508651 0.0203 GTT
123 526153182 0.0816 TA 165978181 0.0258 TAA 109335141 0.0170 TAC 104450634 0.0162 TAG 146068463 0.0227 TAT
124 355583693 0.0551 TC 105474843 0.0164 TCA 77928750 0.0121 TCC 58773061 0.0091 TCG 113158614 0.0176 TCT
125 375551419 0.0582 TG 113251416 0.0176 TGA 72683975 0.0113 TGC 81429936 0.0127 TGG 107781308 0.0167 TGT
126 653083381 0.1013 TT 164899014 0.0256 TTA 127390884 0.0198 TTC 127703170 0.0198 TTG 233082150 0.0362 TTT
127 ```
128
129 # Extracting
130
131 ```
132 % seqrequester extract
133 usage: ./FreeBSD-amd64/bin/seqrequester [mode] [options] [sequence_file ...]
134
135 OPTIONS for extract mode:
136 -bases baselist extract bases as specified in the 'list' from each sequence
137 -sequences seqlist extract ordinal sequences as specified in the 'list'
138
139 -reverse reverse the bases in the sequence
140 -complement complement the bases in the sequence
141 -rc alias for -reverse -complement
142
143 -compress compress homopolymer runs to one base
144
145 -upcase
146 -downcase
147
148 -length min-max print sequence if it is at least 'min' bases and at most 'max' bases long
149
150 a 'baselist' is a set of integers formed from any combination
151 of the following, seperated by a comma:
152 num a single number
153 bgn-end a range of numbers: bgn <= end
154 bases are spaced-based; -bases 0-2,4 will print the bases between
155 the first two spaces (the first two bases) and the base after the
156 fourth space (the fifth base).
157
158 a 'seqlist' is a set of integers formed from any combination
159 of the following, seperated by a comma:
160 num a single number
161 bgn-end a range of numbers: bgn <= end
162 sequences are 1-based; -sequences 1,3-5 will print the first, third,
163 fourth and fifth sequences.
164 ```
165
166 # Sampling
167
168 ```
169 % seqrequester sample
170 usage: ./FreeBSD-amd64/bin/seqrequester [mode] [options] [sequence_file ...]
171
172 OPTIONS for sample mode:
173 -paired treat inputs as paired sequences; the first two files form the
174 first pair, and so on.
175
176 -copies C write C different copies of the sampling (without replacement).
177 -output O write output sequences to file O. If paired, two files must be supplied.
178
179 -coverage C output C coverage of sequences, based on genome size G.
180 -genomesize G
181
182 -bases B output B bases.
183
184 -reads R output R reads.
185 -pairs P output P pairs (only if -paired).
186
187 -fraction F output fraction F of the input bases.
188
189 ```
190
191 # Generating
192
193 Undocumented.
194
195 # Simulating
196
197 ```
198 seqrequester simulate
199 usage: ./FreeBSD-amd64/bin/seqrequester [mode] [options] [sequence_file ...]
200
201 OPTIONS for simulate mode:
202 -genome G sample reads from these sequences
203 -circular treat the sequences in G as circular
204
205 -genomesize g genome size to use for deciding coverage below
206 -coverage c generate approximately c coverage of output
207 -nreads n generate exactly n reads of output
208 -nbases n generate approximately n bases of output
209
210 -distribution F generate read length by sampling the distribution in file F
211 one column - each line is the length of a sequence
212 two columns - each line has the 'length' and 'number of sequences'
213
214 if file F doesn't exist, use a built-in distribution
215 ultra-long-nanopore
216 pacbio
217 pacbio-hifi
218
219 -length min[-max] (not implemented)
220 -output x.fasta (not implemented)
221 ```