Mercurial > repos > fubar > microsatbed
comparison seqrequester/README.md @ 1:1085e094cf5f draft
planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/microsatbed commit 7ceb6658309a7ababe622b5d92e729e5470e22f0-dirty
author | fubar |
---|---|
date | Sat, 13 Jul 2024 12:39:06 +0000 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
0:dd71d3167476 | 1:1085e094cf5f |
---|---|
1 # seqrequester | |
2 | |
3 This is 'seqrequester', a tool for summarizing, extracting, generating and | |
4 modifying DNA sequences. | |
5 | |
6 ## Installation | |
7 ### Dependency | |
8 * GCC (tested with 11.3.0) | |
9 * git 2.25.1 or higher | |
10 | |
11 ### Make | |
12 ``` | |
13 git clone https://github.com/marbl/seqrequester.git | |
14 cd seqrequester/src | |
15 make -j 12 | |
16 ``` | |
17 | |
18 ## Summarizing | |
19 | |
20 The summarize mode will generate a table of Nx lengths, a lovely ASCII | |
21 plot of the histogram of sequence lengths, report GC content, and di- | |
22 and tri-nucleotide frequencies. | |
23 | |
24 It can optionally split sequences at N's before computing the length of a sequence. | |
25 | |
26 You can also get a simple histogram of the sequence lengths and the number of sequences | |
27 at each length, or just a simple list of all sequence lengths. | |
28 | |
29 It will, of course, read FASTA and FASTQ, uncompressed or compressed with | |
30 gzip, bzip2 or xz. | |
31 | |
32 Only one report is generated, regardless of how many sequence files are supplied. | |
33 | |
34 | |
35 ``` | |
36 % seqrequester summarize | |
37 usage: seqrequester [mode] [options] [sequence_file ...] | |
38 | |
39 OPTIONS for summarize mode: | |
40 -size base size to use for N50 statistics | |
41 -1x limit NG table to 1x coverage | |
42 | |
43 -split-n split sequences at N bases before computing length | |
44 -simple output a simple 'length numSequences' histogram | |
45 -lengths output a list of the sequence lengths | |
46 | |
47 -assequences load data as complete sequences (for testing) | |
48 -asbases load data as blocks of bases (for testing) | |
49 ``` | |
50 | |
51 ``` | |
52 % seqrequester summarize /archive/mothra/FLX/*gz | |
53 | |
54 G=6462464889 sum of || length num | |
55 NG length index lengths || range seqs | |
56 ----- ------------ --------- ------------ || ------------------- ------- | |
57 00010 652 801160 646246790 || 42-112 4768|- | |
58 00020 582 1862887 1292493013 || 113-183 16961|- | |
59 00030 555 3002684 1938739802 || 184-254 89381|-- | |
60 00040 538 4186751 2584986254 || 255-325 536862|-------- | |
61 00050 523 5405461 3231232945 || 326-396 1463599|-------------------- | |
62 00060 509 6657839 3877479295 || 397-467 1960924|--------------------------- | |
63 00070 488 7952426 4523725460 || 468-538 4616863|--------------------------------------------------------------- | |
64 00080 447 9329218 5169971940 || 539-609 2858982|---------------------------------------- | |
65 00090 389 10872299 5816218777 || 610-680 625376|--------- | |
66 00100 42 12803136 6462464889 || 681-751 252454|---- | |
67 001.000x 12803137 6462464889 || 752-822 134849|-- | |
68 || 823-893 78435|-- | |
69 || 894-964 47976|- | |
70 || 965-1035 30852|- | |
71 || 1036-1106 21127|- | |
72 || 1107-1177 14817|- | |
73 || 1178-1248 28461|- | |
74 || 1249-1319 4930|- | |
75 || 1320-1390 3655|- | |
76 || 1391-1461 2657|- | |
77 || 1462-1532 2120|- | |
78 || 1533-1603 1597|- | |
79 || 1604-1674 1268|- | |
80 || 1675-1745 953|- | |
81 || 1746-1816 766|- | |
82 || 1817-1887 573|- | |
83 || 1888-1958 443|- | |
84 || 1959-2029 344|- | |
85 || 2030-2100 1022|- | |
86 || 2101-2171 21|- | |
87 || 2172-2242 23|- | |
88 || 2243-2313 20|- | |
89 || 2314-2384 17|- | |
90 || 2385-2455 8|- | |
91 || 2456-2526 9|- | |
92 || 2527-2597 4|- | |
93 || 2598-2668 2|- | |
94 || 2669-2739 6|- | |
95 || 2740-2810 2|- | |
96 || 2811-2881 6|- | |
97 || 2882-2952 1|- | |
98 || 2953-3023 0| | |
99 || 3024-3094 0| | |
100 || 3095-3165 1|- | |
101 || 3166-3236 0| | |
102 || 3237-3307 0| | |
103 || 3308-3378 0| | |
104 || 3379-3449 1|- | |
105 || 3450-3520 0| | |
106 || 3521-3591 1|- | |
107 | |
108 --------------------- --------------------- ---------------------------------------------------------------------------------------------- | |
109 mononucleotide dinucleotide trinucleotide | |
110 --------------------- --------------------- ---------------------------------------------------------------------------------------------- | |
111 1959571306 0.3032 A 665030151 0.1031 AA 237235545 0.0369 AAA 132268487 0.0205 AAC 136675399 0.0212 AAG 158473516 0.0246 AAT | |
112 1247489432 0.1930 C 389352138 0.0604 AC 115665542 0.0180 ACA 87346626 0.0136 ACC 70986769 0.0110 ACG 114582435 0.0178 ACT | |
113 1345011807 0.2081 G 397219280 0.0616 AG 121659180 0.0189 AGA 65811037 0.0102 AGC 102037062 0.0159 AGG 106854671 0.0166 AGT | |
114 1910392344 0.2956 T 507072196 0.0786 AT 152454159 0.0237 ATA 89877335 0.0140 ATC 106195089 0.0165 ATG 158544503 0.0246 ATT | |
115 380831936 0.0590 CA 132169383 0.0205 CAA 76839888 0.0119 CAC 67197045 0.0104 CAG 104566859 0.0162 CAT | |
116 --GC-- --AT-- 281892951 0.0437 CC 86178881 0.0134 CCA 65022575 0.0101 CCC 50576089 0.0079 CCG 79660170 0.0124 CCT | |
117 40.12% 59.88% 208535008 0.0323 CG 60164341 0.0093 CGA 27649662 0.0043 CGC 52322022 0.0081 CGG 67296554 0.0105 CGT | |
118 374626420 0.0581 CT 95122699 0.0148 CTA 75643338 0.0118 CTC 74304266 0.0115 CTG 129554475 0.0201 CTT | |
119 383528854 0.0595 GA 128291282 0.0199 GAA 70244915 0.0109 GAC 88104990 0.0137 GAG 96746854 0.0150 GAT | |
120 218253748 0.0338 GC 72696062 0.0113 GCA 51118632 0.0079 GCC 27797659 0.0043 GCG 66512705 0.0103 GCT | |
121 361154273 0.0560 GG 87662449 0.0136 GGA 51591104 0.0080 GGC 124820908 0.0194 GGG 89162545 0.0139 GGT | |
122 371793122 0.0576 GT 112773785 0.0175 GTA 61960408 0.0096 GTC 66540524 0.0103 GTG 130508651 0.0203 GTT | |
123 526153182 0.0816 TA 165978181 0.0258 TAA 109335141 0.0170 TAC 104450634 0.0162 TAG 146068463 0.0227 TAT | |
124 355583693 0.0551 TC 105474843 0.0164 TCA 77928750 0.0121 TCC 58773061 0.0091 TCG 113158614 0.0176 TCT | |
125 375551419 0.0582 TG 113251416 0.0176 TGA 72683975 0.0113 TGC 81429936 0.0127 TGG 107781308 0.0167 TGT | |
126 653083381 0.1013 TT 164899014 0.0256 TTA 127390884 0.0198 TTC 127703170 0.0198 TTG 233082150 0.0362 TTT | |
127 ``` | |
128 | |
129 # Extracting | |
130 | |
131 ``` | |
132 % seqrequester extract | |
133 usage: ./FreeBSD-amd64/bin/seqrequester [mode] [options] [sequence_file ...] | |
134 | |
135 OPTIONS for extract mode: | |
136 -bases baselist extract bases as specified in the 'list' from each sequence | |
137 -sequences seqlist extract ordinal sequences as specified in the 'list' | |
138 | |
139 -reverse reverse the bases in the sequence | |
140 -complement complement the bases in the sequence | |
141 -rc alias for -reverse -complement | |
142 | |
143 -compress compress homopolymer runs to one base | |
144 | |
145 -upcase | |
146 -downcase | |
147 | |
148 -length min-max print sequence if it is at least 'min' bases and at most 'max' bases long | |
149 | |
150 a 'baselist' is a set of integers formed from any combination | |
151 of the following, seperated by a comma: | |
152 num a single number | |
153 bgn-end a range of numbers: bgn <= end | |
154 bases are spaced-based; -bases 0-2,4 will print the bases between | |
155 the first two spaces (the first two bases) and the base after the | |
156 fourth space (the fifth base). | |
157 | |
158 a 'seqlist' is a set of integers formed from any combination | |
159 of the following, seperated by a comma: | |
160 num a single number | |
161 bgn-end a range of numbers: bgn <= end | |
162 sequences are 1-based; -sequences 1,3-5 will print the first, third, | |
163 fourth and fifth sequences. | |
164 ``` | |
165 | |
166 # Sampling | |
167 | |
168 ``` | |
169 % seqrequester sample | |
170 usage: ./FreeBSD-amd64/bin/seqrequester [mode] [options] [sequence_file ...] | |
171 | |
172 OPTIONS for sample mode: | |
173 -paired treat inputs as paired sequences; the first two files form the | |
174 first pair, and so on. | |
175 | |
176 -copies C write C different copies of the sampling (without replacement). | |
177 -output O write output sequences to file O. If paired, two files must be supplied. | |
178 | |
179 -coverage C output C coverage of sequences, based on genome size G. | |
180 -genomesize G | |
181 | |
182 -bases B output B bases. | |
183 | |
184 -reads R output R reads. | |
185 -pairs P output P pairs (only if -paired). | |
186 | |
187 -fraction F output fraction F of the input bases. | |
188 | |
189 ``` | |
190 | |
191 # Generating | |
192 | |
193 Undocumented. | |
194 | |
195 # Simulating | |
196 | |
197 ``` | |
198 seqrequester simulate | |
199 usage: ./FreeBSD-amd64/bin/seqrequester [mode] [options] [sequence_file ...] | |
200 | |
201 OPTIONS for simulate mode: | |
202 -genome G sample reads from these sequences | |
203 -circular treat the sequences in G as circular | |
204 | |
205 -genomesize g genome size to use for deciding coverage below | |
206 -coverage c generate approximately c coverage of output | |
207 -nreads n generate exactly n reads of output | |
208 -nbases n generate approximately n bases of output | |
209 | |
210 -distribution F generate read length by sampling the distribution in file F | |
211 one column - each line is the length of a sequence | |
212 two columns - each line has the 'length' and 'number of sequences' | |
213 | |
214 if file F doesn't exist, use a built-in distribution | |
215 ultra-long-nanopore | |
216 pacbio | |
217 pacbio-hifi | |
218 | |
219 -length min[-max] (not implemented) | |
220 -output x.fasta (not implemented) | |
221 ``` |