microsatbed: seqrequester/README.md comparison

comparison seqrequester/README.md @ 1:1085e094cf5f draft

planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/microsatbed commit 7ceb6658309a7ababe622b5d92e729e5470e22f0-dirty

author	fubar
date	Sat, 13 Jul 2024 12:39:06 +0000
parents
children

comparison

equal deleted inserted replaced

-:dd71d3167476
+:1085e094cf5f
+# seqrequester
+This is 'seqrequester', a tool for summarizing, extracting, generating and
+modifying DNA sequences.
+## Installation
+### Dependency
+* GCC (tested with 11.3.0)
+* git 2.25.1 or higher
+### Make
+```
+git clone https://github.com/marbl/seqrequester.git
+cd seqrequester/src
+make -j 12
+```
+## Summarizing
+The summarize mode will generate a table of Nx lengths, a lovely ASCII
+plot of the histogram of sequence lengths, report GC content, and di-
+and tri-nucleotide frequencies.
+It can optionally split sequences at N's before computing the length of a sequence.
+You can also get a simple histogram of the sequence lengths and the number of sequences
+at each length, or just a simple list of all sequence lengths.
+It will, of course, read FASTA and FASTQ, uncompressed or compressed with
+gzip, bzip2 or xz.
+Only one report is generated, regardless of how many sequence files are supplied.
+```
+% seqrequester summarize
+usage: seqrequester [mode] [options] [sequence_file ...]
+OPTIONS for summarize mode:
+-size          base size to use for N50 statistics
+-1x            limit NG table to 1x coverage
+-split-n       split sequences at N bases before computing length
+-simple        output a simple 'length numSequences' histogram
+-lengths       output a list of the sequence lengths
+-assequences   load data as complete sequences (for testing)
+-asbases       load data as blocks of bases    (for testing)
+```
+```
+% seqrequester summarize /archive/mothra/FLX/*gz
+G=6462464889                       sum of  ||               length     num
+NG         length     index       lengths  ||                range    seqs
+----- ------------ --------- ------------  ||  ------------------- -------
+00010          652    801160    646246790  ||         42-112          4768|-
+00020          582   1862887   1292493013  ||        113-183         16961|-
+00030          555   3002684   1938739802  ||        184-254         89381|--
+00040          538   4186751   2584986254  ||        255-325        536862|--------
+00050          523   5405461   3231232945  ||        326-396       1463599|--------------------
+00060          509   6657839   3877479295  ||        397-467       1960924|---------------------------
+00070          488   7952426   4523725460  ||        468-538       4616863|---------------------------------------------------------------
+00080          447   9329218   5169971940  ||        539-609       2858982|----------------------------------------
+00090          389  10872299   5816218777  ||        610-680        625376|---------
+00100           42  12803136   6462464889  ||        681-751        252454|----
+001.000x            12803137   6462464889  ||        752-822        134849|--
+||        823-893         78435|--
+||        894-964         47976|-
+||        965-1035        30852|-
+||       1036-1106        21127|-
+||       1107-1177        14817|-
+||       1178-1248        28461|-
+||       1249-1319         4930|-
+||       1320-1390         3655|-
+||       1391-1461         2657|-
+||       1462-1532         2120|-
+||       1533-1603         1597|-
+||       1604-1674         1268|-
+||       1675-1745          953|-
+||       1746-1816          766|-
+||       1817-1887          573|-
+||       1888-1958          443|-
+||       1959-2029          344|-
+||       2030-2100         1022|-
+||       2101-2171           21|-
+||       2172-2242           23|-
+||       2243-2313           20|-
+||       2314-2384           17|-
+||       2385-2455            8|-
+||       2456-2526            9|-
+||       2527-2597            4|-
+||       2598-2668            2|-
+||       2669-2739            6|-
+||       2740-2810            2|-
+||       2811-2881            6|-
+||       2882-2952            1|-
+||       2953-3023            0|
+||       3024-3094            0|
+||       3095-3165            1|-
+||       3166-3236            0|
+||       3237-3307            0|
+||       3308-3378            0|
+||       3379-3449            1|-
+||       3450-3520            0|
+||       3521-3591            1|-
+--------------------- --------------------- ----------------------------------------------------------------------------------------------
+mononucleotide          dinucleotide                                                                                  trinucleotide
+--------------------- --------------------- ----------------------------------------------------------------------------------------------
+1959571306 0.3032 A   665030151 0.1031 AA   237235545 0.0369 AAA    132268487 0.0205 AAC    136675399 0.0212 AAG    158473516 0.0246 AAT
+1247489432 0.1930 C   389352138 0.0604 AC   115665542 0.0180 ACA     87346626 0.0136 ACC     70986769 0.0110 ACG    114582435 0.0178 ACT
+1345011807 0.2081 G   397219280 0.0616 AG   121659180 0.0189 AGA     65811037 0.0102 AGC    102037062 0.0159 AGG    106854671 0.0166 AGT
+1910392344 0.2956 T   507072196 0.0786 AT   152454159 0.0237 ATA     89877335 0.0140 ATC    106195089 0.0165 ATG    158544503 0.0246 ATT
+380831936 0.0590 CA   132169383 0.0205 CAA     76839888 0.0119 CAC     67197045 0.0104 CAG    104566859 0.0162 CAT
+--GC--  --AT--    281892951 0.0437 CC    86178881 0.0134 CCA     65022575 0.0101 CCC     50576089 0.0079 CCG     79660170 0.0124 CCT
+40.12%  59.88%    208535008 0.0323 CG    60164341 0.0093 CGA     27649662 0.0043 CGC     52322022 0.0081 CGG     67296554 0.0105 CGT
+374626420 0.0581 CT    95122699 0.0148 CTA     75643338 0.0118 CTC     74304266 0.0115 CTG    129554475 0.0201 CTT
+383528854 0.0595 GA   128291282 0.0199 GAA     70244915 0.0109 GAC     88104990 0.0137 GAG     96746854 0.0150 GAT
+218253748 0.0338 GC    72696062 0.0113 GCA     51118632 0.0079 GCC     27797659 0.0043 GCG     66512705 0.0103 GCT
+361154273 0.0560 GG    87662449 0.0136 GGA     51591104 0.0080 GGC    124820908 0.0194 GGG     89162545 0.0139 GGT
+371793122 0.0576 GT   112773785 0.0175 GTA     61960408 0.0096 GTC     66540524 0.0103 GTG    130508651 0.0203 GTT
+526153182 0.0816 TA   165978181 0.0258 TAA    109335141 0.0170 TAC    104450634 0.0162 TAG    146068463 0.0227 TAT
+355583693 0.0551 TC   105474843 0.0164 TCA     77928750 0.0121 TCC     58773061 0.0091 TCG    113158614 0.0176 TCT
+375551419 0.0582 TG   113251416 0.0176 TGA     72683975 0.0113 TGC     81429936 0.0127 TGG    107781308 0.0167 TGT
+653083381 0.1013 TT   164899014 0.0256 TTA    127390884 0.0198 TTC    127703170 0.0198 TTG    233082150 0.0362 TTT
+```
+# Extracting
+```
+% seqrequester extract
+usage: ./FreeBSD-amd64/bin/seqrequester [mode] [options] [sequence_file ...]
+OPTIONS for extract mode:
+-bases     baselist extract bases as specified in the 'list' from each sequence
+-sequences seqlist  extract ordinal sequences as specified in the 'list'
+-reverse            reverse the bases in the sequence
+-complement         complement the bases in the sequence
+-rc                 alias for -reverse -complement
+-compress           compress homopolymer runs to one base
+-upcase
+-downcase
+-length min-max     print sequence if it is at least 'min' bases and at most 'max' bases long
+a 'baselist' is a set of integers formed from any combination
+of the following, seperated by a comma:
+num       a single number
+bgn-end   a range of numbers:  bgn <= end
+bases are spaced-based; -bases 0-2,4 will print the bases between
+the first two spaces (the first two bases) and the base after the
+fourth space (the fifth base).
+a 'seqlist' is a set of integers formed from any combination
+of the following, seperated by a comma:
+num       a single number
+bgn-end   a range of numbers:  bgn <= end
+sequences are 1-based; -sequences 1,3-5 will print the first, third,
+fourth and fifth sequences.
+```
+# Sampling
+```
+% seqrequester sample
+usage: ./FreeBSD-amd64/bin/seqrequester [mode] [options] [sequence_file ...]
+OPTIONS for sample mode:
+-paired             treat inputs as paired sequences; the first two files form the
+first pair, and so on.
+-copies C           write C different copies of the sampling (without replacement).
+-output O           write output sequences to file O.  If paired, two files must be supplied.
+-coverage C         output C coverage of sequences, based on genome size G.
+-genomesize G
+-bases B            output B bases.
+-reads R            output R reads.
+-pairs P            output P pairs (only if -paired).
+-fraction F         output fraction F of the input bases.
+```
+# Generating
+Undocumented.
+# Simulating
+```
+seqrequester simulate
+usage: ./FreeBSD-amd64/bin/seqrequester [mode] [options] [sequence_file ...]
+OPTIONS for simulate mode:
+-genome G           sample reads from these sequences
+-circular           treat the sequences in G as circular
+-genomesize g       genome size to use for deciding coverage below
+-coverage c         generate approximately c coverage of output
+-nreads n           generate exactly n reads of output
+-nbases n           generate approximately n bases of output
+-distribution F     generate read length by sampling the distribution in file F
+one column  - each line is the length of a sequence
+two columns - each line has the 'length' and 'number of sequences'
+if file F doesn't exist, use a built-in distribution
+ultra-long-nanopore
+pacbio
+pacbio-hifi
+-length min[-max]   (not implemented)
+-output x.fasta     (not implemented)
+```

Mercurial > repos > fubar > microsatbed

comparison seqrequester/README.md @ 1:1085e094cf5f draft