comparison GEMBASSY-1.0.3/doc/text/gshuffleseq.txt @ 0:8300eb051bea draft

Initial upload
author ktnyt
date Fri, 26 Jun 2015 05:19:29 -0400
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:8300eb051bea
1 gshuffleseq
2 Function
3
4 Create randomized sequence with conserved k-mer composition
5
6 Description
7
8 gshuffleseq shuffles and randomizes the given sequence, conserving the
9 nucleotide/peptide k-mer content of the original sequence.
10
11 For k=1, i.e. shuffling sequencing preserving single nucleotide composition,
12 Fisher-Yates Algorithm is employed.
13 For k>1, shuffling preserves all k-mers (all k where k=1~k). For example,
14 k=3 preserves all triplet, doublet, and single nucleotide composition.
15 Algorithm for k-mer preserved shuffling is non-trivial, which is solved
16 by graph theoretical approach with Eulerian random walks in the graph of
17 k-1-mers. See Jiang et al., Kandel et al., and Propp et al., for details
18 of this algorithm.
19
20 G-language SOAP service is provided by the
21 Institute for Advanced Biosciences, Keio University.
22 The original web service is located at the following URL:
23
24 http://www.g-language.org/wiki/soap
25
26 WSDL(RPC/Encoded) file is located at:
27
28 http://soap.g-language.org/g-language.wsdl
29
30 Documentation on G-language Genome Analysis Environment methods are
31 provided at the Document Center
32
33 http://ws.g-language.org/gdoc/
34
35 Usage
36
37 Here is a sample session with gshuffleseq
38
39 % gshuffleseq tsw:hbb_human
40 Create randomized sequence with conserved k-mer composition
41 output sequence [hbb_human.fasta]:
42
43 Go to the input files for this example
44 Go to the output files for this example
45
46 Command line arguments
47
48 Standard (Mandatory) qualifiers:
49 [-sequence] seqall Sequence(s) filename and optional format, or
50 reference (input USA)
51 [-outseq] seqout [<sequence>.<format>] Sequence filename and
52 optional format (output USA)
53
54 Additional (Optional) qualifiers: (none)
55 Advanced (Unprompted) qualifiers:
56 -k integer [1] Sequence k-mer to preserve composition
57 (Any integer value)
58
59 Associated qualifiers:
60
61 "-sequence" associated qualifiers
62 -sbegin1 integer Start of each sequence to be used
63 -send1 integer End of each sequence to be used
64 -sreverse1 boolean Reverse (if DNA)
65 -sask1 boolean Ask for begin/end/reverse
66 -snucleotide1 boolean Sequence is nucleotide
67 -sprotein1 boolean Sequence is protein
68 -slower1 boolean Make lower case
69 -supper1 boolean Make upper case
70 -scircular1 boolean Sequence is circular
71 -sformat1 string Input sequence format
72 -iquery1 string Input query fields or ID list
73 -ioffset1 integer Input start position offset
74 -sdbname1 string Database name
75 -sid1 string Entryname
76 -ufo1 string UFO features
77 -fformat1 string Features format
78 -fopenfile1 string Features file name
79
80 "-outseq" associated qualifiers
81 -osformat2 string Output seq format
82 -osextension2 string File name extension
83 -osname2 string Base file name
84 -osdirectory2 string Output directory
85 -osdbname2 string Database name to add
86 -ossingle2 boolean Separate file for each entry
87 -oufo2 string UFO features
88 -offormat2 string Features format
89 -ofname2 string Features file name
90 -ofdirectory2 string Output directory
91
92 General qualifiers:
93 -auto boolean Turn off prompts
94 -stdout boolean Write first file to standard output
95 -filter boolean Read first file from standard input, write
96 first file to standard output
97 -options boolean Prompt for standard and additional values
98 -debug boolean Write debug output to program.dbg
99 -verbose boolean Report some/full command line options
100 -help boolean Report command line options and exit. More
101 information on associated and general
102 qualifiers can be found with -help -verbose
103 -warning boolean Report warnings
104 -error boolean Report errors
105 -fatal boolean Report fatal errors
106 -die boolean Report dying program messages
107 -version boolean Report version number and exit
108
109 Input file format
110
111 The database definitions for following commands are available at
112 http://soap.g-language.org/kbws/embossrc
113
114 gshuffleseq reads one or more nucleotide or protein sequences.
115
116 Output file format
117
118 The output from gshuffleseq is to .
119
120 File: hbb_human.fasta
121
122 >HBB_HUMAN P68871 Hemoglobin subunit beta (Beta-globin) (Hemoglobin beta chain) (LVV-hemorphin-7)
123 KGWLDLVAGAAHFVRRLKMLLEVDWAAHEERVGTSNPNNALKNEAADVEVHSPTHVNPTQ
124 LVLVQVGFGTLHLQGVECPKPKPGGVALKPVAHLLAMKECTLVALGSDFYVDHGSDGEDK
125 GFKAYVLATSFFAYTNFLHGKVKHVLF
126
127
128 Data files
129
130 None.
131
132 Notes
133
134 None.
135
136 References
137
138 Fisher R.A. and Yates F. (1938) "Example 12", Statistical Tables, London
139
140 Durstenfeld R. (1964) "Algorithm 235: Random permutation", CACM 7(7):420
141
142 Jiang M., Anderson J., Gillespie J., and Mayne M. (2008) "uShuffle:
143 a useful tool for shuffling biological sequences while preserving the
144 k-let counts", BMC Bioinformatics 9:192
145
146 Kandel D., Matias Y., Unver R., and Winker P. (1996) "Shuffling biological
147 sequences", Discrete Applied Mathematics 71(1-3):171-185
148
149 Propp J.G. and Wilson D.B. (1998) "How to get a perfectly random sample
150 from a generic Markov chain and generate a random spanning tree of a
151 directed graph", Journal of Algorithms 27(2):170-217
152
153 Arakawa, K., Mori, K., Ikeda, K., Matsuzaki, T., Konayashi, Y., and
154 Tomita, M. (2003) G-language Genome Analysis Environment: A Workbench
155 for Nucleotide Sequence Data Mining, Bioinformatics, 19, 305-306.
156
157 Arakawa, K. and Tomita, M. (2006) G-language System as a Platform for
158 large-scale analysis of high-throughput omics data, J. Pest Sci.,
159 31, 7.
160
161 Arakawa, K., Kido, N., Oshita, K., Tomita, M. (2010) G-language Genome
162 Analysis Environment with REST and SOAP Web Service Interfaces,
163 Nucleic Acids Res., 38, W700-W705.
164
165 Warnings
166
167 None.
168
169 Diagnostic Error Messages
170
171 None.
172
173 Exit status
174
175 It always exits with a status of 0.
176
177 Known bugs
178
179 None.
180
181 See also
182
183 shuffleseq Shuffles a set of sequences maintaining composition
184
185 Author(s)
186
187 Hidetoshi Itaya (celery@g-language.org)
188 Institute for Advanced Biosciences, Keio University
189 252-0882 Japan
190
191 Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
192 Institute for Advanced Biosciences, Keio University
193 252-0882 Japan
194
195 History
196
197 2012 - Written by Hidetoshi Itaya
198 2013 - Fixed by Hidetoshi Itaya
199
200 Target users
201
202 This program is intended to be used by everyone and everything, from
203 naive users to embedded scripts.
204
205 Comments
206
207 None.
208