genret

 

Function

Retrieves various gene related information from genome flatfile

Description

genret reads in one or more genome flatfiles and retrieves various data from
the input file. It is a wrapper program to the G-language REST service,
where a method is specified by giving a string to the "method" qualifier. By
default, genret will parse the input file to retrieve the accession ID
(or name) of the genome to query G-language REST service. By setting the
"accid" qualifier to false (or 0), genret will instead parse the sequence
and features of the genome to create a GenBank formatted flatfile and upload
the file to the G-language web server. Using the file uploaded, genret will
execute the method provided.

genret is able to perform a variety of tasks, incluing the retrieval of
sequence upstream, downstream, or around the start or stop codon,
translated gene sequences search of gene data by keyword.

Details on G-language REST service is available from the wiki page

http://www.g-language.org/wiki/rest

Documentation on G-language Genome Analysis Environment methods are
provided at the Document Center

http://ws.g-language.org/gdoc/

Usage

Here is a sample session with genret

Retrieving sequences upstream, downstream, or around the start/stop codons. The following example shows the retrieval of sequence around the start codons of all genes.

Genes to access are specified by regular expression. '*' stands for every gene.

Available methods are:
after_startcodon
after_stopcodon
around_startcodon
around_stopcodon
before_startcodon
before_stopcodon


% genret
Retrieves various gene related information from genome flatfile
Input nucleotide sequence(s): refseqn:NC_000913
Gene name(s) to lookup [*]:
Feature to access: around_startcodon
Full text output file [nc_000913.around_startcodon]:

Go to the input files for this example
Go to the output files for this example

Example 2

Using flat text as target genes. The names can be split with with a space, comma, or vertical bar.


% genret
Retrieves various gene related information from genome flatfile
Input nucleotide sequence(s): refseqn:NC_000913
List of gene name(s) to report [*]: recA,recB
Name of gene feature to access: translation
Sequence output file [nc_000913.translation.genret]: stdout
>recA
MAIDENKQKALAAALGQIEKQFGKGSIMRLGEDRSMDVETISTGSLSLDIALGAGGLPMGR
IVEIYGPESSGKTTLTLQVIAAAQREGKTCAFIDAEHALDPIYARKLGVDIDNLLCSQPDT
GEQALEICDALARSGAVDVIVVDSVAALTPKAEIEGEIGDSHMGLAARMMSQAMRKLAGNL
KQSNTLLIFINQIRMKIGVMFGNPETTTGGNALKFYASVRLDIRRIGAVKEGENVVGSETR
VKVVKNKIAAPFKQAEFQILYGEGINFYGELVDLGVKEKLIEKAGAWYSYKGEKIGQGKAN
ATAWLKDNPETAKEIEKKVRELLLSNPNSTPDFSVDDSEGVAETNEDF
>recB
MSDVAETLDPLRLPLQGERLIEASAGTGKTFTIAALYLRLLLGLGGSAAFPRPLTVEELLV
VTFTEAATAELRGRIRSNIHELRIACLRETTDNPLYERLLEEIDDKAQAAQWLLLAERQMD
EAAVFTIHGFCQRMLNLNAFESGMLFEQQLIEDESLLRYQACADFWRRHCYPLPREIAQVV
FETWKGPQALLRDINRYLQGEAPVIKAPPPDDETLASRHAQIVARIDTVKQQWRDAVGELD
ALIESSGIDRRKFNRSNQAKWIDKISAWAEEETNSYQLPESLEKFSQRFLEDRTKAGGETP
RHPLFEAIDQLLAEPLSIRDLVITRALAEIRETVAREKRRRGELGFDDMLSRLDSALRSES
GEVLAAAIRTRFPVAMIDEFQDTDPQQYRIFRRIWHHQPETALLLIGDPKQAIYAFRGADI
FTYMKARSEVHAHYTLDTNWRSAPGMVNSVNKLFSQTDDAFMFREIPFIPVKSAGKNQALR
FVFKGETQPAMKMWLMEGESCGVGDYQSTMAQVCAAQIRDWLQAGQRGEALLMNGDDARPV
RASDISVLVRSRQEAAQVRDALTLLEIPSVYLSNRDSVFETLEAQEMLWLLQAVMTPEREN
TLRSALATSMMGLNALDIETLNNDEHAWDVVVEEFDGYRQIWRKRGVMPMLRALMSARNIA
ENLLATAGGERRLTDILHISELLQEAGTQLESEHALVRWLSQHILEPDSNASSQQMRLESD
KHLVQIVTIHKSKGLEYPLVWLPFITNFRVQEQAFYHDRHSFEAVLDLNAAPESVDLAEAE
RLAEDLRLLYVALTRSVWHCSLGVAPLVRRRGDKKGDTDVHQSALGRLLQKGEPQDAAGLR
TCIEALCDDDIAWQTAQTGDNQPWQVNDVSTAELNAKTLQRLPGDNWRVTSYSGLQQRGHG
IAQDLMPRLDVDAAGVASVVEEPTLTPHQFPRGASPGTFLHSLFEDLDFTQPVDPNWVREK
LELGGFESQWEPVLTEWITAVLQAPLNETGVSLSQLSARNKQVEMEFYLPISEPLIASQLD
TLIRQFDPLSAGCPPLEFMQVRGMLKGFIDLVFRHEGRYYLLDYKSNWLGEDSSAYTQQAM
AAAMQAHRYDLQYQLYTLALHRYLRHRIADYDYEHHFGGVIYLFLRGVDKEHPQQGIYTTR
PNAGLIALMDEMFAGMTLEEA

Go to the input files for this example
Go to the output files for this example

Example 3

Using a file with a list of gene names. The following example will retrieve the strand direction for each gene listed in the "gene_list.txt" file. String prefixed with an "@" or "list::" will be interpreted as file names.


% genret
Retrieves various gene features from genome flatfile
Input nucleotide sequence(s): refseqn:NC_000913
List of gene name(s) to report [*]: @gene_list.txt
Name of gene feature to access: direction
Full text output file [nc_000913.direction]: stdout
gene,direction
thrA,direct
thrB,direct
thrC,direct

Example 4

Retrieving translations of coding sequences.
The following example will retrieve the translated protein sequence of the "recA" gene.


% genret
Retrieves various gene related information from genome flatfile
Input nucleotide sequence(s): refseqn:NC_000913
Gene name(s) to lookup [*]: recA
Feature to access: translation
Full text output file [nc_000913.translation]: stdout
>recA
MAIDENKQKALAAALGQIEKQFGKGSIMRLGEDRSMDVETISTGSLSLDIALGAGGLPMGR
IVEIYGPESSGKTTLTLQVIAAAQREGKTCAFIDAEHALDPIYARKLGVDIDNLLCSQPDT
GEQALEICDALARSGAVDVIVVDSVAALTPKAEIEGEIGDSHMGLAARMMSQAMRKLAGNL
KQSNTLLIFINQIRMKIGVMFGNPETTTGGNALKFYASVRLDIRRIGAVKEGENVVGSETR
VKVVKNKIAAPFKQAEFQILYGEGINFYGELVDLGVKEKLIEKAGAWYSYKGEKIGQGKAN
ATAWLKDNPETAKEIEKKVRELLLSNPNSTPDFSVDDSEGVAETNEDF

Go to the input files for this example
Go to the output files for this example

Example 5

Retrieving feature information of the genes.
The following example will retrieve the start positions for each gene. The values for the keys in GenBank format is available for retrieval. (ex. start end direction GO* etc.)
Positions will be returned with a 1 start value.


% genret
Retrieves various gene related information from genome flatfile
Input nucleotide sequence(s): refseqn:NC_000913
Gene name(s) to lookup [*]:
Feature to access: start
Full text output file [nc_000913.start]:

Go to the input files for this example
Go to the output files for this example

Example 6

Passing extra arguments to the methods.
The following example shows the retrieval of 30 base pairs around the start codon of the "recA" gene. By default, the "around_startcodon" method returns 200 base pairs around the start codon. Using the "-argument" qualifier allows the user to change this value.


% genret refseqn:NC_000913 recA around_startcodon -argument 30,30 stdout
Retrieves various gene features from genome flatfile
>recA
ccggtattacccggcatgacaggagtaaaaatggctatcgacgaaaacaaacagaaagcgt
tg

Go to the input files for this example
Go to the output files for this example

Example 7

Re-annotating a flatfile. genret supports re-annotation of a genome flatfile via Restauro-G service developed by our team. The original software is available at [http://restauro-g.iab.keio.ac.jp].


% genret refseqn:NC_000913 '*' annotate nc_000913-annotate.gbk
Retrieves various gene features from genome flatfile

Go to the input files for this example
Go to the output files for this example

Command line arguments

Qualifier Type Description Allowed values Default
Standard (Mandatory) qualifiers
[-sequence]
(Parameter 1)
seqall Nucleotide sequence(s) filename and optional format, or reference (input USA) Readable sequence(s) Required
[-gene]
(Parameter 2)
string List of gene name(s) to report Any string *
[-access]
(Parameter 3)
string Name of gene feature to access Any word  
[-outfile]
(Parameter 4)
outfile Sequence output file Output file <*>.genret
Additional (Optional) qualifiers
(none)
Advanced (Unprompted) qualifiers
-argument string Extra arguments to pass to method Any string  
-[no]accid boolean Include to use sequence accession ID as query Boolean value Yes/No Yes

Input file format

Database definitions for the examples are included in the embossrc_template
file of the Keio Bioinformatcs Web Service (KBWS) package.

Input files for usage example 4

File: gene_list.txt

thrA
thrB
thrC

Output file format

Output files for usage example 1

File: nc_000913.around_startcodon

>thrL
cgtgagtaaattaaaattttattgacttaggtcactaaatactttaaccaatataggcata
gcgcacagacagataaaaattacagagtacacaacatccatgaaacgcattagcaccacca
ttaccaccaccatcaccattaccacaggtaacggtgcgggctgacgcgtacaggaaacaca
gaaaaaagcccgcacctgac
>thrA
aggtaacggtgcgggctgacgcgtacaggaaacacagaaaaaagcccgcacctgacagtgc
gggctttttttttcgaccaaaggtaacgaggtaacaaccatgcgagtgttgaagttcggcg
gtacatcagtggcaaatgcagaacgttttctgcgtgttgccgatattctggaaagcaatgc
caggcaggggcaggtggcca

[Part of this file has been deleted for brevity]

>yjjY
tgcatgtttgctacctaaattgccaactaaatcgaaacaggaagtacaaaagtccctgacc
tgcctgatgcatgctgcaaattaacatgatcggcgtaacatgactaaagtacgtaattgcg
ttcttgatgcactttccatcaacgtcaacaacatcattagcttggtcgtgggtactttccc
tcaggacccgacagtgtcaa
>yjtD
tttttctgcgacttacgttaagaatttgtaaattcgcaccgcgtaataagttgacagtgat
cacccggttcgcggttatttgatcaagaagagtggcaatatgcgtataacgattattctgg
tcgcacccgccagagcagaaaatattggggcagcggcgcgggcaatgaaaacgatggggtt
tagcgatctgcggattgtcg

Output files for usage example 3

File: nc_000913.start

gene,start
thrL,190
thrA,337
thrB,2801
thrC,3734
yaaX,5234
yaaA,5683
yaaJ,6529
talB,8238
mog,9306

[Part of this file has been deleted for brevity]

yjjX,4631256
ytjC,4631820
rob,4632464
creA,4633544
creB,4634030
creC,4634719
creD,4636201
arcA,4637613
yjjY,4638425
yjtD,4638965

Output files for usage example 7

File: ecoli-annotate.gbk

LOCUS NC_000913 4639675 bp DNA circular BCT 25-OCT-2010
DEFINITION Escherichia coli str. K-12 substr. MG1655 chromosome, complete
genome.
ACCESSION NC_000913
VERSION NC_000913.2 GI:49175990
DBLINK Project: 57779
KEYWORDS .
SOURCE Escherichia coli str. K-12 substr. MG1655
ORGANISM Escherichia coli str. K-12 substr. MG1655
Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;

[Part of this file has been deleted for brevity]

CDS 2801..3733
/EC_number="2.7.1.39"
/codon_start="1"
/db_xref="GI:16127997"
/db_xref="ASAP:ABE-0000010"
/db_xref="UniProtKB/Swiss-Prot:P00547"
/db_xref="ECOCYC:EG10999"
/db_xref="EcoGene:EG10999"
/db_xref="GeneID:947498"
/function="enzyme; Amino acid biosynthesis: Threonine"
/function="1.5.1.8 metabolism; building block
biosynthesis; amino acids; threonine"
/function="7.1 location of gene products; cytoplasm"
/gene="thrB"
/gene_synonym="ECK0003; JW0002"
/locus_tag="b0003"
/note="GO_component: GO:0005737 - cytoplasm; GO_process:
GO:0009088 - threonine biosynthetic process"
/product="homoserine kinase"
/protein_id="NP_414544.1"
/rs_com="FUNCTION: Catalyzes the ATP-dependent
phosphorylation of L- homoserine to L-homoserine
phosphate (By similarity)."
/rs_com="CATALYTIC ACTIVITY: ATP + L-homoserine = ADP +
O-phospho-L- homoserine."
/rs_com="PATHWAY: Amino-acid biosynthesis; L-threonine
biosynthesis; L- threonine from L-aspartate: step 4/5."
/rs_com="SUBCELLULAR LOCATION: Cytoplasm (Potential)."
/rs_com="SIMILARITY: Belongs to the GHMP kinase family.
Homoserine kinase subfamily."
/rs_des="RecName: Full=Homoserine kinase; Short=HK;
Short=HSK; EC=2.7.1.39;"
/rs_protein="Level 1: similar to KHSE_ECODH 1.7e-180"
/rs_xr="EMBL; CP000948; ACB01208.1; -; Genomic_DNA."
/rs_xr="RefSeq; YP_001728986.1; -."
/rs_xr="ProteinModelPortal; B1XBC8; -."
/rs_xr="SMR; B1XBC8; 2-308."
/rs_xr="EnsemblBacteria; EBESCT00000012034;
EBESCP00000011562; EBESCG00000011096."
/rs_xr="GeneID; 6058639; -."
/rs_xr="GenomeReviews; CP000948_GR; ECDH10B_0003."
/rs_xr="KEGG; ecd:ECDH10B_0003; -."
/rs_xr="HOGENOM; HBG646290; -."
/rs_xr="OMA; GSAHADN; -."
/rs_xr="ProtClustDB; PRK01212; -."
/rs_xr="BioCyc; ECOL316385:ECDH10B_0003-MONOMER; -."
/rs_xr="GO; GO:0005737; C:cytoplasm;
IEA:UniProtKB-SubCell."
/rs_xr="GO; GO:0005524; F:ATP binding; IEA:UniProtKB-KW."
/rs_xr="GO; GO:0004413; F:homoserine kinase activity;
IEA:EC."
/rs_xr="GO; GO:0009088; P:threonine biosynthetic process;
IEA:UniProtKB-KW."
/rs_xr="HAMAP; MF_00384; Homoser_kinase; 1; -."
/rs_xr="InterPro; IPR006204; GHMP_kinase."
/rs_xr="InterPro; IPR013750; GHMP_kinase_C."
/rs_xr="InterPro; IPR006203; GHMP_knse_ATP-bd_CS."
/rs_xr="InterPro; IPR000870; Homoserine_kin."
/rs_xr="InterPro; IPR020568; Ribosomal_S5_D2-typ_fold."
/rs_xr="InterPro; IPR014721;
Ribosomal_S5_D2-typ_fold_subgr."
/rs_xr="Gene3D; G3DSA:3.30.230.10;
Ribosomal_S5_D2-type_fold; 1."
/rs_xr="Pfam; PF08544; GHMP_kinases_C; 1."
/rs_xr="Pfam; PF00288; GHMP_kinases_N; 1."
/rs_xr="PIRSF; PIRSF000676; Homoser_kin; 1."
/rs_xr="PRINTS; PR00958; HOMSERKINASE."
/rs_xr="SUPFAM; SSF54211; Ribosomal_S5_D2-typ_fold; 1."
/rs_xr="TIGRFAMs; TIGR00191; thrB; 1."
/rs_xr="PROSITE; PS00627; GHMP_KINASES_ATP; 1."
/transl_table="11"
/translation="MVKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVEAAETF
SLNNLGRFADKLPSEPRENIVYQCWERFCQELGKQIPVAMTLEKNMPIGSGLGSSACS
VVAALMAMNEHCGKPLNDTRLLALMGELEGRISGSIHYDNVAPCFLGGMQLMIEENDI
ISQQVPGFDEWLWVLAYPGIKVSTAEARAILPAQYRRQDCIAHGRHLAGFIHACYSRQ
PELAAKLMKDVIAEPYRERLLPGFRQARQAVAEIGAVASGISGSGPTLFALCDKPETA
QRVADWLGKNYLQNQEGFVHICRLDTAGARVLEN"

[Part of this file has been deleted for brevity]

4639201 gcgcagtcgg gcgaaatatc attactacgc cacgccagtt gaactggtgc cgctgttaga
4639261 ggaaaaatct tcatggatga gccatgccgc gctggtgttt ggtcgcgaag attccgggtt
4639321 gactaacgaa gagttagcgt tggctgacgt tcttactggt gtgccgatgg tggcggatta
4639381 tccttcgctc aatctggggc aggcggtgat ggtctattgc tatcaattag caacattaat
4639441 acaacaaccg gcgaaaagtg atgcaacggc agaccaacat caactgcaag ctttacgcga
4639501 acgagccatg acattgctga cgactctggc agtggcagat gacataaaac tggtcgactg
4639561 gttacaacaa cgcctggggc ttttagagca acgagacacg gcaatgttgc accgtttgct
4639621 gcatgatatt gaaaaaaata tcaccaaata aaaaacgcct tagtaagtat ttttc
//

Data files

None.

Notes

None.

References

   Arakawa, K., Mori, K., Ikeda, K., Matsuzaki, T., Konayashi, Y., and
      Tomita, M. (2003) G-language Genome Analysis Environment: A Workbench
      for Nucleotide Sequence Data Mining, Bioinformatics, 19, 305-306.

   Arakawa, K. and Tomita, M. (2006) G-language System as a Platform for
      large-scale analysis of high-throughput omics data, J. Pest Sci.,
      31, 7.

   Arakawa, K., Kido, N., Oshita, K., Tomita, M. (2010) G-language Genome
      Analysis Environment with REST and SOAP Web Service Interfaces,
      Nucleic Acids Res., 38, W700-W705.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with a status of 0.

Known bugs

None.

See also

Program name Description
entret Retrieve sequence entries from flatfile databases and files
seqret Read and write (return) sequences

Author(s)

Hidetoshi Itaya (celery@g-language.org)
  Institute for Advanced Biosciences, Keio University
  252-0882 Japan

Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
  Institute for Advanced Biosciences, Keio University
  252-0882 Japan

History

2012 - Written by Hidetoshi Itaya

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scrips.

Comments

None.