What it does:
This tool uses Bio::SeqFeature::Tools::Unflattener and Bio::Tools::GFF to convert GenBank flatfiles to GFF3 with gene containment hierarchies mapped for optimal display in gbrowse.
The input files are assumed to be gzipped GenBank flatfiles for refseq contigs. The files may contain multiple GenBank records.
Designed for RefSeq
This script is designed for RefSeq genomic sequence entries. It may work for third party annotations but this has not been tested. But see below, Uniprot/Swissprot works, EMBL and possibly EMBL/Ensembl if you don't mind some gene model unflattener errors (dgg).
G-R-P-E Gene Model
Don Gilbert worked this over with needs to produce GFF3 suited to loading to GMOD Chado databases.
This writes GFF with an alternate, but useful Gene model, instead of the consensus model for GFF3
[ gene > mRNA> (exon,CDS,UTR) ]
This alternate is
gene > mRNA > polypeptide > exon
means the only feature with dna bases is the exon. The others specify only location ranges on a genome. Exon of course is a child of mRNA and protein/peptide.
The protein/polypeptide feature is an important one, having all the annotations of the GenBank CDS feature, protein ID, translation, GO terms, Dbxrefs to other proteins.
UTRs, introns, CDS-exons are all inferred from the primary exon bases inside/outside appropriate higher feature ranges. Other special gene model features remain the same.
Authors
Sheldon McKay (mckays@cshl.edu)
Copyright (c) 2004 Cold Spring Harbor Laboratory.
Author of hacks for GFF2Chado loading
Don Gilbert (gilbertd@indiana.edu)