Galaxy |

What it does

Bakta is a tool for the rapid & standardized annotation of bacterial genomes and plasmids from both isolates and MAGs.

Comprehensive & taxonomy-independent database: Bakta provides a large and taxonomy-independent database using UniProt's entire UniRef protein sequence cluster universe.
Protein sequence identification: Bakta exactly identifies known identical protein sequences (IPS) from RefSeq and UniProt allowing the fine-grained annotation of gene alleles (AMR) or closely related but distinct protein families. This is achieved via an alignment-free sequence identification (AFSI) approach using full-length MD5 protein sequence hash digests.
Small proteins/short open reading frames: Bakta detects and annotates small proteins/short open reading frames (sORF).
Expert annotation systems: To provide high quality annotations for certain proteins of higher interest, e.g. AMR & VF genes, Bakta includes & merges different expert annotation systems. Currently, Bakta uses NCBI's AMRFinderPlus for AMR gene annotations as well as an generalized protein sequence expert system with distinct coverage, identity and priority values for each sequence, currenlty comprising the VFDB as well as NCBI's BlastRules.
Comprehensive workflow: Bakta annotates ncRNA cis-regulatory regions, oriC/oriV/oriT and assembly gaps as well as standard feature types: tRNA, tmRNA, rRNA, ncRNA genes, CRISPR, CDS.
GFF3 & INSDC conform annotations: Bakta writes GFF3 and INSDC-compliant (Genbank & EMBL) annotation files ready for submission (checked via GenomeTools GFF3Validator, table2asn_GFF and ENA Webin-CLI for GFF3 and EMBL file formats, respectively for representative genomes of all ESKAPE species).
Bacteria & plasmids: Bakta was designed to annotate bacteria (isolates & MAGs) and plasmids, only.

Input options

Choose a genome or assembly in fasta format to use bakta annotations
Choose A version of the Bakta database

Organism options You can specify informations about analysed fasta as text input for: - genus - species - strain - plasmid

Annotation options 1. You can specify if all sequences (chromosome or plasmids) are complete or not 2. You can add your own prodigal training file for CDS predictionœ 3. The translation table could be modified, default is the 11th for bacteria 4. You can specify if bacteria is gram -/+ or unknonw (default value is unknow) 5. You can keep the name of contig present in the input file 6. You can specify your own replicon table as a TSV/CSV file 7. The compliance option is for ready to submit annotation file to Public database as ENA, Genbank EMBL 8. You can specify a protein sequence file for annotation in GenBank or fasta formats Using the Fasta format, each reference sequence can be provided in a short or long format:

# short: >id gene~~~product~~~dbxrefs MAQ...

# long: >id min_identity~~~min_query_cov~~~min_subject_cov~~~gene~~~product~~~dbxrefs MAQ...

Skip steps Some steps could be skiped: - skip-trna Skip tRNA detection & annotation - skip-tmrna Skip tmRNA detection & annotation - skip-rrna Skip rRNA detection & annotation - skip-ncrna Skip ncRNA detection & annotation - skip-ncrna-region Skip ncRNA region detection & annotation - skip-crispr Skip CRISPR array detection & annotation - skip-cds Skip CDS detection & annotation - skip-pseudo Skip pseudogene detection & annotation - skip-sorf Skip sORF detection & annotation - skip-gap Skip gap detection & annotation - skip-ori Skip oriC/oriT detection & annotation

Output options Bakta produce numbers of output files, you can select what type of file you want: - Summary of the annotation - Annotated files - Sequence files for nucleotide and/or amino acid