Galaxy | Tool Preview

GeneSeqToFamily preparation (version 0.4.3)
GFF3 datasets
GFF3 dataset 0
Each FASTA header line should start with a transcript id
As required by TreeBest, part of the GeneSeqToFamily workflow, only TranscriptId_species is acceptable format by Aequatus visualisation
Region IDs are in the `seqid` column for GFF3 and in the `seq_region_name` field in JSON. This is typically used to filter out chromosomes with a non-standard genetic code, like mitochondria, to be analysed separately

What it does

This tool converts a set of GFF3 and/or JSON gene feature information datasets into SQLite format.

It also filters the CDS FASTA datasets to keep only the transcripts present in the gene feature information.

Optionally it can also: - keep only canonical transcripts (or the longest CDS per gene, if this attribute is not provided) - remove sequences which are annotated as non protein-coding or whose length is not a multiple of 3 - change the header line of the FASTA sequences to the >TranscriptId_species format (as required by TreeBest, part of the GeneSeqToFamily workflow).

Example GFF3 file:

scaffold_0  MYZPE13164_Clone_G006_v1.0  gene            44968   69413   .   -   .   ID=MYZPE13164_G006_v1.0_000000030;Name=MYZPE13164_G006_v1.0_000000030;biotype=protein_coding
scaffold_0  MYZPE13164_Clone_G006_v1.0  mRNA            44968   69413   .   -   .   ID=MYZPE13164_G006_v1.0_000000030.1;Parent=MYZPE13164_G006_v1.0_000000030;Name=MYZPE13164_G006_v1.0_000000030.1;biotype=protein_coding;_AED=0.31
scaffold_0  MYZPE13164_Clone_G006_v1.0  three_prime_utr 44968   46637   .   -   .   ID=MYZPE13164_G006_v1.0_000000030.1.3utr1;Parent=MYZPE13164_G006_v1.0_000000030.1
scaffold_0  MYZPE13164_Clone_G006_v1.0  exon            44968   47432   .   -   .   ID=MYZPE13164_G006_v1.0_000000030.1.exon1;Parent=MYZPE13164_G006_v1.0_000000030.1
scaffold_0  MYZPE13164_Clone_G006_v1.0  CDS             46638   47432   .   -   0   ID=MYZPE13164_G006_v1.0_000000030.1.cds1;Parent=MYZPE13164_G006_v1.0_000000030.1
scaffold_0  MYZPE13164_Clone_G006_v1.0  exon            53325   53539   .   -   .   ID=MYZPE13164_G006_v1.0_000000030.1.exon2;Parent=MYZPE13164_G006_v1.0_000000030.1
scaffold_0  MYZPE13164_Clone_G006_v1.0  CDS             53325   53539   .   -   2   ID=MYZPE13164_G006_v1.0_000000030.1.cds2;Parent=MYZPE13164_G006_v1.0_000000030.1
scaffold_0  MYZPE13164_Clone_G006_v1.0  exon            54614   54719   .   -   .   ID=MYZPE13164_G006_v1.0_000000030.1.exon3;Parent=MYZPE13164_G006_v1.0_000000030.1
scaffold_0  MYZPE13164_Clone_G006_v1.0  CDS             54614   54719   .   -   0   ID=MYZPE13164_G006_v1.0_000000030.1.cds3;Parent=MYZPE13164_G006_v1.0_000000030.1
scaffold_0  MYZPE13164_Clone_G006_v1.0  CDS             54852   55106   .   -   0   ID=MYZPE13164_G006_v1.0_000000030.1.cds4;Parent=MYZPE13164_G006_v1.0_000000030.1
scaffold_0  MYZPE13164_Clone_G006_v1.0  exon            54852   55117   .   -   .   ID=MYZPE13164_G006_v1.0_000000030.1.exon4;Parent=MYZPE13164_G006_v1.0_000000030.1
scaffold_0  MYZPE13164_Clone_G006_v1.0  five_prime_utr  55107   55117   .   -   .   ID=MYZPE13164_G006_v1.0_000000030.1.5utr1;Parent=MYZPE13164_G006_v1.0_000000030.1
scaffold_0  MYZPE13164_Clone_G006_v1.0  five_prime_utr  68851   69413   .   -   .   ID=MYZPE13164_G006_v1.0_000000030.1.5utr2;Parent=MYZPE13164_G006_v1.0_000000030.1
scaffold_0  MYZPE13164_Clone_G006_v1.0  exon            68851   69413   .   -   .   ID=MYZPE13164_G006_v1.0_000000030.1.exon5;Parent=MYZPE13164_G006_v1.0_000000030.1

The following features are parsed: gene, mRNA, transcript, exon, five_prime_utr, three_prime_utr and CDS, all other are ignored. Also, ID and Parent attributes in the 9th column are needed to create relations among features.

If a value in the ID and Parent attribute contains a colon, everything up to the first colon will be discarded.