What it does
This tool converts a set of GFF3 and/or JSON gene feature information datasets into SQLite format.
It also filters the CDS FASTA datasets to keep only the transcripts present in the gene feature information.
Optionally it can also: - keep only canonical transcripts (or the longest CDS per gene, if this attribute is not provided) - remove sequences which are annotated as non protein-coding or whose length is not a multiple of 3 - change the header line of the FASTA sequences to the >TranscriptId_species format (as required by TreeBest, part of the GeneSeqToFamily workflow).
Example GFF3 file:
scaffold_0 MYZPE13164_Clone_G006_v1.0 gene 44968 69413 . - . ID=MYZPE13164_G006_v1.0_000000030;Name=MYZPE13164_G006_v1.0_000000030;biotype=protein_coding scaffold_0 MYZPE13164_Clone_G006_v1.0 mRNA 44968 69413 . - . ID=MYZPE13164_G006_v1.0_000000030.1;Parent=MYZPE13164_G006_v1.0_000000030;Name=MYZPE13164_G006_v1.0_000000030.1;biotype=protein_coding;_AED=0.31 scaffold_0 MYZPE13164_Clone_G006_v1.0 three_prime_utr 44968 46637 . - . ID=MYZPE13164_G006_v1.0_000000030.1.3utr1;Parent=MYZPE13164_G006_v1.0_000000030.1 scaffold_0 MYZPE13164_Clone_G006_v1.0 exon 44968 47432 . - . ID=MYZPE13164_G006_v1.0_000000030.1.exon1;Parent=MYZPE13164_G006_v1.0_000000030.1 scaffold_0 MYZPE13164_Clone_G006_v1.0 CDS 46638 47432 . - 0 ID=MYZPE13164_G006_v1.0_000000030.1.cds1;Parent=MYZPE13164_G006_v1.0_000000030.1 scaffold_0 MYZPE13164_Clone_G006_v1.0 exon 53325 53539 . - . ID=MYZPE13164_G006_v1.0_000000030.1.exon2;Parent=MYZPE13164_G006_v1.0_000000030.1 scaffold_0 MYZPE13164_Clone_G006_v1.0 CDS 53325 53539 . - 2 ID=MYZPE13164_G006_v1.0_000000030.1.cds2;Parent=MYZPE13164_G006_v1.0_000000030.1 scaffold_0 MYZPE13164_Clone_G006_v1.0 exon 54614 54719 . - . ID=MYZPE13164_G006_v1.0_000000030.1.exon3;Parent=MYZPE13164_G006_v1.0_000000030.1 scaffold_0 MYZPE13164_Clone_G006_v1.0 CDS 54614 54719 . - 0 ID=MYZPE13164_G006_v1.0_000000030.1.cds3;Parent=MYZPE13164_G006_v1.0_000000030.1 scaffold_0 MYZPE13164_Clone_G006_v1.0 CDS 54852 55106 . - 0 ID=MYZPE13164_G006_v1.0_000000030.1.cds4;Parent=MYZPE13164_G006_v1.0_000000030.1 scaffold_0 MYZPE13164_Clone_G006_v1.0 exon 54852 55117 . - . ID=MYZPE13164_G006_v1.0_000000030.1.exon4;Parent=MYZPE13164_G006_v1.0_000000030.1 scaffold_0 MYZPE13164_Clone_G006_v1.0 five_prime_utr 55107 55117 . - . ID=MYZPE13164_G006_v1.0_000000030.1.5utr1;Parent=MYZPE13164_G006_v1.0_000000030.1 scaffold_0 MYZPE13164_Clone_G006_v1.0 five_prime_utr 68851 69413 . - . ID=MYZPE13164_G006_v1.0_000000030.1.5utr2;Parent=MYZPE13164_G006_v1.0_000000030.1 scaffold_0 MYZPE13164_Clone_G006_v1.0 exon 68851 69413 . - . ID=MYZPE13164_G006_v1.0_000000030.1.exon5;Parent=MYZPE13164_G006_v1.0_000000030.1
The following features are parsed: gene, mRNA, transcript, exon, five_prime_utr, three_prime_utr and CDS, all other are ignored. Also, ID and Parent attributes in the 9th column are needed to create relations among features.
If a value in the ID and Parent attribute contains a colon, everything up to the first colon will be discarded.