Repository 'dante'
hg clone https://toolshed.g2.bx.psu.edu/repos/petr-novak/dante

Changeset 23:e2bbc79f0fac (2023-01-25)
Previous changeset 22:1eabd42e00ef (2020-04-03) Next changeset 24:df99812ded92 (2023-01-27)
Commit message:
"planemo upload commit baf4ca09569b1b709c37f2df712e778da05edaf9-dirty"
modified:
dante.xml
dante_gff_output_filtering.xml
dante_gff_to_dna.xml
dante_gff_to_tabular.xml
summarize_gff.xml
removed:
LICENCE
README.md
configuration.py
coverage2gff.py
dante.py
dante_gff_output_filtering.py
dante_gff_to_dna.py
dom_prot_seq.fa
fasta2database.R
fasta2database.py
parse_aln.py
summarize_gff.R
test-data/GEPY_test_long_1.fa
test-data/GEPY_test_long_1.fa.fai
test-data/GEPY_test_long_1_output_unfiltered.gff3
test-data/single_fasta.gff3
test-data/single_fasta_filtered.gff3
test-data/test_seq_1
test-data/vyber-Ty1_01.fasta
tests.sh
tool-data/rexdb_versions.loc.sample
tool-data/select_domain.loc.sample
tool_dependencies.xml
b
diff -r 1eabd42e00ef -r e2bbc79f0fac LICENCE
--- a/LICENCE Fri Apr 03 07:27:59 2020 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
b
b'@@ -1,674 +0,0 @@\n-          GNU GENERAL PUBLIC LICENSE\n-                       Version 3, 29 June 2007\n-\n- Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>\n- Everyone is permitted to copy and distribute verbatim copies\n- of this license document, but changing it is not allowed.\n-\n-                            Preamble\n-\n-  The GNU General Public License is a free, copyleft license for\n-software and other kinds of works.\n-\n-  The licenses for most software and other practical works are designed\n-to take away your freedom to share and change the works.  By contrast,\n-the GNU General Public License is intended to guarantee your freedom to\n-share and change all versions of a program--to make sure it remains free\n-software for all its users.  We, the Free Software Foundation, use the\n-GNU General Public License for most of our software; it applies also to\n-any other work released this way by its authors.  You can apply it to\n-your programs, too.\n-\n-  When we speak of free software, we are referring to freedom, not\n-price.  Our General Public Licenses are designed to make sure that you\n-have the freedom to distribute copies of free software (and charge for\n-them if you wish), that you receive source code or can get it if you\n-want it, that you can change the software or use pieces of it in new\n-free programs, and that you know you can do these things.\n-\n-  To protect your rights, we need to prevent others from denying you\n-these rights or asking you to surrender the rights.  Therefore, you have\n-certain responsibilities if you distribute copies of the software, or if\n-you modify it: responsibilities to respect the freedom of others.\n-\n-  For example, if you distribute copies of such a program, whether\n-gratis or for a fee, you must pass on to the recipients the same\n-freedoms that you received.  You must make sure that they, too, receive\n-or can get the source code.  And you must show them these terms so they\n-know their rights.\n-\n-  Developers that use the GNU GPL protect your rights with two steps:\n-(1) assert copyright on the software, and (2) offer you this License\n-giving you legal permission to copy, distribute and/or modify it.\n-\n-  For the developers\' and authors\' protection, the GPL clearly explains\n-that there is no warranty for this free software.  For both users\' and\n-authors\' sake, the GPL requires that modified versions be marked as\n-changed, so that their problems will not be attributed erroneously to\n-authors of previous versions.\n-\n-  Some devices are designed to deny users access to install or run\n-modified versions of the software inside them, although the manufacturer\n-can do so.  This is fundamentally incompatible with the aim of\n-protecting users\' freedom to change the software.  The systematic\n-pattern of such abuse occurs in the area of products for individuals to\n-use, which is precisely where it is most unacceptable.  Therefore, we\n-have designed this version of the GPL to prohibit the practice for those\n-products.  If such problems arise substantially in other domains, we\n-stand ready to extend this provision to those domains in future versions\n-of the GPL, as needed to protect the freedom of users.\n-\n-  Finally, every program is threatened constantly by software patents.\n-States should not allow patents to restrict development and use of\n-software on general-purpose computers, but in those that do, we wish to\n-avoid the special danger that patents applied to a free program could\n-make it effectively proprietary.  To prevent this, the GPL assures that\n-patents cannot be used to render the program non-free.\n-\n-  The precise terms and conditions for copying, distribution and\n-modification follow.\n-\n-                       TERMS AND CONDITIONS\n-\n-  0. Definitions.\n-\n-  "This License" refers to version 3 of the GNU General Public License.\n-\n-  "Copyright" also means copyright-like laws that apply to other kinds of\n-works, such as semiconductor masks.\n-\n-  "The Program" refers to any copyri'..b'HE PROGRAM\n-IS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF\n-ALL NECESSARY SERVICING, REPAIR OR CORRECTION.\n-\n-  16. Limitation of Liability.\n-\n-  IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING\n-WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS\n-THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY\n-GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE\n-USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF\n-DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD\n-PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),\n-EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF\n-SUCH DAMAGES.\n-\n-  17. Interpretation of Sections 15 and 16.\n-\n-  If the disclaimer of warranty and limitation of liability provided\n-above cannot be given local legal effect according to their terms,\n-reviewing courts shall apply local law that most closely approximates\n-an absolute waiver of all civil liability in connection with the\n-Program, unless a warranty or assumption of liability accompanies a\n-copy of the Program in return for a fee.\n-\n-                     END OF TERMS AND CONDITIONS\n-\n-            How to Apply These Terms to Your New Programs\n-\n-  If you develop a new program, and you want it to be of the greatest\n-possible use to the public, the best way to achieve this is to make it\n-free software which everyone can redistribute and change under these terms.\n-\n-  To do so, attach the following notices to the program.  It is safest\n-to attach them to the start of each source file to most effectively\n-state the exclusion of warranty; and each file should have at least\n-the "copyright" line and a pointer to where the full notice is found.\n-\n-    <one line to give the program\'s name and a brief idea of what it does.>\n-    Copyright (C) <year>  <name of author>\n-\n-    This program is free software: you can redistribute it and/or modify\n-    it under the terms of the GNU General Public License as published by\n-    the Free Software Foundation, either version 3 of the License, or\n-    (at your option) any later version.\n-\n-    This program is distributed in the hope that it will be useful,\n-    but WITHOUT ANY WARRANTY; without even the implied warranty of\n-    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n-    GNU General Public License for more details.\n-\n-    You should have received a copy of the GNU General Public License\n-    along with this program.  If not, see <https://www.gnu.org/licenses/>.\n-\n-Also add information on how to contact you by electronic and paper mail.\n-\n-  If the program does terminal interaction, make it output a short\n-notice like this when it starts in an interactive mode:\n-\n-    <program>  Copyright (C) <year>  <name of author>\n-    This program comes with ABSOLUTELY NO WARRANTY; for details type `show w\'.\n-    This is free software, and you are welcome to redistribute it\n-    under certain conditions; type `show c\' for details.\n-\n-The hypothetical commands `show w\' and `show c\' should show the appropriate\n-parts of the General Public License.  Of course, your program\'s commands\n-might be different; for a GUI interface, you would use an "about box".\n-\n-  You should also get your employer (if you work as a programmer) or school,\n-if any, to sign a "copyright disclaimer" for the program, if necessary.\n-For more information on this, and how to apply and follow the GNU GPL, see\n-<https://www.gnu.org/licenses/>.\n-\n-  The GNU General Public License does not permit incorporating your program\n-into proprietary programs.  If your program is a subroutine library, you\n-may consider it more useful to permit linking proprietary applications with\n-the library.  If this is what you want to do, use the GNU Lesser General\n-Public License instead of this License.  But first, please read\n-<https://www.gnu.org/licenses/why-not-lgpl.html>.\n'
b
diff -r 1eabd42e00ef -r e2bbc79f0fac README.md
--- a/README.md Fri Apr 03 07:27:59 2020 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
[
b'@@ -1,219 +0,0 @@\n-\xef\xbb\xbf# Domain based annotation of transposable elements  - DANTE #\n-\n-### Authors \n- Nina Hostakova, Petr Novak, Pavel Neumann, Jiri Macas\n- Biology Centre CAS, Czech Republic\n- \n- \n-### Introduction\n-\n-* Protein Domains Finder [dante.py]\n-\t* Script performs scanning of given DNA sequence(s) in (multi)fasta format in order to discover protein domains using our protein domains database.\n-\t* Domains searching is accomplished engaging LASTAL alignment tool.\n-\t* Domains are subsequently annotated and classified - in case certain domain has multiple annotations assigned, classifation is derived from the common classification level of all of them. \t\n-\t\t\t\n-* Proteins Domains Filter [dante_gff_output_filtering.py]\n-\t* filters GFF3 output from previous step to obtain certain kind of domain and/or allows to adjust quality filtering  \n-        \n-### DEPENDENCIES ###\n-\n-* python3.4 or higher with packages:\t\n-\t* numpy\n-\t* biopython\n-* [lastal](http://last.cbrc.jp/doc/last.html) 744 or higher\n-* ProfRep/DANTE modules:\n-\t* configuration.py \n-\n-\n-### Protein Domains Finder ###\n-\n-This tool provides **preliminary** output of all domains types which are not filtered for quality.\n-\n-#### INPUTS ####\n-\n-* DNA sequence [multiFasta]\n-\t\t\n-#### OUTPUTS ####\n-\t\t\n-* **All protein domains GFF3** - individual domains are reported per line as regions (start-end) on the original DNA sequence including the seq ID and strand orientation. The last "Attributes" column contains several comma-separated information related to the domain annotation, alignment and its quality. This file can undergo further filtering using Protein Domain Filter tool.\t\t\n-\n-#### USAGE ####\n-\n-\t\tusage: dante.py [-h] -q QUERY -pdb PROTEIN_DATABASE -cs\n-\t\t\t\t\t\t\t\t  CLASSIFICATION [-oug DOMAIN_GFF] [-nld NEW_LDB]\n-\t\t\t\t\t\t\t\t  [-dir OUTPUT_DIR] [-thsc THRESHOLD_SCORE]\n-\t\t\t\t\t\t\t\t  [-wd WIN_DOM] [-od OVERLAP_DOM]\n-\t\t\t\t\t\t\t\t  \n-\t\toptional arguments:\n-\t\t  -h, --help            show this help message and exit\n-\t\t  -oug DOMAIN_GFF, --domain_gff DOMAIN_GFF\n-\t\t\t\t\t\t\t\toutput domains gff format (default: None)\n-\t\t  -nld NEW_LDB, --new_ldb NEW_LDB\n-\t\t\t\t\t\t\t\tcreate indexed database files for lastal in case of\n-\t\t\t\t\t\t\t\tworking with new protein db (default: False)\n-\t\t  -dir OUTPUT_DIR, --output_dir OUTPUT_DIR\n-\t\t\t\t\t\t\t\tspecify if you want to change the output directory\n-\t\t\t\t\t\t\t\t(default: None)\n-\t\t  -thsc THRESHOLD_SCORE, --threshold_score THRESHOLD_SCORE\n-\t\t\t\t\t\t\t\tpercentage of the best score in the cluster to be\n-\t\t\t\t\t\t\t\ttolerated when assigning annotations per base\n-\t\t\t\t\t\t\t\t(default: 80)\n-\t\t  -wd WIN_DOM, --win_dom WIN_DOM\n-\t\t\t\t\t\t\t\twindow to process large input sequences sequentially\n-\t\t\t\t\t\t\t\t(default: 10000000)\n-\t\t  -od OVERLAP_DOM, --overlap_dom OVERLAP_DOM\n-\t\t\t\t\t\t\t\toverlap of sequences in two consecutive windows\n-\t\t\t\t\t\t\t\t(default: 10000)\n-\n-\t\trequired named arguments:\n-\t\t  -q QUERY, --query QUERY\n-\t\t\t\t\t\t\t\tinput DNA sequence to search for protein domains in a\n-\t\t\t\t\t\t\t\tfasta format. Multifasta format allowed. (default:\n-\t\t\t\t\t\t\t\tNone)\n-\t\t  -pdb PROTEIN_DATABASE, --protein_database PROTEIN_DATABASE\n-\t\t\t\t\t\t\t\tprotein domains database file (default: None)\n-\t\t  -cs CLASSIFICATION, --classification CLASSIFICATION\n-\t\t\t\t\t\t\t\tprotein domains classification file (default: None)\n-\n-\n-\t\t\n-#### HOW TO RUN EXAMPLE ####\n-\t\t./protein_domains.py -q PATH_TO_INPUT_SEQ -pdb PATH_TO_PROTEIN_DB -cs PATH_TO_CLASSIFICATION_FILE\n-\t\t\n-\t When running for the first time with a new database use -nld option allowing lastal to create indexed database files:\n-\n-         -nld True\n-\n-\tuse other arguments if you wish to rename your outputs or they will be created automatically with standard names \n-\t\n-### Protein Domains Filter ###\n-\t\t\n-The script performs Protein Domains Finder output filtering for quality and/or extracting specific type of protein domain or mobile elements of origin. For the filtered domains it reports their translated protein sequence of original DNA.\n-\n-WHEN NO PARAMETERS GIVEN, IT PERFORMS QUALITY FILTERIN'..b"PTIONS]\n-                            [-mlen MAX_LEN_PROPORTION]\n-                            [-sd {All,GAG,INT,PROT,RH,RT,aRH,CHDCR,CHDII,TPase,YR,HEL1,HEL2,ENDO}]\n-                            [-el ELEMENT_TYPE] [-dir OUTPUT_DIR]\n-\n-\n-\n-\t\toptional arguments:\n-\t\t  -h, --help            show this help message and exit\n-\t\t  -ouf DOMAINS_FILTERED, --domains_filtered DOMAINS_FILTERED\n-\t\t\t\t\t\t\t\toutput filtered domains gff file (default: None)\n-\t\t  -dps DOMAINS_PROT_SEQ, --domains_prot_seq DOMAINS_PROT_SEQ\n-\t\t\t\t\t\t\t\toutput file containg domains protein sequences\n-\t\t\t\t\t\t\t\t(default: None)\n-\t\t  -thl {float range 0.0..1.0}, --th_length {float range 0.0..1.0}\n-\t\t\t\t\t\t\t\tproportion of alignment length threshold (default:\n-\t\t\t\t\t\t\t\t0.8)\n-\t\t  -thi {float range 0.0..1.0}, --th_identity {float range 0.0..1.0}\n-\t\t\t\t\t\t\t\tproportion of alignment identity threshold (default:\n-\t\t\t\t\t\t\t\t0.35)\n-\t\t  -ths {float range 0.0..1.0}, --th_similarity {float range 0.0..1.0}\n-\t\t\t\t\t\t\t\tthreshold for alignment proportional similarity\n-\t\t\t\t\t\t\t\t(default: 0.45)\n-\t\t  -ir INTERRUPTIONS, --interruptions INTERRUPTIONS\n-\t\t\t\t\t\t\t\tinterruptions (frameshifts + stop codons) tolerance\n-\t\t\t\t\t\t\t\tthreshold per 100 AA (default: 3)\n-\t\t  -mlen MAX_LEN_PROPORTION, --max_len_proportion MAX_LEN_PROPORTION\n-\t\t\t\t\t\t\t\tmaximal proportion of alignment length to the original\n-\t\t\t\t\t\t\t\tlength of protein domain from database (default: 1.2)\n-\t\t  -sd {All,GAG,INT,PROT,RH,RT,aRH,CHDCR,CHDII,TPase,YR,HEL1,HEL2,ENDO}, --selected_dom {All,GAG,INT,PROT,RH,RT,aRH,CHDCR,CHDII,TPase,YR,HEL1,HEL2,ENDO}\n-\t\t\t\t\t\t\t\tfilter output domains based on the domain type\n-\t\t\t\t\t\t\t\t(default: All)\n-\t\t  -el ELEMENT_TYPE, --element_type ELEMENT_TYPE\n-\t\t\t\t\t\t\t\tfilter output domains by typing substring from\n-\t\t\t\t\t\t\t\tclassification (default: )\n-\t\t  -dir OUTPUT_DIR, --output_dir OUTPUT_DIR\n-\t\t\t\t\t\t\t\tspecify if you want to change the output directory\n-\t\t\t\t\t\t\t\t(default: None)\n-\n-\t\trequired named arguments:\n-\t\t  -dg DOM_GFF, --dom_gff DOM_GFF\n-\t\t\t\t\t\t\t\tbasic unfiltered gff file of all domains (default:\n-\t\t\t\t\t\t\t\tNone)\n-\n-\n-\n-#### HOW TO RUN EXAMPLE ####\n-e.g. getting quality filtered integrase(INT) domains of all gypsy transposable elements:\n-\t\n-\t./domains_filtering.py -dom_gff PATH_TO_INPUT_GFF -pdb PATH_TO_PROTEIN_DB -cs PATH_TO_CLASSIFICATION_FILE --selected_dom INT --element_type Ty3/gypsy \n-\n-\n-### Extract Domains Nucleotide Sequences ###\n-\n-This tool extracts nucleotide sequences of protein domains from reference DNA based on DANTE's output. It can be used e.g. for deriving phylogenetic relations of individual mobile elements classes within a species. \n-\n-#### INPUTS ####\n-\n-* original DNA sequence in multifasta format to extract the domains from \n-* GFF3 file of protein domains (**DANTE's output** - preferably filtered for quality and specific domain type)\n-* Domains database classification table (to check the classification level)\n-\n-#### OUTPUTS ####\n-\n-* fasta files of domains nucleotide sequences for individual transposons lineages\n-* txt file of domains counts extracted for individual lineages\n-\n-**- For GALAXY usage all concatenated in a single fasta file**\n-\n-#### USAGE ####\t\n-\t\tusage: dante_gff_to_dna.py [-h] -i INPUT_DNA -d DOMAINS_GFF -cs\n-\t\t\tCLASSIFICATION [-out OUT_DIR] [-ex EXTENDED]\n-\n-\t\toptional arguments:\n-\t\t  -h, --help            show this help message and exit\n-\t\t  -i INPUT_DNA, --input_dna INPUT_DNA\n-\t\t\t\t\t\t\t\tpath to input DNA sequence\n-\t\t  -d DOMAINS_GFF, --domains_gff DOMAINS_GFF\n-\t\t\t\t\t\t\t\tGFF file of protein domains\n-\t\t  -cs CLASSIFICATION, --classification CLASSIFICATION\n-\t\t\t\t\t\t\t\tprotein domains classification file\n-\t\t  -out OUT_DIR, --out_dir OUT_DIR\n-\t\t\t\t\t\t\t\toutput directory\n-\t\t  -ex EXTENDED, --extended EXTENDED\n-\t\t\t\t\t\t\t\textend the domains edges if not the whole datatabase\n-\t\t\t\t\t\t\t\tsequence was aligned\n-\n-#### HOW TO RUN EXAMPLE ####\n-\t./extract_domains_seqs.py --domains_gff PATH_PROTEIN_DOMAINS_GFF --input_dna PATH_TO_INPUT_DNA  --classification PROTEIN_DOMAINS_DB_CLASS_TBL --extended True\n-\n-\t\n-\n-\n-\n-\n-\n-\n-\n"
b
diff -r 1eabd42e00ef -r e2bbc79f0fac configuration.py
--- a/configuration.py Fri Apr 03 07:27:59 2020 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
b
@@ -1,23 +0,0 @@
-#!/usr/bin/env python3
-''' configuration file to set up the paths and constants '''
-import os
-
-MAIN_GIT_DIR = os.path.dirname(os.path.realpath(__file__))
-TOOL_DATA = os.path.join(MAIN_GIT_DIR, "tool-data")
-TMP = "tmp"
-SC_MATRIX_SKELETON = os.path.join(TOOL_DATA, "{}.txt.sample")
-AMBIGUOUS_TAG = "Ambiguous_domain"
-## IO
-CLASS_FILE = "ALL.classification-new"
-LAST_DB_FILE = "ALL_protein-domains_05.fasta"
-DOM_PROT_SEQ = "dom_prot_seq.fa"
-FILT_DOM_GFF = "domains_filtered.gff"
-EXTRACT_DOM_STAT = "domains_counts.txt"
-EXTRACT_OUT_DIR = "extracted_domains"
-FASTA_LINE = 60
-SOURCE_PROFREP = "profrep"
-SOURCE_DANTE = "dante"
-DOMAINS_FEATURE = "protein_domain"
-PHASE = "."
-HEADER_GFF = "##gff-version 3"
-DOMAINS_GFF = "output_domains.gff"
b
diff -r 1eabd42e00ef -r e2bbc79f0fac coverage2gff.py
--- a/coverage2gff.py Fri Apr 03 07:27:59 2020 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
[
@@ -1,64 +0,0 @@
-#!/usr/bin/env python3
-import argparse
-import tempfile
-import shutil
-import sys
-
-def parse_args():
-    '''Argument parsin'''
-    description = """
-    parsing cap3 assembly aln output
-    """
-
-    parser = argparse.ArgumentParser(
-        description=description,
-        formatter_class=argparse.RawTextHelpFormatter)
-    parser.add_argument(
-        '-g',
-        '--gff_file',
-        default=None,
-        required=True,
-        help="input gff3 file for appending coverage information",
-        type=str,
-        action='store')
-    parser.add_argument(
-        '-p',
-        '--profile',
-        default=None,
-        required=True,
-        help="output file for coverage profile",
-        type=str,
-        action="store")
-    return parser.parse_args()
-
-def read_coverage(profile):
-    with open(profile) as p:
-        d = {}
-        for name, prof in zip(p, p):
-            d[name[1:].strip()] = [int(i) for i in prof.split()]
-    return d
-
-
-def main():
-    args = parse_args()
-    coverage_hash = read_coverage(args.profile)
-    gff_tmp = tempfile.NamedTemporaryFile()
-    with open(args.gff_file) as f, open(gff_tmp.name, 'w') as out:
-        for line in f:
-            if line[0] == "#":
-                out.write(line)
-            else:
-                line_parts = line.split()
-                start = int(line_parts[3])
-                end = int(line_parts[4])
-                coverage = round( sum(coverage_hash[line_parts[0]][(
-                    start - 1):end]) / (end - start + 1), 3)
-                new_line = "{};Coverage={}\n".format(line.strip(), coverage)
-                out.write(new_line)
-
-    shutil.copyfile(gff_tmp.name, args.gff_file)
-
-
-if __name__ == "__main__":
-
-    main()
b
diff -r 1eabd42e00ef -r e2bbc79f0fac dante.py
--- a/dante.py Fri Apr 03 07:27:59 2020 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
[
b'@@ -1,934 +0,0 @@\n-#!/usr/bin/env python3\n-\n-import numpy as np\n-import subprocess\n-import math\n-import time\n-from operator import itemgetter\n-from collections import Counter\n-from itertools import groupby\n-import os\n-import re\n-import configuration\n-from tempfile import NamedTemporaryFile\n-import sys\n-import warnings\n-import shutil\n-from collections import defaultdict\n-\n-np.set_printoptions(threshold=sys.maxsize)\n-\n-def alignment_scoring():\n-    \'\'\' Create hash table for alignment similarity counting: for every \n-\tcombination of aminoacids in alignment assign score from protein \n-\tscoring matrix defined in configuration file  \'\'\'\n-    score_dict = {}\n-    with open(configuration.SC_MATRIX) as smatrix:\n-        count = 1\n-        for line in smatrix:\n-            if not line.startswith("#"):\n-                if count == 1:\n-                    aa_all = line.rstrip().replace(" ", "")\n-                else:\n-                    count_aa = 1\n-                    line = list(filter(None, line.rstrip().split(" ")))\n-                    for aa in aa_all:\n-                        score_dict["{}{}".format(line[0], aa)] = line[count_aa]\n-                        count_aa += 1\n-                count += 1\n-    return score_dict\n-\n-\n-def characterize_fasta(QUERY, WIN_DOM):\n-    \'\'\' Find the sequences, their lengths, starts, ends and if \n-\tthey exceed the window \'\'\'\n-    with open(QUERY) as query:\n-        headers = []\n-        fasta_lengths = []\n-        seq_starts = []\n-        seq_ends = []\n-        fasta_chunk_len = 0\n-        count_line = 1\n-        for line in query:\n-            line = line.rstrip()\n-            if line.startswith(">"):\n-                headers.append(line.rstrip())\n-                fasta_lengths.append(fasta_chunk_len)\n-                fasta_chunk_len = 0\n-                seq_starts.append(count_line + 1)\n-                seq_ends.append(count_line - 1)\n-            else:\n-                fasta_chunk_len += len(line)\n-            count_line += 1\n-        seq_ends.append(count_line)\n-        seq_ends = seq_ends[1:]\n-        fasta_lengths.append(fasta_chunk_len)\n-        fasta_lengths = fasta_lengths[1:]\n-        # control if there are correct (unique) names for individual seqs:\n-        # LASTAL takes seqs IDs till the first space which can then create problems with ambiguous records\n-        if len(headers) > len(set([header.split(" ")[0] for header in headers\n-                                   ])):\n-            raise NameError(\n-                \'\'\'Sequences in multifasta format are not named correctly:\n-\t\t\t\t\t\t\tseq IDs (before the first space) are the same\'\'\')\n-\n-    above_win = [idx\n-                 for idx, value in enumerate(fasta_lengths) if value > WIN_DOM]\n-    below_win = [idx\n-                 for idx, value in enumerate(fasta_lengths)\n-                 if value <= WIN_DOM]\n-    lens_above_win = np.array(fasta_lengths)[above_win]\n-    return headers, above_win, below_win, lens_above_win, seq_starts, seq_ends\n-\n-\n-def split_fasta(QUERY, WIN_DOM, step, headers, above_win, below_win,\n-                lens_above_win, seq_starts, seq_ends):\n-    \'\'\' Create temporary file containing all sequences - the ones that exceed \n-\tthe window are cut with a set overlap (greater than domain size with a reserve) \'\'\'\n-    with open(QUERY, "r") as query:\n-        count_fasta_divided = 0\n-        count_fasta_not_divided = 0\n-        ntf = NamedTemporaryFile(delete=False)\n-        divided = np.array(headers)[above_win]\n-        row_length = configuration.FASTA_LINE\n-        for line in query:\n-            line = line.rstrip()\n-            if line.startswith(">") and line in divided:\n-                stop_line = seq_ends[above_win[\n-                    count_fasta_divided]] - seq_starts[above_win[\n-                        count_fasta_divided]] + 1\n-                count_line = 0\n-                whole_seq = []\n-                for line2 in query:\n-                    whole_seq.append(line2.rstrip())\n-                    count_l'..b'P\n-            if not os.path.exists(OUTPUT_DIR):\n-                os.makedirs(OUTPUT_DIR)\n-        OUTPUT_DOMAIN = os.path.join(OUTPUT_DIR,\n-                                     os.path.basename(OUTPUT_DOMAIN))\n-    domain_search(QUERY, LAST_DB, CLASSIFICATION, OUTPUT_DOMAIN,\n-                  THRESHOLD_SCORE, WIN_DOM, OVERLAP_DOM, SCORING_MATRIX)\n-\n-    print("ELAPSED_TIME_DOMAINS = {} s".format(time.time() - t))\n-\n-\n-if __name__ == "__main__":\n-    import argparse\n-    from argparse import RawDescriptionHelpFormatter\n-\n-    class CustomFormatter(argparse.ArgumentDefaultsHelpFormatter,\n-                          argparse.RawDescriptionHelpFormatter):\n-        pass\n-\n-    parser = argparse.ArgumentParser(\n-        description=\n-        \'\'\'Script performs similarity search on given DNA sequence(s) in (multi)fasta against our protein domains database of all Transposable element for certain group of organisms (Viridiplantae or Metazoans). Domains are subsequently annotated and classified - in case certain domain has multiple annotations assigned, classifation is derived from the common classification level of all of them. Domains search is accomplished engaging LASTAL alignment tool.\n-\t\t\n-\tDEPENDENCIES:\n-\t\t- python 3.4 or higher with packages:\n-\t\t\t-numpy\n-\t\t- lastal 744 or higher [http://last.cbrc.jp/]\n-\t\t- configuration.py module\n-\n-\tEXAMPLE OF USAGE:\n-\t\t\n-\t\t./protein_domains_pd.py -q PATH_TO_INPUT_SEQ -pdb PATH_TO_PROTEIN_DB -cs PATH_TO_CLASSIFICATION_FILE\n-\t\t\n-\tWhen running for the first time with a new database use -nld option allowing lastal to create indexed database files:\n-\t\t\n-\t\t-nld True\n-\t\n-\t\t\'\'\',\n-        epilog="""""",\n-        formatter_class=CustomFormatter)\n-\n-    requiredNamed = parser.add_argument_group(\'required named arguments\')\n-    requiredNamed.add_argument(\n-        "-q",\n-        "--query",\n-        type=str,\n-        required=True,\n-        help=\n-        \'input DNA sequence to search for protein domains in a fasta format. Multifasta format allowed.\')\n-    requiredNamed.add_argument(\'-pdb\',\n-                               "--protein_database",\n-                               type=str,\n-                               required=True,\n-                               help=\'protein domains database file\')\n-    requiredNamed.add_argument(\'-cs\',\n-                               \'--classification\',\n-                               type=str,\n-                               required=True,\n-                               help=\'protein domains classification file\')\n-    parser.add_argument("-oug",\n-                        "--domain_gff",\n-                        type=str,\n-                        help="output domains gff format")\n-    parser.add_argument(\n-        "-nld",\n-        "--new_ldb",\n-        type=str,\n-        default=False,\n-        help=\n-        "create indexed database files for lastal in case of working with new protein db")\n-    parser.add_argument(\n-        "-dir",\n-        "--output_dir",\n-        type=str,\n-        help="specify if you want to change the output directory")\n-    parser.add_argument(\n-        "-M",\n-        "--scoring_matrix",\n-        type=str,\n-        default="BL80",\n-        choices=[\'BL80\', \'BL62\', \'MIQS\'],\n-        help="specify scoring matrix to use for similarity search (BL80, BL62, MIQS)")\n-    parser.add_argument(\n-        "-thsc",\n-        "--threshold_score",\n-        type=int,\n-        default=80,\n-        help=\n-        "percentage of the best score in the cluster to be tolerated when assigning annotations per base")\n-    parser.add_argument(\n-        "-wd",\n-        "--win_dom",\n-        type=int,\n-        default=10000000,\n-        help="window to process large input sequences sequentially")\n-    parser.add_argument("-od",\n-                        "--overlap_dom",\n-                        type=int,\n-                        default=10000,\n-                        help="overlap of sequences in two consecutive windows")\n-\n-    args = parser.parse_args()\n-    main(args)\n'
b
diff -r 1eabd42e00ef -r e2bbc79f0fac dante.xml
--- a/dante.xml Fri Apr 03 07:27:59 2020 -0400
+++ b/dante.xml Wed Jan 25 13:06:55 2023 +0000
b
@@ -1,10 +1,7 @@
-<tool id="dante" name="Domain based ANnotation of Transposable Elements - DANTE" version="1.1.0">
+<tool id="dante" name="Domain based ANnotation of Transposable Elements - DANTE" version="1.1.4">
   <description> Tool for annotation of transposable elements based on the similarity to conserved protein domains database. </description>
   <requirements>
-    <requirement type="package">last</requirement>
-    <requirement type="package">numpy</requirement>
-    <requirement type="package" version="1.0">rexdb</requirement>
-    <requirement type="set_environment">REXDB</requirement>
+    <requirement type="package">dante=0.1.4</requirement>
   </requirements>
   <stdio>
     <regex match="Traceback" source="stderr" level="fatal" description="Unknown error" />
@@ -12,7 +9,7 @@
   </stdio>
   <command>
     #if str($input_type.input_type_selector) == "aln"
-      python3 ${__tool_directory__}/parse_aln.py -a $(input_sequences) -f sequences.fasta -p sequences.profile
+      parse_aln.py -a $(input_sequences) -f sequences.fasta -p sequences.profile
       &amp;&amp;
       INPUT_SEQUENCES="sequences.fasta"
     #else    
@@ -21,13 +18,12 @@
     &amp;&amp;
 
 
-    python3 ${__tool_directory__}/dante.py --query \${INPUT_SEQUENCES} --domain_gff ${DomGff}
-   --protein_database \${REXDB}/${db_type}_pdb
-   --classification \${REXDB}/${db_type}_class
-    --scoring_matrix ${scoring_matrix}
+    dante --query \${INPUT_SEQUENCES} --domain_gff ${DomGff}
+   --database $database
+      --scoring_matrix ${scoring_matrix}
 
     &amp;&amp;
-    python3 ${__tool_directory__}/dante_gff_output_filtering.py --dom_gff ${DomGff}
+    dante_gff_output_filtering.py --dom_gff ${DomGff}
     --domains_prot_seq ${Domains_filtered} --domains_filtered ${DomGff_filtered}
     --output_dir .
     --selected_dom All --th_identity 0.35
@@ -37,12 +33,12 @@
 
     #if str($input_type.input_type_selector) == "aln"
      &amp;&amp;
-     python3 ${__tool_directory__}/coverage2gff.py -p sequences.profile -g ${DomGff}
+     coverage2gff.py -p sequences.profile -g ${DomGff}
     #end if
 
     #if str($iterative) == "Yes"
     &amp;&amp;
-    python3 ${__tool_directory__}/dante_gff_output_filtering.py --dom_gff ${DomGff}
+   dante_gff_output_filtering.py --dom_gff ${DomGff}
     --domains_prot_seq domains_filtered.fasta --domains_filtered domains_filtered.gff
     --output_dir .
     --selected_dom All --th_identity 0.35
@@ -53,22 +49,22 @@
 
 
 
-    python3 ${__tool_directory__}/fasta2database.py domains_filtered.fasta domains_filtered.db
+    fasta2database.py domains_filtered.fasta domains_filtered.db
     domains_filtered.class
     &amp;&amp;
 
     lastdb -p domains_filtered.db domains_filtered.db
     &amp;&amp;
 
-    python3 ${__tool_directory__}/dante.py --query \${INPUT_SEQUENCES} --domain_gff ${DomGff2}
+    dante.py --query \${INPUT_SEQUENCES} --domain_gff ${DomGff2}
    --protein_database domains_filtered.db
    --classification domains_filtered.class
-    --scoring_matrix BL80
+      --scoring_matrix BL80
 
 
     #if str($input_type.input_type_selector) == "aln"
      &amp;&amp;
-     python3 ${__tool_directory__}/coverage2gff.py -p sequences.profile -g ${DomGff2}
+     coverage2gff.py -p sequences.profile -g ${DomGff2}
     #end if
     #end if
 
@@ -87,13 +83,12 @@
         <param name="input_sequences" type="data" format="txt" label="Sequences in ALN format (extracted from RepeatExplorer)"/>
       </when>
     </conditional>
-    <param name="db_type" type="select" label="Select taxon and protein domain database version (REXdb)" help="">
-      <options from_file="rexdb_versions.loc">
-        <column name="name" index="0"/>
-        <column name="value" index="1"/>
-      </options>
+    <param name="database" type="select" label="Select REXdb database">
+        <option value="Viridiplantae_v3.0" selected="true">Viridiplantae_v3.0</option>
+        <option value="Metazoa_v3.1" selected="true">Metazoa_v3.1</option>
+        <option value="Viridiplantae_v2.2" selected="true">Viridiplantae_v2.2</option>
+        <option value="Metazoa_v3.0" selected="true">Metazoa_v3.1</option>
     </param>
-
     <param name="scoring_matrix" type="select" label="Select scoring matrix">
       <option value="BL80" selected="true" >BLOSUM80</option>
       <option value="BL62">BLOSUM62</option>
b
diff -r 1eabd42e00ef -r e2bbc79f0fac dante_gff_output_filtering.py
--- a/dante_gff_output_filtering.py Fri Apr 03 07:27:59 2020 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
[
b'@@ -1,361 +0,0 @@\n-#!/usr/bin/env python3\n-import sys\n-import time\n-import configuration\n-import os\n-import textwrap\n-import subprocess\n-from tempfile import NamedTemporaryFile\n-from collections import defaultdict\n-\n-\n-class Range():\n-    \'\'\'\n-    This class is used to check float range in argparse\n-    \'\'\'\n-\n-    def __init__(self, start, end):\n-        self.start = start\n-        self.end = end\n-\n-    def __eq__(self, other):\n-        return self.start <= other <= self.end\n-\n-    def __str__(self):\n-        return "float range {}..{}".format(self.start, self.end)\n-\n-    def __repr__(self):\n-        return "float range {}..{}".format(self.start, self.end)\n-\n-\n-def check_file_start(gff_file):\n-    count_comment = 0\n-    with open(gff_file, "r") as gff_all:\n-        line = gff_all.readline()\n-        while line.startswith("#"):\n-            line = gff_all.readline()\n-            count_comment += 1\n-    return count_comment\n-\n-\n-def write_info(filt_dom_tmp, FILT_DOM_GFF, orig_class_dict, filt_class_dict,\n-               dom_dict, version_lines, TH_IDENTITY, TH_SIMILARITY,\n-               TH_LENGTH, TH_INTERRUPT, TH_LEN_RATIO, SELECTED_DOM):\n-    \'\'\'\n-\tWrite domains statistics in beginning of filtered GFF\n-\t\'\'\'\n-    with open(FILT_DOM_GFF, "w") as filt_gff:\n-        for line in version_lines:\n-            filt_gff.write(line)\n-        filt_gff.write(("##Filtering thresholdss: min identity: {}, min similarity: {},"\n-                        " min relative alingment length: {}, max interuptions(stop or "\n-                        "frameshift): {}, max relative alignment length: {}, selected"\n-                        " domains: {} \\n").format(TH_IDENTITY,\n-                                                  TH_SIMILARITY,\n-                                                  TH_LENGTH,\n-                                                  TH_INTERRUPT,\n-                                                  TH_LEN_RATIO,\n-                                                  SELECTED_DOM))\n-        filt_gff.write("##CLASSIFICATION\\tORIGINAL_COUNTS\\tFILTERED_COUNTS\\n")\n-        if not orig_class_dict:\n-            filt_gff.write("##NO DOMAINS CLASSIFICATIONS\\n")\n-        for classification in sorted(orig_class_dict.keys()):\n-            if classification in filt_class_dict.keys():\n-                filt_gff.write("##{}\\t{}\\t{}\\n".format(\n-                    classification, orig_class_dict[\n-                        classification], filt_class_dict[classification]))\n-            else:\n-                filt_gff.write("##{}\\t{}\\t{}\\n".format(\n-                    classification, orig_class_dict[classification], 0))\n-        filt_gff.write("##-----------------------------------------------\\n"\n-                       "##SEQ\\tDOMAIN\\tCOUNTS\\n")\n-        if not dom_dict:\n-            filt_gff.write("##NO DOMAINS\\n")\n-        for seq in sorted(dom_dict.keys()):\n-            for dom, count in sorted(dom_dict[seq].items()):\n-                filt_gff.write("##{}\\t{}\\t{}\\n".format(seq, dom, count))\n-        filt_gff.write("##-----------------------------------------------\\n")\n-        with open(filt_dom_tmp.name, "r") as filt_tmp:\n-            for line in filt_tmp:\n-                filt_gff.write(line)\n-\n-\n-def get_file_start(gff_file):\n-    count_comment = 0\n-    lines = []\n-    with open(gff_file, "r") as gff_all:\n-        line = gff_all.readline()\n-        while line.startswith("#"):\n-            lines.append(line)\n-            line = gff_all.readline()\n-            count_comment += 1\n-    return count_comment, lines\n-\n-\n-def parse_gff_line(line):\n-    \'\'\'Return dictionary with gff fields  and  atributers\n-    Note - type of fields is strings\n-    \'\'\'\n-    # order of first 9 column is fixed\n-    gff_line = dict(\n-        zip(\n-            [\'seqid\', \'source\', \'type\', \'start\', \'end\',\n-             \'score\', \'strand\', \'phase\', \'attributes\'],\n-            line.split("\\t")\n-        )\n-    )\n-    # split attributes and replace:\n-    gff_line[\'attributes\'] = dict([i.sp'..b'\t\t\tFILTERING OPTIONS:\n-\t\t\t\t> QUALITY: - Min relative length of alignemnt to the protein domain from DB (without gaps)\n-\t\t\t\t   - Identity \n-\t\t\t\t   - Similarity (scoring matrix: BLOSUM82)\n-\t\t\t\t   - Interruption in the reading frame (frameshifts + stop codons) per every starting 100 AA\n-\t\t\t\t   - Max alignment proportion to the original length of database domain sequence \n-\t\t\t\t> DOMAIN TYPE: choose from choices (\'Name\' attribute in GFF)\n-\t\t\t\tRecords for ambiguous domain type (e.g. INT/RH) are filtered out automatically\n-\t\t\t\t\n-\t\t\t\t> MOBILE ELEMENT TYPE:\n-\t\t\t\tarbitrary substring of the element classification (\'Final_Classification\' attribute in GFF)\n-\t\t\t\t\n-\t\tOUTPUTS:\n-\t\t\t- filtered GFF3 file\n-\t\t\t- fasta file of translated protein sequences (from original DNA) for the aligned domains that match the filtering criteria \n-\t\t\n-\tDEPENDENCIES:\n-\t\t- python 3.4 or higher\n-\t\t> ProfRep modules:\n-\t\t\t- configuration.py \n-\n-\tEXAMPLE OF USAGE:\n-\t\tGetting quality filtered integrase(INT) domains of all gypsy transposable elements:\n-\t\t./domains_filtering.py -dom_gff PATH_TO_INPUT_GFF -pdb PATH_TO_PROTEIN_DB -cs PATH_TO_CLASSIFICATION_FILE --selected_dom INT --element_type Ty3/gypsy \n-\n-\t\t\'\'\',\n-        epilog="""""",\n-        formatter_class=CustomFormatter)\n-    requiredNamed = parser.add_argument_group(\'required named arguments\')\n-    requiredNamed.add_argument("-dg",\n-                               "--dom_gff",\n-                               type=str,\n-                               required=True,\n-                               help="basic unfiltered gff file of all domains")\n-    parser.add_argument("-ouf",\n-                        "--domains_filtered",\n-                        type=str,\n-                        help="output filtered domains gff file")\n-    parser.add_argument("-dps",\n-                        "--domains_prot_seq",\n-                        type=str,\n-                        help="output file containg domains protein sequences")\n-    parser.add_argument("-thl",\n-                        "--th_length",\n-                        type=float,\n-                        choices=[Range(0.0, 1.0)],\n-                        default=0.8,\n-                        help="proportion of alignment length threshold")\n-    parser.add_argument("-thi",\n-                        "--th_identity",\n-                        type=float,\n-                        choices=[Range(0.0, 1.0)],\n-                        default=0.35,\n-                        help="proportion of alignment identity threshold")\n-    parser.add_argument("-ths",\n-                        "--th_similarity",\n-                        type=float,\n-                        choices=[Range(0.0, 1.0)],\n-                        default=0.45,\n-                        help="threshold for alignment proportional similarity")\n-    parser.add_argument(\n-        "-ir",\n-        "--interruptions",\n-        type=int,\n-        default=3,\n-        help=\n-        "interruptions (frameshifts + stop codons) tolerance threshold per 100 AA")\n-    parser.add_argument(\n-        "-mlen",\n-        "--max_len_proportion",\n-        type=float,\n-        default=1.2,\n-        help=\n-        "maximal proportion of alignment length to the original length of protein domain from database")\n-    parser.add_argument(\n-        "-sd",\n-        "--selected_dom",\n-        type=str,\n-        default="All",\n-        choices=[\n-            "All", "GAG", "INT", "PROT", "RH", "RT", "aRH", "CHDCR", "CHDII",\n-            "TPase", "YR", "HEL1", "HEL2", "ENDO"\n-        ],\n-        help="filter output domains based on the domain type")\n-    parser.add_argument(\n-        "-el",\n-        "--element_type",\n-        type=str,\n-        default="",\n-        help="filter output domains by typing substring from classification")\n-    parser.add_argument(\n-        "-dir",\n-        "--output_dir",\n-        type=str,\n-        default=None,\n-        help="specify if you want to change the output directory")\n-    args = parser.parse_args()\n-    main(args)\n'
b
diff -r 1eabd42e00ef -r e2bbc79f0fac dante_gff_output_filtering.xml
--- a/dante_gff_output_filtering.xml Fri Apr 03 07:27:59 2020 -0400
+++ b/dante_gff_output_filtering.xml Wed Jan 25 13:06:55 2023 +0000
[
@@ -1,11 +1,16 @@
-<tool id="domains_filter" name="Protein Domains Filter" version="1.0.1">
+<tool id="domains_filter" name="Protein Domains Filter" version="1.1.4">
   <description> Tool for filtering of gff3 output from DANTE. Filtering can be performed based domain type and alignment quality. </description>
-  <stdio>
+    <requirements>
+        <requirement type="package">dante=0.1.4</requirement>
+    </requirements>
+    <stdio>
+        <regex match="Traceback" source="stderr" level="fatal" description="Unknown error" />
+        <regex match="error" source="stderr" level="fatal" description="Unknown error" />
     <regex match="Traceback" source="stderr" level="fatal" description="Unknown error" />
     <regex match="error" source="stderr" level="fatal" description="Unknown error" />
   </stdio>
 <command>
-python3 ${__tool_directory__}/dante_gff_output_filtering.py --dom_gff ${DomGff} --domains_prot_seq ${dom_prot_seq} --domains_filtered ${dom_filtered} --selected_dom ${selected_domain} --th_identity ${th_identity} --th_similarity ${th_similarity} --th_length ${th_length} --interruptions ${interruptions} --max_len_proportion ${th_len_ratio} --element_type '${element_type}'
+dante_gff_output_filtering.py --dom_gff ${DomGff} --domains_prot_seq ${dom_prot_seq} --domains_filtered ${dom_filtered} --selected_dom ${selected_domain} --th_identity ${th_identity} --th_similarity ${th_similarity} --th_length ${th_length} --interruptions ${interruptions} --max_len_proportion ${th_len_ratio} --element_type '${element_type}'
 
 </command>
 <inputs>
@@ -16,10 +21,20 @@
  <param name="interruptions" type="integer" value="3" label="Interruptions [frameshifts + stop codons]" help="Tolerance threshold per every starting 100 amino acids of alignment sequence" />
  <param name="th_len_ratio" type="float" value="1.2" label="Maximal length proportion" help="Maximal proportion of alignment length to the original length of protein domain from database (including indels)" />
  <param name="selected_domain" type="select" label="Select protein domain type" >
-    <options from_file="select_domain.loc" >
-     <column name="name" index="0"/>
-     <column name="value" index="0"/>
- </options>
+        <option value="All" selected="true">All</option>
+        <option value="GAG">GAG</option>
+        <option value="INT">INT</option>
+        <option value="PROT">PROT</option>
+        <option value="RH">RH</option>
+        <option value="RT">RT</option>
+        <option value="aRH">aRH</option>
+        <option value="CHDCR">CHDCR</option>
+        <option value="CHDII">CHDII</option>
+        <option value="TPase">TPase</option>
+        <option value="YR">YR</option>
+        <option value="HEL1">HEL1</option>
+        <option value="HEL2">HEL2</option>
+        <option value="ENDO">ENDO</option>
    </param>
    <param name="element_type" type="text" value="" label="Filter based on classification" help="You can use preset options or enter an  arbitrary string to filter a certain repetitive element type of any level. It must be a continuous substring in a proper format of Final_Classification attribute of GFF3 file. Classification levels are separated by | character. Filtering is case sensitive">
      <option value="Ty1/copia">Ty1/copia</option>
b
diff -r 1eabd42e00ef -r e2bbc79f0fac dante_gff_to_dna.py
--- a/dante_gff_to_dna.py Fri Apr 03 07:27:59 2020 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
[
b'@@ -1,191 +0,0 @@\n-#!/usr/bin/env python3\n-\n-import argparse\n-import time\n-import os\n-import textwrap\n-from collections import defaultdict\n-from Bio import SeqIO\n-import configuration\n-from dante_gff_output_filtering import parse_gff_line\n-t_nt_seqs_extraction = time.time()\n-\n-\n-def str2bool(v):\n-    if v.lower() in (\'yes\', \'true\', \'t\', \'y\', \'1\'):\n-        return True\n-    elif v.lower() in (\'no\', \'false\', \'f\', \'n\', \'0\'):\n-        return False\n-    else:\n-        raise argparse.ArgumentTypeError(\'Boolean value expected\')\n-\n-\n-def check_file_start(gff_file):\n-    count_comment = 0\n-    with open(gff_file, "r") as gff_all:\n-        line = gff_all.readline()\n-        while line.startswith("#"):\n-            line = gff_all.readline()\n-            count_comment += 1\n-    return count_comment, line\n-\n-\n-def extract_nt_seqs(DNA_SEQ, DOM_GFF, OUT_DIR, CLASS_TBL, EXTENDED):\n-    \'\'\' Extract nucleotide sequences of protein domains found by DANTE from input DNA seq.\n-\t\tSequences are saved in fasta files separately for each transposon lineage.\n-\t\tSequences extraction is based on position of Best_Hit alignment reported by LASTAL.\n-\t\tThe positions can be extended (optional) based on what part of database domain was aligned\n-        (Best_Hit_DB_Pos attribute).\n-\t\tThe strand orientation needs to be considered in extending and extracting the sequence itself\n-\t\'\'\'\n-    [count_comment, first_line] = check_file_start(DOM_GFF)\n-    unique_classes = get_unique_classes(CLASS_TBL)\n-    files_dict = defaultdict(str)\n-    domains_counts_dict = defaultdict(int)\n-    allSeqs = SeqIO.to_dict(SeqIO.parse(DNA_SEQ, \'fasta\'))\n-    with open(DOM_GFF, "r") as domains:\n-        for comment_idx in range(count_comment):\n-            next(domains)\n-        seq_id_stored = first_line.split("\\t")[0]\n-        allSeqs = SeqIO.to_dict(SeqIO.parse(DNA_SEQ, \'fasta\'))\n-        seq_nt = allSeqs[seq_id_stored]\n-        for line in domains:\n-            gff_line = parse_gff_line(line)\n-            elem_type = gff_line[\'attributes\'][\'Final_Classification\']\n-            if elem_type == configuration.AMBIGUOUS_TAG:\n-                continue  # skip ambiguous classification\n-            seq_id = gff_line[\'seqid\']\n-            dom_type = gff_line[\'attributes\'][\'Name\']\n-            strand = gff_line[\'strand\']\n-            align_nt_start = int(gff_line[\'attributes\'][\'Best_Hit\'].split(":")[\n-                -1].split("-")[0])\n-            align_nt_end = int(gff_line[\'attributes\'][\'Best_Hit\'].split(":")[\n-                -1].split("-")[1].split("[")[0])\n-            if seq_id != seq_id_stored:\n-                seq_id_stored = seq_id\n-                seq_nt = allSeqs[seq_id_stored]\n-            if EXTENDED:\n-                ## which part of database sequence was aligned\n-                db_part = gff_line[\'attributes\'][\'Best_Hit_DB_Pos\']\n-                ## db_part = line.split("\\t")[8].split(";")[4].split("=")[1]\n-                ## datatabse seq length\n-                dom_len = int(db_part.split("of")[1])\n-                ## start of alignment on database seq\n-                db_start = int(db_part.split("of")[0].split(":")[0])\n-                ## end of alignment on database seq\n-                db_end = int(db_part.split("of")[0].split(":")[1])\n-                ## number of nucleotides missing in the beginning\n-                dom_nt_prefix = (db_start - 1) * 3\n-                ## number of nucleotides missing in the end\n-                dom_nt_suffix = (dom_len - db_end) * 3\n-                if strand == "+":\n-                    dom_nt_start = align_nt_start - dom_nt_prefix\n-                    dom_nt_end = align_nt_end + dom_nt_suffix\n-                ## reverse extending for - strand\n-                else:\n-                    dom_nt_start = align_nt_start - dom_nt_suffix\n-                    dom_nt_end = align_nt_end + dom_nt_prefix\n-                ## correction for domain after extending having negative starting positon\n-                dom_nt_start = max(1, dom_nt_start)'..b'full_dom_nt = seq_nt.seq[dom_nt_start - 1:dom_nt_end]\n-            ## for - strand take reverse complement of the extracted sequence\n-            if strand == "-":\n-                full_dom_nt = full_dom_nt.reverse_complement()\n-            full_dom_nt = str(full_dom_nt)\n-            ## report when domain classified to the last level and no Ns in extracted seq\n-            if elem_type in unique_classes and "N" not in full_dom_nt:\n-                # lineages reported in separate fasta files\n-                if not elem_type in files_dict:\n-                    files_dict[elem_type] = os.path.join(\n-                        OUT_DIR, "{}.fasta".format(elem_type.split("|")[\n-                            -1].replace("/", "_")))\n-                with open(files_dict[elem_type], "a") as out_nt_seq:\n-                    out_nt_seq.write(">{}:{}-{}|{}[{}]\\n{}\\n".format(\n-                        seq_nt.id, dom_nt_start, dom_nt_end, dom_type,\n-                        elem_type, textwrap.fill(full_dom_nt,\n-                                                 configuration.FASTA_LINE)))\n-                domains_counts_dict[elem_type] += 1\n-    return domains_counts_dict\n-\n-\n-def get_unique_classes(CLASS_TBL):\n-    \'\'\' Get all the lineages of current domains classification table to check if domains are classified to the last level.\n-\t\tOnly the sequences of unambiguous and completely classified domains will be extracted.\n-\t\'\'\'\n-    unique_classes = []\n-    with open(CLASS_TBL, "r") as class_tbl:\n-        for line in class_tbl:\n-            line_class = "|".join(line.rstrip().split("\\t")[1:])\n-            if line_class not in unique_classes:\n-                unique_classes.append(line_class)\n-    return unique_classes\n-\n-\n-def write_domains_stat(domains_counts_dict, OUT_DIR):\n-    \'\'\' Report counts of domains for individual lineages\n-\t\'\'\'\n-    total = 0\n-    with open(\n-            os.path.join(OUT_DIR,\n-                         configuration.EXTRACT_DOM_STAT), "w") as dom_stat:\n-        for domain, count in domains_counts_dict.items():\n-            dom_stat.write(";{}:{}\\n".format(domain, count))\n-            total += count\n-        dom_stat.write(";TOTAL:{}\\n".format(total))\n-\n-\n-def main(args):\n-\n-    DNA_SEQ = args.input_dna\n-    DOM_GFF = args.domains_gff\n-    OUT_DIR = args.out_dir\n-    CLASS_TBL = args.classification\n-    EXTENDED = args.extended\n-\n-    if not os.path.exists(OUT_DIR):\n-        os.makedirs(OUT_DIR)\n-\n-    domains_counts_dict = extract_nt_seqs(DNA_SEQ, DOM_GFF, OUT_DIR, CLASS_TBL,\n-                                          EXTENDED)\n-    write_domains_stat(domains_counts_dict, OUT_DIR)\n-\n-    print("ELAPSED_TIME_EXTRACTION = {} s\\n".format(time.time() -\n-                                                    t_nt_seqs_extraction))\n-\n-\n-if __name__ == "__main__":\n-\n-    # Command line arguments\n-    parser = argparse.ArgumentParser()\n-    parser.add_argument(\'-i\',\n-                        \'--input_dna\',\n-                        type=str,\n-                        required=True,\n-                        help=\'path to input DNA sequence\')\n-    parser.add_argument(\'-d\',\n-                        \'--domains_gff\',\n-                        type=str,\n-                        required=True,\n-                        help=\'GFF file of protein domains\')\n-    parser.add_argument(\'-cs\',\n-                        \'--classification\',\n-                        type=str,\n-                        required=True,\n-                        help=\'protein domains classification file\')\n-    parser.add_argument(\'-out\',\n-                        \'--out_dir\',\n-                        type=str,\n-                        default=configuration.EXTRACT_OUT_DIR,\n-                        help=\'output directory\')\n-    parser.add_argument(\n-        \'-ex\',\n-        \'--extended\',\n-        type=str2bool,\n-        default=True,\n-        help=\n-        \'extend the domains edges if not the whole datatabase sequence was aligned\')\n-    args = parser.parse_args()\n-    main(args)\n'
b
diff -r 1eabd42e00ef -r e2bbc79f0fac dante_gff_to_dna.xml
--- a/dante_gff_to_dna.xml Fri Apr 03 07:27:59 2020 -0400
+++ b/dante_gff_to_dna.xml Wed Jan 25 13:06:55 2023 +0000
b
@@ -1,20 +1,18 @@
-<tool id="domains_extract" name="Extract Domains Nucleotide Sequences" version="1.0.0">
+<tool id="domains_extract" name="Extract Domains Nucleotide Sequences" version="1.1.4">
   <description> Tool to extract nucleotide sequences of protein domains found by DANTE </description>
   <requirements>
-    <requirement type="package">biopython</requirement>
-    <requirement type="package" version="1.0">rexdb</requirement>
-    <requirement type="set_environment">REXDB</requirement>
+    <requirement type="package">dante=0.1.4</requirement>
   </requirements>
   <command>
     TEMP_DIR_LINEAGES=\$(mktemp -d) &amp;&amp;
-    python3 ${__tool_directory__}/dante_gff_to_dna.py --domains_gff ${domains_gff} --input_dna ${input_dna} --out_dir \$TEMP_DIR_LINEAGES
+    /mnt/raid/users/petr/workspace/dante/dante_gff_to_dna.py --domains_gff ${domains_gff} --input_dna ${input_dna} --out_dir \$TEMP_DIR_LINEAGES
 
     #if $extend_edges:
    --extended True
     #else:
    --extended False
     #end if
-   --classification \${REXDB}/${db_type}_class
+   --database ${database}
     &amp;&amp;
 
     cat \$TEMP_DIR_LINEAGES/*fasta > $out_fasta &amp;&amp;
@@ -23,12 +21,12 @@
   <inputs>
    <param format="fasta" type="data" name="input_dna" label="Input DNA" help="Choose input DNA sequence(s) to extract the domains from" />
    <param format="gff" type="data" name="domains_gff" label="Protein domains GFF" help="Choose filtered protein domains GFF3 (DANTE's output)" />
-   <param name="db_type" type="select" label="Select taxon and protein domain database version (REXdb)" help="">
-      <options from_file="rexdb_versions.loc">
-        <column name="name" index="0"/>
-        <column name="value" index="1"/>
-      </options>
-    </param>
+   <param name="database" type="select" label="Select REXdb database">
+        <option value="Viridiplantae_v3.0" selected="true">Viridiplantae_v3.0</option>
+        <option value="Metazoa_v3.1" selected="true">Metazoa_v3.1</option>
+        <option value="Viridiplantae_v2.2" selected="true">Viridiplantae_v2.2</option>
+        <option value="Metazoa_v3.0" selected="true">Metazoa_v3.1</option>
+      </param>
 
    <param name="extend_edges" type="boolean" truevalue="True" falsevalue="False" checked="True" label="Extend sequence edges" help="Extend extracted sequence edges to the full length of database domains sequences"/>
   </inputs>
b
diff -r 1eabd42e00ef -r e2bbc79f0fac dante_gff_to_tabular.xml
--- a/dante_gff_to_tabular.xml Fri Apr 03 07:27:59 2020 -0400
+++ b/dante_gff_to_tabular.xml Wed Jan 25 13:06:55 2023 +0000
[
@@ -1,9 +1,9 @@
-<tool id="gff_to_tabular" name="Convert dante gff3 to tab delimited file" version="0.1.0" python_template_version="3.5">
+<tool id="gff_to_tabular" name="Convert dante gff3 to tab delimited file" version="0.1.4" python_template_version="3.5">
     <requirements>
-        <requirement type="package">R</requirement>
+        <requirement type="package">dante=0.1.4</requirement>
     </requirements>
     <command detect_errors="exit_code"><![CDATA[
-        Rscript ${__tool_directory__}/summarize_gff.R '$inputgff' '$output'
+        summarize_gff.R '$inputgff' '$output'
     ]]></command>
     <inputs>
       <param type="data" name="inputgff" format="gff3" />
b
diff -r 1eabd42e00ef -r e2bbc79f0fac dom_prot_seq.fa
--- a/dom_prot_seq.fa Fri Apr 03 07:27:59 2020 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
b
@@ -1,57 +0,0 @@
->scaffold146.1|size86774:976-1289 RH Class_I|LTR|Ty1/copia|Bianca
-ISWRSTKQTIVAISSNHVELLAIHDTSRECVWLRFMIESIIMXXXXXXXXXXXXXXXXXX
-QLKE*YIKCDRTKHISPKFFFTQDLQKNGDVIIQQIRSNDNVVD
->scaffold146.1|size86774:6810-7049 PROT Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand
-LVDSGASCNLMSKRVMKQMGIPDEKLEFLDATLYAFDRRTIIPAGKIQLPVTLGEEERTR
-SEMVEFIIVDMDLAYNAILG
->scaffold146.1|size86774:8801-9241 RT Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat
-DFKGVNKHCQPDPFPLPHIDRLVDAVAGSSLLSTMDAYSGYHQISLAREDQAKSSFLTED
-GVFCYVVMPFGLRNAGATYQRLVNKIFADLLGKEMEIYVDDMIVKSLNDEDHIIYLSHCF
-EVCRTHRLKLNPAKCCFGVRSGKFLGY
->scaffold146.1|size86774:10819-11667 INT Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand
-RDAMDCVRRCQSCQYFAPINRKPGAEITLTELPCPFDRWGIDILGPFPQSVRQRRFCIVA
-VEYHSKWIEAEAVASITSEAVKKFVMNNIIVRFGCPRVLVSDNGPQFISDKFATFCEEYG
-IQQRTSSVYHPQTNGQAEASNKIILHGLRRNLDSLGGSWPDQLPHVLWAYRTTPKSSTGE
-TPFSLVYGSEAVAPVESTIITPRIAAYMHTESANTEFRELDLDLLEERRNEVYGRVRKQQ
-RALRKRYNQRVRPRQFEKGDLILRSVESQGHKGKLDRAWEGPY
->scaffold146.1|size86774:14592-14828 PROT Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila
-MVDLGASINLMPYSIYSALQLGPLQGTAIVIKLADRSNTHPEGVIEDVLVQVNNLVFPAD
-FYVLKMGKAENNDCPLLLG
->scaffold146.1|size86774:15420-15995 RT Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila
-IYAISDSDWVSPVHVVPKKTGFTVERNKNGELVPKRVTNGWRVCIDYRKLNDATRKDHFP
-LPFIDQMLERLAGKKFYCFLDGYSGYNQVAIAPEDQEKTTFTCTYGTYAFRKMPFGLCNA
-PATFQRCMLSIFSEFTGKFIEVFMDDFTVYGDSFEGALENLEKVLQRCVEKKLVLNSEKC
-HFMVRQGIVLGH
->scaffold146.1|size86774:16188-16634 RH Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila
-FNQECQEAFNKLKSLLTAAPIIQPPNWELPFELMCDASNYALGAVLGQKIEGKRHVIYYA
-SKTLSEAQIHYTTTEKELLAIVYALEKFRSYLLGTKITVHSDHAALRHLLSKKESKPRLI
-RWILLLQEFDLEIKDRAGTENAVADNLSR
->scaffold146.1|size86774:24873-25481 INT Class_I|LTR|Ty1/copia|Bianca
-HDRLGHPGMIMMRKIIRTTSGHSLKNREILHPREYICTACAQGKLITRPSPVKIMNERIT
-FLERIQGDICGPIHPACGPFRYFIVLIDASSRWSHVSLLSTRNHAFARLLSQIIRLRAHF
-PDYPVKKIRLDNAAEFTSRTFNNYCLAMGIDVEHPVEYVHTQNGLAESLIKRLQLIARPL
-LMKSKLPVTCWGHAIIHASSLIR
->scaffold146.1|size86774:26322-27032 RT Class_I|LTR|Ty1/copia|Bianca
-WKDAIESELKSLNKRDVFGPVVRTPEGVQPVGYKWVFVRKRNDKGEISRYKARLVAQGFS
-QRPGIDYDETYSPVMDATTFRFLISLAIEYGLDLQLMDVVTAYLYGSLDCEIYMKIPEGF
-HMPERYSSEPRTDYAIKLNKSLYGLKQSGRMWYNRLSEYLIKEGYKNNLVCPCVFMKKFE
-NEFVIIAVYVDDINIVGTQKALLDAVNCLKREFEMKDLGRTKYCLGLQIEYLKNGIF
->scaffold146.1|size86774:27723-28124 RH Class_I|LTR|Ty1/copia|Bianca
-DAGYRSDPHNGRSQTGYVFLNKGAAISWRSTKQTIAATSSNHAELLAIHETSRECVWLRS
-MIESIYNACGLFTDKMPPTVLYEDNSACIIQLKEGYIKGDRTKHISPKFFFTHDLQKNGE
-VIIQQIRSSDNVAD
->scaffold146.1|size86774:10299-10658 aRH Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat
-WNMYIDGSTQSGAGVGVHYITPYGDWINLAVKLQFPATNNVAEYEALLAGMNFALSLGVT
-RLKTFSDSQLVVEQFSGHFQAKEPMLEAYKSRSQLLAAKFSEFSLEHIPRESNRAADSLA
->scaffold146.1|size86774:16812-17666 INT Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila
-HASDYGGHFGPNRTARRILDVGFYWPSIFRDVYQFCRTCDACQRVGNITNRREMPQNYIL
-ANEIFDIWGLDFMGPFPQSQGNNYILVAVDYVSKWVEAIPTRTDDGKTVTEFLRKNIFTR
-YGVPKAIISDRGTHFCNSTMRAMMKKYNVIHKTTTAYHPQGNGQAEATNREIKSILEKVV
-NKKRSNWSQKLPDALWAYRTAYKTPIGTTPFRLIYGKHCNLPVGLEHKAYWAIREMNFEE
-GGDAELRQMQLQELDALRLEAYDNSRIYKERLKTYHDKKLLQQNF
->scaffold146.1|size86774:19976-20212 PROT Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila
-MVDLGASINLMPYYIYSALKLGSLQGTAIIIKLADRSETHPEGVVKDVLAQVNNLVFPAD
-FYVLKMGEAENDDCPLLLG
->scaffold146.1|size86774:28912-29124 PROT Class_I|LTR|Ty1/copia|Bianca
-CLVDSATTHTILKNMRYFTSFEKRDVNIATIVCEANIVEGSGRAVIVLPSGTHIRIDDAL
-YANKSRRNLLS
b
diff -r 1eabd42e00ef -r e2bbc79f0fac fasta2database.R
--- a/fasta2database.R Fri Apr 03 07:27:59 2020 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
[
@@ -1,14 +0,0 @@
-library(Biostrings)
-input_fasta = commandArgs(T)[1]
-## for testing input_fasta="/mnt/raid/454_data/RE2_benchmark/REPET_annotation/Prunus_persica/DANTE_proteins_filtered.fasta"
-s = readAAStringSet(input_fasta)
-names_table = do.call("rbind", strsplit(names(s)," "))
-head(names_table)
-classification_table = paste(names_table[,1], gsub("|","\t",names_table[,3], fixed = TRUE), sep="\t")
-cat(unique(classification_table), sep="\n", file = paste(input_fasta, ".classification", sep = ""))
-
-new_fasta_names = paste("NA-", names_table[,2], "__", names_table[,1], sep="")
-
-names(s) = new_fasta_names
-
-writeXStringSet(s, filepath = paste(input_fasta, ".db",sep=''))
b
diff -r 1eabd42e00ef -r e2bbc79f0fac fasta2database.py
--- a/fasta2database.py Fri Apr 03 07:27:59 2020 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
[
@@ -1,25 +0,0 @@
-#!/usr/bin/env python3
-'''
-Helper script to create DANTE databese which can be used in second iteration
-'''
-import sys
-
-fasta_input = sys.argv[1]
-db_fasta_output_file = sys.argv[2]
-db_classification_file = sys.argv[3]
-classification_table = set()
-# fasta header will be reformatted to correct REXdb classification
-with open(fasta_input, 'r') as f, open(db_fasta_output_file, 'w') as out:
-    for line in f:
-        if line[0] == ">":
-            ## modify header
-            name, domain, classification = line.split(" ")
-            name_clean=name[1:].replace("-","_")
-            new_header = ">NA-{}__{}\n".format(domain, name_clean)
-            classification_string = "\t".join(classification.split("|"))
-            classification_table.add("{}\t{}".format(name_clean, classification_string))
-            out.write(new_header)
-        else:
-            out.write(line)
-with open(db_classification_file, 'w') as f:
-    f.writelines(classification_table)
b
diff -r 1eabd42e00ef -r e2bbc79f0fac parse_aln.py
--- a/parse_aln.py Fri Apr 03 07:27:59 2020 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
[
@@ -1,137 +0,0 @@
-#!/usr/bin/env python3
-'''
-parse .aln file - output from cap3 program. Output is fasta file and
-profile file
-'''
-import argparse
-import re
-
-
-def parse_args():
-    '''Argument parsin'''
-    description = """
-    parsing cap3 assembly aln output
-    """
-
-    parser = argparse.ArgumentParser(
-        description=description,
-        formatter_class=argparse.RawTextHelpFormatter)
-    parser.add_argument('-a',
-                        '--aln_file',
-                        default=None,
-                        required=True,
-                        help="Aln file input",
-                        type=str,
-                        action='store')
-    parser.add_argument('-f',
-                        '--fasta',
-                        default=None,
-                        required=True,
-                        help="fasta output file name",
-                        type=str,
-                        action='store')
-    parser.add_argument('-p',
-                        '--profile',
-                        default=None,
-                        required=True,
-                        help="output file for coverage profile",
-                        type=str,
-                        action="store")
-    return parser.parse_args()
-
-
-def get_header(f):
-    aln_header = "    .    :    .    :    .    :    .    :    .    :    .    :"
-    contig_lead = "******************"
-    aln_start = -1
-    while True:
-        line = f.readline()
-        if not line:
-            return None, None
-        if line[0:18] == contig_lead:
-            line2 = f.readline()
-        else:
-            continue
-        if aln_header in line2:
-            aln_start = line2.index(aln_header)
-            break
-    contig_name = line.split()[1] + line.split()[2]
-    return contig_name, aln_start
-
-
-def segment_start(f):
-    pos = f.tell()
-    line = f.readline()
-    # detect next contig or end of file
-    if "********" in line or line == "" or "Number of segment pairs = " in line:
-        segment = False
-    else:
-        segment = True
-    f.seek(pos)
-    return segment
-
-
-def get_segment(f, seq_start):
-    if not segment_start(f):
-        return None, None
-    aln = []
-    while True:
-        line = f.readline()
-        if ".    :    .    :" in line:
-            continue
-        if "__________" in line:
-            consensus = f.readline().rstrip('\n')[seq_start:]
-            f.readline()  # empty line
-            break
-        else:
-            aln.append(line.rstrip('\n')[seq_start:])
-    return aln, consensus
-
-
-def aln2coverage(aln):
-    coverage = [0] * len(aln[0])
-    for a in aln:
-        for i, c in enumerate(a):
-            if c not in " -":
-                coverage[i] += 1
-    return coverage
-
-
-def read_contig(f, seq_start):
-    contig = ""
-    coverage = []
-    while True:
-        aln, consensus = get_segment(f, seq_start)
-        if aln:
-            contig += consensus
-            coverage += aln2coverage(aln)
-        else:
-            break
-    return contig, coverage
-
-def remove_gaps(consensus, coverage):
-    if "-" not in consensus:
-        return consensus, coverage
-    new_coverage = [cov for cons, cov in zip(consensus, coverage)
-                    if cons != "-"]
-    new_consensus = consensus.replace("-", "")
-    return new_consensus, new_coverage
-
-def main():
-    args = parse_args()
-    with open(args.aln_file, 'r') as f1, open(args.fasta, 'w') as ffasta, open(args.profile, 'w') as fprofile:
-        while True:
-            contig_name, seq_start = get_header(f1)
-            if contig_name:
-                consensus, coverage = remove_gaps(*read_contig(f1, seq_start))
-                ffasta.write(">{}\n".format(contig_name))
-                ffasta.write("{}\n".format(consensus))
-                fprofile.write(">{}\n".format(contig_name))
-                fprofile.write("{}\n".format(" ".join([str(i) for i in coverage])))
-            else:
-                break
-
-
-if __name__ == "__main__":
-
-    main()
b
diff -r 1eabd42e00ef -r e2bbc79f0fac summarize_gff.R
--- a/summarize_gff.R Fri Apr 03 07:27:59 2020 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
[
@@ -1,61 +0,0 @@
-## summarize hits
-output = commandArgs(T)[2] ## output table
-filepath = commandArgs(T)[1]  ## input dante gff3
-if (length(commandArgs(T))==2){
-  summarized_by = NA
-}else{
-  summarized_by = strsplit(commandArgs(T)[-(1:2)], split = ",")[[1]]
-}
-
-readGFF3fromDante = function(filepath){
-  dfraw=read.table(filepath, as.is = TRUE)
-  gff_df = dfraw[,1:8]
-  colnames(gff_df) = c("seqid", "source", "type", "start", "end", "score",
-                    "strand", "phase")
-  ## assume same order, same attributes names
-  ## TODO make ti more robust - order can change!
-  gffattr_list = lapply(
-    strsplit(dfraw[,9],split=c("=|;")),
-    function(x)x[c(FALSE,TRUE)]
-  )
-  ## some rows are not complete - in case of ambiguous domains
-  L = sapply(gffattr_list, length)
-  short = L  < max(L)
-  if (any(short)){
-    gffattr_list[short] = lapply(gffattr_list[short],function(x) c(x, rep(NA, 13 - length(x))))
-  }
-  gffattr = as.data.frame(do.call(rbind, gffattr_list), stringsAsFactors = FALSE)
-
-  ## get attributes names
-  attrnames =  strsplit(dfraw[1,9],split=c("=|;"))[[1]][c(TRUE,FALSE)]
-  colnames(gffattr) = attrnames
-
-  gff_df$Final_Classification = gffattr$Final_Classification
-  gff_df$Name = gffattr$Name
-  gff_df$Region_Hits_Classifications = gffattr$Region_Hits_Classifications
-  gff_df$Best_Hit = gffattr$Best_Hit
-  gff_df$Best_Hit_DB_Pos = gffattr$Best_Hir_DB_Pos
-  gff_df$DB_Seq = gffattr$DB_Seq
-  gff_df$Query_Seq = gffattr$Query_Seq
-  gff_df$Region_Seq = gffattr$Region_Seq
-  gff_df$Identity = as.numeric(gffattr$Identity)
-  gff_df$Similarity = as.numeric(gffattr$Similarity)
-  gff_df$Relat_Length = as.numeric(gffattr$Relat_Length)
-  gff_df$Relat_Interruptions = as.numeric(gffattr$Relat_Interruptions)
-  gff_df$Hit_to_DB_Length = as.numeric(gffattr$Hit_to_DB_Length)
-  return(gff_df)
-}
-
-gff = readGFF3fromDante(filepath)
-# summarized_by = c("Final_Classification", "Name", "seqid")
-# summarized_by = c("Final_Classification")
-
-
-if (is.na(summarized_by)){
-  ## export complete table
-  write.table(gff, file = output, row.names = FALSE, quote = FALSE, sep = "\t")
-}else{
-  ## export summary
-  tbl = data.frame(table(gff[, summarized_by]))
-  write.table(tbl, file = output, row.names = FALSE, quote = FALSE, sep = "\t")
-}
b
diff -r 1eabd42e00ef -r e2bbc79f0fac summarize_gff.xml
--- a/summarize_gff.xml Fri Apr 03 07:27:59 2020 -0400
+++ b/summarize_gff.xml Wed Jan 25 13:06:55 2023 +0000
[
@@ -1,13 +1,13 @@
-<tool id="gff_summary" name="Summarize gff3 output from DANTE" version="0.1.0" python_template_version="3.5">
+<tool id="gff_summary" name="Summarize gff3 output from DANTE" version="0.1.4" python_template_version="3.5">
     <requirements>
-        <requirement type="package">R</requirement>
+        <requirement type="package">dante=0.1.4</requirement>
     </requirements>
     <command detect_errors="exit_code"><![CDATA[
-        Rscript ${__tool_directory__}/summarize_gff.R '$inputgff' '$output' '$group'
+        summarize_gff.R '$inputgff' '$output' '$group'
     ]]></command>
     <inputs>
       <param type="data" name="inputgff" format="gff3" />
-      <param name="group" type="select" label="select categories to summarize" multiple="true" optional="false">
+      <param name="group" type="select" label="select categories to summarize" multiple="false" optional="false">
             <option value="Name">protein domain name</option>
             <option value="Final_Classification">Classification</option>
             <option value="seqid">Sequence ID</option>
b
diff -r 1eabd42e00ef -r e2bbc79f0fac test-data/GEPY_test_long_1.fa
--- a/test-data/GEPY_test_long_1.fa Fri Apr 03 07:27:59 2020 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
b
b'@@ -1,378 +0,0 @@\n->scaffold146.1|size86774\n-CTAGAACACCAACACTAACAGGTACTACAGTCTGGAATCGGATATCTCGCATGTTAAAATATTCGGATGTGCTGTTTATA\n-TTCTCATTCCCCTGTCTCAAAGAACAAAAGTGGGACCCCAACGTCGATTGAAAATTTATATTGGATTTGAATCTCCTACG\n-ATTATACGATACCTTGAGCCNNNNNNNTTAACATGAGATGTGTTTACTGCTAGATTTGCAGATTGTTATTTTGATCAAAC\n-CCATTTTCATAAGTCATTGAAATAAAATGAGAAATATAAAAAATTAAGTTGGCATAAGTCATCATTGACACATTACGATC\n-CTCGTACTAAGGAATGTGAACTGGGAGTTTAGAAAATTCTTCATGTGCCGGAATAGACAATTCAATTGTCGAATGTGTTT\n-AACGATGCGAATGGGGTTTTACAATCGCATATACTTGCAGCAAACACACCGATTAAGGTTGATGTTCCTGAAGAACGCAC\n-GAAAATTGCGAACGAATCAAAAATGCGTTTGGAACGAGGTAGATCTATTGGTTCTAAGGATAAGAATCCTAGAACGAACA\n-CCGAATGTTCAAATTGAGAATGTGTCGAATCCTCTGAGCACACATATGGTGGTTAGATCTTTGGATGTGAAAATGGATCC\n-ATTCAGACCGCATGAGAACGATGAAGAAATATTAGGNNNNNGAAGTACCTTATCTTAGTGCAATCGGGGCATTGATGTAT\n-CTTGCGAATAATACGAGGCCTAATATAGCATTTGCTGTTAATCTGTTGACAAGATGTAGTTCGTCGCCTACGAAAAGATA\n-TTGGAAATGCGTGAAACATGTTCTTCGATANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN\n-NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN\n-NNNNNNNNNNNNNNTATTTCTTGGCGATCTACGAAACAAACTATCGTAGCTATCTCGTCAAATCACGTAGAATTATTAGC\n-GATACATGACACAAGTCGTGAATGCGTCTGGTTGAGATTTATGATTGAAAGCATTTATAATGNNNNNNNNNNNNNNNNNN\n-NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTACAGTTGAAAGAATGATATATTAAATGTGACCGAACGAAACATAT\n-TTCGCCAAAATTCTTTCTTTACACAAGATCTTCAAAAGAACGGAGATGTGATTATCCAGCAGATACGATCAAACGATAAT\n-GTAGTAGATTTATTCACAAAACTGCTTCATACGGCAGGTTTTGAAAAGTTGATCTACAACAATGGCATTCGAAGATTGAA\n-AGGTTTGGAGTGATGCAACCATCAGGGGTAGATGTTTTTGCTTGAAGACGGAGGGATGTAAAAAGATTATAGAAATGTAC\n-TCTTTTTCATTCACTAAGGTTTTTATCCTTTTTCCTTAGTAAAGTTTTAACGAGTCATATCCTATAATGATAGACATCCA\n-GGGTGGAGTATTACAAAACTATACTCGAAAATTAGATTGTGGATGTCTAGTTTACCAAGTTTCAAATAAAGACGGAAATA\n-AATAGTACTATACACAAAATAATGCTATTCATGTGGGGCTCACGTCATTAATTTGTTTGAATTATAAAACGGTTCAGAAC\n-CATCGCTCACCTATATAAATAGAGGTTATGTATGCTGAAATTATACAGATGAAATAATACAGATTTTATACTTTCATTTT\n-CTTTATTCTTCTTCCGTTTCTACTATATCGAAGTAATTCATAGAGAAGTTGACGTAGAACGTCCGATTGAAGATTCAAGT\n-AAATATTTTTCATTTATTGGTATTATTACTTTCCTAACAATTATTTGATTAAGCGCTATTGTTATTTGAGTCTCCTTCAT\n-TCACACAAATTGCATTCGAGAAAAAGACATTTTTTGTCCCCTCAAATTTTTCAACTTTCAATTTTTTGTCCCTCGACTTT\n-CAAAGAAGACACATTTGGTTTCTTAAATTTGATTTAAGGTCAATTTTGATTCATATTACAAAATTTTAATCATAAATTGA\n-CTATTTTACCTGTAAATAAATATTTTTAAAAGTTGAATTTCATTTTCTTAGACTATTTAAATGTAATTTGACTTGATTGT\n-GAGACTTATGAAGTGATTGTGAATCTTGTTTAAGACTTAAATATTCAATGTATGATAGAAAATTTATGTTGCAATCATAT\n-TATTGTGGAAATCTTATATAACATTGACGTGGAAATTTTTGTGTCGTGCCAAAAATAATAACCTCACAACAACAATAATG\n-GAAAATTTTCTGTGCTCATTTTTTGTCGTCTTCCTCCTCCTCTCTGCCGCTGCAAATGGCGACGACGTGTACACATCCTT\n-CGTTAACTTCCTCGCAAAGAACGGCATTTCCAGCGCCGAAATCTCTTCCACCGTCTACTCTCCACAAAACACTAGCTTCC\n-AGAACGTTCTACTCTCCGCCGTGAGGAACCGCCGGTTCAACCGATCCACCACCAGAAACCCAGCACGATTTTCGCGCCCA\n-CGGCGGAATCCCACGTCAGCGCCGCCGTCATTTGCTCCAAGGAACTCGGGATTCAGCTCAAGATCCGCAGCGGCGGCCAC\n-GACTTCGAGGGCATCTCCTACGTTTCTGCGGACGGCGGCGCGTTCGTCTTACTGGATGTGTCCAATTTCCGGTCGATTTC\n-CGTCGATATTCCCGGCGAGACGGCGTGGGTCGGCCCCGGGGCTTATCTCGGAGAGCTGTACTACAGGATCTGGGAGAAGA\n-GTAGCGTCCACGGTTTCCCCGCCGGGGTCCCGCCCTCCGTTCGATTTTCCAGAAAATGCTTCAAATCGGCGAAGTGGGGC\n-TGACGTTTAACTCCTACGGCGGAGTAATGGACCGGATCCCGGAATCGGAAGCTCCCTTCCTCCTAAAGAACTTAAAACGG\n-CTTATATCTCGCTATATAACGATATTTAAGGATTCGAAACCACTTATAATCTCTTTTCATGCCTTAAATGAGGTTATTTA\n-AGGATCCAATTGCATTTAAAATGCCTTATTATGCATCAAATAGCTTCAAATAGCCTTAACTATGCTAAAGAACTTATAAC\n-GGCTTATATCTCGCTATATAACGATATTTAACCATCCGAAACCACTTATAATCTCTTTTCAAGCCTTAAATGAGGTTATT\n-TAAGGATCCAATTGCATTTAAAATGTCTTATTATGCATCAAATAGCTTCAAATAGCCTAAACTATGCTAAAGAACTTATA\n-ACGGCTTATATCTCGCTATATAACGATATTAAAGCATCCGAAACCACTTATAATCTCTTTTCAAGCCTTAAATGAGGTTA\n-TTAAAGGATCCAATTGCATTTAGAATGCCTTATTATGCATCAAATAGCTTCAAATAGCCTAAACTATGCTAAAGAACTTA\n-TAACGGCTTATATCTCGCTATATAACGATATTAAAGCATCTCCTAAAGAACTTAAAACGGCTTATATCTCGCTATATAAC\n-GATATTTAAGGATTCGAAACCACTTATAATCTCTTTTCATGCCTTAAATGAGGTTATTTAAGGATCCAATTGCATTTAAA\n-ATGCCTTATTATGCATCAAATAGCTTCAAATAGCCTTAACTATGCTAAAGAACTTATAACGGCTTATATCTCGCTATATA\n-ACGATATTTAACCATCCGAAACCACTTATAATCTCTTTTCAAGCCTTAAATGAGGTTATTTAAGGATCCAATTGCATTTA\n-AAATGTCTTATTATGCATCAAATAGCTTCAAATAGCCTAAACTATGCTAAAGAACTTATAACGGCTTATATCTCGCTATA\n-TAACGATATTAAAGCATC'..b'GATAGAAGCAAAGTTGAGGTTGATGAAAACTTTGCAGAATATATA\n-TCTTTCCATGTTATGAATGATCCTGAAGATCCTGAACCTAGAACAATGACTGAATGTCAGAAACGAGATGATTGGCCAAA\n-ATGGAAAGATGCTATAGAAAGTGAGCTGAAATCTCTGAATAAGAGAGATGTTTTCGGACCTGTAGTTCGAACACCTGAAG\n-GTGTACAACCGGTTGGTTATAAGTGGGTTTTTGTGAGAAAACGAAATGATAAAGGAGAAATATCTCGGTATAAGGCGAGA\n-TTAGTAGCTCAAGGGTTTTCTCAAAGGCCAGGAATTGATTATGATGAAACCTATTCACCGGTTATGGATGCCACAACTTT\n-CAGGTTTTTGATAAGTCTGGCGATTGAATATGGGCTTGATTTACAACTGATGGATGTTGTAACAGCATACTTATATGGGT\n-CACTGGATTGTGAAATATATATGAAAATCCCTGAAGGGTTTCATATGCCTGAACGATATAGTTCTGAACCCCGTACCGAT\n-TATGCGATTAAATTGAATAAATCCCTGTATGGATTAAAGCAGTCAGGACGAATGTGGTATAACCGTCTAAGTGAATACTT\n-GATTAAAGAGGGTTATAAGAACAATTTGGTTTGTCCCTGTGTTTTTATGAAGAAATTCGAAAATGAGTTCGTGATCATCG\n-CTGTGTATGTCGATGACATTAATATTGTGGGAACTCAGAAGGCATTATTGGATGCCGTGAACTGCTTGAAAAGGGAATTT\n-GAAATGAAGGATTTGGGAAGAACGAAATATTGCCTTGGTTTGCAAATTGAATATTTGAAAAATGGGATTTTTCGTACCGA\n-TTATGCTATTAAATTGAATAAATCCCTGTATGGATTAAAGCAGTCAGGACGAATGTGGTATAACCGTCTGAGTGAGTATC\n-TGATCAAAGAAGGTTATAAAAACAATTTGGTTTGTCCTTGTGTTTTTATGAAGAATTTTGAAAATGAGTTCGTGATCATC\n-GCTGTGTATGTCGATGACATTAATATTGTGGGAACTCAGAAGGCATTATTAGATGCTGTGAACTGCTTGAAAAGGGAATT\n-TGAAATGAAGGATTTGGGAAGAACGAAATATTGCCTTGGTTTGCAAATTGAATATTTGAAAAATGGGATTTTTCTTCATC\n-AGAATACGTATACCAAGAAGGTATTGAAACGTTTTTATATGGATTATTCACATCCTCTGAGCACACCTATGGTGGTTAGA\n-TCTTTAGATGTGAAAACGGATCCATTCAGGCCACAGGAGAACGATGAAGAAATATTAGGTCCTGAAGTACCTTATCTTAG\n-TGCAATCGGGGCATTAATGTATCTTGCGAATAATACGAGGCCTGACATTGCATTTGCTGTTAATCTGTTGGCAAGATATA\n-GTTCATCGCCTACGAAAAGACATTGGAAAGGCGTGAAACATGTTCTTCGATATCTTCAAGGTACTACTGATAAGGGGTTG\n-TATTATCAGAAAGATATGAAGTCAGAACTTATCGGGTATGCTGATGCTGGATATAGATCAGATCCACATAATGGGAGATC\n-TCAGACAGGATATGTTTTCCTGAATAAAGGAGCTGCTATTTCTTGGCGATCTACGAAACAGACTATCGCAGCTACCTCGT\n-CAAACCACGCAGAATTACTAGCGATACACGAAACAAGTCGTGAATGCGTTTGGTTGAGATCTATGATTGAAAGCATTTAT\n-AATGCTTGTGGATTGTTTACAGATAAGATGCCTCCGACTGTATTATATGAAGATAATAGTGCATGTATTATACAGTTGAA\n-AGAAGGATATATTAAGGGTGACAGAACGAAACATATTTCACCAAAATTCTTCTTTACACATGATCTTCAAAAGAACGGAG\n-AGGTAATTATCCAGCAGATACGATCAAGCGATAATGTGGCAGATTTATTCACGAAACCACTCCCTACATCAACTTTTGAA\n-AAGTTGATTTACAATATTGGAATCCGAAGGTTGAAGGATTTGGAGTGATGCAGTCATCAGGGGGAGATGTTTTTGACTGA\n-GGACAAAGGGATGCAAGAAAATTATAGGAATGTACTCTTTTTCCTTCACTAAGGTTTTTATCCCATTGGGTTTTTCCTTA\n-GTAAGGTTTTAACGAGGCATATCCTATAATGATAGACATCCAAGGGGGAGTGTTGTAAAACTATACACGAAAATCGGATT\n-GTGGATGTCTATTATACCATATTTCAAATAAAGACGGAAATGCACAGTACTTACTATTCATGTGGGTCCCGCAGCATTAA\n-ATTTATCAAATTATTGATTGTAAACGAACGGATGTAATCGATGATTGATGCCTATAAATATAGGCATGGTGCAGAATGAA\n-TTTAAGCAGAACAAAATTTGAGCATAAATTTTTCTCTTCTTCTTCTCATTTCTTTTCTTGATTCAATAATACGCTGAAGG\n-AATTTCTACAGAAGTTGACGTAGAACGTCCGATTGAAGATTCAAGTAAGTTTATGAATTATTCATCTTTTATTTCTTATT\n-TTTCTAACACGTTATCAGCACGAAGTCTAACCAACTGAGTGCTTATATAATCTTGAAGATTATATATTATATGATATGAT\n-CCCGCAGATCGTATGGTTGATTCGATGATCCAGAAGATTATTTGATTTAACTTCTTTACCATTTTCGTCTGAAGTCTCAA\n-TATGATATCCATTTCGGCGAATATCTCTAAAACTCAATAAATTCCTTCGAGACTTATTCGCATATAATGCATCATCAATT\n-CTGATATGCGTTCCCGATGGTAATACTATCACAGCTCTGCCGGAGCCTTCAACAATATTCGCTTCACATACAATTGTGGC\n-AATATTTACATCCCGTTTTTCAAAACTAGTAAAATATCTCATGTTTTTCAAGATCGTATGCGTCGTTGCACTATCCACCA\n-GACATTCATCACCCTCAGCCATGCTGGAAATAAACAATATATTTCATTATCAGTTCATGATACAATTTACAATTCGGCAT\n-ACATAATACCAACAATAATGCCCAGAAATGTTAACAGCGTGGAAACAAAGACTGAAGCGTCTATGTTCTGGATCATCCAA\n-ATATACCCAAAAATAAAAGTTACCATAAAAACTATAAAAACAAGATACTGAACAGAGGTGTTCATCTGGCTCAGAGATTT\n-TGTAAAGAGATTTAGAAAAGAGAAGTTTTTAAGTGAAAAACCTTCTTGATAAAATCGTTATTTATACTGAGCCAAGTTAC\n-CGACAATATCTGCGAAATCCAAGATTTTCGTTGAAAAAACATAACACCTTATTAAAATATTCAAAACACGCCGAATATTT\n-TACAAAACACGCTGATAACATGCCGAATACAACATATTTATTTCAACCAGCATTATTAAATAGATAACTAAATCCTTCGT\n-CTGGATCACGTGGACGTGGTCGTGGCCGAGGTCGTTACTATGATTATGGTCGTGAAAAGAACAAGTATATCTGGAAGAAA\n-CCTGCCGTTGTCAAAGAGGTAAATGTGAAAAATGATCAGGGTGACCAGAATACTTGTTACAGATGTGGAAAGGAAGGACA\n-CTGGTCACGGACGTGTCGAACGCCTAAATCACTTGTCGACCTTTATCAGCGAGCGAAGAAAATTGAGGAAAAGGGGAAGA\n-AAAAGGAGACGAATAACGCTGAAGCTGAGACGTATAACGGAGAAGTCAATATGACTAAGCTGGATGTCGCAGATTTCCTG\n-GCTGATCCAGACGAAGGATTTAGTTATCTATTTAATAATGCTGGTTGAAAGAAATTGTTGTATTCGGCATGTTATCAGCG\n-TGTTTTGTAAAATATTCGGCGTGTTTTGAATATTTTAATAAAATGTTATGTTTTTCAACGAAAATATTGGATTTCGCAGA\n-TATTGTCGGTAACTT\n'
b
diff -r 1eabd42e00ef -r e2bbc79f0fac test-data/GEPY_test_long_1.fa.fai
--- a/test-data/GEPY_test_long_1.fa.fai Fri Apr 03 07:27:59 2020 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
b
@@ -1,1 +0,0 @@
-scaffold146.1|size86774 30095 25 80 81
b
diff -r 1eabd42e00ef -r e2bbc79f0fac test-data/GEPY_test_long_1_output_unfiltered.gff3
--- a/test-data/GEPY_test_long_1_output_unfiltered.gff3 Fri Apr 03 07:27:59 2020 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
[
b'@@ -1,26 +0,0 @@\n-##gff-version 3\n-##-----------------------------------------------\n-##PIPELINE VERSION         : iter_search_optional-rv-3168(0b80fa0)\n-##PROTEIN DATABASE VERSION : Viridiplantae_v3.0_pdb\n-##-----------------------------------------------\n-scaffold146.1|size86774\tdante\tprotein_domain\t976\t1289\t293\t+\t.\tName=RH;Final_Classification=Class_I|LTR|Ty1/copia|Bianca;Region_Hits_Classifications=RH|Class_I|LTR|Ty1/copia|Bianca;Best_Hit=Ty1-RH__REXdb_ID2558|Class_I|LTR|Ty1/copia|Bianca:976-1289[100percent];Best_Hit_DB_Pos=26:134of134;DB_Seq=ISWRSVKQTITATSSNHAELLALHEASRECVWLRSMIQHIQKNCG-LSSGRMDATIIYEDNTACIAQLKEGYIKGDRTKHISPKFF-FTHDLQKDGDISIQQIRSCDNLAD;Region_Seq=ISWRSTKQTIVAISSNHVELLAIHDTSRECVWLRFMIESI\\IMXXXXXXXXXXXXXXXXXXQLKE*YIKCDRTKHISPKFF\\FTQDLQKNGDVIIQQIRSNDNVVD;Query_Seq=ISWRSTKQTIVAISSNHVELLAIHDTSRECVWLRFMIESI-----\\IMXXXXXXXXXXXXXXXXXXQLKE*YIKCDRTKHISPKFF\\FTQDLQKNGDVIIQQIRSNDNVVD;Identity=0.59;Similarity=0.66;Relat_Length=0.813;Relat_Interruptions=1.5;Hit_to_DB_Length=0.83\n-scaffold146.1|size86774\tdante\tprotein_domain\t6810\t7049\t153\t+\t.\tName=PROT;Final_Classification=Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand;Region_Hits_Classifications=PROT|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand;Best_Hit=Ty3-PROT__REXdb_ID9702|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand:6810-7049[100percent];Best_Hit_DB_Pos=1:80of80;DB_Seq=LVDDGSKVNLLPYRVFQQMGIPEEQLVRDQAPVKGIGGVPVLVEGKVKLALTLGEAPRTRTHYAVFLVVKPPLSYNAILG;Region_Seq=LVDSGASCNLMSKRVMKQMGIPDEKLEFLDATLYAFDRRTIIPAGKIQLPVTLGEEERTRSEMVEFIIVDMDLAYNAILG;Query_Seq=LVDSGASCNLMSKRVMKQMGIPDEKLEFLDATLYAFDRRTIIPAGKIQLPVTLGEEERTRSEMVEFIIVDMDLAYNAILG;Identity=0.44;Similarity=0.62;Relat_Length=1.0;Relat_Interruptions=0.0;Hit_to_DB_Length=1.0\n-scaffold146.1|size86774\tdante\tprotein_domain\t7656\t8296\t.\t+\t.\tName=RT/INT;Final_Classification=Ambiguous_domain;Region_Hits_Classifications_=RT|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand[246bp],INT|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand[468bp]\n-scaffold146.1|size86774\tdante\tprotein_domain\t8756\t9241\t538\t+\t.\tName=RT;Final_Classification=Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat;Region_Hits_Classifications=RT|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand[486bp],RT|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Ogre[441bp];Best_Hit=Ty3-RT__REXdb_ID8210|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand:8801-9241[90percent];Best_Hit_DB_Pos=27:173of173;DB_Seq=DFTDLNKACPKDSFPLPHIDRLVDSTAGNELLTFMDAFSGYNQIMMNPEDQEKTSFITDRGIYCYKVMPFGLKNAGATYQRLVNKMFHNHLGKTMEVYIDDMLVKSLKKEDHVKHLEECFDILNKYQMKLNPAKCTFGVPSGEFLGY;Region_Seq=TSIATASGGRTSDGADFKGVNKHCQPDPFPLPHIDRLVDAVAGSSLLSTMDAYSGYHQISLAREDQAKSSFLTEDGVFCYVVMPFGLRNAGATYQRLVNKIFADLLGKEMEIYVDDMIVKSLNDEDHIIYLSHCFEVCRTHRLKLNPAKCCFGVRSGKFLGY;Query_Seq=DFKGVNKHCQPDPFPLPHIDRLVDAVAGSSLLSTMDAYSGYHQISLAREDQAKSSFLTEDGVFCYVVMPFGLRNAGATYQRLVNKIFADLLGKEMEIYVDDMIVKSLNDEDHIIYLSHCFEVCRTHRLKLNPAKCCFGVRSGKFLGY;Identity=0.63;Similarity=0.8;Relat_Length=0.85;Relat_Interruptions=0.0;Hit_to_DB_Length=0.85\n-scaffold146.1|size86774\tdante\tprotein_domain\t9434\t9781\t343\t+\t.\tName=RH;Final_Classification=Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand;Region_Hits_Classifications=RH|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand;Best_Hit=Ty3-RH__REXdb_ID9729|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand:9434-9772[97percent];Best_Hit_DB_Pos=1:113of149;DB_Seq=WTEECEEAFQKLKEYLGSPHLLVKPIQGEPLFLYLAVSEHATSSVLVREDDGVQRPIYYTSRALVDAETRYLSLEKIVLALIVSARRLRPYFQAHTIIVLTDQPIRQVLAKPD;Region_Seq=WTDQCDRAFKELKTYLASPPLIVSPTPTETLGLYLAVSEHAVSSVLVAERDGVQHPVYYVSHTLLPAESRYSTVEKFVLALLKSVAKLRHYFESRKVIVYTDQPIKAVLGQSDHTS;Query_Seq=WTDQCDRAFKELKTYLASPPLIVSPTPTETLGLYLAVSEHAVSSVLVAERDGVQHPVYYVSHTLLPAESRYSTVEKFVLALLKSVAKLRHYFESRKVIVYTDQPIKAVLGQSD;Identity=0.58;Similarity=0.73;Relat_Length=0.758;Relat_Interruptions=0.0;Hit_to_DB_Length=0.76\n-scaffold146.1|size86774\tdante\tprotein_domain\t10810\t11667\t747\t+\t.\tName=INT;Final_Classification=Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand;Region_Hits_Classific'..b'n-chromovirus|OTA|Athila;Best_Hit=Ty3-INT__REXdb_ID6633|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila:16812-17666[98percent];Best_Hit_DB_Pos=1:285of313;DB_Seq=HSHSYGGHFGAKRTAHKVLESGFYWPSIFKDAYHFCKSCEKCQRTGNITHKNQMPLTNILVSEIFDVWGIDFMGPFPSSFGNLYILLVVDYVSKWIEAKATRTNDAKVVLDFVRTHIFNRFGIPKAIISDRGTHFCNRSMEALLRKYHVTHRTSTAYHPQTNGQAEISNREIKSILEKIVQPNRRDWSLRLGDALWAYRTAYKSPIGMSPYRMIYGKACHLPVELEHKAFWAIKQCNMDYDAAGIARKLQLQELEEIRNDAYENARIYKEKTKNLHDRMLTRKEF;Region_Seq=HASDYGGHFGPNRTARRILDVGFYWPSIFRDVYQFCRTCDACQRVGNITNRREMPQNYILANEIFDIWGLDFMGPFPQSQGNNYILVAVDYVSKWVEAIPTRTDDGKTVTEFLRKNIFTRYGVPKAIISDRGTHFCNSTMRAMMKKYNVIHKTTTAYHPQGNGQAEATNREIKSILEKVVNKKRSNWSQKLPDALWAYRTAYKTPIGTTPFRLIYGKHCNLPVGLEHKAYWAIREMNFEEGGDAELRQMQLQELDALRLEAYDNSRIYKERLKTYHDKKLLQQNFRERLS;Query_Seq=HASDYGGHFGPNRTARRILDVGFYWPSIFRDVYQFCRTCDACQRVGNITNRREMPQNYILANEIFDIWGLDFMGPFPQSQGNNYILVAVDYVSKWVEAIPTRTDDGKTVTEFLRKNIFTRYGVPKAIISDRGTHFCNSTMRAMMKKYNVIHKTTTAYHPQGNGQAEATNREIKSILEKVVNKKRSNWSQKLPDALWAYRTAYKTPIGTTPFRLIYGKHCNLPVGLEHKAYWAIREMNFEEGGDAELRQMQLQELDALRLEAYDNSRIYKERLKTYHDKKLLQQNF;Identity=0.61;Similarity=0.79;Relat_Length=0.911;Relat_Interruptions=0.0;Hit_to_DB_Length=0.91\n-scaffold146.1|size86774\tdante\tprotein_domain\t18554\t18811\t306\t-\t.\tName=INT;Final_Classification=Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila;Region_Hits_Classifications=INT|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila;Best_Hit=Ty3-INT__REXdb_ID6693|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila:18554-18802[96percent];Best_Hit_DB_Pos=231:313of313;DB_Seq=WALRLLNFDNNACGEKRKLQLQELEEMRLNAYESSRIYKERTKAYHDKKLQRREFQPGQQVLLFNSRLRLFPGKLKSKWSGPF;Region_Seq=QGNWAIREMNFEEGGDAELRQMQLQELDALRLEAYDNSRIYKERLKAYHDKKILQQNFREGQQVLLFNSKLRLFPGKLKSRWMGPF;Query_Seq=WAIREMNFEEGGDAELRQMQLQELDALRLEAYDNSRIYKERLKAYHDKKILQQNFREGQQVLLFNSKLRLFPGKLKSRWMGPF;Identity=0.65;Similarity=0.82;Relat_Length=0.265;Relat_Interruptions=0.0;Hit_to_DB_Length=0.27\n-scaffold146.1|size86774\tdante\tprotein_domain\t19158\t19478\t197\t-\t.\tName=INT;Final_Classification=Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila;Region_Hits_Classifications=INT|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila;Best_Hit=Ty3-INT__REXdb_ID6659|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila:19182-19448[83percent];Best_Hit_DB_Pos=216:304of314;DB_Seq=YGKPCHLPVELEHKAWWAVKQCNMELDVAGQHRxLQLQELEEIRNDAYESSxIYKEKTKAFHDKQILRKNFEVGQKVLIFHSRLKLFPG;Region_Seq=PRGTISIGLNFGKQCKVLVGMEHENYWEIREMNYEEGADVEQKQMQLQKMDALKLEAYDNSRIDKEKLKAHHAKRILQQNCKKRQQVLIFDSKLKMFPGIPRWMEPF;Query_Seq=FGKQCKVLVGMEHENYWEIREMNYEEGADVEQKQMQLQKMDALKLEAYDNSRIDKEKLKAHHAKRILQQNCKKRQQVLIFDSKLKMFPG;Identity=0.42;Similarity=0.71;Relat_Length=0.283;Relat_Interruptions=0.0;Hit_to_DB_Length=0.28\n-scaffold146.1|size86774\tdante\tprotein_domain\t19976\t20212\t259\t-\t.\tName=PROT;Final_Classification=Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila;Region_Hits_Classifications=PROT|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila;Best_Hit=Ty3-PROT__REXdb_ID6659|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila:19976-20212[100percent];Best_Hit_DB_Pos=1:80of80;DB_Seq=MLDLGASINVMPYSIYNSLNLGPMEETCIIIQLADRSNAYPKGVMEDVLVQVNELVFPADFYILKMEDELSPNPTPILLG;Region_Seq=MVDLGASINLMPYYIYSALKLGSLQGTAIIIKLADRSETHPEGVVKDVLAQVNNLVFPADFYVLKMGEAENDDCPLLLG;Query_Seq=MVDLGASINLMPYYIYSALKLGSLQGTAIIIKLADRSETHPEGVVKDVLAQVNNLVFPADFYVLKM-GEAENDDCPLLLG;Identity=0.62;Similarity=0.79;Relat_Length=1.0;Relat_Interruptions=0.0;Hit_to_DB_Length=1.0\n-scaffold146.1|size86774\tdante\tprotein_domain\t28912\t29124\t216\t-\t.\tName=PROT;Final_Classification=Class_I|LTR|Ty1/copia|Bianca;Region_Hits_Classifications=PROT|Class_I|LTR|Ty1/copia|Bianca;Best_Hit=Ty1-PROT__REXdb_ID2599|Class_I|LTR|Ty1/copia|Bianca:28912-29124[100percent];Best_Hit_DB_Pos=1:71of71;DB_Seq=CLADCATTHTILRDKRYFLELTLIKANVSTISGTTNLVEGSGRANIMLPNGTRFHINDALYSSKSRRNLLS;Region_Seq=CLVDSATTHTILKNMRYFTSFEKRDVNIATIVCEANIVEGSGRAVIVLPSGTHIRIDDALYANKSRRNLLS;Query_Seq=CLVDSATTHTILKNMRYFTSFEKRDVNIATIVCEANIVEGSGRAVIVLPSGTHIRIDDALYANKSRRNLLS;Identity=0.59;Similarity=0.7;Relat_Length=1.0;Relat_Interruptions=0.0;Hit_to_DB_Length=1.0\n'
b
diff -r 1eabd42e00ef -r e2bbc79f0fac test-data/single_fasta.gff3
--- a/test-data/single_fasta.gff3 Fri Apr 03 07:27:59 2020 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
[
b'@@ -1,26 +0,0 @@\n-##gff-version 3\n-##-----------------------------------------------\n-##PIPELINE VERSION         : dante-rv-3081(adb2509)\n-##PROTEIN DATABASE VERSION : Viridiplantae_v3.0_pdb\n-##-----------------------------------------------\n-scaffold146.1|size86774\tdante\tprotein_domain\t976\t1289\t293\t+\t.\tName=RH;Final_Classification=Class_I|LTR|Ty1/copia|Bianca;Region_Hits_Classifications=RH|Class_I|LTR|Ty1/copia|Bianca;Best_Hit=Ty1-RH__REXdb_ID2558|Class_I|LTR|Ty1/copia|Bianca:976-1289[100percent];Best_Hit_DB_Pos=26:134of134;DB_Seq=ISWRSVKQTITATSSNHAELLALHEASRECVWLRSMIQHIQKNCG-LSSGRMDATIIYEDNTACIAQLKEGYIKGDRTKHISPKFF-FTHDLQKDGDISIQQIRSCDNLAD;Query_Seq=ISWRSTKQTIVAISSNHVELLAIHDTSRECVWLRFMIESI-----\\IMXXXXXXXXXXXXXXXXXXQLKE*YIKCDRTKHISPKFF\\FTQDLQKNGDVIIQQIRSNDNVVD;Identity=0.59;Similarity=0.66;Relat_Length=0.813;Relat_Interruptions=1.5;Hit_to_DB_Length=0.83\n-scaffold146.1|size86774\tdante\tprotein_domain\t6810\t7049\t153\t+\t.\tName=PROT;Final_Classification=Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand;Region_Hits_Classifications=PROT|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand;Best_Hit=Ty3-PROT__REXdb_ID9702|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand:6810-7049[100percent];Best_Hit_DB_Pos=1:80of80;DB_Seq=LVDDGSKVNLLPYRVFQQMGIPEEQLVRDQAPVKGIGGVPVLVEGKVKLALTLGEAPRTRTHYAVFLVVKPPLSYNAILG;Query_Seq=LVDSGASCNLMSKRVMKQMGIPDEKLEFLDATLYAFDRRTIIPAGKIQLPVTLGEEERTRSEMVEFIIVDMDLAYNAILG;Identity=0.44;Similarity=0.62;Relat_Length=1.0;Relat_Interruptions=0.0;Hit_to_DB_Length=1.0\n-scaffold146.1|size86774\tdante\tprotein_domain\t7656\t8296\t.\t+\t.\tName=RT/INT;Final_Classification=Ambiguous_domain;Region_Hits_Classifications_=RT|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand[246bp],INT|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand[468bp]\n-scaffold146.1|size86774\tdante\tprotein_domain\t8756\t9241\t538\t+\t.\tName=RT;Final_Classification=Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat;Region_Hits_Classifications=RT|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand[486bp],RT|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Ogre[441bp];Best_Hit=Ty3-RT__REXdb_ID8210|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand:8801-9241[90percent];Best_Hit_DB_Pos=27:173of173;DB_Seq=DFTDLNKACPKDSFPLPHIDRLVDSTAGNELLTFMDAFSGYNQIMMNPEDQEKTSFITDRGIYCYKVMPFGLKNAGATYQRLVNKMFHNHLGKTMEVYIDDMLVKSLKKEDHVKHLEECFDILNKYQMKLNPAKCTFGVPSGEFLGY;Query_Seq=DFKGVNKHCQPDPFPLPHIDRLVDAVAGSSLLSTMDAYSGYHQISLAREDQAKSSFLTEDGVFCYVVMPFGLRNAGATYQRLVNKIFADLLGKEMEIYVDDMIVKSLNDEDHIIYLSHCFEVCRTHRLKLNPAKCCFGVRSGKFLGY;Identity=0.63;Similarity=0.8;Relat_Length=0.85;Relat_Interruptions=0.0;Hit_to_DB_Length=0.85\n-scaffold146.1|size86774\tdante\tprotein_domain\t9433\t9781\t343\t+\t.\tName=RH;Final_Classification=Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand;Region_Hits_Classifications=RH|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand;Best_Hit=Ty3-RH__REXdb_ID9729|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand:9434-9772[97percent];Best_Hit_DB_Pos=1:113of149;DB_Seq=WTEECEEAFQKLKEYLGSPHLLVKPIQGEPLFLYLAVSEHATSSVLVREDDGVQRPIYYTSRALVDAETRYLSLEKIVLALIVSARRLRPYFQAHTIIVLTDQPIRQVLAKPD;Query_Seq=WTDQCDRAFKELKTYLASPPLIVSPTPTETLGLYLAVSEHAVSSVLVAERDGVQHPVYYVSHTLLPAESRYSTVEKFVLALLKSVAKLRHYFESRKVIVYTDQPIKAVLGQSD;Identity=0.58;Similarity=0.73;Relat_Length=0.758;Relat_Interruptions=0.0;Hit_to_DB_Length=0.76\n-scaffold146.1|size86774\tdante\tprotein_domain\t10810\t11667\t747\t+\t.\tName=INT;Final_Classification=Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand;Region_Hits_Classifications=INT|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand;Best_Hit=Ty3-INT__REXdb_ID9633|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand:10819-11667[98percent];Best_Hit_DB_Pos=30:310of310;DB_Seq=RDTHQYVQRCIQCQKFAPLIHKPGEEMTIMSAPCPFAQWGIDLVGPFPQTAGRKKFFIVAVDYFTKWVEAEALSKITEDEVMHFIWKYICCRFGLPRSLVSDNGTQFNGKKIRAWCEEMKITQKFVAVAHPQANGQVESTNRTIVNGLKKRIDELGGSWVDELPSVLWSYRTSAKAATGETPFRLTYGTEAVIPVEVAMDTLRIATF--DEEANDGALRTRLDEIFDLREAAYLHMERSKNLIKARYDQGVRSRSFQIGDLILRRADALKHTGKLEANWEGPY;Query_Seq=RDAMDCVRRCQSCQYFAPINRKPGAEI'..b'rus|OTA|Tat;Region_Hits_Classifications=RH|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Ogre[99bp],RH|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand[117bp];Best_Hit=Ty3-RH__REXdb_ID8372|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand:10701-10817[100percent];Best_Hit_DB_Pos=279:317of317;DB_Seq=NREGTGRVVKWAIELSEFDLHFEPRHAIKSQALADFVVE;Query_Seq=NTDHTSRLAKWAIKVSAMDIAFEPRKAIKGQALADFVVE;Identity=0.64;Similarity=0.77;Relat_Length=0.123;Relat_Interruptions=0.0;Hit_to_DB_Length=0.12\n-scaffold146.1|size86774\tdante\tprotein_domain\t16797\t17666\t1057\t-\t.\tName=INT;Final_Classification=Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila;Region_Hits_Classifications=INT|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila;Best_Hit=Ty3-INT__REXdb_ID6633|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila:16812-17666[98percent];Best_Hit_DB_Pos=1:285of313;DB_Seq=HSHSYGGHFGAKRTAHKVLESGFYWPSIFKDAYHFCKSCEKCQRTGNITHKNQMPLTNILVSEIFDVWGIDFMGPFPSSFGNLYILLVVDYVSKWIEAKATRTNDAKVVLDFVRTHIFNRFGIPKAIISDRGTHFCNRSMEALLRKYHVTHRTSTAYHPQTNGQAEISNREIKSILEKIVQPNRRDWSLRLGDALWAYRTAYKSPIGMSPYRMIYGKACHLPVELEHKAFWAIKQCNMDYDAAGIARKLQLQELEEIRNDAYENARIYKEKTKNLHDRMLTRKEF;Query_Seq=HASDYGGHFGPNRTARRILDVGFYWPSIFRDVYQFCRTCDACQRVGNITNRREMPQNYILANEIFDIWGLDFMGPFPQSQGNNYILVAVDYVSKWVEAIPTRTDDGKTVTEFLRKNIFTRYGVPKAIISDRGTHFCNSTMRAMMKKYNVIHKTTTAYHPQGNGQAEATNREIKSILEKVVNKKRSNWSQKLPDALWAYRTAYKTPIGTTPFRLIYGKHCNLPVGLEHKAYWAIREMNFEEGGDAELRQMQLQELDALRLEAYDNSRIYKERLKTYHDKKLLQQNF;Identity=0.61;Similarity=0.79;Relat_Length=0.911;Relat_Interruptions=0.0;Hit_to_DB_Length=0.91\n-scaffold146.1|size86774\tdante\tprotein_domain\t18554\t18811\t306\t-\t.\tName=INT;Final_Classification=Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila;Region_Hits_Classifications=INT|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila;Best_Hit=Ty3-INT__REXdb_ID6693|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila:18554-18802[96percent];Best_Hit_DB_Pos=231:313of313;DB_Seq=WALRLLNFDNNACGEKRKLQLQELEEMRLNAYESSRIYKERTKAYHDKKLQRREFQPGQQVLLFNSRLRLFPGKLKSKWSGPF;Query_Seq=WAIREMNFEEGGDAELRQMQLQELDALRLEAYDNSRIYKERLKAYHDKKILQQNFREGQQVLLFNSKLRLFPGKLKSRWMGPF;Identity=0.65;Similarity=0.82;Relat_Length=0.265;Relat_Interruptions=0.0;Hit_to_DB_Length=0.27\n-scaffold146.1|size86774\tdante\tprotein_domain\t19158\t19478\t197\t-\t.\tName=INT;Final_Classification=Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila;Region_Hits_Classifications=INT|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila;Best_Hit=Ty3-INT__REXdb_ID6659|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila:19182-19448[83percent];Best_Hit_DB_Pos=216:304of314;DB_Seq=YGKPCHLPVELEHKAWWAVKQCNMELDVAGQHRxLQLQELEEIRNDAYESSxIYKEKTKAFHDKQILRKNFEVGQKVLIFHSRLKLFPG;Query_Seq=FGKQCKVLVGMEHENYWEIREMNYEEGADVEQKQMQLQKMDALKLEAYDNSRIDKEKLKAHHAKRILQQNCKKRQQVLIFDSKLKMFPG;Identity=0.42;Similarity=0.71;Relat_Length=0.283;Relat_Interruptions=0.0;Hit_to_DB_Length=0.28\n-scaffold146.1|size86774\tdante\tprotein_domain\t19976\t20212\t259\t-\t.\tName=PROT;Final_Classification=Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila;Region_Hits_Classifications=PROT|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila;Best_Hit=Ty3-PROT__REXdb_ID6659|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila:19976-20212[100percent];Best_Hit_DB_Pos=1:80of80;DB_Seq=MLDLGASINVMPYSIYNSLNLGPMEETCIIIQLADRSNAYPKGVMEDVLVQVNELVFPADFYILKMEDELSPNPTPILLG;Query_Seq=MVDLGASINLMPYYIYSALKLGSLQGTAIIIKLADRSETHPEGVVKDVLAQVNNLVFPADFYVLKM-GEAENDDCPLLLG;Identity=0.62;Similarity=0.79;Relat_Length=1.0;Relat_Interruptions=0.0;Hit_to_DB_Length=1.0\n-scaffold146.1|size86774\tdante\tprotein_domain\t28912\t29124\t216\t-\t.\tName=PROT;Final_Classification=Class_I|LTR|Ty1/copia|Bianca;Region_Hits_Classifications=PROT|Class_I|LTR|Ty1/copia|Bianca;Best_Hit=Ty1-PROT__REXdb_ID2599|Class_I|LTR|Ty1/copia|Bianca:28912-29124[100percent];Best_Hit_DB_Pos=1:71of71;DB_Seq=CLADCATTHTILRDKRYFLELTLIKANVSTISGTTNLVEGSGRANIMLPNGTRFHINDALYSSKSRRNLLS;Query_Seq=CLVDSATTHTILKNMRYFTSFEKRDVNIATIVCEANIVEGSGRAVIVLPSGTHIRIDDALYANKSRRNLLS;Identity=0.59;Similarity=0.7;Relat_Length=1.0;Relat_Interruptions=0.0;Hit_to_DB_Length=1.0\n'
b
diff -r 1eabd42e00ef -r e2bbc79f0fac test-data/single_fasta_filtered.gff3
--- a/test-data/single_fasta_filtered.gff3 Fri Apr 03 07:27:59 2020 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
[
b'@@ -1,33 +0,0 @@\n-##gff-version 3\n-##-----------------------------------------------\n-##PIPELINE VERSION         : dante-rv-3081(adb2509)\n-##PROTEIN DATABASE VERSION : Viridiplantae_v3.0_pdb\n-##-----------------------------------------------\n-##CLASSIFICATION\tORIGINAL_COUNTS\tFILTERED_COUNTS\n-##Ambiguous_domain\t1\t0\n-##Class_I|LTR|Ty1/copia|Bianca\t6\t5\n-##Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila\t7\t5\n-##Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat\t3\t2\n-##Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand\t4\t2\n-##-----------------------------------------------\n-##SEQ\tDOMAIN\tCOUNTS\n-##scaffold146.1|size86774\tINT\t3\n-##scaffold146.1|size86774\tPROT\t4\n-##scaffold146.1|size86774\tRH\t3\n-##scaffold146.1|size86774\tRT\t3\n-##scaffold146.1|size86774\taRH\t1\n-##-----------------------------------------------\n-scaffold146.1|size86774\tdante\tprotein_domain\t976\t1289\t293\t+\t.\tName=RH;Final_Classification=Class_I|LTR|Ty1/copia|Bianca;Region_Hits_Classifications=RH|Class_I|LTR|Ty1/copia|Bianca;Best_Hit=Ty1-RH__REXdb_ID2558|Class_I|LTR|Ty1/copia|Bianca:976-1289[100percent];Best_Hit_DB_Pos=26:134of134;DB_Seq=ISWRSVKQTITATSSNHAELLALHEASRECVWLRSMIQHIQKNCG-LSSGRMDATIIYEDNTACIAQLKEGYIKGDRTKHISPKFF-FTHDLQKDGDISIQQIRSCDNLAD;Query_Seq=ISWRSTKQTIVAISSNHVELLAIHDTSRECVWLRFMIESI-----\\IMXXXXXXXXXXXXXXXXXXQLKE*YIKCDRTKHISPKFF\\FTQDLQKNGDVIIQQIRSNDNVVD;Identity=0.59;Similarity=0.66;Relat_Length=0.813;Relat_Interruptions=1.5;Hit_to_DB_Length=0.83\n-scaffold146.1|size86774\tdante\tprotein_domain\t6810\t7049\t153\t+\t.\tName=PROT;Final_Classification=Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand;Region_Hits_Classifications=PROT|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand;Best_Hit=Ty3-PROT__REXdb_ID9702|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand:6810-7049[100percent];Best_Hit_DB_Pos=1:80of80;DB_Seq=LVDDGSKVNLLPYRVFQQMGIPEEQLVRDQAPVKGIGGVPVLVEGKVKLALTLGEAPRTRTHYAVFLVVKPPLSYNAILG;Query_Seq=LVDSGASCNLMSKRVMKQMGIPDEKLEFLDATLYAFDRRTIIPAGKIQLPVTLGEEERTRSEMVEFIIVDMDLAYNAILG;Identity=0.44;Similarity=0.62;Relat_Length=1.0;Relat_Interruptions=0.0;Hit_to_DB_Length=1.0\n-scaffold146.1|size86774\tdante\tprotein_domain\t8756\t9241\t538\t+\t.\tName=RT;Final_Classification=Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat;Region_Hits_Classifications=RT|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand[486bp],RT|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Ogre[441bp];Best_Hit=Ty3-RT__REXdb_ID8210|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand:8801-9241[90percent];Best_Hit_DB_Pos=27:173of173;DB_Seq=DFTDLNKACPKDSFPLPHIDRLVDSTAGNELLTFMDAFSGYNQIMMNPEDQEKTSFITDRGIYCYKVMPFGLKNAGATYQRLVNKMFHNHLGKTMEVYIDDMLVKSLKKEDHVKHLEECFDILNKYQMKLNPAKCTFGVPSGEFLGY;Query_Seq=DFKGVNKHCQPDPFPLPHIDRLVDAVAGSSLLSTMDAYSGYHQISLAREDQAKSSFLTEDGVFCYVVMPFGLRNAGATYQRLVNKIFADLLGKEMEIYVDDMIVKSLNDEDHIIYLSHCFEVCRTHRLKLNPAKCCFGVRSGKFLGY;Identity=0.63;Similarity=0.8;Relat_Length=0.85;Relat_Interruptions=0.0;Hit_to_DB_Length=0.85\n-scaffold146.1|size86774\tdante\tprotein_domain\t10810\t11667\t747\t+\t.\tName=INT;Final_Classification=Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand;Region_Hits_Classifications=INT|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand;Best_Hit=Ty3-INT__REXdb_ID9633|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand:10819-11667[98percent];Best_Hit_DB_Pos=30:310of310;DB_Seq=RDTHQYVQRCIQCQKFAPLIHKPGEEMTIMSAPCPFAQWGIDLVGPFPQTAGRKKFFIVAVDYFTKWVEAEALSKITEDEVMHFIWKYICCRFGLPRSLVSDNGTQFNGKKIRAWCEEMKITQKFVAVAHPQANGQVESTNRTIVNGLKKRIDELGGSWVDELPSVLWSYRTSAKAATGETPFRLTYGTEAVIPVEVAMDTLRIATF--DEEANDGALRTRLDEIFDLREAAYLHMERSKNLIKARYDQGVRSRSFQIGDLILRRADALKHTGKLEANWEGPY;Query_Seq=RDAMDCVRRCQSCQYFAPINRKPGAEITLTELPCPFDRWGIDILGPFPQSVRQRRFCIVAVEYHSKWIEAEAVASITSEAVKKFVMNNIIVRFGCPRVLVSDNGPQFISDKFATFCEEYGIQQRTSSVYHPQTNGQAEASNKIILHGLRRNLDSLGGSWPDQLPHVLWAYRTTPKSSTGETPFSLVYGSEAVAPVESTIITPRIAAYMHTESANTEFRELDLDLLEERRNEVYGRVRKQQRALRKRYNQRVRPRQFEKGDLILRSVESQGHKGKLDRAWEGPY;Identity=0.49;Similarity=0.66;Relat_Length=0.906;Relat_Interruptions=0.0;Hit_to_DB_Length=0.91\n-scaffold146.1|size86774\tdante\tprotein_domain\t14592'..b'PVMDATTFRFLISLAIEYGLDLQLMDVVTAYLYGSLDCEIYMKIPEGFHMPERYSSEPRTDYAIKLNKSLYGLKQSGRMWYNRLSEYLIKEGYKNNLVCPCVFMKKFENEFVIIAVYVDDINIVGTQKALLDAVNCLKREFEMKDLGRTKYCLGLQIEYLKNGIF;Identity=0.78;Similarity=0.91;Relat_Length=0.905;Relat_Interruptions=0.0;Hit_to_DB_Length=0.9\n-scaffold146.1|size86774\tdante\tprotein_domain\t27723\t28124\t581\t+\t.\tName=RH;Final_Classification=Class_I|LTR|Ty1/copia|Bianca;Region_Hits_Classifications=RH|Class_I|LTR|Ty1/copia|Bianca;Best_Hit=Ty1-RH__REXdb_ID2558|Class_I|LTR|Ty1/copia|Bianca:27723-28124[100percent];Best_Hit_DB_Pos=1:134of134;DB_Seq=DAGYLSDPHHGRSQTGYLFTSGNTAISWRSVKQTITATSSNHAELLALHEASRECVWLRSMIQHIQKNCGLSSGRMDATIIYEDNTACIAQLKEGYIKGDRTKHISPKFFFTHDLQKDGDISIQQIRSCDNLAD;Query_Seq=DAGYRSDPHNGRSQTGYVFLNKGAAISWRSTKQTIAATSSNHAELLAIHETSRECVWLRSMIESIYNACGLFTDKMPPTVLYEDNSACIIQLKEGYIKGDRTKHISPKFFFTHDLQKNGEVIIQQIRSSDNVAD;Identity=0.75;Similarity=0.84;Relat_Length=1.0;Relat_Interruptions=0.0;Hit_to_DB_Length=1.0\n-scaffold146.1|size86774\tdante\tprotein_domain\t10299\t10658\t303\t-\t.\tName=aRH;Final_Classification=Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat;Region_Hits_Classifications=aRH|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand[360bp],aRH|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Ogre[360bp],aRH|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|TatII[360bp];Best_Hit=Ty3-aRH__REXdb_ID9546|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Tat|Retand:10299-10658[100percent];Best_Hit_DB_Pos=1:121of121;DB_Seq=WILHVDGASSKQGSGIGIRLQSPYGEVIEQSFCLAFNASNNEAEYESLLAGLRLAVGIGVTKLRAFCNSQLVANQFSGDYEAKDSRMEAYLAQVQELSKKFLSFELARIPRSENSAADSLA;Query_Seq=WNMYIDG-STQSGAGVGVHYITPYGDWINLAVKLQFPATNNVAEYEALLAGMNFALSLGVTRLKTFSDSQLVVEQFSGHFQAKEPMLEAYKSRSQLLAAKFSEFSLEHIPRESNRAADSLA;Identity=0.49;Similarity=0.7;Relat_Length=1.0;Relat_Interruptions=0.0;Hit_to_DB_Length=1.0\n-scaffold146.1|size86774\tdante\tprotein_domain\t16797\t17666\t1057\t-\t.\tName=INT;Final_Classification=Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila;Region_Hits_Classifications=INT|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila;Best_Hit=Ty3-INT__REXdb_ID6633|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila:16812-17666[98percent];Best_Hit_DB_Pos=1:285of313;DB_Seq=HSHSYGGHFGAKRTAHKVLESGFYWPSIFKDAYHFCKSCEKCQRTGNITHKNQMPLTNILVSEIFDVWGIDFMGPFPSSFGNLYILLVVDYVSKWIEAKATRTNDAKVVLDFVRTHIFNRFGIPKAIISDRGTHFCNRSMEALLRKYHVTHRTSTAYHPQTNGQAEISNREIKSILEKIVQPNRRDWSLRLGDALWAYRTAYKSPIGMSPYRMIYGKACHLPVELEHKAFWAIKQCNMDYDAAGIARKLQLQELEEIRNDAYENARIYKEKTKNLHDRMLTRKEF;Query_Seq=HASDYGGHFGPNRTARRILDVGFYWPSIFRDVYQFCRTCDACQRVGNITNRREMPQNYILANEIFDIWGLDFMGPFPQSQGNNYILVAVDYVSKWVEAIPTRTDDGKTVTEFLRKNIFTRYGVPKAIISDRGTHFCNSTMRAMMKKYNVIHKTTTAYHPQGNGQAEATNREIKSILEKVVNKKRSNWSQKLPDALWAYRTAYKTPIGTTPFRLIYGKHCNLPVGLEHKAYWAIREMNFEEGGDAELRQMQLQELDALRLEAYDNSRIYKERLKTYHDKKLLQQNF;Identity=0.61;Similarity=0.79;Relat_Length=0.911;Relat_Interruptions=0.0;Hit_to_DB_Length=0.91\n-scaffold146.1|size86774\tdante\tprotein_domain\t19976\t20212\t259\t-\t.\tName=PROT;Final_Classification=Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila;Region_Hits_Classifications=PROT|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila;Best_Hit=Ty3-PROT__REXdb_ID6659|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila:19976-20212[100percent];Best_Hit_DB_Pos=1:80of80;DB_Seq=MLDLGASINVMPYSIYNSLNLGPMEETCIIIQLADRSNAYPKGVMEDVLVQVNELVFPADFYILKMEDELSPNPTPILLG;Query_Seq=MVDLGASINLMPYYIYSALKLGSLQGTAIIIKLADRSETHPEGVVKDVLAQVNNLVFPADFYVLKM-GEAENDDCPLLLG;Identity=0.62;Similarity=0.79;Relat_Length=1.0;Relat_Interruptions=0.0;Hit_to_DB_Length=1.0\n-scaffold146.1|size86774\tdante\tprotein_domain\t28912\t29124\t216\t-\t.\tName=PROT;Final_Classification=Class_I|LTR|Ty1/copia|Bianca;Region_Hits_Classifications=PROT|Class_I|LTR|Ty1/copia|Bianca;Best_Hit=Ty1-PROT__REXdb_ID2599|Class_I|LTR|Ty1/copia|Bianca:28912-29124[100percent];Best_Hit_DB_Pos=1:71of71;DB_Seq=CLADCATTHTILRDKRYFLELTLIKANVSTISGTTNLVEGSGRANIMLPNGTRFHINDALYSSKSRRNLLS;Query_Seq=CLVDSATTHTILKNMRYFTSFEKRDVNIATIVCEANIVEGSGRAVIVLPSGTHIRIDDALYANKSRRNLLS;Identity=0.59;Similarity=0.7;Relat_Length=1.0;Relat_Interruptions=0.0;Hit_to_DB_Length=1.0\n'
b
diff -r 1eabd42e00ef -r e2bbc79f0fac test-data/test_seq_1
--- a/test-data/test_seq_1 Fri Apr 03 07:27:59 2020 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
b
@@ -1,34 +0,0 @@
->test_seq_1
-tatactacgcgactagactacgatactagggacagcatattacacccagagaacagacta
-TTGCAAAAAGAGAAGATTGGTAAAACGGGGGGGAATGTGTGAGTTGTAATGGGTTCATCT
-GCTCATGTTACTGCAGGGAGGACAGGTCCCGAACCCATTTCCAGCCTTTGGCCTGTCCCT
-CGTCCCTTACCGATCAAAGATACAACCCGGCGGAAGACCTATGGACTCTTACGGTATGCA
-AGCTAGTTCCTTTAGATTAGATTACAATGTTTACTTTTGTTATTTCTAGTTGGCAAGAGT
-GTTCCCCATAGTGTAGTTCCTTGAGCGGTAACAGTGAGCGAACCGTGAGAGCATGCATAC
-TGTTTCGGACGAGCTGTTTTACAGGTTTCCAAAGACTGTTTCTTTGCGATCCGACCAGAA
-GCTACTGCATGTTCGAGGACGAACATGGGGCAAGTGTGGGGATGTTTGGAGGTATGTTAT
-TTTTGTATGTTTTTAGCTATGCATTTTAGGCCTCTTGATAGTGGTTAGAGTTGATTTTGA
-tatactacgcgactagactacgatactagggacagcatattacacccagagaacagacta
-CCGCTTCCGCACCAGGCGAGCTTAGCTCGTGTGTTCCTTCCTGGCGGACGCCTCAGAGGG
-AGATCATGTTGGTGATCAATGTTGGGGTGGGATAGGAGTTTACTCGTGGGCTATGTTGGA
-CCTCTCTTTGGGACATGGTCCAGAGCTAATTGGTCGTCTCTAGGAGGGCCAGATGACAGT
-tatactacgcgactagactacgatactagggacagcatattacacccagagaacagacta
-ACCTGAAGGTGTACAACCGGTTGGTTATAAGTGGGTTTTTGTGAGAAAACGAAATGATAA
-AGGAGAAATATCTCGGTATAAGGCGAGATTAGTAGCTCAAGGGTTTTCTCAAAGGCCAGG
-AATTGATTATGATGAAACCTATTCACCGGTTATGGATGCCACAACTTTCAGGTTTTTGAT
-AAGTCTGGCGATTGAATATGGGCTTGATTTACAACTGATGGATGTTGTAACAGCATACTT
-ATATGGGTCACTGGATTGTGAAATATATATGAAAATCCCTGAAGGGTTTCATATGCCTGA
-ACGATATAGTTCTGAACCCCGTACCGATTATGCGATTAAATTGAATAAATCCCTGTATGG
-ATTAAAGCAGTCAGGACGAATGTGGTATAACCGTCTAAGTGAATACTTGATTAAAGAGGG
-tatactacgcgactagactacgatactagggacagcatattacacccagagaacagacta
-tatactacgcgactagactacgatactagggacagcatattacacccagagaacagacta
-ACAAGGTGGCGACAGTGGAACATGGCCCGATCGAGGACCAGCGTGAAGTCACGCATAACA
-TGGAACCAATCGGGTACAAGAACGTTTCACTATCCTCTTCCGACGGAAGGAAGAACGTCA
-AGATTGGGGTGCAAATGCCCCCAAATATCGAAGAACAACTCATCCAGGTCTTGACAGAGT
-ATCAAGACATCTTTGCTTGGGACATCTCCGAGGTCCCTGGAATTGATCGGTCACTGATGG
-AACATCGCATCAATACCGATCCTGAGGCCGTGCCCGTTCGACAAAAAAGGAGACGCTTCT
-CTCACAATCTGTGATCAACCCGAGAGAAAATGTAAGCGCTGTAACGCTGAGGAGTGGAAA
-GGTTGCAGATGAAGCAATCCAGAAGAAGAGGAAATCGCCTAAAGAAGCAGCAACAAAACC
-AGAGGATGAGAAGGAGGTCGAAGCTGCTGAACCAGTGACAGAACCCACTGCCAAGAAACA
-GAAAGAACCAGAGGTCGAGGTAACAAAGGAAAAATCTGTTATTAAACCTTACTATGAACT
-TCCACCTTTTCCAGGGAGGTTCAAGCTGGAGAAAAAGCAAGAGGAAGAGAAGGAGTTGAT
b
diff -r 1eabd42e00ef -r e2bbc79f0fac test-data/vyber-Ty1_01.fasta
--- a/test-data/vyber-Ty1_01.fasta Fri Apr 03 07:27:59 2020 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
b
b'@@ -1,1015 +0,0 @@\n->Acoerulea195_58_rc\n-CTTTCTGAAACCGGCACGAAGTGGTTTGACATCAATTTTGTGACAAAATGTCTTTGGTTA\n-TCTCGCCTTGAATTCATTGTTAGGAAAATCAAGATGTACGCATGGTCGTCTCGTGTCCTT\n-GTGTCTCTTAGGTCTGTCCAGGAAAAGTTATTAGGTTCTGGACGTACGTACCGAGTTAGG\n-ACGTACGTACGTTCACTGAAGCAGCAAACATCTTTTGGTCAGGACGTACGTACTCAGTTA\n-GGACGTTCGTACGTCCAAGTAGGCAGCAGACGTGTTTTCGTATGGAGACGTACGTACCGA\n-TTTAGAGCGTTCGTACGTCATGTTAGGCAGCAGGCTTATTTTTGTGTGGCGACGTTCGAA\n-CGTCATATCGACGTACGAACGTCCAAAGTCGGTTTGCACTGTTTCGGCCCATTTTCCTAG\n-GTTTTCAACCGTAAGTATAAATAGGGTTTCTCTTCCATAGAACAAGATAGTTCTTTGGCC\n-GTCCATCCTTCTTCTTGTTCATGTGTTTTGGTTAGGTTTTTGATATTGAACTTGTTTGTG\n-GCTATTGAACTCAATTCCCTTTCTTCCCTCTTTTCTCAATTCATATCTTTGTGATCTTCA\n-AGAGTTATTGTGAATTCGGCTTCGAGTCTTCTTGTGTGTTCATCAAGAAGTCGAGTGTCT\n-TGTGCGTACATCAAGGTGCTCTGGTGAAGTCAAGGTGTTAGAGATAACCAGGAGTATCAA\n-CTCGCAGTGAGTGCAGGTAGTTGATAGGTTTGTACATCTCTTTTCATCTAGTGGATTATT\n-TTGTTGTCGCGAGGACAACCGTGGACGTTTCCCTGTTTGGGTTTTACCACGTTAAAAATC\n-TCTGTGTGTTATTTACTTTCTGCGTATTGTTTGTTTGCTTGTTTGAATTCACATTTGTTA\n-TACTAGCACTATTTGTAGTTTCTCAATTGGTATCAGTCGCTGGGGGCTGGTTAAGGAGCG\n-TAAAGGCTCACGCTAACGATCAATAGTGCTATGGATTACTCTGGTAAGACACATGTCTCG\n-GTGGTTGCTCGACTCTCTGATGAGCAGTCCATTGACAGTAAAGAATGCTCTAGTAATTGT\n-GACAGGGATTTGATTCTGGTGGACAATAATGATTGGGCTACCCTATACCGGTTATCCAGA\n-AAGGAATGCCTAAGGTTGGAGGCACAAAATTGTATTCTTCAGGATAGACTTGACTTATTG\n-ACCTCTAGTTCCTCATCTATGTCAATCCTGGATAGGGAACCTGTATCATGGCAGGTTGAT\n-AAGGTGGCTTTACTGGGTAATCTTGCTGCTCTTGAGCATGACAAGGTCGAATGGGAAGCT\n-CGGTATAAGATTGTGTCCTCAGAACTAGACAAGGTCAAAAAGGAGTTGATTCGCTTTCAG\n-TCGTTCGAAAAGTGCAATGGGTTGTCATCTTCTGAATTACCTCCACTGTTGCCCCATCCT\n-CCTACCAAAAAGTCCGACATTCATCACTCTGTGGACACGGTTGGCATCCGGCGTAAGTGG\n-AAACTGTTCAGGCGTGTTCCGCCTCAGGGAAGGAAACGTCAAATGCCGTGCTCACTATGC\n-GGACAGTTTGGACATTGGGCCTCGCAGTGTGGTGTGGCTGCTCATGTTCCACAATCATGG\n-AAGCCGTTTTCACAACCGTCGTATGATTATCCATCTTATCCATATGCCTCTTACTCTTTT\n-CCTTGTAACTTTGCAGGAAATGTTTCAAATGTGGCGATTACTGCCCTTCATGTCTCAAGC\n-TCGGATAGTGAGTGGTTACTTGATAGTGGAGCTTCAAAACATATGTCAGGTAATGCCAAA\n-CTTTTCTCCTCCGTTACTGCTATAGATGGTGGAAGTGTTACTTTTGGGAATGGTAAGAGC\n-TCTCCTGTGATTGGTAAGGGATTTGTCGCTGGTATTGGTTTATCTCCGAATGATGTTTGT\n-TTGTTAGTTGATGGTTTGCGTGTAAATTTGATCAGTATTAGCCAACTGTGTGATACTGAC\n-CATACTGTTAATTTTTCCAAAAATATATGTACCGTGCTTGATAGTTTGGGTAAGTGTATC\n-ATGACAGGTAAACGAACATTGGATAATTGCTATGCCATTCAGCCTGTTACATCTAGCATG\n-AACTGCTTACCTAGCAAACTAAATGAGGGTCTATTGTGGCATCAACGACTCGGTCATGTC\n-AACTTTGAACACCTGGACAATCTAACTCGGAATGAATATATTAAGGGAGTTCCTAGACTT\n-GGAAGAAATCGAGACACTGTGTGTGGTGGGTGTCAACTAGGTAAACAGATACGTAGTCCA\n-CATTCCAAGAAAAAATCCATAACCACATCTTCTCCTTTAGAACTCATACACATGGATCTG\n-ATGGGTCCTACTCGTACTCCTAGTCTAGGAGGCAAACGATACATCTTGGTTATGGTTGAT\n-GACTACACTCGCTTTACCTGGGTATCATTCTTGCGTGAAAAATCTGATGCGTTTCTTGAG\n-TTTCAGGGGATATGCCTTCGTATTCAGAACGAGAAAGATACTCAAATTAAACATATCAGA\n-AGCGATAGAGGTGGTGAGTTCACAGCCACAGGTGTGATTGAGTATTGTATTGCAAATGGT\n-ACATGGCAAGAATTTTCGGCTCCATACACTCCGCAACAAAACGGAGTCGCTGAAAGGAAA\n-AATCGTGTTATTCAGGAGATGGCTCGTGCTATGTTGCATGCAAAGGATGTTCCGACCAAG\n-TTTTGGGCGGAAGTGGTTCATACTGCTTGTTACATAATGAACCGTGTATATCTAAGATCT\n-GGTACCACACAAACTGCTTATGAGCTATGGTATGGTAAGAAGCCGAATCTCAAATATATG\n-CGAGTTTTTGGTAGTGTGTGCTATGTGTGCAAGGACAGACAAAGTCTGTCCAAGTTTGAT\n-AGTCGAGGTGAAGTAGCTCTTTTACTTGGTTATTCTTCTAACAGTAGAGCCTTTCGAGTG\n-TTTAACTACACCACTCGCAAGGTCATGGAATCCTTTAATGTTGTTGTTGATGACACTATT\n-ACATCTGACTCTTCTGTTTCCACTGGTACACAGGATGTCACAGTTCTCTCACCCGTGTCA\n-GACCCGGCTGACATGTCTTCCATATCGTTATCATCACCTGATAATGGCAATGGTGGTACT\n-AAACCTTCTGATGCTGCAGAGGACGTGCCTAGTAGGACCGGTGCTGTGCTCACACCGGAT\n-GATGTGGTTCAATCACCAGATGTGATTGATGTGTCTTCAGATCTCTCCACTGTCCCTGCT\n-GACCCTGAAAGGGTGTTTAATCTAGCTTCACCTCGTGTCAAACAATATCACTCCTTAGGA\n-GATATCATTGGGGATATTAATGATCAGCGTCTGACTCGTCGGAGGGCCAAGGAGACAAAT\n-TGTGTTCATTATGTTTGTTATCTCTCTTCTCTTGAACCTAAAAATGTTACTGATGCTCTT\n-ATTGATGATGATTGGCTAGTTGCTATGCAGGAAGAACTCGGTCAGTTTAAGCGTAGTGAT\n-GTCTGGACGTTGGTTCCTAGACCTACTCACACTAATGTGGTTGGCACCAAGTGGATCTTT\n-AAAAACAAGTTGGATGAGTTCGGACAGATTGTGCGCAACAAGGCAAGGCTCGTAGCTCAG\n-GGCTACAGTCAGATTGAAGGTATTGACTATGGAGAGACATTTGCTCCCGTGGCTAGGTTG\n-GAATCTGTCAGGCTTCTTCTTGCTATGGCATGCCACTTGAATTTCAAGTTGTATCAAATG\n-GATGTCAAAAGTGCATTTCTCAATGGTATTCTTAATGAGGAGGTCTATGTTGAACAACCT\n-AAAGGGTTTGTGGATCACACTTTCCGAATCATGTCTTCAAATTGCAAAAAGC'..b'ACAGCCTTCCAACTA\n-TCAATTTCTCAAAATTATCAGATTTTGTGTGCACCGCATGTGCAACTGGAAAATTAATTA\n-TAAAACCATCTTATCTTAAAGTTAAAAATGAGTCATTAAATTTTCTTGAACGCATTCAAG\n-GAGATATATGTGGTCCAATTCAAGCACTATCAGGACCTTTTAGATATTTCATGGTGCTCA\n-TATATGCATCTACTAGATGGTCACATGTGTGTCTATTGTCCACACGAAATCATGCTTTTT\n-CCCAGCTTATTGATCAAATTATCAAATTAAGAGCAAATCATCCTAAAAATAGGATAAAAA\n-CAATTCGAATGGATAATGCCGCTGAATTTTCTTCACGTGCATTCAATGACTATTGCATGG\n-CTATGGGCATTCATTTAGAACATTTTGTGCCTTATGTTCATACTCAAAATGGTTTGGCTG\n-AATCTCTCATCAAAAGAGTAAAATTAGTTGCTCGACCACTATTACAGAATTGTAATTTAC\n-CAGCATCATGTTGGGCACATGCGGTATTACACGCCGCAGATCTGATACAAATCAGACCAA\n-CTGCATATCATACAACCTCCCCGCTACAACTAGTACGTAGCACTCAGCCAAGTATTTCCC\n-ATCTACGAAAATTCGGTTGCGCAGTATACGTACCGATATCACCACCGCAGCGTACATCCA\n-TGGGCCCCCACAGAAAACTAGGGATCTATGTGGGTTATAACTCTCCGTCAATAATAAAAT\n-ATCTTGAACCTCTTACAGGGGACCTGTTTACTGCCCGCTACGCTGATTCAATTTTTGATG\n-AGGACCATTTTCAGGCATTAGGGGGAGAATCAAACCACAAAGAATGCCAGGAAATAGATT\n-GGAATGTAACAGGCATTCAGTCCTTAGATCCACGTACTAAAGAATCTGAAACTGAAGTTC\n-AGAGGATCATAGATTTGCAACATATTGCAAATAATCTGCCAGATGCATTTACTGACCATA\n-AAGGTGTCACTAAATCACATATTCCCGCTGTTAATGCACCAGAACGAGTGGAGGTACCAA\n-CTAAAACCACTCAAACCACAAATGAGAGTAAGAGGGGGAGAAATCTGGTTAGTCGGAATA\n-TAGCTTCTCAAAAGCCTCCGCGGAAACAGAGGAAATCAAATCCTCTACCAGTAAATGCAA\n-TTCAACCTCAAGTTGAAGGACACCAACCAGATGCTCAACATCTTGAACCTAGCATAAATG\n-CGCATAAAAACATAATTGCTGGGACATCGGGACACCATGGTTCTATTGTTGTGGGAAATC\n-ACATAGAGTCTGAAGGTATAAAAGAAATTTCCATAAACTATACAGATTCAGGAGAATCAT\n-ATAATAGAGAGACTCCAATTGTCGACATATATTTCGCCTCTAAAATTGCTGAAACCCTTC\n-AAGTGGATCCAGAACCAAAGACCGTCAGGGAGTGCCTCAAGCGTCCTGATTGGCCTAAAT\n-GGAAGGAAGCAATTGAGGCAGAAGTGCGCTCGCTCAACAAAAGAGAGGTATTTTCCTCGG\n-TAATACCTACTCCTCATAATGTATTCCCTGTTGGAGCAAAATGGGTTTTTGTTCGAAAAA\n-GGAATGAAAACAATGAGGTGGTGAGATACAAAGCGAGGCTTGTAGCACAAGGGTTCACGC\n-AGAGGCCCGACATCGATTACGATGATACATACTCTCCTGTAATGAGTGGAATAACGTTTC\n-GATACTTAATATCTTTGGCAGTACAAATGAATTTATCTATGCAGTTGATGGATGTAGTGA\n-CAACATACTTATATGGGTCACTCAAATCGGACATATATATGAAAGTCCCTGAATGACTTA\n-AAATGTCGAATCCAAAAGAAAATCGCAACGCATATTGTGTAAAATTACAAAAGTCACTAT\n-ATGGCTTAAAACAATCGGGTAGAATGTGGTATAACCGATTGAGTGAGTTCCTTATTCAAA\n-AAGGCTACTCAAATAATGATGATTGCCCTTGTGTATTGATAAAGAAATCCTCAAATGGAT\n-TTTGCATCATCTCAGTGTACGTTGATGACCTCAATATCATGGGAAGTACACCTGATATCG\n-AAGAAGCACACAATCATCTAATGGCGAATTTGAGATGAAAGATTTGGGAAAGACCAAATT\n-CTGCTTAGGCTTACAGCTTGAGCATCTTCCCTCGGGAATTTTAGTATACCAACCTGCATA\n-TATTCAAAAGGTTTTGGAAAATTTTAATATGGATAAATCATATCCAACCAAAACACCCAT\n-GGTTGTCAGATCCCTTGATATGAATAAAGATCCTTTTAGACCTCGGGATGATGACGAAGA\n-GATATTAGGACCTGAGTTCCCGTATCTCAGTGCCATTGGTGCGTTAATATACCTTGCAAA\n-TTGCACCAGGCGTGATATTGCATTTACAGTGAATTTACTAGTTAGACATAGCGTTGCTTC\n-ATCGTAACGTCATTGGACGGGAGTAAATAATATCCTTAGATATTTACATGGCACAAAGGA\n-TCTTGGCTTATTCTATCAGATAAACCAAGATATGACTATGGTANGATATACTGATNGCTG\n-CTATCTATCTGATCCTCACAATGTCAGGTCACAAACAGGTTTCGTTTTCTTATATGGTGG\n-AACTGCTTTTTCATGGAAGTCAACAAAACAGACTCTCCTAGCAACCTCCACTAATCATTC\n-CTGAACTTGTTGCATTTTTTGAAGCATCTTAAGATTGTGTATGGCTTCGCAGGATGATTA\n-ACCCTATTCAAACTTCATGTGGTGTTGGTTCATTAGGATCACCAACTATTATATATGAAG\n-ATAATGCAGCCTCGCCATTGTCTCAAAATGCAAATGTGGTTTATGTTAGAAAGTAATATC\n-CCCACACCTATATTCTTCCTAAGGTTATTTTAATCCTCAGTGCATTACAGAAGGGATGGA\n-GAAATTTGATATTTTCCCAAATTAAATCATGTGCCAATTTAGCAGATTTGTTCCCCAAGT\n-TTTTTCCAAATTCAACGTTCCAGAAATCCATTCATGGAAATTGGTATAGAGATGATTCCC\n-GAGATTTGCAAAGTTCAGGGGGAGAAATCTCCCTGAAAATATACCCGTTTAATTATCATC\n-AGGTAATGAATATTGTACTCTTTTCCTTTATGAGTTTTTCCAACAGGGTTTCTCATATAA\n-GGTTTTTAACGAGACAATTAAATACAAGTATTGATGCATGCCATATCATATTTCTCCTTA\n-TATTTTTCCTACTAGGTTTTAAAGGAGTTTTTTATGGCACATCTCATTGCACTCTTTTCA\n-TTATGAGTTTTTTTGACATTTTCTCTCATAATGTTTTTAATGAAGCCATATCTTATCAAT\n-GATCATATATCATACTTTCTATTTTCCCTATCGGGGTTTTAAAGGAAGTACTCAAGACAT\n-ATATTGTTCTCTAAACTCAAAAATGAGTTTTATCCCTATATAAAGGTTTTCTCAAATGAG\n-TTATCATGAGGCAATAATCATTATATGTTGCACAATTTTTTCCTTATTATTTTTCCACTG\n-GGTTTAAAGGAGTTTTAGCAACATATCTACACTATTGTCCTTATATTTTTTCCACAGGGT\n-TTTTGGAGGAGACTTTAAAGATTATACAACGACTTTTCAAGATGAAGATGAGGAACATTC\n-TTAAAGAGAAAAATTTACAAGGATTATTATTTATCAAGATGATGCACATTTACACAGACA\n-AGCATGGATTAGGGAGAGTGTTAGGAATTAATTAAATGTATTAATTAATGGGATAATCCC\n-TGCGTTGCCGGTTGCCTTGTTTTAGCAACCGTTCCTTGTAAACCGCCTCTGTAACAAGGG\n-TATAAATACCCACATCTTCAATCAATGAAAACACTGTTCCATCATTCTGTCACTTTTACT\n-ACTTTACACTCTA\n'
b
diff -r 1eabd42e00ef -r e2bbc79f0fac tests.sh
--- a/tests.sh Fri Apr 03 07:27:59 2020 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
[
@@ -1,25 +0,0 @@
-#!/bin/bash
-
-export DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
-export TEXT_DATA="$DIR/test-data"
-export classification_tbl=${DIR}/tool-data/protein_domains/Viridiplantae_v3.0_class
-export pdb=${DIR}/tool-data/protein_domains/Viridiplantae_v3.0_pdb
-
-# make sure  dir for testing exists
-mkdir -p $DIR/tmp
-
-######## DANTE
-## single_seq, for/rev strand of mapping
-$DIR/dante.py -q ${TEXT_DATA}/GEPY_test_long_1 -pdb $pdb -cs $classification_tbl \
-              --domain_gff $PWD/tmp/single_fasta.gff3
-## multifasta
-$DIR/dante.py -q ${TEXT_DATA}/vyber-Ty1_01.fasta -pdb $pdb -cs $classification_tbl \
-              --domain_gff $PWD/tmp/multifasta.gff3
-## multifasta_win
-$DIR/dante.py -q ${TEXT_DATA}/vyber-Ty1_01.fasta -pdb $pdb -cs $classification_tbl \
-              -wd 3100 -od 1500 --domain_gff $PWD/tmp/multifasta_win.gff3
-
-# test filtering
-$DIR/dante_gff_output_filtering.py --dom_gff $PWD/tmp/single_fasta.gff3 \
-                                   --domains_filtered $PWD/tmp/single_fasta_filtered.gff3 \
-
b
diff -r 1eabd42e00ef -r e2bbc79f0fac tool-data/rexdb_versions.loc.sample
--- a/tool-data/rexdb_versions.loc.sample Fri Apr 03 07:27:59 2020 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
b
@@ -1,6 +0,0 @@
-#name value is base name for file with classification and pdb
-Viridiplantae_version_3.0 Viridiplantae_v3.0
-Viridiplantae_version_2.2 Viridiplantae_v2.2
-Metazoa_version_3.1 Metazoa_v3.1
-Metazoa_version_3.0 Metazoa_v3.0
-
b
diff -r 1eabd42e00ef -r e2bbc79f0fac tool-data/select_domain.loc.sample
--- a/tool-data/select_domain.loc.sample Fri Apr 03 07:27:59 2020 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
b
@@ -1,14 +0,0 @@
-All
-GAG
-INT
-PROT
-RH
-RT
-aRH
-CHDCR
-CHDII
-TPase
-YR
-HEL1
-HEL2
-ENDO
b
diff -r 1eabd42e00ef -r e2bbc79f0fac tool_dependencies.xml
--- a/tool_dependencies.xml Fri Apr 03 07:27:59 2020 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
b
@@ -1,9 +0,0 @@
-<?xml version="1.0" ?>
-<tool_dependency>
-    <package name="rexdb" version="1.0">
-        <repository changeset_revision="ac89c185fbd0" name="package_rexdb_1_0" owner="petr-novak" prior_installation_required="True" toolshed="https://toolshed.g2.bx.psu.edu"/>
-        <readme>
-      prepare rexdb database for dante
-    </readme>
-    </package>
-</tool_dependency>
\ No newline at end of file