Repository 'profrep'
hg clone https://toolshed.g2.bx.psu.edu/repos/petr-novak/profrep

Changeset 0:a5f1638b73be (2019-06-26)
Next changeset 1:bef8eb5ab4b2 (2019-06-26)
Commit message:
Uploaded
added:
README.md
configuration.py
configuration_for_testing.org
contributors.txt
dante.py
dante.xml
dante_gff_output_filtering.py
dante_gff_output_filtering.xml
dante_gff_to_dna.py
dante_gff_to_dna.xml
dependencies/dante/1.0.0/env.sh
dependencies/profrep/1.0.0/env.sh
domains_data/blosum80.txt
download_profrep_databases.sh
env.sh
extract_data_for_profrep.py
extract_data_for_profrep.xml
gff.py
gff_select_region.xml
gff_selection.py
profrep.py
profrep.xml
profrep_db_reducing.py
profrep_db_reducing.xml
profrep_masking.py
profrep_masking.xml
profrep_refine.xml
profrep_refining.py
shed.yml
test_data/GEPY_cluster_annotation
test_data/GEPY_test_long_1
test_data/classification.csv
test_data/hitsort_PID90_LCOV55.cls
test_data/proteins_all
test_data/reads_all
test_data/test_seq_1
test_data/vyber-Ty1_01.fasta
testing.sh
tool-data/prepared_datasets.txt
tool-data/rexdb_versions.txt
tool-data/select_domain.txt
tool_config_profrep.xml
tool_dependencies.xml
tool_dependencies.xml.delete
visualization.py
b
diff -r 000000000000 -r a5f1638b73be README.md
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/README.md Wed Jun 26 08:01:42 2019 -0400
[
b'@@ -0,0 +1,519 @@\n+\xef\xbb\xbf# REPEATS ANNOTATION TOOLS FOR ASSEMBLIES  #\n+\n+\n+## 1. PROFREP ##\n+*- **PROF**iles of **REP**eats -* \n+\n+The ProfRep main tool engages outputs of RepeatExplorer for repeats annotation in DNA sequences (typically assemblies but not necessarily). Moreover, it provides repetitive profiles of the sequence, pointing out quantitative representation of individual repeats along the sequence as well as the overall repetitiveness.\n+\n+### DEPENDENCIES ###\n+\n+* python 3.4 or higher with packages:\n+\t* numpy\n+\t* matplotlib\n+\t* biopython\n+* [BLAST 2.2.28+](https://www.ncbi.nlm.nih.gov/books/NBK279690/) or higher\n+* [wigToBigWig](http://hgdownload.cse.ucsc.edu/admin/exe/)\n+* [cd-hit](http://weizhongli-lab.org/cd-hit/)\n+* [JBrowse](http://jbrowse.org/install/) - **Only bin needed, does not have to be installed under a web server**\n+  \n+* ProfRep Modules:\n+\t* gff.py\n+\t* visualization.py\n+\t* configuration.py \n+\t* protein_domains.py\n+\t* domains_filtering.py\n+\t\n+* Profrep databases\n+\n+There are precompiled profrep annotation dataset for limited number of species. List of species can be find in file [prepared_datasets.txt](tool_data/prepared_datasets). Databases include large files and must be downloaded from our website:\n+\n+    cd tool_data\n+    wget http://repeatexplorer.org/repeatexplorer/wp-content/uploads/profrep.tar.gz\n+    tar xzvf profrep.tar.gz\n+        \n+    \n+#### INPUTS ####\n+\n+* **DNA sequence(s) to annotate** [multiFASTA]\n+\n+* **Species specific dataset** available from RepeatExplorer archive consisting of:\n+\n+\t* NGS reads sequences [multiFASTA]\n+\t\t* In RE archive: *seqclust -> sequences -> sequences.fasta*\n+\t* CLS file of clusters and belonging reads [multiFASTA] \n+\t\t* in RE archive: *seqclust -> clustering -> hitsort.cls*\n+\t* Classification table [TSV, CSV] \n+\t\t* in RE archive: *PROFREP_CLASSIFICATION_TEMPLATE.csv* (automatic classification)\n+\n+\n+#### OUTPUTS ####\n+\t\n+* **HTML summary report,JBrowse Data Directory** showing basic information and repetitive profile graphs as well as protein domains (optional) for individual sequences (up to 50). This output also serves as an data directory for [JBrowse](https://jbrowse.org/) genome browser. You can create a standalone JBrowse instance for further detailed visualization of the output tracks using Galaxy-integrated tool. This output can also be downloaded as an archive containing all relevant data for visualization via locally installed JBrowse server (see more about visualization in OUTPUT VISUALIZATION below)\n+* **Ns GFF** - reports unspecified (N) bases regions in the sequence\n+* **Repeats GFF** - reports repetitive regions of a certain length (defaultly **80**) and above hits/copy numbers threshold (defaultly **3**)\n+* **Domains GFF** - reports protein domains, classification of domain, chain orientation and alignment sequences\n+* Log file\n+\n+\n+### Running ProfRep ###\n+\n+\tusage: profrep.py [-h] -q QUERY -rdb READS -a ANN_TBL -c CLS [-id DB_ID]\n+                  [-bs BIT_SCORE] [-m MAX_ALIGNMENTS] [-e E_VALUE]\n+                  [-df DUST_FILTER] [-ws WORD_SIZE] [-t TASK] [-n NEW_DB]\n+                  [-w WINDOW] [-o OVERLAP] [-pd PROTEIN_DOMAINS]\n+                  [-pdb PROTEIN_DATABASE] [-cs CLASSIFICATION] [-wd WIN_DOM]\n+                  [-od OVERLAP_DOM] [-thsc THRESHOLD_SCORE]\n+                  [-thl {float range 0.0..1.0}] [-thi {float range 0.0..1.0}]\n+                  [-ths {float range 0.0..1.0}] [-ir INTERRUPTIONS]\n+                  [-mlen MAX_LEN_PROPORTION] [-lg LOG_FILE] [-ouf OUTPUT_GFF]\n+                  [-oug DOMAIN_GFF] [-oun N_GFF] [-hf HTML_FILE]\n+                  [-hp HTML_PATH] [-cn COPY_NUMBERS] [-gs GENOME_SIZE]\n+                  [-thr THRESHOLD_REPEAT] [-thsg THRESHOLD_SEGMENT]\n+                  [-jb JBROWSE_BIN]\n+                  \n+\n+\toptional arguments:\n+\t  -h, --help            show this help message and exit\n+\n+\trequired arguments:\n+\t  -q QUERY, --query QUERY\n+\t\t\t\t\t\t\tinput DNA sequence in (multi)fasta format (default:\n+'..b'unts extracted for individual lineages\n+\n+**- For GALAXY usage all concatenated in a single fasta file**\n+\n+#### USAGE ####\t\n+\t\tusage: extract_domains_seqs.py [-h] -i INPUT_DNA -d DOMAINS_GFF -cs\n+\t\t\tCLASSIFICATION [-out OUT_DIR] [-ex EXTENDED]\n+\n+\t\toptional arguments:\n+\t\t  -h, --help            show this help message and exit\n+\t\t  -i INPUT_DNA, --input_dna INPUT_DNA\n+\t\t\t\t\t\t\t\tpath to input DNA sequence\n+\t\t  -d DOMAINS_GFF, --domains_gff DOMAINS_GFF\n+\t\t\t\t\t\t\t\tGFF file of protein domains\n+\t\t  -cs CLASSIFICATION, --classification CLASSIFICATION\n+\t\t\t\t\t\t\t\tprotein domains classification file\n+\t\t  -out OUT_DIR, --out_dir OUT_DIR\n+\t\t\t\t\t\t\t\toutput directory\n+\t\t  -ex EXTENDED, --extended EXTENDED\n+\t\t\t\t\t\t\t\textend the domains edges if not the whole datatabase\n+\t\t\t\t\t\t\t\tsequence was aligned\n+\n+#### HOW TO RUN EXAMPLE ####\n+\t./extract_domains_seqs.py --domains_gff PATH_PROTEIN_DOMAINS_GFF --input_dna PATH_TO_INPUT_DNA  --classification PROTEIN_DOMAINS_DB_CLASS_TBL --extended True\n+\n+### GALAXY implementation ###\n+\n+#### Dependencies ####\n+\n+* python3.4 or higher with packages:\t\n+\t* numpy\n+\t* matplotlib\n+\t* biopython\n+* [BLAST 2.2.28+](https://www.ncbi.nlm.nih.gov/books/NBK279671/) or higher\n+* [LAST](http://last.cbrc.jp/doc/last.html) 744 or higher:\n+\t* [download](http://last.cbrc.jp/)\n+\t* [install](http://last.cbrc.jp/doc/last.html)\n+* [wigToBigWig](http://hgdownload.cse.ucsc.edu/admin/exe/)\n+* [cd-hit](http://weizhongli-lab.org/cd-hit/)\n+* [JBrowse](http://jbrowse.org/install/) - **Only bin needed, does not have to be installed under a web server**\n+\n+#### Source ######\n+\n+https://nina_h@bitbucket.org/nina_h/profrep.git\n+\n+branch "cerit" --> only Pisum Sativum Terno in preparad annotation datasets\n+\n+branch "develop"/"master" --> extended internal database of species (not published, or for internal purposes)\n+\n+#### Configuration #####\n+\n+Add tools\n+\n+\t<section name="Assembly annotation" id="annotation">\n+\t\t<label id="profrep_prepare" text="ProfRep Data Preparation" />\n+\t\t\t<tool file="profrep/extract_data_for_profrep.xml" />\n+\t\t\t<tool file="profrep/db_reducing.xml" />\n+\t\t<label id="profrep_main" text="Profrep" />\n+\t\t\t<tool file="profrep/profrep.xml" />\n+\t\t<label id="profrep_supplementary" text="Profrep Supplementary" />\n+\t\t\t<tool file="profrep/profrep_refine.xml" />\n+\t\t\t<tool file="profrep/profrep_masking.xml" />\n+\t\t\t<tool file="profrep/gff_select_region.xml" />\n+\t\t<label id="domains" text="DANTE" />\n+\t\t\t<tool file="profrep/protein_domains.xml" />\n+\t\t\t<tool file="profrep/domains_filtering.xml" />\n+\t\t\t<tool file="profrep/extract_domains_seqs.xml" />\n+  </section>\n+\n+\t\n+to \n+\n+\t$__root_dir__/config/tool_conf.xml\n+\t\n+------------------------------------------------------------------------\n+\n+Place PROFREP_DB files to\n+\n+\t$__tool_data_path__/profrep\n+\n+*REMARK* PROFREP_DB files contain prepared annotation data for species in the roll-up menu:\n+\t\n+\t* sequences.fasta - including BLAST database files which was created by:\n+\t\t makeblastdb -in >sequences.fasta -dbtype nucl\n+\t* hitosort.cls file\n+\t* classification table table\n+\n+Place DANTE_DB files to\n+\n+\t$__tool_data_path__/protein_domains\n+\t\n+*REMARK* DANTE_DB files contain protein domains database files:\n+\t* protein domains database including LASTAL database files which was created by:\n+\t\tlastdb -p -cR01 >database_name< >database_name<\n+\t\t(lastal database files are actually enough, original datatabse table does not have to be present)\n+\t* classification table\n+\t\n+------------------------------------------------------------------------\t\n+\t\n+Create\n+\t\n+\t$__root_dir__/database/dependencies/profrep/1.0.0/env.sh\n+\t\n+containing:\n+\t\n+\texport JBROWSE_BIN=PATH_TO_JBROWSE_DIR/bin\n+\n+------------------------------------------------------------------------\t\n+\t\n+Link the following files into galaxy tool-data dir\n+\n+\tln -s $__tool_directory__/profrep/domains_data/select_domain.txt $__tool_data_path__\n+\tln -s $__tool_directory__/profrep/profrep_data/prepared_datasets.txt $__tool_data_path__\n+\t\n+\t\n+\n+\n+\n+\n+\n+\n+\n'
b
diff -r 000000000000 -r a5f1638b73be configuration.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/configuration.py Wed Jun 26 08:01:42 2019 -0400
[
@@ -0,0 +1,128 @@
+#!/usr/bin/env python3
+import os
+''' configuration file to set up the paths and constants '''
+
+######## PROFREP #######################################################
+## Constansts
+N_segment = 50
+MAX_FILES_SUBPROFILES = 1000
+MAX_PIC_NUM = 50
+IMAGE_RES = 300
+FASTA_LINE = 60
+SEQ_LEN_VIZ = 200000
+FORBIDDEN_CHARS = "\\/"
+HTML_STR = '''
+ <!DOCTYPE html>
+ <html>
+ <body>
+ <h2>PROFREP OUTPUT</h2>
+ <h4> Sequences processed: </h4>
+ {}
+ <h4> Total length: </h4>
+ <pre> {} bp </pre>
+ <h4> Database: </h4>
+ <pre> {} </pre>
+ <hr>
+ <h3> Repetitive profile(s)</h3> </br>
+ {} <br/>
+ <h4>References: </h4>
+ {}
+ </h6>
+ </body>
+ </html>
+ '''
+
+## IO
+DOMAINS_GFF = "output_domains.gff"
+N_GFF = "N_regions.gff"
+REPEATS_GFF = "output_repeats.gff"
+HTML = "output.html"
+LOG_FILE = "log.txt"
+PROFREP_DATA = "tool_data/profrep"
+PROFREP_TBL = "prepared_datasets.txt"
+PROFREP_OUTPUT_DIR = "profrep_output_dir"
+## JBrowse and Tracks Conf
+jbrowse_data_dir = "data"
+JSON_CONF_R = """{"hooks" : {"modify": "function( track, f, fdiv ) {fdiv.style.backgroundColor = '#278ECF'}"}}"""
+JSON_CONF_N = """{"hooks" : {"modify": "function( track, f, fdiv ) {fdiv.style.background = '#474747'}"}}"""
+COLORS_HEX = ["#7F7F7F", "#00FF00", "#0000FF", "#FF0000", "#01FFFE", "#FFA6FE",
+              "#FFDB66", "#006401", "#010067", "#95003A", "#007DB5", "#FF00F6",
+              "#774D00", "#90FB92", "#0076FF", "#D5FF00", "#FF937E", "#6A826C",
+              "#FF029D", "#FE8900", "#7A4782", "#7E2DD2", "#85A900", "#FF0056",
+              "#A42400", "#00AE7E", "#683D3B", "#BDC6FF", "#263400", "#BDD393",
+              "#00B917", "#9E008E", "#001544", "#C28C9F", "#FF74A3", "#01D0FF",
+              "#004754", "#E56FFE", "#788231", "#0E4CA1", "#91D0CB", "#BE9970",
+              "#968AE8", "#BB8800", "#43002C", "#DEFF74", "#00FFC6", "#FFE502",
+              "#620E00", "#008F9C", "#98FF52", "#7544B1", "#B500FF", "#00FF78",
+              "#FF6E41", "#005F39", "#6B6882", "#5FAD4E", "#A75740", "#A5FFD2",
+              "#FFB167", "#009BFF", "#E85EBE"]
+COLORS_RGB = ["127,127,127", "0,255,0", "0,0,255", "255,0,0", "1,255,254",
+              "255,166,254", "255,219,102", "0,100,1", "1,0,103", "149,0,58",
+              "0,125,181", "255,0,246", "119,77,0", "144,251,146", "0,118,255",
+              "213,255,0", "255,147,126", "106,130,108", "255,2,157",
+              "254,137,0", "122,71,130", "126,45,210", "133,169,0", "255,0,86",
+              "164,36,0", "0,174,126", "104,61,59", "189,198,255", "38,52,0",
+              "189,211,147", "0,185,23", "158,0,142", "0,21,68", "194,140,159",
+              "255,116,163", "1,208,255", "0,71,84", "229,111,254",
+              "120,130,49", "14,76,161", "145,208,203", "190,153,112",
+              "150,138,232", "187,136,0", "67,0,44", "222,255,116",
+              "0,255,198", "255,229,2", "98,14,0", "0,143,156", "152,255,82",
+              "117,68,177", "181,0,255", "0,255,120", "255,110,65", "0,95,57",
+              "107,104,130", "95,173,78", "167,87,64", "165,255,210",
+              "255,177,103", "0,155,255", "232,94,190"]
+TRACK_LIST = '''
+ \t,{}\n
+ \t"storeClass" : "JBrowse/Store/SeqFeature/BigWig",
+ \t"urlTemplate" : "{}",
+ \t"type" : "JBrowse/View/Track/Wiggle/XYPlot",
+ \t"label" : "{}",
+ \t"key" : "{}",
+ \t"style": {}
+ \t\t"pos_color": "{}"
+ \t {},
+ \t"scale" : "log"
+ \t{}\n
+ '''
+
+## GFF tracks
+HEADER_GFF = "##gff-version 3"
+SOURCE_PROFREP = "profrep"
+SOURCE_DANTE = "dante"
+PHASE = "."
+DOMAINS_FEATURE = "protein_domain"
+REPEATS_FEATURE = "repeat"
+N_NAME = "N"
+N_FEATURE = "N_region"
+HEADER_WIG = "variableStep\tchrom="
+GFF_EMPTY = "."
+
+######### BIG WIG ######################################################
+CHROM_SIZES_FILE = "chrom_sizes.txt"
+
+######### EXTRACT_DATA_DOR_PROFREP #####################################
+HITSORT_CLS = "seqclust/clustering/hitsort.cls"
+READS_ALL = "seqclust/sequences/sequences.fasta"
+ANNOTATION = "PROFREP_CLASSIFICATION_TEMPLATE.csv"
+
+######### PROFREP_DB_REDUCING ##########################################
+MEM_LIM = 1500  # MB
+CLS_REDUCED = "hitsort_reduced.cls"
+READS_ALL_REDUCED = "reads_all_reduced"
+
+######### PROFREP_REFINING #############################################
+WITH_DOMAINS = "mobile_element"
+QUALITY_DIFF_TO_REMOVE = 0.05  # 5% tolerance of PID
+
+######### DANTE ##############################################
+MAIN_GIT_DIR = os.path.dirname(os.path.realpath(__file__))
+DOMAINS_DATA = os.path.join(MAIN_GIT_DIR, "domains_data")
+TMP = "tmp"
+SC_MATRIX = os.path.join(DOMAINS_DATA, "blosum80.txt")
+AMBIGUOUS_TAG = "Ambiguous_domain"
+## IO
+CLASS_FILE = "ALL.classification-new"
+LAST_DB_FILE = "ALL_protein-domains_05.fasta"
+DOM_PROT_SEQ = "dom_prot_seq.fa"
+FILT_DOM_GFF = "domains_filtered.gff"
+EXTRACT_DOM_STAT = "domains_counts.txt"
+EXTRACT_OUT_DIR = "extracted_domains"
b
diff -r 000000000000 -r a5f1638b73be configuration_for_testing.org
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/configuration_for_testing.org Wed Jun 26 08:01:42 2019 -0400
b
@@ -0,0 +1,18 @@
+#+BEGIN_SRC sh
+# tool setup
+
+cd /home/petr/galaxy2/galaxy/tools
+ln -sf /mnt/raid/users/petr/workspace/profrep
+
+cd /home/petr/galaxy2/galaxy/tool-data
+ln -sf /mnt/raid/users/petr/workspace/profrep/tool-data/* ./
+
+cd /home/petr/galaxy2/galaxy/database/dependencies
+ln -s /mnt/raid/users/petr/workspace/profrep/dependencies/* ./
+
+
+cd /home/petr/galaxy2/galaxy/config/
+ln -s /mnt/raid/users/petr/workspace/profrep/tool_config_profrep.xml
+#+END_SRC
+
+#+RESULTS:
b
diff -r 000000000000 -r a5f1638b73be contributors.txt
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/contributors.txt Wed Jun 26 08:01:42 2019 -0400
b
@@ -0,0 +1,1 @@
+Nina H
b
diff -r 000000000000 -r a5f1638b73be dante.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/dante.py Wed Jun 26 08:01:42 2019 -0400
[
b'@@ -0,0 +1,784 @@\n+#!/usr/bin/env python3\n+\n+import numpy as np\n+import subprocess\n+import math\n+import time\n+from operator import itemgetter\n+from collections import Counter\n+from itertools import groupby\n+import os\n+import configuration\n+from tempfile import NamedTemporaryFile\n+import sys\n+import warnings\n+import shutil\n+from collections import defaultdict\n+\n+np.set_printoptions(threshold=sys.maxsize)\n+\n+\n+def alignment_scoring():\n+    \'\'\' Create hash table for alignment similarity counting: for every \n+\tcombination of aminoacids in alignment assign score from protein \n+\tscoring matrix defined in configuration file  \'\'\'\n+    score_dict = {}\n+    with open(configuration.SC_MATRIX) as smatrix:\n+        count = 1\n+        for line in smatrix:\n+            if not line.startswith("#"):\n+                if count == 1:\n+                    aa_all = line.rstrip().replace(" ", "")\n+                else:\n+                    count_aa = 1\n+                    line = list(filter(None, line.rstrip().split(" ")))\n+                    for aa in aa_all:\n+                        score_dict["{}{}".format(line[0], aa)] = line[count_aa]\n+                        count_aa += 1\n+                count += 1\n+    return score_dict\n+\n+\n+def characterize_fasta(QUERY, WIN_DOM):\n+    \'\'\' Find the sequences, their lengths, starts, ends and if \n+\tthey exceed the window \'\'\'\n+    with open(QUERY) as query:\n+        headers = []\n+        fasta_lengths = []\n+        seq_starts = []\n+        seq_ends = []\n+        fasta_chunk_len = 0\n+        count_line = 1\n+        for line in query:\n+            line = line.rstrip()\n+            if line.startswith(">"):\n+                headers.append(line.rstrip())\n+                fasta_lengths.append(fasta_chunk_len)\n+                fasta_chunk_len = 0\n+                seq_starts.append(count_line + 1)\n+                seq_ends.append(count_line - 1)\n+            else:\n+                fasta_chunk_len += len(line)\n+            count_line += 1\n+        seq_ends.append(count_line)\n+        seq_ends = seq_ends[1:]\n+        fasta_lengths.append(fasta_chunk_len)\n+        fasta_lengths = fasta_lengths[1:]\n+        # control if there are correct (unique) names for individual seqs:\n+        # LASTAL takes seqs IDs till the first space which can then create problems with ambiguous records\n+        if len(headers) > len(set([header.split(" ")[0] for header in headers\n+                                   ])):\n+            raise NameError(\n+                \'\'\'Sequences in multifasta format are not named correctly:\n+\t\t\t\t\t\t\tseq IDs (before the first space) are the same\'\'\')\n+\n+    above_win = [idx\n+                 for idx, value in enumerate(fasta_lengths) if value > WIN_DOM]\n+    below_win = [idx\n+                 for idx, value in enumerate(fasta_lengths)\n+                 if value <= WIN_DOM]\n+    lens_above_win = np.array(fasta_lengths)[above_win]\n+    return headers, above_win, below_win, lens_above_win, seq_starts, seq_ends\n+\n+\n+def split_fasta(QUERY, WIN_DOM, step, headers, above_win, below_win,\n+                lens_above_win, seq_starts, seq_ends):\n+    \'\'\' Create temporary file containing all sequences - the ones that exceed \n+\tthe window are cut with a set overlap (greater than domain size with a reserve) \'\'\'\n+    with open(QUERY, "r") as query:\n+        count_fasta_divided = 0\n+        count_fasta_not_divided = 0\n+        ntf = NamedTemporaryFile(delete=False)\n+        divided = np.array(headers)[above_win]\n+        row_length = configuration.FASTA_LINE\n+        for line in query:\n+            line = line.rstrip()\n+            if line.startswith(">") and line in divided:\n+                stop_line = seq_ends[above_win[\n+                    count_fasta_divided]] - seq_starts[above_win[\n+                        count_fasta_divided]] + 1\n+                count_line = 0\n+                whole_seq = []\n+                for line2 in query:\n+                    whole_seq.append(line2.rstrip())\n+                    count_line += 1\n'..b'T_DB, LAST_DB),\n+                        shell=True)\n+\n+    if OUTPUT_DIR and not os.path.exists(OUTPUT_DIR):\n+        os.makedirs(OUTPUT_DIR)\n+\n+    if not os.path.isabs(OUTPUT_DOMAIN):\n+        if OUTPUT_DIR is None:\n+            OUTPUT_DIR = configuration.TMP\n+            if not os.path.exists(OUTPUT_DIR):\n+                os.makedirs(OUTPUT_DIR)\n+        OUTPUT_DOMAIN = os.path.join(OUTPUT_DIR,\n+                                     os.path.basename(OUTPUT_DOMAIN))\n+    domain_search(QUERY, LAST_DB, CLASSIFICATION, OUTPUT_DOMAIN,\n+                  THRESHOLD_SCORE, WIN_DOM, OVERLAP_DOM)\n+\n+    print("ELAPSED_TIME_DOMAINS = {} s".format(time.time() - t))\n+\n+\n+if __name__ == "__main__":\n+    import argparse\n+    from argparse import RawDescriptionHelpFormatter\n+\n+    class CustomFormatter(argparse.ArgumentDefaultsHelpFormatter,\n+                          argparse.RawDescriptionHelpFormatter):\n+        pass\n+\n+    parser = argparse.ArgumentParser(\n+        description=\n+        \'\'\'Script performs similarity search on given DNA sequence(s) in (multi)fasta against our protein domains database of all Transposable element for certain group of organisms (Viridiplantae or Metazoans). Domains are subsequently annotated and classified - in case certain domain has multiple annotations assigned, classifation is derived from the common classification level of all of them. Domains search is accomplished engaging LASTAL alignment tool.\n+\t\t\n+\tDEPENDENCIES:\n+\t\t- python 3.4 or higher with packages:\n+\t\t\t-numpy\n+\t\t- lastal 744 or higher [http://last.cbrc.jp/]\n+\t\t- configuration.py module\n+\n+\tEXAMPLE OF USAGE:\n+\t\t\n+\t\t./protein_domains_pd.py -q PATH_TO_INPUT_SEQ -pdb PATH_TO_PROTEIN_DB -cs PATH_TO_CLASSIFICATION_FILE\n+\t\t\n+\tWhen running for the first time with a new database use -nld option allowing lastal to create indexed database files:\n+\t\t\n+\t\t-nld True\n+\t\n+\t\t\'\'\',\n+        epilog="""""",\n+        formatter_class=CustomFormatter)\n+\n+    requiredNamed = parser.add_argument_group(\'required named arguments\')\n+    requiredNamed.add_argument(\n+        "-q",\n+        "--query",\n+        type=str,\n+        required=True,\n+        help=\n+        \'input DNA sequence to search for protein domains in a fasta format. Multifasta format allowed.\')\n+    requiredNamed.add_argument(\'-pdb\',\n+                               "--protein_database",\n+                               type=str,\n+                               required=True,\n+                               help=\'protein domains database file\')\n+    requiredNamed.add_argument(\'-cs\',\n+                               \'--classification\',\n+                               type=str,\n+                               required=True,\n+                               help=\'protein domains classification file\')\n+    parser.add_argument("-oug",\n+                        "--domain_gff",\n+                        type=str,\n+                        help="output domains gff format")\n+    parser.add_argument(\n+        "-nld",\n+        "--new_ldb",\n+        type=str,\n+        default=False,\n+        help=\n+        "create indexed database files for lastal in case of working with new protein db")\n+    parser.add_argument(\n+        "-dir",\n+        "--output_dir",\n+        type=str,\n+        help="specify if you want to change the output directory")\n+    parser.add_argument(\n+        "-thsc",\n+        "--threshold_score",\n+        type=int,\n+        default=80,\n+        help=\n+        "percentage of the best score in the cluster to be tolerated when assigning annotations per base")\n+    parser.add_argument(\n+        "-wd",\n+        "--win_dom",\n+        type=int,\n+        default=10000000,\n+        help="window to process large input sequences sequentially")\n+    parser.add_argument("-od",\n+                        "--overlap_dom",\n+                        type=int,\n+                        default=10000,\n+                        help="overlap of sequences in two consecutive windows")\n+\n+    args = parser.parse_args()\n+    main(args)\n'
b
diff -r 000000000000 -r a5f1638b73be dante.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/dante.xml Wed Jun 26 08:01:42 2019 -0400
b
@@ -0,0 +1,100 @@
+<tool id="dante" name="Domain based ANnotation of Transposable Elements - DANTE" version="1.0.0">
+  <requirements>
+    <requirement type="package">last</requirement>
+    <requirement type="package">numpy</requirement>
+    <requirement type="package" version="1.0.0">dante</requirement>
+  </requirements>
+  <stdio>
+    <regex match="Traceback" source="stderr" level="fail" description="Unknown error" />
+    <regex match="error" source="stderr" level="fail" description="Unknown error" />
+  </stdio>
+<description> Tool for annotation of transposable elements based on the similarity to conserved protein domains database. </description>
+<command>
+python3 ${__tool_directory__}/dante.py --query ${input} --domain_gff ${DomGff}
+ --protein_database ${__tool_data_path__ }/protein_domains/${db_type}_pdb
+ --classification ${__tool_data_path__ }/protein_domains/${db_type}_class
+</command>
+<inputs>
+ <param format="fasta" type="data" name="input" label="Choose your input sequence" help="Input DNA must be in proper fasta format, multi-fasta containing more sequences is allowed" />
+
+ <param name="db_type" type="select" label="Select taxon and protein domain database version (REXdb)" help="">
+   <options from_file="rexdb_versions.txt">
+     <column name="name" index="0"/>
+     <column name="value" index="1"/>
+   </options>
+ </param>
+
+</inputs>
+
+<outputs>
+ <data format="gff3" name="DomGff" label="Unfiltered GFF3 file of ALL protein domains from dataset ${input.hid}" />
+</outputs>
+
+ <help>
+
+THIS IS A PRIMARY OUTPUT THAT SHOULD UNDERGO FURTHER QUALITY FILTERING TO GET RID OFF POTENTIAL FALSE POSITIVE DOMAINS
+
+**WHAT IT DOES**
+
+This tool uses external aligning programme `LAST`_ and RepeatExplorer database of TE protein domains(REXdb) (Viridiplantae and Metazoa)
+
+.. _LAST: http://last.cbrc.jp/  
+
+*Lastal* runs similarity search to find hits between query DNA sequence and our database of protein domains from all Viridiplantae repetitive elements. Hits with overlapping positions in the sequence (even through other hits) forms a cluster which represents one potential protein domain. Strand orientation is taken into consideration when forming the clusters which means each cluster is built from forward or reverse stranded hits exclusively. The clusters are subsequently processed separately; within one cluster positions are scanned base-by-base and classification strings are assigned for each of them based on the database sequences which were mapped on that place. These asigned classification strings consist of a domain type as well as class and lineage of the repetitive element where the database protein comes from. Different classification levels are separated by "|" character. Every hit is scored according to the scoring matrix used for DNA-protein alignment (BLOSUM80). For single position only the hits reaching certain percentage (80% by default) of the overall best score within the whole cluster are reported. One cluster of overlapping hits represents one domain region and is recorded as one line in the resulting GFF3 file. Regarding the classition strings assigned to one region (cluster) there are three situations that can occur:
+
+ 1. There is a single classification string assigned to each position as well as classifications along all the positions in the region are mutually uniform, in this case domain's final classification is equivalent to this unique classification.
+ 2. There are multiple classification strings assigned to one cluster, i.e. one domain, which leads to classification to the common (less specific) level of all the strings
+ 3. There is a conflict at the domain type level, domains are reported with slash (e.g. RT/INT) and the classification is in this case ambiguous
+
+**There are 2 outputs produced by this tool:**
+
+1. GFF3 file of all proteins domains built from all hits found by LAST. Domains are reported per line as regions (start - end) on the original DNA sequence including the seq ID, alignment score and strand orientation. The last "Attributes" column contains several semicolon-separated information related to annotation, repetitive classification, alignment and its quality. This file can undergo further filtering using *Protein Domain Filter* tool
+
+- Attributes reported always:
+
+ Name
+ type of domain; if ambiguous reported with slash 
+
+ Final_classification 
+ definite classification based on all partial classifications of Region_hits_classifications attribute or 
+ "Ambiguous_domain" when there is an ambiguous domain type 
+
+ Region_Hits_Classifications
+ all hits classifications (comma separated) from a certain domain region that reach the set score threshold; in case of multiple annotations the square brackets indicate the number of bases having this particular classification
+
+- Attributes only reported in case of unambiguous domain type (all the attributes including quality information are related to the Best_Hit of the region):
+
+ Best_hit  
+ classification and position of the best alignment with the highest score within the cluster; in the square brackets is the percentage of the whole cluster range that this best hit covers
+
+ Best_Hit_DB_Pos
+ showing which part of the original datatabase domain corresponding to the Best Hit was aligned on query DNA (e.g. **Best_Hit_DB_Pos=17:75of79** means the Best Hit reported in GFF represents region from 17th to 75th of total 79 aminoacids in the original domain from the database)
+
+ DB_Seq 
+ database protein sequence of the best hit mapped to the query DNA
+
+ Query_Seq 
+ alignment sequence of the query DNA for the best hit
+
+ Identity
+ ratio of identical amino acids in alignment sequence to the length of alignment
+
+ Similarity
+ ratio of alignment positions with positive score (according to the scoring matrix) to the length of alignment
+
+ Relat_Length
+ ratio of gapless length of the aligned protein sequence to the whole length of the database protein 
+
+ Relat_Interruptions
+ number of the interruptions (frameshifts + stop codons) in aligned translated query sequence per each starting 100 AA
+
+ Hit_to_DB_Length
+ proportion of alignment length to the original length of the protein domain from database
+
+
+
+!NOTE: Tool can in average process 0.5 Gbps of the DNA sequence per day. This is only a rough estimate and it is highly dependent on input data (repetive elements occurence) as well as computing resources. Maximum running time of the tool is 7 days.
+
+ </help>
+</tool>
+
b
diff -r 000000000000 -r a5f1638b73be dante_gff_output_filtering.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/dante_gff_output_filtering.py Wed Jun 26 08:01:42 2019 -0400
[
b'@@ -0,0 +1,333 @@\n+#!/usr/bin/env python3\n+\n+import time\n+import configuration\n+import os\n+import textwrap\n+import subprocess\n+from tempfile import NamedTemporaryFile\n+from collections import defaultdict\n+\n+\n+class Range():\n+    \'\'\'\n+    This class is used to check float range in argparse\n+    \'\'\'\n+\n+    def __init__(self, start, end):\n+        self.start = start\n+        self.end = end\n+\n+    def __eq__(self, other):\n+        return self.start <= other <= self.end\n+\n+    def __str__(self):\n+        return "float range {}..{}".format(self.start, self.end)\n+\n+    def __repr__(self):\n+        return "float range {}..{}".format(self.start, self.end)\n+\n+\n+def check_file_start(gff_file):\n+    count_comment = 0\n+    with open(gff_file, "r") as gff_all:\n+        line = gff_all.readline()\n+        while line.startswith("#"):\n+            line = gff_all.readline()\n+            count_comment += 1\n+    return count_comment\n+\n+\n+def write_info(filt_dom_tmp, FILT_DOM_GFF, orig_class_dict, filt_class_dict,\n+               dom_dict, version_lines):\n+    \'\'\'\n+\tWrite domains statistics in beginning of filtered GFF\n+\t\'\'\'\n+    with open(FILT_DOM_GFF, "w") as filt_gff:\n+        for line in version_lines:\n+            filt_gff.write(line)\n+        filt_gff.write("##CLASSIFICATION\\tORIGINAL_COUNTS\\tFILTERED_COUNTS\\n")\n+        if not orig_class_dict:\n+            filt_gff.write("##NO DOMAINS CLASSIFICATIONS\\n")\n+        for classification in sorted(orig_class_dict.keys()):\n+            if classification in filt_class_dict.keys():\n+                filt_gff.write("##{}\\t{}\\t{}\\n".format(\n+                    classification, orig_class_dict[\n+                        classification], filt_class_dict[classification]))\n+            else:\n+                filt_gff.write("##{}\\t{}\\t{}\\n".format(\n+                    classification, orig_class_dict[classification], 0))\n+        filt_gff.write("##-----------------------------------------------\\n"\n+                       "##SEQ\\tDOMAIN\\tCOUNTS\\n")\n+        if not dom_dict:\n+            filt_gff.write("##NO DOMAINS\\n")\n+        for seq in sorted(dom_dict.keys()):\n+            for dom, count in sorted(dom_dict[seq].items()):\n+                filt_gff.write("##{}\\t{}\\t{}\\n".format(seq, dom, count))\n+        filt_gff.write("##-----------------------------------------------\\n")\n+        with open(filt_dom_tmp.name, "r") as filt_tmp:\n+            for line in filt_tmp:\n+                filt_gff.write(line)\n+\n+\n+def get_file_start(gff_file):\n+    count_comment = 0\n+    lines = []\n+    with open(gff_file, "r") as gff_all:\n+        line = gff_all.readline()\n+        while line.startswith("#"):\n+            lines.append(line)\n+            line = gff_all.readline()\n+            count_comment += 1\n+    return count_comment, lines\n+\n+\n+def filter_qual_dom(DOM_GFF, FILT_DOM_GFF, TH_IDENTITY, TH_SIMILARITY,\n+                    TH_LENGTH, TH_INTERRUPT, TH_LEN_RATIO, SELECTED_DOM,\n+                    ELEMENT):\n+    \'\'\' Filter gff output based on domain and quality of alignment \'\'\'\n+    [count_comment, version_lines] = get_file_start(DOM_GFF)\n+    filt_dom_tmp = NamedTemporaryFile(delete=False)\n+    with open(DOM_GFF, "r") as gff_all, open(filt_dom_tmp.name,\n+                                             "w") as gff_filtered:\n+        for comment_idx in range(count_comment):\n+            next(gff_all)\n+        dom_dict = defaultdict(lambda: defaultdict(int))\n+        orig_class_dict = defaultdict(int)\n+        filt_class_dict = defaultdict(int)\n+        seq_ids_all = []\n+        xminimals = []\n+        xmaximals = []\n+        domains = []\n+        xminimals_all = []\n+        xmaximals_all = []\n+        domains_all = []\n+        start = True\n+        for line in gff_all:\n+            attributes = line.rstrip().split("\\t")[-1]\n+            classification = attributes.split(";")[1].split("=")[1]\n+            orig_class_dict[classification] += 1\n+            ## ambiguous domains filtered out automatically\n+            if classifi'..b'\t\t\tFILTERING OPTIONS:\n+\t\t\t\t> QUALITY: - Min relative length of alignemnt to the protein domain from DB (without gaps)\n+\t\t\t\t   - Identity \n+\t\t\t\t   - Similarity (scoring matrix: BLOSUM82)\n+\t\t\t\t   - Interruption in the reading frame (frameshifts + stop codons) per every starting 100 AA\n+\t\t\t\t   - Max alignment proportion to the original length of database domain sequence \n+\t\t\t\t> DOMAIN TYPE: choose from choices (\'Name\' attribute in GFF)\n+\t\t\t\tRecords for ambiguous domain type (e.g. INT/RH) are filtered out automatically\n+\t\t\t\t\n+\t\t\t\t> MOBILE ELEMENT TYPE:\n+\t\t\t\tarbitrary substring of the element classification (\'Final_Classification\' attribute in GFF)\n+\t\t\t\t\n+\t\tOUTPUTS:\n+\t\t\t- filtered GFF3 file\n+\t\t\t- fasta file of translated protein sequences (from original DNA) for the aligned domains that match the filtering criteria \n+\t\t\n+\tDEPENDENCIES:\n+\t\t- python 3.4 or higher\n+\t\t> ProfRep modules:\n+\t\t\t- configuration.py \n+\n+\tEXAMPLE OF USAGE:\n+\t\tGetting quality filtered integrase(INT) domains of all gypsy transposable elements:\n+\t\t./domains_filtering.py -dom_gff PATH_TO_INPUT_GFF -pdb PATH_TO_PROTEIN_DB -cs PATH_TO_CLASSIFICATION_FILE --selected_dom INT --element_type Ty3/gypsy \n+\n+\t\t\'\'\',\n+        epilog="""""",\n+        formatter_class=CustomFormatter)\n+    requiredNamed = parser.add_argument_group(\'required named arguments\')\n+    requiredNamed.add_argument("-dg",\n+                               "--dom_gff",\n+                               type=str,\n+                               required=True,\n+                               help="basic unfiltered gff file of all domains")\n+    parser.add_argument("-ouf",\n+                        "--domains_filtered",\n+                        type=str,\n+                        help="output filtered domains gff file")\n+    parser.add_argument("-dps",\n+                        "--domains_prot_seq",\n+                        type=str,\n+                        help="output file containg domains protein sequences")\n+    parser.add_argument("-thl",\n+                        "--th_length",\n+                        type=float,\n+                        choices=[Range(0.0, 1.0)],\n+                        default=0.8,\n+                        help="proportion of alignment length threshold")\n+    parser.add_argument("-thi",\n+                        "--th_identity",\n+                        type=float,\n+                        choices=[Range(0.0, 1.0)],\n+                        default=0.35,\n+                        help="proportion of alignment identity threshold")\n+    parser.add_argument("-ths",\n+                        "--th_similarity",\n+                        type=float,\n+                        choices=[Range(0.0, 1.0)],\n+                        default=0.45,\n+                        help="threshold for alignment proportional similarity")\n+    parser.add_argument(\n+        "-ir",\n+        "--interruptions",\n+        type=int,\n+        default=3,\n+        help=\n+        "interruptions (frameshifts + stop codons) tolerance threshold per 100 AA")\n+    parser.add_argument(\n+        "-mlen",\n+        "--max_len_proportion",\n+        type=float,\n+        default=1.2,\n+        help=\n+        "maximal proportion of alignment length to the original length of protein domain from database")\n+    parser.add_argument(\n+        "-sd",\n+        "--selected_dom",\n+        type=str,\n+        default="All",\n+        choices=[\n+            "All", "GAG", "INT", "PROT", "RH", "RT", "aRH", "CHDCR", "CHDII",\n+            "TPase", "YR", "HEL1", "HEL2", "ENDO"\n+        ],\n+        help="filter output domains based on the domain type")\n+    parser.add_argument(\n+        "-el",\n+        "--element_type",\n+        type=str,\n+        default="",\n+        help="filter output domains by typing substring from classification")\n+    parser.add_argument(\n+        "-dir",\n+        "--output_dir",\n+        type=str,\n+        default=None,\n+        help="specify if you want to change the output directory")\n+    args = parser.parse_args()\n+    main(args)\n'
b
diff -r 000000000000 -r a5f1638b73be dante_gff_output_filtering.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/dante_gff_output_filtering.xml Wed Jun 26 08:01:42 2019 -0400
[
@@ -0,0 +1,78 @@
+<tool id="domains_filter" name="Protein Domains Filter" version="1.0.0">
+  <requirements>
+    <requirement type="package" version="1.0.0">domains_filter</requirement>
+  </requirements>
+  <stdio>
+    <regex match="Traceback" source="stderr" level="fail" description="Unknown error" />
+    <regex match="error" source="stderr" level="fail" description="Unknown error" />
+  </stdio>
+<description> Tool for filtering of gff3 output from DANTE. Filtering can be performed based domain type and alignment quality. </description>
+<command>
+python3 ${__tool_directory__}/dante_gff_output_filtering.py --dom_gff ${DomGff} --domains_prot_seq ${dom_prot_seq} --domains_filtered ${dom_filtered} --selected_dom ${selected_domain} --th_identity ${th_identity} --th_similarity ${th_similarity} --th_length ${th_length} --interruptions ${interruptions} --max_len_proportion ${th_len_ratio} --element_type '${element_type}'
+
+</command>
+<inputs>
+ <param format="gff" type="data" name="DomGff" label="Choose primary GFF3 file of all domains from Protein Domains Finder" />
+ <param name="th_identity" type="float" value="0.35" min="0" max="1" label="Minimum identity" help="Protein sequence indentity threshold between input and mapped protein from db [0-1]" />
+ <param name="th_similarity" type="float" value="0.45" min="0" max="1" label="Minimum similarity" help="Protein sequence similarity threshold between input and mapped protein from db [0-1]" />
+ <param name="th_length" type="float" value="0.8" min="0" max="1" label="Minimum alignment length" help="Proportion of the hit length without gaps to the length of the database sequence [0-1]" />
+ <param name="interruptions" type="integer" value="3" label="Interruptions [frameshifts + stop codons]" help="Tolerance threshold per every starting 100 amino acids of alignment sequence" />
+ <param name="th_len_ratio" type="float" value="1.2" label="Maximal length proportion" help="Maximal proportion of alignment length to the original length of protein domain from database (including indels)" />
+ <param name="selected_domain" type="select" label="Select protein domain type" >
+    <options from_file="select_domain.txt" >
+     <column name="name" index="0"/>
+     <column name="value" index="0"/>
+ </options>
+   </param>
+   <param name="element_type" type="text" value="Ty1/copia" label="Filter based on classification" help="You can use preset options or enter an  arbitrary string to filter a certain repetitive element type of any level. It must be a continuous substring in a proper format of Final_Classification attribute of GFF3 file. Classification levels are separated by | character. Filtering is case sensitive">
+     <option value="Ty1/copia">Ty1/copia</option>
+     <option value="Ty3/copia">Ty3/gypsy</option>
+     <option value="Class_I|">Class_I|</option>
+     <option value="Class_II|">Class_II|</option>
+    <sanitizer>
+       <valid initial="string.ascii_letters,string.digits">
+         <add value="_" />
+         <add value="/" />
+         <add value="|" />
+       </valid>
+    </sanitizer>
+   </param>
+</inputs>
+<outputs>
+ <data format="gff3" name="dom_filtered" label="Filtered GFF3 file of ${selected_domain} domains from dataset ${DomGff.hid}" />
+ <data format="fasta" name="dom_prot_seq" label="Protein sequences of ${selected_domain} domains from dataset ${DomGff.hid}" />


+</outputs>
+
+ <help>
+
+**WHAT IT DOES**
+
+This tool runs filtering on either primary GFF3 file of all domains, i.e. output of *Protein Domains Finder* tool or already filtered GFF3 file. Domains can be filtered based on:
+
+**Quality of alignment such as**:
+ - alignment sequence identity
+ - alignment similarity
+ - alignment proportion length
+ - number of interruptions (frameshifts or stop codons) per 100 AA
+
+**Protein domain type**
+ This filtration is based on "Name" attribute of GFF3 file
+
+**Repetitive element classification**
+ In the text field you can specify a classification string you wish to filter. This filtration is based on "Final_Classification" attribute of GFF file, so it must be in the proper form (classification levels are separated by "|"). You can see which classifications occurs in your data taking a look into Classification summary table output. If you leave the field blank, domains of all classifications will be reported
+
+All the records containing ambiguous domain type (e.g. RH/INT) are filtered out automatically. They do not take place in filtered gff file neither the protein sequence is derived from these potentially chimeric domains. Optimal results (for general usage) should be reached using the default quality filtering parameters which are appropriate to find all types of protein domains. Keep in mind that the results should be critically assessed based on your input data anyhow.
+
+
+**OUTPUTS PRODUCED:**
+ 1. Filtered GFF3 file
+ 2. Translated protein sequences of the filtered domains regions of original DNA sequence in fasta format
+
+ *Translated sequences are taken from the best alignment (Best_Hit attribute) within a domain region, however this alignment does not necessarily have to cover the whole region reported as a domain in gff file*
+
+
+ </help>
+</tool>
+
b
diff -r 000000000000 -r a5f1638b73be dante_gff_to_dna.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/dante_gff_to_dna.py Wed Jun 26 08:01:42 2019 -0400
[
@@ -0,0 +1,186 @@
+#!/usr/bin/env python3
+
+import argparse
+import configuration
+import time
+import os
+from collections import defaultdict
+from Bio import SeqIO
+import textwrap
+
+t_nt_seqs_extraction = time.time()
+
+
+def str2bool(v):
+    if v.lower() in ('yes', 'true', 't', 'y', '1'):
+        return True
+    elif v.lower() in ('no', 'false', 'f', 'n', '0'):
+        return False
+    else:
+        raise argparse.ArgumentTypeError('Boolean value expected')
+
+
+def check_file_start(gff_file):
+    count_comment = 0
+    with open(gff_file, "r") as gff_all:
+        line = gff_all.readline()
+        while line.startswith("#"):
+            line = gff_all.readline()
+            count_comment += 1
+    return count_comment, line
+
+
+def extract_nt_seqs(DNA_SEQ, DOM_GFF, OUT_DIR, CLASS_TBL, EXTENDED):
+    ''' Extract nucleotide sequences of protein domains found by DANTE from input DNA seq.
+ Sequences are saved in fasta files separately for each transposon lineage.
+ Sequences extraction is based on position of Best_Hit alignment reported by LASTAL.
+ The positions can be extended (optional) based on what part of database domain was aligned (Best_Hit_DB_Pos attribute).
+ The strand orientation needs to be considered in extending and extracting the sequence itself
+ '''
+    [count_comment, first_line] = check_file_start(DOM_GFF)
+    unique_classes = get_unique_classes(CLASS_TBL)
+    files_dict = defaultdict(str)
+    domains_counts_dict = defaultdict(int)
+    allSeqs = SeqIO.to_dict(SeqIO.parse(DNA_SEQ, 'fasta'))
+    with open(DOM_GFF, "r") as domains:
+        for comment_idx in range(count_comment):
+            next(domains)
+        seq_id_stored = first_line.split("\t")[0]
+        allSeqs = SeqIO.to_dict(SeqIO.parse(DNA_SEQ, 'fasta'))
+        seq_nt = allSeqs[seq_id_stored]
+        for line in domains:
+            seq_id = line.split("\t")[0]
+            dom_type = line.split("\t")[8].split(";")[0].split("=")[1]
+            elem_type = line.split("\t")[8].split(";")[1].split("=")[1]
+            strand = line.split("\t")[6]
+            align_nt_start = int(line.split("\t")[8].split(";")[3].split(":")[
+                -1].split("-")[0])
+            align_nt_end = int(line.split("\t")[8].split(";")[3].split(":")[
+                -1].split("-")[1].split("[")[0])
+            if seq_id != seq_id_stored:
+                seq_id_stored = seq_id
+                seq_nt = allSeqs[seq_id_stored]
+            if EXTENDED:
+                ## which part of database sequence was aligned
+                db_part = line.split("\t")[8].split(";")[4].split("=")[1]
+                ## datatabse seq length
+                dom_len = int(db_part.split("of")[1])
+                ## start of alignment on database seq
+                db_start = int(db_part.split("of")[0].split(":")[0])
+                ## end of alignment on database seq
+                db_end = int(db_part.split("of")[0].split(":")[1])
+                ## number of nucleotides missing in the beginning
+                dom_nt_prefix = (db_start - 1) * 3
+                ## number of nucleotides missing in the end
+                dom_nt_suffix = (dom_len - db_end) * 3
+                if strand == "+":
+                    dom_nt_start = align_nt_start - dom_nt_prefix
+                    dom_nt_end = align_nt_end + dom_nt_suffix
+                ## reverse extending for - strand
+                else:
+                    dom_nt_start = align_nt_start - dom_nt_suffix
+                    dom_nt_end = align_nt_end + dom_nt_prefix
+                ## correction for domain after extending having negative starting positon
+                dom_nt_start = max(1, dom_nt_start)
+            else:
+                dom_nt_start = align_nt_start
+                dom_nt_end = align_nt_end
+            full_dom_nt = seq_nt.seq[dom_nt_start - 1:dom_nt_end]
+            ## for - strand take reverse complement of the extracted sequence
+            if strand == "-":
+                full_dom_nt = full_dom_nt.reverse_complement()
+            full_dom_nt = str(full_dom_nt)
+            ## report when domain classified to the last level and no Ns in extracted seq
+            if elem_type in unique_classes and "N" not in full_dom_nt:
+                # lineages reported in separate fasta files
+                if not elem_type in files_dict:
+                    files_dict[elem_type] = os.path.join(
+                        OUT_DIR, "{}.fasta".format(elem_type.split("|")[
+                            -1].replace("/", "_")))
+                with open(files_dict[elem_type], "a") as out_nt_seq:
+                    out_nt_seq.write(">{}:{}-{}|{}[{}]\n{}\n".format(
+                        seq_nt.id, dom_nt_start, dom_nt_end, dom_type,
+                        elem_type, textwrap.fill(full_dom_nt,
+                                                 configuration.FASTA_LINE)))
+                domains_counts_dict[elem_type] += 1
+    return domains_counts_dict
+
+
+def get_unique_classes(CLASS_TBL):
+    ''' Get all the lineages of current domains classification table to check if domains are classified to the last level.
+ Only the sequences of unambiguous and completely classified domains will be extracted.
+ '''
+    unique_classes = []
+    with open(CLASS_TBL, "r") as class_tbl:
+        for line in class_tbl:
+            line_class = "|".join(line.rstrip().split("\t")[1:])
+            if line_class not in unique_classes:
+                unique_classes.append(line_class)
+    return unique_classes
+
+
+def write_domains_stat(domains_counts_dict, OUT_DIR):
+    ''' Report counts of domains for individual lineages
+ '''
+    total = 0
+    with open(
+            os.path.join(OUT_DIR,
+                         configuration.EXTRACT_DOM_STAT), "w") as dom_stat:
+        for domain, count in domains_counts_dict.items():
+            dom_stat.write(";{}:{}\n".format(domain, count))
+            total += count
+        dom_stat.write(";TOTAL:{}\n".format(total))
+
+
+def main(args):
+
+    DNA_SEQ = args.input_dna
+    DOM_GFF = args.domains_gff
+    OUT_DIR = args.out_dir
+    CLASS_TBL = args.classification
+    EXTENDED = args.extended
+
+    if not os.path.exists(OUT_DIR):
+        os.makedirs(OUT_DIR)
+
+    domains_counts_dict = extract_nt_seqs(DNA_SEQ, DOM_GFF, OUT_DIR, CLASS_TBL,
+                                          EXTENDED)
+    write_domains_stat(domains_counts_dict, OUT_DIR)
+
+    print("ELAPSED_TIME_EXTRACTION = {} s\n".format(time.time() -
+                                                    t_nt_seqs_extraction))
+
+
+if __name__ == "__main__":
+
+    # Command line arguments
+    parser = argparse.ArgumentParser()
+    parser.add_argument('-i',
+                        '--input_dna',
+                        type=str,
+                        required=True,
+                        help='path to input DNA sequence')
+    parser.add_argument('-d',
+                        '--domains_gff',
+                        type=str,
+                        required=True,
+                        help='GFF file of protein domains')
+    parser.add_argument('-cs',
+                        '--classification',
+                        type=str,
+                        required=True,
+                        help='protein domains classification file')
+    parser.add_argument('-out',
+                        '--out_dir',
+                        type=str,
+                        default=configuration.EXTRACT_OUT_DIR,
+                        help='output directory')
+    parser.add_argument(
+        '-ex',
+        '--extended',
+        type=str2bool,
+        default=True,
+        help=
+        'extend the domains edges if not the whole datatabase sequence was aligned')
+    args = parser.parse_args()
+    main(args)
b
diff -r 000000000000 -r a5f1638b73be dante_gff_to_dna.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/dante_gff_to_dna.xml Wed Jun 26 08:01:42 2019 -0400
b
@@ -0,0 +1,62 @@
+<tool id="domains_extract" name="Extract Domains Nucleotide Sequences" version="1.0.0">
+  <description> Tool to extract nucleotide sequences of protein domains found by DANTE </description>
+  <requirements>
+    <requirement type="package">biopython</requirement>
+  </requirements>
+  <command>
+    TEMP_DIR_LINEAGES=\$(mktemp -d) &amp;&amp;
+    python3 ${__tool_directory__}/dante_gff_to_dna.py --domains_gff ${domains_gff} --input_dna ${input_dna} --out_dir \$TEMP_DIR_LINEAGES
+
+    #if $extend_edges:
+   --extended True
+    #else:
+   --extended False
+    #end if
+   --classification ${__tool_data_path__ }/protein_domains/${db_type}_class
+    &amp;&amp;
+
+    cat \$TEMP_DIR_LINEAGES/domains_counts.txt \$TEMP_DIR_LINEAGES/*fasta > $out_fasta &amp;&amp;
+    rm -rf \$TEMP_DIR_LINEAGES
+  </command>
+  <inputs>
+   <param format="fasta" type="data" name="input_dna" label="Input DNA" help="Choose input DNA sequence(s) to extract the domains from" />
+   <param format="gff" type="data" name="domains_gff" label="Protein domains GFF" help="Choose filtered protein domains GFF3 (DANTE's output)" />
+   <param name="db_type" type="select" label="Select taxon and protein domain database version (REXdb)" help="">
+      <options from_file="rexdb_versions.txt">
+        <column name="name" index="0"/>
+        <column name="value" index="1"/>
+      </options>
+    </param>
+
+   <param name="extend_edges" type="boolean" truevalue="True" falsevalue="False" checked="True" label="Extend sequence edges" help="Extend extracted sequence edges to the full length of database domains sequences"/>
+  </inputs>
+  <outputs>
+    <data format="fasta" name="out_fasta" label="Concatenated fasta domains NT sequences from ${input_dna.hid}" /> 
+  </outputs>
+
+  <help>
+
+    **WHAT IT DOES**
+
+    This tool extracts nucleotide sequences of protein domains from reference DNA based on DANTE's output. It can be used e.g. for deriving phylogenetic relations of individual mobile elements within a species. This can be done separately for individual protein domains types.
+    In this case, prior running this tool use DANTE on input DNA:
+
+    1. Protein Domains Finder
+    2. Protein Domains Filter (quality filter + type of domain, e.g. RT)
+   
+    INPUTS:
+   * original DNA sequence in multifasta format to extract the domains from 
+   * DANTE's output GFF3 file (preferably filtered for quality and specific domain type)
+
+    OUTPUT:
+
+    * concatenated fasta file of nucleotide sequences for individual transposons lineages
+    
+    By default sequences will be EXTENDED if the alignment reported by LASTAL does not cover the whole protein sequence from the database. 
+    As the result, the corresponding nucleotide region of the WHOLE aligned database domain will be reported. For every record in the GFF3 file the sequence is reported for the BEST HIT within the domain region under following conditions:
+
+   * The domain cannot be ambiguous, i.e. the FINAL CLASSIFICATION of the domains region corresponds to the last classification level
+   * The extracted sequences are not reported in the case they contain any Ns within the extracted region
+
+  </help>
+</tool>
b
diff -r 000000000000 -r a5f1638b73be dependencies/dante/1.0.0/env.sh
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/dependencies/dante/1.0.0/env.sh Wed Jun 26 08:01:42 2019 -0400
b
@@ -0,0 +1,5 @@
+#dependencies for dante goes here
+env > /tmp/env_galaxy.txt
+which lastal >> /tmp/env_galaxy.txt
+lastal --version >> /tmp/env_galaxy.txt
+
b
diff -r 000000000000 -r a5f1638b73be dependencies/profrep/1.0.0/env.sh
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/dependencies/profrep/1.0.0/env.sh Wed Jun 26 08:01:42 2019 -0400
b
@@ -0,0 +1,4 @@
+#dependencies for  profrep goes here
+
+#for testing only:
+env > /tmp/env_galaxy.txt
b
diff -r 000000000000 -r a5f1638b73be domains_data/blosum80.txt
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/domains_data/blosum80.txt Wed Jun 26 08:01:42 2019 -0400
b
@@ -0,0 +1,27 @@
+# Entries for the BLOSUM80 matrix at a scale of ln(2)/2.0.
+   A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  J  Z  X  *
+A  5 -2 -2 -2 -1 -1 -1  0 -2 -2 -2 -1 -1 -3 -1  1  0 -3 -2  0 -2 -2 -1 -1 -6
+R -2  6 -1 -2 -4  1 -1 -3  0 -3 -3  2 -2 -4 -2 -1 -1 -4 -3 -3 -1 -3  0 -1 -6
+N -2 -1  6  1 -3  0 -1 -1  0 -4 -4  0 -3 -4 -3  0  0 -4 -3 -4  5 -4  0 -1 -6
+D -2 -2  1  6 -4 -1  1 -2 -2 -4 -5 -1 -4 -4 -2 -1 -1 -6 -4 -4  5 -5  1 -1 -6
+C -1 -4 -3 -4  9 -4 -5 -4 -4 -2 -2 -4 -2 -3 -4 -2 -1 -3 -3 -1 -4 -2 -4 -1 -6
+Q -1  1  0 -1 -4  6  2 -2  1 -3 -3  1  0 -4 -2  0 -1 -3 -2 -3  0 -3  4 -1 -6
+E -1 -1 -1  1 -5  2  6 -3  0 -4 -4  1 -2 -4 -2  0 -1 -4 -3 -3  1 -4  5 -1 -6
+G  0 -3 -1 -2 -4 -2 -3  6 -3 -5 -4 -2 -4 -4 -3 -1 -2 -4 -4 -4 -1 -5 -3 -1 -6
+H -2  0  0 -2 -4  1  0 -3  8 -4 -3 -1 -2 -2 -3 -1 -2 -3  2 -4 -1 -4  0 -1 -6
+I -2 -3 -4 -4 -2 -3 -4 -5 -4  5  1 -3  1 -1 -4 -3 -1 -3 -2  3 -4  3 -4 -1 -6
+L -2 -3 -4 -5 -2 -3 -4 -4 -3  1  4 -3  2  0 -3 -3 -2 -2 -2  1 -4  3 -3 -1 -6
+K -1  2  0 -1 -4  1  1 -2 -1 -3 -3  5 -2 -4 -1 -1 -1 -4 -3 -3 -1 -3  1 -1 -6
+M -1 -2 -3 -4 -2  0 -2 -4 -2  1  2 -2  6  0 -3 -2 -1 -2 -2  1 -3  2 -1 -1 -6
+F -3 -4 -4 -4 -3 -4 -4 -4 -2 -1  0 -4  0  6 -4 -3 -2  0  3 -1 -4  0 -4 -1 -6
+P -1 -2 -3 -2 -4 -2 -2 -3 -3 -4 -3 -1 -3 -4  8 -1 -2 -5 -4 -3 -2 -4 -2 -1 -6
+S  1 -1  0 -1 -2  0  0 -1 -1 -3 -3 -1 -2 -3 -1  5  1 -4 -2 -2  0 -3  0 -1 -6
+T  0 -1  0 -1 -1 -1 -1 -2 -2 -1 -2 -1 -1 -2 -2  1  5 -4 -2  0 -1 -1 -1 -1 -6
+W -3 -4 -4 -6 -3 -3 -4 -4 -3 -3 -2 -4 -2  0 -5 -4 -4 11  2 -3 -5 -3 -3 -1 -6
+Y -2 -3 -3 -4 -3 -2 -3 -4  2 -2 -2 -3 -2  3 -4 -2 -2  2  7 -2 -3 -2 -3 -1 -6
+V  0 -3 -4 -4 -1 -3 -3 -4 -4  3  1 -3  1 -1 -3 -2  0 -3 -2  4 -4  2 -3 -1 -6
+B -2 -1  5  5 -4  0  1 -1 -1 -4 -4 -1 -3 -4 -2  0 -1 -5 -3 -4  5 -4  0 -1 -6
+J -2 -3 -4 -5 -2 -3 -4 -5 -4  3  3 -3  2  0 -4 -3 -1 -3 -2  2 -4  3 -3 -1 -6
+Z -1  0  0  1 -4  4  5 -3  0 -4 -3  1 -1 -4 -2  0 -1 -3 -3 -3  0 -3  5 -1 -6
+X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -6
+* -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6  1
b
diff -r 000000000000 -r a5f1638b73be download_profrep_databases.sh
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/download_profrep_databases.sh Wed Jun 26 08:01:42 2019 -0400
b
@@ -0,0 +1,4 @@
+#!/bin/sh
+cd tool_data
+wget http://repeatexplorer.org/repeatexplorer/wp-content/uploads/profrep.tar.gz
+tar xzvf profrep.tar.gz
b
diff -r 000000000000 -r a5f1638b73be env.sh
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/env.sh Wed Jun 26 08:01:42 2019 -0400
b
@@ -0,0 +1,1 @@
+export JBROWSE_BIN="jbrowse_bin must be setup"
b
diff -r 000000000000 -r a5f1638b73be extract_data_for_profrep.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/extract_data_for_profrep.py Wed Jun 26 08:01:42 2019 -0400
b
@@ -0,0 +1,62 @@
+#!/usr/bin/env python3
+
+import zipfile
+import tempfile
+import argparse
+from shutil import copyfile
+import os
+import configuration
+
+
+def main(args):
+
+    RE_ARCHIVE = args.re_archive
+    OUTPUT_CLS = args.output_cls
+    OUTPUT_READS_ALL = args.output_reads_all
+    OUTPUT_ANNOTATION = args.output_annotation
+
+    if not os.path.isabs(OUTPUT_CLS):
+        OUTPUT_CLS = os.path.join(os.getcwd(), OUTPUT_CLS)
+
+    if not os.path.isabs(OUTPUT_READS_ALL):
+        OUTPUt_READS_ALL = os.path.join(os.getcwd(), OUTPUT_READS_ALL)
+
+    if not os.path.isabs(OUTPUT_ANNOTATION):
+        OUTPUT_ANNOTATION = os.path.join(os.getcwd(), OUTPUT_ANNOTATION)
+
+    with tempfile.TemporaryDirectory() as dirpath:
+        with zipfile.ZipFile(RE_ARCHIVE, 'r') as re_archive:
+            re_archive.extractall(dirpath)
+        copyfile(os.path.join(dirpath, configuration.HITSORT_CLS), OUTPUT_CLS)
+        copyfile(
+            os.path.join(dirpath, configuration.READS_ALL), OUTPUT_READS_ALL)
+        copyfile(
+            os.path.join(dirpath, configuration.ANNOTATION), OUTPUT_ANNOTATION)
+
+
+if __name__ == '__main__':
+
+    # Command line arguments
+    parser = argparse.ArgumentParser()
+    parser.add_argument('-ar',
+                        '--re_archive',
+                        type=str,
+                        required=True,
+                        help='RepeatExplorer output data archive')
+    parser.add_argument('-oc',
+                        '--output_cls',
+                        type=str,
+                        default="output_hitsort_cls",
+                        help='Output cls file of all clusters')
+    parser.add_argument('-or',
+                        '--output_reads_all',
+                        type=str,
+                        default="output_reads_all",
+                        help='Output file of all reads sequences')
+    parser.add_argument('-oa',
+                        '--output_annotation',
+                        type=str,
+                        default="output_annotation",
+                        help='Output file of clusters annotation')
+    args = parser.parse_args()
+    main(args)
b
diff -r 000000000000 -r a5f1638b73be extract_data_for_profrep.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/extract_data_for_profrep.xml Wed Jun 26 08:01:42 2019 -0400
b
@@ -0,0 +1,26 @@
+<tool id="extract_data" name="Extract Data For ProfRep" version="1.0.0">
+<requirements><requirement type="package" version="1.0.0">profrep</requirement></requirements>
+<description> Extract data for ProfRep from RepaetExplorer </description>
+<command>
+python3 ${__tool_directory__}/extract_data_for_profrep.py --re_archive ${re_archive} --output_cls ${output_cls} --output_reads_all ${output_reads_all} --output_annotation ${output_annotation}
+</command>
+<inputs>
+ <param format="zip" type="data" name="re_archive" label="RepeatExplorer output data archive" help="" />
+</inputs>
+
+<outputs>
+ <data format="fasta" name="output_cls" label="Output cls file of all clusters from ${re_archive.hid} archive" />
+ <data format="fasta" name="output_reads_all" label="Output file of all reads sequences from ${re_archive.hid} archive" />
+ <data format="tabular" name="output_annotation" label="Output file of clusters annotation from ${re_archive.hid} archive" />
+</outputs>
+
+ <help>
+
+**WHAT IT DOES**
+
+This tool will extract all the necessary input data that are needed for ProfRep from RE output archive.
+Use if the species is not already provided in our internal database (ProfRep drop-down menu -> Choose existing annotation dataset)
+
+
+ </help>
+</tool>
b
diff -r 000000000000 -r a5f1638b73be gff.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/gff.py Wed Jun 26 08:01:42 2019 -0400
[
@@ -0,0 +1,121 @@
+#!/usr/bin/env python3
+""" sequence repetitive profile conversion to GFF3 format """
+
+import time
+import configuration
+from operator import itemgetter
+from itertools import groupby
+from tempfile import NamedTemporaryFile
+import os
+
+
+def create_gff(THRESHOLD, THRESHOLD_SEGMENT, OUTPUT_GFF, files_dict, headers):
+    seq_id = None
+    exclude = set(['ALL'])
+    gff_tmp_list = []
+    for repeat in sorted(set(files_dict.keys()).difference(exclude)):
+        gff_tmp = NamedTemporaryFile(delete=False)
+        pos_seq_dict = {}
+        with open(files_dict[repeat][0], "r") as repeat_f, open(
+                files_dict[repeat][2], "r") as quality_f:
+            for line_r, line_q in zip(repeat_f, quality_f):
+                if "chrom" in line_r:
+                    idx_ranges(THRESHOLD_SEGMENT, seq_id, gff_tmp,
+                               configuration.REPEATS_FEATURE, repeat,
+                               pos_seq_dict)
+                    seq_id = line_r.rstrip().split("chrom=")[1]
+                    pos_seq_dict = {}
+                else:
+                    hits = int(line_r.rstrip().split("\t")[1])
+                    if hits >= THRESHOLD:
+                        position = int(line_r.rstrip().split("\t")[0])
+                        pos_seq_dict[position] = line_q.rstrip().split("\t")[1]
+        idx_ranges(THRESHOLD_SEGMENT, seq_id, gff_tmp,
+                   configuration.REPEATS_FEATURE, repeat, pos_seq_dict)
+        gff_tmp_list.append(gff_tmp.name)
+        gff_tmp.close()
+    sort_records(gff_tmp_list, headers, OUTPUT_GFF)
+    for tmp in gff_tmp_list:
+        os.unlink(tmp)
+
+
+def idx_ranges(THRESHOLD_SEGMENT, seq_id, gff_file, feature, repeat,
+               pos_seq_dict):
+    indices = sorted(pos_seq_dict.keys(), key=int)
+    for key, group in groupby(
+            enumerate(indices),
+            lambda index_item: index_item[0] - index_item[1]):
+        group = list(map(itemgetter(1), group))
+        if len(group) > THRESHOLD_SEGMENT:
+            sum_qual = 0
+            for position in group:
+                sum_qual += int(pos_seq_dict[position])
+            qual_per_reg = sum_qual / len(group)
+            # Take boundaries of the group vectors
+            gff_file.write(
+                "{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\tName={};Average_PID={}\n".format(
+                    seq_id, configuration.SOURCE_PROFREP, feature, group[
+                        0], group[-1], configuration.GFF_EMPTY,
+                    configuration.GFF_EMPTY, configuration.GFF_EMPTY, repeat,
+                    round(qual_per_reg)).encode("utf-8"))
+
+
+def idx_ranges_N(indices, THRESHOLD_SEGMENT, seq_id, gff_file, feature,
+                 att_name):
+    for key, group in groupby(
+            enumerate(indices),
+            lambda index_item: index_item[0] - index_item[1]):
+        group = list(map(itemgetter(1), group))
+        if len(group) > THRESHOLD_SEGMENT:
+            # Take boundaries of the group vectors
+            gff_file.write("{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\tName={}\n".format(
+                seq_id, configuration.SOURCE_PROFREP, feature, group[0], group[
+                    -1], configuration.GFF_EMPTY, configuration.GFF_EMPTY,
+                configuration.GFF_EMPTY, att_name))
+
+
+def sort_records(gff_tmp_list, headers, OUTPUT_GFF):
+    opened_files = [open(i, "r") for i in gff_tmp_list]
+    files_lines = dict((key, "") for key in opened_files)
+    count_without_line = 0
+    count_seq = 0
+    ####################################################################
+    present_seqs = headers
+    ####################################################################
+    with open(OUTPUT_GFF, "w") as final_gff:
+        final_gff.write("{}\n".format(configuration.HEADER_GFF))
+        while True:
+            for file_name in opened_files:
+                if not files_lines[file_name]:
+                    line = file_name.readline()
+                    if line:
+                        files_lines[file_name] = line
+                    else:
+                        count_without_line += 1
+            if count_without_line == len(opened_files):
+                break
+            count_without_line = 0
+            count = 0
+            lowest_pos = float("inf")
+            for file_key in files_lines.keys():
+                if files_lines[file_key].split("\t")[0] == present_seqs[
+                        count_seq]:
+                    count += 1
+                    start_pos = int(files_lines[file_key].split("\t")[3])
+                    if start_pos < lowest_pos:
+                        lowest_pos = start_pos
+                        record_to_write = files_lines[file_key]
+                        lowest_file_key = file_key
+            if count == 0:
+                count_seq += 1
+            else:
+                final_gff.write(record_to_write)
+                files_lines[lowest_file_key] = ""
+
+
+def main():
+    pass
+
+
+if __name__ == "__main__":
+    main()
b
diff -r 000000000000 -r a5f1638b73be gff_select_region.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/gff_select_region.xml Wed Jun 26 08:01:42 2019 -0400
b
@@ -0,0 +1,34 @@
+<tool id="gff_select" name="GFF Region Selector" version="1.0.0">
+<requirements><requirement type="package" version="1.0.0">profrep</requirement></requirements>
+<description> Tool to select a specific sequence region from a GFF file </description>
+<command>
+python3 ${__tool_directory__}/gff_selection.py  --gff_input ${gff_input} --region ${region} --gff_output ${gff_output} --new_seq_id '${new_seq_id}' 
+
+</command>
+<inputs>
+ <param format="gff" type="data" name="gff_input" label="Choose input GFF3 file to cut" />
+ <param name="region" type="text" label="Choose the sequence and region to cut (including both end positions)" help="for example chr1 or chr1:1000-2000" />
+ <param name="new_seq_id" type="text" value="" label="Type in a new sequence name for the cut region" help="In case of using JBrowse it must correspond to the cut reference sequence name. If not specified, the original name with cut coordinates will be reported" />
+</inputs>
+<outputs>
+ <data format="gff3" name="gff_output" label="Region ${region} of GFF3 file ${gff_input.hid}" /> 
+</outputs>
+
+<help>
+
+**WHAT IT DOES**
+
+This tools enables to extract a region of interest from input GFF. It facilitates e.g. comparing features in several GFF files but for fraction of whole data. Use arbitrary GFF as an input. You can extract all records for individual sequence or select specific region. The coordinates of new regions in the modified GFF file will be recalculated with respect to the selected range. Type in the selected region and the corresponding seq ID to extract in the following form:
+
+**original_seq_name (e.g. chr1)**
+
+OR
+
+**original_seq_name:start-end (e.g. chr1:1000-2000)**
+
+**!PLEASE NOTE** Only the GFF records that are entirely inside the selected region will be reported.
+
+
+</help>
+</tool>
+
b
diff -r 000000000000 -r a5f1638b73be gff_selection.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/gff_selection.py Wed Jun 26 08:01:42 2019 -0400
[
@@ -0,0 +1,86 @@
+#!/usr/bin/env python3
+
+import argparse
+
+
+def check_file_start(gff_file):
+    count_comment = 0
+    with open(gff_file, "r") as gff_all:
+        line = gff_all.readline()
+        while line.startswith("#"):
+            line = gff_all.readline()
+            count_comment += 1
+    return count_comment
+
+
+def cut_region(GFF_IN, GFF_OUT, REGION, NEW_SEQ_ID):
+    '''
+ Extract records for particular sequence and/or region from arbitrary GFF3 file
+ in form: original_seq_name:start-end (e.g. chr1:1000-2000)
+ Write a new GFF containing only records from this region
+ If new SEQ ID for extracted region is not provided, it will be named based on the region to cut
+ ! ALLOWS TO CUT ONLY ONE REGION AT A TIME
+ '''
+    ## cut a particular part of a paritcular sequence
+    if ":" and "-" in REGION:
+        seq_to_cut = ":".join(REGION.split(":")[:-1])
+        int_start = int(REGION.split(":")[-1].split("-")[0])
+        int_end = int(REGION.split(":")[-1].split("-")[1])
+    ## cut the whole sequence if region is not specified
+    else:
+        int_start = 0
+        int_end = float("inf")
+        seq_to_cut = REGION
+    count_comment = check_file_start(GFF_IN)
+    with open(GFF_OUT, "w") as gff_out:
+        with open(GFF_IN, "r") as gff_in:
+            for comment_idx in range(count_comment):
+                next(gff_in)
+            gff_out.write("##gff-version 3\n")
+            gff_out.write("##sequence region {}\n".format(REGION))
+            for line in gff_in:
+                if not line.startswith("#") and line.split("\t")[
+                        0] == seq_to_cut and int(float(line.split("\t")[
+                            3])) >= int_start and int(float(line.split("\t")[
+                                4])) <= int_end:
+                    new_start = int(line.split("\t")[3]) - int_start + 1
+                    new_end = int(line.split("\t")[4]) - int_start + 1
+                    gff_out.write("{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}".format(
+                        NEW_SEQ_ID, line.split("\t")[1], line.split("\t")[
+                            2], new_start, new_end, line.split("\t")[
+                                5], line.split("\t")[6], line.split("\t")[
+                                    7], line.split("\t")[8]))
+
+
+def main(args):
+    # Command line arguments
+    GFF_IN = args.gff_input
+    GFF_OUT = args.gff_output
+    REGION = args.region
+    NEW_SEQ_ID = args.new_seq_id
+
+    if GFF_OUT is None:
+        GFF_OUT = "{}_cut{}.gff3".format(GFF_IN, REGION)
+
+    if not NEW_SEQ_ID:
+        NEW_SEQ_ID = REGION
+
+    cut_region(GFF_IN, GFF_OUT, REGION, NEW_SEQ_ID)
+
+
+if __name__ == "__main__":
+
+    # Command line arguments
+    parser = argparse.ArgumentParser()
+    parser.add_argument('-gi',
+                        '--gff_input',
+                        type=str,
+                        required=True,
+                        help='choose gff file')
+    parser.add_argument(
+        '-go', '--gff_output',
+        type=str, help='choose gff file')
+    parser.add_argument('-si', '--new_seq_id', type=str, help=' ')
+    parser.add_argument('-rg', '--region', type=str, required=True, help=' ')
+    args = parser.parse_args()
+    main(args)
b
diff -r 000000000000 -r a5f1638b73be profrep.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/profrep.py Wed Jun 26 08:01:42 2019 -0400
[
b'@@ -0,0 +1,1249 @@\n+#!/usr/bin/env python3\n+\n+import subprocess\n+import csv\n+import time\n+import sys\n+import matplotlib\n+# matplotlib.use("PDF")\n+matplotlib.use("pdf")\n+import matplotlib.pyplot as plt\n+import matplotlib.colors as colors\n+import matplotlib.cm as cmx\n+import multiprocessing\n+import argparse\n+import os\n+from functools import partial\n+from multiprocessing import Pool\n+from tempfile import NamedTemporaryFile\n+from operator import itemgetter\n+from itertools import groupby\n+import gff\n+import configuration\n+import visualization\n+import distutils\n+from distutils import dir_util\n+import tempfile\n+import re\n+from Bio import SeqIO\n+import sys\n+import pickle\n+import shutil\n+import warnings\n+import random\n+import numpy as np\n+import dante_gff_output_filtering as domains_filtering\n+import dante as protein_domains\n+\n+t_profrep = time.time()\n+np.set_printoptions(threshold=sys.maxsize)\n+warnings.filterwarnings("ignore", module="matplotlib")\n+\n+\n+class Range():\n+    \'\'\'\n+    This class is used to check float range in argparse\n+    \'\'\'\n+\n+    def __init__(self, start, end):\n+        self.start = start\n+        self.end = end\n+\n+    def __eq__(self, other):\n+        return self.start <= other <= self.end\n+\n+    def __str__(self):\n+        return "float range {}..{}".format(self.start, self.end)\n+\n+    def __repr__(self):\n+        return "float range {}..{}".format(self.start, self.end)\n+\n+\n+def get_version(path):\n+    branch = subprocess.check_output("git rev-parse --abbrev-ref HEAD",\n+                                     shell=True,\n+                                     cwd=path).decode(\'ascii\').strip()\n+    shorthash = subprocess.check_output("git log --pretty=format:\'%h\' -n 1  ",\n+                                        shell=True,\n+                                        cwd=path).decode(\'ascii\').strip()\n+    revcount = len(subprocess.check_output("git log --oneline",\n+                                           shell=True,\n+                                           cwd=path).decode(\'ascii\').split())\n+    version_string = ("-------------------------------------"\n+                      "-------------------------------------\\n"\n+                      "PIPELINE VERSION         : "\n+                      "{branch}-rv-{revcount}({shorthash})\\n"\n+                      "-------------------------------------"\n+                      "-------------------------------------\\n").format(\n+                          branch=branch,\n+                          shorthash=shorthash,\n+                          revcount=revcount, )\n+    return (version_string)\n+\n+\n+def str2bool(v):\n+    if v.lower() in (\'yes\', \'true\', \'t\', \'y\', \'1\'):\n+        return True\n+    elif v.lower() in (\'no\', \'false\', \'f\', \'n\', \'0\'):\n+        return False\n+    else:\n+        raise argparse.ArgumentTypeError(\'Boolean value expected\')\n+\n+\n+def check_fasta_id(QUERY):\n+    forbidden_ids = []\n+    headers = []\n+    for record in SeqIO.parse(QUERY, "fasta"):\n+        if any(x in record.id for x in configuration.FORBIDDEN_CHARS):\n+            forbidden_ids.append(record.id)\n+        headers.append(record.id)\n+    if len(headers) > len(set([header.split(" ")[0] for header in headers])):\n+        raise NameError(\n+            \'\'\'Sequences in multifasta format are not named correctly:\n+\t\t\t\t\t\t\tseq IDs(before the first space) are the same\'\'\')\n+    return forbidden_ids, headers\n+\n+\n+def multifasta(QUERY):\n+    \'\'\' Create single fasta temporary files to be processed sequentially \'\'\'\n+    PATTERN = ">"\n+    fasta_list = []\n+    with open(QUERY, "r") as fasta:\n+        reader = fasta.read()\n+        splitter = reader.split(PATTERN)[1:]\n+        for fasta_num, part in enumerate(splitter):\n+            ntf = NamedTemporaryFile(delete=False)\n+            ntf.write("{}{}".format(PATTERN, part).encode("utf-8"))\n+            fasta_list.append(ntf.name)\n+            ntf.close()\n+        return fasta_list\n+\n+\n+def fasta_read(subfasta):\n+    \'\'\' Read fasta, gain header and sequence without gaps \'\''..b't(\n+        \'-thsc\',\n+        \'--threshold_score\',\n+        type=int,\n+        default=80,\n+        help=\n+        \'protein domains module: percentage of the best score within the cluster to  significant domains\')\n+    protOpt.add_argument("-thl",\n+                         "--th_length",\n+                         type=float,\n+                         choices=[Range(0.0, 1.0)],\n+                         default=0.8,\n+                         help="proportion of alignment length threshold")\n+    protOpt.add_argument("-thi",\n+                         "--th_identity",\n+                         type=float,\n+                         choices=[Range(0.0, 1.0)],\n+                         default=0.35,\n+                         help="proportion of alignment identity threshold")\n+    protOpt.add_argument(\n+        "-ths",\n+        "--th_similarity",\n+        type=float,\n+        choices=[Range(0.0, 1.0)],\n+        default=0.45,\n+        help="threshold for alignment proportional similarity")\n+    protOpt.add_argument(\n+        "-ir",\n+        "--interruptions",\n+        type=int,\n+        default=3,\n+        help=\n+        "interruptions (frameshifts + stop codons) tolerance threshold per 100 AA")\n+    protOpt.add_argument(\n+        "-mlen",\n+        "--max_len_proportion",\n+        type=float,\n+        default=1.2,\n+        help=\n+        "maximal proportion of alignment length to the original length of protein domain from database")\n+\n+    ################ OUTPUTS ###########################################\n+    outOpt.add_argument(\'-lg\',\n+                        \'--log_file\',\n+                        type=str,\n+                        default=LOG_FILE,\n+                        help=\'path to log file\')\n+    outOpt.add_argument(\'-ouf\',\n+                        \'--output_gff\',\n+                        type=str,\n+                        default=REPEATS_GFF,\n+                        help=\'path to output gff of repetitive regions\')\n+    outOpt.add_argument(\'-oug\',\n+                        \'--domain_gff\',\n+                        type=str,\n+                        default=DOMAINS_GFF,\n+                        help=\'path to output gff of protein domains\')\n+    outOpt.add_argument(\'-oun\',\n+                        \'--n_gff\',\n+                        type=str,\n+                        default=N_GFF,\n+                        help=\'path to output gff of N regions\')\n+    outOpt.add_argument(\'-hf\',\n+                        \'--html_file\',\n+                        type=str,\n+                        default=HTML,\n+                        help=\'path to output html file\')\n+    outOpt.add_argument(\'-hp\',\n+                        \'--html_path\',\n+                        type=str,\n+                        default=PROFREP_OUTPUT_DIR,\n+                        help=\'path to html extra files\')\n+\n+    ################ HITS/COPY NUMBERS ####################################\n+    cnOpt.add_argument(\'-cn\',\n+                       \'--copy_numbers\',\n+                       type=str2bool,\n+                       default=False,\n+                       help=\'convert hits to copy numbers\')\n+    cnOpt.add_argument(\n+        \'-gs\',\n+        \'--genome_size\',\n+        type=float,\n+        help=\n+        \'genome size is required when converting hits to copy numbers and you use custom data\')\n+    cnOpt.add_argument(\n+        \'-thr\',\n+        \'--threshold_repeat\',\n+        type=int,\n+        default=3,\n+        help=\n+        \'threshold for hits/copy numbers per position to be considered repetitive\')\n+    cnOpt.add_argument(\n+        \'-thsg\',\n+        \'--threshold_segment\',\n+        type=int,\n+        default=80,\n+        help=\'threshold for the length of repetitive segment to be reported\')\n+\n+    ################ JBrowse ##########################\n+    galaxyOpt.add_argument(\'-jb\',\n+                           \'--jbrowse_bin\',\n+                           type=str,\n+                           help=\'path to JBrowse bin directory\')\n+\n+    args = parser.parse_args()\n+    main(args)\n'
b
diff -r 000000000000 -r a5f1638b73be profrep.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/profrep.xml Wed Jun 26 08:01:42 2019 -0400
[
b'@@ -0,0 +1,183 @@\n+<tool id="profrep" name="ProfRep" version="1.0.0">\n+  <stdio>\n+    <regex match="Traceback" source="stderr" level="fail" description="Unknown error" />\n+  </stdio>\n+  <description> Tool to identify and visualize general repetive profile of a sequence as well as assign repetitive regions to a class from database of repeats </description>\n+<requirements>\n+    <requirement type="package">blast</requirement>\n+    <requirement type="package">last</requirement>\n+    <requirement type="package">ucsc-wigtobigwig</requirement>\n+    <requirement type="package">biopython</requirement>\n+    <requirement type="package">numpy</requirement>\n+    <requirement type="package">matplotlib</requirement>\n+    <requirement type="package">profrep</requirement>\n+    <requirement type="package" version="1.0">profrep_databases</requirement>\n+    <requirement type="package" version="1.16.4">jbrowse</requirement>\n+</requirements>\n+<command>\n+which python3 > /home/petr/tmp/profreptest/env;\n+\n+#if not $custom_data.options_custom_data:\n+\tprofrep_reads=\\$(awk -v var="${custom_data.prepared_dataset}" \'BEGIN{FS="\\t"}{if ($1 == var) print $3}\' $__tool_data_path__/profrep/prepared_datasets.txt) &amp;&amp;\n+\tprofrep_cls=\\$(awk -v var="${custom_data.prepared_dataset}" \'BEGIN{FS="\\t"}{if ($1 == var) print $4}\' $__tool_data_path__/profrep/prepared_datasets.txt) &amp;&amp;\n+\tprofrep_annotation=\\$(awk -v var="${custom_data.prepared_dataset}" \'BEGIN{FS="\\t"}{if ($1 == var) print $5}\' $__tool_data_path__/profrep/prepared_datasets.txt) &amp;&amp;\n+#end if\n+\n+python3 ${__tool_directory__}/profrep.py --query ${input} --output_gff ${ProfGff} --html_file ${html_file}\n+--html_path ${html_file.extra_files_path} --n_gff ${NGff} \n+--protein_domains ${dm.domains_switch}\n+--jbrowse_bin \\${JBROWSE_SOURCE_DIR}/bin\n+--log_file ${log_file}\n+\n+ #if $dm.domains_switch:\n+\t--domain_gff ${DomGff}\n+\t--protein_database ${__tool_data_path__ }/protein_domains/${dm.db_type}_pdb\n+\t--classification ${__tool_data_path__ }/protein_domains/${dm.db_type}_class\n+ #end if\n+\n+ #if $advanced_options.opts:  \n+\t--bit_score ${advanced_options.bit_score} \n+\t--word_size ${advanced_options.word_size} \n+\t--e_value ${advanced_options.e_value} \n+\t--threshold_repeat ${advanced_options.threshold} \n+\t--window ${advanced_options.window} \n+\t--overlap ${advanced_options.overlap} \n+\t#if $advanced_options.dust_filter:\n+\t\t--dust_filter "yes"\n+\t#else\n+\t\t--dust_filter "no"\n+\t#end if\n+ #end if\n+ \n+ #if $custom_data.options_custom_data:\n+    --reads ${reads}\n+    --ann_tbl ${annotations}\n+    --cls ${cls}\n+    --new_db True\n+\t#if $custom_data.cn.copy_num:\n+\t\t--copy_numbers $custom_data.cn.copy_num\n+        --genome_size ${custom_data.cn.genome_size}\n+    #end if\n+ #else\n+    --db_id ${custom_data.prepared_dataset}  \n+    --copy_numbers $custom_data.copy_numbers  \n+    --reads $__tool_data_path__/profrep/\\$profrep_reads\n+    --ann_tbl $__tool_data_path__/profrep/\\$profrep_annotation\n+    --cls $__tool_data_path__/profrep/\\$profrep_cls  \n+    --new_db False\n+ #end if \n+</command> \n+\n+<inputs>\n+ <param format="fasta" type="data" name="input" label="DNA sequence to annotate" help="Input sequence in multi-fasta format" />\n+ <conditional name="custom_data" >\n+  <param name="options_custom_data" type="boolean" truevalue="True" falsevalue="False" checked="False" label="Use custom annotation data" />\n+  <when value="False">\n+   <param name="prepared_dataset" type="select" label="Choose existing annotation dataset"  help="You can find list of all available species below in the Database section">\n+    <options from_file="profrep/prepared_datasets.txt" >\n+     <column name="name" index="1"/>\n+     <column name="value" index="0"/>\n+    </options>\n+   </param>  \n+   <param name="copy_numbers" type="boolean" truevalue="True" falsevalue="False" checked="True" label="Convert hits to copy numbers"  />  \n+  </when>\n+  <when value="True">\n+   <param format="fasta" type="data" name="reads" label="NGS reads" help="Input file of '..b' repeats (e.g. satellites, MITEs) arbitrary custom classification with any number of levels is allowed\n+\t\t\t\n+\t\tExample::\n+\t\t\n+\t\t\t\t42      repeat|mobile_element|Class_I|LTR|Ty1/copia|SIRE\n+\t\t\t\t43\trepeat|mobile_element|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Ogre/Tat|TatIV/Ogre\n+\t\t\t\t45      repeat|mobile_element|Class_I|LTR|Ty3/gypsy|non-chromovirus|OTA|Athila\n+\t\t\t\t48      repeat|satellite|PisTR/B\n+\t\t\t\t134     organelle|plastid\n+\t\n+\tAll the files are available from RE clustering archive. For Galaxy manipulation you can use **\'Extract Data for Profrep\'** tool to extract them. Please keep in mind that classification table from RepeatExplorer should serve as some kind of template and it is supposed to be manually adjusted anyway. For selected species these files will already be available as prepared datasets - at present this option is only available for Pisum sativum Terno (Macas et al 2015))\n+\t\n+\n+\t**Principle**\n+\n+\tThe main ProfRep tool runs blastn similarity search on given DNA against the database of all reads (low coverage sequencing). The preliminary hits have to pass quality filter (not too stringent so that the hit are not too fragmented) based on BITSCORE parameter (default **50**). These and other search parameters are all adjustable (*Advanced options* in Galaxy formular). The similarity search runs in parallel which lowers the computing times significantly especially when working with large input data - it defaultly uses all the sources available. The parallelization sliding window is set to **5kb** with **150b** overlap, both parameters are adjustable as well. When changing them, make sure that the overlap is at least of reads length so that the hits on borders are covered. The hits are sorted to clusters they belong to and subsequently assigned to a corresponding repetitive class based on the classification table. The hits amounts per each base are recorded for every repeat class separately in form of repetitive profile. Hits can be recalculated to copy numbers if the genome size of the species is provided (for prepared species in the Galaxy menu already included). The profiles are reported in a BigWig format to be visualized as graphs (log scale) in JBrowse. This format is binary, so it cannot be directly checked, but the quantitative information is still available form Wig text-based files in the output data structure ("data" DIR). For a quick check the profiles including the domains regions are also showed in summary HTML report (if the sequence length does not exceed 200kb). The summed profile **ALL** is created based on all individual profiles plus profiles of all mapped (but unclustered or unclassified) reads, keeping track of the overal sequence representation of repeats.\n+\tProtein domains search is accomplished by DANTE tool (see below), running defaultly as a ProfRep module (can be switched off). The protein domains outputs are already **filtered** with default quality parameters optimized for Viridiplantae species.  \n+\n+\t**Outputs** \n+\n+\t\t- **HTML summary report, JBrowse Data Directory** showing basic information and repetitive profile graphs as well as protein domains (optional) for individual sequences (up to 50). This output also serves as an data directory for [JBrowse](https://jbrowse.org/) genome browser. You can create a standalone JBrowse instance for further detailed visualization of the output tracks using Galaxy-integrated tool. This output can also be downloaded as an archive containing all relevant data for visualization via locally installed JBrowse server (see more about visualization in OUTPUT VISUALIZATION below)\n+\t\t- **Ns GFF** - reports unspecified (N) bases regions in the sequence\n+\t\t- **Repeats GFF** - reports repetitive regions of a certain length (defaultly **80**) and above hits/copy numbers threshold (defaultly **5**)\n+\t\t- **Domains GFF** - reports protein domains, classification of domain, chain orientation and alignment sequences\n+\t\t- **Log file**\n+\t\t\n+\n+ </help>\n+ \n+</tool>\n+\n'
b
diff -r 000000000000 -r a5f1638b73be profrep_db_reducing.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/profrep_db_reducing.py Wed Jun 26 08:01:42 2019 -0400
[
b'@@ -0,0 +1,242 @@\n+#!/usr/bin/env python3\n+\n+import argparse\n+import subprocess\n+import re\n+from tempfile import NamedTemporaryFile\n+import os\n+import configuration\n+\n+\n+def group_reads(reads_files_list, IDENTITY_TH):\n+    \'\'\' Reduce the number of reads separately for each significant cluster based on similarities between them using cd-hit tool. \n+\t\tcd-hit produces reduced reads files containing only the representative reads.  \n+\t\'\'\'\n+    reads_seqs_cl_reduced_list = []\n+    ## Run cd-hit on each cluster separately\n+    for cluster_reads in reads_files_list:\n+        cl = cluster_reads.split("_")[-1]\n+        reads_seqs_cl_reduced = NamedTemporaryFile(\n+            suffix=".reduced_{}".format(cl),\n+            delete=False)\n+        subprocess.call("cd-hit-est -i {} -o {} -c {} -M {}".format(\n+            cluster_reads, reads_seqs_cl_reduced.name, IDENTITY_TH,\n+            configuration.MEM_LIM),\n+                        shell=True)\n+        reads_seqs_cl_reduced_list.append(reads_seqs_cl_reduced.name)\n+        reads_seqs_cl_reduced.close()\n+    ## Return the list of reduced reads files\n+    return reads_seqs_cl_reduced_list\n+\n+\n+def representative_reads(READS_ALL, CLS_FILE, CL_SIZE_TH, CLS_REDUCED,\n+                         IDENTITY_TH):\n+    \'\'\' Group the reads based on the sequences similarities. \n+\t\tReplace a group by only the one representative read preserving the quantitative info how much reads it represents\n+\t\t1. Loop over the original cls file and find the significant clusters (min. number of reads) \n+\t\t2. Get the reads which are in individual significant clusters\n+\t\t2. Get the reads sequences for individual clusters to run cd-hit which groups the reads for each cluster\n+\t\t3. After getting all significant ones (clusters sorted by size) process the outputs from cd-hit and to get reads representations\n+\t\t4. Create new cls file and write down significant clusters with the new reads IDs\n+\t\t5. Continue reading unsignificant original cls file and copy the rest of clusters to the new cls unchanged \n+\tFind groups of similar reads and replace them with only one representative also preserving the number of reads it represents\t\n+\t\'\'\'\n+    reads_dict = {}\n+    cl = None\n+    line_cl_header = None\n+    modify_files = True\n+    reads_files_list = []\n+    cls_reduced_file = open(CLS_REDUCED, "w")\n+    ## parse file of all clusters from RE\n+    with open(CLS_FILE, "r") as cls_ori:\n+        for line in cls_ori:\n+            if line.startswith(">"):\n+                line_cl_header = line\n+                ## cluster number\n+                cl = re.split(\'\\t| \', line)[0].rstrip().lstrip(">")\n+            else:\n+                reads_in_cl = line.rstrip().split("\\t")\n+                ## reduce only reads in the biggest clusters:\n+                ## the threshold of cluster size is set as a minimum number of reads it has to contain\n+                if len(reads_in_cl) >= CL_SIZE_TH:\n+                    ## for significant cluster create a separate file to write reads sequences\n+                    reads_seqs_cl_orig = NamedTemporaryFile(\n+                        suffix="_{}".format(cl),\n+                        delete=False)\n+                    reads_files_list.append(reads_seqs_cl_orig.name)\n+                    ## for every read in the cluster create entry in reads_dict to which cluster it belongs and the file of the read sequence for this cluster\n+                    for read in reads_in_cl:\n+                        ## Dictionary of reads from significant clusters -> KEY:read_id VALUE:[number of cluster, filename to reads sequences file]\n+                        reads_dict[read] = [cl, reads_seqs_cl_orig.name]\n+                        reads_seqs_cl_orig.close()\n+                ## after getting all significant clusters to be reduced (original cls file sorted by size of clusters), process the reads reads in them and write to the modified reads and cls files\n+                elif modify_files:\n+                    ## get reads sequences'..b' else:\n+                    reads_repre_dict[read] = 0\n+                line = clstr.readline()\n+            if read_represent:\n+                reads_repre_dict[read_represent] = count_reads\n+                reads_in_cl_mod.append("{}reduce{}".format(\n+                    read_represent.rstrip().lstrip(">"), count_reads))\n+    return reads_repre_dict, reads_in_cl_mod\n+\n+\n+def reduce_reads(READS_ALL, READS_ALL_REDUCED, reads_repre_dict):\n+    \'\'\' Report a new file of reads sequences based on the original file of ALL reads using the reads representation dictionary.\n+\t\tLoop over the reads in the original READS_ALL file \n+\t\tThere are 3 options evaluated for the read:\n+\t\t\t- the value in the dictionary equals to zero, read is not representative -> it will not take place in the new reads DB\n+\t\t\t- the value is greater than zero, the read is representative -> in new read DB encode the number of representing reads using \'reduce\' tag (<Original_read_ID>reduce<number_represented>)\n+\t\t\t- the read is not in the dictionary -> add it unchanged from the original ALL reads database\n+\t\'\'\'\n+    with open(READS_ALL_REDUCED, "w") as reads_all_red:\n+        with open(READS_ALL, "r") as reads_all_ori:\n+            for line in reads_all_ori:\n+                if line.startswith(">"):\n+                    if line.rstrip() in reads_repre_dict:\n+                        amount_represented = reads_repre_dict[line.rstrip()]\n+                        if amount_represented > 0:\n+                            reads_all_red.write("{}reduce{}\\n".format(\n+                                line.rstrip(), amount_represented))\n+                            reads_all_red.write(reads_all_ori.readline())\n+                    else:\n+                        reads_all_red.write(line)\n+                        reads_all_red.write(reads_all_ori.readline())\n+\n+\n+def main(args):\n+    CLS_FILE = args.cls\n+    READS_ALL = args.reads_all\n+    CL_SIZE_TH = args.cluster_size\n+    IDENTITY_TH = args.identity_th\n+    CLS_REDUCED = args.cls_reduced\n+    READS_ALL_REDUCED = args.reads_reduced\n+\n+    if not os.path.isabs(CLS_REDUCED):\n+        CLS_REDUCED = os.path.join(os.getcwd(), CLS_REDUCED)\n+\n+    if not os.path.isabs(READS_ALL_REDUCED):\n+        READS_ALL_REDUCED = os.path.join(os.getcwd(), READS_ALL_REDUCED)\n+\n+    reads_repre_dict = representative_reads(READS_ALL, CLS_FILE, CL_SIZE_TH,\n+                                            CLS_REDUCED, IDENTITY_TH)\n+    reduce_reads(READS_ALL, READS_ALL_REDUCED, reads_repre_dict)\n+\n+\n+if __name__ == \'__main__\':\n+\n+    # Command line arguments\n+    parser = argparse.ArgumentParser()\n+    parser.add_argument(\'-r\',\n+                        \'--reads_all\',\n+                        type=str,\n+                        required=True,\n+                        help=\'input file containing all reads sequences\')\n+    parser.add_argument(\n+        \'-c\',\n+        \'--cls\',\n+        type=str,\n+        required=True,\n+        help=\'input sorted cls file containing reads assigned to clusters\')\n+    parser.add_argument(\'-rr\',\n+                        \'--reads_reduced\',\n+                        type=str,\n+                        default=configuration.READS_ALL_REDUCED,\n+                        help=\'output file containing reduced number of reads\')\n+    parser.add_argument(\n+        \'-cr\',\n+        \'--cls_reduced\',\n+        type=str,\n+        default=configuration.CLS_REDUCED,\n+        help=\n+        \'output cls file containing adjusted clusters for the reduced reads database\')\n+    parser.add_argument(\'-i\',\n+                        \'--identity_th\',\n+                        type=float,\n+                        default=0.90,\n+                        help=\'reads identity threshold for cdhit\')\n+    parser.add_argument(\'-cs\',\n+                        \'--cluster_size\',\n+                        type=int,\n+                        default=1000,\n+                        help=\'minimum cluster size to be included in reducing\')\n+\n+    args = parser.parse_args()\n+    main(args)\n'
b
diff -r 000000000000 -r a5f1638b73be profrep_db_reducing.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/profrep_db_reducing.xml Wed Jun 26 08:01:42 2019 -0400
[
@@ -0,0 +1,33 @@
+<tool id="profrep_db_reducing" name="cd-hit based size reduction of Profrep database" version="1.0.0">
+<description> Tool to reduce database of reads sequences based on their similarities to speed up ProfRep </description>
+<requirements>
+    <requirement type="package" version="1.0.0">profrep</requirement>
+    <requirement type="package" version="4.6.4">cd-hit</requirement>
+</requirements>
+<command>
+python3 ${__tool_directory__}/profrep_db_reducing.py --reads_all ${reads_all} --cls ${cls} --cluster_size ${cluster_size} --identity_th ${identity_th} --reads_reduced ${reads_reduced} --cls_reduced ${cls_reduced}
+</command>
+<inputs>
+ <param format="fasta" type="data" name="reads_all" label="NGS reads" help="Choose input file containing all reads sequences to be reduced" />
+ <param format="fasta" type="data" name="cls" label="RE clusters" help="Choose file containing all clusters and belonging reads [ RE archive -> seqclust -> clustering -> hitsort.cls]" />
+ <param name="cluster_size" type="integer" value="1000" min="1" max ="1000000000" label="Min cluster size" help="Only the reads from most represented clusters will be reduced - parameter indicates min. number of reads in a cluster to be involved in reducing" />
+ <param name="identity_th" type="float" value="0.90" min="0.1" max ="1.0" label="Reads identity threshold" help="Proportion of identity between reads sequences to group and reduce them" />
+</inputs>
+
+<outputs>
+ <data format="fasta" name="cls_reduced" label="Modified cls file of ${cls.hid}" />
+ <data format="fasta" name="reads_reduced" label="Reduced reads database of ${reads_all.hid}" />
+</outputs>
+
+ <help>
+
+**WHAT IT DOES**
+
+This tool will reduce the database of all reads based on similarities between them using **cd-hit**. Basically, it creates groups of similar reads and the reduced database will then be composed of one representative read replacing the group. New reads IDs also indicate the number of reads that they represents. The identity threshold between the reads to create a group (cd-hit parameter) is by default set to **0.9**. This value usually makes a good balance between reduction level and accuracy. As the new reads database is produced, CLS file containing reads connected to clusters has to be modified as well. As the result we will obtain reduced database of reads sequences and modified cls file adjusted to the new reads database. The actual reduction level depends on number of clusters envolved and how big they are. Default value for cluster size to be involved in reducing is **1000**, which means all clusters containing 1000 and more reads will undergo the reduction.
+
+ </help>
+</tool>
+
+
+
+
b
diff -r 000000000000 -r a5f1638b73be profrep_masking.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/profrep_masking.py Wed Jun 26 08:01:42 2019 -0400
[
@@ -0,0 +1,93 @@
+#!/usr/bin/env python3
+
+import argparse
+from Bio import SeqIO
+from Bio.Seq import MutableSeq
+from Bio.Alphabet import generic_dna
+import sys
+
+
+def main(args):
+    # Command line arguments
+    QUERY = args.query
+    MODE = args.mode
+    REPEAT_GFF = args.repeat_gff
+    MASKED = args.output_masked
+
+    repeats_all = get_indices(REPEAT_GFF)
+
+    if MODE == "lowercase":
+        lower_mask(QUERY, repeats_all, MASKED)
+    else:
+        N_mask(QUERY, repeats_all, MASKED)
+
+
+def get_indices(REPEAT_GFF):
+    '''
+ Get indices of repeats from GFF file to mask
+ '''
+    repeats_all = {}
+    with open(REPEAT_GFF, "r") as repeats_gff:
+        for line in repeats_gff:
+            if not line.startswith("#"):
+                seq_id = line.split("\t")[0]
+                start_r = line.split("\t")[3]
+                end_r = line.split("\t")[4]
+                if seq_id in repeats_all.keys():
+                    repeats_all[seq_id].append([int(start_r), int(end_r)])
+                else:
+                    repeats_all[seq_id] = [[int(start_r), int(end_r)]]
+    return repeats_all
+
+
+def lower_mask(QUERY, repeats_all, MASKED):
+    allSeqs = list(SeqIO.parse(QUERY, 'fasta'))
+    for singleSeq in allSeqs:
+        mutable = MutableSeq(str(singleSeq.seq), generic_dna)
+        for index in repeats_all[singleSeq.id]:
+            for item in range(index[0] - 1, index[1]):
+                mutable[item] = mutable[item].lower()
+        singleSeq.seq = mutable
+    with open(MASKED, "w") as handle:
+        SeqIO.write(allSeqs, handle, 'fasta')
+
+
+def N_mask(QUERY, repeats_all, MASKED):
+    allSeqs = list(SeqIO.parse(QUERY, 'fasta'))
+    for singleSeq in allSeqs:
+        mutable = MutableSeq(str(singleSeq.seq), generic_dna)
+        for index in repeats_all[singleSeq.id]:
+            for item in range(index[0] - 1, index[1]):
+                mutable[item] = "N"
+        singleSeq.seq = mutable
+    with open(MASKED, "w") as handle:
+        SeqIO.write(allSeqs, handle, 'fasta')
+
+
+if __name__ == "__main__":
+
+    # Command line arguments
+    parser = argparse.ArgumentParser()
+    parser.add_argument('-q',
+                        '--query',
+                        type=str,
+                        required=True,
+                        help='query sequence to be processed')
+    parser.add_argument('-rg',
+                        '--repeat_gff',
+                        type=str,
+                        required=True,
+                        help='query sequence to be processed')
+    parser.add_argument('-m',
+                        '--mode',
+                        default="lowercase",
+                        choices=['lowercase', 'N'],
+                        help='query sequence to be processed')
+    parser.add_argument('-o',
+                        '--output_masked',
+                        type=str,
+                        default="output_masked",
+                        help='query sequence to be processed')
+
+    args = parser.parse_args()
+    main(args)
b
diff -r 000000000000 -r a5f1638b73be profrep_masking.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/profrep_masking.xml Wed Jun 26 08:01:42 2019 -0400
b
@@ -0,0 +1,27 @@
+<tool id="profrep_mask" name="ProfRep Masker" version="1.0.0">
+<description> Tool to mask repetitive regions of original DNA seq based on the repeat regions reported by ProfRep </description>
+<requirements>
+    <requirement type="package" version="1.0.0">profrep</requirement>
+</requirements>
+<command>
+python3 ${__tool_directory__}/profrep_masking.py --query ${input} --repeat_gff ${rp_gff} --mode ${mode} --output_masked ${out_masked}
+</command>
+<inputs>
+ <param format="fasta" type="data" name="input" label="Choose your input sequence to be masked" help="" />
+ <param format="gff" type="data" name="rp_gff" label="Choose GFF file of repetitive regions" help="" />
+ <param name="mode" type="select" label="Select the mode of masking" help="" >
+  <option value="lowercase" selected="True">lowercase</option>
+     <option value="N">N</option>/>
+ </param>
+
+</inputs>
+
+<outputs>
+ <data format="fasta" name="out_masked" label="Masked DNA sequence(s) from dataset ${input.hid}" />
+</outputs>
+
+<help>
+**WHAT IT DOES**
+ This tool will mask repetitive regions of the original DNA sequence(s) based on repeats GFF. You can choose either lowercase(sequence will be preserved) or "N" mode of masking.
+</help>
+</tool>
b
diff -r 000000000000 -r a5f1638b73be profrep_refine.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/profrep_refine.xml Wed Jun 26 08:01:42 2019 -0400
b
@@ -0,0 +1,55 @@
+
+<tool id="profrep_refine" name="ProfRep Refiner" version="1.0.0">
+  <requirements>
+    <requirement type="package" version="1.0.0">profrep</requirement>
+    <requirement type="package">numpy</requirement>
+  </requirements>
+  <description> Tool to polish the raw ProfRep output in order to evaluate overlapping regions of different classifications and to interconnect fragmented parts of individual repeats, optionally supported by domains information </description>
+  <command>
+    python3 ${__tool_directory__}/profrep_refining.py --repeats_gff ${repeats_gff} --out_refined ${out_refined} --gap_threshold ${gap_th} --include_dom ${include_domains.domains}
+    #if $include_domains.domains:
+    --domains_gff ${include_domains.dom_gff}
+    --dom_number ${include_domains.dom_num}
+    --class_tbl  ${__tool_data_path__ }/protein_domains/${include_domains.db_type}_class
+    #end if
+  </command>
+  <inputs>
+    <param format="gff" type="data" name="repeats_gff" label="Repeats GFF" help="Choose repeats GFF3 file from ProfRep output" />
+    <param name="gap_th" type="integer" value="250" label="Gap tolerance" help="Threshold for tolerated gap between two consecutive repeat regions of the same class to be interconnected" />
+    <conditional name="include_domains">
+      <param name="domains" type="boolean" display="checkbox" truevalue="true" falsevalue="false" checked="true" label="Include protein domains information" help = "This helps to improve the confidence of the regions merging" />
+      <when value="true">
+        <param format="gff" type="data" name="dom_gff" label="Domains GFF" help="Choose GFF3 file containing protein domains" />
+        <param name="dom_num" type="integer" value="2" min="0" max="10" label="Minimum domains" help="Min number of domains per mobile element to confirm the regions merging" />
+        <param name="db_type" type="select" label="Select taxon and protein domain database version (REXdb)" help="">
+          <options from_file="rexdb_versions.txt">
+            <column name="name" index="0"/>
+            <column name="value" index="1"/>
+          </options>
+        </param>
+
+      </when>
+    </conditional>
+  </inputs>
+  <outputs>
+    <data format="gff3" name="out_refined" label="Refined GFF3 file from dataset ${repeats_gff.hid}" />
+  </outputs>
+
+  <help>
+
+
+    **WHAT IT DOES**
+
+    REFINING PROCESS of repeats annotation runs in two consecutive steps:
+
+   1. REMOVING LOW CONFIDENCE REPEATS REGIONS
+   
+   Prior the regions interconnecting, it is necessary to filter out some nested regions of different classification, which might be false positive and disrupt the merging process. However, not all of these overlapping regions of different classification are necessarily wrong and they can reveal inserted or chimeric elements. Thats why we only get rid of the regions with significantly lower quality. At first clusters of overlapping repeat regions are created. Within a cluster, regions are gradually checked based on descending PID. All the other regions occuring inside the current one (with some borders tolerance on each side - defaultly 10 bp), are removed in case their PID is more than 5% lower than the current region. Otherwise it will be preserved. Average PID (percentage of identity) is counted as the mean per each position for equally classified hits and then it is averaged over the whole region reported in repeats GFF. 
+
+   2. INTERCONNECTING REGIONS
+   
+   These "cleaned" regions are subsequently interconnected in the next step. It searches for consecutive repeats to create segments with the same classification that are not further from each other than a gap threshold (defaultly 250 bp). These segments cannot be corrupted by repeats of different classification. The confidence of merging is by default supported by the protein domains information (optional). In this case, a minimum number of protein domains of equal orientation must be present inside the expanded segment. These domains need to be unambiguously classified until the very last level (checked based on domains classification table) and at the same time the classification must correspond to the repeat classification of the region. Repetitive elements which do not encode protein domains, such as satellites or MITes (i.e. not mobile elements), are not checked for the domains and the regions of same classification are merged only based on the gap criterion.
+
+  </help>
+</tool>
+
b
diff -r 000000000000 -r a5f1638b73be profrep_refining.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/profrep_refining.py Wed Jun 26 08:01:42 2019 -0400
[
b'@@ -0,0 +1,394 @@\n+#!/usr/bin/env python3\n+\n+import argparse\n+import numpy as np\n+import os\n+from tempfile import NamedTemporaryFile\n+from collections import defaultdict\n+import configuration\n+import sys\n+\n+\n+def str2bool(v):\n+    if v.lower() in (\'yes\', \'true\', \'t\', \'y\', \'1\'):\n+        return True\n+    elif v.lower() in (\'no\', \'false\', \'f\', \'n\', \'0\'):\n+        return False\n+    else:\n+        raise argparse.ArgumentTypeError(\'Boolean value expected\')\n+\n+\n+def check_dom_gff(DOM_GFF):\n+    \'\'\' Check if the GFF file of protein domains was given \'\'\'\n+    with open(DOM_GFF) as domains_f:\n+        line = domains_f.readline()\n+        while line.startswith("#"):\n+            line = domains_f.readline()\n+        if len(line.split("\\t")) == 9 and "Final_Classification" in line:\n+            pass\n+        else:\n+            raise IOError(\n+                "There was detected an input GFF file that does not contain domains. Please check it and choose the domains GFF file")\n+\n+\n+def create_dom_dict(DOM_GFF):\n+    \'\'\' Create hash table of protein domains for individual sequences \'\'\'\n+    check_dom_gff(DOM_GFF)\n+    dict_domains = defaultdict(list)\n+    count_comment = check_file_start(DOM_GFF)\n+    with open(DOM_GFF, "r") as domains:\n+        for comment_idx in range(count_comment):\n+            next(domains)\n+        for line in domains:\n+            seq_id = line.split("\\t")[0]\n+            ann_dom_lineage = line.split("\\t")[8].split(";")[1].split("=")[-1]\n+            start_dom = int(line.split("\\t")[3])\n+            end_dom = int(line.split("\\t")[4])\n+            strand_dom = line.split("\\t")[6]\n+            dict_domains[seq_id].append((start_dom, end_dom, ann_dom_lineage,\n+                                         strand_dom))\n+    for seq_id in dict_domains.keys():\n+        dict_domains[seq_id] = sorted(dict_domains[seq_id], key=lambda x: x[0])\n+    return dict_domains\n+\n+\n+def check_file_start(gff_file):\n+    count_comment = 0\n+    with open(gff_file, "r") as gff_all:\n+        line = gff_all.readline()\n+        while line.startswith("#"):\n+            line = gff_all.readline()\n+            count_comment += 1\n+    return count_comment\n+\n+\n+def interconnect_intervals(OUT_REFINED, gff_removed, GAP_TH, DOM_NUM,\n+                           dict_domains, domains_classes):\n+    \'\'\' Second step of refining - INTERVALS INTERCONNECTING:\n+\tGradually checking regions of GFF which are sorted by starting point.\n+\tAdding regions to be interconnected if the gap between the new and previous one is lower than threshold and theirclassification is the same as the one set by the first region.\n+\tIf the conditions are not fullfilled the adding is stopped and this whole expanded region is further evaluated:\n+\t1. if domains are not included in joining (A. choosed as parameter or B. based on the classification the element does not belong to mobile elements):\n+\t\t-> the fragments are joined reported as one region in refined gff\n+\t2. if domains should be included:\n+\t\t-> the fragments are joined if domains within the expanded region meets the criteria:\n+\t\t\t1. they are at least certain number of domains \n+\t\t\t2. they have equal strand orientation \n+\t\t\t3. they are classified to the last classification level (checked from the class. table of the database) and this matches the classification of the expanded region\n+\t\t-> otherwise the region is refragmented to previous parts, which are reported as they were in the original repeats gff\n+\t\'\'\'\n+    with open(OUT_REFINED, "w") as joined_intervals:\n+        start_line = check_file_start(gff_removed)\n+        with open(gff_removed, "r") as repeats:\n+            for comment_idx in range(start_line):\n+                joined_intervals.write(repeats.readline())\n+            joined = False\n+            initial = repeats.readline()\n+            ## if there are repeats in GFF, initialize\n+            if initial is not "":\n+                seq_id_ini = initial.rstrip().split("\\t")[0]\n+                start_ini = int(initial.rstrip'..b'      "{}\\t{}\\t{}\\t{}\\t{}\\t{}\\t{}\\t{}\\tName={}\\n".format(\n+                    seq_id, configuration.SOURCE_PROFREP,\n+                    configuration.REPEATS_FEATURE, part_start, part_end,\n+                    configuration.GFF_EMPTY, configuration.GFF_EMPTY,\n+                    configuration.GFF_EMPTY, ann_ini))\n+    ## delete already checked domains from the dict\n+    del (dict_domains[seq_id][0:index_dom])\n+    return dict_domains\n+\n+\n+def get_unique_classes(CLASS_TBL):\n+    \'\'\' Get all the lineages of current domains database classification table to subsequently check the protein domains if they are classified up to the last level.\n+\tOnly these domains will be considered as valid for interval joining. \n+\tIf their classification is be finite (based on comparing to this list of unique classes) they will not be counted for minimum number of domains criterion within the segment to be joined\n+\t\'\'\'\n+    unique_classes = []\n+    with open(CLASS_TBL, "r") as class_tbl:\n+        for line in class_tbl:\n+            line_class = "|".join(line.rstrip().split("\\t")[1:])\n+            if line_class not in unique_classes:\n+                unique_classes.append(line_class)\n+    return unique_classes\n+\n+\n+def main(args):\n+    # Command line arguments\n+    REPEATS_GFF = args.repeats_gff\n+    DOM_GFF = args.domains_gff\n+    GAP_TH = args.gap_threshold\n+    DOM_NUM = args.dom_number\n+    OUT_REFINED = args.out_refined\n+    INCLUDE_DOM = args.include_dom\n+    CLASS_TBL = args.class_tbl\n+    BORDERS = args.borders\n+\n+    # first step of refining - removing low confidence repeats regions\n+    gff_removed = cluster_regions_for_quality_check(REPEATS_GFF, BORDERS)\n+\n+    # second step of refining - interconnecting repeats regions\n+    if INCLUDE_DOM:\n+        unique_classes = get_unique_classes(CLASS_TBL)\n+        dict_domains = create_dom_dict(DOM_GFF)\n+        joined_intervals = interconnect_intervals(\n+            OUT_REFINED, gff_removed, GAP_TH, DOM_NUM, dict_domains,\n+            unique_classes)\n+    else:\n+        joined_intervals = interconnect_intervals(OUT_REFINED, gff_removed,\n+                                                  GAP_TH, DOM_NUM, None, None)\n+    os.unlink(gff_removed)\n+\n+\n+if __name__ == "__main__":\n+\n+    # Command line arguments\n+    parser = argparse.ArgumentParser()\n+    parser.add_argument(\'-rep_gff\',\n+                        \'--repeats_gff\',\n+                        type=str,\n+                        required=True,\n+                        help=\'original repeats regions GFF from PROFREP\')\n+    parser.add_argument(\n+        \'-dom_gff\',\n+        \'--domains_gff\',\n+        type=str,\n+        help=\n+        \'protein domains GFF if you want to support repeats joining by domains information\')\n+    parser.add_argument(\n+        \'-gth\',\n+        \'--gap_threshold\',\n+        type=int,\n+        default=250,\n+        help=\'gap tolerance between consecutive repeats to be interconnected\')\n+    parser.add_argument(\'-our\',\n+                        \'--out_refined\',\n+                        type=str,\n+                        default="output_refined.gff",\n+                        help=\'query sequence to be processed\')\n+    parser.add_argument(\n+        \'-dn\',\n+        \'--dom_number\',\n+        type=int,\n+        default=2,\n+        help=\'min number of domains present to confirm the region joining\')\n+    parser.add_argument(\n+        \'-id\',\n+        \'--include_dom\',\n+        type=str2bool,\n+        default=False,\n+        help=\'include domains information to refine the repeats regions\')\n+    parser.add_argument(\n+        \'-ct\',\n+        \'--class_tbl\',\n+        help=\n+        \'classification table of protein domain database to check the level of classification\')\n+    parser.add_argument(\n+        \'-br\',\n+        \'--borders\',\n+        type=int,\n+        default=10,\n+        help=\n+        \'number of bp tolerated from one or the other side of two overlaping regions when evaluating quality\')\n+    args = parser.parse_args()\n+    main(args)\n'
b
diff -r 000000000000 -r a5f1638b73be shed.yml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/shed.yml Wed Jun 26 08:01:42 2019 -0400
[
@@ -0,0 +1,10 @@
+categories: [Annotation]
+description: Annotation of repeats
+long_description: Annotation of repetitive sequences in genome assembly
+name: DANTE-Profrep
+owner: petr-novak
+exclude:
+  - tool-data/profrep/*
+  - tool-data/protein_domains/*
+  - tool-data/profrep.tar.gz
+
b
diff -r 000000000000 -r a5f1638b73be test_data/GEPY_cluster_annotation
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test_data/GEPY_cluster_annotation Wed Jun 26 08:01:42 2019 -0400
b
@@ -0,0 +1,57 @@
+11 LTR/copia/Angela
+15 LTR/copia/Angela
+32 LTR/copia/Angela
+70 LTR/copia/Angela
+71 LTR/copia/Angela
+78 LTR/copia/Angela
+108 LTR/copia/Angela
+2 LTR/copia/Bianca
+31 LTR/copia/Bianca
+50 LTR/copia/Ivana
+66 LTR/copia/Ivana
+68 LTR/copia/Ivana
+106 LTR/copia/Ivana
+75 LTR/copia/Tork
+83 LTR/copia/Tork
+6 LTR/gypsy/Athila
+17 LTR/gypsy/Athila
+22 LTR/gypsy/Athila
+26 LTR/gypsy/Athila
+37 LTR/gypsy/Athila
+42 LTR/gypsy/Athila
+48 LTR/gypsy/Athila
+55 LTR/gypsy/Athila
+60 LTR/gypsy/Athila
+65 LTR/gypsy/Athila
+74 LTR/gypsy/Athila
+85 LTR/gypsy/Athila
+86 LTR/gypsy/Athila
+89 LTR/gypsy/Athila
+102 LTR/gypsy/Athila
+123 LTR/gypsy/Athila
+127 LTR/gypsy/Athila
+145 LTR/gypsy/Athila
+63 LTR/gypsy/chromo
+73 LTR/gypsy/chromo
+3 LTR/gypsy/Tat
+51 LTR/gypsy/Tat
+56 LTR/gypsy/Tat
+58 LTR/gypsy/Tat
+92 LTR/gypsy/Tat
+96 LTR/gypsy/Tat
+101 LTR/gypsy/Tat
+121 LTR/gypsy/Tat
+18 LTR
+20 LTR
+35 LTR
+8 MITE
+16 MITE
+24 MITE
+4 MITE
+40 MITE
+47 MITE
+111 rDNA
+1 Tandem
+44 Tandem
+109 Tandem
+124 Tandem
b
diff -r 000000000000 -r a5f1638b73be test_data/GEPY_test_long_1
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test_data/GEPY_test_long_1 Wed Jun 26 08:01:42 2019 -0400
b
b'@@ -0,0 +1,527 @@\n+>scaffold146.1|size86774\n+CTAGAACACCAACACTAACAGGTACTACAGTCTGGAATCGGATATCTCGCATGTTAAAAT\n+ATTCGGATGTGCTGTTTATATTCTCATTCCCCTGTCTCAAAGAACAAAAGTGGGACCCCA\n+ACGTCGATTGAAAATTTATATTGGATTTGAATCTCCTACGATTATACGATACCTTGAGCC\n+NNNNNNNTTAACATGAGATGTGTTTACTGCTAGATTTGCAGATTGTTATTTTGATCAAAC\n+CCATTTTCATAAGTCATTGAAATAAAATGAGAAATATAAAAAATTAAGTTGGCATAAGTC\n+ATCATTGACACATTACGATCCTCGTACTAAGGAATGTGAACTGGGAGTTTAGAAAATTCT\n+TCATGTGCCGGAATAGACAATTCAATTGTCGAATGTGTTTAACGATGCGAATGGGGTTTT\n+ACAATCGCATATACTTGCAGCAAACACACCGATTAAGGTTGATGTTCCTGAAGAACGCAC\n+GAAAATTGCGAACGAATCAAAAATGCGTTTGGAACGAGGTAGATCTATTGGTTCTAAGGA\n+TAAGAATCCTAGAACGAACACCGAATGTTCAAATTGAGAATGTGTCGAATCCTCTGAGCA\n+CACATATGGTGGTTAGATCTTTGGATGTGAAAATGGATCCATTCAGACCGCATGAGAACG\n+ATGAAGAAATATTAGGNNNNNGAAGTACCTTATCTTAGTGCAATCGGGGCATTGATGTAT\n+CTTGCGAATAATACGAGGCCTAATATAGCATTTGCTGTTAATCTGTTGACAAGATGTAGT\n+TCGTCGCCTACGAAAAGATATTGGAAATGCGTGAAACATGTTCTTCGATANNNNNNNNNN\n+NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN\n+NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN\n+NNNNNNNNNNNNNNTATTTCTTGGCGATCTACGAAACAAACTATCGTAGCTATCTCGTCA\n+AATCACGTAGAATTATTAGCGATACATGACACAAGTCGTGAATGCGTCTGGTTGAGATTT\n+ATGATTGAAAGCATTTATAATGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN\n+NNNNNNNNNNNNNNTACAGTTGAAAGAATGATATATTAAATGTGACCGAACGAAACATAT\n+TTCGCCAAAATTCTTTCTTTACACAAGATCTTCAAAAGAACGGAGATGTGATTATCCAGC\n+AGATACGATCAAACGATAATGTAGTAGATTTATTCACAAAACTGCTTCATACGGCAGGTT\n+TTGAAAAGTTGATCTACAACAATGGCATTCGAAGATTGAAAGGTTTGGAGTGATGCAACC\n+ATCAGGGGTAGATGTTTTTGCTTGAAGACGGAGGGATGTAAAAAGATTATAGAAATGTAC\n+TCTTTTTCATTCACTAAGGTTTTTATCCTTTTTCCTTAGTAAAGTTTTAACGAGTCATAT\n+CCTATAATGATAGACATCCAGGGTGGAGTATTACAAAACTATACTCGAAAATTAGATTGT\n+GGATGTCTAGTTTACCAAGTTTCAAATAAAGACGGAAATAAATAGTACTATACACAAAAT\n+AATGCTATTCATGTGGGGCTCACGTCATTAATTTGTTTGAATTATAAAACGGTTCAGAAC\n+CATCGCTCACCTATATAAATAGAGGTTATGTATGCTGAAATTATACAGATGAAATAATAC\n+AGATTTTATACTTTCATTTTCTTTATTCTTCTTCCGTTTCTACTATATCGAAGTAATTCA\n+TAGAGAAGTTGACGTAGAACGTCCGATTGAAGATTCAAGTAAATATTTTTCATTTATTGG\n+TATTATTACTTTCCTAACAATTATTTGATTAAGCGCTATTGTTATTTGAGTCTCCTTCAT\n+TCACACAAATTGCATTCGAGAAAAAGACATTTTTTGTCCCCTCAAATTTTTCAACTTTCA\n+ATTTTTTGTCCCTCGACTTTCAAAGAAGACACATTTGGTTTCTTAAATTTGATTTAAGGT\n+CAATTTTGATTCATATTACAAAATTTTAATCATAAATTGACTATTTTACCTGTAAATAAA\n+TATTTTTAAAAGTTGAATTTCATTTTCTTAGACTATTTAAATGTAATTTGACTTGATTGT\n+GAGACTTATGAAGTGATTGTGAATCTTGTTTAAGACTTAAATATTCAATGTATGATAGAA\n+AATTTATGTTGCAATCATATTATTGTGGAAATCTTATATAACATTGACGTGGAAATTTTT\n+GTGTCGTGCCAAAAATAATAACCTCACAACAACAATAATGGAAAATTTTCTGTGCTCATT\n+TTTTGTCGTCTTCCTCCTCCTCTCTGCCGCTGCAAATGGCGACGACGTGTACACATCCTT\n+CGTTAACTTCCTCGCAAAGAACGGCATTTCCAGCGCCGAAATCTCTTCCACCGTCTACTC\n+TCCACAAAACACTAGCTTCCAGAACGTTCTACTCTCCGCCGTGAGGAACCGCCGGTTCAA\n+CCGATCCACCACCAGAAACCCAGCACGATTTTCGCGCCCACGGCGGAATCCCACGTCAGC\n+GCCGCCGTCATTTGCTCCAAGGAACTCGGGATTCAGCTCAAGATCCGCAGCGGCGGCCAC\n+GACTTCGAGGGCATCTCCTACGTTTCTGCGGACGGCGGCGCGTTCGTCTTACTGGATGTG\n+TCCAATTTCCGGTCGATTTCCGTCGATATTCCCGGCGAGACGGCGTGGGTCGGCCCCGGG\n+GCTTATCTCGGAGAGCTGTACTACAGGATCTGGGAGAAGAGTAGCGTCCACGGTTTCCCC\n+GCCGGGGTCCCGCCCTCCGTTCGATTTTCCAGAAAATGCTTCAAATCGGCGAAGTGGGGC\n+TGACGTTTAACTCCTACGGCGGAGTAATGGACCGGATCCCGGAATCGGAAGCTCCCTTCC\n+\n+TCCTAAAGAACTTAAAACGGCTTATATCTCGCTATATAACGATATTTAAGGATTCGAAAC\n+CACTTATAATCTCTTTTCATGCCTTAAATGAGGTTATTTAAGGATCCAATTGCATTTAAA\n+ATGCCTTATTATGCATCAAATAGCTTCAAATAGCCTTAACTATGCTAAAGAACTTATAAC\n+GGCTTATATCTCGCTATATAACGATATTTAACCATCCGAAACCACTTATAATCTCTTTTC\n+AAGCCTTAAATGAGGTTATTTAAGGATCCAATTGCATTTAAAATGTCTTATTATGCATCA\n+AATAGCTTCAAATAGCCTAAACTATGCTAAAGAACTTATAACGGCTTATATCTCGCTATA\n+TAACGATATTAAAGCATCCGAAACCACTTATAATCTCTTTTCAAGCCTTAAATGAGGTTA\n+TTAAAGGATCCAATTGCATTTAGAATGCCTTATTATGCATCAAATAGCTTCAAATAGCCT\n+AAACTATGCTAAAGAACTTATAACGGCTTATATCTCGCTATATAACGATATTAAAGCATC\n+TCCTAAAGAACTTAAAACGGCTTATATCTCGCTATATAACGATATTTAAGGATTCGAAAC\n+CACTTATAATCTCTTTTCATGCCTTAAATGAGGTTATTTAAGGATCCAATTGCATTTAAA\n+ATGCCTTATTATGCATCAAATAGCTTCAAATAGCCTTAACTATGCTAAAGAACTTATAAC\n+GGCTTATATCTCGCTATATAACGATATTTAACCATCCGAAACCACTTATAATCTCTTTTC\n+AAGCCTTAAATGAGGTTATTTAAGGATCCAATTGCATTTAAAATGTCTTATTATGCATCA\n+AATAGCTTCAAATAGCCTAAACTATGCTAAAGAACTTATAACGGCT'..b'ATATATCTTTCCATGTTATGAATGATCCTGAAGATCC\n+TGAACCTAGAACAATGACTGAATGTCAGAAACGAGATGATTGGCCAAAATGGAAAGATGC\n+TATAGAAAGTGAGCTGAAATCTCTGAATAAGAGAGATGTTTTCGGACCTGTAGTTCGAAC\n+ACCTGAAGGTGTACAACCGGTTGGTTATAAGTGGGTTTTTGTGAGAAAACGAAATGATAA\n+AGGAGAAATATCTCGGTATAAGGCGAGATTAGTAGCTCAAGGGTTTTCTCAAAGGCCAGG\n+AATTGATTATGATGAAACCTATTCACCGGTTATGGATGCCACAACTTTCAGGTTTTTGAT\n+AAGTCTGGCGATTGAATATGGGCTTGATTTACAACTGATGGATGTTGTAACAGCATACTT\n+ATATGGGTCACTGGATTGTGAAATATATATGAAAATCCCTGAAGGGTTTCATATGCCTGA\n+ACGATATAGTTCTGAACCCCGTACCGATTATGCGATTAAATTGAATAAATCCCTGTATGG\n+ATTAAAGCAGTCAGGACGAATGTGGTATAACCGTCTAAGTGAATACTTGATTAAAGAGGG\n+TTATAAGAACAATTTGGTTTGTCCCTGTGTTTTTATGAAGAAATTCGAAAATGAGTTCGT\n+GATCATCGCTGTGTATGTCGATGACATTAATATTGTGGGAACTCAGAAGGCATTATTGGA\n+TGCCGTGAACTGCTTGAAAAGGGAATTTGAAATGAAGGATTTGGGAAGAACGAAATATTG\n+CCTTGGTTTGCAAATTGAATATTTGAAAAATGGGATTTTTCG\n+TACCGATTATGCTATTAAATTGAATAAATCCCTGTATGGATTAAAGCAGTCAGGACGAAT\n+GTGGTATAACCGTCTGAGTGAGTATCTGATCAAAGAAGGTTATAAAAACAATTTGGTTTG\n+TCCTTGTGTTTTTATGAAGAATTTTGAAAATGAGTTCGTGATCATCGCTGTGTATGTCGA\n+TGACATTAATATTGTGGGAACTCAGAAGGCATTATTAGATGCTGTGAACTGCTTGAAAAG\n+GGAATTTGAAATGAAGGATTTGGGAAGAACGAAATATTGCCTTGGTTTGCAAATTGAATA\n+TTTGAAAAATGGGATTTTTCTTCATCAGAATACGTATACCAAGAAGGTATTGAAACGTTT\n+TTATATGGATTATTCACATCCTCTGAGCACACCTATGGTGGTTAGATCTTTAGATGTGAA\n+AACGGATCCATTCAGGCCACAGGAGAACGATGAAGAAATATTAGGTCCTGAAGTACCTTA\n+TCTTAGTGCAATCGGGGCATTAATGTATCTTGCGAATAATACGAGGCCTGACATTGCATT\n+TGCTGTTAATCTGTTGGCAAGATATAGTTCATCGCCTACGAAAAGACATTGGAAAGGCGT\n+GAAACATGTTCTTCGATATCTTCAAGGTACTACTGATAAGGGGTTGTATTATCAGAAAGA\n+TATGAAGTCAGAACTTATCGGGTATGCTGATGCTGGATATAGATCAGATCCACATAATGG\n+GAGATCTCAGACAGGATATGTTTTCCTGAATAAAGGAGCTGCTATTTCTTGGCGATCTAC\n+GAAACAGACTATCGCAGCTACCTCGTCAAACCACGCAGAATTACTAGCGATACACGAAAC\n+AAGTCGTGAATGCGTTTGGTTGAGATCTATGATTGAAAGCATTTATAATGCTTGTGGATT\n+GTTTACAGATAAGATGCCTCCGACTGTATTATATGAAGATAATAGTGCATGTATTATACA\n+GTTGAAAGAAGGATATATTAAGGGTGACAGAACGAAACATATTTCACCAAAATTCTTCTT\n+TACACATGATCTTCAAAAGAACGGAGAGGTAATTATCCAGCAGATACGATCAAGCGATAA\n+TGTGGCAGATTTATTCACGAAACCACTCCCTACATCAACTTTTGAAAAGTTGATTTACAA\n+TATTGGAATCCGAAGGTTGAAGGATTTGGAGTGATGCAGTCATCAGGGGGAGATGTTTTT\n+GACTGAGGACAAAGGGATGCAAGAAAATTATAGGAATGTACTCTTTTTCCTTCACTAAGG\n+TTTTTATCCCATTGGGTTTTTCCTTAGTAAGGTTTTAACGAGGCATATCCTATAATGATA\n+GACATCCAAGGGGGAGTGTTGTAAAACTATACACGAAAATCGGATTGTGGATGTCTATTA\n+TACCATATTTCAAATAAAGACGGAAATGCACAGTACTTACTATTCATGTGGGTCCCGCAG\n+CATTAAATTTATCAAATTATTGATTGTAAACGAACGGATGTAATCGATGATTGATGCCTA\n+TAAATATAGGCATGGTGCAGAATGAATTTAAGCAGAACAAAATTTGAGCATAAATTTTTC\n+TCTTCTTCTTCTCATTTCTTTTCTTGATTCAATAATACGCTGAAGGAATTTCTACAGAAG\n+TTGACGTAGAACGTCCGATTGAAGATTCAAGTAAGTTTATGAATTATTCATCTTTTATTT\n+CTTATTTTTCTAACACGTTATCAGCACGAAGTCTAACCAACTGAGTGCTTATATAATCTT\n+GAAGATTATATATTATATGATATGATCCCGCAGATCGTATGGTTGATTCGATGATCCAGA\n+AGATTATTTGATTTAAC\n+TTCTTTACCATTTTCGTCTGAAGTCTCAATATGATATCCATTTCGGCGAATATCTCTAAA\n+ACTCAATAAATTCCTTCGAGACTTATTCGCATATAATGCATCATCAATTCTGATATGCGT\n+TCCCGATGGTAATACTATCACAGCTCTGCCGGAGCCTTCAACAATATTCGCTTCACATAC\n+AATTGTGGCAATATTTACATCCCGTTTTTCAAAACTAGTAAAATATCTCATGTTTTTCAA\n+GATCGTATGCGTCGTTGCACTATCCACCAGACATTCATCACCCTCAGCCATGCTGGAAAT\n+AAACAATATATTTCATTATCAGTTCATGATACAATTTACAATTCGGCATACATAATACCA\n+ACAATAATGCCCAGAAATGTTAACAGCGTGGAAACAAAGACTGAAGCGTCTATGTTCTGG\n+ATCATCCAAATATACCCAAAAATAAAAGTTACCATAAAAACTATAAAAACAAGATACTGA\n+ACAGAGGTGTTCATCTGGCTCAGAGATTTTGTAAAGAGATTTAGAAAAGAGAAGTTTTTA\n+AGTGAAAAACCTTCTTGATAAAATCGTTATTTATACTGAGCCAAGTTACCGACAATATCT\n+GCGAAATCCAAGATTTTCGTTGAAAAAACATAACACCTTATTAAAATATTCAAAACACGC\n+CGAATATTTTACAAAACACGCTGATAACATGCCGAATACAACATATTTATTTCAACCAGC\n+ATTATTAAATAGATAACTAAATCCTTCGTCTGGATCA\n+CGTGGACGTGGTCGTGGCCGAGGTCGTTACTATGATTATGGTCGTGAAAAGAACAAGTAT\n+ATCTGGAAGAAACCTGCCGTTGTCAAAGAGGTAAATGTGAAAAATGATCAGGGTGACCAG\n+AATACTTGTTACAGATGTGGAAAGGAAGGACACTGGTCACGGACGTGTCGAACGCCTAAA\n+TCACTTGTCGACCTTTATCAGCGAGCGAAGAAAATTGAGGAAAAGGGGAAGAAAAAGGAG\n+ACGAATAACGCTGAAGCTGAGACGTATAACGGAGAAGTCAATATGACTAAGCTGGATGTC\n+GCAGATTTCCTGGCTGATCCAGACGAAGGATTTAGTTATCTATTTAATAATGCTGGTTGA\n+AAGAAATTGTTGTATTCGGCATGTTATCAGCGTGTTTTGTAAAATATTCGGCGTGTTTTG\n+AATATTTTAATAAAATGTTATGTTTTTCAACGAAAATATTGGATTTCGCAGATATTGTCG\n+GTAACTT\n+\n+\n+\n'
b
diff -r 000000000000 -r a5f1638b73be test_data/classification.csv
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test_data/classification.csv Wed Jun 26 08:01:42 2019 -0400
b
b'@@ -0,0 +1,401 @@\n+ATCOPI1_I\tTy1/copia\tAleI/Retrofit\n+ATCOPIA21I\tTy1/copia\tAleI/Retrofit\n+ATCOPIA3I\tTy1/copia\tAleI/Retrofit\n+ATCOPIA47_I\tTy1/copia\tAleI/Retrofit\n+ATCOPIA4I\tTy1/copia\tAleI/Retrofit\n+ATCOPIA51_I\tTy1/copia\tAleI/Retrofit\n+ATCOPIA52_I\tTy1/copia\tAleI/Retrofit\n+ATCOPIA6I\tTy1/copia\tAleI/Retrofit\n+ATCOPIA7I\tTy1/copia\tAleI/Retrofit\n+ATRE1_I\tTy1/copia\tAleI/Retrofit\n+CASTOR_I\tTy1/copia\tAleI/Retrofit\n+COP_I_MT\tTy1/copia\tAleI/Retrofit\n+Copia-17_SB-I\tTy1/copia\tAleI/Retrofit\n+Copia-26_SB-I\tTy1/copia\tAleI/Retrofit\n+Copia-44_SB-I\tTy1/copia\tAleI/Retrofit\n+Copia-46_ZM-I\tTy1/copia\tAleI/Retrofit\n+Copia-47_ZM-I\tTy1/copia\tAleI/Retrofit\n+Copia-48_ZM-I\tTy1/copia\tAleI/Retrofit\n+Copia-50_ZM-I\tTy1/copia\tAleI/Retrofit\n+Copia-52_ZM-I\tTy1/copia\tAleI/Retrofit\n+Copia-54_SB-I\tTy1/copia\tAleI/Retrofit\n+Copia-60_SB-I\tTy1/copia\tAleI/Retrofit\n+Copia-61_SB-I\tTy1/copia\tAleI/Retrofit\n+COPIA1_ZM_I\tTy1/copia\tAleI/Retrofit\n+Copia16-ZM_I\tTy1/copia\tAleI/Retrofit\n+Copia20-VV_I\tTy1/copia\tAleI/Retrofit\n+Copia21-PTR_I\tTy1/copia\tAleI/Retrofit\n+Copia26-ZM_I\tTy1/copia\tAleI/Retrofit\n+Copia27-ZM_I\tTy1/copia\tAleI/Retrofit\n+Copia3-ZM_I\tTy1/copia\tAleI/Retrofit\n+Copia30-PTR_I\tTy1/copia\tAleI/Retrofit\n+Copia31-ZM_I\tTy1/copia\tAleI/Retrofit\n+Copia32-ZM_I\tTy1/copia\tAleI/Retrofit\n+Copia51-PTR_I\tTy1/copia\tAleI/Retrofit\n+Copia8-ZM_I\tTy1/copia\tAleI/Retrofit\n+Hopscotch_I\tTy1/copia\tAleI/Retrofit\n+HOPSCOTCH_ZM-I\tTy1/copia\tAleI/Retrofit\n+Llorens_Koala_DQ365823.1\tTy1/copia\tAleI/Retrofit\n+OSCOPIA2_I\tTy1/copia\tAleI/Retrofit\n+RETROFIT_I\tTy1/copia\tAleI/Retrofit\n+RETROFIT2_I\tTy1/copia\tAleI/Retrofit\n+RETROFIT3_I\tTy1/copia\tAleI/Retrofit\n+RETROFIT4_I\tTy1/copia\tAleI/Retrofit\n+SBCOPIA1_I\tTy1/copia\tAleI/Retrofit\n+SHACOP20_I_MT\tTy1/copia\tAleI/Retrofit\n+SHACOP3_I_MT\tTy1/copia\tAleI/Retrofit\n+Wicker_Ale_A_109N23-1\tTy1/copia\tAleI/Retrofit\n+Wicker_Ale_B_AF459088-1\tTy1/copia\tAleI/Retrofit\n+Wicker_Atara_At-1\tTy1/copia\tAleI/Retrofit\n+Wicker_ATCopia4_At-1\tTy1/copia\tAleI/Retrofit\n+Wicker_Hopscotch_Os-108\tTy1/copia\tAleI/Retrofit\n+Wicker_Hopscotch_Os-110\tTy1/copia\tAleI/Retrofit\n+Wicker_Iana_Os-106\tTy1/copia\tAleI/Retrofit\n+Wicker_Inav_AY013246-1\tTy1/copia\tAleI/Retrofit\n+Wicker_Lilly_Os-110\tTy1/copia\tAleI/Retrofit\n+Wicker_Oref_1020F19-1\tTy1/copia\tAleI/Retrofit\n+Wicker_Osr12_Os_cons\tTy1/copia\tAleI/Retrofit\n+Wicker_Retrofit4_Os-101\tTy1/copia\tAleI/Retrofit\n+Wicker_rn_126-72_Os\tTy1/copia\tAleI/Retrofit\n+Wicker_rn_179-105_Os-102\tTy1/copia\tAleI/Retrofit\n+Wicker_rn_179-105_Os-104\tTy1/copia\tAleI/Retrofit\n+Wicker_rn_25-15_Os-111\tTy1/copia\tAleI/Retrofit\n+Wicker_rn_252-146_Os\tTy1/copia\tAleI/Retrofit\n+Wicker_rn_306-181_Os_cons\tTy1/copia\tAleI/Retrofit\n+Wicker_rn_414-223_Os-107\tTy1/copia\tAleI/Retrofit\n+Wicker_rn_416-225_Os-6\tTy1/copia\tAleI/Retrofit\n+Wicker_rn_72-48_Os-1\tTy1/copia\tAleI/Retrofit\n+Wicker_rn_74-211_Os-7\tTy1/copia\tAleI/Retrofit\n+Wicker_rn_74-211_Os-8\tTy1/copia\tAleI/Retrofit\n+Wicker_rn_77-52_Os-9\tTy1/copia\tAleI/Retrofit\n+Wicker_Athea_At-101\tTy1/copia\tAleI/Retrofit\n+Wicker_Ale_C_AY951944-1\tTy1/copia\tAleI/Retrofit\n+Wicker_Athea_At-102\tTy1/copia\tAleI/Retrofit\n+Wicker_Lila_Os-109\tTy1/copia\tAleI/Retrofit\n+Wicker_Oref_AY368673-1\tTy1/copia\tAleI/Retrofit\n+Wicker_rn_14-5_Os-5\tTy1/copia\tAleI/Retrofit\n+ATCOPIA11I\tTy1/copia\tAleII\n+ATCOPIA13I\tTy1/copia\tAleII\n+ATCOPIA1I\tTy1/copia\tAleII\n+ATCOPIA22I\tTy1/copia\tAleII\n+ATCOPIA26I\tTy1/copia\tAleII\n+ATCOPIA38B_I\tTy1/copia\tAleII\n+ATCOPIA62_I\tTy1/copia\tAleII\n+ATCOPIA72_I\tTy1/copia\tAleII\n+ATCOPIA75_I\tTy1/copia\tAleII\n+COP10_I_MT\tTy1/copia\tAleII\n+COP21_I_MT\tTy1/copia\tAleII\n+Copia-8_SB-I\tTy1/copia\tAleII\n+Copia15-PTR_I\tTy1/copia\tAleII\n+Copia23-PTR_I\tTy1/copia\tAleII\n+Copia3-VV_I\tTy1/copia\tAleII\n+Copia38-PTR_I\tTy1/copia\tAleII\n+Copia44-PTR_I\tTy1/copia\tAleII\n+Copia52-PTR_I\tTy1/copia\tAleII\n+Llorens_Melmoth_AC007134.10\tTy1/copia\tAleII\n+MTCOPIA2_I\tTy1/copia\tAleII\n+SHACOP11_I_MT\tTy1/copia\tAleII\n+SHACOP12_I_MT\tTy1/copia\tAleII\n+SHACOP23_I_MT\tTy1/copia\tAleII\n+SHACOP6_I_MT\tTy1/copia\tAleII\n+SHACOP7_I_MT\tTy1/copia\tAleII\n+Tvv1_I\tTy1/copia\tAleII\n+Wicker_Alina_At-1\tTy1/copia\tAleII\n+Wicker_Ally_Os_cons\tTy1/co'..b'3/gypsy\tchromovirus\n+CRW_ID59\tTy3/gypsy\tchromovirus\n+Galadriel_LycE_ID42\tTy3/gypsy\tchromovirus\n+Gimli_AraT_ID40\tTy3/gypsy\tchromovirus\n+Gloin_AraT_ID210\tTy3/gypsy\tchromovirus\n+GlyM_ID222\tTy3/gypsy\tchromovirus\n+GlyM_ID228\tTy3/gypsy\tchromovirus\n+GlyM_ID230\tTy3/gypsy\tchromovirus\n+GlyM1_ID85\tTy3/gypsy\tchromovirus\n+GlyT_ID221\tTy3/gypsy\tchromovirus\n+GlyT1_ID82\tTy3/gypsy\tchromovirus\n+IpoB1_ID87\tTy3/gypsy\tchromovirus\n+Legolas_AraT_ID39\tTy3/gypsy\tchromovirus\n+LORE1a_LotJ_ID256\tTy3/gypsy\tchromovirus\n+LORE2A_LotJ_ID51\tTy3/gypsy\tchromovirus\n+LORE2B_LotJ_ID52\tTy3/gypsy\tchromovirus\n+LotJ_ID226\tTy3/gypsy\tchromovirus\n+LotJ_ID234\tTy3/gypsy\tchromovirus\n+LotJ1_ID63\tTy3/gypsy\tchromovirus\t#domains of this elements are no longer present in the TE_domains_newest.fasta\n+LotJ2_ID65\tTy3/gypsy\tchromovirus\n+LotJ3_ID88\tTy3/gypsy\tchromovirus\n+MedT_ID218\tTy3/gypsy\tchromovirus\n+MedT_ID232\tTy3/gypsy\tchromovirus\n+MedT1_ID93\tTy3/gypsy\tchromovirus\n+MedT2_ID89\tTy3/gypsy\tchromovirus\n+MedT3_ID96\tTy3/gypsy\tchromovirus\n+MedT4_ID97\tTy3/gypsy\tchromovirus\t#domains of this elements are no longer present in the TE_domains_newest.fasta\n+MusA_ID235\tTy3/gypsy\tchromovirus\n+MusA_ID43\tTy3/gypsy\tchromovirus\n+MusA1_ID67\tTy3/gypsy\tchromovirus\n+MusB1_ID98\tTy3/gypsy\tchromovirus\n+OryGl1_ID240\tTy3/gypsy\tchromovirus\n+OryGr1_ID100\tTy3/gypsy\tchromovirus\t#domains of this elements are no longer present in the TE_domains_newest.fasta\n+OryM1_ID101\tTy3/gypsy\tchromovirus\n+OryM2_ID102\tTy3/gypsy\tchromovirus\n+OryM3_ID103\tTy3/gypsy\tchromovirus\n+OryP1_ID105\tTy3/gypsy\tchromovirus\n+OryP2_ID106\tTy3/gypsy\tchromovirus\n+OryS_ID254\tTy3/gypsy\tchromovirus\n+Peabody_PiSat_ID47\tTy3/gypsy\tchromovirus\n+PetI1_ID72\tTy3/gypsy\tchromovirus\n+PetI2_ID76\tTy3/gypsy\tchromovirus\n+PhyP_ID251\tTy3/gypsy\tchromovirus\n+PiSat1_ID200\tTy3/gypsy\tchromovirus\n+PopT_ID202\tTy3/gypsy\tchromovirus\n+PopT_ID203\tTy3/gypsy\tchromovirus\n+PopT1_ID21\tTy3/gypsy\tchromovirus\n+PopT10_ID168\tTy3/gypsy\tchromovirus\n+PopT11_ID172\tTy3/gypsy\tchromovirus\n+PopT2_ID20\tTy3/gypsy\tchromovirus\n+PopT3_ID196\tTy3/gypsy\tchromovirus\n+PopT4_ID191\tTy3/gypsy\tchromovirus\n+PopT5_ID192\tTy3/gypsy\tchromovirus\n+PopT6_ID188\tTy3/gypsy\tchromovirus\n+PopT7_ID193\tTy3/gypsy\tchromovirus\n+PopT8_ID197\tTy3/gypsy\tchromovirus\n+PopT9_ID194\tTy3/gypsy\tchromovirus\n+Reina_ZeaM_ID49\tTy3/gypsy\tchromovirus\n+Retrosor2_SorB_ID48\tTy3/gypsy\tchromovirus\n+SelM_ID252\tTy3/gypsy\tchromovirus\n+SetI1_ID107\tTy3/gypsy\tchromovirus\n+SilL1_ID241\tTy3/gypsy\tchromovirus\n+SilL2_ID242\tTy3/gypsy\tchromovirus\n+SorB1_ID109\tTy3/gypsy\tchromovirus\n+Tekay_ZeaM_ID50\tTy3/gypsy\tchromovirus\n+ThelH1_ID77\tTy3/gypsy\tchromovirus\n+TriA1_ID110\tTy3/gypsy\tchromovirus\n+VitV_ID211\tTy3/gypsy\tchromovirus\n+VitV_ID214\tTy3/gypsy\tchromovirus\n+VitV_ID215\tTy3/gypsy\tchromovirus\n+VitV1_ID22\tTy3/gypsy\tchromovirus\n+VitV2_ID128\tTy3/gypsy\tchromovirus\n+VitV3_ID124\tTy3/gypsy\tchromovirus\n+VitV4_ID142\tTy3/gypsy\tchromovirus\n+VitV5_ID143\tTy3/gypsy\tchromovirus\n+VitV6_ID131\tTy3/gypsy\tchromovirus\n+LjRE2_ID63\tTy3/gypsy\tchromovirus\n+MedT4_ID246\tTy3/gypsy\tchromovirus\n+OryGr1_ID245\tTy3/gypsy\tchromovirus\n+ATLANTYS_LC_I\tTy3/gypsy\tOgre/Tat\n+ATLANTYS-I_OS\tTy3/gypsy\tOgre/Tat\n+ATLANTYS2_I\tTy3/gypsy\tOgre/Tat\n+CALYPSHAN2_I_MT\tTy3/gypsy\tOgre/Tat\n+Grande1-4\tTy3/gypsy\tOgre/Tat\n+Gret1_I\tTy3/gypsy\tOgre/Tat\n+Gypsy-18_SB-I\tTy3/gypsy\tOgre/Tat\n+Gypsy-22_SB-I\tTy3/gypsy\tOgre/Tat\n+Gypsy-30_SB-I\tTy3/gypsy\tOgre/Tat\n+Gypsy-71-I_ZM\tTy3/gypsy\tOgre/Tat\n+Gypsy11-ZM_I\tTy3/gypsy\tOgre/Tat\n+Gypsy12-ZM_I\tTy3/gypsy\tOgre/Tat\n+Gypsy29-ZM_I\tTy3/gypsy\tOgre/Tat\n+HUCK1-I_ZM\tTy3/gypsy\tOgre/Tat\n+Ogre-LE1\tTy3/gypsy\tOgre/Tat\n+Ogre-MT1A\tTy3/gypsy\tOgre/Tat\n+Ogre-MT1B\tTy3/gypsy\tOgre/Tat\n+Ogre-MT1C\tTy3/gypsy\tOgre/Tat\n+Ogre-MT1D\tTy3/gypsy\tOgre/Tat\n+Ogre-MT2\tTy3/gypsy\tOgre/Tat\n+Ogre-MT3\tTy3/gypsy\tOgre/Tat\n+Ogre-MT4\tTy3/gypsy\tOgre/Tat\n+Ogre-PS1\tTy3/gypsy\tOgre/Tat\n+Ogre-PT1\tTy3/gypsy\tOgre/Tat\n+Ogre-PT2\tTy3/gypsy\tOgre/Tat\n+Ogre-PT3\tTy3/gypsy\tOgre/Tat\n+Ogre-SD1\tTy3/gypsy\tOgre/Tat\n+Ogre-VP1\tTy3/gypsy\tOgre/Tat\n+Retand\tTy3/gypsy\tOgre/Tat\n+RETROSOR1_I\tTy3/gypsy\tOgre/Tat\n+RIRE2_I\tTy3/gypsy\tOgre/Tat\n+Tat4-1\tTy3/gypsy\tOgre/Tat\n'
b
diff -r 000000000000 -r a5f1638b73be test_data/hitsort_PID90_LCOV55.cls
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test_data/hitsort_PID90_LCOV55.cls Wed Jun 26 08:01:42 2019 -0400
b
b'@@ -0,0 +1,257660 @@\n+>CL1 19412 \n+187743f 94702f 426720f 312327f 164509r 16474f 110280f 34469r 29039r 20658f 169726r 108111f 331950r 159500f 175238r 265269r 291041f 230360r 314128f 242067r 166026r 125166f 233607r 288604r 178942r 308255r 74998f 189328f 170850r 416179r 300989r 27921f 263291f 22372f 192087r 429650f 41343f 344660r 237518f 219821r 219566f 24223f 369411r 257833f 63844r 377424f 38926r 219527f 313122r 93067r 96180r 386250r 140811f 193254f 4398r 388576f 209481f 82727f 178211r 272164f 283145r 292220f 222767f 165091f 244819r 147291f 295718f 272546f 380683r 192810r 274806r 394855f 9934f 35688f 136261f 274741r 121484f 416965r 272703f 116035r 47566f 3066f 22378f 390349f 43178r 352180r 53065f 339984r 336935f 100697f 200896r 429293f 14677f 99032f 311646r 408618r 53967r 41082r 346359f 308577r 346841f 237857f 78006r 268605r 423708r 32174r 206820r 84098r 156218f 327732f 27423f 186185f 192638f 64750f 253484f 393343r 75957r 389195r 386635f 265512f 373482r 72315f 20224r 147174f 421014r 378253r 187649r 109182r 299367f 145598r 306208f 190077f 328477r 392332r 227073f 151660f 70726f 371650f 257638r 257763r 290141r 7813r 194489r 70312f 355689r 174509f 158691f 119104f 146365r 161386f 283255r 131124f 187267r 12027f 180199f 142514f 163015f 127376f 65271r 341261f 367022f 42257f 244269r 23815f 179460r 152952r 216772r 269281r 26632r 334536f 215041r 65625r 305044f 223671f 281556f 337581r 257638f 95453f 359694f 367670f 187743r 377784f 194673f 385114r 308189f 219821f 294389f 316646r 234717f 172776f 189663f 179983f 161771r 429782f 171328f 109671r 344930r 273802f 415984r 189561r 379330r 101749r 35011f 370631f 27758r 170737r 312566f 144835f 132419f 24491f 183796f 85636f 364384f 186921f 41676f 360927r 181979f 2482f 109290r 379330f 298320r 88743r 410194f 143843r 326333f 134379f 152370r 307574r 6687r 310742r 91648f 120797f 173514r 236436f 91498f 139147f 96710f 335329r 216717r 42292r 351423f 388960r 228114f 265683r 267735r 247674f 61195r 404021f 368839f 300269f 93471f 294097r 281128f 325500f 8036f 388874f 350483f 27423r 384493f 373059f 369186r 301366f 285498r 154484r 382948r 418797f 88704r 268863r 315602r 132585r 412576r 322913r 200527f 356374r 249849f 211521f 10820r 353969r 332848r 370101f 380968r 215709f 180827f 178597r 271246f 229356f 92378f 399307f 215258r 323696r 87178r 244847f 313212f 78322r 258479r 11286f 227831f 416291r 6759f 295659r 228143f 58865r 300419f 185663f 114224f 211899r 277530f 59052r 176662f 303371f 350587f 270206r 258719f 420509r 183294r 229760f 20014r 107684f 264698r 111920f 278153f 11784r 241541r 273414r 114026r 410997r 110280r 24814f 200127r 281270f 248723r 211383r 267260r 103017r 421342f 207523f 225059f 191871f 409361r 296895f 413267r 270079r 280082f 71870f 251979r 235516r 254854r 230215r 359389f 44420f 144257f 344589r 234642f 395133f 26343f 332970f 274806f 93097f 120717r 204385r 363115f 387821f 87899f 267940f 243217r 303330r 328709r 402587r 382944r 323462f 66945r 333521f 260537r 377424r 333237f 284140f 46081r 56322f 226829r 224513f 106545f 5291r 283803r 357098f 118869f 296707r 118628f 425112r 173286f 215694r 135961f 45703f 497f 240122r 154183r 49621r 385084r 412488r 310741r 65910f 418800r 256343r 198901f 333867r 402132r 104398r 388941f 85636r 337580f 88708r 385899r 199134f 239332r 277480f 393095r 380561r 97706r 241518f 255081f 343528r 65495r 130470f 88953f 322579r 49955f 81319f 135388r 339984f 119499r 231165f 394273f 46720r 3249r 58809f 16348r 111055r 301671f 356088f 10461r 367533f 113231f 171835f 177964r 407199f 63441f 55271r 32533r 311983r 409090r 59825f 20244r 270922f 311411r 115450f 116651f 350824r 55135r 255124f 89838r 184459f 122101f 61870r 113705r 207031f 47677f 147773f 51187f 100014f 334376f 61527r 113114r 237857r 134634f 248346r 104658r 51576r 346072r 340759f 226142r 5801f 279705f 61375f 219809f 277229r 227624f 256343f 98037f 135382r 178526f 423823f 25552r 140978r 55221r 98282f 125118r 4896r 68816r 77699r 202357r 401516f 418668f 187649f 423136r 193382r 307482r 120970r 357461f 259270r 422244r 38728r 198311f 141348f 37'..b'5568f\n+>CL128700 2 \n+379147r 139015f\n+>CL128701 2 \n+63641f 413867f\n+>CL128702 2 \n+296101r 230098r\n+>CL128703 2 \n+370080f 347311f\n+>CL128704 2 \n+60031r 180954f\n+>CL128705 2 \n+137135r 136070r\n+>CL128706 2 \n+416053r 120346f\n+>CL128707 2 \n+405121f 365242r\n+>CL128708 2 \n+369410r 335358r\n+>CL128709 2 \n+92243f 424658f\n+>CL128710 2 \n+408193r 257222r\n+>CL128711 2 \n+420667f 133654r\n+>CL128712 2 \n+202736r 115556r\n+>CL128713 2 \n+254036f 20624f\n+>CL128714 2 \n+67755f 285689f\n+>CL128715 2 \n+223603r 205562r\n+>CL128716 2 \n+403551r 163868r\n+>CL128717 2 \n+75738f 156242r\n+>CL128718 2 \n+272327r 125264f\n+>CL128719 2 \n+230325f 112423r\n+>CL128720 2 \n+184412f 168431f\n+>CL128721 2 \n+271673r 255154r\n+>CL128722 2 \n+84962f 374079f\n+>CL128723 2 \n+367223r 225523r\n+>CL128724 2 \n+55846f 238056f\n+>CL128725 2 \n+286691f 136487r\n+>CL128726 2 \n+378177f 216377r\n+>CL128727 2 \n+7395r 423976r\n+>CL128728 2 \n+393041f 182802f\n+>CL128729 2 \n+323470f 268113f\n+>CL128730 2 \n+269218f 103168r\n+>CL128731 2 \n+268744f 261690r\n+>CL128732 2 \n+60837r 218922f\n+>CL128733 2 \n+363169r 11279r\n+>CL128734 2 \n+8410r 109459f\n+>CL128735 2 \n+421089f 364773f\n+>CL128736 2 \n+93942f 257386f\n+>CL128737 2 \n+61960f 423443f\n+>CL128738 2 \n+82176f 332058f\n+>CL128739 2 \n+314310f 254406r\n+>CL128740 2 \n+293778f 293453r\n+>CL128741 2 \n+328284r 12742f\n+>CL128742 2 \n+160028r 109130f\n+>CL128743 2 \n+329208f 146868r\n+>CL128744 2 \n+26007r 150346f\n+>CL128745 2 \n+298995f 158672f\n+>CL128746 2 \n+4194r 407111f\n+>CL128747 2 \n+414252r 373041r\n+>CL128748 2 \n+422346f 210899r\n+>CL128749 2 \n+373659r 319640r\n+>CL128750 2 \n+44638f 164146r\n+>CL128751 2 \n+361070r 104260r\n+>CL128752 2 \n+179612f 11151r\n+>CL128753 2 \n+426384r 4180r\n+>CL128754 2 \n+40388r 137313f\n+>CL128755 2 \n+93282r 429752f\n+>CL128756 2 \n+396451f 267972f\n+>CL128757 2 \n+188931f 122485f\n+>CL128758 2 \n+381074f 213596f\n+>CL128759 2 \n+66331r 221429f\n+>CL128760 2 \n+267721f 225267f\n+>CL128761 2 \n+226933r 198566r\n+>CL128762 2 \n+339168f 101731r\n+>CL128763 2 \n+65991r 171143r\n+>CL128764 2 \n+61589f 171603r\n+>CL128765 2 \n+129659r 104191r\n+>CL128766 2 \n+207682r 153361f\n+>CL128767 2 \n+390980r 353577r\n+>CL128768 2 \n+310313f 265568r\n+>CL128769 2 \n+85647f 39994r\n+>CL128770 2 \n+172071f 139339r\n+>CL128771 2 \n+158475r 150927f\n+>CL128772 2 \n+53327r 281674f\n+>CL128773 2 \n+410328f 15083r\n+>CL128774 2 \n+296174f 237420f\n+>CL128775 2 \n+240501r 223951f\n+>CL128776 2 \n+210655f 115356f\n+>CL128777 2 \n+318044r 165443f\n+>CL128778 2 \n+425617f 379834f\n+>CL128779 2 \n+73371r 193513r\n+>CL128780 2 \n+56496f 177653f\n+>CL128781 2 \n+281692r 179707f\n+>CL128782 2 \n+238384f 163821f\n+>CL128783 2 \n+366769r 106243f\n+>CL128784 2 \n+84041r 12969f\n+>CL128785 2 \n+63370f 363111r\n+>CL128786 2 \n+406816f 144481r\n+>CL128787 2 \n+41611r 403166r\n+>CL128788 2 \n+158298r 122270r\n+>CL128789 2 \n+320185f 10283f\n+>CL128790 2 \n+35930f 196576f\n+>CL128791 2 \n+378926f 110787r\n+>CL128792 2 \n+277169f 170334r\n+>CL128793 2 \n+36211f 101759f\n+>CL128794 2 \n+394325r 214103f\n+>CL128795 2 \n+320611r 234615r\n+>CL128796 2 \n+81838r 68805f\n+>CL128797 2 \n+31236f 15622r\n+>CL128798 2 \n+41944f 372929f\n+>CL128799 2 \n+368453f 365094f\n+>CL128800 2 \n+320206f 258778r\n+>CL128801 2 \n+372700f 158080r\n+>CL128802 2 \n+99370f 379470r\n+>CL128803 2 \n+87885f 167642f\n+>CL128804 2 \n+77188f 58025r\n+>CL128805 2 \n+336465r 154603r\n+>CL128806 2 \n+317481r 271141r\n+>CL128807 2 \n+381543r 232442f\n+>CL128808 2 \n+89800r 345842f\n+>CL128809 2 \n+364272f 348933r\n+>CL128810 2 \n+61063f 282347r\n+>CL128811 2 \n+33441r 25048f\n+>CL128812 2 \n+387977r 208365f\n+>CL128813 2 \n+379247f 238387r\n+>CL128814 2 \n+43027r 340787r\n+>CL128815 2 \n+327628r 124662r\n+>CL128816 2 \n+209396r 176451r\n+>CL128817 2 \n+50359f 201427f\n+>CL128818 2 \n+85346f 203226f\n+>CL128819 2 \n+208063f 20328f\n+>CL128820 2 \n+429218r 354169f\n+>CL128821 2 \n+255078f 16158f\n+>CL128822 2 \n+196912r 167817r\n+>CL128823 2 \n+47372f 149683f\n+>CL128824 2 \n+192851f 182419f\n+>CL128825 2 \n+67496f 279178f\n+>CL128826 2 \n+376566r 24119r\n+>CL128827 2 \n+337902f 143649f\n+>CL128828 2 \n+7781r 320366f\n+>CL128829 2 \n+396404f 328610r\n+>CL128830 2 \n+78891r 25505f\n'
b
diff -r 000000000000 -r a5f1638b73be test_data/proteins_all
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test_data/proteins_all Wed Jun 26 08:01:42 2019 -0400
b
b'@@ -0,0 +1,13196 @@\n+>Ty3-CHDCR__PetI2_ID76\n+DAARTRADQVFNPPRPMTRSQAKLLHDVHIHLAWSKESNYSEAKWPSSRTEGYIFKFEP\n+\n+>Ty3-CHDCR__PetI1_ID72\n+DKTQTPTRPFTRSQAKLIQDTHLHLTWKGMEEGPKSDLKSMLWCAMDGGPSRTL\n+\n+>Ty3-CHDCR__GlyT1_ID82\n+DGGPTQVKGLEGELEALPLVSFKGPMTRARTQALQDHIQMSLGKYEKAHQAKDDKSWILM\n+SQI\n+\n+>Ty3-CHDCR__PopT2_ID20\n+DVHHGNYNPSCKAKTNVQEDSDGPMTRARAKQLQRALTSQIGMIEAASELKISNQFEIGS\n+RMFICL\n+\n+>Ty3-CHDCR__CRA3a_ID1\n+NDSMDYSTQHEDELLGATKKEAMHVLNGPMTRSKTRLLNQDITTLLQHIEGSLKQDACQT\n+TLVVIQA\n+\n+>Ty3-CHDCR__CarP1_ID79\n+DANKAKNTQDTQRPSESDLEQISNDVNTNTVLNLESFSGPVTRSKAKKLKELLQVFTQKH\n+VDDGLANCGKFKYELFE\n+\n+>Ty3-CHDCR__ThelH1_ID77\n+DTSMDEVRTEGQAEGLVQDGNQDQMAELSEMEVQEGKLAEVNDPSSDLTELTGSITRSRA\n+KKLTLAISRLYTNLMTKFTEATPSDRAVTLLAYSDD\n+\n+>Ty3-CHDCR__IpoB1_ID87\n+DETGMQQINDPLKVSSGPVTRSKSRKIQEAFNTFVQTDWSPTSFCEEEDQSGKIILFNCL\n+\n+>Ty3-CHDCR__BraR5_ID117\n+DMIMDQLVDKDEEGETEDTLAEEEAVLAIPTGPITRAMTRRLKEAIGNILKISKKQEDCL\n+GRSLSYQDTLITIHVI\n+\n+>Ty3-CHDCR__CRM5_ID112\n+DTTTPLSNTLQPLRHTTSTQVQPTSSPTQVFDGPITRSRAKKLQQEVHALLYEFQLNTND\n+NFMLPKSCMLILFRFI\n+\n+>Ty3-CHDCR__GlyM1_ID85\n+DAILPRKGPVTRAMSKRLQEDWARAAEEGPKVLMNLKVDF\n+\n+>Ty3-CHDCR__CRR3_ID17\n+DEDIPTVHATSSTKQPSSNTKDTIQGPLTRSRAKKLQVQVNSFLTDFNFSTSENVILPKC\n+STLVVLRNI\n+\n+>Ty3-CHDCR__CRM2_ID24\n+DEDIHTTDASIPIQVPISGPITRARARQLNHQVITLLSSCPSYLEPWRPVHSCFA\n+\n+>Ty3-CHDCR__CRM3_ID37\n+DEDITIIDTTIPVTTSPFVANQGPMTRARARKLNYQVNSFFAIEANSSLNEVLKPCDDFI\n+MLRCL\n+\n+>Ty3-CHDCR__CRR1_ID14\n+DEDIPSNDTTTPIAQQGPMTRARARELNYQVKSFLANHTSSSQNWVLLNGCCDLLVVRNM\n+\n+>Ty3-CHDCR__SetI1_ID107\n+DEDIPSNDTTTPTMQQGPMTRARARELNYQVNSFLTVHKPSSQNWMLLSHCDDFIIIRND\n+\n+>Ty3-CHDCR__OryGr1_ID245\n+DEDITSLDTHLQPPHVEQDVPQIQGPITRVRARQLDRXVNAFLSLHGNISTNGMLFNSCN\n+DFIVIRNL\n+\n+>Ty3-CHDCR__Cereba1_ID8\n+DEDINTIATPTAPAAIHTGPVTRARARQLNYQVLSFIGNTSNVHEHMMLPKLDTFVVLMN\n+E\n+\n+>Ty3-CHDCR__CRW_ID59\n+DEDINTIVTPTAPTVTYTGPITRARARQLNYQVLSFLGNDSNVHEIMMLPKLDTIVLLTN\n+E\n+\n+>Ty3-CHDCR__OryM1_ID101\n+DEDTTSTSTPAAPQAPPVTPPPQAPTVDHPVGPITRARARELNFIMLLRNE\n+\n+>Ty3-CHDCR__OryM3_ID103\n+DEDTTSTSTPAAPQAPPVTPPPQAPTVDHPVGPITRARARELNFIMLLRNE\n+\n+>Ty3-CHDCR__OryP1_ID105\n+DEDTNYNTSTPTPVDAPPPQAPPSGPITRSRARKLNFVMLLKNE\n+\n+>Ty3-CHDCR__CRM1a_ID25\n+DADINTNTSTSTPAAPSPAQAPPLPPGPVTRARARELNYIMLLKNE\n+\n+>Ty3-CHDCR__OryM2_ID102\n+DADIIPSDIHNNPRLIIQGPITRARAQXLNLEVSSFLSSSLYDFENRLLPNDYIMIRNN\n+\n+>Ty3-CHDCR__CRR4_ID19\n+DADITNSDTHNNPPTVIQGPITRARARQLNLEVSSFLCSSLYEFENRLLPNDYIVIRNN\n+\n+>Ty3-CHDCR__CRM4_ID30\n+DEDITPRDTNNTPHVDSPITRARARQLNLQVISFLSNYSCAFESSMLPNDLIVLRNE\n+\n+>Ty3-CHDCR__SorB1_ID109\n+DEDINTSDTYPTSTSDTYPTSTSDTYPTSSPTQPIAGPLTRARARQLNLQVSSALNSCQS\n+YLDNGDTCTLVLLRNN\n+\n+>Ty3-CHDCR__OryP2_ID106\n+DEDINTIDTSTSPQVQLHGPITRARARQLNYLVSSFLSSYSSSLYPGDVCTLVLLRND\n+\n+>Ty3-CHDCR__OryGl1_ID240\n+DEDINTVDTSTSPHVQHDGPITRARARQLNYQVSSFLSSYSSSLYPGDACTRVLLRND\n+\n+>Ty3-CHDCR__CRR2_ID16\n+DEDINTIDTSTSPHIQHDGPITRARARQLNYQVSSFLSSYSSSLYLGDACTRVLLRND\n+\n+>Ty3-CHDCR__TriA1_ID110\n+HPTIIPMDVPTSPTTPLGPMTKAPAKAIKDKVNSILFELPFSTHETWILPQEETLCVIRY\n+L\n+\n+>Ty3-CHDCR__Bilby_ID66\n+DTVYPRAIPMDPPSPPQVPQGPITRAHTRAFETEVTSLLALLPYESCETWLLPQASVLCV\n+LRCE\n+\n+>Ty3-CHDCR__CRA1_ID6\n+DMIMDSINDMEHEPELERELVAEEGAKLVAEDELVAEEKLVVEDVLVTPAVPMTRSRAKL\n+FDQAIAGMLNHIRDRPNDLSQVTTSLVLFQAQ\n+\n+>Ty3-CHDCR__CRA2a_ID5\n+DENMTYATKPLEPQEDKEQLEAEEQLVPEEALIVPAGPLTRSKSKKFNQAINGLLKELKK\n+NQEDVAQSSFIVITAQ\n+\n+>Ty3-CHDCR__CRA4_ID2\n+DINLTSQTAELQAVPHLLLQPVPEVPDGIMTSSKAKQLKKRFNLVVQDILSYQEL\n+\n+>Ty3-CHDCR__LotJ3_ID88\n+DEDRSSPDKDPLQEIGGPMTRSKTKRMKQALQGLILELKGKEDQNKLEATPKWVNFLEH\n+\n+>Ty3-CHDCR__PopT1_ID21\n+DVDQPRNTSKDPLHVPNGPMTQSKTKTLKEALNALVLNVSTRSELKGPLEYQEETLVHLI\n+\n+>Ty3-CHDCR__PopT10_ID168\n+DTNKPNTKRNHANDPLEVPIGPITRARANKLKEALNELVQNIWSKMDLERLGTFKEHKGQ\n+PLIHLV\n+\n+>Ty3-CHDCR__VitV2_ID128\n+DENQQAFKDPLHVPVGLITKARSKKIKEALNGLIQDI\n+\n+>Ty3-CHDCR__MusB1_ID98\n+NEQVDHNSAKDPLIFRGGPMTRAKAKMMKEALTCLLEGIWKEXAGQNLVKVLWIQEEPKI\n+VNMI\n+\n+>Ty3-CHDCR__MedT1_ID93\n+DEDIVQDISDAIQSLGGPMTRARARRVNDALVHFIIKSIEGSAQVEEGVAQVEEKEPKFI\n+III\n+\n+>Ty3-CHDCR__PiSat1_ID200\n+DEDIIQDINDTMQGLGGPMTRARARRVNDALVHFMIKSIECMGQIEEKEPKFILIIQA\n+\n+>Ty3-CHDCR__MedT2_ID89\n+DEGMVVHDTSASIQGLGGPMTRSRTKKAKEALTQLVAKVLESKPTLESMEDKMVMCI\n+\n+>Ty3-CHDCR__LjRE2_ID63\n+DEDKDKDKGHGALKGLGGPMTRARAKRAKEALQQMIALALEEG'..b'\n+AKPMKTP\n+\n+>Ty1-RT__Wicker_WIS_B_cons\n+WLEAMKSEIGSMYENEVWTLTDLPVDRRAIENKWIFKKKTDADGNVTIYK\n+ARLVAKGYRQVQGVDYDETFSPVAKLKSVRIMLAIAAFYDYEIWQMDVKT\n+AFLNGFLKEELYMMQPEGFVDPKNADKVCKLQRSIYGLVQASRSWNIRFD\n+EMIKAFGFTQTYGEACVYKKVSGSSVAFLILYVDDILLMGNDIELLDSIK\n+AYLNKCFSMKDLGEAAYILGIKIYRDRSRRLIGLSQSTYLDKILKKFNMD\n+QSKKGFLP\n+\n+>Ty1-RT__Llorens_Araco_AC079131.4\n+WRNAMDEEIKSIQKNDTWELTSLPNGHKAIGVKWVYKAKKNSKGEVERYK\n+ARLVAKGYSQRAGIDYDEVFAPVARLETVRLIISLAAQNKWKIHQMDVKS\n+AFLNGDLEEEVYIEQPQGYIVKGEEDKVLRLKKALYGLKQAPRAWNTRID\n+KYFKEKDFIKCPYEHALYIKIQKEDILIACLYVDDLIFTGNNPSMFEEFK\n+KEMTKEFEMTDIGLMSYYLGIEVKQEDNGIFITQEGYAKEVLKKFKMDDS\n+NPVCTP\n+\n+>Ty1-RT__Llorens_Fourf_AF391808.3\n+WKEAVRSEMESIMSNGTWEVVDRPYGCQPIGCKWIFKKKLRPDGTIERYK\n+ARLVAKGYTQKEGEDFFDTYSPVARLTTIRTLIAVAASYGLIIHQMDVKT\n+AFLNGELDEEIYMDQPEGFIADGQENKVCRLIKSLYGLKQAPKQWHEKFD\n+NTLTAAGFVVNESDTCVYYRYGGGESVMLCLYVDDILIFGSNLNVIEEVK\n+NLLSSNFEMKDLGEADVILNIKLVRKADGGVTLLQSHYVEKVLSRFGFSD\n+CDPAPTP\n+\n+>Ty1-RT__Llorens_Koala_DQ365823.1\n+WKQAMDEEYQALVTNKTWHLVPPNXKNIIDCKWVYKVKRKQDGTLDRYKA\n+RLVAKGFRQRYGIDYEDTFSPVIKMTTIRIILAIAVSKGWFLRQLDVKNA\n+FLHGILEEEVYMYQPPGYEDKQHPNYVCKLDKALYGLKQAPRAWFARLSH\n+KLNQLGFQESKADTSLFFYNREGLTVFLLIYVDDIIVVSSKSEAIPILLQ\n+NLQQDFALKDLGNLHYFLGIEVNQSPNGIVLTQAKYANDLLRRSGMMNCK\n+PVTTP\n+\n+>Ty1-RT__Llorens_Melmoth_AC007134.10\n+WCEAVDAEIGAMEKTNTWEITTLPKGKKAVGCKWVFTLKFLADGNLERYK\n+ARLVAKGYTQKEGLDYTDTFSPVAKMTTIKLLLKVSASKKWFLKQLDVSN\n+AFLNGELEEEIFMKIPEGYAERKGIVLPSNVVLRLKRSIYGLKQASRQWF\n+KKFSSSLLSLGFKKTHGDHTLFLKMYDGEFVIVLVYVDDIVIASTSEAAA\n+AQLTEELDQRFKLRDLGDLKYFLGLEVARTTAGISICQRKYALELLQSTG\n+MLACKPVSVP\n+\n+>Ty1-RT__Llorens_Oryco1-1_AL928755.5\n+WIKAMEDEIHMIEKNNTWELVDRPRDREVIGVKWVYKTKLNLDGSVQKYK\n+ARLVAKGFKQKPGIDYYETYAPVARLETIRTIIALAAQKRWKIYQLDVKS\n+AFLNGYLDEEIYVEQPEGFSVQGGENKVFRLKKALYGLKQAPRVWYSQID\n+KYFIQKGFAKSISEPTLYVNKTGTDILIVSLYVDDLIYTGNSEKMMQDFK\n+KDMMHTYEMSDLGLLYYFLGMEVHQSDEGIFISQRKYAENILKKFKMDNC\n+KSVTTP\n+\n+>Ty1-RT__Llorens_Oryco1-2_AL606630.3\n+WRGAMQDELDAIVDNDTWSLTDLPHGHRAIGLKWVYKLKRDEQGAIVRYK\n+ARLAAKGYVQRQGVDFDEVFTPVARLESVCLLLAVAAHQDWQVHHMDVKS\n+AFLNGKLLEEVYVSQPPGFVDDNHKNKVYRLHKALYGLRQAPRAWNAKLD\n+SSLLSFGFHRSSSEHGVYTRTRGGRRLTVGVYVNDLIITGDHDDEIRSFK\n+GEMMKLFKMSDLGALRYYLGIKVTLDSDGITLGQAAYAGKILERAGLKDY\n+NPCQTP\n+\n+>Ty1-RT__Llorens_Poco_AC210386.1\n+WIDAMNEELRMIEKNQTWKLVDMSEHKKPIGVKWVYRTKLNADGTINKHK\n+ARLVVKGYAQIFGVDFSETFAPVARLDTIRMLLAVAAQKGWKIFQLDVKS\n+AFLNGYLQEGIFVEQPKGFVVRGEEEKVYLLKKALYGLKQAPRAWXSRID\n+EHLLKLDFKKSLSESTLYIRNSNSDYIVVSLYVDDLFVTGNNQSMIDNFK\n+AEMMKVFEMTDLGEMAYFLGMEVQQNQHGIFICQQKYAKEILKKFKMEEC\n+KSMPTP\n+\n+>Ty1-RT__Llorens_RTvr2_AY900122.1\n+WLDAMQDKMKSLYDNHTCDFVNLPKGKRALENRWIFRVKQESNSTSTRYK\n+ARLVVKGFRQRKGVDFNEIFSSVVRMTSIXTVLSLAATLDLEVKQMDVKT\n+TFLHGDLEEEIYMKQPDDFLIEGKEDYVCRLRKSLYGLKQAPRQWYKKFE\n+SVMCEQGYKKTTFDHCVFVRKFSENDFIILLLYVDDMLIVGKDVSKIDRL\n+KKQLGESFAMKDMGAAKKILGISITRDRKEKKLWLSQKHYIQKVLQRFQM\n+ENAKVVSTP\n+\n+>Ty1-RT__Llorens_Tork4_EU105455.1\n+WFAAMGDEMESLHKNQTWDLVIQPSGRKIITCKWVFKKKEGISPAEGVKY\n+KARVVARGFNQREGVDYNEIFSPVVRHTSIRVLLAIVAHQNLELEQLDVK\n+TAFLHGELEEEIYMTQPDGFQVPGKENHVCKLKKSLYGLKQSPRQWYKRF\n+DSYMVKLGYTRSSYDCCVYYNRLNDDSFIYLVLYVDDMLIAAKKKYDIQK\n+LKGLLSAEFEMKDLGAARKILGMEIIRDRERRKLFLSQRSYIQKVLARFG\n+MSSSKPIDTP\n+\n+>Ty1-RT__Llorens_V12_EU009618.1\n+WLRAMHEDMKSLHKNNTYELMELPKGKRALKNKWVLKRKPEPNRSQPRYK\n+ARLVVKGFSQKKGIDFEEIFSPVVKMSSIRVVLGLAASMNLEIEQLDVKT\n+AFLHGDLEEEIYMEQLEGFTIKGKEHLVCRLKKSLYGLKQAPRQWYKKFD\n+SFMVEHGYDRTASDHCVFVKKFSDGEFIILLLYVDDMLIVGRDTGKIDKL\n+KKELSKSFEMKDLGSTSQILGIKISRDRTNGKLWLSQESYIEKVLDKFNM\n+GKAKPVSSP\n+\n+>Ty1-RT__Llorens_Vitico1-1_AM465428.1\n+WCNAMKEEIAAIEKNETWELVELLEDKNVIGVKWVFRTKYLADGSIQKHK\n+AQLVAKGYAQQHGVDYDDTFSPIALFETVRTLLALAAHMHWCVYQFDVKS\n+AFLNGELVEEVYVSQPEGFIVPGKEEHVYRLKKTLFGLKQAPRAWYSKID\n+SYFVENGFERSKSDPNLYLKRQGKNDLLIICLYVDDMIYMGSSSFLINEF\n+KACMKKKFEMSDLGLLHFFLGLEVKQVEDGVFVSXRKYAIDLLKKFNMLN\n+CKVVATP\n+\n+>Ty1-RT__Llorens_Vitico1-2_AM462010.2\n+WSKAMKEEIAALKRNSTWELVPKPRDVEPISYKWVYKIKRRTDGSIERHK\n+AHLVARGFSQQYGLDYDETFSPVAKLTIVRVLLALAANKDWDLWXMDVKN\n+AFLHKELDREIYMNQLMGFQSQGHPEYVCKLRKALYGLKQAPRAWYGKIA\n+EFLTQSGYSVTHADSSLFVKANGGKLAIVLVYVDDLIITGDDVEEICXTK\n+ENLSVRFEMKELGQLKHFLGLEVDCTHEGIFLCQQKCAKDLLKKFGMLEC\n+KPISTP\n'
b
diff -r 000000000000 -r a5f1638b73be test_data/reads_all
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test_data/reads_all Wed Jun 26 08:01:42 2019 -0400
b
b'@@ -0,0 +1,1720000 @@\n+>1f\n+ATGGTTGACTTTTCACTGCCCCCATTCGACGACATAGTAAGTAATATATTCTGATTCGAATCGAAGGGTATGGTTTTTATTTGAGAGCAATGTGTTTGAT\n+>1r\n+AAAATCTTTTTTTTAATCATATTTTGCAACATAGTTAAATACAGCATTAAGCTAAATTATTTAACTTCAAGCAATGTATTTAACTATGAACAACATTTAA\n+>2f\n+TGAATAAAAAAATCATTATTTTTTGAGTGATTTCTCTCCTCCTATTTGGAAAATAGTTGTACTGAATGATACTAAAAAAATTAAATTAACATTTTATAAA\n+>2r\n+ATCATTCTCCTTTTTCGTTTTATCTTTGGTTAGATTCGTCCTTCAACGACAATCTATTTTACTCGCTTTATTAAGATTAGAAGCAACTATTTTATCCTGC\n+>3f\n+AGACAGAAAACAGGTTCATTGGCTTTCTGATCATCAATGGACAGAGCTGCAGAAAATATTTTCTTCCAAAAGCAGCACTGATAAATTATCGGGTACGCTA\n+>3r\n+ATCCAAAGTCAACGTTTAACCAATTTTCACTGTCCCAATTTTTCTTCCTCCTTCTTGTGTCCCGTTAGGCATCGTAATGAAACATGAATCAGTATTGCAT\n+>4f\n+TTTCGACAACGACGACACTTTCTCTCGTTCTTCATCGAGATCTTTCTGAACACTTACGATAGCTGCTTTCTTACTCTCAATCTGTTCTACTAGTTCAAGA\n+>4r\n+CGGAGCTGAAAAACAGAGATGCCACTTTAGGGGAGCTTAAGAAAGAGCTTCAATCTTGTGATTCTTTGATGCTGAAATTGGAAGTGATAAATCAAGAGTA\n+>5f\n+GATGGTCCTGTACCTGCTACAATGCGGATCATCAAACTTTTATTCGAGGATATTTGTTATTATCCTCGAAAAGTGAAAAGTTTAATGGAACATTATGCCA\n+>5r\n+CGGGATATGATTAGCAAGTTAAATAGGTTTCTCAATCCGTTGGAAAAAAGTTAATCATTCCAACCTTAAAATATTTCATACCAAACACATTAATCTCAAA\n+>6f\n+CACGAAAAAGATCTCTTAACAGACAGAACAAGTGTTGGATCATCAGCTTCAAATTCAAGTAATCTTGGGAAATTAGTTCAGCTGGAGGGCGGCAACAGTG\n+>6r\n+TTTTCGAAGGGATAACTCCAGAGCAGCAGCCTGCCGCTCAAGCTCTTTTCCCCTTTTGTAAGCTCTCAAGGCCTCTTCAGATTTCCCTTCATCTTTTAGC\n+>7f\n+TCTAGTTCAGTTGCCTCTCAAGGTATTATAACATCCAAGAGATCTGTAAATTCATATGTACATTAGTAATACCTGCAATACTTTTTCCAGGAGACACTTG\n+>7r\n+ATTTTTGAAGATGGTAGTTGTGTTCCATTAGAAGTATCTAAGAAGACAGAGCAAATGGTTGATGTTCTTAGAAGCATGTCCAGTCACCAATCTCTTACTA\n+>8f\n+TGCTGTTGCTGTGTTCATAATTACTTGTTATAGAACCTAACTAACAATGGCGTATTTATGGTTGCAGAATTTCAACACAGAGCTAGTCGTCGACATGGTT\n+>8r\n+GCATATGATTGATCAATCCCCTCGCAGCACTGTGTCATGAAATGCATACCGTAAGCTCATGTTGAAACAACAATTTCTTCGAAAATGAAGGCAATTCTTG\n+>9f\n+ATGCGAGATATAAGCCGTCATAAGTTCTTTGGCATAGTTAAGGCTATTTGAAGCTATTTGATGCATAATAAGGCATTTTAAATGCAATTGGATCCTTAAA\n+>9r\n+TAACAATCCAAAACCACTTGTAATCTCTTTTACAGCCTTAAATGAGCTTATTTAATGATCCAATTGCATTTAAAATGCCTTATTATGCATCAAATAGCTT\n+>10f\n+AATGAGGAGCCTTTGAGATCGAATCGTCTATGGATTCTCAACTCTGTGCAGAACATGTTTCCCATGACGACGAAGCGAAACTGTAAAAGCCGGATTTGAG\n+>10r\n+TTACCTTTCATCCAATTAGCTCTCACTTGCAGGTTCTAAAAAGGATGCTCCCAAGGTATCATCGCCACGTTCGAAGATATGAAAACACTCTCATAACCAA\n+>11f\n+GAAGATTTGTTCCTTGTTAGGGTTCATACCGGCATCATGATGTGGATTTCAGTGTATGCGTTTGATGAATTCCACTAGTAATTGTGTGTCTCTGGCCCGG\n+>11r\n+CAACGCCCCTCTAACCTGGAACAGGAAAACTTTTGAATTATAAGAAATTTCTCCAATAATTAAAATCAAGAAACTAATACAAAATTACCAGGTCGAGTTC\n+>12f\n+TCCTCGAATAAAATATCAAGACGTGATGGAAGTGCTTAGCCTGCAACCTATTACTTCTACTGCAGCTGCTCACATGAGGTGTGCTTCTGCTTCATTTTAT\n+>12r\n+TCAATGGCTTTTTTTAAAACCTTGCGAGAAGTGTTGTTGATTGCATCTAAATGACTCTTCATTTCCTTTAAATAGCTCAAACTTCTTTGCTGGATTTCTG\n+>13f\n+CCAAGTTGAAGCTGATTCATGATGACGGGCATAAGAAACCGTGGAGTCGAGTTCCGAATTCAACGTTTTTCCCGCCAAATCTGAAGAGCCTGACTCTGGC\n+>13r\n+TGCATAAATGAATTGTCCGAAACCGTCCACAGTCTGAGATCCGACCACTCAATGTGCAGATATTCGAGAGAACAAAATCCATTGCCACCGAAGTTCCACA\n+>14f\n+GGAAACAACGAAAAAGGACAACTCGGACTTGGAAAAGAAGTTCAACAGAAAAATTCTTTCACTCTTCTGATGAAAGACGAACAAATCATTGACATCTTCT\n+>14r\n+CCAAGTCCGAGTTGTCCATTTGAATTCCATCCACAAACCAAAATTTCTCCGTTTCTCTTCAAAATCATCATGTGATTTCCTCCACATGAGATTTGTTTGA\n+>15f\n+TATATCAACTTATATCTCGCTATATGACGGTAATTACAATCCGAAACTACTTATAATCTCTTTTCAAGCCTTAAATGAGGTTATTTAAGGATCCAATTGC\n+>15r\n+CATAAATAAAGTTATTTAAGGTCTAAAAAGAGATTATAAGTGCTTTTGCATGGTTAAATATAGTTATATAGCATGTTATAGGTCATTACACATCATTTCA\n+>16f\n+GTTTTGCTTGTTGAGAAATGGAGGGACCAAAATATGACCTTATAATGGCATTTCATCTCACGATTGCGACGATTTATTTATTTTTTTTTAAATCTAATCC\n+>16r\n+GCTATATCATTAGACCTTATATTAAGGGTTTAGAGTTTAATATTTATATGTCTTTCGACTATACCCTAACCCCCAAGACTCTACACCACTGTTTGTAAAT\n+>17f\n+CAAATTAAGCAAAAAAAGAACCACCGGCAAATAAATGACTCAGCTATCCAAACATTCGAAAATTATACGAGAATAGTAAATAACATTGCGCCACGACCAT\n+>17r\n+CTTGCTAAACCCAGGCAGCTACACAACTCACAGTGGGGAATGATGTGAAACTCCCGAGGGACAAGTCAGTGCACATTTTACGTCTTAATTCTCCCTGTTA\n+>18f\n+AACGAGATTTAGGGCGGAGCCCAGGGTAAAAACTCGATCAGAATCGACCGAGAAAAACCCGAGAGAAATGACGTCCACTATCTGAGAAAATGCAGTGAGT\n+>18r\n+TATACTTTCCTGGGGAAGGGGAAAGGCCGCGGAGGCGTTTTGGGCATTAAGCGCAACCGCCGGCGAATCGCTATAAAAGGAAGTTTTGAGGAAGAAAAAA\n+>19f\n+CAGAGCAGAACAGTAAAATTCGCATACCAAATTGAACAATATTTAAAAGTATTACACAAAACTTCGCGAAAAAAAAAACAGGAAAAAAACCTATCAATGC'..b'TTAGACGCTCAAGGCAATGTGGAAAAACAAAACTAATCATGTCCAGATGCTGAAAATAATCATGGAGCAGGAATTTCT\n+>429983r\n+GTTCTTGAGTGAAGGTAAGTCTTTCATTCTTAGGCCTAACGACATAACGTCATCATCATCATTATTATTCTTTTTGTCTTCATTTTATTGCATGTACTTA\n+>429984f\n+GCAATGGCTTTGAGGAAGTCTTGTCGCGGTACAGAGAACTTGTTCCCAATCTGAAACTAGCAGGTTGACTATTTCTTGCATAGTGGCTCATCTTACAGCT\n+>429984r\n+TGGTGACCTGACAGGATTAACAACACAGCTGAAGTGAGTAGAATAGACCACCGAAAGTAAGCAGGTCGCAATATATTCAAGAACCTGGCCATCCGCAATA\n+>429985f\n+GAAAATTGGATTCCATCACCATGAGCCCTCGAATTATGCAACTCCATCGAATATAAATCGTAGTTCTTTTCAGTCCAATTATTTTCATTTTTTAATTATT\n+>429985r\n+CACCACCATTTAGGCCCTTCTATTTTTAGTAACCATCCGTACACTGTTCGCTACCTAATTTTCATTTATTTACAAAATCATATCCTTGCTCCATTTAATA\n+>429986f\n+AATTAATTTCATCTTTGGTCTGAGACACGGCAAGTTCGTCGTCAAAGCATTAATGCCAAAGTAAACACAACAGTGGGCCACTAATCAATGCCTGACACGA\n+>429986r\n+TAAGAGCTGCTGTCCGTGTTGTTCGCATTTTCTCACTTCGAATCATCTCTATCCGAATTTTCTTCTCAATAAGGTTAGTTTCTATTTATTTATTTTATTT\n+>429987f\n+GCTCCTGTGATCTGTGAGTCTCCGGTCAATTTCTACTACGGCCTTCTGTACTTGTAATTTACTCGAATCAGGTTGCGAAACTCAAATTTGCTCACTCGCG\n+>429987r\n+AGCATCAACTCGCTGAACATTGCTCTCCTCGACCAAGATTTCGATGGCCTTCAGACACAGAGCCTTAACTTCCGATTCCTTAAGCGGTTCGCATCTCTTG\n+>429988f\n+CATATTATTTTTGGTTCATAAGTTTGCTTGATTTCATGTAGTAGCATCTCCTTAGGTATTTGTTTAGAGTTTTTTCTATTGCAGTAAGATGCATTTTTCG\n+>429988r\n+ACTTCCATTTGAGTTCATGCCAGCAGAAGACAAGCAAACAAATAGCATAAGATTTCAGCTTTGTGGACCAAAAATCTGGCCGTAGAAGAAAATAGTCATC\n+>429989f\n+AAAACCACTTATAATATCATTTTAGGCCTCAAATAAGATTAATTGAGGATTCAATTTCATTTAAAATGGCTTATTATACATTAAATATATTGAAATAGTC\n+>429989r\n+TCAGTTATTACATTTGTGTATGATATAAAACCTTATAATTTATTTTGAAATATAATTAACACTATTTCAAGCCATTTAATATATAATTAGCCTTTTAAAA\n+>429990f\n+TAAACTTGCTCTTGGCCAGGTTAATACTGCAAGGAATCTGATTGGCAAGATTGCCAAGGATTATGTCAAGTTATTGAAATACGGAGACTCTCTCTATCGA\n+>429990r\n+TCGTGAAGGCGTAGGGCTGTACGTCGACATCAGCCCTCGTGATTTTGTTGATAAAAGAACTCTTTCCGACATTTGGGTATCCACAGATTAAGATAGTTCG\n+>429991f\n+GTCAGGTGGCTGGCAAGGCAATACATAAACAAGTTCAGAGAGGAAGTTTAGACGTGTATCGTTTCATCATCTCCAAGCATTCTCTAGGCCTGTACTCCGG\n+>429991r\n+GGCAACACTGGAAACGGTGAGAGAGTTAAACTTGACGCTACACGAAGCATGGAGGCCGTGGAAAGTGAATAAGCAAATTGCGGGCTACACTCAGAGCTAC\n+>429992f\n+AAAAAAAAATACGATACGATACAAAAATGGATGTATTAGAGATAATGCTCCATATCTAGTCTGCAGATTGGCAAAAAACCTACGAGGAGAATGAAAAAAA\n+>429992r\n+GTTTTTTTACTCGTTCGCTTATAGCTCCGATGTTCTTGCTGAGGTCGTTGTCGTACTTCTGTTTATCAGATGGCACGATGCGGGGACTTATGATGCCAGA\n+>429993f\n+ATGTCATTTCAAAGCATAAGTAAGGCTATTTCGAGCAATTTGAGGCCTAATAAGCCATTTTAATTCCATTTGAATGCCTAATTTACCTTATTTGAGGCCT\n+>429993r\n+ACCATCCAAAACCACTTATAACCTCTTTTTATGCATTAAATGAGGTTAATACAACATCCAATTGCGTTTTAAATGAATTATTAGGTCGTACATATAGTAA\n+>429994f\n+CTTGATGTGGGTAAAGGTGAGTGGGCAGGAGAAATCATCAGATCTTGTGATGAATGGAGTTGCAGAACGAATAGTAGGCAAAGCCGCCATGGTTTTTTTT\n+>429994r\n+CAGAATCCAGATTTTGAAATTCTGTACATTCAGATTTATATATACATACATACATACATGTATGTACCGACAAGTGTATTTATAATATACACCACACATG\n+>429995f\n+GAAAATCTTCTGAAAAGCCGAATTCGTCTGAGGAGATCGCAGTTTTATCTAGAAAACAGAAAAATCGAATACGTTTTATCTCGTTAACGAAAACTGGAAA\n+>429995r\n+ACTTCGCTTGGATCTGCTGAGATCAACTTTCTGCGATCCAACATCTCGGATTTATCCGATATCTGGATTCCCCTGAAAGGAAAACTGGCTCATGCCTGCG\n+>429996f\n+GGTGAGTTGATCCAATAAAGACTCATGCGATGAAAATAAGCCGATAACGCCTCCGGATCCAATGGGTCTTTGGAGAAACTCCCAATGTGTTTTCATCACG\n+>429996r\n+GTTTTGCGTTACACTTAATCATAGTTTACTGATTGCAGAATGCAGAGAGAAACTCTGTGCCTCTGGTGTTGGTTTGCCCTTCTCATTTAATCAGTTCCAT\n+>429997f\n+TTCTTGACGAAAATGGTTAATAACACTTACAGGAGATTTCTGATAAACTTACAGGTTCTTAAAGATGGGAGCTTCCACATGGCTCAGCTCAAGGCATAAA\n+>429997r\n+TGGTTACACCATTGCATTCGGGATCAGGAATGTGACTCATATAGAAAAAGTTCTTTCTGAAGCTTACAGGTGAGTTCAGTCCTCTGCATACATATATTTC\n+>429998f\n+AGGCTACCTTCATGGACGAGGCGGTACACTATTACAGAAACACCATTCATGAGCTTCCTTCTCTTGTTTTTCTATGGAGTTGCTTGTTTTTACAAAAATC\n+>429998r\n+GAGTTTATGGCTAAGGAAAAAGAAAAAAGATATGCAGCAGCATTGTTTGAGAATAGAGAAGAAAAAAGGATATTTTAAATGCTGCAAATGCGCGTTTCTC\n+>429999f\n+ACATTATCGCTTGATCGTATCTGCTGGATAATTACCACTCCATTCTTTTGAAGATCATGTGTAAACAATCCACAAATATTATAAATGCTTTCAATCATTG\n+>429999r\n+GGGTATGCTGATGCTGGATATAGATCTGATCCACATAATGGAAGATCTCAGACCGGATATGTTTTCCTGAATAAAGGAGCTGCTATTTCTTAGCGATTTA\n+>430000f\n+TTTTCGTTTGTTTTATTAGGGCAGAGAAAAAAAATTCTCTTCTCTGCCTTTCCTTGCCTCCATTGAAGAAGATGAAGAAGAAGAAGAATTTGATGAATCT\n+>430000r\n+AAGGGGGTTGAATTTTGGGTGTTTTCTGGAAGCCAACTGGGGAAGGACAACACCAAAAACCTTCCGGAGTGACTGCCATCGGAACTGTTTTTATCTTCGT\n'
b
diff -r 000000000000 -r a5f1638b73be test_data/test_seq_1
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test_data/test_seq_1 Wed Jun 26 08:01:42 2019 -0400
b
@@ -0,0 +1,34 @@
+>test_seq_1
+tatactacgcgactagactacgatactagggacagcatattacacccagagaacagacta
+TTGCAAAAAGAGAAGATTGGTAAAACGGGGGGGAATGTGTGAGTTGTAATGGGTTCATCT
+GCTCATGTTACTGCAGGGAGGACAGGTCCCGAACCCATTTCCAGCCTTTGGCCTGTCCCT
+CGTCCCTTACCGATCAAAGATACAACCCGGCGGAAGACCTATGGACTCTTACGGTATGCA
+AGCTAGTTCCTTTAGATTAGATTACAATGTTTACTTTTGTTATTTCTAGTTGGCAAGAGT
+GTTCCCCATAGTGTAGTTCCTTGAGCGGTAACAGTGAGCGAACCGTGAGAGCATGCATAC
+TGTTTCGGACGAGCTGTTTTACAGGTTTCCAAAGACTGTTTCTTTGCGATCCGACCAGAA
+GCTACTGCATGTTCGAGGACGAACATGGGGCAAGTGTGGGGATGTTTGGAGGTATGTTAT
+TTTTGTATGTTTTTAGCTATGCATTTTAGGCCTCTTGATAGTGGTTAGAGTTGATTTTGA
+tatactacgcgactagactacgatactagggacagcatattacacccagagaacagacta
+CCGCTTCCGCACCAGGCGAGCTTAGCTCGTGTGTTCCTTCCTGGCGGACGCCTCAGAGGG
+AGATCATGTTGGTGATCAATGTTGGGGTGGGATAGGAGTTTACTCGTGGGCTATGTTGGA
+CCTCTCTTTGGGACATGGTCCAGAGCTAATTGGTCGTCTCTAGGAGGGCCAGATGACAGT
+tatactacgcgactagactacgatactagggacagcatattacacccagagaacagacta
+ACCTGAAGGTGTACAACCGGTTGGTTATAAGTGGGTTTTTGTGAGAAAACGAAATGATAA
+AGGAGAAATATCTCGGTATAAGGCGAGATTAGTAGCTCAAGGGTTTTCTCAAAGGCCAGG
+AATTGATTATGATGAAACCTATTCACCGGTTATGGATGCCACAACTTTCAGGTTTTTGAT
+AAGTCTGGCGATTGAATATGGGCTTGATTTACAACTGATGGATGTTGTAACAGCATACTT
+ATATGGGTCACTGGATTGTGAAATATATATGAAAATCCCTGAAGGGTTTCATATGCCTGA
+ACGATATAGTTCTGAACCCCGTACCGATTATGCGATTAAATTGAATAAATCCCTGTATGG
+ATTAAAGCAGTCAGGACGAATGTGGTATAACCGTCTAAGTGAATACTTGATTAAAGAGGG
+tatactacgcgactagactacgatactagggacagcatattacacccagagaacagacta
+tatactacgcgactagactacgatactagggacagcatattacacccagagaacagacta
+ACAAGGTGGCGACAGTGGAACATGGCCCGATCGAGGACCAGCGTGAAGTCACGCATAACA
+TGGAACCAATCGGGTACAAGAACGTTTCACTATCCTCTTCCGACGGAAGGAAGAACGTCA
+AGATTGGGGTGCAAATGCCCCCAAATATCGAAGAACAACTCATCCAGGTCTTGACAGAGT
+ATCAAGACATCTTTGCTTGGGACATCTCCGAGGTCCCTGGAATTGATCGGTCACTGATGG
+AACATCGCATCAATACCGATCCTGAGGCCGTGCCCGTTCGACAAAAAAGGAGACGCTTCT
+CTCACAATCTGTGATCAACCCGAGAGAAAATGTAAGCGCTGTAACGCTGAGGAGTGGAAA
+GGTTGCAGATGAAGCAATCCAGAAGAAGAGGAAATCGCCTAAAGAAGCAGCAACAAAACC
+AGAGGATGAGAAGGAGGTCGAAGCTGCTGAACCAGTGACAGAACCCACTGCCAAGAAACA
+GAAAGAACCAGAGGTCGAGGTAACAAAGGAAAAATCTGTTATTAAACCTTACTATGAACT
+TCCACCTTTTCCAGGGAGGTTCAAGCTGGAGAAAAAGCAAGAGGAAGAGAAGGAGTTGAT
b
diff -r 000000000000 -r a5f1638b73be test_data/vyber-Ty1_01.fasta
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test_data/vyber-Ty1_01.fasta Wed Jun 26 08:01:42 2019 -0400
b
b'@@ -0,0 +1,1015 @@\n+>Acoerulea195_58_rc\n+CTTTCTGAAACCGGCACGAAGTGGTTTGACATCAATTTTGTGACAAAATGTCTTTGGTTA\n+TCTCGCCTTGAATTCATTGTTAGGAAAATCAAGATGTACGCATGGTCGTCTCGTGTCCTT\n+GTGTCTCTTAGGTCTGTCCAGGAAAAGTTATTAGGTTCTGGACGTACGTACCGAGTTAGG\n+ACGTACGTACGTTCACTGAAGCAGCAAACATCTTTTGGTCAGGACGTACGTACTCAGTTA\n+GGACGTTCGTACGTCCAAGTAGGCAGCAGACGTGTTTTCGTATGGAGACGTACGTACCGA\n+TTTAGAGCGTTCGTACGTCATGTTAGGCAGCAGGCTTATTTTTGTGTGGCGACGTTCGAA\n+CGTCATATCGACGTACGAACGTCCAAAGTCGGTTTGCACTGTTTCGGCCCATTTTCCTAG\n+GTTTTCAACCGTAAGTATAAATAGGGTTTCTCTTCCATAGAACAAGATAGTTCTTTGGCC\n+GTCCATCCTTCTTCTTGTTCATGTGTTTTGGTTAGGTTTTTGATATTGAACTTGTTTGTG\n+GCTATTGAACTCAATTCCCTTTCTTCCCTCTTTTCTCAATTCATATCTTTGTGATCTTCA\n+AGAGTTATTGTGAATTCGGCTTCGAGTCTTCTTGTGTGTTCATCAAGAAGTCGAGTGTCT\n+TGTGCGTACATCAAGGTGCTCTGGTGAAGTCAAGGTGTTAGAGATAACCAGGAGTATCAA\n+CTCGCAGTGAGTGCAGGTAGTTGATAGGTTTGTACATCTCTTTTCATCTAGTGGATTATT\n+TTGTTGTCGCGAGGACAACCGTGGACGTTTCCCTGTTTGGGTTTTACCACGTTAAAAATC\n+TCTGTGTGTTATTTACTTTCTGCGTATTGTTTGTTTGCTTGTTTGAATTCACATTTGTTA\n+TACTAGCACTATTTGTAGTTTCTCAATTGGTATCAGTCGCTGGGGGCTGGTTAAGGAGCG\n+TAAAGGCTCACGCTAACGATCAATAGTGCTATGGATTACTCTGGTAAGACACATGTCTCG\n+GTGGTTGCTCGACTCTCTGATGAGCAGTCCATTGACAGTAAAGAATGCTCTAGTAATTGT\n+GACAGGGATTTGATTCTGGTGGACAATAATGATTGGGCTACCCTATACCGGTTATCCAGA\n+AAGGAATGCCTAAGGTTGGAGGCACAAAATTGTATTCTTCAGGATAGACTTGACTTATTG\n+ACCTCTAGTTCCTCATCTATGTCAATCCTGGATAGGGAACCTGTATCATGGCAGGTTGAT\n+AAGGTGGCTTTACTGGGTAATCTTGCTGCTCTTGAGCATGACAAGGTCGAATGGGAAGCT\n+CGGTATAAGATTGTGTCCTCAGAACTAGACAAGGTCAAAAAGGAGTTGATTCGCTTTCAG\n+TCGTTCGAAAAGTGCAATGGGTTGTCATCTTCTGAATTACCTCCACTGTTGCCCCATCCT\n+CCTACCAAAAAGTCCGACATTCATCACTCTGTGGACACGGTTGGCATCCGGCGTAAGTGG\n+AAACTGTTCAGGCGTGTTCCGCCTCAGGGAAGGAAACGTCAAATGCCGTGCTCACTATGC\n+GGACAGTTTGGACATTGGGCCTCGCAGTGTGGTGTGGCTGCTCATGTTCCACAATCATGG\n+AAGCCGTTTTCACAACCGTCGTATGATTATCCATCTTATCCATATGCCTCTTACTCTTTT\n+CCTTGTAACTTTGCAGGAAATGTTTCAAATGTGGCGATTACTGCCCTTCATGTCTCAAGC\n+TCGGATAGTGAGTGGTTACTTGATAGTGGAGCTTCAAAACATATGTCAGGTAATGCCAAA\n+CTTTTCTCCTCCGTTACTGCTATAGATGGTGGAAGTGTTACTTTTGGGAATGGTAAGAGC\n+TCTCCTGTGATTGGTAAGGGATTTGTCGCTGGTATTGGTTTATCTCCGAATGATGTTTGT\n+TTGTTAGTTGATGGTTTGCGTGTAAATTTGATCAGTATTAGCCAACTGTGTGATACTGAC\n+CATACTGTTAATTTTTCCAAAAATATATGTACCGTGCTTGATAGTTTGGGTAAGTGTATC\n+ATGACAGGTAAACGAACATTGGATAATTGCTATGCCATTCAGCCTGTTACATCTAGCATG\n+AACTGCTTACCTAGCAAACTAAATGAGGGTCTATTGTGGCATCAACGACTCGGTCATGTC\n+AACTTTGAACACCTGGACAATCTAACTCGGAATGAATATATTAAGGGAGTTCCTAGACTT\n+GGAAGAAATCGAGACACTGTGTGTGGTGGGTGTCAACTAGGTAAACAGATACGTAGTCCA\n+CATTCCAAGAAAAAATCCATAACCACATCTTCTCCTTTAGAACTCATACACATGGATCTG\n+ATGGGTCCTACTCGTACTCCTAGTCTAGGAGGCAAACGATACATCTTGGTTATGGTTGAT\n+GACTACACTCGCTTTACCTGGGTATCATTCTTGCGTGAAAAATCTGATGCGTTTCTTGAG\n+TTTCAGGGGATATGCCTTCGTATTCAGAACGAGAAAGATACTCAAATTAAACATATCAGA\n+AGCGATAGAGGTGGTGAGTTCACAGCCACAGGTGTGATTGAGTATTGTATTGCAAATGGT\n+ACATGGCAAGAATTTTCGGCTCCATACACTCCGCAACAAAACGGAGTCGCTGAAAGGAAA\n+AATCGTGTTATTCAGGAGATGGCTCGTGCTATGTTGCATGCAAAGGATGTTCCGACCAAG\n+TTTTGGGCGGAAGTGGTTCATACTGCTTGTTACATAATGAACCGTGTATATCTAAGATCT\n+GGTACCACACAAACTGCTTATGAGCTATGGTATGGTAAGAAGCCGAATCTCAAATATATG\n+CGAGTTTTTGGTAGTGTGTGCTATGTGTGCAAGGACAGACAAAGTCTGTCCAAGTTTGAT\n+AGTCGAGGTGAAGTAGCTCTTTTACTTGGTTATTCTTCTAACAGTAGAGCCTTTCGAGTG\n+TTTAACTACACCACTCGCAAGGTCATGGAATCCTTTAATGTTGTTGTTGATGACACTATT\n+ACATCTGACTCTTCTGTTTCCACTGGTACACAGGATGTCACAGTTCTCTCACCCGTGTCA\n+GACCCGGCTGACATGTCTTCCATATCGTTATCATCACCTGATAATGGCAATGGTGGTACT\n+AAACCTTCTGATGCTGCAGAGGACGTGCCTAGTAGGACCGGTGCTGTGCTCACACCGGAT\n+GATGTGGTTCAATCACCAGATGTGATTGATGTGTCTTCAGATCTCTCCACTGTCCCTGCT\n+GACCCTGAAAGGGTGTTTAATCTAGCTTCACCTCGTGTCAAACAATATCACTCCTTAGGA\n+GATATCATTGGGGATATTAATGATCAGCGTCTGACTCGTCGGAGGGCCAAGGAGACAAAT\n+TGTGTTCATTATGTTTGTTATCTCTCTTCTCTTGAACCTAAAAATGTTACTGATGCTCTT\n+ATTGATGATGATTGGCTAGTTGCTATGCAGGAAGAACTCGGTCAGTTTAAGCGTAGTGAT\n+GTCTGGACGTTGGTTCCTAGACCTACTCACACTAATGTGGTTGGCACCAAGTGGATCTTT\n+AAAAACAAGTTGGATGAGTTCGGACAGATTGTGCGCAACAAGGCAAGGCTCGTAGCTCAG\n+GGCTACAGTCAGATTGAAGGTATTGACTATGGAGAGACATTTGCTCCCGTGGCTAGGTTG\n+GAATCTGTCAGGCTTCTTCTTGCTATGGCATGCCACTTGAATTTCAAGTTGTATCAAATG\n+GATGTCAAAAGTGCATTTCTCAATGGTATTCTTAATGAGGAGGTCTATGTTGAACAACCT\n+AAAGGGTTTGTGGATCACACTTTCCGAATCATGTCTTCAAATTGCAAAAAGC'..b'ACAGCCTTCCAACTA\n+TCAATTTCTCAAAATTATCAGATTTTGTGTGCACCGCATGTGCAACTGGAAAATTAATTA\n+TAAAACCATCTTATCTTAAAGTTAAAAATGAGTCATTAAATTTTCTTGAACGCATTCAAG\n+GAGATATATGTGGTCCAATTCAAGCACTATCAGGACCTTTTAGATATTTCATGGTGCTCA\n+TATATGCATCTACTAGATGGTCACATGTGTGTCTATTGTCCACACGAAATCATGCTTTTT\n+CCCAGCTTATTGATCAAATTATCAAATTAAGAGCAAATCATCCTAAAAATAGGATAAAAA\n+CAATTCGAATGGATAATGCCGCTGAATTTTCTTCACGTGCATTCAATGACTATTGCATGG\n+CTATGGGCATTCATTTAGAACATTTTGTGCCTTATGTTCATACTCAAAATGGTTTGGCTG\n+AATCTCTCATCAAAAGAGTAAAATTAGTTGCTCGACCACTATTACAGAATTGTAATTTAC\n+CAGCATCATGTTGGGCACATGCGGTATTACACGCCGCAGATCTGATACAAATCAGACCAA\n+CTGCATATCATACAACCTCCCCGCTACAACTAGTACGTAGCACTCAGCCAAGTATTTCCC\n+ATCTACGAAAATTCGGTTGCGCAGTATACGTACCGATATCACCACCGCAGCGTACATCCA\n+TGGGCCCCCACAGAAAACTAGGGATCTATGTGGGTTATAACTCTCCGTCAATAATAAAAT\n+ATCTTGAACCTCTTACAGGGGACCTGTTTACTGCCCGCTACGCTGATTCAATTTTTGATG\n+AGGACCATTTTCAGGCATTAGGGGGAGAATCAAACCACAAAGAATGCCAGGAAATAGATT\n+GGAATGTAACAGGCATTCAGTCCTTAGATCCACGTACTAAAGAATCTGAAACTGAAGTTC\n+AGAGGATCATAGATTTGCAACATATTGCAAATAATCTGCCAGATGCATTTACTGACCATA\n+AAGGTGTCACTAAATCACATATTCCCGCTGTTAATGCACCAGAACGAGTGGAGGTACCAA\n+CTAAAACCACTCAAACCACAAATGAGAGTAAGAGGGGGAGAAATCTGGTTAGTCGGAATA\n+TAGCTTCTCAAAAGCCTCCGCGGAAACAGAGGAAATCAAATCCTCTACCAGTAAATGCAA\n+TTCAACCTCAAGTTGAAGGACACCAACCAGATGCTCAACATCTTGAACCTAGCATAAATG\n+CGCATAAAAACATAATTGCTGGGACATCGGGACACCATGGTTCTATTGTTGTGGGAAATC\n+ACATAGAGTCTGAAGGTATAAAAGAAATTTCCATAAACTATACAGATTCAGGAGAATCAT\n+ATAATAGAGAGACTCCAATTGTCGACATATATTTCGCCTCTAAAATTGCTGAAACCCTTC\n+AAGTGGATCCAGAACCAAAGACCGTCAGGGAGTGCCTCAAGCGTCCTGATTGGCCTAAAT\n+GGAAGGAAGCAATTGAGGCAGAAGTGCGCTCGCTCAACAAAAGAGAGGTATTTTCCTCGG\n+TAATACCTACTCCTCATAATGTATTCCCTGTTGGAGCAAAATGGGTTTTTGTTCGAAAAA\n+GGAATGAAAACAATGAGGTGGTGAGATACAAAGCGAGGCTTGTAGCACAAGGGTTCACGC\n+AGAGGCCCGACATCGATTACGATGATACATACTCTCCTGTAATGAGTGGAATAACGTTTC\n+GATACTTAATATCTTTGGCAGTACAAATGAATTTATCTATGCAGTTGATGGATGTAGTGA\n+CAACATACTTATATGGGTCACTCAAATCGGACATATATATGAAAGTCCCTGAATGACTTA\n+AAATGTCGAATCCAAAAGAAAATCGCAACGCATATTGTGTAAAATTACAAAAGTCACTAT\n+ATGGCTTAAAACAATCGGGTAGAATGTGGTATAACCGATTGAGTGAGTTCCTTATTCAAA\n+AAGGCTACTCAAATAATGATGATTGCCCTTGTGTATTGATAAAGAAATCCTCAAATGGAT\n+TTTGCATCATCTCAGTGTACGTTGATGACCTCAATATCATGGGAAGTACACCTGATATCG\n+AAGAAGCACACAATCATCTAATGGCGAATTTGAGATGAAAGATTTGGGAAAGACCAAATT\n+CTGCTTAGGCTTACAGCTTGAGCATCTTCCCTCGGGAATTTTAGTATACCAACCTGCATA\n+TATTCAAAAGGTTTTGGAAAATTTTAATATGGATAAATCATATCCAACCAAAACACCCAT\n+GGTTGTCAGATCCCTTGATATGAATAAAGATCCTTTTAGACCTCGGGATGATGACGAAGA\n+GATATTAGGACCTGAGTTCCCGTATCTCAGTGCCATTGGTGCGTTAATATACCTTGCAAA\n+TTGCACCAGGCGTGATATTGCATTTACAGTGAATTTACTAGTTAGACATAGCGTTGCTTC\n+ATCGTAACGTCATTGGACGGGAGTAAATAATATCCTTAGATATTTACATGGCACAAAGGA\n+TCTTGGCTTATTCTATCAGATAAACCAAGATATGACTATGGTANGATATACTGATNGCTG\n+CTATCTATCTGATCCTCACAATGTCAGGTCACAAACAGGTTTCGTTTTCTTATATGGTGG\n+AACTGCTTTTTCATGGAAGTCAACAAAACAGACTCTCCTAGCAACCTCCACTAATCATTC\n+CTGAACTTGTTGCATTTTTTGAAGCATCTTAAGATTGTGTATGGCTTCGCAGGATGATTA\n+ACCCTATTCAAACTTCATGTGGTGTTGGTTCATTAGGATCACCAACTATTATATATGAAG\n+ATAATGCAGCCTCGCCATTGTCTCAAAATGCAAATGTGGTTTATGTTAGAAAGTAATATC\n+CCCACACCTATATTCTTCCTAAGGTTATTTTAATCCTCAGTGCATTACAGAAGGGATGGA\n+GAAATTTGATATTTTCCCAAATTAAATCATGTGCCAATTTAGCAGATTTGTTCCCCAAGT\n+TTTTTCCAAATTCAACGTTCCAGAAATCCATTCATGGAAATTGGTATAGAGATGATTCCC\n+GAGATTTGCAAAGTTCAGGGGGAGAAATCTCCCTGAAAATATACCCGTTTAATTATCATC\n+AGGTAATGAATATTGTACTCTTTTCCTTTATGAGTTTTTCCAACAGGGTTTCTCATATAA\n+GGTTTTTAACGAGACAATTAAATACAAGTATTGATGCATGCCATATCATATTTCTCCTTA\n+TATTTTTCCTACTAGGTTTTAAAGGAGTTTTTTATGGCACATCTCATTGCACTCTTTTCA\n+TTATGAGTTTTTTTGACATTTTCTCTCATAATGTTTTTAATGAAGCCATATCTTATCAAT\n+GATCATATATCATACTTTCTATTTTCCCTATCGGGGTTTTAAAGGAAGTACTCAAGACAT\n+ATATTGTTCTCTAAACTCAAAAATGAGTTTTATCCCTATATAAAGGTTTTCTCAAATGAG\n+TTATCATGAGGCAATAATCATTATATGTTGCACAATTTTTTCCTTATTATTTTTCCACTG\n+GGTTTAAAGGAGTTTTAGCAACATATCTACACTATTGTCCTTATATTTTTTCCACAGGGT\n+TTTTGGAGGAGACTTTAAAGATTATACAACGACTTTTCAAGATGAAGATGAGGAACATTC\n+TTAAAGAGAAAAATTTACAAGGATTATTATTTATCAAGATGATGCACATTTACACAGACA\n+AGCATGGATTAGGGAGAGTGTTAGGAATTAATTAAATGTATTAATTAATGGGATAATCCC\n+TGCGTTGCCGGTTGCCTTGTTTTAGCAACCGTTCCTTGTAAACCGCCTCTGTAACAAGGG\n+TATAAATACCCACATCTTCAATCAATGAAAACACTGTTCCATCATTCTGTCACTTTTACT\n+ACTTTACACTCTA\n'
b
diff -r 000000000000 -r a5f1638b73be testing.sh
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/testing.sh Wed Jun 26 08:01:42 2019 -0400
[
@@ -0,0 +1,42 @@
+#!/bin/bash
+
+DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
+test_data="$DIR/test_data"
+classification_tbl='!!!! set up the path !!!!!'
+pdb='!!! set up the path'
+
+######## Protein Domais Finder
+## single_seq, for/rev strand of mapping
+$DIR/protein_domains_pd.py -q $test_data/GEPY_test_long_1 -pdb $pdb -cs $classification_tbl -dir $PWD/tmp/single_fasta/
+## multifasta
+$DIR/protein_domains_pd.py -q $test_data/vyber-Ty1_01.fasta -pdb $pdb -cs $classification_tbl -dir $PWD/tmp/multifasta/
+## multifasta_win
+$DIR/protein_domains_pd.py -q $test_data/vyber-Ty1_01.fasta -pdb $pdb -cs $classification_tbl -wd 3100 -od 1500 -dir $PWD/tmp/multifasta_win
+
+## testing if outputs are the same in case of using sliding window and not
+if [[ $(diff $PWD/tmp/multifasta/output_domains.gff $PWD/tmp/multifasta_win/output_domains.gff) -eq 0 ]];then
+ echo "Testing output of sliding window comparing to no window accomplished sucessfuly"
+else
+ echo "WARNING! There is difference between outputs of sliding window and no window used"
+fi
+######## Protein Domains Filter
+## default params
+$DIR/domains_filtering.py -dom_gff $PWD/tmp/single_fasta/output_domains.gff 
+if [[ -e $PWD/tmp/single_fasta/domains_filtered.gff ]] && [[ -e $PWD/tmp/single_fasta/dom_prot_seq.txt ]] ; then
+ echo -e "Filtered file and protein seqs file for default parameters exists"
+else
+ echo -e "Filtered outputs for default parameters are missing"
+fi
+if [[ $(cat $PWD/tmp/single_fasta/domains_filtered.gff | wc -l) -gt 1 ]];then
+ echo "File was correctly filtered using default parameters"
+fi
+## Ty1-RT filtering
+$DIR/domains_filtering.py -dom_gff $PWD/tmp/multifasta/output_domains.gff -sd Ty1-RT
+if [[ -e $PWD/tmp/multifasta/domains_filtered.gff ]] && [[ -e $PWD/tmp/multifasta/dom_prot_seq.txt ]]; then
+ echo -e "Filtered file and protein seqs file of Ty1-RT domains exist"
+else
+ echo -e "Filtered outputs of Ty1-RT domains are missing"
+fi
+if [[ $(cat $PWD/tmp/multifasta/domains_filtered.gff | wc -l) -gt 1 ]];then
+ echo "File was correctly filtered for Ty1-RT domains"
+fi
b
diff -r 000000000000 -r a5f1638b73be tool-data/prepared_datasets.txt
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/tool-data/prepared_datasets.txt Wed Jun 26 08:01:42 2019 -0400
b
@@ -0,0 +1,11 @@
+#ID Name READS CLS ANNOTATION Coverage Reference Ref_link
+PST_C Pisum sativum Cameor (2017) PST_C_reads_all PST_C_hitsort.cls PST_C_annotation 0.1038 None None
+PST_C_reduced Pisum sativum Cameor - REDUCED (2017) PST_C_reads_all_reduced PST_C_hitsort_reduced.cls PST_C_annotation 0.1038 None None
+PST Pisum sativum Terno (Macas et al. 2015) PST_reads_all PST_hitsort.cls PST_annotation 0.1038 Macas J., Novak P., Pellicer J., Cizkova J., Koblizkova A., Neumann P., Fukova I., Dolezel J., Kelly L., Leitch I. (2015) - In Depth Characterization of Repetitive DNA in 23 Plant Genomes Reveals Sources of Genome Size Variation in the Legume Tribe Fabeae PLoS ONE 10 (11): e0143424. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0143424
+CEUR Cuscuta europea (2018) CEUR_reads_all CEUR_hitsort.cls CEUR_annotation 0.1837 None None 
+CEUR_reduced Cuscuta europea - REDUCED (2018) CEUR_reads_all_reduced CEUR_hitsort_reduced.cls CEUR_annotation 0.1837 None None 
+GEPY Genlisea nigrocaulis (Vu et al. 2015) GEPY_reads_all GEPY_hitsort_PID90_LCOV55.cls GEPY_annotation 1 Vu, G.T.H., Schmutzer, T., Bull, F., Cao, H.X., Fuchs, J., Tran, T.D., Jovtchev, G., Pistrick, K., Stein, K., Pecinka, A., Neumann, P., Novak, P., Macas, J., Dear, P.H., Blattner, F.R., Scholz, U., Schubert, I. (2015) - Comparative genome analysis reveals divergent genome size evolution in a carnivorous plant genus. Plant Genome 8(3). https://dl.sciencesocieties.org/publications/tpg/abstracts/8/3/plantgenome2015.04.0021
+RHP Rhynchospora pubera (Marques at al. 2015) RHP_reads_all RHP_hitsort_PID90_LCOV55.cls RHP_annotation 0.5467 Marques, A., Ribeiro, T., Neumann, P., Macas, J., Novak, P., Schubert, V., Pellino, M., Fuchs, J., Ma, W., Kuhlmann, M., Brandt, R., Vanzela, A.L.L., Beseda, T., Simkova, H., Pedrosa-Harand, A., Houben, A. (2015) - Holocentromeres in Rhynchospora are associated with genome-wide centromere-specific repeat arrays interspersed amongst euchromatin. Proc. Natl. Acad. Sci. USA 112: 13633-13638. http://w3lamc.umbr.cas.cz/lamc/publ/Marques_PNAS_2015.pdf
+BVL Beta vulgaris (Kowar et al. 2016) BVL_reads_all BVL_hitsort_PID90_LCOV55.cls BVL_annotation 0.1817 Kowar, T., Zakrzewski, F., Macas, J., Koblizkova, A., Viehoever, P., Weisshaar, B., Schmidt, T. (2016) - Repeat composition of CenH3-chromatin and H3K9me2-marked heterochromatin in sugar beet (Beta vulgaris). BMC Plant Biol. 16: 120. http://bmcplantbiol.biomedcentral.com/articles/10.1186/s12870-016-0805-5
+
+
b
diff -r 000000000000 -r a5f1638b73be tool-data/rexdb_versions.txt
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/tool-data/rexdb_versions.txt Wed Jun 26 08:01:42 2019 -0400
b
@@ -0,0 +1,6 @@
+#name value is base name for file with classification and pdb
+Viridiplantae_version_3.0 Viridiplantae_v3.0
+Viridiplantae_version_2.2 Viridiplantae_v2.2
+Metazoa_version_3.1 Metazoa_v3.1
+Metazoa_version_3.0 Metazoa_v3.0
+
b
diff -r 000000000000 -r a5f1638b73be tool-data/select_domain.txt
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/tool-data/select_domain.txt Wed Jun 26 08:01:42 2019 -0400
b
@@ -0,0 +1,14 @@
+All
+GAG
+INT
+PROT
+RH
+RT
+aRH
+CHDCR
+CHDII
+TPase
+YR
+HEL1
+HEL2
+ENDO
b
diff -r 000000000000 -r a5f1638b73be tool_config_profrep.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/tool_config_profrep.xml Wed Jun 26 08:01:42 2019 -0400
b
@@ -0,0 +1,18 @@
+<?xml version='1.0' encoding='utf-8'?>
+<toolbox monitor="true">
+  <section id="profrep" name="Genome Annotation">
+
+    <label id="dante_subsection" text="DANTE"/>
+    <tool file="profrep/dante.xml"/>
+    <tool file="profrep/dante_gff_output_filtering.xml"/>
+    <tool file="profrep/dante_gff_to_dna.xml"/>
+
+    <label id="profrep_subsection" text="Profrep"/>
+    <tool file="profrep/profrep.xml"/>
+    <tool file="profrep/profrep_refine.xml"/>
+    <tool file="profrep/profrep_db_reducing.xml"/>
+    <tool file="profrep/extract_data_for_profrep.xml"/>
+    <tool file="profrep/gff_select_region.xml"/>
+    <tool file="profrep/profrep_masking.xml"/>
+  </section>
+</toolbox>
b
diff -r 000000000000 -r a5f1638b73be tool_dependencies.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/tool_dependencies.xml Wed Jun 26 08:01:42 2019 -0400
b
@@ -0,0 +1,14 @@
+<?xml version="1.0"?>
+<tool_dependency>
+    <package name="profrep_databases" version="1.0">
+        <install version="1.0">
+            <actions>
+              <action type="download_by_url">http://repeatexplorer.org/repeatexplorer/wp-content/uploads/2018/10/Viridiplantae_v3.0.zip</action>
+            </actions>
+        </install>
+        <readme>
+            Profrep databases
+        </readme>
+    </package>
+</tool_dependency>
+
b
diff -r 000000000000 -r a5f1638b73be tool_dependencies.xml.delete
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/tool_dependencies.xml.delete Wed Jun 26 08:01:42 2019 -0400
b
@@ -0,0 +1,5 @@
+<tool_dependency>
+    <package name="jbrowse" version="1.11.6">
+        <repository changeset_revision="6cc678412457" name="package_jbrowse_1_11_6" owner="iuc" prior_installation_required="True" toolshed="http://toolshed.g2.bx.psu.edu" />
+    </package>
+</tool_dependency>
b
diff -r 000000000000 -r a5f1638b73be visualization.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/visualization.py Wed Jun 26 08:01:42 2019 -0400
[
@@ -0,0 +1,146 @@
+#!/usr/bin/env python3
+""" visualization module """
+
+import numpy as np
+import configuration
+import matplotlib.pyplot as plt
+import matplotlib.lines as mlines
+
+
+def vis_profrep(seq_ids_all, files_dict, seq_lengths_all, CN, HTML_DATA,
+                seqs_all_part):
+    ''' visualization of repetitive profiles'''
+    graphs_dict = {}
+    seq_id_repeats = []
+    th_length = configuration.SEQ_LEN_VIZ
+    exclude = set(['ALL'])
+    sorted_keys = sorted(set(files_dict.keys()).difference(exclude))
+    sorted_keys.insert(0, "ALL")
+    plot_num = 0
+    seqs_long = []
+    seqs_count = 1
+    seqs_max_limit = []
+    for repeat in sorted_keys:
+        with open(files_dict[repeat][0], "r") as repeat_f:
+            positions_all = []
+            hits_all = []
+            include = True
+            first_line = repeat_f.readline()
+            seq_id_repeat = first_line.rstrip().split("chrom=")[1]
+            seq_len_repeat = seq_lengths_all[seq_ids_all.index(seq_id_repeat)]
+            if seq_id_repeat not in graphs_dict.keys():
+                if seq_len_repeat > th_length:
+                    if seq_id_repeat not in seqs_long:
+                        seqs_long.append(seq_id_repeat)
+                    include = False
+                else:
+                    [fig, ax] = plot_figure(seq_id_repeat, seq_len_repeat, CN)
+                    graphs_dict[seq_id_repeat] = [fig, ax]
+            seq_id_repeats.append(seq_id_repeat)
+            for line in repeat_f:
+                if "chrom" in line:
+                    seqs_count += 1
+                    if include:
+                        graphs_dict = plot_profile(
+                            graphs_dict, seq_id_repeats[-1], positions_all,
+                            hits_all, repeat, plot_num)
+                        positions_all = []
+                        hits_all = []
+                    seq_id_repeat = line.rstrip().split("chrom=")[1]
+                    seq_len_repeat = seq_lengths_all[seq_ids_all.index(
+                        seq_id_repeat)]
+                    if seq_id_repeat not in graphs_dict.keys():
+                        if seq_len_repeat > th_length:
+                            if seq_id_repeat not in seqs_long:
+                                seqs_long.append(seq_id_repeat)
+                            include = False
+                        else:
+                            [fig, ax] = plot_figure(seq_id_repeat,
+                                                    seq_len_repeat, CN)
+                            graphs_dict[seq_id_repeat] = [fig, ax]
+                    seq_id_repeats.append(seq_id_repeat)
+                    if seq_id_repeat not in seqs_all_part:
+                        break
+                else:
+                    if include:
+                        positions_all.append(line.rstrip().split("\t")[0])
+                        hits_all.append(line.rstrip().split("\t")[1])
+        if include:
+            graphs_dict = plot_profile(graphs_dict, seq_id_repeats[-1],
+                                       positions_all, hits_all, repeat,
+                                       plot_num)
+            seq_id_repeats.append(seq_id_repeat)
+            positions_all = []
+            hits_all = []
+        plot_num += 1
+    return graphs_dict, seqs_long
+
+
+def plot_figure(seq_id, seq_length, CN):
+    fig = plt.figure(figsize=(18, 8))
+    ax = fig.add_subplot(111)
+    ax.set_xlabel('sequence bp')
+    if CN:
+        ax.set_ylabel('copy numbers')
+    else:
+        ax.set_ylabel('hits')
+    ax.set_title(seq_id)
+    plt.xlim([0, seq_length])
+    return fig, ax
+
+
+def plot_profile(graphs_dict, seq_id_repeat, positions_all, hits_all, repeat,
+                 plot_num):
+    if "|" in repeat:
+        graphs_dict[seq_id_repeat][1].plot(
+            positions_all,
+            hits_all,
+            label="|".join(repeat.split("|")[-2:]),
+            color=configuration.COLORS_HEX[plot_num])
+    else:
+        graphs_dict[seq_id_repeat][1].plot(
+            positions_all,
+            hits_all,
+            label=repeat,
+            color=configuration.COLORS_HEX[plot_num])
+    return graphs_dict
+
+
+def vis_domains(fig, ax, seq_id, xminimal, xmaximal, domains):
+    ''' visualization of protein domains'''
+    y_upper_lim = ax.get_ylim()[1]
+    dom_uniq = list(set(domains))
+    colors = [configuration.COLORS_HEX[dom_uniq.index(domain)]
+              for domain in domains]
+    colors_dom = [
+        list(reversed(configuration.COLORS_HEX))[dom_uniq.index(domain)]
+        for domain in domains
+    ]
+    colors_legend = list(reversed(configuration.COLORS_HEX))[0:len(dom_uniq)]
+    ax.hlines([y_upper_lim + y_upper_lim / 10] * len(xminimal),
+              xminimal,
+              xmaximal,
+              color=colors_dom,
+              lw=2,
+              label=dom_uniq)
+    lines_legend = []
+    ax2 = ax.twinx()  # add second axis for domains
+    for count_uniq in list(range(len(dom_uniq))):
+        lines_legend.append(mlines.Line2D([], [],
+                                          color=colors_legend[count_uniq],
+                                          markersize=15,
+                                          label=dom_uniq[count_uniq]))
+    ax2.legend(lines_legend, [line.get_label() for line in lines_legend],
+               bbox_to_anchor=(1.05, 1),
+               loc='upper left',
+               borderaxespad=0.)
+    ax2.yaxis.set_visible(False)
+    return fig, ax
+
+
+def main():
+    pass
+
+
+if __name__ == "__main__":
+    main()