Repository 're_utils'
hg clone https://toolshed.g2.bx.psu.edu/repos/petr-novak/re_utils

Changeset 0:a4cd8608ef6b (2019-04-01)
Next changeset 1:2e811f988e1d (2019-06-12)
Commit message:
Uploaded
added:
Galaxy_integration.org
README.html
README.md
RM_custom_search.py
RM_custom_search.py.bak
RM_custom_search.xml
RM_html_report.R
__init__.py
contributors.txt
fasta_affixer.py
fasta_affixer.xml
fasta_interlacer.py
fasta_manual_input.xml
paired_fastq_filtering.R
paired_fastq_filtering.xml
paired_fastq_filtering_wrapper.sh
parallel.py
rmsk_summary_table_multiple.r
single_fastq_filtering.R
single_fastq_filtering.xml
single_fastq_filtering_wrapper.sh
test_data/ERR215189_1_part.fastq.gz
test_data/ERR215189_2_part.fastq.gz
test_data/prefix_suffix.fasta
test_data/seqs.fasta
test_data/single_output.fasta
test_data/single_output.png
test_run1.sh
test_run2.sh
test_run_affixer.sh
tool_data/organele_ref_and_phi-X174.fasta
b
diff -r 000000000000 -r a4cd8608ef6b Galaxy_integration.org
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/Galaxy_integration.org Mon Apr 01 07:56:36 2019 -0400
b
@@ -0,0 +1,11 @@
+
+#+BEGIN_SRC sh
+/home/petr/anaconda3/bin/planemo shed_init --name=repeatexplorer_utilities \
+                    --owner=repeatexplorer \
+                    --description="some utilities for data preprocessing" \
+                    --long_description="some utilities for data preprocessing" \
+                    --category="Fasta Manipulation"
+#+END_SRC
+# this create file .shed.yml
+
+
b
diff -r 000000000000 -r a4cd8608ef6b README.html
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/README.html Mon Apr 01 07:56:36 2019 -0400
b
@@ -0,0 +1,70 @@
+<?xml version="1.0" encoding="UTF-8" ?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
+ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
+
+<html xmlns="http://www.w3.org/1999/xhtml">
+
+<head>
+<title>README.html</title>
+
+</head>
+
+<body>
+
+<h1>RepeatExplorer utilities</h1>
+<p>This repository include utilities for preprocessing of NGS data to suitable format for RepeatExplorer  and TAREAN 
+analysis. Each tool include also XML file which define tool interface for Galaxy environment</p>
+<h2>Available tools</h2>
+<h3>Paired fastq reads filtering and interlacing</h3>
+<p>tool definition file: <code>paired_fastq_filtering.xml</code></p>
+<p>This tool is designed to make memory efficient preprocessing of two fastq files. Output of this file can be used as input of RepeatExplorer clustering. Input files can be in GNU zipped archive (.gz extension). Reads are filtered based on the quality, presence of N bases and adapters. Two input fastq files are procesed in parallel. Only complete pair are kept. As the input files are process in chunks, it is required that pair reads are complete and in the same order in both input files. All reads which pass the quality filter fill be writen into output files. If sampling is specified, only sample of sequences will be returned. Cutadapt us run with this options:</p>
+<p><code>--anywhere='AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT'
+--anywhere='AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT'
+--anywhere='GATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
+--anywhere='ATCTCGTATGCCGTCTTCTGCTTG'
+--anywhere='CAAGCAGAAGACGGCATACGAGAT'
+--anywhere='GTGACTGGAGTTCAGACGTGTGCTCTTCCGATC'
+--error-rate=0.05
+--times=1 --overlap=15 --discard</code></p>
+<p>Order of fastq files processing</p>
+<ol>
+<li>Trimming (optional)</li>
+<li>Filter by quality</li>
+<li>Discard single reads, keep complete pairs</li>
+<li>Cutadapt filtering</li>
+<li>Discard single reads, keep complete pairs</li>
+<li>Sampling (optional)</li>
+<li>Interlacing two fasta files</li>
+</ol>
+<h3>single fastq reads filtering</h3>
+<p>tool definition file: <code>single_fastq_filtering.xml</code></p>
+<p>This tool is designed to perform preprocessing
+of fastq file. Input files can be in GNU zipped archive (.gz extension). Reads
+are filtered based on the quality, presence of N bases and adapters. All reads
+which pass the quality filter fill be writen into output files. If sampling is
+specified, only sample of sequences will be returned. </p>
+<h3>fasta afixer</h3>
+<p>tool definition file: <code>fasta_affixer.xml</code></p>
+<p>Tool for appending prefix and suffix to sequences names in fasta formated sequences. This tool is useful
+if you want to do comparative analysis with RepeatExplorer and need to
+append sample codes to sequence identifiers</p>
+<h2>Dependencies</h2>
+<p>R programming environment with installed packages <em>optparse</em> and <em>ShortRead</em> (Bioconductor)
+python3
+cutadapt</p>
+<h2>License</h2>
+<p>Copyright (c) 2012 Petr Novak (petr@umbr.cas.cz), Jiri Macas and Pavel Neumann,
+Laboratory of Molecular Cytogenetics(http://w3lamc.umbr.cas.cz/lamc/)
+Institute of Plant Molecular Biology, Biology Centre AS CR, Ceske Budejovice, Czech Republic</p>
+<p>This program is free software: you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation, either version 3 of the License, or
+(at your option) any later version.</p>
+<p>This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+You should have received a copy of the GNU General Public License
+along with this program.  If not, see <a href="http://www.gnu.org/licenses/">http://www.gnu.org/licenses/</a>.</p>
+</body>
+</html>
b
diff -r 000000000000 -r a4cd8608ef6b README.md
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/README.md Mon Apr 01 07:56:36 2019 -0400
b
@@ -0,0 +1,74 @@
+# RepeatExplorer utilities #
+
+This repository include utilities for preprocessing of NGS data to suitable format for RepeatExplorer  and TAREAN 
+analysis. Each tool include also XML file which define tool interface for Galaxy environment
+
+## Available tools ##
+
+### Paired fastq reads filtering and interlacing ###
+tool definition file: `paired_fastq_filtering.xml`
+
+This tool is designed to make memory efficient preprocessing of two fastq files. Output of this file can be used as input of RepeatExplorer clustering. Input files can be in GNU zipped archive (.gz extension). Reads are filtered based on the quality, presence of N bases and adapters. Two input fastq files are procesed in parallel. Only complete pair are kept. As the input files are process in chunks, it is required that pair reads are complete and in the same order in both input files. All reads which pass the quality filter fill be writen into output files. If sampling is specified, only sample of sequences will be returned. Cutadapt us run with this options:
+
+```
+--anywhere='AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT'
+--anywhere='AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT'
+--anywhere='GATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
+--anywhere='ATCTCGTATGCCGTCTTCTGCTTG'
+--anywhere='CAAGCAGAAGACGGCATACGAGAT'
+--anywhere='GTGACTGGAGTTCAGACGTGTGCTCTTCCGATC'
+--error-rate=0.05
+--times=1 --overlap=15 --discard
+```
+
+Order of fastq files processing
+
+1. Trimming (optional)
+2. Filter by quality
+3. Discard single reads, keep complete pairs
+4. Cutadapt filtering
+5. Discard single reads, keep complete pairs
+6. Sampling (optional)
+7. Interlacing two fasta files
+
+
+### single fastq reads filtering ###
+tool definition file: `single_fastq_filtering.xml`
+
+This tool is designed to perform preprocessing
+of fastq file. Input files can be in GNU zipped archive (.gz extension). Reads
+are filtered based on the quality, presence of N bases and adapters. All reads
+which pass the quality filter fill be writen into output files. If sampling is
+specified, only sample of sequences will be returned. 
+
+### fasta afixer ###
+tool definition file: `fasta_affixer.xml`
+
+Tool for appending prefix and suffix to sequences names in fasta formated sequences. This tool is useful
+if you want to do comparative analysis with RepeatExplorer and need to
+append sample codes to sequence identifiers
+
+
+## Dependencies ##
+
+R programming environment with installed packages *optparse* and *ShortRead* (Bioconductor)
+python3
+cutadapt
+
+## License ##
+
+Copyright (c) 2012 Petr Novak (petr@umbr.cas.cz), Jiri Macas and Pavel Neumann,
+Laboratory of Molecular Cytogenetics(http://w3lamc.umbr.cas.cz/lamc/)
+Institute of Plant Molecular Biology, Biology Centre AS CR, Ceske Budejovice, Czech Republic
+
+This program is free software: you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation, either version 3 of the License, or
+(at your option) any later version.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+You should have received a copy of the GNU General Public License
+along with this program.  If not, see <http://www.gnu.org/licenses/>.
b
diff -r 000000000000 -r a4cd8608ef6b RM_custom_search.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/RM_custom_search.py Mon Apr 01 07:56:36 2019 -0400
[
@@ -0,0 +1,88 @@
+#!/usr/bin/env python3
+''' RepeatMasker search against custom database
+input:
+- archive with sequencing data
+- custom repeat database
+'''
+import zipfile
+import shutil
+import os
+import subprocess
+from parallel import parallel2
+import glob
+
+print(os.environ)
+def extract_sequences(f):
+    # check archive:
+    try:
+        z=zipfile.ZipFile(f)
+        # extract only dirCLXXXX/reads.fas
+        seq_list = []
+        for filein in z.namelist():
+            if filein.lower().startswith("seqclust/clustering/clusters/dir_cl") and filein.endswith("reads.fas"):
+                outdir = filein.split("/")[3]
+                outfile = outdir +"/reads.fas"
+                source = z.open(filein)
+                os.mkdir(outdir)
+                target = open(outfile, "wb")
+                shutil.copyfileobj(source, target)
+                seq_list.append(outfile)
+        if  len(seq_list) == 0:
+            raise ValueError()
+                
+    except zipfile.BadZipfile as e:
+        print("unable to extract sequences from archive!")
+        raise e
+    
+    except IOError as e:
+        print("unable to extract sequences from archive!")
+        raise e
+
+    except ValueError as e:
+        print("No sequences found in archive!")
+        raise e
+    
+    seq_list.sort()
+    return seq_list
+
+    
+def RepeatMasker(reads,database):
+    args = ["RepeatMasker", reads, "-q", "-lib", database, "-pa", "1" , "-nolow", "-dir", os.path.dirname(reads)]
+    print(args)
+    status=subprocess.call(args , stderr = open(os.devnull, 'wb'))
+    return status
+
+def summarizeRepeatMaskerOutput(htmlout = "summary.html"):
+    cmd = os.path.dirname(os.path.abspath(__file__))+"/rmsk_summary_table_multiple.r"
+    args = [ cmd, "-f", "dir_CL*/reads.fas", "-r", "dir_CL*/reads.fas.out", "-o", "RM-custom_output_table"  ]
+    status=subprocess.call(args)
+    cmd = cmd = os.path.dirname(os.path.abspath(__file__))+"/RM_html_report.R"
+    args = [cmd, htmlout]
+    status=subprocess.call(args)
+    return status
+    
+
+def main():
+    from optparse import OptionParser
+    
+    parser = OptionParser()
+    parser.add_option("-i", "--input_file", dest="input_file", help="seqclust zip archive")
+    parser.add_option("-d", "--database", dest="database", help="custom repeatmasker database")
+    parser.add_option("-g", "--galaxy_dir", dest="galaxy_dir", help="Galaxy home directory")
+    parser.add_option("-r", "--report", dest="report", help="output html file with report summary",default='report.html')
+    
+    options, args = parser.parse_args()
+    config_file = os.path.dirname(os.path.abspath(__file__))+"/seqclust.config"
+    
+    
+    seq_files = extract_sequences(options.input_file)  ### REMOVE - TESTING
+    parallel2(RepeatMasker, seq_files, [options.database])
+    
+    status = summarizeRepeatMaskerOutput(options.report)
+    
+     
+        
+if __name__== "__main__":
+    main()
+  
+    
b
diff -r 000000000000 -r a4cd8608ef6b RM_custom_search.py.bak
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/RM_custom_search.py.bak Mon Apr 01 07:56:36 2019 -0400
[
@@ -0,0 +1,96 @@
+#!/usr/bin/env python
+''' RepeatMasker search against custom database
+input:
+- archive with sequencing data
+- custom repeat database
+'''
+import zipfile
+import shutil
+import os
+import subprocess
+import parallel
+import glob
+
+def extract_sequences(f):
+    # check archive:
+    try:
+        z=zipfile.ZipFile(f)
+        # extract only dirCLXXXX/reads.fas
+        seq_list = []
+        for filein in z.namelist():
+            if filein.lower().startswith("seqclust/clustering/clusters/dir_cl") and filein.endswith("reads.fas"):
+                outdir = filein.split("/")[3]
+                outfile = outdir +"/reads.fas"
+                source = z.open(filein)
+                os.mkdir(outdir)
+                target = file(outfile, "wb")
+                shutil.copyfileobj(source, target)
+                seq_list.append(outfile)
+        if  len(seq_list) == 0:
+            raise ValueError()
+                
+    except zipfile.BadZipfile as e:
+        print "unable to extract sequences from archive!"
+        raise e
+    
+    except IOError as e:
+        print "unable to extract sequences from archive!"
+        raise e
+
+    except ValueError as e:
+        print "No sequences found in archive!"
+        raise e
+    
+    seq_list.sort()
+    return seq_list
+
+def get_RM_dir(config_file,galaxy_dir):
+    shutil.copy(config_file,"seqclust.config")
+    f = open("seqclust.config",'a')
+    f.write("\necho $REPEAT_MASKER")
+    f.close()
+    args = ["bash", "seqclust.config"]
+    p = subprocess.Popen(args,stdout = subprocess.PIPE)
+    RMdir = "{0}{1}".format(galaxy_dir,p.stdout.readline().strip())
+    return RMdir
+    
+def RepeatMasker(RM,reads,database):
+    args = [RM, reads, "-q", "-lib", database, "-pa", "1" , "-nolow", "-dir", os.path.dirname(reads)]
+    status=subprocess.call(args , stderr = open(os.devnull, 'wb'))
+    return status
+
+def summarizeRepeatMaskerOutput(htmlout = "summary.html"):
+    cmd = os.path.dirname(os.path.abspath(__file__))+"/rmsk_summary_table_multiple.r"
+    args = [ cmd, "-f", "dir_CL*/reads.fas", "-r", "dir_CL*/reads.fas.out", "-o", "RM-custom_output_table"  ]
+    status=subprocess.call(args)
+    cmd = cmd = os.path.dirname(os.path.abspath(__file__))+"/RM_html_report.R"
+    args = [cmd, htmlout]
+    status=subprocess.call(args)
+    return status
+    
+
+def main():
+    from optparse import OptionParser
+    
+    parser = OptionParser()
+    parser.add_option("-i", "--input_file", dest="input_file", help="seqclust zip archive")
+    parser.add_option("-d", "--database", dest="database", help="custom repeatmasker database")
+    parser.add_option("-g", "--galaxy_dir", dest="galaxy_dir", help="Galaxy home directory")
+    parser.add_option("-r", "--report", dest="report", help="output html file with report summary",default='report.html')
+    
+    options, args = parser.parse_args()
+    config_file = os.path.dirname(os.path.abspath(__file__))+"/seqclust.config"
+    
+    
+    seq_files = extract_sequences(options.input_file)  ### REMOVE - TESTING
+    RMdir = get_RM_dir(config_file, options.galaxy_dir)
+    parallel.parallel(RepeatMasker, [RMdir+"/RepeatMasker"], seq_files, [options.database])
+    
+    status = summarizeRepeatMaskerOutput(options.report)
+    
+     
+        
+if __name__== "__main__":
+    main()
+  
+    
b
diff -r 000000000000 -r a4cd8608ef6b RM_custom_search.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/RM_custom_search.xml Mon Apr 01 07:56:36 2019 -0400
b
@@ -0,0 +1,28 @@
+<tool id="RMsearch" name="RepeatMasker custom search" version="1.0.1">
+
+  <description>Scan clustering results using RepeatMasker against custom database of repeats</description>
+  <requirements>
+    <requirement type="package" version="4.0.7" >repeatmasker</requirement> 
+  </requirements>
+
+  <command interpreter="python3">
+    RM_custom_search.py -i $input_zip -d $RMdatabase -g $__root_dir__ -r $output_html
+  </command>
+
+  <inputs>
+    <param format="zip" type="data" name="input_zip" label="Input clustering data in as zip archive" help="zip archive obtained from previouse Graph-based sequence clustering"/>
+    <param name="RMdatabase" format="fasta" type="data" label="Library of repeats" help="Library of repeats as DNA sequences in fasta format. The recommended format for IDs in a custom library is : '>reapeatname#class/subclass'"/>
+  </inputs>
+  
+  <outputs>
+    <data format="html" name="output_html" label="HTML summary of custom database ${RMdatabase.hid} search on dataset ${input_zip.hid} " />
+  </outputs>
+
+  <help>
+    **What it does**
+
+    Use this tool if you want to scan previous clustering result with custom database of repeats using repeatmasker. 
+    
+  </help>
+</tool>
+
b
diff -r 000000000000 -r a4cd8608ef6b RM_html_report.R
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/RM_html_report.R Mon Apr 01 07:56:36 2019 -0400
[
@@ -0,0 +1,169 @@
+#!/usr/bin/env Rscript
+### this script is expected to run from clustering directory! ######
+
+## assume RM-custom_output_tablesummary.csv file in active directory
+
+suppressPackageStartupMessages(library(R2HTML))
+######################################################################################
+htmlheader="
+ <html xmlns:mml=\"http://www.w3.org/1998/Math/MathML\">
+ <head>
+ <title> Clustering summary </title>
+ <style>
+ <!--
+ table { background:#FFFFFF;
+ border:1px solid gray;
+ border-collapse:collapse;
+ color:#fff;
+ font:normal 10px verdana, arial, helvetica, sans-serif;
+ }
+ caption { border:1px solid #5C443A;
+ color:#5C443A;
+ font-weight:bold;
+ font-size:20pt
+ padding:6px 4px 8px 0px;
+ text-align:center;
+
+ }
+ td, th { color:#363636;
+ padding:.4em;
+ }
+ tr { border:1px dotted gray;
+ }
+ thead th, tfoot th { background:#5C443A;
+ color:#FFFFFF;
+ padding:3px 10px 3px 10px;
+ text-align:left;
+ text-transform:uppercase;
+ }
+ tbody td a { color:#3636FF;
+ text-decoration:underline;
+ }
+ tbody td a:visited { color:gray;
+ text-decoration:line-through;
+ }
+ tbody td a:hover { text-decoration:underline;
+ }
+ tbody th a { color:#3636FF;
+ font-weight:normal;
+ text-decoration:none;
+ }
+ tbody th a:hover { color:#363636;
+ }
+ tbody td+td+td+td a { background-image:url('bullet_blue.png');
+ background-position:left center;
+ background-repeat:no-repeat;
+ color:#FFFFFF;
+ padding-left:15px;
+ }
+ tbody td+td+td+td a:visited { background-image:url('bullet_white.png');
+ background-position:left center;
+ background-repeat:no-repeat;
+ }
+ tbody th, tbody td { text-align:left;
+ vertical-align:top;
+ }
+ tfoot td { background:#5C443A;
+ color:#FFFFFF;
+ padding-top:3px;
+ }
+ .odd { background:#fff;
+ }
+ tbody tr:hover { background:#EEEEEE;
+ border:1px solid #03476F;
+ color:#000000;
+ }
+ -->
+ </style>
+
+ </head>
+
+ "
+######################################################################################
+######################################################################################
+
+
+
+#basic statistics:
+# Number of reads used for clustering
+
+RM=read.table("RM-custom_output_tablesummary.csv",sep="\t",header=TRUE,as.is=TRUE,check.names=FALSE)
+
+#Any hits to RM database?
+N=NA
+
+# convert to legible format:
+RM2=data.frame(
+ 'total length [bp]'=RM$All_Reads_Length[c(T,F,F)],
+ 'number of reads'=RM$All_Reads_Number[c(T,F,F)],
+ check.names=FALSE,stringsAsFactors=FALSE
+)
+
+RMpart1=RM[c(T,F,F),-c(1:3),drop=FALSE] #counts
+RMpart2=RM[c(F,T,F),-c(1:3),drop=FALSE] #percent
+
+RMjoined=list()
+
+for (i in colnames(RMpart1)){
+ RMjoined[[i]]=paste(RMpart1[,i],"hits, ",signif(RMpart2[,i],3),"%",sep='')
+}
+
+
+
+if (ncol(RM)>3){  # not emppty
+ RM2=cbind(cluster=paste("CL",1:nrow(RM2),sep=''),
+ RM2,
+ "Genome proportion[%]"=signif(RM2$'number of reads'/N*100,3),
+ "cumulative GP [%]"=signif(cumsum(RM2$'number of reads'/N*100),3),
+ as.data.frame(RMjoined,stringsAsFactors=FALSE))
+
+ ##### RM2 formating for html output: #####
+ ##########################################
+ bold=RMpart2>3
+ for (i in 6:ncol(RM2)){
+ rmcol=RM2[,i]
+ RM2[,i]=ifelse(bold[,i-5],paste("<b>",rmcol,"</b>",sep=''),rmcol)
+ }
+
+ # join hits to one  column
+ RMstring=character(nrow(RM2))
+ for (i in 1:nrow(RM2)){
+ x=ifelse(RMpart2[i,]>0,paste(colnames(RM2[,-(1:5),drop=FALSE])," (",RM2[i,-(1:5),drop=FALSE],")",sep=''),"")
+ # reorder based on GR
+ x=x[order(RMpart2[i,],decreasing=TRUE)]
+
+ RMstring[i]=paste(x[x!=''],collapse="<br />")
+ if (nchar(RMstring[i])>240){
+ RMstring[i]=paste(substring(RMstring[i],1,220),"......",sep='')
+ }
+
+ }
+}else{  # no RM hits
+ RM2=cbind(cluster=paste("CL",1:nrow(RM2),sep=''),
+ RM2,
+ "Genome proportion[%]"=signif(RM2$'number of reads'/N*100,3),
+ "cumulative GP [%]"=signif(cumsum(RM2$'number of reads'/N*100),3))
+ RMstring=rep("",nrow(RM)/3)
+}
+
+
+# RM2 add link to subpage
+
+
+RM2=data.frame(RM2[,1:3],'Repeat Masker'=RMstring,check.names=FALSE)
+
+
+##################################################################################################
+####################                              HTML output                                #####
+##################################################################################################
+
+
+htmlout=commandArgs(T)[1]  # full absolute path
+
+cat(htmlheader,file=htmlout)
+
+HTML.title("RepeatMasker search against custom database",file=htmlout,HR=1)
+
+HTML(RM2,file=htmlout,align='left',caption="",captionalign='')
+HTMLEndFile(htmlout)
+
b
diff -r 000000000000 -r a4cd8608ef6b contributors.txt
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/contributors.txt Mon Apr 01 07:56:36 2019 -0400
b
@@ -0,0 +1,1 @@
+Petr Novak
b
diff -r 000000000000 -r a4cd8608ef6b fasta_affixer.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/fasta_affixer.py Mon Apr 01 07:56:36 2019 -0400
[
@@ -0,0 +1,35 @@
+#!/usr/bin/env python3
+''' fasta affixer - adding prefixes and suffixes to fasta sequence names'''
+import argparse
+
+parser = argparse.ArgumentParser()
+parser.add_argument("-f", "--fasta", type=str, help="fasta file")
+parser.add_argument("-o", "--output", type=str, help="output fasta file")
+parser.add_argument(
+    "-p", "--prefix",
+    type=str, help="prefix to be added to names")
+parser.add_argument(
+    "-s", "--suffix",
+    type=str, help="suffix to be added",
+    default='')
+parser.add_argument("-n",
+                    "--nspace",
+                    type=int,
+                    help="number of spaces to ignore",
+                    default='0')
+
+args = parser.parse_args()
+
+with open(args.fasta, "r") as f, open(args.output, "w") as out:
+    for oneline in f:
+        if oneline == "":
+            continue
+        if not oneline:
+            break
+        if oneline[0] == ">":
+            header = " ".join(oneline.split()[:1 + args.nspace])
+            header_out = header[0] + args.prefix + header[1:] + args.suffix + "\n"
+            out.write(header_out)
+        else:
+            out.write(oneline)
+
b
diff -r 000000000000 -r a4cd8608ef6b fasta_affixer.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/fasta_affixer.xml Mon Apr 01 07:56:36 2019 -0400
b
@@ -0,0 +1,81 @@
+<tool id="fasta_affixer" name="FASTA read name affixer" version="1.0.0">
+<description> Tool appending suffix and prefix to sequences names </description>
+<command interpreter="python3">
+fasta_affixer.py -f $input -p "$prefix" -s "$suffix" -n $nspace -o $output
+</command>
+
+ <inputs>
+  <param format="fasta" type="data" name="input" label="Choose your fasta file" />
+  <param name="prefix" type="text" size="10" value="" label="Prefix" help="Enter prefix which will be added to all sequences names" />
+  <param name="suffix" type="text" size="10" value="" label="Suffix" help="Enter suffix which will be added to all sequences names"/>
+  <param name="nspace" type="integer" size="10" value="0" min="0" max="1000" label="Number of spaces in name to ignore" help="Sequence name is a string before the first space. If you want name to include spaces in name, enter positive integer. All other characters beyond ignored spaces are omitted"/>
+ </inputs>
+
+
+ <outputs>
+  <data format="fasta" name="output" label="fasta dataset ${input.hid} with modified sequence names" />
+ </outputs>
+
+ <tests>
+   <test>
+     <param name="input" value="single_output.fasta" />
+     <param name="prefix" value="TEST" />
+     <param name="suffux" value="OK"/>
+     <param name="nspace" value="0" />
+     <output name="output" value="prefix_suffix.fasta" />
+   </test>
+ </tests>
+ <help>
+**What is does**

+Tool for appending prefix and suffix to sequences names in fasta formated sequences. This tool is useful
+if you want to do comparative analysis with RepeatExplorer and need to
+append sample codes to sequence identifiers
+
+**Example**
+The following fasta file:
+
+::
+
+  >123454
+  acgtactgactagccatgacg
+  >234235
+  acgtactgactagccatgacg
+
+is renamed to:
+
+::
+
+  >prefix123454suffix
+  acgtactgactagccatgacg
+  >prefix234235suffix
+  acgtactgactagccatgacg
+
+
+By default, anything after spaces is 
+excluded from sequences name. In example sequence:
+
+::
+  
+ >SRR352150.23846180 HWUSI-EAS1786:7:119:15910:19280/1
+ CTGGATTCTATACCTTTGGCAACTACTTCTTGGTTGATCAGGAAATTAACACTAGTAGTTTAGGCAATTTGGAATGGTGCCAAAGATGTATAGAACTTTC
+ IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIHIIIIIFIIIIIIHDHBBIHFIHIIBHHDDHIFHIHIIIHIHGGDFDEI@EGEGFGFEFB@ECG
+
+when **Number of spaces in name to ignore** is set to 0 (default) the output will be:

+::

+ >prefixSRR352150.23846180suffix
+ CTGGATTCTATACCTTTGGCAACTACTTCTTGGTTGATCAGGAAATTAACACTAGTAGTTTAGGCAATTTGGAATGGTGCCAAAGATGTATAGAACTTTC
+

+If you want to keep spaces the setting **Number of spaces in name to ignore** to 1 will yield 

+:: 

+ >prefixSRR352150.23846180 HWUSI-EAS1786:7:119:15910:19280/1suffix
+ CTGGATTCTATACCTTTGGCAACTACTTCTTGGTTGATCAGGAAATTAACACTAGTAGTTTAGGCAATTTGGAATGGTGCCAAAGATGTATAGAACTTTC
+
+
+</help>
+</tool>
b
diff -r 000000000000 -r a4cd8608ef6b fasta_interlacer.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/fasta_interlacer.py Mon Apr 01 07:56:36 2019 -0400
[
@@ -0,0 +1,200 @@
+#!/usr/bin/env python
+''' interlacing two fastq sequences'''
+import sys
+
+def readSingleSeq(file):
+    ''' read single seq from fasta file'''
+    line = file.readline()
+    if not line:
+        return False  # end of file
+    if line[0] != ">":
+        raise Exception("no header on the first line")
+    seqname = line[1:].strip()
+    seq = ""
+    # read sequences
+    while True:
+        last_pos = file.tell()
+        line = file.readline()
+        if not line:
+            break
+        if line[0] == ">":
+            file.seek(last_pos)
+            break
+        seq = seq + line.strip()
+    return {'name': seqname, 'sequence': seq}
+
+
+def writeSingleSeq(fileobject, seq):
+    ''' write single sequence to fasta file'''
+    fileobject.write(">")
+    fileobject.write(seq['name'] + "\n")
+    fileobject.write(seq['sequence'] + "\n")
+
+
+def main():
+    from optparse import OptionParser
+    parser = OptionParser()
+    parser.add_option("-a",
+                      "--fasta_file_A",
+                      dest="seqfileA",
+                      help="input sequences in fasta format")
+    parser.add_option("-b",
+                      "--fasta_file_B",
+                      dest="seqfileB",
+                      help="input sequences in fasta format")
+    parser.add_option("-p",
+                      "--fasta_file_pairs",
+                      dest="seqfile_pairs",
+                      help="output file with paired sequences")
+    parser.add_option("-x",
+                      "--fasta_file_singles",
+                      dest="seqfile_singles",
+                      help="output file with single sequences")
+    options, args = parser.parse_args()
+
+    # Input files
+    fA = open(options.seqfileA, 'r')
+    fB = open(options.seqfileB, 'r')
+    # Output files
+    if options.seqfile_pairs:
+        fPairs = open(options.seqfile_pairs, 'w')
+    else:
+        fPairs = open(options.seqfileA + ".pairs", 'w')
+
+    if options.seqfile_singles:
+        single = open(options.seqfile_singles, "w")
+    else:
+        single = open(options.seqfileA + ".single", "w")
+
+
+    sA1 = readSingleSeq(fA)
+    sB1 = readSingleSeq(fB)
+    if not sA1 or not sB1:
+        raise Exception("\nEmpty sequence on input, nothing to interlace!\n")
+    charA = sA1['name'][-1]
+    charB = sB1['name'][-1]
+    # validate sequence names
+    if charA == charB:
+        sys.stderr.write(
+            "last character of sequence id must be used for distinguishing pairs!")
+        exit(1)
+        # check first thousand!
+    for i in range(1000):
+        seqA = readSingleSeq(fA)
+        seqB = readSingleSeq(fB)
+        if (not seqA) or (not seqB):
+            # end of file:
+            if i == 0:
+                sys.stderr.write("input file is empty")
+                exit(1)
+            else:
+                break
+        if seqA['name'][-1] == charA and seqB['name'][-1] == charB:
+            continue
+        else:
+            sys.stderr.write(
+                "last character of sequence id must be used for distinguishing pairs!")
+            exit(1)
+
+    fA.seek(0)
+    fB.seek(0)
+
+    buffA = {}
+    buffB = {}
+    buffA_names = []
+    buffB_names = []
+
+    while True:
+
+        seqA = readSingleSeq(fA)
+        seqB = readSingleSeq(fB)
+
+        if not seqA and not seqB:
+            break  # end of file
+
+        ## validation and direct checking only if not end of files
+        if seqA and seqB:
+            #validate:
+            if not (seqA['name'][-1] == charA and seqB['name'][-1] == charB):
+                sys.stderr.write(
+                    "last character of sequence id must be used for distinguishing pairs!")
+                exit(1)
+
+                # check if current seqs are pairs
+            if seqA['name'][:-1] == seqB['name'][:-1]:
+                writeSingleSeq(fPairs, seqA)
+                writeSingleSeq(fPairs, seqB)
+                continue
+
+            ### compare whith buffers
+            ### seqA vs buffB
+        if seqA:
+            if seqA["name"][:-1] in buffB:
+                writeSingleSeq(fPairs, seqA)
+                seqtmp = {"name": seqA["name"][:-1] + charB,
+                          "sequence": buffB[seqA["name"][:-1]]}
+                writeSingleSeq(fPairs, seqtmp)
+                # can I empty buffA ???
+                for i in buffA_names:
+                    seqtmp = {"name": i + charA, "sequence": buffA[i]}
+                    writeSingleSeq(single, seqtmp)
+                    buffA = {}
+                    buffA_names = []
+
+                j = 0
+                for i in buffB_names:
+                    seqtmp = {"name": i + charB, "sequence": buffB[i]}
+                    del buffB[i]
+                    j += 1
+                    if i == seqA["name"][:-1]:
+                        del buffB_names[0:j]
+                        break
+                    else:
+                        writeSingleSeq(single, seqtmp)
+            else:
+                buffA[seqA["name"][:-1]] = seqA['sequence']
+                buffA_names.append(seqA["name"][:-1])
+
+                ### seqA vs buffB
+        if seqB:
+            if seqB["name"][:-1] in buffA:
+                seqtmp = {"name": seqB["name"][:-1] + charA,
+                          "sequence": buffA[seqB["name"][:-1]]}
+                writeSingleSeq(fPairs, seqtmp)
+                writeSingleSeq(fPairs, seqB)
+                # can I empty buffB ???
+                for i in buffB_names:
+                    seqtmp = {"name": i + charB, "sequence": buffB[i]}
+                    writeSingleSeq(single, seqtmp)
+                    buffB = {}
+                    buffB_names = []
+
+                j = 0
+                for i in buffA_names:
+                    seqtmp = {"name": i + charA, "sequence": buffA[i]}
+                    del buffA[i]
+                    j += 1
+                    if i == seqB["name"][:-1]:
+                        del buffA_names[0:j]
+                        break
+                    else:
+                        writeSingleSeq(single, seqtmp)
+
+            else:
+                buffB[seqB["name"][:-1]] = seqB['sequence']
+                buffB_names.append(seqB["name"][:-1])
+                fA.close()
+                fB.close()
+                fPairs.close()
+                # write rest of singles:
+    for i in buffA:
+        seqtmp = {"name": i + charA, "sequence": buffA[i]}
+        writeSingleSeq(single, seqtmp)
+    for i in buffB:
+        seqtmp = {"name": i + charB, "sequence": buffB[i]}
+        writeSingleSeq(single, seqtmp)
+        single.close()
+
+
+if __name__ == "__main__":
+    main()
b
diff -r 000000000000 -r a4cd8608ef6b fasta_manual_input.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/fasta_manual_input.xml Mon Apr 01 07:56:36 2019 -0400
b
@@ -0,0 +1,21 @@
+<tool id="fasta_input" name="Fasta interactive input" version="1.0.0">
+  <description> Enter fasta file to text window </description>
+  <command>
+    echo "${fasta_text}" > ${fasta_file}  &amp;&amp;
+    sed -i 's/__gt__/>/g' ${fasta_file}  &amp;&amp;
+    sed -i 's/__cn__/\n/g' ${fasta_file}
+  </command>
+
+  <inputs>
+    <param  type="text" name="fasta_text" area="True" size="50x150" label="Paste sequences in fasta format" />
+  </inputs>
+
+
+  <outputs>
+    <data format="fasta" name="fasta_file" label="fasta sequence manually edited" />
+  </outputs>
+
+  <help>
+    This is helper tool to enter fasta sequences manualy
+  </help>
+</tool>
b
diff -r 000000000000 -r a4cd8608ef6b paired_fastq_filtering.R
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/paired_fastq_filtering.R Mon Apr 01 07:56:36 2019 -0400
[
b'@@ -0,0 +1,404 @@\n+#!/usr/bin/env Rscript\n+# TODO - add triming from start and end, removing short reads\n+library(optparse, quiet = TRUE)\n+matricesSum = function(m1,m2){\n+    mfin = matrix(0, nrow = nrow(m2), ncol = max(ncol(m1), ncol(m2)))\n+    rownames(mfin) = rownames(m2)\n+    if (sum(m1)>0){\n+        mfin[,1:ncol(m1)] = mfin[,1:ncol(m1)] + m1\n+    }\n+    if (sum(m2)>0){\n+        mfin[,1:ncol(m2)] = mfin[,1:ncol(m2)] + m2\n+    }\n+\n+    return(mfin)\n+}\n+megablast = function(fx,params="-F \\"m D\\" -D 4 -p 90 ",database){\n+                                        # assume blastn and makeblastdb in $PATH\n+\ttmp_in = tempfile()\n+\ttmp_out = tempfile()\n+\twriteFasta(fx,file=tmp_in)\n+  ## check if db is present:\n+  dbfiles = paste0(database, ".nhr")\n+  if (!file.exists(dbfiles)){\n+    cat("running makeblastdb\\n")\n+    system(paste0("makeblastdb -dbtype nucl -in ", database))\n+  }\n+  params = \'-num_alignments 1 -num_threads 4  -outfmt "6 qseqid qcovs pident" \'\n+\tcmd=paste("blastn", " -db", database, "-query", tmp_in,\n+            "-out", tmp_out, " ", params)\n+  status=system(cmd,intern=TRUE)\n+\tif (file.info(tmp_out)$size != 0){\n+    results=read.table(tmp_out,sep="\\t",as.is=TRUE,comment.char="")\n+    colnames(results) = c("qseqid", "qcovs", "pident")\n+\t}else{\n+\t\tresults=NULL\n+\t}\n+  unlink(tmp_in)\n+\tunlink(tmp_out)\n+\treturn(results)\n+}\n+\n+\n+plotFrequencies = function(x){\n+    par(xaxs = \'i\', yaxs = \'i\')\n+    plot(NULL, xlim = c(1,ncol(x)), ylim = c(0,1),type =\'n\', xlab="position", ylab = "proportion")\n+    col = c(A="red", "T"="orange", C="blue", G="green")\n+    for (n in c("A","C", "T","G")){\n+        points(x[n,]/colSums(x), type = \'l\', col = col[n])\n+    }\n+    grid(ny = 50, nx = ncol(x) - 1  )\n+    legend("topright", col = col, lwd = 2, legend = names(col))\n+}\n+\n+option_list = list(make_option(c("-a", "--fastqA"), action = "store", type = "character", \n+    help = "fastq file A", default = NA), make_option(c("-b", "--fastqB"), action = "store", \n+    type = "character", help = "fastq file B", default = NA), make_option(c("-x", \n+    "--fastqX"), action = "store", type = "character", help = "output fastq file X", \n+    default = NA), make_option(c("-y", "--fastqY"), action = "store", type = "character", \n+    help = "output fastq file Y", default = NA), make_option(c("-c", "--cut_off"), \n+    action = "store", type = "numeric", help = "Quality cut-off value [default %default]", \n+    default = 10), make_option(c("-r", "--rnd_seed"), action = "store", type = "numeric", \n+                               help = "seed for random number generator [default %default]", default = 123),\n+    make_option(c("-G", "--png_file_output"), action = "store", type = "character", default=NA),\n+    make_option(c("-p", "--percent_above"), action = "store", type = "numeric", help = "Percent of bases in sequence that must have quality equal to / higher than cut-off value [default %default]", \n+        default = 95), make_option(c("-e", "--trim_end"), action = "store", type = "numeric", \n+        help = "trimming - end position [no trimming by default]", default = NA), \n+    make_option(c("-s", "--trim_start"), action = "store", type = "numeric", help = "triming position - start  [no trimming by default]", \n+        default = NA), make_option(c("-n", "--sample_size"), action = "store", type = "numeric", \n+        help = "requested sample size (number of pairs)[no sampling by default]", \n+        default = NULL), make_option(c("-N", "--Max_N"), action = "store", type = "integer", \n+        help = "maximum number of Ns in sequences [default %default]", default = 0), \n+    make_option(c("-R", "--rename"), action = "store_true", type = "logical", help = "Rename sequences", \n+        default = FALSE), make_option(c("-f", "--format"), action = "store", type = "character", \n+        help = "format of output - fastq or fasta [default %default] ", default = "fasta"), \n+    make_option(c("-C", "--cutadapt_options"), action = "store", type = "character", \n+        hel'..b'TART, width(fq2)))\n+        \n+    }\n+    \n+    \n+    fqF1 = fq1[fin_filter(fq1)]\n+    fqF2 = fq2[fin_filter(fq2)]\n+    \n+    inc1 = id(fqF1) %in% id(fqF2)\n+    inc2 = id(fqF2) %in% id(fqF1)\n+    \n+    cat("running cutadapt on chunk\\n")\n+    # cut addapt here: # filter parts:\n+    tmp_in1 = tempfile(fileext = ".fastq")\n+    tmp_out1 = tempfile(fileext = ".fastq")\n+    tmp_in2 = tempfile(fileext = ".fastq")\n+    tmp_out2 = tempfile(fileext = ".fastq")\n+    \n+    \n+    if (is.null(formals(writeFastq)$compress)) {\n+        writeFastq(fqF1[inc1], file = tmp_in1)\n+        writeFastq(fqF2[inc2], file = tmp_in2)\n+    } else {\n+        writeFastq(fqF1[inc1], file = tmp_in1, compress = FALSE)\n+        writeFastq(fqF2[inc2], file = tmp_in2, compress = FALSE)\n+    }\n+    \n+    \n+    cmd1 = cutadapt_cmd(tmp_in1, tmp_out1, cutadapt_arguments)\n+    cmd2 = cutadapt_cmd(tmp_in2, tmp_out2, cutadapt_arguments)\n+    # this should run cutadapt in parallel\n+    \n+    status = system(paste(cmd1, "& \\n", cmd2, "&\\nwait"), intern = TRUE)\n+    # collect results\n+    ftmp1 = FastqFile(tmp_out1)\n+    \n+    fqF1 = readFastq(ftmp1)\n+    close(ftmp1)\n+    \n+    ftmp2 = FastqFile(tmp_out2)\n+    fqF2 = readFastq(ftmp2)\n+    close(ftmp2)\n+    ## clean up\n+    unlink(tmp_out1)\n+    unlink(tmp_in1)\n+    unlink(tmp_out2)\n+    unlink(tmp_in2)\n+    \n+\n+\n+                                        # remove sequences similar to filter database (e.g. plastid DNA)\n+    if (!is.null(opt$filter_seq)){\n+      blast_results1 =  megablast(fqF1, database=opt$filter_seq)\n+      blast_results2 =  megablast(fqF2, database=opt$filter_seq)\n+      if (!is.null(blast_results1) &  !is.null(blast_results2)){\n+        exclude1=with(blast_results1, unique(qseqid[pident >= 90 & qcovs >=90]))\n+        exclude2=with(blast_results2, unique(qseqid[pident >= 90 & qcovs >=90]))\n+        ## note - blast will truncate seq ids - anything beyond space is omitted\n+        seq_to_exclude1 = gsub(" .*","",id(fqF1)) %in% exclude1\n+        seq_to_exclude2 = gsub(" .*","",id(fqF2)) %in% exclude2\n+        excluded_contamination1=signif(sum(seq_to_exclude1)/length(seq_to_exclude1)*100,3)\n+        excluded_contamination2=signif(sum(seq_to_exclude2)/length(seq_to_exclude2)*100,3)\n+        cat(excluded_contamination1,"% filtered out after blast - forward reads\\n" )\n+        cat(excluded_contamination2,"% filtered out after blast - reverse reads\\n" )\n+        fqF1 = fqF1[!seq_to_exclude1]\n+        fqF2 = fqF2[!seq_to_exclude2]\n+      }\n+    }\n+\n+\n+\n+    \n+    # filter complete pairs again: id1=gsub(pair_regex,\'\',id(fqF1))\n+    # id2=gsub(pair_regex,\'\',id(fqF2))\n+    inc1 = id(fqF1) %in% id(fqF2)\n+    inc2 = id(fqF2) %in% id(fqF1)\n+    total = sum(inc1) + total\n+    ## create new id - last character must differentiate pair - for interlacig\n+    \n+    fqF1@id = BStringSet(paste0(id(fqF1), F_id))\n+    fqF2@id = BStringSet(paste0(id(fqF2), R_id))\n+    \n+    \n+    if (sum(inc1) > sample_size_in_chunk) {\n+        smp = sort(sample(sum(inc1), sample_size_in_chunk))\n+        writeFun(fqF1[inc1][smp], file = f1out, mode = "a")\n+        writeFun(fqF2[inc2][smp], file = f2out, mode = "a")\n+        nfrq1 = alphabetByCycle(sread(fqF1[inc1][smp]))\n+        nfrq2 = alphabetByCycle(sread(fqF2[inc2][smp]))\n+    } else {\n+        writeFun(fqF1[inc1], file = f1out, mode = "a")\n+        writeFun(fqF2[inc2], file = f2out, mode = "a")\n+        nfrq1 = alphabetByCycle(sread(fqF1[inc1]))\n+        nfrq2 = alphabetByCycle(sread(fqF2[inc2]))\n+    }\n+    nucleotideFrequenciesForward = matricesSum(nucleotideFrequenciesForward, nfrq1)\n+    nucleotideFrequenciesReverse = matricesSum(nucleotideFrequenciesReverse, nfrq2)\n+    \n+}\n+if( !is.na(opt$png_file_output)){\n+    png(opt$png_file_output, width = 1000, height = 1000)\n+    par(mfrow=c(2,1))\n+    plotFrequencies(nucleotideFrequenciesForward)\n+    mtext("forward reads")\n+    plotFrequencies(nucleotideFrequenciesReverse)\n+    mtext("reverse reads")\n+    dev.off()\n+}\n+close(f1)\n+close(f2)\n+ \n'
b
diff -r 000000000000 -r a4cd8608ef6b paired_fastq_filtering.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/paired_fastq_filtering.xml Mon Apr 01 07:56:36 2019 -0400
[
b'@@ -0,0 +1,181 @@\n+<tool id="paired_fastq_filtering" name="Preprocessing of fastq paired-reads">\n+  <description>\n+    Preprocessing of paired reads fastq files\n+    including trimming, quality filtering, cutadapt filtering and interlacing. Broken\n+    pairs are discarded.\n+  </description>\n+  <requirements>\n+    <requirement type="package">blast</requirement>\n+    <requirement type="package">cutadapt</requirement>\n+    <requirement type="package">bioconductor-shortread</requirement>\n+    <requirement type="package">r-optparse</requirement>\n+  </requirements>\n+  <command interpreter="bash">\n+    paired_fastq_filtering_wrapper.sh -a ${A} -b ${B}  -o ${paired} -c ${cut_off} -p ${percent_above} -N ${max_n} $rename -G ${png_output}\n+\n+    #if $sampling.sequence_sampling :\n+    -n $sampling.sample_size\n+    #end if\n+\n+    #if $trimming.sequence_trimming :\n+    -e $trimming.trim_end -s $trimming.trim_start\n+    #end if\n+\n+    #if $cutadapt.use_custom :\n+    -C "${cutadapt.custom_options}"\n+    #end if\n+\n+    #if $similarity_filtering.include :\n+    -F "${similarity_filtering.filter_database}"\n+    #end if\n+\n+  </command>\n+\n+  <inputs>\n+    <param format="fastq,fastq.gz" type="data" name="A" label="Left-hand reads" />\n+\n+    <param format="fastq,fastq.gz" type="data" name="B" label="Right-hand reads" />\n+\n+    <conditional name="sampling">\n+      <param name="sequence_sampling" type="boolean" truevalue="true" falsevalue="false" checked="False" label="Sequence sampling"/>\n+\t    <when value="false">\n+        <!-- do nothing here -->\n+      </when>\n+      <when value="true">\n+   \t\t  <param name="sample_size" type="integer" label="Sample size(number of pairs)" help="How many sequence pairs should be in resulting dataset" value="500000" min="0"/>\n+      </when>\n+    </conditional>\n+\n+    <param type="integer" name="cut_off" label="Quality cut-off" value="10" min="0" help="see below how to correctly set quality cut-off" />\n+    <param type="integer" name="percent_above" label="percent above cutoff" value="95" min="0"\n+           help="Percent of bases in sequence that must have quality equal to / higher than cut-off value" />\n+\n+    <conditional name="trimming">\n+      <param name="sequence_trimming" type="boolean" truevalue="true" falsevalue="false" checked="False" label="Trim sequences"/>\n+      <when value="false">\n+        <!-- do nothing here -->\n+      </when>      \n+      <when value="true">\n+        <param type="integer" name="trim_start" label="trimming - start position" value="1" min="1"\n+               help="sequences are trimmed at specified start" />\n+        <param type="integer" name="trim_end" label="trimming - end position" value="100" min="1"\n+               help="sequences are trimmed to specified end position, shorted sequences are discarded" />\n+      </when>      \n+\n+    </conditional>\n+   \t<param name="max_n" type="integer" label="maximum Ns" help="Maximum number of Ns in sequence" value="0" min="0" max="10"/>\n+\n+    <conditional name="cutadapt">\n+      <param name="use_custom" type="boolean" truevalue="true" falsevalue="false" checked="False" label="Do you want to use custom cutadapt options"/>\n+\t    <when value="false">\n+        <!-- do nothing here -->\n+      </when>\n+      <when value="true">\n+   \t\t  <param name="custom_options" type="text" area="True" size="8x30"  label="Cutadapt custom options" help="Consult cutadapt for usage" value="">\n+          <sanitizer sanitize="False"/>\n+          </param>>\n+      </when>\n+    </conditional>\n+\n+    <conditional name="similarity_filtering">\n+\t    <param name="include" type="boolean" truevalue="true" falsevalue="false" checked="False" label="Use similarity search filtering"/>\n+\t    <when value="false">\n+        <!-- do nothing here -->\n+      </when>\n+      <when value="true">\n+        \n+   \t\t  <param name="filter_database" format="fasta" type="data" label="Sequence filter database" help="Provide DNA sequences in fasta format. Sequence reads which has at least 90% simila'..b'id} "/>\n+    <data format="png" name="png_output" label="nucleotide composition after filtering of ${A.hid} and ${B.hid} "/>"\n+  </outputs>\n+\n+\n+  <tests>\n+    <test>\n+      <param name="A" value="ERR215189_1_part.fastq.gz" />\n+      <param name="B" value="ERR215189_2_part.fastq.gz" />\n+      <param name="max_n" value="0"/>\n+      <param name="cut_off" value="10" />\n+      <param name="percent_above" value="95" />\n+      <output name="output" value="paired_output.fasta" />\n+      <output name="png_output" value="paired_output.png" />\n+    </test>\n+  </tests>\n+\n+  <help>\n+**What it does**\n+\n+This tool is designed to make memory efficient preprocessing of two\n+fastq files. Output of this file can be used as input of RepeatExplorer clustering.\n+Input files can be in GNU zipped archive (.gz extension).\n+Reads are filtered based on the quality, presence of N bases and\n+adapters. Two input fastq files are procesed in parallel. Only complete pair\n+are kept. As the input files are process in chunks, it is required that\n+pair reads are complete and in the same order in both input files. All\n+reads which pass the quality filter fill be writen into output files.\n+If sampling is specified, only sample of sequences will be\n+returned. Cutadapt us run with this options::\n+\n+    --anywhere=\'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT\'\n+    --anywhere=\'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT\'\n+    --anywhere=\'GATCGGAAGAGCACACGTCTGAACTCCAGTCAC\'\n+    --anywhere=\'ATCTCGTATGCCGTCTTCTGCTTG\'\n+    --anywhere=\'CAAGCAGAAGACGGCATACGAGAT\'\n+    --anywhere=\'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATC\'\n+    --error-rate=0.05\n+    --times=1 --overlap=15 --discard\n+\n+\n+**Order of fastq files processing**\n+\n+1. Trimming (optional)       \n+#. Filter by quality      \n+#. Discard single reads, keep complete pairs\n+#. Cutadapt filtering \n+#. Discard single reads, keep complete pairs    \n+#. Sampling (optional)         \n+#. Interlacing two fasta files\n+\n+**Quality setting cut-off**\n+\n+To correctly set quality cut-off, you need to know how the quality is encoded in your fastq file, default\n+filtering which is suitable for Sanger and Illumina 1.8 encoding is shown below::\n+\n+\n+      Default filtering cut-off\n+                |   \n+                |\n+                V\n+      SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................\n+      ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................\n+      ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................\n+      .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................\n+      LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................\n+      !"#$%&amp;\'()*+,-./0123456789:;&lt;=&gt;?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\n+      |                         |    |        |                              |                     |\n+     33                        59   64       73                            104                   126\n+      0........................26...31.......40                                \n+                               -5....0........9.............................40 \n+                                     0........9.............................40 \n+                                        3.....9.............................40 \n+      0.2......................26...31........41                              \n+    \n+     S - Sanger        Phred+33,  raw reads typically (0, 40)\n+     X - Solexa        Solexa+64, raw reads typically (-5, 40)\n+     I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)\n+     J - Illumina 1.5+ Phred+64,  raw reads typically (3, 40)\n+         with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) \n+         (Note: See discussion above).\n+     L - Illumina 1.8+ Phred+33,  raw reads typically (0, 41)\n+    \n+  </help>    \n+</tool>\n+\n'
b
diff -r 000000000000 -r a4cd8608ef6b paired_fastq_filtering_wrapper.sh
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/paired_fastq_filtering_wrapper.sh Mon Apr 01 07:56:36 2019 -0400
[
@@ -0,0 +1,66 @@
+#!/bin/bash
+set -euo pipefail
+IFS=$'\n\t'
+
+# run filtering
+WD="`dirname $0`"
+ORIDIR=$PWD
+cd $WD
+WD=$PWD  # absolute path to this script
+cd $ORIDIR
+SAMPLING=""
+TRIM_END=""
+TRIM_START=""
+PERCENT_ABOVE="95"
+CUTADAPT=""
+RENAME=""
+FILTER_SEQ=""
+while getopts "a:b:o:n:c:p:e:s:N:C:G:F:R" OPTION
+do
+    case $OPTION in
+        a)
+            FASTAA=$OPTARG;;
+        b)
+            FASTAB=$OPTARG;;
+        o)
+            PAIRED_OUTPUT=$OPTARG;;
+        n)
+            SAMPLING=( -n ${OPTARG} );;
+        c)
+            CUT_OFF=$OPTARG;;
+        p)
+            PERCENT_ABOVE=$OPTARG;;
+        e)
+            TRIM_END=( -e ${OPTARG} );;
+        s)
+            TRIM_START=( -s ${OPTARG} );;
+        N)
+            MAX_N=${OPTARG};;
+        C)
+            CUTADAPT=(-C " "${OPTARG}" " );;
+        G)
+            PNG_OUTPUT=${OPTARG};;
+        R)
+            RENAME="-R";;
+        F)
+            FILTER_SEQ=( -F ${OPTARG} );;
+
+    esac
+done
+fasta_tmp_fileX=$(mktemp)
+fasta_tmp_fileY=$(mktemp)
+
+if [ -z "$CUTADAPT" ] # test if$CUTADAPT is empty
+then
+    ${WD}/paired_fastq_filtering.R -a $FASTAA -b $FASTAB -x $fasta_tmp_fileX -y $fasta_tmp_fileY  ${SAMPLING[@]} -c $CUT_OFF -G $PNG_OUTPUT\
+         -p $PERCENT_ABOVE  ${TRIM_START[@]}  ${TRIM_END[@]} -N $MAX_N $RENAME ${FILTER_SEQ[@]}
+else
+    ${WD}/paired_fastq_filtering.R -a $FASTAA -b $FASTAB -x $fasta_tmp_fileX -y $fasta_tmp_fileY  ${SAMPLING[@]} -c $CUT_OFF -G $PNG_OUTPUT\
+         -p $PERCENT_ABOVE  ${TRIM_START[@]}  ${TRIM_END[@]} -N $MAX_N "${CUTADAPT[@]}" $RENAME ${FILTER_SEQ[@]}
+fi
+
+${WD}/fasta_interlacer.py -a $fasta_tmp_fileX -b $fasta_tmp_fileY -p $PAIRED_OUTPUT -x fasta_tmp_single
+
+
+rm $fasta_tmp_fileX
+rm $fasta_tmp_fileY
b
diff -r 000000000000 -r a4cd8608ef6b parallel.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/parallel.py Mon Apr 01 07:56:36 2019 -0400
[
b'@@ -0,0 +1,366 @@\n+#!/usr/bin/env python3\n+import multiprocessing\n+import os\n+import time\n+from itertools import cycle\n+\'\'\'\n+functions for parallel processing of data chunks using worker function\n+\'\'\'\n+\n+\n+def run_multiple_pbs_jobs(cmds, status_files, qsub_params=""):\n+    \'\'\'\n+    Example of pbs_params:\n+    -l walltime=1000:00:00,nodes=1:ppn=8,mem=15G\n+    -l walltime=150:00:00,nodes=1:ppn=1\n+\n+    \'\'\'\n+    jobs = []\n+    status_function = []\n+    status_command = []\n+    for cmd, sf in zip(cmds, status_files):\n+        jobs.append(pbs_send_job(cmd, sf, qsub_params))\n+    for p in jobs:\n+        p.join()\n+        status_function.append(p.exitcode)\n+    # collect pbs run status\n+    for sf in status_files:\n+        with open(sf) as f:\n+            status_command.append(f.read().strip())\n+    status = {\'function\': status_function, \'command\': status_command}\n+    return status\n+\n+\n+def pbs_send_job(cmd, status_file, qsub_params):\n+    \'\'\' send job to pbs cluster, require status file\'\'\'\n+    p = multiprocessing.Process(target=pbs_run,\n+                                args=(cmd, status_file, qsub_params))\n+    p.start()\n+    return p\n+\n+\n+def pbs_run(cmd, status_file, qsub_params):\n+    \'\'\'\n+    run shell command cmd on pbs cluster, wait for job to finish\n+    and return status\n+    \'\'\'\n+    print(status_file)\n+    error_file = status_file + ".e"\n+    # test if writable\n+    try:\n+        f = open(status_file, \'w\').close()\n+        f = open(error_file, \'w\').close()\n+    except IOError:\n+        print("cannot write to status files, make sure path exists")\n+        raise IOError\n+\n+    if os.path.exists(status_file):\n+        print("removing old status file")\n+        os.remove(status_file)\n+    cmd_full = ("echo \'{cmd} && echo \\"OK\\" > {status_file} || echo \\"ERROR\\""\n+                " > {status_file}\' | qsub -e {err}"\n+                " {qsub_params} ").format(cmd=cmd, status_file=status_file,\n+                                          err=error_file,\n+                                          qsub_params=qsub_params)\n+    os.system(cmd_full)\n+\n+    while True:\n+        if os.path.exists(status_file):\n+            break\n+        else:\n+            time.sleep(3)\n+    with open(status_file) as f:\n+        status = f.read().strip()\n+    return status\n+\n+\n+def spawn(f):\n+    def fun(pipe, x):\n+        pipe.send(f(x))\n+        pipe.close()\n+    return fun\n+\n+\n+def get_max_proc():\n+    \'\'\'Number of cpu to ise in ether get from config.py is available or\n+    from global PROC or from environment variable PRCO or set to system max\'\'\'\n+    try:\n+        from config import PROC as max_proc\n+    except ImportError:\n+        if "PROC" in globals():\n+            max_proc = PROC\n+        elif "PROC" in os.environ:\n+            max_proc = int(os.environ["PROC"])\n+\n+        else:\n+            max_proc = multiprocessing.cpu_count()\n+    return max_proc\n+\n+\n+def parmap2(f, X, groups, ppn):\n+    max_proc = get_max_proc()\n+    print("running in parallel using ", max_proc, "cpu(s)")\n+    process_pool = []\n+    output = [None] * len(X)\n+    # prepare processes\n+    for x, index in zip(X, list(range(len(X)))):\n+        # status:\n+        # 0: waiting, 1: running, 2:collected\n+        process_pool.append({\n+            \'status\': 0,\n+            \'proc\': None,\n+            \'pipe\': None,\n+            \'index\': index,\n+            \'group\': groups[index],\n+            \'ppn\': ppn[index]\n+\n+        })\n+\n+    # run processes\n+    running = 0\n+    finished = 0\n+    sleep_time = 0.001\n+    while True:\n+        # count alive processes\n+        if not sleep_time:\n+            sleep_time = 0.001\n+        for i in process_pool:\n+            if i[\'status\'] == 1 and not (i[\'proc\'].exitcode is None):\n+                sleep_time = 0.0\n+                # was running now finished --> collect\n+                i[\'status\'] = 2\n+                running -= 1\n+                finished += 1\n+                output[i[\'index\']] = collect(i[\'proc\'], i[\'pipe\'])\n+                del i'..b'ection\n+                del proc[index]\n+                del pipe[index]\n+\n+    # collect the rest:\n+    [pf.join() for pf in proc]\n+    for pf, pp in zip(proc, pipe):\n+        if pf.pid and not pf.exitcode and not pf.is_alive() and (pf.name not in returnvalue):\n+            returnvalue[str(pf.name)] = pp[0].recv()\n+            pp[0].close()\n+            pp[1].close()\n+    # convert to list in input correct order\n+    returnvalue = [returnvalue[str(i)] for i in range(len(X))]\n+    return returnvalue\n+\n+\n+def parallel2(command, *args, groups=None, ppn=None):\n+    \'\'\' same as parallel but groups are used to identifie mutually\n+    exclusive jobs, jobs with the same goup id are never run together\n+    ppn params is \'load\' of the job - sum of loads cannot exceed 1\n+    \'\'\'\n+    # check args, expand if necessary\n+    args = list(args)\n+    N = [len(i) for i in args]  # lengths of lists\n+    Mx = max(N)\n+    if len(set(N)) == 1:\n+        # all good\n+        pass\n+    elif set(N) == set([1, Mx]):\n+        # expand args of length 1\n+        for i in range(len(args)):\n+            if len(args[i]) == 1:\n+                args[i] = args[i] * Mx\n+    else:\n+        raise ValueError\n+    if not groups:\n+        groups = range(Mx)\n+    elif len(groups) != Mx:\n+        print("length of groups must be same as number of job or None")\n+        raise ValueError\n+\n+    if not ppn:\n+        ppn = [0] * Mx\n+    elif len(ppn) != Mx:\n+        print("length of ppn must be same as number of job or None")\n+        raise ValueError\n+    elif max(ppn) > 1 and min(ppn):\n+        print("ppn values must be in 0 - 1 range")\n+        raise ValueError\n+    # convert argument to suitable format - \'transpose\'\n+    argsTuples = list(zip(*args))\n+    args = [list(i) for i in argsTuples]\n+\n+    # multiprocessing.Pool()\n+\n+    def command_star(args):\n+        return(command(*args))\n+\n+    x = parmap2(command_star,  argsTuples, groups, ppn)\n+    return x\n+\n+\n+def parallel(command, *args):\n+    \'\'\' Execute command in parallel using multiprocessing\n+    command is the function to be executed\n+    args is list of list of arguments\n+    execution is :\n+        command(args[0][0],args[1][0],args[2][0],args[3][0],....)\n+        command(args[0][1],args[1][1],args[2][1],args[3][1],....)\n+        command(args[0][2],args[1][2],args[2][2],args[3][2],....)\n+        ...\n+    output of command is returned as list\n+    \'\'\'\n+    # check args, expand if necessary\n+    args = list(args)\n+    N = [len(i) for i in args]  # lengths of lists\n+    Mx = max(N)\n+    if len(set(N)) == 1:\n+        # all good\n+        pass\n+    elif set(N) == set([1, Mx]):\n+        # expand args of length 1\n+        for i in range(len(args)):\n+            if len(args[i]) == 1:\n+                args[i] = args[i] * Mx\n+    else:\n+        raise ValueError\n+\n+    # convert argument to suitable format - \'transpose\'\n+    argsTuples = list(zip(*args))\n+    args = [list(i) for i in argsTuples]\n+\n+    multiprocessing.Pool()\n+\n+    def command_star(args):\n+        return(command(*args))\n+\n+    x = parmap(command_star, argsTuples)\n+    return x\n+\n+\n+def worker(*a):\n+    x = 0\n+    y = 0\n+    for i in a:\n+        if i == 1.1:\n+            print("raising exception")\n+            s = 1 / 0\n+        y += i\n+        for j in range(10):\n+            x += i\n+            for j in range(100000):\n+                x = 1.0 / (float(j) + 1.0)\n+    return(y)\n+\n+# test\n+if __name__ == "__main__":\n+ #   x = parallel2(worker, [1], [2], [3], [4], [1], [1, 2, 3, 7, 10, 1.1, 20, 30, 40, 10, 30, 20, 40, 50, 50], [\n+ #       3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 5, 6, 4, 3, 2])\n+\n+    x = parallel2(\n+        worker, [1], [2], [3], [4], [1],\n+        [1, 2, 3, 7, 10, 1.2, 20, 30, 40, 10, 30, 20, 40, 50, 50],\n+        [3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 5, 6, 4, 3, 2],\n+        groups=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],\n+        ppn=[0.6, 0.6, 0.2, 0.6, 0.2, 0.2, 0.4,\n+             0.1, 0.1, 0.3, 0.3, 0.3, 0.1, 0.1, 0.1]\n+    )\n+    print(x)\n'
b
diff -r 000000000000 -r a4cd8608ef6b rmsk_summary_table_multiple.r
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/rmsk_summary_table_multiple.r Mon Apr 01 07:56:36 2019 -0400
[
@@ -0,0 +1,127 @@
+#! /usr/bin/env Rscript
+# repeat masker summary
+# analysis of *.out file
+# input arguments: <rmsk_result.out> <reads.fas.
+# calculates totoal proportion of matched repetetive families and averagee score and print the table 
+suppressPackageStartupMessages(library(getopt))
+
+# INPUT ARGUMENTS:
+opt = getopt(matrix(c(
+  'fasta', 'f', 1, "character",
+  'repeat_masker','r', 1, "character",
+  'output', 'o', 1, "character",
+  'help','h',0,'logical'),ncol=4,byrow=T))
+
+if (!is.null(opt$help)) {
+cat("Usage:\n\n -f   fasta file \n -r   repeat masker output\n -o   output files\n")
+stop()
+}
+
+RMfiles=system(paste("ls ",opt$repeat_masker,sep=""),intern=T)
+fasta=system(paste("ls ",opt$fasta,sep=""),intern=T)
+
+
+
+getsubdir=function(x){
+ xx=gsub(".*/","",x)
+ sdir=gsub(xx,"",x,fixed=T)
+ sdir
+}
+
+fasta.summary=function(reads){
+ suppressPackageStartupMessages(library(Biostrings,quiet=T))
+ if (exists("readDNAStringSet")){
+ seqs=readDNAStringSet(reads)
+ }else{
+ seqs=read.DNAStringSet(reads)
+ }
+
+ N=length(seqs)
+ Ls=nchar(seqs)
+ L=sum(Ls)
+ M=median(Ls)
+ A=mean(Ls)
+ output=c(N,L,M,A)
+ names(output)=c("N","length","median","mean")
+ output
+}
+
+repCont=function(repname){   # uses external variable!
+ pos=repname==classfamily
+ RC=unname(sum(lengths[pos])/totalLength)
+ RC
+}
+repScore=function(repname){   # uses external variable!
+ pos=repname==classfamily
+ S=unname(mean(score[pos]))
+ S
+}
+
+N=length(RMfiles)
+
+for (i in seq_along(RMfiles)){
+ cat(paste((RMfiles[i]),"\n"))
+
+ sdir=getsubdir(RMfiles[i])
+ seqs.summary=fasta.summary(fasta[i])
+ totalLength=seqs.summary["length"]
+
+ total.counts=data.frame(c("All_Reads_Length","All_Reads_Number"),c(NA,NA),c(NA,NA),c(totalLength,seqs.summary['N']),c(totalLength,seqs.summary['N']),c(NA,NA))
+ colnames(total.counts)=c("class/fammily","class","family","hits","content","Mean_Score")
+
+
+ # get RM data if not empty
+ rmsk_empty=length(grep("There were no repetitive sequences detected",scan(RMfiles[i],what=character(),n=1,sep="\n",quiet=T),fixed=T))>0
+ rmsk_notperformed = length(grep("No Repbase used with RepetMasker",scan(RMfiles[i],what=character(),n=1,sep="\n",quiet=T),fixed=T))>0
+ if (!rmsk_empty & !rmsk_notperformed) {
+ rmsk=read.table(RMfiles[i],header=F,as.is=T,skip=2,comment.char="*",fill=T)
+ score=rmsk[,1]
+ Ids=rmsk[,5]
+ starts=rmsk[,6]
+ ends=rmsk[,7]
+ lengths=ends-starts
+ classfamily=rmsk[,11]
+ out.table=table(classfamily)
+ out.table=data.frame(names(out.table),c(out.table),stringsAsFactors=F)
+ contents=sapply(out.table[,1],repCont)*100
+ meanScore=sapply(out.table[,1],repScore)
+ class=gsub("/.*","",out.table[,1])
+ families=gsub(".*/","",out.table[,1])
+ out.table=data.frame(out.table[,1],class,families,out.table[,-1],contents,meanScore,stringsAsFactors=F)
+ colnames(out.table)=c("class/fammily","class","family","hits","content","Mean_Score")
+ out.table=out.table[order(out.table$content,decreasing=T),]
+ out.table=rbind(out.table,total.counts)
+ }else{
+ out.table=total.counts
+ }
+ out.table=out.table[order(out.table$content,decreasing=T),]
+
+ write.table(out.table,file=paste(sdir,opt$output,sep=""),row.names=F,quote=F,sep="\t")
+
+
+ # merge all tables:
+
+ colnames(out.table)=paste(colnames(out.table),c("",rep(sdir,5)))
+
+ if (i==1) {
+ mergedTable=out.table[,c(1,4,5,6)]
+ }else{
+
+ mergedTable=merge(mergedTable,out.table[,c(1,4,5,6)],by=1,all=T)
+ }
+
+}
+ mergedTable=mergedTable[order(rowSums(mergedTable[,seq(2,N*3,by=3),drop=FALSE],na.rm=T),decreasing=T),]
+ mergedTable[is.na(mergedTable)]<-0
+
+ write.table(t(mergedTable),file=paste(opt$output,"summary.csv",sep=""),col.names=F,row.names=T,quote=F,sep="\t")
+
+#save.image("tmp.RData")
+
+
+
+
+
+
+
+
b
diff -r 000000000000 -r a4cd8608ef6b single_fastq_filtering.R
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/single_fastq_filtering.R Mon Apr 01 07:56:36 2019 -0400
[
b'@@ -0,0 +1,317 @@\n+#!/usr/bin/env Rscript\n+\n+\n+\n+library(optparse,quiet=TRUE)\n+matricesSum = function(m1,m2){\n+    mfin = matrix(0, nrow = nrow(m2), ncol = max(ncol(m1), ncol(m2)))\n+    rownames(mfin) = rownames(m2)\n+    if (sum(m1)>0){\n+        mfin[,1:ncol(m1)] = mfin[,1:ncol(m1)] + m1\n+    }\n+    if (sum(m2)>0){\n+        mfin[,1:ncol(m2)] = mfin[,1:ncol(m2)] + m2\n+    }\n+\n+    return(mfin)\n+}\n+\n+plotFrequencies = function(x){\n+    par(xaxs = \'i\', yaxs = \'i\')\n+    plot(NULL, xlim = c(1,ncol(x)), ylim = c(0,1),type =\'n\', xlab="position", ylab = "proportion")\n+    col = c(A="red", "T"="orange", C="blue", G="green")\n+    for (n in c("A","C", "T","G")){\n+        points(x[n,]/colSums(x), type = \'l\', col = col[n])\n+    }\n+    grid(ny = 50, nx = ncol(x) - 1  )\n+    legend("topright", col = col, lwd = 2, legend = names(col))\n+}\n+\n+megablast = function(fx,params="-F \\"m D\\" -D 4 -p 90 ",database){\n+  # assume blastn and makeblastdb in $PATH\n+\ttmp_in = tempfile()\n+\ttmp_out = tempfile()\n+\twriteFasta(fx,file=tmp_in)\n+  ## check if db is present:\n+  dbfiles = paste0(database, ".nhr")\n+  if (!file.exists(dbfiles)){\n+    cat("running makeblastdb\\n")\n+    system(paste0("makeblastdb -dbtype nucl -in ", database))\n+  }\n+  params = \'-num_alignments 1 -num_threads 4  -outfmt "6 qseqid qcovs pident" \'\n+\tcmd=paste("blastn", " -db", database, "-query", tmp_in,\n+            "-out", tmp_out, " ", params)\n+  status=system(cmd,intern=TRUE)\n+\n+\tif (file.info(tmp_out)$size != 0){\n+    results=read.table(tmp_out,sep="\\t",as.is=TRUE,comment.char="")\n+    colnames(results) = c("qseqid", "qcovs", "pident")\n+\t}else{\n+\t\tresults=NULL\n+\t}\n+  unlink(tmp_in)\n+\tunlink(tmp_out)\n+\treturn(results)\n+}\n+\n+\n+\n+option_list = list(\n+  make_option(c(\'-a\', \'--fastqA\'),action=\'store\',type=\'character\',help=\'fastq file A\',default=NA),\n+  make_option(c(\'-x\', \'--fastqX\'),action=\'store\',type=\'character\',help=\'output fastq file X\',default=NA),\n+  make_option(c(\'-c\', \'--cut_off\'),action=\'store\',type=\'numeric\',help=\'Quality cut-off value [default %default]\',default=10),\n+  make_option(c(\'-r\', \'--rnd_seed\'),action=\'store\',type=\'numeric\',help=\'seed for random number generator [default %default]\',default=123),\n+  make_option(c(\'-p\', \'--percent_above\'),action=\'store\',type=\'numeric\',\n+              help=\'Percent of bases in sequence that must have quality equal to / higher than cut-off value [default %default]\',default=95),\n+  make_option(c("-G", "--png_file_output"), action = "store", type = "character", default=NA),\n+  make_option(c(\'-e\', \'--trim_end\'), action=\'store\',type=\'numeric\',help="trimming - end position [no trimming by default]", default=NA),\n+  make_option(c(\'-s\', \'--trim_start\'), action=\'store\',type=\'numeric\',help="triming position - start  [no trimming by default]", default=NA),\n+\n+  make_option(c(\'-n\', \'--sample_size\'),action=\'store\',type=\'numeric\',help=\'requested sample size (number of pairs)[no sampling by default]\',default=NULL),\n+  make_option(c(\'-N\', \'--Max_N\'),action=\'store\',type=\'integer\',help=\'maximum number of Ns in sequences [default %default]\',default=0),\n+  make_option(c(\'-f\', \'--format\'),action=\'store\',type=\'character\',help=\'format of output - fastq or fasta [default %default] \',default="fasta"),\n+  make_option(c(\'-C\', \'--cutadapt_options\'),action=\'store\',type=\'character\',help=\'file specifying cutadapt options\',default=NULL),\n+  make_option(c(\'-F\', \'--filter_seq\'),action=\'store\',type=\'character\',help=\'file specifying sequences for filtering (e.g. plastid DNA)\',default=NULL),\n+  make_option(c(\'-j\', \'--chunk_size\'),action=\'store\',type=\'numeric\',help=\'Number of sequences processed in single step. This option affect speed of processing and memory usage [default %default]\',default=1000000)\n+\n+\n+\n+\n+\n+)\n+\n+description=paste(strwrap(\'Script for filterinq fastq file f\n+\t\t\t\t\t\t\'),collapse="\\n")\n+# arguments whic hcould be modifed\n+cutadapt_arguments=paste(\n+\t\t" --anywhere=\'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT\' ",\n+\t\t" --anywhere=\'AGATCGGAAGAGCGTCG'..b'=\' \',sep=" ")\n+}\n+\n+\n+cutadapt_cmd=function(input,output,cutadapt_arguments){\n+\tcmds=paste("cutadapt --format=",\n+\t\t\t"fastq",\n+\t\t\tcutadapt_arguments,\n+\t\t\t" ",\n+\t\t\tc(input),\n+\t\t\t" -o ",\n+\t\t\tc(output),\n+\t\t\tsep="")\n+\n+}\n+\n+set.seed(opt$rnd_seed)\n+\n+PROP=opt$percent_above/100\n+MINQ=opt$cut_off\n+TRIM_END=opt$trim_end\n+TRIM_START=opt$trim_start\n+\n+CHUNK_SIZE=opt$chunk_size\n+\n+MIN_LENGTH=TRIM_END - ifelse(is.na(TRIM_START),1,TRIM_START)+1\n+if (is.na(MIN_LENGTH)){\n+\tMIN_LENGTH=1\n+}\n+\n+suppressPackageStartupMessages(library(ShortRead))\n+Qfilter=srFilter(function(x){\n+\t\t\trowSums(as(quality(x),\'matrix\')>MINQ,na.rm=TRUE)/nchar(sread(x))>=PROP\n+\t\t}\n+)\n+\n+Min_Length_filter=srFilter(function(x){\n+\t\t\tnchar(sread(x))>=MIN_LENGTH\n+\t\t}\n+)\n+\n+\n+fin_filter=compose(Qfilter,nFilter(opt$Max_N),Min_Length_filter)\n+\n+if (opt$format=="fasta"){\n+\twriteFun=writeFasta\n+}else{\n+\twriteFun=writeFastq\n+}\n+\n+\n+f1out=opt$fastqX\n+\n+#find wich character specify pair: - it could be\n+\n+# check size of sequences - must be same...\n+n1=countLines(opt$fastqA)/4\n+\n+cat("number of sequences in ",opt$fastqA,":",n1,"\\n")\n+cat("--------\\n")\n+\n+\n+number_of_chunks=round(n1/CHUNK_SIZE)\n+if (number_of_chunks==0){\n+\tCHUNK_SIZE=n1\n+\tnumber_of_chunks=1\n+}\n+if (!is.null(opt$sample_size)){\n+\tsample_size_in_chunk=opt$sample_size/number_of_chunks\n+}else{\n+\tsample_size_in_chunk=CHUNK_SIZE\n+}\n+\n+# adjust the chunk size to get exact count of sequences:\n+CHUNK_SIZE = ceiling(n1/number_of_chunks)\n+save.image("tmp.RData")\n+\n+print("--------------------------------")\n+print (sample_size_in_chunk)\n+print (opt$sample_size)\n+print (CHUNK_SIZE)\n+print("--------------------------------")\n+f1 <- FastqStreamer(opt$fastqA, CHUNK_SIZE)\n+total=0\n+nucleotideFrequenciesForward = matrix(0)\n+while(TRUE){\n+\tprint (Sys.time())\n+\tfq1 <- yield(f1)\n+\tprint (Sys.time())\n+\tif (length(fq1)==0){\n+\t\tbreak\n+\t}\n+\t\n+\t\n+\t\n+\t\n+\tif (!is.na(TRIM_END)){\n+\t\tfq1=narrow(fq1,end=ifelse((TRIM_END)>width(fq1),width(fq1),TRIM_END))\n+\t}\n+\tif (!is.na(TRIM_START)){\n+\t\tfq1=narrow(fq1,start=ifelse((TRIM_START)<width(fq1),TRIM_START,width(fq1)))\n+\t}\n+\t\n+\n+\n+\tfqF1=fq1[fin_filter(fq1)]\n+\n+\tcat("running cutadapt on chunk\\n")\n+\t# cut addapt here: # filter parts:\n+\ttmp_in1=tempfile(fileext=".fastq")\n+\ttmp_out1=tempfile(fileext=".fastq")\n+        if (is.null(formals(writeFastq)$compress)){\n+            writeFastq(fqF1,file=tmp_in1)\n+        }else{\n+            writeFastq(fqF1,file=tmp_in1, compress = FALSE)\n+        }\n+\n+\n+\tcmd1=cutadapt_cmd(tmp_in1,tmp_out1, cutadapt_arguments)\n+\n+\tstatus=system(cmd1,intern=TRUE)\n+\tprint (Sys.time())\n+\t# collect results\n+\tftmp1=FastqFile(tmp_out1)\n+\n+\tfqF1=readFastq(ftmp1)\n+\tclose(ftmp1)\n+\n+  # remove sequences similar to filter database (e.g. plastid DNA)\n+  if (!is.null(opt$filter_seq)){\n+\t\tblast_results =  megablast(fqF1, database=opt$filter_seq)\n+\t\tif (!is.null(blast_results)){\n+\t\t\texclude=with(blast_results, unique(qseqid[pident >= 90 & qcovs >=90]))\n+      ## note - blast will truncate seq ids - anything beyond space is omitted\n+\t\t\tseq_to_exclude = gsub(" .*","",id(fqF1)) %in% exclude\n+\t\t\texcluded_contamination=signif(sum(seq_to_exclude)/length(seq_to_exclude)*100,3)\n+      cat(excluded_contamination,"% filtered out after blast\\n" )\n+\t\t\tfqF1 = fqF1[!seq_to_exclude]\n+\t\t}\n+\t}\n+\n+\n+\n+\t# clean up\n+\tunlink(tmp_out1)\n+\tunlink(tmp_in1)\n+\n+\n+  \n+\n+\t# filter complete pairs again:\n+\n+\t# create new id - last character must differentiate pair - for interlacig\n+\tif (length(fqF1)>sample_size_in_chunk){\n+\t\tsmp=sort(sample(seq_along(fqF1),sample_size_in_chunk))\n+\t\twriteFun(fqF1[smp],file=f1out,mode=\'a\')\n+    nfrq1 = alphabetByCycle(sread(fqF1[smp]))\n+\n+\t}else{\n+      writeFun(fqF1,file=f1out,mode=\'a\')\n+      nfrq1 = alphabetByCycle(sread(fqF1))\n+\t}\n+  nucleotideFrequenciesForward = matricesSum(nucleotideFrequenciesForward, nfrq1)\n+\n+}\n+\n+if( !is.na(opt$png_file_output)){\n+    png(opt$png_file_output, width = 1000, height = 500)\n+    plotFrequencies(nucleotideFrequenciesForward)\n+    dev.off()\n+}\n+\n+close(f1)\n+\n+\n+\n+\n'
b
diff -r 000000000000 -r a4cd8608ef6b single_fastq_filtering.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/single_fastq_filtering.xml Mon Apr 01 07:56:36 2019 -0400
[
@@ -0,0 +1,170 @@
+<tool id="single_fastq_filtering" name="Preprocessing of fastq reads">
+  <description>
+    Preprocessing of fastq files
+    including trimming, quality filtering, cutadapt filtering and sampling
+  </description>
+  <requirements>
+    <requirement type="package">blast</requirement>
+    <requirement type="package">cutadapt</requirement>
+    <requirement type="package">bioconductor-shortread</requirement>
+    <requirement type="package">r-optparse</requirement>
+  </requirements>
+  <command interpreter="bash">
+    single_fastq_filtering_wrapper.sh -a ${A} -o ${output} -c ${cut_off} -p ${percent_above} -N ${max_n} -G ${png_output}
+
+    #if $sampling.sequence_sampling :
+    -n $sampling.sample_size
+    #end if
+
+    #if $trimming.sequence_trimming :
+    -e $trimming.trim_end -s $trimming.trim_start
+    #end if
+
+    #if $cutadapt.use_custom :
+    -C "${cutadapt.custom_options}"
+    #end if
+
+    #if $similarity_filtering.include :
+    -F "${similarity_filtering.filter_database}"
+    #end if
+
+
+  </command>
+
+  <inputs>
+    <param format="fastq,fastq.gz" type="data" name="A" label="reads in fastq format" />
+    <conditional name="sampling">
+      <param name="sequence_sampling" type="boolean" truevalue="true" falsevalue="false" checked="False" label="Sequence sampling"/>
+     <when value="false">
+        <!-- do nothing here -->
+      </when>
+      <when value="true">
+      <param name="sample_size" type="integer" label="Sample size(number of reads" help="How many sequence reads should be in resulting dataset" value="500000" min="0"/>
+      </when>
+    </conditional>
+
+    <param type="integer" name="cut_off" label="Quality cut-off" value="10" min="0" help="see below how to correctly set quality cut-off" />
+    <param type="integer" name="percent_above" label="percent above cutoff" value="95" min="0"
+           help="Percent of bases in sequence that must have quality equal to / higher than cut-off value" />
+
+    <conditional name="trimming">
+      <param name="sequence_trimming" type="boolean" truevalue="true" falsevalue="false" checked="False" label="Trim sequences"/>
+      <when value="false">
+        <!-- do nothing here -->
+      </when>      
+      <when value="true">
+        <param type="integer" name="trim_start" label="trimming - start position" value="1" min="1"
+               help="sequences are trimmed at specified start" />
+        <param type="integer" name="trim_end" label="trimming - end position" value="100" min="1"
+               help="sequences are trimmed to specified end position, shorted sequences are discarded" />
+      </when>      
+
+    </conditional>
+    <param name="max_n" type="integer" label="maximum Ns" help="Maximum number of Ns in sequence" value="0" min="0" max="10"/>
+
+    <conditional name="cutadapt">
+      <param name="use_custom" type="boolean" truevalue="true" falsevalue="false" checked="False" label="Do you want to use custom cutadapt options"/>
+     <when value="false">
+        <!-- do nothing here -->
+      </when>
+      <when value="true">
+      <param name="custom_options" type="text" area="True" size="8x30"  label="Cutadapt custom options" help="Consult cutadapt for usage" value="">
+          <sanitizer sanitize="False"/>
+          </param>>
+      </when>
+    </conditional>
+
+    <conditional name="similarity_filtering">
+     <param name="include" type="boolean" truevalue="true" falsevalue="false" checked="False" label="Use similarity search filtering"/>
+     <when value="false">
+        <!-- do nothing here -->
+      </when>
+      <when value="true">
+
+      <param name="filter_database" format="fasta" type="data" label="Sequence filter database" help="Provide DNA sequences in fasta format. Sequence reads which has at least 90% similarity over 90% of length to sequence in filter database will be removed. This is suitable option if you want to remove organele DNA or contamination"/>
+      </when>
+    </conditional>
+
+  </inputs>
+
+
+  <outputs>
+    <data format="fasta" name="output" label="filtered fasta reads from datasets ${A.hid}"/>
+    <data format="png" name="png_output" label="nucleotide composition after filtering of ${A.hid}"/>"
+  </outputs>
+
+  <tests>
+    <test>
+      <param name="A" value="ERR215189_1_part.fastq.gz" />
+      <param name="max_n" value="0"/>
+      <param name="cut_off" value="10" />
+      <param name="percent_above" value="95" />
+      <output name="output" value="single_output.fasta" />
+      <output name="png_output" value="single_output.png" />
+    </test>
+  </tests>
+
+  <help>
+**What it does**
+
+This tool is designed to perform preprocessing of fastq file. Input files can be
+in GNU zipped archive (.gz extension). Reads are filtered based on the quality,
+presence of N bases and adapters. All reads which pass the quality filter fill
+be writen into output files. If sampling is specified, only sample of sequences
+will be returned.
+
+Cutadapt us run with this options::
+
+    --anywhere='AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT'
+    --anywhere='AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT'
+    --anywhere='GATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
+    --anywhere='ATCTCGTATGCCGTCTTCTGCTTG'
+    --anywhere='CAAGCAGAAGACGGCATACGAGAT'
+    --anywhere='GTGACTGGAGTTCAGACGTGTGCTCTTCCGATC'
+    --error-rate=0.05
+    --times=1 --overlap=15 --discard
+
+
+**Order of fastq files processing**
+
+1. Trimming (optional)       
+#. Filter by quality      
+#. Cutadapt filtering 
+#. Sampling (optional)         
+#. Interlacing two fasta files
+
+**Quality setting cut-off**
+
+To correctly set quality cut-off, you need to know how the quality is encoded in your fastq file, default
+filtering which is suitable for Sanger and Illumina 1.8 encoding is shown below::
+
+
+      Default filtering cut-off
+                |   
+                |
+                V
+      SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
+      ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
+      ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
+      .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................
+      LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
+      !"#$%&amp;'()*+,-./0123456789:;&lt;=&gt;?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
+      |                         |    |        |                              |                     |
+     33                        59   64       73                            104                   126
+      0........................26...31.......40                                
+                               -5....0........9.............................40 
+                                     0........9.............................40 
+                                        3.....9.............................40 
+      0.2......................26...31........41                              
+    
+     S - Sanger        Phred+33,  raw reads typically (0, 40)
+     X - Solexa        Solexa+64, raw reads typically (-5, 40)
+     I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)
+     J - Illumina 1.5+ Phred+64,  raw reads typically (3, 40)
+         with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) 
+         (Note: See discussion above).
+     L - Illumina 1.8+ Phred+33,  raw reads typically (0, 41)
+    
+  </help>    
+</tool>
+
b
diff -r 000000000000 -r a4cd8608ef6b single_fastq_filtering_wrapper.sh
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/single_fastq_filtering_wrapper.sh Mon Apr 01 07:56:36 2019 -0400
[
@@ -0,0 +1,59 @@
+#!/bin/bash
+
+#set -euo pipefail # this is something like string in perl
+IFS=$'\n\t'
+
+# run filtering
+WD="`dirname $0`"
+ORIDIR=$PWD
+cd $WD
+WD=$PWD  # absolute path to this script
+cd $ORIDIR
+SAMPLING=""
+TRIM_END=""
+TRIM_START=""
+PERCENT_ABOVE="95"
+CUTADAPT=""
+FILTER_SEQ=""
+while getopts "a:o:n:c:p:e:s:N:C:G:F:" OPTION
+do
+    case $OPTION in
+        a)
+            FASTAA=$OPTARG;;
+        o)
+            OUTPUT=$OPTARG;;
+        n)
+            SAMPLING=( -n ${OPTARG} );;
+        c)
+            CUT_OFF=$OPTARG;;
+        p)
+            PERCENT_ABOVE=$OPTARG;;
+        e)
+            TRIM_END=( -e ${OPTARG} );;
+        s)
+            TRIM_START=( -s ${OPTARG} );;
+        N)
+            MAX_N=${OPTARG};;
+        C)
+            CUTADAPT=(-C " "${OPTARG}" " );;
+        G)
+            PNG_OUTPUT=${OPTARG};;
+        F)
+            FILTER_SEQ=( -F ${OPTARG} );;
+    esac
+done
+
+
+
+if [ -z "$CUTADAPT" ] # test if $CUTADAPT is empty
+then
+    ${WD}/single_fastq_filtering.R -a $FASTAA -x $OUTPUT  ${SAMPLING[@]} -c $CUT_OFF\
+         -p $PERCENT_ABOVE  ${TRIM_START[@]}  ${TRIM_END[@]} -N $MAX_N -G $PNG_OUTPUT ${FILTER_SEQ[@]}
+else
+    ${WD}/single_fastq_filtering.R -a $FASTAA -x $OUTPUT  ${SAMPLING[@]} -c $CUT_OFF -G $PNG_OUTPUT\
+         -p $PERCENT_ABOVE  ${TRIM_START[@]}  ${TRIM_END[@]} -N $MAX_N "${CUTADAPT[@]}" ${FILTER_SEQ[@]}
+fi
+
+
+
+
b
diff -r 000000000000 -r a4cd8608ef6b test_data/ERR215189_1_part.fastq.gz
b
Binary file test_data/ERR215189_1_part.fastq.gz has changed
b
diff -r 000000000000 -r a4cd8608ef6b test_data/ERR215189_2_part.fastq.gz
b
Binary file test_data/ERR215189_2_part.fastq.gz has changed
b
diff -r 000000000000 -r a4cd8608ef6b test_data/prefix_suffix.fasta
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test_data/prefix_suffix.fasta Mon Apr 01 07:56:36 2019 -0400
b
b'@@ -0,0 +1,267832 @@\n+>TESTERR215189.1OK\n+GTGAGGAGTAGTCATCGGCCGCCAACTGATCCTGGAGTGAAGTGTGTCTCGACCCTGTCCTCAATTCCCATGGTAGTACCGCCTCGTGACCATAC\n+>TESTERR215189.1OK\n+CCGATCAAGGAACGATGTTTACGTCAGGAGAATTCGAAGAGTTCACGGCCGATATGGGAATCAAATTATTAAACTCCTCTCCGTACTATGCCCAA\n+>TESTERR215189.2OK\n+ATATGCTTAAACTCAGCGGGTAGTCCCGTCTGACCTGGGGTCGCGGTCCGAGCGCCTAAACGCTTCGGTCGTGTATGGGTCCTTGAGGCCAATAA\n+>TESTERR215189.2OK\n+AAGTTTGGGCTGCCGGCATAACTTGTCGGGCACCGCACGTGGTGGGCGACATCTAGTTGTTCTCGGTGCAGTGTCCTGGCATGCAGCTGGCTTAT\n+>TESTERR215189.3OK\n+ATCTGGGGCCAACCTCTTGGTTTGGGGCCGGTATGTGCACTTGGAGGGCTCGGGGCGGCCGGGGTTGGAGAGGGGAGAGCACGATGGAGGGCAGT\n+>TESTERR215189.3OK\n+CTGCCGCCGCCCCTCTTTCCTCCCCTGCGCCAAGCCTGCCGCAGCCCCCGGTGACGAGGACCTCAGCGATAGTCCAGTTCCCCTCTTGCGACGAT\n+>TESTERR215189.4OK\n+AATACCCGAAATGCATGCCGCCTCACATTTCCGCATACGCAGTCTTATTACTAGTACTGGTAGTAGCAGTAGTAGGACTGCAGGAGACCTGCAGC\n+>TESTERR215189.4OK\n+CGGACAGGCTAGCCAGGCAAAGTCATTAAAGGCCAACGACGACGACGACGTGGGCCGTGGCCGAGTCGCACGTTGGTAGCTGGGCCGTACCGAGG\n+>TESTERR215189.6OK\n+ATTTTGTTAAGGCCCGACACTGGTAGTGTTCGTGATCAAGTGTTTGAAAGTACTAATCTCATACCTAGTATGGGATGGGGAAGCCTAGTACCTGA\n+>TESTERR215189.6OK\n+CTAATCTCACGACATCCGCGCACCACAAGGGTTGCTTCCTGTGTCGGCCGTCCCCATCAATTCCCAGGCACGTGTCAGGTCCAAACTTCCCTTGG\n+>TESTERR215189.7OK\n+GCCCACCGACGGGTGGTCCTCTGGCCAATGGGCCCCGCGACGCAGCAGCTCTGTACGTGGCGGCCCCGGGTGCGCGGAGAGGGGCTCATAAATTT\n+>TESTERR215189.7OK\n+GTCAGAAGACGCACGGACGCAAAGGTGCGAGCGCGAGTGGCACAAACACGTTTCCCTCCGTCCTCCCGCGCGCCCCACCGGAAAAGTCCAGACCC\n+>TESTERR215189.8OK\n+AGTGAATTGAAGGCCCAGATGAAATCTGGGTTATTTCTTAATGAACATGAGGTTTGTGCTACTCAAGTTCTAATTCATTTGCTTAAAGCGTTATT\n+>TESTERR215189.8OK\n+TGCTTTTTGTCATGTGTGGATACACTAAATCCAGATATGTACAGCTAAGACATGTTAAAATAAAAAATTTACATTAGTTTATGATTTATTAAATT\n+>TESTERR215189.9OK\n+CTAACTTCCAGCACTTCACCACGTGTATCATATGTGTGCCGAGTTCCGCATCACGAGTATGCGAGGGTTGATAGGGAGTTCTATTCAAAGGACCC\n+>TESTERR215189.9OK\n+CCCTTCGTGCCAGCACTCACGAAAGCGTCCCCAAGCATTACCGTTCACTATAGAAACGACACTCATGCATTTAACGCTGCATGGACAGCACAACA\n+>TESTERR215189.10OK\n+GATCGTGTTACTAACATGTAGTGCCTAAAAAATTACAAAATGTAGGAAAGGCATAATCATTGGTGGATATTGCGCCCGGGTATATTCGGAATCAT\n+>TESTERR215189.10OK\n+CGAAACATTGTAAGAAAGAGTTAATAGCGACAAATATCGAACCCGAGGGAGTACGTTACTACTGAAATGCAACGACGAATACACTTGAGCCAAAA\n+>TESTERR215189.11OK\n+AGCACCGAGCCGGTCGCCACAATGCGTGGCTGATTGCCGCTAAGCCAACATGCAGTTGCAAACGCACTGCCGTACTCTAACCAGTCAACACTCCA\n+>TESTERR215189.11OK\n+TCTCCGGGTGAAAACCATGCTTGAGTCCTCAGGCGGGCGATGGCGGCGCTTTAGCATCGTTTCTTTGGAGATCATGTTCCGGCCTTTTTCTCGGC\n+>TESTERR215189.12OK\n+GGCTCCTCCGAGCCCATCGCGACGCGAAGGTAGAAGTCTCCGACCACGCGTTGCGTGCCGTAGCGGATCCACGCGTCGATCTGGTCGGGACGGGG\n+>TESTERR215189.12OK\n+ATTCATCGGCGGCTTCCTGGACTGGGTGCTCTCCCAGCGCGGCGACGTCGACATGGACTCTCTCCTCATCTCCATGAAGACAGGGGACTGCCCCC\n+>TESTERR215189.13OK\n+CATCGCCTACGCCGCCCAATAAAGGAGGATGCGCCGCCCGGTCTCCCGCTCGGCCTAGCGGTAGGGGCGTGTAGCCGCGCCTCCTGGTCAGCACG\n+>TESTERR215189.13OK\n+GAAGGACCGAGCCCCTGGGGTCGCCGCCTTCCCTCCCGGTCTTTGCCTCCTGTCAGCATGACCGCTGAAGGGGGATGGTTGTCGTTGGGCCCCGT\n+>TESTERR215189.14OK\n+GACTTGATTGATGGTAATAAGCTATCAGTGCTTAGATTCATTACAGCTAGAAGTAGAAATTGCTTAATTTGCTGCTTACGTTATCTGATTTGTGT\n+>TESTERR215189.14OK\n+GAATAATTCCACAAGAATAGTACATGCAATTAGAGAAAGGGTTTTAATTGAAGCCTAATAAACACAGAAACAAACCAAAGCATGCAATTGAACAT\n+>TESTERR215189.15OK\n+AATTTGCCTTGGAAATAACGTGAAGAAGAAAATGCACTTTCAAAGCTAGAAAACGCAAAACAGCAAGATACGCCTTCGTCAGACGTATACGTCTC\n+>TESTERR215189.15OK\n+GTCCTCGCTTGCAGGCGTGCACTTGTGGCCACCGATGCAACCCAATATCCTTCTGCTCGGCGTCCGTCAGGTATCTCTGAGAGGATCCATGAACC\n+>TESTERR215189.16OK\n+TCAGTCCTTGATGCAGGTCAACATCTCCACCGTCTCCGGCAACAACCGCTAGCGCCACTCCTCGATCACCCTGCAGTAAGACTAAAGCAAGTTTC\n+>TESTERR215189.16OK\n+CTTTGGTGCTCCATCTAGTTCAGGTGATTCTCCTATTCTTGCATCTACACCTCCTATATCTCCTGCTTCTGAATTGTCATCCTACTTGGACAATG\n+>TESTERR215189.17OK\n+CGTGTCCCCGTTGTACAGAGCAGCCGCCGCCGCGTGTGTTTACTGTTTAGTGCGATGGGTGTGTAGTGTTAGGATCATCTCTTCTGTTTGTACTG\n+>TESTERR215189.17OK\n+CTGGCACATGAACCTAGCTGTCAGATCCACGTCGTAGCACAACCAGCGTGCCACACAAGCGGGTTAACATAAGTTGTGAGGAACTAGAACTTGCA\n+>TESTERR215189.18OK\n+CTTCTGGTACTCGTCCGCTGACGTCATTGAAGAGCTGCCGAATCGGTGAAGCGTCTTCTACTAATCGGCCGATGTCCCCTTGAAGGAGGGATCGG\n+>TESTERR215189.18OK\n+CCAGCATCCGATCCCTCCTTCAAGGGGACATCGGCCGATTAGTAGAAGACGCTTCACCGATTCGGCAGCTCTTCAATG'..b'TAGGA\n+>TESTERR215189.74979OK\n+TTTTGGAGGCCCTTATCCTCATTACGCCTAGCATTGAGTGGGCTGGATATTTACCTTATCAACTAGCAAATCAATAGGGGTTCTATTTGATTAAG\n+>TESTERR215189.74981OK\n+CATTTGCAGACCCAACTGGACCATCTGGACAAACAAATGTGAATTGTATGGTTCTTGTTCTTGTAGTAAAATAGTTTCCTTTTGTTGTACACAAC\n+>TESTERR215189.74981OK\n+CTTACACCACTTCCTCCAAAGCATCTGGTAAAAACAAATGAGCAGACAAGCAACCAAACACATTGCCTACCCAGCTACATGATCGTGCCAGTGCT\n+>TESTERR215189.74982OK\n+CGGCAACGTCGTCGAGGCCGTGCGCCACATCCGCTCCGTCATGGGGGACGTCCGCGCGCTGCGCAACATGGACGACGACGAGGTCTTCGCCTACG\n+>TESTERR215189.74982OK\n+CACGATGGCTCGCGCCCGCCGCGCCGGGTCGCCGCTCTTGAAGATGCCGGAGCCGACGAAGACGCCGTCGCATCCGAGCTGCATCATGAGCGCGG\n+>TESTERR215189.74984OK\n+ATTGGTTAACTTATCTTTCAAGGTGCTGAACGCAACCTCTTGTGAATCACTCCAAGCAAAGGGAACATCCTTCTTTGTAAGCTCATGTAGAGGCG\n+>TESTERR215189.74984OK\n+AAATTCGAAGCTTTCTTGGCCTTGCAGGTTTCTACCGTCGGTTTGTTCGTGATTTTAGCTCCATTGCAGCGCCTCTACATGAGCTTACAAAGAAG\n+>TESTERR215189.74986OK\n+CAACGTTGGTAGCAACAGGCCGCTACCAGGAGGACCGGCTGAGATGTGGTGTCCAAGGGCACAGGTTTTTGCTCCACTGTCGATCTTTGCATGAA\n+>TESTERR215189.74986OK\n+CCCTTTTCCTAGAACTTGTGTGCTGCCGAATCTTGACAACACTATCTGTAGAAAGTTTGTTTTCATGCAAAGATCGACAGTGGAGCAAAAACCTG\n+>TESTERR215189.74987OK\n+GCCAGGCTTTGTTGATCCATATGATCTGTACGATATGGAAGAAGCGGACGAGGATGAAATAGAGGAGCTGAAGAAATATGAGTTCGCTTTGCTGC\n+>TESTERR215189.74987OK\n+TTCAGCATACTTGCGACAGCACGGAATGCAGCAAAACGAGGGTCGTCGCAGAGATAGTCCTCCAATGCACGATGCCTACCCTCGAGTCCAGGGTG\n+>TESTERR215189.74989OK\n+ATGACTTGGCCTCATCCTCTCCTTCCTCCGGCTTAACACCGGCGGTCTGTTCAGGGTTCCAAACTCATAGTGGCAACTAAACACGAGGGTTGCGC\n+>TESTERR215189.74989OK\n+GAGGGGTGCCCTCGGGAACGCGGACACAGGTGGTGCATGGCTGTCGTCAGCTCGTGCCGTAAGGTGTTGGGTTAAGTCTCGCAACGAGCGCAACC\n+>TESTERR215189.74990OK\n+AAAGATCGTCGTACGATGTCCGATAATTTTATACGCGTAGCGGCGTAGGCCGTGTTCACGGAACGTTCCGTCACATCCCCCGTATCAACCTGAGC\n+>TESTERR215189.74990OK\n+AGCGGCCGCGGGCACGGACAATCCGGGACAGCAGGGCCTCGCCGGCATTATTACCCCCGCCGTGCCGCCGCTTAATTGCCACCGTGGAGCCGCGG\n+>TESTERR215189.74991OK\n+GGAGAAAGAGAATAATTGTTGGCACAAGTCGATAAGGTCAGATAGTCCTTACGTTGTCTAACGTATGGTGCAGGATGTCATGACTAGTCCAACTT\n+>TESTERR215189.74991OK\n+GTTGTTCCGACAGTGGCGGTCGTGAACATCCATTACTAGTCTTTAACGGCCTATTCTAAAATGCCTTTTCAGTCTTAATTCCCAACCTTTGGAGG\n+>TESTERR215189.74992OK\n+GCTGGTTAAGCAGAAGCTTTGCCGTTTTGCTTAATATTGCAAGGAGGCCATTAGCATATAAATAAATAAGCTTTTAGCAGCAGGTTTCATCCGTG\n+>TESTERR215189.74992OK\n+TTTTTTACAAGGACTGGATTTGCTAACCAGTCGGGGTGCTTACGTTCACGGATGAAACCTGCTGCTAAAAGCTTATTTATTTATATGCTAATGGC\n+>TESTERR215189.74993OK\n+GGGAGAGCAGAGCTGGGCTGCTGCCACACAGCTGTTTGCTGCTGATGGCAAAAGCGGTGTCACCGTGCCACGTGCCGGCGCAGGTGTGCGACGTG\n+>TESTERR215189.74993OK\n+CCTTTCCCGGATATGATATTTAGTAGAGTGAGTTGACAGAGCCAGGAAGAGCTAAGAGAAAAAGCACAGTTAGGGCGATCTAGTTTCTTTTTCTT\n+>TESTERR215189.74994OK\n+CTTTGGTTTTCAGCCCATCTCTCCCCATCACTCTACAATGCTTCTGCTTTAGCGTGTCTTTCATTGAACCACGCAACAAACCTGTAGAAGGTGAA\n+>TESTERR215189.74994OK\n+GGTGTGAAGGATAATTTTGAACTGTTTCAGGTTCCTGCTCGTCAGCTTGAGGACTATTACTTTCAGCACACGTTGGAGCAGGTCAGGTCCACAAC\n+>TESTERR215189.74996OK\n+TTGTTCAGCCACACCATAAGCCACCATGATAAGCTTGTTTTTCACGGGAGTAACAACCTCACAAGGAACCCTCCCAGTAATGTTGTCCACGTGGT\n+>TESTERR215189.74996OK\n+TGCAAGAACACGTAGCTTTGACAGAGGCAAGAATGGAGGAAGCAATCGACCGGCGTGTGGCTCTAGCTCTTAGCAAACAAGCCATCAGGGGCTAA\n+>TESTERR215189.74997OK\n+TGGCGAGGAGGCGCTCGCGCGTCTGGCCGACATGGGCCGGCTAGGCGTCCCCGTCGACAGCTACAGCTACGCGCACGGGCTTAAGGCCTGCATTG\n+>TESTERR215189.74997OK\n+CAGGGCGCTCATTCTTGGCGTAGCATCCAATCATGGCACTCCATGACACAACATTCCTCTCGGGCATCCAAGCAAACACACGCTCTGCATAGGTG\n+>TESTERR215189.74998OK\n+GTGTTCCAGGCGGCGTAAGTTGTGGAACAATTCCGCAACTCTTTAGATGATTGCTAAACTCGTGGCTCAAATACTCGCCTCCACGATCAGATCGT\n+>TESTERR215189.74998OK\n+GTCTGAAACCTTTGAAAAGTTCAAGGAATTTCAGAATGAAGTTGAAAATCAGCGTGGCAAGAAAATTAAGGCCTTACGATCTGATCGTGGAGGCG\n+>TESTERR215189.74999OK\n+CTTATCCGACAAAGCAAACATGCGCCTAGCGAACTGGGCTGGCAAGCTCCTCACGCCCGCAGGAAGGAGAACCCTAGTCCGGGCAGTCCTATCCT\n+>TESTERR215189.74999OK\n+AGAAGCGAATAGGTTCTTGTCTTTTTCAGTGCACGGAGTTGGCATTTCGACCCAAGGTTTGCGTGGCTCCTTCCATTCATTCCAAAGCCAGCGTA\n+>TESTERR215189.75000OK\n+CCACTCCTCCAAGCTATATGTGTGGGGAGAGAAGGAAAGCTCCACACACGCGCACTCGCTCACCACGCCACGCCTTGCCTCGCCGGGCCGGGCGG\n+>TESTERR215189.75000OK\n+GGCAAAAACCCTAGCTTGATTAGATCATCTTTTCGGTAGACGCCAATAGGTCGTATATCTAGAAGAACCCATGATCTCACGCCACGATCGTGATC\n'
b
diff -r 000000000000 -r a4cd8608ef6b test_data/seqs.fasta
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test_data/seqs.fasta Mon Apr 01 07:56:36 2019 -0400
b
b'@@ -0,0 +1,3000 @@\n+>ERR215189.32 HWI-ST620_0157:3:1101:4535:2056 1\n+CTGGTCTTTTCATTCTAATACTCTATCCGAGAGTTCTCCGTATTTAGCAAAGCTGTTCCTATCTAACGGAAGGCTCCTGG\n+ATCAAATGACAAAGA\n+>ERR215189.43 HWI-ST620_0157:3:1101:10667:2192 1\n+CAACCGTGAGGGTCTGCTGTGATACCACTTGTAACACCCTATCCTCAAAAGTTGCATTGCCCAGCCCTTATGTGGGGCGG\n+GCAAGAATACACATG\n+>ERR215189.83 HWI-ST620_0157:3:1101:6983:2487 1\n+CAAAAGTTGTAGAAGGAGATGGCGTGGCCACATCTAATCCCGGAGATGAAGGAGAGGATCCAAGGTGTGCTGCCCCGACA\n+GAATCTGGTGGTGGA\n+>ERR215189.264 HWI-ST620_0157:3:1101:15398:3259 1\n+ATTGAATCTGAGTCGCGAGGAACCTTTCCTCGGCGATCCTATCCACCTTCTTGAATCCTTCTTGTCTGCTCACGTCGTCC\n+GGCGATCGTCGGCGG\n+>ERR215189.279 HWI-ST620_0157:3:1101:20167:3416 1\n+CTTCAGCTTCTGAGTTGAGAACGTACAATTGTCCCTCTGCGTCCAAACGCAGACGACCATGTCCGACGTGCTTCGGGCTC\n+CCACAGCAGTCGCAG\n+>ERR215189.295 HWI-ST620_0157:3:1101:9033:3587 1\n+TGTCACGAATGGCGACGCGTACCACCACTTCCTCCTCGACAGCCAACCCTTTGCGCAGCACCTTTGTAGGCAGGTCGACC\n+GACCGAGCCTTCAGA\n+>ERR215189.457 HWI-ST620_0157:3:1101:8995:4637 1\n+CTTGCGACATAGCCTCCTGTAATCCTTAGTGAGTGTTCCGACAATGGAAGTCAATTTCTTTCCATGCCTTGAGCGCCTTC\n+CCTTAGTTGTCGTGG\n+>ERR215189.493 HWI-ST620_0157:3:1101:19976:4694 1\n+CTTTAGCATATGATTGCTTAACATATGCATGAACCGTGGATAATTGAACTAGCAACATCTTAATTTTGCAGTCTGATAAA\n+GCTGATGAACCCACT\n+>ERR215189.505 HWI-ST620_0157:3:1101:7324:4808 1\n+CAGACGTCAAGGATTGAGACTAATAGACCATCTATCCTGGCTTACTATAGGGCACATCATTTGATCTCCGTTGCTGGCAT\n+GCTGCTTGCTTTATC\n+>ERR215189.602 HWI-ST620_0157:3:1101:14342:5350 1\n+CTCCATTCACTAGTTTTCTTTTTTATTAGGACTGCGTTTGCTAATTAGTCGAGGTGCTTACATTCACAGTTGAAACCGGT\n+GCTAAAAGTTTATTT\n+>ERR215189.665 HWI-ST620_0157:3:1101:10881:5838 1\n+AAGTTGTATGACCAATGTTACTAATTTCCATACCTGTACCATTGGCCGCATGCACTTGATCACGTCCATTGTACTTGTCA\n+TGAATTGTCATCTTC\n+>ERR215189.691 HWI-ST620_0157:3:1101:19026:5948 1\n+CTGCTGCGTATCGGCCGTAGCAACACTGTAGGCGGAATCAGCGGTGACTATCTTGATGTTGTCACCGACCCATTGTACGA\n+GGCATTGGTGCATTG\n+>ERR215189.768 HWI-ST620_0157:3:1101:16448:6482 1\n+CGGAAGCTACTACCCGGACAGGGAGGTACTTGTTATTACTCAGAATAACATCCCGGGTACTAGCCAGAACAGAACCCCCA\n+GGCGATTAGCACAAG\n+>ERR215189.946 HWI-ST620_0157:3:1101:16942:7437 1\n+CTGTATACCCCTTTGATAAGTTGCTCCTGCGTTTTTTAGTCCGAAGGACATTGTTTTGTAGCAGAAGGCTCCAAAGGGCG\n+TGATGAACGCTGTTT\n+>ERR215189.1121 HWI-ST620_0157:3:1101:3988:8508 1\n+GTGGTGGAGGCGGTGGAACATTGACGATGTTTTGCGCAGGTACCTGGGCCTGCATCTCCTCCAAAGCCAGACGAAGTGAT\n+AAGATCTCATCCAAC\n+>ERR215189.1175 HWI-ST620_0157:3:1101:9471:8782 1\n+CCCCCTCGGCATGCTTGACAGCAGCAGTTTTCCTCCCCATCCTTGATGTCACGAGCTTCTTCTGTCGCATACGCTCGGGG\n+GCTAAGTGGTGGGAG\n+>ERR215189.1300 HWI-ST620_0157:3:1101:12651:9739 1\n+CCAACCACCGGTCTACCCGCGCCTACGCCTCCCTGACGGAGTAGCGCAATGATCGCTGGCGTCACCCCCGGAGTTCTGCA\n+GCACCGCCCGGTTTG\n+>ERR215189.1342 HWI-ST620_0157:3:1101:16786:9956 1\n+AATGCACCTGTGTTGCATCGTTTTTCGTGCCGTAACCTGATAGAGTCGGAGTGCCCGATCTTTCGGTGAGTGAAGATAAA\n+TTCGACTTGGTGGAA\n+>ERR215189.1366 HWI-ST620_0157:3:1101:8081:10001 1\n+GGCCATCCCATTCAGGACTCCCAGCTTGTCCTGAACCTCCTCCGCAGCCTCAACCCGTGCTTCTCCAACACCGCTAGCTT\n+CTCATCCTTCGCCCA\n+>ERR215189.1459 HWI-ST620_0157:3:1101:7970:10656 1\n+CACAAGATATCATACAGAGGAAGTCACTCATCCATGTATGAACTCAGAGTTGCAGGATCCCCAGAGAAGAACAATTACAC\n+ATGCTTACAAGGCTA\n+>ERR215189.1464 HWI-ST620_0157:3:1101:10931:10567 1\n+TTCTATCGCCCCGTTCGCTTGCAGGCCGCTAGCCGCCACGCCACCCCAGGCGATGTCCCGTGGCAGCGGACGGGAACGCC\n+GGCCTGAAGCTCCGC\n+>ERR215189.1469 HWI-ST620_0157:3:1101:13348:10534 1\n+CGAGTCGAGCGTGTGCTTCTCCGGGCCGCACGGCGGCGCGCGCGTCGTGGTTGAGAGATTCGGTGGAAGCATTCGTACGC\n+AGCCGTGTGTGAAAC\n+>ERR215189.1614 HWI-ST620_0157:3:1101:19541:11361 1\n+CCCGCGCGGACTAGACGGGCGAGAGGAATATGGAAAGAGGGCCCTTCTGCGAGTATTATTCTCGTCAATCTTGACCGGTC\n+AGGTTCGATCCAACG\n+>ERR215189.1677 HWI-ST620_0157:3:1101:9667:11901 1\n+CACGGGGGCCCACAGAGCCCCGGAATCCTACTAAATTTGGCCCGTCGGACCATGAAGCACGAAGCCACTGACGGCCTGCT\n+CCTACTCCATCTCTC\n+>ERR215189.1699 HWI-ST620_0157:3:1101:19789:11774 1\n+AGATATTAGGTCAGTCATGTTATGGTATGGCACCTACCATTCGTAGATGCGCTGAAGCCCAGCAGTGGACACCTTGTTCC\n+AGTAGCTTGGGTCTT\n+>ERR215189.1732 HWI-ST620_0157:3:1101:10974:12002 1\n+GTTCTCGTGTTAGCCCTGGTAGCAGCATTTGAGGAGAACCCTGTGTGCGCAGAGAAGTGTTCCCTTTGGTTGGAACTGCC\n+TGGGGAGACGGTAGA\n+>ERR215189.1825 HWI-ST620_0157:3:1101:16657:12727 1\n+CTCTCGTGGATCTTGGCGT'..b'TTGCGCTCCTCGGGG\n+CCGTGGCTGAGCGCG\n+>ERR215189.73346 HWI-ST620_0157:3:1103:6211:28776 1\n+CATCGGGATCCTACCTCGATCTTGATGATTCCCAGCTCCAGAACATTGATGTGAATCTTCCTCACAAGGTGATCACGCCA\n+GTTAGGGCTTAGCTA\n+>ERR215189.73367 HWI-ST620_0157:3:1103:18533:28811 1\n+GTATGGGGCCCCCGCCTAAGCGCTACCCTGAGATCGGACTCGAAGGGTAGTAAGACATCCCTTAGGGATGCCACTGTAAG\n+GACAAGAAACATAGC\n+>ERR215189.73381 HWI-ST620_0157:3:1103:5668:29236 1\n+CATGTGTGTAAATCATTCGTGACCTGCGTGTCGCGCGTACGGCAACACGGCAGGAGCCATATGTGTTGTCACTTTATTGT\n+ATTTAATGGTCTGCG\n+>ERR215189.73426 HWI-ST620_0157:3:1103:2984:29480 1\n+CCTCCTCCTCCGCTGAGTTCGCTCGCCCGACCCAATTCCCGACGGCGATCGATTTCCAAACCAAAACCAAACACCTCCCC\n+CGCGGCCGCGGGCCC\n+>ERR215189.73449 HWI-ST620_0157:3:1103:12258:29264 1\n+GGTGTTTGCCCTGAAAAGAAAACGCTGTCGCCGCCGCCGTCTCCGGCCGCCCCTCGCTTGCATCCCCTTCCTCCACACTC\n+CGCAGCACGCCGCCG\n+>ERR215189.73528 HWI-ST620_0157:3:1103:9257:29897 1\n+TGGCTCTACGAACTTGATTGAAGAGTGGCCGCACAAGAGTTTTGAAATTTGAATTTCTGGCCATACGTTTGCAGAAATAA\n+TCTGTACGGCGCCTT\n+>ERR215189.73570 HWI-ST620_0157:3:1103:6153:30149 1\n+AACACCAGAATGGGAGCAATCAATCAACCTAAGAGCAATGACTCTCTTCTCTAATTCCCAATCAGCAGAAACAAAATGTG\n+CAACAATACTAAGGT\n+>ERR215189.73604 HWI-ST620_0157:3:1103:4311:30383 1\n+ATGATGCTCAAGGATTTCGAAGGCAAGGTGTCAAACGCTCGTGGCGCTCTCTGCGTCGACCTCACCATCGGCAGTAAGAT\n+CCTTCCTACCATGCT\n+>ERR215189.73609 HWI-ST620_0157:3:1103:9677:30394 1\n+AAATTTCATTGTACCACATCTAGTATGCTTATAATGTGCTTGCTTAGCCTTTCAATTGTTTAGATTAAATTTTTACTCTT\n+GATATATATCTACTA\n+>ERR215189.73692 HWI-ST620_0157:3:1103:19251:30711 1\n+GTACTGCTTGCAGTTTCTCTTCCAGTGACCAAGTTCATGACAGTAAAAGCACTCTTTATCTGGAGCAGGTCCAGCTTTAG\n+CCTTGGGCGCTGGGT\n+>ERR215189.73727 HWI-ST620_0157:3:1103:13244:30902 1\n+CGTCTCAAACCAGTGAATGGGGGAAGTCTGAAGCTTAGGCTCCTTTGCTATTGTTAACCTAAGCTTAGGGGCGTGTTTTC\n+CTGATTCAAATCAAA\n+>ERR215189.73789 HWI-ST620_0157:3:1103:18523:31141 1\n+AACTTTTTGGTGACGTCAGCGGACGAGTGCCAGAAGAAGCAGAGGAAGCCTTGGCGCCGGCCGCATACATCGAGTCGATG\n+AGGATTCCGGTCTTC\n+>ERR215189.73808 HWI-ST620_0157:3:1103:7321:31275 1\n+CGTGCGCACGGTCCTCCTGCACCACCTCGTCTAAAGCGGTCGTCCACATCTCAACCCACTGAATCATCCTCGGAGCCCAC\n+TAGACCCCCAGCCCA\n+>ERR215189.73833 HWI-ST620_0157:3:1103:19728:31401 1\n+ATGATAGTAGCTTAGTGGCTGACTGGAACTTGGCATGGTCAGGTATAAAACTGTAAAAGGTCTACGGTACCATGTGACAT\n+GCTACATTGGCGTGC\n+>ERR215189.73908 HWI-ST620_0157:3:1103:14001:31835 1\n+GATCGCGCACATGTGTATGTCACCATCAGGCACCACCAACCTAGCTCCTCGCGATCTCTAGCTGCTTCCACGCACCAAGA\n+GATCACACTGCACTA\n+>ERR215189.73928 HWI-ST620_0157:3:1103:20497:31750 1\n+GTACACTATCACCCTGGCAAGGCCAATGTGGTGGCCGACGCTCTTAGCCACAAGCCCTGTTGCAACTGCTTGTTGGCAAC\n+AACCTTCACCGAAAC\n+>ERR215189.74174 HWI-ST620_0157:3:1103:21108:33384 1\n+TGCAAAATTGCTTCATCATTATGCCTGTGGCCTTTTTCTTAAGCAGACCACGGTATCTTAAAACCTTGAAGCTGAATACT\n+ACGTTTTTTTAAGGG\n+>ERR215189.74267 HWI-ST620_0157:3:1103:12391:33904 1\n+GAGGCTGTGCTCAATGATTTCCCTGTCCACCCCCTTGAGATCACCTGCAGACCAAGCAAAGATGTCTTCGTTCTTGTTGA\n+GGCACTCGACGAGGT\n+>ERR215189.74547 HWI-ST620_0157:3:1103:7013:35633 1\n+AAAGCTATACAAATTATGGATGTGTTACTTCCTGTGATAAATTCTTGTATCATCTACTGAACCAGGAAATGATACAAATG\n+ACATTCGGGATCTCT\n+>ERR215189.74592 HWI-ST620_0157:3:1103:6438:35991 1\n+GGTAAACTTGCGACATAGCCTCCTGTAATCCTTAGTGAGTGTTCCGACAATGGAAGTCAATTTCTTTCCATGCCTTGAGC\n+GCCTTCCCTTAGTTG\n+>ERR215189.74609 HWI-ST620_0157:3:1103:13206:35943 1\n+CCCTCCCCATGATCCAAAACCCAAATCCCCTCCTCTGCACCCATCTCGTCGCCACCTCGACCTCTACCCGCAGCGACCTC\n+ACCACCACGCCACTC\n+>ERR215189.74663 HWI-ST620_0157:3:1103:16554:36157 1\n+GGCTCTGCACTCCACCGTGGCGTTGCGTTGCCAAGCCATGCCATCGCCACTAGTGGATGCTAAGGGTGCGTTTAGTTTCA\n+GGACTGAAGTGGGAC\n+>ERR215189.74818 HWI-ST620_0157:3:1103:2734:37008 1\n+GATCTTGCCTTTTCCTTGTGTGGGCGCGCGTGCACAGGTTGGGAGATGGTCGCGCGTGTCGTGCTCGCTGTCCAACGAAG\n+ATCGGAAGAGCGGTT\n+>ERR215189.74842 HWI-ST620_0157:3:1103:13395:37042 1\n+CCATGACTATTGCATTTGATGATTGTTTTCTTGGTGTGGAGATCCTGTTGGGGAAATCACGGGCCGAGGGTCCTGGAGGA\n+CCTCACCGTGGAATC\n+>ERR215189.74897 HWI-ST620_0157:3:1103:17016:37406 1\n+ATCTCACGTAACCAAAACTTGTTTATACATTGACAGAGAAGAAGAAAAGGAAAAATAAACTACACCACATCTCAACAACA\n+CACGGGCAGTAAAGC\n+>ERR215189.74951 HWI-ST620_0157:3:1103:8040:37766 1\n+GCGCAAATCTAGGACCAACACACGAACGCACTCGAACGTGCAGCATGTGGAGCCGCTCCTTGCAGCACCTCTACGAGATC\n+CAAGTCAGACTAGTT\n'
b
diff -r 000000000000 -r a4cd8608ef6b test_data/single_output.fasta
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test_data/single_output.fasta Mon Apr 01 07:56:36 2019 -0400
b
b'@@ -0,0 +1,267832 @@\n+>ERR215189.1 HWI-ST620_0157:3:1101:1659:1941/1\n+GTGAGGAGTAGTCATCGGCCGCCAACTGATCCTGGAGTGAAGTGTGTCTCGACCCTGTCCTCAATTCCCATGGTAGTACCGCCTCGTGACCATAC\n+>ERR215189.1 HWI-ST620_0157:3:1101:1659:1941/2\n+CCGATCAAGGAACGATGTTTACGTCAGGAGAATTCGAAGAGTTCACGGCCGATATGGGAATCAAATTATTAAACTCCTCTCCGTACTATGCCCAA\n+>ERR215189.2 HWI-ST620_0157:3:1101:1914:1875/1\n+ATATGCTTAAACTCAGCGGGTAGTCCCGTCTGACCTGGGGTCGCGGTCCGAGCGCCTAAACGCTTCGGTCGTGTATGGGTCCTTGAGGCCAATAA\n+>ERR215189.2 HWI-ST620_0157:3:1101:1914:1875/2\n+AAGTTTGGGCTGCCGGCATAACTTGTCGGGCACCGCACGTGGTGGGCGACATCTAGTTGTTCTCGGTGCAGTGTCCTGGCATGCAGCTGGCTTAT\n+>ERR215189.3 HWI-ST620_0157:3:1101:2032:1928/1\n+ATCTGGGGCCAACCTCTTGGTTTGGGGCCGGTATGTGCACTTGGAGGGCTCGGGGCGGCCGGGGTTGGAGAGGGGAGAGCACGATGGAGGGCAGT\n+>ERR215189.3 HWI-ST620_0157:3:1101:2032:1928/2\n+CTGCCGCCGCCCCTCTTTCCTCCCCTGCGCCAAGCCTGCCGCAGCCCCCGGTGACGAGGACCTCAGCGATAGTCCAGTTCCCCTCTTGCGACGAT\n+>ERR215189.4 HWI-ST620_0157:3:1101:2535:1844/1\n+AATACCCGAAATGCATGCCGCCTCACATTTCCGCATACGCAGTCTTATTACTAGTACTGGTAGTAGCAGTAGTAGGACTGCAGGAGACCTGCAGC\n+>ERR215189.4 HWI-ST620_0157:3:1101:2535:1844/2\n+CGGACAGGCTAGCCAGGCAAAGTCATTAAAGGCCAACGACGACGACGACGTGGGCCGTGGCCGAGTCGCACGTTGGTAGCTGGGCCGTACCGAGG\n+>ERR215189.6 HWI-ST620_0157:3:1101:3004:1999/1\n+ATTTTGTTAAGGCCCGACACTGGTAGTGTTCGTGATCAAGTGTTTGAAAGTACTAATCTCATACCTAGTATGGGATGGGGAAGCCTAGTACCTGA\n+>ERR215189.6 HWI-ST620_0157:3:1101:3004:1999/2\n+CTAATCTCACGACATCCGCGCACCACAAGGGTTGCTTCCTGTGTCGGCCGTCCCCATCAATTCCCAGGCACGTGTCAGGTCCAAACTTCCCTTGG\n+>ERR215189.7 HWI-ST620_0157:3:1101:3673:1834/1\n+GCCCACCGACGGGTGGTCCTCTGGCCAATGGGCCCCGCGACGCAGCAGCTCTGTACGTGGCGGCCCCGGGTGCGCGGAGAGGGGCTCATAAATTT\n+>ERR215189.7 HWI-ST620_0157:3:1101:3673:1834/2\n+GTCAGAAGACGCACGGACGCAAAGGTGCGAGCGCGAGTGGCACAAACACGTTTCCCTCCGTCCTCCCGCGCGCCCCACCGGAAAAGTCCAGACCC\n+>ERR215189.8 HWI-ST620_0157:3:1101:4115:1913/1\n+AGTGAATTGAAGGCCCAGATGAAATCTGGGTTATTTCTTAATGAACATGAGGTTTGTGCTACTCAAGTTCTAATTCATTTGCTTAAAGCGTTATT\n+>ERR215189.8 HWI-ST620_0157:3:1101:4115:1913/2\n+TGCTTTTTGTCATGTGTGGATACACTAAATCCAGATATGTACAGCTAAGACATGTTAAAATAAAAAATTTACATTAGTTTATGATTTATTAAATT\n+>ERR215189.9 HWI-ST620_0157:3:1101:5855:1853/1\n+CTAACTTCCAGCACTTCACCACGTGTATCATATGTGTGCCGAGTTCCGCATCACGAGTATGCGAGGGTTGATAGGGAGTTCTATTCAAAGGACCC\n+>ERR215189.9 HWI-ST620_0157:3:1101:5855:1853/2\n+CCCTTCGTGCCAGCACTCACGAAAGCGTCCCCAAGCATTACCGTTCACTATAGAAACGACACTCATGCATTTAACGCTGCATGGACAGCACAACA\n+>ERR215189.10 HWI-ST620_0157:3:1101:7671:1854/1\n+GATCGTGTTACTAACATGTAGTGCCTAAAAAATTACAAAATGTAGGAAAGGCATAATCATTGGTGGATATTGCGCCCGGGTATATTCGGAATCAT\n+>ERR215189.10 HWI-ST620_0157:3:1101:7671:1854/2\n+CGAAACATTGTAAGAAAGAGTTAATAGCGACAAATATCGAACCCGAGGGAGTACGTTACTACTGAAATGCAACGACGAATACACTTGAGCCAAAA\n+>ERR215189.11 HWI-ST620_0157:3:1101:7883:1898/1\n+AGCACCGAGCCGGTCGCCACAATGCGTGGCTGATTGCCGCTAAGCCAACATGCAGTTGCAAACGCACTGCCGTACTCTAACCAGTCAACACTCCA\n+>ERR215189.11 HWI-ST620_0157:3:1101:7883:1898/2\n+TCTCCGGGTGAAAACCATGCTTGAGTCCTCAGGCGGGCGATGGCGGCGCTTTAGCATCGTTTCTTTGGAGATCATGTTCCGGCCTTTTTCTCGGC\n+>ERR215189.12 HWI-ST620_0157:3:1101:8663:1847/1\n+GGCTCCTCCGAGCCCATCGCGACGCGAAGGTAGAAGTCTCCGACCACGCGTTGCGTGCCGTAGCGGATCCACGCGTCGATCTGGTCGGGACGGGG\n+>ERR215189.12 HWI-ST620_0157:3:1101:8663:1847/2\n+ATTCATCGGCGGCTTCCTGGACTGGGTGCTCTCCCAGCGCGGCGACGTCGACATGGACTCTCTCCTCATCTCCATGAAGACAGGGGACTGCCCCC\n+>ERR215189.13 HWI-ST620_0157:3:1101:8777:1872/1\n+CATCGCCTACGCCGCCCAATAAAGGAGGATGCGCCGCCCGGTCTCCCGCTCGGCCTAGCGGTAGGGGCGTGTAGCCGCGCCTCCTGGTCAGCACG\n+>ERR215189.13 HWI-ST620_0157:3:1101:8777:1872/2\n+GAAGGACCGAGCCCCTGGGGTCGCCGCCTTCCCTCCCGGTCTTTGCCTCCTGTCAGCATGACCGCTGAAGGGGGATGGTTGTCGTTGGGCCCCGT\n+>ERR215189.14 HWI-ST620_0157:3:1101:9593:1927/1\n+GACTTGATTGATGGTAATAAGCTATCAGTGCTTAGATTCATTACAGCTAGAAGTAGAAATTGCTTAATTTGCTGCTTACGTTATCTGATTTGTGT\n+>ERR215189.14 HWI-ST620_0157:3:1101:9593:1927/2\n+GAATAATTCCACAAGAATAGTACATGCAATTAGAGAAAGGGTTTTAATTGAAGCCTAATAAACACAGAAACAAACCAAAGCATGCAATTGAACAT\n+>ERR215189.15 HWI-ST620_0157:3:1101:9720:1969/1\n+AATTTGCCTTGGAAATAACGTGAAGAAGAAAATGCACTTTCAAAGCTAGAAAACGCAAAACAGCAAGATACGCCTTCGTCAGACGTATACGTCTC\n+>ERR215189.15 HWI-ST620_0157:3:1101:9720:1969/2\n+GT'..b'CGAAGCTTTCTTGGCCTTGCAGGTTTCTACCGTCGGTTTGTTCGTGATTTTAGCTCCATTGCAGCGCCTCTACATGAGCTTACAAAGAAG\n+>ERR215189.74986 HWI-ST620_0157:3:1103:4178:38055/1\n+CAACGTTGGTAGCAACAGGCCGCTACCAGGAGGACCGGCTGAGATGTGGTGTCCAAGGGCACAGGTTTTTGCTCCACTGTCGATCTTTGCATGAA\n+>ERR215189.74986 HWI-ST620_0157:3:1103:4178:38055/2\n+CCCTTTTCCTAGAACTTGTGTGCTGCCGAATCTTGACAACACTATCTGTAGAAAGTTTGTTTTCATGCAAAGATCGACAGTGGAGCAAAAACCTG\n+>ERR215189.74987 HWI-ST620_0157:3:1103:4374:38184/1\n+GCCAGGCTTTGTTGATCCATATGATCTGTACGATATGGAAGAAGCGGACGAGGATGAAATAGAGGAGCTGAAGAAATATGAGTTCGCTTTGCTGC\n+>ERR215189.74987 HWI-ST620_0157:3:1103:4374:38184/2\n+TTCAGCATACTTGCGACAGCACGGAATGCAGCAAAACGAGGGTCGTCGCAGAGATAGTCCTCCAATGCACGATGCCTACCCTCGAGTCCAGGGTG\n+>ERR215189.74989 HWI-ST620_0157:3:1103:7120:38094/1\n+ATGACTTGGCCTCATCCTCTCCTTCCTCCGGCTTAACACCGGCGGTCTGTTCAGGGTTCCAAACTCATAGTGGCAACTAAACACGAGGGTTGCGC\n+>ERR215189.74989 HWI-ST620_0157:3:1103:7120:38094/2\n+GAGGGGTGCCCTCGGGAACGCGGACACAGGTGGTGCATGGCTGTCGTCAGCTCGTGCCGTAAGGTGTTGGGTTAAGTCTCGCAACGAGCGCAACC\n+>ERR215189.74990 HWI-ST620_0157:3:1103:7067:38203/1\n+AAAGATCGTCGTACGATGTCCGATAATTTTATACGCGTAGCGGCGTAGGCCGTGTTCACGGAACGTTCCGTCACATCCCCCGTATCAACCTGAGC\n+>ERR215189.74990 HWI-ST620_0157:3:1103:7067:38203/2\n+AGCGGCCGCGGGCACGGACAATCCGGGACAGCAGGGCCTCGCCGGCATTATTACCCCCGCCGTGCCGCCGCTTAATTGCCACCGTGGAGCCGCGG\n+>ERR215189.74991 HWI-ST620_0157:3:1103:7453:38152/1\n+GGAGAAAGAGAATAATTGTTGGCACAAGTCGATAAGGTCAGATAGTCCTTACGTTGTCTAACGTATGGTGCAGGATGTCATGACTAGTCCAACTT\n+>ERR215189.74991 HWI-ST620_0157:3:1103:7453:38152/2\n+GTTGTTCCGACAGTGGCGGTCGTGAACATCCATTACTAGTCTTTAACGGCCTATTCTAAAATGCCTTTTCAGTCTTAATTCCCAACCTTTGGAGG\n+>ERR215189.74992 HWI-ST620_0157:3:1103:7467:38189/1\n+GCTGGTTAAGCAGAAGCTTTGCCGTTTTGCTTAATATTGCAAGGAGGCCATTAGCATATAAATAAATAAGCTTTTAGCAGCAGGTTTCATCCGTG\n+>ERR215189.74992 HWI-ST620_0157:3:1103:7467:38189/2\n+TTTTTTACAAGGACTGGATTTGCTAACCAGTCGGGGTGCTTACGTTCACGGATGAAACCTGCTGCTAAAAGCTTATTTATTTATATGCTAATGGC\n+>ERR215189.74993 HWI-ST620_0157:3:1103:7344:38214/1\n+GGGAGAGCAGAGCTGGGCTGCTGCCACACAGCTGTTTGCTGCTGATGGCAAAAGCGGTGTCACCGTGCCACGTGCCGGCGCAGGTGTGCGACGTG\n+>ERR215189.74993 HWI-ST620_0157:3:1103:7344:38214/2\n+CCTTTCCCGGATATGATATTTAGTAGAGTGAGTTGACAGAGCCAGGAAGAGCTAAGAGAAAAAGCACAGTTAGGGCGATCTAGTTTCTTTTTCTT\n+>ERR215189.74994 HWI-ST620_0157:3:1103:8323:38070/1\n+CTTTGGTTTTCAGCCCATCTCTCCCCATCACTCTACAATGCTTCTGCTTTAGCGTGTCTTTCATTGAACCACGCAACAAACCTGTAGAAGGTGAA\n+>ERR215189.74994 HWI-ST620_0157:3:1103:8323:38070/2\n+GGTGTGAAGGATAATTTTGAACTGTTTCAGGTTCCTGCTCGTCAGCTTGAGGACTATTACTTTCAGCACACGTTGGAGCAGGTCAGGTCCACAAC\n+>ERR215189.74996 HWI-ST620_0157:3:1103:9716:38206/1\n+TTGTTCAGCCACACCATAAGCCACCATGATAAGCTTGTTTTTCACGGGAGTAACAACCTCACAAGGAACCCTCCCAGTAATGTTGTCCACGTGGT\n+>ERR215189.74996 HWI-ST620_0157:3:1103:9716:38206/2\n+TGCAAGAACACGTAGCTTTGACAGAGGCAAGAATGGAGGAAGCAATCGACCGGCGTGTGGCTCTAGCTCTTAGCAAACAAGCCATCAGGGGCTAA\n+>ERR215189.74997 HWI-ST620_0157:3:1103:11145:38006/1\n+TGGCGAGGAGGCGCTCGCGCGTCTGGCCGACATGGGCCGGCTAGGCGTCCCCGTCGACAGCTACAGCTACGCGCACGGGCTTAAGGCCTGCATTG\n+>ERR215189.74997 HWI-ST620_0157:3:1103:11145:38006/2\n+CAGGGCGCTCATTCTTGGCGTAGCATCCAATCATGGCACTCCATGACACAACATTCCTCTCGGGCATCCAAGCAAACACACGCTCTGCATAGGTG\n+>ERR215189.74998 HWI-ST620_0157:3:1103:11193:38148/1\n+GTGTTCCAGGCGGCGTAAGTTGTGGAACAATTCCGCAACTCTTTAGATGATTGCTAAACTCGTGGCTCAAATACTCGCCTCCACGATCAGATCGT\n+>ERR215189.74998 HWI-ST620_0157:3:1103:11193:38148/2\n+GTCTGAAACCTTTGAAAAGTTCAAGGAATTTCAGAATGAAGTTGAAAATCAGCGTGGCAAGAAAATTAAGGCCTTACGATCTGATCGTGGAGGCG\n+>ERR215189.74999 HWI-ST620_0157:3:1103:11027:38205/1\n+CTTATCCGACAAAGCAAACATGCGCCTAGCGAACTGGGCTGGCAAGCTCCTCACGCCCGCAGGAAGGAGAACCCTAGTCCGGGCAGTCCTATCCT\n+>ERR215189.74999 HWI-ST620_0157:3:1103:11027:38205/2\n+AGAAGCGAATAGGTTCTTGTCTTTTTCAGTGCACGGAGTTGGCATTTCGACCCAAGGTTTGCGTGGCTCCTTCCATTCATTCCAAAGCCAGCGTA\n+>ERR215189.75000 HWI-ST620_0157:3:1103:11413:38069/1\n+CCACTCCTCCAAGCTATATGTGTGGGGAGAGAAGGAAAGCTCCACACACGCGCACTCGCTCACCACGCCACGCCTTGCCTCGCCGGGCCGGGCGG\n+>ERR215189.75000 HWI-ST620_0157:3:1103:11413:38069/2\n+GGCAAAAACCCTAGCTTGATTAGATCATCTTTTCGGTAGACGCCAATAGGTCGTATATCTAGAAGAACCCATGATCTCACGCCACGATCGTGATC\n'
b
diff -r 000000000000 -r a4cd8608ef6b test_data/single_output.png
b
Binary file test_data/single_output.png has changed
b
diff -r 000000000000 -r a4cd8608ef6b test_run1.sh
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test_run1.sh Mon Apr 01 07:56:36 2019 -0400
b
@@ -0,0 +1,17 @@
+#!/bin/sh
+
+# same like galaxy test:
+
+./single_fastq_filtering_wrapper.sh -a test_data/ERR215189_1_part.fastq.gz -o test_data/single_output.fasta -G test_data/single_output.png -c 10 -N 0 -p 95
+
+
+echo "single fastq filtering with defaults"
+./single_fastq_filtering_wrapper.sh -a test_data/ERR215189_1_part.fastq.gz -o tmp/test1.fasta -G tmp/test1.png -c 10 -N 0
+
+echo "single fastq filtering with with sampling"
+./single_fastq_filtering_wrapper.sh -a test_data/ERR215189_1_part.fastq.gz -o tmp/test2.fasta -G tmp/test2.png -c 10 -N 0 -n 500
+
+echo "single fastq filtering with contaminant removing"
+./single_fastq_filtering_wrapper.sh -F tool_data/organele_ref_and_phi-X174.fasta -a test_data/ERR215189_1_part.fastq.gz -o tmp/test3.fasta -G tmp/test3.png -c 10 -N 0 
+
+
b
diff -r 000000000000 -r a4cd8608ef6b test_run2.sh
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test_run2.sh Mon Apr 01 07:56:36 2019 -0400
b
@@ -0,0 +1,20 @@
+#!/bin/sh
+## same as galaxy terst
+./paired_fastq_filtering_wrapper.sh -a test_data/ERR215189_1_part.fastq.gz -b test_data/ERR215189_2_part.fastq.gz -o test_data/single_output.fasta -G test_data/single_output.png -c 10 -N 0 -p 95
+
+echo "paired fastq filtering with defaults"
+./paired_fastq_filtering_wrapper.sh -b test_data/ERR215189_2_part.fastq.gz -a test_data/ERR215189_1_part.fastq.gz -o tmp/test2.1.fasta -G tmp/test2.1.png -c 10 -N 0
+
+echo "paired fastq filtering with with sampling"
+./paired_fastq_filtering_wrapper.sh -b test_data/ERR215189_2_part.fastq.gz -a test_data/ERR215189_1_part.fastq.gz -o tmp/test2.2.fasta -G tmp/test2.2.png -c 10 -N 0 -n 500
+
+echo "paired fastq filtering with contaminant removing"
+./paired_fastq_filtering_wrapper.sh -F tool_data/organele_ref_and_phi-X174.fasta -b test_data/ERR215189_2_part.fastq.gz -a test_data/ERR215189_1_part.fastq.gz -o tmp/test2.3.fasta -G tmp/test2.3.png -c 10 -N 0
+
+echo "paired fastq filtering with contaminant removing + trimming"
+./paired_fastq_filtering_wrapper.sh -F tool_data/organele_ref_and_phi-X174.fasta -b test_data/ERR215189_2_part.fastq.gz -a test_data/ERR215189_1_part.fastq.gz -o tmp/test2.3.fasta -G tmp/test2.3.png -c 10 -N 0 -s 10 -e 80
+
+
+echo "paired fastq filtering with contaminant removing - full run"
+./paired_fastq_filtering_wrapper.sh -F tool_data/organele_ref_and_phi-X174.fasta -b test_data/ERR215189_2.fastq.gz -a test_data/ERR215189_1.fastq.gz -o tmp/test2.3.fasta -G tmp/test2.3.png -c 10 -N 0
+
b
diff -r 000000000000 -r a4cd8608ef6b test_run_affixer.sh
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test_run_affixer.sh Mon Apr 01 07:56:36 2019 -0400
b
@@ -0,0 +1,2 @@
+#/bin/sh
+./fasta_affixer.py -f test_data/single_output.fasta  -p TEST -s OK  -o test_data/prefix_suffix.fasta -n 0
b
diff -r 000000000000 -r a4cd8608ef6b tool_data/organele_ref_and_phi-X174.fasta
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/tool_data/organele_ref_and_phi-X174.fasta Mon Apr 01 07:56:36 2019 -0400
b
b'@@ -0,0 +1,1444436 @@\n+>118_J02482_phi-X174#contamination\n+GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTTGATAAAGCAG\n+GAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCGGAAAATGAGAAAATTCGACCTATCCTTGCGCA\n+GCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTGTCAAAAACTGACGCGTTGGATGAGGAGAAG\n+TGGCTTAATATGCTTGGCACGTTCGTCAAGGACTGGTTTAGATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGT\n+TGACATTTTAAAAGAGCGTGGATTACTATCTGAGTCCGATGCTGTTCAACCACTAATAGGTAAGAAATCATGAGTCAAGT\n+TACTGAACAATCCGTACGTTTCCAGACCGCTTTGGCCTCTATTAAGCTCATTCAGGCTTCTGCCGTTTTGGATTTAACCG\n+AAGATGATTTCGATTTTCTGACGAGTAACAAAGTTTGGATTGCTACTGACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCT\n+TGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCGTCATTGCTTA\n+TTATGTTCATCCCGTCAACATTCAAACGGCCTGTCTCATCATGGAAGGCGCTGAATTTACGGAAAACATTATTAATGGCG\n+TCGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGTGTACGCGCAGGAAACACTGACGTTCTTACTGAC\n+GCAGAAGAAAACGTGCGTCAAAAATTACGTGCGGAAGGAGTGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGC\n+CCTGGTCGTCCGCAGCCGTTGCGAGGTACTAAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCAACAATT\n+TTAATTGCAGGGGCTTCGGCCCCTTACTTGAGGATAAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATG\n+ACCTTTCCCATCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGAC\n+TCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCTATTGACTCTACTGTAGACAT\n+TTTTACTTTTTATGTCCCTCATCGTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAAGGATGGTGTTAATGCCACTC\n+CTCTCCCGACTGTTAACACTACTGGTTATATTGACCATGCCGCTTTTCTTGGCACGATTAACCCTGATACCAATAAAATC\n+CCTAAGCATTTGTTTCAGGGTTATTTGAATATCTATAACAACTATTTTAAAGCGCCGTGGATGCCTGACCGTACCGAGGC\n+TAACCCTAATGAGCTTAATCAAGATGATGCTCGTTATGGTTTCCGTTGCTGCCATCTCAAAAACATTTGGACTGCTCCGC\n+TTCCTCCTGAGACTGAGCTTTCTCGCCAAATGACGACTTCTACCACATCTATTGACATTATGGGTCTGCAAGCTGCTTAT\n+GCTAATTTGCATACTGACCAAGAACGTGATTACTTCATGCAGCGTTACCATGATGTTATTTCTTCATTTGGAGGTAAAAC\n+CTCTTATGACGCTGACAACCGTCCTTTACTTGTCATGCGCTCTAATCTCTGGGCATCTGGCTATGATGTTGATGGAACTG\n+ACCAAACGTCGTTAGGCCAGTTTTCTGGTCGTGTTCAACAGACCTATAAACATTCTGTGCCGCGTTTCTTTGTTCCTGAG\n+CATGGCACTATGTTTACTCTTGCGCTTGTTCGTTTTCCGCCTACTGCGACTAAAGAGATTCAGTACCTTAACGCTAAAGG\n+TGCTTTGACTTATACCGATATTGCTGGCGACCCTGTTTTGTATGGCAACTTGCCGCCGCGTGAAATTTCTATGAAGGATG\n+TTTTCCGTTCTGGTGATTCGTCTAAGAAGTTTAAGATTGCTGAGGGTCAGTGGTATCGTTATGCGCCTTCGTATGTTTCT\n+CCTGCTTATCACCTTCTTGAAGGCTTCCCATTCATTCAGGAACCGCCTTCTGGTGATTTGCAAGAACGCGTACTTATTCG\n+CCACCATGATTATGACCAGTGTTTCCAGTCCGTTCAGTTGTTGCAGTGGAATAGTCAGGTTAAATTTAATGTGACCGTTT\n+ATCGCAATCTGCCGACCACTCGCGATTCAATCATGACTTCGTGATAAAAGATTGAGTGTGAGGTTATAACGCCGAAGCGG\n+TAAAAATTTTAATTTTTGCCGCTGAGGGGTTGACCAAGCGAAGCGCGGTAGGTTTTCTGCTTAGGAGTTTAATCATGTTT\n+CAGACTTTTATTTCTCGCCATAATTCAAACTTTTTTTCTGATAAGCTGGTTCTCACTTCTGTTACTCCAGCTTCTTCGGC\n+ACCTGTTTTACAGACACCTAAAGCTACATCGTCAACGTTATATTTTGATAGTTTGACGGTTAATGCTGGTAATGGTGGTT\n+TTCTTCATTGCATTCAGATGGATACATCTGTCAACGCCGCTAATCAGGTTGTTTCTGTTGGTGCTGATATTGCTTTTGAT\n+GCCGACCCTAAATTTTTTGCCTGTTTGGTTCGCTTTGAGTCTTCTTCGGTTCCGACTACCCTCCCGACTGCCTATGATGT\n+TTATCCTTTGAATGGTCGCCATGATGGTGGTTATTATACCGTCAAGGACTGTGTGACTATTGACGTCCTTCCCCGTACGC\n+CGGGCAATAACGTTTATGTTGGTTTCATGGTTTGGTCTAACTTTACCGCTACTAAATGCCGCGGATTGGTTTCGCTGAAT\n+CAGGTTATTAAAGAGATTATTTGTCTCCAGCCACTTAAGTGAGGTGATTTATGTTTGGTGCTATTGCTGGCGGTATTGCT\n+TCTGCTCTTGCTGGTGGCGCCATGTCTAAATTGTTTGGAGGCGGTCAAAAAGCCGCCTCCGGTGGCATTCAAGGTGATGT\n+GCTTGCTACCGATAACAATACTGTAGGCATGGGTGATGCTGGTATTAAATCTGCCATTCAAGGCTCTAATGTTCCTAACC\n+CTGATGAGGCCGCCCCTAGTTTTGTTTCTGGTGCTATGGCTAAAGCTGGTAAAGGACTTCTTGAAGGTACGTTGCAGGCT\n+GGCACTTCTGCCGTTTCTGATAAGTTGCTTGATTTGGTTGGACTTGGTGGCAAGTCTGCCGCTGATAAAGGAAAGGATAC\n+TCGTGATTATCTTGCTGCTGCATTTCCTGAGCTTAATGCTTGGGAGCGTGCTGGTGCTGATGCTTCCTCTGCTGGTATGG\n+TTGACGCCGGATTTGAGAATCAAAAAGAGCTTACTAAAATGCAACTGGACAATCAGAAAGAGATTGCCGAGATGCAAAAT\n+GAGACTCAAAAAGAGATTGCTGGCATTCAGTCGGCGACTTCACGCCAGAATACGAAAGACCAGGTATATGCACAAAATGA\n+GATGCTTGCTTATCAACAGAAGGAGTCTACTGCTCGCGTTGCGTCTATTATGGAAAACACCAATCTTTCCAAGCAACAGC\n+AGGTTTCCGAGATTATGCGCCAAATGCTTACTCAAGCTCAAACGGCTGGTCAGTATTTTACCAATGACCAAATCAAAGAA\n+ATGACTCGCAAGGTTAGTGCTGAGGTTGACTTAGTTCATCAGCAAACGCAGAATCAGCGGTATGGCTCTTCTCATATTGG\n+CGCTACTGCAAAGGATATTTCTAATGTCGTCACTGATGCTGCTTCTGGTGTGGTTGATATTTTTCATGGTATTGATAAAG\n+CTGT'..b'ATAAATTAATTAAATTGAAGTTAATTAATTTATGTTGACATTATACTA\n+TGTATGTGATATTATAATTTTAGTATTGCCAGAATAGCTCAGTAGGTAGAGCAGTGGACTGAAAATCCGCGTGTCACCAG\n+TTCAAATCTGGTTTCTGGCAAATTTCAATGTTATATACATTGATGATTTCAATTAATTAAGATTTTATTTTATTCTATTC\n+TATTCTATTCTATTCTATGATATTCTATTTACAGTAATATCATAGAAACCAAATATTTACAAGATTTTATGCCTTCCTAT\n+GAAGAACATCTAAAATTAGTTAAATTATAACATAATCGTTTTGTTCGTTTTTAACAAAATGTTACCAATTTATTTTTCAA\n+TTCATAAATAAATGAATAATAAAATAAAATTATGTATTGTGTAAATTAAACAACACGAACTTTTTATTCTTTTATTCAAT\n+GCTTTCAAACTGCTTTAGTTGCTAAAGCAGTTTGAAAAACATAACATGTTGTAATTATAATTTATACTACTGCACCAATA\n+GAAACACCAATGACTATTTTATTAAGATCACCAATAAAAATGTTCGTTGTTATCAACATTTTGTATGTTCATTTTGTTGA\n+CTGTGGTGATGATGTATATATATAATAAATATGCCAAAAGAGTAACGAAAAATACACAATCAATTTGCTAGATACACAAG\n+AAAAACAAACGCGTCCATCGTCTAAAGGATAGGACAGAGGTCTTCTAAACCTCTAGTATAGGTTCAATTCCTATTGGACG\n+CAAAATAGAATAAATACAAAAAAGTGTTTGTCTTTTTATTCTTGTTTTTTATCAATATAATGATACATTATTGTATTATT\n+ATTCCATCAGGTAATTGAGTGCTTGAAATCATAAAGAAAATTAGTTTGTATTATTTTAACGTCCTTGTTTCGTAGTTATT\n+GACATATTGAAGAATCATGTTTAATGAATGTTGTATCGTTAGTTATTTTGTATTGATTTGAAATTTCAACAAATGGAAGT\n+TTTTTCTTAACACTTATTTTAAGTCATAATAATTTTTATTACATTAATTTAAGTGAAAATTATTTCTGCATTTGAATTGT\n+TTCAATTGATTCAAACTTCAACATTATTCTATTGTCTAAAATGTGACAACATTTGAACACTTTTAAATGATTTTGTTTGT\n+AATATTTTTTTATATTATACTTACTTAGTTCTTGCAAATTATCAATATAATTTATTAATTAAATATTTGATAATTTAAAT\n+AAAAATTTATTTAAATTATTGTGTTATCCAACTTTACCATTACTGTCTATTCTATTGTTAATTATTAACAATTAAAATGC\n+TTGTCTATTAATATTCAATTGTGTATCCCACCCATTGTTAAAGACAAACGAAATCATTAATCATTTTTATTGAGTCTGTT\n+TAACAAGATTGACAATGATTCAGTTGACCAGCCATACTTTGGGTAGGTTATAATATAACAGTTTACGAGGCCTTTATGGT\n+TAAATTGAATTTATGAAAACATATGCAATTATTGAGGCCGGAGGTGAACAACTTCAAGTAGAGCCAGGCCGTTTTTACAA\n+TATCCGTCATCTGTCTCTCCGAGGTGTAAATTTTTGGGGACAAAACACAAAACTTTTGTTGTATCGTGTATTAATGATTC\n+GTCATCAATCTGCAACTGTGCTTGGTAATCCTTGGGTTCAAAATGCTACAGTAAAAGGTCGTATTTTAGATGCTCGACGT\n+GATGATAAACTTGTAATCTATAAAATGCGCGCAAAAAAGAAAACCAGAAGAAAGCGTGGACATCGCCAAGGTCTTACTAG\n+ATTTGTAGTAGATGCTATTTGTTTGAATGGAAAAGTACTTTTGGACTAAATCAATTGTAGAGTGATTTCTGTTATTCCAT\n+TTTAGATTGACTTGTGTAATTCGATGGTTATATAAATATGATTAAGTATGATACTAAAATTTTTTTAATTGTGTTAACAC\n+TATATAAAAAAATCTTAATAAATATAAGTGACTAACATGTTGGTTTATCCAACAAAGTCACTTATATTTATTAAGATTTT\n+TTTATAAAAATGTAATAAATATAAGTAAATGTTGGTTCTTCCAAAAGGTAAAAACCTTTTGAAAGAACGGGTCAATGATT\n+TATTCTAGCTACAAAACCATATTGTTTTCAATTTGTGTTATGTAAACAAATGAATTAATTTAACACTATAGAGAAACAAA\n+CAGACAGACAAATAGATAGATATATAAATTAATACATTATCTTGTTAATGATTTTCGTTTTAAATAAATATAGTTAATTA\n+TTAGTAAAATTGCAAATGAGACAAGTAACACTACGCTTGCAATTGCAGTAGCTCCTTTATAATCATATTGTTCTAATCTT\n+TGAAATATTAGTACAGAAACTACTAAATCTTTCATAGGAATGTTAGATGCTACTAAAACAATTGAACCATATTCTCCAAT\n+GGCTCTAGAAAAACCAAGAGCAGTTCCAGTCAACAAGGGAGATATCATAGGAGGAAATAACACATTCCAAAATGTTGTCC\n+AAGGTGATGCGCCAATACACCATGCTGCTTCTTCTGTTTCTTCTTCCATACTTTGTAAAACAGGTTGTATGGTGCGTACA\n+ATAAAGGGAAGTGATACAAACATCATAGCTATCAATACACCTAATCGGCTAAATGCTATTTTTATACCAAGCCATGAACA\n+GATTGGTCCCATCCAACCTTTATCACTATATACAGTCATAAGTGTTAGACCTCCTACTGACGTAGGAAGTGCGAAAGGAA\n+GATCTACTGCAGCATCTAAAAAATTTTTTCCTGGAAAACGATATCTTACTAAGATCCAAGCAAGGATTAAACCTAAAAAA\n+GCATTAATAACAGCTGCAAGTGCTGCAGTTAAAAAAGTTACTTTATATGCAGATAAAACAACAGGTTCAGTTACTGCCTG\n+AAAAATAGTATACCAAGATTGTTCTTTAGTCCTTAATAAAAGAGCTGTAACAGGTAAAAATAATATTAATATTCCGTAAT\n+GTAAAGCGGCTATTAATAAAAATTCAAAATAAGTAAAAAAACGAATATCCCGTTTACGAATATTTAAGAGTACTGTTCGA\n+GATATCACACACAATAGAATCATATTATAATACTGTTATTATTTTTTGGTCACTGTCACACAATAATAAATACAAATATA\n+GCAAACATCAATAAATTATATTGTGATGTGACTATTTATTATGTTGATTTCCTATCTATTTAGATAGAACTGTCTTTAAA\n+TAGCTAGATAATTTGTGTAAACATAAACATTTCGTATAAAAAAGTATTCTATAGTATTTGTCATTTATAATAATATATAA\n+GTTATGATATATGTTAATTTTGTTATTAAAACAGACAATTAATTTATGTTATGCTAAGCTATATTTTTTTAGTACAATGA\n+ATAAAAAAGAAAAGACATAGAATTTATAAATTTAATTTAATGTTATAAAGATAATTTTTTTATTGTATACTAATTATACA\n+ATAAAAAACAAAATGTTTAAACAAGACTGAAATACATTAAAAAATAAAACAGATTCAATTGAAATCACTAAACACATAAA\n+ACCTTGAATTACATACAAAGTAAATGAAAATAAGCATATATAAAATTATTCATTTTCAAAACCATATATTTTGTTATTAG\n+TACAAATTGGAAATTGGTCACTTATTAATTCTTTCAAATTTATGTGGATACATCATTCTTTAGAATACTTATATATGAGA\n+GGCATAATGTTTTGATTTCCTATGTAAACACATAAAACTAAATTGCATTATGAATTATTTCTAACAATACTTCTATGATG\n+TTTTCAATTCACTGTTATTGTAAGGACAATCAGGATTGAACTGATGTCTTCCACCACGTCAAGGTGGCACTCTACCGCTG\n+AGTTATGTCCCT\n'