Galaxy |

Changeset 4:cb56cc1d5c39 (2016-03-21)

Previous changeset 3:e1a14ed7a9d6 (2016-02-24) Next changeset 5:8159dab5dbdb (2016-04-12)

Commit message:
Updates to the palfilter.py utility.

modified:
pal_filter.py
test-data/illuminaPE_assembly.out
test-data/illuminaPE_assembly_after_filters.out

diff -r e1a14ed7a9d6 -r cb56cc1d5c39 pal_filter.py
--- a/pal_filter.py Wed Feb 24 08:25:17 2016 -0500
+++ b/pal_filter.py Mon Mar 21 06:52:43 2016 -0400

[

b'@@ -1,87 +1,140 @@\n #!/usr/bin/python -tt\n-#\n-# pal_filter\n-# https://github.com/graemefox/pal_filter\n-#\n-################################################################################\n-# Graeme Fox - 15/02/2016 - graeme.fox@manchester.ac.uk\n-# Tested on 64-bit Ubuntu, with Python 2.7\n-#\n-################################################################################\n-# PROGRAM DESCRIPTION\n-#\n-# Program to pick optimum loci from the output of pal_finder_v0.02.04\n-#\n-# This program can be used to filter output from pal_finder and choose the\n-# \'optimum\' loci.\n-#\n-# For the paper referncing this workflow, see Griffiths et al.\n-# (unpublished as of 15/02/2016) (sarah.griffiths-5@postgrad.manchester.ac.uk)\n-#\n-################################################################################\n-#\n-# This program also contains a quality-check method to improve the rate of PCR\n-# success. For this QC method, paired end reads are assembled using\n-# PANDAseq so you must have PANDAseq installed.\n-#\n-# For the paper referencing this assembly-QC method see Fox et al.\n-# (unpublished as of 15/02/2016) (graeme.fox@manchester.ac.uk)\n-#\n-# For best results in PCR for marker development, I suggest enabling all the\n-# filter options AND the assembly based QC\n-#\n-################################################################################\n-# REQUIREMENTS\n-#\n-# Must have Biopython installed (www.biopython.org).\n-#\n-# If you with to perform the assembly QC step, you must have:\n-# PandaSeq (https://github.com/neufeld/pandaseq)\n-# PandaSeq must be in your $PATH / able to run from anywhere\n-################################################################################\n-# REQUIRED OPTIONS\n-#\n-# -i forward_paired_ends.fastQ\n-# -j reverse_paired_ends.fastQ\n-# -p pal_finder output - the "(microsatellites with read IDs and\n-# primer pairs)" file\n-#\n-# By default the program does nothing....\n-#\n-# NON-REQUIRED OPTIONS\n-#\n-# -assembly: turn on the pandaseq assembly QC step\n-# -primers: filter microsatellite loci to just those which\n-# have primers designed\n-#\n-# -occurrences: filter microsatellite loci to those with primers\n-# which appear only once in the dataset\n-#\n-# -rankmotifs: filter microsatellite loci to just those with perfect motifs.\n-# Rank the output by size of motif (largest first)\n-#\n-###########################################################\n-# For repeat analysis, the following extra non-required options may be useful:\n-#\n-# Since PandaSeq Assembly, and fastq -> fasta conversion are slow, do them the\n-# first time, generate the files and then skip either, or both steps with\n-# the following:\n-#\n-# -a: skip assembly step\n-# -c: skip fastq -> fasta conversion step\n-#\n-# Just make sure to keep the assembled/converted files in the correct directory\n-# with the correct filename(s)\n-###########################################################\n-#\n-# EXAMPLE USAGE:\n-#\n-# pal_filtery.py -i R1.fastq -j R2.fastq\n-# -p pal_finder_output.tabular -primers -occurrences -rankmotifs -assembly\n-#\n-###########################################################\n+"""\n+pal_filter\n+https://github.com/graemefox/pal_filter\n+\n+Graeme Fox - 03/03/2016 - graeme.fox@manchester.ac.uk\n+Tested on 64-bit Ubuntu, with Python 2.7\n+\n+~~~~~~~~~~~~~~~~~~~\n+PROGRAM DESCRIPTION\n+\n+Program to pick optimum loci from the output of pal_finder_v0.02.04\n+\n+This program can be used to filter output from pal_finder and choose the\n+\'optimum\' loci.\n+\n+For the paper referncing this workflow, see Griffiths et al.\n+(unpublished as of 15/02/2016) (sarah.griffiths-5@postgrad.manchester.ac.uk)\n+\n+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n+This program also contains a quality-check method to improve the rate of PCR\n+success. For this QC method, paired end reads are assembled using\n+PANDAseq so you must have PANDAseq installed.\n+\n+For the paper referencing this assembly-QC method see Fox et al.\n+(unpublished as of 15/02/2016) (graeme.fox@manchester.ac'..b'e output file\n assembly_seq = (assembly_sequences_index.get_raw(x).decode())\n-# fasta entries need to be converted to single line so sit nicely in the output\n- assembly_output = assembly_seq.replace("\\n","\\t")\n+ # fasta entries need to be converted to single line so sit\n+ # nicely in the output\n+ assembly_output = assembly_seq.replace("\\n","\\t").strip(\'\\t\')\n R1_fasta_seq = (R1fasta_sequences_index.get_raw(x).decode())\n R1_output = R1_fasta_seq.replace("\\n","\\t",1).replace("\\n","")\n R2_fasta_seq = (R2fasta_sequences_index.get_raw(x).decode())\n R2_output = R2_fasta_seq.replace("\\n","\\t",1).replace("\\n","")\n assembly_no_id = \'\\n\'.join(assembly_seq.split(\'\\n\')[1:])\n \n-# check that both primer sequences can be seen in the assembled contig\n+ # check that both primer sequences can be seen in the\n+ # assembled contig\n if y or ReverseComplement1(y) in assembly_no_id and z or \\\n ReverseComplement1(z) in assembly_no_id:\n if y in assembly_no_id:\n-# get the positions of the primers in the assembly to predict fragment length\n+ # get the positions of the primers in the assembly\n+ # (can be used to predict fragment length)\n F_position = assembly_no_id.index(y)+len(y)+1\n if ReverseComplement1(y) in assembly_no_id:\n- F_position = assembly_no_id.index(ReverseComplement1(y)\\\n- )+len(ReverseComplement1(y))+1\n+ F_position = assembly_no_id.index( \\\n+ ReverseComplement1(y))+len(ReverseComplement1(y))+1\n if z in assembly_no_id:\n R_position = assembly_no_id.index(z)+1\n if ReverseComplement1(z) in assembly_no_id:\n- R_position = assembly_no_id.index(ReverseComplement1(z)\\\n- )+1\n-\n-# write everything out into the output file\n- output = (str(x) + "\\t" + y + "\\t" + str(F_position) \\\n- + "\\t" + (z) + "\\t" + str(R_position) \\\n- + "\\t" + a + "\\t" + assembly_output \\\n- + R1_output + "\\t" + R2_output + "\\n")\n- outputfile.write(output)\n+ R_position = assembly_no_id.index( \\\n+ ReverseComplement1(z))+1\n+ output = (str(x),\n+ str(y),\n+ str(F_position),\n+ str(z),\n+ str(R_position),\n+ str(a),\n+ str(assembly_output),\n+ str(R1_output),\n+ str(R2_output + "\\n"))\n+ outputfile.write("\\t".join(output))\n print "\\nPANDAseq quality check complete."\n print "Results from PANDAseq quality check (and filtering, if any" \\\n " any filters enabled) written to output file" \\\n- " ending \\"_pal_filter_assembly_output.txt\\".\\n\\n"\n-\n+ " ending \\"_pal_filter_assembly_output.txt\\".\\n"\n print "Filtering of pal_finder results complete."\n print "Filtered results written to output file ending \\".filtered\\"."\n print "\\nFinished\\n"\n else:\n- if (skip_assembly == 1 or skip_conversion == 1):\n+ if args.skip_assembly or args.skip_conversion:\n print "\\nERROR: You cannot supply the -a flag or the -c flag without \\\n also supplying the -assembly flag.\\n"\n-\n print "\\nProgram Finished\\n"\n'

diff -r e1a14ed7a9d6 -r cb56cc1d5c39 test-data/illuminaPE_assembly.out
--- a/test-data/illuminaPE_assembly.out Wed Feb 24 08:25:17 2016 -0500
+++ b/test-data/illuminaPE_assembly.out Mon Mar 21 06:52:43 2016 -0400

@@ -1,2 +1,2 @@
-readPairID Forward Primer F Primer Position in Assembled Read Reverse Primer R Primer Position in Assembled Read Motifs(bases) Assembled Read ID Assembled Read Sequence Raw Forward Read ID Raw Forward Read Sequence Raw Reverse Read ID Raw Reverse Read Sequence
+readPairID Forward Primer F Primer Position in Assembled Read Reverse Primer R Primer Position in Assembled Read Motifs(bases) Assembled Read ID Assembled Read Sequence Raw Forward Read ID Raw Forward Read Sequence Raw Reverse Read ID Raw Reverse Read Sequence
ILLUMINA-545855:49:FC61RLR:2:1:19063:1614 1 1 AT(14) AT(14) AT(14) AT(14) >ILLUMINA-545855:49:FC61RLR:2:1:19063:1614 TATATATATATATACACATATATATATATATTTTTTACATTATTTCACTTCGCCCAAACTAGAGAGTCTAACAAAGTACAACCCAGCATATTAAAGTTCATCTCAGTTTTGTTCTGAAATGAGAAAAAAATATATATATATATGTTTATATATATATATA >ILLUMINA-545855:49:FC61RLR:2:1:19063:1614 TATATATATATATACACATATATATATATATTTTTTACATTATTTCACTTCGCCCAAACTAGAGAGTCTAACAAAGTACAACCCAGCATATTAAAGTTCATCTCAGTTTTGTTCTG >ILLUMINA-545855:49:FC61RLR:2:1:19063:1614 TATATATATATATAAACATATATATATATATTTTTTTCTCATTTCAGAACAAAAGTGAGATGAACTTTAATATGGTGGGGTGTATTTTGAGAGACTCTCTAGTTTGGGAGGAGTGA

diff -r e1a14ed7a9d6 -r cb56cc1d5c39 test-data/illuminaPE_assembly_after_filters.out
--- a/test-data/illuminaPE_assembly_after_filters.out Wed Feb 24 08:25:17 2016 -0500
+++ b/test-data/illuminaPE_assembly_after_filters.out Mon Mar 21 06:52:43 2016 -0400

@@ -1,1 +1,1 @@
-readPairID Forward Primer F Primer Position in Assembled Read Reverse Primer R Primer Position in Assembled Read Motifs(bases) Assembled Read ID Assembled Read Sequence Raw Forward Read ID Raw Forward Read Sequence Raw Reverse Read ID Raw Reverse Read Sequence
+readPairID Forward Primer F Primer Position in Assembled Read Reverse Primer R Primer Position in Assembled Read Motifs(bases) Assembled Read ID Assembled Read Sequence Raw Forward Read ID Raw Forward Read Sequence Raw Reverse Read ID Raw Reverse Read Sequence