Galaxy |

pyFastqDuplicateRemover

pyFastqDuplicateRemover is part of the pyCRAC package. Removes identical sequences from fastq and fasta files and returns a fasta file with collapsed data.

Can also process paired-end data.

Examples

Unprocessed fastq data with six random nucleotides at 5' end of the read:

@FCC102EACXX:3:1101:3231:2110#TGACCAAT/1
GCGCCTGCCAATTCCATCGTAATGATTAATAGGGACGGTCGGGGGCATC
+
bb_ceeeegggggiiiiiifghiihiihiiiiiiiiiifggfhiecccc

After pyBarcodeFilter:

@FCC102EACXX:3:1101:3231:2110#TGACCAAT/1##GCGCCT
TCCATCGTAATGATTAATAGGGACGGTCGGGGGCATC
+
giiiiiifghiihiihiiiiiiiiiifggfhiecccc

This entry is printed to the NNNNNNGCCAAT barcode file.

After pyFastqDuplicateRemover:

>1_GCGCCT_5/1
TCCATCGTAATGATTAATAGGGACGGTCGGGGGCATC

The '1' indicates that this is the first unique cDNA in the data
GCGCCT is the random barcode sequence
the '5' indicates that 5 reads were found with identical read and random barcode sequences
the '/1' indicates that the seqeuence originates from the forward sequencing reaction

Parameter list

Options:

-f FILE, --input_file=FILE
                                      name of the FASTQ or FASTA input file

-r FILE, --reverse_input_file=FILE
                                      name of the paired (or reverse) FASTQ or FASTA input file

-o FILE, --output_file=FILE
                                      Provide the path and name of the fastq or fasta output file. Default is standard output.
                                      For paired-end data just provide a file name without file extension (!)