pyFastqDuplicateRemover
pyFastqDuplicateRemover is part of the pyCRAC package. Removes identical sequences from fastq and fasta files and returns a fasta file with collapsed data.
Can also process paired-end data.
Examples
Unprocessed fastq data with six random nucleotides at 5' end of the read:
@FCC102EACXX:3:1101:3231:2110#TGACCAAT/1 GCGCCTGCCAATTCCATCGTAATGATTAATAGGGACGGTCGGGGGCATC + bb_ceeeegggggiiiiiifghiihiihiiiiiiiiiifggfhiecccc
After pyBarcodeFilter:
@FCC102EACXX:3:1101:3231:2110#TGACCAAT/1##GCGCCT TCCATCGTAATGATTAATAGGGACGGTCGGGGGCATC + giiiiiifghiihiihiiiiiiiiiifggfhiecccc This entry is printed to the NNNNNNGCCAAT barcode file.
After pyFastqDuplicateRemover:
>1_GCGCCT_5/1 TCCATCGTAATGATTAATAGGGACGGTCGGGGGCATC The '1' indicates that this is the first unique cDNA in the data GCGCCT is the random barcode sequence the '5' indicates that 5 reads were found with identical read and random barcode sequences the '/1' indicates that the seqeuence originates from the forward sequencing reaction
Parameter list
Options:
-f FILE, --input_file=FILE name of the FASTQ or FASTA input file -r FILE, --reverse_input_file=FILE name of the paired (or reverse) FASTQ or FASTA input file -o FILE, --output_file=FILE Provide the path and name of the fastq or fasta output file. Default is standard output. For paired-end data just provide a file name without file extension (!)