Galaxy | Tool Preview

pyReadCounters (version 1.0.0)
GTF file containing gene ID co-ordinates
Use .novo or .sam input files
Alignment file of type .novo

pyReadCounters

pyReadCounters is part of the pyCRAC package. Produces a gene hittable file, two GTF output files showing to which genomic features the reads overlap. Finally the tool produces a read statistics file that provides information about the complexity of your dataset.

Output file examples

A hittable file:

# generated by pyReadCounters version 1.1.0, Mon Apr 16 20:34:22 2012
# /usr/local/bin/pyReadCounters.py -f RNAseq_data.novo -c 1 --unique
# total number of reads 12534556
# total number of paired reads  10947376
# total number of single reads  483095
# total number of mapped reads: 11430471
# total number of overlapping genomic features  7019550
#       sense   5960669
#       anti-sense      1058881
# feature       sense_overlap anti-sense_overlap  number of reads

YEF3        49930       3629        24221
PMA1        32621       2650        21776
COX1        24559       1037        15174
TFP1        21539       1689        13506
HSC82       21177       1458        12729
ADH1        20245       1467        11351
AI5_ALPHA   20022       918         13101
AI4         19390       886         12638
AI3         17823       798         11473
AI2         17590       790         11297
RPL10       16822       1113        8797
ENO2        16336       1125        8913
TEF1        15578       1333        5450

An example of a GTF 'count_output' file:

# generated by Counters version 1.2.0, Tue Jan  8 22:47:29 2013
# pyReadCounters.py -f PAR_CLIP_unique.novo --mutations=TC -v
# total number of reads:    2455251
# total number of paired reads:     0
# total number of single reads:     2455251
# total number of mapped reads:     2455251
# total number of overlapping genomic features:     5153943
#   sense:  2640600
#   anti-sense:     2513343
chrXIV      reads   exon    661572  661605  2       +       .   gene_id "INT_0_6716,YNR016C"; gene_name "INT_0_6716,ACC1"; # 661596S;
chrXIV      reads   exon    661720  661738  1       +       .   gene_id "INT_0_6716,YNR016C"; gene_name "INT_0_6716,ACC1"; # 661726S;
chrXIV      reads   exon    661839  661878  4       +       .   gene_id "INT_0_6716,YNR016C"; gene_name "INT_0_6716,ACC1"; # 661875S;

This output file also reports whether a read contains a mutation.

For example:

# 661596S

Indicates that the read had a nucleotide substitution ("S") at genomic coordinate 661596. The chromosome name can be found in the first column.


Parameter list

File input options:

-f FILE, --input_file=FILE
                                        provide the path to your novo, SAM/BAM or gtf data
                                        file. Default is standard input. Make sure to specify
                                        the file type of the file you want to have analyzed
                                        using the --file_type option!
-o OUTPUT_FILE, --output_file=OUTPUT_FILE
                                        Use this flag to override the standard file names. Do
                                        NOT add an extension.
--file_type=FILE_TYPE
                                        use this option to specify the file type (i.e.
                                        'novo','sam' or 'gtf'). This will tell the program
                                        which parsers to use for processing the files. Default
                                        = 'novo'
--gtf=annotation_file.gtf
                                        type the path to the gtf annotation file that you want
                                        to use

Common pyCRAC options:

        --ignorestrand
                                                                                To ignore strand information and all reads overlapping
                                        with genomic features will be considered sense reads.
                                        Useful for analysing ChIP or RIP data
--overlap=1
                                                                                        sets the number of nucleotides a read has to overlap
                                        with a gene before it is considered a hit. Default =
                                        1 nucleotide
-r 100, --range=100
                                        allows you to add regions flanking the genomic
                                        feature. If you set '-r 50' or '--range=50', then the
                                        program will add 50 nucleotides to each feature on
                                        each side regardless of whether the GTF file has genes
                                        with annotated UTRs

Options for SAM/BAM and Novo files:

--mutations=delsonly
                                        Use this option to only track mutations that are of
                                        interest. For CRAC data this is usually deletions
                                        (--mutations=delsonly). For PAR-CLIP data this is
                                        usually T-C mutations (--mutations=TC). Other options
                                        are\: do not report any mutations: --mutations=nomuts.
                                        Only report specific base mutations, for example only
                                        in T's, C's and G's :--mutations=[TCG]. The brackets
                                        are essential. Other nucleotide combinations are also
                                        possible
--align_quality=100, --mapping_quality=100
                                        with these options you can set the alignment quality
                                        (Novoalign) or mapping quality (SAM) threshold. Reads
                                        with qualities lower than the threshold will be
                                        ignored. Default = 0
--align_score=100
                                                                                        with this option you can set the alignment score
                                        threshold. Reads with alignment scores lower than the
                                        threshold will be ignored. Default = 0
--unique
                                                                                        with this option reads with multiple alignment
                                        locations will be removed. Default = Off
--blocks
                                                                                        with this option reads with the same start and end
                                        coordinates on a chromosome will be counted as one
                                        cDNA. Default = Off
-m 100000, --max=100000
                                        maximum number of mapped reads that will be analyzed.
                                        Default = All
-d 1000, --distance=1000
                                        this option allows you to set the maximum number of
                                        base-pairs allowed between two non-overlapping paired
                                        reads. Default = 1000
--discarded=FILE
                                                                                        prints the lines from the alignments file that were
                                        discarded by the parsers. This file contains reads
                                        that were unmapped (NM), of poor quality (i.e. QC) or
                                        paired reads that were mapped to different chromosomal
                                        locations or were too far apart on the same
                                        chromosome. Useful for debugging purposes
-l 100, --length=1000
                                                                                        to set read length threshold. Default = 1000