Galaxy |

pyReadCounters

pyReadCounters is part of the pyCRAC package. Produces a gene hittable file, two GTF output files showing to which genomic features the reads overlap. Finally the tool produces a read statistics file that provides information about the complexity of your dataset.

Output file examples

A hittable file:

# generated by pyReadCounters version 1.1.0, Mon Apr 16 20:34:22 2012
# /usr/local/bin/pyReadCounters.py -f RNAseq_data.novo -c 1 --unique
# total number of reads 12534556
# total number of paired reads  10947376
# total number of single reads  483095
# total number of mapped reads: 11430471
# total number of overlapping genomic features  7019550
#       sense   5960669
#       anti-sense      1058881
# feature       sense_overlap anti-sense_overlap  number of reads

YEF3        49930       3629        24221
PMA1        32621       2650        21776
COX1        24559       1037        15174
TFP1        21539       1689        13506
HSC82       21177       1458        12729
ADH1        20245       1467        11351
AI5_ALPHA   20022       918         13101
AI4         19390       886         12638
AI3         17823       798         11473
AI2         17590       790         11297
RPL10       16822       1113        8797
ENO2        16336       1125        8913
TEF1        15578       1333        5450

An example of a GTF 'count_output' file:

# generated by Counters version 1.2.0, Tue Jan  8 22:47:29 2013
# pyReadCounters.py -f PAR_CLIP_unique.novo --mutations=TC -v
# total number of reads:    2455251
# total number of paired reads:     0
# total number of single reads:     2455251
# total number of mapped reads:     2455251
# total number of overlapping genomic features:     5153943
#   sense:  2640600
#   anti-sense:     2513343
chrXIV      reads   exon    661572  661605  2       +       .   gene_id "INT_0_6716,YNR016C"; gene_name "INT_0_6716,ACC1"; # 661596S;
chrXIV      reads   exon    661720  661738  1       +       .   gene_id "INT_0_6716,YNR016C"; gene_name "INT_0_6716,ACC1"; # 661726S;
chrXIV      reads   exon    661839  661878  4       +       .   gene_id "INT_0_6716,YNR016C"; gene_name "INT_0_6716,ACC1"; # 661875S;

This output file also reports whether a read contains a mutation.

For example:

# 661596S

Indicates that the read had a nucleotide substitution ("S") at genomic coordinate 661596. The chromosome name can be found in the first column.

Parameter list

File input options:

-f FILE, --input_file=FILE
                                        provide the path to your novo, SAM/BAM or gtf data
                                        file. Default is standard input. Make sure to specify
                                        the file type of the file you want to have analyzed
                                        using the --file_type option!
-o OUTPUT_FILE, --output_file=OUTPUT_FILE
                                        Use this flag to override the standard file names. Do
                                        NOT add an extension.
--file_type=FILE_TYPE
                                        use this option to specify the file type (i.e.
                                        'novo','sam' or 'gtf'). This will tell the program
                                        which parsers to use for processing the files. Default
                                        = 'novo'
--gtf=annotation_file.gtf
                                        type the path to the gtf annotation file that you want
                                        to use

Common pyCRAC options:

        --ignorestrand
                                                                                To ignore strand information and all reads overlapping
                                        with genomic features will be considered sense reads.
                                        Useful for analysing ChIP or RIP data
--overlap=1
                                                                                        sets the number of nucleotides a read has to overlap
                                        with a gene before it is considered a hit. Default =
                                        1 nucleotide
-r 100, --range=100
                                        allows you to add regions flanking the genomic
                                        feature. If you set '-r 50' or '--range=50', then the
                                        program will add 50 nucleotides to each feature on
                                        each side regardless of whether the GTF file has genes
                                        with annotated UTRs

Options for SAM/BAM and Novo files:

--mutations=delsonly
Use this option to only track mutations that are of
interest. For CRAC data this is usually deletions
(--mutations=delsonly). For PAR-CLIP data this is
usually T-C mutations (--mutations=TC). Other options
are\: do not report any mutations: --mutations=nomuts.
Only report specific base mutations, for example only
in T's, C's and G's :--mutations=[TCG]. The brackets
are essential. Other nucleotide combinations are also
possible
--align_quality=100, --mapping_quality=100
with these options you can set the alignment quality
(Novoalign) or mapping quality (SAM) threshold. Reads
with qualities lower than the threshold will be
ignored. Default = 0
--align_score=100
with this option you can set the alignment score
threshold. Reads with alignment scores lower than the
threshold will be ignored. Default = 0
--unique
with this option reads with multiple alignment
locations will be removed. Default = Off
--blocks
with this option reads with the same start and end
coordinates on a chromosome will be counted as one
cDNA. Default = Off
-m 100000, --max=100000
maximum number of mapped reads that will be analyzed.
Default = All
-d 1000, --distance=1000
this option allows you to set the maximum number of
base-pairs allowed between two non-overlapping paired
reads. Default = 1000
--discarded=FILE
prints the lines from the alignments file that were
discarded by the parsers. This file contains reads
that were unmapped (NM), of poor quality (i.e. QC) or
paired reads that were mapped to different chromosomal
locations or were too far apart on the same
chromosome. Useful for debugging purposes
-l 100, --length=1000
to set read length threshold. Default = 1000