Galaxy | Tool Preview

pyClusterReads (version 1.0.0)
GTF format sorted by position i.e. pyReadCounters output file.
GTF file containing gene ID co-ordinates

pyClusterReads

pyClusterReads is part of the pyCRAC package. Takes a reads_count_output GTF file from pyReadCounters generates clusters from the interval coordinates. Produces a GTF output file with cluster intervals and overlapping genomic features. It also includes mutation frequencies (after the # character) for nucleotides in intervals using chromosomal coordinates The pyClusterReads GTF output file essentially has the same layout as other pyCRAC GTF output files.

NOTE! By default it calls each cluster an "exon" but this has no meaning. It may overlap with an intron. Use bedtools to extract those intervals that overlap with introns or other features

The maximum height of the cluster is indicated in column 8. The hash character at the end of each line (#) shows chromosomal coordinates of mutated nucleotides within the cluster interval and their mutation frequencies.

For example:

# 114099S100.0

indicates that 100% of the nucleotides in position 114099 were substituted in the cluster.

An example of a pyClusterReads output file:

# generated by pyClusterReads.py version 0.0.1, Fri Jan 18 11:59:42 2013
# pyClusterReads.py -f count_output_reads.gtf -o count_output_clusters.gtf -v
# chromosome    feature source  start   end     cDNAs   strand  height  attributes
chrI    cluster exon    112583  112643  6       -       5   gene_id "INT_0_114,YAL021C"; gene_name "INT_0_114,CCR4"; # 112612S75.0;
chrI    cluster exon    113176  113232  3       -       3   gene_id "INT_0_114,YAL021C"; gene_name "INT_0_114,CCR4"; # 113184S100.0;
chrI    cluster exon    113334  113386  2       -       2   gene_id "INT_0_114,YAL021C"; gene_name "INT_0_114,CCR4"; # 113349S50.0,113379S100.0;
chrI    cluster exon    113534  113564  3       -       3   gene_id "INT_0_119,INT_0_114"; gene_name "INT_0_119,INT_0_114"; # 113554S33.3,113556S33.3,113557S33.3;
chrI    cluster exon    113644  113691  5       -       4   gene_id "YAL020C,INT_0_114"; gene_name "ATS1,INT_0_114"; # 113649S50.0,113657S33.3,113679S25.0
chrI    cluster exon    113912  113958  2       -       2   gene_id "YAL020C,INT_0_114"; gene_name "ATS1,INT_0_114"; # 113932S50.0,113946S50.0;
chrI    cluster exon    113966  114066  5       -       3   gene_id "YAL020C,INT_0_114"; gene_name "ATS1,INT_0_114"; # 113987S50.0,114033S33.3,114039S33.3;
chrI    cluster exon    114067  114130  3       -       3   gene_id "YAL020C,INT_0_114"; gene_name "ATS1,INT_0_114"; # 114099S100.0;

Parameter list

File input options:

-f reads.gtf, --input_file=reads.gtf
                              provide the path to your GTF read data file. NOTE the
                              file has to be correctly sorted! If you used
                              pyReadCounters to generate the file you should be
                              fine. If you modified it, use the sort command
                              described in the manual to sort your file first by
                              chromosome, then by strand and then by start position.
-o clusters.gtf, --output_file=clusters.gtf
                              provide a name for an output file. By default it
                              writes to the standard output
--gtf=Yourfavoritegtf.gtf
                              type the path to the gtf annotation file that you want
                              to use

Common pyCRAC options:

-r 100, --range=100
                              allows you to set the length of the UTR regions. If
                              you set '-r 50' or '--range=50', then the program will
                              set a fixed length (50 bp) regardless of whether the
                              GTF annotation file has genes with annotated UTRs.
-a protein_coding, --annotation=protein_coding
                              select which annotation (i.e. protein_coding, ncRNA,
                              sRNA, rRNA,snoRNA,snRNA, depending on the source of
                              your GTF file) you would like to focus your analysis
                              on. Default = all annotations

Options for cluster analysis:

--cic=2, --cdnasinclusters=2
                              sets the minimal number of overlapping cDNAs in each
                              cluster. Default = 2
--co=5, --clusteroverlap=5
                              sets the number of nucleotides cDNA sequences have to
                              overlap to form a cluster. Default = 1 nucleotide
--ch=5, --clusterheight=5
                              sets the minimal height of the cluster. Default = 2
                              nucleotides
--cl=100, --clusterlength=100
                              to set the maximum cluster sequence length
--mutsfreq=10, --mutationfrequency=10
                              sets the minimal mutations frequency for a cluster
                              position in the GTF output file. Default = 0%.
                              Example: if the mutsfrequency is set at 10 and a
                              cluster position has a mutated in less than 10% of the
                              reads, then the mutation will not be reported.