pyClusterReads
pyClusterReads is part of the pyCRAC package. Takes a reads_count_output GTF file from pyReadCounters generates clusters from the interval coordinates. Produces a GTF output file with cluster intervals and overlapping genomic features. It also includes mutation frequencies (after the # character) for nucleotides in intervals using chromosomal coordinates The pyClusterReads GTF output file essentially has the same layout as other pyCRAC GTF output files.
NOTE! By default it calls each cluster an "exon" but this has no meaning. It may overlap with an intron. Use bedtools to extract those intervals that overlap with introns or other features
The maximum height of the cluster is indicated in column 8. The hash character at the end of each line (#) shows chromosomal coordinates of mutated nucleotides within the cluster interval and their mutation frequencies.
For example:
# 114099S100.0
indicates that 100% of the nucleotides in position 114099 were substituted in the cluster.
An example of a pyClusterReads output file:
# generated by pyClusterReads.py version 0.0.1, Fri Jan 18 11:59:42 2013 # pyClusterReads.py -f count_output_reads.gtf -o count_output_clusters.gtf -v # chromosome feature source start end cDNAs strand height attributes chrI cluster exon 112583 112643 6 - 5 gene_id "INT_0_114,YAL021C"; gene_name "INT_0_114,CCR4"; # 112612S75.0; chrI cluster exon 113176 113232 3 - 3 gene_id "INT_0_114,YAL021C"; gene_name "INT_0_114,CCR4"; # 113184S100.0; chrI cluster exon 113334 113386 2 - 2 gene_id "INT_0_114,YAL021C"; gene_name "INT_0_114,CCR4"; # 113349S50.0,113379S100.0; chrI cluster exon 113534 113564 3 - 3 gene_id "INT_0_119,INT_0_114"; gene_name "INT_0_119,INT_0_114"; # 113554S33.3,113556S33.3,113557S33.3; chrI cluster exon 113644 113691 5 - 4 gene_id "YAL020C,INT_0_114"; gene_name "ATS1,INT_0_114"; # 113649S50.0,113657S33.3,113679S25.0 chrI cluster exon 113912 113958 2 - 2 gene_id "YAL020C,INT_0_114"; gene_name "ATS1,INT_0_114"; # 113932S50.0,113946S50.0; chrI cluster exon 113966 114066 5 - 3 gene_id "YAL020C,INT_0_114"; gene_name "ATS1,INT_0_114"; # 113987S50.0,114033S33.3,114039S33.3; chrI cluster exon 114067 114130 3 - 3 gene_id "YAL020C,INT_0_114"; gene_name "ATS1,INT_0_114"; # 114099S100.0;
Parameter list
File input options:
-f reads.gtf, --input_file=reads.gtf provide the path to your GTF read data file. NOTE the file has to be correctly sorted! If you used pyReadCounters to generate the file you should be fine. If you modified it, use the sort command described in the manual to sort your file first by chromosome, then by strand and then by start position. -o clusters.gtf, --output_file=clusters.gtf provide a name for an output file. By default it writes to the standard output --gtf=Yourfavoritegtf.gtf type the path to the gtf annotation file that you want to use
Common pyCRAC options:
-r 100, --range=100 allows you to set the length of the UTR regions. If you set '-r 50' or '--range=50', then the program will set a fixed length (50 bp) regardless of whether the GTF annotation file has genes with annotated UTRs. -a protein_coding, --annotation=protein_coding select which annotation (i.e. protein_coding, ncRNA, sRNA, rRNA,snoRNA,snRNA, depending on the source of your GTF file) you would like to focus your analysis on. Default = all annotations
Options for cluster analysis:
--cic=2, --cdnasinclusters=2 sets the minimal number of overlapping cDNAs in each cluster. Default = 2 --co=5, --clusteroverlap=5 sets the number of nucleotides cDNA sequences have to overlap to form a cluster. Default = 1 nucleotide --ch=5, --clusterheight=5 sets the minimal height of the cluster. Default = 2 nucleotides --cl=100, --clusterlength=100 to set the maximum cluster sequence length --mutsfreq=10, --mutationfrequency=10 sets the minimal mutations frequency for a cluster position in the GTF output file. Default = 0%. Example: if the mutsfrequency is set at 10 and a cluster position has a mutated in less than 10% of the reads, then the mutation will not be reported.