pyReadAligner
pyReadAligner is part of the pyCRAC package. Generates multiple sequence alignments for reads mapped to individual genes or genomic regions. Produces a fasta output file.
Parameter list
File input options:
-f FILE, --input_file=FILE As input files you can use Novoalign native output or SAM files as input file. By default it expects data from the standard input. Make sure to specify the file type of the file you want to have analyzed using the --file_type option! -o OUTPUT_FILE, --output_file=OUTPUT_FILE Use this flag to override the standard output file names. All alignments will be written to one output file. -g FILE, --genes_file=FILE here you need to type in the name of your gene list file (1 column) or the hittable file --chr=FILE if you simply would like to align reads against a genomic sequence you should generate a tab delimited file containing an identifyer, chromosome name, start position, end position and strand --gtf=annotation_file.gtf type the path to the gtf annotation file that you want to use --tab=tab_file.tab type the path to the tab file that contains the genomic reference sequence --file_type=FILE_TYPE use this option to specify the file type (i.e. 'novo', 'sam', 'gtf'). This will tell the program which parsers to use for processing the files. Default = 'novo'
pyReadAligner specific options:
--limit=500 with this option you can select how many reads mapped to a particular gene/ORF/region you want to count. Default = All
Common options:
--ignorestrand this flag tells the program to ignore strand information and all overlapping reads will considered sense reads. Useful for analysing ChIP or RIP data --overlap=1 sets the number of nucleotides a read has to overlap with a gene before it is considered a hit. Default = 1 nucleotide -s genomic, --sequence=genomic with this option you can select whether you want the reads aligned to the genomic or the coding sequence. Default = genomic -r 100, --range=100 allows you to set the length of the UTR regions. If you set '-r 50' or '--range=50', then the program will set a fixed length (50 bp) regardless of whether the GTF file has genes with annotated UTRs.
Options for novo, SAM and BAM files:
--align_quality=100, --mapping_quality=100 with these options you can set the alignment quality (Novoalign) or mapping quality (SAM) threshold. Reads with qualities lower than the threshold will be ignored. Default = 0 --align_score=100 with this option you can set the alignment score threshold. Reads with alignment scores lower than the threshold will be ignored. Default = 0 -l 100, --length=100 to set read length threshold. Default = 1000 -m 100000, --max=100000 maximum number of mapped reads that will be analyzed. Default = All --unique with this option reads with multiple alignment locations will be removed. Default = Off --blocks with this option reads with the same start and end coordinates on a chromosome will only be counted once. Default = Off --discarded=FILE prints the lines from the alignments file that were discarded by the parsers. This file contains reads that were unmapped (NM), of poor quality (i.e. QC) or paired reads that were mapped to different chromosomal locations or were too far apart on the same chromosome. Useful for debugging purposes -d 1000, --distance=1000 this option allows you to set the maximum number of base-pairs allowed between two non-overlapping paired reads. Default = 1000 --mutations=delsonly Use this option to only track mutations that are of interest. For CRAC data this is usually deletions (--mutations=delsonly). For PAR-CLIP data this is usually T-C mutations (--mutations=TC). Other options are: do not report any mutations: --mutations=nomuts. Only report specific base mutations, for example only in T's, C's and G's :--mutations=[TCG]. The brackets are essential. Other nucleotide combinations are also possible