Galaxy | Tool Preview

PerM.xml - This file requires an entry in the tool_data_table_conf.xml file. Upload a file named tool_data_table_conf.xml.sample to the repository that includes the required entry to correct this error.

Map with PerM (version 1.1.2)

What it does

PerM is a short read aligner designed to be ultrafast with long SOLiD reads to the whole genome or transcriptions. PerM can be fully sensitive to alignments with up to four mismatches and highly sensitive to a higher number of mismatches.

Development team

PerM is developed by Ting Chen's group, Center of Excellence in Genomic Sciences at the University of Southern California. If you have any questions, please email yanghoch at usc.edu or check the project page.

Citation

PerM: Efficient mapping of short sequencing reads with periodic full sensitive spaced seeds. Bioinformatics, 2009, 25 (19): 2514-2521.

Input

The input files are read files and a reference. Users can use the pre-indexed reference in Galaxy or upload their own reference.

The uploaded reference file should be in the fasta format. Multiple sequences like transcriptions should be concatenated together separated by a header line that starts with the ">" character.

Reads files must be in either fastqsanger or fastqcssanger format to use in PerM. However, there are several possible starting formats that can be converted to one of those two: fastq (any type), color-space fastq, fasta, csfasta, or csfasta+qualsolid.

An uploaded base-space fastq file MUST be checked/transformed with FASTQGroomer tools in Galaxy to be converted to the fastqsanger format (this is true even if the original file is in Sanger format).

Uploaded fasta and csfasta without quality score files can be transformed to fastqsanger by the FASTQGroomer, with pseudo quality scores added.

An uploaded csfasta + qual pair can also be transformed into fastqcssanger by solid2fastq.

Outputs

The output mapping result is in SAM format, and has the following columns:

  Column  Description
--------  --------------------------------------------------------
 1 QNAME  Query (pair) NAME
 2 FLAG   bitwise FLAG
 3 RNAME  Reference sequence NAME
 4 POS    1-based leftmost POSition/coordinate of clipped sequence
 5 MAPQ   MAPping Quality (Phred-scaled)
 6 CIGAR  extended CIGAR string
 7 MRNM   Mate Reference sequence NaMe ('=' if same as RNAME)
 8 MPOS   1-based Mate POSition
 9 ISIZE  Inferred insert SIZE
10 SEQ    query SEQuence on the same strand as the reference
11 QUAL   query QUALity (ASCII-33 gives the Phred base quality)
12 OPT    variable OPTional fields in the format TAG:VTYPE:VALUE
12.1 NM   Number of mismatches (SOLiD-specific)
12.2 CS   Reads in color space (SOLiD-specific)
12.3 CQ   Bases quality in color spacehidden="true" (SOLiD-specific)

The flags are as follows:

  Flag  Description
------  -------------------------------------
0x0001  the read is paired in sequencing
0x0002  the read is mapped in a proper pair
0x0004  the query sequence itself is unmapped
0x0008  the mate is unmapped
0x0010  strand of the query (1 for reverse)
0x0020  strand of the mate
0x0040  the read is the first read in a pair
0x0080  the read is the second read in a pair
0x0100  the alignment is not primary

Here is some sample output:

Qname FLAG    Rname   POS     MAPQ    CIAGR   MRNM    MPOS    ISIZE   SEQ     QUAL    NM      CS      CQ
491_28_332_F3   16      ref-1   282734  255     35M     *       0       0       AGTCAAACTCCGAATGCCAATGACTTATCCTTAGG    #%%%%%%%!!%%%!!%%%%%%%%!!%%%%%%%%%%      NM:i:3  CS:Z:C0230202330012130103100230121001212        CQ:Z:###################################
491_28_332_F3   16      ref-1   269436  255     35M     *       0       0       AGTCAAACTCCGAATGCCAATGACTTATCCTTAGG    #%%%%%%%!!%%%!!%%%%%%%%!!%%%%%%%%%%      NM:i:3  CS:Z:C0230202330012130103100230121001212        CQ:Z:###################################

The user can check a checkbox for optional output containing the unmmaped reads in fastqsanger or fastqcssanger. The default is to produce it.

PerM parameter list

Below is a list of PerM command line options for PerM. Not all of these are relevant to Galaxy's implementation, but are included for completeness.

The command for single-end:

PerM [ref_or_index] [read] [options]

The command for paired-end:

PerM [ref_or_index] -1 [read1] -2 [read1] [options]

The command-line options:

-A                Output all alignments within the given mismatch threshold, end-to-end.
-B                Output best alignments in terms of mismatches in the given mismatch threshold. [Default]
-E                Output only the uniquely mapped reads in the given mismatch threshold.
-m                Create the reference index, without reusing the saved index.
-s PATH           Save the reference index to accelerate the mapping in the future. If PATH is not specified, the default path will be used.
-v INT            Where INT is the number of mismatches allowed in one end. [Default=2]
-T INT            Where INT is the length to truncate read length to, so 30 means use only first 30 bases (signals). Leave blank if the full read is meant to be used.
-o PATH           Where PATH is for output the mapping of one read set. PerM's output are in .mapping or .sam format, determined by the ext name of PATH. Ex: -o out.sam will output in SAM format; -o out.mapping will output in .mapping format.
-d PATH           Where PATH is the directory for multiple read sets.
-u PATH           Print the fastq file of those unmapped reads to the file in PATH.
--noSamHeader     Print no SAM header so it is convenient to concatenate multiple SAM output files.
--includeReadsWN  Encodes N or "." with A or 3, respectively.
--statsOnly       Output the mapping statistics in stdout only, without saving alignments to files.
--ignoreQS        Ignore the quality scores in fastq or QUAL files.
--seed {F2 | S11 | F3 | F4}    Specify the seed pattern, which has a specific full sensitivity. Check the algorithm page (link below) for seed patterns to balance the sensitivity and running time.
--readFormat {fasta | fastq | csfasta | csfastq}    Read in reads in the specified format, instead of guessing according to the extension name.
--delimiter CHAR  Which is a character used as the delimiter to separate the the read id, and the additional info in the line with ">" in fasta or csfasta.

Paired reads options:

-e        Exclude ambiguous paired.
-L INT    Mate-paired separate lower bound.
-U INT    Mate-paired separate upper bound.
-1 PATH   The forward reads file path.
-2 PATH   The reversed reads file path.

See the PerM algorithm page for information on algorithms and seeds.