Galaxy | Tool Preview

Pileup-to-Interval (version 1.0.3)
See "Types of pileup datasets" below for examples

What is does

Reduces the size of a results set by taking a pileup file and producing a condensed version showing consecutive sequences of bases meeting coverage criteria. The tool works on six and ten column pileup formats produced with samtools pileup command. You also can specify columns for the input file manually. The tool assumes that the pileup dataset was produced by samtools pileup command (although you can override this by setting column assignments manually).


Types of pileup datasets

The description of pileup format below is largely based on information that can be found on SAMTools documentation page. The 6- and 10-column variants are described below.

Six column pileup:

   1    2  3  4        5        6
---------------------------------
chrM  412  A  2       .,       II
chrM  413  G  4     ..t,     IIIH
chrM  414  C  4     ...a     III2
chrM  415  C  4     TTTt     III7

where:

Column Definition
------ ----------------------------
     1 Chromosome
     2 Position (1-based)
     3 Reference base at that position
     4 Coverage (# reads aligning over that position)
     5 Bases within reads where (see Galaxy wiki for more info)
     6 Quality values (phred33 scale, see Galaxy wiki for more)

Ten column pileup

The ten-column pileup incorporates additional consensus information generated with -c option of samtools pileup command:

   1    2  3  4   5   6   7   8       9       10
------------------------------------------------
chrM  412  A  A  75   0  25  2       .,       II
chrM  413  G  G  72   0  25  4     ..t,     IIIH
chrM  414  C  C  75   0  25  4     ...a     III2
chrM  415  C  T  75  75  25  4     TTTt     III7

where:

 Column Definition
------- ----------------------------
      1 Chromosome
      2 Position (1-based)
      3 Reference base at that position
      4 Consensus bases
      5 Consensus quality
      6 SNP quality
      7 Maximum mapping quality
      8 Coverage (# reads aligning over that position)
      9 Bases within reads where (see Galaxy wiki for more info)
     10 Quality values (phred33 scale, see Galaxy wiki for more)

The output format

The output file condenses the information in the pileup file so that consecutive bases are listed together as sequences. The starting and ending points of the sequence range are listed, with the starting value converted to a 0-based value.

Given the following input with minimum coverage set to 3:

   1    2  3  4        5        6
---------------------------------
chr1  112  G  3     ..Ta     III6
chr1  113  T  2     aT..     III5
chr1  114  A  5     ,,..     IIH2
chr1  115  C  4      ,.,      III
chrM  412  A  2       .,       II
chrM  413  G  4     ..t,     IIIH
chrM  414  C  4     ...a     III2
chrM  415  C  4     TTTt     III7
chrM  490  T  3        a        I

the following would be the output:

   1    2    3  4
-------------------
chr1  111  112  G
chr1  113  115  AC
chrM  412  415  GCC
chrM  489  490  T

where:

 Column Definition
------- ----------------------------
      1 Chromosome
      2 Starting position (0-based)
      3 Ending position (1-based)
      4 Sequence of bases