view make_families.xml @ 5:4bc49a5769ee draft

Version 0.5: Split interleaved SSCS output file into two paired files.
author nick
date Thu, 01 Dec 2016 23:22:52 -0500
parents 7f513b9b1b1e
children 9a0bee12b583
line wrap: on
line source

<?xml version="1.0"?>
<tool id="make_families" name="Du Novo: Make families" version="0.5">
  <description>of duplex sequencing reads</description>
  <requirements>
    <requirement type="package" version="0.5">duplex</requirement>
    <requirement type="set_environment">DUPLEX_DIR</requirement>
  </requirements>
  <!-- TODO: Add dependency on coreutils to get paste? -->
  <command>paste $fastq1 $fastq2
    | paste - - - -
    | awk -f \$DUPLEX_DIR/make-barcodes.awk -v TAG_LEN=$taglen -v INVARIANT=$invariant
    | sort
    &gt; $output
  </command>
  <inputs>
    <param name="fastq1" type="data" format="fastq" label="Sequencing reads, mate 1"/>
    <param name="fastq2" type="data" format="fastq" label="Sequencing reads, mate 2"/>
    <param name="taglen" type="integer" value="12" min="0" label="Tag length" help="length of each random barcode on the ends of the fragments"/>
    <param name="invariant" type="integer" value="5" min="0" label="Invariant sequence length" help="length of the sequence between the tag and actual sample sequence (the restriction site, normally)"/>
  </inputs>
  <outputs>
    <data name="output" format="tabular"/>
  </outputs>
  <tests>
    <test>
      <param name="fastq1" value="smoke_1.fq"/>
      <param name="fastq2" value="smoke_2.fq"/>
      <param name="taglen" value="5"/>
      <param name="invariant" value="1"/>
      <output name="output" file="smoke.families.tsv"/>
    </test>
    <test>
      <param name="fastq1" value="smoke_1.fq"/>
      <param name="fastq2" value="smoke_2.fq"/>
      <param name="taglen" value="5"/>
      <param name="invariant" value="0"/>
      <output name="output" file="smoke.families.i0.tsv"/>
    </test>
  </tests>
  <help>

**What it does**

This tool is for processing raw duplex sequencing data, removing the barcodes and grouping by them into families of reads from the same fragment.

-----

**Output**

The output will be a tabular file where each line corresponds to a pair of input reads.

The columns are::

  1: barcode (both tags joined and ordered)
  2: tag order in barcode ("ab" or "ba")
  3: read1 name
  4: read1 sequence (minus the tag and invariant sequences)
  5: read1 quality scores (minus the same tag and invariant)
  6: read2 name
  7: read2 sequence (minus the tag and invariant sequences)
  8: read2 quality scores (minus the same tag and invariant)

-----

**Barcode creation**

For each pair, the tool will remove the tag at the beginning of each read and create a barcode by concatenating the two tags. The order of the tags is determined by a string comparison so that it will make an identical barcode from pairs of either order. The original tag order will be noted in the second column.

Since pairs from opposite strands will have the same tags, but in the reverse order, this produces the same barcode for reads from the same fragment, regardless of strand. Then a simple sort will group all reads from the same strand together, separated into strands by the different "order" values.

Examples::

  +---------------+-----------------+
  |  input tags   |     output      |
  +-------+-------+-------+---------+
  | read1 | read2 | order | barcode |
  +-------+-------+-------+---------+
  |  ATG  |  CCT  |  ab   | ATGCCT  |
  +-------+-------+-------+---------+
  |  CCT  |  ATG  |  ba   | ATGCCT  |
  +-------+-------+-------+---------+

    </help>
</tool>