view tools/spades_2_5/plot_spades_stats.xml @ 3:d82f18c76309 draft

Uploaded spades wrapper 0.6. Supports spades 2.5.1. Also removes the need for the hack at installation time, by fixing the problem with input files. Shows the license for the tool as well.
author lionelguy
date Thu, 12 Sep 2013 07:46:54 -0400
parents b5ce24f34dd7
children
line wrap: on
line source

<tool id="plot_spades_stats" name="SPAdes stats" version="0.1">
  <description>coverage vs. length plot</description>
  <requirements>
    <requirement type="package">R</requirement>  
  </requirements>
  <command interpreter="bash">r_wrapper.sh $script_file</command>

  <inputs>
    <param name="input_scaffolds" type="data" format="tabular" label="Scaffold stats"/>
    <param name="input_contigs" type="data" format="tabular" label="Contig stats"/>
    <param name="length_co" type="integer" value="1000" min="0" label="Length cut-off" help="Contigs with length under that value are shown in red"/>
    <param name="coverage_co" type="integer" value="10" min="0" label="Coverage cut-off" help="Contigs with length under that value are shown in red"/>
  </inputs>
  <configfiles>
    <configfile name="script_file">
## Setup R error handling to go to stderr
options( show.error.messages=F, 
  error = function () { 
    cat( geterrmessage(), file=stderr() ); q( "no", 1, F ) 
} )
files = c("${input_contigs}", "${input_scaffolds}")
types = c("Contigs", "Scaffolds")

## Start plotting device
png("${out_file}", w=500, h=1000)
par(mfrow=c(2,1))

## Loop over the two files
for (i in 1:length(types)){
  seqs = read.table(files[i], header=FALSE, comment.char="#")
  colnames = c("name", "length", "coverage")
  names(seqs) = colnames

  ## Stats over all sequences 
  sl_all = sort(seqs\$length, decreasing=TRUE)
  cs_all = cumsum(sl_all)
  s_all = sum(seqs\$length)
  n50_idx_all = which.min(sl_all[cs_all &lt; 0.5*s_all])
  n90_idx_all = which.min(sl_all[cs_all &lt; 0.9*s_all])
  n50_all = sl_all[n50_idx_all]
  n90_all = sl_all[n90_idx_all]
  
  ## Filter short seqs, redo stats
  seqs_filt = seqs[seqs\$length >= ${length_co} &amp; seqs\$coverage >= ${coverage_co},]
  if (nrow(seqs_filt) > 0){
    sl_filt = sort(seqs_filt\$length, decreasing=TRUE)
    cs_filt = cumsum(sl_filt)
    s_filt = sum(seqs_filt\$length)
    n50_idx_filt = which.min(sl_filt[cs_filt &lt; 0.5*s_filt])
    n90_idx_filt = which.min(sl_filt[cs_filt &lt; 0.9*s_filt])
    n50_filt = sl_filt[n50_idx_filt]
    n90_filt = sl_filt[n90_idx_filt]
  }
  seqs_bad = seqs[seqs\$length &lt; ${length_co} | seqs\$coverage &lt; ${coverage_co},]

  ## Length vs coverage
  plot(length~coverage, data=seqs, log="xy", type="n", main=paste(types[i], ": coverage vs. length", sep=""), xlab="Coverage", ylab="Length")
  if (nrow(seqs_bad) > 0){
    points(length~coverage, data=seqs_bad, cex=0.5, col="red")
  }
  if (nrow(seqs_filt) > 0){
    points(length~coverage, data=seqs_filt, cex=0.5, col="black")
  }
  abline(v=${coverage_co}, h=${length_co}, lty=2, col=grey(0.3))
  legend(x="topleft", legend=c("Before/after filtering", paste(c("N50: ", "N90: ", "Median cov.: "), c(n50_all, n90_all, round(median(seqs\$coverage))), rep("/", 3), c(n50_filt, n90_filt, round(median(seqs_filt\$coverage))), sep="")), cex=0.8)
}
dev.off()
    </configfile>
  </configfiles>
  <outputs>
    <data format="png" name="out_file" />
  </outputs>
  <help>
**What it does**

Using the output of SPAdes (a pair of fasta file and stat file for each of the contigs and scaffolds), it produces a coverage vs. contig plot. Each dot represent a contig/scaffold. Given a coverage and a length cutoff, sequences that do not meet those criteria are shown in red. Some statistics are also given (N50, N90, median contig/scaffold length) both before and after filtering.

Use the "filter SPAdes output" tool to actually filter sequences.
  </help>
</tool>