Galaxy |

Changeset 13:7aaa9bc23e3c (2014-10-14)

Previous changeset 12:ebb4cb1e8e35 (2014-10-01) Next changeset 14:44130e484a97 (2014-11-19)

Commit message:
Added support for paired end reads - Changed terminology to generalise to sgRNA CRISPR experiments. - Added option to include second factor for statistical power - Added option to filter out samples with low counts - Added support for paired end reads - Added option to highlight only positive or negative fold change in smear plot - Fixed bug that caused tool to stop if more than enough sample annotations were supplied

modified:
hairpinTool.R
hairpinTool.xml

diff -r ebb4cb1e8e35 -r 7aaa9bc23e3c hairpinTool.R
--- a/hairpinTool.R Wed Oct 01 16:00:43 2014 +1000
+++ b/hairpinTool.R Tue Oct 14 17:05:07 2014 +1100

[

b'@@ -1,45 +1,54 @@\n # ARGS: 1.inputType -String specifying format of input (fastq or table)\n-# IF inputType is "fastQ":\n+# IF inputType is "fastq" or "pairedFastq:\n # 2*.fastqPath -One or more strings specifying path to fastq files\n-# 2.annoPath -String specifying path to hairpin annotation table\n+# 2.annoPath -String specifying path to hairpin annotation table\n # 3.samplePath -String specifying path to sample annotation table\n # 4.barStart -Integer specifying starting position of barcode\n # 5.barEnd -Integer specifying ending position of barcode\n-# 6.hpStart -Integer specifying startins position of hairpin\n+# ### \n+# IF inputType is "pairedFastq":\n+# 6.barStartRev -Integer specifying starting position of barcode\n+# on reverse end\n+# 7.barEndRev -Integer specifying ending position of barcode\n+# on reverse end\n+# ### \n+# 8.hpStart -Integer specifying startins position of hairpin\n # unique region\n-# 7.hpEnd -Integer specifying ending position of hairpin\n+# 9.hpEnd -Integer specifying ending position of hairpin\n # unique region\n-# ### \n # IF inputType is "counts":\n # 2.countPath -String specifying path to count table\n # 3.annoPath -String specifying path to hairpin annotation table\n # 4.samplePath -String specifying path to sample annotation table\n # ###\n-# 8.cpmReq -Float specifying cpm requirement\n-# 9.sampleReq -Integer specifying cpm requirement\n-# 10.fdrThresh -Float specifying the FDR requirement\n-# 11.lfcThresh -Float specifying the log-fold-change requirement\n-# 12.workMode -String specifying exact test or GLM usage\n-# 13.htmlPath -String specifying path to HTML file\n-# 14.folderPath -STring specifying path to folder for output\n+# 10.secFactName -String specifying name of secondary factor\n+# 11.cpmReq -Float specifying cpm requirement\n+# 12.sampleReq -Integer specifying cpm requirement\n+# 13.readReq -Integer specifying read requirement\n+# 14.fdrThresh -Float specifying the FDR requirement\n+# 15.lfcThresh -Float specifying the log-fold-change requirement\n+# 16.workMode -String specifying exact test or GLM usage\n+# 17.htmlPath -String specifying path to HTML file\n+# 18.folderPath -String specifying path to folder for output\n # IF workMode is "classic" (exact test)\n-# 15.pairData[2] -String specifying first group for exact test\n-# 16.pairData[1] -String specifying second group for exact test\n+# 19.pairData[2] -String specifying first group for exact test\n+# 20.pairData[1] -String specifying second group for exact test\n # ###\n # IF workMode is "glm"\n-# 15.contrastData -String specifying contrasts to be made\n-# 16.roastOpt -String specifying usage of gene-wise tests\n-# 17.hairpinReq -String specifying hairpin requirement for gene-\n+# 19.contrastData -String specifying contrasts to be made\n+# 20.roastOpt -String specifying usage of gene-wise tests\n+# 21.hairpinReq -String specifying hairpin requirement for gene-\n # wise test\n-# 18.selectOpt -String specifying type of selection for barcode\n+# 22.selectOpt -String specifying type of selection for barcode\n # plots\n-# 19.selectVals -String specifying members selected for barcode\n+# 23.selectVals -String specifying members selected for barcode\n # plots\n # '..b'\n } else {\n selectedGenes <- selectVals\n@@ -668,9 +822,13 @@\n # Generate data frame of the significant differences\n sigDiff <- data.frame(Up=upCount, Flat=flatCount, Down=downCount)\n if (workMode == "glm") {\n+\n row.names(sigDiff) <- contrastData\n+\n } else if (workMode == "classic") {\n+\n row.names(sigDiff) <- paste0(pairData[2], "-", pairData[1])\n+\n }\n \n # Output table of summarised counts\n@@ -702,7 +860,8 @@\n cata("<body>\\n")\n cata("<h3>EdgeR Analysis Output:</h3>\\n")\n cata("<h4>Input Summary:</h4>\\n")\n-if (inputType=="fastq") {\n+if (inputType == "fastq" || inputType == "pairedFastq") {\n+\n cata("<ul>\\n")\n ListItem(hpReadout[1])\n ListItem(hpReadout[2])\n@@ -716,31 +875,51 @@\n cata("<br />\\n")\n cata("<b>Please check that read percentages are consistent with ")\n cata("expectations.</b><br >\\n")\n-} else if (inputType=="counts") {\n+\n+} else if (inputType == "counts") {\n+\n cata("<ul>\\n")\n ListItem("Number of Samples: ", ncol(data$counts))\n ListItem("Number of Hairpins: ", countsRows)\n ListItem("Number of annotations provided: ", annoRows)\n ListItem("Number of annotations matched to hairpin: ", annoMatched)\n cata("</ul>\\n")\n+\n }\n \n cata("The estimated common biological coefficient of variation (BCV) is: ", \n commonBCV, "<br />\\n")\n \n+if (secFactName == "none") {\n+\n+ cata("No secondary factor specified.<br />\\n")\n+\n+} else {\n+\n+ cata("Secondary factor specified as: ", secFactName, "<br />\\n")\n+\n+}\n+\n cata("<h4>Output:</h4>\\n")\n cata("PDF copies of JPEGS available in \'Plots\' section.<br />\\n")\n for (i in 1:nrow(imageData)) {\n if (grepl("barcode", imageData$Link[i])) {\n+\n if (packageVersion("limma")<"3.19.19") {\n+\n HtmlImage(imageData$Link[i], imageData$Label[i], \n height=length(selectedGenes)*150)\n+\n } else {\n+\n HtmlImage(imageData$Link[i], imageData$Label[i], \n height=length(selectedGenes)*300)\n+\n }\n } else {\n+\n HtmlImage(imageData$Link[i], imageData$Label[i])\n+\n }\n }\n cata("<br />\\n")\n@@ -779,26 +958,42 @@\n }\n \n cata("<p>Alt-click links to download file.</p>\\n")\n-cata("<p>Click floppy disc icon associated history item to download ")\n+cata("<p>Click floppy disc icon on associated history item to download ")\n cata("all files.</p>\\n")\n cata("<p>.tsv files can be viewed in Excel or any spreadsheet program.</p>\\n")\n \n cata("<h4>Additional Information:</h4>\\n")\n \n if (inputType == "fastq") {\n+\n ListItem("Data was gathered from fastq raw read file(s).")\n+\n } else if (inputType == "counts") {\n+\n ListItem("Data was gathered from a table of counts.")\n+\n }\n \n-if (cpmReq!=0 && sampleReq!=0) {\n- tempStr <- paste("Hairpins without more than", cpmReq,\n+if (cpmReq != 0 && sampleReq != 0) {\n+ tempStr <- paste("Target sequences without more than", cpmReq,\n "CPM in at least", sampleReq, "samples are insignificant",\n "and filtered out.")\n ListItem(tempStr)\n+\n filterProp <- round(filteredCount/preFilterCount*100, digits=2)\n tempStr <- paste0(filteredCount, " of ", preFilterCount," (", filterProp,\n- "%) hairpins were filtered out for low count-per-million.")\n+ "%) target sequences were filtered out for low ",\n+ "count-per-million.")\n+ ListItem(tempStr)\n+}\n+\n+if (sampleReq != 0) {\n+ tempStr <- paste("Samples that did not produce more than", sampleReq,\n+ "counts were filtered out.")\n+ ListItem(tempStr)\n+\n+ tempStr <- paste0(sampleFilterCount, " samples were filtered out for low ",\n+ "counts.")\n ListItem(tempStr)\n }\n \n@@ -809,9 +1004,9 @@\n }\n \n if (workMode == "classic") {\n- ListItem("An exact test was performed on each hairpin.")\n+ ListItem("An exact test was performed on each target sequence.")\n } else if (workMode == "glm") {\n- ListItem("A generalised linear model was fitted to each hairpin.")\n+ ListItem("A generalised linear model was fitted to each target sequence.")\n }\n \n cit <- character()\n'

diff -r ebb4cb1e8e35 -r 7aaa9bc23e3c hairpinTool.xml
--- a/hairpinTool.xml Wed Oct 01 16:00:43 2014 +1000
+++ b/hairpinTool.xml Tue Oct 14 17:05:07 2014 +1100

b'@@ -1,12 +1,13 @@\n-<tool id="shRNAseq" name="shRNAseq Tool" version="1.0.13">\n+<tool id="shRNAseq" name="shRNAseq Tool" version="1.2.0">\n <description>\n- Analyse hairpin differential representation using edgeR\n+ Analyse differential representation for shRNAseq and sgRNA based procedures\n+ using edgeR package from Bioconductor.\n </description>\n \n <requirements>\n- <requirement type="R-module" version="3.6.2">edgeR</requirement>\n- <requirement type="R-module" version="3.20.7">limma</requirement>\n- <requirement type="package" version="3.0.3">R_3_0_3</requirement>\n+ <requirement type="R-module" version="3.7.17">edgeR</requirement>\n+ <requirement type="R-module" version="3.21.16">limma</requirement>\n+ <requirement type="package" version="3.1.1">R_3_0_3</requirement>\n </requirements>\n \n <stdio>\n@@ -14,43 +15,90 @@\n </stdio>\n \n <command interpreter="Rscript">\n- hairpinTool.R $inputOpt.inputType\n+ ampliconTool.R $inputOpt.inputType\n #if $inputOpt.inputType=="fastq":\n+\n #for $i, $fas in enumerate($inputOpt.fastq):\n fastq::$fas.file\n #end for\n \n $inputOpt.hairpin\n $inputOpt.samples\n+\n+ #if $inputOpt.positions.posOption=="yes":\n+ $inputOpt.positions.barstart\n+ $inputOpt.positions.barend\n+ 0\n+ 0\n+ $inputOpt.positions.hpstart\n+ $inputOpt.positions.hpend\n+ #else:\n+ 1\n+ 5\n+ 0\n+ 0\n+ 37\n+ 57\n+ #end if\n+ #elif $inputOpt.inputType=="pairedFastq":\n+\n+ #for $i, $fas in enumerate($inputOpt.fastq):\n+ fastq::$fas.file\n+ #end for\n+\n+ #for $i, $fas in enumerate($inputOpt.fastq):\n+ fastqRev::$fas.fileRev\n+ #end for\n+ \n+ $inputOpt.hairpin\n+ $inputOpt.samples\n \n #if $inputOpt.positions.posOption=="yes":\n $inputOpt.positions.barstart\n $inputOpt.positions.barend\n+ $inputOpt.positions.barstartRev\n+ $inputOpt.positions.barendRev\n $inputOpt.positions.hpstart\n $inputOpt.positions.hpend\n #else:\n 1\n 5\n+ 0\n+ 0\n 37\n 57\n #end if\n- #else:\n+\n+ #elif $inputOpt.inputType=="counts":\n $inputOpt.counts\n $inputOpt.hairpin\n $inputOpt.samples\n- 0 0 0\n+ 0\n+ 0\n+ 0\n+ 0\n+ 0\n #end if\n- \n+ \n+ #if $inputOpt.secondaryFactor.secFactorOpt=="yes":\n+ $inputOpt.secondaryFactor.secFactName\n+ #else:\n+ "none"\n+ #end if\n+\n #if $filterCPM.filtOption=="yes":\n $filterCPM.cpmReq\n $filterCPM.sampleReq\n+ $filterCPM.readReq\n #else:\n -Inf\n -Inf\n+ -Inf\n #end if\n \n $fdr\n $lfc\n+ $direction\n $workMode.mode\n $outFile\n $outFile.files_path\n@@ -61,6 +109,7 @@\n #elif $workMode.mode=="glm":\n "$workMode.contrast"\n $workMode.roast.roastOption\n+\n #if $workMode.roast.roastOption=="yes":\n '..b'et \n+ sequence. Simple and fast for straightforward comparisons. In this option you\n+ will have the option of "*Compare* x *To* y" which implicitly subtracts the \n+ data from y from that of x to produce the comparison.\n \n- * **Generalised Linear Model:** This allow for complex contrasts to be specified\n- and also gene level analysis to be performed. If this option is chosen then\n- contrasts must be explicitly stated in equations and multiple contrasts can be\n- made. In addition there will be the option to analyse hairpins on a per-gene\n- basis to see if hairpins belonging to a particular gene have any overall\n- tendencies for the direction of their log-fold-change.\n+ * **Generalised Linear Model:** This allow for complex contrasts to be specified \n+ and also gene level analysis to be performed. If this option is chosen then \n+ contrasts must be explicitly stated in equations and multiple contrasts can \n+ be made. In addition there will be the option to analyse hairpins/sgRNA on a \n+ per-gene basis to see if hairpins/sgRNA belonging to a particular gene have \n+ any overall tendencies for the direction of their log-fold-change.\n \n **FDR Threshold:**\n-The smear plot in the output will have hairpins highlighted to signify\n+The smear plot in the output will have hairpins/sgRNA highlighted to signify\n significant differential representation. The significance is determined by\n contorlling the false discovery rate, only those with a FDR lower than the\n threshold will be highlighted in the plot.\n@@ -379,10 +588,10 @@\n using. The methodology articles are listed in Section 2.1 of the limma \n User\'s Guide.\n \n-\t* Smyth, GK (2005). Limma: linear models for microarray data. In: \n-\t \'Bioinformatics and Computational Biology Solutions using R and \n-\t Bioconductor\'. R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, \n-\t W. Huber (eds), Springer, New York, pages 397-420.\n+ * Smyth, GK (2005). Limma: linear models for microarray data. In: \n+ \'Bioinformatics and Computational Biology Solutions using R and \n+ Bioconductor\'. R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, \n+ W. Huber (eds), Springer, New York, pages 397-420.\n \n .. class:: infomark\n \n@@ -392,25 +601,24 @@\n the various original statistical methods implemented in edgeR. See \n Section 1.2 in the User\'s Guide for more detail.\n \n-\t* Robinson MD, McCarthy DJ and Smyth GK (2010). edgeR: a Bioconductor \n-\t package for differential expression analysis of digital gene expression \n-\t data. Bioinformatics 26, 139-140\n-\t \n-\t* Robinson MD and Smyth GK (2007). Moderated statistical tests for assessing \n-\t differences in tag abundance. Bioinformatics 23, 2881-2887\n-\t \n-\t* Robinson MD and Smyth GK (2008). Small-sample estimation of negative \n-\t binomial dispersion, with applications to SAGE data.\n-\t Biostatistics, 9, 321-332\n-\t \n-\t* McCarthy DJ, Chen Y and Smyth GK (2012). Differential expression analysis \n-\t of multifactor RNA-Seq experiments with respect to biological variation. \n-\t Nucleic Acids Research 40, 4288-4297\n-\t \n+ * Robinson MD, McCarthy DJ and Smyth GK (2010). edgeR: a Bioconductor \n+ package for differential expression analysis of digital gene expression \n+ data. Bioinformatics 26, 139-140\n+ \n+ * Robinson MD and Smyth GK (2007). Moderated statistical tests for assessing \n+ differences in tag abundance. Bioinformatics 23, 2881-2887\n+ \n+ * Robinson MD and Smyth GK (2008). Small-sample estimation of negative \n+ binomial dispersion, with applications to SAGE data.\n+ Biostatistics, 9, 321-332\n+ \n+ * McCarthy DJ, Chen Y and Smyth GK (2012). Differential expression analysis \n+ of multifactor RNA-Seq experiments with respect to biological variation. \n+ Nucleic Acids Research 40, 4288-4297\n+ \n Report problems to: su.s@wehi.edu.au\n \n .. _edgeR: http://www.bioconductor.org/packages/release/bioc/html/edgeR.html\n .. _limma: http://www.bioconductor.org/packages/release/bioc/html/limma.html\n </help>\n </tool>\n- \n'