Mercurial > repos > fubar > toolfactory

<tool id="rgedgeRpaired" name="edgeR" version="0.20">
  <description>1 or 2 level models for count data</description>
  <requirements>
      <requirement type="package" version="2.12">biocbasics</requirement>
      <requirement type="package" version="3.0.1">package_r3</requirement>
  </requirements>

  <command interpreter="python">
     rgToolFactory.py --script_path "$runme" --interpreter "Rscript" --tool_name "edgeR"
    --output_dir "$html_file.files_path" --output_html "$html_file" --output_tab "$outtab" --make_HTML "yes"
  </command>
  <inputs>
    <param name="input1"  type="data" format="tabular" label="Select an input matrix - rows are contigs, columns are counts for each sample"
       help="Use the HTSeq based count matrix preparation tool to create these matrices from BAM/SAM files and a GTF file of genomic features"/>
    <param name="title" type="text" value="edgeR" size="80" label="Title for job outputs" help="Supply a meaningful name here to remind you what the outputs contain">
      <sanitizer invalid_char="">
        <valid initial="string.letters,string.digits"><add value="_" /> </valid>
      </sanitizer>
    </param>
    <param name="treatment_name" type="text" value="Treatment" size="50" label="Treatment Name"/>
    <param name="Treat_cols" label="Select columns containing treatment." type="data_column" data_ref="input1" numerical="True"
         multiple="true" use_header_names="true" size="120" display="checkboxes">
        <validator type="no_options" message="Please select at least one column."/>
    </param>
    <param name="control_name" type="text" value="Control" size="50" label="Control Name"/>
    <param name="Control_cols" label="Select columns containing control." type="data_column" data_ref="input1" numerical="True"
         multiple="true" use_header_names="true" size="120" display="checkboxes" optional="true">
    </param>
    <param name="subjectids" type="text" optional="true" size="120" value = ""
       label="IF SUBJECTS NOT ALL INDEPENDENT! Enter integers to indicate sample pairing for every column in input"
       help="Leave blank if no pairing, but eg if data from sample id A99 is in columns 2,4 and id C21 is in 3,5 then enter '1,2,1,2'">
      <sanitizer>
        <valid initial="string.digits"><add value="," /> </valid>
      </sanitizer>
    </param>
    <param name="fQ" type="float" value="0.3" size="5" label="Non-differential contig count quantile threshold - zero to analyze all non-zero read count contigs"
     help="May be a good or a bad idea depending on the biology and the question. EG 0.3 = sparsest 30% of contigs with at least one read are removed before analysis"/>
    <param name="useNDF" type="boolean" truevalue="T" falsevalue="F" checked="false" size="1"
              label="Non differential filter - remove contigs below a threshold (1 per million) for half or more samples"
     help="May be a good or a bad idea depending on the biology and the question. This was the old default. Quantile based is available as an alternative"/>
    <conditional name="DESeq">
    <param name="doDESeq" type="select"
       label="Run the same model with DESeq2 and compare findings"
       help="DESeq2 is an update to the DESeq package. It uses different assumptions and methods to edgeR">
      <option value="F" selected="true">Do not run DESeq2</option>
      <option value="T">Run DESeq2 (only works if NO second GLM factor supplied at present)</option>
     </param>
     <when value="T">
         <param name="DESeq_fitType" type="select">
            <option value="parametric" selected="true">Parametric (default) fit for dispersions</option>
            <option value="local">Local fit - use this if parametric fails</option>
            <option value="mean">Mean dispersion fit- use this if you really understand what you're doing - read the fine manual</option>
         </param>
     </when>
     <when value="F"> </when>
    </conditional>
    <param name="doVoom" type="boolean" truevalue="T" checked='false' falsevalue="F" size="1" label="Run the same model with VOOM transformation and limma."/>
    <conditional name="camera">
    <param name="doCamera" type="select" label="Run the edgeR implementation of Camera GSEA for up/down gene sets"
        help="If yes, you can choose a set of genesets to test and/or supply a gmt format geneset collection from your history">
    <option value="F" selected="true">Do not run GSEA tests with the Camera algorithm</option>
    <option value="T">Run GSEA tests with the Camera algorithm</option>
    </param>
     <when value="T">
     <conditional name="gmtSource">
      <param name="refgmtSource" type="select"
         label="Use a gene set (.gmt) from your history and/or use a built-in (MSigDB etc) gene set">
        <option value="indexed" selected="true">Use a built-in gene set</option>
        <option value="history">Use a gene set from my history</option>
        <option value="both">Add a gene set from my history to a built in gene set</option>
      </param>
      <when value="indexed">
        <param name="builtinGMT" type="select" label="Select a gene set matrix (.gmt) file to use for the analysis">
          <options from_data_table="gseaGMT_3.1">
            <filter type="sort_by" column="2" />
            <validator type="no_options" message="No GMT v3.1 files are available - please install them"/>
          </options>
        </param>
      </when>
      <when value="history">
        <param name="ownGMT" type="data" format="gmt" label="Select a Gene Set from your history" />
      </when>
      <when value="both">
        <param name="ownGMT" type="data" format="gseagmt" label="Select a Gene Set from your history" />
        <param name="builtinGMT" type="select" label="Select a gene set matrix (.gmt) file to use for the analysis">
          <options from_data_table="gseaGMT_3.1">
            <filter type="sort_by" column="2" />
            <validator type="no_options" message="No GMT v3.1 files are available - please fix tool_data_table and loc files"/>
          </options>
        </param>
       </when>
     </conditional>
     </when>
     <when value="F">
     </when>
    </conditional>
    <param name="priordf" type="integer" value="20" size="3" label="prior.df for tagwise dispersion - lower value = more emphasis on each tag's variance. Replaces prior.n  and prior.df = prior.n * residual.df"
     help="0 = Use edgeR default. Use a small value to 'smooth' small samples. See edgeR docs and note below"/>
    <param name="fdrthresh" type="float" value="0.05" size="5" label="P value threshold for FDR filtering for amily wise error rate control"
     help="Conventional default value of 0.05 recommended"/>
    <param name="fdrtype" type="select" label="FDR (Type II error) control method"
         help="Use fdr or bh typically to control for the number of tests in a reliable way">
            <option value="fdr" selected="true">fdr</option>
            <option value="BH">Benjamini Hochberg</option>
            <option value="BY">Benjamini Yukateli</option>
            <option value="bonferroni">Bonferroni</option>
            <option value="hochberg">Hochberg</option>
            <option value="holm">Holm</option>
            <option value="hommel">Hommel</option>
            <option value="none">no control for multiple tests</option>
    </param>
  </inputs>
  <outputs>
    <data format="tabular" name="outtab" label="${title}.xls"/>
    <data format="html" name="html_file" label="${title}.html"/>
  </outputs>
 <stdio>
     <exit_code range="4"   level="fatal"   description="Number of subject ids must match total number of samples in the input matrix" />
 </stdio>
 <tests>
<test>
<param name='input1' value='test_bams2mx.xls' ftype='tabular' />
 <param name='treatment_name' value='case' />
 <param name='title' value='edgeRtest' />
 <param name='useNDF' value='' />
 <param name='fdrtype' value='fdr' />
 <param name='priordf' value="0" />
 <param name='fdrthresh' value="0.05" />
 <param name='control_name' value='control' />
 <param name='subjectids' value='' />
  <param name='Treat_cols' value='3,4,5,9' />
 <param name='Control_cols' value='2,6,7,8' />
 <output name='outtab' file='edgeRtest1out.xls' compare='diff' />
 <output name='html_file' file='edgeRtest1out.html'  compare='diff' lines_diff='20' />
</test>
</tests>

<configfiles>
<configfile name="runme">
<![CDATA[
##
## edgeR.Rscript
## updated npv 2011 for R 2.14.0 and edgeR 2.4.0 by ross
## Performs DGE on a count table containing n replicates of two conditions
##
### Original edgeR code by: S.Lunke and A.Kaspi
reallybig = log10(.Machine\$double.xmax)
reallysmall = log10(.Machine\$double.xmin)
library('stringr')
library('gplots')
library('edgeR')

hmap2 = function(cmat,nsamp=100,outpdfname='heatmap2.pdf', TName='Treatment',group=NA,myTitle='title goes here')
{
    ### Perform clustering for significant pvalues after controlling FWER
    samples = colnames(cmat)
    gu = unique(group)
    if (length(gu) == 2) {
        col.map = function(g) {if (g==gu[1]) "#FF0000" else "#0000FF"}
        pcols = unlist(lapply(group,col.map))
        } else {
        colours = rainbow(length(gu),start=0,end=4/6)
        pcols = colours[match(group,gu)]
    }
    gn = rownames(cmat)
    dm = cmat[(! is.na(gn)),]
    ### remove unlabelled hm rows
    nprobes = nrow(dm)
    if (nprobes > nsamp) {
      dm =dm[1:nsamp,]
    }
    newcolnames = substr(colnames(dm),1,20)
    colnames(dm) = newcolnames
    pdf(outpdfname)
    heatmap.2(dm,main=myTitle,ColSideColors=pcols,col=topo.colors(100),dendrogram="col",key=T,density.info='none',
         Rowv=F,scale='row',trace='none',margins=c(8,8),cexRow=0.4,cexCol=0.5)
    dev.off()
}

hmap = function(cmat,nmeans=4,outpdfname="heatMap.pdf",nsamp=250,TName='Treatment',group=NA,myTitle="Title goes here")
{
    ## for 2 groups only was
    ## col.map = function(g) {if (g==TName) "#FF0000" else "#0000FF"}
    ## pcols = unlist(lapply(group,col.map))
    gu = unique(group)
    colours = rainbow(length(gu),start=0.3,end=0.6)
    pcols = colours[match(group,gu)]
    nrows = nrow(cmat)
    mtitle = paste(myTitle,'Heatmap: n contigs =',nrows)
    if (nrows > nsamp)  {
               cmat = cmat[c(1:nsamp),]
               mtitle = paste('Heatmap: Top ',nsamp,' DE contigs (of ',nrows,')',sep='')
          }
    newcolnames = substr(colnames(cmat),1,20)
    colnames(cmat) = newcolnames
    pdf(outpdfname)
    heatmap(cmat,scale='row',main=mtitle,cexRow=0.3,cexCol=0.4,Rowv=NA,ColSideColors=pcols)
    dev.off()
}

qqPlot = function(descr='Title',pvector, ...)
## stolen from https://gist.github.com/703512
{
    o = -log10(sort(pvector,decreasing=F))
    e = -log10( 1:length(o)/length(o) )
    o[o==-Inf] = reallysmall
    o[o==Inf] = reallybig
    pdfname = paste(gsub(" ","", descr , fixed=TRUE),'pval_qq.pdf',sep='_')
    maint = paste(descr,'QQ Plot')
    pdf(pdfname)
    plot(e,o,pch=19,cex=1, main=maint, ...,
        xlab=expression(Expected~~-log[10](italic(p))),
        ylab=expression(Observed~~-log[10](italic(p))),
        xlim=c(0,max(e)), ylim=c(0,max(o)))
    lines(e,e,col="red")
    grid(col = "lightgray", lty = "dotted")
    dev.off()
}

smearPlot = function(DGEList,deTags, outSmear, outMain)
        {
        pdf(outSmear)
        plotSmear(DGEList,de.tags=deTags,main=outMain)
        grid(col="blue")
        dev.off()
        }

boxPlot = function(rawrs,cleanrs,maint,myTitle)
{
        nc = ncol(rawrs)
        for (i in c(1:nc)) {rawrs[(rawrs[,i] < 0),i] = NA}
        fullnames = colnames(rawrs)
        newcolnames = substr(colnames(rawrs),1,20)
        colnames(rawrs) = newcolnames
        newcolnames = substr(colnames(cleanrs),1,20)
        colnames(cleanrs) = newcolnames
        pdfname = paste(gsub(" ","", myTitle , fixed=TRUE),"sampleBoxplot.pdf",sep='_')
        defpar = par(no.readonly=T)
        pdf(pdfname)
        l = layout(matrix(c(1,2),1,2,byrow=T))
        print.noquote('raw contig counts by sample:')
        print.noquote(summary(rawrs))
        print.noquote('normalised contig counts by sample:')
        print.noquote(summary(cleanrs))
        boxplot(rawrs,varwidth=T,notch=T,ylab='log contig count',col="maroon",las=3,cex.axis=0.35,main=paste('Raw:',maint))
        grid(col="blue")
        boxplot(cleanrs,varwidth=T,notch=T,ylab='log contig count',col="maroon",las=3,cex.axis=0.35,main=paste('After ',maint))
        grid(col="blue")
        dev.off()
        pdfname = paste(gsub(" ","", myTitle , fixed=TRUE),"samplehistplot.pdf",sep='_')
        nc = ncol(rawrs)
        print.noquote(paste('Using ncol rawrs=',nc))
        ncroot = round(sqrt(nc))
        if (ncroot*ncroot < nc) { ncroot = ncroot + 1 }
        m = c()
        for (i in c(1:nc)) {
              rhist = hist(rawrs[,i],breaks=100,plot=F)
              m = append(m,max(rhist\$counts))
             }
        ymax = max(m)
        pdf(pdfname)
        par(mfrow=c(ncroot,ncroot))
        for (i in c(1:nc)) {
                 hist(rawrs[,i], main=paste("Contig logcount",i), xlab='log raw count', col="maroon",
                 breaks=100,sub=fullnames[i],cex=0.8,ylim=c(0,ymax))
             }
        dev.off()
        par(defpar)

}

cumPlot = function(rawrs,cleanrs,maint,myTitle)
{
        pdfname = paste(gsub(" ","", myTitle , fixed=TRUE),"RowsumCum.pdf",sep='_')
        defpar = par(no.readonly=T)
        pdf(pdfname)
        par(mfrow=c(2,1))
        lrs = log(rawrs,10)
        lim = max(lrs)
        hist(lrs,breaks=100,main=paste('Before:',maint),xlab="Reads (log)",
             ylab="Count",col="maroon",sub=myTitle, xlim=c(0,lim),las=1)
        grid(col="blue")
        lrs = log(cleanrs,10)
        hist(lrs,breaks=100,main=paste('After:',maint),xlab="Reads (log)",
             ylab="Count",col="maroon",sub=myTitle,xlim=c(0,lim),las=1)
        grid(col="blue")
        dev.off()
        par(defpar)
}

cumPlot1 = function(rawrs,cleanrs,maint,myTitle)
{
        pdfname = paste(gsub(" ","", myTitle , fixed=TRUE),"RowsumCum.pdf",sep='_')
        pdf(pdfname)
        par(mfrow=c(2,1))
        lastx = max(rawrs)
        rawe = knots(ecdf(rawrs))
        cleane = knots(ecdf(cleanrs))
        cy = 1:length(cleane)/length(cleane)
        ry = 1:length(rawe)/length(rawe)
        plot(rawe,ry,type='l',main=paste('Before',maint),xlab="Log Contig Total Reads",
             ylab="Cumulative proportion",col="maroon",log='x',xlim=c(1,lastx),sub=myTitle)
        grid(col="blue")
        plot(cleane,cy,type='l',main=paste('After',maint),xlab="Log Contig Total Reads",
             ylab="Cumulative proportion",col="maroon",log='x',xlim=c(1,lastx),sub=myTitle)
        grid(col="blue")
        dev.off()
}


doGSEA = function(y=NULL,design=NULL,histgmt="",
                  bigmt="/data/genomes/gsea/3.1/Abetterchoice_nocgp_c2_c3_c5_symbols_all.gmt",
                  ntest=0, myTitle="myTitle", outfname="GSEA.xls", minnin=5, maxnin=2000,fdrthresh=0.05,fdrtype="BH")
{
  genesets = c()
  if (bigmt > "")
  {
    bigenesets = readLines(bigmt)
    genesets = bigenesets
  }
  if (histgmt > "")
  {
    hgenesets = readLines(histgmt)
    if (bigmt > "") {
      genesets = rbind(genesets,hgenesets)
    } else {
      genesets = hgenesets
    }
  }
  print.noquote(paste("@@@read",length(genesets), 'genesets from',histgmt,bigmt))
  genesets = strsplit(genesets,'\t')
  ##### tabular. genesetid\tURLorwhatever\tgene_1\t..\tgene_n
  outf = outfname
  head=paste(myTitle,'edgeR GSEA')
  write(head,file=outfname,append=F)
  ntest=length(genesets)
  urownames = toupper(rownames(y))
  upcam = c()
  downcam = c()
  for (i in 1:ntest) {
    gs = unlist(genesets[i])
    g = gs[1] #### geneset_id
    u = gs[2]
    if (u > "") { u = paste("<a href=\'",u,"\'>",u,"</a>",sep="") }
    glist = gs[3:length(gs)] #### member gene symbols
    glist = toupper(glist)
    inglist = urownames %in% glist
    nin = sum(inglist)
    if ((nin > minnin) && (nin < maxnin)) {
      ### print(paste('@@found',sum(inglist),'genes in glist'))
      camres = camera(y=y,index=inglist,design=design)
      if (camres) {
      rownames(camres) = g
      ##### gene set name
      camres = cbind(GeneSet=g,URL=u,camres)
      if (camres\$Direction == "Up")
        {
        upcam = rbind(upcam,camres) } else {
          downcam = rbind(downcam,camres)
        }
      }
   }
  }
  uscam = upcam[order(upcam\$PValue),]
  unadjp = uscam\$PValue
  uscam\$adjPValue = p.adjust(unadjp,method=fdrtype)
  nup = max(10,sum((uscam\$adjPValue < fdrthresh)))
  dscam = downcam[order(downcam\$PValue),]
  unadjp = dscam\$PValue
  dscam\$adjPValue = p.adjust(unadjp,method=fdrtype)
  ndown = max(10,sum((dscam\$adjPValue < fdrthresh)))
  write.table(uscam,file=paste('upCamera',outfname,sep='_'),quote=F,sep='\t',row.names=F)
  write.table(dscam,file=paste('downCamera',outfname,sep='_'),quote=F,sep='\t',row.names=F)
  print.noquote(paste('@@@@@ Camera up top',nup,'gene sets:'))
  write.table(head(uscam,nup),file="",quote=F,sep='\t',row.names=F)
  print.noquote(paste('@@@@@ Camera down top',ndown,'gene sets:'))
  write.table(head(dscam,ndown),file="",quote=F,sep='\t',row.names=F)
}


edgeIt = function (Count_Matrix,group,outputfilename,fdrtype='fdr',priordf=5,
        fdrthresh=0.05,outputdir='.', myTitle='edgeR',libSize=c(),useNDF=F,
        filterquantile=0.2, subjects=c(),mydesign=NULL,
        doDESeq=T,doVoom=T,doCamera=T,org='hg19',
        histgmt="", bigmt="/data/genomes/gsea/3.1/Abetterchoice_nocgp_c2_c3_c5_symbols_all.gmt",
        doCook=F,DESeq_fittype="parameteric")
{
  if (length(unique(group))!=2){
    print("Number of conditions identified in experiment does not equal 2")
    q()
    }
  require(edgeR)
  options(width = 512)
  mt = paste(unlist(strsplit(myTitle,'_')),collapse=" ")
  allN = nrow(Count_Matrix)
  nscut = round(ncol(Count_Matrix)/2)
  colTotmillionreads = colSums(Count_Matrix)/1e6
  rawrs = rowSums(Count_Matrix)
  nonzerod = Count_Matrix[(rawrs > 0),]
  nzN = nrow(nonzerod)
  nzrs = rowSums(nonzerod)
  zN = allN - nzN
  print('**** Quantiles for non-zero row counts:',quote=F)
  print(quantile(nzrs,probs=seq(0,1,0.1)),quote=F)
  if (useNDF == "T")
  {
    gt1rpin3 = rowSums(Count_Matrix/expandAsMatrix(colTotmillionreads,dim(Count_Matrix)) >= 1) >= nscut
    lo = colSums(Count_Matrix[!gt1rpin3,])
    workCM = Count_Matrix[gt1rpin3,]
    cleanrs = rowSums(workCM)
    cleanN = length(cleanrs)
    meth = paste( "After removing",length(lo),"contigs with fewer than ",nscut," sample read counts >= 1 per million, there are",sep="")
    print(paste("Read",allN,"contigs. Removed",zN,"contigs with no reads.",meth,cleanN,"contigs"),quote=F)
    maint = paste('Filter >= 1/million reads in >=',nscut,'samples')
  }   else {
    useme = (nzrs > quantile(nzrs,filterquantile))
    workCM = nonzerod[useme,]
    lo = colSums(nonzerod[!useme,])
    cleanrs = rowSums(workCM)
    cleanN = length(cleanrs)
    meth = paste("After filtering at count quantile =",filterquantile,", there are",sep="")
    print(paste('Read',allN,"contigs. Removed",zN,"with no reads.",meth,cleanN,"contigs"),quote=F)
    maint = paste('Filter below',filterquantile,'quantile')
  }
  cumPlot(rawrs=rawrs,cleanrs=cleanrs,maint=maint,myTitle=myTitle)
  allgenes <- rownames(workCM)
  print(paste("*** Total low count contigs per sample = ",paste(lo,collapse=',')),quote=F)
  rsums = rowSums(workCM)
  TName=unique(group)[1]
  CName=unique(group)[2]
  DGEList = DGEList(counts=workCM, group = group)
  DGEList = calcNormFactors(DGEList)

  if (is.null(mydesign)) {
    if (length(subjects) == 0)
    {
      mydesign = model.matrix(~group)
    }
    else {
      subjf = factor(subjects)
      mydesign = model.matrix(~subjf+group)
      ### we block on subject so make group last to simplify finding it
    }
  }
  print.noquote(paste('Using samples:',paste(colnames(workCM),collapse=',')))
  print.noquote('Using design matrix:')
  print.noquote(mydesign)
  DGEList = estimateGLMCommonDisp(DGEList,mydesign)
  comdisp = DGEList\$common.dispersion
  DGEList = estimateGLMTrendedDisp(DGEList,mydesign)
  if (priordf > 0) {
    print.noquote(paste("prior.df =",priordf))
    DGEList = estimateGLMTagwiseDisp(DGEList,mydesign,prior.df = priordf)
  } else {
    DGEList = estimateGLMTagwiseDisp(DGEList,mydesign)
  }
  lastcoef=ncol(mydesign)
  print.noquote(paste('*** lastcoef = ',lastcoef))
  estpriorn = getPriorN(DGEList)
  predLFC1 = predFC(DGEList,prior.count=1,design=mydesign,dispersion=DGEList\$tagwise.dispersion,offset=getOffset(DGEList))
  predLFC3 = predFC(DGEList,prior.count=3,design=mydesign,dispersion=DGEList\$tagwise.dispersion,offset=getOffset(DGEList))
  predLFC5 = predFC(DGEList,prior.count=5,design=mydesign,dispersion=DGEList\$tagwise.dispersion,offset=getOffset(DGEList))
  DGLM = glmFit(DGEList,design=mydesign)
  DE = glmLRT(DGLM)
  #### always last one - subject is first if needed
  logCPMnorm = cpm(DGEList,log=T,normalized.lib.sizes=T)
  logCPMraw = cpm(DGEList,log=T,normalized.lib.sizes=F)
  uoutput = cbind(
    Name=as.character(rownames(DGEList\$counts)),
    DE\$table,
    adj.p.value=p.adjust(DE\$table\$PValue, method=fdrtype),
    Dispersion=DGEList\$tagwise.dispersion,totreads=rsums,
    predLFC1=predLFC1[,lastcoef],
    predLFC3=predLFC3[,lastcoef],
    predLFC5=predLFC5[,lastcoef],
    logCPMnorm,
    DGEList\$counts
  )
  soutput = uoutput[order(DE\$table\$PValue),]
  heatlogcpmnorm = logCPMnorm[order(DE\$table\$PValue),]
  goodness = gof(DGLM, pcutoff=fdrthresh)
  noutl = (sum(goodness\$outlier) > 0)
  if (noutl > 0) {
        print.noquote(paste('***',noutl,'GLM outliers found'))
        print(paste(rownames(DGLM)[(goodness\$outlier)],collapse=','),quote=F)
    } else {
      print('*** No GLM fit outlier genes found')
    }
  z = limma::zscoreGamma(goodness\$gof.statistic, shape=goodness\$df/2, scale=2)
  pdf(paste(mt,"GoodnessofFit.pdf",sep='_'))
  qq = qqnorm(z, panel.first=grid(), main="tagwise dispersion")
  abline(0,1,lwd=3)
  points(qq\$x[goodness\$outlier],qq\$y[goodness\$outlier], pch=16, col="maroon")
  dev.off()
  print(paste("Common Dispersion =",comdisp,"CV = ",sqrt(comdisp),"getPriorN = ",estpriorn),quote=F)
  uniqueg = unique(group)
  sample_colors =  match(group,levels(group))
  pdf(paste(mt,"MDSplot.pdf",sep='_'))
  sampleTypes = levels(factor(group))
  print.noquote(sampleTypes)
  plotMDS.DGEList(DGEList,main=paste("MDS Plot for",myTitle),cex=0.5,col=sample_colors,pch=sample_colors)
  legend(x="topleft", legend = sampleTypes,col=c(1:length(sampleTypes)), pch=19)
  grid(col="blue")
  dev.off()
  colnames(logCPMnorm) = paste( colnames(logCPMnorm),'N',sep="_")
  print(paste('Raw sample CPM',paste(colSums(logCPMraw,na.rm=T),collapse=',')))
  try(boxPlot(rawrs=logCPMraw,cleanrs=logCPMnorm,maint='TMM Normalisation',myTitle=myTitle))
  nreads = soutput\$totreads
  print('*** writing output',quote=F)
  write.table(soutput,outputfilename, quote=FALSE, sep="\t",row.names=F)
  rn = row.names(workCM)
  print.noquote('@@ rn')
  print.noquote(head(rn))
  reg = "^chr([0-9]+):([0-9]+)-([0-9]+)"
  genecards="<a href=\'http://www.genecards.org/index.php?path=/Search/keyword/"
  ucsc = paste("<a href=\'http://genome.ucsc.edu/cgi-bin/hgTracks?db=",org,sep='')
  testreg = str_match(rn,reg)
  nreads = uoutput\$totreads
  if (sum(!is.na(testreg[,1]))/length(testreg[,1]) > 0.8)
  {
    print("@@ using ucsc substitution for urls")
    urls = paste0(ucsc,"&amp;position=chr",testreg[,2],":",testreg[,3],"-",testreg[,4],"\'>",rn,"</a>")
  } else {
    print("@@ using genecards substitution for urls")
    urls = paste0(genecards,rn,"\'>",rn,"</a>")
  }
  tt = uoutput
  print.noquote("*** edgeR Top tags\n")
  tt = cbind(tt,ntotreads=nreads,URL=urls)
  tt = tt[order(DE\$table\$PValue),]
  print.noquote(tt[1:50,])
  ### Plot MAplot
  deTags = rownames(uoutput[uoutput\$adj.p.value < fdrthresh,])
  nsig = length(deTags)
  print(paste('***',nsig,'tags significant at adj p=',fdrthresh),quote=F)
  if (nsig > 0) {
      print('*** deTags',quote=F)
      print(head(deTags))
    }
  deColours = ifelse(deTags,'red','black')
  pdf(paste(mt,"BCV_vs_abundance.pdf",sep='_'))
  plotBCV(DGEList, cex=0.3, main="Biological CV vs abundance")
  dev.off()
  dg = DGEList[order(DE\$table\$PValue),]
  outpdfname=paste(mt,"heatmap.pdf",sep='_')
  hmap2(heatlogcpmnorm,nsamp=100,TName=TName,group=group,outpdfname=outpdfname,myTitle=myTitle)
  outSmear = paste(mt,"Smearplot.pdf",sep='_')
  outMain = paste("Smear Plot for ",TName,' Vs ',CName,' (FDR@',fdrthresh,' N = ',nsig,')',sep='')
  smearPlot(DGEList=DGEList,deTags=deTags, outSmear=outSmear, outMain = outMain)
  qqPlot(descr=myTitle,pvector=DE\$table\$PValue)
  if (doDESeq == T)
  {
    ### DESeq2
    require('DESeq2')
    print.noquote(paste('****subjects=',subjects,'length=',length(subjects)))
    if (length(subjects) == 0)
        {
        pdata = data.frame(Name=colnames(workCM),Rx=group,row.names=colnames(workCM))
        deSEQds = DESeqDataSetFromMatrix(countData = workCM,  colData = pdata, design = formula(~ Rx))
        } else {
        pdata = data.frame(Name=colnames(workCM),Rx=group,subjects=subjects,row.names=colnames(workCM))
        deSEQds = DESeqDataSetFromMatrix(countData = workCM,  colData = pdata, design = formula(~ subjects + Rx))
        }
    deSeqDatsizefac <- estimateSizeFactors(deSEQds)
    deSeqDatdisp <- estimateDispersions(deSeqDatsizefac,fitType=DESeq_fittype)
    resDESeq <- nbinomWaldTest(deSeqDatdisp, pAdjustMethod=fdrtype)
    rDESeq = as.data.frame(results(resDESeq))
    srDESeq = rDESeq[order(rDESeq\$pvalue),]
    write.table(srDESeq,paste(mt,'DESeq2_TopTable.xls',sep='_'), quote=FALSE, sep="\t",row.names=F)
    topresults.DESeq <- rDESeq[which(rDESeq\$padj < fdrthresh), ]
    DESeqcountsindex <- which(allgenes %in% rownames(topresults.DESeq))
    DESeqcounts <- rep(0, length(allgenes))
    DESeqcounts[DESeqcountsindex] <- 1
    pdf(paste(mt,"DESeq2_dispersion_estimates.pdf",sep='_'))
    plotDispEsts(resDESeq)
    dev.off()
    if (doCook) {
       pdf(paste(mt,"DESeq2_cooks_distance.pdf",sep='_'))
       W <- mcols(resDESeq)\$WaldStatistic_condition_treated_vs_untreated
       maxCooks <- mcols(resDESeq)\$maxCooks
       idx <- !is.na(W)
       plot(rank(W[idx]), maxCooks[idx], xlab="rank of Wald statistic", ylab="maximum Cook's distance per gene",
          ylim=c(0,5), cex=.4, col="maroon")
       m <- ncol(dds)
       p <- 3
       abline(h=qf(.75, p, m - p),col="darkblue")
       grid(col="lightgray",lty="dotted")
    }
  }
  counts.dataframe = as.data.frame(c())
  norm.factor = DGEList\$samples\$norm.factors
  topresults.edgeR <- soutput[which(soutput\$adj.p.value < fdrthresh), ]
  edgeRcountsindex <- which(allgenes %in% rownames(topresults.edgeR))
  edgeRcounts <- rep(0, length(allgenes))
  edgeRcounts[edgeRcountsindex] <- 1
  if (doVoom == T) {
      pdf(paste(mt,"voomplot.pdf",sep='_'))
      dat.voomed <- voom(DGEList, mydesign, plot = TRUE, normalize.method="quantil", lib.size = NULL)
      dev.off()
      fit <- lmFit(dat.voomed, mydesign)
      fit <- eBayes(fit)
      rvoom <- topTable(fit, coef = length(colnames(mydesign)), adj = "BH", n = Inf)
      write.table(rvoom,paste(mt,'VOOM_topTable.xls',sep='_'), quote=FALSE, sep="\t",row.names=F)
      topresults.voom <- rvoom[which(rvoom\$adj.P.Val < fdrthresh), ]
      voomcountsindex <- which(allgenes %in% rownames(topresults.voom))
      voomcounts <- rep(0, length(allgenes))
      voomcounts[voomcountsindex] <- 1
  }
  if ((doDESeq==T) || (doVoom==T)) {
    if ((doVoom==T) && (doDESeq==T)) {
        vennmain = paste(mt,'Voom,edgeR and DESeq2 overlap at FDR=',fdrthresh)
        counts.dataframe <- data.frame(edgeR = edgeRcounts, DESeq2 = DESeqcounts,
                                       VOOM_limma = voomcounts, row.names = allgenes)
       } else if (doDESeq==T) {
         vennmain = paste(mt,'DESeq2 and edgeR overlap at FDR=',fdrthresh)
         counts.dataframe <- data.frame(edgeR = edgeRcounts, DESeq2 = DESeqcounts, row.names = allgenes)
       } else if (doVoom==T) {
        vennmain = paste(mt,'Voom and edgeR overlap at FDR=',fdrthresh)
        counts.dataframe <- data.frame(edgeR = edgeRcounts, VOOM_limma = voomcounts, row.names = allgenes)
       }

    if (nrow(counts.dataframe > 1)) {
      counts.venn <- vennCounts(counts.dataframe)
      vennf = paste(mt,'venn.pdf',sep='_')
      pdf(vennf)
      vennDiagram(counts.venn,main=vennmain,col="maroon")
      dev.off()
    }
  } ### doDESeq or doVoom
  if (doDESeq==T) {
    cat("*** DESeq top 50\n")
    print(srDESeq[1:50,])
  }
  if (doVoom==T) {
      cat("*** VOOM top 50\n")
      print(rvoom[1:50,])
  }
  if (doCamera) {
  doGSEA(y=DGEList,design=mydesign,histgmt=histgmt,bigmt=bigmt,ntest=20,myTitle=myTitle,
    outfname=paste(mt,"GSEA.xls",sep="_"),fdrthresh=fdrthresh,fdrtype=fdrtype)
  }
  uoutput

}
#### Done

#### sink(stdout(),append=T,type="message")

doDESeq = $DESeq.doDESeq
### make these 'T' or 'F'
doVoom = $doVoom
doCamera = $camera.doCamera
Out_Dir = "$html_file.files_path"
Input =  "$input1"
TreatmentName = "$treatment_name"
TreatmentCols = "$Treat_cols"
ControlName = "$control_name"
ControlCols= "$Control_cols"
outputfilename = "$outtab"
org = "$input1.dbkey"
if (org == "") { org = "hg19"}
fdrtype = "$fdrtype"
priordf = $priordf
fdrthresh = $fdrthresh
useNDF = "$useNDF"
fQ = $fQ
myTitle = "$title"
sids = strsplit("$subjectids",',')
subjects = unlist(sids)
nsubj = length(subjects)
builtin_gmt=""
history_gmt=""

builtin_gmt = ""
history_gmt = ""
DESeq_fittype=""
#if $DESeq.doDESeq == "T"
  DESeq_fittype = "$DESeq.DESeq_fitType"
#end if
#if $camera.doCamera == 'T'
  #if $camera.gmtSource.refgmtSource == "indexed" or $camera.gmtSource.refgmtSource == "both":
     builtin_gmt = "${camera.gmtSource.builtinGMT.fields.path}"
  #end if
  #if $camera.gmtSource.refgmtSource == "history" or $camera.gmtSource.refgmtSource == "both":
    history_gmt = "${camera.gmtSource.ownGMT}"
    history_gmt_name = "${camera.gmtSource.ownGMT.name}"
  #end if
#end if
if (nsubj > 0) {
if (doDESeq) {
 print('WARNING - cannot yet use DESeq2 for 2 way anova - see the docs')
 doDESeq = F
 }
}
TCols = as.numeric(strsplit(TreatmentCols,",")[[1]])-1
CCols = as.numeric(strsplit(ControlCols,",")[[1]])-1
cat('Got TCols=')
cat(TCols)
cat('; CCols=')
cat(CCols)
cat('\n')
useCols = c(TCols,CCols)
if (file.exists(Out_Dir) == F) dir.create(Out_Dir)
Count_Matrix = read.table(Input,header=T,row.names=1,sep='\t') #Load tab file assume header
snames = colnames(Count_Matrix)
nsamples = length(snames)
if (nsubj >  0 & nsubj != nsamples) {
options("show.error.messages"=T)
mess = paste('Fatal error: Supplied subject id list',paste(subjects,collapse=','),
   'has length',nsubj,'but there are',nsamples,'samples',paste(snames,collapse=','))
write(mess, stderr())
quit(save="no",status=4)
}

Count_Matrix = Count_Matrix[,useCols] ### reorder columns
if (length(subjects) != 0) {subjects = subjects[useCols]}
rn = rownames(Count_Matrix)
islib = rn %in% c('librarySize','NotInBedRegions')
LibSizes = Count_Matrix[subset(rn,islib),][1] # take first
Count_Matrix = Count_Matrix[subset(rn,! islib),]
group = c(rep(TreatmentName,length(TCols)), rep(ControlName,length(CCols)) )
group = factor(group, levels=c(ControlName,TreatmentName))
colnames(Count_Matrix) = paste(group,colnames(Count_Matrix),sep="_")
results = edgeIt(Count_Matrix=Count_Matrix,group=group,outputfilename=outputfilename,
                 fdrtype='BH',priordf=priordf,fdrthresh=fdrthresh,outputdir='.',
                 myTitle='edgeR',useNDF=F,libSize=c(),filterquantile=fQ,subjects=subjects,
                 doDESeq=doDESeq,doVoom=doVoom,doCamera=doCamera,org=org,
                 histgmt=history_gmt,bigmt=builtin_gmt,DESeq_fittype=DESeq_fittype)
sessionInfo()
]]>
</configfile>
</configfiles>
<help>

**What it does**

Performs digital gene expression analysis between a treatment and control on a count matrix.
Optionally adds a term for subject if not all samples are independent or if some other factor needs to be blocked in the design.

**Input**

A matrix consisting of non-negative integers. The matrix must have a unique header row identifiying the samples, and a unique set of row names
as  the first column. Typically the row names are gene symbols or probe id's for downstream use in GSEA and other methods.

If you have (eg) paired samples and wish to include a term in the GLM to account for some other factor (subject in the case of paired samples),
put a comma separated list of indicators for every sample (whether modelled or not!) indicating (eg) the subject number or
A list of integers, one for each subject or an empty string if samples are all independent.
If not empty, there must be exactly as many integers in the supplied integer list as there are columns (samples) in the count matrix.
Integers for samples that are not in the analysis *must* be present in the string as filler even if not used.

So if you have 2 pairs out of 6 samples, you need to put in unique integers for the unpaired ones
eg if you had 6 samples with the first two independent but the second and third pairs each being from independent subjects. you might use
8,9,1,1,2,2
as subject IDs to indicate two paired samples from the same subject in columns 3/4 and 5/6

**Output**

A summary html page with links to the R source code and all the outputs, nice grids of helpful plot thumbnails, and lots
of logging and the top 50 rows of the topTable.

The main topTables of results are provided as separate excelish tabular files.

They include adjusted p values and dispersions for each region, raw and cpm sample data counts and shrunken (predicted) log fold change estimates.
These are provided for downstream analyses such as GSEA and are predictions of the logFC you might expect to see
in an independent replication of your original experiment. Higher number means more shrinkage. Shrinkage is more extreme for low coverage features
so downstream analyses are more robust against strong effect size estimates based on relatively little experimental information.

**Note on prior.N**

http://seqanswers.com/forums/showthread.php?t=5591 says:

*prior.n*

The value for prior.n determines the amount of smoothing of tagwise dispersions towards the common dispersion.
You can think of it as like a "weight" for the common value. (It is actually the weight for the common likelihood
in the weighted likelihood equation). The larger the value for prior.n, the more smoothing, i.e. the closer your
tagwise dispersion estimates will be to the common dispersion. If you use a prior.n of 1, then that gives the
common likelihood the weight of one observation.

In answer to your question, it is a good thing to squeeze the tagwise dispersions towards a common value,
or else you will be using very unreliable estimates of the dispersion. I would not recommend using the value that
you obtained from estimateSmoothing()---this is far too small and would result in virtually no moderation
(squeezing) of the tagwise dispersions. How many samples do you have in your experiment?
What is the experimental design? If you have few samples (less than 6) then I would suggest a prior.n of at least 10.
If you have more samples, then the tagwise dispersion estimates will be more reliable,
so you could consider using a smaller prior.n, although I would hesitate to use a prior.n less than 5.


From Bioconductor Digest, Vol 118, Issue 5, Gordon writes:

Dear Dorota,

The important settings are prior.df and trend.

prior.n and prior.df are related through prior.df = prior.n * residual.df,
and your experiment has residual.df = 36 - 12 = 24.  So the old setting of
prior.n=10 is equivalent for your data to prior.df = 240, a very large
value.  Going the other way, the new setting of prior.df=10 is equivalent
to prior.n=10/24.

To recover old results with the current software you would use

  estimateTagwiseDisp(object, prior.df=240, trend="none")

To get the new default from old software you would use

  estimateTagwiseDisp(object, prior.n=10/24, trend=TRUE)

Actually the old trend method is equivalent to trend="loess" in the new
software. You should use plotBCV(object) to see whether a trend is
required.

Note you could also use

  prior.n = getPriorN(object, prior.df=10)

to map between prior.df and prior.n.

** Old rant on variable name changes in bioconductor versions**

BioC authors sometimes make small mostly cosmetic changes to variable names (eg: from p.value to PValue)
often to make them more internally consistent or self describing. Unfortunately, these improvements
break existing code in ways that can take a while to track down that relies on the library in ways that can take a while to track down,
increasing downstream tool maintenance effort uselessly.

Please, don't do that. It hurts us.


</help>

</tool>
author	fubar
date	Thu, 28 Aug 2014 02:33:05 -0400
parents
children