Mercurial > repos > xuebing > sharplabtool
comparison tools/rgenetics/rgQQ.py @ 0:9071e359b9a3
Uploaded
| author | xuebing |
|---|---|
| date | Fri, 09 Mar 2012 19:37:19 -0500 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| -1:000000000000 | 0:9071e359b9a3 |
|---|---|
| 1 """ | |
| 2 oct 2009 - multiple output files | |
| 3 Dear Matthias, | |
| 4 | |
| 5 Yes, you can define number of outputs dynamically in Galaxy. For doing | |
| 6 this, you'll have to declare one output dataset in your xml and pass | |
| 7 its ID ($out_file.id) to your python script. Also, set | |
| 8 force_history_refresh="True" in your tool tag in xml, like this: | |
| 9 <tool id="split1" name="Split" force_history_refresh="True"> | |
| 10 In your script, if your outputs are named in the following format, | |
| 11 primary_associatedWithDatasetID_designation_visibility_extension | |
| 12 (_DBKEY), all your datasets will show up in the history pane. | |
| 13 associatedWithDatasetID is the $out_file.ID passed from xml, | |
| 14 designation will be a unique identifier for each output (set in your | |
| 15 script), | |
| 16 visibility can be set to visible if you want the dataset visible in | |
| 17 your history, or notvisible otherwise | |
| 18 extension is the required format for your dataset (bed, tabular, fasta | |
| 19 etc) | |
| 20 DBKEY is optional, and can be set if required (e.g. hg18, mm9 etc) | |
| 21 | |
| 22 One of our tools "MAF to Interval converter" (tools/maf/ | |
| 23 maf_to_interval.xml) already uses this feature. You can use it as a | |
| 24 reference. | |
| 25 | |
| 26 qq.chisq Quantile-quantile plot for chi-squared tests | |
| 27 Description | |
| 28 This function plots ranked observed chi-squared test statistics against the corresponding expected | |
| 29 order statistics. It also estimates an inflation (or deflation) factor, lambda, by the ratio of the trimmed | |
| 30 means of observed and expected values. This is useful for inspecting the results of whole-genome | |
| 31 association studies for overdispersion due to population substructure and other sources of bias or | |
| 32 confounding. | |
| 33 Usage | |
| 34 qq.chisq(x, df=1, x.max, main="QQ plot", | |
| 35 sub=paste("Expected distribution: chi-squared (",df," df)", sep=""), | |
| 36 xlab="Expected", ylab="Observed", | |
| 37 conc=c(0.025, 0.975), overdisp=FALSE, trim=0.5, | |
| 38 slope.one=FALSE, slope.lambda=FALSE, | |
| 39 thin=c(0.25,50), oor.pch=24, col.shade="gray", ...) | |
| 40 Arguments | |
| 41 x A vector of observed chi-squared test values | |
| 42 df The degreees of freedom for the tests | |
| 43 x.max If present, truncate the observed value (Y) axis here | |
| 44 main The main heading | |
| 45 sub The subheading | |
| 46 xlab x-axis label (default "Expected") | |
| 47 ylab y-axis label (default "Observed") | |
| 48 conc Lower and upper probability bounds for concentration band for the plot. Set this | |
| 49 to NA to suppress this | |
| 50 overdisp If TRUE, an overdispersion factor, lambda, will be estimated and used in calculating | |
| 51 concentration band | |
| 52 trim Quantile point for trimmed mean calculations for estimation of lambda. Default | |
| 53 is to trim at the median | |
| 54 slope.one Is a line of slope one to be superimpsed? | |
| 55 slope.lambda Is a line of slope lambda to be superimposed? | |
| 56 thin A pair of numbers indicating how points will be thinned before plotting (see | |
| 57 Details). If NA, no thinning will be carried out | |
| 58 oor.pch Observed values greater than x.max are plotted at x.max. This argument sets | |
| 59 the plotting symbol to be used for out-of-range observations | |
| 60 col.shade The colour with which the concentration band will be filled | |
| 61 ... Further graphical parameter settings to be passed to points() | |
| 62 | |
| 63 Details | |
| 64 To reduce plotting time and the size of plot files, the smallest observed and expected points are | |
| 65 thinned so that only a reduced number of (approximately equally spaced) points are plotted. The | |
| 66 precise behaviour is controlled by the parameter thin, whose value should be a pair of numbers. | |
| 67 The first number must lie between 0 and 1 and sets the proportion of the X axis over which thinning | |
| 68 is to be applied. The second number should be an integer and sets the maximum number of points | |
| 69 to be plotted in this section. | |
| 70 The "concentration band" for the plot is shown in grey. This region is defined by upper and lower | |
| 71 probability bounds for each order statistic. The default is to use the 2.5 Note that this is not a | |
| 72 simultaneous confidence region; the probability that the plot will stray outside the band at some | |
| 73 point exceeds 95 | |
| 74 When required, he dispersion factor is estimated by the ratio of the observed trimmed mean to its | |
| 75 expected value under the chi-squared assumption. | |
| 76 Value | |
| 77 The function returns the number of tests, the number of values omitted from the plot (greater than | |
| 78 x.max), and the estimated dispersion factor, lambda. | |
| 79 Note | |
| 80 All tests must have the same number of degrees of freedom. If this is not the case, I suggest | |
| 81 transforming to p-values and then plotting -2log(p) as chi-squared on 2 df. | |
| 82 Author(s) | |
| 83 David Clayton hdavid.clayton@cimr.cam.ac.uki | |
| 84 References | |
| 85 Devlin, B. and Roeder, K. (1999) Genomic control for association studies. Biometrics, 55:997-1004 | |
| 86 """ | |
| 87 | |
| 88 import sys, random, math, copy,os, subprocess, tempfile | |
| 89 from rgutils import RRun, rexe | |
| 90 | |
| 91 rqq = """ | |
| 92 # modified by ross lazarus for the rgenetics project may 2000 | |
| 93 # makes a pdf for galaxy from an x vector of chisquare values | |
| 94 # from snpMatrix | |
| 95 # http://www.bioconductor.org/packages/bioc/html/snpMatrix.html | |
| 96 qq.chisq <- | |
| 97 function(x, df=1, x.max, | |
| 98 main="QQ plot", | |
| 99 sub=paste("Expected distribution: chi-squared (",df," df)", sep=""), | |
| 100 xlab="Expected", ylab="Observed", | |
| 101 conc=c(0.025, 0.975), overdisp=FALSE, trim=0.5, | |
| 102 slope.one=T, slope.lambda=FALSE, | |
| 103 thin=c(0.5,200), oor.pch=24, col.shade="gray", ofname="qqchi.pdf", | |
| 104 h=6,w=6,printpdf=F,...) { | |
| 105 | |
| 106 # Function to shade concentration band | |
| 107 | |
| 108 shade <- function(x1, y1, x2, y2, color=col.shade) { | |
| 109 n <- length(x2) | |
| 110 polygon(c(x1, x2[n:1]), c(y1, y2[n:1]), border=NA, col=color) | |
| 111 } | |
| 112 | |
| 113 # Sort values and see how many out of range | |
| 114 | |
| 115 obsvd <- sort(x, na.last=NA) | |
| 116 N <- length(obsvd) | |
| 117 if (missing(x.max)) { | |
| 118 Np <- N | |
| 119 } | |
| 120 else { | |
| 121 Np <- sum(obsvd<=x.max) | |
| 122 } | |
| 123 if(Np==0) | |
| 124 stop("Nothing to plot") | |
| 125 | |
| 126 # Expected values | |
| 127 | |
| 128 if (df==2) { | |
| 129 expctd <- 2*cumsum(1/(N:1)) | |
| 130 } | |
| 131 else { | |
| 132 expctd <- qchisq(p=(1:N)/(N+1), df=df) | |
| 133 } | |
| 134 | |
| 135 # Concentration bands | |
| 136 | |
| 137 if (!is.null(conc)) { | |
| 138 if(conc[1]>0) { | |
| 139 e.low <- qchisq(p=qbeta(conc[1], 1:N, N:1), df=df) | |
| 140 } | |
| 141 else { | |
| 142 e.low <- rep(0, N) | |
| 143 } | |
| 144 if (conc[2]<1) { | |
| 145 e.high <- qchisq(p=qbeta(conc[2], 1:N, N:1), df=df) | |
| 146 } | |
| 147 else { | |
| 148 e.high <- 1.1*rep(max(x),N) | |
| 149 } | |
| 150 } | |
| 151 | |
| 152 # Plot outline | |
| 153 | |
| 154 if (Np < N) | |
| 155 top <- x.max | |
| 156 else | |
| 157 top <- obsvd[N] | |
| 158 right <- expctd[N] | |
| 159 if (printpdf) {pdf(ofname,h,w)} | |
| 160 plot(c(0, right), c(0, top), type="n", xlab=xlab, ylab=ylab, | |
| 161 main=main, sub=sub) | |
| 162 | |
| 163 # Thinning | |
| 164 | |
| 165 if (is.na(thin[1])) { | |
| 166 show <- 1:Np | |
| 167 } | |
| 168 else if (length(thin)!=2 || thin[1]<0 || thin[1]>1 || thin[2]<1) { | |
| 169 warning("invalid thin parameter; no thinning carried out") | |
| 170 show <- 1:Np | |
| 171 } | |
| 172 else { | |
| 173 space <- right*thin[1]/floor(thin[2]) | |
| 174 iat <- round((N+1)*pchisq(q=(1:floor(thin[2]))*space, df=df)) | |
| 175 if (max(iat)>thin[2]) | |
| 176 show <- unique(c(iat, (1+max(iat)):Np)) | |
| 177 else | |
| 178 show <- 1:Np | |
| 179 } | |
| 180 Nu <- floor(trim*N) | |
| 181 if (Nu>0) | |
| 182 lambda <- mean(obsvd[1:Nu])/mean(expctd[1:Nu]) | |
| 183 if (!is.null(conc)) { | |
| 184 if (Np<N) | |
| 185 vert <- c(show, (Np+1):N) | |
| 186 else | |
| 187 vert <- show | |
| 188 if (overdisp) | |
| 189 shade(expctd[vert], lambda*e.low[vert], | |
| 190 expctd[vert], lambda*e.high[vert]) | |
| 191 else | |
| 192 shade(expctd[vert], e.low[vert], expctd[vert], e.high[vert]) | |
| 193 } | |
| 194 points(expctd[show], obsvd[show], ...) | |
| 195 # Overflow | |
| 196 if (Np<N) { | |
| 197 over <- (Np+1):N | |
| 198 points(expctd[over], rep(x.max, N-Np), pch=oor.pch) | |
| 199 } | |
| 200 # Lines | |
| 201 line.types <- c("solid", "dashed", "dotted") | |
| 202 key <- NULL | |
| 203 txt <- NULL | |
| 204 if (slope.one) { | |
| 205 key <- c(key, line.types[1]) | |
| 206 txt <- c(txt, "y = x") | |
| 207 abline(a=0, b=1, lty=line.types[1]) | |
| 208 } | |
| 209 if (slope.lambda && Nu>0) { | |
| 210 key <- c(key, line.types[2]) | |
| 211 txt <- c(txt, paste("y = ", format(lambda, digits=4), "x", sep="")) | |
| 212 if (!is.null(conc)) { | |
| 213 if (Np<N) | |
| 214 vert <- c(show, (Np+1):N) | |
| 215 else | |
| 216 vert <- show | |
| 217 } | |
| 218 abline(a=0, b=lambda, lty=line.types[2]) | |
| 219 } | |
| 220 if (printpdf) {dev.off()} | |
| 221 # Returned value | |
| 222 | |
| 223 if (!is.null(key)) | |
| 224 legend(0, top, legend=txt, lty=key) | |
| 225 c(N=N, omitted=N-Np, lambda=lambda) | |
| 226 | |
| 227 } | |
| 228 | |
| 229 """ | |
| 230 | |
| 231 | |
| 232 | |
| 233 | |
| 234 def makeQQ(dat=[], sample=1.0, maxveclen=4000, fname='fname',title='title', | |
| 235 xvar='Sample',h=8,w=8,logscale=True,outdir=None): | |
| 236 """ | |
| 237 y is data for a qq plot and ends up on the x axis go figure | |
| 238 if sampling, oversample low values - all the top 1% ? | |
| 239 assume we have 0-1 p values | |
| 240 """ | |
| 241 R = [] | |
| 242 colour="maroon" | |
| 243 nrows = len(dat) | |
| 244 dat.sort() # small to large | |
| 245 fn = float(nrows) | |
| 246 unifx = [x/fn for x in range(1,(nrows+1))] | |
| 247 if logscale: | |
| 248 unifx = [-math.log10(x) for x in unifx] # uniform distribution | |
| 249 if sample < 1.0 and len(dat) > maxveclen: | |
| 250 # now have half a million markers eg - too many to plot all for a pdf - sample to get 10k or so points | |
| 251 # oversample part of the distribution | |
| 252 always = min(1000,nrows/20) # oversample smaller of lowest few hundred items or 5% | |
| 253 skip = int(nrows/float(maxveclen)) # take 1 in skip to get about maxveclen points | |
| 254 if skip <= 1: | |
| 255 skip = 2 | |
| 256 samplei = [i for i in range(nrows) if (i < always) or (i % skip == 0)] | |
| 257 # always oversample first sorted (here lowest) values | |
| 258 yvec = [dat[i] for i in samplei] # always get first and last | |
| 259 xvec = [unifx[i] for i in samplei] # and sample xvec same way | |
| 260 maint='QQ %s (random %d of %d)' % (title,len(yvec),nrows) | |
| 261 else: | |
| 262 yvec = [x for x in dat] | |
| 263 maint='QQ %s (n=%d)' % (title,nrows) | |
| 264 xvec = unifx | |
| 265 if logscale: | |
| 266 maint = 'Log%s' % maint | |
| 267 mx = [0,math.log10(nrows)] # if 1000, becomes 3 for the null line | |
| 268 ylab = '-log10(%s) Quantiles' % title | |
| 269 xlab = '-log10(Uniform 0-1) Quantiles' | |
| 270 yvec = [-math.log10(x) for x in yvec if x > 0.0] | |
| 271 else: | |
| 272 mx = [0,1] | |
| 273 ylab = '%s Quantiles' % title | |
| 274 xlab = 'Uniform 0-1 Quantiles' | |
| 275 | |
| 276 xv = ['%f' % x for x in xvec] | |
| 277 R.append('xvec = c(%s)' % ','.join(xv)) | |
| 278 yv = ['%f' % x for x in yvec] | |
| 279 R.append('yvec = c(%s)' % ','.join(yv)) | |
| 280 R.append('mx = c(0,%f)' % (math.log10(fn))) | |
| 281 R.append('pdf("%s",h=%d,w=%d)' % (fname,h,w)) | |
| 282 R.append("par(lab=c(10,10,10))") | |
| 283 R.append('qqplot(xvec,yvec,xlab="%s",ylab="%s",main="%s",sub="%s",pch=19,col="%s",cex=0.8)' % (xlab,ylab,maint,title,colour)) | |
| 284 R.append('points(mx,mx,type="l")') | |
| 285 R.append('grid(col="lightgray",lty="dotted")') | |
| 286 R.append('dev.off()') | |
| 287 RRun(rcmd=R,title='makeQQplot',outdir=outdir) | |
| 288 | |
| 289 | |
| 290 | |
| 291 def main(): | |
| 292 u = """ | |
| 293 """ | |
| 294 u = """<command interpreter="python"> | |
| 295 rgQQ.py "$input1" "$name" $sample "$cols" $allqq $height $width $logtrans $allqq.id $__new_file_path__ | |
| 296 </command> | |
| 297 | |
| 298 </command> | |
| 299 """ | |
| 300 print >> sys.stdout,'## rgQQ.py. cl=',sys.argv | |
| 301 npar = 11 | |
| 302 if len(sys.argv) < npar: | |
| 303 print >> sys.stdout, '## error - too few command line parameters - wanting %d' % npar | |
| 304 print >> sys.stdout, u | |
| 305 sys.exit(1) | |
| 306 in_fname = sys.argv[1] | |
| 307 name = sys.argv[2] | |
| 308 sample = float(sys.argv[3]) | |
| 309 head = None | |
| 310 columns = [int(x) for x in sys.argv[4].strip().split(',')] # work with python columns! | |
| 311 allout = sys.argv[5] | |
| 312 height = int(sys.argv[6]) | |
| 313 width = int(sys.argv[7]) | |
| 314 logscale = (sys.argv[8].lower() == 'true') | |
| 315 outid = sys.argv[9] # this is used to allow multiple output files | |
| 316 outdir = sys.argv[10] | |
| 317 nan_row = False | |
| 318 rows = [] | |
| 319 for i, line in enumerate( file( sys.argv[1] ) ): | |
| 320 # Skip comments | |
| 321 if line.startswith( '#' ) or ( i == 0 ): | |
| 322 if i == 0: | |
| 323 head = line.strip().split("\t") | |
| 324 continue | |
| 325 if len(line.strip()) == 0: | |
| 326 continue | |
| 327 # Extract values and convert to floats | |
| 328 fields = line.strip().split( "\t" ) | |
| 329 row = [] | |
| 330 nan_row = False | |
| 331 for column in columns: | |
| 332 if len( fields ) <= column: | |
| 333 return fail( "No column %d on line %d: %s" % ( column, i, fields ) ) | |
| 334 val = fields[column] | |
| 335 if val.lower() == "na": | |
| 336 nan_row = True | |
| 337 else: | |
| 338 try: | |
| 339 row.append( float( fields[column] ) ) | |
| 340 except ValueError: | |
| 341 return fail( "Value '%s' in column %d on line %d is not numeric" % ( fields[column], column+1, i ) ) | |
| 342 if not nan_row: | |
| 343 rows.append( row ) | |
| 344 if i > 1: | |
| 345 i = i-1 # remove header row from count | |
| 346 if head == None: | |
| 347 head = ['Col%d' % (x+1) for x in columns] | |
| 348 R = [] | |
| 349 for c,column in enumerate(columns): # we appended each column in turn | |
| 350 outname = allout | |
| 351 if c > 0: # after first time | |
| 352 outname = 'primary_%s_col%s_visible_pdf' % (outid,column) | |
| 353 outname = os.path.join(outdir,outname) | |
| 354 dat = [] | |
| 355 nrows = len(rows) # sometimes lots of NA's!! | |
| 356 for arow in rows: | |
| 357 dat.append(arow[c]) # remember, we appended each col in turn! | |
| 358 cname = head[column] | |
| 359 makeQQ(dat=dat,sample=sample,fname=outname,title='%s_%s' % (name,cname), | |
| 360 xvar=cname,h=height,w=width,logscale=logscale,outdir=outdir) | |
| 361 | |
| 362 | |
| 363 | |
| 364 if __name__ == "__main__": | |
| 365 main() |
