# HG changeset patch # User petr-novak # Date 1619599040 0 # Node ID d14b68e9fd1dcf10f54f1415e425ddd353c34343 # Parent 5376e1c9adec7f14287d4d7c09899ac109816f6c Uploaded - new tools added diff -r 5376e1c9adec -r d14b68e9fd1d ChipSeqRatioReport --- a/ChipSeqRatioReport Fri Apr 24 08:54:30 2020 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,265 +0,0 @@ - - - - ChIP-Seq Mapper Output - - - - - -

image

- - -

- - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Cluster Chip_Hits Input_Hits Ratio Chip/Input Normalized ratio Chip/Input Ratio Chip/(Chip+Input) Normalized ratio Chip/(Chip+Input)
73 - 5171 - 91 - 56.8 - 56.8 -0.98 -0.98 -
112 -15274 -240 - 63.6 - 63.6 -0.98 -0.98 -
160 - 1 - 0 - Inf - Inf -1.00 -1.00 -
168 - 1306 - 25 - 52.2 - 52.2 -0.98 -0.98 -
208 - 3134 - 25 -125.4 -125.4 -0.99 -0.99 -
213 - 1 - 0 - Inf - Inf -1.00 -1.00 -
225 - 409 - 1 -409.0 -409.0 -1.00 -1.00 -
236 - 4638 - 55 - 84.3 - 84.3 -0.99 -0.99 -
250 - 1 - 0 - Inf - Inf -1.00 -1.00 -
294 - 11 - 5 - 2.2 - 2.2 -0.69 -0.69 -
-
-
- -


- - Generated on: Fri Nov 22 12:59:10 2019 - R2HTML -
- - \ No newline at end of file diff -r 5376e1c9adec -r d14b68e9fd1d Galaxy_integration.org --- a/Galaxy_integration.org Fri Apr 24 08:54:30 2020 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,58 +0,0 @@ - -#+BEGIN_SRC sh -/home/petr/anaconda3/bin/planemo shed_init --name=repeatexplorer_utilities \ - --owner=repeatexplorer \ - --description="some utilities for data preprocessing" \ - --long_description="some utilities for data preprocessing" \ - --category="Fasta Manipulation" -#+END_SRC -# this create file .shed.yml - - -# create repository manualy: -#+BEGIN_SRC sh -planemo shed_build . -#+END_SRC - -#+RESULTS: -: Created: /mnt/raid/users/petr/workspace/re_utilities.tar.gz - -this creates archive: ../re_utilities.tar.gz -unecessary file from archive are deleted; - -#+BEGIN_SRC bash :tangle clean_repository_tarball.sh - #!/bin/bash - containsElement () { - local e match="$1" - shift - for e; do [[ "$e" == "$match" ]] && return 0; done - return 1 - } - ARCHIVE_GZ=../re_utilities.tar.gz - TMP_TAR=`mktemp` - echo $TMP_TAR - ARCHIVE_GZ_CLEAN=../re_utilities_clean.tar.gz - zcat $ARCHIVE_GZ > $TMP_TAR - ARCHIVE_FILE_LIST=`tar -tz -f ../re_utilities.tar.gz` - ls -l $TMP_TAR - GIT_LIST=`git ls-files` - - for FILE in $ARCHIVE_FILE_LIST - do - containsElement $FILE ${GIT_LIST[@]} - if [ $? != 0 ] - then - echo "Deleting ${FILE}" - tar --delete -f $TMP_TAR $FILE - fi - done - ls -l $TMP_TAR - echo "compressing.." - gzip -c $TMP_TAR > $ARCHIVE_GZ_CLEAN - echo "output in ${ARCHIVE_GZ_CLEAN}" - -#+END_SRC - - - - diff -r 5376e1c9adec -r d14b68e9fd1d README.html --- a/README.html Fri Apr 24 08:54:30 2020 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,70 +0,0 @@ - - - - - - -README.html - - - - - -

RepeatExplorer utilities

-

This repository include utilities for preprocessing of NGS data to suitable format for RepeatExplorer and TAREAN -analysis. Each tool include also XML file which define tool interface for Galaxy environment

-

Available tools

-

Paired fastq reads filtering and interlacing

-

tool definition file: paired_fastq_filtering.xml

-

This tool is designed to make memory efficient preprocessing of two fastq files. Output of this file can be used as input of RepeatExplorer clustering. Input files can be in GNU zipped archive (.gz extension). Reads are filtered based on the quality, presence of N bases and adapters. Two input fastq files are procesed in parallel. Only complete pair are kept. As the input files are process in chunks, it is required that pair reads are complete and in the same order in both input files. All reads which pass the quality filter fill be writen into output files. If sampling is specified, only sample of sequences will be returned. Cutadapt us run with this options:

-

--anywhere='AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT' ---anywhere='AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT' ---anywhere='GATCGGAAGAGCACACGTCTGAACTCCAGTCAC' ---anywhere='ATCTCGTATGCCGTCTTCTGCTTG' ---anywhere='CAAGCAGAAGACGGCATACGAGAT' ---anywhere='GTGACTGGAGTTCAGACGTGTGCTCTTCCGATC' ---error-rate=0.05 ---times=1 --overlap=15 --discard

-

Order of fastq files processing

-
    -
  1. Trimming (optional)
  2. -
  3. Filter by quality
  4. -
  5. Discard single reads, keep complete pairs
  6. -
  7. Cutadapt filtering
  8. -
  9. Discard single reads, keep complete pairs
  10. -
  11. Sampling (optional)
  12. -
  13. Interlacing two fasta files
  14. -
-

single fastq reads filtering

-

tool definition file: single_fastq_filtering.xml

-

This tool is designed to perform preprocessing -of fastq file. Input files can be in GNU zipped archive (.gz extension). Reads -are filtered based on the quality, presence of N bases and adapters. All reads -which pass the quality filter fill be writen into output files. If sampling is -specified, only sample of sequences will be returned.

-

fasta afixer

-

tool definition file: fasta_affixer.xml

-

Tool for appending prefix and suffix to sequences names in fasta formated sequences. This tool is useful -if you want to do comparative analysis with RepeatExplorer and need to -append sample codes to sequence identifiers

-

Dependencies

-

R programming environment with installed packages optparse and ShortRead (Bioconductor) -python3 -cutadapt

-

License

-

Copyright (c) 2012 Petr Novak (petr@umbr.cas.cz), Jiri Macas and Pavel Neumann, -Laboratory of Molecular Cytogenetics(http://w3lamc.umbr.cas.cz/lamc/) -Institute of Plant Molecular Biology, Biology Centre AS CR, Ceske Budejovice, Czech Republic

-

This program is free software: you can redistribute it and/or modify -it under the terms of the GNU General Public License as published by -the Free Software Foundation, either version 3 of the License, or -(at your option) any later version.

-

This program is distributed in the hope that it will be useful, -but WITHOUT ANY WARRANTY; without even the implied warranty of -MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the -GNU General Public License for more details. -You should have received a copy of the GNU General Public License -along with this program. If not, see http://www.gnu.org/licenses/.

- - diff -r 5376e1c9adec -r d14b68e9fd1d __init__.py diff -r 5376e1c9adec -r d14b68e9fd1d annot2krona.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/annot2krona.py Wed Apr 28 08:37:20 2021 +0000 @@ -0,0 +1,53 @@ +#!/usr/bin/env python3 +''' +take various inputs and convert it to krona tabular format for visualization +supported inputs: +- DANTE gff3 +- TODO PROFREP gff3 +- TODO RE archive - normal run +- TODO RE archive - comparative +- +''' +import argparse +import re +import collections + + +def parse_dante_gff(f): + '''load gff3 file and return classification with counts''' + r = re.compile("Final_Classification=") + cls_count = collections.defaultdict(int) + for line in f: + if re.match("#", line.strip()): + continue + attributes = line.split("\t")[8].split(";") + cls_raw = list(filter(r.match, attributes))[0] + cls = re.sub(r, "",cls_raw) + cls_count[cls] += 1 + + return cls_count + + +def export_classification(cls, f): + '''save classification to tab delimited file''' + for i in cls: + f.write('{}\t{}\n'.format(cls[i], i.replace("|","\t"))) + + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument('-f', '--format', choices=['dante', 'profrep', 're']) + parser.add_argument('-i', '--input', type=argparse.FileType('r')) + parser.add_argument('-o', '--output', type=argparse.FileType('w')) + + args = parser.parse_args() + + if args.format == "dante": + classification = parse_dante_gff(args.input) + + if args.format in ["profrep" 're']: + print("Not implemented") + exit(0) + + export_classification(classification, args.output) diff -r 5376e1c9adec -r d14b68e9fd1d clean_repository_tarball.sh --- a/clean_repository_tarball.sh Fri Apr 24 08:54:30 2020 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,29 +0,0 @@ -#!/bin/bash -containsElement () { - local e match="$1" - shift - for e; do [[ "$e" == "$match" ]] && return 0; done - return 1 -} -ARCHIVE_GZ=../re_utilities.tar.gz -TMP_TAR=`mktemp` -echo $TMP_TAR -ARCHIVE_GZ_CLEAN=../re_utilities_clean.tar.gz -zcat $ARCHIVE_GZ > $TMP_TAR -ARCHIVE_FILE_LIST=`tar -tz -f ../re_utilities.tar.gz` -ls -l $TMP_TAR -GIT_LIST=`git ls-files` - -for FILE in $ARCHIVE_FILE_LIST -do - containsElement $FILE ${GIT_LIST[@]} - if [ $? != 0 ] - then - echo "Deleting ${FILE}" - tar --delete -f $TMP_TAR $FILE - fi -done -ls -l $TMP_TAR -echo "compressing.." -gzip -c $TMP_TAR > $ARCHIVE_GZ_CLEAN -echo "output in ${ARCHIVE_GZ_CLEAN}" diff -r 5376e1c9adec -r d14b68e9fd1d cluster_table2krona_format.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/cluster_table2krona_format.py Wed Apr 28 08:37:20 2021 +0000 @@ -0,0 +1,44 @@ +#!/usr/bin/env python +import sys +import re +from collections import defaultdict +import argparse + +parser = argparse.ArgumentParser() +parser.add_argument("-i" ,"--input", type=argparse.FileType('r'), help="path to file CLUSTER_table.csv") +parser.add_argument("-o" ,"--output", type=argparse.FileType('w'), help="output file name") +parser.add_argument("-m", "--use_manual", action='store_true', default=False) + +args = parser.parse_args() + +column = 6 if args.use_manual else 4 + + +header = False +clust_info = {} +counts = defaultdict(lambda: 0) +top_clusters = 0 +with open(args.input.name, 'r') as f: + for l in f: + parts = l.split() + if re.match('.*Cluster.+Supercluster.+Size.+Size_adjusted.+Automatic_annotation.+TAREAN_annotation.+Final_annotation', l): + print("header detected") + header = True + continue + if header: + classification = "Top_clusters\t" + "\t".join(parts[column].split("/")[1:]).replace('"','') + counts[classification] += int(parts[3]) + top_clusters += int(parts[3]) + + elif len(parts) >= 2: + clust_info[parts[0].replace('"', '')] = int(parts[1]) + +counts['Singlets'] = clust_info['Number_of_singlets'] +counts['Small_cluster'] = int(clust_info['Number_of_reads_in_clusters']) - top_clusters + +with open(args.output.name, 'w') as fout: + for cls_line, nreads in counts.items(): + fout.write(str(nreads) +"\t" + cls_line + "\n") + + + diff -r 5376e1c9adec -r d14b68e9fd1d cluster_table2krona_format.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/cluster_table2krona_format.xml Wed Apr 28 08:37:20 2021 +0000 @@ -0,0 +1,45 @@ + + + + + + + + + + + + / + + diff -r 5376e1c9adec -r d14b68e9fd1d example1.sh --- a/example1.sh Fri Apr 24 08:54:30 2020 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,4 +0,0 @@ -#!/bin/sh -# minimal example : -./ChipSeqRatioAnalysis.py -c test_data/seq_C_10k -i test_data/seq_I_10k -k test_data/VTS_contigs.info.minRD5 -o tmp/chi_seq_ratio.csv -ht tmp/html_report.html -# output goes to current directory diff -r 5376e1c9adec -r d14b68e9fd1d example2.sh --- a/example2.sh Fri Apr 24 08:54:30 2020 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,4 +0,0 @@ -#!/bin/sh -# minimal example2 : -./ChipSeqRatioAnalysis.py -c test_data/seq_C_20k -i test_data/seq_I_10k -k test_data/VTS_contigs.info.minRD5 -o tmp/chi_seq_ratio2.csv -ht tmp/html_report2.html --max_cl 300 -# output goes to current directory diff -r 5376e1c9adec -r d14b68e9fd1d extract_files_from_re_archive.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/extract_files_from_re_archive.xml Wed Apr 28 08:37:20 2021 +0000 @@ -0,0 +1,49 @@ + + + + #for $sf in $file: + + #if $sf == "CLUSTER_TABLE.csv" + unzip -p -j ${RepeatExplorer_archive} ${sf} > ${cluster_table} + ; + #end if + + #if $sf == "COMPARATIVE_ANALYSIS_COUNTS.csv" + unzip -p -j ${RepeatExplorer_archive} ${sf} > ${comparative_analysis_count} + ; + #end if + + #if $sf == "SUPERCLUSTER_TABLE.csv" + unzip -p -j ${RepeatExplorer_archive} ${sf} > ${supercluster_table} + ; + #end if + + #end for + + + + + + + + + + + + + + + + + + "CLUSTER_TABLE.csv" in file + + + "SUPERCLUSTER_TABLE.csv" in file + + + "COMPARATIVE_ANALYSIS_COUNTS.csv" in file + + + + diff -r 5376e1c9adec -r d14b68e9fd1d fasta_tmp_single diff -r 5376e1c9adec -r d14b68e9fd1d plot_comparative_clustering_summary.R --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/plot_comparative_clustering_summary.R Wed Apr 28 08:37:20 2021 +0000 @@ -0,0 +1,304 @@ +#!/usr/bin/env Rscript +library(optparse) +## TODO - add scale to legend! +twenty_colors = c( + '#e6194b', '#3cb44b', '#ffe119', '#4363d8', '#f58231', + '#911eb4', '#46f0f0', '#f032e6', '#bcf60c', '#fabebe', + '#008080', '#e6beff', '#9a6324', '#fffac8', '#800000', + '#aaffc3', '#808000', '#ffd8b1', '#000075', "#000000" +) + +get_color = function(classification, size){ + ## 20 of unique colors, first is black + unique_colors = twenty_colors[1:opt$number_of_colors] + Ncol = length(unique_colors) + ## rest wil be grey: + grey_color = "#a9a9a9" + ## unique repeats without All + include = !classification %in% "All" + unique_repeats = names(c(sort(by(size[include], INDICES = classification[include], FUN = sum), decreasing = TRUE))) + color_table = unique_colors[1:min(Ncol,length(unique_repeats))] + names(color_table) = unique_repeats[1:min(Ncol,length(unique_repeats))] + color = rep(grey_color, length(classification)) + names(color) = classification + for (ac in names(color_table)){ + color[names(color) %in% ac] = color_table[ac] + } + return(color) +} + + +make_legend = function(color){ + ## simplify description: + names(color) = gsub(".+/","",names(color)) + description = sapply(split(names(color), color), function(x) paste(unique(x), collapse=";")) + description = gsub(".+;.+", "Other", description) + description = gsub("All", "Other", description) + if ("Other" %in% description & length(description) > 1){ + description = c(description[! description %in% "Other"], description[description %in% "Other"]) + } + ord = order(factor(names(description), levels = twenty_colors)) + legend_info = list(name = gsub("All", "NA", description)[ord], color = names(description)[ord]) +} + +plot_rect_map = function(read_counts,cluster_annotation, output_file,GS, RL, Xcoef=1,Ycoef=1){ + ## read_counts : correspond to COMPARATIVE_ANALYSIS_COUNTS.csv + ## cluster annotation : CLUSTER_TABLE.csv + counts = read.table(read_counts,header=TRUE,as.is=TRUE) + input_read_counts = unlist(read.table(read_counts, nrows = 1, comment.char = "",sep="\t")[-(1:2)]) + + counts_file_valid = ncol(counts) == (length(input_read_counts) + 2) & all(colnames(input_read_counts)[1:2]==c("cluster", "supercluster")) + ## find which line is header + header_line = grep(".*Cluster.*Supercluster.*Size", readLines(cluster_annotation)) + annot = read.table(cluster_annotation, sep="\t",header=TRUE,as.is=TRUE, skip = header_line - 1) + ## validate + annot_file_valid = all(colnames(annot)==c("Cluster","Supercluster","Size","Size_adjusted","Automatic_annotation","TAREAN_annotation","Final_annotation")) + + + if (!annot_file_valid | !counts_file_valid){ + pdf(output_file) + plot.new() + text(0.5,0.5,"Input is not valid, check input files!") + dev.off() + stop("Input files are not valid!") + } + print(annot_file_valid) + print(counts_file_valid) + ## remove counts which are not in annotation - only clusters in annot will be plotted! + counts = counts[annot$Cluster,] + N = nrow(annot) + + counts_automatic = counts + annot_automatic = annot + input_read_counts_automatic = input_read_counts + # remove organelar and contamination if required make count correction + if (opt$nuclear_only){ + exclude=grep("contamination|organelle",annot$Automatic_annotation) + if (length(exclude)>0){ + counts_automatic = counts[-exclude, , drop=FALSE] + annot_automatic = annot[-exclude, ,drop=FALSE] + input_read_counts_automatic = input_read_counts - colSums(counts[exclude,-c(1:2) , drop=FALSE]) + } + } + color_auto = get_color(annot_automatic$Automatic_annotation, annot_automatic$Size) + + legend_info = make_legend(color_auto) + params = list(Automatic_annotation = list( + color = color_auto, + legend = legend_info, + counts = counts_automatic, + annot = annot_automatic, + input_read_counts = input_read_counts_automatic + ) + ) + + + if (!is.null(annot$Final_annotation)){ + + ## column with manual annotation exist - check if correct + if (any(annot$Final_annotation %in% "" | any(is.na(annot$Final_annotation)))){ + message("Final annotation is not complete, skipping") + }else{ + + counts_manual = counts + annot_manual = annot + input_read_counts_manual = input_read_counts + ## correction must be done idependetly in case manual and automatic classification differ in annotation + if (opt$nuclear_only){ + exclude=grep("contamination|organelle",annot$Final_annotation) + if (length(exclude)>0){ + counts_manual = counts[-exclude, , drop=FALSE] + annot_manual = annot[-exclude, ,drop=FALSE] + input_read_counts_manual = input_read_counts - colSums(counts[exclude,-c(1:2) , drop=FALSE]) + } + } + ## append + color_manual = get_color(annot_manual$Final_annotation, annot_manual$Size) + legend_info_manual = make_legend(color_manual) + + params$Final_annotation = list( + color = color_manual, + legend = legend_info_manual, + counts = counts_manual, + annot = annot_manual, + input_read_counts = input_read_counts_manual + + ) + } + } + + ## set size of pdf output + wdth = (3 + N*0.03 ) * Xcoef + hgt = (2.2 + ncol(counts)*0.20) * Ycoef + if (!any(is.na(GS))){ + hgt = hgt + ncol(counts)*0.20 * Ycoef + } + ptsize = round((wdth*hgt)^(1/4))*5 + + + pdf(output_file, width=wdth,height=hgt, pointsize = ptsize) # was 50 + ## originaly - printing of both automatic and final annotation + ## now - print only final_annotation if available + if (length(params) == 2){ + ## remove automatic + params$Automatic_annotation = NULL + } + ## + + for (j in seq_along(params)){ + Nclust = nrow(params[[j]]$annot) + ##prepare matrix for plotting + M = as.matrix(params[[j]]$counts[1:Nclust,-(1:2)]) + rownames(M) = paste0("CL",rownames(M)) + Mn1=(M)/apply(M,1,max) + Mn2=M/max(M) + ord1 = hclust(dist(Mn1),method = "ward.D")$order + ord2 = hclust(dist(t(Mn2)))$order + + ploting_area_width = 3 + log10(Nclust)*3 + ploting_area_sides = 1.5 + legend_width = 3 + title_height = 0.5 + if (any(is.na(GS))){ + layout(matrix(c(0,0,0,3,0,2,0,3,0,1,0,3),ncol=4,byrow = TRUE), + width=c(ploting_area_sides,ploting_area_width,ploting_area_sides, legend_width), + height=c(title_height, 3,ncol(M)*0.8)) + }else{ + ## extra row for legends + + + layout(matrix(c(0,0,0,3,0,2,0,3,0,1,0,3,0,0,0,4),ncol=4,byrow = TRUE), + width=c(ploting_area_sides,ploting_area_width,ploting_area_sides, legend_width), + height=c(title_height, 3,ncol(M)*0.8,ncol(M)*0.8 )) + } + + + par(xaxs='i', yaxs = 'i') + par(las=2,mar=c(4,0,0,0),cex.axis=0.5) + + if (any(is.na(GS))){ + rectMap(Mn2[ord1,ord2],scale.by='row',col=params[[j]]$color[ord1], grid=TRUE) + }else{ + # use genomic sizes + Mn3 = t(t(M) * (GS[colnames(M),] / params[[j]]$input_read_counts))[ord1,ord2] + ## rescale + MaxGS = max(Mn3) + Mn3 = Mn3/max(Mn3) + rectMap(Mn3,scale.by='none',col=params[[j]]$color[ord1], grid=TRUE) + } + par(las=2,mar=c(1,0,1,0), mgp = c(2,1,0)) + barplot(params[[j]]$annot$Size[ord1], col = 1) + mtext(side = 2, "Cluster size", las = 3, line = 2.5, cex = 0.5) + mtext(side=3, names(params)[j], las=0, line=1) + plot.new() + legend("topleft", col= params[[j]]$legend$color, legend=params[[j]]$legend$name, pch=15, cex=0.7, bty="n", pt.cex=1) + } + + if (!any(is.na(GS))){ + ## plot GS scale + par(xaxs='i', yaxs = 'i') + print(log(nrow(Mn3))) + par(las=2,mar=c(4,0,0,log(nrow(Mn3))),cex.axis=0.5) # same par as recplot above to keep the scale + Mn3scale = Mn3 + Mn3scale[,] = 0 + colnames(Mn3scale)=rep("", ncol(Mn3scale)) + rownames(Mn3scale)=rep("", nrow(Mn3scale)) + Mn3scale[,1] = seq(0,1, length.out = nrow(Mn3)) + rectMap(Mn3scale,scale.by='none',col="grey", grid=FALSE, boxlab="", draw_box=FALSE, center=FALSE) + slabels = pretty(c(0,MaxGS), n = 10) + sat = slabels/MaxGS * nrow(Mn3scale) + axis(side=1, at= sat, labels = slabels, line = 0) + mtext(side = 1, text = "Repeat abundance", las=1, line=2.5,cex=0.4) + mtext(side = 2, text = "Rectangle\n height", las=1, line=2,cex=0.4, at=1) + + axis(2, at=c(0.5, 1, 1.5), labels=c(0,0.5,1),line=0) + } + st = dev.off() +} + +rectMap=function(x,scale.by='row',col=1,xlab="",ylab="",grid=TRUE,axis_pos=c(1,4),boxlab = "Cluster Id", cexx=NULL,cexy=NULL, draw_box=TRUE, center=TRUE){ + if (scale.by=='row'){ + #x=(x)/rowSums(x) + x=(x)/apply(x,1,sum) + } + if (scale.by=='column'){ + x=t(t(x)/apply(x,2,max)) + } + nc=ncol(x) + nr=nrow(x) + coords=expand.grid(1:nr,1:nc) + plot(coords[,1],coords[,2],type='n',axes=F,xlim=range(coords[,1])+c(-.5,.5),ylim=range(coords[,2])+c(-.5,.5),xlab=xlab,ylab=ylab) + axis(axis_pos[1],at=1:nr,labels=rownames(x),lty=0,tick=FALSE,line=0,cex.axis=0.5/log10(nr)) + axis(axis_pos[2],at=1:nc,labels=colnames(x),lty=0,tick=FALSE,las=2,line=0 ,hadj=0, cex.axis=0.7) + axis(2,at=1:nc,labels=colnames(x),lty=0,tick=FALSE,las=2,line=0 ,hadj=1, cex.axis=0.7) + + mtext(side = 1, boxlab, las=1, line = 3, cex = 0.5) + line = 1.5 + log10(nr) + #mtext(side = 2, "Proportions of individual samples", las =0, line = line, cex = 0.5) + s=x/2 + w = c(x)/2 + if(center){ + rect(coords[,1]-0.5,coords[,2]-s,coords[,1]+0.5,coords[,2]+s,col=col,border=NA) + }else{ + rect(coords[,1]-0.5,coords[,2]-0.5,coords[,1]+0.5,coords[,2]+x-0.5,col=col,border=NA) + } + if (grid){ + abline(v=0:(nr)+.5,h=0:(nc)+.5,lty=2,col="#60606030",lwd=0.2) + } + if(draw_box){ + box(col="#60606030",lty=2, lwd=0.2) + } +} + + option_list <- list( + make_option(c("-c", "--cluster_table"), default=NA, type = "character", + help="file from RepeatExplorer2 clustering - CLUSTER_TABLE.csv"), + + make_option(c("-m", "--comparative_counts"),default = NA,type = "character", + help="file from RepeatExplorer2 output - COMPARATIVE_ANALYSIS_COUNTS.csv"), + + make_option(c("-o", "--output"), type="character", + default="comparative_analysis_summary.pdf", + help="File name for output figures (pdf document)"), + make_option(c("-N", "--number_of_colors"), type="integer", default=10, + help="Number of unique colors used from plotting (2-20, default is 10)"), + + make_option(c("-g", "--genome_size"),default = NA,type = "character", + help="file from genome sizes of species provided in tab delimited file in the format: + + species_code1 GenomeSize1 + species_code2 GenomeSize2 + species_code3 GenomeSize3 + species_code4 GenomeSize4 + + provide the same codes for species as in file COMPARATIVE_ANALYSIS_COUNTS.csv. The use of genome + sizes file imply the --nuclear_only option. If genome sizes are used, genomic abundance scale is added. + "), + make_option(c("-n", "--nuclear_only"), default = FALSE, type="logical", + action = "store_true", + help="remove all non-nuclear sequences (organelle and contamination). ") +) + + +opt = parse_args(OptionParser(option_list = option_list)) + +if (any(is.na(c(opt$cluster_table, opt$comparative_counts)))){ + message("\nBoth files: CLUSTER_TABLE.csv and COMPARATIVE_ANALYSIS_COUNTS.csv must be provided\n") + q() +} + +if (!opt$number_of_colors %in% 1:20){ + message("number of color must be in range 1..20") + stop() +} + +if (!is.na(opt$genome_size)){ + GS = read.table(opt$genome_size, header=FALSE, as.is=TRUE, row.names = 1) + opt$nuclear_only=TRUE +}else{ + GS = NA + RL = NA +} + +plot_rect_map(opt$comparative_counts, opt$cluster_table, opt$output, GS, RL) + diff -r 5376e1c9adec -r d14b68e9fd1d plot_comparative_clustering_summary.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/plot_comparative_clustering_summary.xml Wed Apr 28 08:37:20 2021 +0000 @@ -0,0 +1,71 @@ + + Simple utility to create visualization of RepeatExplorer conmparative analysis + + r-optparse + + + + $__tool_directory__/plot_comparative_clustering_summary.R + --cluster_table=$cluster_table + --comparative_counts=$counts + --number_of_colors=$number_of_colors + --output=$outpdf + $nuclear_only + + #if $normalization.use_genome_size: + --genome_size $normalization.genome_size_table + #end if + + + + + + + + + + + + + + + + + + + + + + + + + **Visualization of comparative clustering** + Visualization can be created two output files from RepeatExplorer pipeline. + + Input file CLUSTER_TABLE.csv contains automatic annotation, information about cluster sizes and the total number of reads used for analysis + Example of CLUSTER_TABLE.csv: :: + + "Number_of_reads_in_clusters" 3002 + "Number_of_clusters" 895 + "Number_of_superclusters" 895 + "Number_of_singlets" 6998 + + "Number_of_analyzed_reads" 10000 + + "Cluster" "Supercluster" "Size" "Size_adjusted" "Automatic_annotation" "TAREAN_classification" "Final_annotation" + 1 1 61 61 "All" "Other" + 2 2 59 59 "All/repeat/satellite" "Putative satellites (high confidence)" + 3 3 45 45 "All/repeat/satellite" "Putative satellites (low confidence)" + 4 4 38 38 "All" "Other" + 5 5 32 32 "All" "Other" + 6 6 28 28 "All" "Other" + 7 7 25 25 "All" "Other" + 8 8 24 24 "All" "Other" + 9 9 23 23 "All" "Other" + 10 10 22 22 "All/repeat/mobile_element/Class_I/LTR/Ty3_gypsy/non-chromovirus/OTA/Tat/Ogre" "Other" + 11 11 20 20 "All" "Other" + 12 12 20 20 "All" "Other" + + + + diff -r 5376e1c9adec -r d14b68e9fd1d sampleFasta.xml --- a/sampleFasta.xml Fri Apr 24 08:54:30 2020 -0400 +++ b/sampleFasta.xml Wed Apr 28 08:37:20 2021 +0000 @@ -1,5 +1,5 @@ - Tool for random sampling subsets of reads from larger dataset + Tool for randomly sampling subsets of reads from large datasets seqkit @@ -40,8 +40,9 @@ **What it does** - This tools is intended to create sample of sequences from by taking 'random' sample from larger data sets. - Using a same seed parameter make sampling reproducible. + This tools randomly samples the specified number of reads from larger datasets. + Using the same random number generator seed with the same dataset results in sampling the same set of reads, while + using different seeds generates different subsets of reads. diff -r 5376e1c9adec -r d14b68e9fd1d tmp.RData Binary file tmp.RData has changed diff -r 5376e1c9adec -r d14b68e9fd1d tmp/.dummy