# HG changeset patch # User iuc # Date 1575495160 18000 # Node ID ec5a925141137dd2acba012a8046e692b5b1b034 # Parent 7fc3c8704c05afa0eee963e8ce2681eadc64178f "planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/fsd commit 2f7a8781ec284b6fefe162416786d374c032e441" diff -r 7fc3c8704c05 -r ec5a92514113 fsd.py --- a/fsd.py Thu Oct 24 09:36:09 2019 -0400 +++ b/fsd.py Wed Dec 04 16:32:40 2019 -0500 @@ -25,7 +25,7 @@ def readFileReferenceFree(file): with open(file, 'r') as dest_f: - data_array = numpy.genfromtxt(dest_f, skip_header=0, delimiter='\t', comments='#', dtype='string') + data_array = numpy.genfromtxt(dest_f, skip_header=0, delimiter='\t', comments='#', dtype=str) return(data_array) @@ -274,7 +274,7 @@ fig.suptitle('Family Size Distribution (FSD) based on families', fontsize=14) ax = fig.add_subplot(1, 1, 1) ticks = numpy.arange(1, 22, 1) - ticks1 = map(str, ticks) + ticks1 = [str(_) for _ in ticks] ticks1[len(ticks1) - 1] = ">20" ax.set_xticks([], []) if rel_freq: @@ -299,7 +299,7 @@ fig2.suptitle('Family Size Distribution (FSD) based on PE reads', fontsize=14) ax2 = fig2.add_subplot(1, 1, 1) ticks = numpy.arange(1, 22) - ticks1 = map(str, ticks) + ticks1 = [str(_) for _ in ticks] ticks1[len(ticks1) - 1] = ">20" reads = [] reads_rel = [] @@ -468,7 +468,7 @@ # tick labels of x axis ticks = numpy.arange(1, 22, 1) - ticks1 = map(str, ticks) + ticks1 = [str(_) for _ in ticks] ticks1[len(ticks1) - 1] = ">20" plt.xticks(numpy.array(ticks), ticks1) singl = len(data_o[data_o == 1]) @@ -549,7 +549,7 @@ fig3.suptitle("{}: FSD based on PE reads".format(name_file), fontsize=14) ax2 = fig3.add_subplot(1, 1, 1) ticks = numpy.arange(1, 22) - ticks1 = map(str, ticks) + ticks1 = [str(_) for _ in ticks] ticks1[len(ticks1) - 1] = ">20" reads = [] reads_rel = [] diff -r 7fc3c8704c05 -r ec5a92514113 fsd.xml --- a/fsd.xml Thu Oct 24 09:36:09 2019 -0400 +++ b/fsd.xml Wed Dec 04 16:32:40 2019 -0500 @@ -1,21 +1,19 @@ - + Family Size Distribution of duplex sequencing tags fsd_macros.xml - - python - matplotlib - + python '$__tool_directory__/fsd.py' #for $i, $s in enumerate($series) - --inputFile${i + 1} '${s.file}' - --inputName${i + 1} '${s.file.element_identifier}' + #set $file=$s.file + --inputFile${i + 1} '${file}' + --inputName${i + 1} @ESCAPE_IDENTIFIER@ #end for -$log_axis -$rel_freq +$log_axis +$rel_freq --output_pdf '$output_pdf' --output_tabular '$output_tabular' @@ -63,18 +61,18 @@ 20 members. This information is very useful in early decision steps of the analysis parameters, such as the minimum number of PE-reads to build the single stranded consensus sequence (SSCS). Moreover, this tool can compare several datasets or different steps in the analysis pipeline to monitor data loss or gain (e.g families re-united with barcode correction tool from the `Du Novo Analysis Pipeline `_). In an extension of this tool, each family is stratified into SSCS (ab/ba) and DSC and visualizes the allocation of DCSs respective to SSCS-ab and SSCS-ba. This is quite handy to better understand the relationship of SSCS to DCS per family and identify sources of bias (e.g. more SSCS to DCS in a particular family size, or more forward ab than reverse ba reads). - + **Input** - -This tools expects a tabular file with the tags of all families, their sizes and information about forward (ab) and reverse (ba) strands. At least one input file must be provided and up to 4 different datasets can be compared with this tool. All input files are produced in the same way (see section **How to generate the input**):: + +This tools expects a tabular file with the tags of all families, their sizes and information about forward (ab) and reverse (ba) strands. At least one input file must be provided and up to 4 different datasets can be compared with this tool. All input files are produced in the same way (see section **How to generate the input**):: 1 2 3 ----------------------------- 1 AAAAAAAAAAAAAAAAATGGTATG ba 3 AAAAAAAAAAAAAATGGTATGGAC ab - + .. class:: infomark **How to generate the input** @@ -90,7 +88,7 @@ We only need columns 1 and 2. These two columns can be extracted from this dataset using the **Cut** tool:: - 1 2 + 1 2 --------------------------- AAAAAAAAAAAAAAATAGCTCGAT ab AAAAAAAAAAAAAAATAGCTCGAT ab @@ -99,9 +97,9 @@ Next, the tags are sorted in ascending or descending order using the **Sort** tool:: - 1 2 + 1 2 --------------------------- - AAAAAAAAAAAAAAAAATGGTATG ba + AAAAAAAAAAAAAAAAATGGTATG ba AAAAAAAAAAAAAAATAGCTCGAT ab AAAAAAAAAAAAAAATAGCTCGAT ab AAAAAAAAAAAAAAATAGCTCGAT ab @@ -114,15 +112,15 @@ 3 AAAAAAAAAAAAAATGGTATGGAC ab These data can now be used in this tool. - + **Output** - + The output is a PDF file with a plot and a summary of the data in the plot. The tool analyzes the family size distribution, that is the number of PE-reads per tag or family. This information is presented graphically as a histogram. The output can interpreted the following way: The **first part** of the output from this tool is “FSD after tags” and “FSD after PE reads” depending whether the data is compared to the total number of unique tags/families or the total number of PE-reads. The data is presented in a histogram showing the frequency of family sizes (FS) ranging from FS=1 to FS>20 members in a step-wise manner. Multiple datasets or steps of an analysis can be merged in one histogram (e.g. tags before and after barcode correction). The family size distribution gives the user an insight into the allocation of DCS in the libary; information that can be used to decide on the optimal family size parameter for building SSCSs. The **second part** of the output shows the family size distribution (FSD) for each of the datasets in more detail. The tags are categorized based on whether they form DCS (duplex = complementary tags in the ab and ba direction) or form only SSCS-ab or SSCS-ba with no matching complement. -]]> +]]> diff -r 7fc3c8704c05 -r ec5a92514113 fsd_beforevsafter.py --- a/fsd_beforevsafter.py Thu Oct 24 09:36:09 2019 -0400 +++ b/fsd_beforevsafter.py Wed Dec 04 16:32:40 2019 -0500 @@ -27,7 +27,7 @@ def readFileReferenceFree(file, delim): with open(file, 'r') as dest_f: - data_array = numpy.genfromtxt(dest_f, skip_header=0, delimiter=delim, comments='#', dtype='string') + data_array = numpy.genfromtxt(dest_f, skip_header=0, delimiter=delim, comments='#', dtype=str) return data_array @@ -202,11 +202,11 @@ counts = plt.hist(list1, bins=range(-1, maximumX + 1), stacked=False, label=labels, color=colors, align="left", alpha=1, edgecolor="black", linewidth=1) ticks = numpy.arange(0, maximumX, 1) - ticks1 = map(str, ticks) + ticks1 = [str(_) for _ in ticks] ticks1[len(ticks1) - 1] = ">20" plt.xticks(numpy.array(ticks), ticks1) if ref_genome is not None: - count = numpy.array([v for k, v in sorted(Counter(quant_ab_ref).iteritems())]) # count all family sizes from all ab strands + count = numpy.array([v for k, v in sorted(Counter(quant_ab_ref).items())]) # count all family sizes from all ab strands legend = "max. family size:\nabsolute frequency:\nrelative frequency:\n\ntotal nr. of reads:\n(before SSCS building)" plt.text(0.1, 0.085, legend, size=11, transform=plt.gcf().transFigure) @@ -216,7 +216,7 @@ plt.text(0.35, 0.105, legend, size=11, transform=plt.gcf().transFigure) count2 = numpy.array( - [v for k, v in sorted(Counter(quant_ba_ref).iteritems())]) # count all family sizes from all ba strands + [v for k, v in sorted(Counter(quant_ba_ref).items())]) # count all family sizes from all ba strands legend = "BA\n{}\n{}\n{:.5f}" \ .format(max(quant_ba_ref), count2[len(count2) - 1], float(count2[len(count2) - 1]) / sum(count2)) plt.text(0.45, 0.1475, legend, size=11, transform=plt.gcf().transFigure) diff -r 7fc3c8704c05 -r ec5a92514113 fsd_beforevsafter.xml --- a/fsd_beforevsafter.xml Thu Oct 24 09:36:09 2019 -0400 +++ b/fsd_beforevsafter.xml Wed Dec 04 16:32:40 2019 -0500 @@ -1,19 +1,17 @@ - + Family Size Distribution of duplex sequencing tags during Du Novo analysis fsd_macros.xml - - python - matplotlib + biopython pysam - + -python '$__tool_directory__/fsd_beforevsafter.py' ---inputFile_SSCS '$file1' ---inputName1 '$file1.element_identifier' +python '$__tool_directory__/fsd_beforevsafter.py' +--inputFile_SSCS '$file' +--inputName1 @ESCAPE_IDENTIFIER@ --makeDCS '$makeDCS' #if $afterTrimming: --afterTrimming '$afterTrimming' @@ -21,11 +19,11 @@ #if $bamFile: --bamFile '$bamFile' #end if ---output_pdf '$output_pdf' +--output_pdf '$output_pdf' --output_tabular '$output_tabular' - + @@ -36,7 +34,7 @@ - + @@ -47,18 +45,18 @@ `_. **Input** - -**Dataset 1:** This tools expects a tabular file with the tags of all families, their sizes and information about forward (ab) and reverse (ba) strands.:: + +**Dataset 1:** This tools expects a tabular file with the tags of all families, their sizes and information about forward (ab) and reverse (ba) strands.:: 1 2 3 ----------------------------- 1 AAAAAAAAAAAAAAAAATGGTATG ba 3 AAAAAAAAAAAAAATGGTATGGAC ab - + .. class:: infomark **How to generate the input** @@ -74,7 +72,7 @@ We only need columns 1 and 2. These two columns can be extracted from this dataset using the **Cut** tool:: - 1 2 + 1 2 --------------------------- AAAAAAAAAAAAAAATAGCTCGAT ab AAAAAAAAAAAAAAATAGCTCGAT ab @@ -83,7 +81,7 @@ Next, the tags are sorted in ascending or descending order using the **Sort** tool:: - 1 2 + 1 2 --------------------------- AAAAAAAAAAAAAAAAATGGTATG ba AAAAAAAAAAAAAAATAGCTCGAT ab @@ -98,25 +96,25 @@ 3 AAAAAAAAAAAAAATGGTATGGAC ab These data can now be used in this tool. - + **Dataset 2:** A fasta file is required with all tags and the associated family size of both strands (forward and reverse) in the header and the read itself in the next line. This input file can be obtained by the `Du Novo Analysis Pipeline `_ with the tool **Make consensus reads**. **Dataset 3 (optional):** In addition, the fasta file with all tags, which were not filtered out after trimming, can be included. This file can also be obtained by the `Du Novo Analysis Pipeline` with the tool **Sequence Content Trimmer**. For both input files (dataset 2 and 3), only the DCSs of one mate are necessary these tools give information on both mates in the output file), since both files have the same tags and family sizes, but different DCSs, which are not required in this tool:: - + >AAAAAAAAATAGATCATAGACTCT 7-10 CTAGACTCACTGGCGTTACTGACTGCGAGACCCTCCACGTG >AAAAAAAAGGCAGAAGATATACGC 11-3 CNCNGGCCCCCCGCTCCGTGCACAGACGNNGCNACTGACAA - + **Dataset 4 (optional):** BAM file of the aligned reads. This file can be obtained by the tool `Map with BWA-MEM `_. **Output** - + The output is a PDF file with a plot and a summary of the data of the plots. The tool compares various datasets from the `Du Novo Analysis Pipeline `_ and helps in decision making of various parameters (e.g family size, minimum read length, etc). For example: Re-running trimming with different parameters allows to recover reads that would be lost due to stringent filtering by read length. This tool also allows to assess reads on target. The tool extracts the tags of reads and their family sizes before SSCS building, after DCS building, after trimming and finally after the alignment to the reference genome. In the plot, the family sizes for both SSCS-ab and SSCS-ba are shown; whereas the summary represents only counts either of SSCS-ab or of SSCS-ba. -]]> +]]> diff -r 7fc3c8704c05 -r ec5a92514113 fsd_macros.xml --- a/fsd_macros.xml Thu Oct 24 09:36:09 2019 -0400 +++ b/fsd_macros.xml Wed Dec 04 16:32:40 2019 -0500 @@ -1,4 +1,5 @@ + @@ -9,5 +10,12 @@ } - - \ No newline at end of file + + + + + matplotlib + + + + diff -r 7fc3c8704c05 -r ec5a92514113 fsd_regions.py --- a/fsd_regions.py Thu Oct 24 09:36:09 2019 -0400 +++ b/fsd_regions.py Wed Dec 04 16:32:40 2019 -0500 @@ -1,270 +1,270 @@ -#!/usr/bin/env python - -# Family size distribution of tags which were aligned to the reference genome -# -# Author: Monika Heinzl & Gundula Povysil, Johannes-Kepler University Linz (Austria) -# Contact: monika.heinzl@edumail.at -# -# Takes at least one TABULAR file with tags before the alignment to the SSCS, -# a BAM file with tags of reads that overlap the regions of the reference genome and -# an optional BED file with chromosome, start and stop position of the regions as input. -# The program produces a plot which shows the distribution of family sizes of the tags from the input files and -# a tabular file with the data of the plot. - -# USAGE: python FSD_regions.py --inputFile filenameSSCS --inputName1 filenameSSCS -# --bamFile DCSbamFile --rangesFile BEDfile --output_tabular outptufile_name_tabular -# --output_pdf outputfile_name_pdf - -import argparse -import collections -import re -import sys - -import matplotlib.pyplot as plt -import numpy as np -import pysam -from matplotlib.backends.backend_pdf import PdfPages - -plt.switch_backend('agg') - - -def readFileReferenceFree(file, delim): - with open(file, 'r') as dest_f: - data_array = np.genfromtxt(dest_f, skip_header=0, delimiter=delim, comments='#', dtype='string') - return data_array - - -def make_argparser(): - parser = argparse.ArgumentParser(description='Family Size Distribution of tags which were aligned to regions of the reference genome') - parser.add_argument('--inputFile', help='Tabular File with three columns: ab or ba, tag and family size.') - parser.add_argument('--inputName1') - parser.add_argument('--bamFile', help='BAM file with aligned reads.') - parser.add_argument('--rangesFile', default=None, help='BED file with chromosome, start and stop positions.') - parser.add_argument('--output_pdf', default="data.pdf", type=str, help='Name of the pdf and tabular file.') - parser.add_argument('--output_tabular', default="data.tabular", type=str, help='Name of the pdf and tabular file.') - return parser - - -def compare_read_families_refGenome(argv): - parser = make_argparser() - args = parser.parse_args(argv[1:]) - - firstFile = args.inputFile - name1 = args.inputName1 - name1 = name1.split(".tabular")[0] - bamFile = args.bamFile - - rangesFile = args.rangesFile - title_file = args.output_pdf - title_file2 = args.output_tabular - sep = "\t" - - with open(title_file2, "w") as output_file, PdfPages(title_file) as pdf: - data_array = readFileReferenceFree(firstFile, "\t") - pysam.index(bamFile) - - bam = pysam.AlignmentFile(bamFile, "rb") - qname_dict = collections.OrderedDict() - - if rangesFile is not None: - with open(rangesFile, 'r') as regs: - range_array = np.genfromtxt(regs, skip_header=0, delimiter='\t', comments='#', dtype='string') - - if range_array.ndim == 0: - print("Error: file has 0 lines") - exit(2) - - if range_array.ndim == 1: - chrList = range_array[0] - start_posList = range_array[1].astype(int) - stop_posList = range_array[2].astype(int) - chrList = [chrList.tolist()] - start_posList = [start_posList.tolist()] - stop_posList = [stop_posList.tolist()] - else: - chrList = range_array[:, 0] - start_posList = range_array[:, 1].astype(int) - stop_posList = range_array[:, 2].astype(int) - - if len(start_posList) != len(stop_posList): - print("start_positions and end_positions do not have the same length") - exit(3) - - chrList = np.array(chrList) - start_posList = np.array(start_posList).astype(int) - stop_posList = np.array(stop_posList).astype(int) - for chr, start_pos, stop_pos in zip(chrList, start_posList, stop_posList): - chr_start_stop = "{}_{}_{}".format(chr, start_pos, stop_pos) - qname_dict[chr_start_stop] = [] - for read in bam.fetch(chr.tobytes(), start_pos, stop_pos): - if not read.is_unmapped: - if re.search('_', read.query_name): - tags = re.split('_', read.query_name)[0] - else: - tags = read.query_name - qname_dict[chr_start_stop].append(tags) - - else: - for read in bam.fetch(): - if not read.is_unmapped: - if re.search(r'_', read.query_name): - tags = re.split('_', read.query_name)[0] - else: - tags = read.query_name - - if read.reference_name not in qname_dict.keys(): - qname_dict[read.reference_name] = [tags] - else: - qname_dict[read.reference_name].append(tags) - - seq = np.array(data_array[:, 1]) - tags = np.array(data_array[:, 2]) - quant = np.array(data_array[:, 0]).astype(int) - group = np.array(qname_dict.keys()) - - all_ab = seq[np.where(tags == "ab")[0]] - all_ba = seq[np.where(tags == "ba")[0]] - quant_ab = quant[np.where(tags == "ab")[0]] - quant_ba = quant[np.where(tags == "ba")[0]] - - seqDic_ab = dict(zip(all_ab, quant_ab)) - seqDic_ba = dict(zip(all_ba, quant_ba)) - - lst_ab = [] - lst_ba = [] - quantAfterRegion = [] - length_regions = 0 - for i in group: - lst_ab_r = [] - lst_ba_r = [] - seq_mut = qname_dict[i] - if rangesFile is None: - seq_mut, seqMut_index = np.unique(np.array(seq_mut), return_index=True) - length_regions = length_regions + len(seq_mut) * 2 - for r in seq_mut: - count_ab = seqDic_ab.get(r) - count_ba = seqDic_ba.get(r) - lst_ab_r.append(count_ab) - lst_ab.append(count_ab) - lst_ba_r.append(count_ba) - lst_ba.append(count_ba) - - dataAB = np.array(lst_ab_r) - dataBA = np.array(lst_ba_r) - bigFamilies = np.where(dataAB > 20)[0] - dataAB[bigFamilies] = 22 - bigFamilies = np.where(dataBA > 20)[0] - dataBA[bigFamilies] = 22 - - quantAll = np.concatenate((dataAB, dataBA)) - quantAfterRegion.append(quantAll) - - quant_ab = np.array(lst_ab) - quant_ba = np.array(lst_ba) - - maximumX = np.amax(np.concatenate(quantAfterRegion)) - minimumX = np.amin(np.concatenate(quantAfterRegion)) - - # PLOT - plt.rc('figure', figsize=(11.69, 8.27)) # A4 format - plt.rcParams['axes.facecolor'] = "E0E0E0" # grey background color - plt.rcParams['xtick.labelsize'] = 14 - plt.rcParams['ytick.labelsize'] = 14 - plt.rcParams['patch.edgecolor'] = "black" - fig = plt.figure() - plt.subplots_adjust(bottom=0.3) - - colors = ["#6E6E6E", "#0431B4", "#5FB404", "#B40431", "#F4FA58", "#DF7401", "#81DAF5"] - - col = [] - for i in range(0, len(group)): - col.append(colors[i]) - - counts = plt.hist(quantAfterRegion, bins=range(minimumX, maximumX + 1), stacked=False, label=group, - align="left", alpha=1, color=col, edgecolor="black", linewidth=1) - ticks = np.arange(minimumX - 1, maximumX, 1) - - ticks1 = map(str, ticks) - ticks1[len(ticks1) - 1] = ">20" - plt.xticks(np.array(ticks), ticks1) - count = np.bincount(map(int, quant_ab)) # original counts - - legend = "max. family size:\nabsolute frequency:\nrelative frequency:\n\ntotal nr. of reads:\n(before SSCS building)" - plt.text(0.15, 0.085, legend, size=11, transform=plt.gcf().transFigure) - - legend = "AB\n{}\n{}\n{:.5f}\n\n{:,}".format(max(map(int, quant_ab)), count[len(count) - 1], float(count[len(count) - 1]) / sum(count), sum(np.array(data_array[:, 0]).astype(int))) - plt.text(0.35, 0.105, legend, size=11, transform=plt.gcf().transFigure) - - count2 = np.bincount(map(int, quant_ba)) # original counts - - legend = "BA\n{}\n{}\n{:.5f}" \ - .format(max(map(int, quant_ba)), count2[len(count2) - 1], float(count2[len(count2) - 1]) / sum(count2)) - plt.text(0.45, 0.1475, legend, size=11, transform=plt.gcf().transFigure) - - plt.text(0.55, 0.2125, "total nr. of tags:", size=11, transform=plt.gcf().transFigure) - plt.text(0.8, 0.2125, "{:,} ({:,})".format(length_regions, length_regions / 2), size=11, - transform=plt.gcf().transFigure) - - legend4 = "* In the plot, both family sizes of the ab and ba strands were used.\nWhereas the total numbers indicate only the single count of the tags per region.\n" - plt.text(0.1, 0.01, legend4, size=11, transform=plt.gcf().transFigure) - - space = 0 - for i, count in zip(group, quantAfterRegion): - plt.text(0.55, 0.15 - space, "{}:\n".format(i), size=11, transform=plt.gcf().transFigure) - plt.text(0.8, 0.15 - space, "{:,}\n".format(len(count) / 2), size=11, transform=plt.gcf().transFigure) - space = space + 0.02 - - plt.legend(loc='upper right', fontsize=14, bbox_to_anchor=(0.9, 1), frameon=True) - plt.xlabel("Family size", fontsize=14) - plt.ylabel("Absolute Frequency", fontsize=14) - plt.grid(b=True, which="major", color="#424242", linestyle=":") - plt.margins(0.01, None) - - pdf.savefig(fig, bbox_inch="tight") - plt.close() - - output_file.write("Dataset:{}{}\n".format(sep, name1)) - output_file.write("{}AB{}BA\n".format(sep, sep)) - output_file.write("max. family size:{}{}{}{}\n".format(sep, max(map(int, quant_ab)), sep, max(map(int, quant_ba)))) - output_file.write("absolute frequency:{}{}{}{}\n".format(sep, count[len(count) - 1], sep, count2[len(count2) - 1])) - output_file.write("relative frequency:{}{:.3f}{}{:.3f}\n\n".format(sep, float(count[len(count) - 1]) / sum(count), sep, float(count2[len(count2) - 1]) / sum(count2))) - output_file.write("total nr. of reads{}{}\n".format(sep, sum(np.array(data_array[:, 0]).astype(int)))) - output_file.write("total nr. of tags{}{} ({})\n".format(sep, length_regions, length_regions / 2)) - output_file.write("\n\nValues from family size distribution\n") - output_file.write("{}".format(sep)) - for i in group: - output_file.write("{}{}".format(i, sep)) - output_file.write("\n") - - j = 0 - for fs in counts[1][0:len(counts[1]) - 1]: - if fs == 21: - fs = ">20" - else: - fs = "={}".format(fs) - output_file.write("FS{}{}".format(fs, sep)) - if len(group) == 1: - output_file.write("{}{}".format(int(counts[0][j]), sep)) - else: - for n in range(len(group)): - output_file.write("{}{}".format(int(counts[0][n][j]), sep)) - output_file.write("\n") - j += 1 - output_file.write("sum{}".format(sep)) - - if len(group) == 1: - output_file.write("{}{}".format(int(sum(counts[0])), sep)) - else: - for i in counts[0]: - output_file.write("{}{}".format(int(sum(i)), sep)) - output_file.write("\n") - output_file.write("\n\nIn the plot, both family sizes of the ab and ba strands were used.\nWhereas the total numbers indicate only the single count of the tags per region.\n") - output_file.write("Region{}total nr. of tags per region\n".format(sep)) - for i, count in zip(group, quantAfterRegion): - output_file.write("{}{}{}\n".format(i, sep, len(count) / 2)) - - print("Files successfully created!") - - -if __name__ == '__main__': - sys.exit(compare_read_families_refGenome(sys.argv)) +#!/usr/bin/env python + +# Family size distribution of tags which were aligned to the reference genome +# +# Author: Monika Heinzl & Gundula Povysil, Johannes-Kepler University Linz (Austria) +# Contact: monika.heinzl@edumail.at +# +# Takes at least one TABULAR file with tags before the alignment to the SSCS, +# a BAM file with tags of reads that overlap the regions of the reference genome and +# an optional BED file with chromosome, start and stop position of the regions as input. +# The program produces a plot which shows the distribution of family sizes of the tags from the input files and +# a tabular file with the data of the plot. + +# USAGE: python FSD_regions.py --inputFile filenameSSCS --inputName1 filenameSSCS +# --bamFile DCSbamFile --rangesFile BEDfile --output_tabular outptufile_name_tabular +# --output_pdf outputfile_name_pdf + +import argparse +import collections +import re +import sys + +import matplotlib.pyplot as plt +import numpy as np +import pysam +from matplotlib.backends.backend_pdf import PdfPages + +plt.switch_backend('agg') + + +def readFileReferenceFree(file, delim): + with open(file, 'r') as dest_f: + data_array = np.genfromtxt(dest_f, skip_header=0, delimiter=delim, comments='#', dtype=str) + return data_array + + +def make_argparser(): + parser = argparse.ArgumentParser(description='Family Size Distribution of tags which were aligned to regions of the reference genome') + parser.add_argument('--inputFile', help='Tabular File with three columns: ab or ba, tag and family size.') + parser.add_argument('--inputName1') + parser.add_argument('--bamFile', help='BAM file with aligned reads.') + parser.add_argument('--rangesFile', default=None, help='BED file with chromosome, start and stop positions.') + parser.add_argument('--output_pdf', default="data.pdf", type=str, help='Name of the pdf and tabular file.') + parser.add_argument('--output_tabular', default="data.tabular", type=str, help='Name of the pdf and tabular file.') + return parser + + +def compare_read_families_refGenome(argv): + parser = make_argparser() + args = parser.parse_args(argv[1:]) + + firstFile = args.inputFile + name1 = args.inputName1 + name1 = name1.split(".tabular")[0] + bamFile = args.bamFile + + rangesFile = args.rangesFile + title_file = args.output_pdf + title_file2 = args.output_tabular + sep = "\t" + + with open(title_file2, "w") as output_file, PdfPages(title_file) as pdf: + data_array = readFileReferenceFree(firstFile, "\t") + pysam.index(bamFile) + + bam = pysam.AlignmentFile(bamFile, "rb") + qname_dict = collections.OrderedDict() + + if rangesFile is not None: + with open(rangesFile, 'r') as regs: + range_array = np.genfromtxt(regs, skip_header=0, delimiter='\t', comments='#', dtype=str) + + if range_array.ndim == 0: + print("Error: file has 0 lines") + exit(2) + + if range_array.ndim == 1: + chrList = range_array[0] + start_posList = range_array[1].astype(int) + stop_posList = range_array[2].astype(int) + chrList = [chrList.tolist()] + start_posList = [start_posList.tolist()] + stop_posList = [stop_posList.tolist()] + else: + chrList = range_array[:, 0] + start_posList = range_array[:, 1].astype(int) + stop_posList = range_array[:, 2].astype(int) + + if len(start_posList) != len(stop_posList): + print("start_positions and end_positions do not have the same length") + exit(3) + + chrList = np.array(chrList) + start_posList = np.array(start_posList).astype(int) + stop_posList = np.array(stop_posList).astype(int) + for chr, start_pos, stop_pos in zip(chrList, start_posList, stop_posList): + chr_start_stop = "{}_{}_{}".format(chr, start_pos, stop_pos) + qname_dict[chr_start_stop] = [] + for read in bam.fetch(chr, start_pos, stop_pos): + if not read.is_unmapped: + if re.search('_', read.query_name): + tags = re.split('_', read.query_name)[0] + else: + tags = read.query_name + qname_dict[chr_start_stop].append(tags) + + else: + for read in bam.fetch(): + if not read.is_unmapped: + if re.search(r'_', read.query_name): + tags = re.split('_', read.query_name)[0] + else: + tags = read.query_name + + if read.reference_name not in qname_dict: + qname_dict[read.reference_name] = [tags] + else: + qname_dict[read.reference_name].append(tags) + + seq = np.array(data_array[:, 1]) + tags = np.array(data_array[:, 2]) + quant = np.array(data_array[:, 0]).astype(int) + group = np.array(list(qname_dict.keys())) + + all_ab = seq[np.where(tags == "ab")[0]] + all_ba = seq[np.where(tags == "ba")[0]] + quant_ab = quant[np.where(tags == "ab")[0]] + quant_ba = quant[np.where(tags == "ba")[0]] + + seqDic_ab = dict(zip(all_ab, quant_ab)) + seqDic_ba = dict(zip(all_ba, quant_ba)) + + lst_ab = [] + lst_ba = [] + quantAfterRegion = [] + length_regions = 0 + for i in group: + lst_ab_r = [] + lst_ba_r = [] + seq_mut = qname_dict[i] + if rangesFile is None: + seq_mut, seqMut_index = np.unique(np.array(seq_mut), return_index=True) + length_regions = length_regions + len(seq_mut) * 2 + for r in seq_mut: + count_ab = seqDic_ab.get(r) + count_ba = seqDic_ba.get(r) + lst_ab_r.append(count_ab) + lst_ab.append(count_ab) + lst_ba_r.append(count_ba) + lst_ba.append(count_ba) + + dataAB = np.array(lst_ab_r) + dataBA = np.array(lst_ba_r) + bigFamilies = np.where(dataAB > 20)[0] + dataAB[bigFamilies] = 22 + bigFamilies = np.where(dataBA > 20)[0] + dataBA[bigFamilies] = 22 + + quantAll = np.concatenate((dataAB, dataBA)) + quantAfterRegion.append(quantAll) + + quant_ab = np.array(lst_ab) + quant_ba = np.array(lst_ba) + + maximumX = np.amax(np.concatenate(quantAfterRegion)) + minimumX = np.amin(np.concatenate(quantAfterRegion)) + + # PLOT + plt.rc('figure', figsize=(11.69, 8.27)) # A4 format + plt.rcParams['axes.facecolor'] = "E0E0E0" # grey background color + plt.rcParams['xtick.labelsize'] = 14 + plt.rcParams['ytick.labelsize'] = 14 + plt.rcParams['patch.edgecolor'] = "black" + fig = plt.figure() + plt.subplots_adjust(bottom=0.3) + + colors = ["#6E6E6E", "#0431B4", "#5FB404", "#B40431", "#F4FA58", "#DF7401", "#81DAF5"] + + col = [] + for i in range(0, len(group)): + col.append(colors[i]) + + counts = plt.hist(quantAfterRegion, bins=range(minimumX, maximumX + 1), stacked=False, label=group, + align="left", alpha=1, color=col, edgecolor="black", linewidth=1) + ticks = np.arange(minimumX - 1, maximumX, 1) + + ticks1 = [str(_) for _ in ticks] + ticks1[len(ticks1) - 1] = ">20" + plt.xticks(np.array(ticks), ticks1) + count = np.bincount([int(_) for _ in quant_ab]) # original counts + + legend = "max. family size:\nabsolute frequency:\nrelative frequency:\n\ntotal nr. of reads:\n(before SSCS building)" + plt.text(0.15, 0.085, legend, size=11, transform=plt.gcf().transFigure) + + legend = "AB\n{}\n{}\n{:.5f}\n\n{:,}".format(max(map(int, quant_ab)), count[len(count) - 1], float(count[len(count) - 1]) / sum(count), sum(np.array(data_array[:, 0]).astype(int))) + plt.text(0.35, 0.105, legend, size=11, transform=plt.gcf().transFigure) + + count2 = np.bincount([int(_) for _ in quant_ba]) # original counts + + legend = "BA\n{}\n{}\n{:.5f}" \ + .format(max(map(int, quant_ba)), count2[len(count2) - 1], float(count2[len(count2) - 1]) / sum(count2)) + plt.text(0.45, 0.1475, legend, size=11, transform=plt.gcf().transFigure) + + plt.text(0.55, 0.2125, "total nr. of tags:", size=11, transform=plt.gcf().transFigure) + plt.text(0.8, 0.2125, "{:,} ({:,})".format(length_regions, length_regions / 2), size=11, + transform=plt.gcf().transFigure) + + legend4 = "* In the plot, both family sizes of the ab and ba strands were used.\nWhereas the total numbers indicate only the single count of the tags per region.\n" + plt.text(0.1, 0.01, legend4, size=11, transform=plt.gcf().transFigure) + + space = 0 + for i, count in zip(group, quantAfterRegion): + plt.text(0.55, 0.15 - space, "{}:\n".format(i), size=11, transform=plt.gcf().transFigure) + plt.text(0.8, 0.15 - space, "{:,}\n".format(len(count) / 2), size=11, transform=plt.gcf().transFigure) + space = space + 0.02 + + plt.legend(loc='upper right', fontsize=14, bbox_to_anchor=(0.9, 1), frameon=True) + plt.xlabel("Family size", fontsize=14) + plt.ylabel("Absolute Frequency", fontsize=14) + plt.grid(b=True, which="major", color="#424242", linestyle=":") + plt.margins(0.01, None) + + pdf.savefig(fig, bbox_inch="tight") + plt.close() + + output_file.write("Dataset:{}{}\n".format(sep, name1)) + output_file.write("{}AB{}BA\n".format(sep, sep)) + output_file.write("max. family size:{}{}{}{}\n".format(sep, max(map(int, quant_ab)), sep, max(map(int, quant_ba)))) + output_file.write("absolute frequency:{}{}{}{}\n".format(sep, count[len(count) - 1], sep, count2[len(count2) - 1])) + output_file.write("relative frequency:{}{:.3f}{}{:.3f}\n\n".format(sep, float(count[len(count) - 1]) / sum(count), sep, float(count2[len(count2) - 1]) / sum(count2))) + output_file.write("total nr. of reads{}{}\n".format(sep, sum(np.array(data_array[:, 0]).astype(int)))) + output_file.write("total nr. of tags{}{} ({})\n".format(sep, length_regions, length_regions / 2)) + output_file.write("\n\nValues from family size distribution\n") + output_file.write("{}".format(sep)) + for i in group: + output_file.write("{}{}".format(i, sep)) + output_file.write("\n") + + j = 0 + for fs in counts[1][0:len(counts[1]) - 1]: + if fs == 21: + fs = ">20" + else: + fs = "={}".format(fs) + output_file.write("FS{}{}".format(fs, sep)) + if len(group) == 1: + output_file.write("{}{}".format(int(counts[0][j]), sep)) + else: + for n in range(len(group)): + output_file.write("{}{}".format(int(counts[0][n][j]), sep)) + output_file.write("\n") + j += 1 + output_file.write("sum{}".format(sep)) + + if len(group) == 1: + output_file.write("{}{}".format(int(sum(counts[0])), sep)) + else: + for i in counts[0]: + output_file.write("{}{}".format(int(sum(i)), sep)) + output_file.write("\n") + output_file.write("\n\nIn the plot, both family sizes of the ab and ba strands were used.\nWhereas the total numbers indicate only the single count of the tags per region.\n") + output_file.write("Region{}total nr. of tags per region\n".format(sep)) + for i, count in zip(group, quantAfterRegion): + output_file.write("{}{}{}\n".format(i, sep, len(count) / 2)) + + print("Files successfully created!") + + +if __name__ == '__main__': + sys.exit(compare_read_families_refGenome(sys.argv)) diff -r 7fc3c8704c05 -r ec5a92514113 fsd_regions.xml --- a/fsd_regions.xml Thu Oct 24 09:36:09 2019 -0400 +++ b/fsd_regions.xml Wed Dec 04 16:32:40 2019 -0500 @@ -1,27 +1,25 @@ - + Family size distribution of user-specified regions in the reference genome fsd_macros.xml - - python - matplotlib + pysam - + -python '$__tool_directory__/fsd_regions.py' ---inputFile '$file1' ---inputName1 '$file1.element_identifier' +python '$__tool_directory__/fsd_regions.py' +--inputFile '$file' +--inputName1 @ESCAPE_IDENTIFIER@ --bamFile '$file2' #if $file3: --rangesFile '$file3' #end if ---output_pdf '$output_pdf' +--output_pdf '$output_pdf' --output_tabular '$output_tabular' - + @@ -31,11 +29,11 @@ - + - + + ]]> diff -r 7fc3c8704c05 -r ec5a92514113 td.py --- a/td.py Thu Oct 24 09:36:09 2019 -0400 +++ b/td.py Wed Dec 04 16:32:40 2019 -0500 @@ -18,7 +18,6 @@ # --nr_above_bars True/False --output_tabular outptufile_name_tabular import argparse -import itertools import operator import sys from collections import Counter, defaultdict @@ -70,7 +69,7 @@ plt.suptitle(subtitle, y=1, x=0.5, fontsize=14) plt.xlabel("Family size", fontsize=14) ticks = numpy.arange(0, maximumXFS + 1, 1) - ticks1 = map(str, ticks) + ticks1 = [str(_) for _ in ticks] if maximumXFS >= 20: ticks1[len(ticks1) - 1] = ">=20" plt.xticks(numpy.array(ticks), ticks1) @@ -102,7 +101,7 @@ fig = plt.figure(figsize=(6, 8)) plt.subplots_adjust(bottom=0.1) - p1 = numpy.array([v for k, v in sorted(Counter(numpy.concatenate(list1)).iteritems())]) + p1 = numpy.array([v for k, v in sorted(Counter(numpy.concatenate(list1)).items())]) maximumY = numpy.amax(p1) if relative is True: # relative difference bin1 = numpy.arange(-1, maximumX + 0.2, 0.1) @@ -129,7 +128,7 @@ plt.ylim((0, maximumY * 1.2)) plt.ylabel("Absolute Frequency", fontsize=14) bins = counts[1] # width of bins - counts = numpy.array(map(int, counts[0][5])) + counts = numpy.array([int(_) for _ in counts[0][5]]) plt.legend(loc='upper right', fontsize=14, frameon=True, bbox_to_anchor=(1.45, 1)) plt.suptitle(subtitle, y=1, x=0.5, fontsize=14) @@ -177,7 +176,7 @@ step = 1 fig = plt.figure(figsize=(6, 8)) plt.subplots_adjust(bottom=0.1) - p1 = numpy.array([v for k, v in sorted(Counter(numpy.concatenate(list1)).iteritems())]) + p1 = numpy.array([v for k, v in sorted(Counter(numpy.concatenate(list1)).items())]) maximumY = numpy.amax(p1) bin1 = maximumX + 1 if rel_freq: @@ -188,7 +187,7 @@ plt.ylim((0, 1.07)) plt.ylabel("Relative Frequency", fontsize=14) bins = counts[1] # width of bins - counts = numpy.array(map(float, counts[0][2])) + counts = numpy.array([float(_) for _ in counts[0][2]]) else: counts = plt.hist(list1, bins=bin1, edgecolor='black', linewidth=1, @@ -197,7 +196,7 @@ plt.ylim((0, maximumY * 1.2)) plt.ylabel("Absolute Frequency", fontsize=14) bins = counts[1] # width of bins - counts = numpy.array(map(int, counts[0][2])) + counts = numpy.array([int(_) for _ in counts[0][2]]) plt.legend(loc='upper right', fontsize=14, frameon=True, bbox_to_anchor=(1.45, 1)) plt.suptitle(subtitle, y=1, x=0.5, fontsize=14) @@ -581,7 +580,7 @@ i = 0 array2 = numpy.unique(array2) # remove duplicate sequences to decrease running time for a in array1: - dist = numpy.array([sum(itertools.imap(operator.ne, a, b)) for b in array2]) # fastest + dist = numpy.array([sum(map(operator.ne, a, b)) for b in array2]) # fastest res[i] = numpy.amin(dist[dist > 0]) # pick min distance greater than zero i += 1 return res @@ -589,10 +588,10 @@ def hamming_difference(array1, array2, mate_b): array2 = numpy.unique(array2) # remove duplicate sequences to decrease running time - array1_half = numpy.array([i[0:(len(i)) / 2] for i in array1]) # mate1 part1 - array1_half2 = numpy.array([i[len(i) / 2:len(i)] for i in array1]) # mate1 part 2 - array2_half = numpy.array([i[0:(len(i)) / 2] for i in array2]) # mate2 part1 - array2_half2 = numpy.array([i[len(i) / 2:len(i)] for i in array2]) # mate2 part2 + array1_half = numpy.array([i[0:int(len(i) / 2)] for i in array1]) # mate1 part1 + array1_half2 = numpy.array([i[int(len(i) / 2):len(i)] for i in array1]) # mate1 part 2 + array2_half = numpy.array([i[0:int(len(i) / 2)] for i in array2]) # mate2 part1 + array2_half2 = numpy.array([i[int(len(i) / 2):len(i)] for i in array2]) # mate2 part2 diff11 = [] relativeDiffList = [] @@ -628,7 +627,7 @@ array2_half2_withoutSame = half2_mate2[index_withoutSame] array2_withoutSame = array2[index_withoutSame] # whole tag (=not splitted into 2 halfs) # calculate HD of "a" in the tag to all "a's" or "b" in the tag to all "b's" - dist = numpy.array([sum(itertools.imap(operator.ne, a, c)) for c in + dist = numpy.array([sum(map(operator.ne, a, c)) for c in array2_half_withoutSame]) min_index = numpy.where(dist == dist.min())[0] # get index of min HD min_value = dist.min() @@ -636,7 +635,7 @@ min_tag_half2 = array2_half2_withoutSame[min_index] min_tag_array2 = array2_withoutSame[min_index] # get whole tag with min HD - dist_second_half = numpy.array([sum(itertools.imap(operator.ne, b, e)) for e in + dist_second_half = numpy.array([sum(map(operator.ne, b, e)) for e in min_tag_half2]) # calculate HD of "b" to all "b's" or "a" to all "a's" max_value = dist_second_half.max() max_index = numpy.where(dist_second_half == dist_second_half.max())[0] # get index of max HD @@ -674,7 +673,7 @@ def readFileReferenceFree(file): with open(file, 'r') as dest_f: - data_array = numpy.genfromtxt(dest_f, skip_header=0, delimiter='\t', comments='#', dtype='string') + data_array = numpy.genfromtxt(dest_f, skip_header=0, delimiter='\t', comments='#', dtype=str) integers = numpy.array(data_array[:, 0]).astype(int) return (integers, data_array) @@ -948,8 +947,8 @@ # HD analysis for a subset of the tag if subset > 0: - tag1 = numpy.array([i[0:(len(i)) / 2] for i in data_array[:, 1]]) - tag2 = numpy.array([i[len(i) / 2:len(i)] for i in data_array[:, 1]]) + tag1 = numpy.array([i[0:int(len(i) / 2)] for i in data_array[:, 1]]) + tag2 = numpy.array([i[int(len(i) / 2):len(i)] for i in data_array[:, 1]]) flanking_region_float = float((len(tag1[0]) - subset)) / 2 flanking_region = int(flanking_region_float) @@ -1068,8 +1067,8 @@ if tag1 in checked_tags: # skip tag if already written to file continue - sample_half_a = tag1[0:(len(tag1)) / 2] - sample_half_b = tag1[len(tag1) / 2:len(tag1)] + sample_half_a = tag1[0:int(len(tag1) / 2)] + sample_half_b = tag1[int(len(tag1) / 2):len(tag1)] max_tags = data_chimeraAnalysis[i, 1] if len(max_tags) > 1 and len(max_tags) != len(data_array[0, 1]) and type(max_tags) is not numpy.ndarray: @@ -1079,8 +1078,8 @@ info_maxTags = [data_array[data_array[:, 1] == t, :] for t in max_tags] - chimera_half_a = numpy.array([t[0:(len(t)) / 2] for t in max_tags]) # mate1 part1 - chimera_half_b = numpy.array([t[len(t) / 2:len(t)] for t in max_tags]) # mate1 part 2 + chimera_half_a = numpy.array([t[0:int(len(t) / 2)] for t in max_tags]) # mate1 part1 + chimera_half_b = numpy.array([t[int(len(t) / 2):len(t)] for t in max_tags]) # mate1 part 2 new_format = [] for j in range(len(max_tags)): @@ -1268,7 +1267,7 @@ "For simplicity, we used the maximum value of the relative differences and the respective delta HD.\n" "Note that when only tags that can form a DCS are included in the analysis, the family sizes for both directions (ab and ba) of the strand will be included in the plots.\n") - output_file.write("\nlength of one half of the tag{}{}\n\n".format(sep, len(data_array[0, 1]) / 2)) + output_file.write("\nlength of one half of the tag{}{}\n\n".format(sep, int(len(data_array[0, 1]) / 2))) createFileHDwithinTag(summary9, sumCol9, overallSum9, output_file, "Tag distance of each half in the tag", sep) diff -r 7fc3c8704c05 -r ec5a92514113 td.xml --- a/td.xml Thu Oct 24 09:36:09 2019 -0400 +++ b/td.xml Wed Dec 04 16:32:40 2019 -0500 @@ -1,32 +1,29 @@ - + Tag distance analysis of duplex tags fsd_macros.xml - - python - matplotlib - + - + @@ -43,11 +40,11 @@ - + - - + + `_. In this context the Hamming distance is simply the number of differences between two tags. The tool compares in a randomly selected subset of tags (default n=1000), the difference between each tag of the subset with the tags of the complete dataset. Each tag will differ by a certain number of nucleotides with the other tags; yet the tool uses the smallest difference observed with any other tag. - + **Input** - + This tools expects a tabular file with the tags of all families, the family sizes and information about forward (ab) and reverse (ba) strands:: 1 2 3 @@ -82,7 +79,7 @@ We only need columns 1 and 2. These two columns can be extracted from this dataset using the **Cut** tool:: - 1 2 + 1 2 --------------------------- AAAAAAAAAAAAAAATAGCTCGAT ab AAAAAAAAAAAAAAATAGCTCGAT ab @@ -91,7 +88,7 @@ Next, the tags are sorted in ascending or descending order using the **Sort** tool:: - 1 2 + 1 2 --------------------------- AAAAAAAAAAAAAAAAATGGTATG ba AAAAAAAAAAAAAAATAGCTCGAT ab @@ -106,17 +103,17 @@ 3 AAAAAAAAAAAAAATGGTATGGAC ab These data can now be used in this tool. - + **Output** - + The output is one PDF file with various plots of the tag distance, a tabular file with the summarized data of the plots and a tabular file with the chimeras. The PDF file contains several pages: 1. This first page contains a graph representing the minimum tag distance (smallest number of differences) categorized after the family sizes. - + 2. The second page contains the same information as the first page, but plots the family size categorized by the minimum tag distance. - + 3. The third page contains the **first step** of the **chimera analysis**, which examines the differences between the tags at both ends of a read (a/b). Chimeras can be distinguished by carrying the same tag at one end combined with multiple different tags at the other end of a read. Here, we describe the calculation of the TDs for only one tag in detail, but the process is repeated for each tag in the sample (default n=1000). First, the tool splits the tag into its upstream and downstream part (named a and b) and compares it with all other a parts of the families in the dataset. Next, the tool estimates the sequence differences (TD) among the a parts and extracts those tags with the smallest difference (TD a.min) and calculates the TD of the b part. The tags with the largest differences are extracted to estimate the maximum TD (TD b.max). The process is repeated starting with the b part instead and estimates TD a.max and TD b.min. Next, we calculate the sum of TD a.min and TD b.max. - + 4. The fourth page contains the **second step** of the **chimera analysis**: the absolute difference (=delta TD) between the partial TDs (TD a.min & TD b.max and TD b.min & TD a.max). The partial TDs of chimeric tags are normally very different which means that multiple combinations of the same a part with different b parts are likely. But it is possible that small delta TDs occur due to a half of a tag that is identical to other halves in the data. For this purpose, the relative difference between the partial TDs is estimated in the next step. 5. The fifth page contains the **third step** of the **chimera analysis**: the relative differences of the partial TDs (=relative delta TD). These are calculated as the absolute difference between TD a.min and TD b.max equal to TD delta. Since it is not known whether the absolute difference originates due to a low and a very large TD within a tag or an identical half (TD=0), the tool estimates the relative TD delta as the ratio of the difference to the sum of the partial TDs. In a chimera, it is expected that only one end of the tag contributes to the TD of the whole tag. In other words, if the same a part is observed in combination with several different b parts, then one end will have a TD = 0. Thus, the TD difference between the parts (TD a.min - TD b.max, TD a.max - TD b.min) is the same as the sum of the parts (TD a.min + TD b.max, TD a.max + TD b.min) or the ratio of the difference to the sum (relative delta TD = TD a.min - TD b.max / (TD a.min + TD b.max or TD a.max - TD b.min / (TD a.max + TD b.min)) will equal 1 in chimeric families. Here, the maximum value of the relative delta TD and the respective delta TD (in step 4) are selected and the plot can be interpreted as the following: @@ -130,7 +127,7 @@ .. class:: infomark -**Note:** +**Note:** Chimeras can be identical in the first or second part of the tag and can have an identical TD with mutliple tags. Therefore, the second column of the output file can have multiple tag entries. The file also contains the family sizes and the direction of the read (ab, ba). The asterisks mark the identical part of the tag:: 1 2 @@ -138,7 +135,7 @@ GAAAGGGAGG GCGCTTCACG 1 ba GCAATCGACG *GCGCTTCACG* 1 ba CCCTCCCTGA GGTTCGTTAT 1 ba CGTCCTTTTC *GGTTCGTTAT* 1 ba, GCACCTCCTT *GGTTCGTTAT* 1 ba ATGCTGATCT CGAATGCATA 55 ba, 59 ab AGGTGCCGCC *CGAATGCATA* 27 ba, *ATGCTGATCT* GAATGTTTAC 1 ba - ]]> + ]]>