Mercurial > repos > mheinzl > fsd_regions

--- a/fsd_regions.py	Tue Nov 20 09:51:47 2018 -0500
+++ b/fsd_regions.py	Mon Nov 26 04:25:26 2018 -0500
@@ -71,7 +71,7 @@
         seqDic_ab = dict(zip(all_ab, quant_ab))
         seqDic_ba = dict(zip(all_ba, quant_ba))

-        if re.search(r'(\d)+_(\d)+$', str(mut_array[0,0])) is None:
+        if re.search('_(\d)+_(\d)+$', str(mut_array[0,0])) is None:
             seq_mut, seqMut_index = numpy.unique(numpy.array(mut_array[:, 1]), return_index=True)
             group = mut_array[seqMut_index,0]
             mut_array = mut_array[seqMut_index,:]
@@ -156,7 +156,7 @@
         for i, count in zip(groupUnique, quantAfterRegion):
             index_of_current_region = numpy.where(group == i)[0]
             plt.text(0.55, 0.14 - s, "{}=\n".format(i), size=11, transform=plt.gcf().transFigure)
-            if re.search(r'(\d)+_(\d)+$', str(mut_array[0, 0])) is None:
+            if re.search('_(\d)+_(\d)+$', str(mut_array[0, 0])) is None:
                 nr_tags_ab = len(numpy.unique(mut_array[index_of_current_region, 1]))
             else:
                 nr_tags_ab = len(mut_array[index_of_current_region, 1])
--- a/fsd_regions.xml	Tue Nov 20 09:51:47 2018 -0500
+++ b/fsd_regions.xml	Mon Nov 26 04:25:26 2018 -0500
@@ -1,16 +1,17 @@
 <?xml version="1.0" encoding="UTF-8"?>
-<tool id="fsd_regions" name="Duplex Sequencing Analysis: fsd_regions" version="1.0.0">
+<tool id="fsd_regions" name="Duplex Sequencing Analysis: fsd_regions" version="1.0.1">
     <description>Family size distribution (FSD) of user-specified regions in the reference genome</description>
     <requirements>
         <requirement type="package" version="2.7">python</requirement>
         <requirement type="package" version="1.4.0">matplotlib</requirement>
     </requirements>
     <command>
-        python2 '$__tool_directory__/fsd_regions.py' --inputFile '$file1' --inputName1 '$file1.name' --ref_genome '$file2' --output_pdf $output_pdf --output_tabular $output_tabular
+        python2 '$__tool_directory__/fsd_regions.py' --inputFile '$file1' --inputName1 '$file1.name' --bamFile '$file2' --rangesFile '$file3' --output_pdf $output_pdf --output_tabular $output_tabular
     </command>
     <inputs>
         <param name="file1" type="data" format="tabular" label="Dataset 1: input tags of whole dataset" optional="false" help="Input in tabular format with the family size, tags and the direction of the strand ('ab' or 'ba') for each family."/>
-        <param name="file2" type="data" format="txt" label="Dataset 2: input tags aligned to the reference genome" help="Input in txt format with the regions in the reference genome and the tags, which were aligned to the reference genome."/>
+        <param name="file2" type="data" format="bam" label="Dataset 2: BAM file of aligned reads." help="Input in BAM format with the reads that were aligned to the reference genome."/>
+        <param name="file3" type="data" format="bed" label="Dataset 3: BED file with chromsome, start and stop positions of regions." optional="true" help="BED file with start and stop positions of regions."/>
     </inputs>
     <outputs>
         <data name="output_pdf" format="pdf" />
@@ -18,45 +19,39 @@
     </outputs>
     <tests>
         <test>
-            <param name="file1" value="Test_data.tabular"/>
-            <param name="file2" value="Test_data_regions.txt"/>
-            <output name="output_pdf" file="output_file.pdf" lines_diff="136"/>
-            <output name="output_tabular" file="output_file.tabular"/>
+            <param name="file1" value="fsd_reg.tab"/>
+            <param name="file2" value="fsd_reg.bam"/>
+            <param name="file3" value="fsd_reg_ranges.bed"/>
+            <output name="output_pdf" file="fsd_reg_output.pdf" lines_diff="136"/>
+            <output name="output_tabular" file="fsd_reg_output.tab"/>
         </test>
     </tests>
     <help> <![CDATA[

 **What it does**

-    This tool will create a distribution of family sizes of all tags, which were aligned to the reference genome. The distribution is separated after the regions of the reference genome.
+This tool will create a distribution of family sizes of all tags, which were aligned to the reference genome. The distribution is separated after the regions of the reference genome.


 **Input**

-    This tools expects a tabular file with the tags of all families, their sizes and information about forward (ab) and reverse (ba) strands.
-
-    +-----+----------------------------+----+
-    | 1   | AAAAAAAAAAAATGTTGGAATCTT   | ba |
-    +-----+----------------------------+----+
-    | 10  | AAAAAAAAAAAGGCGGTCCACCCC   | ab |
-    +-----+----------------------------+----+
-    | 28  | AAAAAAAAAAATGGTATGGACCGA   | ab |
-    +-----+----------------------------+----+
+**Dataset 1:** This tools expects a tabular file with the tags of all families, their sizes and information about forward (ab) and reverse (ba) strands.
+
+    1  AAAAAAAAAAAATGTTGGAATCTT ba
+   10  AAAAAAAAAAAGGCGGTCCACCCC ab
+   28  AAAAAAAAAAATGGTATGGACCGA ab

-
-    In addition, a TXT file with the regions and all tags that were aligned to the reference genome is required.      This file can obtained from a different tool.
-
-    +-----------+------------------------------+
-    | 87_636    | AAATCAAAGTATGAATGAAGTTGCCT   |
-    +-----------+------------------------------+
-    | 87_636    | AAATTCATAGCATTAATTTCAACGGG   |
-    +-----------+------------------------------+
-    | 656_1143  | GGGGCAGCCATATTGGCAATTATCAT   |
-    +-----------+------------------------------+
+**Dataset 2:** BAM file of the aligned reads. This file can be obtained by the tool "Map with BWA-MEM".
+
+**Dataset 3 (optional):** BED file with start and stop positions of the regions. If it is not provided, then all aligned reads of the BAM file are used in the distribution of family sizes.
+
+   ACH_TDII   90    633
+   ACH_TDII  659   1140
+   ACH_TDII 1144   1561

 **Output**

-    The output is a PDF file with the plot and a tabular file with the data of the plot.
+The output is a PDF file with the plot and a tabular file with the data of the plot.


 **About Author**
--- a/test-data/Test_data.tabular	Tue Nov 20 09:51:47 2018 -0500
+++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
@@ -1,32 +0,0 @@
-10	AAAAAACATCCCAATAAGAAATCA	ab
-9	AAAAAACATCCCAATAAGAAATCA	ba
-4	AAAAAAGTCCTTCGACTCAAGCGG	ab
-5	AAAAAAGTCCTTCGACTCAAGCGG	ba
-5	AAAAAATAGTTAAGCCGACACACT	ab
-7	AAAAAATAGTTAAGCCGACACACT	ba
-7	AAAAAATGTGCCGAACCTTGGCGA	ab
-10	AAAAAATGTGCCGAACCTTGGCGA	ba
-7	AAAAACAACATAGCTTGAAAATTT	ab
-4	AAAAACAACATAGCTTGAAAATTT	ba
-81	ATTCGGATAATTCGACGCAACATT	ab
-11	ATTCGGATAATTCGACGCAACATT	ba
-41	ATTCGTCGACAATACAAAGGGGCC	ab
-226	ATTCGTCGACAATACAAAGGGGCC	ba
-6	ATTGCCAGTGTGGGCTGGTTAGTA	ab
-41	ATTGCCAGTGTGGGCTGGTTAGTA	ba
-50	ATTTCGCGACCATCCGCCACTTTG	ab
-332	ATTTCGCGACCATCCGCCACTTTG	ba
-64	CAAACTTTAGCACAGTGTGTGTCC	ab
-57	CAAACTTTAGCACAGTGTGTGTCC	ba
-85	ATAAACGGCCTTCGACATTGTGAC	ab
-15	ATAAACGGCCTTCGACATTGTGAC	ba
-11	ATAAAGTCACCTGTGAATACGTTG	ab
-35	ATAAAGTCACCTGTGAATACGTTG	ba
-83	ATAAATCGAAACCGTGCCCAACAA	ab
-63	ATAAATCGAAACCGTGCCCAACAA	ba
-9	ATTTAGATATTTTCTTCTTTTTCT	ab
-7	ATTTAGATATTTTCTTCTTTTTCT	ba
-7	ATTTAGTTATCCGTCGGCGACGAA	ab
-3	ATTTAGTTATCCGTCGGCGACGAA	ba
-8	ATTTAGTTTGAATTGCCCTGCGTC	ab
-9	ATTTAGTTTGAATTGCCCTGCGTC	ba
\ No newline at end of file
--- a/test-data/Test_data_regions.txt	Tue Nov 20 09:51:47 2018 -0500
+++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
@@ -1,16 +0,0 @@
-ACH_87_636 AAAAAACATCCCAATAAGAAATCA
-ACH_87_636 AAAAAAGTCCTTCGACTCAAGCGG
-ACH_87_636 AAAAAATAGTTAAGCCGACACACT
-ACH_87_636 AAAAAATGTGCCGAACCTTGGCGA
-ACH_87_636 AAAAACAACATAGCTTGAAAATTT
-ACH_656_1143 ATTCGGATAATTCGACGCAACATT
-ACH_656_1143 ATTCGTCGACAATACAAAGGGGCC
-ACH_656_1143 ATTGCCAGTGTGGGCTGGTTAGTA
-ACH_656_1143 ATTTCGCGACCATCCGCCACTTTG
-ACH_656_1143 CAAACTTTAGCACAGTGTGTGTCC
-ACH_1141_1564 ATAAACGGCCTTCGACATTGTGAC
-ACH_1141_1564 ATAAAGTCACCTGTGAATACGTTG
-ACH_1141_1564 ATAAATCGAAACCGTGCCCAACAA
-ACH_1892_2398 ATTTAGATATTTTCTTCTTTTTCT
-ACH_1892_2398 ATTTAGTTATCCGTCGGCGACGAA
-ACH_1892_2398 ATTTAGTTTGAATTGCCCTGCGTC
Binary file test-data/fsd_reg.bam has changed
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/fsd_reg.tab	Mon Nov 26 04:25:26 2018 -0500
@@ -0,0 +1,32 @@
+10	AAAAAACATCCCAATAAGAAATCA	ab
+9	AAAAAACATCCCAATAAGAAATCA	ba
+4	AAAAAAGTCCTTCGACTCAAGCGG	ab
+5	AAAAAAGTCCTTCGACTCAAGCGG	ba
+5	AAAAAATAGTTAAGCCGACACACT	ab
+7	AAAAAATAGTTAAGCCGACACACT	ba
+7	AAAAAATGTGCCGAACCTTGGCGA	ab
+10	AAAAAATGTGCCGAACCTTGGCGA	ba
+7	AAAAACAACATAGCTTGAAAATTT	ab
+4	AAAAACAACATAGCTTGAAAATTT	ba
+81	ATTCGGATAATTCGACGCAACATT	ab
+11	ATTCGGATAATTCGACGCAACATT	ba
+41	ATTCGTCGACAATACAAAGGGGCC	ab
+226	ATTCGTCGACAATACAAAGGGGCC	ba
+6	ATTGCCAGTGTGGGCTGGTTAGTA	ab
+41	ATTGCCAGTGTGGGCTGGTTAGTA	ba
+50	ATTTCGCGACCATCCGCCACTTTG	ab
+332	ATTTCGCGACCATCCGCCACTTTG	ba
+64	CAAACTTTAGCACAGTGTGTGTCC	ab
+57	CAAACTTTAGCACAGTGTGTGTCC	ba
+85	ATAAACGGCCTTCGACATTGTGAC	ab
+15	ATAAACGGCCTTCGACATTGTGAC	ba
+11	ATAAAGTCACCTGTGAATACGTTG	ab
+35	ATAAAGTCACCTGTGAATACGTTG	ba
+83	ATAAATCGAAACCGTGCCCAACAA	ab
+63	ATAAATCGAAACCGTGCCCAACAA	ba
+9	ATTTAGATATTTTCTTCTTTTTCT	ab
+7	ATTTAGATATTTTCTTCTTTTTCT	ba
+7	ATTTAGTTATCCGTCGGCGACGAA	ab
+3	ATTTAGTTATCCGTCGGCGACGAA	ba
+8	ATTTAGTTTGAATTGCCCTGCGTC	ab
+9	ATTTAGTTTGAATTGCCCTGCGTC	ba
\ No newline at end of file
Binary file test-data/fsd_reg_output.pdf has changed
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/fsd_reg_output.tab	Mon Nov 26 04:25:26 2018 -0500
@@ -0,0 +1,38 @@
+Dataset:	Tags_fsd_reg.tab
+	AB	BA
+max. family size:	11	35
+absolute frequency:	22	2
+relative frequency:	0.333	0.167
+
+total nr. of reads	1312
+total nr. of tags	24 (12)
+
+
+Values from family size distribution
+	ACH_TDII_5regions_90_633	ACH_TDII_5regions_659_1140
+FS=4	4	0
+FS=5	4	0
+FS=6	0	0
+FS=7	6	0
+FS=8	0	0
+FS=9	2	0
+FS=10	4	0
+FS=11	0	2
+FS=12	0	0
+FS=13	0	0
+FS=14	0	0
+FS=15	0	0
+FS=16	0	0
+FS=17	0	0
+FS=18	0	0
+FS=19	0	0
+FS=20	0	0
+FS>20	0	2
+sum	20	4
+
+
+In the plot, both family sizes of the ab and ba strands were used.
+Whereas the total numbers indicate only the single count of the tags per region.
+Region	total nr. of tags per region
+ACH_TDII_5regions_90_633	10
+ACH_TDII_5regions_659_1140	2
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/fsd_reg_ranges.bed	Mon Nov 26 04:25:26 2018 -0500
@@ -0,0 +1,2 @@
+ACH_TDII_5regions	90	633
+ACH_TDII_5regions	659	1140
Binary file test-data/output_file.pdf has changed
--- a/test-data/output_file.tabular	Tue Nov 20 09:51:47 2018 -0500
+++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
@@ -1,43 +0,0 @@
-Dataset:	Test_data
-	AB	BA
-max. family size:	85	332
-absolute frequency:	9	1
-relative frequency:	0.209	0.062
-
-total nr. of reads	1312
-total nr. of tags	32 (16)
-
-
-Values from family size distribution
-	ACH_87_636	ACH_656_1143	ACH_1141_1564	ACH_1892_2398
-FS=3	0	0	0	1
-FS=4	2	0	0	0
-FS=5	2	0	0	0
-FS=6	0	1	0	0
-FS=7	3	0	0	2
-FS=8	0	0	0	1
-FS=9	1	0	0	2
-FS=10	2	0	0	0
-FS=11	0	1	1	0
-FS=12	0	0	0	0
-FS=13	0	0	0	0
-FS=14	0	0	0	0
-FS=15	0	0	1	0
-FS=16	0	0	0	0
-FS=17	0	0	0	0
-FS=18	0	0	0	0
-FS=19	0	0	0	0
-FS=20	0	0	0	0
-FS>20	0	8	4	0
-sum	10	10	6	6
-
-
-In the plot, both family sizes of the ab and ba strands were used.
-Whereas the total numbers indicate only the count of the tags per region.
-
-
-Region	total nr. of tags per region
-ACH_87_636	5
-ACH_656_1143	5
-ACH_1141_1564	3
-ACH_1892_2398	3