# HG changeset patch # User onnodg # Date 1760963251 0 # Node ID ff68835adb2ba2f786c9e2ab28eb464dc4ab6f34 # Parent 00d56396b32a254d629397b4b7ba6b020600b52d planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_clusters_tool commit d771f9fbfd42bcdeda1623d954550882a0863847-dirty diff -r 00d56396b32a -r ff68835adb2b __pycache__/__init__.cpython-313.pyc Binary file __pycache__/__init__.cpython-313.pyc has changed diff -r 00d56396b32a -r ff68835adb2b __pycache__/cdhit_analysis.cpython-313.pyc Binary file __pycache__/cdhit_analysis.cpython-313.pyc has changed diff -r 00d56396b32a -r ff68835adb2b cdhit_analysis.py --- a/cdhit_analysis.py Tue Oct 14 09:09:46 2025 +0000 +++ b/cdhit_analysis.py Mon Oct 20 12:27:31 2025 +0000 @@ -1,5 +1,3 @@ -#!/usr/bin/env python3 - import argparse import os import re diff -r 00d56396b32a -r ff68835adb2b cdhit_analysis.xml --- a/cdhit_analysis.xml Tue Oct 14 09:09:46 2025 +0000 +++ b/cdhit_analysis.xml Mon Oct 20 12:27:31 2025 +0000 @@ -1,4 +1,4 @@ - + Analyze CD-HIT clustering results with taxonomic annotation @@ -14,19 +14,19 @@ --input_annotation '$input_annotation' #if $output_options.similarity_output: - --output_similarity_txt '$output_similarity_txt' - --output_similarity_plot '$output_similarity_plot' + --output_similarity_txt '$similarity_txt' + --output_similarity_plot '$similarity_plot' #end if #if $output_options.evalue_output: - --output_evalue_txt '$output_evalue_txt' - --output_evalue_plot '$output_evalue_plot' + --output_evalue_txt '$evalue_txt' + --output_evalue_plot '$evalue_plot' #end if #if $output_options.count_output: - --output_count '$output_count' + --output_count '$cluster_count' #end if #if $output_options.taxa_output: - --output_taxa_clusters '$output_taxa_clusters' - --output_taxa_processed '$output_taxa_processed' + --output_taxa_clusters '$cluster_taxa' + --output_taxa_processed '$processed_taxa' #end if --simi_plot_y_min '$plot_params.simi_plot_y_min' @@ -48,24 +48,24 @@ ]]> - + label="Excel Annotations file" + help="Excel workfile with annotations per header" />
@@ -104,31 +104,31 @@
- + output_options['similarity_output'] - + output_options['similarity_output'] - + output_options['evalue_output'] - + output_options['evalue_output'] - + output_options['count_output'] - + output_options['taxa_output'] - + output_options['taxa_output'] @@ -143,13 +143,13 @@ - - - - - - - + + + + + + + @@ -160,13 +160,13 @@ - - - - - - - + + + + + + + @@ -178,7 +178,7 @@
- +
@@ -187,15 +187,15 @@ -
+
- - - - - + + + + + @@ -212,10 +212,10 @@ **Output Options:** -- **Similarity output**: Creates similarity analysis with plots and text files showing intra-cluster similarity distributions -- **E-value output**: Creates E-value analysis with plots and text files showing E-value distributions -- **Count output**: Creates summary tables with annotated/unannotated read counts per cluster -- **Taxa output**: Creates taxonomic analysis determining the most likely taxa for each cluster +- **Cluster similarity output**: Creates similarity analysis with plots and text files showing intra-cluster similarity distributions +- **Cluster e-value output**: Creates E-value analysis with plots and text files showing E-value distributions +- **Cluster count output**: Creates summary tables with annotated/unannotated read counts per cluster +- **Taxa annotations output**: Creates taxonomic analysis determining the most likely taxa for each cluster **Parameters:** @@ -235,9 +235,23 @@ **Note**: The tool expects that sequence counts are included in the cluster file headers in the format "header(count)". +------------- + +.. class:: infomark + **Credits** -Authors = Onno de Gorter, 2025. + Based on a script by Nick Kortleven, translated, modified and wrapped by Onno de Gorter, -Developed for the New light on old remedies project, a PhD research by Anja Fischer +Developed for the New light on old remedies project, a PhD research by Anja Fischer. + +Link to the project website: + +* https://ahm.uva.nl/funded-research-projects/new-lights-on-old-remedies/new-lights-on-old-remedies.html + ]]> + + + + + \ No newline at end of file diff -r 00d56396b32a -r ff68835adb2b test-data/malformed_cluster.clstr --- a/test-data/malformed_cluster.clstr Tue Oct 14 09:09:46 2025 +0000 +++ b/test-data/malformed_cluster.clstr Mon Oct 20 12:27:31 2025 +0000 @@ -1,4 +1,4 @@ ->Cluster 0 -0 100nt, >read1:50..._CONS(50) * -invalid_line_without_proper_format -1 90nt, >read2:25..._CONS(25) at /+/95% +>Cluster 0 +0 100nt, >read1:50..._CONS(50) * +invalid_line_without_proper_format +1 90nt, >read2:25..._CONS(25) at /+/95% diff -r 00d56396b32a -r ff68835adb2b test-data/simple_cluster.clstr --- a/test-data/simple_cluster.clstr Tue Oct 14 09:09:46 2025 +0000 +++ b/test-data/simple_cluster.clstr Mon Oct 20 12:27:31 2025 +0000 @@ -1,2 +1,2 @@ ->Cluster 0 -0 100nt, >read_no_anno:50... * +>Cluster 0 +0 100nt, >read_no_anno:50... * diff -r 00d56396b32a -r ff68835adb2b test-data/test_count.txt --- a/test-data/test_count.txt Tue Oct 14 09:09:46 2025 +0000 +++ b/test-data/test_count.txt Mon Oct 20 12:27:31 2025 +0000 @@ -1,26 +1,26 @@ -cluster unannotated annotated total perc_unannotated perc_annotated -0 2.0 408 410.0 0.49 99.51 -1 1.0 0 1.0 100.00 0.00 -2 0.0 1 1.0 0.00 100.00 -3 0.0 52 52.0 0.00 100.00 -4 1.0 0 1.0 100.00 0.00 -5 0.0 176 176.0 0.00 100.00 -6 1.0 0 1.0 100.00 0.00 -7 0.0 79 79.0 0.00 100.00 -8 1.0 0 1.0 100.00 0.00 -9 9.0 0 9.0 100.00 0.00 -10 3.0 0 3.0 100.00 0.00 -11 2.0 0 2.0 100.00 0.00 -12 1.0 0 1.0 100.00 0.00 -13 1.0 0 1.0 100.00 0.00 -14 1.0 0 1.0 100.00 0.00 -15 5.0 0 5.0 100.00 0.00 -16 21.0 0 21.0 100.00 0.00 -17 38.0 0 38.0 100.00 0.00 -18 5.0 0 5.0 100.00 0.00 -19 5.0 0 5.0 100.00 0.00 -20 1.0 0 1.0 100.00 0.00 -21 1.0 0 1.0 100.00 0.00 -22 4.0 0 4.0 100.00 0.00 -23 0.0 1 1.0 0.00 100.00 -TOTAL 103.0 717 820.0 12.56 87.44 +cluster unannotated annotated total perc_unannotated perc_annotated +0 2.0 408 410.0 0.49 99.51 +1 1.0 0 1.0 100.00 0.00 +2 0.0 1 1.0 0.00 100.00 +3 0.0 52 52.0 0.00 100.00 +4 1.0 0 1.0 100.00 0.00 +5 0.0 176 176.0 0.00 100.00 +6 1.0 0 1.0 100.00 0.00 +7 0.0 79 79.0 0.00 100.00 +8 1.0 0 1.0 100.00 0.00 +9 9.0 0 9.0 100.00 0.00 +10 3.0 0 3.0 100.00 0.00 +11 2.0 0 2.0 100.00 0.00 +12 1.0 0 1.0 100.00 0.00 +13 1.0 0 1.0 100.00 0.00 +14 1.0 0 1.0 100.00 0.00 +15 5.0 0 5.0 100.00 0.00 +16 21.0 0 21.0 100.00 0.00 +17 38.0 0 38.0 100.00 0.00 +18 5.0 0 5.0 100.00 0.00 +19 5.0 0 5.0 100.00 0.00 +20 1.0 0 1.0 100.00 0.00 +21 1.0 0 1.0 100.00 0.00 +22 4.0 0 4.0 100.00 0.00 +23 0.0 1 1.0 0.00 100.00 +TOTAL 103.0 717 820.0 12.56 87.44 diff -r 00d56396b32a -r ff68835adb2b test-data/test_evalue.txt --- a/test-data/test_evalue.txt Tue Oct 14 09:09:46 2025 +0000 +++ b/test-data/test_evalue.txt Mon Oct 20 12:27:31 2025 +0000 @@ -1,20 +1,20 @@ -evalue count -unannotated 103.0 -1.41e-39 414 -4.99e-39 166 -1.54e-33 72 -6.56e-38 25 -2.32e-37 16 -7.17e-32 6 -1.82e-38 4 -5.07e-39 3 -8.21e-37 2 -1.43e-39 1 -6.45e-38 1 -6.66e-38 1 -2.28e-37 1 -8.62e-37 1 -1.06e-35 1 -1.08e-35 1 -3.33e-30 1 -8.16e-12 1 +evalue count +unannotated 103.0 +1.41e-39 414 +4.99e-39 166 +1.54e-33 72 +6.56e-38 25 +2.32e-37 16 +7.17e-32 6 +1.82e-38 4 +5.07e-39 3 +8.21e-37 2 +1.43e-39 1 +6.45e-38 1 +6.66e-38 1 +2.28e-37 1 +8.62e-37 1 +1.06e-35 1 +1.08e-35 1 +3.33e-30 1 +8.16e-12 1 diff -r 00d56396b32a -r ff68835adb2b test-data/test_processed_taxa.xlsx Binary file test-data/test_processed_taxa.xlsx has changed diff -r 00d56396b32a -r ff68835adb2b test-data/test_similarity.txt --- a/test-data/test_similarity.txt Tue Oct 14 09:09:46 2025 +0000 +++ b/test-data/test_similarity.txt Mon Oct 20 12:27:31 2025 +0000 @@ -1,14 +1,14 @@ -# Average similarity: 99.35 -# Standard deviation: 0.65 -similarity count -100.0 383 -98.89 368 -98.88 18 -98.86 1 -98.73 7 -98.28 1 -98.21 8 -97.8 2 -97.78 29 -97.75 2 -97.73 1 +# Average similarity: 99.35 +# Standard deviation: 0.65 +similarity count +100.0 383 +98.89 368 +98.88 18 +98.86 1 +98.73 7 +98.28 1 +98.21 8 +97.8 2 +97.78 29 +97.75 2 +97.73 1 diff -r 00d56396b32a -r ff68835adb2b test-data/test_taxa_clusters.xlsx Binary file test-data/test_taxa_clusters.xlsx has changed diff -r 00d56396b32a -r ff68835adb2b tests/__pycache__/test_cdhit_analysis.cpython-313-pytest-8.4.2.pyc Binary file tests/__pycache__/test_cdhit_analysis.cpython-313-pytest-8.4.2.pyc has changed diff -r 00d56396b32a -r ff68835adb2b tests/test_cdhit_analysis.py --- a/tests/test_cdhit_analysis.py Tue Oct 14 09:09:46 2025 +0000 +++ b/tests/test_cdhit_analysis.py Mon Oct 20 12:27:31 2025 +0000 @@ -591,27 +591,27 @@ assert "Processing complete" in captured.out - def test_16a_prepare_evalue_histogram_valid_data(self): + def test_18a_prepare_evalue_histogram_valid_data(self): """ - Test 16a: prepare_evalue_histogram returns correct counts/bins. + Test 18a: prepare_evalue_histogram returns correct counts/bins. """ from Stage_1_translated.NLOOR_scripts.process_clusters_tool import cdhit_analysis as ca counts, bins = ca.prepare_evalue_histogram([1e-5, 1e-3, 0.5], []) assert counts.sum() == 3 # 3 entries counted assert len(bins) == 51 # 50 bins => 51 edges - def test_16b_prepare_evalue_histogram_empty(self): + def test_18b_prepare_evalue_histogram_empty(self): """ - Test 16b: prepare_evalue_histogram with empty/invalid data returns (None, None). + Test 18b: prepare_evalue_histogram with empty/invalid data returns (None, None). """ from Stage_1_translated.NLOOR_scripts.process_clusters_tool import cdhit_analysis as ca counts, bins = ca.prepare_evalue_histogram([0, None, "bad"], []) assert counts is None assert bins is None - def test_16c_create_evalue_plot_creates_file_and_returns_data(self, tmp_path): + def test_18c_create_evalue_plot_creates_file_and_returns_data(self, tmp_path): """ - Test 16c: create_evalue_plot saves a PNG and returns numeric data. + Test 18c: create_evalue_plot saves a PNG and returns numeric data. """ from Stage_1_translated.NLOOR_scripts.process_clusters_tool import cdhit_analysis as ca out = tmp_path / "eval.png"