comparison README.md @ 4:e64af72e1b8f draft default tip

planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_clusters_tool commit 4017d38cf327c48a6252e488ba792527dae97a70-dirty
author onnodg
date Mon, 15 Dec 2025 16:44:40 +0000
parents 706b7acdb230
children
comparison
equal deleted inserted replaced
3:c6981ea453ae 4:e64af72e1b8f
1 This script processes cluster output files from cd-hit-est for use in Galaxy. 1 # CDHIT Cluster Analysis Script
2 It extracts cluster information, associates taxa and e-values from annotation files, 2
3 performs statistical calculations, and generates text and plot outputs 3 This script processes a single **cluster file** together with an **excel file containing annotated reads**, generating multiple output files for downstream visualization and reporting.
4 summarizing similarity and taxonomic distributions. 4
5 5 It is designed for clustering-based taxonomic pipelines and provides a detailed overview of cluster composition, similarity metrics, and taxonomic consistency within and between clusters.
6 6
7 Main steps: 7 ## Usage
8 1. Parse cd-hit-est cluster file and (optional) annotation file. 8
9 2. Process each cluster to extract similarity, taxa, and e-value information. 9 The script performs the following main tasks:
10 3. Aggregate results across clusters. 10
11 4. Generate requested outputs: text summaries, plots, and Excel reports. 11 1. Parse command-line arguments.
12 12 2. Load the CD-HIT cluster results and annotated read information.
13 13 3. Group reads per cluster and compute similarity statistics (e.g., identity, alignment coverage).
14 Note: Uses a non-interactive matplotlib backend (Agg) for compatibility with Galaxy. 14 4. Resolve taxonomic inconsistencies within clusters using uncertainty and minimum-count thresholds.
15 5. Generate visual and tabular summaries of cluster composition, similarity distribution, and annotation quality.
16
17 ### Command Line Interface
18 The CD-HIT cluster analysis tool can be run as a Python script:
19
20 ```bash
21 python cdhit_analysis.py [options]
22 ```
23
24 Below are detailed examples for a common use case.
25
26 #### General use case
27 This example demonstrates the general usage of the tool for analyzing CD-HIT clustering results.
28
29 **Requirements**:
30
31 Requirements as listed in the cdhit_analysis.xml file:
32
33 - Python version = 3.12.3
34 - Matplotlib version = 3.12.3
35 - Pandas version = 2.3.2
36 - Openpyxl version = 3.1.5
37
38 **Input requirements**
39
40 - CD-HIT cluster file (.clstr) containing sequence clusters with similarity information.
41 - Excel file containing annotated reads with corresponding taxonomic or metadata columns.
42 - The read identifiers in both files must match — the script merges cluster membership with read annotations using these IDs.
43
44 **Example: Analyzing CD-HIT clusters with taxonomic annotations**
45
46 ```
47 process_clusters_tool/cdhit_analysis.sh'
48 --input_cluster 'clusters.txt'
49 --input_annotation 'annotations.xlsx'
50 --output_similarity_txt 'similarity_summary.txt'
51 --output_similarity_plot 'similairy_plot.png'
52 --output_evalue_txt 'evalue_summary.txt'
53 --output_evalue_plot 'evalue_plot.png'
54 --output_count 'cluster_count.txt'
55 --output_taxa_clusters 'taxa_clustered.xlsx'
56 --output_taxa_processed 'taxa_processed.xlsx.'
57 --simi_plot_y_min '95'
58 --simi_plot_y_max '100'
59 --uncertain_taxa_use_ratio '0.5'
60 --min_to_split '0.45'
61 --min_count_to_split '10'
62 --show_unannotated_clusters
63 --make_taxa_in_cluster_split
64 --print_empty_files
65 ```
66
67 **Example Input (`clusters.txt`)**
68
69
70 ```
71 >Cluster 0
72 0 357nt, >M01687:476:000000000-LL5F5:1:2113:18579:17490_CONS(1)... *
73 >Cluster 1
74 0 85nt, >M01687:476:000000000-LL5F5:1:1102:21316:1191_CONS(59577)... at 1:85:1:85/+/98.82%
75 1 85nt, >M01687:476:000000000-LL5F5:1:1102:19793:1302_CONS(106)... at 1:85:1:85/+/97.65%
76 2 84nt, >M01687:476:000000000-LL5F5:1:1102:18943:1430_CONS(15)... at 1:84:1:85/+/98.81%
77 3 85nt, >M01687:476:000000000-LL5F5:1:1102:9619:1460_CONS(38)... at 1:85:1:85/+/97.65%
78 4 85nt, >M01687:476:000000000-LL5F5:1:1102:8280:1614_CONS(1)... at 1:85:1:85/+/97.65%
79 ...
80 1 39nt, >M01687:476:000000000-LL5F5:1:1116:4266:19390_CONS(1)... at 1:39:1:38/+/97.44%
81 >Cluster 530
82 0 39nt, >M01687:476:000000000-LL5F5:1:2112:21268:1323_CONS(1)... *
83 >Cluster 531
84 0 38nt, >M01687:476:000000000-LL5F5:1:2103:25634:11346_CONS(1)... *
85 >Cluster 532
86 0 33nt, >M01687:476:000000000-LL5F5:1:2106:13260:18932_CONS(1)... *
87 >Cluster 533
88 0 31nt, >M01687:476:000000000-LL5F5:1:1110:28179:10205_CONS(1)... *
89 >Cluster 534
90 0 30nt, >M01687:476:000000000-LL5F5:1:1110:23278:23216_CONS(1)... *
91 >Cluster 535
92 0 29nt, >M01687:476:000000000-LL5F5:1:2117:17691:6487_CONS(1)... *
93 >Cluster 536
94 0 28nt, >M01687:476:000000000-LL5F5:1:1104:7756:22829_CONS(1)... *
95
96 ```
97
98 **Example FASTA (`annotations.xlsx`)**
99
100
101 ```header e_value identity percentage coverage bitscore count source taxa kingdom phylum class order family genus species
102 M01687:476:000000000-LL5F5:1:1102:8926:6561_CONS 2.33E-41 98.889 100 161 12 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Achillea / Achillea millefolium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Achillea Achillea millefolium
103 M01687:476:000000000-LL5F5:1:2114:16883:18620_CONS 1.08E-39 97.778 100 156 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Achillea / Achillea millefolium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Achillea Achillea millefolium
104 M01687:476:000000000-LL5F5:1:1102:20658:7882_CONS 1.63E-37 98.795 100 148 29 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Apiales / Apiaceae / Aegopodium / Aegopodium podagraria Viridiplantae Streptophyta Magnoliopsida Apiales Apiaceae Aegopodium Aegopodium podagraria
105 ...
106 M01687:476:000000000-LL5F5:1:2114:19155:4308_CONS 1.07E-39 100 94 156 13 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Gentianales / Apocynaceae / Vinca / Vinca minor Viridiplantae Streptophyta Magnoliopsida Gentianales Apocynaceae Vinca Vinca minor
107 M01687:476:000000000-LL5F5:1:1117:11316:6653_CONS 4.96E-38 98.81 94 150 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Gentianales / Apocynaceae / Vinca / Vinca minor Viridiplantae Streptophyta Magnoliopsida Gentianales Apocynaceae Vinca Vinca minor
108 M01687:476:000000000-LL5F5:1:1106:28052:14441_CONS 8.25E-41 98.876 100 159 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Xanthium Xanthium strumarium
109 M01687:476:000000000-LL5F5:1:2118:15258:6790_CONS 8.25E-41 98.876 100 159 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Xanthium Xanthium strumarium
110 ```
111 **Outputs**
112
113 | Output Type | Format | Description |
114 |--------------|--------|-------------|
115 | **Similarity summary** | `.txt` | Text file listing average and per-cluster similarity statistics, derived from the CD-HIT `.clstr` file. |
116 | **Similarity plot** | `.png` | Histogram or density plot showing sequence similarity distribution across all clusters; useful for identifying thresholds or anomalies. |
117 | **E-value summary** | `.txt` | Text file containing aggregated E-value statistics for all clusters (if available from annotation data). |
118 | **E-value plot** | `.png` | Visualization of E-value distribution, helping to identify potential low-confidence clusters. |
119 | **Cluster count summary** | `.txt` | Summary of the number of clusters, total reads per cluster, and counts of annotated vs. unannotated reads. |
120 | **Taxa per cluster** | `.txt` | Text file showing the dominant or representative taxon assigned to each cluster, including uncertainty ratios. |
121 | **Processed taxa summary** | `.txt` | Aggregated view of taxonomic composition after filtering and cluster-based reassignment. |
122
123
124 **Output files (example)**
125
126 outputs/
127 ├── similarity_plot.png
128 <img width="3570" height="1765" alt="afbeelding" src="https://github.com/user-attachments/assets/f1ad5105-fcd1-4c2d-a5aa-7e8419b46281" />
129
130 ├── similarity_summary.txt
131 ```
132 # Average similarity: 98.94
133 # Standard deviation: 0.68
134 similarity count
135 100.0 23803
136 99.47 1
137 99.46 1
138 ...
139 97.18 1
140 97.17 2
141 97.14 11
142 97.12 2
143 97.1 1
144 97.03 5
145 97.0 946
146 ```
147 ├── evalue_plot.png
148 <img width="3565" height="1765" alt="afbeelding" src="https://github.com/user-attachments/assets/278fdfe3-882e-4f0e-901b-a2acbbcace24" />
149
150 ├── evalue_summary.txt
151 ```
152 evalue count
153 unannotated 11754.0
154 2.8e-40 59691
155 2.16e-52 6595
156 1.3e-38 6105
157 2.57e-35 3332
158 ...
159 7.3e-13 1
160 2.06e-12 1
161 5.4e-12 1
162 8.73e-11 1
163 ```
164 ├── cluster_count.txt
165 ```
166 cluster unannotated annotated total perc_unannotated perc_annotated
167 0 1.0 0 1.0 100.00 0.00
168 1 16.0 68214 68230.0 0.02 99.98
169 ...
170 535 1.0 0 1.0 100.00 0.00
171 536 1.0 0 1.0 100.00 0.00
172 TOTAL 11754.0 99826 111580.0 10.53 89.47
173 ```
174 ├── taxa_clusters.xlsx
175 ```
176 cluster count taxa_full kingdom phylum class order family genus species
177 0 1 Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read
178 1 16 Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read
179 1 68189 Viridiplantae / Streptophyta / Magnoliopsida / Rosales / Ulmaceae / Ulmus / Uncertain taxa Viridiplantae Streptophyta Magnoliopsida Rosales Ulmaceae Ulmus Uncertain taxa
180 ...
181 534 1 Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read
182 535 1 Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read
183 536 1 Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read
184 ```
185 └── taxa_processed.xlsx
186 ```
187 cluster count taxa_full kingdom phylum class order family genus species
188 1 68189 Viridiplantae / Streptophyta / Magnoliopsida / Rosales / Ulmaceae / Ulmus / Uncertain taxa Viridiplantae Streptophyta Magnoliopsida Rosales Ulmaceae Ulmus Uncertain taxa
189 2 7781 Viridiplantae / Streptophyta / Magnoliopsida / Malpighiales / Salicaceae / Populus / Populus tremula Viridiplantae Streptophyta Magnoliopsida Malpighiales Salicaceae Populus Populus tremula
190 ...
191 518 1 Viridiplantae / Streptophyta / Magnoliopsida / Myrtales / Onagraceae / Circaea / Circaea lutetiana Viridiplantae Streptophyta Magnoliopsida Myrtales Onagraceae Circaea Circaea lutetiana
192 522 1 Viridiplantae / Streptophyta / Magnoliopsida / Rosales / Rosaceae / Rubus / Rubus idaeus Viridiplantae Streptophyta Magnoliopsida Rosales Rosaceae Rubus Rubus idaeus
193 532 1 Viridiplantae / Streptophyta / Magnoliopsida / Malpighiales / Euphorbiaceae / Euphorbia / Euphorbia myrsinites Viridiplantae Streptophyta Magnoliopsida Malpighiales Euphorbiaceae Euphorbia Euphorbia myrsinites
194 ```
195
196 #### **CLI Arguments (common)**
197
198 | Argument | Description |
199 |-----------|--------------|
200 | `--input_cluster` | Path to the input CD-HIT cluster file (`.clstr`). |
201 | `--input_annotation` | Path to the annotation file (optional, e.g. `.out` from BLAST or other source). |
202 | `--output_similarity_txt` | Output path for similarity summary text file. |
203 | `--output_similarity_plot` | Output path for similarity plot image (`.png`). |
204 | `--output_evalue_txt` | Output path for E-value summary text file. |
205 | `--output_evalue_plot` | Output path for E-value plot image (`.png`). |
206 | `--output_count` | Output path for cluster count summary file. |
207 | `--output_taxa_clusters` | Output path for taxa-per-cluster file. |
208 | `--output_taxa_processed` | Output path for processed taxa summary file. |
209 | `--simi_plot_y_min` | Minimum value for the Y-axis in the similarity plot (default: `95.0`). |
210 | `--simi_plot_y_max` | Maximum value for the Y-axis in the similarity plot (default: `100.0`). |
211 | `--uncertain_taxa_use_ratio` | Ratio (0–1) determining how uncertain taxa contribute to the dominant taxon (default: `0.5`). |
212 | `--min_to_split` | Minimum taxonomic percentage threshold for splitting multi-taxon clusters (default: `0.45`). |
213 | `--min_count_to_split` | Minimum number of reads required to split a cluster by taxonomy (default: `10`). |
214 | `--show_unannotated_clusters` | Include clusters without any annotation in the output when specified. |
215 | `--make_taxa_in_cluster_split` | Enable splitting clusters containing multiple taxa into subclusters. |
216 | `--print_empty_files` | Print a message if an expected output file (e.g., annotation file) is empty. |
217
218
219 ### Galaxy integration
220
221 The tool is also available through the Galaxy platform:
222
223 - **Galaxy Toolshed**: The CDHIT cluster analysis tool is available in the Galaxy Toolshed,
224 enabling easy installation into any Galaxy instance.
225 - **Web-based interface**: Users can upload annotation and cluster files, configure validation parameters through the GUI,
226 run validations, and download results.
227 - **Workflow integration**: The tool can be incorporated into Galaxy workflows for automated processing pipelines.
228
229 To use the tool in Galaxy:
230 1. Install the tool from the Galaxy Toolshed (search for "cdhit_analysis")
231 2. Upload your cluster and excel annotations files to your Galaxy history
232 3. Configure parameters through the GUI
233 4. Run the tool
234 5. View results and download validation reports and cluster annotations
235
236 ## License
237
238 No license yet
239
240 ## Citation
241
242 If you use this software in your research, please cite this repository.
243
244 ## Contact
245
246 For questions or issues:
247 - GitHub Issues: https://github.com/Onnodg/Naturalis_NLOOR/issues
248 - Email: onno.gorter@naturalis.nl (until Febuary 2026)
249
250 ## Acknowledgments
251
252 This tool was developed to support the New lights on old remedies project, a PhD project by Anja Fischer.