Mercurial > repos > onnodg > cdhit_analysis
comparison README.md @ 4:e64af72e1b8f draft default tip
planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_clusters_tool commit 4017d38cf327c48a6252e488ba792527dae97a70-dirty
| author | onnodg |
|---|---|
| date | Mon, 15 Dec 2025 16:44:40 +0000 |
| parents | 706b7acdb230 |
| children |
comparison
equal
deleted
inserted
replaced
| 3:c6981ea453ae | 4:e64af72e1b8f |
|---|---|
| 1 This script processes cluster output files from cd-hit-est for use in Galaxy. | 1 # CDHIT Cluster Analysis Script |
| 2 It extracts cluster information, associates taxa and e-values from annotation files, | 2 |
| 3 performs statistical calculations, and generates text and plot outputs | 3 This script processes a single **cluster file** together with an **excel file containing annotated reads**, generating multiple output files for downstream visualization and reporting. |
| 4 summarizing similarity and taxonomic distributions. | 4 |
| 5 | 5 It is designed for clustering-based taxonomic pipelines and provides a detailed overview of cluster composition, similarity metrics, and taxonomic consistency within and between clusters. |
| 6 | 6 |
| 7 Main steps: | 7 ## Usage |
| 8 1. Parse cd-hit-est cluster file and (optional) annotation file. | 8 |
| 9 2. Process each cluster to extract similarity, taxa, and e-value information. | 9 The script performs the following main tasks: |
| 10 3. Aggregate results across clusters. | 10 |
| 11 4. Generate requested outputs: text summaries, plots, and Excel reports. | 11 1. Parse command-line arguments. |
| 12 | 12 2. Load the CD-HIT cluster results and annotated read information. |
| 13 | 13 3. Group reads per cluster and compute similarity statistics (e.g., identity, alignment coverage). |
| 14 Note: Uses a non-interactive matplotlib backend (Agg) for compatibility with Galaxy. | 14 4. Resolve taxonomic inconsistencies within clusters using uncertainty and minimum-count thresholds. |
| 15 5. Generate visual and tabular summaries of cluster composition, similarity distribution, and annotation quality. | |
| 16 | |
| 17 ### Command Line Interface | |
| 18 The CD-HIT cluster analysis tool can be run as a Python script: | |
| 19 | |
| 20 ```bash | |
| 21 python cdhit_analysis.py [options] | |
| 22 ``` | |
| 23 | |
| 24 Below are detailed examples for a common use case. | |
| 25 | |
| 26 #### General use case | |
| 27 This example demonstrates the general usage of the tool for analyzing CD-HIT clustering results. | |
| 28 | |
| 29 **Requirements**: | |
| 30 | |
| 31 Requirements as listed in the cdhit_analysis.xml file: | |
| 32 | |
| 33 - Python version = 3.12.3 | |
| 34 - Matplotlib version = 3.12.3 | |
| 35 - Pandas version = 2.3.2 | |
| 36 - Openpyxl version = 3.1.5 | |
| 37 | |
| 38 **Input requirements** | |
| 39 | |
| 40 - CD-HIT cluster file (.clstr) containing sequence clusters with similarity information. | |
| 41 - Excel file containing annotated reads with corresponding taxonomic or metadata columns. | |
| 42 - The read identifiers in both files must match — the script merges cluster membership with read annotations using these IDs. | |
| 43 | |
| 44 **Example: Analyzing CD-HIT clusters with taxonomic annotations** | |
| 45 | |
| 46 ``` | |
| 47 process_clusters_tool/cdhit_analysis.sh' | |
| 48 --input_cluster 'clusters.txt' | |
| 49 --input_annotation 'annotations.xlsx' | |
| 50 --output_similarity_txt 'similarity_summary.txt' | |
| 51 --output_similarity_plot 'similairy_plot.png' | |
| 52 --output_evalue_txt 'evalue_summary.txt' | |
| 53 --output_evalue_plot 'evalue_plot.png' | |
| 54 --output_count 'cluster_count.txt' | |
| 55 --output_taxa_clusters 'taxa_clustered.xlsx' | |
| 56 --output_taxa_processed 'taxa_processed.xlsx.' | |
| 57 --simi_plot_y_min '95' | |
| 58 --simi_plot_y_max '100' | |
| 59 --uncertain_taxa_use_ratio '0.5' | |
| 60 --min_to_split '0.45' | |
| 61 --min_count_to_split '10' | |
| 62 --show_unannotated_clusters | |
| 63 --make_taxa_in_cluster_split | |
| 64 --print_empty_files | |
| 65 ``` | |
| 66 | |
| 67 **Example Input (`clusters.txt`)** | |
| 68 | |
| 69 | |
| 70 ``` | |
| 71 >Cluster 0 | |
| 72 0 357nt, >M01687:476:000000000-LL5F5:1:2113:18579:17490_CONS(1)... * | |
| 73 >Cluster 1 | |
| 74 0 85nt, >M01687:476:000000000-LL5F5:1:1102:21316:1191_CONS(59577)... at 1:85:1:85/+/98.82% | |
| 75 1 85nt, >M01687:476:000000000-LL5F5:1:1102:19793:1302_CONS(106)... at 1:85:1:85/+/97.65% | |
| 76 2 84nt, >M01687:476:000000000-LL5F5:1:1102:18943:1430_CONS(15)... at 1:84:1:85/+/98.81% | |
| 77 3 85nt, >M01687:476:000000000-LL5F5:1:1102:9619:1460_CONS(38)... at 1:85:1:85/+/97.65% | |
| 78 4 85nt, >M01687:476:000000000-LL5F5:1:1102:8280:1614_CONS(1)... at 1:85:1:85/+/97.65% | |
| 79 ... | |
| 80 1 39nt, >M01687:476:000000000-LL5F5:1:1116:4266:19390_CONS(1)... at 1:39:1:38/+/97.44% | |
| 81 >Cluster 530 | |
| 82 0 39nt, >M01687:476:000000000-LL5F5:1:2112:21268:1323_CONS(1)... * | |
| 83 >Cluster 531 | |
| 84 0 38nt, >M01687:476:000000000-LL5F5:1:2103:25634:11346_CONS(1)... * | |
| 85 >Cluster 532 | |
| 86 0 33nt, >M01687:476:000000000-LL5F5:1:2106:13260:18932_CONS(1)... * | |
| 87 >Cluster 533 | |
| 88 0 31nt, >M01687:476:000000000-LL5F5:1:1110:28179:10205_CONS(1)... * | |
| 89 >Cluster 534 | |
| 90 0 30nt, >M01687:476:000000000-LL5F5:1:1110:23278:23216_CONS(1)... * | |
| 91 >Cluster 535 | |
| 92 0 29nt, >M01687:476:000000000-LL5F5:1:2117:17691:6487_CONS(1)... * | |
| 93 >Cluster 536 | |
| 94 0 28nt, >M01687:476:000000000-LL5F5:1:1104:7756:22829_CONS(1)... * | |
| 95 | |
| 96 ``` | |
| 97 | |
| 98 **Example FASTA (`annotations.xlsx`)** | |
| 99 | |
| 100 | |
| 101 ```header e_value identity percentage coverage bitscore count source taxa kingdom phylum class order family genus species | |
| 102 M01687:476:000000000-LL5F5:1:1102:8926:6561_CONS 2.33E-41 98.889 100 161 12 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Achillea / Achillea millefolium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Achillea Achillea millefolium | |
| 103 M01687:476:000000000-LL5F5:1:2114:16883:18620_CONS 1.08E-39 97.778 100 156 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Achillea / Achillea millefolium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Achillea Achillea millefolium | |
| 104 M01687:476:000000000-LL5F5:1:1102:20658:7882_CONS 1.63E-37 98.795 100 148 29 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Apiales / Apiaceae / Aegopodium / Aegopodium podagraria Viridiplantae Streptophyta Magnoliopsida Apiales Apiaceae Aegopodium Aegopodium podagraria | |
| 105 ... | |
| 106 M01687:476:000000000-LL5F5:1:2114:19155:4308_CONS 1.07E-39 100 94 156 13 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Gentianales / Apocynaceae / Vinca / Vinca minor Viridiplantae Streptophyta Magnoliopsida Gentianales Apocynaceae Vinca Vinca minor | |
| 107 M01687:476:000000000-LL5F5:1:1117:11316:6653_CONS 4.96E-38 98.81 94 150 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Gentianales / Apocynaceae / Vinca / Vinca minor Viridiplantae Streptophyta Magnoliopsida Gentianales Apocynaceae Vinca Vinca minor | |
| 108 M01687:476:000000000-LL5F5:1:1106:28052:14441_CONS 8.25E-41 98.876 100 159 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Xanthium Xanthium strumarium | |
| 109 M01687:476:000000000-LL5F5:1:2118:15258:6790_CONS 8.25E-41 98.876 100 159 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Xanthium Xanthium strumarium | |
| 110 ``` | |
| 111 **Outputs** | |
| 112 | |
| 113 | Output Type | Format | Description | | |
| 114 |--------------|--------|-------------| | |
| 115 | **Similarity summary** | `.txt` | Text file listing average and per-cluster similarity statistics, derived from the CD-HIT `.clstr` file. | | |
| 116 | **Similarity plot** | `.png` | Histogram or density plot showing sequence similarity distribution across all clusters; useful for identifying thresholds or anomalies. | | |
| 117 | **E-value summary** | `.txt` | Text file containing aggregated E-value statistics for all clusters (if available from annotation data). | | |
| 118 | **E-value plot** | `.png` | Visualization of E-value distribution, helping to identify potential low-confidence clusters. | | |
| 119 | **Cluster count summary** | `.txt` | Summary of the number of clusters, total reads per cluster, and counts of annotated vs. unannotated reads. | | |
| 120 | **Taxa per cluster** | `.txt` | Text file showing the dominant or representative taxon assigned to each cluster, including uncertainty ratios. | | |
| 121 | **Processed taxa summary** | `.txt` | Aggregated view of taxonomic composition after filtering and cluster-based reassignment. | | |
| 122 | |
| 123 | |
| 124 **Output files (example)** | |
| 125 | |
| 126 outputs/ | |
| 127 ├── similarity_plot.png | |
| 128 <img width="3570" height="1765" alt="afbeelding" src="https://github.com/user-attachments/assets/f1ad5105-fcd1-4c2d-a5aa-7e8419b46281" /> | |
| 129 | |
| 130 ├── similarity_summary.txt | |
| 131 ``` | |
| 132 # Average similarity: 98.94 | |
| 133 # Standard deviation: 0.68 | |
| 134 similarity count | |
| 135 100.0 23803 | |
| 136 99.47 1 | |
| 137 99.46 1 | |
| 138 ... | |
| 139 97.18 1 | |
| 140 97.17 2 | |
| 141 97.14 11 | |
| 142 97.12 2 | |
| 143 97.1 1 | |
| 144 97.03 5 | |
| 145 97.0 946 | |
| 146 ``` | |
| 147 ├── evalue_plot.png | |
| 148 <img width="3565" height="1765" alt="afbeelding" src="https://github.com/user-attachments/assets/278fdfe3-882e-4f0e-901b-a2acbbcace24" /> | |
| 149 | |
| 150 ├── evalue_summary.txt | |
| 151 ``` | |
| 152 evalue count | |
| 153 unannotated 11754.0 | |
| 154 2.8e-40 59691 | |
| 155 2.16e-52 6595 | |
| 156 1.3e-38 6105 | |
| 157 2.57e-35 3332 | |
| 158 ... | |
| 159 7.3e-13 1 | |
| 160 2.06e-12 1 | |
| 161 5.4e-12 1 | |
| 162 8.73e-11 1 | |
| 163 ``` | |
| 164 ├── cluster_count.txt | |
| 165 ``` | |
| 166 cluster unannotated annotated total perc_unannotated perc_annotated | |
| 167 0 1.0 0 1.0 100.00 0.00 | |
| 168 1 16.0 68214 68230.0 0.02 99.98 | |
| 169 ... | |
| 170 535 1.0 0 1.0 100.00 0.00 | |
| 171 536 1.0 0 1.0 100.00 0.00 | |
| 172 TOTAL 11754.0 99826 111580.0 10.53 89.47 | |
| 173 ``` | |
| 174 ├── taxa_clusters.xlsx | |
| 175 ``` | |
| 176 cluster count taxa_full kingdom phylum class order family genus species | |
| 177 0 1 Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read | |
| 178 1 16 Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read | |
| 179 1 68189 Viridiplantae / Streptophyta / Magnoliopsida / Rosales / Ulmaceae / Ulmus / Uncertain taxa Viridiplantae Streptophyta Magnoliopsida Rosales Ulmaceae Ulmus Uncertain taxa | |
| 180 ... | |
| 181 534 1 Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read | |
| 182 535 1 Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read | |
| 183 536 1 Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read | |
| 184 ``` | |
| 185 └── taxa_processed.xlsx | |
| 186 ``` | |
| 187 cluster count taxa_full kingdom phylum class order family genus species | |
| 188 1 68189 Viridiplantae / Streptophyta / Magnoliopsida / Rosales / Ulmaceae / Ulmus / Uncertain taxa Viridiplantae Streptophyta Magnoliopsida Rosales Ulmaceae Ulmus Uncertain taxa | |
| 189 2 7781 Viridiplantae / Streptophyta / Magnoliopsida / Malpighiales / Salicaceae / Populus / Populus tremula Viridiplantae Streptophyta Magnoliopsida Malpighiales Salicaceae Populus Populus tremula | |
| 190 ... | |
| 191 518 1 Viridiplantae / Streptophyta / Magnoliopsida / Myrtales / Onagraceae / Circaea / Circaea lutetiana Viridiplantae Streptophyta Magnoliopsida Myrtales Onagraceae Circaea Circaea lutetiana | |
| 192 522 1 Viridiplantae / Streptophyta / Magnoliopsida / Rosales / Rosaceae / Rubus / Rubus idaeus Viridiplantae Streptophyta Magnoliopsida Rosales Rosaceae Rubus Rubus idaeus | |
| 193 532 1 Viridiplantae / Streptophyta / Magnoliopsida / Malpighiales / Euphorbiaceae / Euphorbia / Euphorbia myrsinites Viridiplantae Streptophyta Magnoliopsida Malpighiales Euphorbiaceae Euphorbia Euphorbia myrsinites | |
| 194 ``` | |
| 195 | |
| 196 #### **CLI Arguments (common)** | |
| 197 | |
| 198 | Argument | Description | | |
| 199 |-----------|--------------| | |
| 200 | `--input_cluster` | Path to the input CD-HIT cluster file (`.clstr`). | | |
| 201 | `--input_annotation` | Path to the annotation file (optional, e.g. `.out` from BLAST or other source). | | |
| 202 | `--output_similarity_txt` | Output path for similarity summary text file. | | |
| 203 | `--output_similarity_plot` | Output path for similarity plot image (`.png`). | | |
| 204 | `--output_evalue_txt` | Output path for E-value summary text file. | | |
| 205 | `--output_evalue_plot` | Output path for E-value plot image (`.png`). | | |
| 206 | `--output_count` | Output path for cluster count summary file. | | |
| 207 | `--output_taxa_clusters` | Output path for taxa-per-cluster file. | | |
| 208 | `--output_taxa_processed` | Output path for processed taxa summary file. | | |
| 209 | `--simi_plot_y_min` | Minimum value for the Y-axis in the similarity plot (default: `95.0`). | | |
| 210 | `--simi_plot_y_max` | Maximum value for the Y-axis in the similarity plot (default: `100.0`). | | |
| 211 | `--uncertain_taxa_use_ratio` | Ratio (0–1) determining how uncertain taxa contribute to the dominant taxon (default: `0.5`). | | |
| 212 | `--min_to_split` | Minimum taxonomic percentage threshold for splitting multi-taxon clusters (default: `0.45`). | | |
| 213 | `--min_count_to_split` | Minimum number of reads required to split a cluster by taxonomy (default: `10`). | | |
| 214 | `--show_unannotated_clusters` | Include clusters without any annotation in the output when specified. | | |
| 215 | `--make_taxa_in_cluster_split` | Enable splitting clusters containing multiple taxa into subclusters. | | |
| 216 | `--print_empty_files` | Print a message if an expected output file (e.g., annotation file) is empty. | | |
| 217 | |
| 218 | |
| 219 ### Galaxy integration | |
| 220 | |
| 221 The tool is also available through the Galaxy platform: | |
| 222 | |
| 223 - **Galaxy Toolshed**: The CDHIT cluster analysis tool is available in the Galaxy Toolshed, | |
| 224 enabling easy installation into any Galaxy instance. | |
| 225 - **Web-based interface**: Users can upload annotation and cluster files, configure validation parameters through the GUI, | |
| 226 run validations, and download results. | |
| 227 - **Workflow integration**: The tool can be incorporated into Galaxy workflows for automated processing pipelines. | |
| 228 | |
| 229 To use the tool in Galaxy: | |
| 230 1. Install the tool from the Galaxy Toolshed (search for "cdhit_analysis") | |
| 231 2. Upload your cluster and excel annotations files to your Galaxy history | |
| 232 3. Configure parameters through the GUI | |
| 233 4. Run the tool | |
| 234 5. View results and download validation reports and cluster annotations | |
| 235 | |
| 236 ## License | |
| 237 | |
| 238 No license yet | |
| 239 | |
| 240 ## Citation | |
| 241 | |
| 242 If you use this software in your research, please cite this repository. | |
| 243 | |
| 244 ## Contact | |
| 245 | |
| 246 For questions or issues: | |
| 247 - GitHub Issues: https://github.com/Onnodg/Naturalis_NLOOR/issues | |
| 248 - Email: onno.gorter@naturalis.nl (until Febuary 2026) | |
| 249 | |
| 250 ## Acknowledgments | |
| 251 | |
| 252 This tool was developed to support the New lights on old remedies project, a PhD project by Anja Fischer. |
