Mercurial > repos > onnodg > cdhit_analysis
diff README.md @ 4:e64af72e1b8f draft default tip
planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_clusters_tool commit 4017d38cf327c48a6252e488ba792527dae97a70-dirty
| author | onnodg |
|---|---|
| date | Mon, 15 Dec 2025 16:44:40 +0000 |
| parents | 706b7acdb230 |
| children |
line wrap: on
line diff
--- a/README.md Fri Oct 24 09:38:24 2025 +0000 +++ b/README.md Mon Dec 15 16:44:40 2025 +0000 @@ -1,14 +1,252 @@ -This script processes cluster output files from cd-hit-est for use in Galaxy. -It extracts cluster information, associates taxa and e-values from annotation files, -performs statistical calculations, and generates text and plot outputs -summarizing similarity and taxonomic distributions. +# CDHIT Cluster Analysis Script + +This script processes a single **cluster file** together with an **excel file containing annotated reads**, generating multiple output files for downstream visualization and reporting. + +It is designed for clustering-based taxonomic pipelines and provides a detailed overview of cluster composition, similarity metrics, and taxonomic consistency within and between clusters. + +## Usage + +The script performs the following main tasks: + +1. Parse command-line arguments. +2. Load the CD-HIT cluster results and annotated read information. +3. Group reads per cluster and compute similarity statistics (e.g., identity, alignment coverage). +4. Resolve taxonomic inconsistencies within clusters using uncertainty and minimum-count thresholds. +5. Generate visual and tabular summaries of cluster composition, similarity distribution, and annotation quality. + +### Command Line Interface +The CD-HIT cluster analysis tool can be run as a Python script: + +```bash +python cdhit_analysis.py [options] +``` + +Below are detailed examples for a common use case. + +#### General use case +This example demonstrates the general usage of the tool for analyzing CD-HIT clustering results. + +**Requirements**: + +Requirements as listed in the cdhit_analysis.xml file: + +- Python version = 3.12.3 +- Matplotlib version = 3.12.3 +- Pandas version = 2.3.2 +- Openpyxl version = 3.1.5 + +**Input requirements** + +- CD-HIT cluster file (.clstr) containing sequence clusters with similarity information. +- Excel file containing annotated reads with corresponding taxonomic or metadata columns. +- The read identifiers in both files must match — the script merges cluster membership with read annotations using these IDs. + +**Example: Analyzing CD-HIT clusters with taxonomic annotations** + +``` +process_clusters_tool/cdhit_analysis.sh' +--input_cluster 'clusters.txt' +--input_annotation 'annotations.xlsx' +--output_similarity_txt 'similarity_summary.txt' +--output_similarity_plot 'similairy_plot.png' +--output_evalue_txt 'evalue_summary.txt' +--output_evalue_plot 'evalue_plot.png' +--output_count 'cluster_count.txt' +--output_taxa_clusters 'taxa_clustered.xlsx' +--output_taxa_processed 'taxa_processed.xlsx.' +--simi_plot_y_min '95' +--simi_plot_y_max '100' +--uncertain_taxa_use_ratio '0.5' +--min_to_split '0.45' +--min_count_to_split '10' +--show_unannotated_clusters +--make_taxa_in_cluster_split +--print_empty_files +``` + +**Example Input (`clusters.txt`)** + + +``` + >Cluster 0 +0 357nt, >M01687:476:000000000-LL5F5:1:2113:18579:17490_CONS(1)... * +>Cluster 1 +0 85nt, >M01687:476:000000000-LL5F5:1:1102:21316:1191_CONS(59577)... at 1:85:1:85/+/98.82% +1 85nt, >M01687:476:000000000-LL5F5:1:1102:19793:1302_CONS(106)... at 1:85:1:85/+/97.65% +2 84nt, >M01687:476:000000000-LL5F5:1:1102:18943:1430_CONS(15)... at 1:84:1:85/+/98.81% +3 85nt, >M01687:476:000000000-LL5F5:1:1102:9619:1460_CONS(38)... at 1:85:1:85/+/97.65% +4 85nt, >M01687:476:000000000-LL5F5:1:1102:8280:1614_CONS(1)... at 1:85:1:85/+/97.65% + ... +1 39nt, >M01687:476:000000000-LL5F5:1:1116:4266:19390_CONS(1)... at 1:39:1:38/+/97.44% +>Cluster 530 +0 39nt, >M01687:476:000000000-LL5F5:1:2112:21268:1323_CONS(1)... * +>Cluster 531 +0 38nt, >M01687:476:000000000-LL5F5:1:2103:25634:11346_CONS(1)... * +>Cluster 532 +0 33nt, >M01687:476:000000000-LL5F5:1:2106:13260:18932_CONS(1)... * +>Cluster 533 +0 31nt, >M01687:476:000000000-LL5F5:1:1110:28179:10205_CONS(1)... * +>Cluster 534 +0 30nt, >M01687:476:000000000-LL5F5:1:1110:23278:23216_CONS(1)... * +>Cluster 535 +0 29nt, >M01687:476:000000000-LL5F5:1:2117:17691:6487_CONS(1)... * +>Cluster 536 +0 28nt, >M01687:476:000000000-LL5F5:1:1104:7756:22829_CONS(1)... * + +``` + +**Example FASTA (`annotations.xlsx`)** + + +```header e_value identity percentage coverage bitscore count source taxa kingdom phylum class order family genus species +M01687:476:000000000-LL5F5:1:1102:8926:6561_CONS 2.33E-41 98.889 100 161 12 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Achillea / Achillea millefolium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Achillea Achillea millefolium +M01687:476:000000000-LL5F5:1:2114:16883:18620_CONS 1.08E-39 97.778 100 156 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Achillea / Achillea millefolium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Achillea Achillea millefolium +M01687:476:000000000-LL5F5:1:1102:20658:7882_CONS 1.63E-37 98.795 100 148 29 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Apiales / Apiaceae / Aegopodium / Aegopodium podagraria Viridiplantae Streptophyta Magnoliopsida Apiales Apiaceae Aegopodium Aegopodium podagraria +... +M01687:476:000000000-LL5F5:1:2114:19155:4308_CONS 1.07E-39 100 94 156 13 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Gentianales / Apocynaceae / Vinca / Vinca minor Viridiplantae Streptophyta Magnoliopsida Gentianales Apocynaceae Vinca Vinca minor +M01687:476:000000000-LL5F5:1:1117:11316:6653_CONS 4.96E-38 98.81 94 150 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Gentianales / Apocynaceae / Vinca / Vinca minor Viridiplantae Streptophyta Magnoliopsida Gentianales Apocynaceae Vinca Vinca minor +M01687:476:000000000-LL5F5:1:1106:28052:14441_CONS 8.25E-41 98.876 100 159 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Xanthium Xanthium strumarium +M01687:476:000000000-LL5F5:1:2118:15258:6790_CONS 8.25E-41 98.876 100 159 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Xanthium Xanthium strumarium +``` +**Outputs** + +| Output Type | Format | Description | +|--------------|--------|-------------| +| **Similarity summary** | `.txt` | Text file listing average and per-cluster similarity statistics, derived from the CD-HIT `.clstr` file. | +| **Similarity plot** | `.png` | Histogram or density plot showing sequence similarity distribution across all clusters; useful for identifying thresholds or anomalies. | +| **E-value summary** | `.txt` | Text file containing aggregated E-value statistics for all clusters (if available from annotation data). | +| **E-value plot** | `.png` | Visualization of E-value distribution, helping to identify potential low-confidence clusters. | +| **Cluster count summary** | `.txt` | Summary of the number of clusters, total reads per cluster, and counts of annotated vs. unannotated reads. | +| **Taxa per cluster** | `.txt` | Text file showing the dominant or representative taxon assigned to each cluster, including uncertainty ratios. | +| **Processed taxa summary** | `.txt` | Aggregated view of taxonomic composition after filtering and cluster-based reassignment. | -Main steps: -1. Parse cd-hit-est cluster file and (optional) annotation file. -2. Process each cluster to extract similarity, taxa, and e-value information. -3. Aggregate results across clusters. -4. Generate requested outputs: text summaries, plots, and Excel reports. +**Output files (example)** + +outputs/ +├── similarity_plot.png +<img width="3570" height="1765" alt="afbeelding" src="https://github.com/user-attachments/assets/f1ad5105-fcd1-4c2d-a5aa-7e8419b46281" /> + +├── similarity_summary.txt +``` +# Average similarity: 98.94 +# Standard deviation: 0.68 +similarity count +100.0 23803 +99.47 1 +99.46 1 +... +97.18 1 +97.17 2 +97.14 11 +97.12 2 +97.1 1 +97.03 5 +97.0 946 +``` +├── evalue_plot.png +<img width="3565" height="1765" alt="afbeelding" src="https://github.com/user-attachments/assets/278fdfe3-882e-4f0e-901b-a2acbbcace24" /> + +├── evalue_summary.txt +``` +evalue count +unannotated 11754.0 +2.8e-40 59691 +2.16e-52 6595 +1.3e-38 6105 +2.57e-35 3332 +... +7.3e-13 1 +2.06e-12 1 +5.4e-12 1 +8.73e-11 1 +``` +├── cluster_count.txt +``` +cluster unannotated annotated total perc_unannotated perc_annotated +0 1.0 0 1.0 100.00 0.00 +1 16.0 68214 68230.0 0.02 99.98 +... +535 1.0 0 1.0 100.00 0.00 +536 1.0 0 1.0 100.00 0.00 +TOTAL 11754.0 99826 111580.0 10.53 89.47 +``` +├── taxa_clusters.xlsx +``` +cluster count taxa_full kingdom phylum class order family genus species +0 1 Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read +1 16 Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read +1 68189 Viridiplantae / Streptophyta / Magnoliopsida / Rosales / Ulmaceae / Ulmus / Uncertain taxa Viridiplantae Streptophyta Magnoliopsida Rosales Ulmaceae Ulmus Uncertain taxa +... +534 1 Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read +535 1 Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read +536 1 Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read +``` +└── taxa_processed.xlsx +``` +cluster count taxa_full kingdom phylum class order family genus species +1 68189 Viridiplantae / Streptophyta / Magnoliopsida / Rosales / Ulmaceae / Ulmus / Uncertain taxa Viridiplantae Streptophyta Magnoliopsida Rosales Ulmaceae Ulmus Uncertain taxa +2 7781 Viridiplantae / Streptophyta / Magnoliopsida / Malpighiales / Salicaceae / Populus / Populus tremula Viridiplantae Streptophyta Magnoliopsida Malpighiales Salicaceae Populus Populus tremula +... +518 1 Viridiplantae / Streptophyta / Magnoliopsida / Myrtales / Onagraceae / Circaea / Circaea lutetiana Viridiplantae Streptophyta Magnoliopsida Myrtales Onagraceae Circaea Circaea lutetiana +522 1 Viridiplantae / Streptophyta / Magnoliopsida / Rosales / Rosaceae / Rubus / Rubus idaeus Viridiplantae Streptophyta Magnoliopsida Rosales Rosaceae Rubus Rubus idaeus +532 1 Viridiplantae / Streptophyta / Magnoliopsida / Malpighiales / Euphorbiaceae / Euphorbia / Euphorbia myrsinites Viridiplantae Streptophyta Magnoliopsida Malpighiales Euphorbiaceae Euphorbia Euphorbia myrsinites +``` + +#### **CLI Arguments (common)** + +| Argument | Description | +|-----------|--------------| +| `--input_cluster` | Path to the input CD-HIT cluster file (`.clstr`). | +| `--input_annotation` | Path to the annotation file (optional, e.g. `.out` from BLAST or other source). | +| `--output_similarity_txt` | Output path for similarity summary text file. | +| `--output_similarity_plot` | Output path for similarity plot image (`.png`). | +| `--output_evalue_txt` | Output path for E-value summary text file. | +| `--output_evalue_plot` | Output path for E-value plot image (`.png`). | +| `--output_count` | Output path for cluster count summary file. | +| `--output_taxa_clusters` | Output path for taxa-per-cluster file. | +| `--output_taxa_processed` | Output path for processed taxa summary file. | +| `--simi_plot_y_min` | Minimum value for the Y-axis in the similarity plot (default: `95.0`). | +| `--simi_plot_y_max` | Maximum value for the Y-axis in the similarity plot (default: `100.0`). | +| `--uncertain_taxa_use_ratio` | Ratio (0–1) determining how uncertain taxa contribute to the dominant taxon (default: `0.5`). | +| `--min_to_split` | Minimum taxonomic percentage threshold for splitting multi-taxon clusters (default: `0.45`). | +| `--min_count_to_split` | Minimum number of reads required to split a cluster by taxonomy (default: `10`). | +| `--show_unannotated_clusters` | Include clusters without any annotation in the output when specified. | +| `--make_taxa_in_cluster_split` | Enable splitting clusters containing multiple taxa into subclusters. | +| `--print_empty_files` | Print a message if an expected output file (e.g., annotation file) is empty. | -Note: Uses a non-interactive matplotlib backend (Agg) for compatibility with Galaxy. +### Galaxy integration + +The tool is also available through the Galaxy platform: + +- **Galaxy Toolshed**: The CDHIT cluster analysis tool is available in the Galaxy Toolshed, + enabling easy installation into any Galaxy instance. +- **Web-based interface**: Users can upload annotation and cluster files, configure validation parameters through the GUI, + run validations, and download results. +- **Workflow integration**: The tool can be incorporated into Galaxy workflows for automated processing pipelines. + +To use the tool in Galaxy: +1. Install the tool from the Galaxy Toolshed (search for "cdhit_analysis") +2. Upload your cluster and excel annotations files to your Galaxy history +3. Configure parameters through the GUI +4. Run the tool +5. View results and download validation reports and cluster annotations + +## License + +No license yet + +## Citation + +If you use this software in your research, please cite this repository. + +## Contact + +For questions or issues: +- GitHub Issues: https://github.com/Onnodg/Naturalis_NLOOR/issues +- Email: onno.gorter@naturalis.nl (until Febuary 2026) + +## Acknowledgments + +This tool was developed to support the New lights on old remedies project, a PhD project by Anja Fischer.
