Mercurial > repos > onnodg > cdhit_analysis
view README.md @ 4:e64af72e1b8f draft default tip
planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_clusters_tool commit 4017d38cf327c48a6252e488ba792527dae97a70-dirty
| author | onnodg |
|---|---|
| date | Mon, 15 Dec 2025 16:44:40 +0000 |
| parents | 706b7acdb230 |
| children |
line wrap: on
line source
# CDHIT Cluster Analysis Script This script processes a single **cluster file** together with an **excel file containing annotated reads**, generating multiple output files for downstream visualization and reporting. It is designed for clustering-based taxonomic pipelines and provides a detailed overview of cluster composition, similarity metrics, and taxonomic consistency within and between clusters. ## Usage The script performs the following main tasks: 1. Parse command-line arguments. 2. Load the CD-HIT cluster results and annotated read information. 3. Group reads per cluster and compute similarity statistics (e.g., identity, alignment coverage). 4. Resolve taxonomic inconsistencies within clusters using uncertainty and minimum-count thresholds. 5. Generate visual and tabular summaries of cluster composition, similarity distribution, and annotation quality. ### Command Line Interface The CD-HIT cluster analysis tool can be run as a Python script: ```bash python cdhit_analysis.py [options] ``` Below are detailed examples for a common use case. #### General use case This example demonstrates the general usage of the tool for analyzing CD-HIT clustering results. **Requirements**: Requirements as listed in the cdhit_analysis.xml file: - Python version = 3.12.3 - Matplotlib version = 3.12.3 - Pandas version = 2.3.2 - Openpyxl version = 3.1.5 **Input requirements** - CD-HIT cluster file (.clstr) containing sequence clusters with similarity information. - Excel file containing annotated reads with corresponding taxonomic or metadata columns. - The read identifiers in both files must match — the script merges cluster membership with read annotations using these IDs. **Example: Analyzing CD-HIT clusters with taxonomic annotations** ``` process_clusters_tool/cdhit_analysis.sh' --input_cluster 'clusters.txt' --input_annotation 'annotations.xlsx' --output_similarity_txt 'similarity_summary.txt' --output_similarity_plot 'similairy_plot.png' --output_evalue_txt 'evalue_summary.txt' --output_evalue_plot 'evalue_plot.png' --output_count 'cluster_count.txt' --output_taxa_clusters 'taxa_clustered.xlsx' --output_taxa_processed 'taxa_processed.xlsx.' --simi_plot_y_min '95' --simi_plot_y_max '100' --uncertain_taxa_use_ratio '0.5' --min_to_split '0.45' --min_count_to_split '10' --show_unannotated_clusters --make_taxa_in_cluster_split --print_empty_files ``` **Example Input (`clusters.txt`)** ``` >Cluster 0 0 357nt, >M01687:476:000000000-LL5F5:1:2113:18579:17490_CONS(1)... * >Cluster 1 0 85nt, >M01687:476:000000000-LL5F5:1:1102:21316:1191_CONS(59577)... at 1:85:1:85/+/98.82% 1 85nt, >M01687:476:000000000-LL5F5:1:1102:19793:1302_CONS(106)... at 1:85:1:85/+/97.65% 2 84nt, >M01687:476:000000000-LL5F5:1:1102:18943:1430_CONS(15)... at 1:84:1:85/+/98.81% 3 85nt, >M01687:476:000000000-LL5F5:1:1102:9619:1460_CONS(38)... at 1:85:1:85/+/97.65% 4 85nt, >M01687:476:000000000-LL5F5:1:1102:8280:1614_CONS(1)... at 1:85:1:85/+/97.65% ... 1 39nt, >M01687:476:000000000-LL5F5:1:1116:4266:19390_CONS(1)... at 1:39:1:38/+/97.44% >Cluster 530 0 39nt, >M01687:476:000000000-LL5F5:1:2112:21268:1323_CONS(1)... * >Cluster 531 0 38nt, >M01687:476:000000000-LL5F5:1:2103:25634:11346_CONS(1)... * >Cluster 532 0 33nt, >M01687:476:000000000-LL5F5:1:2106:13260:18932_CONS(1)... * >Cluster 533 0 31nt, >M01687:476:000000000-LL5F5:1:1110:28179:10205_CONS(1)... * >Cluster 534 0 30nt, >M01687:476:000000000-LL5F5:1:1110:23278:23216_CONS(1)... * >Cluster 535 0 29nt, >M01687:476:000000000-LL5F5:1:2117:17691:6487_CONS(1)... * >Cluster 536 0 28nt, >M01687:476:000000000-LL5F5:1:1104:7756:22829_CONS(1)... * ``` **Example FASTA (`annotations.xlsx`)** ```header e_value identity percentage coverage bitscore count source taxa kingdom phylum class order family genus species M01687:476:000000000-LL5F5:1:1102:8926:6561_CONS 2.33E-41 98.889 100 161 12 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Achillea / Achillea millefolium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Achillea Achillea millefolium M01687:476:000000000-LL5F5:1:2114:16883:18620_CONS 1.08E-39 97.778 100 156 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Achillea / Achillea millefolium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Achillea Achillea millefolium M01687:476:000000000-LL5F5:1:1102:20658:7882_CONS 1.63E-37 98.795 100 148 29 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Apiales / Apiaceae / Aegopodium / Aegopodium podagraria Viridiplantae Streptophyta Magnoliopsida Apiales Apiaceae Aegopodium Aegopodium podagraria ... M01687:476:000000000-LL5F5:1:2114:19155:4308_CONS 1.07E-39 100 94 156 13 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Gentianales / Apocynaceae / Vinca / Vinca minor Viridiplantae Streptophyta Magnoliopsida Gentianales Apocynaceae Vinca Vinca minor M01687:476:000000000-LL5F5:1:1117:11316:6653_CONS 4.96E-38 98.81 94 150 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Gentianales / Apocynaceae / Vinca / Vinca minor Viridiplantae Streptophyta Magnoliopsida Gentianales Apocynaceae Vinca Vinca minor M01687:476:000000000-LL5F5:1:1106:28052:14441_CONS 8.25E-41 98.876 100 159 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Xanthium Xanthium strumarium M01687:476:000000000-LL5F5:1:2118:15258:6790_CONS 8.25E-41 98.876 100 159 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Xanthium Xanthium strumarium ``` **Outputs** | Output Type | Format | Description | |--------------|--------|-------------| | **Similarity summary** | `.txt` | Text file listing average and per-cluster similarity statistics, derived from the CD-HIT `.clstr` file. | | **Similarity plot** | `.png` | Histogram or density plot showing sequence similarity distribution across all clusters; useful for identifying thresholds or anomalies. | | **E-value summary** | `.txt` | Text file containing aggregated E-value statistics for all clusters (if available from annotation data). | | **E-value plot** | `.png` | Visualization of E-value distribution, helping to identify potential low-confidence clusters. | | **Cluster count summary** | `.txt` | Summary of the number of clusters, total reads per cluster, and counts of annotated vs. unannotated reads. | | **Taxa per cluster** | `.txt` | Text file showing the dominant or representative taxon assigned to each cluster, including uncertainty ratios. | | **Processed taxa summary** | `.txt` | Aggregated view of taxonomic composition after filtering and cluster-based reassignment. | **Output files (example)** outputs/ ├── similarity_plot.png <img width="3570" height="1765" alt="afbeelding" src="https://github.com/user-attachments/assets/f1ad5105-fcd1-4c2d-a5aa-7e8419b46281" /> ├── similarity_summary.txt ``` # Average similarity: 98.94 # Standard deviation: 0.68 similarity count 100.0 23803 99.47 1 99.46 1 ... 97.18 1 97.17 2 97.14 11 97.12 2 97.1 1 97.03 5 97.0 946 ``` ├── evalue_plot.png <img width="3565" height="1765" alt="afbeelding" src="https://github.com/user-attachments/assets/278fdfe3-882e-4f0e-901b-a2acbbcace24" /> ├── evalue_summary.txt ``` evalue count unannotated 11754.0 2.8e-40 59691 2.16e-52 6595 1.3e-38 6105 2.57e-35 3332 ... 7.3e-13 1 2.06e-12 1 5.4e-12 1 8.73e-11 1 ``` ├── cluster_count.txt ``` cluster unannotated annotated total perc_unannotated perc_annotated 0 1.0 0 1.0 100.00 0.00 1 16.0 68214 68230.0 0.02 99.98 ... 535 1.0 0 1.0 100.00 0.00 536 1.0 0 1.0 100.00 0.00 TOTAL 11754.0 99826 111580.0 10.53 89.47 ``` ├── taxa_clusters.xlsx ``` cluster count taxa_full kingdom phylum class order family genus species 0 1 Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read 1 16 Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read 1 68189 Viridiplantae / Streptophyta / Magnoliopsida / Rosales / Ulmaceae / Ulmus / Uncertain taxa Viridiplantae Streptophyta Magnoliopsida Rosales Ulmaceae Ulmus Uncertain taxa ... 534 1 Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read 535 1 Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read 536 1 Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read Unannotated read ``` └── taxa_processed.xlsx ``` cluster count taxa_full kingdom phylum class order family genus species 1 68189 Viridiplantae / Streptophyta / Magnoliopsida / Rosales / Ulmaceae / Ulmus / Uncertain taxa Viridiplantae Streptophyta Magnoliopsida Rosales Ulmaceae Ulmus Uncertain taxa 2 7781 Viridiplantae / Streptophyta / Magnoliopsida / Malpighiales / Salicaceae / Populus / Populus tremula Viridiplantae Streptophyta Magnoliopsida Malpighiales Salicaceae Populus Populus tremula ... 518 1 Viridiplantae / Streptophyta / Magnoliopsida / Myrtales / Onagraceae / Circaea / Circaea lutetiana Viridiplantae Streptophyta Magnoliopsida Myrtales Onagraceae Circaea Circaea lutetiana 522 1 Viridiplantae / Streptophyta / Magnoliopsida / Rosales / Rosaceae / Rubus / Rubus idaeus Viridiplantae Streptophyta Magnoliopsida Rosales Rosaceae Rubus Rubus idaeus 532 1 Viridiplantae / Streptophyta / Magnoliopsida / Malpighiales / Euphorbiaceae / Euphorbia / Euphorbia myrsinites Viridiplantae Streptophyta Magnoliopsida Malpighiales Euphorbiaceae Euphorbia Euphorbia myrsinites ``` #### **CLI Arguments (common)** | Argument | Description | |-----------|--------------| | `--input_cluster` | Path to the input CD-HIT cluster file (`.clstr`). | | `--input_annotation` | Path to the annotation file (optional, e.g. `.out` from BLAST or other source). | | `--output_similarity_txt` | Output path for similarity summary text file. | | `--output_similarity_plot` | Output path for similarity plot image (`.png`). | | `--output_evalue_txt` | Output path for E-value summary text file. | | `--output_evalue_plot` | Output path for E-value plot image (`.png`). | | `--output_count` | Output path for cluster count summary file. | | `--output_taxa_clusters` | Output path for taxa-per-cluster file. | | `--output_taxa_processed` | Output path for processed taxa summary file. | | `--simi_plot_y_min` | Minimum value for the Y-axis in the similarity plot (default: `95.0`). | | `--simi_plot_y_max` | Maximum value for the Y-axis in the similarity plot (default: `100.0`). | | `--uncertain_taxa_use_ratio` | Ratio (0–1) determining how uncertain taxa contribute to the dominant taxon (default: `0.5`). | | `--min_to_split` | Minimum taxonomic percentage threshold for splitting multi-taxon clusters (default: `0.45`). | | `--min_count_to_split` | Minimum number of reads required to split a cluster by taxonomy (default: `10`). | | `--show_unannotated_clusters` | Include clusters without any annotation in the output when specified. | | `--make_taxa_in_cluster_split` | Enable splitting clusters containing multiple taxa into subclusters. | | `--print_empty_files` | Print a message if an expected output file (e.g., annotation file) is empty. | ### Galaxy integration The tool is also available through the Galaxy platform: - **Galaxy Toolshed**: The CDHIT cluster analysis tool is available in the Galaxy Toolshed, enabling easy installation into any Galaxy instance. - **Web-based interface**: Users can upload annotation and cluster files, configure validation parameters through the GUI, run validations, and download results. - **Workflow integration**: The tool can be incorporated into Galaxy workflows for automated processing pipelines. To use the tool in Galaxy: 1. Install the tool from the Galaxy Toolshed (search for "cdhit_analysis") 2. Upload your cluster and excel annotations files to your Galaxy history 3. Configure parameters through the GUI 4. Run the tool 5. View results and download validation reports and cluster annotations ## License No license yet ## Citation If you use this software in your research, please cite this repository. ## Contact For questions or issues: - GitHub Issues: https://github.com/Onnodg/Naturalis_NLOOR/issues - Email: onno.gorter@naturalis.nl (until Febuary 2026) ## Acknowledgments This tool was developed to support the New lights on old remedies project, a PhD project by Anja Fischer.
