Mercurial > repos > onnodg > cdhit_analysis

diff README.md @ 4:e64af72e1b8f draft default tip
planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_clusters_tool commit 4017d38cf327c48a6252e488ba792527dae97a70-dirty
author: onnodg
date: Mon, 15 Dec 2025 16:44:40 +0000
parents: 706b7acdb230
--- a/README.md	Fri Oct 24 09:38:24 2025 +0000
+++ b/README.md	Mon Dec 15 16:44:40 2025 +0000
@@ -1,14 +1,252 @@
-This script processes cluster output files from cd-hit-est for use in Galaxy.
-It extracts cluster information, associates taxa and e-values from annotation files,
-performs statistical calculations, and generates text and plot outputs
-summarizing similarity and taxonomic distributions.
+# CDHIT Cluster Analysis Script
+
+This script processes a single **cluster file** together with an **excel file containing annotated reads**, generating multiple output files for downstream visualization and reporting.
+
+It is designed for clustering-based taxonomic pipelines and provides a detailed overview of cluster composition, similarity metrics, and taxonomic consistency within and between clusters.
+
+## Usage
+
+The script performs the following main tasks:
+
+1. Parse command-line arguments.
+2. Load the CD-HIT cluster results and annotated read information.
+3. Group reads per cluster and compute similarity statistics (e.g., identity, alignment coverage).
+4. Resolve taxonomic inconsistencies within clusters using uncertainty and minimum-count thresholds.
+5. Generate visual and tabular summaries of cluster composition, similarity distribution, and annotation quality.
+   
+### Command Line Interface
+The CD-HIT cluster analysis tool can be run as a Python script:
+
+```bash
+python cdhit_analysis.py [options]
+```
+
+Below are detailed examples for a common use case.
+
+#### General use case
+This example demonstrates the general usage of the tool for analyzing CD-HIT clustering results.
+
+**Requirements**:
+
+Requirements as listed in the cdhit_analysis.xml file:
+
+- Python version = 3.12.3
+- Matplotlib version = 3.12.3
+- Pandas version = 2.3.2
+- Openpyxl version = 3.1.5
+
+**Input requirements**
+
+- CD-HIT cluster file (.clstr) containing sequence clusters with similarity information.
+- Excel file containing annotated reads with corresponding taxonomic or metadata columns.
+- The read identifiers in both files must match — the script merges cluster membership with read annotations using these IDs.
+
+**Example: Analyzing CD-HIT clusters with taxonomic annotations**
+
+```
+process_clusters_tool/cdhit_analysis.sh'
+--input_cluster 'clusters.txt'
+--input_annotation 'annotations.xlsx'
+--output_similarity_txt 'similarity_summary.txt'
+--output_similarity_plot 'similairy_plot.png'
+--output_evalue_txt 'evalue_summary.txt'
+--output_evalue_plot 'evalue_plot.png' 
+--output_count 'cluster_count.txt'
+--output_taxa_clusters 'taxa_clustered.xlsx'
+--output_taxa_processed 'taxa_processed.xlsx.'
+--simi_plot_y_min '95'
+--simi_plot_y_max '100'
+--uncertain_taxa_use_ratio '0.5'
+--min_to_split '0.45'
+--min_count_to_split '10'
+--show_unannotated_clusters
+--make_taxa_in_cluster_split
+--print_empty_files
+```
+
+**Example Input (`clusters.txt`)**
+
+
+```
+   >Cluster 0
+0	357nt, >M01687:476:000000000-LL5F5:1:2113:18579:17490_CONS(1)... *
+>Cluster 1
+0	85nt, >M01687:476:000000000-LL5F5:1:1102:21316:1191_CONS(59577)... at 1:85:1:85/+/98.82%
+1	85nt, >M01687:476:000000000-LL5F5:1:1102:19793:1302_CONS(106)... at 1:85:1:85/+/97.65%
+2	84nt, >M01687:476:000000000-LL5F5:1:1102:18943:1430_CONS(15)... at 1:84:1:85/+/98.81%
+3	85nt, >M01687:476:000000000-LL5F5:1:1102:9619:1460_CONS(38)... at 1:85:1:85/+/97.65%
+4	85nt, >M01687:476:000000000-LL5F5:1:1102:8280:1614_CONS(1)... at 1:85:1:85/+/97.65%
+    ...
+1	39nt, >M01687:476:000000000-LL5F5:1:1116:4266:19390_CONS(1)... at 1:39:1:38/+/97.44%
+>Cluster 530
+0	39nt, >M01687:476:000000000-LL5F5:1:2112:21268:1323_CONS(1)... *
+>Cluster 531
+0	38nt, >M01687:476:000000000-LL5F5:1:2103:25634:11346_CONS(1)... *
+>Cluster 532
+0	33nt, >M01687:476:000000000-LL5F5:1:2106:13260:18932_CONS(1)... *
+>Cluster 533
+0	31nt, >M01687:476:000000000-LL5F5:1:1110:28179:10205_CONS(1)... *
+>Cluster 534
+0	30nt, >M01687:476:000000000-LL5F5:1:1110:23278:23216_CONS(1)... *
+>Cluster 535
+0	29nt, >M01687:476:000000000-LL5F5:1:2117:17691:6487_CONS(1)... *
+>Cluster 536
+0	28nt, >M01687:476:000000000-LL5F5:1:1104:7756:22829_CONS(1)... *
+
+```
+
+**Example FASTA (`annotations.xlsx`)**
+
+
+```header	e_value	identity percentage	coverage	bitscore	count	source	taxa	kingdom	phylum	class	order	family	genus	species
+M01687:476:000000000-LL5F5:1:1102:8926:6561_CONS	2.33E-41	98.889	100	161	12	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Achillea / Achillea millefolium	Viridiplantae	Streptophyta	Magnoliopsida	Asterales	Asteraceae	Achillea	Achillea millefolium
+M01687:476:000000000-LL5F5:1:2114:16883:18620_CONS	1.08E-39	97.778	100	156	1	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Achillea / Achillea millefolium	Viridiplantae	Streptophyta	Magnoliopsida	Asterales	Asteraceae	Achillea	Achillea millefolium
+M01687:476:000000000-LL5F5:1:1102:20658:7882_CONS	1.63E-37	98.795	100	148	29	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Apiales / Apiaceae / Aegopodium / Aegopodium podagraria	Viridiplantae	Streptophyta	Magnoliopsida	Apiales	Apiaceae	Aegopodium	Aegopodium podagraria
+...
+M01687:476:000000000-LL5F5:1:2114:19155:4308_CONS	1.07E-39	100	94	156	13	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Gentianales / Apocynaceae / Vinca / Vinca minor	Viridiplantae	Streptophyta	Magnoliopsida	Gentianales	Apocynaceae	Vinca	Vinca minor
+M01687:476:000000000-LL5F5:1:1117:11316:6653_CONS	4.96E-38	98.81	94	150	1	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Gentianales / Apocynaceae / Vinca / Vinca minor	Viridiplantae	Streptophyta	Magnoliopsida	Gentianales	Apocynaceae	Vinca	Vinca minor
+M01687:476:000000000-LL5F5:1:1106:28052:14441_CONS	8.25E-41	98.876	100	159	1	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium	Viridiplantae	Streptophyta	Magnoliopsida	Asterales	Asteraceae	Xanthium	Xanthium strumarium
+M01687:476:000000000-LL5F5:1:2118:15258:6790_CONS	8.25E-41	98.876	100	159	1	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium	Viridiplantae	Streptophyta	Magnoliopsida	Asterales	Asteraceae	Xanthium	Xanthium strumarium
+```
+**Outputs**
+
+| Output Type | Format | Description |
+|--------------|--------|-------------|
+| **Similarity summary** | `.txt` | Text file listing average and per-cluster similarity statistics, derived from the CD-HIT `.clstr` file. |
+| **Similarity plot** | `.png` | Histogram or density plot showing sequence similarity distribution across all clusters; useful for identifying thresholds or anomalies. |
+| **E-value summary** | `.txt` | Text file containing aggregated E-value statistics for all clusters (if available from annotation data). |
+| **E-value plot** | `.png` | Visualization of E-value distribution, helping to identify potential low-confidence clusters. |
+| **Cluster count summary** | `.txt` | Summary of the number of clusters, total reads per cluster, and counts of annotated vs. unannotated reads. |
+| **Taxa per cluster** | `.txt` | Text file showing the dominant or representative taxon assigned to each cluster, including uncertainty ratios. |
+| **Processed taxa summary** | `.txt` | Aggregated view of taxonomic composition after filtering and cluster-based reassignment. |
 
 
-Main steps:
-1. Parse cd-hit-est cluster file and (optional) annotation file.
-2. Process each cluster to extract similarity, taxa, and e-value information.
-3. Aggregate results across clusters.
-4. Generate requested outputs: text summaries, plots, and Excel reports.
+**Output files (example)**
+
+outputs/
+├── similarity_plot.png
+<img width="3570" height="1765" alt="afbeelding" src="https://github.com/user-attachments/assets/f1ad5105-fcd1-4c2d-a5aa-7e8419b46281" />
+
+├── similarity_summary.txt
+```
+# Average similarity: 98.94
+# Standard deviation: 0.68
+similarity	count
+100.0	23803
+99.47	1
+99.46	1
+...
+97.18	1
+97.17	2
+97.14	11
+97.12	2
+97.1	1
+97.03	5
+97.0	946
+```
+├── evalue_plot.png
+<img width="3565" height="1765" alt="afbeelding" src="https://github.com/user-attachments/assets/278fdfe3-882e-4f0e-901b-a2acbbcace24" />
+
+├── evalue_summary.txt
+```
+evalue	count
+unannotated	11754.0
+2.8e-40	59691
+2.16e-52	6595
+1.3e-38	6105
+2.57e-35	3332
+...
+7.3e-13	1
+2.06e-12	1
+5.4e-12	1
+8.73e-11	1
+```
+├── cluster_count.txt
+```
+cluster	unannotated	annotated	total	perc_unannotated	perc_annotated
+0	1.0	0	1.0	100.00	0.00
+1	16.0	68214	68230.0	0.02	99.98
+...
+535	1.0	0	1.0	100.00	0.00
+536	1.0	0	1.0	100.00	0.00
+TOTAL	11754.0	99826	111580.0	10.53	89.47
+```
+├── taxa_clusters.xlsx
+```
+cluster	count	taxa_full	kingdom	phylum	class	order	family	genus	species
+0	1	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read
+1	16	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read
+1	68189	Viridiplantae / Streptophyta / Magnoliopsida / Rosales / Ulmaceae / Ulmus / Uncertain taxa	Viridiplantae	Streptophyta	Magnoliopsida	Rosales	Ulmaceae	Ulmus	Uncertain taxa
+...
+534	1	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read
+535	1	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read
+536	1	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read
+```
+└── taxa_processed.xlsx
+```
+cluster	count	taxa_full	kingdom	phylum	class	order	family	genus	species
+1	68189	Viridiplantae / Streptophyta / Magnoliopsida / Rosales / Ulmaceae / Ulmus / Uncertain taxa	Viridiplantae	Streptophyta	Magnoliopsida	Rosales	Ulmaceae	Ulmus	Uncertain taxa
+2	7781	Viridiplantae / Streptophyta / Magnoliopsida / Malpighiales / Salicaceae / Populus / Populus tremula	Viridiplantae	Streptophyta	Magnoliopsida	Malpighiales	Salicaceae	Populus	Populus tremula
+...
+518	1	Viridiplantae / Streptophyta / Magnoliopsida / Myrtales / Onagraceae / Circaea / Circaea lutetiana	Viridiplantae	Streptophyta	Magnoliopsida	Myrtales	Onagraceae	Circaea	Circaea lutetiana
+522	1	Viridiplantae / Streptophyta / Magnoliopsida / Rosales / Rosaceae / Rubus / Rubus idaeus	Viridiplantae	Streptophyta	Magnoliopsida	Rosales	Rosaceae	Rubus	Rubus idaeus
+532	1	Viridiplantae / Streptophyta / Magnoliopsida / Malpighiales / Euphorbiaceae / Euphorbia / Euphorbia myrsinites	Viridiplantae	Streptophyta	Magnoliopsida	Malpighiales	Euphorbiaceae	Euphorbia	Euphorbia myrsinites
+```
+
+#### **CLI Arguments (common)**
+
+| Argument | Description |
+|-----------|--------------|
+| `--input_cluster` | Path to the input CD-HIT cluster file (`.clstr`). |
+| `--input_annotation` | Path to the annotation file (optional, e.g. `.out` from BLAST or other source). |
+| `--output_similarity_txt` | Output path for similarity summary text file. |
+| `--output_similarity_plot` | Output path for similarity plot image (`.png`). |
+| `--output_evalue_txt` | Output path for E-value summary text file. |
+| `--output_evalue_plot` | Output path for E-value plot image (`.png`). |
+| `--output_count` | Output path for cluster count summary file. |
+| `--output_taxa_clusters` | Output path for taxa-per-cluster file. |
+| `--output_taxa_processed` | Output path for processed taxa summary file. |
+| `--simi_plot_y_min` | Minimum value for the Y-axis in the similarity plot (default: `95.0`). |
+| `--simi_plot_y_max` | Maximum value for the Y-axis in the similarity plot (default: `100.0`). |
+| `--uncertain_taxa_use_ratio` | Ratio (0–1) determining how uncertain taxa contribute to the dominant taxon (default: `0.5`). |
+| `--min_to_split` | Minimum taxonomic percentage threshold for splitting multi-taxon clusters (default: `0.45`). |
+| `--min_count_to_split` | Minimum number of reads required to split a cluster by taxonomy (default: `10`). |
+| `--show_unannotated_clusters` | Include clusters without any annotation in the output when specified. |
+| `--make_taxa_in_cluster_split` | Enable splitting clusters containing multiple taxa into subclusters. |
+| `--print_empty_files` | Print a message if an expected output file (e.g., annotation file) is empty. |
 
 
-Note: Uses a non-interactive matplotlib backend (Agg) for compatibility with Galaxy.
+### Galaxy integration
+
+The tool is also available through the Galaxy platform:
+
+- **Galaxy Toolshed**: The CDHIT cluster analysis tool is available in the Galaxy Toolshed, 
+  enabling easy installation into any Galaxy instance.
+- **Web-based interface**: Users can upload annotation and cluster files, configure validation parameters through the GUI, 
+  run validations, and download results.
+- **Workflow integration**: The tool can be incorporated into Galaxy workflows for automated processing pipelines.
+
+To use the tool in Galaxy:
+1. Install the tool from the Galaxy Toolshed (search for "cdhit_analysis")
+2. Upload your cluster and excel annotations files to your Galaxy history
+3. Configure parameters through the GUI
+4. Run the tool
+5. View results and download validation reports and cluster annotations
+
+## License
+
+No license yet
+
+## Citation
+
+If you use this software in your research, please cite this repository.
+
+## Contact
+
+For questions or issues:
+- GitHub Issues: https://github.com/Onnodg/Naturalis_NLOOR/issues
+- Email: onno.gorter@naturalis.nl (until Febuary 2026)
+
+## Acknowledgments
+
+This tool was developed to support the New lights on old remedies project, a PhD project by Anja Fischer.
author	onnodg
date	Mon, 15 Dec 2025 16:44:40 +0000
parents	706b7acdb230
children