view README.md @ 4:e64af72e1b8f draft default tip

planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_clusters_tool commit 4017d38cf327c48a6252e488ba792527dae97a70-dirty
author onnodg
date Mon, 15 Dec 2025 16:44:40 +0000
parents 706b7acdb230
children
line wrap: on
line source

# CDHIT Cluster Analysis Script

This script processes a single **cluster file** together with an **excel file containing annotated reads**, generating multiple output files for downstream visualization and reporting.

It is designed for clustering-based taxonomic pipelines and provides a detailed overview of cluster composition, similarity metrics, and taxonomic consistency within and between clusters.

## Usage

The script performs the following main tasks:

1. Parse command-line arguments.
2. Load the CD-HIT cluster results and annotated read information.
3. Group reads per cluster and compute similarity statistics (e.g., identity, alignment coverage).
4. Resolve taxonomic inconsistencies within clusters using uncertainty and minimum-count thresholds.
5. Generate visual and tabular summaries of cluster composition, similarity distribution, and annotation quality.
   
### Command Line Interface
The CD-HIT cluster analysis tool can be run as a Python script:

```bash
python cdhit_analysis.py [options]
```

Below are detailed examples for a common use case.

#### General use case
This example demonstrates the general usage of the tool for analyzing CD-HIT clustering results.

**Requirements**:

Requirements as listed in the cdhit_analysis.xml file:

- Python version = 3.12.3
- Matplotlib version = 3.12.3
- Pandas version = 2.3.2
- Openpyxl version = 3.1.5

**Input requirements**

- CD-HIT cluster file (.clstr) containing sequence clusters with similarity information.
- Excel file containing annotated reads with corresponding taxonomic or metadata columns.
- The read identifiers in both files must match — the script merges cluster membership with read annotations using these IDs.

**Example: Analyzing CD-HIT clusters with taxonomic annotations**

```
process_clusters_tool/cdhit_analysis.sh'
--input_cluster 'clusters.txt'
--input_annotation 'annotations.xlsx'
--output_similarity_txt 'similarity_summary.txt'
--output_similarity_plot 'similairy_plot.png'
--output_evalue_txt 'evalue_summary.txt'
--output_evalue_plot 'evalue_plot.png' 
--output_count 'cluster_count.txt'
--output_taxa_clusters 'taxa_clustered.xlsx'
--output_taxa_processed 'taxa_processed.xlsx.'
--simi_plot_y_min '95'
--simi_plot_y_max '100'
--uncertain_taxa_use_ratio '0.5'
--min_to_split '0.45'
--min_count_to_split '10'
--show_unannotated_clusters
--make_taxa_in_cluster_split
--print_empty_files
```

**Example Input (`clusters.txt`)**


```
   >Cluster 0
0	357nt, >M01687:476:000000000-LL5F5:1:2113:18579:17490_CONS(1)... *
>Cluster 1
0	85nt, >M01687:476:000000000-LL5F5:1:1102:21316:1191_CONS(59577)... at 1:85:1:85/+/98.82%
1	85nt, >M01687:476:000000000-LL5F5:1:1102:19793:1302_CONS(106)... at 1:85:1:85/+/97.65%
2	84nt, >M01687:476:000000000-LL5F5:1:1102:18943:1430_CONS(15)... at 1:84:1:85/+/98.81%
3	85nt, >M01687:476:000000000-LL5F5:1:1102:9619:1460_CONS(38)... at 1:85:1:85/+/97.65%
4	85nt, >M01687:476:000000000-LL5F5:1:1102:8280:1614_CONS(1)... at 1:85:1:85/+/97.65%
    ...
1	39nt, >M01687:476:000000000-LL5F5:1:1116:4266:19390_CONS(1)... at 1:39:1:38/+/97.44%
>Cluster 530
0	39nt, >M01687:476:000000000-LL5F5:1:2112:21268:1323_CONS(1)... *
>Cluster 531
0	38nt, >M01687:476:000000000-LL5F5:1:2103:25634:11346_CONS(1)... *
>Cluster 532
0	33nt, >M01687:476:000000000-LL5F5:1:2106:13260:18932_CONS(1)... *
>Cluster 533
0	31nt, >M01687:476:000000000-LL5F5:1:1110:28179:10205_CONS(1)... *
>Cluster 534
0	30nt, >M01687:476:000000000-LL5F5:1:1110:23278:23216_CONS(1)... *
>Cluster 535
0	29nt, >M01687:476:000000000-LL5F5:1:2117:17691:6487_CONS(1)... *
>Cluster 536
0	28nt, >M01687:476:000000000-LL5F5:1:1104:7756:22829_CONS(1)... *

```

**Example FASTA (`annotations.xlsx`)**


```header	e_value	identity percentage	coverage	bitscore	count	source	taxa	kingdom	phylum	class	order	family	genus	species
M01687:476:000000000-LL5F5:1:1102:8926:6561_CONS	2.33E-41	98.889	100	161	12	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Achillea / Achillea millefolium	Viridiplantae	Streptophyta	Magnoliopsida	Asterales	Asteraceae	Achillea	Achillea millefolium
M01687:476:000000000-LL5F5:1:2114:16883:18620_CONS	1.08E-39	97.778	100	156	1	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Achillea / Achillea millefolium	Viridiplantae	Streptophyta	Magnoliopsida	Asterales	Asteraceae	Achillea	Achillea millefolium
M01687:476:000000000-LL5F5:1:1102:20658:7882_CONS	1.63E-37	98.795	100	148	29	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Apiales / Apiaceae / Aegopodium / Aegopodium podagraria	Viridiplantae	Streptophyta	Magnoliopsida	Apiales	Apiaceae	Aegopodium	Aegopodium podagraria
...
M01687:476:000000000-LL5F5:1:2114:19155:4308_CONS	1.07E-39	100	94	156	13	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Gentianales / Apocynaceae / Vinca / Vinca minor	Viridiplantae	Streptophyta	Magnoliopsida	Gentianales	Apocynaceae	Vinca	Vinca minor
M01687:476:000000000-LL5F5:1:1117:11316:6653_CONS	4.96E-38	98.81	94	150	1	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Gentianales / Apocynaceae / Vinca / Vinca minor	Viridiplantae	Streptophyta	Magnoliopsida	Gentianales	Apocynaceae	Vinca	Vinca minor
M01687:476:000000000-LL5F5:1:1106:28052:14441_CONS	8.25E-41	98.876	100	159	1	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium	Viridiplantae	Streptophyta	Magnoliopsida	Asterales	Asteraceae	Xanthium	Xanthium strumarium
M01687:476:000000000-LL5F5:1:2118:15258:6790_CONS	8.25E-41	98.876	100	159	1	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium	Viridiplantae	Streptophyta	Magnoliopsida	Asterales	Asteraceae	Xanthium	Xanthium strumarium
```
**Outputs**

| Output Type | Format | Description |
|--------------|--------|-------------|
| **Similarity summary** | `.txt` | Text file listing average and per-cluster similarity statistics, derived from the CD-HIT `.clstr` file. |
| **Similarity plot** | `.png` | Histogram or density plot showing sequence similarity distribution across all clusters; useful for identifying thresholds or anomalies. |
| **E-value summary** | `.txt` | Text file containing aggregated E-value statistics for all clusters (if available from annotation data). |
| **E-value plot** | `.png` | Visualization of E-value distribution, helping to identify potential low-confidence clusters. |
| **Cluster count summary** | `.txt` | Summary of the number of clusters, total reads per cluster, and counts of annotated vs. unannotated reads. |
| **Taxa per cluster** | `.txt` | Text file showing the dominant or representative taxon assigned to each cluster, including uncertainty ratios. |
| **Processed taxa summary** | `.txt` | Aggregated view of taxonomic composition after filtering and cluster-based reassignment. |


**Output files (example)**

outputs/
├── similarity_plot.png
<img width="3570" height="1765" alt="afbeelding" src="https://github.com/user-attachments/assets/f1ad5105-fcd1-4c2d-a5aa-7e8419b46281" />

├── similarity_summary.txt
```
# Average similarity: 98.94
# Standard deviation: 0.68
similarity	count
100.0	23803
99.47	1
99.46	1
...
97.18	1
97.17	2
97.14	11
97.12	2
97.1	1
97.03	5
97.0	946
```
├── evalue_plot.png
<img width="3565" height="1765" alt="afbeelding" src="https://github.com/user-attachments/assets/278fdfe3-882e-4f0e-901b-a2acbbcace24" />

├── evalue_summary.txt
```
evalue	count
unannotated	11754.0
2.8e-40	59691
2.16e-52	6595
1.3e-38	6105
2.57e-35	3332
...
7.3e-13	1
2.06e-12	1
5.4e-12	1
8.73e-11	1
```
├── cluster_count.txt
```
cluster	unannotated	annotated	total	perc_unannotated	perc_annotated
0	1.0	0	1.0	100.00	0.00
1	16.0	68214	68230.0	0.02	99.98
...
535	1.0	0	1.0	100.00	0.00
536	1.0	0	1.0	100.00	0.00
TOTAL	11754.0	99826	111580.0	10.53	89.47
```
├── taxa_clusters.xlsx
```
cluster	count	taxa_full	kingdom	phylum	class	order	family	genus	species
0	1	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read
1	16	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read
1	68189	Viridiplantae / Streptophyta / Magnoliopsida / Rosales / Ulmaceae / Ulmus / Uncertain taxa	Viridiplantae	Streptophyta	Magnoliopsida	Rosales	Ulmaceae	Ulmus	Uncertain taxa
...
534	1	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read
535	1	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read
536	1	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read	Unannotated read
```
└── taxa_processed.xlsx
```
cluster	count	taxa_full	kingdom	phylum	class	order	family	genus	species
1	68189	Viridiplantae / Streptophyta / Magnoliopsida / Rosales / Ulmaceae / Ulmus / Uncertain taxa	Viridiplantae	Streptophyta	Magnoliopsida	Rosales	Ulmaceae	Ulmus	Uncertain taxa
2	7781	Viridiplantae / Streptophyta / Magnoliopsida / Malpighiales / Salicaceae / Populus / Populus tremula	Viridiplantae	Streptophyta	Magnoliopsida	Malpighiales	Salicaceae	Populus	Populus tremula
...
518	1	Viridiplantae / Streptophyta / Magnoliopsida / Myrtales / Onagraceae / Circaea / Circaea lutetiana	Viridiplantae	Streptophyta	Magnoliopsida	Myrtales	Onagraceae	Circaea	Circaea lutetiana
522	1	Viridiplantae / Streptophyta / Magnoliopsida / Rosales / Rosaceae / Rubus / Rubus idaeus	Viridiplantae	Streptophyta	Magnoliopsida	Rosales	Rosaceae	Rubus	Rubus idaeus
532	1	Viridiplantae / Streptophyta / Magnoliopsida / Malpighiales / Euphorbiaceae / Euphorbia / Euphorbia myrsinites	Viridiplantae	Streptophyta	Magnoliopsida	Malpighiales	Euphorbiaceae	Euphorbia	Euphorbia myrsinites
```

#### **CLI Arguments (common)**

| Argument | Description |
|-----------|--------------|
| `--input_cluster` | Path to the input CD-HIT cluster file (`.clstr`). |
| `--input_annotation` | Path to the annotation file (optional, e.g. `.out` from BLAST or other source). |
| `--output_similarity_txt` | Output path for similarity summary text file. |
| `--output_similarity_plot` | Output path for similarity plot image (`.png`). |
| `--output_evalue_txt` | Output path for E-value summary text file. |
| `--output_evalue_plot` | Output path for E-value plot image (`.png`). |
| `--output_count` | Output path for cluster count summary file. |
| `--output_taxa_clusters` | Output path for taxa-per-cluster file. |
| `--output_taxa_processed` | Output path for processed taxa summary file. |
| `--simi_plot_y_min` | Minimum value for the Y-axis in the similarity plot (default: `95.0`). |
| `--simi_plot_y_max` | Maximum value for the Y-axis in the similarity plot (default: `100.0`). |
| `--uncertain_taxa_use_ratio` | Ratio (0–1) determining how uncertain taxa contribute to the dominant taxon (default: `0.5`). |
| `--min_to_split` | Minimum taxonomic percentage threshold for splitting multi-taxon clusters (default: `0.45`). |
| `--min_count_to_split` | Minimum number of reads required to split a cluster by taxonomy (default: `10`). |
| `--show_unannotated_clusters` | Include clusters without any annotation in the output when specified. |
| `--make_taxa_in_cluster_split` | Enable splitting clusters containing multiple taxa into subclusters. |
| `--print_empty_files` | Print a message if an expected output file (e.g., annotation file) is empty. |


### Galaxy integration

The tool is also available through the Galaxy platform:

- **Galaxy Toolshed**: The CDHIT cluster analysis tool is available in the Galaxy Toolshed, 
  enabling easy installation into any Galaxy instance.
- **Web-based interface**: Users can upload annotation and cluster files, configure validation parameters through the GUI, 
  run validations, and download results.
- **Workflow integration**: The tool can be incorporated into Galaxy workflows for automated processing pipelines.

To use the tool in Galaxy:
1. Install the tool from the Galaxy Toolshed (search for "cdhit_analysis")
2. Upload your cluster and excel annotations files to your Galaxy history
3. Configure parameters through the GUI
4. Run the tool
5. View results and download validation reports and cluster annotations

## License

No license yet

## Citation

If you use this software in your research, please cite this repository.

## Contact

For questions or issues:
- GitHub Issues: https://github.com/Onnodg/Naturalis_NLOOR/issues
- Email: onno.gorter@naturalis.nl (until Febuary 2026)

## Acknowledgments

This tool was developed to support the New lights on old remedies project, a PhD project by Anja Fischer.