Mercurial > repos > onnodg > blast_annotations_processor
diff README.md @ 2:9ca209477dfd draft default tip
planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit 4017d38cf327c48a6252e488ba792527dae97a70-dirty
| author | onnodg |
|---|---|
| date | Mon, 15 Dec 2025 16:43:36 +0000 |
| parents | |
| children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/README.md Mon Dec 15 16:43:36 2025 +0000 @@ -0,0 +1,297 @@ +# BLAST Annotations Processor Script + +This script processes a single **annotated BLAST file** together with a **FASTA file containing the same reads but unannotated**, generating multiple output files for downstream visualization and reporting. + +It is designed for BLAST-based taxonomic pipelines and provides a complete overview of annotation quality, distribution, and composition of the analyzed dataset. + +--- + +## Usage + +The script performs the following main tasks: + +1. Parse command-line arguments. +2. Load the annotated BLAST results and the unannotated FASTA headers. +3. Group BLAST hits per read and filter them by specified thresholds. +4. Resolve taxonomic conflicts with the lowest common ancestor method using predefined uncertainty rules. +5. Generate a variety of outputs of statistics and annotations for downstream use. + + +### Command Line Interface +The BLAST annotations processor can be run as a Python script: + +```bash +python blast_annotations_processor.py [options] +``` + +Below are detailed examples for common use case + +#### General use case +This example shows the general use of the tool. + +**Requirements**: + +Requirements as listed in the blast_annotations_processor xml file: + +- Python version=3.12.3 +- Matplotlib version=3.12.3 +- Pandas version=2.3.2 +- Numpy version=2.3.2 +- Openpyxl version=3.1.5 + + +**Input requirements** + +- BLAST tabular file with alignment metrics, source and taxa +- Fasta file with preprocessed reads +- Header correspondence: Query identifiers in the BLAST output and FASTA headers **must match**. The script relies on matching IDs to merge annotations with read headers. + + + +**Example: Analyzing BLAST annotation result using curated database** + +```bash +python annotate_blast_results.py +--input-anno 'annotated_curated_results.tabular' +--input-unanno 'unannotated_reads.fasta' +--eval-plot 'eval_curated.png' +--taxa-output 'taxa_curated.txt' +--circle-data 'circle_curated.txt' +--header-anno 'anno_curated.xlsx' +--anno-stats 'stats_curated.txt' +--eval-threshold '1e-5' +--uncertain-threshold '0.9' +--use-counts +``` + +This command will: + +- Parse the BLAST and FASTA files. +- Filter hits using `E-value ≤ 1e-5`, `uncertainty threshold ≥ 90%`, and use read count in the circular data output. +- Resolve taxonomic conflicts and generate plots, reports, and spreadsheet outputs in the given output files. + + +**Example Input (`annotated_curated_results.tabular`)** + + +``` + #Query ID #Subject #Subject accession #Subject Taxonomy ID #Identity percentage #Coverage #evalue #bitscore #Source #Taxonomy + M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758) source=NCBI sequenceID=EU382995 superkingdom=Eukaryota kingdom=Viridiplantae phylum=Streptophyta subphylum=Streptophytina class=Magnoliopsida subclass=NA infraclass=NA order=Ranunculales suborder=NA infraorder=NA superfamily=NA family=Ranunculaceae genus=Ranunculus species=Ranunculus repens markercode=trnL lat=NA lon=NA source=NCBI N/A 100.000 100 1.24e-38 152 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Ranunculales / Ranunculaceae / Ranunculus / Ranunculus repens + M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758) source=NCBI sequenceID=JQ041850 superkingdom=Eukaryota kingdom=Viridiplantae phylum=Streptophyta subphylum=Streptophytina class=Magnoliopsida subclass=NA infraclass=NA order=Ranunculales suborder=NA infraorder=NA superfamily=NA family=Ranunculaceae genus=Ranunculus species=Ranunculus repens markercode=trnL lat=NA lon=NA source=NCBI N/A 100.000 100 1.24e-38 152 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Ranunculales / Ranunculaceae / Ranunculus / Ranunculus repens + M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758) source=NCBI sequenceID=DQ410740 superkingdom=Eukaryota kingdom=Viridiplantae phylum=Streptophyta subphylum=Streptophytina class=Magnoliopsida subclass=NA infraclass=NA order=Ranunculales suborder=NA infraorder=NA superfamily=NA family=Ranunculaceae genus=Ranunculus species=Ranunculus muricatus markercode=trnL lat=NA lon=NA source=NCBI N/A 98.780 100 5.79e-37 147 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Ranunculales / Ranunculaceae / Ranunculus / Ranunculus muricatus + M01687:476:000000000-LL5F5:1:1102:14619:1181_CONS(6595) source=NCBI sequenceID=HM590330 superkingdom=Eukaryota kingdom=Viridiplantae phylum=Streptophyta subphylum=Streptophytina class=Magnoliopsida subclass=NA infraclass=NA order=Malpighiales suborder=NA infraorder=NA superfamily=NA family=Salicaceae genus=Populus species=Populus tremula markercode=trnL lat=50.47 lon=-104.37 source=NCBI N/A 100.000 100 2.16e-52 198 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Malpighiales / Salicaceae / Populus / Populus tremula + M01687:476:000000000-LL5F5:1:1102:14619:1181_CONS(6595) source=NCBI sequenceID=MH573985 superkingdom=Eukaryota kingdom=Viridiplantae phylum=Streptophyta subphylum=Streptophytina class=Magnoliopsida subclass=NA infraclass=NA order=Malpighiales suborder=NA infraorder=NA superfamily=NA family=Salicaceae genus=Populus species=Populus alba markercode=trnL lat=NA lon=NA source=NCBI N/A 99.074 100 1.01e-50 193 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Malpighiales / Salicaceae / Populus / Populus alba + ... +``` + +**Example FASTA (`unannotated_reads.fasta`)** + + +``` + >M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758) merged_sample={}; count=1758; direction=right; sminR=40.0; + ali_length=82; seq_b_deletion=219; seq_b_insertion=0; mode=alignment; sminL=40.0; seq_a_single=0; seq_b_single=0; + gggcaatcctgagccaaatcctgctttcagaaaacaaaaagagggttcagaaagcaaagg + gataggtgcagagactcaatgg + + >M01687:476:000000000-LL5F5:1:1102:14619:1181_CONS(6595) merged_sample={}; count=6595; direction=right; sminR=40.0; + ali_length=107; mode=alignment; sminL=40.0; seq_a_single=0; seq_b_single=0; + gggcaatcctgagccaaatcctatttttcgaaaacaaacaaaaaaacaaacaaaggttca + taaagacagaataagaatacaaaaggataggtgcagagactcaatgg + ... +``` + +**Outputs** + + +| Output Type | Format | Description | +|-------------------------------|--------|-------------| +| **E-value distribution plots**| `.png` | Histogram of BLAST E-values across all queries; useful for choosing score cutoffs or spotting anomalies. | +| **Taxonomic composition** | `.txt` | Summarized counts or proportions of reads assigned to each taxonomic level. | +| **Circular taxonomy data** | `.txt` | JSON-formatted hierarchical taxonomy structure, used to generate circular taxonomic plots. | +| **Header annotations** | `.xlsx` | Excel workbook with merged and per-read annotation information, and alignment statistics. | +| **Annotation statistics** | `.txt` | Summary metrics such as number of annotated reads, unassigned reads, unique taxa detected, and filtering statistics. | + + +**Output files (example)** + + +outputs + +├── eval.png +<img width="2100" height="900" alt="afbeelding" src="https://github.com/user-attachments/assets/75b8fac6-da31-4980-a535-f9dd7ffd15bb" /> + + +├── taxa.txt +``` +Uncertain count per taxonomie level{'K': 0, 'P': 0, 'C': 0, 'O': 18, 'F': 10, 'G': 615, 'S': 1285} +percentage_rooted number_rooted total_num taxon_level indentificatie +100.00 3373 3373 K Viridiplantae +100.00 3373 3373 P Streptophyta +99.97 3372 3373 C Magnoliopsida +1.96 66 3373 O Apiales +1.96 66 3373 F Apiaceae +1.22 41 3373 G Aegopodium +1.22 41 3373 S Aegopodium podagraria +0.27 9 3373 G Apium +0.27 9 3373 S Apium graveolens +0.47 16 3373 G Uncertain taxa +4.77 161 3373 O Asterales +4.77 161 3373 F Asteraceae +0.06 2 3373 G Achillea +0.06 2 3373 S Achillea millefolium +0.15 5 3373 G Artemisia +0.15 5 3373 S Uncertain taxa +0.03 1 3373 G Calendula +... +4.57 154 3373 G Uncertain taxa +0.12 4 3373 F Uncertain taxa +0.53 18 3373 O Uncertain taxa +0.03 1 3373 C Pinopsida +0.03 1 3373 O Cupressales +0.03 1 3373 F Taxaceae +0.03 1 3373 G Taxus +0.03 1 3373 S Taxus baccata +``` +├── circle.txt +``` +[ + { + "labels": [ + "Bacteria", + "Uncertain taxa", + "Viridiplantae" + ], + "sizes": [ + 2, + 1, + 29 + ] + }, + { + "labels": [ + "Pseudomonadota", + "Uncertain taxa", + "Streptophyta" + ], + "sizes": [ + 2, + 1, + 29 + ] +... + ], + "sizes": [ + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 3, + 1, + 1, + 2, + 2, + 1, + 1, + 1, + 1, + 4, + 1, + 1, + 1, + 1 + ] + } +] +``` +├── anno.xlsx +``` +header e_value identity percentage coverage bitscore count source taxa kingdom phylum class order family genus species +M01687:476:000000000-LL5F5:1:1102:8926:6561_CONS 2.33E-41 98.889 100 161 12 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Achillea / Achillea millefolium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Achillea Achillea millefolium +M01687:476:000000000-LL5F5:1:2114:16883:18620_CONS 1.08E-39 97.778 100 156 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Achillea / Achillea millefolium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Achillea Achillea millefolium +M01687:476:000000000-LL5F5:1:1102:20658:7882_CONS 1.63E-37 98.795 100 148 29 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Apiales / Apiaceae / Aegopodium / Aegopodium podagraria Viridiplantae Streptophyta Magnoliopsida Apiales Apiaceae Aegopodium Aegopodium podagraria +M01687:476:000000000-LL5F5:1:1102:3453:17892_CONS 3.51E-39 100 100 154 179 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Apiales / Apiaceae / Aegopodium / Aegopodium podagraria Viridiplantae Streptophyta Magnoliopsida Apiales Apiaceae Aegopodium Aegopodium podagraria +M01687:476:000000000-LL5F5:1:1101:16634:16511_CONS 5.79E-37 98.795 100 147 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Apiales / Apiaceae / Aegopodium / Aegopodium podagraria Viridiplantae Streptophyta Magnoliopsida Apiales Apiaceae Aegopodium Aegopodium podagraria +... +M01687:476:000000000-LL5F5:1:1119:27044:6653_CONS 2.69E-35 97.59 100 141 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Fabales / Fabaceae / Vicia / Vicia faba Viridiplantae Streptophyta Magnoliopsida Fabales Fabaceae Vicia Vicia faba +M01687:476:000000000-LL5F5:1:1109:2464:14257_CONS 7.37E-36 100 95 143 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Fabales / Fabaceae / Vicia / Vicia faba Viridiplantae Streptophyta Magnoliopsida Fabales Fabaceae Vicia Vicia faba +M01687:476:000000000-LL5F5:1:1106:26123:11458_CONS 1.63E-37 98.795 100 148 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Fabales / Fabaceae / Vicia / Vicia faba Viridiplantae Streptophyta Magnoliopsida Fabales Fabaceae Vicia Vicia faba +M01687:476:000000000-LL5F5:1:1104:24402:7089_CONS 5E-43 100 100 167 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Fabales / Fabaceae / Vicia / Vicia hirsuta Viridiplantae Streptophyta Magnoliopsida Fabales Fabaceae Vicia Vicia hirsuta +M01687:476:000000000-LL5F5:1:2114:19155:4308_CONS 1.07E-39 100 94 156 13 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Gentianales / Apocynaceae / Vinca / Vinca minor Viridiplantae Streptophyta Magnoliopsida Gentianales Apocynaceae Vinca Vinca minor +M01687:476:000000000-LL5F5:1:1117:11316:6653_CONS 4.96E-38 98.81 94 150 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Gentianales / Apocynaceae / Vinca / Vinca minor Viridiplantae Streptophyta Magnoliopsida Gentianales Apocynaceae Vinca Vinca minor +M01687:476:000000000-LL5F5:1:1106:28052:14441_CONS 8.25E-41 98.876 100 159 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Xanthium Xanthium strumarium +M01687:476:000000000-LL5F5:1:2118:15258:6790_CONS 8.25E-41 98.876 100 159 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Xanthium Xanthium strumarium +``` + + +└── stats.txt +``` +metric value +percentage_annotated 71.3862433862434 +annotated_sequences 3373 +total_sequences 4725 +percentage_unique_annotated 89.46585409571608 +unique_annotated 99826 +total_unique 111580 +``` + +--- + +#### CLI Arguments (common) + +| Argument | Description | +|----------|-------------| +| `--input-anno` | Path to the annotated BLAST results (tab-separated) | +| `--input-unanno` | Path to the unannotated reads FASTA file | +| `--eval-plot` | Output file where eval plot output will be written | +| `--taxa-output` | Output file where taxa output will be written | +| `--circle-data` | Output file where circle data output will be written | +| `--header-anno` | Output file where header annotation results will be written | +| `--anno-stats` | Output file where annotation statistics will be written | +| `--eval-treshold` | Maximum E-value to retain hits (default: `1e-5`) | +| `--uncertain-threshold` | percentage for which lca picks the majority taxon (default: `0.9 (90%)`) | +| `--use-counts` | Use read counts in the circle data output when true (default: `True`) | + +--- + + +### Galaxy integration + +The tool is also available through the Galaxy platform: + +- **Galaxy Toolshed**: The BLAST annotations processor tool is available in the Galaxy Toolshed, + enabling easy installation into any Galaxy instance. +- **Web-based interface**: Users can upload sequence files, configure validation parameters through the GUI, + run validations, and download results. +- **Workflow integration**: The tool can be incorporated into Galaxy workflows for automated processing pipelines. + +To use the tool in Galaxy: +1. Install the tool from the Galaxy Toolshed (search for "blast_annotations_processor") +2. Upload your raw read and BLAST files to your Galaxy history +3. Configure parameters through the GUI +4. Run the tool +5. View results and download validation reports and tabular annotations + +## License + +No license yet + +## Citation + +If you use this software in your research, please cite this repository. + +## Contact + +For questions or issues: +- GitHub Issues: https://github.com/Onnodg/Naturalis_NLOOR/issues +- Email: onno.gorter@naturalis.nl (until Febuary 2026) + +## Acknowledgments + +This tool was developed to support the New lights on old remedies project, a PhD project by Anja Fischer.
