Mercurial > repos > onnodg > blast_annotations_processor
view README.md @ 2:9ca209477dfd draft default tip
planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit 4017d38cf327c48a6252e488ba792527dae97a70-dirty
| author | onnodg |
|---|---|
| date | Mon, 15 Dec 2025 16:43:36 +0000 |
| parents | |
| children |
line wrap: on
line source
# BLAST Annotations Processor Script This script processes a single **annotated BLAST file** together with a **FASTA file containing the same reads but unannotated**, generating multiple output files for downstream visualization and reporting. It is designed for BLAST-based taxonomic pipelines and provides a complete overview of annotation quality, distribution, and composition of the analyzed dataset. --- ## Usage The script performs the following main tasks: 1. Parse command-line arguments. 2. Load the annotated BLAST results and the unannotated FASTA headers. 3. Group BLAST hits per read and filter them by specified thresholds. 4. Resolve taxonomic conflicts with the lowest common ancestor method using predefined uncertainty rules. 5. Generate a variety of outputs of statistics and annotations for downstream use. ### Command Line Interface The BLAST annotations processor can be run as a Python script: ```bash python blast_annotations_processor.py [options] ``` Below are detailed examples for common use case #### General use case This example shows the general use of the tool. **Requirements**: Requirements as listed in the blast_annotations_processor xml file: - Python version=3.12.3 - Matplotlib version=3.12.3 - Pandas version=2.3.2 - Numpy version=2.3.2 - Openpyxl version=3.1.5 **Input requirements** - BLAST tabular file with alignment metrics, source and taxa - Fasta file with preprocessed reads - Header correspondence: Query identifiers in the BLAST output and FASTA headers **must match**. The script relies on matching IDs to merge annotations with read headers. **Example: Analyzing BLAST annotation result using curated database** ```bash python annotate_blast_results.py --input-anno 'annotated_curated_results.tabular' --input-unanno 'unannotated_reads.fasta' --eval-plot 'eval_curated.png' --taxa-output 'taxa_curated.txt' --circle-data 'circle_curated.txt' --header-anno 'anno_curated.xlsx' --anno-stats 'stats_curated.txt' --eval-threshold '1e-5' --uncertain-threshold '0.9' --use-counts ``` This command will: - Parse the BLAST and FASTA files. - Filter hits using `E-value ≤ 1e-5`, `uncertainty threshold ≥ 90%`, and use read count in the circular data output. - Resolve taxonomic conflicts and generate plots, reports, and spreadsheet outputs in the given output files. **Example Input (`annotated_curated_results.tabular`)** ``` #Query ID #Subject #Subject accession #Subject Taxonomy ID #Identity percentage #Coverage #evalue #bitscore #Source #Taxonomy M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758) source=NCBI sequenceID=EU382995 superkingdom=Eukaryota kingdom=Viridiplantae phylum=Streptophyta subphylum=Streptophytina class=Magnoliopsida subclass=NA infraclass=NA order=Ranunculales suborder=NA infraorder=NA superfamily=NA family=Ranunculaceae genus=Ranunculus species=Ranunculus repens markercode=trnL lat=NA lon=NA source=NCBI N/A 100.000 100 1.24e-38 152 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Ranunculales / Ranunculaceae / Ranunculus / Ranunculus repens M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758) source=NCBI sequenceID=JQ041850 superkingdom=Eukaryota kingdom=Viridiplantae phylum=Streptophyta subphylum=Streptophytina class=Magnoliopsida subclass=NA infraclass=NA order=Ranunculales suborder=NA infraorder=NA superfamily=NA family=Ranunculaceae genus=Ranunculus species=Ranunculus repens markercode=trnL lat=NA lon=NA source=NCBI N/A 100.000 100 1.24e-38 152 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Ranunculales / Ranunculaceae / Ranunculus / Ranunculus repens M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758) source=NCBI sequenceID=DQ410740 superkingdom=Eukaryota kingdom=Viridiplantae phylum=Streptophyta subphylum=Streptophytina class=Magnoliopsida subclass=NA infraclass=NA order=Ranunculales suborder=NA infraorder=NA superfamily=NA family=Ranunculaceae genus=Ranunculus species=Ranunculus muricatus markercode=trnL lat=NA lon=NA source=NCBI N/A 98.780 100 5.79e-37 147 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Ranunculales / Ranunculaceae / Ranunculus / Ranunculus muricatus M01687:476:000000000-LL5F5:1:1102:14619:1181_CONS(6595) source=NCBI sequenceID=HM590330 superkingdom=Eukaryota kingdom=Viridiplantae phylum=Streptophyta subphylum=Streptophytina class=Magnoliopsida subclass=NA infraclass=NA order=Malpighiales suborder=NA infraorder=NA superfamily=NA family=Salicaceae genus=Populus species=Populus tremula markercode=trnL lat=50.47 lon=-104.37 source=NCBI N/A 100.000 100 2.16e-52 198 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Malpighiales / Salicaceae / Populus / Populus tremula M01687:476:000000000-LL5F5:1:1102:14619:1181_CONS(6595) source=NCBI sequenceID=MH573985 superkingdom=Eukaryota kingdom=Viridiplantae phylum=Streptophyta subphylum=Streptophytina class=Magnoliopsida subclass=NA infraclass=NA order=Malpighiales suborder=NA infraorder=NA superfamily=NA family=Salicaceae genus=Populus species=Populus alba markercode=trnL lat=NA lon=NA source=NCBI N/A 99.074 100 1.01e-50 193 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Malpighiales / Salicaceae / Populus / Populus alba ... ``` **Example FASTA (`unannotated_reads.fasta`)** ``` >M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758) merged_sample={}; count=1758; direction=right; sminR=40.0; ali_length=82; seq_b_deletion=219; seq_b_insertion=0; mode=alignment; sminL=40.0; seq_a_single=0; seq_b_single=0; gggcaatcctgagccaaatcctgctttcagaaaacaaaaagagggttcagaaagcaaagg gataggtgcagagactcaatgg >M01687:476:000000000-LL5F5:1:1102:14619:1181_CONS(6595) merged_sample={}; count=6595; direction=right; sminR=40.0; ali_length=107; mode=alignment; sminL=40.0; seq_a_single=0; seq_b_single=0; gggcaatcctgagccaaatcctatttttcgaaaacaaacaaaaaaacaaacaaaggttca taaagacagaataagaatacaaaaggataggtgcagagactcaatgg ... ``` **Outputs** | Output Type | Format | Description | |-------------------------------|--------|-------------| | **E-value distribution plots**| `.png` | Histogram of BLAST E-values across all queries; useful for choosing score cutoffs or spotting anomalies. | | **Taxonomic composition** | `.txt` | Summarized counts or proportions of reads assigned to each taxonomic level. | | **Circular taxonomy data** | `.txt` | JSON-formatted hierarchical taxonomy structure, used to generate circular taxonomic plots. | | **Header annotations** | `.xlsx` | Excel workbook with merged and per-read annotation information, and alignment statistics. | | **Annotation statistics** | `.txt` | Summary metrics such as number of annotated reads, unassigned reads, unique taxa detected, and filtering statistics. | **Output files (example)** outputs ├── eval.png <img width="2100" height="900" alt="afbeelding" src="https://github.com/user-attachments/assets/75b8fac6-da31-4980-a535-f9dd7ffd15bb" /> ├── taxa.txt ``` Uncertain count per taxonomie level{'K': 0, 'P': 0, 'C': 0, 'O': 18, 'F': 10, 'G': 615, 'S': 1285} percentage_rooted number_rooted total_num taxon_level indentificatie 100.00 3373 3373 K Viridiplantae 100.00 3373 3373 P Streptophyta 99.97 3372 3373 C Magnoliopsida 1.96 66 3373 O Apiales 1.96 66 3373 F Apiaceae 1.22 41 3373 G Aegopodium 1.22 41 3373 S Aegopodium podagraria 0.27 9 3373 G Apium 0.27 9 3373 S Apium graveolens 0.47 16 3373 G Uncertain taxa 4.77 161 3373 O Asterales 4.77 161 3373 F Asteraceae 0.06 2 3373 G Achillea 0.06 2 3373 S Achillea millefolium 0.15 5 3373 G Artemisia 0.15 5 3373 S Uncertain taxa 0.03 1 3373 G Calendula ... 4.57 154 3373 G Uncertain taxa 0.12 4 3373 F Uncertain taxa 0.53 18 3373 O Uncertain taxa 0.03 1 3373 C Pinopsida 0.03 1 3373 O Cupressales 0.03 1 3373 F Taxaceae 0.03 1 3373 G Taxus 0.03 1 3373 S Taxus baccata ``` ├── circle.txt ``` [ { "labels": [ "Bacteria", "Uncertain taxa", "Viridiplantae" ], "sizes": [ 2, 1, 29 ] }, { "labels": [ "Pseudomonadota", "Uncertain taxa", "Streptophyta" ], "sizes": [ 2, 1, 29 ] ... ], "sizes": [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 2, 2, 1, 1, 1, 1, 4, 1, 1, 1, 1 ] } ] ``` ├── anno.xlsx ``` header e_value identity percentage coverage bitscore count source taxa kingdom phylum class order family genus species M01687:476:000000000-LL5F5:1:1102:8926:6561_CONS 2.33E-41 98.889 100 161 12 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Achillea / Achillea millefolium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Achillea Achillea millefolium M01687:476:000000000-LL5F5:1:2114:16883:18620_CONS 1.08E-39 97.778 100 156 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Achillea / Achillea millefolium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Achillea Achillea millefolium M01687:476:000000000-LL5F5:1:1102:20658:7882_CONS 1.63E-37 98.795 100 148 29 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Apiales / Apiaceae / Aegopodium / Aegopodium podagraria Viridiplantae Streptophyta Magnoliopsida Apiales Apiaceae Aegopodium Aegopodium podagraria M01687:476:000000000-LL5F5:1:1102:3453:17892_CONS 3.51E-39 100 100 154 179 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Apiales / Apiaceae / Aegopodium / Aegopodium podagraria Viridiplantae Streptophyta Magnoliopsida Apiales Apiaceae Aegopodium Aegopodium podagraria M01687:476:000000000-LL5F5:1:1101:16634:16511_CONS 5.79E-37 98.795 100 147 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Apiales / Apiaceae / Aegopodium / Aegopodium podagraria Viridiplantae Streptophyta Magnoliopsida Apiales Apiaceae Aegopodium Aegopodium podagraria ... M01687:476:000000000-LL5F5:1:1119:27044:6653_CONS 2.69E-35 97.59 100 141 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Fabales / Fabaceae / Vicia / Vicia faba Viridiplantae Streptophyta Magnoliopsida Fabales Fabaceae Vicia Vicia faba M01687:476:000000000-LL5F5:1:1109:2464:14257_CONS 7.37E-36 100 95 143 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Fabales / Fabaceae / Vicia / Vicia faba Viridiplantae Streptophyta Magnoliopsida Fabales Fabaceae Vicia Vicia faba M01687:476:000000000-LL5F5:1:1106:26123:11458_CONS 1.63E-37 98.795 100 148 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Fabales / Fabaceae / Vicia / Vicia faba Viridiplantae Streptophyta Magnoliopsida Fabales Fabaceae Vicia Vicia faba M01687:476:000000000-LL5F5:1:1104:24402:7089_CONS 5E-43 100 100 167 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Fabales / Fabaceae / Vicia / Vicia hirsuta Viridiplantae Streptophyta Magnoliopsida Fabales Fabaceae Vicia Vicia hirsuta M01687:476:000000000-LL5F5:1:2114:19155:4308_CONS 1.07E-39 100 94 156 13 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Gentianales / Apocynaceae / Vinca / Vinca minor Viridiplantae Streptophyta Magnoliopsida Gentianales Apocynaceae Vinca Vinca minor M01687:476:000000000-LL5F5:1:1117:11316:6653_CONS 4.96E-38 98.81 94 150 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Gentianales / Apocynaceae / Vinca / Vinca minor Viridiplantae Streptophyta Magnoliopsida Gentianales Apocynaceae Vinca Vinca minor M01687:476:000000000-LL5F5:1:1106:28052:14441_CONS 8.25E-41 98.876 100 159 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Xanthium Xanthium strumarium M01687:476:000000000-LL5F5:1:2118:15258:6790_CONS 8.25E-41 98.876 100 159 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Xanthium Xanthium strumarium ``` └── stats.txt ``` metric value percentage_annotated 71.3862433862434 annotated_sequences 3373 total_sequences 4725 percentage_unique_annotated 89.46585409571608 unique_annotated 99826 total_unique 111580 ``` --- #### CLI Arguments (common) | Argument | Description | |----------|-------------| | `--input-anno` | Path to the annotated BLAST results (tab-separated) | | `--input-unanno` | Path to the unannotated reads FASTA file | | `--eval-plot` | Output file where eval plot output will be written | | `--taxa-output` | Output file where taxa output will be written | | `--circle-data` | Output file where circle data output will be written | | `--header-anno` | Output file where header annotation results will be written | | `--anno-stats` | Output file where annotation statistics will be written | | `--eval-treshold` | Maximum E-value to retain hits (default: `1e-5`) | | `--uncertain-threshold` | percentage for which lca picks the majority taxon (default: `0.9 (90%)`) | | `--use-counts` | Use read counts in the circle data output when true (default: `True`) | --- ### Galaxy integration The tool is also available through the Galaxy platform: - **Galaxy Toolshed**: The BLAST annotations processor tool is available in the Galaxy Toolshed, enabling easy installation into any Galaxy instance. - **Web-based interface**: Users can upload sequence files, configure validation parameters through the GUI, run validations, and download results. - **Workflow integration**: The tool can be incorporated into Galaxy workflows for automated processing pipelines. To use the tool in Galaxy: 1. Install the tool from the Galaxy Toolshed (search for "blast_annotations_processor") 2. Upload your raw read and BLAST files to your Galaxy history 3. Configure parameters through the GUI 4. Run the tool 5. View results and download validation reports and tabular annotations ## License No license yet ## Citation If you use this software in your research, please cite this repository. ## Contact For questions or issues: - GitHub Issues: https://github.com/Onnodg/Naturalis_NLOOR/issues - Email: onno.gorter@naturalis.nl (until Febuary 2026) ## Acknowledgments This tool was developed to support the New lights on old remedies project, a PhD project by Anja Fischer.
