Mercurial > repos > onnodg > blast_annotations_processor
view README.md @ 3:ca2f07b71581 draft default tip
planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit 600e5a50a13a3a16a1970d6d4d31cb4f7bd549bf-dirty
| author | onnodg |
|---|---|
| date | Thu, 12 Feb 2026 13:52:07 +0000 |
| parents | 9ca209477dfd |
| children |
line wrap: on
line source
# BLAST Annotations Processor Script This script processes a single **annotated BLAST file** together with a **FASTA file containing the same reads but unannotated**, generating multiple output files for downstream visualization and reporting. It is designed for BLAST-based taxonomic pipelines and provides a complete overview of annotation quality, distribution, and composition of the analyzed dataset. --- ## Usage The script performs the following main tasks: 1. Parse command-line arguments. 2. Load the annotated BLAST results and the unannotated FASTA headers. 3. Group BLAST hits per read and filter them by specified thresholds. 4. Resolve taxonomic conflicts with the lowest common ancestor method using predefined uncertainty rules. 5. Generate a variety of outputs of statistics and annotations for downstream use. ### Command Line Interface The BLAST annotations processor can be run as a Python script: ```bash python blast_annotations_processor.py [options] ``` Below are detailed examples for common use case #### General use case This example shows the general use of the tool. **Requirements**: Requirements as listed in the blast_annotations_processor xml file: - Python version=3.12.3 - Matplotlib version=3.12.3 - Pandas version=2.3.2 - Numpy version=2.3.2 - Openpyxl version=3.1.5 **Input requirements** - BLAST tabular file with alignment metrics, source and taxa - Fasta file with preprocessed reads - Header correspondence: Query identifiers in the BLAST output and FASTA headers **must match**. The script relies on matching IDs to merge annotations with read headers. **Example: Analyzing BLAST annotation result using curated database** ```bash python annotate_blast_results.py \ --input-anno annotated_curated_results.tabular \ --input-unanno unannotated_reads.fasta \ --eval-plot eval_curated.png \ --taxa-output taxa_curated.txt \ --circle-data circle_curated.txt \ --header-anno anno_curated.xlsx \ --log run.log \ --filtered-fasta filtered_reads.fasta \ --eval-threshold 1e-10 \ --uncertain-threshold 0.9 \ --use-counts \ --min-identity 70 \ --min-coverage 69 \ --min-bitscore 40 \ --bitscore-perc-cutoff 0 \ --ignore-rank "unknown,invalid" \ --ignore-taxonomy "environmental" \ --ignore-obiclean-type singleton \ --ignore-illuminapairend-type pairend \ --min-support 10 ``` This command will: Parse the annotated BLAST results and the corresponding unannotated FASTA sequences. Filter BLAST hits using E-value ≤ 1e-10, minimum identity ≥ 70%, minimum coverage ≥ 69%, and minimum bitscore ≥ 40, and apply a bitscore percentage cutoff of 0% (no additional top-bitscore filtering). Resolve taxonomic conflicts using an LCA approach with an uncertainty threshold of 90%, while ignoring ranks containing "unknown,invalid" and taxonomy containing "environmental". Exclude sequences flagged as obiclean type singleton and sequences marked as Illuminapairedend type pairend (merge failure), and require a minimum taxonomic support of 10 reads. Use read counts when generating circular taxonomy outputs (--use-counts). Produce the configured outputs (plots, Kraken-style report, circular data, per-header annotations), plus the required log file and filtered FASTA for downstream analysis. **Example Input (`annotated_curated_results.tabular`)** ``` #Query ID #Subject #Subject accession #Subject Taxonomy ID #Identity percentage #Coverage #evalue #bitscore #Source #Taxonomy M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758) source=NCBI sequenceID=EU382995 superkingdom=Eukaryota kingdom=Viridiplantae phylum=Streptophyta subphylum=Streptophytina class=Magnoliopsida subclass=NA infraclass=NA order=Ranunculales suborder=NA infraorder=NA superfamily=NA family=Ranunculaceae genus=Ranunculus species=Ranunculus repens markercode=trnL lat=NA lon=NA source=NCBI N/A 100.000 100 1.24e-38 152 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Ranunculales / Ranunculaceae / Ranunculus / Ranunculus repens M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758) source=NCBI sequenceID=JQ041850 superkingdom=Eukaryota kingdom=Viridiplantae phylum=Streptophyta subphylum=Streptophytina class=Magnoliopsida subclass=NA infraclass=NA order=Ranunculales suborder=NA infraorder=NA superfamily=NA family=Ranunculaceae genus=Ranunculus species=Ranunculus repens markercode=trnL lat=NA lon=NA source=NCBI N/A 100.000 100 1.24e-38 152 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Ranunculales / Ranunculaceae / Ranunculus / Ranunculus repens M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758) source=NCBI sequenceID=DQ410740 superkingdom=Eukaryota kingdom=Viridiplantae phylum=Streptophyta subphylum=Streptophytina class=Magnoliopsida subclass=NA infraclass=NA order=Ranunculales suborder=NA infraorder=NA superfamily=NA family=Ranunculaceae genus=Ranunculus species=Ranunculus muricatus markercode=trnL lat=NA lon=NA source=NCBI N/A 98.780 100 5.79e-37 147 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Ranunculales / Ranunculaceae / Ranunculus / Ranunculus muricatus M01687:476:000000000-LL5F5:1:1102:14619:1181_CONS(6595) source=NCBI sequenceID=HM590330 superkingdom=Eukaryota kingdom=Viridiplantae phylum=Streptophyta subphylum=Streptophytina class=Magnoliopsida subclass=NA infraclass=NA order=Malpighiales suborder=NA infraorder=NA superfamily=NA family=Salicaceae genus=Populus species=Populus tremula markercode=trnL lat=50.47 lon=-104.37 source=NCBI N/A 100.000 100 2.16e-52 198 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Malpighiales / Salicaceae / Populus / Populus tremula M01687:476:000000000-LL5F5:1:1102:14619:1181_CONS(6595) source=NCBI sequenceID=MH573985 superkingdom=Eukaryota kingdom=Viridiplantae phylum=Streptophyta subphylum=Streptophytina class=Magnoliopsida subclass=NA infraclass=NA order=Malpighiales suborder=NA infraorder=NA superfamily=NA family=Salicaceae genus=Populus species=Populus alba markercode=trnL lat=NA lon=NA source=NCBI N/A 99.074 100 1.01e-50 193 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Malpighiales / Salicaceae / Populus / Populus alba ... ``` **Example FASTA (`unannotated_reads.fasta`)** ``` >M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758) merged_sample={}; count=1758; direction=right; sminR=40.0; ali_length=82; seq_b_deletion=219; seq_b_insertion=0; mode=alignment; sminL=40.0; seq_a_single=0; seq_b_single=0; gggcaatcctgagccaaatcctgctttcagaaaacaaaaagagggttcagaaagcaaagg gataggtgcagagactcaatgg >M01687:476:000000000-LL5F5:1:1102:14619:1181_CONS(6595) merged_sample={}; count=6595; direction=right; sminR=40.0; ali_length=107; mode=alignment; sminL=40.0; seq_a_single=0; seq_b_single=0; gggcaatcctgagccaaatcctatttttcgaaaacaaacaaaaaaacaaacaaaggttca taaagacagaataagaatacaaaaggataggtgcagagactcaatgg ... ``` **Outputs** | Output Type | Format | Description | |-------------------------------|--------|-------------| | **E-value distribution plots**| `.png` | Histogram of BLAST E-values across all queries; useful for choosing score cutoffs or spotting anomalies. | | **Taxonomic composition** | `.txt` | Summarized counts or proportions of reads assigned to each taxonomic level. | | **Circular taxonomy data** | `.txt` | JSON-formatted hierarchical taxonomy structure, used to generate circular taxonomic plots. | | **Header annotations** | `.xlsx` | Excel workbook with merged and per-read annotation information, and alignment statistics. | | **Log** | `.txt` | Summary metrics such as number of annotated reads, unassigned reads, unique taxa detected, and filtering statistics. | | **Filtered fasta** | `.fasta` | Fasta that passed the set thresholds, for use in downstream analysis (clustering) | **Output files (example)** outputs ├── eval.png <img width="2100" height="900" alt="afbeelding" src="https://github.com/user-attachments/assets/75b8fac6-da31-4980-a535-f9dd7ffd15bb" /> ├── taxa.txt ``` Uncertain count per taxonomie level{'K': 0, 'P': 0, 'C': 0, 'O': 18, 'F': 10, 'G': 615, 'S': 1285} percentage_rooted number_rooted total_num taxon_level indentificatie 100.00 3373 3373 K Viridiplantae 100.00 3373 3373 P Streptophyta 99.97 3372 3373 C Magnoliopsida 1.96 66 3373 O Apiales 1.96 66 3373 F Apiaceae 1.22 41 3373 G Aegopodium 1.22 41 3373 S Aegopodium podagraria 0.27 9 3373 G Apium 0.27 9 3373 S Apium graveolens 0.47 16 3373 G Uncertain taxa 4.77 161 3373 O Asterales 4.77 161 3373 F Asteraceae 0.06 2 3373 G Achillea 0.06 2 3373 S Achillea millefolium 0.15 5 3373 G Artemisia 0.15 5 3373 S Uncertain taxa 0.03 1 3373 G Calendula ... 4.57 154 3373 G Uncertain taxa 0.12 4 3373 F Uncertain taxa 0.53 18 3373 O Uncertain taxa 0.03 1 3373 C Pinopsida 0.03 1 3373 O Cupressales 0.03 1 3373 F Taxaceae 0.03 1 3373 G Taxus 0.03 1 3373 S Taxus baccata ``` ├── circle.txt ``` [ { "labels": [ "Bacteria", "Uncertain taxa", "Viridiplantae" ], "sizes": [ 2, 1, 29 ] }, { "labels": [ "Pseudomonadota", "Uncertain taxa", "Streptophyta" ], "sizes": [ 2, 1, 29 ] ... ], "sizes": [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 2, 2, 1, 1, 1, 1, 4, 1, 1, 1, 1 ] } ] ``` ├── anno.xlsx ``` header e_value identity percentage coverage bitscore count source taxa kingdom phylum class order family genus species M01687:476:000000000-LL5F5:1:1102:8926:6561_CONS 2.33E-41 98.889 100 161 12 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Achillea / Achillea millefolium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Achillea Achillea millefolium M01687:476:000000000-LL5F5:1:2114:16883:18620_CONS 1.08E-39 97.778 100 156 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Achillea / Achillea millefolium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Achillea Achillea millefolium M01687:476:000000000-LL5F5:1:1102:20658:7882_CONS 1.63E-37 98.795 100 148 29 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Apiales / Apiaceae / Aegopodium / Aegopodium podagraria Viridiplantae Streptophyta Magnoliopsida Apiales Apiaceae Aegopodium Aegopodium podagraria M01687:476:000000000-LL5F5:1:1102:3453:17892_CONS 3.51E-39 100 100 154 179 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Apiales / Apiaceae / Aegopodium / Aegopodium podagraria Viridiplantae Streptophyta Magnoliopsida Apiales Apiaceae Aegopodium Aegopodium podagraria M01687:476:000000000-LL5F5:1:1101:16634:16511_CONS 5.79E-37 98.795 100 147 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Apiales / Apiaceae / Aegopodium / Aegopodium podagraria Viridiplantae Streptophyta Magnoliopsida Apiales Apiaceae Aegopodium Aegopodium podagraria ... M01687:476:000000000-LL5F5:1:1119:27044:6653_CONS 2.69E-35 97.59 100 141 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Fabales / Fabaceae / Vicia / Vicia faba Viridiplantae Streptophyta Magnoliopsida Fabales Fabaceae Vicia Vicia faba M01687:476:000000000-LL5F5:1:1109:2464:14257_CONS 7.37E-36 100 95 143 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Fabales / Fabaceae / Vicia / Vicia faba Viridiplantae Streptophyta Magnoliopsida Fabales Fabaceae Vicia Vicia faba M01687:476:000000000-LL5F5:1:1106:26123:11458_CONS 1.63E-37 98.795 100 148 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Fabales / Fabaceae / Vicia / Vicia faba Viridiplantae Streptophyta Magnoliopsida Fabales Fabaceae Vicia Vicia faba M01687:476:000000000-LL5F5:1:1104:24402:7089_CONS 5E-43 100 100 167 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Fabales / Fabaceae / Vicia / Vicia hirsuta Viridiplantae Streptophyta Magnoliopsida Fabales Fabaceae Vicia Vicia hirsuta M01687:476:000000000-LL5F5:1:2114:19155:4308_CONS 1.07E-39 100 94 156 13 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Gentianales / Apocynaceae / Vinca / Vinca minor Viridiplantae Streptophyta Magnoliopsida Gentianales Apocynaceae Vinca Vinca minor M01687:476:000000000-LL5F5:1:1117:11316:6653_CONS 4.96E-38 98.81 94 150 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Gentianales / Apocynaceae / Vinca / Vinca minor Viridiplantae Streptophyta Magnoliopsida Gentianales Apocynaceae Vinca Vinca minor M01687:476:000000000-LL5F5:1:1106:28052:14441_CONS 8.25E-41 98.876 100 159 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Xanthium Xanthium strumarium M01687:476:000000000-LL5F5:1:2118:15258:6790_CONS 8.25E-41 98.876 100 159 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Xanthium Xanthium strumarium ``` └── log.txt ``` Starting processing for FASTA === PARAMETERS USED === uncertain_threshold: 0.9 eval_threshold: 1e-10 use_counts: True ignore_rank: unknown ignore_taxonomy: environmental bitscore_perc_cutoff: 8.0 min_bitscore: 100 ignore_obiclean_type: singleton ignore_illuminapairend_type: pairend min_identity: 80 min_coverage: 70 ignore_seqids: min_support: 1 === END PARAMETERS === Filtered FASTA written succesfully(1790 sequences) FASTA: total headers: 2156 FASTA: headers kept after filters and min_support=1: 1790 FASTA: removed due to header filters (illumina/obiclean/etc.): 366 FASTA: removed due to low dereplicated count (<1): 0 FASTA: total invalid (header filter + low support): 366 Reading BLAST annotations BLAST: total hits read: 4977 BLAST: hits kept after quality filters: 3145 BLAST: hits filtered (evalue/coverage/identity/bitscore): 1832 BLAST: hits removed due to invalid taxon: 0 BLAST: hits removed due to ignored seqids: 0 Note: 30 BLAST q_ids not in FASTA (showing up to 10): ['M01687:460:000000000-LGY9G:1:1101:11918:3518_CONS(1)', 'M01687:460:000000000-LGY9G:1:1101:12996:3690_CONS(1)', 'M01687:460:000000000-LGY9G:1:1101:11564:11468_CONS(1)', 'M01687:460:000000000-LGY9G:1:1102:19358:5472_CONS(1)', 'M01687:460:000000000-LGY9G:1:2114:4805:4734_CONS(1)', 'M01687:460:000000000-LGY9G:1:2114:7472:19038_CONS(1)', 'M01687:460:000000000-LGY9G:1:2112:26865:11154_CONS(1)', 'M01687:460:000000000-LGY9G:1:2113:29518:11119_CONS(1)', 'M01687:460:000000000-LGY9G:1:2113:14681:23251_CONS(1)', 'M01687:460:000000000-LGY9G:1:2110:17890:1754_CONS(2)'] ANNOTATION: total FASTA headers considered: 1790 ANNOTATION: reads with BLAST hits: 622 ANNOTATION: reads without BLAST hits: 1168 ANNOTATION: unique annotated count (from header counts): 49571 ANNOTATION: total unique count (from FASTA): 66132 E-value plot written succesfully Taxa summary written succesfully Header annotations written succesfully Circle diagram JSON written succesfully === ANNOTATION STATISTICS === percentage_annotated: 28.84972170686456 annotated_sequences: 622 total_sequences: 2156 percentage_unique_annotated: 74.95766043670235 unique_annotated: 49571 total_unique: 66132 ``` └── filtered_fasta.fasta ``` >M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758) merged_sample={}; count=1758; direction=right; sminR=40.0; ali_length=82; seq_b_deletion=219; seq_b_insertion=0; mode=alignment; sminL=40.0; seq_a_single=0; seq_b_single=0; gggcaatcctgagccaaatcctgctttcagaaaacaaaaagagggttcagaaagcaaagg gataggtgcagagactcaatgg >M01687:476:000000000-LL5F5:1:1102:14619:1181_CONS(6595) merged_sample={}; count=6595; direction=right; sminR=40.0; ali_length=107; mode=alignment; sminL=40.0; seq_a_single=0; seq_b_single=0; gggcaatcctgagccaaatcctatttttcgaaaacaaacaaaaaaacaaacaaaggttca taaagacagaataagaatacaaaaggataggtgcagagactcaatgg ``` --- #### CLI Arguments (common) | Argument | Description | |----------|-------------| | `--input-anno` | Path to the annotated BLAST results (tab-separated) | | `--input-unanno` | Path to the unannotated reads FASTA file | | `--eval-plot` | Output file where the E-value distribution plot will be written | | `--taxa-output` | Output file where the taxonomic (Kraken-style) report will be written | | `--circle-data` | Output file where circular taxonomy data will be written | | `--header-anno` | Output file where per-header annotation results will be written (tabular/xlsx) | | `--log` | Output file where log messages will be written | | `--filtered-fasta` | Output FASTA file filtered for downstream analysis | | `--eval-threshold` | Maximum E-value to retain hits (default: `1e-10`) | | `--uncertain-threshold` | Proportion required for LCA to assign a majority taxon (default: `0.9` / 90%) | | `--use-counts` | Use read counts when generating circular taxonomy data (default: `False`) | | `--min-identity` | Minimum sequence identity (%) to consider a BLAST hit | | `--min-coverage` | Minimum query coverage (%) to consider a BLAST hit | | `--min-bitscore` | Minimum bitscore required to retain a BLAST hit | | `--bitscore-perc-cutoff` | Bitscore percentage cutoff relative to the top hit | | `--ignore-rank` | Ignore taxonomic ranks containing this text (default: `unknown`) | | `--ignore-taxonomy` | Ignore taxonomy strings containing this text (default: `environmental`) | | `--ignore-obiclean-type` | Ignore sequences with this obiclean classification (default: `singleton`) | | `--ignore-illuminapairend-type` | Ignore sequences with this paired-end merge status (default: `pairend`) | | `--ignore-seqids` | Ignore sequences containing this identifier substring | | `--min-support` | Retain taxa only if they (or their descendants) have at least N reads assigned | --- ### Galaxy integration The tool is also available through the Galaxy platform: - **Galaxy Toolshed**: The BLAST annotations processor tool is available in the Galaxy Toolshed, enabling easy installation into any Galaxy instance. - **Web-based interface**: Users can upload sequence files, configure validation parameters through the GUI, run validations, and download results. - **Workflow integration**: The tool can be incorporated into Galaxy workflows for automated processing pipelines. To use the tool in Galaxy: 1. Install the tool from the Galaxy Toolshed (search for "blast_annotations_processor") 2. Upload your raw read and BLAST files to your Galaxy history 3. Configure parameters through the GUI 4. Run the tool 5. View results and download validation reports and tabular annotations ## License No license yet ## Citation If you use this software in your research, please cite this repository. ## Contact For questions or issues: - GitHub Issues: https://github.com/Onnodg/Naturalis_NLOOR/issues - Email: onno.gorter@naturalis.nl (until Febuary 2026) ## Acknowledgments This tool was developed to support the New lights on old remedies project, a PhD project by Anja Fischer.
