Mercurial > repos > onnodg > blast_annotations_processor
comparison README.md @ 2:9ca209477dfd draft default tip
planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit 4017d38cf327c48a6252e488ba792527dae97a70-dirty
| author | onnodg |
|---|---|
| date | Mon, 15 Dec 2025 16:43:36 +0000 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| 1:2acf82433aa4 | 2:9ca209477dfd |
|---|---|
| 1 # BLAST Annotations Processor Script | |
| 2 | |
| 3 This script processes a single **annotated BLAST file** together with a **FASTA file containing the same reads but unannotated**, generating multiple output files for downstream visualization and reporting. | |
| 4 | |
| 5 It is designed for BLAST-based taxonomic pipelines and provides a complete overview of annotation quality, distribution, and composition of the analyzed dataset. | |
| 6 | |
| 7 --- | |
| 8 | |
| 9 ## Usage | |
| 10 | |
| 11 The script performs the following main tasks: | |
| 12 | |
| 13 1. Parse command-line arguments. | |
| 14 2. Load the annotated BLAST results and the unannotated FASTA headers. | |
| 15 3. Group BLAST hits per read and filter them by specified thresholds. | |
| 16 4. Resolve taxonomic conflicts with the lowest common ancestor method using predefined uncertainty rules. | |
| 17 5. Generate a variety of outputs of statistics and annotations for downstream use. | |
| 18 | |
| 19 | |
| 20 ### Command Line Interface | |
| 21 The BLAST annotations processor can be run as a Python script: | |
| 22 | |
| 23 ```bash | |
| 24 python blast_annotations_processor.py [options] | |
| 25 ``` | |
| 26 | |
| 27 Below are detailed examples for common use case | |
| 28 | |
| 29 #### General use case | |
| 30 This example shows the general use of the tool. | |
| 31 | |
| 32 **Requirements**: | |
| 33 | |
| 34 Requirements as listed in the blast_annotations_processor xml file: | |
| 35 | |
| 36 - Python version=3.12.3 | |
| 37 - Matplotlib version=3.12.3 | |
| 38 - Pandas version=2.3.2 | |
| 39 - Numpy version=2.3.2 | |
| 40 - Openpyxl version=3.1.5 | |
| 41 | |
| 42 | |
| 43 **Input requirements** | |
| 44 | |
| 45 - BLAST tabular file with alignment metrics, source and taxa | |
| 46 - Fasta file with preprocessed reads | |
| 47 - Header correspondence: Query identifiers in the BLAST output and FASTA headers **must match**. The script relies on matching IDs to merge annotations with read headers. | |
| 48 | |
| 49 | |
| 50 | |
| 51 **Example: Analyzing BLAST annotation result using curated database** | |
| 52 | |
| 53 ```bash | |
| 54 python annotate_blast_results.py | |
| 55 --input-anno 'annotated_curated_results.tabular' | |
| 56 --input-unanno 'unannotated_reads.fasta' | |
| 57 --eval-plot 'eval_curated.png' | |
| 58 --taxa-output 'taxa_curated.txt' | |
| 59 --circle-data 'circle_curated.txt' | |
| 60 --header-anno 'anno_curated.xlsx' | |
| 61 --anno-stats 'stats_curated.txt' | |
| 62 --eval-threshold '1e-5' | |
| 63 --uncertain-threshold '0.9' | |
| 64 --use-counts | |
| 65 ``` | |
| 66 | |
| 67 This command will: | |
| 68 | |
| 69 - Parse the BLAST and FASTA files. | |
| 70 - Filter hits using `E-value ≤ 1e-5`, `uncertainty threshold ≥ 90%`, and use read count in the circular data output. | |
| 71 - Resolve taxonomic conflicts and generate plots, reports, and spreadsheet outputs in the given output files. | |
| 72 | |
| 73 | |
| 74 **Example Input (`annotated_curated_results.tabular`)** | |
| 75 | |
| 76 | |
| 77 ``` | |
| 78 #Query ID #Subject #Subject accession #Subject Taxonomy ID #Identity percentage #Coverage #evalue #bitscore #Source #Taxonomy | |
| 79 M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758) source=NCBI sequenceID=EU382995 superkingdom=Eukaryota kingdom=Viridiplantae phylum=Streptophyta subphylum=Streptophytina class=Magnoliopsida subclass=NA infraclass=NA order=Ranunculales suborder=NA infraorder=NA superfamily=NA family=Ranunculaceae genus=Ranunculus species=Ranunculus repens markercode=trnL lat=NA lon=NA source=NCBI N/A 100.000 100 1.24e-38 152 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Ranunculales / Ranunculaceae / Ranunculus / Ranunculus repens | |
| 80 M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758) source=NCBI sequenceID=JQ041850 superkingdom=Eukaryota kingdom=Viridiplantae phylum=Streptophyta subphylum=Streptophytina class=Magnoliopsida subclass=NA infraclass=NA order=Ranunculales suborder=NA infraorder=NA superfamily=NA family=Ranunculaceae genus=Ranunculus species=Ranunculus repens markercode=trnL lat=NA lon=NA source=NCBI N/A 100.000 100 1.24e-38 152 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Ranunculales / Ranunculaceae / Ranunculus / Ranunculus repens | |
| 81 M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758) source=NCBI sequenceID=DQ410740 superkingdom=Eukaryota kingdom=Viridiplantae phylum=Streptophyta subphylum=Streptophytina class=Magnoliopsida subclass=NA infraclass=NA order=Ranunculales suborder=NA infraorder=NA superfamily=NA family=Ranunculaceae genus=Ranunculus species=Ranunculus muricatus markercode=trnL lat=NA lon=NA source=NCBI N/A 98.780 100 5.79e-37 147 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Ranunculales / Ranunculaceae / Ranunculus / Ranunculus muricatus | |
| 82 M01687:476:000000000-LL5F5:1:1102:14619:1181_CONS(6595) source=NCBI sequenceID=HM590330 superkingdom=Eukaryota kingdom=Viridiplantae phylum=Streptophyta subphylum=Streptophytina class=Magnoliopsida subclass=NA infraclass=NA order=Malpighiales suborder=NA infraorder=NA superfamily=NA family=Salicaceae genus=Populus species=Populus tremula markercode=trnL lat=50.47 lon=-104.37 source=NCBI N/A 100.000 100 2.16e-52 198 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Malpighiales / Salicaceae / Populus / Populus tremula | |
| 83 M01687:476:000000000-LL5F5:1:1102:14619:1181_CONS(6595) source=NCBI sequenceID=MH573985 superkingdom=Eukaryota kingdom=Viridiplantae phylum=Streptophyta subphylum=Streptophytina class=Magnoliopsida subclass=NA infraclass=NA order=Malpighiales suborder=NA infraorder=NA superfamily=NA family=Salicaceae genus=Populus species=Populus alba markercode=trnL lat=NA lon=NA source=NCBI N/A 99.074 100 1.01e-50 193 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Malpighiales / Salicaceae / Populus / Populus alba | |
| 84 ... | |
| 85 ``` | |
| 86 | |
| 87 **Example FASTA (`unannotated_reads.fasta`)** | |
| 88 | |
| 89 | |
| 90 ``` | |
| 91 >M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758) merged_sample={}; count=1758; direction=right; sminR=40.0; | |
| 92 ali_length=82; seq_b_deletion=219; seq_b_insertion=0; mode=alignment; sminL=40.0; seq_a_single=0; seq_b_single=0; | |
| 93 gggcaatcctgagccaaatcctgctttcagaaaacaaaaagagggttcagaaagcaaagg | |
| 94 gataggtgcagagactcaatgg | |
| 95 | |
| 96 >M01687:476:000000000-LL5F5:1:1102:14619:1181_CONS(6595) merged_sample={}; count=6595; direction=right; sminR=40.0; | |
| 97 ali_length=107; mode=alignment; sminL=40.0; seq_a_single=0; seq_b_single=0; | |
| 98 gggcaatcctgagccaaatcctatttttcgaaaacaaacaaaaaaacaaacaaaggttca | |
| 99 taaagacagaataagaatacaaaaggataggtgcagagactcaatgg | |
| 100 ... | |
| 101 ``` | |
| 102 | |
| 103 **Outputs** | |
| 104 | |
| 105 | |
| 106 | Output Type | Format | Description | | |
| 107 |-------------------------------|--------|-------------| | |
| 108 | **E-value distribution plots**| `.png` | Histogram of BLAST E-values across all queries; useful for choosing score cutoffs or spotting anomalies. | | |
| 109 | **Taxonomic composition** | `.txt` | Summarized counts or proportions of reads assigned to each taxonomic level. | | |
| 110 | **Circular taxonomy data** | `.txt` | JSON-formatted hierarchical taxonomy structure, used to generate circular taxonomic plots. | | |
| 111 | **Header annotations** | `.xlsx` | Excel workbook with merged and per-read annotation information, and alignment statistics. | | |
| 112 | **Annotation statistics** | `.txt` | Summary metrics such as number of annotated reads, unassigned reads, unique taxa detected, and filtering statistics. | | |
| 113 | |
| 114 | |
| 115 **Output files (example)** | |
| 116 | |
| 117 | |
| 118 outputs | |
| 119 | |
| 120 ├── eval.png | |
| 121 <img width="2100" height="900" alt="afbeelding" src="https://github.com/user-attachments/assets/75b8fac6-da31-4980-a535-f9dd7ffd15bb" /> | |
| 122 | |
| 123 | |
| 124 ├── taxa.txt | |
| 125 ``` | |
| 126 Uncertain count per taxonomie level{'K': 0, 'P': 0, 'C': 0, 'O': 18, 'F': 10, 'G': 615, 'S': 1285} | |
| 127 percentage_rooted number_rooted total_num taxon_level indentificatie | |
| 128 100.00 3373 3373 K Viridiplantae | |
| 129 100.00 3373 3373 P Streptophyta | |
| 130 99.97 3372 3373 C Magnoliopsida | |
| 131 1.96 66 3373 O Apiales | |
| 132 1.96 66 3373 F Apiaceae | |
| 133 1.22 41 3373 G Aegopodium | |
| 134 1.22 41 3373 S Aegopodium podagraria | |
| 135 0.27 9 3373 G Apium | |
| 136 0.27 9 3373 S Apium graveolens | |
| 137 0.47 16 3373 G Uncertain taxa | |
| 138 4.77 161 3373 O Asterales | |
| 139 4.77 161 3373 F Asteraceae | |
| 140 0.06 2 3373 G Achillea | |
| 141 0.06 2 3373 S Achillea millefolium | |
| 142 0.15 5 3373 G Artemisia | |
| 143 0.15 5 3373 S Uncertain taxa | |
| 144 0.03 1 3373 G Calendula | |
| 145 ... | |
| 146 4.57 154 3373 G Uncertain taxa | |
| 147 0.12 4 3373 F Uncertain taxa | |
| 148 0.53 18 3373 O Uncertain taxa | |
| 149 0.03 1 3373 C Pinopsida | |
| 150 0.03 1 3373 O Cupressales | |
| 151 0.03 1 3373 F Taxaceae | |
| 152 0.03 1 3373 G Taxus | |
| 153 0.03 1 3373 S Taxus baccata | |
| 154 ``` | |
| 155 ├── circle.txt | |
| 156 ``` | |
| 157 [ | |
| 158 { | |
| 159 "labels": [ | |
| 160 "Bacteria", | |
| 161 "Uncertain taxa", | |
| 162 "Viridiplantae" | |
| 163 ], | |
| 164 "sizes": [ | |
| 165 2, | |
| 166 1, | |
| 167 29 | |
| 168 ] | |
| 169 }, | |
| 170 { | |
| 171 "labels": [ | |
| 172 "Pseudomonadota", | |
| 173 "Uncertain taxa", | |
| 174 "Streptophyta" | |
| 175 ], | |
| 176 "sizes": [ | |
| 177 2, | |
| 178 1, | |
| 179 29 | |
| 180 ] | |
| 181 ... | |
| 182 ], | |
| 183 "sizes": [ | |
| 184 1, | |
| 185 1, | |
| 186 1, | |
| 187 1, | |
| 188 1, | |
| 189 1, | |
| 190 1, | |
| 191 1, | |
| 192 1, | |
| 193 1, | |
| 194 1, | |
| 195 3, | |
| 196 1, | |
| 197 1, | |
| 198 2, | |
| 199 2, | |
| 200 1, | |
| 201 1, | |
| 202 1, | |
| 203 1, | |
| 204 4, | |
| 205 1, | |
| 206 1, | |
| 207 1, | |
| 208 1 | |
| 209 ] | |
| 210 } | |
| 211 ] | |
| 212 ``` | |
| 213 ├── anno.xlsx | |
| 214 ``` | |
| 215 header e_value identity percentage coverage bitscore count source taxa kingdom phylum class order family genus species | |
| 216 M01687:476:000000000-LL5F5:1:1102:8926:6561_CONS 2.33E-41 98.889 100 161 12 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Achillea / Achillea millefolium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Achillea Achillea millefolium | |
| 217 M01687:476:000000000-LL5F5:1:2114:16883:18620_CONS 1.08E-39 97.778 100 156 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Achillea / Achillea millefolium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Achillea Achillea millefolium | |
| 218 M01687:476:000000000-LL5F5:1:1102:20658:7882_CONS 1.63E-37 98.795 100 148 29 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Apiales / Apiaceae / Aegopodium / Aegopodium podagraria Viridiplantae Streptophyta Magnoliopsida Apiales Apiaceae Aegopodium Aegopodium podagraria | |
| 219 M01687:476:000000000-LL5F5:1:1102:3453:17892_CONS 3.51E-39 100 100 154 179 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Apiales / Apiaceae / Aegopodium / Aegopodium podagraria Viridiplantae Streptophyta Magnoliopsida Apiales Apiaceae Aegopodium Aegopodium podagraria | |
| 220 M01687:476:000000000-LL5F5:1:1101:16634:16511_CONS 5.79E-37 98.795 100 147 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Apiales / Apiaceae / Aegopodium / Aegopodium podagraria Viridiplantae Streptophyta Magnoliopsida Apiales Apiaceae Aegopodium Aegopodium podagraria | |
| 221 ... | |
| 222 M01687:476:000000000-LL5F5:1:1119:27044:6653_CONS 2.69E-35 97.59 100 141 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Fabales / Fabaceae / Vicia / Vicia faba Viridiplantae Streptophyta Magnoliopsida Fabales Fabaceae Vicia Vicia faba | |
| 223 M01687:476:000000000-LL5F5:1:1109:2464:14257_CONS 7.37E-36 100 95 143 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Fabales / Fabaceae / Vicia / Vicia faba Viridiplantae Streptophyta Magnoliopsida Fabales Fabaceae Vicia Vicia faba | |
| 224 M01687:476:000000000-LL5F5:1:1106:26123:11458_CONS 1.63E-37 98.795 100 148 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Fabales / Fabaceae / Vicia / Vicia faba Viridiplantae Streptophyta Magnoliopsida Fabales Fabaceae Vicia Vicia faba | |
| 225 M01687:476:000000000-LL5F5:1:1104:24402:7089_CONS 5E-43 100 100 167 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Fabales / Fabaceae / Vicia / Vicia hirsuta Viridiplantae Streptophyta Magnoliopsida Fabales Fabaceae Vicia Vicia hirsuta | |
| 226 M01687:476:000000000-LL5F5:1:2114:19155:4308_CONS 1.07E-39 100 94 156 13 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Gentianales / Apocynaceae / Vinca / Vinca minor Viridiplantae Streptophyta Magnoliopsida Gentianales Apocynaceae Vinca Vinca minor | |
| 227 M01687:476:000000000-LL5F5:1:1117:11316:6653_CONS 4.96E-38 98.81 94 150 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Gentianales / Apocynaceae / Vinca / Vinca minor Viridiplantae Streptophyta Magnoliopsida Gentianales Apocynaceae Vinca Vinca minor | |
| 228 M01687:476:000000000-LL5F5:1:1106:28052:14441_CONS 8.25E-41 98.876 100 159 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Xanthium Xanthium strumarium | |
| 229 M01687:476:000000000-LL5F5:1:2118:15258:6790_CONS 8.25E-41 98.876 100 159 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Xanthium Xanthium strumarium | |
| 230 ``` | |
| 231 | |
| 232 | |
| 233 └── stats.txt | |
| 234 ``` | |
| 235 metric value | |
| 236 percentage_annotated 71.3862433862434 | |
| 237 annotated_sequences 3373 | |
| 238 total_sequences 4725 | |
| 239 percentage_unique_annotated 89.46585409571608 | |
| 240 unique_annotated 99826 | |
| 241 total_unique 111580 | |
| 242 ``` | |
| 243 | |
| 244 --- | |
| 245 | |
| 246 #### CLI Arguments (common) | |
| 247 | |
| 248 | Argument | Description | | |
| 249 |----------|-------------| | |
| 250 | `--input-anno` | Path to the annotated BLAST results (tab-separated) | | |
| 251 | `--input-unanno` | Path to the unannotated reads FASTA file | | |
| 252 | `--eval-plot` | Output file where eval plot output will be written | | |
| 253 | `--taxa-output` | Output file where taxa output will be written | | |
| 254 | `--circle-data` | Output file where circle data output will be written | | |
| 255 | `--header-anno` | Output file where header annotation results will be written | | |
| 256 | `--anno-stats` | Output file where annotation statistics will be written | | |
| 257 | `--eval-treshold` | Maximum E-value to retain hits (default: `1e-5`) | | |
| 258 | `--uncertain-threshold` | percentage for which lca picks the majority taxon (default: `0.9 (90%)`) | | |
| 259 | `--use-counts` | Use read counts in the circle data output when true (default: `True`) | | |
| 260 | |
| 261 --- | |
| 262 | |
| 263 | |
| 264 ### Galaxy integration | |
| 265 | |
| 266 The tool is also available through the Galaxy platform: | |
| 267 | |
| 268 - **Galaxy Toolshed**: The BLAST annotations processor tool is available in the Galaxy Toolshed, | |
| 269 enabling easy installation into any Galaxy instance. | |
| 270 - **Web-based interface**: Users can upload sequence files, configure validation parameters through the GUI, | |
| 271 run validations, and download results. | |
| 272 - **Workflow integration**: The tool can be incorporated into Galaxy workflows for automated processing pipelines. | |
| 273 | |
| 274 To use the tool in Galaxy: | |
| 275 1. Install the tool from the Galaxy Toolshed (search for "blast_annotations_processor") | |
| 276 2. Upload your raw read and BLAST files to your Galaxy history | |
| 277 3. Configure parameters through the GUI | |
| 278 4. Run the tool | |
| 279 5. View results and download validation reports and tabular annotations | |
| 280 | |
| 281 ## License | |
| 282 | |
| 283 No license yet | |
| 284 | |
| 285 ## Citation | |
| 286 | |
| 287 If you use this software in your research, please cite this repository. | |
| 288 | |
| 289 ## Contact | |
| 290 | |
| 291 For questions or issues: | |
| 292 - GitHub Issues: https://github.com/Onnodg/Naturalis_NLOOR/issues | |
| 293 - Email: onno.gorter@naturalis.nl (until Febuary 2026) | |
| 294 | |
| 295 ## Acknowledgments | |
| 296 | |
| 297 This tool was developed to support the New lights on old remedies project, a PhD project by Anja Fischer. |
