Mercurial > repos > onnodg > blast_annotations_processor
comparison README.md @ 3:ca2f07b71581 draft default tip
planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit 600e5a50a13a3a16a1970d6d4d31cb4f7bd549bf-dirty
| author | onnodg |
|---|---|
| date | Thu, 12 Feb 2026 13:52:07 +0000 |
| parents | 9ca209477dfd |
| children |
comparison
equal
deleted
inserted
replaced
| 2:9ca209477dfd | 3:ca2f07b71581 |
|---|---|
| 49 | 49 |
| 50 | 50 |
| 51 **Example: Analyzing BLAST annotation result using curated database** | 51 **Example: Analyzing BLAST annotation result using curated database** |
| 52 | 52 |
| 53 ```bash | 53 ```bash |
| 54 python annotate_blast_results.py | 54 python annotate_blast_results.py \ |
| 55 --input-anno 'annotated_curated_results.tabular' | 55 --input-anno annotated_curated_results.tabular \ |
| 56 --input-unanno 'unannotated_reads.fasta' | 56 --input-unanno unannotated_reads.fasta \ |
| 57 --eval-plot 'eval_curated.png' | 57 --eval-plot eval_curated.png \ |
| 58 --taxa-output 'taxa_curated.txt' | 58 --taxa-output taxa_curated.txt \ |
| 59 --circle-data 'circle_curated.txt' | 59 --circle-data circle_curated.txt \ |
| 60 --header-anno 'anno_curated.xlsx' | 60 --header-anno anno_curated.xlsx \ |
| 61 --anno-stats 'stats_curated.txt' | 61 --log run.log \ |
| 62 --eval-threshold '1e-5' | 62 --filtered-fasta filtered_reads.fasta \ |
| 63 --uncertain-threshold '0.9' | 63 --eval-threshold 1e-10 \ |
| 64 --use-counts | 64 --uncertain-threshold 0.9 \ |
| 65 ``` | 65 --use-counts \ |
| 66 | 66 --min-identity 70 \ |
| 67 --min-coverage 69 \ | |
| 68 --min-bitscore 40 \ | |
| 69 --bitscore-perc-cutoff 0 \ | |
| 70 --ignore-rank "unknown,invalid" \ | |
| 71 --ignore-taxonomy "environmental" \ | |
| 72 --ignore-obiclean-type singleton \ | |
| 73 --ignore-illuminapairend-type pairend \ | |
| 74 --min-support 10 | |
| 75 ``` | |
| 67 This command will: | 76 This command will: |
| 68 | 77 |
| 69 - Parse the BLAST and FASTA files. | 78 Parse the annotated BLAST results and the corresponding unannotated FASTA sequences. |
| 70 - Filter hits using `E-value ≤ 1e-5`, `uncertainty threshold ≥ 90%`, and use read count in the circular data output. | 79 |
| 71 - Resolve taxonomic conflicts and generate plots, reports, and spreadsheet outputs in the given output files. | 80 Filter BLAST hits using E-value ≤ 1e-10, minimum identity ≥ 70%, minimum coverage ≥ 69%, and minimum bitscore ≥ 40, and apply a bitscore percentage cutoff of 0% (no additional top-bitscore filtering). |
| 72 | 81 |
| 82 Resolve taxonomic conflicts using an LCA approach with an uncertainty threshold of 90%, while ignoring ranks containing "unknown,invalid" and taxonomy containing "environmental". | |
| 83 | |
| 84 Exclude sequences flagged as obiclean type singleton and sequences marked as Illuminapairedend type pairend (merge failure), and require a minimum taxonomic support of 10 reads. | |
| 85 | |
| 86 Use read counts when generating circular taxonomy outputs (--use-counts). | |
| 87 | |
| 88 Produce the configured outputs (plots, Kraken-style report, circular data, per-header annotations), plus the required log file and filtered FASTA for downstream analysis. | |
| 73 | 89 |
| 74 **Example Input (`annotated_curated_results.tabular`)** | 90 **Example Input (`annotated_curated_results.tabular`)** |
| 75 | 91 |
| 76 | 92 |
| 77 ``` | 93 ``` |
| 107 |-------------------------------|--------|-------------| | 123 |-------------------------------|--------|-------------| |
| 108 | **E-value distribution plots**| `.png` | Histogram of BLAST E-values across all queries; useful for choosing score cutoffs or spotting anomalies. | | 124 | **E-value distribution plots**| `.png` | Histogram of BLAST E-values across all queries; useful for choosing score cutoffs or spotting anomalies. | |
| 109 | **Taxonomic composition** | `.txt` | Summarized counts or proportions of reads assigned to each taxonomic level. | | 125 | **Taxonomic composition** | `.txt` | Summarized counts or proportions of reads assigned to each taxonomic level. | |
| 110 | **Circular taxonomy data** | `.txt` | JSON-formatted hierarchical taxonomy structure, used to generate circular taxonomic plots. | | 126 | **Circular taxonomy data** | `.txt` | JSON-formatted hierarchical taxonomy structure, used to generate circular taxonomic plots. | |
| 111 | **Header annotations** | `.xlsx` | Excel workbook with merged and per-read annotation information, and alignment statistics. | | 127 | **Header annotations** | `.xlsx` | Excel workbook with merged and per-read annotation information, and alignment statistics. | |
| 112 | **Annotation statistics** | `.txt` | Summary metrics such as number of annotated reads, unassigned reads, unique taxa detected, and filtering statistics. | | 128 | **Log** | `.txt` | Summary metrics such as number of annotated reads, unassigned reads, unique taxa detected, and filtering statistics. | |
| 129 | **Filtered fasta** | `.fasta` | Fasta that passed the set thresholds, for use in downstream analysis (clustering) | | |
| 113 | 130 |
| 114 | 131 |
| 115 **Output files (example)** | 132 **Output files (example)** |
| 116 | 133 |
| 117 | 134 |
| 228 M01687:476:000000000-LL5F5:1:1106:28052:14441_CONS 8.25E-41 98.876 100 159 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Xanthium Xanthium strumarium | 245 M01687:476:000000000-LL5F5:1:1106:28052:14441_CONS 8.25E-41 98.876 100 159 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Xanthium Xanthium strumarium |
| 229 M01687:476:000000000-LL5F5:1:2118:15258:6790_CONS 8.25E-41 98.876 100 159 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Xanthium Xanthium strumarium | 246 M01687:476:000000000-LL5F5:1:2118:15258:6790_CONS 8.25E-41 98.876 100 159 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Xanthium Xanthium strumarium |
| 230 ``` | 247 ``` |
| 231 | 248 |
| 232 | 249 |
| 233 └── stats.txt | 250 └── log.txt |
| 234 ``` | 251 ``` |
| 235 metric value | 252 Starting processing for FASTA |
| 236 percentage_annotated 71.3862433862434 | 253 === PARAMETERS USED === |
| 237 annotated_sequences 3373 | 254 uncertain_threshold: 0.9 |
| 238 total_sequences 4725 | 255 eval_threshold: 1e-10 |
| 239 percentage_unique_annotated 89.46585409571608 | 256 use_counts: True |
| 240 unique_annotated 99826 | 257 ignore_rank: unknown |
| 241 total_unique 111580 | 258 ignore_taxonomy: environmental |
| 242 ``` | 259 bitscore_perc_cutoff: 8.0 |
| 243 | 260 min_bitscore: 100 |
| 261 ignore_obiclean_type: singleton | |
| 262 ignore_illuminapairend_type: pairend | |
| 263 min_identity: 80 | |
| 264 min_coverage: 70 | |
| 265 ignore_seqids: | |
| 266 min_support: 1 | |
| 267 === END PARAMETERS === | |
| 268 Filtered FASTA written succesfully(1790 sequences) | |
| 269 FASTA: total headers: 2156 | |
| 270 FASTA: headers kept after filters and min_support=1: 1790 | |
| 271 FASTA: removed due to header filters (illumina/obiclean/etc.): 366 | |
| 272 FASTA: removed due to low dereplicated count (<1): 0 | |
| 273 FASTA: total invalid (header filter + low support): 366 | |
| 274 Reading BLAST annotations | |
| 275 BLAST: total hits read: 4977 | |
| 276 BLAST: hits kept after quality filters: 3145 | |
| 277 BLAST: hits filtered (evalue/coverage/identity/bitscore): 1832 | |
| 278 BLAST: hits removed due to invalid taxon: 0 | |
| 279 BLAST: hits removed due to ignored seqids: 0 | |
| 280 Note: 30 BLAST q_ids not in FASTA (showing up to 10): ['M01687:460:000000000-LGY9G:1:1101:11918:3518_CONS(1)', 'M01687:460:000000000-LGY9G:1:1101:12996:3690_CONS(1)', 'M01687:460:000000000-LGY9G:1:1101:11564:11468_CONS(1)', 'M01687:460:000000000-LGY9G:1:1102:19358:5472_CONS(1)', 'M01687:460:000000000-LGY9G:1:2114:4805:4734_CONS(1)', 'M01687:460:000000000-LGY9G:1:2114:7472:19038_CONS(1)', 'M01687:460:000000000-LGY9G:1:2112:26865:11154_CONS(1)', 'M01687:460:000000000-LGY9G:1:2113:29518:11119_CONS(1)', 'M01687:460:000000000-LGY9G:1:2113:14681:23251_CONS(1)', 'M01687:460:000000000-LGY9G:1:2110:17890:1754_CONS(2)'] | |
| 281 ANNOTATION: total FASTA headers considered: 1790 | |
| 282 ANNOTATION: reads with BLAST hits: 622 | |
| 283 ANNOTATION: reads without BLAST hits: 1168 | |
| 284 ANNOTATION: unique annotated count (from header counts): 49571 | |
| 285 ANNOTATION: total unique count (from FASTA): 66132 | |
| 286 E-value plot written succesfully | |
| 287 Taxa summary written succesfully | |
| 288 Header annotations written succesfully | |
| 289 Circle diagram JSON written succesfully | |
| 290 === ANNOTATION STATISTICS === | |
| 291 percentage_annotated: 28.84972170686456 | |
| 292 annotated_sequences: 622 | |
| 293 total_sequences: 2156 | |
| 294 percentage_unique_annotated: 74.95766043670235 | |
| 295 unique_annotated: 49571 | |
| 296 total_unique: 66132 | |
| 297 ``` | |
| 298 | |
| 299 └── filtered_fasta.fasta | |
| 300 ``` | |
| 301 >M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758) merged_sample={}; count=1758; direction=right; sminR=40.0; | |
| 302 ali_length=82; seq_b_deletion=219; seq_b_insertion=0; mode=alignment; sminL=40.0; seq_a_single=0; seq_b_single=0; | |
| 303 gggcaatcctgagccaaatcctgctttcagaaaacaaaaagagggttcagaaagcaaagg | |
| 304 gataggtgcagagactcaatgg | |
| 305 | |
| 306 >M01687:476:000000000-LL5F5:1:1102:14619:1181_CONS(6595) merged_sample={}; count=6595; direction=right; sminR=40.0; | |
| 307 ali_length=107; mode=alignment; sminL=40.0; seq_a_single=0; seq_b_single=0; | |
| 308 gggcaatcctgagccaaatcctatttttcgaaaacaaacaaaaaaacaaacaaaggttca | |
| 309 taaagacagaataagaatacaaaaggataggtgcagagactcaatgg | |
| 310 ``` | |
| 244 --- | 311 --- |
| 245 | 312 |
| 246 #### CLI Arguments (common) | 313 #### CLI Arguments (common) |
| 247 | 314 |
| 248 | Argument | Description | | 315 | Argument | Description | |
| 249 |----------|-------------| | 316 |----------|-------------| |
| 250 | `--input-anno` | Path to the annotated BLAST results (tab-separated) | | 317 | `--input-anno` | Path to the annotated BLAST results (tab-separated) | |
| 251 | `--input-unanno` | Path to the unannotated reads FASTA file | | 318 | `--input-unanno` | Path to the unannotated reads FASTA file | |
| 252 | `--eval-plot` | Output file where eval plot output will be written | | 319 | `--eval-plot` | Output file where the E-value distribution plot will be written | |
| 253 | `--taxa-output` | Output file where taxa output will be written | | 320 | `--taxa-output` | Output file where the taxonomic (Kraken-style) report will be written | |
| 254 | `--circle-data` | Output file where circle data output will be written | | 321 | `--circle-data` | Output file where circular taxonomy data will be written | |
| 255 | `--header-anno` | Output file where header annotation results will be written | | 322 | `--header-anno` | Output file where per-header annotation results will be written (tabular/xlsx) | |
| 256 | `--anno-stats` | Output file where annotation statistics will be written | | 323 | `--log` | Output file where log messages will be written | |
| 257 | `--eval-treshold` | Maximum E-value to retain hits (default: `1e-5`) | | 324 | `--filtered-fasta` | Output FASTA file filtered for downstream analysis | |
| 258 | `--uncertain-threshold` | percentage for which lca picks the majority taxon (default: `0.9 (90%)`) | | 325 | `--eval-threshold` | Maximum E-value to retain hits (default: `1e-10`) | |
| 259 | `--use-counts` | Use read counts in the circle data output when true (default: `True`) | | 326 | `--uncertain-threshold` | Proportion required for LCA to assign a majority taxon (default: `0.9` / 90%) | |
| 327 | `--use-counts` | Use read counts when generating circular taxonomy data (default: `False`) | | |
| 328 | `--min-identity` | Minimum sequence identity (%) to consider a BLAST hit | | |
| 329 | `--min-coverage` | Minimum query coverage (%) to consider a BLAST hit | | |
| 330 | `--min-bitscore` | Minimum bitscore required to retain a BLAST hit | | |
| 331 | `--bitscore-perc-cutoff` | Bitscore percentage cutoff relative to the top hit | | |
| 332 | `--ignore-rank` | Ignore taxonomic ranks containing this text (default: `unknown`) | | |
| 333 | `--ignore-taxonomy` | Ignore taxonomy strings containing this text (default: `environmental`) | | |
| 334 | `--ignore-obiclean-type` | Ignore sequences with this obiclean classification (default: `singleton`) | | |
| 335 | `--ignore-illuminapairend-type` | Ignore sequences with this paired-end merge status (default: `pairend`) | | |
| 336 | `--ignore-seqids` | Ignore sequences containing this identifier substring | | |
| 337 | `--min-support` | Retain taxa only if they (or their descendants) have at least N reads assigned | | |
| 260 | 338 |
| 261 --- | 339 --- |
| 262 | 340 |
| 263 | 341 |
| 264 ### Galaxy integration | 342 ### Galaxy integration |
