comparison README.md @ 3:ca2f07b71581 draft default tip

planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit 600e5a50a13a3a16a1970d6d4d31cb4f7bd549bf-dirty
author onnodg
date Thu, 12 Feb 2026 13:52:07 +0000
parents 9ca209477dfd
children
comparison
equal deleted inserted replaced
2:9ca209477dfd 3:ca2f07b71581
49 49
50 50
51 **Example: Analyzing BLAST annotation result using curated database** 51 **Example: Analyzing BLAST annotation result using curated database**
52 52
53 ```bash 53 ```bash
54 python annotate_blast_results.py 54 python annotate_blast_results.py \
55 --input-anno 'annotated_curated_results.tabular' 55 --input-anno annotated_curated_results.tabular \
56 --input-unanno 'unannotated_reads.fasta' 56 --input-unanno unannotated_reads.fasta \
57 --eval-plot 'eval_curated.png' 57 --eval-plot eval_curated.png \
58 --taxa-output 'taxa_curated.txt' 58 --taxa-output taxa_curated.txt \
59 --circle-data 'circle_curated.txt' 59 --circle-data circle_curated.txt \
60 --header-anno 'anno_curated.xlsx' 60 --header-anno anno_curated.xlsx \
61 --anno-stats 'stats_curated.txt' 61 --log run.log \
62 --eval-threshold '1e-5' 62 --filtered-fasta filtered_reads.fasta \
63 --uncertain-threshold '0.9' 63 --eval-threshold 1e-10 \
64 --use-counts 64 --uncertain-threshold 0.9 \
65 ``` 65 --use-counts \
66 66 --min-identity 70 \
67 --min-coverage 69 \
68 --min-bitscore 40 \
69 --bitscore-perc-cutoff 0 \
70 --ignore-rank "unknown,invalid" \
71 --ignore-taxonomy "environmental" \
72 --ignore-obiclean-type singleton \
73 --ignore-illuminapairend-type pairend \
74 --min-support 10
75 ```
67 This command will: 76 This command will:
68 77
69 - Parse the BLAST and FASTA files. 78 Parse the annotated BLAST results and the corresponding unannotated FASTA sequences.
70 - Filter hits using `E-value ≤ 1e-5`, `uncertainty threshold ≥ 90%`, and use read count in the circular data output. 79
71 - Resolve taxonomic conflicts and generate plots, reports, and spreadsheet outputs in the given output files. 80 Filter BLAST hits using E-value ≤ 1e-10, minimum identity ≥ 70%, minimum coverage ≥ 69%, and minimum bitscore ≥ 40, and apply a bitscore percentage cutoff of 0% (no additional top-bitscore filtering).
72 81
82 Resolve taxonomic conflicts using an LCA approach with an uncertainty threshold of 90%, while ignoring ranks containing "unknown,invalid" and taxonomy containing "environmental".
83
84 Exclude sequences flagged as obiclean type singleton and sequences marked as Illuminapairedend type pairend (merge failure), and require a minimum taxonomic support of 10 reads.
85
86 Use read counts when generating circular taxonomy outputs (--use-counts).
87
88 Produce the configured outputs (plots, Kraken-style report, circular data, per-header annotations), plus the required log file and filtered FASTA for downstream analysis.
73 89
74 **Example Input (`annotated_curated_results.tabular`)** 90 **Example Input (`annotated_curated_results.tabular`)**
75 91
76 92
77 ``` 93 ```
107 |-------------------------------|--------|-------------| 123 |-------------------------------|--------|-------------|
108 | **E-value distribution plots**| `.png` | Histogram of BLAST E-values across all queries; useful for choosing score cutoffs or spotting anomalies. | 124 | **E-value distribution plots**| `.png` | Histogram of BLAST E-values across all queries; useful for choosing score cutoffs or spotting anomalies. |
109 | **Taxonomic composition** | `.txt` | Summarized counts or proportions of reads assigned to each taxonomic level. | 125 | **Taxonomic composition** | `.txt` | Summarized counts or proportions of reads assigned to each taxonomic level. |
110 | **Circular taxonomy data** | `.txt` | JSON-formatted hierarchical taxonomy structure, used to generate circular taxonomic plots. | 126 | **Circular taxonomy data** | `.txt` | JSON-formatted hierarchical taxonomy structure, used to generate circular taxonomic plots. |
111 | **Header annotations** | `.xlsx` | Excel workbook with merged and per-read annotation information, and alignment statistics. | 127 | **Header annotations** | `.xlsx` | Excel workbook with merged and per-read annotation information, and alignment statistics. |
112 | **Annotation statistics** | `.txt` | Summary metrics such as number of annotated reads, unassigned reads, unique taxa detected, and filtering statistics. | 128 | **Log** | `.txt` | Summary metrics such as number of annotated reads, unassigned reads, unique taxa detected, and filtering statistics. |
129 | **Filtered fasta** | `.fasta` | Fasta that passed the set thresholds, for use in downstream analysis (clustering) |
113 130
114 131
115 **Output files (example)** 132 **Output files (example)**
116 133
117 134
228 M01687:476:000000000-LL5F5:1:1106:28052:14441_CONS 8.25E-41 98.876 100 159 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Xanthium Xanthium strumarium 245 M01687:476:000000000-LL5F5:1:1106:28052:14441_CONS 8.25E-41 98.876 100 159 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Xanthium Xanthium strumarium
229 M01687:476:000000000-LL5F5:1:2118:15258:6790_CONS 8.25E-41 98.876 100 159 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Xanthium Xanthium strumarium 246 M01687:476:000000000-LL5F5:1:2118:15258:6790_CONS 8.25E-41 98.876 100 159 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Xanthium Xanthium strumarium
230 ``` 247 ```
231 248
232 249
233 └── stats.txt 250 └── log.txt
234 ``` 251 ```
235 metric value 252 Starting processing for FASTA
236 percentage_annotated 71.3862433862434 253 === PARAMETERS USED ===
237 annotated_sequences 3373 254 uncertain_threshold: 0.9
238 total_sequences 4725 255 eval_threshold: 1e-10
239 percentage_unique_annotated 89.46585409571608 256 use_counts: True
240 unique_annotated 99826 257 ignore_rank: unknown
241 total_unique 111580 258 ignore_taxonomy: environmental
242 ``` 259 bitscore_perc_cutoff: 8.0
243 260 min_bitscore: 100
261 ignore_obiclean_type: singleton
262 ignore_illuminapairend_type: pairend
263 min_identity: 80
264 min_coverage: 70
265 ignore_seqids:
266 min_support: 1
267 === END PARAMETERS ===
268 Filtered FASTA written succesfully(1790 sequences)
269 FASTA: total headers: 2156
270 FASTA: headers kept after filters and min_support=1: 1790
271 FASTA: removed due to header filters (illumina/obiclean/etc.): 366
272 FASTA: removed due to low dereplicated count (<1): 0
273 FASTA: total invalid (header filter + low support): 366
274 Reading BLAST annotations
275 BLAST: total hits read: 4977
276 BLAST: hits kept after quality filters: 3145
277 BLAST: hits filtered (evalue/coverage/identity/bitscore): 1832
278 BLAST: hits removed due to invalid taxon: 0
279 BLAST: hits removed due to ignored seqids: 0
280 Note: 30 BLAST q_ids not in FASTA (showing up to 10): ['M01687:460:000000000-LGY9G:1:1101:11918:3518_CONS(1)', 'M01687:460:000000000-LGY9G:1:1101:12996:3690_CONS(1)', 'M01687:460:000000000-LGY9G:1:1101:11564:11468_CONS(1)', 'M01687:460:000000000-LGY9G:1:1102:19358:5472_CONS(1)', 'M01687:460:000000000-LGY9G:1:2114:4805:4734_CONS(1)', 'M01687:460:000000000-LGY9G:1:2114:7472:19038_CONS(1)', 'M01687:460:000000000-LGY9G:1:2112:26865:11154_CONS(1)', 'M01687:460:000000000-LGY9G:1:2113:29518:11119_CONS(1)', 'M01687:460:000000000-LGY9G:1:2113:14681:23251_CONS(1)', 'M01687:460:000000000-LGY9G:1:2110:17890:1754_CONS(2)']
281 ANNOTATION: total FASTA headers considered: 1790
282 ANNOTATION: reads with BLAST hits: 622
283 ANNOTATION: reads without BLAST hits: 1168
284 ANNOTATION: unique annotated count (from header counts): 49571
285 ANNOTATION: total unique count (from FASTA): 66132
286 E-value plot written succesfully
287 Taxa summary written succesfully
288 Header annotations written succesfully
289 Circle diagram JSON written succesfully
290 === ANNOTATION STATISTICS ===
291 percentage_annotated: 28.84972170686456
292 annotated_sequences: 622
293 total_sequences: 2156
294 percentage_unique_annotated: 74.95766043670235
295 unique_annotated: 49571
296 total_unique: 66132
297 ```
298
299 └── filtered_fasta.fasta
300 ```
301 >M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758) merged_sample={}; count=1758; direction=right; sminR=40.0;
302 ali_length=82; seq_b_deletion=219; seq_b_insertion=0; mode=alignment; sminL=40.0; seq_a_single=0; seq_b_single=0;
303 gggcaatcctgagccaaatcctgctttcagaaaacaaaaagagggttcagaaagcaaagg
304 gataggtgcagagactcaatgg
305
306 >M01687:476:000000000-LL5F5:1:1102:14619:1181_CONS(6595) merged_sample={}; count=6595; direction=right; sminR=40.0;
307 ali_length=107; mode=alignment; sminL=40.0; seq_a_single=0; seq_b_single=0;
308 gggcaatcctgagccaaatcctatttttcgaaaacaaacaaaaaaacaaacaaaggttca
309 taaagacagaataagaatacaaaaggataggtgcagagactcaatgg
310 ```
244 --- 311 ---
245 312
246 #### CLI Arguments (common) 313 #### CLI Arguments (common)
247 314
248 | Argument | Description | 315 | Argument | Description |
249 |----------|-------------| 316 |----------|-------------|
250 | `--input-anno` | Path to the annotated BLAST results (tab-separated) | 317 | `--input-anno` | Path to the annotated BLAST results (tab-separated) |
251 | `--input-unanno` | Path to the unannotated reads FASTA file | 318 | `--input-unanno` | Path to the unannotated reads FASTA file |
252 | `--eval-plot` | Output file where eval plot output will be written | 319 | `--eval-plot` | Output file where the E-value distribution plot will be written |
253 | `--taxa-output` | Output file where taxa output will be written | 320 | `--taxa-output` | Output file where the taxonomic (Kraken-style) report will be written |
254 | `--circle-data` | Output file where circle data output will be written | 321 | `--circle-data` | Output file where circular taxonomy data will be written |
255 | `--header-anno` | Output file where header annotation results will be written | 322 | `--header-anno` | Output file where per-header annotation results will be written (tabular/xlsx) |
256 | `--anno-stats` | Output file where annotation statistics will be written | 323 | `--log` | Output file where log messages will be written |
257 | `--eval-treshold` | Maximum E-value to retain hits (default: `1e-5`) | 324 | `--filtered-fasta` | Output FASTA file filtered for downstream analysis |
258 | `--uncertain-threshold` | percentage for which lca picks the majority taxon (default: `0.9 (90%)`) | 325 | `--eval-threshold` | Maximum E-value to retain hits (default: `1e-10`) |
259 | `--use-counts` | Use read counts in the circle data output when true (default: `True`) | 326 | `--uncertain-threshold` | Proportion required for LCA to assign a majority taxon (default: `0.9` / 90%) |
327 | `--use-counts` | Use read counts when generating circular taxonomy data (default: `False`) |
328 | `--min-identity` | Minimum sequence identity (%) to consider a BLAST hit |
329 | `--min-coverage` | Minimum query coverage (%) to consider a BLAST hit |
330 | `--min-bitscore` | Minimum bitscore required to retain a BLAST hit |
331 | `--bitscore-perc-cutoff` | Bitscore percentage cutoff relative to the top hit |
332 | `--ignore-rank` | Ignore taxonomic ranks containing this text (default: `unknown`) |
333 | `--ignore-taxonomy` | Ignore taxonomy strings containing this text (default: `environmental`) |
334 | `--ignore-obiclean-type` | Ignore sequences with this obiclean classification (default: `singleton`) |
335 | `--ignore-illuminapairend-type` | Ignore sequences with this paired-end merge status (default: `pairend`) |
336 | `--ignore-seqids` | Ignore sequences containing this identifier substring |
337 | `--min-support` | Retain taxa only if they (or their descendants) have at least N reads assigned |
260 338
261 --- 339 ---
262 340
263 341
264 ### Galaxy integration 342 ### Galaxy integration