Mercurial > repos > onnodg > blast_annotations_processor

diff README.md @ 3:ca2f07b71581 draft default tip
planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit 600e5a50a13a3a16a1970d6d4d31cb4f7bd549bf-dirty
author: onnodg
date: Thu, 12 Feb 2026 13:52:07 +0000
parents: 9ca209477dfd
--- a/README.md	Mon Dec 15 16:43:36 2025 +0000
+++ b/README.md	Thu Feb 12 13:52:07 2026 +0000
@@ -51,25 +51,41 @@
 **Example: Analyzing BLAST annotation result using curated database**
 
 ```bash
-python annotate_blast_results.py
---input-anno 'annotated_curated_results.tabular'
---input-unanno 'unannotated_reads.fasta'
---eval-plot 'eval_curated.png'
---taxa-output 'taxa_curated.txt'
---circle-data 'circle_curated.txt'
---header-anno 'anno_curated.xlsx'
---anno-stats 'stats_curated.txt' 
---eval-threshold '1e-5'
---uncertain-threshold '0.9'
---use-counts
+python annotate_blast_results.py \
+  --input-anno annotated_curated_results.tabular \
+  --input-unanno unannotated_reads.fasta \
+  --eval-plot eval_curated.png \
+  --taxa-output taxa_curated.txt \
+  --circle-data circle_curated.txt \
+  --header-anno anno_curated.xlsx \
+  --log run.log \
+  --filtered-fasta filtered_reads.fasta \
+  --eval-threshold 1e-10 \
+  --uncertain-threshold 0.9 \
+  --use-counts \
+  --min-identity 70 \
+  --min-coverage 69 \
+  --min-bitscore 40 \
+  --bitscore-perc-cutoff 0 \
+  --ignore-rank "unknown,invalid" \
+  --ignore-taxonomy "environmental" \
+  --ignore-obiclean-type singleton \
+  --ignore-illuminapairend-type pairend \
+  --min-support 10
 ```
-
 This command will:
 
-- Parse the BLAST and FASTA files.  
-- Filter hits using `E-value ≤ 1e-5`, `uncertainty threshold ≥ 90%`, and use read count in the circular data output.  
-- Resolve taxonomic conflicts and generate plots, reports, and spreadsheet outputs in the given output files.
+Parse the annotated BLAST results and the corresponding unannotated FASTA sequences.
+
+Filter BLAST hits using E-value ≤ 1e-10, minimum identity ≥ 70%, minimum coverage ≥ 69%, and minimum bitscore ≥ 40, and apply a bitscore percentage cutoff of 0% (no additional top-bitscore filtering).
+
+Resolve taxonomic conflicts using an LCA approach with an uncertainty threshold of 90%, while ignoring ranks containing "unknown,invalid" and taxonomy containing "environmental".
 
+Exclude sequences flagged as obiclean type singleton and sequences marked as Illuminapairedend type pairend (merge failure), and require a minimum taxonomic support of 10 reads.
+
+Use read counts when generating circular taxonomy outputs (--use-counts).
+
+Produce the configured outputs (plots, Kraken-style report, circular data, per-header annotations), plus the required log file and filtered FASTA for downstream analysis.
 
 **Example Input (`annotated_curated_results.tabular`)**
 
@@ -109,7 +125,8 @@
 | **Taxonomic composition**     | `.txt` | Summarized counts or proportions of reads assigned to each taxonomic level. |
 | **Circular taxonomy data**    | `.txt` | JSON-formatted hierarchical taxonomy structure, used to generate circular taxonomic plots. |
 | **Header annotations**        | `.xlsx` | Excel workbook with merged and per-read annotation information, and alignment statistics. |
-| **Annotation statistics**     | `.txt` | Summary metrics such as number of annotated reads, unassigned reads, unique taxa detected, and filtering statistics. |
+| **Log**     | `.txt` | Summary metrics such as number of annotated reads, unassigned reads, unique taxa detected, and filtering statistics. |
+| **Filtered fasta**            | `.fasta` | Fasta that passed the set thresholds, for use in downstream analysis (clustering) |
 
 
 **Output files (example)**
@@ -230,17 +247,67 @@
 ```
 
 
-└── stats.txt
+└── log.txt
 ```
-metric	value
-percentage_annotated	71.3862433862434
-annotated_sequences	3373
-total_sequences	4725
-percentage_unique_annotated	89.46585409571608
-unique_annotated	99826
-total_unique	111580
+Starting processing for FASTA
+=== PARAMETERS USED ===
+uncertain_threshold: 0.9
+eval_threshold: 1e-10
+use_counts: True
+ignore_rank: unknown
+ignore_taxonomy: environmental
+bitscore_perc_cutoff: 8.0
+min_bitscore: 100
+ignore_obiclean_type: singleton
+ignore_illuminapairend_type: pairend
+min_identity: 80
+min_coverage: 70
+ignore_seqids: 
+min_support: 1
+=== END PARAMETERS ===
+Filtered FASTA written succesfully(1790 sequences)
+FASTA: total headers: 2156
+FASTA: headers kept after filters and min_support=1: 1790
+FASTA: removed due to header filters (illumina/obiclean/etc.): 366
+FASTA: removed due to low dereplicated count (<1): 0
+FASTA: total invalid (header filter + low support): 366
+Reading BLAST annotations
+BLAST: total hits read: 4977
+BLAST: hits kept after quality filters: 3145
+BLAST: hits filtered (evalue/coverage/identity/bitscore): 1832
+BLAST: hits removed due to invalid taxon: 0
+BLAST: hits removed due to ignored seqids: 0
+Note: 30 BLAST q_ids not in FASTA (showing up to 10): ['M01687:460:000000000-LGY9G:1:1101:11918:3518_CONS(1)', 'M01687:460:000000000-LGY9G:1:1101:12996:3690_CONS(1)', 'M01687:460:000000000-LGY9G:1:1101:11564:11468_CONS(1)', 'M01687:460:000000000-LGY9G:1:1102:19358:5472_CONS(1)', 'M01687:460:000000000-LGY9G:1:2114:4805:4734_CONS(1)', 'M01687:460:000000000-LGY9G:1:2114:7472:19038_CONS(1)', 'M01687:460:000000000-LGY9G:1:2112:26865:11154_CONS(1)', 'M01687:460:000000000-LGY9G:1:2113:29518:11119_CONS(1)', 'M01687:460:000000000-LGY9G:1:2113:14681:23251_CONS(1)', 'M01687:460:000000000-LGY9G:1:2110:17890:1754_CONS(2)']
+ANNOTATION: total FASTA headers considered: 1790
+ANNOTATION: reads with BLAST hits: 622
+ANNOTATION: reads without BLAST hits: 1168
+ANNOTATION: unique annotated count (from header counts): 49571
+ANNOTATION: total unique count (from FASTA): 66132
+E-value plot written succesfully
+Taxa summary written succesfully
+Header annotations written succesfully
+Circle diagram JSON written succesfully
+=== ANNOTATION STATISTICS ===
+percentage_annotated: 28.84972170686456
+annotated_sequences: 622
+total_sequences: 2156
+percentage_unique_annotated: 74.95766043670235
+unique_annotated: 49571
+total_unique: 66132
 ```
 
+└── filtered_fasta.fasta
+```
+    >M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758) merged_sample={}; count=1758; direction=right; sminR=40.0;
+    ali_length=82; seq_b_deletion=219; seq_b_insertion=0; mode=alignment; sminL=40.0; seq_a_single=0; seq_b_single=0;
+    gggcaatcctgagccaaatcctgctttcagaaaacaaaaagagggttcagaaagcaaagg
+    gataggtgcagagactcaatgg
+
+    >M01687:476:000000000-LL5F5:1:1102:14619:1181_CONS(6595) merged_sample={}; count=6595; direction=right; sminR=40.0;
+    ali_length=107; mode=alignment; sminL=40.0; seq_a_single=0; seq_b_single=0; 
+    gggcaatcctgagccaaatcctatttttcgaaaacaaacaaaaaaacaaacaaaggttca
+    taaagacagaataagaatacaaaaggataggtgcagagactcaatgg
+```
 ---
 
 #### CLI Arguments (common)
@@ -249,14 +316,25 @@
 |----------|-------------|
 | `--input-anno` | Path to the annotated BLAST results (tab-separated) |
 | `--input-unanno` | Path to the unannotated reads FASTA file |
-| `--eval-plot` | Output file where eval plot output will be written |
-| `--taxa-output` | Output file where taxa output will be written |
-| `--circle-data` | Output file where circle data output will be written |
-| `--header-anno` | Output file where header annotation results will be written |
-| `--anno-stats` | Output file where annotation statistics will be written |
-| `--eval-treshold` | Maximum E-value to retain hits (default: `1e-5`) |
-| `--uncertain-threshold` | percentage for which lca picks the majority taxon (default: `0.9 (90%)`) |
-| `--use-counts` | Use read counts in the circle data output when true (default: `True`) |
+| `--eval-plot` | Output file where the E-value distribution plot will be written |
+| `--taxa-output` | Output file where the taxonomic (Kraken-style) report will be written |
+| `--circle-data` | Output file where circular taxonomy data will be written |
+| `--header-anno` | Output file where per-header annotation results will be written (tabular/xlsx) |
+| `--log` | Output file where log messages will be written |
+| `--filtered-fasta` | Output FASTA file filtered for downstream analysis |
+| `--eval-threshold` | Maximum E-value to retain hits (default: `1e-10`) |
+| `--uncertain-threshold` | Proportion required for LCA to assign a majority taxon (default: `0.9` / 90%) |
+| `--use-counts` | Use read counts when generating circular taxonomy data (default: `False`) |
+| `--min-identity` | Minimum sequence identity (%) to consider a BLAST hit |
+| `--min-coverage` | Minimum query coverage (%) to consider a BLAST hit |
+| `--min-bitscore` | Minimum bitscore required to retain a BLAST hit |
+| `--bitscore-perc-cutoff` | Bitscore percentage cutoff relative to the top hit |
+| `--ignore-rank` | Ignore taxonomic ranks containing this text (default: `unknown`) |
+| `--ignore-taxonomy` | Ignore taxonomy strings containing this text (default: `environmental`) |
+| `--ignore-obiclean-type` | Ignore sequences with this obiclean classification (default: `singleton`) |
+| `--ignore-illuminapairend-type` | Ignore sequences with this paired-end merge status (default: `pairend`) |
+| `--ignore-seqids` | Ignore sequences containing this identifier substring |
+| `--min-support` | Retain taxa only if they (or their descendants) have at least N reads assigned |
 
 ---
author	onnodg
date	Thu, 12 Feb 2026 13:52:07 +0000
parents	9ca209477dfd
children