blast_annotations_processor: README.md comparison

comparison README.md @ 3:ca2f07b71581 draft default tip

planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit 600e5a50a13a3a16a1970d6d4d31cb4f7bd549bf-dirty

author	onnodg
date	Thu, 12 Feb 2026 13:52:07 +0000
parents	9ca209477dfd
children

comparison

equal deleted inserted replaced

-:9ca209477dfd
+:ca2f07b71581
 **Example: Analyzing BLAST annotation result using curated database**
 ```bash
-python annotate_blast_results.py
+python annotate_blast_results.py \
---input-anno 'annotated_curated_results.tabular'
+--input-anno annotated_curated_results.tabular \
---input-unanno 'unannotated_reads.fasta'
+--input-unanno unannotated_reads.fasta \
---eval-plot 'eval_curated.png'
+--eval-plot eval_curated.png \
---taxa-output 'taxa_curated.txt'
+--taxa-output taxa_curated.txt \
---circle-data 'circle_curated.txt'
+--circle-data circle_curated.txt \
---header-anno 'anno_curated.xlsx'
+--header-anno anno_curated.xlsx \
---anno-stats 'stats_curated.txt'
+--log run.log \
---eval-threshold '1e-5'
+--filtered-fasta filtered_reads.fasta \
---uncertain-threshold '0.9'
+--eval-threshold 1e-10 \
---use-counts
+--uncertain-threshold 0.9 \
-```
+--use-counts \
+--min-identity 70 \
+--min-coverage 69 \
+--min-bitscore 40 \
+--bitscore-perc-cutoff 0 \
+--ignore-rank "unknown,invalid" \
+--ignore-taxonomy "environmental" \
+--ignore-obiclean-type singleton \
+--ignore-illuminapairend-type pairend \
+--min-support 10
+```
 This command will:
-- Parse the BLAST and FASTA files.
+Parse the annotated BLAST results and the corresponding unannotated FASTA sequences.
-- Filter hits using `E-value ≤ 1e-5`, `uncertainty threshold ≥ 90%`, and use read count in the circular data output.
-- Resolve taxonomic conflicts and generate plots, reports, and spreadsheet outputs in the given output files.
+Filter BLAST hits using E-value ≤ 1e-10, minimum identity ≥ 70%, minimum coverage ≥ 69%, and minimum bitscore ≥ 40, and apply a bitscore percentage cutoff of 0% (no additional top-bitscore filtering).
+Resolve taxonomic conflicts using an LCA approach with an uncertainty threshold of 90%, while ignoring ranks containing "unknown,invalid" and taxonomy containing "environmental".
+Exclude sequences flagged as obiclean type singleton and sequences marked as Illuminapairedend type pairend (merge failure), and require a minimum taxonomic support of 10 reads.
+Use read counts when generating circular taxonomy outputs (--use-counts).
+Produce the configured outputs (plots, Kraken-style report, circular data, per-header annotations), plus the required log file and filtered FASTA for downstream analysis.
 **Example Input (`annotated_curated_results.tabular`)**
 ```
 |-------------------------------|--------|-------------|
 | **E-value distribution plots**| `.png` | Histogram  of BLAST E-values across all queries; useful for choosing score cutoffs or spotting anomalies. |
 | **Taxonomic composition**     | `.txt` | Summarized counts or proportions of reads assigned to each taxonomic level. |
 | **Circular taxonomy data**    | `.txt` | JSON-formatted hierarchical taxonomy structure, used to generate circular taxonomic plots. |
 | **Header annotations**        | `.xlsx` | Excel workbook with merged and per-read annotation information, and alignment statistics. |
-| **Annotation statistics**     | `.txt` | Summary metrics such as number of annotated reads, unassigned reads, unique taxa detected, and filtering statistics. |
+| **Log**     | `.txt` | Summary metrics such as number of annotated reads, unassigned reads, unique taxa detected, and filtering statistics. |
+| **Filtered fasta**            | `.fasta` | Fasta that passed the set thresholds, for use in downstream analysis (clustering) |
 **Output files (example)**
 M01687:476:000000000-LL5F5:1:1106:28052:14441_CONS	8.25E-41	98.876	100	159	1	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium	Viridiplantae	Streptophyta	Magnoliopsida	Asterales	Asteraceae	Xanthium	Xanthium strumarium
 M01687:476:000000000-LL5F5:1:2118:15258:6790_CONS	8.25E-41	98.876	100	159	1	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium	Viridiplantae	Streptophyta	Magnoliopsida	Asterales	Asteraceae	Xanthium	Xanthium strumarium
 ```
-└── stats.txt
+└── log.txt
 ```
-metric	value
+Starting processing for FASTA
-percentage_annotated	71.3862433862434
+=== PARAMETERS USED ===
-annotated_sequences	3373
+uncertain_threshold: 0.9
-total_sequences	4725
+eval_threshold: 1e-10
-percentage_unique_annotated	89.46585409571608
+use_counts: True
-unique_annotated	99826
+ignore_rank: unknown
-total_unique	111580
+ignore_taxonomy: environmental
-```
+bitscore_perc_cutoff: 8.0
+min_bitscore: 100
+ignore_obiclean_type: singleton
+ignore_illuminapairend_type: pairend
+min_identity: 80
+min_coverage: 70
+ignore_seqids:
+min_support: 1
+=== END PARAMETERS ===
+Filtered FASTA written succesfully(1790 sequences)
+FASTA: total headers: 2156
+FASTA: headers kept after filters and min_support=1: 1790
+FASTA: removed due to header filters (illumina/obiclean/etc.): 366
+FASTA: removed due to low dereplicated count (<1): 0
+FASTA: total invalid (header filter + low support): 366
+Reading BLAST annotations
+BLAST: total hits read: 4977
+BLAST: hits kept after quality filters: 3145
+BLAST: hits filtered (evalue/coverage/identity/bitscore): 1832
+BLAST: hits removed due to invalid taxon: 0
+BLAST: hits removed due to ignored seqids: 0
+Note: 30 BLAST q_ids not in FASTA (showing up to 10): ['M01687:460:000000000-LGY9G:1:1101:11918:3518_CONS(1)', 'M01687:460:000000000-LGY9G:1:1101:12996:3690_CONS(1)', 'M01687:460:000000000-LGY9G:1:1101:11564:11468_CONS(1)', 'M01687:460:000000000-LGY9G:1:1102:19358:5472_CONS(1)', 'M01687:460:000000000-LGY9G:1:2114:4805:4734_CONS(1)', 'M01687:460:000000000-LGY9G:1:2114:7472:19038_CONS(1)', 'M01687:460:000000000-LGY9G:1:2112:26865:11154_CONS(1)', 'M01687:460:000000000-LGY9G:1:2113:29518:11119_CONS(1)', 'M01687:460:000000000-LGY9G:1:2113:14681:23251_CONS(1)', 'M01687:460:000000000-LGY9G:1:2110:17890:1754_CONS(2)']
+ANNOTATION: total FASTA headers considered: 1790
+ANNOTATION: reads with BLAST hits: 622
+ANNOTATION: reads without BLAST hits: 1168
+ANNOTATION: unique annotated count (from header counts): 49571
+ANNOTATION: total unique count (from FASTA): 66132
+E-value plot written succesfully
+Taxa summary written succesfully
+Header annotations written succesfully
+Circle diagram JSON written succesfully
+=== ANNOTATION STATISTICS ===
+percentage_annotated: 28.84972170686456
+annotated_sequences: 622
+total_sequences: 2156
+percentage_unique_annotated: 74.95766043670235
+unique_annotated: 49571
+total_unique: 66132
+```
+└── filtered_fasta.fasta
+```
+>M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758) merged_sample={}; count=1758; direction=right; sminR=40.0;
+ali_length=82; seq_b_deletion=219; seq_b_insertion=0; mode=alignment; sminL=40.0; seq_a_single=0; seq_b_single=0;
+gggcaatcctgagccaaatcctgctttcagaaaacaaaaagagggttcagaaagcaaagg
+gataggtgcagagactcaatgg
+>M01687:476:000000000-LL5F5:1:1102:14619:1181_CONS(6595) merged_sample={}; count=6595; direction=right; sminR=40.0;
+ali_length=107; mode=alignment; sminL=40.0; seq_a_single=0; seq_b_single=0;
+gggcaatcctgagccaaatcctatttttcgaaaacaaacaaaaaaacaaacaaaggttca
+taaagacagaataagaatacaaaaggataggtgcagagactcaatgg
+```
 ---
 #### CLI Arguments (common)
 | Argument | Description |
 |----------|-------------|
 | `--input-anno` | Path to the annotated BLAST results (tab-separated) |
 | `--input-unanno` | Path to the unannotated reads FASTA file |
-| `--eval-plot` | Output file where eval plot output will be written |
+| `--eval-plot` | Output file where the E-value distribution plot will be written |
-| `--taxa-output` | Output file where taxa output will be written |
+| `--taxa-output` | Output file where the taxonomic (Kraken-style) report will be written |
-| `--circle-data` | Output file where circle data output will be written |
+| `--circle-data` | Output file where circular taxonomy data will be written |
-| `--header-anno` | Output file where header annotation results will be written |
+| `--header-anno` | Output file where per-header annotation results will be written (tabular/xlsx) |
-| `--anno-stats` | Output file where annotation statistics will be written |
+| `--log` | Output file where log messages will be written |
-| `--eval-treshold` | Maximum E-value to retain hits (default: `1e-5`) |
+| `--filtered-fasta` | Output FASTA file filtered for downstream analysis |
-| `--uncertain-threshold` | percentage for which lca picks the majority taxon (default: `0.9 (90%)`) |
+| `--eval-threshold` | Maximum E-value to retain hits (default: `1e-10`) |
-| `--use-counts` | Use read counts in the circle data output when true (default: `True`) |
+| `--uncertain-threshold` | Proportion required for LCA to assign a majority taxon (default: `0.9` / 90%) |
+| `--use-counts` | Use read counts when generating circular taxonomy data (default: `False`) |
+| `--min-identity` | Minimum sequence identity (%) to consider a BLAST hit |
+| `--min-coverage` | Minimum query coverage (%) to consider a BLAST hit |
+| `--min-bitscore` | Minimum bitscore required to retain a BLAST hit |
+| `--bitscore-perc-cutoff` | Bitscore percentage cutoff relative to the top hit |
+| `--ignore-rank` | Ignore taxonomic ranks containing this text (default: `unknown`) |
+| `--ignore-taxonomy` | Ignore taxonomy strings containing this text (default: `environmental`) |
+| `--ignore-obiclean-type` | Ignore sequences with this obiclean classification (default: `singleton`) |
+| `--ignore-illuminapairend-type` | Ignore sequences with this paired-end merge status (default: `pairend`) |
+| `--ignore-seqids` | Ignore sequences containing this identifier substring |
+| `--min-support` | Retain taxa only if they (or their descendants) have at least N reads assigned |
 ---
 ### Galaxy integration

Mercurial > repos > onnodg > blast_annotations_processor

comparison README.md @ 3:ca2f07b71581 draft default tip