Mercurial > repos > onnodg > blast_annotations_processor
diff README.md @ 3:ca2f07b71581 draft default tip
planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit 600e5a50a13a3a16a1970d6d4d31cb4f7bd549bf-dirty
| author | onnodg |
|---|---|
| date | Thu, 12 Feb 2026 13:52:07 +0000 |
| parents | 9ca209477dfd |
| children |
line wrap: on
line diff
--- a/README.md Mon Dec 15 16:43:36 2025 +0000 +++ b/README.md Thu Feb 12 13:52:07 2026 +0000 @@ -51,25 +51,41 @@ **Example: Analyzing BLAST annotation result using curated database** ```bash -python annotate_blast_results.py ---input-anno 'annotated_curated_results.tabular' ---input-unanno 'unannotated_reads.fasta' ---eval-plot 'eval_curated.png' ---taxa-output 'taxa_curated.txt' ---circle-data 'circle_curated.txt' ---header-anno 'anno_curated.xlsx' ---anno-stats 'stats_curated.txt' ---eval-threshold '1e-5' ---uncertain-threshold '0.9' ---use-counts +python annotate_blast_results.py \ + --input-anno annotated_curated_results.tabular \ + --input-unanno unannotated_reads.fasta \ + --eval-plot eval_curated.png \ + --taxa-output taxa_curated.txt \ + --circle-data circle_curated.txt \ + --header-anno anno_curated.xlsx \ + --log run.log \ + --filtered-fasta filtered_reads.fasta \ + --eval-threshold 1e-10 \ + --uncertain-threshold 0.9 \ + --use-counts \ + --min-identity 70 \ + --min-coverage 69 \ + --min-bitscore 40 \ + --bitscore-perc-cutoff 0 \ + --ignore-rank "unknown,invalid" \ + --ignore-taxonomy "environmental" \ + --ignore-obiclean-type singleton \ + --ignore-illuminapairend-type pairend \ + --min-support 10 ``` - This command will: -- Parse the BLAST and FASTA files. -- Filter hits using `E-value ≤ 1e-5`, `uncertainty threshold ≥ 90%`, and use read count in the circular data output. -- Resolve taxonomic conflicts and generate plots, reports, and spreadsheet outputs in the given output files. +Parse the annotated BLAST results and the corresponding unannotated FASTA sequences. + +Filter BLAST hits using E-value ≤ 1e-10, minimum identity ≥ 70%, minimum coverage ≥ 69%, and minimum bitscore ≥ 40, and apply a bitscore percentage cutoff of 0% (no additional top-bitscore filtering). + +Resolve taxonomic conflicts using an LCA approach with an uncertainty threshold of 90%, while ignoring ranks containing "unknown,invalid" and taxonomy containing "environmental". +Exclude sequences flagged as obiclean type singleton and sequences marked as Illuminapairedend type pairend (merge failure), and require a minimum taxonomic support of 10 reads. + +Use read counts when generating circular taxonomy outputs (--use-counts). + +Produce the configured outputs (plots, Kraken-style report, circular data, per-header annotations), plus the required log file and filtered FASTA for downstream analysis. **Example Input (`annotated_curated_results.tabular`)** @@ -109,7 +125,8 @@ | **Taxonomic composition** | `.txt` | Summarized counts or proportions of reads assigned to each taxonomic level. | | **Circular taxonomy data** | `.txt` | JSON-formatted hierarchical taxonomy structure, used to generate circular taxonomic plots. | | **Header annotations** | `.xlsx` | Excel workbook with merged and per-read annotation information, and alignment statistics. | -| **Annotation statistics** | `.txt` | Summary metrics such as number of annotated reads, unassigned reads, unique taxa detected, and filtering statistics. | +| **Log** | `.txt` | Summary metrics such as number of annotated reads, unassigned reads, unique taxa detected, and filtering statistics. | +| **Filtered fasta** | `.fasta` | Fasta that passed the set thresholds, for use in downstream analysis (clustering) | **Output files (example)** @@ -230,17 +247,67 @@ ``` -└── stats.txt +└── log.txt ``` -metric value -percentage_annotated 71.3862433862434 -annotated_sequences 3373 -total_sequences 4725 -percentage_unique_annotated 89.46585409571608 -unique_annotated 99826 -total_unique 111580 +Starting processing for FASTA +=== PARAMETERS USED === +uncertain_threshold: 0.9 +eval_threshold: 1e-10 +use_counts: True +ignore_rank: unknown +ignore_taxonomy: environmental +bitscore_perc_cutoff: 8.0 +min_bitscore: 100 +ignore_obiclean_type: singleton +ignore_illuminapairend_type: pairend +min_identity: 80 +min_coverage: 70 +ignore_seqids: +min_support: 1 +=== END PARAMETERS === +Filtered FASTA written succesfully(1790 sequences) +FASTA: total headers: 2156 +FASTA: headers kept after filters and min_support=1: 1790 +FASTA: removed due to header filters (illumina/obiclean/etc.): 366 +FASTA: removed due to low dereplicated count (<1): 0 +FASTA: total invalid (header filter + low support): 366 +Reading BLAST annotations +BLAST: total hits read: 4977 +BLAST: hits kept after quality filters: 3145 +BLAST: hits filtered (evalue/coverage/identity/bitscore): 1832 +BLAST: hits removed due to invalid taxon: 0 +BLAST: hits removed due to ignored seqids: 0 +Note: 30 BLAST q_ids not in FASTA (showing up to 10): ['M01687:460:000000000-LGY9G:1:1101:11918:3518_CONS(1)', 'M01687:460:000000000-LGY9G:1:1101:12996:3690_CONS(1)', 'M01687:460:000000000-LGY9G:1:1101:11564:11468_CONS(1)', 'M01687:460:000000000-LGY9G:1:1102:19358:5472_CONS(1)', 'M01687:460:000000000-LGY9G:1:2114:4805:4734_CONS(1)', 'M01687:460:000000000-LGY9G:1:2114:7472:19038_CONS(1)', 'M01687:460:000000000-LGY9G:1:2112:26865:11154_CONS(1)', 'M01687:460:000000000-LGY9G:1:2113:29518:11119_CONS(1)', 'M01687:460:000000000-LGY9G:1:2113:14681:23251_CONS(1)', 'M01687:460:000000000-LGY9G:1:2110:17890:1754_CONS(2)'] +ANNOTATION: total FASTA headers considered: 1790 +ANNOTATION: reads with BLAST hits: 622 +ANNOTATION: reads without BLAST hits: 1168 +ANNOTATION: unique annotated count (from header counts): 49571 +ANNOTATION: total unique count (from FASTA): 66132 +E-value plot written succesfully +Taxa summary written succesfully +Header annotations written succesfully +Circle diagram JSON written succesfully +=== ANNOTATION STATISTICS === +percentage_annotated: 28.84972170686456 +annotated_sequences: 622 +total_sequences: 2156 +percentage_unique_annotated: 74.95766043670235 +unique_annotated: 49571 +total_unique: 66132 ``` +└── filtered_fasta.fasta +``` + >M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758) merged_sample={}; count=1758; direction=right; sminR=40.0; + ali_length=82; seq_b_deletion=219; seq_b_insertion=0; mode=alignment; sminL=40.0; seq_a_single=0; seq_b_single=0; + gggcaatcctgagccaaatcctgctttcagaaaacaaaaagagggttcagaaagcaaagg + gataggtgcagagactcaatgg + + >M01687:476:000000000-LL5F5:1:1102:14619:1181_CONS(6595) merged_sample={}; count=6595; direction=right; sminR=40.0; + ali_length=107; mode=alignment; sminL=40.0; seq_a_single=0; seq_b_single=0; + gggcaatcctgagccaaatcctatttttcgaaaacaaacaaaaaaacaaacaaaggttca + taaagacagaataagaatacaaaaggataggtgcagagactcaatgg +``` --- #### CLI Arguments (common) @@ -249,14 +316,25 @@ |----------|-------------| | `--input-anno` | Path to the annotated BLAST results (tab-separated) | | `--input-unanno` | Path to the unannotated reads FASTA file | -| `--eval-plot` | Output file where eval plot output will be written | -| `--taxa-output` | Output file where taxa output will be written | -| `--circle-data` | Output file where circle data output will be written | -| `--header-anno` | Output file where header annotation results will be written | -| `--anno-stats` | Output file where annotation statistics will be written | -| `--eval-treshold` | Maximum E-value to retain hits (default: `1e-5`) | -| `--uncertain-threshold` | percentage for which lca picks the majority taxon (default: `0.9 (90%)`) | -| `--use-counts` | Use read counts in the circle data output when true (default: `True`) | +| `--eval-plot` | Output file where the E-value distribution plot will be written | +| `--taxa-output` | Output file where the taxonomic (Kraken-style) report will be written | +| `--circle-data` | Output file where circular taxonomy data will be written | +| `--header-anno` | Output file where per-header annotation results will be written (tabular/xlsx) | +| `--log` | Output file where log messages will be written | +| `--filtered-fasta` | Output FASTA file filtered for downstream analysis | +| `--eval-threshold` | Maximum E-value to retain hits (default: `1e-10`) | +| `--uncertain-threshold` | Proportion required for LCA to assign a majority taxon (default: `0.9` / 90%) | +| `--use-counts` | Use read counts when generating circular taxonomy data (default: `False`) | +| `--min-identity` | Minimum sequence identity (%) to consider a BLAST hit | +| `--min-coverage` | Minimum query coverage (%) to consider a BLAST hit | +| `--min-bitscore` | Minimum bitscore required to retain a BLAST hit | +| `--bitscore-perc-cutoff` | Bitscore percentage cutoff relative to the top hit | +| `--ignore-rank` | Ignore taxonomic ranks containing this text (default: `unknown`) | +| `--ignore-taxonomy` | Ignore taxonomy strings containing this text (default: `environmental`) | +| `--ignore-obiclean-type` | Ignore sequences with this obiclean classification (default: `singleton`) | +| `--ignore-illuminapairend-type` | Ignore sequences with this paired-end merge status (default: `pairend`) | +| `--ignore-seqids` | Ignore sequences containing this identifier substring | +| `--min-support` | Retain taxa only if they (or their descendants) have at least N reads assigned | ---
