# BLAST Annotations Processor Script

This script processes a single **annotated BLAST file** together with a **FASTA file containing the same reads but unannotated**, generating multiple output files for downstream visualization and reporting.

It is designed for BLAST-based taxonomic pipelines and provides a complete overview of annotation quality, distribution, and composition of the analyzed dataset.

---

## Usage

The script performs the following main tasks:

1. Parse command-line arguments.  
2. Load the annotated BLAST results and the unannotated FASTA headers.  
3. Group BLAST hits per read and filter them by specified thresholds.  
4. Resolve taxonomic conflicts with the lowest common ancestor method using predefined uncertainty rules.  
5. Generate a variety of outputs of statistics and annotations for downstream use.


### Command Line Interface
The BLAST annotations processor can be run as a Python script:

```bash
python blast_annotations_processor.py [options]
```

Below are detailed examples for common use case

#### General use case
This example shows the general use of the tool.

**Requirements**:
  
Requirements as listed in the blast_annotations_processor xml file:

- Python version=3.12.3
- Matplotlib version=3.12.3
- Pandas version=2.3.2
- Numpy version=2.3.2
- Openpyxl version=3.1.5


**Input requirements**

- BLAST tabular file with alignment metrics, source and taxa
- Fasta file with preprocessed reads
- Header correspondence: Query identifiers in the BLAST output and FASTA headers **must match**. The script relies on matching IDs to merge annotations with read headers.  



**Example: Analyzing BLAST annotation result using curated database**

```bash
python annotate_blast_results.py \
  --input-anno annotated_curated_results.tabular \
  --input-unanno unannotated_reads.fasta \
  --eval-plot eval_curated.png \
  --taxa-output taxa_curated.txt \
  --circle-data circle_curated.txt \
  --header-anno anno_curated.xlsx \
  --log run.log \
  --filtered-fasta filtered_reads.fasta \
  --eval-threshold 1e-10 \
  --uncertain-threshold 0.9 \
  --use-counts \
  --min-identity 70 \
  --min-coverage 69 \
  --min-bitscore 40 \
  --bitscore-perc-cutoff 0 \
  --ignore-rank "unknown,invalid" \
  --ignore-taxonomy "environmental" \
  --ignore-obiclean-type singleton \
  --ignore-illuminapairend-type pairend \
  --min-support 10
```
This command will:

Parse the annotated BLAST results and the corresponding unannotated FASTA sequences.

Filter BLAST hits using E-value ≤ 1e-10, minimum identity ≥ 70%, minimum coverage ≥ 69%, and minimum bitscore ≥ 40, and apply a bitscore percentage cutoff of 0% (no additional top-bitscore filtering).

Resolve taxonomic conflicts using an LCA approach with an uncertainty threshold of 90%, while ignoring ranks containing "unknown,invalid" and taxonomy containing "environmental".

Exclude sequences flagged as obiclean type singleton and sequences marked as Illuminapairedend type pairend (merge failure), and require a minimum taxonomic support of 10 reads.

Use read counts when generating circular taxonomy outputs (--use-counts).

Produce the configured outputs (plots, Kraken-style report, circular data, per-header annotations), plus the required log file and filtered FASTA for downstream analysis.

**Example Input (`annotated_curated_results.tabular`)**


```
    #Query ID	#Subject	#Subject accession	#Subject Taxonomy ID	#Identity percentage	#Coverage	#evalue	#bitscore	#Source	#Taxonomy
    M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758)	source=NCBI   sequenceID=EU382995   superkingdom=Eukaryota   kingdom=Viridiplantae   phylum=Streptophyta   subphylum=Streptophytina   class=Magnoliopsida   subclass=NA   infraclass=NA   order=Ranunculales   suborder=NA   infraorder=NA   superfamily=NA   family=Ranunculaceae   genus=Ranunculus   species=Ranunculus repens   markercode=trnL   lat=NA   lon=NA	source=NCBI	N/A	100.000	100	1.24e-38	152	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Ranunculales / Ranunculaceae / Ranunculus / Ranunculus repens
    M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758)	source=NCBI   sequenceID=JQ041850   superkingdom=Eukaryota   kingdom=Viridiplantae   phylum=Streptophyta   subphylum=Streptophytina   class=Magnoliopsida   subclass=NA   infraclass=NA   order=Ranunculales   suborder=NA   infraorder=NA   superfamily=NA   family=Ranunculaceae   genus=Ranunculus   species=Ranunculus repens   markercode=trnL   lat=NA   lon=NA	source=NCBI	N/A	100.000	100	1.24e-38	152	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Ranunculales / Ranunculaceae / Ranunculus / Ranunculus repens
    M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758)	source=NCBI   sequenceID=DQ410740   superkingdom=Eukaryota   kingdom=Viridiplantae   phylum=Streptophyta   subphylum=Streptophytina   class=Magnoliopsida   subclass=NA   infraclass=NA   order=Ranunculales   suborder=NA   infraorder=NA   superfamily=NA   family=Ranunculaceae   genus=Ranunculus   species=Ranunculus muricatus   markercode=trnL   lat=NA   lon=NA	source=NCBI	N/A	98.780	100	5.79e-37	147	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Ranunculales / Ranunculaceae / Ranunculus / Ranunculus muricatus
    M01687:476:000000000-LL5F5:1:1102:14619:1181_CONS(6595)	source=NCBI   sequenceID=HM590330   superkingdom=Eukaryota   kingdom=Viridiplantae   phylum=Streptophyta   subphylum=Streptophytina   class=Magnoliopsida   subclass=NA   infraclass=NA   order=Malpighiales   suborder=NA   infraorder=NA   superfamily=NA   family=Salicaceae   genus=Populus   species=Populus tremula   markercode=trnL   lat=50.47   lon=-104.37	source=NCBI	N/A	100.000	100	2.16e-52	198	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Malpighiales / Salicaceae / Populus / Populus tremula
    M01687:476:000000000-LL5F5:1:1102:14619:1181_CONS(6595)	source=NCBI   sequenceID=MH573985   superkingdom=Eukaryota   kingdom=Viridiplantae   phylum=Streptophyta   subphylum=Streptophytina   class=Magnoliopsida   subclass=NA   infraclass=NA   order=Malpighiales   suborder=NA   infraorder=NA   superfamily=NA   family=Salicaceae   genus=Populus   species=Populus alba   markercode=trnL   lat=NA   lon=NA	source=NCBI	N/A	99.074	100	1.01e-50	193	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Malpighiales / Salicaceae / Populus / Populus alba
    ...
```

**Example FASTA (`unannotated_reads.fasta`)**


```
    >M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758) merged_sample={}; count=1758; direction=right; sminR=40.0;
    ali_length=82; seq_b_deletion=219; seq_b_insertion=0; mode=alignment; sminL=40.0; seq_a_single=0; seq_b_single=0;
    gggcaatcctgagccaaatcctgctttcagaaaacaaaaagagggttcagaaagcaaagg
    gataggtgcagagactcaatgg

    >M01687:476:000000000-LL5F5:1:1102:14619:1181_CONS(6595) merged_sample={}; count=6595; direction=right; sminR=40.0;
    ali_length=107; mode=alignment; sminL=40.0; seq_a_single=0; seq_b_single=0; 
    gggcaatcctgagccaaatcctatttttcgaaaacaaacaaaaaaacaaacaaaggttca
    taaagacagaataagaatacaaaaggataggtgcagagactcaatgg
    ...
```

**Outputs**


| Output Type                   | Format | Description |
|-------------------------------|--------|-------------|
| **E-value distribution plots**| `.png` | Histogram  of BLAST E-values across all queries; useful for choosing score cutoffs or spotting anomalies. |
| **Taxonomic composition**     | `.txt` | Summarized counts or proportions of reads assigned to each taxonomic level. |
| **Circular taxonomy data**    | `.txt` | JSON-formatted hierarchical taxonomy structure, used to generate circular taxonomic plots. |
| **Header annotations**        | `.xlsx` | Excel workbook with merged and per-read annotation information, and alignment statistics. |
| **Log**     | `.txt` | Summary metrics such as number of annotated reads, unassigned reads, unique taxa detected, and filtering statistics. |
| **Filtered fasta**            | `.fasta` | Fasta that passed the set thresholds, for use in downstream analysis (clustering) |


**Output files (example)**


outputs

├── eval.png
<img width="2100" height="900" alt="afbeelding" src="https://github.com/user-attachments/assets/75b8fac6-da31-4980-a535-f9dd7ffd15bb" />


├── taxa.txt
```
Uncertain count per taxonomie level{'K': 0, 'P': 0, 'C': 0, 'O': 18, 'F': 10, 'G': 615, 'S': 1285}
percentage_rooted	number_rooted	total_num	taxon_level	indentificatie
100.00	3373	3373	K	Viridiplantae
100.00	3373	3373	P	  Streptophyta
99.97	3372	3373	C	    Magnoliopsida
1.96	66	3373	O	      Apiales
1.96	66	3373	F	        Apiaceae
1.22	41	3373	G	          Aegopodium
1.22	41	3373	S	            Aegopodium podagraria
0.27	9	3373	G	          Apium
0.27	9	3373	S	            Apium graveolens
0.47	16	3373	G	          Uncertain taxa
4.77	161	3373	O	      Asterales
4.77	161	3373	F	        Asteraceae
0.06	2	3373	G	          Achillea
0.06	2	3373	S	            Achillea millefolium
0.15	5	3373	G	          Artemisia
0.15	5	3373	S	            Uncertain taxa
0.03	1	3373	G	          Calendula
...
4.57	154	3373	G	          Uncertain taxa
0.12	4	3373	F	        Uncertain taxa
0.53	18	3373	O	      Uncertain taxa
0.03	1	3373	C	    Pinopsida
0.03	1	3373	O	      Cupressales
0.03	1	3373	F	        Taxaceae
0.03	1	3373	G	          Taxus
0.03	1	3373	S	            Taxus baccata
```
├── circle.txt
```
[
  {
    "labels": [
      "Bacteria",
      "Uncertain taxa",
      "Viridiplantae"
    ],
    "sizes": [
      2,
      1,
      29
    ]
  },
  {
    "labels": [
      "Pseudomonadota",
      "Uncertain taxa",
      "Streptophyta"
    ],
    "sizes": [
      2,
      1,
      29
    ]
...
    ],
    "sizes": [
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      3,
      1,
      1,
      2,
      2,
      1,
      1,
      1,
      1,
      4,
      1,
      1,
      1,
      1
    ]
  }
]
```
├── anno.xlsx
```
header	e_value	identity percentage	coverage	bitscore	count	source	taxa	kingdom	phylum	class	order	family	genus	species
M01687:476:000000000-LL5F5:1:1102:8926:6561_CONS	2.33E-41	98.889	100	161	12	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Achillea / Achillea millefolium	Viridiplantae	Streptophyta	Magnoliopsida	Asterales	Asteraceae	Achillea	Achillea millefolium
M01687:476:000000000-LL5F5:1:2114:16883:18620_CONS	1.08E-39	97.778	100	156	1	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Achillea / Achillea millefolium	Viridiplantae	Streptophyta	Magnoliopsida	Asterales	Asteraceae	Achillea	Achillea millefolium
M01687:476:000000000-LL5F5:1:1102:20658:7882_CONS	1.63E-37	98.795	100	148	29	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Apiales / Apiaceae / Aegopodium / Aegopodium podagraria	Viridiplantae	Streptophyta	Magnoliopsida	Apiales	Apiaceae	Aegopodium	Aegopodium podagraria
M01687:476:000000000-LL5F5:1:1102:3453:17892_CONS	3.51E-39	100	100	154	179	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Apiales / Apiaceae / Aegopodium / Aegopodium podagraria	Viridiplantae	Streptophyta	Magnoliopsida	Apiales	Apiaceae	Aegopodium	Aegopodium podagraria
M01687:476:000000000-LL5F5:1:1101:16634:16511_CONS	5.79E-37	98.795	100	147	1	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Apiales / Apiaceae / Aegopodium / Aegopodium podagraria	Viridiplantae	Streptophyta	Magnoliopsida	Apiales	Apiaceae	Aegopodium	Aegopodium podagraria
...
M01687:476:000000000-LL5F5:1:1119:27044:6653_CONS	2.69E-35	97.59	100	141	1	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Fabales / Fabaceae / Vicia / Vicia faba	Viridiplantae	Streptophyta	Magnoliopsida	Fabales	Fabaceae	Vicia	Vicia faba
M01687:476:000000000-LL5F5:1:1109:2464:14257_CONS	7.37E-36	100	95	143	1	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Fabales / Fabaceae / Vicia / Vicia faba	Viridiplantae	Streptophyta	Magnoliopsida	Fabales	Fabaceae	Vicia	Vicia faba
M01687:476:000000000-LL5F5:1:1106:26123:11458_CONS	1.63E-37	98.795	100	148	1	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Fabales / Fabaceae / Vicia / Vicia faba	Viridiplantae	Streptophyta	Magnoliopsida	Fabales	Fabaceae	Vicia	Vicia faba
M01687:476:000000000-LL5F5:1:1104:24402:7089_CONS	5E-43	100	100	167	1	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Fabales / Fabaceae / Vicia / Vicia hirsuta	Viridiplantae	Streptophyta	Magnoliopsida	Fabales	Fabaceae	Vicia	Vicia hirsuta
M01687:476:000000000-LL5F5:1:2114:19155:4308_CONS	1.07E-39	100	94	156	13	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Gentianales / Apocynaceae / Vinca / Vinca minor	Viridiplantae	Streptophyta	Magnoliopsida	Gentianales	Apocynaceae	Vinca	Vinca minor
M01687:476:000000000-LL5F5:1:1117:11316:6653_CONS	4.96E-38	98.81	94	150	1	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Gentianales / Apocynaceae / Vinca / Vinca minor	Viridiplantae	Streptophyta	Magnoliopsida	Gentianales	Apocynaceae	Vinca	Vinca minor
M01687:476:000000000-LL5F5:1:1106:28052:14441_CONS	8.25E-41	98.876	100	159	1	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium	Viridiplantae	Streptophyta	Magnoliopsida	Asterales	Asteraceae	Xanthium	Xanthium strumarium
M01687:476:000000000-LL5F5:1:2118:15258:6790_CONS	8.25E-41	98.876	100	159	1	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium	Viridiplantae	Streptophyta	Magnoliopsida	Asterales	Asteraceae	Xanthium	Xanthium strumarium
```


└── log.txt
```
Starting processing for FASTA
=== PARAMETERS USED ===
uncertain_threshold: 0.9
eval_threshold: 1e-10
use_counts: True
ignore_rank: unknown
ignore_taxonomy: environmental
bitscore_perc_cutoff: 8.0
min_bitscore: 100
ignore_obiclean_type: singleton
ignore_illuminapairend_type: pairend
min_identity: 80
min_coverage: 70
ignore_seqids: 
min_support: 1
=== END PARAMETERS ===
Filtered FASTA written succesfully(1790 sequences)
FASTA: total headers: 2156
FASTA: headers kept after filters and min_support=1: 1790
FASTA: removed due to header filters (illumina/obiclean/etc.): 366
FASTA: removed due to low dereplicated count (<1): 0
FASTA: total invalid (header filter + low support): 366
Reading BLAST annotations
BLAST: total hits read: 4977
BLAST: hits kept after quality filters: 3145
BLAST: hits filtered (evalue/coverage/identity/bitscore): 1832
BLAST: hits removed due to invalid taxon: 0
BLAST: hits removed due to ignored seqids: 0
Note: 30 BLAST q_ids not in FASTA (showing up to 10): ['M01687:460:000000000-LGY9G:1:1101:11918:3518_CONS(1)', 'M01687:460:000000000-LGY9G:1:1101:12996:3690_CONS(1)', 'M01687:460:000000000-LGY9G:1:1101:11564:11468_CONS(1)', 'M01687:460:000000000-LGY9G:1:1102:19358:5472_CONS(1)', 'M01687:460:000000000-LGY9G:1:2114:4805:4734_CONS(1)', 'M01687:460:000000000-LGY9G:1:2114:7472:19038_CONS(1)', 'M01687:460:000000000-LGY9G:1:2112:26865:11154_CONS(1)', 'M01687:460:000000000-LGY9G:1:2113:29518:11119_CONS(1)', 'M01687:460:000000000-LGY9G:1:2113:14681:23251_CONS(1)', 'M01687:460:000000000-LGY9G:1:2110:17890:1754_CONS(2)']
ANNOTATION: total FASTA headers considered: 1790
ANNOTATION: reads with BLAST hits: 622
ANNOTATION: reads without BLAST hits: 1168
ANNOTATION: unique annotated count (from header counts): 49571
ANNOTATION: total unique count (from FASTA): 66132
E-value plot written succesfully
Taxa summary written succesfully
Header annotations written succesfully
Circle diagram JSON written succesfully
=== ANNOTATION STATISTICS ===
percentage_annotated: 28.84972170686456
annotated_sequences: 622
total_sequences: 2156
percentage_unique_annotated: 74.95766043670235
unique_annotated: 49571
total_unique: 66132
```

└── filtered_fasta.fasta
```
    >M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758) merged_sample={}; count=1758; direction=right; sminR=40.0;
    ali_length=82; seq_b_deletion=219; seq_b_insertion=0; mode=alignment; sminL=40.0; seq_a_single=0; seq_b_single=0;
    gggcaatcctgagccaaatcctgctttcagaaaacaaaaagagggttcagaaagcaaagg
    gataggtgcagagactcaatgg

    >M01687:476:000000000-LL5F5:1:1102:14619:1181_CONS(6595) merged_sample={}; count=6595; direction=right; sminR=40.0;
    ali_length=107; mode=alignment; sminL=40.0; seq_a_single=0; seq_b_single=0; 
    gggcaatcctgagccaaatcctatttttcgaaaacaaacaaaaaaacaaacaaaggttca
    taaagacagaataagaatacaaaaggataggtgcagagactcaatgg
```
---

#### CLI Arguments (common)

| Argument | Description |
|----------|-------------|
| `--input-anno` | Path to the annotated BLAST results (tab-separated) |
| `--input-unanno` | Path to the unannotated reads FASTA file |
| `--eval-plot` | Output file where the E-value distribution plot will be written |
| `--taxa-output` | Output file where the taxonomic (Kraken-style) report will be written |
| `--circle-data` | Output file where circular taxonomy data will be written |
| `--header-anno` | Output file where per-header annotation results will be written (tabular/xlsx) |
| `--log` | Output file where log messages will be written |
| `--filtered-fasta` | Output FASTA file filtered for downstream analysis |
| `--eval-threshold` | Maximum E-value to retain hits (default: `1e-10`) |
| `--uncertain-threshold` | Proportion required for LCA to assign a majority taxon (default: `0.9` / 90%) |
| `--use-counts` | Use read counts when generating circular taxonomy data (default: `False`) |
| `--min-identity` | Minimum sequence identity (%) to consider a BLAST hit |
| `--min-coverage` | Minimum query coverage (%) to consider a BLAST hit |
| `--min-bitscore` | Minimum bitscore required to retain a BLAST hit |
| `--bitscore-perc-cutoff` | Bitscore percentage cutoff relative to the top hit |
| `--ignore-rank` | Ignore taxonomic ranks containing this text (default: `unknown`) |
| `--ignore-taxonomy` | Ignore taxonomy strings containing this text (default: `environmental`) |
| `--ignore-obiclean-type` | Ignore sequences with this obiclean classification (default: `singleton`) |
| `--ignore-illuminapairend-type` | Ignore sequences with this paired-end merge status (default: `pairend`) |
| `--ignore-seqids` | Ignore sequences containing this identifier substring |
| `--min-support` | Retain taxa only if they (or their descendants) have at least N reads assigned |

---


### Galaxy integration

The tool is also available through the Galaxy platform:

- **Galaxy Toolshed**: The BLAST annotations processor tool is available in the Galaxy Toolshed, 
  enabling easy installation into any Galaxy instance.
- **Web-based interface**: Users can upload sequence files, configure validation parameters through the GUI, 
  run validations, and download results.
- **Workflow integration**: The tool can be incorporated into Galaxy workflows for automated processing pipelines.

To use the tool in Galaxy:
1. Install the tool from the Galaxy Toolshed (search for "blast_annotations_processor")
2. Upload your raw read and BLAST files to your Galaxy history
3. Configure parameters through the GUI
4. Run the tool
5. View results and download validation reports and tabular annotations

## License

No license yet

## Citation

If you use this software in your research, please cite this repository.

## Contact

For questions or issues:
- GitHub Issues: https://github.com/Onnodg/Naturalis_NLOOR/issues
- Email: onno.gorter@naturalis.nl (until Febuary 2026)

## Acknowledgments

This tool was developed to support the New lights on old remedies project, a PhD project by Anja Fischer.
