diff README.md @ 2:9ca209477dfd draft default tip

planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit 4017d38cf327c48a6252e488ba792527dae97a70-dirty
author onnodg
date Mon, 15 Dec 2025 16:43:36 +0000
parents
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/README.md	Mon Dec 15 16:43:36 2025 +0000
@@ -0,0 +1,297 @@
+# BLAST Annotations Processor Script
+
+This script processes a single **annotated BLAST file** together with a **FASTA file containing the same reads but unannotated**, generating multiple output files for downstream visualization and reporting.
+
+It is designed for BLAST-based taxonomic pipelines and provides a complete overview of annotation quality, distribution, and composition of the analyzed dataset.
+
+---
+
+## Usage
+
+The script performs the following main tasks:
+
+1. Parse command-line arguments.  
+2. Load the annotated BLAST results and the unannotated FASTA headers.  
+3. Group BLAST hits per read and filter them by specified thresholds.  
+4. Resolve taxonomic conflicts with the lowest common ancestor method using predefined uncertainty rules.  
+5. Generate a variety of outputs of statistics and annotations for downstream use.
+
+
+### Command Line Interface
+The BLAST annotations processor can be run as a Python script:
+
+```bash
+python blast_annotations_processor.py [options]
+```
+
+Below are detailed examples for common use case
+
+#### General use case
+This example shows the general use of the tool.
+
+**Requirements**:
+  
+Requirements as listed in the blast_annotations_processor xml file:
+
+- Python version=3.12.3
+- Matplotlib version=3.12.3
+- Pandas version=2.3.2
+- Numpy version=2.3.2
+- Openpyxl version=3.1.5
+
+
+**Input requirements**
+
+- BLAST tabular file with alignment metrics, source and taxa
+- Fasta file with preprocessed reads
+- Header correspondence: Query identifiers in the BLAST output and FASTA headers **must match**. The script relies on matching IDs to merge annotations with read headers.  
+
+
+
+**Example: Analyzing BLAST annotation result using curated database**
+
+```bash
+python annotate_blast_results.py
+--input-anno 'annotated_curated_results.tabular'
+--input-unanno 'unannotated_reads.fasta'
+--eval-plot 'eval_curated.png'
+--taxa-output 'taxa_curated.txt'
+--circle-data 'circle_curated.txt'
+--header-anno 'anno_curated.xlsx'
+--anno-stats 'stats_curated.txt' 
+--eval-threshold '1e-5'
+--uncertain-threshold '0.9'
+--use-counts
+```
+
+This command will:
+
+- Parse the BLAST and FASTA files.  
+- Filter hits using `E-value ≤ 1e-5`, `uncertainty threshold ≥ 90%`, and use read count in the circular data output.  
+- Resolve taxonomic conflicts and generate plots, reports, and spreadsheet outputs in the given output files.
+
+
+**Example Input (`annotated_curated_results.tabular`)**
+
+
+```
+    #Query ID	#Subject	#Subject accession	#Subject Taxonomy ID	#Identity percentage	#Coverage	#evalue	#bitscore	#Source	#Taxonomy
+    M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758)	source=NCBI   sequenceID=EU382995   superkingdom=Eukaryota   kingdom=Viridiplantae   phylum=Streptophyta   subphylum=Streptophytina   class=Magnoliopsida   subclass=NA   infraclass=NA   order=Ranunculales   suborder=NA   infraorder=NA   superfamily=NA   family=Ranunculaceae   genus=Ranunculus   species=Ranunculus repens   markercode=trnL   lat=NA   lon=NA	source=NCBI	N/A	100.000	100	1.24e-38	152	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Ranunculales / Ranunculaceae / Ranunculus / Ranunculus repens
+    M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758)	source=NCBI   sequenceID=JQ041850   superkingdom=Eukaryota   kingdom=Viridiplantae   phylum=Streptophyta   subphylum=Streptophytina   class=Magnoliopsida   subclass=NA   infraclass=NA   order=Ranunculales   suborder=NA   infraorder=NA   superfamily=NA   family=Ranunculaceae   genus=Ranunculus   species=Ranunculus repens   markercode=trnL   lat=NA   lon=NA	source=NCBI	N/A	100.000	100	1.24e-38	152	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Ranunculales / Ranunculaceae / Ranunculus / Ranunculus repens
+    M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758)	source=NCBI   sequenceID=DQ410740   superkingdom=Eukaryota   kingdom=Viridiplantae   phylum=Streptophyta   subphylum=Streptophytina   class=Magnoliopsida   subclass=NA   infraclass=NA   order=Ranunculales   suborder=NA   infraorder=NA   superfamily=NA   family=Ranunculaceae   genus=Ranunculus   species=Ranunculus muricatus   markercode=trnL   lat=NA   lon=NA	source=NCBI	N/A	98.780	100	5.79e-37	147	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Ranunculales / Ranunculaceae / Ranunculus / Ranunculus muricatus
+    M01687:476:000000000-LL5F5:1:1102:14619:1181_CONS(6595)	source=NCBI   sequenceID=HM590330   superkingdom=Eukaryota   kingdom=Viridiplantae   phylum=Streptophyta   subphylum=Streptophytina   class=Magnoliopsida   subclass=NA   infraclass=NA   order=Malpighiales   suborder=NA   infraorder=NA   superfamily=NA   family=Salicaceae   genus=Populus   species=Populus tremula   markercode=trnL   lat=50.47   lon=-104.37	source=NCBI	N/A	100.000	100	2.16e-52	198	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Malpighiales / Salicaceae / Populus / Populus tremula
+    M01687:476:000000000-LL5F5:1:1102:14619:1181_CONS(6595)	source=NCBI   sequenceID=MH573985   superkingdom=Eukaryota   kingdom=Viridiplantae   phylum=Streptophyta   subphylum=Streptophytina   class=Magnoliopsida   subclass=NA   infraclass=NA   order=Malpighiales   suborder=NA   infraorder=NA   superfamily=NA   family=Salicaceae   genus=Populus   species=Populus alba   markercode=trnL   lat=NA   lon=NA	source=NCBI	N/A	99.074	100	1.01e-50	193	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Malpighiales / Salicaceae / Populus / Populus alba
+    ...
+```
+
+**Example FASTA (`unannotated_reads.fasta`)**
+
+
+```
+    >M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758) merged_sample={}; count=1758; direction=right; sminR=40.0;
+    ali_length=82; seq_b_deletion=219; seq_b_insertion=0; mode=alignment; sminL=40.0; seq_a_single=0; seq_b_single=0;
+    gggcaatcctgagccaaatcctgctttcagaaaacaaaaagagggttcagaaagcaaagg
+    gataggtgcagagactcaatgg
+
+    >M01687:476:000000000-LL5F5:1:1102:14619:1181_CONS(6595) merged_sample={}; count=6595; direction=right; sminR=40.0;
+    ali_length=107; mode=alignment; sminL=40.0; seq_a_single=0; seq_b_single=0; 
+    gggcaatcctgagccaaatcctatttttcgaaaacaaacaaaaaaacaaacaaaggttca
+    taaagacagaataagaatacaaaaggataggtgcagagactcaatgg
+    ...
+```
+
+**Outputs**
+
+
+| Output Type                   | Format | Description |
+|-------------------------------|--------|-------------|
+| **E-value distribution plots**| `.png` | Histogram  of BLAST E-values across all queries; useful for choosing score cutoffs or spotting anomalies. |
+| **Taxonomic composition**     | `.txt` | Summarized counts or proportions of reads assigned to each taxonomic level. |
+| **Circular taxonomy data**    | `.txt` | JSON-formatted hierarchical taxonomy structure, used to generate circular taxonomic plots. |
+| **Header annotations**        | `.xlsx` | Excel workbook with merged and per-read annotation information, and alignment statistics. |
+| **Annotation statistics**     | `.txt` | Summary metrics such as number of annotated reads, unassigned reads, unique taxa detected, and filtering statistics. |
+
+
+**Output files (example)**
+
+
+outputs
+
+├── eval.png
+<img width="2100" height="900" alt="afbeelding" src="https://github.com/user-attachments/assets/75b8fac6-da31-4980-a535-f9dd7ffd15bb" />
+
+
+├── taxa.txt
+```
+Uncertain count per taxonomie level{'K': 0, 'P': 0, 'C': 0, 'O': 18, 'F': 10, 'G': 615, 'S': 1285}
+percentage_rooted	number_rooted	total_num	taxon_level	indentificatie
+100.00	3373	3373	K	Viridiplantae
+100.00	3373	3373	P	  Streptophyta
+99.97	3372	3373	C	    Magnoliopsida
+1.96	66	3373	O	      Apiales
+1.96	66	3373	F	        Apiaceae
+1.22	41	3373	G	          Aegopodium
+1.22	41	3373	S	            Aegopodium podagraria
+0.27	9	3373	G	          Apium
+0.27	9	3373	S	            Apium graveolens
+0.47	16	3373	G	          Uncertain taxa
+4.77	161	3373	O	      Asterales
+4.77	161	3373	F	        Asteraceae
+0.06	2	3373	G	          Achillea
+0.06	2	3373	S	            Achillea millefolium
+0.15	5	3373	G	          Artemisia
+0.15	5	3373	S	            Uncertain taxa
+0.03	1	3373	G	          Calendula
+...
+4.57	154	3373	G	          Uncertain taxa
+0.12	4	3373	F	        Uncertain taxa
+0.53	18	3373	O	      Uncertain taxa
+0.03	1	3373	C	    Pinopsida
+0.03	1	3373	O	      Cupressales
+0.03	1	3373	F	        Taxaceae
+0.03	1	3373	G	          Taxus
+0.03	1	3373	S	            Taxus baccata
+```
+├── circle.txt
+```
+[
+  {
+    "labels": [
+      "Bacteria",
+      "Uncertain taxa",
+      "Viridiplantae"
+    ],
+    "sizes": [
+      2,
+      1,
+      29
+    ]
+  },
+  {
+    "labels": [
+      "Pseudomonadota",
+      "Uncertain taxa",
+      "Streptophyta"
+    ],
+    "sizes": [
+      2,
+      1,
+      29
+    ]
+...
+    ],
+    "sizes": [
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      1,
+      3,
+      1,
+      1,
+      2,
+      2,
+      1,
+      1,
+      1,
+      1,
+      4,
+      1,
+      1,
+      1,
+      1
+    ]
+  }
+]
+```
+├── anno.xlsx
+```
+header	e_value	identity percentage	coverage	bitscore	count	source	taxa	kingdom	phylum	class	order	family	genus	species
+M01687:476:000000000-LL5F5:1:1102:8926:6561_CONS	2.33E-41	98.889	100	161	12	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Achillea / Achillea millefolium	Viridiplantae	Streptophyta	Magnoliopsida	Asterales	Asteraceae	Achillea	Achillea millefolium
+M01687:476:000000000-LL5F5:1:2114:16883:18620_CONS	1.08E-39	97.778	100	156	1	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Achillea / Achillea millefolium	Viridiplantae	Streptophyta	Magnoliopsida	Asterales	Asteraceae	Achillea	Achillea millefolium
+M01687:476:000000000-LL5F5:1:1102:20658:7882_CONS	1.63E-37	98.795	100	148	29	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Apiales / Apiaceae / Aegopodium / Aegopodium podagraria	Viridiplantae	Streptophyta	Magnoliopsida	Apiales	Apiaceae	Aegopodium	Aegopodium podagraria
+M01687:476:000000000-LL5F5:1:1102:3453:17892_CONS	3.51E-39	100	100	154	179	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Apiales / Apiaceae / Aegopodium / Aegopodium podagraria	Viridiplantae	Streptophyta	Magnoliopsida	Apiales	Apiaceae	Aegopodium	Aegopodium podagraria
+M01687:476:000000000-LL5F5:1:1101:16634:16511_CONS	5.79E-37	98.795	100	147	1	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Apiales / Apiaceae / Aegopodium / Aegopodium podagraria	Viridiplantae	Streptophyta	Magnoliopsida	Apiales	Apiaceae	Aegopodium	Aegopodium podagraria
+...
+M01687:476:000000000-LL5F5:1:1119:27044:6653_CONS	2.69E-35	97.59	100	141	1	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Fabales / Fabaceae / Vicia / Vicia faba	Viridiplantae	Streptophyta	Magnoliopsida	Fabales	Fabaceae	Vicia	Vicia faba
+M01687:476:000000000-LL5F5:1:1109:2464:14257_CONS	7.37E-36	100	95	143	1	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Fabales / Fabaceae / Vicia / Vicia faba	Viridiplantae	Streptophyta	Magnoliopsida	Fabales	Fabaceae	Vicia	Vicia faba
+M01687:476:000000000-LL5F5:1:1106:26123:11458_CONS	1.63E-37	98.795	100	148	1	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Fabales / Fabaceae / Vicia / Vicia faba	Viridiplantae	Streptophyta	Magnoliopsida	Fabales	Fabaceae	Vicia	Vicia faba
+M01687:476:000000000-LL5F5:1:1104:24402:7089_CONS	5E-43	100	100	167	1	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Fabales / Fabaceae / Vicia / Vicia hirsuta	Viridiplantae	Streptophyta	Magnoliopsida	Fabales	Fabaceae	Vicia	Vicia hirsuta
+M01687:476:000000000-LL5F5:1:2114:19155:4308_CONS	1.07E-39	100	94	156	13	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Gentianales / Apocynaceae / Vinca / Vinca minor	Viridiplantae	Streptophyta	Magnoliopsida	Gentianales	Apocynaceae	Vinca	Vinca minor
+M01687:476:000000000-LL5F5:1:1117:11316:6653_CONS	4.96E-38	98.81	94	150	1	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Gentianales / Apocynaceae / Vinca / Vinca minor	Viridiplantae	Streptophyta	Magnoliopsida	Gentianales	Apocynaceae	Vinca	Vinca minor
+M01687:476:000000000-LL5F5:1:1106:28052:14441_CONS	8.25E-41	98.876	100	159	1	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium	Viridiplantae	Streptophyta	Magnoliopsida	Asterales	Asteraceae	Xanthium	Xanthium strumarium
+M01687:476:000000000-LL5F5:1:2118:15258:6790_CONS	8.25E-41	98.876	100	159	1	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium	Viridiplantae	Streptophyta	Magnoliopsida	Asterales	Asteraceae	Xanthium	Xanthium strumarium
+```
+
+
+└── stats.txt
+```
+metric	value
+percentage_annotated	71.3862433862434
+annotated_sequences	3373
+total_sequences	4725
+percentage_unique_annotated	89.46585409571608
+unique_annotated	99826
+total_unique	111580
+```
+
+---
+
+#### CLI Arguments (common)
+
+| Argument | Description |
+|----------|-------------|
+| `--input-anno` | Path to the annotated BLAST results (tab-separated) |
+| `--input-unanno` | Path to the unannotated reads FASTA file |
+| `--eval-plot` | Output file where eval plot output will be written |
+| `--taxa-output` | Output file where taxa output will be written |
+| `--circle-data` | Output file where circle data output will be written |
+| `--header-anno` | Output file where header annotation results will be written |
+| `--anno-stats` | Output file where annotation statistics will be written |
+| `--eval-treshold` | Maximum E-value to retain hits (default: `1e-5`) |
+| `--uncertain-threshold` | percentage for which lca picks the majority taxon (default: `0.9 (90%)`) |
+| `--use-counts` | Use read counts in the circle data output when true (default: `True`) |
+
+---
+
+
+### Galaxy integration
+
+The tool is also available through the Galaxy platform:
+
+- **Galaxy Toolshed**: The BLAST annotations processor tool is available in the Galaxy Toolshed, 
+  enabling easy installation into any Galaxy instance.
+- **Web-based interface**: Users can upload sequence files, configure validation parameters through the GUI, 
+  run validations, and download results.
+- **Workflow integration**: The tool can be incorporated into Galaxy workflows for automated processing pipelines.
+
+To use the tool in Galaxy:
+1. Install the tool from the Galaxy Toolshed (search for "blast_annotations_processor")
+2. Upload your raw read and BLAST files to your Galaxy history
+3. Configure parameters through the GUI
+4. Run the tool
+5. View results and download validation reports and tabular annotations
+
+## License
+
+No license yet
+
+## Citation
+
+If you use this software in your research, please cite this repository.
+
+## Contact
+
+For questions or issues:
+- GitHub Issues: https://github.com/Onnodg/Naturalis_NLOOR/issues
+- Email: onno.gorter@naturalis.nl (until Febuary 2026)
+
+## Acknowledgments
+
+This tool was developed to support the New lights on old remedies project, a PhD project by Anja Fischer.