comparison README.md @ 2:9ca209477dfd draft default tip

planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit 4017d38cf327c48a6252e488ba792527dae97a70-dirty
author onnodg
date Mon, 15 Dec 2025 16:43:36 +0000
parents
children
comparison
equal deleted inserted replaced
1:2acf82433aa4 2:9ca209477dfd
1 # BLAST Annotations Processor Script
2
3 This script processes a single **annotated BLAST file** together with a **FASTA file containing the same reads but unannotated**, generating multiple output files for downstream visualization and reporting.
4
5 It is designed for BLAST-based taxonomic pipelines and provides a complete overview of annotation quality, distribution, and composition of the analyzed dataset.
6
7 ---
8
9 ## Usage
10
11 The script performs the following main tasks:
12
13 1. Parse command-line arguments.
14 2. Load the annotated BLAST results and the unannotated FASTA headers.
15 3. Group BLAST hits per read and filter them by specified thresholds.
16 4. Resolve taxonomic conflicts with the lowest common ancestor method using predefined uncertainty rules.
17 5. Generate a variety of outputs of statistics and annotations for downstream use.
18
19
20 ### Command Line Interface
21 The BLAST annotations processor can be run as a Python script:
22
23 ```bash
24 python blast_annotations_processor.py [options]
25 ```
26
27 Below are detailed examples for common use case
28
29 #### General use case
30 This example shows the general use of the tool.
31
32 **Requirements**:
33
34 Requirements as listed in the blast_annotations_processor xml file:
35
36 - Python version=3.12.3
37 - Matplotlib version=3.12.3
38 - Pandas version=2.3.2
39 - Numpy version=2.3.2
40 - Openpyxl version=3.1.5
41
42
43 **Input requirements**
44
45 - BLAST tabular file with alignment metrics, source and taxa
46 - Fasta file with preprocessed reads
47 - Header correspondence: Query identifiers in the BLAST output and FASTA headers **must match**. The script relies on matching IDs to merge annotations with read headers.
48
49
50
51 **Example: Analyzing BLAST annotation result using curated database**
52
53 ```bash
54 python annotate_blast_results.py
55 --input-anno 'annotated_curated_results.tabular'
56 --input-unanno 'unannotated_reads.fasta'
57 --eval-plot 'eval_curated.png'
58 --taxa-output 'taxa_curated.txt'
59 --circle-data 'circle_curated.txt'
60 --header-anno 'anno_curated.xlsx'
61 --anno-stats 'stats_curated.txt'
62 --eval-threshold '1e-5'
63 --uncertain-threshold '0.9'
64 --use-counts
65 ```
66
67 This command will:
68
69 - Parse the BLAST and FASTA files.
70 - Filter hits using `E-value ≤ 1e-5`, `uncertainty threshold ≥ 90%`, and use read count in the circular data output.
71 - Resolve taxonomic conflicts and generate plots, reports, and spreadsheet outputs in the given output files.
72
73
74 **Example Input (`annotated_curated_results.tabular`)**
75
76
77 ```
78 #Query ID #Subject #Subject accession #Subject Taxonomy ID #Identity percentage #Coverage #evalue #bitscore #Source #Taxonomy
79 M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758) source=NCBI sequenceID=EU382995 superkingdom=Eukaryota kingdom=Viridiplantae phylum=Streptophyta subphylum=Streptophytina class=Magnoliopsida subclass=NA infraclass=NA order=Ranunculales suborder=NA infraorder=NA superfamily=NA family=Ranunculaceae genus=Ranunculus species=Ranunculus repens markercode=trnL lat=NA lon=NA source=NCBI N/A 100.000 100 1.24e-38 152 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Ranunculales / Ranunculaceae / Ranunculus / Ranunculus repens
80 M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758) source=NCBI sequenceID=JQ041850 superkingdom=Eukaryota kingdom=Viridiplantae phylum=Streptophyta subphylum=Streptophytina class=Magnoliopsida subclass=NA infraclass=NA order=Ranunculales suborder=NA infraorder=NA superfamily=NA family=Ranunculaceae genus=Ranunculus species=Ranunculus repens markercode=trnL lat=NA lon=NA source=NCBI N/A 100.000 100 1.24e-38 152 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Ranunculales / Ranunculaceae / Ranunculus / Ranunculus repens
81 M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758) source=NCBI sequenceID=DQ410740 superkingdom=Eukaryota kingdom=Viridiplantae phylum=Streptophyta subphylum=Streptophytina class=Magnoliopsida subclass=NA infraclass=NA order=Ranunculales suborder=NA infraorder=NA superfamily=NA family=Ranunculaceae genus=Ranunculus species=Ranunculus muricatus markercode=trnL lat=NA lon=NA source=NCBI N/A 98.780 100 5.79e-37 147 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Ranunculales / Ranunculaceae / Ranunculus / Ranunculus muricatus
82 M01687:476:000000000-LL5F5:1:1102:14619:1181_CONS(6595) source=NCBI sequenceID=HM590330 superkingdom=Eukaryota kingdom=Viridiplantae phylum=Streptophyta subphylum=Streptophytina class=Magnoliopsida subclass=NA infraclass=NA order=Malpighiales suborder=NA infraorder=NA superfamily=NA family=Salicaceae genus=Populus species=Populus tremula markercode=trnL lat=50.47 lon=-104.37 source=NCBI N/A 100.000 100 2.16e-52 198 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Malpighiales / Salicaceae / Populus / Populus tremula
83 M01687:476:000000000-LL5F5:1:1102:14619:1181_CONS(6595) source=NCBI sequenceID=MH573985 superkingdom=Eukaryota kingdom=Viridiplantae phylum=Streptophyta subphylum=Streptophytina class=Magnoliopsida subclass=NA infraclass=NA order=Malpighiales suborder=NA infraorder=NA superfamily=NA family=Salicaceae genus=Populus species=Populus alba markercode=trnL lat=NA lon=NA source=NCBI N/A 99.074 100 1.01e-50 193 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Malpighiales / Salicaceae / Populus / Populus alba
84 ...
85 ```
86
87 **Example FASTA (`unannotated_reads.fasta`)**
88
89
90 ```
91 >M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758) merged_sample={}; count=1758; direction=right; sminR=40.0;
92 ali_length=82; seq_b_deletion=219; seq_b_insertion=0; mode=alignment; sminL=40.0; seq_a_single=0; seq_b_single=0;
93 gggcaatcctgagccaaatcctgctttcagaaaacaaaaagagggttcagaaagcaaagg
94 gataggtgcagagactcaatgg
95
96 >M01687:476:000000000-LL5F5:1:1102:14619:1181_CONS(6595) merged_sample={}; count=6595; direction=right; sminR=40.0;
97 ali_length=107; mode=alignment; sminL=40.0; seq_a_single=0; seq_b_single=0;
98 gggcaatcctgagccaaatcctatttttcgaaaacaaacaaaaaaacaaacaaaggttca
99 taaagacagaataagaatacaaaaggataggtgcagagactcaatgg
100 ...
101 ```
102
103 **Outputs**
104
105
106 | Output Type | Format | Description |
107 |-------------------------------|--------|-------------|
108 | **E-value distribution plots**| `.png` | Histogram of BLAST E-values across all queries; useful for choosing score cutoffs or spotting anomalies. |
109 | **Taxonomic composition** | `.txt` | Summarized counts or proportions of reads assigned to each taxonomic level. |
110 | **Circular taxonomy data** | `.txt` | JSON-formatted hierarchical taxonomy structure, used to generate circular taxonomic plots. |
111 | **Header annotations** | `.xlsx` | Excel workbook with merged and per-read annotation information, and alignment statistics. |
112 | **Annotation statistics** | `.txt` | Summary metrics such as number of annotated reads, unassigned reads, unique taxa detected, and filtering statistics. |
113
114
115 **Output files (example)**
116
117
118 outputs
119
120 ├── eval.png
121 <img width="2100" height="900" alt="afbeelding" src="https://github.com/user-attachments/assets/75b8fac6-da31-4980-a535-f9dd7ffd15bb" />
122
123
124 ├── taxa.txt
125 ```
126 Uncertain count per taxonomie level{'K': 0, 'P': 0, 'C': 0, 'O': 18, 'F': 10, 'G': 615, 'S': 1285}
127 percentage_rooted number_rooted total_num taxon_level indentificatie
128 100.00 3373 3373 K Viridiplantae
129 100.00 3373 3373 P Streptophyta
130 99.97 3372 3373 C Magnoliopsida
131 1.96 66 3373 O Apiales
132 1.96 66 3373 F Apiaceae
133 1.22 41 3373 G Aegopodium
134 1.22 41 3373 S Aegopodium podagraria
135 0.27 9 3373 G Apium
136 0.27 9 3373 S Apium graveolens
137 0.47 16 3373 G Uncertain taxa
138 4.77 161 3373 O Asterales
139 4.77 161 3373 F Asteraceae
140 0.06 2 3373 G Achillea
141 0.06 2 3373 S Achillea millefolium
142 0.15 5 3373 G Artemisia
143 0.15 5 3373 S Uncertain taxa
144 0.03 1 3373 G Calendula
145 ...
146 4.57 154 3373 G Uncertain taxa
147 0.12 4 3373 F Uncertain taxa
148 0.53 18 3373 O Uncertain taxa
149 0.03 1 3373 C Pinopsida
150 0.03 1 3373 O Cupressales
151 0.03 1 3373 F Taxaceae
152 0.03 1 3373 G Taxus
153 0.03 1 3373 S Taxus baccata
154 ```
155 ├── circle.txt
156 ```
157 [
158 {
159 "labels": [
160 "Bacteria",
161 "Uncertain taxa",
162 "Viridiplantae"
163 ],
164 "sizes": [
165 2,
166 1,
167 29
168 ]
169 },
170 {
171 "labels": [
172 "Pseudomonadota",
173 "Uncertain taxa",
174 "Streptophyta"
175 ],
176 "sizes": [
177 2,
178 1,
179 29
180 ]
181 ...
182 ],
183 "sizes": [
184 1,
185 1,
186 1,
187 1,
188 1,
189 1,
190 1,
191 1,
192 1,
193 1,
194 1,
195 3,
196 1,
197 1,
198 2,
199 2,
200 1,
201 1,
202 1,
203 1,
204 4,
205 1,
206 1,
207 1,
208 1
209 ]
210 }
211 ]
212 ```
213 ├── anno.xlsx
214 ```
215 header e_value identity percentage coverage bitscore count source taxa kingdom phylum class order family genus species
216 M01687:476:000000000-LL5F5:1:1102:8926:6561_CONS 2.33E-41 98.889 100 161 12 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Achillea / Achillea millefolium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Achillea Achillea millefolium
217 M01687:476:000000000-LL5F5:1:2114:16883:18620_CONS 1.08E-39 97.778 100 156 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Achillea / Achillea millefolium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Achillea Achillea millefolium
218 M01687:476:000000000-LL5F5:1:1102:20658:7882_CONS 1.63E-37 98.795 100 148 29 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Apiales / Apiaceae / Aegopodium / Aegopodium podagraria Viridiplantae Streptophyta Magnoliopsida Apiales Apiaceae Aegopodium Aegopodium podagraria
219 M01687:476:000000000-LL5F5:1:1102:3453:17892_CONS 3.51E-39 100 100 154 179 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Apiales / Apiaceae / Aegopodium / Aegopodium podagraria Viridiplantae Streptophyta Magnoliopsida Apiales Apiaceae Aegopodium Aegopodium podagraria
220 M01687:476:000000000-LL5F5:1:1101:16634:16511_CONS 5.79E-37 98.795 100 147 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Apiales / Apiaceae / Aegopodium / Aegopodium podagraria Viridiplantae Streptophyta Magnoliopsida Apiales Apiaceae Aegopodium Aegopodium podagraria
221 ...
222 M01687:476:000000000-LL5F5:1:1119:27044:6653_CONS 2.69E-35 97.59 100 141 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Fabales / Fabaceae / Vicia / Vicia faba Viridiplantae Streptophyta Magnoliopsida Fabales Fabaceae Vicia Vicia faba
223 M01687:476:000000000-LL5F5:1:1109:2464:14257_CONS 7.37E-36 100 95 143 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Fabales / Fabaceae / Vicia / Vicia faba Viridiplantae Streptophyta Magnoliopsida Fabales Fabaceae Vicia Vicia faba
224 M01687:476:000000000-LL5F5:1:1106:26123:11458_CONS 1.63E-37 98.795 100 148 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Fabales / Fabaceae / Vicia / Vicia faba Viridiplantae Streptophyta Magnoliopsida Fabales Fabaceae Vicia Vicia faba
225 M01687:476:000000000-LL5F5:1:1104:24402:7089_CONS 5E-43 100 100 167 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Fabales / Fabaceae / Vicia / Vicia hirsuta Viridiplantae Streptophyta Magnoliopsida Fabales Fabaceae Vicia Vicia hirsuta
226 M01687:476:000000000-LL5F5:1:2114:19155:4308_CONS 1.07E-39 100 94 156 13 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Gentianales / Apocynaceae / Vinca / Vinca minor Viridiplantae Streptophyta Magnoliopsida Gentianales Apocynaceae Vinca Vinca minor
227 M01687:476:000000000-LL5F5:1:1117:11316:6653_CONS 4.96E-38 98.81 94 150 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Gentianales / Apocynaceae / Vinca / Vinca minor Viridiplantae Streptophyta Magnoliopsida Gentianales Apocynaceae Vinca Vinca minor
228 M01687:476:000000000-LL5F5:1:1106:28052:14441_CONS 8.25E-41 98.876 100 159 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Xanthium Xanthium strumarium
229 M01687:476:000000000-LL5F5:1:2118:15258:6790_CONS 8.25E-41 98.876 100 159 1 NCBI Viridiplantae / Streptophyta / Magnoliopsida / Asterales / Asteraceae / Xanthium / Xanthium strumarium Viridiplantae Streptophyta Magnoliopsida Asterales Asteraceae Xanthium Xanthium strumarium
230 ```
231
232
233 └── stats.txt
234 ```
235 metric value
236 percentage_annotated 71.3862433862434
237 annotated_sequences 3373
238 total_sequences 4725
239 percentage_unique_annotated 89.46585409571608
240 unique_annotated 99826
241 total_unique 111580
242 ```
243
244 ---
245
246 #### CLI Arguments (common)
247
248 | Argument | Description |
249 |----------|-------------|
250 | `--input-anno` | Path to the annotated BLAST results (tab-separated) |
251 | `--input-unanno` | Path to the unannotated reads FASTA file |
252 | `--eval-plot` | Output file where eval plot output will be written |
253 | `--taxa-output` | Output file where taxa output will be written |
254 | `--circle-data` | Output file where circle data output will be written |
255 | `--header-anno` | Output file where header annotation results will be written |
256 | `--anno-stats` | Output file where annotation statistics will be written |
257 | `--eval-treshold` | Maximum E-value to retain hits (default: `1e-5`) |
258 | `--uncertain-threshold` | percentage for which lca picks the majority taxon (default: `0.9 (90%)`) |
259 | `--use-counts` | Use read counts in the circle data output when true (default: `True`) |
260
261 ---
262
263
264 ### Galaxy integration
265
266 The tool is also available through the Galaxy platform:
267
268 - **Galaxy Toolshed**: The BLAST annotations processor tool is available in the Galaxy Toolshed,
269 enabling easy installation into any Galaxy instance.
270 - **Web-based interface**: Users can upload sequence files, configure validation parameters through the GUI,
271 run validations, and download results.
272 - **Workflow integration**: The tool can be incorporated into Galaxy workflows for automated processing pipelines.
273
274 To use the tool in Galaxy:
275 1. Install the tool from the Galaxy Toolshed (search for "blast_annotations_processor")
276 2. Upload your raw read and BLAST files to your Galaxy history
277 3. Configure parameters through the GUI
278 4. Run the tool
279 5. View results and download validation reports and tabular annotations
280
281 ## License
282
283 No license yet
284
285 ## Citation
286
287 If you use this software in your research, please cite this repository.
288
289 ## Contact
290
291 For questions or issues:
292 - GitHub Issues: https://github.com/Onnodg/Naturalis_NLOOR/issues
293 - Email: onno.gorter@naturalis.nl (until Febuary 2026)
294
295 ## Acknowledgments
296
297 This tool was developed to support the New lights on old remedies project, a PhD project by Anja Fischer.