annotate blast_annotations_processor.py @ 1:2acf82433aa4 draft default tip

planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit d771f9fbfd42bcdeda1623d954550882a0863847-dirty
author onnodg
date Mon, 20 Oct 2025 12:26:51 +0000
parents a3989edf0a4a
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
1 """Galaxy-compatible BLAST annotation processor.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
2
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
3 This script processes a single annotated BLAST file along with a FASTA file
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
4 containing unannotated reads. It generates multiple types of outputs for
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
5 integration with Galaxy workflows:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
6
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
7 - E-value distribution plots (PNG)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
8 - Taxonomic composition reports (text)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
9 - Circular taxonomy diagram data (JSON)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
10 - Header annotations with merged and per-read information (Excel)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
11 - Annotation statistics summary (text)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
12
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
13 Main workflow:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
14 1. Parse command-line arguments.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
15 2. Load annotated BLAST results and unannotated FASTA headers.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
16 3. Group BLAST hits per query (q_id), filter by thresholds.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
17 4. Resolve taxonomic conflicts with uncertainty rules.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
18 5. Generate requested outputs.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
19
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
20 Notes:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
21 - Headers in BLAST and FASTA should correspond.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
22 - Uses matplotlib, pandas, and openpyxl for visualization and reporting.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
23 """
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
24
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
25 import argparse
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
26 from collections import defaultdict
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
27 import json
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
28 import os
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
29 import matplotlib.pyplot as plt
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
30 import pandas as pd
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
31 import numpy as np
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
32 import openpyxl
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
33
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
34 # Default taxonomic levels
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
35 TAXONOMIC_LEVELS = ["K", "P", "C", "O", "F", "G", "S"]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
36
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
37 def parse_arguments(arg_list=None):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
38 """Parse command line arguments for cluster processing."""
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
39 parser = argparse.ArgumentParser(
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
40 description='Process BLAST annotation results for Galaxy'
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
41 )
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
42 # Required inputs
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
43 parser.add_argument('--input-anno', required=True,
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
44 help='Annotated BLAST output file (tabular format)')
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
45 parser.add_argument('--input-unanno', required=True,
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
46 help='Unannotated sequences file (FASTA format)')
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
47 # Optional outputs
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
48 parser.add_argument('--eval-plot',
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
49 help='Output path for E-value plot (PNG)')
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
50 parser.add_argument('--taxa-output',
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
51 help='Output path for taxa report (tabular)')
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
52 parser.add_argument('--circle-data',
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
53 help='Output path for circular taxonomy data (txt)')
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
54 parser.add_argument('--header-anno',
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
55 help='Output path for header annotations (tabular/xlsx)')
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
56 parser.add_argument('--anno-stats',
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
57 help='Output path for annotation statistics (txt)')
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
58 # Parameters
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
59 parser.add_argument('--uncertain-threshold', type=float, default=0.9,
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
60 help='Threshold for resolving taxonomic conflicts (default: 0.9)')
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
61 parser.add_argument('--eval-threshold', type=float, default=1e-10,
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
62 help='E-value threshold for filtering results (default: 1e-10)')
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
63 parser.add_argument('--use-counts', action='store_true', default=True,
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
64 help='Use read counts in circular data')
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
65
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
66 return parser.parse_args(arg_list)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
67
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
68
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
69 def list_to_string(x):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
70 """
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
71 Convert a list, pandas Series, or numpy array to a comma-separated string."""
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
72 if isinstance(x, (list, pd.Series, np.ndarray)):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
73 return ", ".join(map(str, x))
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
74 return str(x)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
75
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
76 def make_eval_plot(e_val_sets, output_path):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
77 """
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
78 Create a bar plot of E-values per read.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
79
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
80 :param e_val_sets: List of sets containing E-values per read.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
81 :type e_val_sets: list[set[str]]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
82 :param output_path: Path to save the plot (PNG).
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
83 :type output_path: str
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
84 :return: None
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
85 :rtype: None
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
86 """
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
87 if not e_val_sets:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
88 print("No E-values to plot")
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
89 return None
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
90
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
91 # Convert string E-values to floats and sort
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
92 processed_sets = sorted([sorted(float(e_val) for e_val in e_val_set)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
93 for e_val_set in e_val_sets])
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
94
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
95 bar_positions = []
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
96 bar_heights = []
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
97
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
98 for i, e_set in enumerate(processed_sets):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
99 if len(e_set) == 1:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
100 e_set.append(0)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
101 for j, e_val in enumerate(e_set):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
102 bar_positions.append(i + j * 0.29)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
103 if e_val != 0:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
104 bar_heights.append(1 / e_val)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
105 else:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
106 bar_heights.append(e_val)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
107
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
108 plt.figure(figsize=(21, 9))
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
109 plt.bar(bar_positions, bar_heights, width=0.29, color=['blue', 'red'])
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
110
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
111 plt.yscale('log')
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
112 plt.grid(axis='y', linestyle='--', color='gray', alpha=0.7)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
113 plt.ylabel("e-values")
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
114 plt.xticks(ticks=range(len(processed_sets)),
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
115 labels=[f'{i + 1}' for i in range(len(processed_sets))])
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
116
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
117 plt.tight_layout()
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
118 plt.savefig(output_path, format='png')
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
119 plt.close()
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
120 return None
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
121
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
122
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
123 def calculate_annotation_stats(anno_count, unanno_file_path, unique_anno_count, total_unique_count):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
124 """
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
125 Compute annotation statistics for a dataset.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
126
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
127 This function calculates summary statistics for annotated and unannotated
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
128 sequences in a dataset. It counts the total number of sequences in the
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
129 unannotated file (FASTA-style, based on lines starting with '>'), and
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
130 computes the percentage of annotated sequences and unique annotated
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
131 sequences relative to their totals.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
132
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
133 :param anno_count: Total number of annotated sequences.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
134 :type anno_count: int
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
135 :param unanno_file_path: Path to a FASTA file containing unannotated sequences.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
136 :type unanno_file_path: str
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
137 :param unique_anno_count: Number of unique annotated sequences.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
138 :type unique_anno_count: int
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
139 :param total_unique_count: Total number of unique sequences in the dataset.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
140 :type total_unique_count: int
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
141 :return: Dictionary containing:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
142 - 'percentage_annotated' (float): Percentage of annotated sequences.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
143 - 'annotated_sequences' (int): Number of annotated sequences.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
144 - 'total_sequences' (int): Total number of sequences (annotated + unannotated).
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
145 - 'percentage_unique_annotated' (float): Percentage of unique annotated sequences.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
146 - 'unique_annotated' (int): Number of unique annotated sequences.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
147 - 'total_unique' (int): Total number of unique sequences.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
148 :rtype: dict[str, float | int]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
149 """
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
150 # Count total sequences in unannotated file
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
151 total_sequences = 0
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
152 with open(unanno_file_path, 'r') as f:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
153 total_sequences = sum(1 for line in f if line.startswith('>'))
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
154
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
155 percentage_annotated = (anno_count / total_sequences * 100) if total_sequences > 0 else 0
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
156 percentage_unique_annotated = (unique_anno_count / total_unique_count * 100) if total_unique_count > 0 else 0
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
157
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
158 return {
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
159 'percentage_annotated': percentage_annotated,
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
160 'annotated_sequences': anno_count,
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
161 'total_sequences': total_sequences,
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
162 'percentage_unique_annotated': percentage_unique_annotated,
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
163 'unique_annotated': unique_anno_count,
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
164 'total_unique': total_unique_count
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
165 }
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
166
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
167 def _choose_taxon(level_counts: dict, threshold: float):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
168 """
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
169 Determine the most representative taxonomic level based on counts and a threshold.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
170
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
171 This function evaluates a dictionary of taxonomic level counts and returns the
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
172 level name that holds a strict majority. If no level reaches the threshold, or if
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
173 there is a tie for the highest count, the function returns ``None``.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
174
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
175 :param level_counts: A dictionary mapping taxonomic level names (e.g., "species",
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
176 "genus") to their respective counts.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
177 :type level_counts: dict[str, int]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
178 :param threshold: The minimum fraction (between 0 and 1) of the total counts that
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
179 the top taxonomic level must have to be considered dominant.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
180 :type threshold: float
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
181
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
182 :return: The name of the taxonomic level that meets or exceeds the threshold and
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
183 is not tied with another level. Returns ``None`` if no level qualifies or
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
184 if there is a tie.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
185 :rtype: str | None
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
186
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
187 :example:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
188 _choose_taxon({"species": 6, "genus": 3, "family": 1}, 0.5)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
189 'species'
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
190 _choose_taxon({"species": 4, "genus": 4}, 0.6)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
191 None
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
192 _choose_taxon({"species": 2, "genus": 3}, 0.6)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
193 'genus'
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
194 """
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
195 total = sum(level_counts.values())
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
196 if total == 0:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
197 return None
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
198 items = sorted(level_counts.items(), key=lambda x: x[1], reverse=True)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
199 top_name, top_count = items[0]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
200 # tie -> no clear majority
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
201 if sum(1 for _, c in items if c == top_count) > 1:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
202 return None
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
203 if top_count / total >= threshold:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
204 return top_name
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
205 return None
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
206
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
207
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
208 def _resolve_conflicts_recursive(conflicting: list, level_idx: int, threshold: float, max_levels: int, prefix=None):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
209 """
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
210 Recursively resolve taxonomic classification conflicts across multiple levels.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
211
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
212 This function attempts to resolve conflicts between multiple taxonomic paths by
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
213 recursively evaluating each taxonomic level. At each level, it determines whether
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
214 there is a clear majority (based on a specified threshold) for a taxon name. If
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
215 a majority is found, the function continues to the next level using only the
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
216 taxa that match that name. If no majority exists, the function constructs
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
217 "uncertain" outputs to indicate ambiguous classification.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
218
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
219 :param conflicting: List of tuples containing full taxonomic paths and their
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
220 associated counts, e.g. ``[("Bacteria / Proteobacteria / Gammaproteobacteria", 10),
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
221 ("Bacteria / Proteobacteria / Alphaproteobacteria", 5)]``.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
222 :type conflicting: list[tuple[str, int]]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
223 :param level_idx: The current taxonomic level index being evaluated (0-based).
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
224 :type level_idx: int
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
225 :param threshold: The minimum fraction (between 0 and 1) required for a strict
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
226 majority decision at a given taxonomic level.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
227 :type threshold: float
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
228 :param max_levels: The maximum number of taxonomic levels to evaluate before
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
229 stopping recursion.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
230 :type max_levels: int
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
231 :param prefix: A list of already-resolved taxon names up to the previous level.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
232 Defaults to ``None``, in which case an empty list is used.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
233 :type prefix: list[str] | None
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
234
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
235 :return: A tuple containing:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
236 - **short_output** (str): A simplified taxonomic path with "Uncertain taxa"
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
237 appended if no consensus was reached at the current level.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
238 - **full_output** (str): The full taxonomic path, filled with additional
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
239 "Uncertain taxa" entries up to the minimal conflicting level depth.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
240 :rtype: tuple[str, str]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
241
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
242 :example:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
243 conflicts = [
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
244 ... ("Bacteria / Proteobacteria / Gammaproteobacteria", 10),
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
245 ... ("Bacteria / Proteobacteria / Alphaproteobacteria", 5)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
246 ... ]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
247 _resolve_conflicts_recursive(conflicts, level_idx=0, threshold=0.6, max_levels=5)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
248 ('Bacteria / Proteobacteria / Uncertain taxa',
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
249 'Bacteria / Proteobacteria / Uncertain taxa / Uncertain taxa')
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
250 """
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
251 if prefix is None:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
252 prefix = []
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
253
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
254 if not conflicting:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
255 return "", ""
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
256
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
257 # split paths (keeps sync)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
258 conflicting_levels = [t.split(" / ") for t, _ in conflicting]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
259 min_conflicting_level = min(len(p) for p in conflicting_levels)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
260
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
261 # If only one full path remains -> return it (both short & full)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
262 if len(conflicting) == 1:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
263 return conflicting[0][0], conflicting[0][0]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
264
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
265 # If we've reached deepest comparable level or max_levels -> pick most supported full path
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
266 if level_idx >= min_conflicting_level or level_idx >= max_levels:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
267 full_counts = defaultdict(int)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
268 for t, c in conflicting:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
269 full_counts[t] += c
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
270 chosen_full = max(full_counts.items(), key=lambda x: x[1])[0]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
271 return chosen_full, chosen_full
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
272
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
273 # Count names at this level
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
274 level_counts = defaultdict(int)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
275 for taxon, count in conflicting:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
276 parts = taxon.split(" / ")
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
277 if level_idx < len(parts):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
278 level_counts[parts[level_idx]] += count
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
279
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
280 # If everyone agrees at this level, append that name and go deeper
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
281 if len(level_counts) == 1:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
282 name = next(iter(level_counts))
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
283 return _resolve_conflicts_recursive(conflicting, level_idx + 1, threshold, max_levels, prefix + [name])
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
284
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
285 # Check for a strict majority at this level
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
286 chosen_level_name = _choose_taxon(level_counts, threshold)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
287 if chosen_level_name is not None:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
288 # Keep only taxa that match the chosen name at this level and go deeper
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
289 filtered = [(t, c) for (t, c) in conflicting
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
290 if level_idx < len(t.split(" / ")) and t.split(" / ")[level_idx] == chosen_level_name]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
291 return _resolve_conflicts_recursive(filtered, level_idx + 1, threshold, max_levels, prefix + [chosen_level_name])
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
292
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
293 # No majority at this level -> construct uncertain outputs.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
294 # short: prefix + 'Uncertain taxa' (stop here)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
295 short_path = prefix + ['Uncertain taxa']
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
296 short_output = " / ".join(short_path)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
297
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
298 # full: prefix + 'Uncertain taxa' + fill until min_conflicting_level
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
299 full_path = short_path + ['Uncertain taxa'] * (min_conflicting_level - (level_idx + 1))
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
300 full_output = " / ".join(full_path)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
301
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
302 return short_output, full_output
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
303
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
304
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
305 def resolve_taxon_majority(taxon_counts: dict, threshold: float = 0.90, max_levels: int = 7):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
306 """
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
307 Resolve the majority consensus taxonomic classification from a set of annotated paths.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
308
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
309 :param taxon_counts: A dictionary mapping full taxonomic paths to their respective
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
310 occurrence counts. For example:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
311 ``{"Bacteria / Proteobacteria / Gammaproteobacteria": 10,
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
312 "Bacteria / Proteobacteria / Alphaproteobacteria": 5}``.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
313 :type taxon_counts: dict[str, int]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
314 :param threshold: Minimum fraction (between 0 and 1) required for a strict majority
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
315 at each taxonomic level. Defaults to ``0.90``.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
316 :type threshold: float, optional
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
317 :param max_levels: Maximum number of taxonomic levels to evaluate before stopping
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
318 recursion. Defaults to ``7``.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
319 :type max_levels: int, optional
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
320
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
321 :return: A tuple containing:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
322 - **short_output** (str): The consensus taxonomic path, possibly ending with
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
323 "Uncertain taxa" if ambiguity remains.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
324 - **full_output** (str): The expanded taxonomic path including placeholder
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
325 "Uncertain taxa" labels up to the minimal conflicting depth.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
326 :rtype: tuple[str, str]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
327 """
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
328 conflicting = list(taxon_counts.items()) # list of (path, count) keeps taxa<->count synced
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
329 return _resolve_conflicts_recursive(conflicting, level_idx=0, threshold=threshold, max_levels=max_levels)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
330
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
331
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
332 def process_taxa_output(taxa_dicts, output_path, uncertain_threshold):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
333 """
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
334 Generate a tabular taxa report from a list of read taxonomies.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
335
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
336 This function processes taxonomic assignments for each read in a dataset
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
337 and produces a summary report similar in style to Kraken2 output. The
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
338 report includes:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
339
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
340 - The number of reads that were marked as "Uncertain taxa" at each
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
341 taxonomic rank.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
342 - For all other taxa, the number of reads assigned and their percentage
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
343 relative to the sample total.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
344 - The taxonomic rank for each assignment, indicated in the report with
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
345 indentation proportional to the rank depth.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
346
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
347 :param taxa_dicts: List of taxon count dictionaries, one per read, where
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
348 keys are taxon names and values are counts of how often
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
349 they were assigned to the read.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
350 :type taxa_dicts: list[dict[str, int]]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
351 :param output_path: Path to save the tabular taxa report.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
352 :type output_path: str
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
353 :param uncertain_threshold: Threshold for resolving conflicting taxonomic
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
354 assignments (values below the threshold are
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
355 marked as "Uncertain taxa").
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
356 :type uncertain_threshold: float
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
357 :return: None. The function writes the taxa report to ``output_path``.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
358 :rtype: None
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
359 """
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
360 uncertain_dict = {level: 0 for level in TAXONOMIC_LEVELS}
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
361 aggregated_counts = defaultdict(int)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
362
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
363 for read_taxa in taxa_dicts:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
364 resolved_taxon, _ = resolve_taxon_majority(read_taxa, uncertain_threshold)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
365
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
366 # Add counts for resolved taxon
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
367 levels = resolved_taxon.split(" / ")
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
368 for i in range(1, len(levels) + 1):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
369 aggregated_counts[" / ".join(levels[:i])] += 1
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
370
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
371 # Calculate total count
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
372 total_count = sum(value for key, value in aggregated_counts.items()
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
373 if len(key.split(" / ")) == 1)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
374
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
375 report_lines = []
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
376
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
377 for taxonomy, count in sorted(aggregated_counts.items(), key=lambda x: x[0]):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
378 levels = taxonomy.split(" / ")
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
379 indent = " " * (len(levels) - 1) * 2
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
380 taxon_name = levels[-1]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
381 taxon_level_code = TAXONOMIC_LEVELS[len(levels) - 1] if len(levels) <= len(TAXONOMIC_LEVELS) else "U"
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
382 percentage = (count / total_count) * 100 if total_count > 0 else 0
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
383
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
384 report_lines.append(
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
385 f"{percentage:.2f}\t{count}\t{total_count}\t{taxon_level_code}\t{indent}{taxon_name}"
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
386 )
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
387
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
388 if taxon_name == 'Uncertain taxa':
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
389 uncertain_dict[taxon_level_code] += count
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
390
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
391 # Write output
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
392 with open(output_path, 'w') as f:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
393 f.write("Uncertain count per taxonomie level" + str(uncertain_dict) + '\n')
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
394 f.write('percentage_rooted\tnumber_rooted\ttotal_num\ttaxon_level\tindentificatie\n')
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
395 f.write("\n".join(report_lines))
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
396
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
397
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
398 def process_header_annotations(taxa_dicts, headers, e_values, output_path, uncertain_threshold,
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
399 identity_list, coverage_list, source_list, bitscore_list):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
400 """
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
401 Generate an Excel report with per-read and aggregated taxonomic annotations.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
402
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
403 This function processes taxonomic assignments per read, along with associated
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
404 metrics (e-value, identity, coverage, bitscore, source), and generates an Excel
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
405 file with two sheets:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
406
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
407 1. **Individual_Reads** – One row per sequence, including all metrics and
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
408 the resolved taxonomic path.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
409 2. **Merged_by_Taxa** – Aggregated results per unique taxonomic path, including
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
410 merged metrics and counts.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
411
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
412 Internally, the function creates a temporary TSV file, processes it into
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
413 a pandas DataFrame, expands the taxonomic path into separate columns
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
414 (e.g., kingdom, phylum, class, …), and then writes the results into a
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
415 multi-sheet Excel file. Column widths are automatically adjusted for readability.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
416
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
417 :param taxa_dicts: List of taxon count dictionaries, one per read, where keys
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
418 are taxa and values are their counts in the read.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
419 :type taxa_dicts: list[dict[str, int]]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
420 :param headers: Sequence headers corresponding to each read. Used to extract
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
421 read identifiers and counts.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
422 :type headers: list[str]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
423 :param e_values: List of e-values per read. Each element is a list of values
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
424 associated with that read.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
425 :type e_values: list[list[float]]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
426 :param output_path: Path to the Excel file to be created.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
427 :type output_path: str
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
428 :param uncertain_threshold: Threshold for resolving conflicting taxonomic
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
429 assignments. Values below the threshold are
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
430 replaced with "Uncertain taxa".
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
431 :type uncertain_threshold: float
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
432 :param identity_list: Percent identity values per read.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
433 :type identity_list: list[list[str]]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
434 :param coverage_list: Coverage percentage values per read.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
435 :type coverage_list: list[list[str]]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
436 :param source_list: Source identifiers per read (e.g., database names).
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
437 :type source_list: list[list[str]]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
438 :param bitscore_list: Bitscore values per read.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
439 :type bitscore_list: list[list[str]]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
440 :return: None. The function writes results to an Excel file at ``output_path``.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
441 :rtype: None
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
442 :raises PermissionError: If the temporary TSV or output Excel file cannot be written
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
443 (e.g., file is open in another program).
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
444 :raises IndexError: If the input lists (headers, e_values, etc.) are not aligned
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
445 with ``taxa_dicts``.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
446 :raises ValueError: If sequence headers are not formatted as expected for extracting counts.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
447 """
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
448 report_lines = []
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
449 for i, read_taxa in enumerate(taxa_dicts):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
450 _, resolved_taxon_long = resolve_taxon_majority(read_taxa, uncertain_threshold)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
451 try:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
452 e_val = e_values[i][0] if e_values[i] else "N/A"
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
453 identity = identity_list[i][0] if identity_list[i] else "N/A"
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
454 coverage = coverage_list[i][0] if coverage_list[i] else "N/A"
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
455 source = source_list[i][0] if source_list[i] else "N/A"
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
456 bitscore = bitscore_list[i][0] if bitscore_list[i] else "N/A"
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
457 except (IndexError, TypeError) as e:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
458 print(f"Mismatch while extracting values: {e}. Check that your annotated and unannotated input files correspond correctly.")
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
459 continue
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
460
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
461 header = headers[i] if i < len(headers) else f"Header missing"
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
462
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
463 # Extract count
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
464 try:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
465 header_base, count_str = header.rsplit("(", 1)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
466 count = int(count_str.rstrip(")"))
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
467 except ValueError as e:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
468 print(f'Failed extracting count: {e}')
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
469 header_base = header
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
470 count = 1
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
471
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
472 report_lines.append(
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
473 f'{header_base}\t{e_val}\t{identity}\t{coverage}\t{bitscore}\t{count}\t{source}\t{resolved_taxon_long}')
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
474 # Create temporary TSV for processing
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
475 temp_tsv_path = 'temp.tsv'
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
476 try:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
477 with open(temp_tsv_path, 'w') as f:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
478 f.write('header\te_value\tidentity percentage\tcoverage\tbitscore\tcount\tsource\ttaxa\n')
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
479 f.write("\n".join(report_lines))
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
480 except PermissionError as e:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
481 print(f"Unable to write to file, error: {e} file might be opened")
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
482
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
483 # Process the data
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
484 df = pd.read_csv(temp_tsv_path, sep='\t', encoding="latin1")
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
485 if not df.empty:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
486 # Split taxa into separate columns
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
487 taxa_split = df["taxa"].str.split(" / ", expand=True)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
488 max_levels = taxa_split.shape[1] if taxa_split.shape[1] is not None else 7
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
489 level_names = ["kingdom", "phylum", "class", "order", "family", "genus", "species"][:max_levels]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
490 taxa_split.columns = level_names
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
491
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
492 # Add taxa columns to main dataframe
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
493 df_individual = pd.concat([df, taxa_split], axis=1)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
494 df_individual = df_individual.sort_values(['species', 'genus', 'family'], ascending=[True, True, True])
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
495 # Create merged dataframe - first add taxa columns, then aggregate
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
496 df_for_merge = df_individual.copy()
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
497
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
498 group_cols = ["taxa"]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
499 agg_dict = {
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
500 "e_value": lambda x: f"{x.min()}–{x.max()}",
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
501 "identity percentage": lambda x: list_to_string(list(x.unique())),
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
502 "coverage": lambda x: list_to_string(list(x.unique())),
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
503 "bitscore": lambda x: list_to_string(list(x.unique())),
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
504 "count": "sum",
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
505 "source": "first"
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
506 }
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
507
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
508 # Add aggregation for taxa levels that actually exist
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
509 for level in level_names:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
510 if level in df_for_merge.columns:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
511 agg_dict[level] = "first"
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
512
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
513 df_merged = df_for_merge.groupby(group_cols, as_index=False).agg(agg_dict)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
514
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
515 sort_columns = []
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
516 sort_ascending = []
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
517
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
518 # Add family, genus, species if they exist
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
519 for level in ['species', 'genus', 'family']:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
520 if level in df_merged.columns:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
521 sort_columns.append(level)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
522 sort_ascending.append(True)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
523
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
524 df_merged = df_merged.sort_values(sort_columns, ascending=sort_ascending)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
525
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
526 # Write both datasets to single Excel file with multiple sheets
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
527 temp_path = output_path + ".xlsx"
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
528 os.makedirs(os.path.dirname(temp_path), exist_ok=True)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
529 with pd.ExcelWriter(temp_path, engine='openpyxl', mode='w') as writer:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
530 df_individual.to_excel(writer, sheet_name='Individual_Reads', index=False)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
531 df_merged.to_excel(writer, sheet_name='Merged_by_Taxa', index=False)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
532
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
533 # Auto-adjust column widths for both sheets
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
534 for sheet_name in writer.sheets:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
535 worksheet = writer.sheets[sheet_name]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
536 for column in worksheet.columns:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
537 max_length = 0
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
538 column_letter = column[0].column_letter
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
539 for cell in column:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
540 try:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
541 if len(str(cell.value)) > max_length:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
542 max_length = len(str(cell.value))
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
543 except:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
544 pass
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
545 adjusted_width = min(max_length + 2, 50) # Cap at 50 characters
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
546 worksheet.column_dimensions[column_letter].width = adjusted_width
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
547 os.replace(temp_path, output_path)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
548 # Clean up temporary file
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
549 os.remove(temp_tsv_path)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
550 else:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
551 print("Dataframe empty, no annotation results")
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
552
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
553
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
554 def create_circle_diagram(taxa_dicts, output_path, use_counts, uncertain_threshold):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
555 """
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
556 Generate hierarchical JSON data for a circular taxonomy diagram.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
557
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
558 This function aggregates taxonomic classification results from multiple reads
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
559 and prepares hierarchical count data suitable for circular visualization.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
560
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
561 :param taxa_dicts: A list of dictionaries, where each dictionary represents a
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
562 mapping from full taxonomic paths to their corresponding counts for a single
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
563 read. Each key should be a taxonomic path in the format
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
564 :type taxa_dicts: list[dict[str, int]]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
565 :param output_path: File path to write the generated JSON output containing
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
566 circle diagram data.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
567 :type output_path: str
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
568 :param use_counts: Whether to aggregate counts by total read frequency
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
569 (``True``) or to count only unique taxa once (``False``).
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
570 :type use_counts: boolean
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
571 :param uncertain_threshold: Fraction (between 0 and 1) representing the minimum
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
572 majority threshold used for resolving conflicting taxonomic assignments.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
573 :type uncertain_threshold: float
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
574
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
575 :return: None. The resulting JSON file will be written to ``output_path``. The
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
576 file contains a list of dictionaries, one per taxonomic level, with:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
577 - **labels** (*list[str]*): Names of taxa at that level.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
578 - **sizes** (*list[int]*): Corresponding aggregated counts.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
579 :rtype: None
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
580
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
581 :note:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
582 - Ambiguous classifications are labeled as "Uncertain taxa".
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
583 - The number of hierarchical levels is determined by the length of
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
584 ``TAXONOMIC_LEVELS``.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
585 """
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
586 aggregated_counts = defaultdict(int)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
587 seen_taxa = set()
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
588
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
589 for read_taxa in taxa_dicts:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
590 _, resolved_taxon_long = resolve_taxon_majority(read_taxa, uncertain_threshold)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
591
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
592 levels = resolved_taxon_long.split(" / ")
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
593
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
594 if use_counts:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
595 for i in range(1, len(levels) + 1):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
596 aggregated_counts[" / ".join(levels[:i])] += 1
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
597 else:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
598 if resolved_taxon_long not in seen_taxa:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
599 for i in range(1, len(levels) + 1):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
600 aggregated_counts[" / ".join(levels[:i])] += 1
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
601 seen_taxa.add(resolved_taxon_long)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
602
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
603 # Prepare circle data
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
604 circle_data = []
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
605 for level in range(len(TAXONOMIC_LEVELS)):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
606 circle_data.append({"labels": [], "sizes": []})
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
607
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
608 for taxonomy, count in sorted(aggregated_counts.items(), key=lambda x: x[0]):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
609 levels = taxonomy.split(" / ")
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
610 if len(levels) <= len(TAXONOMIC_LEVELS):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
611 circle_data[len(levels) - 1]["labels"].append(levels[-1].strip())
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
612 circle_data[len(levels) - 1]["sizes"].append(count)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
613 with open(output_path, "w") as f:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
614 json.dump(circle_data, f, indent=2)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
615
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
616
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
617 def process_single_file(anno_file_path, unanno_file_path, args):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
618 """
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
619 Process a single annotated BLAST file and generate all requested analytical outputs.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
620
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
621 This function integrates multiple steps for analyzing annotated BLAST output
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
622 files and corresponding unannotated FASTA sequences. It organizes BLAST hits
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
623 per query, extracts the best-scoring matches, and produces various types of
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
624 output files (plots, tables, diagrams, and statistics) according to user
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
625 arguments.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
626
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
627 :param anno_file_path: Path to the annotated BLAST results file.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
628 :type anno_file_path: str
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
629 :param unanno_file_path: Path to a FASTA file containing unannotated sequences.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
630 Used to determine sequence order and total unique sequence counts.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
631 :type unanno_file_path: str
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
632 :param args: Parsed command-line arguments containing the following attributes:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
633 - **eval_threshold** (*float*): Maximum allowed E-value for valid hits.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
634 - **eval_plot** (*str | None*): Output path for E-value distribution plot.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
635 - **taxa_output** (*str | None*): Output path for taxonomic summary table.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
636 - **header_anno** (*str | None*): Output path for annotated header table.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
637 - **circle_data** (*str | None*): Output path for circular diagram JSON data.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
638 - **anno_stats** (*str | None*): Output path for annotation statistics file.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
639 - **use_counts** (*bool*): Whether to use read counts or unique taxa only.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
640 - **uncertain_threshold** (*float*): Majority threshold for resolving taxonomic ambiguity.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
641 :type args: argparse.Namespace
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
642
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
643 :return: None. This function writes multiple output files to the paths
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
644 specified in ``args``
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
645 :rtype: None
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
646
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
647 :raises FileNotFoundError: If the annotated BLAST or unannotated FASTA file cannot be found.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
648 :raises ValueError: If file content is malformed or missing required columns.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
649
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
650 :notes:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
651 - The function first groups BLAST hits by query ID and filters out invalid or
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
652 low-confidence taxa.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
653 - For each query, the best BLAST hit is determined based on bitscore and E-value.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
654 - The total number of unique annotated sequences is inferred from sequence
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
655 headers containing a ``count=`` field.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
656 - Progress and warnings are printed to standard output for traceability.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
657 """
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
658 # Core data structures - one per read
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
659 read_data = [] # List of dicts, one per read
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
660 headers = []
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
661 seen_reads = set()
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
662 annotated_unique_count = 0
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
663 total_unique_count = 0
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
664
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
665 # Current read processing
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
666 current_q_id = ''
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
667 current_read_info = None
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
668
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
669 # Check if file exists and has content
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
670 if not os.path.exists(anno_file_path):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
671 print(f"Error: Input file {anno_file_path} not found")
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
672 return
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
673
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
674 unanno_headers_ordered = []
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
675 if os.path.exists(unanno_file_path):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
676 with open(unanno_file_path, 'r') as f:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
677 for line in f:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
678 if line.startswith('>'):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
679 header = line.split()[0].strip('>')
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
680 unanno_headers_ordered.append(header)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
681 print(f"Found {len(unanno_headers_ordered)} headers in unannotated file")
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
682 else:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
683 print(f"Warning: Unannotated file {unanno_file_path} not found")
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
684
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
685 # Counter to track which read we're currently expecting
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
686 unanno_set = set(unanno_headers_ordered)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
687 read_counter = 0
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
688
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
689 # Count total unique sequences from unannotated file
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
690 if os.path.exists(unanno_file_path):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
691 with open(unanno_file_path, 'r') as f:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
692 for line in f:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
693 if "count=" in line:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
694 count = int(line.split("count=")[1].split(";")[0])
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
695 total_unique_count += count
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
696
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
697 # Process annotated file
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
698 from collections import OrderedDict
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
699 blast_groups = OrderedDict() # q_id -> list of hit dicts
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
700
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
701 # Group BLAST hits per q_id
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
702 try:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
703 with open(anno_file_path, 'r') as f:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
704 for line in f:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
705 if line.startswith("#"):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
706 continue
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
707 parts = line.strip().split('\t')
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
708 if len(parts) < 7:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
709 continue
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
710
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
711 q_id = parts[0]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
712 try:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
713 e_val = float(parts[6])
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
714 except Exception:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
715 continue
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
716 if e_val > args.eval_threshold:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
717 continue
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
718 valid_taxa = True
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
719 taxon = parts[-1].strip()
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
720 if is_invalid_taxon(taxon):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
721 print(f"Skipping {q_id}: invalid taxon: {taxon}")
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
722 valid_taxa = False
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
723
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
724 identity = parts[4] if len(parts) > 4 else ''
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
725 cov = parts[5] if len(parts) > 5 else ''
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
726 bitscore = float(parts[7] if len(parts) > 7 else '')
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
727 source = parts[8] if len(parts) > 8 else ''
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
728 if valid_taxa:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
729 hit = {
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
730 'e_val': e_val,
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
731 'identity': identity,
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
732 'cov': cov,
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
733 'bitscore': bitscore,
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
734 'source': source,
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
735 'taxon': taxon
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
736 }
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
737 blast_groups.setdefault(q_id, []).append(hit)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
738 except Exception as e:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
739 print(f"Error reading BLAST file {anno_file_path}: {e}")
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
740 return
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
741
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
742 # BLAST-reads not in fasta
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
743 extra_blast_qids = [q for q in blast_groups.keys() if q not in unanno_set]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
744 if extra_blast_qids:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
745 print(f"Note: {len(extra_blast_qids)} BLAST q_ids not in FASTA: {extra_blast_qids[:10]}...")
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
746
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
747 # Process fasta order
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
748 for header in unanno_headers_ordered:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
749 if header not in blast_groups:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
750 print(f"Skipping {header}: no BLAST hits")
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
751 continue
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
752 current_read_info = {
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
753 'taxa_dict': {},
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
754 'header': header,
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
755 'e_values': set(),
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
756 'identities': set(),
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
757 'coverages': set(),
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
758 'sources': set(),
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
759 'bitscores': set(),
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
760 'best_eval': float(),
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
761 'best_identity': '',
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
762 'best_coverage': '',
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
763 'best_source': '',
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
764 'best_bitscore': -1,
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
765 'best_taxa_dict' : {}
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
766 }
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
767
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
768 # process all hits for this header
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
769 for hit in blast_groups[header]:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
770 e_val = hit['e_val']
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
771 identity = hit['identity']
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
772 cov = hit['cov']
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
773 bitscore = hit['bitscore']
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
774 source = hit['source']
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
775 taxon = hit['taxon']
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
776
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
777 current_read_info['e_values'].add(str(e_val))
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
778 current_read_info['identities'].add(identity)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
779 current_read_info['coverages'].add(cov)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
780 current_read_info['sources'].add(source)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
781 current_read_info['bitscores'].add(bitscore)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
782
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
783 # Keep track of best hit
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
784 if (bitscore > current_read_info['best_bitscore'] or
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
785 (bitscore == current_read_info['best_bitscore'] and e_val < current_read_info['best_eval'])):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
786 # New best set → reset
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
787 current_read_info['best_bitscore'] = bitscore
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
788 current_read_info['best_eval'] = e_val
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
789 current_read_info['best_identity'] = identity
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
790 current_read_info['best_coverage'] = cov
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
791 current_read_info['best_source'] = source
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
792 current_read_info['best_taxa_dict'] = {taxon: 1}
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
793
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
794 # If hit is of same quality: add to existing dictionary
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
795 elif bitscore == current_read_info['best_bitscore'] and e_val == current_read_info['best_eval']:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
796 td = current_read_info['best_taxa_dict']
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
797 td[taxon] = td.get(taxon, 0) + 1
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
798 # Might be unnecesary, best_taxa_dict seems to be the same as taxa_dict
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
799 # in output. Too scared to remove.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
800 if len(current_read_info['e_values']) == 1:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
801 taxa_dict = current_read_info['taxa_dict']
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
802 taxa_dict[taxon] = taxa_dict.get(taxon, 0) + 1
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
803 read_data.append(current_read_info)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
804
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
805 if header not in seen_reads:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
806 seen_reads.add(header)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
807 headers.append(header)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
808 try:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
809 count = int(header.split('(')[1].split(')')[0])
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
810 except IndexError or ValueError or AttributeError:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
811 count = 0
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
812 annotated_unique_count += count
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
813
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
814 # Extract data for different functions from unified structure
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
815 taxa_dicts = [read['taxa_dict'] for read in read_data]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
816 td = [read['best_taxa_dict'] for read in read_data]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
817 # Generate outputs based on arguments
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
818 if args.eval_plot:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
819 e_val_sets = [read['e_values'] for read in read_data]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
820 make_eval_plot(e_val_sets, args.eval_plot)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
821
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
822 if args.taxa_output:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
823 process_taxa_output(td, args.taxa_output, args.uncertain_threshold)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
824
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
825 if args.header_anno:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
826 # Extract best values efficiently - single list comprehension per metric
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
827 e_values_for_headers = [[read['best_eval']] for read in read_data]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
828 identity_for_headers = [[read['best_identity']] for read in read_data]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
829 coverage_for_headers = [[read['best_coverage']] for read in read_data]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
830 source_for_headers = [[read['best_source']] for read in read_data]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
831 bitscore_for_headers= [[read['best_bitscore']] for read in read_data]
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
832 process_header_annotations(td, headers, e_values_for_headers,
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
833 args.header_anno, args.uncertain_threshold,
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
834 identity_for_headers, coverage_for_headers, source_for_headers, bitscore_for_headers)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
835
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
836 if args.circle_data:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
837 create_circle_diagram(td, args.circle_data, args.use_counts,
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
838 args.uncertain_threshold)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
839
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
840 if args.anno_stats:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
841 stats = calculate_annotation_stats(len(taxa_dicts), unanno_file_path,
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
842 annotated_unique_count, total_unique_count)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
843 with open(args.anno_stats, 'w') as f:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
844 f.write('metric\tvalue\n')
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
845 for key, value in stats.items():
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
846 f.write(f'{key}\t{value}\n')
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
847
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
848 def is_invalid_taxon(taxon: str) -> bool:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
849 """
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
850 Determine whether a given taxonomic path should be considered invalid and excluded from analysis.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
851
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
852 This function identifies and filters out taxonomic strings that contain unreliable,
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
853 incomplete, or placeholder taxonomic information. It is used to clean BLAST or
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
854 classification outputs before aggregation and visualization.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
855
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
856 :param taxon: Full taxonomic path as a string
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
857 :type taxon: str
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
858 :return: ``True`` if the taxon is considered invalid according to the criteria below;
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
859 otherwise ``False``.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
860 :rtype: bool
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
861
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
862 :criteria:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
863 A taxon is considered invalid if:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
864 - It contains the substring ``"invalid taxid"`` or ``"GenBank invalid taxon"``.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
865 - It includes the term ``"environmental sample"`` (non-specific environmental sequence).
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
866 - It has any hierarchical level labeled as ``"unknown"`` followed by a more
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
867 specific level (e.g., *unknown genus / Homo sapiens*).
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
868
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
869 :note:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
870 This function performs a **case-insensitive** check and assumes the
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
871 input taxonomic path uses ``" / "`` as a level delimiter.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
872 """
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
873 taxon_lower = taxon.lower()
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
874
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
875 if "invalid taxid" in taxon_lower or "genbank invalid taxon" in taxon_lower:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
876 return True
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
877
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
878 if "environmental sample" in taxon_lower:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
879 return True
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
880
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
881 # check for "unknown" followed by something deeper
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
882 parts = taxon.split(" / ")
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
883 for i, part in enumerate(parts):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
884 if "unknown" in part.lower() and i < len(parts) - 1:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
885 # there is something more specific after an unknown level
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
886 return True
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
887 return False
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
888
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
889
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
890 def main(arg_list=None):
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
891 """
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
892 Entry point for Galaxy-compatible BLAST annotation processing.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
893
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
894 :param arg_list: Optional list of command-line arguments to override
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
895 ``sys.argv``. Primarily used for testing or programmatic execution.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
896 If ``None``, arguments are read directly from the command line.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
897 :type arg_list: list[str] | None
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
898 :return: None
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
899 :rtype: None
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
900
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
901 Notes:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
902 Calls `process_single_file` with parsed arguments.
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
903 """
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
904 args = parse_arguments(arg_list)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
905 for output_file in [args.eval_plot, args.header_anno, args.taxa_output, args.circle_data, args.anno_stats]:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
906 if output_file:
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
907 os.makedirs(os.path.dirname(output_file), exist_ok=True)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
908 process_single_file(args.input_anno, args.input_unanno, args)
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
909 print("Processing completed successfully")
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
910
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
911
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
912 if __name__ == "__main__":
a3989edf0a4a planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/process_annotations_tool commit c944fd5685f295acba06679e85b67973c173b137
onnodg
parents:
diff changeset
913 main()