view README.md @ 4:04ec86bdac32 draft default tip

planemo upload for repository https://github.com/Onnodg/Naturalis_NLOOR/tree/main/NLOOR_scripts/add_header_tool commit 4017d38cf327c48a6252e488ba792527dae97a70-dirty
author onnodg
date Mon, 15 Dec 2025 17:01:06 +0000
parents f4b8ab4ed24e
children
line wrap: on
line source

# Add Taxonomic Labels Script

This script processes BLAST output files from a **curated BLAST database** and prepares them for downstream taxonomic analysis.

In curated BLAST results, taxonomic labels are often missing or marked as “unknown,” because taxonomy information is stored only in the sequence headers.  
This script extracts that information and appends it to each BLAST result, producing a fully annotated output file.

---

## Usage

Each sequence header in the curated BLAST database includes taxonomy metadata in a structured format, with fields separated by `=` and whitespace. The tool identifies the reads and annotations source, and appends them in the tabular rows, so the source and taxa positions match those of BLAST output using a genbank database.

Using the `--taxon_levels` argument, you can specify which header positions correspond to taxonomic ranks (e.g., kingdom, phylum, genus, species).

> ⚠️ **Important:**  
> The `--taxon_levels` argument is critical — change it only if you fully understand your database’s header structure.



### When to Use

| Database Type              | Need This Script? | Reason |
|-----------------------------|-------------------|--------|
| **Curated BLAST database**  | ✅ Yes            | Taxonomy exists only in headers |
| **GenBank-based BLAST**     | ❌ No             | Taxonomy already included in tabular file |



### Command Line Interface
The add_taxonomic_labels tool can be run as a Python script:

```bash
python add_taxonomic_labels.py \
  --input blast_results.tabular \
  --output labeled_results.tabular \
  --taxon_levels "1 2 4 7 11 12 13"
```

#### General use case

The tool serves a single, clear purpose. In the input example, the taxonomic information appears only in the sequence headers, while the corresponding annotation fields in the file are marked as *unknown*. The tool extracts the taxonomy data from the headers and inserts it into the appropriate annotation fields, replacing the unknown values.

```text
Input
M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758)	source=NCBI   sequenceID=EU382995   superkingdom=Eukaryota   kingdom=Viridiplantae   phylum=Streptophyta   subphylum=Streptophytina   class=Magnoliopsida   subclass=NA   infraclass=NA   order=Ranunculales   suborder=NA   infraorder=NA   superfamily=NA   family=Ranunculaceae   genus=Ranunculus   species=Ranunculus repens   markercode=trnL   lat=NA   lon=NA	source=NCBI	N/A	100.000	100	1.24e-38	152	Genbank	unknown kingdom / unknown phylum / unknown class / unknown order / unknown family / unknown genus / unknown species

Output
M01687:476:000000000-LL5F5:1:1102:12299:1165_CONS(1758)	source=NCBI   sequenceID=EU382995   superkingdom=Eukaryota   kingdom=Viridiplantae   phylum=Streptophyta   subphylum=Streptophytina   class=Magnoliopsida   subclass=NA   infraclass=NA   order=Ranunculales   suborder=NA   infraorder=NA   superfamily=NA   family=Ranunculaceae   genus=Ranunculus   species=Ranunculus repens   markercode=trnL   lat=NA   lon=NA	source=NCBI	N/A	100.000	100	1.24e-38	152	NCBI	Viridiplantae / Streptophyta / Magnoliopsida / Ranunculales / Ranunculaceae / Ranunculus / Ranunculus repens
```

### Galaxy integration

The tool is also available through the Galaxy platform:

- **Galaxy Toolshed**: The add_taxonomic_labels tool is available in the Galaxy Toolshed, 
  enabling easy installation into any Galaxy instance.
- **Web-based interface**: Users can upload sequence files, configure validation parameters through the GUI, 
  run validations, and download results.
- **Workflow integration**: The tool can be incorporated into Galaxy workflows for automated processing pipelines.

To use the tool in Galaxy:
1. Install the tool from the Galaxy Toolshed (search for "add_taxonomic_labels")
2. Upload your BLAST files to your Galaxy history
3. Configure parameters through the GUI
4. Run the tool
5. View results and use the reformatted BLAST file for downstream analysis

## License

No license yet

## Citation

If you use this software in your research, please cite this repository.

## Contact

For questions or issues:
- GitHub Issues: https://github.com/Onnodg/Naturalis_NLOOR/issues
- Email: onno.gorter@naturalis.nl (until Febuary 2026)

## Acknowledgments

This tool was developed to support the New lights on old remedies project, a PhD project by Anja Fischer.