Mercurial > repos > althonos > gecco
diff CHANGELOG.md @ 3:359232b58f6a draft
"Update Galaxy tool wrapper to follow the IUC best practices"
author | althonos |
---|---|
date | Sun, 21 Nov 2021 19:47:22 +0000 |
parents | |
children | 169849dfb098 |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/CHANGELOG.md Sun Nov 21 19:47:22 2021 +0000 @@ -0,0 +1,299 @@ +# Changelog +All notable changes to this project will be documented in this file. + +The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/) +and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html). + +## [Unreleased] +[Unreleased]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.5...master + +## [v0.8.5] - 2021-11-21 +[v0.8.5]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.4...v0.8.5 +### Added +- Minimal compatibility support for running GECCO inside of Galaxy workflows. + +## [v0.8.4] - 2021-09-26 +[v0.8.4]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.3-post1...v0.8.4 +### Fixed +- `gecco convert gbk --format bigslice` failing to run because of outdated code ([#5](https://github.com/zellerlab/GECCO/issues/5)). +- `gecco convert gbk --format bigslice` not creating files with names conforming to BiG-SLiCE expected input. +### Changed +- Bump minimum `pyrodigal` version to `v0.6.2` to use platform-accelerated code if supported. + +## [v0.8.3-post1] - 2021-08-23 +[v0.8.3-post1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.3...v0.8.3-post1 +### Fixed +- Wrong default value for `--threshold` being shown in `gecco run` help message. + +## [v0.8.3] - 2021-08-23 +[v0.8.3]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.2...v0.8.3 +### Changed +- Default probability threshold for segmentation to 0.3 (from 0.4). + +## [v0.9.0] - 2021-08-10 - **YANKED** +[v0.9.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.2...v0.9.0 +### Changed +- Retrain internal model using `--select=0.35` instead of `--select=0.25` like before. +- Change default *p-value* filter from 1e-9 to 1e-5 to detect more features. + +## [v0.8.2] - 2021-07-31 +[v0.8.2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.1...v0.8.2 +### Fixed +- `gecco run` crashing on Python 3.6 because of missing `contextlib.nullcontext` class. +### Changed +- `gecco run` and `gecco annotate` will not try to count the number of profiles when given an external HMM file with the `--hmm` flag. +- `PyHMMER.run` now reports the *p-value* of each domain in addition to the *e-value* as a `/note` qualifier. + +## [v0.8.1] - 2021-07-29 +[v0.8.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.0...v0.8.1 +### Changed +- `gecco run` now filters out unneeded features before annotating, making it easier to analyze the results of a run with a custom `--model`. +### Fixed +- `gecco` reporting about using Pfam `v33.1` while actually using `v34.0` because of an outdated field in `gecco/hmmer/Pfam.ini`. +### Added +- Missing documentation for the `strand` attribute of `gecco.model.Gene`. + +## [v0.8.0] - 2021-07-03 +[v0.8.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.7.0...v0.8.0 +### Changed +- Retrain internal model using new sequence embeddings and remove broken/duplicate BGCs from MIBiG 2.0. +- Bump minimum `pyhmmer` version to `v0.4.0` to improve exception handling. +- Bump minimum `pyrodigal` version to `v0.5.0` to fix sequence decoding on some platforms. +- Use p-values instead of e-values to filter domains obtained with HMMER. +- `gecco cv` and `gecco train` now seed the RNG with a user-defined seed before shuffling rows of training data. +### Fixed +- Extraction of BGC compositions for the type predictor while training. +- `ClusterCRF.trained` failing to open an external model. +### Added +- `Domain.pvalue` attribute to access the p-value of a domain annotation. +- Mandatory `pvalue` column to `FeatureTable` objects. +- Support for loading several feature tables in `gecco train` and `gecco cv`. +- Warnings to `ClusterCRF.fit` when selecting uninformative features. +- `--correction` flag to `gecco train` and `gecco cv`, allowing to give a multiple testing correction method when computing p-values with the Fisher Exact Tests. +### Removed +- Outdated `gecco embed` command. +- Unused `--truncate` flag from the `gecco train` CLI. +- Tigrfam domains, which is not improving performance on the new training data. + +## [v0.7.0] - 2021-05-31 +[v0.7.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.6.3...v0.7.0 +### Added +- Support for writing an AntiSMASH sideload JSON file after a `gecco run` workflow. +- Code for converting GenBank files in BiG-SLiCE compatible format with the `gecco convert` subcommand. +- Documentation about using GECCO in combination with AntiSMASH or BiG-SLiCE. +### Changed +- Minimum Biopython version to `v1.73` for compatibility with older bioinformatics tooling. +- Internal domain composition shipped in the `gecco.types` with newer composition array obtained directly from MIBiG files. +### Removed +- Outdated notice about `-vvv` verbosity level in the help message of the main `gecco` command. + +## [v0.6.3] - 2021-05-10 +[v0.6.3]: https://git.embl.de/grp-zeller/GECCO/compare/v0.6.2...v0.6.3 +### Fixed +- HMMER annotation not properly handling inputs with multiple contigs. +- Some progress bar totals displaying as floats in the CLI. +### Changed +- `PyHMMER` now sets the `Z` and `domZ` values from the number of proteins given to the search pipeline. +- `gecco.cli` delegates imports to make CLI more responsive. +- `pkg_resources` has been replaced with `importlib.resources` and `importlib.metadata` where applicable. +- `multiprocessing.cpu_count` has been replaced with `os.cpu_count` where applicable. + +## [v0.6.2] - 2021-05-04 +[v0.6.2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.6.1...v0.6.2 +### Fixed +- `gecco cv loto` crashing because of outdated code. +### Changed +- Logging-style prompt will only display if GECCO is running with `-vv` flag. +### Added +- GECCO bioRxiv paper reference to `Cluster.to_seq_record` output record. + +## [v0.6.1] - 2021-03-15 +[v0.6.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.6.0...v0.6.1 +### Fixed +- Progress bar not being disabled by `-q` flag in CLI. +- Fallback to using HMM name if accession is not available in `PyHMMER`. +- Group genes by source contig and process them separately in `PyHMMER` to avoid bogus E-values. +### Added +- `psutil` dependency to get the number of physical CPU cores on the host machine. +- Support for using an arbitrary mapping of positives to negatives in `gecco embed`. +### Removed +- Unused and outdated `HMMER` and `DomainRow` classes from `gecco.hmmer`. + +## [v0.6.0] - 2021-02-28 +[v0.6.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.5...v0.6.0 +### Changed +- Updated internal model with a cleaned-up version of the MIBiG-2.0 + Pfam-33.1/Tigrfam-15.0 embedding. +- Updated internal InterPro catalog. +### Fixed +- Features not being grouped together in `gecco cv` and `gecco train` + when provided with a feature table where rows were not sorted by + protein IDs. + +## [v0.5.5] - 2021-02-28 +[v0.5.5]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.4...v0.5.5 +### Fixed +- `gecco cv` bug causing only the last fold to be written. + +## [v0.5.4] - 2021-02-28 +[v0.5.4]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.3...v0.5.4 +### Changed +- Replaced `verboselogs`, `coloredlogs` and `better-exceptions` with `rich`. +### Removed +- `tqdm` training dependency. +### Added +- `gecco annotate` command to produce a feature table from a genomic file. +- `gecco embed` to embed BGCs into non-BGC regions using feature tables. + +## [v0.5.3] - 2021-02-21 +[v0.5.3]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.2...v0.5.3 +### Fixed +- Coordinates of genes in output GenBank files. +- Potential issue with the number of CPUs in `PyHMMER.run`. +### Changed +- Bump required `pyrodigal` version to `v0.4.2` to fix buffer overflow. + +## [v0.5.2] - 2021-01-29 +[v0.5.2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.1...v0.5.2 +### Added +- Support for downloading HMM files directly from GitHub releases assets. +- Validation of filtered HMMs with MD5 checksum. +### Fixed +- Invalid coordinates of protein domains in GenBank output files. +- `gecco.interpro` module not being added to wheel distribution. +### Changed +- Bump required `pyhmmer` version to `v0.2.1`. + +## [v0.5.1] - 2021-01-15 +[v0.5.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.0...v0.5.1 +### Fixed +- `--hmm` flag being ignored in in `gecco run` command. +- `PyHMMER` using HMM names instead of accessions, causing issues with Pfam HMMs. + +## [v0.5.0] - 2021-01-11 +[v0.5.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.5...v0.5.0 +### Added +- Explicit support for Python 3.9. +### Changed +- [`pyhmmer`](https://pypi.org/project/pyhmmer) is used to annotate protein sequences instead of HMMER3 binary `hmmsearch`. +- HMM files are stored in binary format to speedup parsing and reduce storage size. +- `tqdm` is now a *training*-only dependency. +- `gecco cv` now requires *training* dependencies. + +## [v0.4.5] - 2020-11-23 +[v0.4.5]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.4...v0.4.5 +### Added +- Additional `fold` column to cross-validation table output. +### Changed +- Use sequence ID instead of protein ID to extract type from cluster in `gecco cv`. +- Install HMM data in pre-pressed format to make `hmmsearch` runs faster on short sequences. +- `gecco.orf` was rewritten to extract genes from input sequences in parallel. + +## [v0.4.4] - 2020-09-30 +[v0.4.4]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.3...v0.4.4 +### Added +- `gecco cv loto` command to run LOTO cross-validation using BGC types + for stratification. +- `header` keyword argument to `FeatureTable.dump` and `ClusterTable.dump` + to write the table without the column header allowing to append to an + existing table. +- `__getitem__` implementation for `FeatureTable` and `ClusterTable` + that returns a single row or a sub-table from a table. +### Fixed +- `gecco cv` command now writes results iteratively instead of holding + the tables for every fold in memory. +### Changed +- Bumped `pandas` training dependency to `v1.0`. + +## [v0.4.3] - 2020-09-07 +[v0.4.3]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.2...v0.4.3 +### Fixed +- GenBank files being written with invalid `/cds` feature type. +### Changed +- Blocked installation of Biopython `v1.78` or newer as it removes `Bio.Alphabet` + and breaks the current code. + +## [v0.4.2] - 2020-08-07 +[v0.4.2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.1...v0.4.2 +### Fixed +- `TypeClassifier.predict_types` using inverse type probabilities when + given several clusters to process. + +## [v0.4.1] - 2020-08-07 +[v0.4.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.0...v0.4.1 +### Fixed +- `gecco run` command crashing on input sequences not containing any genes. + +## [v0.4.0] - 2020-08-06 +[v0.4.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.3.0...v0.4.0 +### Added +- `gecco.model.ProductType` enum to model the biosynthetic class of a BGC. +### Removed +- `pandas` interaction from internal data model. +- `ClusterCRF` code specific to cross-validation. +### Changed +- `pandas`, `fisher` and `statsmodels` dependencies are now optional. +- `gecco train` command expects a cluster table in addition to the feature + table to know the types of the input BGCs. + +## [v0.3.0] - 2020-08-03 +[v0.3.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.2.2...v0.3.0 +### Changed +- Replaced Nearest-Neighbours classifier with Random Forest to perform type + prediction for candidate BGCs. +- `gecco.knn` module was renamed to implementation-agnostic name `gecco.types`. +### Fixed +- Extraction of domain composition taking a long time in `gecco train` command. +### Removed +- `--metric` argument to the `gecco run` CLI command. + +## [v0.2.2] - 2020-07-31 +[v0.2.2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.2.1...v0.2.2 +### Changed +- `Domain` and `Gene` can now carry qualifiers that are used when they + are translated to a sequence feature. +### Added +- InterPro names, accessions, and HMMER e-value for each annotated domain + in GenBank output files. + +## [v0.2.1] - 2020-07-23 +[v0.2.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.2.0...v0.2.1 +### Fixed +- Various potential crashes in `ClusterRefiner` code. +### Removed +- Uneeded feature dictionary filtering in `ClusterCRF` for models with + Fisher Exact Test feature selection. + +## [v0.2.0] - 2020-07-23 +[v0.2.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.1.1...v0.2.0 +### Fixed +- `pandas` warning about unsorted columns in `gecco run`. +### Removed +- `Gene.probability` property, replaced by `Gene.maximum_probability` and + `Gene.average_probability` properties to be explicit. +### Changed +- Internal model now uses `Pfam` and `Tigrfam` with the top 35% features + selected with Fisher's Exact Test. +- `ClusterRefiner` now removes genes on `Cluster` edges if they do not + contain any domain annotation. + +## [v0.1.1] - 2020-07-22 +[v0.1.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.1.0...v0.1.1 +### Added +- `ClusterCRF.predict_probabilities` to annotate a list of `Gene`. +### Changed +- BGC probability is now stored at the `Domain` level instead of at the `Gene` + level, independently of the feature extraction level used by the CRF. +- `ClusterKNN` will use the model path provided to `gecco run` if any. +### Docs +- Added this changelog file to document changes in the code. +- Added documentation to `gecco` submodules missing some. +- Included the `CHANGELOG.md` file to the generated docs. + +## [v0.1.0] - 2020-07-17 +[v0.1.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.0.1...v0.1.0 +Initial release. + +## [v0.0.1] - 2018-08-13 +[v0.0.1]: https://git.embl.de/grp-zeller/GECCO/compare/37afb97...v0.0.1 +Proof-of-concept.