Previous changeset 2:e618ab1c78d9 (2021-11-21) Next changeset 4:88dc16b4f583 (2021-12-11) |
Commit message:
"Update Galaxy tool wrapper to follow the IUC best practices" |
modified:
README.rst gecco.xml |
added:
CHANGELOG.md test-data/sideload.json |
b |
diff -r e618ab1c78d9 -r 359232b58f6a CHANGELOG.md --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/CHANGELOG.md Sun Nov 21 19:47:22 2021 +0000 |
[ |
b'@@ -0,0 +1,299 @@\n+# Changelog\n+All notable changes to this project will be documented in this file.\n+\n+The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)\n+and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).\n+\n+## [Unreleased]\n+[Unreleased]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.5...master\n+\n+## [v0.8.5] - 2021-11-21\n+[v0.8.5]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.4...v0.8.5\n+### Added\n+- Minimal compatibility support for running GECCO inside of Galaxy workflows.\n+\n+## [v0.8.4] - 2021-09-26\n+[v0.8.4]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.3-post1...v0.8.4\n+### Fixed\n+- `gecco convert gbk --format bigslice` failing to run because of outdated code ([#5](https://github.com/zellerlab/GECCO/issues/5)).\n+- `gecco convert gbk --format bigslice` not creating files with names conforming to BiG-SLiCE expected input.\n+### Changed\n+- Bump minimum `pyrodigal` version to `v0.6.2` to use platform-accelerated code if supported.\n+\n+## [v0.8.3-post1] - 2021-08-23\n+[v0.8.3-post1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.3...v0.8.3-post1\n+### Fixed\n+- Wrong default value for `--threshold` being shown in `gecco run` help message.\n+\n+## [v0.8.3] - 2021-08-23\n+[v0.8.3]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.2...v0.8.3\n+### Changed\n+- Default probability threshold for segmentation to 0.3 (from 0.4).\n+\n+## [v0.9.0] - 2021-08-10 - **YANKED**\n+[v0.9.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.2...v0.9.0\n+### Changed\n+- Retrain internal model using `--select=0.35` instead of `--select=0.25` like before.\n+- Change default *p-value* filter from 1e-9 to 1e-5 to detect more features.\n+\n+## [v0.8.2] - 2021-07-31\n+[v0.8.2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.1...v0.8.2\n+### Fixed\n+- `gecco run` crashing on Python 3.6 because of missing `contextlib.nullcontext` class.\n+### Changed\n+- `gecco run` and `gecco annotate` will not try to count the number of profiles when given an external HMM file with the `--hmm` flag.\n+- `PyHMMER.run` now reports the *p-value* of each domain in addition to the *e-value* as a `/note` qualifier.\n+\n+## [v0.8.1] - 2021-07-29\n+[v0.8.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.0...v0.8.1\n+### Changed\n+- `gecco run` now filters out unneeded features before annotating, making it easier to analyze the results of a run with a custom `--model`.\n+### Fixed\n+- `gecco` reporting about using Pfam `v33.1` while actually using `v34.0` because of an outdated field in `gecco/hmmer/Pfam.ini`.\n+### Added\n+- Missing documentation for the `strand` attribute of `gecco.model.Gene`.\n+\n+## [v0.8.0] - 2021-07-03\n+[v0.8.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.7.0...v0.8.0\n+### Changed\n+- Retrain internal model using new sequence embeddings and remove broken/duplicate BGCs from MIBiG 2.0.\n+- Bump minimum `pyhmmer` version to `v0.4.0` to improve exception handling.\n+- Bump minimum `pyrodigal` version to `v0.5.0` to fix sequence decoding on some platforms.\n+- Use p-values instead of e-values to filter domains obtained with HMMER.\n+- `gecco cv` and `gecco train` now seed the RNG with a user-defined seed before shuffling rows of training data.\n+### Fixed\n+- Extraction of BGC compositions for the type predictor while training.\n+- `ClusterCRF.trained` failing to open an external model.\n+### Added\n+- `Domain.pvalue` attribute to access the p-value of a domain annotation.\n+- Mandatory `pvalue` column to `FeatureTable` objects.\n+- Support for loading several feature tables in `gecco train` and `gecco cv`.\n+- Warnings to `ClusterCRF.fit` when selecting uninformative features.\n+- `--correction` flag to `gecco train` and `gecco cv`, allowing to give a multiple testing correction method when computing p-values with the Fisher Exact Tests.\n+### Removed\n+- Outdated `gecco embed` command.\n+- Unused `--truncate` flag from the `gecco train` CLI.\n+- Tigrfam domains, which is not improving performance on the new t'..b"entation for `FeatureTable` and `ClusterTable`\n+ that returns a single row or a sub-table from a table.\n+### Fixed\n+- `gecco cv` command now writes results iteratively instead of holding\n+ the tables for every fold in memory.\n+### Changed\n+- Bumped `pandas` training dependency to `v1.0`.\n+\n+## [v0.4.3] - 2020-09-07\n+[v0.4.3]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.2...v0.4.3\n+### Fixed\n+- GenBank files being written with invalid `/cds` feature type.\n+### Changed\n+- Blocked installation of Biopython `v1.78` or newer as it removes `Bio.Alphabet`\n+ and breaks the current code.\n+\n+## [v0.4.2] - 2020-08-07\n+[v0.4.2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.1...v0.4.2\n+### Fixed\n+- `TypeClassifier.predict_types` using inverse type probabilities when\n+ given several clusters to process.\n+\n+## [v0.4.1] - 2020-08-07\n+[v0.4.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.0...v0.4.1\n+### Fixed\n+- `gecco run` command crashing on input sequences not containing any genes.\n+\n+## [v0.4.0] - 2020-08-06\n+[v0.4.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.3.0...v0.4.0\n+### Added\n+- `gecco.model.ProductType` enum to model the biosynthetic class of a BGC.\n+### Removed\n+- `pandas` interaction from internal data model.\n+- `ClusterCRF` code specific to cross-validation.\n+### Changed\n+- `pandas`, `fisher` and `statsmodels` dependencies are now optional.\n+- `gecco train` command expects a cluster table in addition to the feature\n+ table to know the types of the input BGCs.\n+\n+## [v0.3.0] - 2020-08-03\n+[v0.3.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.2.2...v0.3.0\n+### Changed\n+- Replaced Nearest-Neighbours classifier with Random Forest to perform type\n+ prediction for candidate BGCs.\n+- `gecco.knn` module was renamed to implementation-agnostic name `gecco.types`.\n+### Fixed\n+- Extraction of domain composition taking a long time in `gecco train` command.\n+### Removed\n+- `--metric` argument to the `gecco run` CLI command.\n+\n+## [v0.2.2] - 2020-07-31\n+[v0.2.2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.2.1...v0.2.2\n+### Changed\n+- `Domain` and `Gene` can now carry qualifiers that are used when they\n+ are translated to a sequence feature.\n+### Added\n+- InterPro names, accessions, and HMMER e-value for each annotated domain\n+ in GenBank output files.\n+\n+## [v0.2.1] - 2020-07-23\n+[v0.2.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.2.0...v0.2.1\n+### Fixed\n+- Various potential crashes in `ClusterRefiner` code.\n+### Removed\n+- Uneeded feature dictionary filtering in `ClusterCRF` for models with\n+ Fisher Exact Test feature selection.\n+\n+## [v0.2.0] - 2020-07-23\n+[v0.2.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.1.1...v0.2.0\n+### Fixed\n+- `pandas` warning about unsorted columns in `gecco run`.\n+### Removed\n+- `Gene.probability` property, replaced by `Gene.maximum_probability` and\n+ `Gene.average_probability` properties to be explicit.\n+### Changed\n+- Internal model now uses `Pfam` and `Tigrfam` with the top 35% features\n+ selected with Fisher's Exact Test.\n+- `ClusterRefiner` now removes genes on `Cluster` edges if they do not\n+ contain any domain annotation.\n+\n+## [v0.1.1] - 2020-07-22\n+[v0.1.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.1.0...v0.1.1\n+### Added\n+- `ClusterCRF.predict_probabilities` to annotate a list of `Gene`.\n+### Changed\n+- BGC probability is now stored at the `Domain` level instead of at the `Gene`\n+ level, independently of the feature extraction level used by the CRF.\n+- `ClusterKNN` will use the model path provided to `gecco run` if any.\n+### Docs\n+- Added this changelog file to document changes in the code.\n+- Added documentation to `gecco` submodules missing some.\n+- Included the `CHANGELOG.md` file to the generated docs.\n+\n+## [v0.1.0] - 2020-07-17\n+[v0.1.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.0.1...v0.1.0\n+Initial release.\n+\n+## [v0.0.1] - 2018-08-13\n+[v0.0.1]: https://git.embl.de/grp-zeller/GECCO/compare/37afb97...v0.0.1\n+Proof-of-concept.\n" |
b |
diff -r e618ab1c78d9 -r 359232b58f6a README.rst --- a/README.rst Sun Nov 21 17:40:58 2021 +0000 +++ b/README.rst Sun Nov 21 19:47:22 2021 +0000 |
b |
@@ -14,7 +14,7 @@ Fields (CRFs). |GitLabCI| |License| |Coverage| |Docs| |Source| |Mirror| |Changelog| -|Issues| |Preprint| |PyPI| |Bioconda| |Versions| |Wheel| +|Issues| |Preprint| |PyPI| |Bioconda| |Galaxy| |Versions| |Wheel| 🔧 Installing GECCO ------------------- @@ -132,3 +132,5 @@ :target: https://pypi.org/project/gecco-tool/#files .. |Wheel| image:: https://img.shields.io/pypi/wheel/gecco-tool?style=flat-square&maxAge=3600 :target: https://pypi.org/project/gecco-tool/#files +.. |Galaxy| image:: https://img.shields.io/badge/Galaxy-GECCO-darkblue?style=flat-square&maxAge=3600 + :target: https://toolshed.g2.bx.psu.edu/repository?repository_id=c29bc911b3fc5f8c |
b |
diff -r e618ab1c78d9 -r 359232b58f6a gecco.xml --- a/gecco.xml Sun Nov 21 17:40:58 2021 +0000 +++ b/gecco.xml Sun Nov 21 19:47:22 2021 +0000 |
[ |
b'@@ -1,8 +1,8 @@\n <?xml version=\'1.0\' encoding=\'utf-8\'?>\n-<tool id="gecco" name="GECCO" version="0.8.4" python_template_version="3.5">\n- <description>GECCO (Gene Cluster prediction with Conditional Random Fields) is a fast and scalable method for identifying putative novel Biosynthetic Gene Clusters (BGCs) in genomic and metagenomic data using Conditional Random Fields (CRFs).</description>\n+<tool id="gecco" name="GECCO" version="0.8.5" python_template_version="3.5">\n+ <description>is a fast and scalable method for identifying putative novel Biosynthetic Gene Clusters (BGCs) in genomic and metagenomic data using Conditional Random Fields (CRFs).</description>\n <requirements>\n- <requirement type="package" version="0.8.4">gecco</requirement>\n+ <requirement type="package" version="0.8.5">gecco</requirement>\n </requirements>\n <version_command>gecco --version</version_command>\n <command detect_errors="aggressive"><![CDATA[\n@@ -14,13 +14,37 @@\n #end if\n ln -s \'$input\' input_tempfile.$file_extension &&\n \n- gecco -vv run -g input_tempfile.$file_extension &&\n- mv input_tempfile.features.tsv $features &&\n- mv input_tempfile.clusters.tsv $clusters\n+ gecco -vv run\n+ --format $input.ext\n+ --genome input_tempfile.$file_extension\n+ --postproc $postproc\n+ --force-clusters-tsv\n+ #if $cds:\n+ --cds $cds\n+ #end if\n+ #if $threshold:\n+ --threshold $threshold\n+ #end if\n+ #if $antismash_sideload:\n+ --antismash-sideload\n+ #end if\n+\n+ && mv input_tempfile.features.tsv \'$features\'\n+ && mv input_tempfile.clusters.tsv \'$clusters\'\n+ #if $antismash_sideload\n+ && mv input_tempfile.sideload.json \'$sideload\'\n+ #end if\n \n ]]></command>\n <inputs>\n- <param name="input" type="data" format="genbank,fasta" label="Sequence file in GenBank or FASTA format"/>\n+ <param name="input" type="data" format="genbank,fasta,embl" label="Sequence file in GenBank, EMBL or FASTA format"/>\n+ <param argument="--cds" type="integer" min="0" value="" optional="true" label="Minimum number of genes required for a cluster"/>\n+ <param argument="--threshold" type="float" min="0" max="1" value="" optional="true" label="Probability threshold for cluster detection"/>\n+ <param argument="--postproc" type="select" label="Post-processing method for gene cluster validation">\n+ <option value="antismash">antiSMASH</option>\n+ <option value="gecco" selected="true">GECCO</option>\n+ </param>\n+ <param argument="--antismash-sideload" type="boolean" checked="false" label="Generate an antiSMASH v6 sideload JSON file"/>\n </inputs>\n <outputs>\n <collection name="records" type="list" label="${tool.name} detected Biosynthetic Gene Clusters on ${on_string} (GenBank)">\n@@ -28,6 +52,9 @@\n </collection>\n <data name="features" format="tabular" label="${tool.name} summary of detected features on ${on_string} (TSV)"/>\n <data name="clusters" format="tabular" label="${tool.name} summary of detected BGCs on ${on_string} (TSV)"/>\n+ <data name="sideload" format="json" label="antiSMASH v6 sideload file with ${tool.name} detected BGCs on ${on_string} (JSON)">\n+ <filter>antismash_sideload</filter>\n+ </data>\n </outputs>\n <tests>\n <test>\n@@ -38,49 +65,48 @@\n <element name="BGC0001866.1_cluster_1" file="BGC0001866.1_cluster_1.gbk" ftype="genbank" lines_diff="2"/>\n </output_collection>\n </test>\n+ <test>\n+ <param name="input" value="BGC0001866.fna"/>\n+ <param name="antismash_sideload" value="True"/>\n+ <output name="features" file="features.tsv"/>\n+ <output name="clusters" file="clusters.tsv"/>\n+ <output name="sideload" file="sideload.json"/>\n+ <output_coll'..b'ed at EMBL.\n \n-**Input**\n+Input\n+-----\n \n GECCO works with DNA sequences, and loads them using Biopython, allowing it to support a large variety of formats, including the common FASTA and GenBank files.\n \n-**Output**\n+Output\n+------\n \n GECCO will create the following files once done (using the same prefix as the input file):\n \n-- features.tsv: The features file, containing the identified proteins and domains in the input sequences.\n-- clusters.tsv: If any were found, a clusters file, containing the coordinates of the predicted clusters, along their putative biosynthetic type.\n-- {sequence}_cluster_{N}.gbk: If any BGCs were found, a GenBank file per cluster, containing the cluster sequence annotated with its member proteins and domains.\n-\n-**Contact**\n+- ``features.tsv``: The features file, containing the identified proteins and domains in the input sequences.\n+- ``clusters.tsv``: If any were found, a clusters file, containing the coordinates of the predicted clusters, along their putative biosynthetic type.\n+- ``{sequence}_cluster_{N}.gbk``: If any BGCs were found, a GenBank file per cluster, containing the cluster sequence annotated with its member proteins and domains.\n \n-If you have any question about GECCO, if you run into any issue, or if you would like to make a feature request, please create an issue in the GitHub repository. \n-You can also directly contact Martin Larralde via email. If you want to contribute to GECCO, please have a look at the contribution guide first, and feel free to \n-open a pull request on the GitHub repository.\n+Contact\n+-------\n \n-]]>\n- </help>\n+If you have any question about GECCO, if you run into any issue, or if you would like to make a feature request, please create an issue in the\n+`GitHub repository <https://github.com/zellerlab/gecco>`_. You can also directly contact `Martin Larralde via email <mailto:martin.larralde@embl.de>`_.\n+If you want to contribute to GECCO, please have a look at the contribution guide first, and feel free to open a pull request on the GitHub repository.\n+\n+ ]]></help>\n <citations>\n- <citation type="bibtex">\n-@article {Carroll2021.05.03.442509,\n-\tauthor = {Carroll, Laura M. and Larralde, Martin and Fleck, Jonas Simon and Ponnudurai, Ruby and Milanese, Alessio and Cappio, Elisa and Zeller, Georg},\n-\ttitle = {Accurate de novo identification of biosynthetic gene clusters with GECCO},\n-\telocation-id = {2021.05.03.442509},\n-\tyear = {2021},\n-\tdoi = {10.1101/2021.05.03.442509},\n-\tpublisher = {Cold Spring Harbor Laboratory},\n-\tabstract = {Biosynthetic gene clusters (BGCs) are enticing targets for (meta)genomic mining efforts, as they may encode novel, specialized metabolites with potential uses in medicine and biotechnology. Here, we describe GECCO (GEne Cluster prediction with COnditional random fields; https://gecco.embl.de), a high-precision, scalable method for identifying novel BGCs in (meta)genomic data using conditional random fields (CRFs). Based on an extensive evaluation of de novo BGC prediction, we found GECCO to be more accurate and over 3x faster than a state-of-the-art deep learning approach. When applied to over 12,000 genomes, GECCO identified nearly twice as many BGCs compared to a rule-based approach, while achieving higher accuracy than other machine learning approaches. Introspection of the GECCO CRF revealed that its predictions rely on protein domains with both known and novel associations to secondary metabolism. The method developed here represents a scalable, interpretable machine learning approach, which can identify BGCs de novo with high precision.Competing Interest StatementThe authors have declared no competing interest.},\n-\tURL = {https://www.biorxiv.org/content/early/2021/05/04/2021.05.03.442509},\n-\teprint = {https://www.biorxiv.org/content/early/2021/05/04/2021.05.03.442509.full.pdf},\n-\tjournal = {bioRxiv}\n-}\n- </citation>\n+ <citation type="doi">10.1101/2021.05.03.442509</citation>\n </citations>\n </tool>\n' |
b |
diff -r e618ab1c78d9 -r 359232b58f6a test-data/sideload.json --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/sideload.json Sun Nov 21 19:47:22 2021 +0000 |
[ |
@@ -0,0 +1,36 @@ +{ + "records": [ + { + "name": "BGC0001866.1", + "subregions": [ + { + "details": { + "alkaloid_probability": "0.000", + "average_p": "0.997", + "max_p": "1.000", + "nrp_probability": "0.140", + "other_probability": "0.000", + "polyketide_probability": "0.980", + "ripp_probability": "0.000", + "saccharide_probability": "0.000", + "terpene_probability": "0.000" + }, + "end": 32979, + "label": "Polyketide", + "start": 347 + } + ] + } + ], + "tool": { + "configuration": { + "cds": "3", + "e-filter": "None", + "postproc": "'gecco'", + "threshold": "0.3" + }, + "description": "Biosynthetic Gene Cluster prediction with Conditional Random Fields.", + "name": "GECCO", + "version": "0.8.4" + } +} \ No newline at end of file |