comparison ncbi_egapx.xml @ 0:0ab743c8837f draft

planemo upload for repository https://github.com/richard-burhans/galaxytools/tree/main/tools/ncbi_egapx commit bd7ba5efde8e6fc5104441896d628760b6c54aa0
author richard-burhans
date Mon, 19 Aug 2024 22:53:40 +0000
parents
children e7091c5a8495
comparison
equal deleted inserted replaced
-1:000000000000 0:0ab743c8837f
1 <tool id="ncbi_egapx" name="NCBI EGAPx" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="@PROFILE@">
2 <description>annotates eukaryotic genomes</description>
3 <macros>
4 <import>macros.xml</import>
5 </macros>
6 <expand macro="edam_ontology"/>
7 <expand macro="requirements"/>
8 <command detect_errors="exit_code"><![CDATA[
9 source /galaxy/env.bash &&
10 echo \${PATH} &&
11 ln -s /galaxy/egapx/egapx_config &&
12 python3 /galaxy/egapx/ui/egapx.py '$yamlconfig' -e galaxy -o 'egapx_out'
13 ]]></command>
14 <inputs>
15 <param name="yamlconfig" type="data" optional="false" label="egapx configuration yaml file to execute" help="" format="yaml,txt" multiple="false"/>
16 </inputs>
17 <outputs>
18 <collection name="egapx_out" type="list" label="Outputs from egapx">
19 <discover_datasets pattern="__name_and_ext__" directory="egapx_out" visible="false"/>
20 </collection>
21 </outputs>
22 <tests>
23 <test expect_test_failure="true">
24 <param name="yamlconfig" value="input.yaml"/>
25 <output_collection name="egapx_out" type="list" count="8"/>
26 </test>
27 </tests>
28 <help><![CDATA[
29 Galaxy tool wrapping the Eukaryotic Genome Annotation Pipeline (EGAPx)
30 =================================================================================================
31
32 .. class:: warningmark
33
34 **Proof of concept: a quick hack to run a NF workflow inside a specialised Galaxy tool wrapper**
35
36 EGAPx is a big, complicated Nextflow workflow, challenging and costly to re-implement **properly**, requiring dozens of new tools and replicating a lot of
37 complicated *groovy* workflow logic.
38
39 It is also very new and in rapid development. Investing developer effort and keeping updated as EGAPx changes rapidly may be *inefficient of developer resources*.
40
41 This wrapper is designed to allow measuring how *inefficient* it is in terms of computing resource utilisation, in comparison to the developer effort
42 required to convert Nextflow DDL into tools and WF logic. Balancing these competing requirements is a fundamental Galaxy challenge.
43
44
45 EGAPx requires very substantial resources to run with real data. *128GB and 32 cores* are the minimum requirement; *256GB and 64 cores* are recommended.
46
47 A special minimal example that can be run in 6GB with 4 cores is provided as a yaml configuration and is used for the tool test.
48
49 In this implementation, the user must supply a yaml configuration file as initial proof of concept.
50 History inputs and even a yaml editor might be provided in future.
51
52 The NF workflow to tool model tested here may be applicable to other NF workflows that take a single configuration yaml.
53
54 .. class:: warningmark
55
56 The computational resource cost of typing the wrong SRA identifiers into a tool form is potentially enormous with this tool!
57
58
59 Sample yaml configurations
60 ===========================
61
62 YAML sample configurations can be uploaded into your Galaxy history from the `EGAPx github repository <https://github.com/ncbi/egapx/tree/main/examples/>`_.
63 The simplest possible example is shown below - can be cut/paste into a history dataset in the upload tool.
64
65
66 *./examples/input_D_farinae_small.yaml* is shown below and can be cut and pasted into the upload form to create a yaml file.
67 RNA-seq data is provided as URI to the reads FASTA files.
68
69 input_D_farinae_small.yaml
70
71 ::
72
73 genome: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/020/809/275/GCF_020809275.1_ASM2080927v1/GCF_020809275.1_ASM2080927v1_genomic.fna.gz
74 taxid: 6954
75 reads:
76 - xhttps://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/data/Dermatophagoides_farinae_small/SRR8506572.1
77 - xhttps://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/data/Dermatophagoides_farinae_small/SRR8506572.2
78 - xhttps://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/data/Dermatophagoides_farinae_small/SRR9005248.1
79 - xhttps://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/data/Dermatophagoides_farinae_small/SRR9005248.2
80
81
82 input_Gavia_stellata.yaml
83
84 ::
85
86 genome: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/030/936/135/GCF_030936135.1_bGavSte3.hap2/GCF_030936135.1_bGavSte3.hap2_genomic.fna.gz
87 reads: txid37040[Organism] AND biomol_transcript[properties] NOT SRS024887[Accession]
88 taxid: 37040
89
90 input_C_longicornis.yaml
91
92 ::
93
94 genome: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/029//603/195/GCF_029603195.1_ASM2960319v2/GCF_029603195.1_ASM2960319v2_genomic.fna.gz
95 reads: txid2530218[Organism] AND biomol_transcript[properties] NOT SRS024887[Accession]
96 taxid: 2530218
97
98 Purpose
99 ========
100
101 **This is not intended for production**
102
103 Just a proof of concept.
104 It is possibly too inefficient to be useful although it may turn out not to be a problem if run on a dedicated workstation.
105 At least the efficiency can now be more easily estimated.
106
107 This tool is not recommended for public deployment because of the resource demands.
108
109 EGAPx Overview
110 ===============
111
112 .. image:: $PATH_TO_IMAGES/Pipeline_sm_ncRNA_CAGE_80pct.png
113
114 **Warning:**
115 The current version is an alpha release with limited features and organism scope to collect initial feedback on execution. Outputs are not yet complete and not intended for production use. Please open a GitHub [Issue](https://github.com/ncbi/egapx/issues) if you encounter any problems with EGAPx. You can also write to cgr@nlm.nih.gov to give us your feedback or if you have any questions.
116
117 EGAPx is the publicly accessible version of the updated NCBI [Eukaryotic Genome Annotation Pipeline](https://www.ncbi.nlm.nih.gov/genome/annotation_euk/process/).
118
119 EGAPx takes an assembly fasta file, a taxid of the organism, and RNA-seq data. Based on the taxid, EGAPx will pick protein sets and HMM models. The pipeline runs `miniprot` to align protein sequences, and `STAR` to align RNA-seq to the assembly. Protein alignments and RNA-seq read alignments are then passed to `Gnomon` for gene prediction. In the first step of `Gnomon`, the short alignments are chained together into putative gene models.
120 In the second step, these predictions are further supplemented by *ab-initio* predictions based on HMM models. The final annotation for the input assembly is produced as a `gff` file.
121
122 **Security Notice:**
123
124 EGAPx has dependencies in and outside of its execution path that include several thousand files from the [NCBI C++ toolkit](https://www.ncbi.nlm.nih.gov/toolkit), and more than a million total lines of code. Static Application Security Testing has shown a small number of verified buffer overrun security vulnerabilities. Users should consult with their organizational security team on risk and if there is concern, consider mitigating options like running via VM or cloud instance.
125
126
127 *To specify an array of NCBI SRA datasets in yaml*
128
129 ::
130
131 reads:
132 - SRR8506572
133 - SRR9005248
134
135
136 *To specify an SRA entrez query*
137
138 ::
139
140 reads: 'txid6954[Organism] AND biomol_transcript[properties] NOT SRS024887[Accession] AND (SRR8506572[Accession] OR SRR9005248[Accession] )'
141
142
143 **Note:** Both the above examples will have more RNA-seq data than the `input_D_farinae_small.yaml` example. To make sure the entrez query does not produce a large number of SRA runs, please run it first at the [NCBI SRA page](https://www.ncbi.nlm.nih.gov/sra). If there are too many SRA runs, then select a few of them and list it in the input yaml.
144
145 Output
146 =======
147
148 EGAPx output will appear as a collection in the user history. The main annotation file is called *accept.gff*.
149
150 ::
151
152 accept.gff
153 annot_builder_output
154 nextflow.log
155 run.report.html
156 run.timeline.html
157 run.trace.txt
158 run_params.yaml
159
160
161 The *nextflow.log* is the log file that captures all the process information and their work directories. ``run_params.yaml`` has all the parameters that were used in the EGAPx run. More information about the process time and resources can be found in the other run* files.
162
163 ## Intermediate files
164
165 In the log, each line denotes the process that completed in the workflow. The first column (_e.g._ `[96/621c4b]`) is the subdirectory where the intermediate output files and logs are found for the process in the same line, _i.e._, `egapx:miniprot:run_miniprot`. To see the intermediate files for that process, you can go to the work directory path that you had supplied and traverse to the subdirectory `96/621c4b`:
166
167 ::
168
169 $ aws s3 ls s3://temp_datapath/D_farinae/96/
170 PRE 06834b76c8d7ceb8c97d2ccf75cda4/
171 PRE 621c4ba4e6e87a4d869c696fe50034/
172 $ aws s3 ls s3://temp_datapath/D_farinae/96/621c4ba4e6e87a4d869c696fe50034/
173 PRE output/
174 2024-03-27 11:19:18 0
175 2024-03-27 11:19:28 6 .command.begin
176 2024-03-27 11:20:24 762 .command.err
177 2024-03-27 11:20:26 762 .command.log
178 2024-03-27 11:20:23 0 .command.out
179 2024-03-27 11:19:18 13103 .command.run
180 2024-03-27 11:19:18 129 .command.sh
181 2024-03-27 11:20:24 276 .command.trace
182 2024-03-27 11:20:25 1 .exitcode
183 $ aws s3 ls s3://temp_datapath/D_farinae/96/621c4ba4e6e87a4d869c696fe50034/output/
184 2024-03-27 11:20:24 17127134 aligns.paf
185
186
187 ]]></help>
188 <expand macro="citations"/>
189 </tool>