# HG changeset patch # User richard-burhans # Date 1724108020 0 # Node ID 0ab743c8837f98fb465d392a0e58a5f4600ba9da planemo upload for repository https://github.com/richard-burhans/galaxytools/tree/main/tools/ncbi_egapx commit bd7ba5efde8e6fc5104441896d628760b6c54aa0 diff -r 000000000000 -r 0ab743c8837f macros.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/macros.xml Mon Aug 19 22:53:40 2024 +0000 @@ -0,0 +1,20 @@ + + + + quay.io/richard-burhans/egapx:@TOOL_VERSION@ + + + 0.2-alpha + 0 + 21.05 + + + operation_0362 + + + + + 10.1109/SC41405.2020.00043 + + + diff -r 000000000000 -r 0ab743c8837f ncbi_egapx.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/ncbi_egapx.xml Mon Aug 19 22:53:40 2024 +0000 @@ -0,0 +1,189 @@ + + annotates eukaryotic genomes + + macros.xml + + + + + + + + + + + + + + + + + + + `_. +The simplest possible example is shown below - can be cut/paste into a history dataset in the upload tool. + + +*./examples/input_D_farinae_small.yaml* is shown below and can be cut and pasted into the upload form to create a yaml file. +RNA-seq data is provided as URI to the reads FASTA files. + +input_D_farinae_small.yaml + +:: + + genome: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/020/809/275/GCF_020809275.1_ASM2080927v1/GCF_020809275.1_ASM2080927v1_genomic.fna.gz + taxid: 6954 + reads: + - xhttps://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/data/Dermatophagoides_farinae_small/SRR8506572.1 + - xhttps://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/data/Dermatophagoides_farinae_small/SRR8506572.2 + - xhttps://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/data/Dermatophagoides_farinae_small/SRR9005248.1 + - xhttps://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/EGAP/data/Dermatophagoides_farinae_small/SRR9005248.2 + + +input_Gavia_stellata.yaml + +:: + + genome: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/030/936/135/GCF_030936135.1_bGavSte3.hap2/GCF_030936135.1_bGavSte3.hap2_genomic.fna.gz + reads: txid37040[Organism] AND biomol_transcript[properties] NOT SRS024887[Accession] + taxid: 37040 + +input_C_longicornis.yaml + +:: + + genome: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/029//603/195/GCF_029603195.1_ASM2960319v2/GCF_029603195.1_ASM2960319v2_genomic.fna.gz + reads: txid2530218[Organism] AND biomol_transcript[properties] NOT SRS024887[Accession] + taxid: 2530218 + +Purpose +======== + +**This is not intended for production** + +Just a proof of concept. +It is possibly too inefficient to be useful although it may turn out not to be a problem if run on a dedicated workstation. +At least the efficiency can now be more easily estimated. + +This tool is not recommended for public deployment because of the resource demands. + +EGAPx Overview +=============== + +.. image:: $PATH_TO_IMAGES/Pipeline_sm_ncRNA_CAGE_80pct.png + +**Warning:** +The current version is an alpha release with limited features and organism scope to collect initial feedback on execution. Outputs are not yet complete and not intended for production use. Please open a GitHub [Issue](https://github.com/ncbi/egapx/issues) if you encounter any problems with EGAPx. You can also write to cgr@nlm.nih.gov to give us your feedback or if you have any questions. + +EGAPx is the publicly accessible version of the updated NCBI [Eukaryotic Genome Annotation Pipeline](https://www.ncbi.nlm.nih.gov/genome/annotation_euk/process/). + +EGAPx takes an assembly fasta file, a taxid of the organism, and RNA-seq data. Based on the taxid, EGAPx will pick protein sets and HMM models. The pipeline runs `miniprot` to align protein sequences, and `STAR` to align RNA-seq to the assembly. Protein alignments and RNA-seq read alignments are then passed to `Gnomon` for gene prediction. In the first step of `Gnomon`, the short alignments are chained together into putative gene models. +In the second step, these predictions are further supplemented by *ab-initio* predictions based on HMM models. The final annotation for the input assembly is produced as a `gff` file. + +**Security Notice:** + +EGAPx has dependencies in and outside of its execution path that include several thousand files from the [NCBI C++ toolkit](https://www.ncbi.nlm.nih.gov/toolkit), and more than a million total lines of code. Static Application Security Testing has shown a small number of verified buffer overrun security vulnerabilities. Users should consult with their organizational security team on risk and if there is concern, consider mitigating options like running via VM or cloud instance. + + +*To specify an array of NCBI SRA datasets in yaml* + +:: + + reads: + - SRR8506572 + - SRR9005248 + + +*To specify an SRA entrez query* + +:: + + reads: 'txid6954[Organism] AND biomol_transcript[properties] NOT SRS024887[Accession] AND (SRR8506572[Accession] OR SRR9005248[Accession] )' + + +**Note:** Both the above examples will have more RNA-seq data than the `input_D_farinae_small.yaml` example. To make sure the entrez query does not produce a large number of SRA runs, please run it first at the [NCBI SRA page](https://www.ncbi.nlm.nih.gov/sra). If there are too many SRA runs, then select a few of them and list it in the input yaml. + +Output +======= + +EGAPx output will appear as a collection in the user history. The main annotation file is called *accept.gff*. + +:: + + accept.gff + annot_builder_output + nextflow.log + run.report.html + run.timeline.html + run.trace.txt + run_params.yaml + + +The *nextflow.log* is the log file that captures all the process information and their work directories. ``run_params.yaml`` has all the parameters that were used in the EGAPx run. More information about the process time and resources can be found in the other run* files. + +## Intermediate files + +In the log, each line denotes the process that completed in the workflow. The first column (_e.g._ `[96/621c4b]`) is the subdirectory where the intermediate output files and logs are found for the process in the same line, _i.e._, `egapx:miniprot:run_miniprot`. To see the intermediate files for that process, you can go to the work directory path that you had supplied and traverse to the subdirectory `96/621c4b`: + +:: + + $ aws s3 ls s3://temp_datapath/D_farinae/96/ + PRE 06834b76c8d7ceb8c97d2ccf75cda4/ + PRE 621c4ba4e6e87a4d869c696fe50034/ + $ aws s3 ls s3://temp_datapath/D_farinae/96/621c4ba4e6e87a4d869c696fe50034/ + PRE output/ + 2024-03-27 11:19:18 0 + 2024-03-27 11:19:28 6 .command.begin + 2024-03-27 11:20:24 762 .command.err + 2024-03-27 11:20:26 762 .command.log + 2024-03-27 11:20:23 0 .command.out + 2024-03-27 11:19:18 13103 .command.run + 2024-03-27 11:19:18 129 .command.sh + 2024-03-27 11:20:24 276 .command.trace + 2024-03-27 11:20:25 1 .exitcode + $ aws s3 ls s3://temp_datapath/D_farinae/96/621c4ba4e6e87a4d869c696fe50034/output/ + 2024-03-27 11:20:24 17127134 aligns.paf + + + ]]> + + diff -r 000000000000 -r 0ab743c8837f test-data/input.yaml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/input.yaml Mon Aug 19 22:53:40 2024 +0000 @@ -0,0 +1,1 @@ +#