# HG changeset patch # User iuc # Date 1552555008 14400 # Node ID 0968856c687c1b549798f22d0a481709a47c707f planemo upload for repository https://github.com/galaxyproject/tools-iuc/blob/master/tool_collections/kraken2/kraken2/ commit 6ad48e582972ec27cdc0d401f877dfe172057231 diff -r 000000000000 -r 0968856c687c README.rst --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/README.rst Thu Mar 14 05:16:48 2019 -0400 @@ -0,0 +1,42 @@ +Introduction +============ +Kraken is a taxonomic sequence classifier that assigns taxonomic labels to DNA sequences. Kraken examines the $k$-mers within a query sequence and uses the information within those $k$-mers to query a database. That database maps $k$-mers to the lowest common ancestor (LCA) of all genomes known to contain a given $k$-mer. + +The first version of Kraken used a large indexed and sorted list of $k$-mer/LCA pairs as its database. While fast, the large memory requirements posed some problems for users, and so Kraken 2 was created to provide a solution to those problems. + +Kraken 2 differs from Kraken 1 in several important ways: + +1. Only minimizers of the $k$-mers in the query sequences are used as database queries. Similarly, only minimizers of the $k$-mers in the reference sequences in the database's genomic library are stored in the database. We will also refer to the minimizers as $\ell$-mers, where $\ell \leq k$. All $k$-mers are considered to have the same LCA as their minimizer's database LCA value. +2. Kraken 2 uses a compact hash table that is a probabilistic data structure. This means that occasionally, database queries will fail by either returning the wrong LCA, or by not resulting in a search failure when a queried minimizer was never actually stored in the database. By incurring the risk of these false positives in the data structure, Kraken 2 is able to achieve faster speeds and lower memory requirements. Users should be aware that database false positive errors occur in less than 1% of queries, and can be compensated for by use of confidence scoring thresholds. +3. Kraken 2 has the ability to build a database from amino acid sequences and perform a translated search of the query sequences against that database. +4. Kraken 2 utilizes spaced seeds in the storage and querying of minimizers to improve classification accuracy. +5. Kraken 2 provides support for "special" databases that are not based on NCBI's taxonomy. These are currently limited to three popular 16S databases. + +Because Kraken 2 only stores minimizers in its hash table, and $k$ can be much larger than $\ell$, only a small percentage of the possible $\ell$-mers in a genomic library are actually deposited in the database. This creates a situation similar to the Kraken 1 "MiniKraken" databases; however, preliminary testing has shown the accuracy of a reduced Kraken 2 database to be quite similar to the full-sized Kraken 2 database, while Kraken 1's MiniKraken databases often resulted in a substantial loss of per-read sensitivity. + +The Kraken 2 paper is currently under preparation. Until it is released, please cite the original Kraken paper if you use Kraken 2 in your research. Thank you! +Page: https://ccb.jhu.edu/software/kraken2/ + +System Requirements +=================== +- Disk space: Construction of a Kraken 2 standard database requires approximately 100 GB of disk space. A test on 01 Jan 2018 of the default installation showed 42 GB of disk space was used to store the genomic library files, 26 GB was used to store the taxonomy information from NCBI, and 29 GB was used to store the Kraken 2 compact hash table. + +- Like in Kraken 1, we strongly suggest against using NFS storage to store the Kraken 2 database if at all possible. + +- Memory: To run efficiently, Kraken 2 requires enough free memory to hold the database (primarily the hash table) in RAM. While this can be accomplished with a ramdisk, Kraken 2 will by default load the database into process-local RAM; the --memory-mapping switch to kraken2 will avoid doing so. The default database size is 29 GB (as of Jan. 2018), and you will need slightly more than that in RAM if you want to build the default database. + +- Dependencies: Kraken 2 currently makes extensive use of Linux utilities such as sed, find, and wget. Many scripts are written using the Bash shell, and the main scripts are written using Perl. Core programs needed to build the database and run the classifier are written in C++11, and need to be compiled using a somewhat recent version of g++ that will support C++11. Multithreading is handled using OpenMP. Downloads of NCBI data are performed by wget and rsync. Most Linux systems will have all of the above listed programs and development libraries available either by default or via package download. + +- Unlike Kraken 1, Kraken 2 does not use an external $k$-mer counter. However, by default, Kraken 2 will attempt to use the dustmasker or segmasker programs provided as part of NCBI's BLAST suite to mask low-complexity regions (see [Masking of Low-complexity Sequences]). + +- MacOS NOTE: MacOS and other non-Linux operating systems are not explicitly supported by the developers, and MacOS users should refer to the Kraken-users group for support in installing the appropriate utilities to allow for full operation of Kraken 2. We will attempt to use MacOS-compliant code when possible, but development and testing time is at a premium and we cannot guarantee that Kraken 2 will install and work to its full potential on a default installation of MacOS. + +- In particular, we note that the default MacOS X installation of GCC does not have support for OpenMP. Without OpenMP, Kraken 2 is limited to single-threaded operation, resulting in slower build and classification runtimes. + +- Network connectivity: Kraken 2's standard database build and download commands expect unfettered FTP and rsync access to the NCBI FTP server. If you're working behind a proxy, you may need to set certain environment variables (such as ftp_proxy or RSYNC_PROXY) in order to get these commands to work properly. + +- Kraken 2's scripts default to using rsync for most downloads; however, you may find that your network situation prevents use of rsync. In such cases, you can try the --use-ftp option to kraken2-build to force the downloads to occur via FTP. + +- MiniKraken: At present, users with low-memory computing environments can replicate the "MiniKraken" functionality of Kraken 1 in two ways: first, by increasing the value of $k$ with respect to $\ell$ (using the --kmer-len and --minimizer-len options to kraken2-build); and secondly, through downsampling of minimizers (from both the database and query sequences) using a hash function. This second option is performed if the --max-db-size option to kraken2-build is used; however, the two options are not mutually exclusive. In a difference from Kraken 1, Kraken 2 does not require building a full database and then shrinking it to obtain a reduced database. + + diff -r 000000000000 -r 0968856c687c kraken2.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/kraken2.xml Thu Mar 14 05:16:48 2019 -0400 @@ -0,0 +1,147 @@ + + + + assign taxonomic labels to sequencing reads + + + macros.xml + + + kraken2 + + kraken2 --version + + '${output}' + ]]> + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ + + +
+ + + (split_reads) + + + (split_reads) + + + (report['create_report']) + + + + + + + + + + + + + + + + + + + + + +
diff -r 000000000000 -r 0968856c687c macros.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/macros.xml Thu Mar 14 05:16:48 2019 -0400 @@ -0,0 +1,19 @@ + + + 2.0.7_beta + + fasta,fastq,fasta.gz,fasta.bz2,fastq.gz,fastq.bz2,fastqsanger + + + + + + + + + + + 10.1186/gb-2014-15-3-r46 + + + diff -r 000000000000 -r 0968856c687c test-data/kraken2_databases.loc --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/kraken2_databases.loc Thu Mar 14 05:16:48 2019 -0400 @@ -0,0 +1,6 @@ +# Tab separated with three columns: +# - value (Galaxy records this in the Galaxy DB) +# - name (Galaxy shows this in the UI) +# - path (folder name containing the Kraken DB) +# +test_entry "Test Database" ${__HERE__}/test_db diff -r 000000000000 -r 0968856c687c test-data/kraken_test1.fa --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/kraken_test1.fa Thu Mar 14 05:16:48 2019 -0400 @@ -0,0 +1,70 @@ +>gi|145231|gb|M33724.1|ECOALPHOA Escherichia coli K-12 truncated PhoA (phoA) gene, partial cds; and transposon Mu dI, partial sequence +CAAAGCTCCGGGCCTCACCCAGGCGCTAAATACCAAAGATGGCGCAGTGATGGTGATGAGTTACGGGAAC +TCCGAAGAGGATTCACAAGAACATACCGGCAGTCAGTTGCGTATTGCGGCGTATGGCCCGCATGCCGCCA +ATGAAGCGGCGCACGAAAAACGCGAAAGCGT + +>gi|145232|gb|M33725.1|ECOALPHOB Escherichia coli K12 phoA pseudogene and transposon Mu dl-R, partial sequence +CTGTCATAAAGTTGTCACGGCCGAGACTTATAGTCGCTTTGTTTTTATTTTTTAATGTATTTGTACATGG +AGAAAATAAAGTGAAACAAAGCACTATTGCACTGGCACTCTTACCGTTACTGTTTACCCCTGTGACAAAA +GCCCGGACACCAGTGAAGCGGCGCACGAAAAACGCGAAAGCGT + +>gi|145234|gb|M33727.1|ECOALPHOE Escherichia coli K12 upstream sequence of psiA5::Mu dI. is identical to psiA30 upstream sequence; putative (phoA) pseudogene and transposon Mu dl-R, partial sequence +TTGTTTTTATTTTTTAATGTATTTGTACATGGAGAAAATAAAGTGAAACAAAGCACTATTGCACTGGTGA +AGCGGCGCACGAAAAACGCGAAAGCGT + +>gi|146195|gb|J01619.1|ECOGLTA Eschericia coli gltA gene, sdhCDAB operon and sucABCD operons, complete sequence +GAATTCGACCGCCATTGCGCAAGGCATCGCCATGACCAGGCAGGATACAAAAGAGAGTCGATAAATATTC +ACGGTGTCCATACCTGATAAATATTTTATGAAAGGCGGCGATGATGCCGCAAAATAATACTTATTTATAA +TCCAGCACGTAGGTTGCGTTAGCGGTTACTTCACCTGCCGTGACATCGACTGCATTATCAATTTGTTCCA +TCCAGGCGAAAAAGTTCAGCGTCTGTTCTGATGAGCTTGCATCCAGGTCAAGATCTGGCGCGGCTGAACC +TAATACGATGTTACCGTCATTTTTGTCCATCAGTCGTACACCGACCCCAGTTGCTTCGCCTGCACTGGTG +TTGCTCAACAAAGGCGTAGCACCAGTTGTCTTAGCCGTGCTATCGAAGGTTACGCCAAACTTTGGATACC +GGCATTCCGCTACCGTTGTCAGAAGCAGGCAGATCACAGTTGATCAAGCGAATGTCGACGGCCACTTTAT +TGCTATGATGCTCCCGGTTTATATGGGTTGTCGTGACTTGTCCAAGATCTATGTTTTTATCAATATCTTC +TGGATGAATTTCACAAGGTGCTTCAATAACCTCCCCCTTAAAGTGAATTTCGCCAGAACCTTCATCAGCA +GCATAAACAGGTGCAGTGAACAGCAGAGATACGGCCAGTGCGGCCAATGTTTTTTGTCCTTTAAACATAA +CAGAGTCCTTTAAGGATATAGAATAGGGGTATAGCTACGCCAGAATATCGTATTTGATTATTGCTAGTTT +TTAGTTTTGCTTAAAAAATATTGTTAGTTTTATTAAATTGGAAAACTAAATTATTGGTATCATGAATTGT +TGTATGATGATAAATATAGGGGGGATATGATAGACGTCATTTTCATAGGGTTATAAAATGCGACTACCAT +GAAGTTTTTAATTCAAAGTATTGGGTTGCTGATAATTTGAGCTGTTCTATTCTTTTTAAATATCTATATA +GGTCTGTTAATGGATTTTATTTTTACAAGTTTTTTGTGTTTAGGCATATAAAAATCAAGCCCGCCATATG +AACGGCGGGTTAAAATATTTACAACTTAGCAATCGAACCATTAACGCTTGATATCGCTTTTAAAGTCGCG +TTTTTCATATCCTGTATACAGCTGACGCGGACGGGCAATCTTCATACCGTCACTGTGCATTTCGCTCCAG +TGGGCGATCCAGCCAACGGTACGTGCCATTGCGAAAATGACGGTGAACATGGAAGACGGAATACCCATCG +CTTTCAGGATGATACCAGAGTAGAAATCGACGTTCGGGTACAGTTTCTTCTCGATAAAGTACGGGTCGTT +CAGCGCGATGTTTTCCAGCTCCATAGCCACTTCCAGCAGGTCATCCTTCGTGCCCAGCTCTTTCAGCACT +TCATGGCAGGTTTCACGCATTACGGTGGCGCGCGGGTCGTAATTTTTGTACACGCGGTGACCGAAGCCCA +TCAGGCGGAAAGAATCATTTTTGTCTTTCGCACGACGAAAAAATTCCGGAATGTGTTTAACGGAGCTGAT +TTCTTCCAGCATTTTCAGCGCCGCTTCGTTAGCACCGCCGTGCGCAGGTCCCCACAGTGAAGCAATACCT +GCTGCGATACAGGCAAACGGGTTCGCACCCGAAGAGCCAGCGGTACGCACGGTGGAGGTAGAGGCGTTCT +GTTCATGGTCAGCGTGCAGGATCAGAATACGGTCCATAGCACGTTCCAGAATCGGATTAACTTCATACGG +TTCGCACGGCGTGGAGAACATCATATTCAGGAAGTTACCGGCGTAGGAGAGATCGTTGCGCGGGTAAACA +AATGGCTGACCAATGGAATACTTGTAACACATCGCGGCCATGGTCGGCATTTTCGACAGCAGGCGGAACG +CGGCAATTTCACGGTGACGAGGATTGTTAACATCCAGCGAGTCGTGATAGAACGCCGCCAGCGCGCCGGT +AATACCACACATGACTGCCATTGGATGCGAGTCGCGACGGAAAGCATGGAACAGACGGGTAATCTGCTCG +TGGATCATGGTATGACGGGTCACCGTAGTTTTAAATTCGTCATACTGTTCCTGAGTCGGTTTTTCACCAT +TCAGCAGGATGTAACAAACTTCCAGGTAGTTAGAATCGGTCGCCAGCTGATCGATCGGGAAACCGCGGTG +CAGCAAAATACCTTCATCACCATCAATAAAAGTAATTTTAGATTCGCAGGATGCGGTTGAAGTGAAGCCT +GGGTCAAAGGTGAACACACCTTTTGAACCGAGAGTACGGATATCAATAACATCTTGACCCAGCGTGCCTT +TCAGCACATCCAGTTCAACAGCTGTATCCCCGTTGAGGGTGAGTTTTGCTTTTGTATCAGCCATTTAAGG +TCTCCTTAGCGCCTTATTGCGTAAGACTGCCGGAACTTAAATTTGCCTTCGCACATCAACCTGGCTTTAC +CCGTTTTTTATTTGGCTCGCCGCTCTGTGAAAGAGGGGAAAACCTGGGTACAGAGCTCTGGGCGCTTGCA +GGTAAAGGATCCATTGATGACGAATAAATGGCGAATCAAGTACTTAGCAATCCGAATTATTAAACTTGTC +TACCACTAATAACTGTCCCGAATGAATTGGTCAATACTCCACACTGTTACATAAGTTAATCTTAGGTGAA +ATACCGACTTCATAACTTTTACGCATTATATGCTTTTCCTGGTAATGTTTGTAACAACTTTGTTGAATGA +TTGTCAAATTAGATGATTAAAAATTAAATAAATGTTGTTATCGTGACCTGGATCACTGTTCAGGATAAAA +CCCGACAAACTATATGTAGGTTAATTGTAATGATTTTGTGAACAGCCTATACTGCCGCCAGTCTCCGGAA +CACCCTGCAATCCCGAGCCACCCAGCGTTGTAACGTGTCGTTTTCGCATCTGGAAGCAGTGTTTTGCATG +ACGCGCAGTTATAGAAAGGACGCTGTCTGACCCGCAAGCAGACCGGAGGAAGGAAATCCCGACGTCTCCA +GGTAACAGAAAGTTAACCTCTGTGCCCGTAGTCCCCAGGGAATAATAAGAACAGCATGTGGGCGTTATTC +ATGATAAGAAATGTGAAAAAACAAAGACCTGTTAATCTGGACCTACAGACCATCCGGTTCCCCATCACGG +CGATAGCGTCCATTCTCCATCGCGTTTCCGGTGTGATCACCTTTGTTGCAGTGGGCATCCTGCTGTGGCT +TCTGGGTACCAGCCTCTCTTCCCCTGAAGGTTTCGAGCAAGCTTCCGCGATTATGGGCAGCTTCTTCGTC +AAATTTATCATGTGGGGCATCCTTACCGCTCTGGCGTATCACGTCGTCGTAGGTATTCGCCACATGATGA +TGGATTTTGGCTATCTGGAAGAAACATTCGAAGCGGGTAAACGCTCCGCCAAAATCTCCTTTGTTATTAC +TGTCGTGCTTTCACTTCTCGCAGGAGTCCTCGTATGGTAAGCAACGCCTCCGCATTAGGACGCAATGGCG +TACATGATTTCATCCTCGTTCGCGCTACCGCTATCGTCCTGACGCTCTACATCATTTATATGGTCGGTTT +TTTCGCTACCAGTGGCGAGCTGACATATGAAGTCTGGATCGGTTTCTTCGCCTCTGCGTTCACCAAAGTG +TTCACCCTGCTGGCGCTGTTTTCTATCTTGATCCATGCCTGGATCGGCATGTGGCAGGTGTTGACCGACT +ACGTTAAACCGCTGGCTTTGCGCCTGATGCTGCAACTGGTGATTGTCGTTGCACTGGTGGTTTACGTGAT +TTATGGATTCGTTGTGGTGTGGGGTGTGTGATGAAATTGCCAGTCAGAGAATTTGATGCAGTTGTGATTG diff -r 000000000000 -r 0968856c687c test-data/kraken_test1_output.tab --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/kraken_test1_output.tab Thu Mar 14 05:16:48 2019 -0400 @@ -0,0 +1,4 @@ +U gi|145231|gb|M33724.1|ECOALPHOA 0 171 0:137 +U gi|145232|gb|M33725.1|ECOALPHOB 0 183 0:149 +U gi|145234|gb|M33727.1|ECOALPHOE 0 97 0:63 +U gi|146195|gb|J01619.1|ECOGLTA 0 3850 0:3816 diff -r 000000000000 -r 0968856c687c test-data/test_db/hash.k2d Binary file test-data/test_db/hash.k2d has changed diff -r 000000000000 -r 0968856c687c test-data/test_db/opts.k2d Binary file test-data/test_db/opts.k2d has changed diff -r 000000000000 -r 0968856c687c test-data/test_db/taxo.k2d Binary file test-data/test_db/taxo.k2d has changed diff -r 000000000000 -r 0968856c687c tool-data/kraken2_databases.loc.sample --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tool-data/kraken2_databases.loc.sample Thu Mar 14 05:16:48 2019 -0400 @@ -0,0 +1,7 @@ +# Expect three columns, tab separated, as follows: +# - value (Galaxy records this in the Galaxy DB) +# - name (Galaxy shows this in the UI) +# - path with or without trailing slash (folder name containing the Kraken DB) +# +# e.g. +# plants2018Plant genomes (2018)/path/to/krakenDB/plants_2018 diff -r 000000000000 -r 0968856c687c tool_data_table_conf.xml.sample --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tool_data_table_conf.xml.sample Thu Mar 14 05:16:48 2019 -0400 @@ -0,0 +1,8 @@ + + + + + value, name, path + +
+
diff -r 000000000000 -r 0968856c687c tool_data_table_conf.xml.test --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tool_data_table_conf.xml.test Thu Mar 14 05:16:48 2019 -0400 @@ -0,0 +1,8 @@ + + + + + value, name, path + +
+