Galaxy | Tool Preview

Filter with SortMeRNA (version 1.0)
The Illumina platform is more common for large scale metatranscriptomic projects requiring a high throughput.
Public rRNA databases provided with SortMeRNA have been indexed. On the contrary, personal databases must be indexed each time SortMeRNA is launched. Please be patient, this may take some time depending on the size of the given database.
Generates statistics for the rRNA content of reads, as well as rRNA subunit distribution.

Overview

SortMeRNA is a software designed to rapidly filter ribosomal RNA fragments from metatransriptomic data produced by next-generation sequencers. It is capable of handling large RNA databases and sorting out all fragments matching to the database with high accuracy and specificity.

If you use this tool, please cite Kopylova E., Noé L. and Touzet H., "SortMeRNA: Fast and accurate filtering of ribosomal RNAs in metatranscriptomic data", Bioinformatics (2012), doi: 10.1093/bioinformatics/bts611.


Input

The input is one file of reads in FASTA or FASTQ format and any number of rRNA databases to search against. If the user has two foward-reverse paired-sequencing reads files, they may use the script "merge_paired_reads.sh" to interleave the reads into one file, preserving their order.

If the sequencing type for the reads is paired-ended, the user has two options under "Sequencing type" to filter the reads and preserve their order in the file. For a further example of each option, please refer to Section 4.2.3 in the SortMeRNA User Manual.


Output

The output will follow the same format (FASTA or FASTQ) as the reads.

In the standalone version of SortMeRNA, the user may output the matching reads in a separate file per database (--bydbs option). This option will be made available in a future version of Galaxy.


rRNA databases

SortMeRNA is distributed with 8 representative rRNA databases, which were all constructed from the SILVA SSU,LSU (version 111) and the RFAM 5/5.8S (version 11.0) databases using the tool UCLUST.

Representative database id % avergage id% # seq Origin # seq filtered to remove
SILVA 16S bacteria 85 91.6 8174 SILVA SSU Ref NR v.111 244077 23s
SILVA 16S archaea 95 96.7 3845 SILVA SSU Ref NR v.111 10919 23s
SILVA 18S eukarya 95 96.7 4512 SILVA SSU Ref NR v.111 31862 26s,28s,23s
 
SILVA 23S bacteria 98 99.4 3055 SILVA LSU Ref v.111 19580 16s,26s,28s
SILVA 23s archaea 98 99.5 164 SILVA LSU Ref v.111 405 16s,26s,28s
SILVA 28S eukarya 98 99.1 4578 SILVA LSU Ref v.111 9321 18s
 
Rfam 5S archaea/bacteria 98 99.2 59513 RFAM 116760  
Rfam 5.8S eukarya 98 98.9 13034 RFAM 225185  
id % :
members of the cluster must have identity at least 'id %' identity with the representative sequence
average id % :
average identity of a cluster member to the representative sequence

The user may also choose to use their own rRNA databases.

Note that your personal databases are indexed each time, and that this may take some time depending on the size of the given database.


SortMeRNA parameter list

The standalone, command-line version of SortMeRNA uses the following parameters.

For indexing (buildtrie):

This program builds a Burst trie on an input rRNA database file in fasta format and stores the material in binary files under the folder '/automata':

./buildtrie --db [path to rrnas database file name {.fasta}]  {OPTIONS}

The list of OPTIONS can be left blank, the default values will be used:

-L  length of the sliding window (the seed)
    (default: 18)

-F  search only the forward strand
-R  search only the reverse-complementary strand
    (default: both strands are searched)

-h  help

For sorting (sortmerna):

To run SortMeRNA, type in any order after 'sortmerna':

--I      [illumina reads file name {fasta/fastq}]

--454    [roche 454 reads file name {fasta/fastq}]

-n       number of databases to use (must precede --db)

--db     [rrnas database name(s)]

         One database,
         ex 1. -n 1 --db /path1/database1.fasta

         Multiple databases,
         ex 2. -n 2 --db /path2/database2.fasta /path3/database3.fasta

{OPTIONS}

The list of OPTIONS can be left blank, the default values will be used:

--accept      [accepted reads file name]
--other       [rejected reads file name]
              (default: no output file is created)

--bydbs       output the accepted reads by database
              (default: concatenated file of reads)

--log         [overall statistics file name]
              (default: no statistics file created)

--paired-in   put both paired-end reads into --accept file
--paired-out  put both paired-end reads into --other file
              (default: if one read is accepted and the other is not,
              separate the reads into --accept and --other files)

-r            ratio of the number of hits on the read / read length
              (default Illumina: 0.25, Roche 454: 0.15)

-F            search only the forward strand
-R            search only the reverse-complementary strand
              (default: both strands are searched)

-a            number of threads to use
              (default: 1)

-m            (m x 4096 bytes) for loading the reads into memory
              ex. '-m 4' means 4*4096 = 16384 bytes will be allocated for the reads
              note: maximum -m is 1020039
              (default: m = 262144 = 1GB)

-v            verbose
              (default: deactivated)

-h            help

--version     version number

Bibliography

[1] Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, Glöckner FO (2013) The SILVA ribosomal RNA gene database project: improved data processing and web-based tools, Nucleic Acids Research, 41 (D1): D590-D596.

[2] Rfam 11.0: 10 years of RNA families. S.W. Burge, J. Daub, R. Eberhardt, J. Tate, L. Barquist, E.P. Nawrocki, S.R. Eddy, P.P. Gardner, A. Bateman. Nucleic Acids Research (2012), doi: 10.1093/nar/gks1005

[3] Edgar, R.C. (2010) Search and clustering orders of magnitude faster than BLAST, Bioinformatics 26(19), 2460-2461, doi: 10.1093/bioinformatics/btq461

[4] Loman, N. J. and Misra, Raju V and Dallman, Timothy J and Constantinidou, Chrystala and Gharbia, Saheer E and Wain, John and Pallen, Mark J., Performance comparison of benchtop high-throughput sequencing platforms (2012), Nature Biotechnology, 30 (5). pp. 434-439