What it does?
This tool extracts data (in fastq format) from the Short Read Archive (SRA) at the National Center for Biotechnology Information (NCBI). It is based on the fasterq-dump utility of the SRA Toolkit. The following applies:
There are three ways in which you can download data:
- Plain text input of accession number(s)
- Providing a list of accessions from file
- Extracting data from an already uploaded SRA dataset
Below we discuss each in detail.
Plain text input of accession number(s)
When you type an accession number (e.g., SRR1582967) into Accession box and click Execute the tool will fetch the data for you. You can also provide a list of multiple accession numbers (e.g. SRR3141592, SRR271828, SRR112358).
Providing a list of accessions from file
A more realistic scenario is when you want to upload a number of datasets at once. To do this you need a list of accession, where there is only one accession per line (see below for information on how to generate such a file). Once you have this file:
- Upload it into your history using Galaxy's upload tool
- Once the list of accessions is uploaded choose List of SRA accessions, one per line from select input type dropdown
- Choose uploaded file within the sra accession list field
- Click Execute
Extract data from an already uploaded SRA dataset
If an SRA dataset is already present in the history, the sequencing data can be extracted in a human-readable data format (fastq, sam, bam) by setting select input type drop-down to SRA archive in current history.
In every case, fastq datasets produced will be saved in Galaxy's history as a collection - a single history element containing multiple datasets. In fact, regardless of the experimental design, three collections will be produced: one containing paired-end data, another containing single-end data, and a third one which contains reads which could not be classified. Some collections may be empty if the accessions provided in the list do not contain one of the type of data.
When you decide to dump technical reads (in Advanced Options Dump only biological reads is set to No), you will probably find your PAIRED data in the other data collection as it is impossible to determine if it was 2 biological reads or one biological and one technical.
By default, only biological reads are dumped and in case of PAIRED dataset only the spots which have both reads will be in the paired-end collection. The remaining single reads will be in the other colletion. To keep all reads, and potentially not have the same number of reads in forward and reverse use the --split-files option in Advanced Options, Select how to split the spots.
How to generate accession lists
- Go to SRA Run Selector by clicking this link
- Find the study you are interested in by typing a search term within the Search box. This can be a word (e.g., mitochondria) or an accession you have gotten from a paper (e.g., SRR1582967).
- Once you click on the study of interest you will see the number of datasets in this study within the Related SRA data box
- Click on the Runs number
- On the page that would open you will see Accession List button
- Clicking of this button will produce a file that you will need to upload into Galaxy and use as the input to this tool.
For credits, information, support and bug reports, please refer ato https://github.com/galaxyproject/tools-iuc.