Galaxy | Tool Preview

Extract Genomic DNA (version 2.2.3)
Only meaningful for GFF, GTF datasets.
If 'Locally cached' is selected, it will use a genomic reference file that matches the input file's dbkey. First it looks whether there are corresponding *.nib files in alignseq.loc. If that is not available, it searches for a corresponding *.2bit in twobit.loc.

This tool requires interval or gff (special tabular formatted data). If your data is not TAB delimited, first use Text Manipulation->Convert.

Make sure that the genome build is specified for the dataset from which you are extracting sequences (click the pencil icon in the history item if it is not specified).

All of the following will cause a line from the input dataset to be skipped and a warning generated. The number of warnings and skipped lines is documented in the resulting history item.
  • Any lines that do not contain at least 3 columns, a chromosome and numerical start and end coordinates.
  • Sequences that fall outside of the range of a line's start and end coordinates.
  • Chromosome, start or end coordinates that are invalid for the specified build.
  • Any lines whose data columns are not separated by a TAB character ( other white-space characters are invalid ).

Extract genomic DNA using coordinates from ASSEMBLED genomes and UNassembled genomes previously were achieved by two separate tools.


What it does

This tool uses coordinate, strand, and build information to fetch genomic DNAs in FASTA or interval format.

If strand is not defined, the default value is "+".


Example

If the input dataset is:

chr7  127475281  127475310  NM_000230  0  +
chr7  127485994  127486166  NM_000230  0  +
chr7  127486011  127486166  D49487     0  +

Extracting sequences with FASTA output data type returns:

>hg17_chr7_127475281_127475310_+ NM_000230
GTAGGAATCGCAGCGCCAGCGGTTGCAAG
>hg17_chr7_127485994_127486166_+ NM_000230
GCCCAAGAAGCCCATCCTGGGAAGGAAAATGCATTGGGGAACCCTGTGCG
GATTCTTGTGGCTTTGGCCCTATCTTTTCTATGTCCAAGCTGTGCCCATC
CAAAAAGTCCAAGATGACACCAAAACCCTCATCAAGACAATTGTCACCAG
GATCAATGACATTTCACACACG
>hg17_chr7_127486011_127486166_+ D49487
TGGGAAGGAAAATGCATTGGGGAACCCTGTGCGGATTCTTGTGGCTTTGG
CCCTATCTTTTCTATGTCCAAGCTGTGCCCATCCAAAAAGTCCAAGATGA
CACCAAAACCCTCATCAAGACAATTGTCACCAGGATCAATGACATTTCAC
ACACG

Extracting sequences with Interval output data type returns:

chr7    127475281       127475310       NM_000230       0       +       GTAGGAATCGCAGCGCCAGCGGTTGCAAG
chr7    127485994       127486166       NM_000230       0       +       GCCCAAGAAGCCCATCCTGGGAAGGAAAATGCATTGGGGAACCCTGTGCGGATTCTTGTGGCTTTGGCCCTATCTTTTCTATGTCCAAGCTGTGCCCATCCAAAAAGTCCAAGATGACACCAAAACCCTCATCAAGACAATTGTCACCAGGATCAATGACATTTCACACACG
chr7    127486011       127486166       D49487  0       +       TGGGAAGGAAAATGCATTGGGGAACCCTGTGCGGATTCTTGTGGCTTTGGCCCTATCTTTTCTATGTCCAAGCTGTGCCCATCCAAAAAGTCCAAGATGACACCAAAACCCTCATCAAGACAATTGTCACCAGGATCAATGACATTTCACACACG