Galaxy | Tool Preview

fasta_regex_finder (version 0.1.0)


Search a fasta file for matches to a regular expression and return a bed file with the coordinates of the match and the matched sequence itself.

Output bed file has columns:

  1. Name of fasta sequence (e.g. chromosome)
  2. Start of the match
  3. End of the match
  4. ID of the match
  5. Length of the match
  6. Strand
  7. Matched sequence as it appears on the forward strand

For matches on the reverse strand it is reported the start and end position on the forward strand and the matched string on the forward strand (so the G4 'GGGAGGGT' present on the reverse strand is reported as ACCCTCCC).

Note: Fasta sequences (chroms) are read in memory one at a time along with the matches for that chromosome. The order of the output is: chroms as they are found in the inut fasta, matches sorted within chroms by positions.



Test data:: >mychr ACTGnACTGnACTGnTGAC

Example1 regex=ACTG:

mychr   0       4       mychr_0_4_for   4       +       ACTG
mychr   5       9       mychr_5_9_for   4       +       ACTG
mychr   10      14      mychr_10_14_for 4       +       ACTG

Example2 regex=ACTG maxstr=3:

mychr   0       4       mychr_0_4_for   4       +       ACT[3,4]
mychr   5       9       mychr_5_9_for   4       +       ACT[3,4]
mychr   10      14      mychr_10_14_for 4       +       ACT[3,4]

Example3 regex=AwwG:

mychr   0       5       mychr_0_5_for   5       +       ACTGn
mychr   5       10      mychr_5_10_for  5       +       ACTGn
mychr   10      15      mychr_10_15_for 5       +       ACTGn