A pairwise genome comparison software for the detection of High-scoring Segment Pairs.
GECKO (GEnome Comparison with K-mers Out-of-core) is a fast, modular application designed to identify collections of High-scoring Segment Pairs in a pairwise genome comparisons. By employing novel filtering and data storing strategies, it is able to compare chromosome-sized sequences in less time.
How to use
To use GECKO, upload two .fasta datasets and select these as "Query sequence" and as "Reference sequence". Once so, choose the parameters that best suite your comparison:
- Query sequence: The sequence that will be compared against the reference. Use only FASTA format.
- Reference sequence: The reference sequence where to look for matches from the query. Note that the reverse strand is computed for the reference and also matched. Use only FASTA format.
- Length: This parameter is the minimum length in nucleotides for an HSP (similarity fragment) to be conserved. Any HSP below this length will be filtered out of the comparison. It is recommended to use around 40 bp for small organisms (e.g. bacterial mycoplasma or E. Coli) and around 100 bp or more for larger organisms (e.g. human chromosomes).
- Similarity: This parameter is analogous to the minimum length, however, instead of length, the similarity is used as threshold. The similarity is calculated as the score attained by an HSP divided by the maximum possible score. Use values above 50-60 to filter noise.
- Word length: This parameter is the seed size used to find HSPs. A smaller seed size will increase sensitivity and decrease performance, whereas a larger seed size will decrease sensitivity and increase performance. Recommended values are 12 or 16 for smaller organisms (bacteria) and 32 for larger organisms (chromosomes). These values must be multiples of 4.
- Alignment extraction: Select "Yes" if you want to generate a file containing the alingments in a format similar to BLAST.
Output data sets
Two files are produced when running GECKO:
- query-reference.csv: A CSV file that includes metadata about the sequence compared and each HSP detected. See section "Interpreting the CSV" below for more information. This file can be used to visualize the comparison in the interactive sequence visualizer GECKO-MGV (use online here or download and install here).
- query-reference.txt: This file contains the alignments in a BLAST-like format (only generated if alignment extraction is selected).
The CSV file can be interpreted as follows. Each column represents:
- Type: currently, this field is reserved for Frag.
- xStart: starting coordinates of the alignment in the query sequence.
- yStart: starting coordinates of the alignment in the reference sequence.
- xEnd: ending coordinates of the alignment in the query sequence.
- yEnd: ending coordinates of the alignment in the reference sequence.
- strand: a character f or r encoding whether the alignment is in the forward or reverse strand.
- block: currently reserved.
- length: the length in nucleotides of the alignment.
- score: the raw score of the alignment calculated with +4 and -4 per match and mismatch.
- ident: the number of identities found in the alignment (i.e. matches).
- similarity: the similarity percentage calculated as the achieved raw score divided by the maximum possible score.
- %ident: the number of identities divided by the length of the alignment.
- SeqX: the ID corresponding to the sequence in the query file to which the xStart and xEnd coordinates correspond (0=> first sequence, 1=> second sequence, etc).
- SeqY: same as above but for the reference file.
Note that fragments in the reverse strand (marked with the r field) have their yStart and yEnd coordinates switched, i.e. yEnd is smaller than yStart.