CHROMEISTER
CHROMEISTER is a heuristic approach for the ultra fast previsualization of pairwise genome comparisons. It is able to compare enormous genomes (up to 30 thousand million base pairs, or 10 times the size of the human genome) much faster than other methods while yielding significant, reusable and exploitable information such as synteny blocks, evolutionary events or pairwise genome similarity metrics.
What is CHROMEISTER good for?
It is specially suitable to obtain a fast visualization of a pairwise genome comparison. Due to its unique-seeds filtering, it is particularly useful to inspect noisy, full-of-repeats genome comparisons. Additionally, since it outputs a scoring metric for each comparison, it can be used for massive all vs all comparisons that get automatically processed based on such metric.
What is CHROMEISTER NOT good for?
It should NOT be used to obtain alignments. CHROMEISTER, as it is, does not produce a set of alignments (although it can be done using the GECKO pipeline, see github repository). It should also NOT be used to perform studies on DNA repeats, since CHROMEISTER filters these as part of its main signal detection process.
How to use
To use Chromeister, upload two .fasta data sets and select these as "Query sequence" and as "Reference sequence". Once so, choose the parameters that best suite your comparison:
Input parameters
- Output dotplot size (dimension): This parameter corresponds to the resolution of the comparison. That is, higher resolution is recommended for large genomes (e.g. use 2000 for more than 3 GBps) and lower resolutions (e.g. use 1000 for everything else) should be used for comparisons involving chromosomes or partial genomes. A value of 1000 will produce a 1,000 x 1,000 output dotplot png image.
- K-mer seed size: This parameter is the seed size used to find unique hits. The recommended value is 32 for all sequences except for small experiments such as bacterial, where 16 is recommended.
- Diffuse value: This parameter determines the level of heuristic subsampling employed. A level of 1 will use perfect indexing (no subsampling). The recommended level is 4, which represents a good trade-off between exact and inexact hits. Only use 1 if no similarity is found.
Output data sets
- Comparison matrix (plain text), i.e. a scaled matrix containing the number of unique and inexact hits per resolution cell.
- A .png dotplot of the comparison with an automatic scoring distance (useful for classifying) and a grid (if enabled) separating the different sequences (chromosomes for instance) compared.
- A .csv file including the coordinates of each sequence/chromosome contained within the query and reference sequences (useful for multi-fasta inputs).
- Events file. A text file where each row is a "Computational Synteny Block". This means that these events are Large-Scale Genome Rearrangements heuristically determined and classified as conserved, transposed or inverted blocks. But this is only an informative labelling that only considers coordinates and does not employ genes nor other external information. A plot will be also available if enabled with option Generate image of events featuring the events that have been detected automatically. There is a color legend for each block, namely:
- Red: conserved blocks (close to the diagonal)
- Green: inverted blocks (reverse strand)
- Blue: transposed blocks (not reverted and not on the diagonal)