What it does
RepMatch accepts two or more input datasets (in gff format https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md), and starts by defining peak-pair midpoints in the first dataset. It then discovers all peak-pair midpoints in the second dataset that are within the distance, defined by the tool's Maximum distance between peaks in different replicates to allow merging parameter, from the peak-pair midpoint coordinate in the first dataset. When encountering multiple candidates to match (one-to-many), RepMatch uses the method defined by the tool's Method of finding match parameter so that there is at most only a one-to-one match across the two datasets. This method provides the following options:
- closest - matches only the closest one in bp distance.
- largest - matches the one that contain the most number of reads.
- all - both methods are run separately.
RepMatch matching is an iterative process, as it attempts to find the centroid coordinate amongst all replicates. As such, the centroid is the point of reference for "distqnce" and "closest". This process can be sped up by increasing the tool's Step size parameter.
The minimum number of replicates that can be matched for a match to occur is defined by the tool's Minimum number of replicates that must be matched for merging to occur parameter. Additional filters can be applied using the tool's Advanced options, including a lower and upper limit for the C-W distance.
Options
- Distance - Maximum distance for discovering all peak-pair midpoints in a second dataset relative to the peak-pair midpoints in the first dataset
- Method - Method to use when encountering multiple candidates to match so that there is at most only a one-to-one match across the two datasets.
- Step Size - Distance for each iteration.
- Replicates - Minimum number of replicates that can be matched for a match to occur. This value must be at least 2.
- Lower Limit - Lower limit for the Crick-Watson distance filter.
- Upper Limit - Upper limit for the Crick-Watson distance filter.
Output Data Files
- Data MP - gff file consisting of only peak pairs
- Columns are chr, script, blank, peak start, peak end, blank, normalized tag counts, blank and info.
- Peak start and end are separated by one coordinate.
- Normalized tag is the occupancy averaged across replicates.
- Attributes include C-W distance, sum total of tag counts, number of replicates merged.
- Data D - tabular file consisting of the list of all matched replicates.
- Data UP - tabular file consisting of all unmatched peak-pairs.
Output Statistics Files
- Statistics Table - tabular file providing the description key of Data D.
- Statistics Histogram - graph of the number of matched locations having the indicated replicate counts.
Comments on Replicates
Three types of replicates may be considered. Biological replicates represent independently collected biological samples. At least two biological replicate must be performed for each experiment from which a conclusion is being drawn, and the conclusion must be evident in both biological replicates when analyzed separately. Technical replicates represent a re-run of the assay on the same biological material. This is usually done when one replicate fails to produce quality data, and is used to replace that earlier replicate. Sequencing replicates represent additional sequencing of the same successful library in order to obtain more reads should the analysis require it. The reads from individual sequencing replicates are usually merged without need for separate analysis.