Galaxy | Tool Preview

hifive (version 0.1.0)
Sequencing lanes
Sequencing lane 0
If a subset of chromosomes are to be used, enter a comma-separated list of chromosome names.
A value of zero indicates that finding the distance dependence function should be skipped.
The minimum interaction distance included for filtering fends and learning correction values.
The maximum interaction distance included for filtering fends and learning correction values.
The cutoff for the absolute gradient values such that learning will cease if all values fall below this threshold.

HiFive is a tool for handling, normalizing, and plotting HiC and 5C chromatin interaction data. It has numerous normalization approaches built in with a variety of options, allowing for fine-scale control and data processing. HiFive is broken down into a series of selectable commands, each with a HiC or 5C version.

COMMANDS

Complete HiC analysis / Complete 5C analysis - this command creates a genome partion file, loads data into a dataset, creates a project file and performs data normalization on that project file.

Create HiC fend set / Create 5C fragment set - this command takes a bed file containing restriction fragment information and creates a HiFive partition file that will be used for downstream data processing.

Create HiC data set / Create 5C data set - this command loads data from BAM file pairs or a variety of other file formats, partitions reads according to the information in the HiFive genome partition file and creates a HiFive data file.

Create HiC project / Create 5C project - this command creates a HiFive project, associates a specific data set with it, constructs a distance-dependence function and filters fragments based on coverage.

Normalize HiC project / Normalize 5C project - this command provides a selection of normalization algorithms and finds correction values for data normalization.

Create HiC heatmap set / Create 5C heatmap set - this command creates a set of heatmaps, one per chromosome and, if selected, one per chromosome pair (trans), in a compact HDF5 format. Heatmaps can also be plotted at the time of creation.

Extract HiC interval / Extract 5C interval - this command returns a genomic interval files with data from a specified region. This data may also be plotted at the tie of extraction.

Create HiC multi-resolution heatmap - this command returns a multi-resolution heatmap file with data heatmapped across resolutions from the smallest to largest specified binsizes in 2X steps.

HiC genomic partitioning - fend file

A bed file containing either restriction enzyme cutopoints or fragment bounds is converted into an hdf5-type fragment file of fragment characteristics. In addition to coordinates, strand, and chromosome information, additional columns can be included containing other fragment characteristics, such as GC content. If additional columns are included, they must be labeled in the header with a label containing no spaces or commas. These names can be used with the binning algorithm to include the fragment characteristic in the model to be learned. Additional characteristics should be comma-separated pairs of values corresponding to the upstream and downstream sides of the cutsite or ends of the fragment, depending on the whether the bed file contains cutsites or fragment coordinates, respectively.

HiC data

Reads are paired with the specified fend file, creating a HiFive dataset object. Data can be a series of paired-end bam files, a tabular format list of paired genomic positions (chromosome1 coordinate1 strand1 chromosome2 coordinate2 strand2), or a HiCPipe-style mat-formatted list of fend-pairs and observed read counts.

HiC project

Fends are filtered in an iterative manner using the minimum interaction cutoff and interaction size parameters specified to ensure that all valid fends have at least the minimum number of interactions with other valid fends. Subsequently, a distance dependence approximation curve is calculated piecewise using the number of bins specified. The first bin encompasses all interactions less than or equal to the minimum bin cutoff value. The remaining bins are evenly sized between log(minimum cutoff) and log(max possible interaction size).

HiC normalization

Corrections values are learned for either each valid fend, ranges of fend characteristics, or both. The 'probability' and 'express' algorithms learn correction values associated with each fend while the 'binning' algorithm learns fend characteristic corrections. These can be chained together in either order to produce more robust corrections.

Using the probability algorithm, observation of counts are assumed to be distributed according to a binomial distribution with an observation probability for each interaction equal to the product of the distance-dependence signal and the two fend correction parameters. Using the probability algorithm, learning is done using a backtracking line gradient descent approach. Learning proceeds for up to the maximum number of iterations but is terminated early if all of the absolute gradient values fall below the cutoff threshold. At each step, the learning rate is scaled down by the step value if the current learning rate does not produce sufficient improvement as measured by the Arjimo criterion.

The express algorithm is a variant of matrix balancing and approximates the corrections through an iterative norm-2 adjustment to given all fragments a mean ratio of one for valid counts versus signal predicted from distance-dependence. This can be done using intra-regional interactions, inter-regional interactions, or all interactions.

The binning algorithm divides each model parameter into some number of bins and based on a binomial distribution, correction values for each bin are learned, maximizing the log-likelihood of the data. Model parameters can be the fend lengths ('len'), fend GC content ('gc'), and any other characteristics passed as additional columns (with header labels) in the bed file used to create the HiFive fend file. Each parameter has a number of bins specified to divide it into and can be partitioned according to its type to contain approximately equal numbers of fends ('even'), or to cover equal portions of the range of parameter values ('fixed'). In addition, parameter types can include the '-const' suffix to denote a parameter that should not be optimized after seeding.

HiC multi-resolution heatmapping

Multi-resolution heatmapping (MRH) allows multiple levels of resolution to be stored and accessed simultaneously using an intelligent binning scheme that only accepts bin with a number of observed reads meeting the minimum observation threshold. MRH files can be interactively explored through the MRH plugin in Galaxy.

5C genomic partitioning - fragment file

A bed file containing targeted restriction enzyme fragment boundaries is converted into an hdf5-type fragment file of fragment characteristics. In addition to coordinates, strand, and chromosome information, additional columns can be included containing other fragment characteristics, such as GC content. If additional columns are included, they must be labeled in the header with a label containing no spaces or commas. These names can be used with the binning algorithm to include the fragment characteristic in the model to be learned.

5C data

Reads are loaded and paired with the specified fragment file, creating a HiFive dataset object. Data can be a series of paired-end bam files or a tabular format list of paired fragments and their observed read count (fragment1 fragment2 count).

5C project

Fragments are filtered in an iterative manner using the minimum interaction cutoff and interaction size parameters specified to ensure that all valid fragments have at least the minimum number of interactions with other valid fragments. Subsequently, a distance dependence approximation line is calculated using a regression line to approximate the linear relationship between log(# reads) and log(distance).

5C normalization

Corrections values are learned for either each valid fragment, ranges of fragment characteristics, or both. The 'probability' and 'express' algorithms learn correction values associated with each fragment while the 'binning' algorithm learns fragment characteristic corrections. These can be chained together in either order to produce more robust corrections.

The probability algorithm assumes non-zero counts to distributed according to a log-normal distribution with each interaction having a mean equal to the distance-depedence predicted signal times each of the interaction fragment correction parameters and a universal sigma value. Using the probability algorithm, learning is done using a backtracking line gradient descent approach. Learning proceeds for up to the maximum number of iterations but is terminated early if all of the absolute gradient values fall below the cutoff threshold. At each step, the learning rate is scaled down by the step value if the current learning rate does not produce sufficient improvement as measured by the Arjimo criterion.

The express algorithm is a variant of matrix balancing and approximates the corrections through an iterative norm-2 adjustment to given all fragments a mean ratio of one for valid counts versus predicted signal from distance-dependence. This can be done using intra-regional interactions, inter-regional interactions, or all interactions.

The binning algorithm divides each model parameter into some number of bins and based on a log-normal distribution, correction values for each bin are learned, maximizing the log-likelihood of the data. Model parameters can be the fragment lengths ('len') and any other characteristics passed as additional columns (with header labels) in the bed file used to create the HiFive fragment file. Each parameter has a number of bins specified to divide it into and can be partitioned according to its type to contain approximately equal numbers of fragments ('even'), or to cover equal portions of the range of parameter values ('fixed'). In addition, parameter types can include the '-const' suffix to denote a parameter that should not be optimized after seeding.