Galaxy | Tool Preview

RBPBench (version 0.8.1+galaxy0)
Genomic regions (e.g. RBP binding sites) in BED format (>= 6-columns) for RBP binding motif search
Method ID which can be used to describe the peak calling method (e.g. clipper_idr). This ID (together with data ID and set RBP ID(s)) defines which search results get compared in RBPBench's comparison mode (see Help below for more details).
Data ID which can be used to describe from which cell type and/or CLIP-seq protocol the data originates (e.g. k562_eclip or pum2_k562_eclip). This ID (together with method ID and set RBP ID(s)) defines which search results get compared in RBPBench's comparison mode (see Help below for more details).
Motif search settings
Motif search settings 0
HTML report options
HTML report options 0
Output options
Output options 0

What is RBPBench?

RBPBench is multi-function tool to evaluate CLIP-seq and other genomic region data using a comprehensive collection of known RNA-binding protein (RBP) binding motifs. RBPBench can be used for a variety of purposes, from RBP motif search (database or user-supplied RBPs) in genomic regions, over motif co-occurrence analysis, to benchmarking CLIP-seq peak caller methods as well as comparisons across cell types and CLIP-seq protocols.


RBPBench program modes

RBPBench on Galaxy provides the following main functions (Choose on top via "Select RBPBench program mode"):

  1. Search RBP binding motifs in genomic regions
  2. Search RBP binding motifs in genomic regions (multiple inputs)
  3. Search RBP binding motifs in genomic regions (data collection input)
  4. Plot nucleotide distribution at genomic positions
  5. Compare different search results

1. Search RBP binding motifs in genomic regions

In this mode we can select any number of RBPs of interest and search for RBP binding motifs in a given set of genomic regions (Genomic regions BED file). A built-in high-quality database of human RBP binding motifs (currently containing 259 RBPs and 605 motifs) is used by default. Moreover, users can add own motifs (Add user-supplied motifs), as well as defining their own database (Provide a custom RBP motif database). Both sequence (MEME/DREME XML format) and structure motifs (covariance models) are supported. Comprehensive hit statistics (both on RBP and single motif level) are output as table files, together with an informative HTML report containing various plots and tables (see Output options to control what files are output). Hit statistics output table formats are described in the RBPBench documentation. The HTML report includes statistics for each RBP on enrichment of motifs in higher scoring regions, as well as a heatmap of RBP co-occurrences in genomic regions, and an upset plot on present RBP combinations (HTML report options for finetuning). If a GTF file is provided (HTML report options -> GTF file), genomic region annotations are also added to the regions and plots. Furthermore, motif distances (RBP and motif level) can be plotted relative to a set reference RBP (HTML report options -> Set reference RBP ID). Motif search settings can be adapted, e.g. to apply up- and/or downstream extension to the genomic regions before search. Motifs for selected RBPs can also be plotted in a separate HTML file (Output options -> Plot RBP motifs). To compare motif search results (mode: Compare different search results), data ID and method ID can be set accordingly (more details in sections 2, 3, and 5).

2. Search RBP binding motifs in genomic regions (multiple inputs)

This mode allows the input of more than one set of genomic regions (via + Insert Dataset). For each input, an RBP for motif search needs to be selected. Optionally (for comparing different search results), descriptive data + method IDs can be added (also see Compare different search results). For example, if two different peak calling methods (method1, method2) have been used to extract RBP binding regions from CLIP-seq data of RBP RBPX, and we want to compare these two methods later on, we would: + Insert Dataset: input the set (i.e., BED file) produced by method1, choose the CLIP-ped RBP (RBPX) + add method ID "method1". + Insert Dataset: input the set produced by method2, again choose RBPX, and add method ID "method2". The data ID we keep constant, ideally choosing an ID that describes the data (e.g. cell type, CLIP-seq protocol, CLIP-ped RBP). For example, if the cell type is K562, and the CLIP-seq protocol is eCLIP, we could specify the data ID "K562_eCLIP" or "RBPX_K562_eCLIP". We can repeat this for other proteins by adding the respective inputs. Finally, for comparing the two methods, all we need to do is to use the two produced hit statistics output tables (RBP + motif hit statistics) as inputs in Compare different search results mode. The same also works the other way around, by keeping the method ID constant and changing the data ID. For example, if we want to compare motif search results across different cell types, we can use different data IDs while keeping the method ID.

3. Search RBP binding motifs in genomic regions (data collection input)

This mode is identical to the previous one (multiple inputs), except that instead of manually defining each input (dataset, RBP, method ID, data ID), we simply input a table containing all the information, as well as a dataset collection containing the datasets. It is thus the preferable mode if we want to compare a large number of datasets (concept of comparing sets via method ID and data ID described in the previous section). The input table (batch processing table file) has the following format (tab-separated columns: RBP ID, method ID, data ID, BED genomic regions file name):

PUM1 method1 K562_eCLIP PUM1.K562_eclip.method1.bed
PUM1 method2 K562_eCLIP PUM1.K562_eclip.method2.bed
PUM1 method3 K562_eCLIP PUM1.K562_eclip.method3.bed
PUM2 method1 K562_eCLIP PUM2.K562_eclip.method1.bed
PUM2 method2 K562_eCLIP PUM2.K562_eclip.method2.bed
PUM2 method3 K562_eCLIP PUM2.K562_eclip.method3.bed
SLBP method1 K562_eCLIP SLBP.K562_eclip.method1.bed
SLBP method2 K562_eCLIP SLBP.K562_eclip.method2.bed
SLBP method3 K562_eCLIP SLBP.K562_eclip.method3.bed

NOTE that the table file name needs to correspond to the name of the dataset inside the dataset collection. Conveniently, if you upload files to Galaxy and make a dataset collection out of them, the dataset names will correspond to the uploaded file names. In the above table, we would produce search results for three different methods, on three different RBPs. Likewise, if we would want to compare motif search results across cell types, the table could look like this:

PUM1 method1 K562_eCLIP PUM1.K562_eclip.method1.bed
PUM1 method1 HepG2_eCLIP PUM1.HepG2_eclip.method1.bed
PUM2 method1 K562_eCLIP PUM2.K562_eclip.method1.bed
PUM2 method1 HepG2_eCLIP PUM2.HepG2_eclip.method1.bed
SLBP method1 K562_eCLIP SLBP.K562_eclip.method1.bed
SLBP method1 HepG2_eCLIP SLBP.HepG2_eclip.method1.bed

Here we would create motif search results across cell types K562 and HepG2, while keeping the peak calling method ID constant ("method1"). As with the two already discussed search modes, the resulting hit statistics output table files (RBP + motif hit statistics) can subsequently serve as inputs to RBPBench's comparison mode (Compare different search results, section 5).

4. Plot nucleotide distribution at genomic positions

In this mode, a set of genomic regions is input and the nucleotide distribution is plotted around a defined center positions (Nucleotide distribution plot settings -> Define zero position for plotting). By default, the upstream end position of each region is used (other choices are center and downstream end). This for example enables us to look at CLIP-seq crosslink positions and potential nucleotide biases at these sites.

5. Compare different search results

This mode is used to compare different motif search results (produced by any of the three motif search modes described above). Inputs are the RBP and motif hit statistics table files output by the motif search modes. As exemplified in the previous sections, the set method IDs and data IDs (together with the selected RBP IDs) define what gets compared in comparison mode. Based on the IDs in the input tables, RBPBench looks for combinations of RBP ID+method ID+data ID, and produces method-ID-centered (with fixed RBP ID + data ID) and / or data-ID-centered (with fixed RBP ID + method ID) comparisons. At least two different IDs are needed for a comparison (e.g. two different method IDs or two different data IDs, with same RBP ID). The comparison results are presented in an HTML report file, containing a hit statistics table and a Venn diagram plot for each found combination. Moreover, the report results are output as table files, and the combined motifs are output in BED format, for a data ID / method ID centered comparison e.g. inside a Genome Viewer. Comparing numbers of unique and shared motif hits between methods also serves as a way of benchmarking different methods. Since no ground truth (i.e., set of true / experimentally verified transcriptome-wide binding sites of an RBP) exists, one obvious way to benchmark peak calling methods is to look at the enrichment of known RBP binding motifs in regions reported by the peak callers. RBPBench makes such evaluations easy, especially by combining modes 2,3, and 5.


Tool documentation & repository

For more information (including a webserver tutorial) please visit the RBPBench website:

https://backofenlab.github.io/RBPBench

The RBPBench repository can be found at:

https://github.com/michauhl/RBPBench

The GitHub repository hosts the command line version of RBPBench and also includes a comprehensive manual with installation instructions and various usage examples.