Galaxy |

Use GraphProt to train a model or to predict RBP binding profiles using a pretrained RBP model.

Model training

To train a GraphProt model, a FASTA file with positive sequences (= RBP binding sites, usually determined by CLIP-seq) and a FASTA file with negative sequences (non-binding, e.g. randomly selected genomic sites) needs to be supplied. By default a sequence model is trained, since they often show similar performance compared to structure models while taking considerably less time to train. For hyperparameter optimization, a portion of the input FASTA sequences (usually n = 500) is taken away, but you can also provide separate optimization sets. After hyperparameter optimization, a model is trained using the input training sequences (minus the optimization set if not specified otherwise) with the determined optimized parameters. After that, a 10-fold cross validation is run on the training sequences to estimate the generalization performance of the model. Sequence and structure motifs (if structure model training enabled) are also output. Both cross validation and motif output can be disabled to further decrease the runtime.

By default, the model training output files are:

a .model file storing the model parameters
a .params file storing model hyperparameters and additional information
a .cv_results file containing the cross validation results
_motif and motif.png files (sequence and / or structure)

Profile prediction

This mode computes whole site or position-wise (= profile) binding scores for a given set input FASTA sequences.

By default, binding profiles are calculated, followed by average profile computation and extraction of peak regions from the average profiles. The average binding profile is more smooth regarding the position-wise (per nucleotide) scores than the initial profile GraphProt outputs and is the recommended way to extract peaks. Note that the amount of smoothness can be controlled in the prediction options (with the lowest value 0 equaling the initial profile). A peak is defined as a contiguous region in the average profile with scores >= the set score threshold (by default 0, can be changed). In addition, a set of high confidence peak regions (p50) can be output. Here the threshold gets set to the median of the scores obtained from the positive training set during model training (information stored in parameters file). Moreover, the peak regions can be converted to genomic regions, if the genomic regions for the input FASTA sequences are supplied.

Apart from predicting binding profiles, whole site predictions can be output as well. Here the output files are the scores for each input sequence, and optionally the p50 filtered set just like with the average profile peaks.

Summing up, the profile predictions output files are:

an avg_profile file containing the position-wise (per nucleotide) binding profile scores
one or several BED files containing the peak regions (all peaks, p50 peaks, all genomic peaks, p50 genomic peaks)
if whole site prediction is enabled, a .predictions file and optionally a .p50.predictions file