Galaxy | Tool Preview

Hammock - cluster peptides (version 1.1.1)
File with sequences to cluster in fasta format. See -i, --input in the manual for details.
Set Automatic to use all labels present in the data or choose a subset of labels to be used. See -l, --labels in the manual for details.

HAMMOCK version 1.1.1

Hammock overview

Hammock performs peptide sequence clustering. It is able to identify clusters of sequences sharing a sequence motif within big datasets. For news, documentation and other available versions, see http://www.recamo.cz/en/software/hammock-cluster-peptides/


Citation Please cite:

Krejci, Adam, et al. "Hammock: a hidden Markov model-based peptide clustering algorithm to identify protein-interaction consensus motifs in large datasets." Bioinformatics 32.1 (2016): 9-16.


Input format

Hammock accepts fasta files. For basic work, fasta description lines (those starting with ">") may contain virtually anything. For work with the concept of sequence labels, description line should be in this form:

>id|count|label

an example of two records in this format:

>1|42|label1
RSPIVRQLPSLP
>2|58|label2
GSWVVDISNVED

For more detailed description of the label concept and input format, see the documentation.


Outputs

Hammock returns three files, all of them are tab-separated tables.

The first is the cluster overview file. It contains one line for each resulting cluster plus header. Columns are:

cluster_id main_sequence sum label1 label2 label3 ...

cluster_id: Cluster's unique numeric identifier.
main_sequence: The most popular (appearing in the highest number of copies) sequence of this cluster
sum: Total count of all sequences in this cluster (sum over all labels)
label1, label2 etc. Counts of sequences with particular labels

The second file provides more detailed information. It contains one line for each clustered (unique) sequence plus header. The sequences are ordered according to their presence in clusters, from the largest cluster and within a cluster, from the most abundant sequence.

Columns are:

cluster_id sequence alignment sum label1 label2 label3 ...

cluster_id: Id of the cluster this sequence belongs to
sequence: Amino acid sequence of this peptide
alignment: Aligned amino acid sequence of this peptide (part of cluster's multiple sequence alignment)
sum: Total count of copies of this sequence (sum over all labels)
label1, label2 etc. Counts of copies with particular labels

The third file is the same as the second file, with two differences: it also contains sequences not belonging to any cluster (these have "NA" in the cluster_id column) and the order of the sequences in this file is the same as their order in the input fasta file.


Parameters Default and auto-detected parameters have been carefully tuned and tested to work well with several datasets, they are especially suited for short peptides from phage display experiments. Neverheless, there is no such thing as universal rules suitable for every dataset - parameter understanding and tuning may be needed. For more detailed description of parameters, see the documentation.