HAMMOCK version 1.1.1
Hammock overview
Hammock performs peptide sequence clustering. It is able to identify clusters of sequences sharing a sequence motif within big datasets. For news, documentation and other available versions, see http://www.recamo.cz/en/software/hammock-cluster-peptides/
Citation Please cite:
Krejci, Adam, et al. "Hammock: a hidden Markov model-based peptide clustering algorithm to identify protein-interaction consensus motifs in large datasets." Bioinformatics 32.1 (2016): 9-16.
Input format
Hammock accepts fasta files. For basic work, fasta description lines (those starting with ">") may contain virtually anything. For work with the concept of sequence labels, description line should be in this form:
>id|count|label
an example of two records in this format:
>1|42|label1RSPIVRQLPSLP>2|58|label2GSWVVDISNVED
For more detailed description of the label concept and input format, see the documentation.
Outputs
Hammock returns three files, all of them are tab-separated tables.
The first is the cluster overview file. It contains one line for each resulting cluster plus header. Columns are:
cluster_id main_sequence sum label1 label2 label3 ...
cluster_id: Cluster's unique numeric identifier.main_sequence: The most popular (appearing in the highest number of copies) sequence of this clustersum: Total count of all sequences in this cluster (sum over all labels)label1, label2 etc. Counts of sequences with particular labels
The second file provides more detailed information. It contains one line for each clustered (unique) sequence plus header. The sequences are ordered according to their presence in clusters, from the largest cluster and within a cluster, from the most abundant sequence.
Columns are:
cluster_id sequence alignment sum label1 label2 label3 ...
cluster_id: Id of the cluster this sequence belongs tosequence: Amino acid sequence of this peptidealignment: Aligned amino acid sequence of this peptide (part of cluster's multiple sequence alignment)sum: Total count of copies of this sequence (sum over all labels)label1, label2 etc. Counts of copies with particular labels
The third file is the same as the second file, with two differences: it also contains sequences not belonging to any cluster (these have "NA" in the cluster_id column) and the order of the sequences in this file is the same as their order in the input fasta file.
Parameters Default and auto-detected parameters have been carefully tuned and tested to work well with several datasets, they are especially suited for short peptides from phage display experiments. Neverheless, there is no such thing as universal rules suitable for every dataset - parameter understanding and tuning may be needed. For more detailed description of parameters, see the documentation.