view peptide_indexer.xml @ 2:cf0d72c7b482 draft

Update.
author galaxyp
date Fri, 10 May 2013 17:31:05 -0400
parents
children 1183846e70a1
line wrap: on
line source

<tool id="openms_peptide_indexer" version="0.1.0" name="Peptide Indexer">
  <description>
    Refreshes the protein references for all peptide hits from a idXML file.
  </description>
  <macros>
    <import>macros.xml</import>
  </macros>
  <expand macro="stdio" />
  <expand macro="requires" />
  <command interpreter="python">
    openms_wrapper.py --executable 'PeptideIndexer' --config $config
  </command>
  <configfiles>
    <configfile name="config">[simple_options]
in=$input1
fasta=$database
out=$output
decoy_string=$decoy_string
#if $decoy_string_position == "prefix"
prefix=true
#end if
$extact_search
$write_protein_sequence
$keep_unreferenced_proteins
aaa_max=$aaa_max
</configfile>
  </configfiles>
  <inputs>
    <param name="input1" label="Identification Input" type="data" format="idxml" />
    <param name="database" label="Database" type="data" format="fasta" />
    <param name="decoy_string" type="text" value="_rev" label="Decoy string"/>
    <param name="decoy_string_position" type="select" label="Decoy Position">
      <option value="suffix" selected="true">Suffix</option>
      <option value="prefix">Prefix</option>
    </param>    
    <param name="extact_search" label="Exact Search" type="boolean" truevalue="" falsevalue="full_tolerant_search=true" checked="true" />
    <param name="write_protein_sequence" type="boolean" truevalue="write_protein_sequence=true" falsevalue="" checked="false" label="Store Protein Sequences" />
    <param name="keep_unreferenced_proteins" label="Keep Unreferenced Proteins" truevalue="keep_unreferenced_proteins=true" falsevalue="" type="boolean" />
    <param name="aaa_max" type="integer" value="4" label="Maximum Number of Ambiguous Amino Acids" help=" Maximal number of ambiguous amino acids (AAA) allowed when matching to a protein DB with AAA's. AAA's are 'B', 'Z', and 'X'" />
  </inputs>
  <outputs>
    <data format="idxml" name="output" />
  </outputs>
  <help>
**What it does**

Each peptide hit is annotated by a target_decoy string, indicating if the peptide sequence is found in a 'target', a 'decoy' or in both 'target+decoy' protein. This information is crucial for the FalseDiscoveryRate IDPosteriorErrorProbability tools.

Note:
Make sure that your protein names in the database contain a correctly formatted decoy string. This can be ensured by using DecoyDatabase. If the decoy identifier is not recognized successfully all proteins will be assumed to stem from the target-part of the query.
E.g., "sw|P33354_REV|YEHR_ECOLI Uncharacterized lipop..." is invalid, since the tool has no knowledge of how SwissProt entries are build up. A correct identifier could be "rev_sw|P33354|YEHR_ECOLI Uncharacterized li ..." or "sw|P33354|YEHR_ECOLI_rev Uncharacterized li", depending on if you are using prefix annotation or not.
This tool will also give you some target/decoy statistics when its done. Look carefully!

By default the tool will fail, if an unmatched peptide occurs, i.e. the database does not contain the corresponding protein. You can force the tool to return successfully in this case by using the flag 'allow_unmatched'.

Some search engines (such as Mascot) will replace ambiguous AA's in the protein database with unambiguous AA' in the reported peptides, e.g., exchange 'X' with 'H'. This will cause this peptide not to be found by exactly matching its sequence. However, we can recover these cases by using tolerant search in these cases (done automatically). In all cases we require ambiguous AA's in peptide sequence to match exactly in the protein DB (i.e., 'X' in peptide only matches 'X').

Two search modes are available:

exact: Peptide sequences require exact match in protein database. If no protein for this peptide can be found, tolerant matching is automatically used for this peptide. Thus, the results for these peptides are identical for both search modes.
tolerant: Allow ambiguous AA's in protein sequence, e.g., 'M' in peptide will match 'X' in protein. This mode might yield more protein hits for some peptides (even though they have exact matches as well).
The exact mode is much faster (about x10) and consumes less memory (about x2.5), but might fail to report a few proteins with ambiguous AAs for some peptides. Usually these proteins are putative, however.

**Citation**

For the underlying tool, please cite ``Marc Sturm, Andreas Bertsch, Clemens Gröpl, Andreas Hildebrandt, Rene Hussong, Eva Lange, Nico Pfeifer, Ole Schulz-Trieglaff, Alexandra Zerck, Knut Reinert, and Oliver Kohlbacher, 2008. OpenMS – an Open-Source Software Framework for Mass Spectrometry. BMC Bioinformatics 9: 163. doi:10.1186/1471-2105-9-163.``

If you use this tool in Galaxy, please cite Chilton J, et al. https://bitbucket.org/galaxyp/galaxyp-toolshed-openms
  </help>
</tool>