view pmids_to_pubtator_matrix.xml @ 1:f55a02423699 draft

"planemo upload for repository https://github.com/dlal-group/simtext commit fd3f5b7b0506fbc460f2a281f694cb57f1c90a3c-dirty"
author dlalgroup
date Thu, 24 Sep 2020 04:28:18 +0000
parents 3f4adc85ba5d
children ada4c7b3fc39
line wrap: on
line source

  <tool id="pmids_to_pubtator_matrix" name="pmids_to_pubtator_matrix" version="@VERSION@">
    <description>Per row, extract scientific terms from PMIDs susing PubTator and generate a binary matrix</description>
    <macros>
        <import>macros.xml</import>
    </macros>
    <expand macro="requirements">
      <requirement type="package" version="2.0.1">r-argparse</requirement>
      <requirement type="package" version="1.4.0">r-stringr</requirement>
      <requirement type="package" version="1.95_4.12">r-rcurl</requirement>
      <requirement type="package" version="1.4.3">r-stringi</requirement>
    </expand>
    <expand macro="stdio"/>
    <command><![CDATA[Rscript 
      '${__tool_directory__}/pmids_to_pubtator_matrix.R'
      --input '$input'
      --output '$output'
      --number '$number'
      $byid
      #for $category in $categories:
      --categories '$category'
      #end for
      ]]>
    </command>
    <inputs>
      <param argument="--input" label="Input file" name="input" optional="false" type="data" format="tabular" help="input"/>
      <param argument="--categories" name="categories" type="select" label="categories" multiple="true" >
        <option value="Gene">Genes</option>
        <option value="Disease">Diseases</option>
        <option value="Mutation">Mutations</option>
        <option value="Chemical">Chemicals</option>
        <option value="Species">Species</option>
      </param>
       <param argument="--byid" label="If you want to find common gene IDs / mesh IDs instead of specific scientific terms." name="byid" type="boolean" truevalue="--byid" falsevalue="" help="byid" checked="false"/>
        <param argument="--number" label="Number of most frequent terms/IDs to extract." name="number" optional="true" type="integer" help="number" value="50"/>
    </inputs>
    <outputs>
      <data format="tabular" name="output" />
    </outputs>
    <tests>
        <test>
          <param name="input" value="pubmed_by_queries_output" ftype="tabular"/>
          <output name="output" value="pmids_to_pubtator_matrix_output"/>
        </test>
    </tests>
    <help><![CDATA[

The tool uses all PMIDs per row and extracts "Gene", "Disease", "Mutation", "Chemical" and "Species" terms of the corresponding abstracts, using PubTator annotations. The user can choose from which categories terms should be extracted. The extracted terms are united in one large binary matrix, with 0= term not present in abstracts of that row and 1= term present in abstracts of that row. The user can decide if the scientific terms should be extracted and used as they are or if they should be grouped by their geneIDs/ meshIDs (several terms are often grouped into one ID). The the user can specify a number of most frequent words to extract per row.

Input: Output of 'abstracts_by_pmids' tool, or tab-delimited table with columns containing PMIDs. The names of the PMID columns should start with "PMID", e.g. "PMID_1", "PMID_2" etc.

Output: Binary matrix in that each column represents one of the extracted terms.

        ]]></help>
    <expand macro="citations"/>
</tool>