view astral.xml @ 0:66ebc4b19d6c draft default tip

"planemo upload for repository https://github.com/smirarab/ASTRAL commit 0f93f327c49e93d6af057973d68ba772ba5715dc-dirty"
author padge
date Wed, 13 Apr 2022 15:03:31 +0000
parents
children
line wrap: on
line source

<tool id="astral" name="ASTRAL-III. Tool for estimating an unrooted species tree given a set of unrooted gene trees." version="0.1.0" python_template_version="3.5">
    <requirements>
        <requirement type="package" version="5.7.8">astral-tree</requirement>
    </requirements>
    <command detect_errors="exit_code"><![CDATA[    
        astral -i $input1 -t ${branch_annotation_level_selector} -o ./output.tre -c $lambda 2> $log_output
        &&
        mv ./output.tre $output
        #if $branch_annotation_level_selector == "16" or $branch_annotation_level_selector == "32"
            &&
            mv freqQuad.csv $branch_annotations
        #end if;
    ]]></command>
    <inputs>
        <param name="input1" type="data" format="newick" multiple="false" label="Tree file" optional="false" />
        <param name="branch_annotation_level_selector" type="select" label="How much annotations should be added to each branch: 0, 1, or 2.">
            <option value="3">3 (default): only the posterior probability for the main resolution.</option>
            <option value="0">0: no annotations.</option>
            <option value="1">1: only the quartet support for the main resolution.</option>
            <option value="2">2: full annotation.</option>
            <option value="4">4: three alternative posterior probabilities.</option>
            <option value="8">8: three alternative quartet scores.</option>
            <option value="16">16: for file export of branch annotations to freqQuad.csv (see below).</option>
            <option value="32">32: for file export of branch annotations to freqQuad.csv (see below).</option>
            <option value="10">10: p-values of a polytomy null hypothesis test. (default: 3)</option>
        </param>
        <param name="lambda" type="float" optional="true" value="0.5" min="0.0" max="10.0" label="lambda" help="lambda parameter for the Yule prior used in the calculations of
        branch lengths and posterior probabilities. Default: 0.5" multiple="false"/>
    </inputs>
    <outputs>
        <data name="output" format="newick" label="Output tree file"/>
        <data name="log_output" format="txt" label="Astral log."/>
        <data name="branch_annotations" format="tabular" label="Branch annotations file.">
            <filter>branch_annotation_level_selector == '16' or branch_annotation_level_selector == '32'</filter>
        </data>
    </outputs>
    <tests>
        <test>
            <param name="input1" value="song_mammals.424.gene.tre"/>
            <param name="branch_annotation_level_selector" value="16" />
            <param name="lambda" value="2" />
            <output name="output" file="song_mammals.tre" ftype="newick"/>
            <output name="log_output">
                <assert_contents>
                    <has_line line="Number of taxa: 37 (37 species)" />
                    <has_line line="gradient0: 1933" />
                    <has_line line="Number of Clusters after addition by distance: 1933" />
                    <has_line line="Final quartet score is: 25526915" />
                </assert_contents>
            </output>
            <output name="branch_annotations" file="freqQuad.csv" ftype="tabular"/>
        </test>
        <test>
            <param name="input1" value="song_primates.424.gene.tre"/>
            <param name="branch_annotation_level_selector" value="0" />
            <param name="lambda" value="2" />
            <output name="output" file="song_primates.tre" ftype="newick"/>
            <output name="log_output">
                <assert_contents>
                    <has_line line="Number of taxa: 14 (14 species)" />
                    <has_line line="gradient0: 339" />
                    <has_line line="Number of Clusters after addition by distance: 339" />
                    <has_line line="Final quartet score is: 389734" />
                </assert_contents>
            </output>
        </test>
    </tests>
    <help><![CDATA[         
Newick annotations
    no annotations (-t 0): This turns off calculation and reporting of posterior probabilities and branch lengths.
    Quartet support (-t 1): The percentage of quartets in your gene trees that agree with a branch (normalized quartet support) give use a nice way of measuring the amount of gene tree conflict around a branch. Note that the local posterior probabilities are computed based on a transformation of normalized quartet scores (see Figure 2 of this paper).
    Alternative quartet topologies (-t 8): Outputs q1, q2, q3; these three values show quartet support (as defined above) for the main topology (LR|SO), first alternative (RS|LO) and second alternative (RO|LS), respectively.
    Local posterior (-t 3): is the default where we show local posterior probability for the main topology.
    Alternative posteriors (-t 4): The output includes three local posterior probabilities: one for the main topology, and one for each of the two alternatives (RS|LO and RO|LS, in that order). The posterior of the three topologies adds up to 1. This is because of our locality assumption, which basically asserts that we assume the four groups around the branch (L, R, S, and O) are each correct and therefore, there are only three possible alternatives.
    Full annotation (-t 2): When you use this option, for each branch you get a lot of different measurements:
    q1,q2,q3: these three values show quartet support (as defined in the description of -t 1) for the main topology, the first alternative, and the second alternative, respectively.
    f1, f2, f3: these three values show the total number of quartet trees in all the gene trees that support the main topology, the first alternative, and the second alternative, respectively.
    pp1, pp2, pp3: these three show the local posterior probabilities (as defined in the description of -t 4) for the main topology, the first alternative, and the second alternative, respectively.
    QC: this shows the total number of quartets defined around each branch (this is what our paper calls m).
    EN: this is the effective number of genes for the branch. If you don't have any missing data, this would be the number of branches in your tree. When there are missing data, some gene trees might have nothing to say about a branch. Thus, the effective number of genes might be smaller than the total number of genes.
    Polytomy test (-t 10): runs an experimental test to see if a null hypothesis that the branch is a polytomy could be rejected. See this paper: doi:10.3390/genes9030132.

File export of branch annotations
    Since it is often hard to know which branch is L and which branch is R, understanding branch annotations is a bit hard for most users. 
    To help, we have added a feature for outputting some of the branch annotations into a .csv file.
    Note that we strongly suggest using DiscoVista to visualize the quartet frequencies. If you find DiscoVista hard to install and use, you can instead use these .csv files.

To get the .csv outputs, you can use -t 16 and -t 32.
    .csv output has the following format.
    The output file is always called freqQuad.csv and is written to the same directory as the input file (sorry for the ugliness!)
    The file is tab-delimited.
    1st column: a dummy name for the node. Note that each three lines in a row have the same node number. Around each node, we have three possible unrooted toplogies (NNI rearrangements). We show stats for these three rearrangements.
    2nd column: the topology name for which we are giving the scores. Here, t1 is always the main topology (observed in your species tree) and t2 and t3 are the two alternatives.
    3rd column: Gives the actual topology with the format: {A}|{B}#{C}|{D}. This means that the quartet topology being scored is putting groups A and B together on one side, and groups C and D on the other side. Please remember that quartets are unrooted trees. Each of the groups is a comma-separate list of species.
    5th column: number of gene trees that match the the topology in this line.
    You will note that this number is not always an integer number. The reason is that in each gene tree, groups A, B, C, and D may not be together. Those gene trees still count as 1 unit, but they can contribute a fraction of that total of 1 to each of the tree topologies. So a gene tree may count as 0.7 for one topology, 0.2 for another, and 0.1 for the third.
    6th column: This is the total number of gene trees that had any useful information about this branch.
    If you have no missing data, this should equal the total number of gene trees.
    If you have missing data, some genes may be missing one of groups A, B, C, or D entirely. Those genes will be agnostic about this branch. This column gives the number of genes that have at least one species from each of A, B, C, and D.
    4th column is likely you are interested in and it depends on whether you used -t 16 or -t 32.
    -t 16: This column is the local posterior for the topology given in this line. Note that the local posterior probability is different from normalized quartet score. See Figure 2 of this paper.
    -t 32: This column is simply 5th column divided by 6 column. Thus, it gives the normalized quartet score for that topology. Note that the three lines with the same node name (1st column) will add up to one in their 4th column.

Prior hyper-parameter
    Our calculations of the local posterior probabilities and branch lengths use a Yule prior model for the branch lengths of the species tree. 
    The speciation rate (in coalescent units) of the Yule process (lambda) is by default set to 0.5, which results in a flat prior for the quartet frequencies in the [1/3,1] range. 
    Using -c option one can adjust the hyper-parameter for the prior. 
    For example, you might want to estimate lambda from the data after one run and plug the estimate prior in a subsequent run. 
    We have not yet fully explored the impact of lambda on the posterior. 
    For branch lengths, lambda acts as a pseudocount and can have a substantial impact on the estimated branch length for very long branches. 
    More specifically, if there is no, or very little discordance around a branch, 
    the MAP lengths of the branch (which is what we report) is almost fully determined by the prior.

    Note that setting lambda to 0 results in reporting ML estimates of the branch lengths instead of MAP. 
    However, for branches with no discordance, we cannot compute a branch lengths. 
    For these, we currently arbitrarily set ML to 10 coalescent units (we might change this in future versions).

    ]]></help>
    <citations>
        <citation type="bibtex">
@misc{githubASTRAL,
  author = {LastTODO, FirstTODO},
  year = {TODO},
  title = {ASTRAL},
  publisher = {GitHub},
  journal = {GitHub repository},
  url = {https://github.com/smirarab/ASTRAL},
}</citation>
    </citations>
</tool>