What it does
Description
It creates a model to recommend tools in Galaxy by learning the connections of tools in workflows. The model is an HDF5 file containing the tool dictionary, weights and configuration of the neural network. The recurrent neural network (Gated Recurrent Units) is used as a deep learner to learn the higher-order dependencies in tool connections of workflows. It takes two tabular files as input - one for the workflows and another for tools' usage frequencies. There are multiple other parameters to be set to find the best configuration of parameters of the neural network. This is achieved using bayesian optimisation hyperparameter search approach. Once the best configuration is found, a model is created which can be used to recommend tools in Galaxy. Further details about the input data and network parameters are explained below.
Input files
There are two input files:
wf_id wf_updated in_id in_tool in_tool_v out_id out_tool out_tool_v published deleted has_error 3 2013-02-07 7 Cut1 1.0.0 5 Grep1 1.0.1 f t f
The first column (wf_id) is the workflow id, second (wf_updated) is the last updated date timestamp, third (in_id) is the id of the tool which is the input to the tool connection, fourth (in_tool) is the name of the input tool, fifth (in_tool_v) is the version of the input tool, sixth (out_id) is the id of the output tool in the tool connection, seventh (out_tool) is the name of the output tool and the last one (out_tool_v) is the version of the output tool. The tools connections (rows) for each workflow are used to recreate the workflow (directed acyclic graph) and unique tool sequences for each workflow are extracted. These tool sequences are then used to learn higher-order dependencies using a recurrent neural network to recommend tools. The last 3 columns give more information about workflows if they are published, non-deleted and has any errors. Collectively, they are useful to determine if the workflows are of good quality.
The second file ("dataset containing usage frequencies of tools") is also a tabular file containing the usage frequencies of tools for a period of time. It has 3 columns:
upload1 2019-03-01 176 toolshed.g2.bx.psu.edu/repos/devteam/fastqc/fastqc/0.72 2019-03-01 97 toolshed.g2.bx.psu.edu/repos/bgruening/deeptools_bam_coverage/deeptools_bam_coverage/3.0.2.0 2019-03-01 67
The first column is the name of the tool, second is the date and the last one is the number of times the tool has been used in a month. For example, if this data is collected for 1 year, then each tool will appear in this list at most 12 times (months in which the usage is > 0). This data helps to know the usage pattern of a tool i.e. if a tool is being used often (high frequency in recent months) in the last one year or not being used at all (low frequency in recent months). This frequency is then used as weights for these tools in the neural network learning. The tools with high frequency in recent months is more important than tools with low frequency in recent months. This constraint allows to phase-out those tools from predictions which are not being used recently.
Parameters
There are multiple parameters which can be set. They are divided into 3 categories:
Output file
The output file (model) is an HDF5 file (http://docs.h5py.org/en/latest/high/file.html) containing multiple attributes like a dictionary of tools, neural network configuration and weights for each layer, weights of all tools and so on. After the tool has finished executing, it can be downloaded and placed at "/galaxy/database/" inside a Galaxy instance codebase. To see the recommended tools (enable the UI integrations) in Galaxy, the following changes should be made to "galaxy.yml" file:
- Enable and then set the property "enable_tool_recommendation" to "true".