Galaxy |

Changeset 3:5b3c08710e47 (2020-05-09)

Previous changeset 2:76251d1ccdcc (2019-10-11) Next changeset 4:afec8c595124 (2020-07-07)

Commit message:
"planemo upload for repository https://github.com/bgruening/galaxytools/tree/recommendation_training/tools/tool_recommendation_model commit c635df659fe1835679438589ded43136b0e515c6"

modified:
create_tool_recommendation_model.xml
extract_workflow_connections.py
main.py
optimise_hyperparameters.py
predict_tool_usage.py
prepare_data.py
test-data/test_tool_usage
test-data/test_workflows
utils.py

diff -r 76251d1ccdcc -r 5b3c08710e47 create_tool_recommendation_model.xml
--- a/create_tool_recommendation_model.xml Fri Oct 11 18:24:54 2019 -0400
+++ b/create_tool_recommendation_model.xml Sat May 09 05:38:23 2020 -0400

b'@@ -1,4 +1,4 @@\n-<tool id="create_tool_recommendation_model" name="Create a model to recommend tools" version="0.0.1">\n+<tool id="create_tool_recommendation_model" name="Create a model to recommend tools" version="0.0.2">\n <description>using deep learning</description>\n <requirements>\n <requirement type="package" version="3.6">python</requirement>\n@@ -21,7 +21,6 @@\n --optimize_n_epochs \'$training_parameters.optimize_n_epochs\'\n --max_evals \'$training_parameters.max_evals\'\n --test_share \'$training_parameters.test_share\'\n- --validation_share \'$training_parameters.validation_share\'\n --batch_size \'$nn_parameters.batch_size\'\n --units \'$nn_parameters.units\'\n --embedding_size \'$nn_parameters.embedding_size\'\n@@ -29,8 +28,6 @@\n --spatial_dropout \'$nn_parameters.spatial_dropout\'\n --recurrent_dropout \'$nn_parameters.recurrent_dropout\'\n --learning_rate \'$nn_parameters.learning_rate\'\n- --activation_recurrent \'$nn_parameters.activation_recurrent\'\n- --activation_output \'$nn_parameters.activation_output\'\n --output_model \'$outfile_model\'\n ]]>\n </command>\n@@ -45,23 +42,21 @@\n \n </section>\n <section name="training_parameters" title="Training parameters" expanded="False">\n- <param name="max_evals" type="integer" value="50" label="Maximum number of evaluations of different configurations of parameters" help="Provide an integer. Different combinations of parameters are sampled and optimized to find the best one. This number specifies the number of different configurations sampled and tested."/>\n+ <param name="max_evals" type="integer" value="20" label="Maximum number of evaluations of different configurations of parameters" help="Provide an integer. Different combinations of parameters are sampled and optimized to find the best one. This number specifies the number of different configurations sampled and tested."/>\n \n- <param name="optimize_n_epochs" type="integer" value="20" label="Number of training iterations to optimize the neural network parameters" help="Provide an integer. This number specifies the number of training iterations done for each sampled configuration while optimising the parameters."/>\n+ <param name="optimize_n_epochs" type="integer" value="5" label="Number of training iterations to optimize the neural network parameters" help="Provide an integer. This number specifies the number of training iterations done for each sampled configuration while optimising the parameters."/>\n \n- <param name="n_epochs" type="integer" value="20" label="Number of training iterations" help="Provide an integer. This specifies the number of deep learning training iterations done after finding the best/optimised configuration of neural network parameters."/>\n+ <param name="n_epochs" type="integer" value="10" label="Number of training iterations" help="Provide an integer. This specifies the number of deep learning training iterations done after finding the best/optimised configuration of neural network parameters."/>\n \n- <param name="test_share" type="float" value="0.0" label="Share of the test data" help="Provide a real number between 0.0 and 1.0. This set of data is used to look through the prediction accuracy on unseen data after neural network training on an optimised configuration of parameters. It should be set to 0.0 while training for a model to be deployed to production. The minimum value can be 0.0 and maximum value should not be more than 0.5."/>\n-\n- <param name="validation_share" type="float" value="0.2" label="Share of the validation data" help="Provide a real number between 0.0 and 1.0. This set of data is used to validate each step of learning while optimising the configurations of parameters. The minimum value can be 0.0 and maximum value should not be more than 0.5."'..b'ast 3 columns give more information about workflows if they are published, non-deleted and has any errors. Collectively, they are useful to determine if the workflows are of good quality.\n+\n 2. The second file ("dataset containing usage frequencies of tools") is also a tabular file containing the usage frequencies of tools for a period of time. It has 3 columns:\n \n ============================================================================================ ========== === \n@@ -196,7 +189,6 @@\n - "optimize_n_epochs": This number specifies how many iterations would the neural network executes to evaluate each sampled configuration.\n - "n_epochs": Once the best configuration of hyperparameters has been found, the neural network takes this configuration and runs for "n_epochs" number of times minimising the error to produce a model at the end.\n - "test_share": It specifies the size of the test set. For example, if it is 0.5, then the test set is half of the entire data available. It should not be set to more than 0.5. This set is used for evaluating the precision on an unseen set.\n- - "validation_share": It specifies the size of the validation set. For example, if it is 0.5, then the validation set is half of the entire data available. It should not be set to more than 0.5. This set is used for computing error while training on the best configuration.\n \n 3. Neural network parameters:\n - "batch_size": The training of the neural network is done using batch learning in this work. The training data is divided into equal batches and for each epoch (a training iteration), all batches of data are trained one after another. A higher or lower value can unsettle the training. Therefore, this parameter should be optimised.\n@@ -206,17 +198,15 @@\n - "spatial_dropout": Similar to dropout, this is used to reduce overfitting in the embedding layer. This parameter should be optimised as well.\n - "recurrent_dropout": Similar to dropout and spatial dropout, this is used to reduce overfitting in the recurrent layers (hidden). This parameter should be optimised as well.\n - "learning_rate": The learning rate specifies the speed of learning. A higher value ensures fast learning (the optimiser may diverge) and a lower value causes slow learning (may not reach the optimum). This parameter should be optimised as well.\n- - "activation_recurrent": Activations are mathematical functions to transform input into output. This takes the name of an activation function from the list of Keras activations (https://keras.io/activations/) for recurrent layers.\n- - "activation_output": This takes the activation for transforming the input of the last layer to the output of the neural network. It is also taken from Keras activations (https://keras.io/activations/).\n \n -----\n \n+\n **Output file**\n \n The output file (model) is an HDF5 file (http://docs.h5py.org/en/latest/high/file.html) containing multiple attributes like a dictionary of tools, neural network configuration and weights for each layer, weights of all tools and so on. After the tool has finished executing, it can be downloaded and placed at "/galaxy/database/" inside a Galaxy instance codebase. To see the recommended tools (enable the UI integrations) in Galaxy, the following changes should be made to "galaxy.yml" file:\n \n - Enable and then set the property "enable_tool_recommendation" to "true".\n- - Enable and then set the property "model_path" to "database/<<model_file_name>>".\n \n ]]>\n </help>\n@@ -225,7 +215,7 @@\n @ARTICLE{anuprulez_galaxytools,\n Author = {Anup Kumar and Bj\xc3\xb6rn Gr\xc3\xbcning},\n keywords = {bioinformatics, recommendation system, deep learning},\n- title = {{Tool recommendation system for Galaxy workflows}},\n+ title = {{Tool recommendation system for Galaxy}},\n url = {https://github.com/bgruening/galaxytools}\n }\n </citation>\n'

diff -r 76251d1ccdcc -r 5b3c08710e47 extract_workflow_connections.py
--- a/extract_workflow_connections.py Fri Oct 11 18:24:54 2019 -0400
+++ b/extract_workflow_connections.py Sat May 09 05:38:23 2020 -0400

[

@@ -11,11 +11,17 @@

class ExtractWorkflowConnections:

-    @classmethod
     def __init__(self):
         """ Init method. """

-    @classmethod
+    def collect_standard_connections(self, row):
+        published = row[8]
+        deleted = row[9]
+        has_errors = row[10]
+        if published == "t" and deleted == "f" and has_errors == "f":
+            return True
+        return False
+
     def read_tabular_file(self, raw_file_path):
         """
         Read tabular file and extract workflow connections
@@ -25,7 +31,8 @@
         workflow_paths_dup = ""
         workflow_parents = dict()
         workflow_paths = list()
-        unique_paths = list()
+        unique_paths = dict()
+        standard_connections = dict()
         with open(raw_file_path, 'rt') as workflow_connections_file:
             workflow_connections = csv.reader(workflow_connections_file, delimiter='\t')
             for index, row in enumerate(workflow_connections):
@@ -35,7 +42,15 @@
                 if wf_id not in workflows:
                     workflows[wf_id] = list()
                 if out_tool and in_tool and out_tool != in_tool:
-                    workflows[wf_id].append((in_tool, out_tool))
+                    workflows[wf_id].append((out_tool, in_tool))
+                    qc = self.collect_standard_connections(row)
+                    if qc:
+                        i_t = utils.format_tool_id(in_tool)
+                        o_t = utils.format_tool_id(out_tool)
+                        if i_t not in standard_connections:
+                            standard_connections[i_t] = list()
+                        if o_t not in standard_connections[i_t]:
+                            standard_connections[i_t].append(o_t)
         print("Processing workflows...")
         wf_ctr = 0
         for wf_id in workflows:
@@ -54,7 +69,6 @@
                     if len(paths) > 0:
                         flow_paths.extend(paths)
             workflow_paths.extend(flow_paths)
-
         print("Workflows processed: %d" % wf_ctr)

         # remove slashes from the tool ids
@@ -75,9 +89,8 @@

         print("Finding compatible next tools...")
         compatible_next_tools = self.set_compatible_next_tools(no_dup_paths)
-        return unique_paths, compatible_next_tools
+        return unique_paths, compatible_next_tools, standard_connections

-    @classmethod
     def set_compatible_next_tools(self, workflow_paths):
         """
         Find next tools for each tool
@@ -97,7 +110,6 @@
             next_tools[tool] = ",".join(list(set(next_tools[tool].split(","))))
         return next_tools

-    @classmethod
     def read_workflow(self, wf_id, workflow_rows):
         """
         Read all connections for a workflow
@@ -112,7 +124,6 @@
                 tool_parents[out_tool].append(in_tool)
         return tool_parents

-    @classmethod
     def get_roots_leaves(self, graph):
         roots = list()
         leaves = list()
@@ -125,7 +136,6 @@
         leaves = list(set(children).difference(set(all_parents)))
         return roots, leaves

-    @classmethod
     def find_tool_paths_workflow(self, graph, start, end, path=[]):
         path = path + [end]
         if start == end:

diff -r 76251d1ccdcc -r 5b3c08710e47 main.py
--- a/main.py Fri Oct 11 18:24:54 2019 -0400
+++ b/main.py Sat May 09 05:38:23 2020 -0400

[

b'@@ -20,7 +20,6 @@\n \n class PredictTool:\n \n- @classmethod\n def __init__(self, num_cpus):\n """ Init method. """\n # set the number of cpus\n@@ -32,47 +31,47 @@\n )\n K.set_session(tf.Session(config=cpu_config))\n \n- @classmethod\n- def find_train_best_network(self, network_config, reverse_dictionary, train_data, train_labels, test_data, test_labels, n_epochs, class_weights, usage_pred, compatible_next_tools):\n+ def find_train_best_network(self, network_config, reverse_dictionary, train_data, train_labels, test_data, test_labels, n_epochs, class_weights, usage_pred, standard_connections, l_tool_freq, l_tool_tr_samples):\n """\n Define recurrent neural network and train sequential data\n """\n+ # get tools with lowest representation\n+ lowest_tool_ids = utils.get_lowest_tools(l_tool_freq)\n+\n print("Start hyperparameter optimisation...")\n hyper_opt = optimise_hyperparameters.HyperparameterOptimisation()\n- best_params, best_model = hyper_opt.train_model(network_config, reverse_dictionary, train_data, train_labels, class_weights)\n+ best_params, best_model = hyper_opt.train_model(network_config, reverse_dictionary, train_data, train_labels, test_data, test_labels, l_tool_tr_samples, class_weights)\n \n # define callbacks\n- early_stopping = callbacks.EarlyStopping(monitor=\'loss\', mode=\'min\', verbose=1, min_delta=1e-4, restore_best_weights=True)\n- predict_callback_test = PredictCallback(test_data, test_labels, reverse_dictionary, n_epochs, compatible_next_tools, usage_pred)\n+ early_stopping = callbacks.EarlyStopping(monitor=\'loss\', mode=\'min\', verbose=1, min_delta=1e-1, restore_best_weights=True)\n+ predict_callback_test = PredictCallback(test_data, test_labels, reverse_dictionary, n_epochs, usage_pred, standard_connections, lowest_tool_ids)\n \n callbacks_list = [predict_callback_test, early_stopping]\n \n+ batch_size = int(best_params["batch_size"])\n+\n print("Start training on the best model...")\n train_performance = dict()\n- if len(test_data) > 0:\n- trained_model = best_model.fit(\n+ trained_model = best_model.fit_generator(\n+ utils.balanced_sample_generator(\n train_data,\n train_labels,\n- batch_size=int(best_params["batch_size"]),\n- epochs=n_epochs,\n- verbose=2,\n- callbacks=callbacks_list,\n- shuffle="batch",\n- validation_data=(test_data, test_labels)\n- )\n- train_performance["validation_loss"] = np.array(trained_model.history["val_loss"])\n- train_performance["precision"] = predict_callback_test.precision\n- train_performance["usage_weights"] = predict_callback_test.usage_weights\n- else:\n- trained_model = best_model.fit(\n- train_data,\n- train_labels,\n- batch_size=int(best_params["batch_size"]),\n- epochs=n_epochs,\n- verbose=2,\n- callbacks=callbacks_list,\n- shuffle="batch"\n- )\n+ batch_size,\n+ l_tool_tr_samples\n+ ),\n+ steps_per_epoch=len(train_data) // batch_size,\n+ epochs=n_epochs,\n+ callbacks=callbacks_list,\n+ validation_data=(test_data, test_labels),\n+ verbose=2,\n+ shuffle=True\n+ )\n+ train_performance["validation_loss"] = np.array(trained_model.history["val_loss"])\n+ train_performance["precision"] = predict_callback_test.precision\n+ train_performance["usage_weights"] = predict_callback_test.usage_weights\n+ train_performance["published_precision"] = predict_callback_test.published_precision\n+ train_performance["lowest_pub_precision"] = predict_callback_test.lowest_pub_precision\n+ train_perform'..b'required=True, help="number of hidden recurrent units")\n@@ -125,8 +134,6 @@\n arg_parser.add_argument("-sd", "--spatial_dropout", required=True, help="1d dropout used for embedding layer")\n arg_parser.add_argument("-rd", "--recurrent_dropout", required=True, help="dropout for the recurrent layers")\n arg_parser.add_argument("-lr", "--learning_rate", required=True, help="learning rate")\n- arg_parser.add_argument("-ar", "--activation_recurrent", required=True, help="activation function for recurrent layers")\n- arg_parser.add_argument("-ao", "--activation_output", required=True, help="activation function for output layers")\n \n # get argument values\n args = vars(arg_parser.parse_args())\n@@ -139,7 +146,6 @@\n optimize_n_epochs = int(args["optimize_n_epochs"])\n max_evals = int(args["max_evals"])\n test_share = float(args["test_share"])\n- validation_share = float(args["validation_share"])\n batch_size = args["batch_size"]\n units = args["units"]\n embedding_size = args["embedding_size"]\n@@ -147,8 +153,6 @@\n spatial_dropout = args["spatial_dropout"]\n recurrent_dropout = args["recurrent_dropout"]\n learning_rate = args["learning_rate"]\n- activation_recurrent = args["activation_recurrent"]\n- activation_output = args["activation_output"]\n num_cpus = 16\n \n config = {\n@@ -158,35 +162,28 @@\n \'optimize_n_epochs\': optimize_n_epochs,\n \'max_evals\': max_evals,\n \'test_share\': test_share,\n- \'validation_share\': validation_share,\n \'batch_size\': batch_size,\n \'units\': units,\n \'embedding_size\': embedding_size,\n \'dropout\': dropout,\n \'spatial_dropout\': spatial_dropout,\n \'recurrent_dropout\': recurrent_dropout,\n- \'learning_rate\': learning_rate,\n- \'activation_recurrent\': activation_recurrent,\n- \'activation_output\': activation_output\n+ \'learning_rate\': learning_rate\n }\n \n # Extract and process workflows\n connections = extract_workflow_connections.ExtractWorkflowConnections()\n- workflow_paths, compatible_next_tools = connections.read_tabular_file(workflows_path)\n+ workflow_paths, compatible_next_tools, standard_connections = connections.read_tabular_file(workflows_path)\n # Process the paths from workflows\n print("Dividing data...")\n data = prepare_data.PrepareData(maximum_path_length, test_share)\n- train_data, train_labels, test_data, test_labels, data_dictionary, reverse_dictionary, class_weights, usage_pred = data.get_data_labels_matrices(workflow_paths, tool_usage_path, cutoff_date, compatible_next_tools)\n+ train_data, train_labels, test_data, test_labels, data_dictionary, reverse_dictionary, class_weights, usage_pred, l_tool_freq, l_tool_tr_samples = data.get_data_labels_matrices(workflow_paths, tool_usage_path, cutoff_date, compatible_next_tools, standard_connections)\n # find the best model and start training\n predict_tool = PredictTool(num_cpus)\n # start training with weighted classes\n print("Training with weighted classes and samples ...")\n- results_weighted = predict_tool.find_train_best_network(config, reverse_dictionary, train_data, train_labels, test_data, test_labels, n_epochs, class_weights, usage_pred, compatible_next_tools)\n- print()\n- print("Best parameters \\n")\n- print(results_weighted["best_parameters"])\n- print()\n- utils.save_model(results_weighted, data_dictionary, compatible_next_tools, trained_model_path, class_weights)\n+ results_weighted = predict_tool.find_train_best_network(config, reverse_dictionary, train_data, train_labels, test_data, test_labels, n_epochs, class_weights, usage_pred, standard_connections, l_tool_freq, l_tool_tr_samples)\n+ utils.save_model(results_weighted, data_dictionary, compatible_next_tools, trained_model_path, class_weights, standard_connections)\n end_time = time.time()\n print()\n print("Program finished in %s seconds" % str(end_time - start_time))\n'

diff -r 76251d1ccdcc -r 5b3c08710e47 optimise_hyperparameters.py
--- a/optimise_hyperparameters.py Fri Oct 11 18:24:54 2019 -0400
+++ b/optimise_hyperparameters.py Sat May 09 05:38:23 2020 -0400

[

@@ -17,18 +17,13 @@

class HyperparameterOptimisation:

-    @classmethod
     def __init__(self):
         """ Init method. """

-    @classmethod
-    def train_model(self, config, reverse_dictionary, train_data, train_labels, class_weights):
+    def train_model(self, config, reverse_dictionary, train_data, train_labels, test_data, test_labels, l_tool_tr_samples, class_weights):
         """
         Train a model and report accuracy
         """
-        l_recurrent_activations = config["activation_recurrent"].split(",")
-        l_output_activations = config["activation_output"].split(",")
-
         # convert items to integer
         l_batch_size = list(map(int, config["batch_size"].split(",")))
         l_embedding_size = list(map(int, config["embedding_size"].split(",")))
@@ -41,20 +36,17 @@
         l_recurrent_dropout = list(map(float, config["recurrent_dropout"].split(",")))

         optimize_n_epochs = int(config["optimize_n_epochs"])
-        validation_split = float(config["validation_share"])

         # get dimensions
         dimensions = len(reverse_dictionary) + 1
         best_model_params = dict()
-        early_stopping = EarlyStopping(monitor='val_loss', mode='min', verbose=1, min_delta=1e-4)
+        early_stopping = EarlyStopping(monitor='val_loss', mode='min', verbose=1, min_delta=1e-1, restore_best_weights=True)

         # specify the search space for finding the best combination of parameters using Bayesian optimisation
         params = {
             "embedding_size": hp.quniform("embedding_size", l_embedding_size[0], l_embedding_size[1], 1),
             "units": hp.quniform("units", l_units[0], l_units[1], 1),
             "batch_size": hp.quniform("batch_size", l_batch_size[0], l_batch_size[1], 1),
-            "activation_recurrent": hp.choice("activation_recurrent", l_recurrent_activations),
-            "activation_output": hp.choice("activation_output", l_output_activations),
             "learning_rate": hp.loguniform("learning_rate", np.log(l_learning_rate[0]), np.log(l_learning_rate[1])),
             "dropout": hp.uniform("dropout", l_dropout[0], l_dropout[1]),
             "spatial_dropout": hp.uniform("spatial_dropout", l_spatial_dropout[0], l_spatial_dropout[1]),
@@ -65,36 +57,36 @@
             model = Sequential()
             model.add(Embedding(dimensions, int(params["embedding_size"]), mask_zero=True))
             model.add(SpatialDropout1D(params["spatial_dropout"]))
-            model.add(GRU(int(params["units"]), dropout=params["dropout"], recurrent_dropout=params["recurrent_dropout"], return_sequences=True, activation=params["activation_recurrent"]))
+            model.add(GRU(int(params["units"]), dropout=params["dropout"], recurrent_dropout=params["recurrent_dropout"], return_sequences=True, activation="elu"))
+            model.add(Dropout(params["dropout"]))
+            model.add(GRU(int(params["units"]), dropout=params["dropout"], recurrent_dropout=params["recurrent_dropout"], return_sequences=False, activation="elu"))
             model.add(Dropout(params["dropout"]))
-            model.add(GRU(int(params["units"]), dropout=params["dropout"], recurrent_dropout=params["recurrent_dropout"], return_sequences=False, activation=params["activation_recurrent"]))
-            model.add(Dropout(params["dropout"]))
-            model.add(Dense(dimensions, activation=params["activation_output"]))
+            model.add(Dense(2 * dimensions, activation="sigmoid"))
             optimizer_rms = RMSprop(lr=params["learning_rate"])
+            batch_size = int(params["batch_size"])
             model.compile(loss=utils.weighted_loss(class_weights), optimizer=optimizer_rms)
-            model_fit = model.fit(
-                train_data,
-                train_labels,
-                batch_size=int(params["batch_size"]),
+            print(model.summary())
+            model_fit = model.fit_generator(
+                utils.balanced_sample_generator(
+                    train_data,
+                    train_labels,
+                    batch_size,
+                    l_tool_tr_samples
+                ),
+                steps_per_epoch=len(train_data) // batch_size,
                 epochs=optimize_n_epochs,
-                shuffle="batch",
+                callbacks=[early_stopping],
+                validation_data=(test_data, test_labels),
                 verbose=2,
-                validation_split=validation_split,
-                callbacks=[early_stopping]
+                shuffle=True
             )
             return {'loss': model_fit.history["val_loss"][-1], 'status': STATUS_OK, 'model': model}
         # minimize the objective function using the set of parameters above
         trials = Trials()
         learned_params = fmin(create_model, params, trials=trials, algo=tpe.suggest, max_evals=int(config["max_evals"]))
         best_model = trials.results[np.argmin([r['loss'] for r in trials.results])]['model']
-
         # set the best params with respective values
         for item in learned_params:
             item_val = learned_params[item]
-            if item == 'activation_output':
-                best_model_params[item] = l_output_activations[item_val]
-            elif item == 'activation_recurrent':
-                best_model_params[item] = l_recurrent_activations[item_val]
-            else:
-                best_model_params[item] = item_val
+            best_model_params[item] = item_val
         return best_model_params, best_model

diff -r 76251d1ccdcc -r 5b3c08710e47 predict_tool_usage.py
--- a/predict_tool_usage.py Fri Oct 11 18:24:54 2019 -0400
+++ b/predict_tool_usage.py Sat May 09 05:38:23 2020 -0400

[

@@ -21,11 +21,9 @@

class ToolPopularity:

-    @classmethod
     def __init__(self):
         """ Init method. """

-    @classmethod
     def extract_tool_usage(self, tool_usage_file, cutoff_date, dictionary):
         """
         Extract the tool usage over time for each tool
@@ -63,7 +61,6 @@
             tool_usage_dict[tool] = collections.OrderedDict(sorted(usage.items()))
         return tool_usage_dict

-    @classmethod
     def learn_tool_popularity(self, x_reshaped, y_reshaped):
         """
         Fit a curve for the tool usage over time to predict future tool usage
@@ -93,7 +90,6 @@
         except Exception:
             return epsilon

-    @classmethod
     def get_pupularity_prediction(self, tools_usage):
         """
         Get the popularity prediction for each tool

diff -r 76251d1ccdcc -r 5b3c08710e47 prepare_data.py
--- a/prepare_data.py Fri Oct 11 18:24:54 2019 -0400
+++ b/prepare_data.py Sat May 09 05:38:23 2020 -0400

[

b'@@ -10,19 +10,18 @@\n import random\n \n import predict_tool_usage\n+import utils\n \n main_path = os.getcwd()\n \n \n class PrepareData:\n \n- @classmethod\n def __init__(self, max_seq_length, test_data_share):\n """ Init method. """\n self.max_tool_sequence_len = max_seq_length\n self.test_share = test_data_share\n \n- @classmethod\n def process_workflow_paths(self, workflow_paths):\n """\n Get all the tools and complete set of individual paths for each workflow\n@@ -40,7 +39,6 @@\n tokens = np.reshape(tokens, [-1, ])\n return tokens, raw_paths\n \n- @classmethod\n def create_new_dict(self, new_data_dict):\n """\n Create new data dictionary\n@@ -48,7 +46,6 @@\n reverse_dict = dict((v, k) for k, v in new_data_dict.items())\n return new_data_dict, reverse_dict\n \n- @classmethod\n def assemble_dictionary(self, new_data_dict, old_data_dictionary={}):\n """\n Create/update tools indices in the forward and backward dictionary\n@@ -56,7 +53,6 @@\n new_data_dict, reverse_dict = self.create_new_dict(new_data_dict)\n return new_data_dict, reverse_dict\n \n- @classmethod\n def create_data_dictionary(self, words, old_data_dictionary={}):\n """\n Create two dictionaries having tools names and their indexes\n@@ -68,7 +64,6 @@\n dictionary, reverse_dictionary = self.assemble_dictionary(dictionary, old_data_dictionary)\n return dictionary, reverse_dictionary\n \n- @classmethod\n def decompose_paths(self, paths, dictionary):\n """\n Decompose the paths to variable length sub-paths keeping the first tool fixed\n@@ -86,7 +81,6 @@\n sub_paths_pos = list(set(sub_paths_pos))\n return sub_paths_pos\n \n- @classmethod\n def prepare_paths_labels_dictionary(self, dictionary, reverse_dictionary, paths, compatible_next_tools):\n """\n Create a dictionary of sequences with their labels for training and test paths\n@@ -116,8 +110,7 @@\n paths_labels[item] = ",".join(list(set(paths_labels[item].split(","))))\n return paths_labels\n \n- @classmethod\n- def pad_paths(self, paths_dictionary, num_classes):\n+ def pad_test_paths(self, paths_dictionary, num_classes):\n """\n Add padding to the tools sequences and create multi-hot encoded labels\n """\n@@ -135,7 +128,35 @@\n train_counter += 1\n return data_mat, label_mat\n \n- @classmethod\n+ def pad_paths(self, paths_dictionary, num_classes, standard_connections, reverse_dictionary):\n+ """\n+ Add padding to the tools sequences and create multi-hot encoded labels\n+ """\n+ size_data = len(paths_dictionary)\n+ data_mat = np.zeros([size_data, self.max_tool_sequence_len])\n+ label_mat = np.zeros([size_data, 2 * (num_classes + 1)])\n+ pos_flag = 1.0\n+ train_counter = 0\n+ for train_seq, train_label in list(paths_dictionary.items()):\n+ pub_connections = list()\n+ positions = train_seq.split(",")\n+ last_tool_id = positions[-1]\n+ last_tool_name = reverse_dictionary[int(last_tool_id)]\n+ start_pos = self.max_tool_sequence_len - len(positions)\n+ for id_pos, pos in enumerate(positions):\n+ data_mat[train_counter][start_pos + id_pos] = int(pos)\n+ if last_tool_name in standard_connections:\n+ pub_connections = standard_connections[last_tool_name]\n+ for label_item in train_label.split(","):\n+ label_pos = int(label_item)\n+ label_row = label_mat[train_counter]\n+ if reverse_dictionary[label_pos] in pub_connections:\n+ label_row[label_pos] = pos_flag\n+ else:\n+ label_row[label_pos + num_classes + 1] = pos_flag\n+ train_counter += 1\n+ return data_mat, label_mat\n+\n def split_test_train_data(self, mu'..b'ol not in last_tool_freq:\n+ last_tool_freq[last_tool] = 0\n+ last_tool_freq[last_tool] += 1\n+ max_freq = max(last_tool_freq.values())\n+ for t in last_tool_freq:\n+ inv_freq[t] = int(np.round(max_freq / float(last_tool_freq[t]), 0))\n+ return last_tool_freq, inv_freq\n \n- @classmethod\n- def get_data_labels_matrices(self, workflow_paths, tool_usage_path, cutoff_date, compatible_next_tools, old_data_dictionary={}):\n+ def get_toolid_samples(self, train_data, l_tool_freq):\n+ l_tool_tr_samples = dict()\n+ for tool_id in l_tool_freq:\n+ for index, tr_sample in enumerate(train_data):\n+ last_tool_id = str(int(tr_sample[-1]))\n+ if last_tool_id == tool_id:\n+ if last_tool_id not in l_tool_tr_samples:\n+ l_tool_tr_samples[last_tool_id] = list()\n+ l_tool_tr_samples[last_tool_id].append(index)\n+ return l_tool_tr_samples\n+\n+ def get_data_labels_matrices(self, workflow_paths, tool_usage_path, cutoff_date, compatible_next_tools, standard_connections, old_data_dictionary={}):\n """\n Convert the training and test paths into corresponding numpy matrices\n """\n processed_data, raw_paths = self.process_workflow_paths(workflow_paths)\n- dictionary, reverse_dictionary = self.create_data_dictionary(processed_data, old_data_dictionary)\n+ dictionary, rev_dict = self.create_data_dictionary(processed_data, old_data_dictionary)\n num_classes = len(dictionary)\n \n print("Raw paths: %d" % len(raw_paths))\n@@ -227,25 +249,32 @@\n random.shuffle(all_unique_paths)\n \n print("Creating dictionaries...")\n- multilabels_paths = self.prepare_paths_labels_dictionary(dictionary, reverse_dictionary, all_unique_paths, compatible_next_tools)\n+ multilabels_paths = self.prepare_paths_labels_dictionary(dictionary, rev_dict, all_unique_paths, compatible_next_tools)\n \n print("Complete data: %d" % len(multilabels_paths))\n train_paths_dict, test_paths_dict = self.split_test_train_data(multilabels_paths)\n \n+ # get sample frequency\n+ l_tool_freq, inv_last_tool_freq = self.get_train_last_tool_freq(train_paths_dict, rev_dict)\n+\n print("Train data: %d" % len(train_paths_dict))\n print("Test data: %d" % len(test_paths_dict))\n \n- test_data, test_labels = self.pad_paths(test_paths_dict, num_classes)\n- train_data, train_labels = self.pad_paths(train_paths_dict, num_classes)\n+ print("Padding train and test data...")\n+ # pad training and test data with leading zeros\n+ test_data, test_labels = self.pad_paths(test_paths_dict, num_classes, standard_connections, rev_dict)\n+ train_data, train_labels = self.pad_paths(train_paths_dict, num_classes, standard_connections, rev_dict)\n+\n+ l_tool_tr_samples = self.get_toolid_samples(train_data, l_tool_freq)\n \n # Predict tools usage\n print("Predicting tools\' usage...")\n usage_pred = predict_tool_usage.ToolPopularity()\n usage = usage_pred.extract_tool_usage(tool_usage_path, cutoff_date, dictionary)\n tool_usage_prediction = usage_pred.get_pupularity_prediction(usage)\n- tool_predicted_usage = self.get_predicted_usage(dictionary, tool_usage_prediction)\n+ t_pred_usage = self.get_predicted_usage(dictionary, tool_usage_prediction)\n \n # get class weights using the predicted usage for each tool\n- class_weights = self.assign_class_weights(train_labels.shape[1], tool_predicted_usage)\n+ class_weights = self.assign_class_weights(num_classes, t_pred_usage)\n \n- return train_data, train_labels, test_data, test_labels, dictionary, reverse_dictionary, class_weights, tool_predicted_usage\n+ return train_data, train_labels, test_data, test_labels, dictionary, rev_dict, class_weights, t_pred_usage, l_tool_freq, l_tool_tr_samples\n'

diff -r 76251d1ccdcc -r 5b3c08710e47 test-data/test_tool_usage
--- a/test-data/test_tool_usage Fri Oct 11 18:24:54 2019 -0400
+++ b/test-data/test_tool_usage Sat May 09 05:38:23 2020 -0400

b'@@ -1,1000 +1,500 @@\n-upload1\t2019-03-01\t176\n-toolshed.g2.bx.psu.edu/repos/devteam/fastqc/fastqc/0.72\t2019-03-01\t97\n-toolshed.g2.bx.psu.edu/repos/bgruening/deeptools_bam_coverage/deeptools_bam_coverage/3.0.2.0\t2019-03-01\t67\n-toolshed.g2.bx.psu.edu/repos/iuc/featurecounts/featurecounts/1.6.3+galaxy2\t2019-03-01\t53\n-toolshed.g2.bx.psu.edu/repos/iuc/sra_tools/fastq_dump/2.9.1.3\t2019-03-01\t51\n-toolshed.g2.bx.psu.edu/repos/devteam/samtools_flagstat/samtools_flagstat/2.0.2\t2019-03-01\t38\n-toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_tail_tool/1.1.0\t2019-03-01\t38\n-toolshed.g2.bx.psu.edu/repos/iuc/multiqc/multiqc/1.7\t2019-03-01\t31\n-toolshed.g2.bx.psu.edu/repos/devteam/fastqc/fastqc/0.52\t2019-03-01\t30\n-toolshed.g2.bx.psu.edu/repos/iuc/rgrnastar/rna_star/2.6.0b-1\t2019-03-01\t28\n-CONVERTER_gz_to_uncompressed\t2019-03-01\t26\n-toolshed.g2.bx.psu.edu/repos/bgruening/deeptools_bam_compare/deeptools_bam_compare/3.0.2.0\t2019-03-01\t25\n-toolshed.g2.bx.psu.edu/repos/pjbriggs/trimmomatic/trimmomatic/0.36.5\t2019-03-01\t25\n-toolshed.g2.bx.psu.edu/repos/iuc/macs2/macs2_callpeak/2.1.1.20160309.5\t2019-03-01\t22\n-toolshed.g2.bx.psu.edu/repos/devteam/samtools_idxstats/samtools_idxstats/2.0.2\t2019-03-01\t21\n-toolshed.g2.bx.psu.edu/repos/lparsons/cutadapt/cutadapt/1.16.5\t2019-03-01\t19\n-toolshed.g2.bx.psu.edu/repos/devteam/ncbi_blast_plus/ncbi_blastn_wrapper/0.3.1\t2019-03-01\t18\n-toolshed.g2.bx.psu.edu/repos/bgruening/deeptools_compute_matrix/deeptools_compute_matrix/3.0.2.0\t2019-03-01\t17\n-toolshed.g2.bx.psu.edu/repos/nilesh/rseqc/rseqc_infer_experiment/2.6.4.1\t2019-03-01\t17\n-toolshed.g2.bx.psu.edu/repos/bgruening/deeptools_compute_gc_bias/deeptools_compute_gc_bias/3.0.2.0\t2019-03-01\t16\n-toolshed.g2.bx.psu.edu/repos/bgruening/deeptools_plot_fingerprint/deeptools_plot_fingerprint/3.0.2.0\t2019-03-01\t16\n-toolshed.g2.bx.psu.edu/repos/galaxyp/cardinal_classification/cardinal_classification/1.12.1.2\t2019-03-01\t16\n-toolshed.g2.bx.psu.edu/repos/iuc/hisat2/hisat2/2.1.0+galaxy4\t2019-03-01\t13\n-toolshed.g2.bx.psu.edu/repos/iuc/flash/flash/1.2.11.3\t2019-03-01\t13\n-toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_cut_tool/1.1.0\t2019-03-01\t13\n-toolshed.g2.bx.psu.edu/repos/iuc/featurecounts/featurecounts/1.4.6.p5\t2019-03-01\t12\n-toolshed.g2.bx.psu.edu/repos/bgruening/deeptools_multi_bam_summary/deeptools_multi_bam_summary/3.0.2.0\t2019-03-01\t12\n-toolshed.g2.bx.psu.edu/repos/galaxyp/cardinal_preprocessing/cardinal_preprocessing/1.12.1.2\t2019-03-01\t12\n-cat1\t2019-03-01\t11\n-toolshed.g2.bx.psu.edu/repos/iuc/ngsutils_bam_filter/ngsutils_bam_filter/0.5.9\t2019-03-01\t11\n-toolshed.g2.bx.psu.edu/repos/iuc/datamash_ops/datamash_ops/1.1.0\t2019-03-01\t10\n-wig_to_bigWig\t2019-03-01\t10\n-CONVERTER_interval_to_bed_0\t2019-03-01\t10\n-toolshed.g2.bx.psu.edu/repos/iuc/goseq/goseq/1.34.0\t2019-03-01\t9\n-toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_sortbed/2.27.1\t2019-03-01\t9\n-toolshed.g2.bx.psu.edu/repos/bgruening/pileometh/pileometh/0.3.0.1\t2019-03-01\t9\n-toolshed.g2.bx.psu.edu/repos/iuc/deseq2/deseq2/2.11.40.6\t2019-03-01\t9\n-toolshed.g2.bx.psu.edu/repos/bgruening/hicexplorer_hicplottads/hicexplorer_hicplottads/2.1.4.0\t2019-03-01\t9\n-toolshed.g2.bx.psu.edu/repos/peterjc/blast_rbh/blast_reciprocal_best_hits/0.1.11\t2019-03-01\t7\n-toolshed.g2.bx.psu.edu/repos/bgruening/deeptools_plot_correlation/deeptools_plot_correlation/3.0.2.0\t2019-03-01\t7\n-toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_find_and_replace/1.1.1\t2019-03-01\t6\n-toolshed.g2.bx.psu.edu/repos/bgruening/deeptools_bam_coverage/deeptools_bam_coverage/3.0.1.0\t2019-03-01\t6\n-secure_hash_message_digest\t2019-03-01\t6\n-CONVERTER_interval_to_bedstrict_0\t2019-03-01\t6\n-toolshed.g2.bx.psu.edu/repos/iuc/gtftobed12/gtftobed12/357\t2019-03-01\t6\n-CONVERTER_bedgraph_to_bigwig\t2019-03-01\t5\n-toolshed.g2.bx.psu.edu/repos/devteam/concat/gops_concat_1/1.0.1\t2019-03-01\t5\n-toolshed.g2.bx.psu.edu/repos/bgruening/deeptools_plot_heatmap/deeptools_plot_heatmap/3.0.2.0\t2019-03-01\t5\n-toolshed.g2.bx.psu.edu/repos/devteam/fastqtofasta/fastq_to_fasta_p'..b'/dexseq_count/1.28.1.0\t2020-04-01\t9\n+toolshed.g2.bx.psu.edu/repos/devteam/ncbi_blast_plus/ncbi_blastx_wrapper/0.3.3\t2020-04-01\t9\n+toolshed.g2.bx.psu.edu/repos/chemteam/gmx_setup/gmx_setup/2019.1.4\t2020-04-01\t9\n+toolshed.g2.bx.psu.edu/repos/bgruening/chembl/chembl/0.1.0\t2020-04-01\t9\n+toolshed.g2.bx.psu.edu/repos/chemteam/gmx_solvate/gmx_solvate/2019.1.4\t2020-04-01\t9\n+toolshed.g2.bx.psu.edu/repos/devteam/fastq_groomer/fastq_groomer/1.0.4\t2020-04-01\t9\n+toolshed.g2.bx.psu.edu/repos/yhoogstrate/flaimapper/flaimapper/3.0.0-0\t2020-04-01\t9\n+toolshed.g2.bx.psu.edu/repos/nick/allele_counts/allele_counts_1/1.2\t2020-04-01\t9\n+toolshed.g2.bx.psu.edu/repos/iuc/snpsift/snpSift_filter/4.3+t.galaxy0\t2020-04-01\t9\n+toolshed.g2.bx.psu.edu/repos/bgruening/chembl/chembl/0.10.1\t2020-04-01\t9\n+toolshed.g2.bx.psu.edu/repos/iuc/qiime_filter_samples_from_otu_table/qiime_filter_samples_from_otu_table/1.9.1.0\t2020-04-01\t9\n+toolshed.g2.bx.psu.edu/repos/iuc/qiime_filter_fasta/qiime_filter_fasta/1.9.1.0\t2020-04-01\t9\n+toolshed.g2.bx.psu.edu/repos/iuc/scpipe/scpipe/1.0.0+galaxy1\t2020-04-01\t9\n+toolshed.g2.bx.psu.edu/repos/iuc/trinity_run_de_analysis/trinity_run_de_analysis/2.9.1\t2020-04-01\t9\n+toolshed.g2.bx.psu.edu/repos/devteam/macs/peakcalling_macs/1.0.1\t2020-04-01\t9\n+toolshed.g2.bx.psu.edu/repos/iuc/qiime_pick_otus/qiime_pick_otus/1.9.1.0\t2020-04-01\t9\n+toolshed.g2.bx.psu.edu/repos/galaxyp/openms_fileinfo/FileInfo/2.3.0\t2020-04-01\t8\n+toolshed.g2.bx.psu.edu/repos/chemteam/gmx_em/gmx_em/2019.1.4\t2020-04-01\t8\n+toolshed.g2.bx.psu.edu/repos/bgruening/racon/racon/1.3.1.1\t2020-04-01\t8\n+toolshed.g2.bx.psu.edu/repos/bebatut/group_humann2_uniref_abundances_to_go/group_humann2_uniref_abundances_to_go/1.2.3\t2020-04-01\t8\n+toolshed.g2.bx.psu.edu/repos/galaxyp/idpquery/idpquery/3.0.11579.0\t2020-04-01\t8\n+toolshed.g2.bx.psu.edu/repos/devteam/fastq_combiner/fastq_combiner/1.1.5\t2020-04-01\t8\n+toolshed.g2.bx.psu.edu/repos/iuc/bp_genbank2gff3/bp_genbank2gff3/1.0\t2020-04-01\t8\n+toolshed.g2.bx.psu.edu/repos/bgruening/keras_model_builder/keras_model_builder/0.4.0\t2020-04-01\t8\n+toolshed.g2.bx.psu.edu/repos/iuc/sickle/sickle/1.33.2\t2020-04-01\t8\n+liftOver1\t2020-04-01\t8\n+toolshed.g2.bx.psu.edu/repos/bgruening/keras_batch_models/keras_batch_models/0.5.0\t2020-04-01\t8\n+toolshed.g2.bx.psu.edu/repos/devteam/vcfbedintersect/vcfbedintersect/1.0.0_rc3+galaxy0\t2020-04-01\t8\n+toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_awk_tool/1.1.1\t2020-04-01\t8\n+toolshed.g2.bx.psu.edu/repos/peterjc/tmhmm_and_signalp/wolf_psort/0.0.11\t2020-04-01\t8\n+toolshed.g2.bx.psu.edu/repos/devteam/freebayes/freebayes/1.1.0.46-0\t2020-04-01\t8\n+toolshed.g2.bx.psu.edu/repos/devteam/vcfvcfintersect/vcfvcfintersect/1.0.0_rc3+galaxy0\t2020-04-01\t8\n+toolshed.g2.bx.psu.edu/repos/wolma/mimodd_main/mimodd_info/0.1.8_1\t2020-04-01\t8\n+CONVERTER_vcf_bgzip_to_tabix_0\t2020-04-01\t8\n+toolshed.g2.bx.psu.edu/repos/galaxyp/dbbuilder/dbbuilder/0.3.0\t2020-04-01\t8\n+toolshed.g2.bx.psu.edu/repos/earlhaminst/ete/ete3_mod/3.1.1\t2020-04-01\t8\n+toolshed.g2.bx.psu.edu/repos/iuc/lofreq_alnqual/lofreq_alnqual/2.1.3.1+galaxy0\t2020-04-01\t8\n+toolshed.g2.bx.psu.edu/repos/saskia-hiltemann/krona_text/krona-text/1\t2020-04-01\t8\n+toolshed.g2.bx.psu.edu/repos/devteam/sam2interval/sam2interval/1.0.2\t2020-04-01\t8\n+toolshed.g2.bx.psu.edu/repos/erasmus-medical-center/dr_disco/dr_disco_classify/0.14.0\t2020-04-01\t8\n+toolshed.g2.bx.psu.edu/repos/iuc/fastani/fastani/1.3\t2020-04-01\t8\n+toolshed.g2.bx.psu.edu/repos/iuc/stacks_denovomap/stacks_denovomap/1.46.0\t2020-04-01\t8\n+toolshed.g2.bx.psu.edu/repos/iuc/anndata_import/anndata_import/0.6.22.post1+galaxy3\t2020-04-01\t8\n+toolshed.g2.bx.psu.edu/repos/bgruening/nanopolish_variants/nanopolish_variants/0.11.1\t2020-04-01\t8\n+toolshed.g2.bx.psu.edu/repos/rnateam/graphprot_predict_profile/graphprot_predict_profile/1.1.7\t2020-04-01\t8\n+toolshed.g2.bx.psu.edu/repos/devteam/ncbi_blast_plus/ncbi_tblastn_wrapper/0.3.3\t2020-04-01\t8\n+toolshed.g2.bx.psu.edu/repos/devteam/fastq_quality_boxplot/cshl_fastq_quality_boxplot/1.0.1\t2020-04-01\t8\n'

diff -r 76251d1ccdcc -r 5b3c08710e47 test-data/test_workflows
--- a/test-data/test_workflows Fri Oct 11 18:24:54 2019 -0400
+++ b/test-data/test_workflows Sat May 09 05:38:23 2020 -0400

b'@@ -1,529 +1,1000 @@\n-wf_id\twf_updated\tin_id\tin_tool\tin_tool_v\tout_id\tout_tool\tout_tool_v\n-3\t2013-02-07 16:48:00\t7\tRemove beginning1\t1.0.0\t5\tGrep1\t1.0.1\n-4\t2013-02-07 16:48:00\t16\twc_gnu\t1.0.0\t14\tbedtools_intersectBed\t\n-4\t2013-02-07 16:48:00\t18\taddValue\t1.0.0\t16\twc_gnu\t1.0.0\n-4\t2013-02-07 16:48:00\t13\tcat1\t1.0.0\t18\taddValue\t1.0.0\n-4\t2013-02-07 16:48:00\t21\tcshl_uniq_tool\t1.0.0\t19\tcshl_awk_tool\t\n-4\t2013-02-07 16:48:00\t13\tcat1\t1.0.0\t20\tCount1\t1.0.0\n-4\t2013-02-07 16:48:00\t20\tCount1\t1.0.0\t21\tcshl_uniq_tool\t1.0.0\n-4\t2013-02-07 16:48:00\t14\tbedtools_intersectBed\t\t23\t\t\n-4\t2013-02-07 16:48:00\t14\tbedtools_intersectBed\t\t24\t\t\n-5\t2013-02-07 16:49:00\t26\tcat1\t1.0.0\t25\taddValue\t1.0.0\n-5\t2013-02-07 16:49:00\t59\tPaste1\t1.0.0\t27\tCut1\t1.0.1\n-5\t2013-02-07 16:49:00\t59\tPaste1\t1.0.0\t28\tCut1\t1.0.1\n-5\t2013-02-07 16:49:00\t66\tPaste1\t1.0.0\t29\tCut1\t1.0.1\n-5\t2013-02-07 16:49:00\t66\tPaste1\t1.0.0\t30\tCut1\t1.0.1\n-5\t2013-02-07 16:49:00\t36\tPaste1\t1.0.0\t31\tCut1\t1.0.1\n-5\t2013-02-07 16:49:00\t36\tPaste1\t1.0.0\t32\tCut1\t1.0.1\n-5\t2013-02-07 16:49:00\t60\tPaste1\t1.0.0\t33\tCut1\t1.0.1\n-5\t2013-02-07 16:49:00\t60\tPaste1\t1.0.0\t34\tCut1\t1.0.1\n-5\t2013-02-07 16:49:00\t64\tAdd_a_column1\t1.1.0\t36\tPaste1\t1.0.0\n-5\t2013-02-07 16:49:00\t26\tcat1\t1.0.0\t37\taddValue\t1.0.0\n-5\t2013-02-07 16:49:00\t26\tcat1\t1.0.0\t38\taddValue\t1.0.0\n-5\t2013-02-07 16:49:00\t39\tFilter1\t1.1.0\t40\tgops_coverage_1\t1.0.0\n-5\t2013-02-07 16:49:00\t51\tSummary_Statistics1\t1.1.0\t41\tcshl_grep_tool\t1.0.0\n-5\t2013-02-07 16:49:00\t52\tSummary_Statistics1\t1.1.0\t41\tcshl_grep_tool\t1.0.0\n-5\t2013-02-07 16:49:00\t49\tSummary_Statistics1\t1.1.0\t42\tFilter1\t1.1.0\n-5\t2013-02-07 16:49:00\t50\tSummary_Statistics1\t1.1.0\t42\tFilter1\t1.1.0\n-5\t2013-02-07 16:49:00\t55\tSummary_Statistics1\t1.1.0\t43\tFilter1\t1.1.0\n-5\t2013-02-07 16:49:00\t54\tSummary_Statistics1\t1.1.0\t44\tFilter1\t1.1.0\n-5\t2013-02-07 16:49:00\t57\tCut1\t1.0.1\t45\tSummary_Statistics1\t1.1.0\n-5\t2013-02-07 16:49:00\t26\tcat1\t1.0.0\t47\taddValue\t1.0.0\n-5\t2013-02-07 16:49:00\t26\tcat1\t1.0.0\t48\taddValue\t1.0.0\n-5\t2013-02-07 16:49:00\t31\tCut1\t1.0.1\t50\tSummary_Statistics1\t1.1.0\n-5\t2013-02-07 16:49:00\t34\tCut1\t1.0.1\t51\tSummary_Statistics1\t1.1.0\n-5\t2013-02-07 16:49:00\t33\tCut1\t1.0.1\t52\tSummary_Statistics1\t1.1.0\n-5\t2013-02-07 16:49:00\t28\tCut1\t1.0.1\t53\tSummary_Statistics1\t1.1.0\n-5\t2013-02-07 16:49:00\t27\tCut1\t1.0.1\t54\tSummary_Statistics1\t1.1.0\n-5\t2013-02-07 16:49:00\t30\tCut1\t1.0.1\t55\tSummary_Statistics1\t1.1.0\n-5\t2013-02-07 16:49:00\t29\tCut1\t1.0.1\t56\tSummary_Statistics1\t1.1.0\n-5\t2013-02-07 16:49:00\t35\tPaste1\t1.0.0\t57\tCut1\t1.0.1\n-5\t2013-02-07 16:49:00\t35\tPaste1\t1.0.0\t58\tCut1\t1.0.1\n-5\t2013-02-07 16:49:00\t62\tAdd_a_column1\t1.1.0\t59\tPaste1\t1.0.0\n-5\t2013-02-07 16:49:00\t63\tAdd_a_column1\t1.1.0\t60\tPaste1\t1.0.0\n-5\t2013-02-07 16:49:00\t38\taddValue\t1.0.0\t62\tAdd_a_column1\t1.1.0\n-5\t2013-02-07 16:49:00\t37\taddValue\t1.0.0\t63\tAdd_a_column1\t1.1.0\n-5\t2013-02-07 16:49:00\t47\taddValue\t1.0.0\t64\tAdd_a_column1\t1.1.0\n-5\t2013-02-07 16:49:00\t40\tgops_coverage_1\t1.0.0\t68\t\t\n-5\t2013-02-07 16:49:00\t40\tgops_coverage_1\t1.0.0\t69\t\t\n-6\t2013-02-07 16:57:00\t70\tcshl_awk_tool1\t1.0.0\t71\t\t\n-9\t2013-02-07 16:59:00\t84\tRemove beginning1\t1.0.0\t82\tGrep1\t1.0.1\n-9\t2013-02-07 16:59:00\t88\tPaste1\t1.0.0\t85\taddValue\t1.0.0\n-9\t2013-02-07 16:59:00\t88\tPaste1\t1.0.0\t86\tCut1\t1.0.1\n-9\t2013-02-07 16:59:00\t82\tGrep1\t1.0.1\t89\t\t\n-11\t2013-02-07 17:03:00\t145\tbarchart_gnuplot\t1.0.0\t136\tcat1\t1.0.0\n-11\t2013-02-07 17:03:00\t170\tPaste1\t1.0.0\t137\tCut1\t1.0.1\n-11\t2013-02-07 17:03:00\t170\tPaste1\t1.0.0\t138\tCut1\t1.0.1\n-11\t2013-02-07 17:03:00\t177\tPaste1\t1.0.0\t139\tCut1\t1.0.1\n-11\t2013-02-07 17:03:00\t177\tPaste1\t1.0.0\t140\tCut1\t1.0.1\n-11\t2013-02-07 17:03:00\t171\tPaste1\t1.0.0\t143\tCut1\t1.0.1\n-11\t2013-02-07 17:03:00\t171\tPaste1\t1.0.0\t144\tCut1\t1.0.1\n-11\t2013-02-07 17:03:00\t176\tAdd_a_column1\t1.1.0\t146\tPaste1\t1.0.0\n-11\t2013-02-07 17:03:00\t175\tAdd_a_column1\t1.1.0\t147\tPaste1\t1.0.0\n-11\t2013-02-07 17:03:00\t156\tSummary_Statistics1\t1.1.0\t150\tFilter1\t1.1.0\n-11\t2013-02-07 17:03:00\t157\tSummary_Statistics1\t1.1.0\t150\tFilter1\t1.1.0\n-11\t2013-02-07 17:03:00\t155\tFilter1\t1.1.0\t151\tgops_coverage_1\t1.0.0\n-11\t2013-02-07 17:03:00\t162\tSummary_Statistics1\t1.1.0\t15'..b't genomic DNA 1\t2.2.2\t716\tfasta2tab\t1.1.0\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t709\tExtract genomic DNA 1\t2.2.2\t716\tfasta2tab\t1.1.0\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t709\tExtract genomic DNA 1\t2.2.2\t716\tfasta2tab\t1.1.0\tf\tt\tf\n+97\t2013-02-20 10:11:21.312214\t709\tExtract genomic DNA 1\t2.2.2\t716\tfasta2tab\t1.1.0\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t709\tExtract genomic DNA 1\t2.2.2\t716\tfasta2tab\t1.1.0\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t709\tExtract genomic DNA 1\t2.2.2\t716\tfasta2tab\t1.1.0\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t709\tExtract genomic DNA 1\t2.2.2\t716\tfasta2tab\t1.1.0\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t709\tExtract genomic DNA 1\t2.2.2\t716\tfasta2tab\t1.1.0\tt\tt\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t711\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t711\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t711\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t711\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t711\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t711\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t711\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t711\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t711\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t711\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t711\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t711\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t711\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t711\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t711\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t711\tExtract genomic DNA 1\t2.2.2\tf\tt\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t711\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t711\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t711\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t711\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t711\tExtract genomic DNA 1\t2.2.2\tt\tt\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t709\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t709\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t709\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t709\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t709\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t709\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t709\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t709\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t709\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t709\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t709\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t709\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t709\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t709\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t709\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t709\tExtract genomic DNA 1\t2.2.2\tf\tt\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t709\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t709\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t709\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t709\tExtract genomic DNA 1\t2.2.2\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t710\t\t\t709\tExtract genomic DNA 1\t2.2.2\tt\tt\tf\n+97\t2013-02-20 10:11:21.312214\t711\tExtract genomic DNA 1\t2.2.2\t712\tfasta2tab\t1.1.0\tf\tf\tf\n+97\t2013-02-20 10:11:21.312214\t711\tExtract genomic DNA 1\t2.2.2\t712\tfasta2tab\t1.1.0\tf\tf\tf\n'

diff -r 76251d1ccdcc -r 5b3c08710e47 utils.py
--- a/utils.py Fri Oct 11 18:24:54 2019 -0400
+++ b/utils.py Sat May 09 05:38:23 2020 -0400

[

b'@@ -2,6 +2,7 @@\n import numpy as np\n import json\n import h5py\n+import random\n \n from keras import backend as K\n \n@@ -15,23 +16,6 @@\n return file_content\n \n \n-def write_file(file_path, content):\n- """\n- Write a file\n- """\n- remove_file(file_path)\n- with open(file_path, "w") as json_file:\n- json_file.write(json.dumps(content))\n-\n-\n-def save_processed_workflows(file_path, unique_paths):\n- workflow_paths_unique = ""\n- for path in unique_paths:\n- workflow_paths_unique += path + "\\n"\n- with open(file_path, "w") as workflows_file:\n- workflows_file.write(workflow_paths_unique)\n-\n-\n def format_tool_id(tool_link):\n """\n Extract tool id from tool link\n@@ -63,17 +47,13 @@\n hf_file.close()\n \n \n-def remove_file(file_path):\n- if os.path.exists(file_path):\n- os.remove(file_path)\n-\n-\n def weighted_loss(class_weights):\n """\n Create a weighted loss function. Penalise the misclassification\n of classes more with the higher usage\n """\n weight_values = list(class_weights.values())\n+ weight_values.extend(weight_values)\n \n def weighted_binary_crossentropy(y_true, y_pred):\n # add another dimension to compute dot product\n@@ -82,46 +62,101 @@\n return weighted_binary_crossentropy\n \n \n-def compute_precision(model, x, y, reverse_data_dictionary, next_compatible_tools, usage_scores, actual_classes_pos, topk):\n+def balanced_sample_generator(train_data, train_labels, batch_size, l_tool_tr_samples):\n+ while True:\n+ dimension = train_data.shape[1]\n+ n_classes = train_labels.shape[1]\n+ tool_ids = list(l_tool_tr_samples.keys())\n+ generator_batch_data = np.zeros([batch_size, dimension])\n+ generator_batch_labels = np.zeros([batch_size, n_classes])\n+ for i in range(batch_size):\n+ random_toolid_index = random.sample(range(0, len(tool_ids)), 1)[0]\n+ random_toolid = tool_ids[random_toolid_index]\n+ sample_indices = l_tool_tr_samples[str(random_toolid)]\n+ random_index = random.sample(range(0, len(sample_indices)), 1)[0]\n+ random_tr_index = sample_indices[random_index]\n+ generator_batch_data[i] = train_data[random_tr_index]\n+ generator_batch_labels[i] = train_labels[random_tr_index]\n+ yield generator_batch_data, generator_batch_labels\n+\n+\n+def compute_precision(model, x, y, reverse_data_dictionary, usage_scores, actual_classes_pos, topk, standard_conn, last_tool_id, lowest_tool_ids):\n """\n Compute absolute and compatible precision\n """\n- absolute_precision = 0.0\n+ pred_t_name = ""\n+ top_precision = 0.0\n+ mean_usage = 0.0\n+ usage_wt_score = list()\n+ pub_precision = 0.0\n+ lowest_pub_prec = 0.0\n+ lowest_norm_prec = 0.0\n+ pub_tools = list()\n+ actual_next_tool_names = list()\n test_sample = np.reshape(x, (1, len(x)))\n \n # predict next tools for a test path\n prediction = model.predict(test_sample, verbose=0)\n \n+ # divide the predicted vector into two halves - one for published and\n+ # another for normal workflows\n nw_dimension = prediction.shape[1]\n-\n- # remove the 0th position as there is no tool at this index\n- prediction = np.reshape(prediction, (nw_dimension,))\n+ half_len = int(nw_dimension / 2)\n \n- prediction_pos = np.argsort(prediction, axis=-1)\n- topk_prediction_pos = prediction_pos[-topk:]\n+ # predict tools\n+ prediction = np.reshape(prediction, (nw_dimension,))\n+ # get predictions of tools from published workflows\n+ standard_pred = prediction[:half_len]\n+ # get predictions of tools from normal workflows\n+ normal_pred = prediction[half_len:]\n \n- # remove the wrong tool position from the predicted list of tool positions\n- topk_prediction_pos = [x for x in topk_prediction_pos if x > 0]\n+ standard_prediction_pos = np.argsort(standard_pred, axis=-1)\n+ standard_topk_prediction_pos = standard_prediction_pos[-topk]\n+\n+ normal_predicti'..b'ame in actual_next_tool_names:\n+ if normal_topk_prediction_pos in usage_scores:\n+ usage_wt_score.append(np.log(usage_scores[normal_topk_prediction_pos] + 1.0))\n+ top_precision = 1.0\n+ if last_tool_id in lowest_tool_ids:\n+ lowest_norm_prec = 1.0\n+ if len(usage_wt_score) > 0:\n+ mean_usage = np.mean(usage_wt_score)\n+ return mean_usage, top_precision, pub_precision, lowest_pub_prec, lowest_norm_prec\n \n \n-def verify_model(model, x, y, reverse_data_dictionary, next_compatible_tools, usage_scores, topk_list=[1, 2, 3]):\n+def get_lowest_tools(l_tool_freq, fraction=0.25):\n+ l_tool_freq = dict(sorted(l_tool_freq.items(), key=lambda kv: kv[1], reverse=True))\n+ tool_ids = list(l_tool_freq.keys())\n+ lowest_ids = tool_ids[-int(len(tool_ids) * fraction):]\n+ return lowest_ids\n+\n+\n+def verify_model(model, x, y, reverse_data_dictionary, usage_scores, standard_conn, lowest_tool_ids, topk_list=[1, 2, 3]):\n """\n Verify the model on test data\n """\n@@ -130,31 +165,49 @@\n size = y.shape[0]\n precision = np.zeros([len(y), len(topk_list)])\n usage_weights = np.zeros([len(y), len(topk_list)])\n+ epo_pub_prec = np.zeros([len(y), len(topk_list)])\n+ epo_lowest_tools_pub_prec = list()\n+ epo_lowest_tools_norm_prec = list()\n+\n # loop over all the test samples and find prediction precision\n for i in range(size):\n+ lowest_pub_topk = list()\n+ lowest_norm_topk = list()\n actual_classes_pos = np.where(y[i] > 0)[0]\n+ test_sample = x[i, :]\n+ last_tool_id = str(int(test_sample[-1]))\n for index, abs_topk in enumerate(topk_list):\n- abs_mean_usg_score, absolute_precision = compute_precision(model, x[i, :], y, reverse_data_dictionary, next_compatible_tools, usage_scores, actual_classes_pos, abs_topk)\n+ usg_wt_score, absolute_precision, pub_prec, lowest_p_prec, lowest_n_prec = compute_precision(model, test_sample, y, reverse_data_dictionary, usage_scores, actual_classes_pos, abs_topk, standard_conn, last_tool_id, lowest_tool_ids)\n precision[i][index] = absolute_precision\n- usage_weights[i][index] = abs_mean_usg_score\n+ usage_weights[i][index] = usg_wt_score\n+ epo_pub_prec[i][index] = pub_prec\n+ if last_tool_id in lowest_tool_ids:\n+ lowest_pub_topk.append(lowest_p_prec)\n+ lowest_norm_topk.append(lowest_n_prec)\n+ if last_tool_id in lowest_tool_ids:\n+ epo_lowest_tools_pub_prec.append(lowest_pub_topk)\n+ epo_lowest_tools_norm_prec.append(lowest_norm_topk)\n mean_precision = np.mean(precision, axis=0)\n mean_usage = np.mean(usage_weights, axis=0)\n- return mean_precision, mean_usage\n+ mean_pub_prec = np.mean(epo_pub_prec, axis=0)\n+ mean_lowest_pub_prec = np.mean(epo_lowest_tools_pub_prec, axis=0)\n+ mean_lowest_norm_prec = np.mean(epo_lowest_tools_norm_prec, axis=0)\n+ return mean_usage, mean_precision, mean_pub_prec, mean_lowest_pub_prec, mean_lowest_norm_prec, len(epo_lowest_tools_pub_prec)\n \n \n-def save_model(results, data_dictionary, compatible_next_tools, trained_model_path, class_weights):\n+def save_model(results, data_dictionary, compatible_next_tools, trained_model_path, class_weights, standard_connections):\n # save files\n trained_model = results["model"]\n best_model_parameters = results["best_parameters"]\n model_config = trained_model.to_json()\n model_weights = trained_model.get_weights()\n-\n model_values = {\n \'data_dictionary\': data_dictionary,\n \'model_config\': model_config,\n \'best_parameters\': best_model_parameters,\n \'model_weights\': model_weights,\n "compatible_tools": compatible_next_tools,\n- "class_weights": class_weights\n+ "class_weights": class_weights,\n+ "standard_connections": standard_connections\n }\n set_trained_model(trained_model_path, model_values)\n'