Galaxy |

Changeset 0:02e46a96e98a (2021-03-24)

Commit message:
"planemo upload for repository https://github.com/galaxyproject/tools-iuc/tools/simtext commit 63a5e13cf89cdd209d20749c582ec5b8dde4e208"

added:
README.md
abstracts_by_pmids.R
macros.xml
pmids_to_pubtator_matrix.R
pubmed_by_queries.R
pubmed_by_queries.xml
test-data/abstracts_by_pmids_output
test-data/pmids_to_pubtator_matrix_output
test-data/pmids_to_pubtator_matrix_output_byid
test-data/pmids_to_pubtator_matrix_output_number
test-data/pubmed_by_queries_output
test-data/pubmed_by_queries_output_abstracts
test-data/test_data
test-data/text_to_wordmatrix_output
test-data/text_to_wordmatrix_output_args
test/commands_tests
text_to_wordmatrix.R

diff -r 000000000000 -r 02e46a96e98a README.md
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/README.md Wed Mar 24 08:34:22 2021 +0000

[

b'@@ -0,0 +1,198 @@\n+# SimText\n+\n+A text mining framework for interactive analysis and visualization of similarities among biomedical entities.\n+\n+## Brief overview of tools:\n+\n+ - pubmed_by_queries: \n+\n+ For each search query, PMIDs or abstracts from PubMed are saved.\n+\n+ - abstracts_by_pmids: \n+\n+ For all PMIDs in each row of a table the according abstracts are saved in additional columns.\n+\n+ - text_to_wordmatrix: \n+\n+ The most frequent words of text from each row are extracted and united in one large binary matrix. \n+ \n+ - pmids_to_pubtator_matrix: \n+\n+ For PMIDs of each row, scientific words are extracted using PubTator annotations and subsequently united in one large binary matrix. \n+\n+ - simtext_app: \n+\n+ Shiny app with word clouds, dimension reduction plot, dendrogram of hierarchical clustering and table with words and their frequency among the search queries.\n+\n+## Set up user credentials on Galaxy\n+\n+To enable users to set their credentials (NCBI API Key) for this tool,\n+make sure the file `config/user_preferences_extra_conf.yml` has the following section:\n+\n+```\n+preferences:\n+ ncbi_account:\n+ description: NCBI account information\n+ inputs:\n+ - name: apikey\n+ label: NCBI API Key (available from "API Key Management" at https://www.ncbi.nlm.nih.gov/account/settings/)\n+ type: text\n+ required: False\n+\n+```\n+\n+## Requirements command-line version\n+\n+ - R (version > 4.0.0)\n+\n+## Installation command-line version\n+\n+```\n+$ mkdir -p <path>/simtext\n+$ cd <path>/simtext\n+$ git clone https://github.com/dlal-group/simtext\n+```\n+\n+## pubmed_by_queries\n+\n+This tool uses a set of search queries to download a defined number of abstracts or PMIDs for each search query from PubMed. PubMed\'s search rules and syntax apply. Users can obtain an API key from the Settings page of their NCBI account (to create an account, visit http://www.ncbi.nlm.nih.gov/account/). If the tool is used as command-line tool the API key is passed as an argument. For usage in Galaxy the API key is added to the Galaxy user-preferences (User/ Preferences/ Manage Information).\n+\n+Input:\n+\n+Tab-delimited table with a list of search queries (biomedical entities of interest) in one column. The column header should start with "ID_" (e.g., "ID_gene" if search queries are genes). \n+\n+Usage:\n+```\n+$ Rscript pubmed_by_queries.R [-h] [-i INPUT] [-o OUTPUT] [-n NUMBER] [-a] [-k KEY] [--install_packages]\n+```\n+\n+Optional arguments: \n+```\n+ -h, --help show help message\n+ -i INPUT, --input INPUT input file name. add path if file is not in working directory\n+ -o OUTPUT, --output OUTPUT output file name [default "pubmed_by_queries_output"]\n+ -n NUMBER, --number NUMBER number of PMIDs or abstracts to save per ID [default "5"]\n+ -a, --abstract if abstracts instead of PMIDs should be retrieved use --abstracts \n+ -k KEY, --key KEY if NCBI API key is available, add it to speed up the download of PubMed data. For usage in Galaxy add the API key to the Galaxy user-preferences (User/ Preferences/ Manage Information).\n+ --install_packages if you want to auto install missing required packages\n+```\n+\n+Output: \n+\n+A table with additional columns containing PMIDs or abstracts from PubMed.\n+\n+## abstracts_by_pmids\n+\n+This tool retrieves abstracts for a matrix of PMIDs. The abstract text is saved in additional columns.\n+\n+Input:\n+\n+Tab-delimited table with rows representing biomedical entities and columns containing the corresponding PMIDs. The names of the PMID columns should start with \xe2\x80\x9cPMID_\xe2\x80\x9d (e.g., \xe2\x80\x9cPMID_1\xe2\x80\x9d, \xe2\x80\x9cPMID_2\xe2\x80\x9d etc.).\n+\n+Usage:\n+```\n+$ Rscript abstracts_by_pmid.R [-h] [-i INPUT] [-o OUTPUT]\n+```\n+\n+Optional arguments: \n+```\n+ -h, --help show help message\n+ -i INPUT, --input INPUT input file name. add path if file is not in working directory\n+ -o OUTPUT, --output OUTPUT output file name [default "abstracts_by_pmids_output"]\n+ --instal'..b'y default a set of english stopwords (e.g., \'the\' or \'not\') are removed. otherwise use -w\n+ -s, --stemDoc apply Porter\'s stemming algorithm: collapsing words to a common root to aid comparison of vocabulary\n+ -p, --plurals by default words in plural and singular are merged to the singular form. otherwise use -p\n+ -- install_packages if you want to auto install missing required packages\n+```\n+\n+Output: \n+\n+A binary matrix in that each column represents one of the extracted words.\n+\n+## pmids_to_pubtator_matrix\n+\n+The tool uses all PMIDs per row and extracts "Gene", "Disease", "Mutation", "Chemical" and "Species" terms of the corresponding abstracts, using PubTator annotations. The user can choose from which categories terms should be extracted. The extracted terms are united in one large binary matrix, with 0= term not present in abstracts of that row and 1= term present in abstracts of that row. The user can decide if the scientific terms should be extracted and used as they are or if they should be grouped by their geneIDs/ meshIDs (several terms are often grouped into one ID). Also, by default all terms are extracted, otherwise the user can specify a number of most frequent words to extract per row.\n+\n+Input: \n+\n+Output of \'abstracts_by_pmids\' tool, or tab-delimited table with columns containing PMIDs. The names of the PMID columns should start with "PMID", e.g. "PMID_1", "PMID_2" etc.\n+\n+Usage:\n+```\n+$ Rscript pmids_to_pubtator_matrix.R [-h] [-i INPUT] [-o OUTPUT] [-b BYID] [-n NUMBER][-c {Gene,Disease,Mutation,Chemical,Species} [{Gene,Disease,Mutation,Chemical,Species} ...]]\n+```\n+ \n+Optional arguments:\n+```\n+ -h, --help show help message\n+ -i INPUT, --input INPUT input file name. add path if file is not in workind directory\n+ -o OUTPUT, --output OUTPUT output file name. [default "pmids_to_pubtator_matrix_output"]\n+ -b, --byid if you want to find common gene IDs / mesh IDs instead of specific scientific terms.\n+ -n NUMBER, --number NUMBER number of most frequent terms/IDs to extract. by default all terms/IDs are extracted.\n+ -c [...], --categories [...] PubTator categories that should be considered [default "(\'Gene\', \'Disease\', \'Mutation\',\'Chemical\')"]\n+ -- install_packages if you want to auto install missing required packages\n+```\n+\n+Output: \n+\n+Binary matrix in that each column represents one of the extracted terms.\n+\n+## simtext_app\n+\n+The tool enables the exploration of data generated by \xe2\x80\x98text_to_wordmatrix\xe2\x80\x99 or \xe2\x80\x98pmids_to_pubtator_matrix\xe2\x80\x99 tools in a Shiny local instance. The following features can be generated: 1) word clouds for each initial search query, 2) dimension reduction and hierarchical clustering of binary matrices, and 3) tables with words and their frequency in the search queries.\n+\n+Input:\n+\n+1)\tInput 1: \n+Tab-delimited table with\n+\t- A column with initial search queries starting with "ID_" (e.g., "ID_gene" if initial search queries were genes).\n+\t- Column(s) with grouping factor(s) to compare pre-existing categories of the initial search queries with the grouping based on text. The column names should start with "GROUPING_". If the column name is "GROUPING_disorder", "disorder" will be shown as a grouping variable in the app.\n+2)\tInput 2: \n+The output of \xe2\x80\x98text_to_wordmatrix\xe2\x80\x99 or \xe2\x80\x98pmids_to_pubtator_matrix\xe2\x80\x99 tools, or a binary matrix.\n+\n+Usage:\n+```\n+$ Rscript simtext_app.R [-h] [-i INPUT] [-m MATRIX] [-p PORT]\n+```\n+\n+Optional arguments:\n+```\n+ -h, --help show help message\n+ -i INPUT, --input INPUT input file name. add path if file is not in working directory\n+ -m MATRIX, --matrix MATRIX matrix file name. add path if file is not in working directory\n+ -p PORT, --port PORT specify port, otherwise randomly selected\n+ --host\t\t\t\t\tspecify host\n+ -- install_packages if you want to auto install missing required packages\n+```\n+\n+Output: \n+\n+SimText app\n'

diff -r 000000000000 -r 02e46a96e98a abstracts_by_pmids.R
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/abstracts_by_pmids.R Wed Mar 24 08:34:22 2021 +0000

[

@@ -0,0 +1,142 @@
+#!/usr/bin/env Rscript
+#TOOL2 abstracts_by_pmids
+#
+#This tool retrieves for all PMIDs in each row of a table the according abstracts and saves them in additional columns.
+#
+#Input: Tab-delimited table with columns containing PMIDs. The names of the PMID columns should start with “PMID”, e.g. “PMID_1”, “PMID_2” etc.
+#
+#Output: Input table with additional columns containing abstracts corresponding to the PMIDs from PubMed.
+#The abstract columns are called "ABSTRACT_1", "ABSTARCT_2" etc.
+#
+# Usage: $ T2_abstracts_by_pmid.R [-h] [-i INPUT] [-o OUTPUT]
+#
+# optional arguments:
+# -h, --help                 show help message
+# -i INPUT, --input INPUT    input file name. add path if file is not in working directory
+# -o OUTPUT, --output OUTPUT output file name. [default "T2_output"]
+
+
+if ("--install_packages" %in% commandArgs()) {
+  print("Installing packages")
+  if (!require("argparse")) install.packages("argparse", repo = "http://cran.rstudio.com/");
+  if (!require("reutils")) install.packages("reutils", repo = "http://cran.rstudio.com/");
+  if (!require("easyPubMed")) install.packages("easyPubMed", repo = "http://cran.rstudio.com/");
+  if (!require("textclean")) install.packages("textclean", repo = "http://cran.rstudio.com/");
+}
+
+suppressPackageStartupMessages(library("argparse"))
+library("reutils")
+suppressPackageStartupMessages(library("easyPubMed"))
+suppressPackageStartupMessages(library("textclean"))
+
+parser <- ArgumentParser()
+parser$add_argument("-i", "--input",
+                    help = "input fie name. add path if file is not in workind directory")
+parser$add_argument("-o", "--output", default = "abstracts_by_pmids_output",
+                    help = "output file name. [default \"%(default)s\"]")
+parser$add_argument("--install_packages", action = "store_true", default = FALSE,
+                    help = "If you want to auto install missing required packages.")
+
+args <- parser$parse_args()
+
+data <- read.delim(args$input, stringsAsFactors = FALSE, header = TRUE, sep = "\t")
+pmids_cols_index <- grep("PMID", names(data))
+
+fetch_abstracts <- function(pmids, row) {
+
+  efetch_result <- NULL
+  try_num <- 1
+  t_0 <- Sys.time()
+
+  while (is.null(efetch_result)) {
+
+    # Timing check: kill at 3 min
+    if (try_num > 1) {
+      Sys.sleep(time = 1 * try_num)
+      cat("Problem to receive PubMed data or error is received. Please wait. Try number: ", try_num, "\n")
+    }
+
+    t_1 <- Sys.time()
+
+    if (as.numeric(difftime(t_1, t_0, units = "mins")) > 3) {
+      message("Killing the request! Something is not working. Please, try again later", "\n")
+      return(data)
+    }
+
+    efetch_result <- tryCatch({
+      suppressWarnings(efetch(uid = pmids, db = "pubmed", retmode = "xml"))
+    }, error = function(e) {
+      NULL
+    })
+
+    if (!is.null(as.list(efetch_result$errors)$error)) {
+      if (as.list(efetch_result$errors)$error == "HTTP error: Status 400; Bad Request") {
+        efetch_result <- NULL
+      }
+    }
+
+    try_num <- try_num + 1
+
+  } #while loop end
+
+  # articles to list
+  xml_data <- strsplit(efetch_result$content, "<PubmedArticle(>|[[:space:]]+?.*>)")[[1]][-1]
+  xml_data <- sapply(xml_data, function(x) {
+    #trim extra stuff at the end of the record
+    if (!grepl("</PubmedArticle>$", x))
+      x <- sub("(^.*</PubmedArticle>).*$", "\\1", x)
+    # Rebuid XML structure and proceed
+    x <- paste("<PubmedArticle>", x)
+    gsub("[[:space:]]{2,}", " ", x)},
+    USE.NAMES = FALSE, simplify = TRUE)
+
+  abstract_text <- sapply(xml_data, function(x) {
+    custom_grep(x, tag = "AbstractText", format = "char")},
+    USE.NAMES = FALSE, simplify = TRUE)
+
+  abstracts <- sapply(abstract_text, function(x) {
+    if (length(x) > 1) {
+      x <- paste(x, collapse = " ", sep = " ")
+      x <- gsub("</{0,1}i>", "", x, ignore.case = T)
+      x <- gsub("</{0,1}b>", "", x, ignore.case = T)
+      x <- gsub("</{0,1}sub>", "", x, ignore.case = T)
+      x <- gsub("</{0,1}exp>", "", x, ignore.case = T)
+    } else if (length(x) < 1) {
+      x <- NA
+    } else {
+      x <- gsub("</{0,1}i>", "", x, ignore.case = T)
+      x <- gsub("</{0,1}b>", "", x, ignore.case = T)
+      x <- gsub("</{0,1}sub>", "", x, ignore.case = T)
+      x <- gsub("</{0,1}exp>", "", x, ignore.case = T)
+    }
+    x
+  },
+  USE.NAMES = FALSE, simplify = TRUE)
+
+  abstracts <- as.character(abstracts)
+
+  if (length(abstracts) > 0) {
+    data[row, sapply(seq(length(abstracts)), function(i) {
+      paste0("ABSTRACT_", i)
+      })] <- abstracts
+    cat(length(abstracts), " abstracts for PMIDs of row ", row, " are added in the table.", "\n")
+  }
+
+  return(data)
+}
+
+
+for (row in seq(nrow(data))) {
+  pmids <-  as.character(unique(data[row, pmids_cols_index]))
+  pmids <- pmids[!pmids == "NA"]
+
+  if (length(pmids) > 0) {
+    data <- tryCatch(fetch_abstracts(pmids, row),
+                    error = function(e) {
+                      Sys.sleep(3)
+                      })
+  } else {
+    print(paste("No PMIDs in row", row))
+  }
+}
+write.table(data, args$output, sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE)

diff -r 000000000000 -r 02e46a96e98a macros.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/macros.xml Wed Mar 24 08:34:22 2021 +0000

@@ -0,0 +1,11 @@
+<macros>
+    <token name="@VERSION@">0.0.2</token>
+
+    <xml name="citations">
+        <citations>
+            <citation type="doi">10.1101/2020.07.06.190629</citation>
+        </citations>
+    </xml>
+
+</macros>
+

diff -r 000000000000 -r 02e46a96e98a pmids_to_pubtator_matrix.R
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/pmids_to_pubtator_matrix.R Wed Mar 24 08:34:22 2021 +0000

[

b'@@ -0,0 +1,231 @@\n+#!/usr/bin/env Rscript\n+#tool: pmids_to_pubtator_matrix\n+#\n+#The tool uses all PMIDs per row and extracts "Gene", "Disease", "Mutation", "Chemical" and "Species" terms of the\n+#corresponding abstracts, using PubTator annotations. The user can choose from which categories terms should be extracted.\n+#The extracted terms are united in one large binary matrix, with 0= term not present in abstracts of that row and 1= term\n+#present in abstracts of that row. The user can decide if the extracted scientific terms should be extracted and used as\n+#they are or if they should be grouped by their geneIDs/ meshIDs (several terms can often be grouped into one ID).\n+#\xc3\xa4Also, by default all terms are extracted, otherwise the user can specify a number of most frequent words to be extracted per row.\n+#\n+#Input: Output of abstracts_by_pmids or tab-delimited table with columns containing PMIDs.\n+#The names of the PMID columns should start with "PMID", e.g. "PMID_1", "PMID_2" etc.\n+#\n+#Output: Binary matrix in that each column represents one of the extracted terms.\n+#\n+# usage: $ pmids_to_pubtator_matrix.R [-h] [-i INPUT] [-o OUTPUT] [-n NUMBER]\n+# [-c {Genes,Diseases,Mutations,Chemicals,Species} [{Genes,Diseases,Mutations,Chemicals,Species} ...]]\n+#\n+# optional arguments:\n+# -h, --help show help message\n+# -i INPUT, --input INPUT input file name. add path if file is not in workind directory\n+# -n NUMBER, --number NUMBER Number of most frequent terms/IDs to extract. By default all terms/IDs are extracted.\n+# -o OUTPUT, --output OUTPUT output file name. [default "pmids_to_pubtator_matrix_output"]\n+# -c {Gene,Disease,Mutation,Chemical,Species} [{Genes,Diseases,Mutations,Chemicals,Species} ...], --categories {Gene,Disease,Mutation,Chemical,Species} [{Gene,Disease,Mutation,Chemical,Species} ...]\n+# Pubtator categories that should be considered. [default "(\'Gene\', \'Disease\', \'Mutation\',\'Chemical\')"]\n+\n+if ("--install_packages" %in% commandArgs()) {\n+ print("Installing packages")\n+ if (!require("argparse")) install.packages("argparse", repo = "http://cran.rstudio.com/");\n+ if (!require("stringr")) install.packages("stringr", repo = "http://cran.rstudio.com/");\n+ if (!require("RCurl")) install.packages("RCurl", repo = "http://cran.rstudio.com/");\n+ if (!require("stringi")) install.packages("stringi", repo = "http://cran.rstudio.com/");\n+}\n+\n+suppressPackageStartupMessages(library("argparse"))\n+library("stringr")\n+library("RCurl")\n+library("stringi")\n+\n+parser <- ArgumentParser()\n+\n+parser$add_argument("-i", "--input",\n+ help = "input fie name. add path if file is not in workind directory")\n+parser$add_argument("-o", "--output", default = "pmids_to_pubtator_matrix_output",\n+ help = "output file name. [default \\"%(default)s\\"]")\n+parser$add_argument("-c", "--categories", choices = c("Gene", "Disease", "Mutation", "Chemical", "Species"), nargs = "+",\n+ default = c("Gene", "Disease", "Mutation", "Chemical"),\n+ help = "Pubtator categories that should be considered. [default \\"%(default)s\\"]")\n+parser$add_argument("-b", "--byid", action = "store_true", default = FALSE,\n+ help = "If you want to find common gene IDs / mesh IDs instead of scientific terms.")\n+parser$add_argument("-n", "--number", default = NULL, type = "integer",\n+ help = "Number of most frequent terms/IDs to extract. By default all terms/IDs are extracted.")\n+parser$add_argument("--install_packages", action = "store_true", default = FALSE,\n+ help = "If you want to auto install missing required packages.")\n+\n+args <- parser$parse_args()\n+\n+\n+data <- read.delim(args$input, stringsAsFactors = FALSE, header = TRUE, sep = "\\t")\n+\n+pmid_cols_index <- grep(c("PMID"), names(data))\n+word_matrix <- data.frame()\n+dict_table <- data.frame()\n+pmids_count <- 0\n+pubtator_max_ids <- 100\n+\n+\n+merge_pubtator_table <- funct'..b'(table) == 6) {\n+ for (i in categories) {\n+ tmp_index <- grep(TRUE, i == as.character(table[, 5]))\n+ if (length(tmp_index) > 0) {\n+ index_categories <- c(index_categories, tmp_index)\n+ }\n+ }\n+ table <- as.data.frame(table, stringsAsFactors = FALSE)\n+ table <- table[index_categories, c(4, 6)]\n+ table <- table[!is.na(table[, 2]), ]\n+ table <- table[!(table[, 2] == "NA"), ]\n+ table <- table[!(table[, 1] == "NA"), ]\n+ }else{\n+ return(NULL)\n+ }\n+}\n+\n+extract_frequent_ids_or_terms <- function(table) {\n+ if (is.null(table)) {\n+ return(NULL)\n+ break\n+ }\n+ if (args$byid) {\n+ if (!is.null(args$number)) {\n+ #retrieve top X mesh_ids\n+ table_mesh <- as.data.frame(table(table[, 2]))\n+ colnames(table_mesh)[1] <- "mesh_id"\n+ table <- table[order(table_mesh$Freq, decreasing = TRUE), ]\n+ table <- table[1:min(args$number, nrow(table_mesh)), ]\n+ table_mesh$mesh_id <- as.character(table_mesh$mesh_id)\n+ #subset table for top X mesh_ids\n+ table <- table[which(as.character(table$V6) %in% as.character(table_mesh$mesh_id)), ]\n+ table <- table[!duplicated(table[, 2]), ]\n+ } else {\n+ table <- table[!duplicated(table[, 2]), ]\n+ }\n+ } else {\n+ if (!is.null(args$number)) {\n+ table[, 1] <- tolower(as.character(table[, 1]))\n+ table <- as.data.frame(table(table[, 1]))\n+ colnames(table)[1] <- "term"\n+ table <- table[order(table$Freq, decreasing = TRUE), ]\n+ table <- table[1:min(args$number, nrow(table)), ]\n+ table$term <- as.character(table$term)\n+ } else {\n+ table[, 1] <- tolower(as.character(table[, 1]))\n+ table <- table[!duplicated(table[, 1]), ]\n+ }\n+ }\n+ return(table)\n+}\n+\n+\n+#for all PMIDs of a row get PubTator terms and add them to the matrix\n+for (i in seq(nrow(data))) {\n+ pmids <- as.character(data[i, pmid_cols_index])\n+ pmids <- pmids[!pmids == "NA"]\n+ if (pmids_count > 10000) {\n+ cat("Break (10s) to avoid killing of requests. Please wait.", "\\n")\n+ Sys.sleep(10)\n+ pmids_count <- 0\n+ }\n+ pmids_count <- pmids_count + length(pmids)\n+ #get puptator terms and process them with functions\n+ if (length(pmids) > 0) {\n+ table <- get_pubtator_terms(pmids)\n+ table <- extract_category_terms(table, args$categories)\n+ table <- extract_frequent_ids_or_terms(table)\n+ if (!is.null(table)) {\n+ colnames(table) <- c("term", "mesh_id")\n+ # add data in binary matrix\n+ if (args$byid) {\n+ mesh_ids <- as.character(table$mesh_id)\n+ if (length(mesh_ids) > 0) {\n+ word_matrix[i, mesh_ids] <- 1\n+ cat(length(mesh_ids), " IDs for PMIDs of row", i, " were added", "\\n")\n+ # add data in dictionary\n+ dict_table <- rbind(dict_table, table)\n+ dict_table <- dict_table[!duplicated(as.character(dict_table[, 2])), ]\n+ }\n+ } else {\n+ terms <- as.character(table[, 1])\n+ if (length(terms) > 0) {\n+ word_matrix[i, terms] <- 1\n+ cat(length(terms), " terms for PMIDs of row", i, " were added.", "\\n")\n+ }\n+ }\n+ }\n+ } else {\n+ cat("No terms for PMIDs of row", i, " were found.", "\\n")\n+ }\n+}\n+\n+if (args$byid) {\n+ #change column names of matrix: exchange mesh ids/ids with term\n+ index_names <- match(names(word_matrix), as.character(dict_table[[2]]))\n+ names(word_matrix) <- dict_table[index_names, 1]\n+}\n+\n+colnames(word_matrix) <- gsub("[^[:print:]]", "", colnames(word_matrix))\n+colnames(word_matrix) <- gsub(\'\\"\', "", colnames(word_matrix), fixed = TRUE)\n+\n+#merge duplicated columns\n+word_matrix <- as.data.frame(do.call(cbind, by(t(word_matrix), INDICES = names(word_matrix), FUN = colSums)))\n+\n+#save binary matrix\n+word_matrix <- as.matrix(word_matrix)\n+word_matrix[is.na(word_matrix)] <- 0\n+cat("Matrix with ", nrow(word_matrix), " rows and ", ncol(word_matrix), " columns generated.", "\\n")\n+write.table(word_matrix, args$output, row.names = FALSE, sep = "\\t", quote = FALSE)\n'

diff -r 000000000000 -r 02e46a96e98a pubmed_by_queries.R
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/pubmed_by_queries.R Wed Mar 24 08:34:22 2021 +0000

[

b'@@ -0,0 +1,258 @@\n+#!/usr/bin/env Rscript\n+#tool: pubmed_by_queries\n+#\n+#This tool uses a set of search queries to download a defined number of abstracts or\n+#PMIDs for search query from PubMed. PubMed\'s search rules and syntax apply.\n+#\n+#Input: Tab-delimited table with search queries in a column starting with "ID_",\n+#e.g. "ID_gene" if search queries are genes.\n+#\n+#Output: Input table with additional columns\n+#with PMIDs or abstracts (--abstracts) from PubMed.\n+#\n+#Usage:\n+#$pubmed_by_queries.R [-h] [-i INPUT] [-o OUTPUT] [-n NUMBER] [-a] [-k KEY]\n+#\n+#optional arguments:\n+# -h, --help show this help message and exit\n+# -i INPUT, --input INPUT input file name. add path if file is not in working directory\n+# -o OUTPUT, --output OUTPUT output file name. [default "pubmed_by_queries_output"]\n+# -n NUMBER, --number NUMBER number of PMIDs or abstracts to save per ID [default "5"]\n+# -a, --abstract if abstracts instead of PMIDs should be retrieved use --abstracts\n+# -k KEY, --key KEY if ncbi API key is available, add it to speed up the download of PubMed data.\n+# For usage in Galaxy add the API key to the Galaxy user-preferences (User/ Preferences/ Manage Information).\n+\n+if ("--install_packages" %in% commandArgs()) {\n+ print("Installing packages")\n+ if (!require("argparse")) install.packages("argparse", repo = "http://cran.rstudio.com/") ;\n+ if (!require("easyPubMed")) install.packages("easyPubMed", repo = "http://cran.rstudio.com/") ;\n+}\n+\n+suppressPackageStartupMessages(library("argparse"))\n+suppressPackageStartupMessages(library("easyPubMed"))\n+\n+parser <- ArgumentParser()\n+parser$add_argument("-i", "--input",\n+ help = "Input fie name. add path if file is not in working directory")\n+parser$add_argument("-o", "--output", default = "pubmed_by_queries_output",\n+ help = "Output file name. [default \\"%(default)s\\"]")\n+parser$add_argument("-n", "--number", type = "integer", default = 5,\n+ help = "Number of PMIDs (or abstracts) to save per ID. [default \\"%(default)s\\"]")\n+parser$add_argument("-a", "--abstract", action = "store_true", default = FALSE,\n+ help = "If abstracts instead of PMIDs should be retrieved use --abstracts ")\n+parser$add_argument("-k", "--key", type = "character",\n+ help = "If ncbi API key is available, add it to speed up the download of PubMed data. For usage in Galaxy add the API key to the Galaxy user-preferences (User/ Preferences/ Manage Information).")\n+parser$add_argument("--install_packages", action = "store_true", default = FALSE,\n+ help = "If you want to auto install missing required packages.")\n+args <- parser$parse_args()\n+\n+if (!is.null(args$key)) {\n+ if (file.exists(args$key)) {\n+ credentials <- read.table(args$key, quote = "\\"", comment.char = "")\n+ args$key <- credentials[1, 1]\n+ }\n+}\n+\n+max_web_tries <- 100\n+\n+data <- read.delim(args$input, stringsAsFactors = FALSE)\n+\n+id_col_index <- grep("ID_", names(data))\n+\n+\n+fetch_pmids <- function(data, number, pubmed_search, query, row, max_web_tries) {\n+ my_pubmed_url <- paste("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?",\n+ "db=pubmed&retmax=", number,\n+ "&term=", pubmed_search$OriginalQuery,\n+ "&usehistory=n", sep = "")\n+ # get ids\n+ idxml <- c()\n+ for (i in seq(max_web_tries)) {\n+ tryCatch({\n+ id_connect <- suppressWarnings(url(my_pubmed_url, open = "rb", encoding = "UTF8"))\n+ idxml <- suppressWarnings(readLines(id_connect, warn = FALSE, encoding = "UTF8"))\n+ suppressWarnings(close(id_connect))\n+ break\n+ }, error = function(e) {\n+ print(paste("Error getting URL, sleeping", 2 * i, "seconds."))\n+ print(e)\n+ Sys.sleep(time = 2 * i)\n+ })\n+ }\n+ pmids <- c()\n+ for (i in seq(length(idxml))) {\n+ if (grepl("^<Id>", idxml[i])) {\n+ pmid <- custom_gre'..b'e("Killing the request! Something is not working. Please, try again later",\n+ "\\n")\n+ return(data)\n+ } else {\n+ return(out_data)\n+ }\n+}\n+\n+\n+process_xml_abstracts <- function(out_data) {\n+ xml_data <- paste(out_data, collapse = "")\n+ # articles to list\n+ xml_data <- strsplit(xml_data, "<PubmedArticle(>|[[:space:]]+?.*>)")[[1]][-1]\n+ xml_data <- sapply(xml_data, function(x) {\n+ #trim extra stuff at the end of the record\n+ if (!grepl("</PubmedArticle>$", x))\n+ x <- sub("(^.*</PubmedArticle>).*$", "\\\\1", x)\n+ # Rebuid XML structure and proceed\n+ x <- paste("<PubmedArticle>", x)\n+ gsub("[[:space:]]{2,}", " ", x)\n+ },\n+ USE.NAMES = FALSE, simplify = TRUE)\n+ #titles\n+ titles <- sapply(xml_data, function(x) {\n+ x <- custom_grep(x, tag = "ArticleTitle", format = "char")\n+ x <- gsub("</{0,1}i>", "", x, ignore.case = T)\n+ x <- gsub("</{0,1}b>", "", x, ignore.case = T)\n+ x <- gsub("</{0,1}sub>", "", x, ignore.case = T)\n+ x <- gsub("</{0,1}exp>", "", x, ignore.case = T)\n+ if (length(x) > 1) {\n+ x <- paste(x, collapse = " ", sep = " ")\n+ } else if (length(x) < 1) {\n+ x <- NA\n+ }\n+ x\n+ },\n+ USE.NAMES = FALSE, simplify = TRUE)\n+ # abstracts\n+ abstract_text <- sapply(xml_data, function(x) {\n+ custom_grep(x, tag = "AbstractText", format = "char")\n+ },\n+ USE.NAMES = FALSE, simplify = TRUE)\n+ abstracts <- sapply(abstract_text, function(x) {\n+ if (length(x) > 1) {\n+ x <- paste(x, collapse = " ", sep = " ")\n+ x <- gsub("</{0,1}i>", "", x, ignore.case = T)\n+ x <- gsub("</{0,1}b>", "", x, ignore.case = T)\n+ x <- gsub("</{0,1}sub>", "", x, ignore.case = T)\n+ x <- gsub("</{0,1}exp>", "", x, ignore.case = T)\n+ } else if (length(x) < 1) {\n+ x <- NA\n+ } else {\n+ x <- gsub("</{0,1}i>", "", x, ignore.case = T)\n+ x <- gsub("</{0,1}b>", "", x, ignore.case = T)\n+ x <- gsub("</{0,1}sub>", "", x, ignore.case = T)\n+ x <- gsub("</{0,1}exp>", "", x, ignore.case = T)\n+ }\n+ x\n+ },\n+ USE.NAMES = FALSE, simplify = TRUE)\n+ #add title to abstracts\n+ if (length(titles) == length(abstracts)) {\n+ abstracts <- paste(titles, abstracts)\n+ }\n+ return(abstracts)\n+}\n+\n+\n+pubmed_data_in_table <- function(data, row, query, number, key, abstract) {\n+ if (is.null(query)) {\n+ print(data)\n+ }\n+ pubmed_search <- get_pubmed_ids(query, api_key = key)\n+ if (as.numeric(pubmed_search$Count) == 0) {\n+ cat("No PubMed result for the following query: ", query, "\\n")\n+ return(data)\n+ } else if (abstract == FALSE) { # fetch PMIDs\n+ data <- fetch_pmids(data, number, pubmed_search, query, row, max_web_tries)\n+ return(data)\n+ } else if (abstract == TRUE) { # fetch abstracts and title text\n+ out_data <- fetch_abstracts(data, number, query, pubmed_search)\n+ abstracts <- process_xml_abstracts(out_data)\n+ #add abstracts to data frame\n+ if (length(abstracts) > 0) {\n+ data[row, sapply(seq(length(abstracts)),\n+ function(i) {\n+ paste0("ABSTRACT_", i)\n+ })] <- abstracts\n+ cat(length(abstracts), " abstracts for ", query, " are added in the table.",\n+ "\\n")\n+ }\n+ return(data)\n+ }\n+}\n+\n+for (i in seq(nrow(data))) {\n+ data <- tryCatch(pubmed_data_in_table(data = data,\n+ row = i,\n+ query = data[i, id_col_index],\n+ number = args$number,\n+ key = args$key,\n+ abstract = args$abstract), error = function(e) {\n+ print("main error")\n+ print(e)\n+ Sys.sleep(5)\n+ })\n+}\n+\n+write.table(data, args$output, append = FALSE, sep = "\\t", row.names = FALSE, col.names = TRUE, quote = FALSE)\n'

diff -r 000000000000 -r 02e46a96e98a pubmed_by_queries.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/pubmed_by_queries.xml Wed Mar 24 08:34:22 2021 +0000

[

@@ -0,0 +1,85 @@
+<tool id="pubmed_by_queries" name="PubMed query" version="@VERSION@" license="MIT">
+    <description>download a defined number of abstracts or PMIDs from PubMed</description>
+    <macros>
+        <import>macros.xml</import>
+    </macros>
+    <requirements>
+        <requirement type="package" version="2.0.3">r-argparse</requirement>
+        <requirement type="package" version="2.13">r-easypubmed</requirement>
+    </requirements>
+
+    <command detect_errors="exit_code"><![CDATA[
+    Rscript
+      '${__tool_directory__}/pubmed_by_queries.R'
+      --input '$input'
+      --output '$output'
+      --number '$number'
+      $abstract
+      #if $__user__.extra_preferences.get('ncbi_account|apikey', ""):
+        -k '$credentials'
+      #end if
+      ]]>
+    </command>
+
+    <configfiles>
+        <configfile name="credentials"><![CDATA[
+        $__user__.extra_preferences.get('ncbi_account|apikey', "")
+        ]]></configfile>
+    </configfiles>
+
+    <inputs>
+        <param argument="--input" type="data" format="tabular" label="Input file with query terms" />
+        <param argument="--abstract" type="boolean" truevalue="--abstract" falsevalue="" checked="false" label="Save abstracts instead of PMIDs"/>
+        <param argument="--number" type="integer" value="5" min="1" label="Number of PMIDs (or abstracts) to save per ID" />
+    </inputs>
+
+    <outputs>
+        <data format="tabular" name="output" />
+    </outputs>
+
+    <tests>
+        <test>
+            <param name="input" value="test_data" ftype="tabular"/>
+            <param name="number" value="5"/>
+            <param name="abstract" value=""/>
+            <output name="output">
+                <assert_contents>
+                    <has_n_columns n="7"/>
+                    <has_n_lines n="7"/>
+                </assert_contents>
+            </output>
+        </test>
+    </tests>
+    <help><![CDATA[
+
+**What it does**
+
+This tool uses a set of search queries to download a defined number of abstracts or PMIDs for each search query from PubMed.
+PubMed's search rules and syntax apply.
+
+**Info**
+
+To speed up the the download of PubMed data users can obtain an API key from the Settings page of their NCBI account (to create an account, visit http://www.ncbi.nlm.nih.gov/account/) and add it to the Galaxy user-preferences (User/ Preferences/ Manage Information).
+
+-----
+
+**Example**
+
+- Input table:
+    Table with a list of search queries (biomedical entities of interest) in one column. The column header should start with
+    "ID\_" (e.g., "ID_gene" if search queries are genes).
+
+    | ID_gene
+    | 33565071
+    | 33377604
+
+- Output table:
+    Table with additional columns containing PMIDs or abstracts from PubMed (here: PMIDs)
+
+    | ID_gene   | PMID_1        |  PMID_2       |  PMID_3
+    | SCN1A       | 33531663  | 33528079  | 33565071
+    | SCN9A       | 33334860  | 33277917  | 33377604
+
+        ]]></help>
+    <expand macro="citations"/>
+</tool>
\ No newline at end of file

diff -r 000000000000 -r 02e46a96e98a test-data/abstracts_by_pmids_output
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/abstracts_by_pmids_output Wed Mar 24 08:34:22 2021 +0000

b'@@ -0,0 +1,7 @@\n+ID_gene\tGROUPING_disease\tPMID_1\tPMID_2\tPMID_3\tPMID_4\tPMID_5\tABSTRACT_1\tABSTRACT_2\tABSTRACT_3\tABSTRACT_4\tABSTRACT_5\n+SCN1A\tepilepsy\t33565071\t33531663\t33528079\t33519675\t33478845\tTo analyze the clinical features and genetic variants in two patients with Dravet syndrome (DS). Peripheral blood samples of the children and their parents were collected for the extraction of genomic DNA and high-throughput sequencing. Suspected variants were confirmed by Sanger sequencing. By high-throughput sequencing, the two children were found to respectively harbor a c.2135delC frameshifting variant in exon 12 and a c.1522G>T nonsense variant in exon 10 of the SCN1A gene. Both variants were predicted to be pathogenic by bioinformatic analysis. Based on the American College of Medical Genetics and Genomics standards and guidelines, the c.2135delC and c.1522G>A variants of the SCN1A gene were predicted to be pathogenic (PVS1+ PS2+ PM2+ PP3). The variants of the SCN1A gene probably underlay the DS in the patients. Above finding has enriched the variant spectrum and enabled genetic counseling for their families.\tThe voltage-gated sodium channel \xce\xb1-subunit genes comprise a highly conserved gene family. Mutations of three of these genes, SCN1A, SCN2A and SCN8A, are responsible for a significant burden of neurological disease. Recent progress in identification and functional characterization of patient variants is generating new insights and novel approaches to therapy for these devastating disorders. Here we review the basic elements of sodium channel function that are used to characterize patient variants. We summarize a large body of work using global and conditional mouse mutants to characterize the in vivo roles of these channels. We provide an overview of the neurological disorders associated with mutations of the human genes and examples of the effects of patient mutations on channel function. Finally, we highlight therapeutic interventions that are emerging from new insights into mechanisms of sodium channelopathies.\tAdvancement in genetic technology has led to the identification of an increasing number of genes in epilepsy. This will provide a huge information in clinical practice and improve diagnosis and treatment of epilepsy. this was a single-center retrospective cohort study of 80 patients who underwent NGS testing with customize epilepsy panel. In total 54 out of 80 patients (67, 5%), pathogenic / likely pathogenic and variants of uncertain significance variants were identified according to ACMG criteria. Pathogenic or likely pathogenic variants (n=35) were identified in 29 out of 80 individuals (36.25%). Variants of uncertain significance (VOUS) (n=34) have identified in 28 out of 80 patients (35%). Pathogenic, likely pathogenic, and variants of uncertain significance (VOUS) were most frequently identified in TSC2 (n\xc2\xa0=\xc2\xa011), SCN1A (n\xc2\xa0=\xc2\xa06) and TSC1 (n\xc2\xa0=\xc2\xa05) genes. Other common genes were KCNQ2 (n\xc2\xa0=\xc2\xa03), AMT (n\xc2\xa0=\xc2\xa03), CACNA1H (n\xc2\xa0=\xc2\xa03), CLCN2 (n\xc2\xa0=\xc2\xa03), MECP2 (n\xc2\xa0=\xc2\xa02), ASAH1 (n\xc2\xa0=\xc2\xa02) and SLC2A1 (n\xc2\xa0=\xc2\xa02). NGS based testing panels contributes the diagnosis of epilepsy and may change the clinical management by preventing unnecessary and potentially harmful diagnostic procedures and management in patients. Thus, our results highlighted the benefit of genetic testing in children suffered with epilepsy. This article is protected by copyright. All rights reserved.\tBackground:SCN1A and SCN2A genes have been reported to be associated with the efficacy of single and combined antiepileptic therapy, but the results remain contradictory. Previous meta-analyses on this topic mainly focused on the SCN1A rs3812718 polymorphism. However, meta-analyses focused on SCN1A rs2298771, SCN1A rs10188577, SCN2A rs17183814, or SCN2A rs2304016 polymorphisms are scarce or non-existent. Objective: We aimed to conduct a meta-analysis to determine the effects of SCN1A rs2298771, SCN1A rs10188577, SCN2A rs17183814, and SCN2A rs2304016 polym'..b'ing genetic intervention for severe developmental diseases.\tWhite-Sutton syndrome is a rare developmental disorder characterized by global developmental delay, intellectual disabilities (ID), and neurobehavioral abnormalities secondary to pathogenic pogo transposable element-derived protein with zinc finger domain (POGZ) variants. The purpose of our study was to describe the neurocognitive phenotype of an unbiased national cohort of patients with identified POGZ pathogenic variants. This study is based on a French collaboration through the AnDDI-Rares network, and includes 19 patients from 18 families with POGZ pathogenic variants. All clinical data and neuropsychological tests were collected from medical files. Among the 19 patients, 14 patients exhibited ID (six mild, five moderate and three severe). The five remaining patients had learning disabilities and shared a similar neurocognitive profile, including language difficulties, dysexecutive syndrome, attention disorders, slowness, and social difficulties. One patient evaluated for autism was found to have moderate autism spectrum disorder. This study reveals that the cognitive phenotype of patients with POGZ pathogenic variants can range from learning disabilities to severe ID. It highlights that pathogenic variations in the same genes can be reported in a large spectrum of neurocognitive profiles, and that children with learning disabilities could benefit from next generation sequencing techniques.\tSeveral genes implicated in autism spectrum disorder (ASD) are chromatin regulators, including POGZ. The cellular and molecular mechanisms leading to ASD impaired social and cognitive behavior are unclear. Animal models are crucial for studying the effects of mutations on brain function and behavior as well as unveiling the underlying mechanisms. Here, we generate a brain specific conditional knockout mouse model deficient for Pogz, an ASD risk gene. We demonstrate that Pogz deficient mice show microcephaly, growth impairment, increased sociability, learning and motor deficits, mimicking several of the human symptoms. At the molecular level, luciferase reporter assay indicates that POGZ is a negative regulator of transcription. In accordance, in Pogz deficient mice we find a significant upregulation of gene expression, most notably in the cerebellum. Gene set enrichment analysis revealed that the transcriptional changes encompass genes and pathways disrupted in ASD, including neurogenesis and synaptic processes, underlying the observed behavioral phenotype in mice. Physiologically, Pogz deficiency is associated with a reduction in the firing frequency of simple and complex spikes and an increase in amplitude of the inhibitory synaptic input in cerebellar Purkinje cells. Our findings support a mechanism linking heterochromatin dysregulation to cerebellar circuit dysfunction and behavioral abnormalities in ASD.\tMany genes have been linked to autism. However, it remains unclear what long-term changes in neural circuitry result from disruptions in these genes, and how these circuit changes might contribute to abnormal behaviors. To address these questions, we studied behavior and physiology in mice heterozygous for Pogz, a high confidence autism gene. Pogz<sup>+/-</sup> mice exhibit reduced anxiety-related avoidance in the elevated plus maze (EPM). Theta-frequency communication between the ventral hippocampus (vHPC) and medial prefrontal cortex (mPFC) is known to be necessary for normal avoidance in the EPM. We found deficient theta-frequency synchronization between the vHPC and mPFC in vivo. When we examined vHPC-mPFC communication at higher resolution, vHPC input onto prefrontal GABAergic interneurons was specifically disrupted, whereas input onto pyramidal neurons remained intact. These findings illustrate how the loss of a high confidence autism gene can impair long-range communication by causing inhibitory circuit dysfunction within pathways important for specific behaviors.\n'

diff -r 000000000000 -r 02e46a96e98a test-data/pmids_to_pubtator_matrix_output
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/pmids_to_pubtator_matrix_output Wed Mar 24 08:34:22 2021 +0000

@@ -0,0 +1,7 @@
+5-httlpr adnp akt alx1 amyloid precursor protein ankk1 ankrd11 ankyrin repeat domain-containing protein 11 apoe arhgap21 arid1a arid1b asah1 atrx bdnf brain-derived neurotrophic factor c-fos cacna1a cacna1h camkii ccne1 cdh1 chd2 chd8 clcn2 cldn18 comt creb ctnnb1 ddx3x depdc5 drd2 drd4 dyrk1a egfr fbxw7 fgfr1 flt4 foxr1 glun2a gonadotropin-releasing hormone grin2 grin2a growth hormone il6r itch kcnq2 leb mapt mdm2 mecp2 med12 mefv mek nav1.5 nav1.7 nestin nf1 nlrp5 notch1 ns5 ntrk2 p21 p75 pard3 pik3ca pkhd1 pogz pomc ppp2r1a pten reln runx1 scn10a scn1a scn2a scn8a scn9a shank2 slc2a1 slc6a4 sox10 sox2 syngap1 tbx1 tcf7l2 tjp1 tnni3k tp53 trkb trpa1 trpv1 tsc1 tsc2 whsc1l1 wnt10a ythdf2
+0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0
+0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 1 0 0 0 1 1 0 0 0 0 0 1 0 1 1 0 1 0 0 0
+1 0 0 0 0 1 0 0 1 0 1 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 1 0 0 0 1 1 0 0 0 0 1 0 1 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 0 0 0 0
+0 1 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1
+0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

diff -r 000000000000 -r 02e46a96e98a test-data/pmids_to_pubtator_matrix_output_byid
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/pmids_to_pubtator_matrix_output_byid Wed Mar 24 08:34:22 2021 +0000

@@ -0,0 +1,7 @@
+ADNP AKT ALX1 ANKRD11 APOE ARID1A ARID1B ATRX BDNF CACNA1A CCNE1 CDH1 CHD2 CHD8 CLDN18 COMT CTNNB1 DDX3X DEPDC5 DRD2 DYRK1A Depdc5 EGFR FGFR1 FLT4 FOXR1 GRIN2 GRIN2A GluN2A IL6R LEB MAPT MDM2 MED12 MEFV NF1 NLRP5 NOTCH1 Nav1.5 Nav1.7 Nestin PIK3CA PKHD1 POGZ PPP2R1A PTEN Pogz RELN SCN1A SCN2A SCN9A SHANK2 SLC6A4 SYNGAP1 Shank2 Sox10 Sox2 TP53 TSC2 TrkB WHSC1L1 WNT10A YTHDF2 amyloid precursor protein c-Fos gonadotropin-releasing hormone growth hormone itch p21 p75
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1
+0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 1 0 1 1 1 1 0 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0
+1 0 0 1 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 1 1 0 1 0
+0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

diff -r 000000000000 -r 02e46a96e98a test-data/pmids_to_pubtator_matrix_output_number
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/pmids_to_pubtator_matrix_output_number Wed Mar 24 08:34:22 2021 +0000

@@ -0,0 +1,7 @@
+amyloid precursor protein ankrd11 anxiety asah1 asd autism bdnf cldn18 dravet syndrome embryonic kidney epilepsy gastric srcc itch kbg syndrome learning disabilities memory impairment nav1.7 ns5 p21 pain pogz scn1a scn2a scn9a shank2 short stature tumors white-sutton syndrome
+0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0
+0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0
+0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
+0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 1 0
+1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0
+0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1

diff -r 000000000000 -r 02e46a96e98a test-data/pubmed_by_queries_output
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/pubmed_by_queries_output Wed Mar 24 08:34:22 2021 +0000

@@ -0,0 +1,7 @@
+ID_gene GROUPING_disease PMID_1 PMID_2 PMID_3 PMID_4 PMID_5
+SCN1A epilepsy 33565071 33531663 33528079 33519675 33478845
+SCN9A epilepsy 33389681 33370834 33278787 33237934 33232657
+GRIN2A epilepsy 33531473 33499151 33457012 33420383 33370585
+ANKRD11 autism 33527450 33476899 33354850 33262785 33179249
+SHANK2 autism 33547379 33515293 33491217 33483523 33383702
+POGZ autism 33377604 33334860 33277917 33203851 33155545

diff -r 000000000000 -r 02e46a96e98a test-data/pubmed_by_queries_output_abstracts
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/pubmed_by_queries_output_abstracts Wed Mar 24 08:34:22 2021 +0000

[

b'@@ -0,0 +1,7 @@\n+ID_gene\tGROUPING_disease\tABSTRACT_1\tABSTRACT_2\tABSTRACT_3\tABSTRACT_4\tABSTRACT_5\n+SCN1A\tepilepsy\t[Analysis of SCN1A gene variants among patients with Dravet syndrome]. To analyze the clinical features and genetic variants in two patients with Dravet syndrome (DS). Peripheral blood samples of the children and their parents were collected for the extraction of genomic DNA and high-throughput sequencing. Suspected variants were confirmed by Sanger sequencing. By high-throughput sequencing, the two children were found to respectively harbor a c.2135delC frameshifting variant in exon 12 and a c.1522G>T nonsense variant in exon 10 of the SCN1A gene. Both variants were predicted to be pathogenic by bioinformatic analysis. Based on the American College of Medical Genetics and Genomics standards and guidelines, the c.2135delC and c.1522G>A variants of the SCN1A gene were predicted to be pathogenic (PVS1+ PS2+ PM2+ PP3). The variants of the SCN1A gene probably underlay the DS in the patients. Above finding has enriched the variant spectrum and enabled genetic counseling for their families.\tSodium channelopathies in neurodevelopmental disorders. The voltage-gated sodium channel \xce\xb1-subunit genes comprise a highly conserved gene family. Mutations of three of these genes, SCN1A, SCN2A and SCN8A, are responsible for a significant burden of neurological disease. Recent progress in identification and functional characterization of patient variants is generating new insights and novel approaches to therapy for these devastating disorders. Here we review the basic elements of sodium channel function that are used to characterize patient variants. We summarize a large body of work using global and conditional mouse mutants to characterize the in vivo roles of these channels. We provide an overview of the neurological disorders associated with mutations of the human genes and examples of the effects of patient mutations on channel function. Finally, we highlight therapeutic interventions that are emerging from new insights into mechanisms of sodium channelopathies.\tCustomized Targeted Massively Parallel Sequencing Enables More Precisely Diagnosis of Patients with Epilepsy. Advancement in genetic technology has led to the identification of an increasing number of genes in epilepsy. This will provide a huge information in clinical practice and improve diagnosis and treatment of epilepsy. this was a single-center retrospective cohort study of 80 patients who underwent NGS testing with customize epilepsy panel. In total 54 out of 80 patients (67, 5%), pathogenic / likely pathogenic and variants of uncertain significance variants were identified according to ACMG criteria. Pathogenic or likely pathogenic variants (n=35) were identified in 29 out of 80 individuals (36.25%). Variants of uncertain significance (VOUS) (n=34) have identified in 28 out of 80 patients (35%). Pathogenic, likely pathogenic, and variants of uncertain significance (VOUS) were most frequently identified in TSC2 (n\xc2\xa0=\xc2\xa011), SCN1A (n\xc2\xa0=\xc2\xa06) and TSC1 (n\xc2\xa0=\xc2\xa05) genes. Other common genes were KCNQ2 (n\xc2\xa0=\xc2\xa03), AMT (n\xc2\xa0=\xc2\xa03), CACNA1H (n\xc2\xa0=\xc2\xa03), CLCN2 (n\xc2\xa0=\xc2\xa03), MECP2 (n\xc2\xa0=\xc2\xa02), ASAH1 (n\xc2\xa0=\xc2\xa02) and SLC2A1 (n\xc2\xa0=\xc2\xa02). NGS based testing panels contributes the diagnosis of epilepsy and may change the clinical management by preventing unnecessary and potentially harmful diagnostic procedures and management in patients. Thus, our results highlighted the benefit of genetic testing in children suffered with epilepsy. This article is protected by copyright. All rights reserved.\tAssociation Between SCN1A rs2298771, SCN1A rs10188577, SCN2A rs17183814, and SCN2A rs2304016 Polymorphisms and Responsiveness to Antiepileptic Drugs: A Meta-Analysis. Background:SCN1A and SCN2A genes have been reported to be associated with the efficacy of single and combined antiepileptic therapy, but the results remain contradictory. Previous meta-analyses on this topic mainly focused on the SCN1A r'..b'posable element-derived protein with zinc finger domain (POGZ) variants. The purpose of our study was to describe the neurocognitive phenotype of an unbiased national cohort of patients with identified POGZ pathogenic variants. This study is based on a French collaboration through the AnDDI-Rares network, and includes 19 patients from 18 families with POGZ pathogenic variants. All clinical data and neuropsychological tests were collected from medical files. Among the 19 patients, 14 patients exhibited ID (six mild, five moderate and three severe). The five remaining patients had learning disabilities and shared a similar neurocognitive profile, including language difficulties, dysexecutive syndrome, attention disorders, slowness, and social difficulties. One patient evaluated for autism was found to have moderate autism spectrum disorder. This study reveals that the cognitive phenotype of patients with POGZ pathogenic variants can range from learning disabilities to severe ID. It highlights that pathogenic variations in the same genes can be reported in a large spectrum of neurocognitive profiles, and that children with learning disabilities could benefit from next generation sequencing techniques.\tPogz deficiency leads to transcription dysregulation and impaired cerebellar activity underlying autism-like behavior in mice. Several genes implicated in autism spectrum disorder (ASD) are chromatin regulators, including POGZ. The cellular and molecular mechanisms leading to ASD impaired social and cognitive behavior are unclear. Animal models are crucial for studying the effects of mutations on brain function and behavior as well as unveiling the underlying mechanisms. Here, we generate a brain specific conditional knockout mouse model deficient for Pogz, an ASD risk gene. We demonstrate that Pogz deficient mice show microcephaly, growth impairment, increased sociability, learning and motor deficits, mimicking several of the human symptoms. At the molecular level, luciferase reporter assay indicates that POGZ is a negative regulator of transcription. In accordance, in Pogz deficient mice we find a significant upregulation of gene expression, most notably in the cerebellum. Gene set enrichment analysis revealed that the transcriptional changes encompass genes and pathways disrupted in ASD, including neurogenesis and synaptic processes, underlying the observed behavioral phenotype in mice. Physiologically, Pogz deficiency is associated with a reduction in the firing frequency of simple and complex spikes and an increase in amplitude of the inhibitory synaptic input in cerebellar Purkinje cells. Our findings support a mechanism linking heterochromatin dysregulation to cerebellar circuit dysfunction and behavioral abnormalities in ASD.\tAltered hippocampal-prefrontal communication during anxiety-related avoidance in mice deficient for the autism-associated gene Pogz. Many genes have been linked to autism. However, it remains unclear what long-term changes in neural circuitry result from disruptions in these genes, and how these circuit changes might contribute to abnormal behaviors. To address these questions, we studied behavior and physiology in mice heterozygous for Pogz, a high confidence autism gene. Pogz<sup>+/-</sup> mice exhibit reduced anxiety-related avoidance in the elevated plus maze (EPM). Theta-frequency communication between the ventral hippocampus (vHPC) and medial prefrontal cortex (mPFC) is known to be necessary for normal avoidance in the EPM. We found deficient theta-frequency synchronization between the vHPC and mPFC in vivo. When we examined vHPC-mPFC communication at higher resolution, vHPC input onto prefrontal GABAergic interneurons was specifically disrupted, whereas input onto pyramidal neurons remained intact. These findings illustrate how the loss of a high confidence autism gene can impair long-range communication by causing inhibitory circuit dysfunction within pathways important for specific behaviors.\n'

diff -r 000000000000 -r 02e46a96e98a test-data/test_data
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/test_data Wed Mar 24 08:34:22 2021 +0000

@@ -0,0 +1,7 @@
+ID_gene GROUPING_disease
+SCN1A epilepsy
+SCN9A epilepsy
+GRIN2A epilepsy
+ANKRD11 autism
+SHANK2 autism
+POGZ autism
\ No newline at end of file

diff -r 000000000000 -r 02e46a96e98a test-data/text_to_wordmatrix_output
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/text_to_wordmatrix_output Wed Mar 24 08:34:22 2021 +0000

@@ -0,0 +1,7 @@
+scn1a patient variant scn2a gene polymorphism genetic pathogenic aeds epilepsy rs2298771 developmental resistance result rs10188577 rs17183814 rs2304016 significant study asian association channel clinical diagnosis dravet first group identified metaanalysis model mutation neurodevelopmental n 3 sequencing significance sodium syndrome analysis antiepileptic bonferroni caucasian child correction correlation decline disorder found global homozygous however gbm neuron cell pain compound sensory associated inhibitor mouse nav17 aetiology cohort current human lebanese nociceptive pathway sample among itch new analgesic overall respectively scn9a survival 293 activity age candidate chronic da0218 decreased differentiate dorsal drg drgsgcs embryonic cldn182 concussion gastric srcc advanced expression rib behavior ketamine man elite impairment induced memory rugby signaling spatial within effect infusion level grin2a highdose hippocampus lowdose nmda reduced system treatment 180 antidepressant anxietylike bdnf brain change depression depressiveanxietylike due ankrd11 kbg two disability feature intellectual novel short stature heterozygous including outcome report adult case pleomorphic review urmss uterine alteration condition delay facial individual number primary spectrum three tumor ucss variable 16q243 abnormalities assessment autism shank2 protein social asd neuronal wnv circuit differentiation ns5 can pbm replication factor grandin interaction may pdzcontaining processing viral wiring behavioral biallelic cellular communication complex deficit domain early genome lead medical mpoa nile phenotype pogz deficient disabilities learning mechanism avoidance cerebellar input neurocognitive severe underlying vhpc whitesutton widespread aav9phpeb achieved anxietyrelated based central cognitive confidence crucial deficiency delivered development difficulties disrupted
+1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+0 1 1 0 1 0 1 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

diff -r 000000000000 -r 02e46a96e98a test-data/text_to_wordmatrix_output_args
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/text_to_wordmatrix_output_args Wed Mar 24 08:34:22 2021 +0000

@@ -0,0 +1,7 @@
+the and scna patients were with variants n for pathogenic polymorphisms genetic genes epilepsy aeds gene results this resistance developmental significant that study was between polymorphism clinical dravet sequencing syndrome are neurodevelopmental sodium diagnosis identified significance asians association metaanalysis first group analysis children found their two variant channel disorders from neurons gbm our pain mutations cells sensory nav associated inhibitor nociceptive human aetiology lebanese new pathway itch model mouse among cohort samples currents into analgesic compound compounds respectively overall survival cldn gastric srcc rib advanced expression behavior man concussion ketamine impairment induced memory signaling spatial elite rugby within levels mice infusion cell highdose hippocampus grina reduced treatment nmda lowdose anxietylike brain ankrd kbg urms novel disability intellectual short stature including features heterozygous adult pleomorphic these urmss uterine review had outcomes alterations mutation number primary three tumors ucss shank social asd autism proteins wnv neuronal protein circuit differentiation can pbm replication spectrum both during interactions may pdzcontaining viral candidate factors processing wiring grandin cellular domain nile other show targets virus west pogz deficient disabilities learning communication whitesutton have mechanisms severe widespread disorder neurocognitive phenotype cerebellar changes input underlying avoidance vhpc aavphpeb achieved
+1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+1 1 1 1 1 1 1 0 1 0 0 1 1 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+1 1 0 1 1 1 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 0 0 0 1 0 1 0 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+1 1 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+1 1 0 1 0 1 1 0 1 1 0 0 1 0 0 1 0 1 0 1 0 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

diff -r 000000000000 -r 02e46a96e98a test/commands_tests
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test/commands_tests Wed Mar 24 08:34:22 2021 +0000

@@ -0,0 +1,37 @@
+#commands to test the tools with "test_data"
+
+ $ cd <path>/simtext
+
+ $ Rscript pubmed_by_queries.R --input "test-data/test_data" --output "test-data/pubmed_by_queries_output"
+ #output: test-data/pubmed_by_queries_output --install_packages
+
+ $ Rscript pubmed_by_queries.R --input "test-data/test_data" --abstract --output "test-data/pubmed_by_queries_output_abstracts" --install_packages
+ #output: test-data/pubmed_by_queries_output_abstracts
+
+ $ Rscript abstracts_by_pmids.R --input "test-data/pubmed_by_queries_output" --output "test-data/abstracts_by_pmids_output" --install_packages
+ #output: test-data/abstracts_by_pmids_output
+
+ $ Rscript text_to_wordmatrix.R --input "test-data/pubmed_by_queries_output_abstracts" --output "test-data/text_to_wordmatrix_output" --install_packages
+ #output: test-data/text_to_wordmatrix_output
+
+ $ Rscript text_to_wordmatrix.R --input "test-data/pubmed_by_queries_output_abstracts" --output "test-data/text_to_wordmatrix_output_args" --remove_num --remove_stopwords --plurals --install_packages
+ #output: test-data/text_to_wordmatrix_output_args
+
+  $ Rscript test-data/pmids_to_pubtator_matrix.R --input "test-data/pubmed_by_queries_output" --output "test-datadata/pmids_to_pubtator_matrix_output" --number 50 --categories Gene Mutation --install_packages
+ #output: test-data/pmids_to_pubtator_matrix_output
+
+  $ Rscript pmids_to_pubtator_matrix.R --input "test-data/pubmed_by_queries_output" --output "test-data/pmids_to_pubtator_matrix_output_byid" --number 50 --categories Gene Disease --install_packages --byid
+ #output: test-data/pmids_to_pubtator_matrix_output_byid
+
+  $ Rscript pmids_to_pubtator_matrix.R --input "test-data/pubmed_by_queries_output" --output "test-data/pmids_to_pubtator_matrix_output_number" --number 5 --categories Gene Disease --install_packages
+ #output: test-data/pmids_to_pubtator_matrix_output_number
+
+ $ Rscript simtext_app.R -i "test-data/test_data" -m "test-data/text_to_wordmatrix_output" --install_packages
+ #output: ShinyApp
+
+ $ Rscript simtext_app.R -i "test-data/test_data" -m "test-data/pmids_to_pubtator_matrix_output" --install_packages
+ #output: ShinyApp
+
+
+
+

diff -r 000000000000 -r 02e46a96e98a text_to_wordmatrix.R
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/text_to_wordmatrix.R Wed Mar 24 08:34:22 2021 +0000

[

@@ -0,0 +1,106 @@
+#!/usr/bin/env Rscript
+# tool: text_to_wordmatrix
+#
+#The tool extracts the most frequent words per entity (per row). Text of columns starting with "ABSTRACT" or "TEXT" are considered.
+#All extracted terms are used to generate a word matrix with rows = entities and columns = extracted words.
+#The resulting matrix is binary with 0= word not present in abstracts of entity and 1= word present in abstracts of entity.
+#
+#Input: Output of "pubmed_by_queries" or "abstracts_by_pmids", or tab-delimited table with entities in column called “ID_<name>”,
+#e.g. “ID_genes” and text in columns starting with "ABSTRACT" or "TEXT".
+#
+#Output: Binary matrix with rows = entities and columns = extracted words.
+#
+#usage: text_to_wordmatrix.R [-h] [-i INPUT] [-o OUTPUT] [-n NUMBER] [-r] [-l] [-w] [-s] [-p]
+#
+# optional arguments:
+# -h, --help                    show help message
+# -i INPUT, --input INPUT       input file name. add path if file is not in working directory
+# -o OUTPUT, --output OUTPUT    output file name. [default "text_to_wordmatrix_output"]
+# -n NUMBER, --number NUMBER    number of most frequent words that should be extracted [default "50"]
+# -r, --remove_num              remove any numbers in text
+# -l, --lower_case              by default all characters are translated to lower case. otherwise use -l
+# -w, --remove_stopwords        by default a set of english stopwords (e.g., "the" or "not") are removed. otherwise use -w
+# -s, --stemDoc                 apply Porter"s stemming algorithm: collapsing words to a common root to aid comparison of vocabulary
+# -p, --plurals                 by default words in plural and singular are merged to the singular form. otherwise use -p
+
+if ("--install_packages" %in% commandArgs()) {
+  print("Installing packages")
+  if (!require("argparse")) install.packages("argparse", repo = "http://cran.rstudio.com/");
+  if (!require("PubMedWordcloud")) install.packages("PubMedWordcloud", repo = "http://cran.rstudio.com/");
+  if (!require("SnowballC")) install.packages("SnowballC", repo = "http://cran.rstudio.com/");
+  if (!require("textclean")) install.packages("textclean", repo = "http://cran.rstudio.com/");
+  if (!require("SemNetCleaner")) install.packages("SemNetCleaner", repo = "http://cran.rstudio.com/");
+  if (!require("stringi")) install.packages("stringi", repo = "http://cran.rstudio.com/");
+  if (!require("stringr")) install.packages("stringr", repo = "http://cran.rstudio.com/");
+}
+
+suppressPackageStartupMessages(library("argparse"))
+suppressPackageStartupMessages(library("PubMedWordcloud"))
+suppressPackageStartupMessages(library("SnowballC"))
+suppressPackageStartupMessages(library("SemNetCleaner"))
+suppressPackageStartupMessages(library("textclean"))
+suppressPackageStartupMessages(library("stringi"))
+suppressPackageStartupMessages(library("stringr"))
+
+parser <- ArgumentParser()
+parser$add_argument("-i", "--input",
+                    help = "input fie name. add path if file is not in workind directory")
+parser$add_argument("-o", "--output", default = "text_to_wordmatrix_output",
+                    help = "output file name. [default \"%(default)s\"]")
+parser$add_argument("-n", "--number", type = "integer", default = 50, choices = seq(1, 500), metavar = "{0..500}",
+                    help = "number of most frequent words used per ID in word matrix [default \"%(default)s\"]")
+parser$add_argument("-r", "--remove_num", action = "store_true", default = FALSE,
+                    help = "remove any numbers in text")
+parser$add_argument("-l", "--lower_case", action = "store_false", default = TRUE,
+                    help = "by default all characters are translated to lower case. otherwise use -l")
+parser$add_argument("-w", "--remove_stopwords", action = "store_false", default = TRUE,
+                    help = "by default a set of English stopwords (e.g., 'the' or 'not') are removed. otherwise use -s")
+parser$add_argument("-s", "--stemDoc", action = "store_true", default = FALSE,
+                    help = "apply Porter's stemming algorithm: collapsing words to a common root to aid comparison of vocabulary")
+parser$add_argument("-p", "--plurals", action = "store_false", default = TRUE,
+                    help = "by default words in plural and singular are merged to the singular form. otherwise use -p")
+parser$add_argument("--install_packages", action = "store_true", default = FALSE,
+                    help = "If you want to auto install missing required packages.")
+
+args <- parser$parse_args()
+
+
+data <- read.delim(args$input, stringsAsFactors = FALSE, header = TRUE, sep = "\t")
+word_matrix <- data.frame()
+
+text_cols_index <- grep(c("ABSTRACT|TEXT"), names(data))
+
+for (row in seq(nrow(data))) {
+    top_words <- cleanAbstracts(abstracts = data[row, text_cols_index],
+                               rmNum = args$remove_num,
+                               tolw = args$lower_case,
+                               rmWords = args$remove_stopwords,
+                               stemDoc = args$stemDoc)
+
+    top_words$word <- as.character(top_words$word)
+
+    cat("Most frequent words for row", row, " are extracted.", "\n")
+
+    if (args$plurals == TRUE) {
+      top_words$word <- sapply(top_words$word, function(x) {
+        singularize(x)
+        })
+      top_words <- aggregate(freq~word, top_words, sum)
+    }
+
+    top_words <- top_words[order(top_words$freq, decreasing = TRUE), ]
+    top_words$word <- as.character(top_words$word)
+
+    number_extract <- min(args$number, nrow(top_words))
+    word_matrix[row, sapply(1:number_extract, function(x) {
+      paste0(top_words$word[x])
+      })] <- top_words$freq[1:number_extract]
+  }
+
+  word_matrix <- as.matrix(word_matrix)
+  word_matrix[is.na(word_matrix)] <- 0
+  word_matrix <- (word_matrix > 0) * 1  #binary matrix
+
+cat("A matrix with ", nrow(word_matrix), " rows and ", ncol(word_matrix), "columns is generated.", "\n")
+
+write.table(word_matrix, args$output, row.names = FALSE, sep = "\t", quote = FALSE)