Mercurial > repos > iuc > text_to_wordmatrix
changeset 0:0692d11af909 draft default tip
"planemo upload for repository https://github.com/galaxyproject/tools-iuc/tools/simtext commit 63a5e13cf89cdd209d20749c582ec5b8dde4e208"
| author | iuc | 
|---|---|
| date | Wed, 24 Mar 2021 08:33:25 +0000 | 
| parents | |
| children | |
| files | README.md abstracts_by_pmids.R macros.xml pmids_to_pubtator_matrix.R pubmed_by_queries.R test-data/abstracts_by_pmids_output test-data/pmids_to_pubtator_matrix_output test-data/pmids_to_pubtator_matrix_output_byid test-data/pmids_to_pubtator_matrix_output_number test-data/pubmed_by_queries_output test-data/pubmed_by_queries_output_abstracts test-data/test_data test-data/text_to_wordmatrix_output test-data/text_to_wordmatrix_output_args test/commands_tests text_to_wordmatrix.R text_to_wordmatrix.xml | 
| diffstat | 17 files changed, 1139 insertions(+), 0 deletions(-) [+] | 
line wrap: on
 line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/README.md Wed Mar 24 08:33:25 2021 +0000 @@ -0,0 +1,198 @@ +# SimText + +A text mining framework for interactive analysis and visualization of similarities among biomedical entities. + +## Brief overview of tools: + + - pubmed_by_queries: + + For each search query, PMIDs or abstracts from PubMed are saved. + + - abstracts_by_pmids: + + For all PMIDs in each row of a table the according abstracts are saved in additional columns. + + - text_to_wordmatrix: + + The most frequent words of text from each row are extracted and united in one large binary matrix. + + - pmids_to_pubtator_matrix: + + For PMIDs of each row, scientific words are extracted using PubTator annotations and subsequently united in one large binary matrix. + + - simtext_app: + + Shiny app with word clouds, dimension reduction plot, dendrogram of hierarchical clustering and table with words and their frequency among the search queries. + +## Set up user credentials on Galaxy + +To enable users to set their credentials (NCBI API Key) for this tool, +make sure the file `config/user_preferences_extra_conf.yml` has the following section: + +``` +preferences: + ncbi_account: + description: NCBI account information + inputs: + - name: apikey + label: NCBI API Key (available from "API Key Management" at https://www.ncbi.nlm.nih.gov/account/settings/) + type: text + required: False + +``` + +## Requirements command-line version + + - R (version > 4.0.0) + +## Installation command-line version + +``` +$ mkdir -p <path>/simtext +$ cd <path>/simtext +$ git clone https://github.com/dlal-group/simtext +``` + +## pubmed_by_queries + +This tool uses a set of search queries to download a defined number of abstracts or PMIDs for each search query from PubMed. PubMed's search rules and syntax apply. Users can obtain an API key from the Settings page of their NCBI account (to create an account, visit http://www.ncbi.nlm.nih.gov/account/). If the tool is used as command-line tool the API key is passed as an argument. For usage in Galaxy the API key is added to the Galaxy user-preferences (User/ Preferences/ Manage Information). + +Input: + +Tab-delimited table with a list of search queries (biomedical entities of interest) in one column. The column header should start with "ID_" (e.g., "ID_gene" if search queries are genes). + +Usage: +``` +$ Rscript pubmed_by_queries.R [-h] [-i INPUT] [-o OUTPUT] [-n NUMBER] [-a] [-k KEY] [--install_packages] +``` + +Optional arguments: +``` + -h, --help show help message + -i INPUT, --input INPUT input file name. add path if file is not in working directory + -o OUTPUT, --output OUTPUT output file name [default "pubmed_by_queries_output"] + -n NUMBER, --number NUMBER number of PMIDs or abstracts to save per ID [default "5"] + -a, --abstract if abstracts instead of PMIDs should be retrieved use --abstracts + -k KEY, --key KEY if NCBI API key is available, add it to speed up the download of PubMed data. For usage in Galaxy add the API key to the Galaxy user-preferences (User/ Preferences/ Manage Information). + --install_packages if you want to auto install missing required packages +``` + +Output: + +A table with additional columns containing PMIDs or abstracts from PubMed. + +## abstracts_by_pmids + +This tool retrieves abstracts for a matrix of PMIDs. The abstract text is saved in additional columns. + +Input: + +Tab-delimited table with rows representing biomedical entities and columns containing the corresponding PMIDs. The names of the PMID columns should start with “PMID_” (e.g., “PMID_1”, “PMID_2” etc.). + +Usage: +``` +$ Rscript abstracts_by_pmid.R [-h] [-i INPUT] [-o OUTPUT] +``` + +Optional arguments: +``` + -h, --help show help message + -i INPUT, --input INPUT input file name. add path if file is not in working directory + -o OUTPUT, --output OUTPUT output file name [default "abstracts_by_pmids_output"] + --install_packages if you want to auto install missing required packages +``` + +Output: + +A table with additional columns containing abstract texts. + +## text_to_wordmatrix + +The tool extracts for each row the most frequent words from the text in columns starting with "ABSTRACT" or "TEXT. The extracted words from each row are united in one large binary matrix, with 0= word not frequently occurring in text of that row and 1= word frequently present in text of that row. + +Input: + +The output of ‘pubmed_by_queries’ or ‘abstracts_by_pmids’ tools, or a tab-delimited table with text in columns starting with "ABSTRACT" or "TEXT". + +Usage: +``` +$ Rscript text_to_wordmatrix.R [-h] [-i INPUT] [-o OUTPUT] [-n NUMBER] [-r] [-l] [-w] [-s] [-p] +``` + +Optional arguments: +``` + -h, --help show help message + -i INPUT, --input INPUT input file name. add path if file is not in working directory + -o OUTPUT, --output OUTPUT output file name. [default "text_to_wordmatrix_output"] + -n NUMBER, --number NUMBER number of most frequent words that should be extracted per row [default "50"] + -r, --remove_num remove any numbers in text + -l, --lower_case by default all characters are translated to lower case. otherwise use -l + -w, --remove_stopwords by default a set of english stopwords (e.g., 'the' or 'not') are removed. otherwise use -w + -s, --stemDoc apply Porter's stemming algorithm: collapsing words to a common root to aid comparison of vocabulary + -p, --plurals by default words in plural and singular are merged to the singular form. otherwise use -p + -- install_packages if you want to auto install missing required packages +``` + +Output: + +A binary matrix in that each column represents one of the extracted words. + +## pmids_to_pubtator_matrix + +The tool uses all PMIDs per row and extracts "Gene", "Disease", "Mutation", "Chemical" and "Species" terms of the corresponding abstracts, using PubTator annotations. The user can choose from which categories terms should be extracted. The extracted terms are united in one large binary matrix, with 0= term not present in abstracts of that row and 1= term present in abstracts of that row. The user can decide if the scientific terms should be extracted and used as they are or if they should be grouped by their geneIDs/ meshIDs (several terms are often grouped into one ID). Also, by default all terms are extracted, otherwise the user can specify a number of most frequent words to extract per row. + +Input: + +Output of 'abstracts_by_pmids' tool, or tab-delimited table with columns containing PMIDs. The names of the PMID columns should start with "PMID", e.g. "PMID_1", "PMID_2" etc. + +Usage: +``` +$ Rscript pmids_to_pubtator_matrix.R [-h] [-i INPUT] [-o OUTPUT] [-b BYID] [-n NUMBER][-c {Gene,Disease,Mutation,Chemical,Species} [{Gene,Disease,Mutation,Chemical,Species} ...]] +``` + +Optional arguments: +``` + -h, --help show help message + -i INPUT, --input INPUT input file name. add path if file is not in workind directory + -o OUTPUT, --output OUTPUT output file name. [default "pmids_to_pubtator_matrix_output"] + -b, --byid if you want to find common gene IDs / mesh IDs instead of specific scientific terms. + -n NUMBER, --number NUMBER number of most frequent terms/IDs to extract. by default all terms/IDs are extracted. + -c [...], --categories [...] PubTator categories that should be considered [default "('Gene', 'Disease', 'Mutation','Chemical')"] + -- install_packages if you want to auto install missing required packages +``` + +Output: + +Binary matrix in that each column represents one of the extracted terms. + +## simtext_app + +The tool enables the exploration of data generated by ‘text_to_wordmatrix’ or ‘pmids_to_pubtator_matrix’ tools in a Shiny local instance. The following features can be generated: 1) word clouds for each initial search query, 2) dimension reduction and hierarchical clustering of binary matrices, and 3) tables with words and their frequency in the search queries. + +Input: + +1) Input 1: +Tab-delimited table with + - A column with initial search queries starting with "ID_" (e.g., "ID_gene" if initial search queries were genes). + - Column(s) with grouping factor(s) to compare pre-existing categories of the initial search queries with the grouping based on text. The column names should start with "GROUPING_". If the column name is "GROUPING_disorder", "disorder" will be shown as a grouping variable in the app. +2) Input 2: +The output of ‘text_to_wordmatrix’ or ‘pmids_to_pubtator_matrix’ tools, or a binary matrix. + +Usage: +``` +$ Rscript simtext_app.R [-h] [-i INPUT] [-m MATRIX] [-p PORT] +``` + +Optional arguments: +``` + -h, --help show help message + -i INPUT, --input INPUT input file name. add path if file is not in working directory + -m MATRIX, --matrix MATRIX matrix file name. add path if file is not in working directory + -p PORT, --port PORT specify port, otherwise randomly selected + --host specify host + -- install_packages if you want to auto install missing required packages +``` + +Output: + +SimText app
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/abstracts_by_pmids.R Wed Mar 24 08:33:25 2021 +0000 @@ -0,0 +1,142 @@ +#!/usr/bin/env Rscript +#TOOL2 abstracts_by_pmids +# +#This tool retrieves for all PMIDs in each row of a table the according abstracts and saves them in additional columns. +# +#Input: Tab-delimited table with columns containing PMIDs. The names of the PMID columns should start with “PMID”, e.g. “PMID_1”, “PMID_2” etc. +# +#Output: Input table with additional columns containing abstracts corresponding to the PMIDs from PubMed. +#The abstract columns are called "ABSTRACT_1", "ABSTARCT_2" etc. +# +# Usage: $ T2_abstracts_by_pmid.R [-h] [-i INPUT] [-o OUTPUT] +# +# optional arguments: +# -h, --help show help message +# -i INPUT, --input INPUT input file name. add path if file is not in working directory +# -o OUTPUT, --output OUTPUT output file name. [default "T2_output"] + + +if ("--install_packages" %in% commandArgs()) { + print("Installing packages") + if (!require("argparse")) install.packages("argparse", repo = "http://cran.rstudio.com/"); + if (!require("reutils")) install.packages("reutils", repo = "http://cran.rstudio.com/"); + if (!require("easyPubMed")) install.packages("easyPubMed", repo = "http://cran.rstudio.com/"); + if (!require("textclean")) install.packages("textclean", repo = "http://cran.rstudio.com/"); +} + +suppressPackageStartupMessages(library("argparse")) +library("reutils") +suppressPackageStartupMessages(library("easyPubMed")) +suppressPackageStartupMessages(library("textclean")) + +parser <- ArgumentParser() +parser$add_argument("-i", "--input", + help = "input fie name. add path if file is not in workind directory") +parser$add_argument("-o", "--output", default = "abstracts_by_pmids_output", + help = "output file name. [default \"%(default)s\"]") +parser$add_argument("--install_packages", action = "store_true", default = FALSE, + help = "If you want to auto install missing required packages.") + +args <- parser$parse_args() + +data <- read.delim(args$input, stringsAsFactors = FALSE, header = TRUE, sep = "\t") +pmids_cols_index <- grep("PMID", names(data)) + +fetch_abstracts <- function(pmids, row) { + + efetch_result <- NULL + try_num <- 1 + t_0 <- Sys.time() + + while (is.null(efetch_result)) { + + # Timing check: kill at 3 min + if (try_num > 1) { + Sys.sleep(time = 1 * try_num) + cat("Problem to receive PubMed data or error is received. Please wait. Try number: ", try_num, "\n") + } + + t_1 <- Sys.time() + + if (as.numeric(difftime(t_1, t_0, units = "mins")) > 3) { + message("Killing the request! Something is not working. Please, try again later", "\n") + return(data) + } + + efetch_result <- tryCatch({ + suppressWarnings(efetch(uid = pmids, db = "pubmed", retmode = "xml")) + }, error = function(e) { + NULL + }) + + if (!is.null(as.list(efetch_result$errors)$error)) { + if (as.list(efetch_result$errors)$error == "HTTP error: Status 400; Bad Request") { + efetch_result <- NULL + } + } + + try_num <- try_num + 1 + + } #while loop end + + # articles to list + xml_data <- strsplit(efetch_result$content, "<PubmedArticle(>|[[:space:]]+?.*>)")[[1]][-1] + xml_data <- sapply(xml_data, function(x) { + #trim extra stuff at the end of the record + if (!grepl("</PubmedArticle>$", x)) + x <- sub("(^.*</PubmedArticle>).*$", "\\1", x) + # Rebuid XML structure and proceed + x <- paste("<PubmedArticle>", x) + gsub("[[:space:]]{2,}", " ", x)}, + USE.NAMES = FALSE, simplify = TRUE) + + abstract_text <- sapply(xml_data, function(x) { + custom_grep(x, tag = "AbstractText", format = "char")}, + USE.NAMES = FALSE, simplify = TRUE) + + abstracts <- sapply(abstract_text, function(x) { + if (length(x) > 1) { + x <- paste(x, collapse = " ", sep = " ") + x <- gsub("</{0,1}i>", "", x, ignore.case = T) + x <- gsub("</{0,1}b>", "", x, ignore.case = T) + x <- gsub("</{0,1}sub>", "", x, ignore.case = T) + x <- gsub("</{0,1}exp>", "", x, ignore.case = T) + } else if (length(x) < 1) { + x <- NA + } else { + x <- gsub("</{0,1}i>", "", x, ignore.case = T) + x <- gsub("</{0,1}b>", "", x, ignore.case = T) + x <- gsub("</{0,1}sub>", "", x, ignore.case = T) + x <- gsub("</{0,1}exp>", "", x, ignore.case = T) + } + x + }, + USE.NAMES = FALSE, simplify = TRUE) + + abstracts <- as.character(abstracts) + + if (length(abstracts) > 0) { + data[row, sapply(seq(length(abstracts)), function(i) { + paste0("ABSTRACT_", i) + })] <- abstracts + cat(length(abstracts), " abstracts for PMIDs of row ", row, " are added in the table.", "\n") + } + + return(data) +} + + +for (row in seq(nrow(data))) { + pmids <- as.character(unique(data[row, pmids_cols_index])) + pmids <- pmids[!pmids == "NA"] + + if (length(pmids) > 0) { + data <- tryCatch(fetch_abstracts(pmids, row), + error = function(e) { + Sys.sleep(3) + }) + } else { + print(paste("No PMIDs in row", row)) + } +} +write.table(data, args$output, sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE)
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/macros.xml Wed Mar 24 08:33:25 2021 +0000 @@ -0,0 +1,11 @@ +<macros> + <token name="@VERSION@">0.0.2</token> + + <xml name="citations"> + <citations> + <citation type="doi">10.1101/2020.07.06.190629</citation> + </citations> + </xml> + +</macros> +
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/pmids_to_pubtator_matrix.R Wed Mar 24 08:33:25 2021 +0000 @@ -0,0 +1,231 @@ +#!/usr/bin/env Rscript +#tool: pmids_to_pubtator_matrix +# +#The tool uses all PMIDs per row and extracts "Gene", "Disease", "Mutation", "Chemical" and "Species" terms of the +#corresponding abstracts, using PubTator annotations. The user can choose from which categories terms should be extracted. +#The extracted terms are united in one large binary matrix, with 0= term not present in abstracts of that row and 1= term +#present in abstracts of that row. The user can decide if the extracted scientific terms should be extracted and used as +#they are or if they should be grouped by their geneIDs/ meshIDs (several terms can often be grouped into one ID). +#äAlso, by default all terms are extracted, otherwise the user can specify a number of most frequent words to be extracted per row. +# +#Input: Output of abstracts_by_pmids or tab-delimited table with columns containing PMIDs. +#The names of the PMID columns should start with "PMID", e.g. "PMID_1", "PMID_2" etc. +# +#Output: Binary matrix in that each column represents one of the extracted terms. +# +# usage: $ pmids_to_pubtator_matrix.R [-h] [-i INPUT] [-o OUTPUT] [-n NUMBER] +# [-c {Genes,Diseases,Mutations,Chemicals,Species} [{Genes,Diseases,Mutations,Chemicals,Species} ...]] +# +# optional arguments: +# -h, --help show help message +# -i INPUT, --input INPUT input file name. add path if file is not in workind directory +# -n NUMBER, --number NUMBER Number of most frequent terms/IDs to extract. By default all terms/IDs are extracted. +# -o OUTPUT, --output OUTPUT output file name. [default "pmids_to_pubtator_matrix_output"] +# -c {Gene,Disease,Mutation,Chemical,Species} [{Genes,Diseases,Mutations,Chemicals,Species} ...], --categories {Gene,Disease,Mutation,Chemical,Species} [{Gene,Disease,Mutation,Chemical,Species} ...] +# Pubtator categories that should be considered. [default "('Gene', 'Disease', 'Mutation','Chemical')"] + +if ("--install_packages" %in% commandArgs()) { + print("Installing packages") + if (!require("argparse")) install.packages("argparse", repo = "http://cran.rstudio.com/"); + if (!require("stringr")) install.packages("stringr", repo = "http://cran.rstudio.com/"); + if (!require("RCurl")) install.packages("RCurl", repo = "http://cran.rstudio.com/"); + if (!require("stringi")) install.packages("stringi", repo = "http://cran.rstudio.com/"); +} + +suppressPackageStartupMessages(library("argparse")) +library("stringr") +library("RCurl") +library("stringi") + +parser <- ArgumentParser() + +parser$add_argument("-i", "--input", + help = "input fie name. add path if file is not in workind directory") +parser$add_argument("-o", "--output", default = "pmids_to_pubtator_matrix_output", + help = "output file name. [default \"%(default)s\"]") +parser$add_argument("-c", "--categories", choices = c("Gene", "Disease", "Mutation", "Chemical", "Species"), nargs = "+", + default = c("Gene", "Disease", "Mutation", "Chemical"), + help = "Pubtator categories that should be considered. [default \"%(default)s\"]") +parser$add_argument("-b", "--byid", action = "store_true", default = FALSE, + help = "If you want to find common gene IDs / mesh IDs instead of scientific terms.") +parser$add_argument("-n", "--number", default = NULL, type = "integer", + help = "Number of most frequent terms/IDs to extract. By default all terms/IDs are extracted.") +parser$add_argument("--install_packages", action = "store_true", default = FALSE, + help = "If you want to auto install missing required packages.") + +args <- parser$parse_args() + + +data <- read.delim(args$input, stringsAsFactors = FALSE, header = TRUE, sep = "\t") + +pmid_cols_index <- grep(c("PMID"), names(data)) +word_matrix <- data.frame() +dict_table <- data.frame() +pmids_count <- 0 +pubtator_max_ids <- 100 + + +merge_pubtator_table <- function(out_data, table) { + out_data <- unlist(strsplit(out_data, "\n", fixed = T)) + for (i in 3:length(out_data)) { + temps <- unlist(strsplit(out_data[i], "\t", fixed = T)) + if (length(temps) == 5) { + temps <- c(temps, NA) + } + if (length(temps) == 6) { + table <- rbind(table, temps) + } + } + return(table) +} + + +get_pubtator_terms <- function(pmids) { + table <- NULL + for (pmid_split in split(pmids, ceiling(seq_along(pmids) / pubtator_max_ids))) { + out_data <- NULL + try_num <- 1 + t_0 <- Sys.time() + while (TRUE) { + # Timing check: kill at 3 min + if (try_num > 1) { + cat("Connection problem. Please wait. Try number:", try_num, "\n") + Sys.sleep(time = 2 * try_num) + } + try_num <- try_num + 1 + t_1 <- Sys.time() + if (as.numeric(difftime(t_1, t_0, units = "mins")) > 3) { + message("Killing the request! Something is not working. Please, try again later", "\n") + return(table) + } + out_data <- tryCatch({ + getURL(paste("https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/pubtator?pmids=", + paste(pmid_split, collapse = ","), sep = "")) + }, error = function(e) { + print(e) + next + }, finally = { + Sys.sleep(0) + }) + if (!is.null(out_data)) { + table <- merge_pubtator_table(out_data, table) + break + } + } + } + return(table) +} + +extract_category_terms <- function(table, categories) { + index_categories <- c() + categories <- as.character(unlist(categories)) + if (ncol(table) == 6) { + for (i in categories) { + tmp_index <- grep(TRUE, i == as.character(table[, 5])) + if (length(tmp_index) > 0) { + index_categories <- c(index_categories, tmp_index) + } + } + table <- as.data.frame(table, stringsAsFactors = FALSE) + table <- table[index_categories, c(4, 6)] + table <- table[!is.na(table[, 2]), ] + table <- table[!(table[, 2] == "NA"), ] + table <- table[!(table[, 1] == "NA"), ] + }else{ + return(NULL) + } +} + +extract_frequent_ids_or_terms <- function(table) { + if (is.null(table)) { + return(NULL) + break + } + if (args$byid) { + if (!is.null(args$number)) { + #retrieve top X mesh_ids + table_mesh <- as.data.frame(table(table[, 2])) + colnames(table_mesh)[1] <- "mesh_id" + table <- table[order(table_mesh$Freq, decreasing = TRUE), ] + table <- table[1:min(args$number, nrow(table_mesh)), ] + table_mesh$mesh_id <- as.character(table_mesh$mesh_id) + #subset table for top X mesh_ids + table <- table[which(as.character(table$V6) %in% as.character(table_mesh$mesh_id)), ] + table <- table[!duplicated(table[, 2]), ] + } else { + table <- table[!duplicated(table[, 2]), ] + } + } else { + if (!is.null(args$number)) { + table[, 1] <- tolower(as.character(table[, 1])) + table <- as.data.frame(table(table[, 1])) + colnames(table)[1] <- "term" + table <- table[order(table$Freq, decreasing = TRUE), ] + table <- table[1:min(args$number, nrow(table)), ] + table$term <- as.character(table$term) + } else { + table[, 1] <- tolower(as.character(table[, 1])) + table <- table[!duplicated(table[, 1]), ] + } + } + return(table) +} + + +#for all PMIDs of a row get PubTator terms and add them to the matrix +for (i in seq(nrow(data))) { + pmids <- as.character(data[i, pmid_cols_index]) + pmids <- pmids[!pmids == "NA"] + if (pmids_count > 10000) { + cat("Break (10s) to avoid killing of requests. Please wait.", "\n") + Sys.sleep(10) + pmids_count <- 0 + } + pmids_count <- pmids_count + length(pmids) + #get puptator terms and process them with functions + if (length(pmids) > 0) { + table <- get_pubtator_terms(pmids) + table <- extract_category_terms(table, args$categories) + table <- extract_frequent_ids_or_terms(table) + if (!is.null(table)) { + colnames(table) <- c("term", "mesh_id") + # add data in binary matrix + if (args$byid) { + mesh_ids <- as.character(table$mesh_id) + if (length(mesh_ids) > 0) { + word_matrix[i, mesh_ids] <- 1 + cat(length(mesh_ids), " IDs for PMIDs of row", i, " were added", "\n") + # add data in dictionary + dict_table <- rbind(dict_table, table) + dict_table <- dict_table[!duplicated(as.character(dict_table[, 2])), ] + } + } else { + terms <- as.character(table[, 1]) + if (length(terms) > 0) { + word_matrix[i, terms] <- 1 + cat(length(terms), " terms for PMIDs of row", i, " were added.", "\n") + } + } + } + } else { + cat("No terms for PMIDs of row", i, " were found.", "\n") + } +} + +if (args$byid) { + #change column names of matrix: exchange mesh ids/ids with term + index_names <- match(names(word_matrix), as.character(dict_table[[2]])) + names(word_matrix) <- dict_table[index_names, 1] +} + +colnames(word_matrix) <- gsub("[^[:print:]]", "", colnames(word_matrix)) +colnames(word_matrix) <- gsub('\"', "", colnames(word_matrix), fixed = TRUE) + +#merge duplicated columns +word_matrix <- as.data.frame(do.call(cbind, by(t(word_matrix), INDICES = names(word_matrix), FUN = colSums))) + +#save binary matrix +word_matrix <- as.matrix(word_matrix) +word_matrix[is.na(word_matrix)] <- 0 +cat("Matrix with ", nrow(word_matrix), " rows and ", ncol(word_matrix), " columns generated.", "\n") +write.table(word_matrix, args$output, row.names = FALSE, sep = "\t", quote = FALSE)
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/pubmed_by_queries.R Wed Mar 24 08:33:25 2021 +0000 @@ -0,0 +1,258 @@ +#!/usr/bin/env Rscript +#tool: pubmed_by_queries +# +#This tool uses a set of search queries to download a defined number of abstracts or +#PMIDs for search query from PubMed. PubMed's search rules and syntax apply. +# +#Input: Tab-delimited table with search queries in a column starting with "ID_", +#e.g. "ID_gene" if search queries are genes. +# +#Output: Input table with additional columns +#with PMIDs or abstracts (--abstracts) from PubMed. +# +#Usage: +#$pubmed_by_queries.R [-h] [-i INPUT] [-o OUTPUT] [-n NUMBER] [-a] [-k KEY] +# +#optional arguments: +# -h, --help show this help message and exit +# -i INPUT, --input INPUT input file name. add path if file is not in working directory +# -o OUTPUT, --output OUTPUT output file name. [default "pubmed_by_queries_output"] +# -n NUMBER, --number NUMBER number of PMIDs or abstracts to save per ID [default "5"] +# -a, --abstract if abstracts instead of PMIDs should be retrieved use --abstracts +# -k KEY, --key KEY if ncbi API key is available, add it to speed up the download of PubMed data. +# For usage in Galaxy add the API key to the Galaxy user-preferences (User/ Preferences/ Manage Information). + +if ("--install_packages" %in% commandArgs()) { + print("Installing packages") + if (!require("argparse")) install.packages("argparse", repo = "http://cran.rstudio.com/") ; + if (!require("easyPubMed")) install.packages("easyPubMed", repo = "http://cran.rstudio.com/") ; +} + +suppressPackageStartupMessages(library("argparse")) +suppressPackageStartupMessages(library("easyPubMed")) + +parser <- ArgumentParser() +parser$add_argument("-i", "--input", + help = "Input fie name. add path if file is not in working directory") +parser$add_argument("-o", "--output", default = "pubmed_by_queries_output", + help = "Output file name. [default \"%(default)s\"]") +parser$add_argument("-n", "--number", type = "integer", default = 5, + help = "Number of PMIDs (or abstracts) to save per ID. [default \"%(default)s\"]") +parser$add_argument("-a", "--abstract", action = "store_true", default = FALSE, + help = "If abstracts instead of PMIDs should be retrieved use --abstracts ") +parser$add_argument("-k", "--key", type = "character", + help = "If ncbi API key is available, add it to speed up the download of PubMed data. For usage in Galaxy add the API key to the Galaxy user-preferences (User/ Preferences/ Manage Information).") +parser$add_argument("--install_packages", action = "store_true", default = FALSE, + help = "If you want to auto install missing required packages.") +args <- parser$parse_args() + +if (!is.null(args$key)) { + if (file.exists(args$key)) { + credentials <- read.table(args$key, quote = "\"", comment.char = "") + args$key <- credentials[1, 1] + } +} + +max_web_tries <- 100 + +data <- read.delim(args$input, stringsAsFactors = FALSE) + +id_col_index <- grep("ID_", names(data)) + + +fetch_pmids <- function(data, number, pubmed_search, query, row, max_web_tries) { + my_pubmed_url <- paste("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?", + "db=pubmed&retmax=", number, + "&term=", pubmed_search$OriginalQuery, + "&usehistory=n", sep = "") + # get ids + idxml <- c() + for (i in seq(max_web_tries)) { + tryCatch({ + id_connect <- suppressWarnings(url(my_pubmed_url, open = "rb", encoding = "UTF8")) + idxml <- suppressWarnings(readLines(id_connect, warn = FALSE, encoding = "UTF8")) + suppressWarnings(close(id_connect)) + break + }, error = function(e) { + print(paste("Error getting URL, sleeping", 2 * i, "seconds.")) + print(e) + Sys.sleep(time = 2 * i) + }) + } + pmids <- c() + for (i in seq(length(idxml))) { + if (grepl("^<Id>", idxml[i])) { + pmid <- custom_grep(idxml[i], tag = "Id", format = "char") + pmids <- c(pmids, as.character(pmid[1])) + } + } + if (length(pmids) > 0) { + data[row, sapply(seq(length(pmids)), function(i) { + paste0("PMID_", i) + })] <- pmids + cat(length(pmids), " PMIDs for ", query, " are added in the table.", "\n") + } + return(data) +} + + +fetch_abstracts <- function(data, number, query, pubmed_search) { + efetch_url <- paste("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?", + "db=pubmed&WebEnv=", pubmed_search$WebEnv, "&query_key=", pubmed_search$QueryKey, + "&retstart=", 0, "&retmax=", number, + "&rettype=", "null", "&retmode=", "xml", sep = "") + api_key <- pubmed_search$APIkey + if (!is.null(api_key)) { + efetch_url <- paste(efetch_url, "&api_key=", api_key, sep = "") + } + # initialize + out_data <- NULL + try_num <- 1 + t_0 <- Sys.time() + # Try to fetch results + while (is.null(out_data)) { + # Timing check: kill at 3 min + if (try_num > 1) { + Sys.sleep(time = 2 * try_num) + cat("Problem to receive PubMed data or error is received. Please wait. Try number:", + try_num, "\n") + } + t_1 <- Sys.time() + if (as.numeric(difftime(t_1, t_0, units = "mins")) > 3) { + message("Killing the request! Something is not working. Please, try again later", + "\n") + return(data) + } + # ENTREZ server connect + out_data <- tryCatch({ + tmp_connect <- suppressWarnings(url(efetch_url, + open = "rb", + encoding = "UTF8")) + suppressWarnings(readLines(tmp_connect, + warn = FALSE, + encoding = "UTF8")) + }, error = function(e) { + print(e) + }, finally = { + try(suppressWarnings(close(tmp_connect)), + silent = TRUE) + }) + # Check if error + if (!is.null(out_data) && + class(out_data) == "character" && + grepl("<ERROR>", substr(paste(utils::head(out_data, n = 100), + collapse = ""), 1, 250))) { + out_data <- NULL + } + try_num <- try_num + 1 + } + if (is.null(out_data)) { + message("Killing the request! Something is not working. Please, try again later", + "\n") + return(data) + } else { + return(out_data) + } +} + + +process_xml_abstracts <- function(out_data) { + xml_data <- paste(out_data, collapse = "") + # articles to list + xml_data <- strsplit(xml_data, "<PubmedArticle(>|[[:space:]]+?.*>)")[[1]][-1] + xml_data <- sapply(xml_data, function(x) { + #trim extra stuff at the end of the record + if (!grepl("</PubmedArticle>$", x)) + x <- sub("(^.*</PubmedArticle>).*$", "\\1", x) + # Rebuid XML structure and proceed + x <- paste("<PubmedArticle>", x) + gsub("[[:space:]]{2,}", " ", x) + }, + USE.NAMES = FALSE, simplify = TRUE) + #titles + titles <- sapply(xml_data, function(x) { + x <- custom_grep(x, tag = "ArticleTitle", format = "char") + x <- gsub("</{0,1}i>", "", x, ignore.case = T) + x <- gsub("</{0,1}b>", "", x, ignore.case = T) + x <- gsub("</{0,1}sub>", "", x, ignore.case = T) + x <- gsub("</{0,1}exp>", "", x, ignore.case = T) + if (length(x) > 1) { + x <- paste(x, collapse = " ", sep = " ") + } else if (length(x) < 1) { + x <- NA + } + x + }, + USE.NAMES = FALSE, simplify = TRUE) + # abstracts + abstract_text <- sapply(xml_data, function(x) { + custom_grep(x, tag = "AbstractText", format = "char") + }, + USE.NAMES = FALSE, simplify = TRUE) + abstracts <- sapply(abstract_text, function(x) { + if (length(x) > 1) { + x <- paste(x, collapse = " ", sep = " ") + x <- gsub("</{0,1}i>", "", x, ignore.case = T) + x <- gsub("</{0,1}b>", "", x, ignore.case = T) + x <- gsub("</{0,1}sub>", "", x, ignore.case = T) + x <- gsub("</{0,1}exp>", "", x, ignore.case = T) + } else if (length(x) < 1) { + x <- NA + } else { + x <- gsub("</{0,1}i>", "", x, ignore.case = T) + x <- gsub("</{0,1}b>", "", x, ignore.case = T) + x <- gsub("</{0,1}sub>", "", x, ignore.case = T) + x <- gsub("</{0,1}exp>", "", x, ignore.case = T) + } + x + }, + USE.NAMES = FALSE, simplify = TRUE) + #add title to abstracts + if (length(titles) == length(abstracts)) { + abstracts <- paste(titles, abstracts) + } + return(abstracts) +} + + +pubmed_data_in_table <- function(data, row, query, number, key, abstract) { + if (is.null(query)) { + print(data) + } + pubmed_search <- get_pubmed_ids(query, api_key = key) + if (as.numeric(pubmed_search$Count) == 0) { + cat("No PubMed result for the following query: ", query, "\n") + return(data) + } else if (abstract == FALSE) { # fetch PMIDs + data <- fetch_pmids(data, number, pubmed_search, query, row, max_web_tries) + return(data) + } else if (abstract == TRUE) { # fetch abstracts and title text + out_data <- fetch_abstracts(data, number, query, pubmed_search) + abstracts <- process_xml_abstracts(out_data) + #add abstracts to data frame + if (length(abstracts) > 0) { + data[row, sapply(seq(length(abstracts)), + function(i) { + paste0("ABSTRACT_", i) + })] <- abstracts + cat(length(abstracts), " abstracts for ", query, " are added in the table.", + "\n") + } + return(data) + } +} + +for (i in seq(nrow(data))) { + data <- tryCatch(pubmed_data_in_table(data = data, + row = i, + query = data[i, id_col_index], + number = args$number, + key = args$key, + abstract = args$abstract), error = function(e) { + print("main error") + print(e) + Sys.sleep(5) + }) +} + +write.table(data, args$output, append = FALSE, sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE)
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/abstracts_by_pmids_output Wed Mar 24 08:33:25 2021 +0000 @@ -0,0 +1,7 @@ +ID_gene GROUPING_disease PMID_1 PMID_2 PMID_3 PMID_4 PMID_5 ABSTRACT_1 ABSTRACT_2 ABSTRACT_3 ABSTRACT_4 ABSTRACT_5 +SCN1A epilepsy 33565071 33531663 33528079 33519675 33478845 To analyze the clinical features and genetic variants in two patients with Dravet syndrome (DS). Peripheral blood samples of the children and their parents were collected for the extraction of genomic DNA and high-throughput sequencing. Suspected variants were confirmed by Sanger sequencing. By high-throughput sequencing, the two children were found to respectively harbor a c.2135delC frameshifting variant in exon 12 and a c.1522G>T nonsense variant in exon 10 of the SCN1A gene. Both variants were predicted to be pathogenic by bioinformatic analysis. Based on the American College of Medical Genetics and Genomics standards and guidelines, the c.2135delC and c.1522G>A variants of the SCN1A gene were predicted to be pathogenic (PVS1+ PS2+ PM2+ PP3). The variants of the SCN1A gene probably underlay the DS in the patients. Above finding has enriched the variant spectrum and enabled genetic counseling for their families. The voltage-gated sodium channel α-subunit genes comprise a highly conserved gene family. Mutations of three of these genes, SCN1A, SCN2A and SCN8A, are responsible for a significant burden of neurological disease. Recent progress in identification and functional characterization of patient variants is generating new insights and novel approaches to therapy for these devastating disorders. Here we review the basic elements of sodium channel function that are used to characterize patient variants. We summarize a large body of work using global and conditional mouse mutants to characterize the in vivo roles of these channels. We provide an overview of the neurological disorders associated with mutations of the human genes and examples of the effects of patient mutations on channel function. Finally, we highlight therapeutic interventions that are emerging from new insights into mechanisms of sodium channelopathies. Advancement in genetic technology has led to the identification of an increasing number of genes in epilepsy. This will provide a huge information in clinical practice and improve diagnosis and treatment of epilepsy. this was a single-center retrospective cohort study of 80 patients who underwent NGS testing with customize epilepsy panel. In total 54 out of 80 patients (67, 5%), pathogenic / likely pathogenic and variants of uncertain significance variants were identified according to ACMG criteria. Pathogenic or likely pathogenic variants (n=35) were identified in 29 out of 80 individuals (36.25%). Variants of uncertain significance (VOUS) (n=34) have identified in 28 out of 80 patients (35%). Pathogenic, likely pathogenic, and variants of uncertain significance (VOUS) were most frequently identified in TSC2 (n = 11), SCN1A (n = 6) and TSC1 (n = 5) genes. Other common genes were KCNQ2 (n = 3), AMT (n = 3), CACNA1H (n = 3), CLCN2 (n = 3), MECP2 (n = 2), ASAH1 (n = 2) and SLC2A1 (n = 2). NGS based testing panels contributes the diagnosis of epilepsy and may change the clinical management by preventing unnecessary and potentially harmful diagnostic procedures and management in patients. Thus, our results highlighted the benefit of genetic testing in children suffered with epilepsy. This article is protected by copyright. All rights reserved. Background:SCN1A and SCN2A genes have been reported to be associated with the efficacy of single and combined antiepileptic therapy, but the results remain contradictory. Previous meta-analyses on this topic mainly focused on the SCN1A rs3812718 polymorphism. However, meta-analyses focused on SCN1A rs2298771, SCN1A rs10188577, SCN2A rs17183814, or SCN2A rs2304016 polymorphisms are scarce or non-existent. Objective: We aimed to conduct a meta-analysis to determine the effects of SCN1A rs2298771, SCN1A rs10188577, SCN2A rs17183814, and SCN2A rs2304016 polymorphisms on resistance to antiepileptic drugs (AEDs). Methods: We searched the PubMed, Embase, Cochrane Library, WANFANG, and CNKI databases up to June 2020 to collect studies on the association of SCN1A and SCN2A polymorphisms with reactivity to AEDs. We calculated the pooled odds ratios (ORs) under the allelic, homozygous, heterozygous, dominant, and recessive genetic models to identify the association between the four single-nucleotide polymorphisms (SNPs) and resistance to AEDs. Results: Our meta-analysis included 19 eligible studies. The results showed that the SCN1A rs2298771 polymorphism was related to AED resistance in the allelic, homozygous, and recessive genetic models (G vs. A: OR = 1.20, 95% CI: 1.012-1.424; GG vs. AA: OR = 1.567, 95% CI: 1.147-2.142; GG vs. AA + AG: OR = 1.408, 95% CI: 1.053-1.882). The homozygous model remained significant after Bonferroni correction (P < 0.0125). Further subgroup analyses demonstrated the significance of the correlation in the dominant model in Caucasians (South Asians) after Bonferroni correction (GG + GA vs. AA: OR = 1.620, 95% CI: 1.165-2.252). However, no association between SCN1A rs2298771 polymorphism and resistance to AEDs was found in Asians or Caucasians (non-South Asians). For SCN1A rs10188577, SCN2A rs17183814, and SCN2A rs2304016 polymorphisms, the correlations with responsiveness to AEDs were not significant in the overall population nor in any subgroup after conducting the Bonferroni correction. The results for SCN1A rs2298771, SCN1A rs10188577, and SCN2A rs2304016 polymorphisms were stable and reliable according to sensitivity analysis and Begg and Egger tests. However, the results for SCN2A rs17183814 polymorphism have to be treated cautiously owing to the significant publication bias revealed by Begg and Egger tests. Conclusions: The present meta-analysis indicated that SCN1A rs2298771 polymorphism significantly affects resistance to AEDs in the overall population and Caucasians (South Asians). There were no significant correlations between SCN1A rs10188577, SCN2A rs17183814, and SCN2A rs2304016 polymorphisms and resistance to AEDs. The objective of this study was to identify developmental trajectories of developmental/behavioral phenotypes and possibly their relationship to epilepsy and genotype by analyzing developmental and behavioral features collected prospectively and longitudinally in a cohort of patients with Dravet syndrome (DS). Thirty-four patients from seven Italian tertiary pediatric neurology centers were enrolled in the study. All patients were examined for the SCN1A gene mutation and prospectively assessed from the first years of life with repeated full clinical observations including neurological and developmental examinations. Subjects were found to follow three neurodevelopmental trajectories. In the first group (16 patients), an initial and usually mild decline was observed between the second and the third year of life, specifically concerning visuomotor abilities, later progressing towards global involvement of all abilities. The second group (12 patients) showed an earlier onset of global developmental impairment, progressing towards a generally worse outcome. The third group of only two patients ended up with a normal neurodevelopmental quotient, but with behavioral and linguistic problems. The remaining four patients were not classifiable due to a lack of critical assessments just before developmental decline. The neurodevelopmental trajectories described in this study suggest a differential contribution of neurobiological and genetic factors. The profile of the first group, which included the largest fraction of patients, suggests that in the initial phase of the disease, visuomotor defects might play a major role in determining developmental decline. Early diagnosis of milder cases with initial visuomotor impairment may therefore provide new tools for a more accurate habilitation strategy. +SCN9A epilepsy 33389681 33370834 33278787 33237934 33232657 Dorsal root ganglia (DRG) sensory neurons can transmit information about noxious stimulus to cerebral cortex via spinal cord, and play an important role in the pain pathway. Alterations of the pain pathway lead to CIPA (congenital insensitivity to pain with anhidrosis) or chronic pain. Accumulating evidence demonstrates that nerve damage leads to the regeneration of neurons in DRG, which may contribute to pain modulation in feedback. Therefore, exploring the regeneration process of DRG neurons would provide a new understanding to the persistent pathological stimulation and contribute to reshape the somatosensory function. It has been reported that a subpopulation of satellite glial cells (SGCs) express Nestin and p75, and could differentiate into glial cells and neurons, suggesting that SGCs may have differentiation plasticity. Our results in the present study show that DRG-derived SGCs (DRG-SGCs) highly express neural crest cell markers Nestin, Sox2, Sox10, and p75, and differentiate into nociceptive sensory neurons in the presence of histone deacetylase inhibitor VPA, Wnt pathway activator CHIR99021, Notch pathway inhibitor RO4929097, and FGF pathway inhibitor SU5402. The nociceptive sensory neurons express multiple functionally-related genes (SCN9A, SCN10A, SP, Trpv1, and TrpA1) and are able to generate action potentials and voltage-gated Na<sup>+</sup> currents. Moreover, we found that these cells exhibited rapid calcium transients in response to capsaicin through binding to the Trpv1 vanilloid receptor, confirming that the DRG-SGC-derived cells are nociceptive sensory neurons. Further, we show that Wnt signaling promotes the differentiation of DRG-SGCs into nociceptive sensory neurons by regulating the expression of specific transcription factor Runx1, while Notch and FGF signaling pathways are involved in the expression of SCN9A. These results demonstrate that DRG-SGCs have stem cell characteristics and can efficiently differentiate into functional nociceptive sensory neurons, shedding light on the clinical treatment of sensory neuron-related diseases. Voltage-gated sodium channel Nav1.7 has been validated as a perspective target for selective inhibitors with analgesic and anti-itch activity. The objective of this study was to discover new candidate compounds with Nav1.7 inhibitor properties. The authors hypothesized that their approach would yield at least one new compound that inhibits sodium currents in vitro and exerts analgesic and anti-itch effects in mice. In silico structure-based similarity search of 1.5 million compounds followed by docking to the Nav1.7 voltage sensor of Domain 4 and molecular dynamics simulation was performed. Patch clamp experiments in Nav1.7-expressing human embryonic kidney 293 cells and in mouse and human dorsal root ganglion neurons were conducted to test sodium current inhibition. Formalin-induced inflammatory pain model, paclitaxel-induced neuropathic pain model, histamine-induced itch model, and mouse lymphoma model of chronic itch were used to confirm in vivo activity of the selected compound. After in silico screening, nine compounds were selected for experimental assessment in vitro. Of those, four compounds inhibited sodium currents in Nav1.7-expressing human embryonic kidney 293 cells by 29% or greater (P < 0.05). Compound 9 (3-(1-benzyl-1H-indol-3-yl)-3-(3-phenoxyphenyl)-N-(2-(pyrrolidin-1-yl)ethyl)propanamide, referred to as DA-0218) reduced sodium current by 80% with a 50% inhibition concentration of 0.74 μM (95% CI, 0.35 to 1.56 μM), but had no effects on Nav1.5-expressing human embryonic kidney 293 cells. In mouse and human dorsal root ganglion neurons, DA-0218 reduced sodium currents by 17% (95% CI, 6 to 28%) and 22% (95% CI, 9 to 35%), respectively. The inhibition was greatly potentiated in paclitaxel-treated mouse neurons. Intraperitoneal and intrathecal administration of the compound reduced formalin-induced phase II inflammatory pain behavior in mice by 76% (95% CI, 48 to 100%) and 80% (95% CI, 68 to 92%), respectively. Intrathecal administration of DA-0218 produced acute reduction in paclitaxel-induced mechanical allodynia, and inhibited histamine-induced acute itch and lymphoma-induced chronic itch. This study's computer-aided drug discovery approach yielded a new Nav1.7 inhibitor that shows analgesic and anti-pruritic activity in mouse models. This study aimed to investigate the genetic aetiology in Chinese children diagnosed with status epilepticus (SE). Next-generation sequencing, copy number variation (CNV) analysis, and other genetic testing methods were conducted for children with SE lacking an identifiable non-genetic aetiology. Furthermore, the phenotype and molecular data of patients with SE were retrospectively analysed. Among children with SE lacking an identifiable non-genetic aetiology, 73 out of 163 children (44.8 %) were found to have causative variants associated with SE including 66 monogenic mutations in 22 genes and 7 CNVs. Based on the American College of Medical Genetics and Genomics scoring system, the monogenic variants included 64 pathogenic/likely pathogenic and 2 uncertain significance variants. SCN1A gene mutations (n = 32) were the most common cause, followed by TSC2 (n = 5), CACNA1A (n = 5), SCN2A (n = 4), SCN9A (n = 2) and DEPDC5 (n = 2) gene mutations. Sixteen mutations were identified in single genes. Furthermore, 51 (77.3 %) monogenic mutations were de novo. Age at SE onset < 1 year (odds ratio [OR] = 2.70, 95 % confidence interval [CI]: 1.25-5.83, p = 0.012) and co-morbidity of intellectual disability (OR = 3.36, 95 %CI: 1.61-6.99, p = 0.001) were independently associated with pathogenic genetic variants. This study identified genetic aetiology in 44.8 % of patients with SE, which indicates a high burden of genetic aetiology among children with SE in China. Our findings highlight the importance for genetic testing of children with SE that lacks an identifiable non-genetic aetiology. Glioblastoma (GBM) is an aggressive brain tumor associated with high degree of resistance to treatment. Given its heterogeneity, it is important to understand the molecular landscape of this tumor for the development of more effective therapies. Because of the different genetic profiles of patients with GBM, we sought to identify genetic variants in Lebanese patients with GBM (LEB-GBM) and compare our findings to those in the Cancer Genome Atlas (TCGA). We performed whole exome sequencing (WES) to identify somatic variants in a cohort of 60 patient-derived GBM samples. We focused our analysis on 50 commonly mutated GBM candidate genes and compared mutation signatures between our population and publicly available GBM data from TCGA. We also cross-tabulated biological covariates to assess for associations with overall survival, time to recurrence and follow-up duration. We included 60 patient-derived GBM samples from 37 males and 23 females, with age ranging from 3 to 80 years (mean and median age at diagnosis were 51 and 56, respectively). Recurrent tumor formation was present in 94.8% of patients (n = 55/58). After filtering, we identified 360 somatic variants from 60 GBM patient samples. After filtering, we identified 360 somatic variants from 60 GBM patient samples. Most frequently mutated genes in our samples included ATRX, PCDHX11, PTEN, TP53, NF1, EGFR, PIK3CA, and SCN9A. Mutations in NLRP5 were associated with decreased overall survival among the Lebanese GBM cohort (p = 0.002). Mutations in NLRP5 were associated with decreased overall survival among the Lebanese GBM cohort (p = 0.002). EGFR and NF1 mutations were associated with the frontal lobe and temporal lobe in our LEB-GBM cohort, respectively. Our WES analysis confirmed the similarity in mutation signature of the LEB-GBM population with TCGA cohorts. It showed that 1 out of the 50 commonly GBM candidate gene mutations is associated with decreased overall survival among the Lebanese cohort. This study also highlights the need for studies with larger sample sizes to inform clinicians for better prognostication and management of Lebanese patients with GBM. Voltage-gated sodium channels initiate electrical signals and are frequently targeted by deadly gating-modifier neurotoxins, including tarantula toxins, which trap the voltage sensor in its resting state. The structural basis for tarantula-toxin action remains elusive because of the difficulty of capturing the functionally relevant form of the toxin-channel complex. Here, we engineered the model sodium channel NaVAb with voltage-shifting mutations and the toxin-binding site of human NaV1.7, an attractive pain target. This mutant chimera enabled us to determine the cryoelectron microscopy (cryo-EM) structure of the channel functionally arrested by tarantula toxin. Our structure reveals a high-affinity resting-state-specific toxin-channel interaction between a key lysine residue that serves as a "stinger" and penetrates a triad of carboxyl groups in the S3-S4 linker of the voltage sensor. By unveiling this high-affinity binding mode, our studies establish a high-resolution channel-docking and resting-state locking mechanism for huwentoxin-IV and provide guidance for developing future resting-state-targeted analgesic drugs. +GRIN2A epilepsy 33531473 33499151 33457012 33420383 33370585 The effects of different forms of monosaccharides on the brain remain unclear, though neuropsychiatric disorders undergo changes in glucose metabolism. This study assessed cell viability responses to five commonly consumed monosaccharides-D-ribose (RIB), D-glucose, D-mannose (MAN), D-xylose and L-arabinose-in cultured neuro-2a cells. Markedly decreased cell viability was observed in cells treated with RIB and MAN. We then showed that high-dose administration of RIB induced depressive- and anxiety-like behavior as well as spatial memory impairment in mice, while high-dose administration of MAN induced anxiety-like behavior and spatial memory impairment only. Moreover, significant pathological changes were observed in the hippocampus of high-dose RIB-treated mice by hematoxylin-eosin staining. Association analysis of the metabolome and transcriptome suggested that the anxiety-like behavior and spatial memory impairment induced by RIB and MAN may be attributed to the changes in four metabolites and 81 genes in the hippocampus, which is involved in amino acid metabolism and serotonin transport. In addition, combined with previous genome-wide association studies on depression, a correlation was found between the levels of Tnni3k and Tbx1 in the hippocampus and RIB induced depressive-like behavior. Finally, metabolite-gene network, qRT-PCR and western blot analysis showed that the insulin-POMC-MEK-TCF7L2 and MAPK-CREB-GRIN2A-CaMKII signaling pathways were respectively associated with RIB and MAN induced depressive/anxiety-like behavior and spatial memory impairment. Our findings clarified our understanding of the biological mechanisms underlying RIB and MAN induced depressive/anxiety-like behavior and spatial memory impairment in mice and highlighted the deleterious effects of high-dose RIB and MAN as long-term energy sources. Elite rugby league and union have some of the highest reported rates of concussion (mild traumatic brain injury) in professional sport due in part to their full-contact high-velocity collision-based nature. Currently, concussions are the most commonly reported match injury during the tackle for both the ball carrier and the tackler (8-28 concussions per 1000 player match hours) and reports exist of reduced cognitive function and long-term health consequences that can end a playing career and produce continued ill health. Concussion is a complex phenotype, influenced by environmental factors and an individual's genetic predisposition. This article reviews concussion incidence within elite rugby and addresses the biomechanics and pathophysiology of concussion and how genetic predisposition may influence incidence, severity and outcome. Associations have been reported between a variety of genetic variants and traumatic brain injury. However, little effort has been devoted to the study of genetic associations with concussion within elite rugby players. Due to a growing understanding of the molecular characteristics underpinning the pathophysiology of concussion, investigating genetic variation within elite rugby is a viable and worthy proposition. Therefore, we propose from this review that several genetic variants within or near candidate genes of interest, namely APOE, MAPT, IL6R, COMT, SLC6A4, 5-HTTLPR, DRD2, DRD4, ANKK1, BDNF and GRIN2A, warrant further study within elite rugby and other sports involving high-velocity collisions. Advanced gastric signet-ring cell carcinoma (SRCC) is a specific type of malignant gastric cancer (GC) with distinct poorer survival. Claudin18.2 (CLDN18.2) is a promising neo-biomarker for the treatment of GC. Clinical trials of CLDN18.2-targeted antibody and T cell-based immunotherapy providing promising prospects for the treatment of GC. The effect of antibody therapy depended on the expression rate of CLDN18.2 has been found in clinical trials. This study aimed to determine the prevalence and the therapeutic value of CLDN18.2 in advanced gastric SRCC. Expression of CLDN18.2 in 105 formalin-fixed, paraffin-embedded (FFPE) tumor tissues was detected by immunohistochemistry (IHC) and evaluated according to FAST criteria. Next-generation sequencing (NGS) using 416 pan-cancer genes panel was performed to characterize the genomic landscape in 61 advanced gastric SRCC patients. Fisher's exact test was used to determine gene differences in different CLDN18.2 expression levels. A total number of 105 advanced gastric SRCC samples were analyzed, of which 95.2% (100/105) were positive stained. Moderate-to-strong CLDN18.2 expression was observed in 64.8% (68/105) of all samples. In particularly, 21.0% (22/105) samples had positive staining in more than 90% tumor cells. No significance was found between CLDN18.2 expression and overall survival (OS). NGS results showed that single nucleotide variations (SNVs) could be frequently found in TP53 (26.2%), CDH1 (19.7%), MED12 (18.0%), PKHD1 (18.0%) and ARID1A (11.5%), besides, copy number variations (CNVs) were rich in NOTCH1 (18.0%) and FLT4 (9.8%) in SRCC samples. Moreover, SNVs in GRIN2A was found in 20% of the patients who had CLDN18.2 staining in <40% of tumor cells (P=0.043), indicating CLDN18.2 expression might be related to the aberration of GRIN2A in advanced gastric SRCC. The highly expressed CLDN18.2 among advanced gastric SRCC patients that we found certified the value of CLDN18.2-targeted therapy in this specific type of GC. In addition, Analyses between CLDN18.2 expression and genetic abnormalities provided novel therapeutic options for advanced gastric SRCC. The NMDA receptor-mediated Ca<sup>2+</sup> signaling during simultaneous pre- and postsynaptic activity is critically involved in synaptic plasticity and thus has a key role in the nervous system. In GRIN2-variant patients alterations of this coincidence detection provoked complex clinical phenotypes, ranging from reduced muscle strength to epileptic seizures and intellectual disability. By using our gene-targeted mouse line (Grin2a<sup>N615S</sup>), we show that voltage-independent glutamate-gated signaling of GluN2A-containing NMDA receptors is associated with NMDAR-dependent audiogenic seizures due to hyperexcitable midbrain circuits. In contrast, the NMDAR antagonist MK-801-induced c-Fos expression is reduced in the hippocampus. Likewise, the synchronization of theta- and gamma oscillatory activity is lowered during exploration, demonstrating reduced hippocampal activity. This is associated with exploratory hyperactivity and aberrantly increased and dysregulated levels of attention that can interfere with associative learning, in particular when relevant cues and reward outcomes are disconnected in space and time. Together, our findings provide (i) experimental evidence that the inherent voltage-dependent Ca<sup>2+</sup> signaling of NMDA receptors is essential for maintaining appropriate responses to sensory stimuli and (ii) a mechanistic explanation for the neurological manifestations seen in the NMDAR-related human disorders with GRIN2 variant-meidiated intellectual disability and focal epilepsy. Evidence suggested the crucial roles of brain-derived neurotrophic factor (BDNF) and glutamate system functioning in the antidepressant mechanisms of low-dose ketamine infusion in treatment-resistant depression (TRD). 65 patients with TRD were genotyped for 684,616 single nucleotide polymorphisms (SNPs). Twelve ketamine-related genes were selected for the gene-based genome-wide association study on the antidepressant effect of ketamine infusion and the resulting serum ketamine and norketamine levels. Specific SNPs and whole genes involved in BDNF-TrkB signaling (i.e., rs2049048 in BDNF and rs10217777 in NTRK2) and the glutamatergic and GABAergic systems (i.e., rs16966731 in GRIN2A) were associated with the rapid (within 240 min) and persistent (up to 2 weeks) antidepressant effect of low-dose ketamine infusion and with serum ketamine and norketamine levels. Our findings confirmed the predictive roles of BDNF-TrkB signaling and glutamatergic and GABAergic systems in the underlying mechanisms of low-dose ketamine infusion for TRD treatment. +ANKRD11 autism 33527450 33476899 33354850 33262785 33179249 To characterize the genetic alterations in adult primary uterine rhabdomyosarcomas (uRMSs) and to investigate whether these tumors are genetically distinct from uterine carcinosarcomas (UCSs). Three tumors originally diagnosed as primary adult pleomorphic uRMS were subjected to massively parallel sequencing targeting 468 cancer-related genes and RNA-sequencing. Mutational profiles were compared to those from UCSs (n=57) obtained from The Cancer Genome Atlas. Sequencing data analyses were performed using validated bioinformatic approaches. Pathogenic TP53 mutations and high levels of genomic instability were detected in the three cases. uRMS1 harbored a likely pathogenic YTHDF2-FOXR1 fusion gene. uRMS2 displayed a PPP2R1A hotspot mutation and amplification of multiple genes, including WHSC1L1, FGFR1, MDM2 and CCNE1, whereas uRMS3 harbored an FBXW7 hotspot mutation and an ANKRD11 homozygous deletion. Hierarchical clustering of somatic mutations and copy number alterations revealed that these tumors initially diagnosed as pleomorphic uRMSs and UCSs were similar. Subsequent comprehensive pathologic re-review of the three uRMSs revealed previously un-identified minute pan-cytokeratin-positive atypical glands in one case (uRMS3), favoring its reclassification as UCS with extensive rhabdomyosarcomatous overgrowth. Adult pleomorphic uRMSs harbor TP53 mutations and high levels of copy number alterations. Our findings underscore the challenge in discriminating between uRMS and UCS with rhabdomyosarcomatous differentiation. NA KBG syndrome is a rare genetic disease characterized mainly by skeletal abnormalities, distinctive facial features, and intellectual disability. Heterozygous mutations in ANKRD11 gene, or deletion of 16q24.3 that includes ANKRD11 gene are the cause of KBG syndrome. We describe two patients presenting with short stature and partial facial features, whereas no intellectual disability or hearing loss was observed in them. Two ANKRD11 variants, c.4039_4041del (p. Lys1347del) and c.6427C > G (p. Leu2143Val), were identified in this study. Both of them were classified as variants of uncertain significance (VOUS) by ACMG/AMP guidelines and were inherited from their mothers. ANKRD11 could enhance the transactivation of p21 gene, which was identified to participate in chondrogenic differentiation. In this study, we demonstrated that the knockdown of ANKRD11 could reduce the p21-promoter luciferase activities while re-introduction of wild type ANKRD11, but not ANKRD11 variants (p. Lys1347del or p. Leu2143Val), could restore the p21 levels. Thus, our study report two loss-of-function ANKRD11 variants which might provide new insight on pathogenic mechanism that correlates ANKRD11 variants with the short stature phenotype of KBG syndrome. KBG syndrome (OMIM #148050) is a rare, autosomal dominant inherited genetic disorder caused by heterozygous mutations in the ankyrin repeat domain-containing protein 11 (ANKRD11) gene or by microdeletion of chromosome 16q24.3. It is characterized by macrodontia of the upper central incisors, distinctive facial dysmorphism, short stature, vertebral abnormalities, hand anomaly including clinodactyly, and various degrees of developmental delay. KBG syndrome presents with variable clinical feature and severity among individuals. Here, we report two KBG patients who have different novel heterozygous mutations of ANKRD11 gene with wide range of clinical manifestations. Two novel heterozygous mutations of ANKRD11 gene were identified in two unrelated Korean patients with variable clinical presentations. The first patient presented with short stature and early puberty and was treated with growth hormone and gonadotropin-releasing hormone agonist without adverse effects. He had mild intellectual disability. In targeted exome sequencing, a novel de novo frameshift variant was identified in ANKRD11, c.5889del, and p. (Ile1963MetfsX9). The second patient had severe intellectual disability with epilepsy. He had normal height and prepubertal stage at the age of 11 years. He had behavioral problems such as autism-like features, anxiety, and stereotypical movements. Whole exome sequencing (WES) was performed, and the novel heterozygous mutation, c3310dup, p. (Glu110GlyfsTer5) in ANKRD11 was identified. KBG syndrome is often underdiagnosed because of its non-specific features and phenotypic variability. Performing a next-generation sequencing panel, including the ANKRD11 gene for cases of developmental delay with/without short stature may be helpful to identify hitherto undiagnosed KBG syndrome patients. Neurodevelopmental disorders (NDDs) are a heterogeneous group of conditions including intellectual disability, global developmental delay, autism spectrum disorder, and attention deficit hyperactivity disorder. Advances in genetic diagnostic technology have led to the identification of a number of NDD-associated genes, but reports of cognitive and developmental outcomes in affected individuals have been variable. The objective of this scoping review is to synthesize available information pertaining to the developmental outcomes of individuals with pathogenic variants in ten emerging recurrent NDD-associated genes identified from large scale sequencing studies; ADNP, ANKRD11, ARID1B, CHD2, CHD8, CTNNB1, DDX3X, DYRK1A, SCN2A, and SYNGAP1. After a comprehensive search, 260 articles were selected that reported on neurodevelopmental measures or diagnoses. We identify the spectrum of developmental outcomes for each genetic NDD, including prevalence of intellectual disability, frequency of co-morbid NDDs such as ADHD and autism, and commonly reported medical issues that can help inform diagnosis and treatment. There are significant gaps in our understanding of the natural history of these conditions. Future research focusing on barriers to assessment, the development of modified assessment tools appropriate for long-term outcomes in genetic NDD, and collection of longitudinal data will increase understanding of prognosis in these conditions and inform evaluations of treatment. +SHANK2 autism 33547379 33515293 33491217 33483523 33383702 West Nile virus (WNV) is a Flavivirus, which can cause febrile illness in humans that may progress to encephalitis. Like any other obligate intracellular pathogens, Flaviviruses hijack cellular protein functions as a strategy for sustaining their life cycle. Many cellular proteins display globular domain known as PDZ domain that interacts with PDZ-Binding Motifs (PBM) identified in many viral proteins. Thus, cellular PDZ-containing proteins are common targets during viral infection. The non-structural protein 5 (NS5) from WNV provides both RNA cap methyltransferase and RNA polymerase activities and is involved in viral replication but its interactions with host proteins remain poorly known. In this study, we demonstrate that the C-terminal PBM of WNV NS5 recognizes several human PDZ-containing proteins using both in vitro and in cellulo high-throughput methods. Furthermore, we constructed and assayed in cell culture WNV replicons where the PBM within NS5 was mutated. Our results demonstrate that the PBM of WNV NS5 is important in WNV replication. Moreover, we show that knockdown of the PDZ-containing proteins TJP1, PARD3, ARHGAP21 or SHANK2 results in the decrease of WNV replication in cells. Altogether, our data reveal that interactions between the PBM of NS5 and PDZ-containing proteins affect West Nile virus replication. Olfaction supports a multitude of behaviors vital for social communication and interactions between conspecifics. Intact sensory processing is contingent upon proper circuit wiring. Disturbances in genetic factors controlling circuit assembly and synaptic wiring can lead to neurodevelopmental disorders, such as autism spectrum disorder (ASD), where impaired social interactions and communication are core symptoms. The variability in behavioral phenotype expression is also contingent upon the role environmental factors play in defining genetic expression. Considering the prevailing clinical diagnosis of ASD, research on therapeutic targets for autism is essential. Behavioral impairments may be identified along a range of increasingly complex social tasks. Hence, the assessment of social behavior and communication is progressing towards more ethologically relevant tasks. Garnering a more accurate understanding of social processing deficits in the sensory domain may greatly contribute to the development of therapeutic targets. With that framework, studies have found a viable link between social behaviors, circuit wiring, and altered neuronal coding related to the processing of salient social stimuli. Here, the relationship between social odor processing in rodents and humans is examined in the context of health and ASD, with special consideration for how genetic expression and neuronal connectivity may regulate behavioral phenotypes. Impairments in social relationships and awareness are features observed in autism spectrum disorders (ASDs). However, the underlying mechanisms remain poorly understood. Shank2 is a high-confidence ASD candidate gene and localizes primarily to postsynaptic densities (PSDs) of excitatory synapses in the central nervous system (CNS). We show here that loss of Shank2 in mice leads to a lack of social attachment and bonding behavior towards pubs independent of hormonal, cognitive, or sensitive deficits. Shank2<sup>-/-</sup> mice display functional changes in nuclei of the social attachment circuit that were most prominent in the medial preoptic area (MPOA) of the hypothalamus. Selective enhancement of MPOA activity by DREADD technology re-established social bonding behavior in Shank2<sup>-/-</sup> mice, providing evidence that the identified circuit might be crucial for explaining how social deficits in ASD can arise. SHANK2 mutations have been identified in individuals with neurodevelopmental disorders, including intellectual disability and autism spectrum disorders (ASD). Using CRISPR/Cas9 genome editing, we obtained SH-SY5Y cell lines with frameshift mutations on one or both SHANK2 alleles. We investigated the effects of the different SHANK2 mutations on cell morphology, cell proliferation and differentiation potential during early neuronal differentiation. All mutant cell lines showed impaired neuronal differentiation marker expression. Cells with bi-allelic SHANK2 mutations revealed diminished apoptosis and increased proliferation, as well as decreased neurite outgrowth during early neuronal differentiation. Bi-allelic SHANK2 mutations resulted in an increase in p-AKT levels, suggesting that SHANK2 mutations impair downstream signaling of tyrosine kinase receptors. Additionally, cells with bi-allelic SHANK2 mutations had lower amyloid precursor protein (APP) expression compared to controls, suggesting a molecular link between SHANK2 and APP. Together, we can show that frameshift mutations on one or both SHANK2 alleles lead to an alteration of neuronal differentiation in SH-SY5Y cells, characterized by changes in cell growth and pre- and postsynaptic protein expression. We also provide first evidence that downstream signaling of tyrosine kinase receptors and amyloid precursor protein expression are affected. Autism spectrum disorder (ASD) is a heterogeneous condition with a complex genetic etiology. The objective of this study is to identify the complex genetic factors that underlie the ASD phenotype and other clinical features of Professor Temple Grandin, an animal scientist and woman with high-functioning ASD. Identifying the underlying genetic cause for ASD can impact medical management, personalize services and treatment, and uncover other medical risks that are associated with the genetic diagnosis. Prof. Grandin underwent chromosomal microarray analysis, whole exome sequencing, and whole genome sequencing, as well as a comprehensive clinical and family history intake. The raw data were analyzed in order to identify possible genotype-phenotype correlations. Genetic testing identified variants in three genes (SHANK2, ALX1, and RELN) that are candidate risk factors for ASD. We identified variants in MEFV and WNT10A, reported to be disease-associated in previous studies, which are likely to contribute to some of her additional clinical features. Moreover, candidate variants in genes encoding metabolic enzymes and transporters were identified, some of which suggest potential therapies. This case report describes the genomic findings in Prof. Grandin and it serves as an example to discuss state-of-the-art clinical diagnostics for individuals with ASD, as well as the medical, logistical, and economic hurdles that are involved in clinical genetic testing for an individual on the autism spectrum. +POGZ autism 33377604 33334860 33277917 33203851 33155545 NA Efficient genetic manipulation in the developing central nervous system is crucial for investigating mechanisms of neurodevelopmental disorders and the development of promising therapeutics. Common approaches including transgenic mice and in utero electroporation, although powerful in many aspects, have their own limitations. In this study, we delivered vectors based on the AAV9.PHP.eB pseudo-type to the fetal mouse brain, and achieved widespread and extensive transduction of neural cells. When AAV9.PHP.eB-coding gRNA targeting PogZ or Depdc5 was delivered to Cas9 transgenic mice, widespread gene knockout was also achieved at the whole brain level. Our studies provide a useful platform for studying brain development and devising genetic intervention for severe developmental diseases. White-Sutton syndrome is a rare developmental disorder characterized by global developmental delay, intellectual disabilities (ID), and neurobehavioral abnormalities secondary to pathogenic pogo transposable element-derived protein with zinc finger domain (POGZ) variants. The purpose of our study was to describe the neurocognitive phenotype of an unbiased national cohort of patients with identified POGZ pathogenic variants. This study is based on a French collaboration through the AnDDI-Rares network, and includes 19 patients from 18 families with POGZ pathogenic variants. All clinical data and neuropsychological tests were collected from medical files. Among the 19 patients, 14 patients exhibited ID (six mild, five moderate and three severe). The five remaining patients had learning disabilities and shared a similar neurocognitive profile, including language difficulties, dysexecutive syndrome, attention disorders, slowness, and social difficulties. One patient evaluated for autism was found to have moderate autism spectrum disorder. This study reveals that the cognitive phenotype of patients with POGZ pathogenic variants can range from learning disabilities to severe ID. It highlights that pathogenic variations in the same genes can be reported in a large spectrum of neurocognitive profiles, and that children with learning disabilities could benefit from next generation sequencing techniques. Several genes implicated in autism spectrum disorder (ASD) are chromatin regulators, including POGZ. The cellular and molecular mechanisms leading to ASD impaired social and cognitive behavior are unclear. Animal models are crucial for studying the effects of mutations on brain function and behavior as well as unveiling the underlying mechanisms. Here, we generate a brain specific conditional knockout mouse model deficient for Pogz, an ASD risk gene. We demonstrate that Pogz deficient mice show microcephaly, growth impairment, increased sociability, learning and motor deficits, mimicking several of the human symptoms. At the molecular level, luciferase reporter assay indicates that POGZ is a negative regulator of transcription. In accordance, in Pogz deficient mice we find a significant upregulation of gene expression, most notably in the cerebellum. Gene set enrichment analysis revealed that the transcriptional changes encompass genes and pathways disrupted in ASD, including neurogenesis and synaptic processes, underlying the observed behavioral phenotype in mice. Physiologically, Pogz deficiency is associated with a reduction in the firing frequency of simple and complex spikes and an increase in amplitude of the inhibitory synaptic input in cerebellar Purkinje cells. Our findings support a mechanism linking heterochromatin dysregulation to cerebellar circuit dysfunction and behavioral abnormalities in ASD. Many genes have been linked to autism. However, it remains unclear what long-term changes in neural circuitry result from disruptions in these genes, and how these circuit changes might contribute to abnormal behaviors. To address these questions, we studied behavior and physiology in mice heterozygous for Pogz, a high confidence autism gene. Pogz<sup>+/-</sup> mice exhibit reduced anxiety-related avoidance in the elevated plus maze (EPM). Theta-frequency communication between the ventral hippocampus (vHPC) and medial prefrontal cortex (mPFC) is known to be necessary for normal avoidance in the EPM. We found deficient theta-frequency synchronization between the vHPC and mPFC in vivo. When we examined vHPC-mPFC communication at higher resolution, vHPC input onto prefrontal GABAergic interneurons was specifically disrupted, whereas input onto pyramidal neurons remained intact. These findings illustrate how the loss of a high confidence autism gene can impair long-range communication by causing inhibitory circuit dysfunction within pathways important for specific behaviors.
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/pmids_to_pubtator_matrix_output Wed Mar 24 08:33:25 2021 +0000 @@ -0,0 +1,7 @@ +5-httlpr adnp akt alx1 amyloid precursor protein ankk1 ankrd11 ankyrin repeat domain-containing protein 11 apoe arhgap21 arid1a arid1b asah1 atrx bdnf brain-derived neurotrophic factor c-fos cacna1a cacna1h camkii ccne1 cdh1 chd2 chd8 clcn2 cldn18 comt creb ctnnb1 ddx3x depdc5 drd2 drd4 dyrk1a egfr fbxw7 fgfr1 flt4 foxr1 glun2a gonadotropin-releasing hormone grin2 grin2a growth hormone il6r itch kcnq2 leb mapt mdm2 mecp2 med12 mefv mek nav1.5 nav1.7 nestin nf1 nlrp5 notch1 ns5 ntrk2 p21 p75 pard3 pik3ca pkhd1 pogz pomc ppp2r1a pten reln runx1 scn10a scn1a scn2a scn8a scn9a shank2 slc2a1 slc6a4 sox10 sox2 syngap1 tbx1 tcf7l2 tjp1 tnni3k tp53 trkb trpa1 trpv1 tsc1 tsc2 whsc1l1 wnt10a ythdf2 +0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 1 0 0 0 1 1 0 0 0 0 0 1 0 1 1 0 1 0 0 0 +1 0 0 0 0 1 0 0 1 0 1 0 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 1 0 0 0 1 1 0 0 0 0 1 0 1 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 0 0 0 0 0 +0 1 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 +0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/pmids_to_pubtator_matrix_output_byid Wed Mar 24 08:33:25 2021 +0000 @@ -0,0 +1,7 @@ +ADNP AKT ALX1 ANKRD11 APOE ARID1A ARID1B ATRX BDNF CACNA1A CCNE1 CDH1 CHD2 CHD8 CLDN18 COMT CTNNB1 DDX3X DEPDC5 DRD2 DYRK1A Depdc5 EGFR FGFR1 FLT4 FOXR1 GRIN2 GRIN2A GluN2A IL6R LEB MAPT MDM2 MED12 MEFV NF1 NLRP5 NOTCH1 Nav1.5 Nav1.7 Nestin PIK3CA PKHD1 POGZ PPP2R1A PTEN Pogz RELN SCN1A SCN2A SCN9A SHANK2 SLC6A4 SYNGAP1 Shank2 Sox10 Sox2 TP53 TSC2 TrkB WHSC1L1 WNT10A YTHDF2 amyloid precursor protein c-Fos gonadotropin-releasing hormone growth hormone itch p21 p75 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 +0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 1 0 1 1 1 1 0 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 +1 0 0 1 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 1 1 0 1 0 +0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/pmids_to_pubtator_matrix_output_number Wed Mar 24 08:33:25 2021 +0000 @@ -0,0 +1,7 @@ +amyloid precursor protein ankrd11 anxiety asah1 asd autism bdnf cldn18 dravet syndrome embryonic kidney epilepsy gastric srcc itch kbg syndrome learning disabilities memory impairment nav1.7 ns5 p21 pain pogz scn1a scn2a scn9a shank2 short stature tumors white-sutton syndrome +0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 +0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 +0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 1 0 +1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 +0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/pubmed_by_queries_output Wed Mar 24 08:33:25 2021 +0000 @@ -0,0 +1,7 @@ +ID_gene GROUPING_disease PMID_1 PMID_2 PMID_3 PMID_4 PMID_5 +SCN1A epilepsy 33565071 33531663 33528079 33519675 33478845 +SCN9A epilepsy 33389681 33370834 33278787 33237934 33232657 +GRIN2A epilepsy 33531473 33499151 33457012 33420383 33370585 +ANKRD11 autism 33527450 33476899 33354850 33262785 33179249 +SHANK2 autism 33547379 33515293 33491217 33483523 33383702 +POGZ autism 33377604 33334860 33277917 33203851 33155545
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/pubmed_by_queries_output_abstracts Wed Mar 24 08:33:25 2021 +0000 @@ -0,0 +1,7 @@ +ID_gene GROUPING_disease ABSTRACT_1 ABSTRACT_2 ABSTRACT_3 ABSTRACT_4 ABSTRACT_5 +SCN1A epilepsy [Analysis of SCN1A gene variants among patients with Dravet syndrome]. To analyze the clinical features and genetic variants in two patients with Dravet syndrome (DS). Peripheral blood samples of the children and their parents were collected for the extraction of genomic DNA and high-throughput sequencing. Suspected variants were confirmed by Sanger sequencing. By high-throughput sequencing, the two children were found to respectively harbor a c.2135delC frameshifting variant in exon 12 and a c.1522G>T nonsense variant in exon 10 of the SCN1A gene. Both variants were predicted to be pathogenic by bioinformatic analysis. Based on the American College of Medical Genetics and Genomics standards and guidelines, the c.2135delC and c.1522G>A variants of the SCN1A gene were predicted to be pathogenic (PVS1+ PS2+ PM2+ PP3). The variants of the SCN1A gene probably underlay the DS in the patients. Above finding has enriched the variant spectrum and enabled genetic counseling for their families. Sodium channelopathies in neurodevelopmental disorders. The voltage-gated sodium channel α-subunit genes comprise a highly conserved gene family. Mutations of three of these genes, SCN1A, SCN2A and SCN8A, are responsible for a significant burden of neurological disease. Recent progress in identification and functional characterization of patient variants is generating new insights and novel approaches to therapy for these devastating disorders. Here we review the basic elements of sodium channel function that are used to characterize patient variants. We summarize a large body of work using global and conditional mouse mutants to characterize the in vivo roles of these channels. We provide an overview of the neurological disorders associated with mutations of the human genes and examples of the effects of patient mutations on channel function. Finally, we highlight therapeutic interventions that are emerging from new insights into mechanisms of sodium channelopathies. Customized Targeted Massively Parallel Sequencing Enables More Precisely Diagnosis of Patients with Epilepsy. Advancement in genetic technology has led to the identification of an increasing number of genes in epilepsy. This will provide a huge information in clinical practice and improve diagnosis and treatment of epilepsy. this was a single-center retrospective cohort study of 80 patients who underwent NGS testing with customize epilepsy panel. In total 54 out of 80 patients (67, 5%), pathogenic / likely pathogenic and variants of uncertain significance variants were identified according to ACMG criteria. Pathogenic or likely pathogenic variants (n=35) were identified in 29 out of 80 individuals (36.25%). Variants of uncertain significance (VOUS) (n=34) have identified in 28 out of 80 patients (35%). Pathogenic, likely pathogenic, and variants of uncertain significance (VOUS) were most frequently identified in TSC2 (n = 11), SCN1A (n = 6) and TSC1 (n = 5) genes. Other common genes were KCNQ2 (n = 3), AMT (n = 3), CACNA1H (n = 3), CLCN2 (n = 3), MECP2 (n = 2), ASAH1 (n = 2) and SLC2A1 (n = 2). NGS based testing panels contributes the diagnosis of epilepsy and may change the clinical management by preventing unnecessary and potentially harmful diagnostic procedures and management in patients. Thus, our results highlighted the benefit of genetic testing in children suffered with epilepsy. This article is protected by copyright. All rights reserved. Association Between SCN1A rs2298771, SCN1A rs10188577, SCN2A rs17183814, and SCN2A rs2304016 Polymorphisms and Responsiveness to Antiepileptic Drugs: A Meta-Analysis. Background:SCN1A and SCN2A genes have been reported to be associated with the efficacy of single and combined antiepileptic therapy, but the results remain contradictory. Previous meta-analyses on this topic mainly focused on the SCN1A rs3812718 polymorphism. However, meta-analyses focused on SCN1A rs2298771, SCN1A rs10188577, SCN2A rs17183814, or SCN2A rs2304016 polymorphisms are scarce or non-existent. Objective: We aimed to conduct a meta-analysis to determine the effects of SCN1A rs2298771, SCN1A rs10188577, SCN2A rs17183814, and SCN2A rs2304016 polymorphisms on resistance to antiepileptic drugs (AEDs). Methods: We searched the PubMed, Embase, Cochrane Library, WANFANG, and CNKI databases up to June 2020 to collect studies on the association of SCN1A and SCN2A polymorphisms with reactivity to AEDs. We calculated the pooled odds ratios (ORs) under the allelic, homozygous, heterozygous, dominant, and recessive genetic models to identify the association between the four single-nucleotide polymorphisms (SNPs) and resistance to AEDs. Results: Our meta-analysis included 19 eligible studies. The results showed that the SCN1A rs2298771 polymorphism was related to AED resistance in the allelic, homozygous, and recessive genetic models (G vs. A: OR = 1.20, 95% CI: 1.012-1.424; GG vs. AA: OR = 1.567, 95% CI: 1.147-2.142; GG vs. AA + AG: OR = 1.408, 95% CI: 1.053-1.882). The homozygous model remained significant after Bonferroni correction (P < 0.0125). Further subgroup analyses demonstrated the significance of the correlation in the dominant model in Caucasians (South Asians) after Bonferroni correction (GG + GA vs. AA: OR = 1.620, 95% CI: 1.165-2.252). However, no association between SCN1A rs2298771 polymorphism and resistance to AEDs was found in Asians or Caucasians (non-South Asians). For SCN1A rs10188577, SCN2A rs17183814, and SCN2A rs2304016 polymorphisms, the correlations with responsiveness to AEDs were not significant in the overall population nor in any subgroup after conducting the Bonferroni correction. The results for SCN1A rs2298771, SCN1A rs10188577, and SCN2A rs2304016 polymorphisms were stable and reliable according to sensitivity analysis and Begg and Egger tests. However, the results for SCN2A rs17183814 polymorphism have to be treated cautiously owing to the significant publication bias revealed by Begg and Egger tests. Conclusions: The present meta-analysis indicated that SCN1A rs2298771 polymorphism significantly affects resistance to AEDs in the overall population and Caucasians (South Asians). There were no significant correlations between SCN1A rs10188577, SCN2A rs17183814, and SCN2A rs2304016 polymorphisms and resistance to AEDs. Multicenter prospective longitudinal study in 34 patients with Dravet syndrome: Neuropsychological development in the first six years of life. The objective of this study was to identify developmental trajectories of developmental/behavioral phenotypes and possibly their relationship to epilepsy and genotype by analyzing developmental and behavioral features collected prospectively and longitudinally in a cohort of patients with Dravet syndrome (DS). Thirty-four patients from seven Italian tertiary pediatric neurology centers were enrolled in the study. All patients were examined for the SCN1A gene mutation and prospectively assessed from the first years of life with repeated full clinical observations including neurological and developmental examinations. Subjects were found to follow three neurodevelopmental trajectories. In the first group (16 patients), an initial and usually mild decline was observed between the second and the third year of life, specifically concerning visuomotor abilities, later progressing towards global involvement of all abilities. The second group (12 patients) showed an earlier onset of global developmental impairment, progressing towards a generally worse outcome. The third group of only two patients ended up with a normal neurodevelopmental quotient, but with behavioral and linguistic problems. The remaining four patients were not classifiable due to a lack of critical assessments just before developmental decline. The neurodevelopmental trajectories described in this study suggest a differential contribution of neurobiological and genetic factors. The profile of the first group, which included the largest fraction of patients, suggests that in the initial phase of the disease, visuomotor defects might play a major role in determining developmental decline. Early diagnosis of milder cases with initial visuomotor impairment may therefore provide new tools for a more accurate habilitation strategy. +SCN9A epilepsy Satellite Glial Cells Give Rise to Nociceptive Sensory Neurons. Dorsal root ganglia (DRG) sensory neurons can transmit information about noxious stimulus to cerebral cortex via spinal cord, and play an important role in the pain pathway. Alterations of the pain pathway lead to CIPA (congenital insensitivity to pain with anhidrosis) or chronic pain. Accumulating evidence demonstrates that nerve damage leads to the regeneration of neurons in DRG, which may contribute to pain modulation in feedback. Therefore, exploring the regeneration process of DRG neurons would provide a new understanding to the persistent pathological stimulation and contribute to reshape the somatosensory function. It has been reported that a subpopulation of satellite glial cells (SGCs) express Nestin and p75, and could differentiate into glial cells and neurons, suggesting that SGCs may have differentiation plasticity. Our results in the present study show that DRG-derived SGCs (DRG-SGCs) highly express neural crest cell markers Nestin, Sox2, Sox10, and p75, and differentiate into nociceptive sensory neurons in the presence of histone deacetylase inhibitor VPA, Wnt pathway activator CHIR99021, Notch pathway inhibitor RO4929097, and FGF pathway inhibitor SU5402. The nociceptive sensory neurons express multiple functionally-related genes (SCN9A, SCN10A, SP, Trpv1, and TrpA1) and are able to generate action potentials and voltage-gated Na<sup>+</sup> currents. Moreover, we found that these cells exhibited rapid calcium transients in response to capsaicin through binding to the Trpv1 vanilloid receptor, confirming that the DRG-SGC-derived cells are nociceptive sensory neurons. Further, we show that Wnt signaling promotes the differentiation of DRG-SGCs into nociceptive sensory neurons by regulating the expression of specific transcription factor Runx1, while Notch and FGF signaling pathways are involved in the expression of SCN9A. These results demonstrate that DRG-SGCs have stem cell characteristics and can efficiently differentiate into functional nociceptive sensory neurons, shedding light on the clinical treatment of sensory neuron-related diseases. Computer-aided Discovery of a New Nav1.7 Inhibitor for Treatment of Pain and Itch. Voltage-gated sodium channel Nav1.7 has been validated as a perspective target for selective inhibitors with analgesic and anti-itch activity. The objective of this study was to discover new candidate compounds with Nav1.7 inhibitor properties. The authors hypothesized that their approach would yield at least one new compound that inhibits sodium currents in vitro and exerts analgesic and anti-itch effects in mice. In silico structure-based similarity search of 1.5 million compounds followed by docking to the Nav1.7 voltage sensor of Domain 4 and molecular dynamics simulation was performed. Patch clamp experiments in Nav1.7-expressing human embryonic kidney 293 cells and in mouse and human dorsal root ganglion neurons were conducted to test sodium current inhibition. Formalin-induced inflammatory pain model, paclitaxel-induced neuropathic pain model, histamine-induced itch model, and mouse lymphoma model of chronic itch were used to confirm in vivo activity of the selected compound. After in silico screening, nine compounds were selected for experimental assessment in vitro. Of those, four compounds inhibited sodium currents in Nav1.7-expressing human embryonic kidney 293 cells by 29% or greater (P < 0.05). Compound 9 (3-(1-benzyl-1H-indol-3-yl)-3-(3-phenoxyphenyl)-N-(2-(pyrrolidin-1-yl)ethyl)propanamide, referred to as DA-0218) reduced sodium current by 80% with a 50% inhibition concentration of 0.74 μM (95% CI, 0.35 to 1.56 μM), but had no effects on Nav1.5-expressing human embryonic kidney 293 cells. In mouse and human dorsal root ganglion neurons, DA-0218 reduced sodium currents by 17% (95% CI, 6 to 28%) and 22% (95% CI, 9 to 35%), respectively. The inhibition was greatly potentiated in paclitaxel-treated mouse neurons. Intraperitoneal and intrathecal administration of the compound reduced formalin-induced phase II inflammatory pain behavior in mice by 76% (95% CI, 48 to 100%) and 80% (95% CI, 68 to 92%), respectively. Intrathecal administration of DA-0218 produced acute reduction in paclitaxel-induced mechanical allodynia, and inhibited histamine-induced acute itch and lymphoma-induced chronic itch. This study's computer-aided drug discovery approach yielded a new Nav1.7 inhibitor that shows analgesic and anti-pruritic activity in mouse models. High genetic burden in 163 Chinese children with status epilepticus. This study aimed to investigate the genetic aetiology in Chinese children diagnosed with status epilepticus (SE). Next-generation sequencing, copy number variation (CNV) analysis, and other genetic testing methods were conducted for children with SE lacking an identifiable non-genetic aetiology. Furthermore, the phenotype and molecular data of patients with SE were retrospectively analysed. Among children with SE lacking an identifiable non-genetic aetiology, 73 out of 163 children (44.8 %) were found to have causative variants associated with SE including 66 monogenic mutations in 22 genes and 7 CNVs. Based on the American College of Medical Genetics and Genomics scoring system, the monogenic variants included 64 pathogenic/likely pathogenic and 2 uncertain significance variants. SCN1A gene mutations (n = 32) were the most common cause, followed by TSC2 (n = 5), CACNA1A (n = 5), SCN2A (n = 4), SCN9A (n = 2) and DEPDC5 (n = 2) gene mutations. Sixteen mutations were identified in single genes. Furthermore, 51 (77.3 %) monogenic mutations were de novo. Age at SE onset < 1 year (odds ratio [OR] = 2.70, 95 % confidence interval [CI]: 1.25-5.83, p = 0.012) and co-morbidity of intellectual disability (OR = 3.36, 95 %CI: 1.61-6.99, p = 0.001) were independently associated with pathogenic genetic variants. This study identified genetic aetiology in 44.8 % of patients with SE, which indicates a high burden of genetic aetiology among children with SE in China. Our findings highlight the importance for genetic testing of children with SE that lacks an identifiable non-genetic aetiology. Correlation of genetic alterations by whole-exome sequencing with clinical outcomes of glioblastoma patients from the Lebanese population. Glioblastoma (GBM) is an aggressive brain tumor associated with high degree of resistance to treatment. Given its heterogeneity, it is important to understand the molecular landscape of this tumor for the development of more effective therapies. Because of the different genetic profiles of patients with GBM, we sought to identify genetic variants in Lebanese patients with GBM (LEB-GBM) and compare our findings to those in the Cancer Genome Atlas (TCGA). We performed whole exome sequencing (WES) to identify somatic variants in a cohort of 60 patient-derived GBM samples. We focused our analysis on 50 commonly mutated GBM candidate genes and compared mutation signatures between our population and publicly available GBM data from TCGA. We also cross-tabulated biological covariates to assess for associations with overall survival, time to recurrence and follow-up duration. We included 60 patient-derived GBM samples from 37 males and 23 females, with age ranging from 3 to 80 years (mean and median age at diagnosis were 51 and 56, respectively). Recurrent tumor formation was present in 94.8% of patients (n = 55/58). After filtering, we identified 360 somatic variants from 60 GBM patient samples. After filtering, we identified 360 somatic variants from 60 GBM patient samples. Most frequently mutated genes in our samples included ATRX, PCDHX11, PTEN, TP53, NF1, EGFR, PIK3CA, and SCN9A. Mutations in NLRP5 were associated with decreased overall survival among the Lebanese GBM cohort (p = 0.002). Mutations in NLRP5 were associated with decreased overall survival among the Lebanese GBM cohort (p = 0.002). EGFR and NF1 mutations were associated with the frontal lobe and temporal lobe in our LEB-GBM cohort, respectively. Our WES analysis confirmed the similarity in mutation signature of the LEB-GBM population with TCGA cohorts. It showed that 1 out of the 50 commonly GBM candidate gene mutations is associated with decreased overall survival among the Lebanese cohort. This study also highlights the need for studies with larger sample sizes to inform clinicians for better prognostication and management of Lebanese patients with GBM. Structural Basis for High-Affinity Trapping of the NaV1.7 Channel in Its Resting State by Tarantula Toxin. Voltage-gated sodium channels initiate electrical signals and are frequently targeted by deadly gating-modifier neurotoxins, including tarantula toxins, which trap the voltage sensor in its resting state. The structural basis for tarantula-toxin action remains elusive because of the difficulty of capturing the functionally relevant form of the toxin-channel complex. Here, we engineered the model sodium channel NaVAb with voltage-shifting mutations and the toxin-binding site of human NaV1.7, an attractive pain target. This mutant chimera enabled us to determine the cryoelectron microscopy (cryo-EM) structure of the channel functionally arrested by tarantula toxin. Our structure reveals a high-affinity resting-state-specific toxin-channel interaction between a key lysine residue that serves as a "stinger" and penetrates a triad of carboxyl groups in the S3-S4 linker of the voltage sensor. By unveiling this high-affinity binding mode, our studies establish a high-resolution channel-docking and resting-state locking mechanism for huwentoxin-IV and provide guidance for developing future resting-state-targeted analgesic drugs. +GRIN2A epilepsy Chronic D-ribose and D-mannose overload induce depressive/anxiety-like behavior and spatial memory impairment in mice. The effects of different forms of monosaccharides on the brain remain unclear, though neuropsychiatric disorders undergo changes in glucose metabolism. This study assessed cell viability responses to five commonly consumed monosaccharides-D-ribose (RIB), D-glucose, D-mannose (MAN), D-xylose and L-arabinose-in cultured neuro-2a cells. Markedly decreased cell viability was observed in cells treated with RIB and MAN. We then showed that high-dose administration of RIB induced depressive- and anxiety-like behavior as well as spatial memory impairment in mice, while high-dose administration of MAN induced anxiety-like behavior and spatial memory impairment only. Moreover, significant pathological changes were observed in the hippocampus of high-dose RIB-treated mice by hematoxylin-eosin staining. Association analysis of the metabolome and transcriptome suggested that the anxiety-like behavior and spatial memory impairment induced by RIB and MAN may be attributed to the changes in four metabolites and 81 genes in the hippocampus, which is involved in amino acid metabolism and serotonin transport. In addition, combined with previous genome-wide association studies on depression, a correlation was found between the levels of Tnni3k and Tbx1 in the hippocampus and RIB induced depressive-like behavior. Finally, metabolite-gene network, qRT-PCR and western blot analysis showed that the insulin-POMC-MEK-TCF7L2 and MAPK-CREB-GRIN2A-CaMKII signaling pathways were respectively associated with RIB and MAN induced depressive/anxiety-like behavior and spatial memory impairment. Our findings clarified our understanding of the biological mechanisms underlying RIB and MAN induced depressive/anxiety-like behavior and spatial memory impairment in mice and highlighted the deleterious effects of high-dose RIB and MAN as long-term energy sources. Genetic Factors That Could Affect Concussion Risk in Elite Rugby. Elite rugby league and union have some of the highest reported rates of concussion (mild traumatic brain injury) in professional sport due in part to their full-contact high-velocity collision-based nature. Currently, concussions are the most commonly reported match injury during the tackle for both the ball carrier and the tackler (8-28 concussions per 1000 player match hours) and reports exist of reduced cognitive function and long-term health consequences that can end a playing career and produce continued ill health. Concussion is a complex phenotype, influenced by environmental factors and an individual's genetic predisposition. This article reviews concussion incidence within elite rugby and addresses the biomechanics and pathophysiology of concussion and how genetic predisposition may influence incidence, severity and outcome. Associations have been reported between a variety of genetic variants and traumatic brain injury. However, little effort has been devoted to the study of genetic associations with concussion within elite rugby players. Due to a growing understanding of the molecular characteristics underpinning the pathophysiology of concussion, investigating genetic variation within elite rugby is a viable and worthy proposition. Therefore, we propose from this review that several genetic variants within or near candidate genes of interest, namely APOE, MAPT, IL6R, COMT, SLC6A4, 5-HTTLPR, DRD2, DRD4, ANKK1, BDNF and GRIN2A, warrant further study within elite rugby and other sports involving high-velocity collisions. Highly expressed Claudin18.2 as a potential therapeutic target in advanced gastric signet-ring cell carcinoma (SRCC). Advanced gastric signet-ring cell carcinoma (SRCC) is a specific type of malignant gastric cancer (GC) with distinct poorer survival. Claudin18.2 (CLDN18.2) is a promising neo-biomarker for the treatment of GC. Clinical trials of CLDN18.2-targeted antibody and T cell-based immunotherapy providing promising prospects for the treatment of GC. The effect of antibody therapy depended on the expression rate of CLDN18.2 has been found in clinical trials. This study aimed to determine the prevalence and the therapeutic value of CLDN18.2 in advanced gastric SRCC. Expression of CLDN18.2 in 105 formalin-fixed, paraffin-embedded (FFPE) tumor tissues was detected by immunohistochemistry (IHC) and evaluated according to FAST criteria. Next-generation sequencing (NGS) using 416 pan-cancer genes panel was performed to characterize the genomic landscape in 61 advanced gastric SRCC patients. Fisher's exact test was used to determine gene differences in different CLDN18.2 expression levels. A total number of 105 advanced gastric SRCC samples were analyzed, of which 95.2% (100/105) were positive stained. Moderate-to-strong CLDN18.2 expression was observed in 64.8% (68/105) of all samples. In particularly, 21.0% (22/105) samples had positive staining in more than 90% tumor cells. No significance was found between CLDN18.2 expression and overall survival (OS). NGS results showed that single nucleotide variations (SNVs) could be frequently found in TP53 (26.2%), CDH1 (19.7%), MED12 (18.0%), PKHD1 (18.0%) and ARID1A (11.5%), besides, copy number variations (CNVs) were rich in NOTCH1 (18.0%) and FLT4 (9.8%) in SRCC samples. Moreover, SNVs in GRIN2A was found in 20% of the patients who had CLDN18.2 staining in <40% of tumor cells (P=0.043), indicating CLDN18.2 expression might be related to the aberration of GRIN2A in advanced gastric SRCC. The highly expressed CLDN18.2 among advanced gastric SRCC patients that we found certified the value of CLDN18.2-targeted therapy in this specific type of GC. In addition, Analyses between CLDN18.2 expression and genetic abnormalities provided novel therapeutic options for advanced gastric SRCC. Voltage-independent GluN2A-type NMDA receptor Ca<sup>2+</sup> signaling promotes audiogenic seizures, attentional and cognitive deficits in mice. The NMDA receptor-mediated Ca<sup>2+</sup> signaling during simultaneous pre- and postsynaptic activity is critically involved in synaptic plasticity and thus has a key role in the nervous system. In GRIN2-variant patients alterations of this coincidence detection provoked complex clinical phenotypes, ranging from reduced muscle strength to epileptic seizures and intellectual disability. By using our gene-targeted mouse line (Grin2a<sup>N615S</sup>), we show that voltage-independent glutamate-gated signaling of GluN2A-containing NMDA receptors is associated with NMDAR-dependent audiogenic seizures due to hyperexcitable midbrain circuits. In contrast, the NMDAR antagonist MK-801-induced c-Fos expression is reduced in the hippocampus. Likewise, the synchronization of theta- and gamma oscillatory activity is lowered during exploration, demonstrating reduced hippocampal activity. This is associated with exploratory hyperactivity and aberrantly increased and dysregulated levels of attention that can interfere with associative learning, in particular when relevant cues and reward outcomes are disconnected in space and time. Together, our findings provide (i) experimental evidence that the inherent voltage-dependent Ca<sup>2+</sup> signaling of NMDA receptors is essential for maintaining appropriate responses to sensory stimuli and (ii) a mechanistic explanation for the neurological manifestations seen in the NMDAR-related human disorders with GRIN2 variant-meidiated intellectual disability and focal epilepsy. Treatment response to low-dose ketamine infusion for treatment-resistant depression: A gene-based genome-wide association study. Evidence suggested the crucial roles of brain-derived neurotrophic factor (BDNF) and glutamate system functioning in the antidepressant mechanisms of low-dose ketamine infusion in treatment-resistant depression (TRD). 65 patients with TRD were genotyped for 684,616 single nucleotide polymorphisms (SNPs). Twelve ketamine-related genes were selected for the gene-based genome-wide association study on the antidepressant effect of ketamine infusion and the resulting serum ketamine and norketamine levels. Specific SNPs and whole genes involved in BDNF-TrkB signaling (i.e., rs2049048 in BDNF and rs10217777 in NTRK2) and the glutamatergic and GABAergic systems (i.e., rs16966731 in GRIN2A) were associated with the rapid (within 240 min) and persistent (up to 2 weeks) antidepressant effect of low-dose ketamine infusion and with serum ketamine and norketamine levels. Our findings confirmed the predictive roles of BDNF-TrkB signaling and glutamatergic and GABAergic systems in the underlying mechanisms of low-dose ketamine infusion for TRD treatment. +ANKRD11 autism Genetic characterization of adult primary pleomorphic uterine rhabdomyosarcoma and comparison with uterine carcinosarcoma. To characterize the genetic alterations in adult primary uterine rhabdomyosarcomas (uRMSs) and to investigate whether these tumors are genetically distinct from uterine carcinosarcomas (UCSs). Three tumors originally diagnosed as primary adult pleomorphic uRMS were subjected to massively parallel sequencing targeting 468 cancer-related genes and RNA-sequencing. Mutational profiles were compared to those from UCSs (n=57) obtained from The Cancer Genome Atlas. Sequencing data analyses were performed using validated bioinformatic approaches. Pathogenic TP53 mutations and high levels of genomic instability were detected in the three cases. uRMS1 harbored a likely pathogenic YTHDF2-FOXR1 fusion gene. uRMS2 displayed a PPP2R1A hotspot mutation and amplification of multiple genes, including WHSC1L1, FGFR1, MDM2 and CCNE1, whereas uRMS3 harbored an FBXW7 hotspot mutation and an ANKRD11 homozygous deletion. Hierarchical clustering of somatic mutations and copy number alterations revealed that these tumors initially diagnosed as pleomorphic uRMSs and UCSs were similar. Subsequent comprehensive pathologic re-review of the three uRMSs revealed previously un-identified minute pan-cytokeratin-positive atypical glands in one case (uRMS3), favoring its reclassification as UCS with extensive rhabdomyosarcomatous overgrowth. Adult pleomorphic uRMSs harbor TP53 mutations and high levels of copy number alterations. Our findings underscore the challenge in discriminating between uRMS and UCS with rhabdomyosarcomatous differentiation. Electroclinical features and outcome of ANKRD11-related KBG syndrome: A novel report and literature review. NA Two loss-of-function ANKRD11 variants in Chinese patients with short stature and a possible molecular pathway. KBG syndrome is a rare genetic disease characterized mainly by skeletal abnormalities, distinctive facial features, and intellectual disability. Heterozygous mutations in ANKRD11 gene, or deletion of 16q24.3 that includes ANKRD11 gene are the cause of KBG syndrome. We describe two patients presenting with short stature and partial facial features, whereas no intellectual disability or hearing loss was observed in them. Two ANKRD11 variants, c.4039_4041del (p. Lys1347del) and c.6427C > G (p. Leu2143Val), were identified in this study. Both of them were classified as variants of uncertain significance (VOUS) by ACMG/AMP guidelines and were inherited from their mothers. ANKRD11 could enhance the transactivation of p21 gene, which was identified to participate in chondrogenic differentiation. In this study, we demonstrated that the knockdown of ANKRD11 could reduce the p21-promoter luciferase activities while re-introduction of wild type ANKRD11, but not ANKRD11 variants (p. Lys1347del or p. Leu2143Val), could restore the p21 levels. Thus, our study report two loss-of-function ANKRD11 variants which might provide new insight on pathogenic mechanism that correlates ANKRD11 variants with the short stature phenotype of KBG syndrome. Two Novel Mutations of ANKRD11 Gene and Wide Clinical Spectrum in KBG Syndrome: Case Reports and Literature Review. KBG syndrome (OMIM #148050) is a rare, autosomal dominant inherited genetic disorder caused by heterozygous mutations in the ankyrin repeat domain-containing protein 11 (ANKRD11) gene or by microdeletion of chromosome 16q24.3. It is characterized by macrodontia of the upper central incisors, distinctive facial dysmorphism, short stature, vertebral abnormalities, hand anomaly including clinodactyly, and various degrees of developmental delay. KBG syndrome presents with variable clinical feature and severity among individuals. Here, we report two KBG patients who have different novel heterozygous mutations of ANKRD11 gene with wide range of clinical manifestations. Two novel heterozygous mutations of ANKRD11 gene were identified in two unrelated Korean patients with variable clinical presentations. The first patient presented with short stature and early puberty and was treated with growth hormone and gonadotropin-releasing hormone agonist without adverse effects. He had mild intellectual disability. In targeted exome sequencing, a novel de novo frameshift variant was identified in ANKRD11, c.5889del, and p. (Ile1963MetfsX9). The second patient had severe intellectual disability with epilepsy. He had normal height and prepubertal stage at the age of 11 years. He had behavioral problems such as autism-like features, anxiety, and stereotypical movements. Whole exome sequencing (WES) was performed, and the novel heterozygous mutation, c3310dup, p. (Glu110GlyfsTer5) in ANKRD11 was identified. KBG syndrome is often underdiagnosed because of its non-specific features and phenotypic variability. Performing a next-generation sequencing panel, including the ANKRD11 gene for cases of developmental delay with/without short stature may be helpful to identify hitherto undiagnosed KBG syndrome patients. Description of neurodevelopmental phenotypes associated with 10 genetic neurodevelopmental disorders: A scoping review. Neurodevelopmental disorders (NDDs) are a heterogeneous group of conditions including intellectual disability, global developmental delay, autism spectrum disorder, and attention deficit hyperactivity disorder. Advances in genetic diagnostic technology have led to the identification of a number of NDD-associated genes, but reports of cognitive and developmental outcomes in affected individuals have been variable. The objective of this scoping review is to synthesize available information pertaining to the developmental outcomes of individuals with pathogenic variants in ten emerging recurrent NDD-associated genes identified from large scale sequencing studies; ADNP, ANKRD11, ARID1B, CHD2, CHD8, CTNNB1, DDX3X, DYRK1A, SCN2A, and SYNGAP1. After a comprehensive search, 260 articles were selected that reported on neurodevelopmental measures or diagnoses. We identify the spectrum of developmental outcomes for each genetic NDD, including prevalence of intellectual disability, frequency of co-morbid NDDs such as ADHD and autism, and commonly reported medical issues that can help inform diagnosis and treatment. There are significant gaps in our understanding of the natural history of these conditions. Future research focusing on barriers to assessment, the development of modified assessment tools appropriate for long-term outcomes in genetic NDD, and collection of longitudinal data will increase understanding of prognosis in these conditions and inform evaluations of treatment. +SHANK2 autism Role of PDZ-binding motif from West Nile virus NS5 protein on viral replication. West Nile virus (WNV) is a Flavivirus, which can cause febrile illness in humans that may progress to encephalitis. Like any other obligate intracellular pathogens, Flaviviruses hijack cellular protein functions as a strategy for sustaining their life cycle. Many cellular proteins display globular domain known as PDZ domain that interacts with PDZ-Binding Motifs (PBM) identified in many viral proteins. Thus, cellular PDZ-containing proteins are common targets during viral infection. The non-structural protein 5 (NS5) from WNV provides both RNA cap methyltransferase and RNA polymerase activities and is involved in viral replication but its interactions with host proteins remain poorly known. In this study, we demonstrate that the C-terminal PBM of WNV NS5 recognizes several human PDZ-containing proteins using both in vitro and in cellulo high-throughput methods. Furthermore, we constructed and assayed in cell culture WNV replicons where the PBM within NS5 was mutated. Our results demonstrate that the PBM of WNV NS5 is important in WNV replication. Moreover, we show that knockdown of the PDZ-containing proteins TJP1, PARD3, ARHGAP21 or SHANK2 results in the decrease of WNV replication in cells. Altogether, our data reveal that interactions between the PBM of NS5 and PDZ-containing proteins affect West Nile virus replication. Genetic influences of autism candidate genes on circuit wiring and olfactory decoding. Olfaction supports a multitude of behaviors vital for social communication and interactions between conspecifics. Intact sensory processing is contingent upon proper circuit wiring. Disturbances in genetic factors controlling circuit assembly and synaptic wiring can lead to neurodevelopmental disorders, such as autism spectrum disorder (ASD), where impaired social interactions and communication are core symptoms. The variability in behavioral phenotype expression is also contingent upon the role environmental factors play in defining genetic expression. Considering the prevailing clinical diagnosis of ASD, research on therapeutic targets for autism is essential. Behavioral impairments may be identified along a range of increasingly complex social tasks. Hence, the assessment of social behavior and communication is progressing towards more ethologically relevant tasks. Garnering a more accurate understanding of social processing deficits in the sensory domain may greatly contribute to the development of therapeutic targets. With that framework, studies have found a viable link between social behaviors, circuit wiring, and altered neuronal coding related to the processing of salient social stimuli. Here, the relationship between social odor processing in rodents and humans is examined in the context of health and ASD, with special consideration for how genetic expression and neuronal connectivity may regulate behavioral phenotypes. Activation of the medial preoptic area (MPOA) ameliorates loss of maternal behavior in a Shank2 mouse model for autism. Impairments in social relationships and awareness are features observed in autism spectrum disorders (ASDs). However, the underlying mechanisms remain poorly understood. Shank2 is a high-confidence ASD candidate gene and localizes primarily to postsynaptic densities (PSDs) of excitatory synapses in the central nervous system (CNS). We show here that loss of Shank2 in mice leads to a lack of social attachment and bonding behavior towards pubs independent of hormonal, cognitive, or sensitive deficits. Shank2<sup>-/-</sup> mice display functional changes in nuclei of the social attachment circuit that were most prominent in the medial preoptic area (MPOA) of the hypothalamus. Selective enhancement of MPOA activity by DREADD technology re-established social bonding behavior in Shank2<sup>-/-</sup> mice, providing evidence that the identified circuit might be crucial for explaining how social deficits in ASD can arise. SHANK2 mutations impair apoptosis, proliferation and neurite outgrowth during early neuronal differentiation in SH-SY5Y cells. SHANK2 mutations have been identified in individuals with neurodevelopmental disorders, including intellectual disability and autism spectrum disorders (ASD). Using CRISPR/Cas9 genome editing, we obtained SH-SY5Y cell lines with frameshift mutations on one or both SHANK2 alleles. We investigated the effects of the different SHANK2 mutations on cell morphology, cell proliferation and differentiation potential during early neuronal differentiation. All mutant cell lines showed impaired neuronal differentiation marker expression. Cells with bi-allelic SHANK2 mutations revealed diminished apoptosis and increased proliferation, as well as decreased neurite outgrowth during early neuronal differentiation. Bi-allelic SHANK2 mutations resulted in an increase in p-AKT levels, suggesting that SHANK2 mutations impair downstream signaling of tyrosine kinase receptors. Additionally, cells with bi-allelic SHANK2 mutations had lower amyloid precursor protein (APP) expression compared to controls, suggesting a molecular link between SHANK2 and APP. Together, we can show that frameshift mutations on one or both SHANK2 alleles lead to an alteration of neuronal differentiation in SH-SY5Y cells, characterized by changes in cell growth and pre- and postsynaptic protein expression. We also provide first evidence that downstream signaling of tyrosine kinase receptors and amyloid precursor protein expression are affected. The Temple Grandin Genome: Comprehensive Analysis in a Scientist with High-Functioning Autism. Autism spectrum disorder (ASD) is a heterogeneous condition with a complex genetic etiology. The objective of this study is to identify the complex genetic factors that underlie the ASD phenotype and other clinical features of Professor Temple Grandin, an animal scientist and woman with high-functioning ASD. Identifying the underlying genetic cause for ASD can impact medical management, personalize services and treatment, and uncover other medical risks that are associated with the genetic diagnosis. Prof. Grandin underwent chromosomal microarray analysis, whole exome sequencing, and whole genome sequencing, as well as a comprehensive clinical and family history intake. The raw data were analyzed in order to identify possible genotype-phenotype correlations. Genetic testing identified variants in three genes (SHANK2, ALX1, and RELN) that are candidate risk factors for ASD. We identified variants in MEFV and WNT10A, reported to be disease-associated in previous studies, which are likely to contribute to some of her additional clinical features. Moreover, candidate variants in genes encoding metabolic enzymes and transporters were identified, some of which suggest potential therapies. This case report describes the genomic findings in Prof. Grandin and it serves as an example to discuss state-of-the-art clinical diagnostics for individuals with ASD, as well as the medical, logistical, and economic hurdles that are involved in clinical genetic testing for an individual on the autism spectrum. +POGZ autism A case of White-Sutton syndrome with previously described loss-of-function variant in DDE domain of POGZ (p.Arg1211*) and Kartagener syndrome. NA Widespread labeling and genomic editing of the fetal central nervous system by in utero CRISPR AAV9-PHP.eB administration. Efficient genetic manipulation in the developing central nervous system is crucial for investigating mechanisms of neurodevelopmental disorders and the development of promising therapeutics. Common approaches including transgenic mice and in utero electroporation, although powerful in many aspects, have their own limitations. In this study, we delivered vectors based on the AAV9.PHP.eB pseudo-type to the fetal mouse brain, and achieved widespread and extensive transduction of neural cells. When AAV9.PHP.eB-coding gRNA targeting PogZ or Depdc5 was delivered to Cas9 transgenic mice, widespread gene knockout was also achieved at the whole brain level. Our studies provide a useful platform for studying brain development and devising genetic intervention for severe developmental diseases. Neuropsychological study in 19 French patients with White-Sutton syndrome and POGZ mutations. White-Sutton syndrome is a rare developmental disorder characterized by global developmental delay, intellectual disabilities (ID), and neurobehavioral abnormalities secondary to pathogenic pogo transposable element-derived protein with zinc finger domain (POGZ) variants. The purpose of our study was to describe the neurocognitive phenotype of an unbiased national cohort of patients with identified POGZ pathogenic variants. This study is based on a French collaboration through the AnDDI-Rares network, and includes 19 patients from 18 families with POGZ pathogenic variants. All clinical data and neuropsychological tests were collected from medical files. Among the 19 patients, 14 patients exhibited ID (six mild, five moderate and three severe). The five remaining patients had learning disabilities and shared a similar neurocognitive profile, including language difficulties, dysexecutive syndrome, attention disorders, slowness, and social difficulties. One patient evaluated for autism was found to have moderate autism spectrum disorder. This study reveals that the cognitive phenotype of patients with POGZ pathogenic variants can range from learning disabilities to severe ID. It highlights that pathogenic variations in the same genes can be reported in a large spectrum of neurocognitive profiles, and that children with learning disabilities could benefit from next generation sequencing techniques. Pogz deficiency leads to transcription dysregulation and impaired cerebellar activity underlying autism-like behavior in mice. Several genes implicated in autism spectrum disorder (ASD) are chromatin regulators, including POGZ. The cellular and molecular mechanisms leading to ASD impaired social and cognitive behavior are unclear. Animal models are crucial for studying the effects of mutations on brain function and behavior as well as unveiling the underlying mechanisms. Here, we generate a brain specific conditional knockout mouse model deficient for Pogz, an ASD risk gene. We demonstrate that Pogz deficient mice show microcephaly, growth impairment, increased sociability, learning and motor deficits, mimicking several of the human symptoms. At the molecular level, luciferase reporter assay indicates that POGZ is a negative regulator of transcription. In accordance, in Pogz deficient mice we find a significant upregulation of gene expression, most notably in the cerebellum. Gene set enrichment analysis revealed that the transcriptional changes encompass genes and pathways disrupted in ASD, including neurogenesis and synaptic processes, underlying the observed behavioral phenotype in mice. Physiologically, Pogz deficiency is associated with a reduction in the firing frequency of simple and complex spikes and an increase in amplitude of the inhibitory synaptic input in cerebellar Purkinje cells. Our findings support a mechanism linking heterochromatin dysregulation to cerebellar circuit dysfunction and behavioral abnormalities in ASD. Altered hippocampal-prefrontal communication during anxiety-related avoidance in mice deficient for the autism-associated gene Pogz. Many genes have been linked to autism. However, it remains unclear what long-term changes in neural circuitry result from disruptions in these genes, and how these circuit changes might contribute to abnormal behaviors. To address these questions, we studied behavior and physiology in mice heterozygous for Pogz, a high confidence autism gene. Pogz<sup>+/-</sup> mice exhibit reduced anxiety-related avoidance in the elevated plus maze (EPM). Theta-frequency communication between the ventral hippocampus (vHPC) and medial prefrontal cortex (mPFC) is known to be necessary for normal avoidance in the EPM. We found deficient theta-frequency synchronization between the vHPC and mPFC in vivo. When we examined vHPC-mPFC communication at higher resolution, vHPC input onto prefrontal GABAergic interneurons was specifically disrupted, whereas input onto pyramidal neurons remained intact. These findings illustrate how the loss of a high confidence autism gene can impair long-range communication by causing inhibitory circuit dysfunction within pathways important for specific behaviors.
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/test_data Wed Mar 24 08:33:25 2021 +0000 @@ -0,0 +1,7 @@ +ID_gene GROUPING_disease +SCN1A epilepsy +SCN9A epilepsy +GRIN2A epilepsy +ANKRD11 autism +SHANK2 autism +POGZ autism \ No newline at end of file
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/text_to_wordmatrix_output Wed Mar 24 08:33:25 2021 +0000 @@ -0,0 +1,7 @@ +scn1a patient variant scn2a gene polymorphism genetic pathogenic aeds epilepsy rs2298771 developmental resistance result rs10188577 rs17183814 rs2304016 significant study asian association channel clinical diagnosis dravet first group identified metaanalysis model mutation neurodevelopmental n 3 sequencing significance sodium syndrome analysis antiepileptic bonferroni caucasian child correction correlation decline disorder found global homozygous however gbm neuron cell pain compound sensory associated inhibitor mouse nav17 aetiology cohort current human lebanese nociceptive pathway sample among itch new analgesic overall respectively scn9a survival 293 activity age candidate chronic da0218 decreased differentiate dorsal drg drgsgcs embryonic cldn182 concussion gastric srcc advanced expression rib behavior ketamine man elite impairment induced memory rugby signaling spatial within effect infusion level grin2a highdose hippocampus lowdose nmda reduced system treatment 180 antidepressant anxietylike bdnf brain change depression depressiveanxietylike due ankrd11 kbg two disability feature intellectual novel short stature heterozygous including outcome report adult case pleomorphic review urmss uterine alteration condition delay facial individual number primary spectrum three tumor ucss variable 16q243 abnormalities assessment autism shank2 protein social asd neuronal wnv circuit differentiation ns5 can pbm replication factor grandin interaction may pdzcontaining processing viral wiring behavioral biallelic cellular communication complex deficit domain early genome lead medical mpoa nile phenotype pogz deficient disabilities learning mechanism avoidance cerebellar input neurocognitive severe underlying vhpc whitesutton widespread aav9phpeb achieved anxietyrelated based central cognitive confidence crucial deficiency delivered development difficulties disrupted +1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 1 1 0 1 0 1 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/text_to_wordmatrix_output_args Wed Mar 24 08:33:25 2021 +0000 @@ -0,0 +1,7 @@ +the and scna patients were with variants n for pathogenic polymorphisms genetic genes epilepsy aeds gene results this resistance developmental significant that study was between polymorphism clinical dravet sequencing syndrome are neurodevelopmental sodium diagnosis identified significance asians association metaanalysis first group analysis children found their two variant channel disorders from neurons gbm our pain mutations cells sensory nav associated inhibitor nociceptive human aetiology lebanese new pathway itch model mouse among cohort samples currents into analgesic compound compounds respectively overall survival cldn gastric srcc rib advanced expression behavior man concussion ketamine impairment induced memory signaling spatial elite rugby within levels mice infusion cell highdose hippocampus grina reduced treatment nmda lowdose anxietylike brain ankrd kbg urms novel disability intellectual short stature including features heterozygous adult pleomorphic these urmss uterine review had outcomes alterations mutation number primary three tumors ucss shank social asd autism proteins wnv neuronal protein circuit differentiation can pbm replication spectrum both during interactions may pdzcontaining viral candidate factors processing wiring grandin cellular domain nile other show targets virus west pogz deficient disabilities learning communication whitesutton have mechanisms severe widespread disorder neurocognitive phenotype cerebellar changes input underlying avoidance vhpc aavphpeb achieved +1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +1 1 1 1 1 1 1 0 1 0 0 1 1 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +1 1 0 1 1 1 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 0 0 0 1 0 1 0 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +1 1 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +1 1 0 1 0 1 1 0 1 1 0 0 1 0 0 1 0 1 0 1 0 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test/commands_tests Wed Mar 24 08:33:25 2021 +0000 @@ -0,0 +1,37 @@ +#commands to test the tools with "test_data" + + $ cd <path>/simtext + + $ Rscript pubmed_by_queries.R --input "test-data/test_data" --output "test-data/pubmed_by_queries_output" + #output: test-data/pubmed_by_queries_output --install_packages + + $ Rscript pubmed_by_queries.R --input "test-data/test_data" --abstract --output "test-data/pubmed_by_queries_output_abstracts" --install_packages + #output: test-data/pubmed_by_queries_output_abstracts + + $ Rscript abstracts_by_pmids.R --input "test-data/pubmed_by_queries_output" --output "test-data/abstracts_by_pmids_output" --install_packages + #output: test-data/abstracts_by_pmids_output + + $ Rscript text_to_wordmatrix.R --input "test-data/pubmed_by_queries_output_abstracts" --output "test-data/text_to_wordmatrix_output" --install_packages + #output: test-data/text_to_wordmatrix_output + + $ Rscript text_to_wordmatrix.R --input "test-data/pubmed_by_queries_output_abstracts" --output "test-data/text_to_wordmatrix_output_args" --remove_num --remove_stopwords --plurals --install_packages + #output: test-data/text_to_wordmatrix_output_args + + $ Rscript test-data/pmids_to_pubtator_matrix.R --input "test-data/pubmed_by_queries_output" --output "test-datadata/pmids_to_pubtator_matrix_output" --number 50 --categories Gene Mutation --install_packages + #output: test-data/pmids_to_pubtator_matrix_output + + $ Rscript pmids_to_pubtator_matrix.R --input "test-data/pubmed_by_queries_output" --output "test-data/pmids_to_pubtator_matrix_output_byid" --number 50 --categories Gene Disease --install_packages --byid + #output: test-data/pmids_to_pubtator_matrix_output_byid + + $ Rscript pmids_to_pubtator_matrix.R --input "test-data/pubmed_by_queries_output" --output "test-data/pmids_to_pubtator_matrix_output_number" --number 5 --categories Gene Disease --install_packages + #output: test-data/pmids_to_pubtator_matrix_output_number + + $ Rscript simtext_app.R -i "test-data/test_data" -m "test-data/text_to_wordmatrix_output" --install_packages + #output: ShinyApp + + $ Rscript simtext_app.R -i "test-data/test_data" -m "test-data/pmids_to_pubtator_matrix_output" --install_packages + #output: ShinyApp + + + +
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/text_to_wordmatrix.R Wed Mar 24 08:33:25 2021 +0000 @@ -0,0 +1,106 @@ +#!/usr/bin/env Rscript +# tool: text_to_wordmatrix +# +#The tool extracts the most frequent words per entity (per row). Text of columns starting with "ABSTRACT" or "TEXT" are considered. +#All extracted terms are used to generate a word matrix with rows = entities and columns = extracted words. +#The resulting matrix is binary with 0= word not present in abstracts of entity and 1= word present in abstracts of entity. +# +#Input: Output of "pubmed_by_queries" or "abstracts_by_pmids", or tab-delimited table with entities in column called “ID_<name>”, +#e.g. “ID_genes” and text in columns starting with "ABSTRACT" or "TEXT". +# +#Output: Binary matrix with rows = entities and columns = extracted words. +# +#usage: text_to_wordmatrix.R [-h] [-i INPUT] [-o OUTPUT] [-n NUMBER] [-r] [-l] [-w] [-s] [-p] +# +# optional arguments: +# -h, --help show help message +# -i INPUT, --input INPUT input file name. add path if file is not in working directory +# -o OUTPUT, --output OUTPUT output file name. [default "text_to_wordmatrix_output"] +# -n NUMBER, --number NUMBER number of most frequent words that should be extracted [default "50"] +# -r, --remove_num remove any numbers in text +# -l, --lower_case by default all characters are translated to lower case. otherwise use -l +# -w, --remove_stopwords by default a set of english stopwords (e.g., "the" or "not") are removed. otherwise use -w +# -s, --stemDoc apply Porter"s stemming algorithm: collapsing words to a common root to aid comparison of vocabulary +# -p, --plurals by default words in plural and singular are merged to the singular form. otherwise use -p + +if ("--install_packages" %in% commandArgs()) { + print("Installing packages") + if (!require("argparse")) install.packages("argparse", repo = "http://cran.rstudio.com/"); + if (!require("PubMedWordcloud")) install.packages("PubMedWordcloud", repo = "http://cran.rstudio.com/"); + if (!require("SnowballC")) install.packages("SnowballC", repo = "http://cran.rstudio.com/"); + if (!require("textclean")) install.packages("textclean", repo = "http://cran.rstudio.com/"); + if (!require("SemNetCleaner")) install.packages("SemNetCleaner", repo = "http://cran.rstudio.com/"); + if (!require("stringi")) install.packages("stringi", repo = "http://cran.rstudio.com/"); + if (!require("stringr")) install.packages("stringr", repo = "http://cran.rstudio.com/"); +} + +suppressPackageStartupMessages(library("argparse")) +suppressPackageStartupMessages(library("PubMedWordcloud")) +suppressPackageStartupMessages(library("SnowballC")) +suppressPackageStartupMessages(library("SemNetCleaner")) +suppressPackageStartupMessages(library("textclean")) +suppressPackageStartupMessages(library("stringi")) +suppressPackageStartupMessages(library("stringr")) + +parser <- ArgumentParser() +parser$add_argument("-i", "--input", + help = "input fie name. add path if file is not in workind directory") +parser$add_argument("-o", "--output", default = "text_to_wordmatrix_output", + help = "output file name. [default \"%(default)s\"]") +parser$add_argument("-n", "--number", type = "integer", default = 50, choices = seq(1, 500), metavar = "{0..500}", + help = "number of most frequent words used per ID in word matrix [default \"%(default)s\"]") +parser$add_argument("-r", "--remove_num", action = "store_true", default = FALSE, + help = "remove any numbers in text") +parser$add_argument("-l", "--lower_case", action = "store_false", default = TRUE, + help = "by default all characters are translated to lower case. otherwise use -l") +parser$add_argument("-w", "--remove_stopwords", action = "store_false", default = TRUE, + help = "by default a set of English stopwords (e.g., 'the' or 'not') are removed. otherwise use -s") +parser$add_argument("-s", "--stemDoc", action = "store_true", default = FALSE, + help = "apply Porter's stemming algorithm: collapsing words to a common root to aid comparison of vocabulary") +parser$add_argument("-p", "--plurals", action = "store_false", default = TRUE, + help = "by default words in plural and singular are merged to the singular form. otherwise use -p") +parser$add_argument("--install_packages", action = "store_true", default = FALSE, + help = "If you want to auto install missing required packages.") + +args <- parser$parse_args() + + +data <- read.delim(args$input, stringsAsFactors = FALSE, header = TRUE, sep = "\t") +word_matrix <- data.frame() + +text_cols_index <- grep(c("ABSTRACT|TEXT"), names(data)) + +for (row in seq(nrow(data))) { + top_words <- cleanAbstracts(abstracts = data[row, text_cols_index], + rmNum = args$remove_num, + tolw = args$lower_case, + rmWords = args$remove_stopwords, + stemDoc = args$stemDoc) + + top_words$word <- as.character(top_words$word) + + cat("Most frequent words for row", row, " are extracted.", "\n") + + if (args$plurals == TRUE) { + top_words$word <- sapply(top_words$word, function(x) { + singularize(x) + }) + top_words <- aggregate(freq~word, top_words, sum) + } + + top_words <- top_words[order(top_words$freq, decreasing = TRUE), ] + top_words$word <- as.character(top_words$word) + + number_extract <- min(args$number, nrow(top_words)) + word_matrix[row, sapply(1:number_extract, function(x) { + paste0(top_words$word[x]) + })] <- top_words$freq[1:number_extract] + } + + word_matrix <- as.matrix(word_matrix) + word_matrix[is.na(word_matrix)] <- 0 + word_matrix <- (word_matrix > 0) * 1 #binary matrix + +cat("A matrix with ", nrow(word_matrix), " rows and ", ncol(word_matrix), "columns is generated.", "\n") + +write.table(word_matrix, args$output, row.names = FALSE, sep = "\t", quote = FALSE)
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/text_to_wordmatrix.xml Wed Mar 24 08:33:25 2021 +0000 @@ -0,0 +1,93 @@ +<tool id="text_to_wordmatrix" name="Text to wordmatrix" version="@VERSION@" license="MIT"> + <description>by extracting most frequent words</description> + <macros> + <import>macros.xml</import> + </macros> + <requirements> + <requirement type="package" version="2.0.3">r-argparse</requirement> + <requirement type="package" version="0.7.0">r-snowballc</requirement> + <requirement type="package" version="0.3.6">r-pubmedwordcloud</requirement> + <requirement type="package" version="1.2.0">r-semnetcleaner</requirement> + <requirement type="package" version="0.9.3">r-textclean</requirement> + <requirement type="package" version="1.5.3">r-stringi</requirement> + <requirement type="package" version="1.4.0">r-stringr</requirement> + </requirements> + <command detect_errors="exit_code"><![CDATA[ + Rscript + '${__tool_directory__}/text_to_wordmatrix.R' + --input '$input' + --output '$output' + --number '$number' + $remove_num + $lower_case + $remove_stopwords + $stemDoc + $plurals + ]]> + </command> + <inputs> + <param argument="--input" type="data" format="tabular" label="Input file" /> + <param argument="--number" type="integer" value="50" min="1" max="500" label="Number of most frequent words that should be extracted per row."/> + <param argument="--remove_num" type="boolean" truevalue="--remove_num" falsevalue="" checked="false" label="Remove any numbers in text." /> + <param argument="--lower_case" type="boolean" truevalue="" falsevalue="--lower_case" checked="true" label="Translate all characters are to lower case." /> + <param argument="--remove_stopwords" type="boolean" truevalue="" falsevalue="--remove_stopwords" checked="true" label="Remove english stopwords" help="e.g. 'the' or 'not'" /> + <param argument="--stemDoc" type="boolean" truevalue="--stemDoc" falsevalue="" checked="false" label="Apply Porter's stemming algorithm: collapsing words to a common root to aid comparison of vocabulary." /> + <param argument="--plurals" type="boolean" truevalue="" falsevalue="--plurals" checked="true" label="Transform words in plural to their singular form." /> + </inputs> + <outputs> + <data format="tabular" name="output" /> + </outputs> + <tests> + <test> + <param name="input" value="pubmed_by_queries_output_abstracts" ftype="tabular"/> + <output name="output"> + <assert_contents> + <has_n_lines n="7"/> + </assert_contents> + </output> + </test> + <test> + <param name="input" value="pubmed_by_queries_output_abstracts" ftype="tabular"/> + <param name="remove_num" value="True"/> + <param name="remove_stopwords" value="False"/> + <param name="plurals" value="False"/> + <output name="output"> + <assert_contents> + <has_n_lines n="7"/> + </assert_contents> + </output> + </test> + </tests> + <help><![CDATA[ + +**What it does** + +The tool extracts for each row the most frequent words from the text in columns starting with "ABSTRACT" or "TEXT. The extracted words from each row are united in one large binary matrix, with 0= word not frequently occurring in text of that row and 1= word frequently present in text of that row. + +- Input table: + + The output of "pubmed_by_queries" or "abstracts_by_pmids" tools, or a table with text in columns starting with "ABSTRACT" or "TEXT". + +- Output table: + + A binary matrix in that each column represents one of the extracted words. + +----- + +**Example** + +- Input table: + + | ABSTRACT_1 | ABSTRACT_2 | TEXT_1 + | abcd def... | abcd def... | abcd def... + | abcd def... | abcd def... | abcd def... + +- Extract of output table: + + | chronic | seizure | child | channel | signaling | grin2a + | 1 | 1 | 1 | 1 | 1 | 1 + | 0 | 1 | 0 | 1 | 0 | 1 + + ]]></help> + <expand macro="citations"/> +</tool> \ No newline at end of file
