# HG changeset patch # User george-weingart # Date 1423455769 18000 # Node ID 18774fa866d880f451b5b4bf461d54f40f9e88fc # Parent 589169d452c078c6c72fbc40171c5bcb28b687b0 Deleted selected files diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/Figure1-Overview.png Binary file maaslin-4450aa4ecc84/Figure1-Overview.png has changed diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/MaAsLin_galaxy_ReadMe.txt --- a/maaslin-4450aa4ecc84/MaAsLin_galaxy_ReadMe.txt Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,16 +0,0 @@ -Installation instructions for maaslin in a galaxy environment. -These instructions require the Mercurial versioning system, galaxy, and an internet connection. - -1. In the "galaxy-dist/tools" directory install maaslin by typing in a terminal: -hg clone https://bitbucket.org/biobakery/maaslin - -2. Update member tool_conf.xml in the galaxy directory adding the following: -
- -
- -3. Update member datatypes_conf.xml in the galaxy directory adding the following: - - -4. Recycle galaxy - diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/Maaslin_Output.png Binary file maaslin-4450aa4ecc84/Maaslin_Output.png has changed diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/None Binary file maaslin-4450aa4ecc84/None has changed diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/README.md --- a/maaslin-4450aa4ecc84/README.md Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,413 +0,0 @@ -MaAsLin User Guide v3.1 -======================= - -September 2013 - Updated April 2014 for Galaxy - -Timothy Tickle and Curtis Huttenhower - -Table of Contents ------------------ - -A. Introduction to MaAsLin -B. Related Projects and Scripts -C. Installing MaAsLin -D. MaAsLin Inputs -E. Process Flow Overview -D. Process Flow Detail -G. Expected Output Files -H. Troubleshooting -I. Installation as an Automated Pipeline -J. Commandline Options (Modifying Process and Figures) - -# A. Introduction to MaAsLin - -MaAsLin is a multivariate statistical framework that finds -associations between clinical metadata and potentially -high-dimensional experimental data. MaAsLin performs boosted additive -general linear models between one group of data (metadata/the -predictors) and another group (in our case relative taxonomic -abundances/the response). In our context we use it to discover -associations between clinical metadata and microbial community -relative abundance or function; however, it is applicable to other -data types. - -Metagenomic data are sparse, and boosting is used to select metadata -that show some potential to be useful in a linear model between the -metadata and abundances. In the context of metadata and community -abundance, a sample's metadata is boosted for one Operational -Taxonomic Unit (OTU) (Yi). The metadata that are selected by boosting -are then used in a general linear model, with each combination of -metadata (as predictors) and OTU abundance (as response -variables). This occurs for every OTU and metadata combination. Given -we work with proportional data, the Yi (abundances) are -`arcsin(sqrt(Yi))` transformed. A final formula is as follows: - -![](https://bitbucket.org/biobakery/maaslin/downloads/maaslinformula2.png) - -For more information about maaslin please visit -[http://huttenhower.sph.harvard.edu/maaslin](http://huttenhower.sph.harvard.edu/maaslin). - - -# B. Related Projects and Scripts - -Other projects exist at www.bitbucket.com that may help in your -analysis: - -* **QiimeToMaAsLin** is a project that reformats abundance files from - Qiime for MaAsLin. Several formats of Qiime consensus lineages are - supported for this project. To download please visit - [https://bitbucket.org/timothyltickle/qiimetomaaslin](https://bitbucket.org/timothyltickle/qiimetomaaslin). - -* **merge_metadata.py** is a script included in the MaAsLin project to - generically merge a metadata file with a table of microbial (or - other) measurements. This script is located in `maaslin/src` and - is documented in `maaslin/doc/ Merge_Metadata_Read_Me.txt`. - - -# C. Installing MaAsLin - -R Libraries: Several libraries need to be installed in R these are - the following: - - * agricolae, gam, gamlss, gbm, glmnet, inlinedocs, logging, MASS, nlme, optparse, outliers, penalized, pscl, robustbase, testhat, vegan - -You can install them by typing R in a terminal and using the - install.packages command: - - install.packages(c('agricolae', 'gam', 'gamlss', 'gbm', 'glmnet', 'inlinedocs', 'logging', 'MASS', 'nlme', 'optparse', 'outliers', 'penalized', 'pscl', 'robustbase', 'testthat')) - -# D. MaAsLin Inputs - -There are 3 input files for each project, the "\*.read.config" file, the "\*.pcl" file, and the "\*.R" script. (If using the sfle automated pipeline, the "\*" in the file names can be anything but need to be identical for all three files. All three files need to be in the `../sfle/input/maasalin/input` folder only if using sfle). Details of each file follow: - -### 1\. "\*.pcl" - -Required input file. A PCL file is the file that contains all the data -and metadata. This file is formatted so that metadata/data (otus or -bugs) are rows and samples are columns. All metadata rows should come -first before any abundance data. The file should be a tab delimited -text file with the extension ".pcl". - -### 2\. "\*.read.config" - -Required input file. A read config file allows one to indicate what data is read from a PCL file without having to change the pcl file or change code. This means one can have a pcl file which is a superset of metadata and abundances which includes data you are not interested in for the run. This file is a text file with ".read.config" as an extension. This file is later described in detail in section **F. Process Flow Overview** subsection **4. Create your read.config file**. - -### 3\. "\*.R" - -Optional input file. The R script file is using a call back programming pattern that allows one to add/modify specific code to customize analysis without touching the main MaAsLin engine. A generic R script is provided "maaslin_demo2.R" and can be renamed and used for any study. The R script can be modified to add quality control or formatting of data, add ecological measurements, or other changes to the underlying data before MaAsLin runs on it. This file is not required to run MaAsLin. - -# E. Process Flow Overview - -1. Obtain your abundance or relative function table. -2. Obtain your metadata. -3. Format and combine your abundance table and metadata as a pcl file for MaAsLin. -4. Create your read.config file. -5. Create your R script or use the default. -6. Place .pcl, .read.config, .R files in `../sfle/input/maaslin/input/` (sfle only) -7. Run -8. Discover amazing associations in your results! - -# F. Process Flow Detail - -### 1\. Obtain your abundance or relative function table. - -Abundance tables are normally derived from sequence data using -*Mothur*, *Qiime*, *HUMAnN*, or *MetaPhlAn*. Please refer to their documentation -for further details. - -### 2\. Obtain your metadata. - -Metadata would be information about the samples in the study. For -instance, one may analyze a case / control study. In this study, you -may have a disease and healthy group (disease state), the sex of the -patents (patient demographics), medication use (chemical treatment), -smoking (patient lifestyle) or other types of data. All aforementioned -data would be study metadata. This section can have any type of data -(factor, ordered factor, continuous, integer, or logical -variables). If a particular data is missing for a sample for a -metadata please write NA. It is preferable to write NA so that, when -looking at the data, it is understood the metadata is missing and it's -absence is intentional and not a mistake. Often investigators are -interested in genetic measurements that may also be placed in the -metadata section to associate to bugs. - -If you are not wanting to manually add metadata to your abundance -table, you may be interested in associated tools or scripts to help -combine your abundance table and metadata to create your pcl -file. Both require a specific format for your metadata file. Please -see the documentation for *QiimeToMaaslin* or *merge_metadata.py* (for -more details see section B). - -### 3\. Format and combine your abundance table and metadata as a pcl -file for *MaAsLin*. - -Please note two tools have been developed to help you! If you are -working from a Qiime OTU output and have a metadata text file try using -*QiimeToMaaslin* found at bitbucket. If you have a tab delimited file -which matches the below .pcl description (for instance MetaPhlAn -output) use the merge_metadata.py script provided in this project -(`maaslin/src/merge_metadata.py`) and documented in -`maaslin/doc/Merge_Metadata_Read_Me.txt`. - -###PCL format description: - -i. Row 1 is expected to be sample IDs beginning the first column with a feature name to identify the row, for example "ID". - -ii. Rows of metadata. Each row is one metadata, the first column entry -being the name of the metadata and each following column being the -metadata value for that each sample. - -iii. Row of taxa/otu abundance. Each row is one taxa/otu, the first -column entry being the name of the taxa/otu followed by abundances of -the taxa/otu per sample. - -iv. Abundances should be normalized by dividing each abundance measurement by the sum of the column (sample) abundances. - -v. Here is an example of the contents of an extremely small pcl file; -another example can be found in this project at -`maaslin/input/maaslin_demo.pcl`. - - - ID Sample1 Sample2 Sample3 Sample4 - metadata1 True True False False - metadata2 1.23 2.34 3.22 3.44 - metadata3 Male Female Male Female - taxa1 0.022 0.014 0.333 0.125 - taxa2 0.406 0.029 0.166 0.300 - taxa3 0.571 0.955 0.500 0.575 - - -### 4\. Create your read.config file. - -A *.read.config file is a structured text file used to indicate which -data in a *.pcl file should be read into MaAsLin and used for -analysis. This allows one to keep their *.pcl file intact while -varying analysis. Hopefully, this avoids errors that may be introduced -while manipulating the pcl files. - -Here is an example of the contents of a *.read.config file. - - Matrix: Metadata - Read_PCL_Columns: Sample2-Sample15 - Read_PCL_Rows: Age-Height,Weight,Sex,Cohort-Profession - - Matrix: Abundance - Read_PCL_Columns: Sample2-Sample15 - Read_PCL_Rows: Bacteria-Bug100 - -The minimal requirement for a MaAsLin .read.config file is as -follows. The Matrix: should be specified. Metadata needs to be named -"Metadata" for the metadata section and "Abundance" for the abundance -section. “Read\_PCL\_Rows:” is used to indicate which rows are data or -metadata to be analyzed. Rows can be identified by their metadata/data -id. Separate ids by commas. If there is a consecutive group of -metadata/data a range of rows can be defined by indicating the first -and last id separated by a “-“. If the beginning or ending id is -missing surrounding an “–“, the rows are read from the beginning or to -the end of the pcl file, respectively. - -A minimal example is shown here: - - Matrix: Metadata - Read\_PCL\_Rows: -Weight - - Matrix: Abundance - Read_PCL_Rows: Bacteria- - -With this minimal example, the delimiter of the file is assumed to be -a tab, all columns are read (since they are not indicated -here). Metadata are read as all rows from the beginning of the pcl -file (skipping the first Sample ID row) to Weight; all data are read -as all rows from Bacteria to the end of the pcl file. This example -refers to the default input files given in the MaAsLin download as -maaslin_demo2.\*. - -### 5\. Optionally, create your R script. - -The R script is used to add code that manipulates your data before -analysis, and for manipulating the multifactoral analysis figure. A -default “*.R” script is available with the default MaAsLin project at -maaslin/input/maaslin_demo2.R. This is an expert option and should -only be used by someone very comfortable with the R language. - -###6. Optional step if using the sfle analysis pipeline. Place .pcl, .read.config, and optional .R files in `../sfle/input/maasalin/input` - -###7. Run. - -By running the commandline script: -On the commandline call the Maaslin.R script. Please refer to the help (-h, --help) for command line options. If running from commandline, the PCL file will need to be transposed. A script is included in Maaslin for your convenience (src/transpose.py). The following example will have such a call included. An example call from the Maaslin folder for the demo data could be as follows. - -./src/transpose.py < input/maaslin_demo2.pcl > maaslin_demo2.tsv -./src/Maaslin.R -i input/maaslin_demo2.read.config demo.text maaslin_demo2.tsv - -When using sfle: -Go to ../sfle and type the following: scons output/maaslin - -###8. Discover amazing associations in your results! - - -#G. Expected Output Files - -The following files will be generated per MaAsLin run. In the -following listing the term projectname refers to what you named your "\*.pcl" file without the extension. - -###Output files that are always created: - -**projectname_log.txt** - -This file contains the detail for the statistical engine. This can be -useful for detailed troubleshooting. - -**projectname-metadata.txt** - -Each metadata will have a file of associations. Any associations -indicated to be performed after initial variable selection (boosting) -is recorded here. Included are the information from the final general -linear model (performed after the boosting) and the FDR corrected -p-value (q-value). Can be opened as a text file or spreadsheet. - -**projectname-metadata.pdf** - -Any association that had a q-value less than or equal to the given -significance threshold will be plotted here (default is 0.25; can be -changed using the commandline argument -d). If this file does not -exist, the projectname-metadata.txt should not have an entry that is -less than or equal to the threshold. Factor data is plotted as -knotched box plots; continuous data is plotted as a scatter plot with -a line of best fit. Two plots are given for MaAslin Methodology; the -left being a raw data plot, the right being a corresponding partial -residual plot. - -**projectname.pdf** - -Contains the biplot visualization. This visualization is presented as a build and can be affected by modifications in the R.script or by using commandline. - -**projectname.txt** - -A collection of all entries in the projectname-metadata.pdf. Can be -opened as a text file or spreadsheet. - -###Additional troubleshooting files when the commandline: - -**data.tsv** - -The data matrix that was read in (transposed). Useful for making sure -the correct data was read in. - -**data.read.config** - -Can be used to read in the data.tsv. - -**metadata.tsv** - -The metadata that was read in (transposed). Useful for making sure the -correct metadata was read in. - -**metadata.read.config** - -Can be used to read in the data.tsv. - -**read_merged.tsv** - -The data and metadata merged (transposed). Useful for making sure the -merging occurred correctly. - -**read_merged.read.config** - -Can be used to read in the read_merged.tsv. - -**read_cleaned.tsv** - -The data read in, merged, and then cleaned. After this process the -data is written to this file for reference if needed. - -**read_cleaned.read.config** - -Can be used to read in read_cleaned.tsv. - -**ProcessQC.txt** - -Contains quality control for the MaAsLin analysis. This includes -information on the magnitude of outlier removal. - -**Run_Parameters.txt** -Contains an account of all the options used when running MaAsLin so the exact methodology can be recreated if needed. - -#H. Other Analysis Flows - -###1. All verses All -The all verses all analysis flow is a way of manipulating how metadata are used. In this method there is a group of metadata that are always evaluated, as well there are a group that are added to this one at a time. To give a more concrete example: You may have metadata cage, diet, and treatment. You may always want to have the association of abundance evaluated controlling for cage but otherwise looking at the metadata one at a time. In this way the cage metadata is the \D2forced\D3 part of the evaluation while the others are not forced and evaluated in serial. The appropriate commandline to indicate this follows (placed in your args file if using sfle, otherwise added in the commandline call): - -> -a -F cage - --a indicates all verses all is being used, -F indicates which metadata are forced (multiple metadata can be given comma delimited as shown here -F metadata1,metadata2,metadata3). This does not bypass the feature selection method so the metadata that are not forced are subject to feature selection and may be removed before coming to the evaluation. If you want all the metadata that are not forced to be evaluated in serial you will need to turn off feature selection and will have a final combined commandline as seen here: - -> -a -F cage -s none - -#I. Troubleshooting - -###1\. (Only valid if using Sfle) ImportError: No module named sfle - -When using the command "scons output/maaslin/..." to run my projects I -get the message: - - ImportError: No module named sfle: - File "/home/user/sfle/SConstruct", line 2: - import sfle - -**Solution:** You need to update your path. On a linux or MacOS terminal -in the sfle directory type the following. - - export PATH=/usr/local/bin:`pwd`/src:$PATH - export PYTHONPATH=$PATH - - -###2\. When trying to run a script I am told I do not have permission -even though file permissions have been set for myself. - -**Solution:** Most likely, you need to set the main MaAsLin script -(Maaslin.R) to executable. - -#J. Installation as an Automated Pipeline - -SflE (pronounced souffle), is a framework for automation and -parallelization on a multiprocessor machine. MaAsLin has been -developed to be compatible with this framework. More information can -be found at -[http://huttenhower.sph.harvard.edu/sfle](http://huttenhower.sph.harvard.edu/sfle). If -interested in installing MaAsLin in a SflE environment. After -installing SflE, download or move the complete maaslin directory into -`sfle/input`. After setting up, one places all maaslin input files in -`sfle/input/maaslin/input`. To run the automated pipeline and analyze -all files in the `sfle/input/maaslin/input` directory, type: `scons output/maaslin` -in a terminal in the sfle directory. This will produce -output in the `sfle/output/maaslin` directory. - -#K. Commandline Options (Modifying Process and Figures) - -Although we recommend the use of default options, commandline -arguments exist to modify both MaAsLin methodology and figures. To see -an up-to-date listing of argument usage, in a terminal in the -`maaslin/src` directory type `./Maaslin.R -h`. - -An additional input file (the args file) can be used to apply -commandline arguments to a MaAsLin run. This is useful when using -MaAsLin as an automated pipeline (using SflE) and is a way to document -what commandline are used for different projects. The args file should -be named the same as the *.pcl file except using the extension .args -. This file should be placed in the `maaslin/input` directory with the -other matching project input files. In this file please have one line -of arguments and values (if needed; some arguments are logical flags -and do not require a value), each separated by a space. The contents -of this file will be directly added to the commandline call for -Maaslin.R. An example of the contents of an args file is given here. - -**Example.args:** - - -v DEBUG -d 0.1 -b 5 - -In this example MaAsLin is modified to produce verbose output for -debugging (-v DEBUG), to change the threshold for making pdfs to a -q-value equal to or less than 0.1 (-d 0.1), and to plot -5 data (bug) features in the biplot (-b 5). - diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/SConscript --- a/maaslin-4450aa4ecc84/SConscript Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,97 +0,0 @@ -import sfle -import csv - -Import( "*" ) -pE = DefaultEnvironment( ) - -# Extensions -sGraphlanAnnotationFileExtension = "-ann.txt" -sGraphlanCoreAnnotFileExtension = "-ann-core.txt" -sGraphlanCoreGenesFileExtension = "-core.txt" -sGraphlanFigureExtension = "-graphlan.pdf" -sMaaslinDataFileExtension = ".txt" -sMaaslinReadConfigFileExtension = ".read.config" -sMaaslinSummaryFileExtension = ".txt" - -sCustomRScriptExtension = ".R" -sPCLExtension = ".pcl" -sTransposeExtension = ".tsv" - -# Files -strMaaslinGraphlanSettings = "Graphlan_settings.txt" - -# Script -sScriptGraphlan = File(os.path.join("..","graphlan","graphlan.py")) -sScriptGraphlanAnnotate = File(os.path.join("..","graphlan","graphlan_annotate.py")) -sScriptMaaslinSummaryToGraphlanAnnotation = File(sfle.d(fileDirSrc,"MaaslinToGraphlanAnnotation.py")) -sScriptPCLToCoreGene = File(sfle.d(fileDirSrc,"PCLToGraphlanCoreGene.py")) - -sProgMaaslin = sfle.d(fileDirSrc,"Maaslin.R") - -# Settings -iGraphlanDPI = 150 -iGraphlanFigureSize = 4 -iGraphlanPad = 0.2 -strGraphlanDirectory = "graphlan" - -c_fileDirLib = sfle.d( fileDirSrc, "lib" ) -c_fileInputMaaslinR = sfle.d( pE, fileDirSrc, "Maaslin.R" ) -c_afileTestsR = [sfle.d( pE, c_fileDirLib, s ) for s in - ("IO.R", "SummarizeMaaslin.R", "Utility.R", "ValidateData.R")] - -c_afileDocsR = c_afileTestsR + [sfle.d( pE, c_fileDirLib, s ) for s in - ( "AnalysisModules.R", "scriptBiplotTSV.R", "BoostGLM.R", "Constants.R", "MaaslinPlots.R")] - -##Test scripts -for fileInputR in c_afileTestsR: - strBase = sfle.rebase( fileInputR, True ) - #Testing summary file - fileTestingSummary = sfle.d( pE, fileDirOutput, strBase +"-TestReport.txt" ) - dirTestingR = Dir( sfle.d( fileDirSrc, "test-" + strBase ) ) - Default( sfle.testthat( pE, fileInputR, dirTestingR, fileTestingSummary ) ) - -##Inline doc -for fileProg in c_afileDocsR: - filePDF = sfle.d( pE, fileDirOutput, sfle.rebase( fileProg, sfle.c_strSufR, sfle.c_strSufPDF ) ) - Default( sfle.inlinedocs( pE, fileProg, filePDF, fileDirTmp ) ) - -##Start regression suite -execfile( "SConscript_maaslin.py" ) - -##Input pcl files -lsMaaslinInputFiles = Glob( sfle.d( fileDirInput, "*" + sfle.c_strSufPCL ) ) - -## Run MaAsLin and generate output -for strPCLFile in lsMaaslinInputFiles: - Default( MaAsLin( strPCLFile )) - -# #Graphlan figure -# #TODO Fix path dependent, better way to know it is installed? -# if(os.path.exists(sScriptGraphlan.get_abspath())): - -# ## Run Graphlan on all output projects -# strProjectName = os.path.splitext(os.path.split(strPCLFile.get_abspath())[1])[0] -# strMaaslinOutputDir = sfle.d(fileDirOutput,strProjectName) - -# ##Get maaslin data files -# strMaaslinSummaryFile = sfle.d(os.path.join(strMaaslinOutputDir, strProjectName + sMaaslinSummaryFileExtension)) - -# # Make core gene file -# sCoreGeneFile = File(sfle.d(strMaaslinOutputDir, os.path.join(strGraphlanDirectory,sfle.rebase(strMaaslinSummaryFile, sMaaslinSummaryFileExtension,sGraphlanCoreGenesFileExtension)))) -# sReadConfigFile = File(sfle.d(fileDirInput,sfle.rebase(strMaaslinSummaryFile, sMaaslinSummaryFileExtension,sMaaslinReadConfigFileExtension))) -# sfle.op(pE, sScriptPCLToCoreGene, [[False, strPCLFile],[False, sReadConfigFile],[True, sCoreGeneFile]]) - -# # Make annotation file -# sAnnotationFile = File(sfle.d(strMaaslinOutputDir, os.path.join(strGraphlanDirectory,sfle.rebase(strMaaslinSummaryFile, sMaaslinSummaryFileExtension,sGraphlanAnnotationFileExtension)))) -# sfle.op(pE, sScriptMaaslinSummaryToGraphlanAnnotation, [[False, strMaaslinSummaryFile],[False,sCoreGeneFile],[False,File(sfle.d(fileDirSrc,strMaaslinGraphlanSettings))],[True,sAnnotationFile]]) - -# # Generate core gene annotation file names -# sCoreGeneAnnotationFile = File(sfle.d(strMaaslinOutputDir, os.path.join(strGraphlanDirectory,sfle.rebase(strMaaslinSummaryFile, sMaaslinSummaryFileExtension,sGraphlanCoreAnnotFileExtension)))) -# sfle.op(pE, sScriptGraphlanAnnotate, ["--annot",[sAnnotationFile],[False, sCoreGeneFile],[True, sCoreGeneAnnotationFile]]) - -# # Call graphlan -# # graphlan.py --dpi 150 --size 4 --pad 0.2 core_genes.annot.xml core_genes.png -# sGraphlanFigure = File(sfle.d(strMaaslinOutputDir, os.path.join(strGraphlanDirectory, sfle.rebase(strMaaslinSummaryFile, sMaaslinSummaryFileExtension,sGraphlanFigureExtension)))) -# sfle.op(pE, sScriptGraphlan, [[False, sCoreGeneAnnotationFile],[True, sGraphlanFigure],"--dpi",iGraphlanDPI,"--size",iGraphlanFigureSize,"--pad",iGraphlanPad]) - -# Default(sCoreGeneFile,sAnnotationFile,sCoreGeneAnnotationFile,sGraphlanFigure) diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/SConscript_maaslin.py --- a/maaslin-4450aa4ecc84/SConscript_maaslin.py Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,61 +0,0 @@ -#!/usr/bin/env python -""" -Authors: Timothy Tickle and Curtis Huttenhower -Description: Find associations in two matrices of data. -""" - -__author__ = "Timothy Tickle and Curtis Huttenhower" -__copyright__ = "Copyright 2012" -__credits__ = ["Timothy Tickle","Curtis Huttenhower"] -__maintainer__ = "Timothy Tickle" -__email__ = "ttickle@hsph.harvard.edu" - -import argparse -import os -import sfle -import sys - -c_strSufRC = ".read.config" - -c_fileDirSrc = Dir( sfle.d( os.path.dirname( sfle.current_file( ) ), sfle.c_strDirSrc ) ) -c_fileProgMaaslin = File( sfle.d( c_fileDirSrc, "Maaslin.R" ) ) -sArgsExt = ".args" -#Commandline to ignore -lsIgnore = ["-i","-I","--input_config","--input_process"] - -def MaAsLin( filePCL ): - #Build input file name if they exist or give "" - strBase = filePCL.get_abspath().replace( sfle.c_strSufPCL, "" ) - strR, strRC, strArgs = (( strBase + s ) for s in (sfle.c_strSufR, c_strSufRC, sArgsExt)) - fileR, fileRC, fileArgs = (( File( s ) if os.path.exists( s ) else "" ) for s in (strR, strRC, strArgs)) - - ## Read in an args file if it exists - lsArgs = [] - if fileArgs: - fReader = csv.reader(open(fileArgs.get_abspath(),'r'), delimiter = " ") - lsArgsTmp = [] - [lsArgsTmp.extend(lsLine) for lsLine in fReader] - fSkip = False - for s in lsArgsTmp: - if s in lsIgnore: - fSkip=True - continue - if fSkip: - fSkip = not fSkip - continue - lsArgs.append(s) - - lsInputArgs = ["-I",[fileR]] if fileR else [] - lsInputArgs.extend(["-i",[fileRC]] if fileRC else []) - lsArgs.extend(lsInputArgs) - - strBase = os.path.basename( strBase ) - fileTSVFile = File(sfle.d(fileDirTmp,sfle.rebase(filePCL,sfle.c_strSufPCL,sfle.c_strSufTSV))) - strT = File( sfle.d( os.path.join(fileDirOutput.get_abspath(), strBase, strBase + sfle.c_strSufTXT) ) ) - - #Transpose PCL - sfle.spipe(pE, filePCL, c_fileProgTranspose, fileTSVFile) - #Run MaAsLin - sfle.op(pE, c_fileProgMaaslin, lsArgs+[[True,strT],[False, fileTSVFile]]) - if fileArgs: Depends(c_fileProgMaaslin, fileArgs) - Default(strT) \ No newline at end of file diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/datatypes_conf.xml --- a/maaslin-4450aa4ecc84/datatypes_conf.xml Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,6 +0,0 @@ - - - - - - diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/doc/MaAsLin_User_Guide_v3.docx Binary file maaslin-4450aa4ecc84/doc/MaAsLin_User_Guide_v3.docx has changed diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/doc/Merge_Metadata_Read_Me.txt --- a/maaslin-4450aa4ecc84/doc/Merge_Metadata_Read_Me.txt Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,33 +0,0 @@ -I. Quick start. - -The merge_metadata.py script has been included in the MaAsLin package to help add metadata to otu tables (or any tab delimited file where columns are the samples). This script was used to make the maaslin_demo.pcl file found in this project. - -The generic command to run the merge_metadata.py is: -python merge_metadata.py input_metadata_file < input_measurements_file > output_pcl_file - -An example of the expected files are found in this project in the directory maaslin/input/for_merge_metadata -An example of how to run the command on the example files is as follows (when in the maaslin folder in a terminal): -python src/merge_metadata.py input/for_merge_metadata/maaslin_demo_metadata.metadata < input/for_merge_metadata/maaslin_demo_measurements.pcl > input/maaslin_demo.pcl - -II. Script overview -merge_metadata.py takes a tab delimited metadata file and adds it to a otu table. Both files have expected formats given below. Additionally, if a pipe-delimited consensus lineage is given in the IDs of the OTUs (for instance for the genus Bifidobacterium, Bacteria|Actinobacteria|Actinobacteria|Bifidobacteriales|Bifidobacteriaceae|Bifidobacterium), the higher level clades in the consensus lineage are added to other otu in the same clade level generating all higher level clade information captured in the otu data*. This heirarchy is then normalized using the same heirarchical structure. This means, after using the script, a sample will sum to more than 1, typically somewhere around 6 but will depend on if your data is originally at genus, species, or another level of resolution. All terminal otus (or the original otus) in a sample should sum to 1. - -*To help combat multiple comparisons, additional clades are only added if they add information to the data set. This means if you have an otu Bacteria|Actinobacteria|Actinobacteria|Bifidobacteriales|Bifidobacteriaceae|Bifidobacterium and no other related otus until Bacteria|Actinobacteria|Actinobacteria|Bifidobacteriales, Bacteria|Actinobacteria|Actinobacteria|Bifidobacteriales|Bifidobacteriaceae will not be added to the data set because it will be no different than the already existing and more specific Bacteria|Actinobacteria|Actinobacteria|Bifidobacteriales|Bifidobacteriaceae|Bifidobacterium otu. Clades at and above Bacteria|Actinobacteria|Actinobacteria|Bifidobacteriales will be included depending on if there are other otus to add them to at those clade levels. - - -III. Description of input files - -Metadata file: -Please make the file as follows: -1. Tab delimited -2. Rows are samples, columns are metadata -3. Sample Ids in the metadata file should match the sample ids in the otu table. -4. Use NA for values which are not recorded. -5. An example file is found at input/for_merge_metadata/maaslin_demo_metadata.metadata - -OTU table: -Please make the file as follows: -1. Tab delimited. -2. Rows are otus, columns are samples (note this is transposed in comparison to the metadata file). -3. If a consensus lineage is included in the otu name, use pipes as the delimiter. -4. An example file is found at input/for_merge_metadata/maaslin_demo_measurements.pcl diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/input/for_merge_metadata/maaslin_demo_measurements.pcl --- a/maaslin-4450aa4ecc84/input/for_merge_metadata/maaslin_demo_measurements.pcl Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,14 +0,0 @@ -ID Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 -Bacteria|Actinobacteria|Actinobacteria|Bifidobacteriales|Bifidobacteriaceae|Bifidobacterium|1 0.0507585 0.0861117 0.00168464 0.0011966 0.0164305 0.00592628 0.0367439 0.0663809 -Bacteria|Actinobacteria|Actinobacteria|Coriobacteriales|Coriobacteriaceae|1008 0 0.166041 0.16004 0.0984803 0.127644 0 0.00320332 0 -Bacteria|Bacteroidetes|Bacteroidia|Bacteroidales|Bacteroidaceae|Bacteroides|101 0.0110852 0.0229631 0.019991 0.0329065 0.044465 0.020979 0 0.0450837 -Bacteria|Bacteroidetes|Bacteroidia|Bacteroidales|Prevotellaceae|1010 0.1993 0 0.134883 0.179251 0 0.065189 0.349727 0.254737 -Bacteria|Firmicutes|Bacilli|Lactobacillales|Enterococcaceae|1023 0.290198 0.0119232 0 0.00538471 0.351818 0.0321204 0.0192199 0 -Bacteria|Firmicutes|Bacilli|Lactobacillales|Unclassified|1013 0.0869312 0 0.0982704 0.0971641 0.101253 0.10691 0 0 -Bacteria|Firmicutes|Clostridia|Clostridiales|Lachnospiraceae|Anaerostipes|1026 0 0 0.143194 0 0.131957 0.142349 0.228754 0 -Bacteria|Firmicutes|Clostridia|Clostridiales|Lachnospiraceae|Roseburia|1032 0.233372 0.41157 0.280773 0.329065 0.010269 0 0.0380629 0 -Bacteria|Firmicutes|Clostridia|Clostridiales|Ruminococcaceae|1156 0.0641774 0.151248 0.0791779 0.0595908 0.00133498 0 0.00471076 0 -Bacteria|Firmicutes|Erysipelotrichi|Erysipelotrichales|Erysipelotrichaceae|Coprobacillus|1179 0 0.00971517 0.0049416 0.123489 0 0.380586 0 0.380998 -Bacteria|Firmicutes|Unclassified|1232 0.0641774 0.13535 0.0701932 0.0538471 0.0667488 0.0681522 0.127191 0.0622321 -Bacteria|Proteobacteria|Betaproteobacteria|Burkholderiales|Alcaligenaceae|Parasutterella|1344 0 0.00507838 0.0012354 0.00167524 0.0351201 0 0.00395704 0 -Bacteria|Proteobacteria|Gammaproteobacteria|Enterobacteriales|Enterobacteriaceae|Escherichia/Shigella|1532 0 0 0.00561545 0.017949 0.11296 0.177788 0.18843 0.190568 diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/input/for_merge_metadata/maaslin_demo_metadata.metadata --- a/maaslin-4450aa4ecc84/input/for_merge_metadata/maaslin_demo_metadata.metadata Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,9 +0,0 @@ -ID Cohort Age Height Weight Sex Smoking Star_Trek_Fan Favorite_color -Sample1 Healthy 87 60 151 0 0 1 Yellow -Sample2 Healthy 78 72 258 1 0 1 Blue -Sample3 Healthy 3 63 195 0 1 0 Green -Sample4 Healthy 2 67 172 1 0 0 Yellow -Sample5 IBD 32 71 202 1 1 1 Green -Sample6 IBD 10 65 210 0 1 0 Blue -Sample7 IBD 39 61 139 1 1 0 Green -Sample8 IBD 96 64 140 0 0 1 Blue diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/input/maaslin_demo2.R --- a/maaslin-4450aa4ecc84/input/maaslin_demo2.R Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,4 +0,0 @@ -processFunction = function( frmeData, aiMetadata, aiGenetics, aiData ) -{ - return( list(frmeData = frmeData, aiMetadata = aiMetadata, aiData = aiData) ) -} diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/input/maaslin_demo2.args --- a/maaslin-4450aa4ecc84/input/maaslin_demo2.args Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,1 +0,0 @@ --v DEBUG \ No newline at end of file diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/input/maaslin_demo2.pcl --- a/maaslin-4450aa4ecc84/input/maaslin_demo2.pcl Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,30 +0,0 @@ -sample Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 -Age 87 78 3 2 32 10 39 96 -Cohort Healthy Healthy Healthy Healthy IBD IBD IBD IBD -Favorite_color Yellow Blue Green Yellow Green Blue Green Blue -Height 60 72 63 67 71 65 61 64 -Sex 0 1 0 1 1 0 1 0 -Smoking 0 0 1 0 1 1 1 0 -Star_Trek_Fan 1 1 0 0 1 0 0 1 -Weight 151 258 195 172 202 210 139 140 -Bacteria 1 1 1 1 1 1 1 1 -Bacteria|Actinobacteria|Actinobacteria 0.0507585 0.252153 0.161725 0.0996769 0.144075 0.00592628 0.0399472 0.0663809 -Bacteria|Actinobacteria|Actinobacteria|Bifidobacteriales|Bifidobacteriaceae|Bifidobacterium|1 0.0507585 0.0861117 0.00168464 0.0011966 0.0164305 0.00592628 0.0367439 0.0663809 -Bacteria|Actinobacteria|Actinobacteria|Coriobacteriales|Coriobacteriaceae|1008 0 0.166041 0.16004 0.0984803 0.127644 0 0.00320332 0 -Bacteria|Bacteroidetes|Bacteroidia|Bacteroidales 0.210385 0.0229631 0.154874 0.212157 0.044465 0.0861681 0.349727 0.29982 -Bacteria|Bacteroidetes|Bacteroidia|Bacteroidales|Bacteroidaceae|Bacteroides|101 0.0110852 0.0229631 0.019991 0.0329065 0.044465 0.020979 0 0.0450837 -Bacteria|Bacteroidetes|Bacteroidia|Bacteroidales|Prevotellaceae|1010 0.1993 0 0.134883 0.179251 0 0.065189 0.349727 0.254737 -Bacteria|Firmicutes 0.738856 0.719806 0.67655 0.668541 0.663381 0.730117 0.417939 0.443231 -Bacteria|Firmicutes|Bacilli|Lactobacillales 0.37713 0.0119232 0.0982704 0.102549 0.45307 0.13903 0.0192199 0 -Bacteria|Firmicutes|Bacilli|Lactobacillales|Enterococcaceae|1023 0.290198 0.0119232 0 0.00538471 0.351818 0.0321204 0.0192199 0 -Bacteria|Firmicutes|Bacilli|Lactobacillales|Unclassified|1013 0.0869312 0 0.0982704 0.0971641 0.101253 0.10691 0 0 -Bacteria|Firmicutes|Clostridia|Clostridiales 0.29755 0.562817 0.503145 0.388656 0.143561 0.142349 0.271528 0 -Bacteria|Firmicutes|Clostridia|Clostridiales|Lachnospiraceae 0.233372 0.41157 0.423967 0.329065 0.142226 0.142349 0.266817 0 -Bacteria|Firmicutes|Clostridia|Clostridiales|Lachnospiraceae|Anaerostipes|1026 0 0 0.143194 0 0.131957 0.142349 0.228754 0 -Bacteria|Firmicutes|Clostridia|Clostridiales|Lachnospiraceae|Roseburia|1032 0.233372 0.41157 0.280773 0.329065 0.010269 0 0.0380629 0 -Bacteria|Firmicutes|Clostridia|Clostridiales|Ruminococcaceae|1156 0.0641774 0.151248 0.0791779 0.0595908 0.00133498 0 0.00471076 0 -Bacteria|Firmicutes|Erysipelotrichi|Erysipelotrichales|Erysipelotrichaceae|Coprobacillus|1179 0 0.00971517 0.0049416 0.123489 0 0.380586 0 0.380998 -Bacteria|Firmicutes|Unclassified|1232 0.0641774 0.13535 0.0701932 0.0538471 0.0667488 0.0681522 0.127191 0.0622321 -Bacteria|Proteobacteria 0 0.00507838 0.00685085 0.0196243 0.14808 0.177788 0.192387 0.190568 -Bacteria|Proteobacteria|Betaproteobacteria|Burkholderiales|Alcaligenaceae|Parasutterella|1344 0 0.00507838 0.0012354 0.00167524 0.0351201 0 0.00395704 0 -Bacteria|Proteobacteria|Gammaproteobacteria|Enterobacteriales|Enterobacteriaceae|Escherichia/Shigella|1532 0 0 0.00561545 0.017949 0.11296 0.177788 0.18843 0.190568 diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/input/maaslin_demo2.read.config --- a/maaslin-4450aa4ecc84/input/maaslin_demo2.read.config Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,5 +0,0 @@ -Matrix: Metadata -Read_PCL_Rows: -Weight - -Matrix: Abundance -Read_PCL_Rows: Bacteria- diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/maaslin.xml --- a/maaslin-4450aa4ecc84/maaslin.xml Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,170 +0,0 @@ - - - -maaslin_wrapper.py ---lastmeta $cls_x ---input $inp_data ---output $out_file1 ---alpha $alpha ---min_abd $min_abd ---min_samp $min_samp ---zip_file $zip_file ---tool_option1 $tool_option1 - - - - - - - - - - - - - - - - - - tool_option1 == "2" - - - - maaslin_SCRIPT_PATH - - - - - - - - - - - - - - - - - -Feedback? Not working? Please contact us at Maaslin_google_group_ . - - -MaAsLin: Multivariate Analysis by Linear Models ------------------------------------------------ - -MaAsLin is a multivariate statistical framework that finds associations between clinical metadata and microbial community abundance or function. The clinical metadata can be of any type continuous (for example age and weight), boolean (sex, stool/biopsy), or discrete/factor (cohort groupings and phenotypes). MaAsLin is best used in the case when you are associating many metadata with microbial measurements. When this is the case each metadatum can be a diffrent type. For example, you could include age, weight, sex, cohort and phenotype in the same input file to be analyzed in the same MaAsLin run. The microbial measurements are expected to be normalized before using MaAsLin and so are proportional data ranging from 0 to 1.0. - -The results of a MaAsLin run are the association of a specific microbial community member with metadata. These associations are without the influence of the other metadata in the study. There are certain factors known that can influence the microbiome (for example diet, age, geography, fecal or biopsy sample origin). MaAsLin allows one to detect the effect of a metadata, possibly a phenotype, deconfounding the effects of diet, age, sample origin or any other metadata captured in the study! - -.. image:: https://bytebucket.org/biobakery/galaxy_maaslin/wiki/Figure1-Overview.png - :height: 500 - :width: 600 - - -*Maaslin Analysis Overview* MaAsLin performs boosted, additive general linear models between one group of data (metadata/the predictors) and another group (in our case microbial abundance/the response). Given that metagenomic data is sparse, the boosting is used to select metadata that show some potential to be associated with microbial abundances. Boosting of metadata and selection of a model occurs per otu. The metadata data that is selected for use by boosting is then used in a general linear model using metadata as predictors and otu arcsin-square root transformed abundance as the response. - - - -For more information on the technical aspects to this algorithm please see the methodological evaluation of MaAsLin that compared it to multiviariate and univariate analyses. Please check back for paper citing. - -Process: --------- -The first step consists of uploading your data using Galaxy's **Get Data - Upload File** - -A sample file is located at: https://bytebucket.org/biobakery/maaslin/wiki/maaslin_demo_pcl.txt - - -**Important** - -Please make sure to choose **File Format: maaslin** - -Required inputs ---------------- - -MaAsLin requires an input pcl file of metadata and microbial community measurements. MaAsLin expects a PCL file as an input file. A PCL file is a text delimited file similar to an excel spread sheet with the following characteristics. - -1. **Rows** represent metadata and features (bugs), **columns** represent samples -2. The **first row** by default should be the sample ids. -3. Metadata rows should be next. -4. Lastly, rows containing features (bugs) measurements (like abundance) should be after metadata rows. -5. The **first column** should contain the ID describing the column. For metadata this may be, for example, ''Age'' for a row containing the age of the patients donating the samples. For measurements, this should be the feature name (bug name). -6. The file is expected to be TAB delimited. - - - - - - -Description of parameters -------------------------- -**Input file** Select a loaded data file to use in analysis. - -**Last metadata row** Metadata and microbial measurements should be rows of the pcl file. Metadata should all come before microbial measurements. This row is the last metadata row which is only followed by rows which are microbial measurements. - -**Maximum false discovery rate (Significance threshold)** Associations are found significant if thier q-value is equal to or less than this threshold. - -**Minimum for feature relative abundance filtering** The minimum relative abundance allowed in the data. Values below this are removed and imputed as the median of the sample data. - -**Minimum for feature prevalence filtering** The minimum percentage of samples a feature can have abudance in before being removed. - -**Type of Output** Select one of the two options for output (summary or detailed results). - -Outputs -------- - -The Run MaAsLin module will create either A) a summary text file of plotted significant associations or B) a compressed directory of associations (significant and not significant). - -A. Any association that had a q-value less than or equal to the significance threshold will be included in a tab-delimited file. - -B. The following files will be generated per MaAsLin run. In the following listing the term projectname refers to what you named your pcl file without the extension. - -**Analysis** (These files are useful for analysis): - -**projectname-metadata.txt** Each metadata will have a file of associations. Any associations indicated to be performed after initial boosting is recorded here. Included are the information from the final general linear model (performed after the boosting) and the FDR corrected p-value (q-value). Can be opened as a text file or spreadsheet. - -**projectname-metadata.pdf** Any association that had a q-value less than or equal to the significance threshold will be plotted here. If this file does not exist, the projectname-metadata.txt should not have an entry that is less than or equal to the threshold. Factor and boolean data is plotted as knotched box plots; continuous data is plotted as a scatter plot with a line of best fit. - -.. image:: https://bytebucket.org/biobakery/galaxy_maaslin/wiki/Maaslin_Output.png - :height: 500 - :width: 600 - - - -*Example of the projectname-metadata.pdf file* Significant associations are combined in files of associations per metadata. Factor and boolean data is plotted as knotched box plots; continuous data is plotted as a scatter plot with a line of best fit. Plots show raw data, header data show information from the reduced - -**projectname_Summary.txt** Any entry in the projectname-metadata.pdf are collected together here. Can be opened as a text file or spreadsheet. - -**Troubleshooting** (These files are typically not used for analysis but are there for documenting the process and troubleshooting): - -**projectname.txt** Contains the detail for the statistical engine. Is useful for detailed troubleshooting. - -**data.tsv** The data matrix that was read in (transposed). Useful for making sure the correct data was read in. - -**data.read.config** Can be used to read in the data.tsv . - -**metadata.tsv** The metadata that was read in (transposed). Useful for making sure the correct metadata was read in. - -**metadata.read.config** Can be used to read in the data.tsv . - -**read_merged.tsv** The data and metadata merged (transposed). Useful for making sure the merging occurred correctly. - -**read_merged.read.config** Can be used to read in the read_merged.tsv . - -**read_cleaned.tsv** The data read in, merged, and then cleaned. After this process the data is written to this file for reference if needed. - -**read_cleaned.read.config** Can be used to read in read_cleaned.tsv . - -**ProcessQC.txt** Contains quality control for the MaAsLin analysis. This includes information on the magnitude of outlier removal. - -Contacts --------- - -Please feel free to contact us at ttickle@hsph.harvard.edu for any questions or comments! - -.. _Maaslin_google_group: https://groups.google.com/d/forum/maaslin-users - - - diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/maaslin_format_input_selector.py --- a/maaslin-4450aa4ecc84/maaslin_format_input_selector.py Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,137 +0,0 @@ -#!/usr/bin/env python - -""" -Author: George Weingart -Description: Dynamically read columns from input file for UI -""" - -##################################################################################### -#Copyright (C) <2012> -# -#Permission is hereby granted, free of charge, to any person obtaining a copy of -#this software and associated documentation files (the "Software"), to deal in the -#Software without restriction, including without limitation the rights to use, copy, -#modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, -#and to permit persons to whom the Software is furnished to do so, subject to -#the following conditions: -# -#The above copyright notice and this permission notice shall be included in all copies -#or substantial portions of the Software. -# -#THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, -#INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A -#PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT -#HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION -#OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE -#SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -##################################################################################### - -__author__ = "George Weingart" -__copyright__ = "Copyright 2012" -__credits__ = ["George Weingart"] -__license__ = "MIT" -__maintainer__ = "George Weingart" -__email__ = "george.weingart@gmail.com" -__status__ = "Development" - -import sys,string,time -from pprint import pprint - -def red(st,l): - if len(st) <= l: return st - l1,l2 = l/2,l/2 - return st[:l1]+".."+st[len(st)-l2:] - -def get_cols(data,full_names): - if data == "": return [] - max_len =32 - fname = data.dataset.file_name - input_file = open(fname) - input_lines = input_file.readlines() - input_file.close() - table_lines = [] - for x in input_lines: - first_column = x.split('\t')[0] - table_lines.append(first_column) - - opt = [] - rc = '' - lines = [] - try: - lines = [(red((rc+v.split()[0]),max_len),'%d' % (i+1),False) for i,v in enumerate(table_lines) if v] - - except: - l1 = '*ALL*' - l2 = 1 - l3 = False - MyList = [l1,l2,l3] - lines.append(MyList) - return opt+lines - -def get_cols_add_line(data,full_names,lastmeta): - if data == "": return [] - display_to = 1 - try: - display_to = int(lastmeta) - except: - pass - - max_len = 32 - fname = data.dataset.file_name - input_file = open(fname) - input_lines = input_file.readlines() - input_file.close() - table_lines = [] - for x in input_lines: - first_column = x.split('\t')[0] - table_lines.append(first_column) - table_lines.insert(0,'-') - if not display_to == 1: - del table_lines[display_to + 1:] - - - opt = [] - rc = '' - lines = [] - try: - lines = [(red((rc+v.split()[0]),max_len),'%d' % (i+1),False) for i,v in enumerate(table_lines) if v] - - except: - l1 = '*ALL*' - l2 = 1 - l3 = False - MyList = [l1,l2,l3] - lines.append(MyList) - return opt+lines - -def get_cols_features(data,full_names,lastmeta): - if data == "": return [] - display_from = 1 - try: - display_from = int(lastmeta) - except: - pass - max_len = 32 - fname = data.dataset.file_name - input_file = open(fname) - input_lines = input_file.readlines() - input_file.close() - table_lines = [] - for x in input_lines: - first_column = x.split('\t')[0] - table_lines.append(first_column) - - opt = [] - rc = '' - del table_lines[:display_from] - lines = [] - try: - lines = [(red((rc+v.split()[0]),max_len),'%d' % (i+1),False) for i,v in enumerate(table_lines) if v] - - except: - l1 = '*ALL*' - l2 = 1 - l3 = False - MyList = [l1,l2,l3] - lines.append(MyList) - return opt+lines diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/maaslin_wrapper.py --- a/maaslin-4450aa4ecc84/maaslin_wrapper.py Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,223 +0,0 @@ -#!/usr/bin/env python - -""" -Author: George Weingart -Description: Wrapper program for maaslin -""" - -##################################################################################### -#Copyright (C) <2012> -# -#Permission is hereby granted, free of charge, to any person obtaining a copy of -#this software and associated documentation files (the "Software"), to deal in the -#Software without restriction, including without limitation the rights to use, copy, -#modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, -#and to permit persons to whom the Software is furnished to do so, subject to -#the following conditions: -# -#The above copyright notice and this permission notice shall be included in all copies -#or substantial portions of the Software. -# -#THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, -#INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A -#PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT -#HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION -#OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE -#SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -##################################################################################### - -__author__ = "George Weingart" -__copyright__ = "Copyright 2012" -__credits__ = ["George Weingart"] -__license__ = "MIT" -__maintainer__ = "George Weingart" -__email__ = "george.weingart@gmail.com" -__status__ = "Development" - -from cStringIO import StringIO -import sys,string -import os -import tempfile -from pprint import pprint -import argparse - -###################################################################################### -# Parse input parms # -###################################################################################### -def read_params(x): - parser = argparse.ArgumentParser(description='MaAsLin Argparser') - parser.add_argument('--lastmeta', action="store", dest='lastmeta',nargs='?') - parser.add_argument('--input', action="store", dest='input',nargs='?') - parser.add_argument('--output', action="store", dest='output',nargs='?') - parser.add_argument('--zip_file', action="store", dest='zip_file',nargs='?') - parser.add_argument('--alpha', action="store", type=float,default=0.05,dest='alpha',nargs='?') - parser.add_argument('--min_abd', action="store", type=float,default=0.0001,dest='min_abd',nargs='?') - parser.add_argument('--min_samp', action="store", type=float,default=0.01,dest='min_samp',nargs='?') - parser.add_argument('--tool_option1', action="store", dest='tool_option1',nargs='?') - return parser - - - -###################################################################################### -# Build read config file # -###################################################################################### -def build_read_config_file(strTempDir,results, DSrc, DMaaslin, root_dir): - fname = results.input - input_file = open(fname) - input_lines = input_file.readlines() - LenInput = len(input_lines) - input_file.close() - TopLimit = int(results.lastmeta) - ReadConfigFileName = os.path.join(strTempDir,"Test.read.config") - Q = "'" - - #WorkingDir = os.getcwd() - WorkingDir = root_dir - os.chdir(DMaaslin) - - Limit1 = Q + "2-" + str(TopLimit ) + Q - ReadConfigTb1 = [ - os.path.join(DSrc,"CreateReadConfigFile.R"), - "-c", - Limit1, - ReadConfigFileName, - "Metadata" - ">/dev/null",\ - "2>&1" - ] - - cmd_config1 = " ".join(ReadConfigTb1) - - os.system(cmd_config1) - - Limit2 = Q + str(TopLimit +1 ) + '-' + Q - ReadConfigTb2 = [ - os.path.join(DSrc,"CreateReadConfigFile.R"), - "-a", - "-c", - Limit2, - ReadConfigFileName, - "Abundance" - ">/dev/null",\ - "2>&1" - ] - - cmd_config2 = " ".join(ReadConfigTb2) - os.system(cmd_config2) - os.chdir(WorkingDir) - return ReadConfigFileName - - -###################################################################################### -# Main Program # -###################################################################################### - -# Parse commandline in -parser = read_params( sys.argv ) -results = parser.parse_args() -root_dir = os.environ.get('maaslin_SCRIPT_PATH') - - - - - - -### If option 2 is selected inform user on 2 outputs -if results.tool_option1 == "2": - print "***Please note: 2 output files are generated: Complete zipped results + Summary ***" - -### Project name -strProjectName = os.path.splitext(os.path.basename(results.input))[0] - -### Define directory locations -D = os.path.join(root_dir) -DSrc = os.path.join(root_dir,"src") -DInput = os.path.join(root_dir,"maaslin","input") -DMaaslin = os.path.join(root_dir) - -DMaaslinGalaxy = os.path.join(root_dir) - - - -### Make temporary folder to work in -### Change permissions to make useable -strTempDir = tempfile.mkdtemp() -cmd_chmod = "chmod 755 /" + strTempDir -os.system(cmd_chmod) -cmd_mkdir1 = "mkdir -m 755 " + os.path.join(strTempDir,strProjectName) -os.system(cmd_mkdir1) - -### Transpose the pcl file to a tsv file -TbCmdTranspose = [\ - "python", - DMaaslinGalaxy + "/transpose.py<" + str(results.input) + ">" + os.path.join(strTempDir,"output.tsv")\ - ] -cmd_transpose = " ".join(TbCmdTranspose) -os.system(cmd_transpose) - -### Make path for target output file -OutputFile = os.path.join(strTempDir,strProjectName,strProjectName+".txt") - -### Make read config file -ReadConfigFileName = build_read_config_file(strTempDir,results, DSrc, DMaaslin, root_dir) - -### Build MaAsLin comamnd -CmdsArray = [\ -os.path.join(DSrc,"Maaslin.R"), \ -"-d", str(results.alpha),\ -"-r", str(results.min_abd),\ -"-p", str(results.min_samp), \ -"-i", \ -ReadConfigFileName, \ -OutputFile, \ -os.path.join(strTempDir,"output.tsv"), \ -"-v",\ -"ERROR",\ -">/dev/null",\ -"2>&1" -] - -invoke_maaslin_cmd = " ".join(CmdsArray) - - - - - -### Write to directory cmd line used for troubleshooting -#CmdFileName = os.path.join(strTempDir,"cmdfile.txt") -#OutFile = open(CmdFileName,"w") -#OutputString = invoke_maaslin_cmd + "\n" -#OutFile.write(OutputString) -#OutFile.close() - -### Call MaAsLin -os.system(invoke_maaslin_cmd) - - -### Copy output file to make available to galaxy -cmd_copy = "cp " + os.path.join(strTempDir,strProjectName+"/output.txt") + " " + results.output -MsgFileName = os.path.join(strTempDir,strProjectName+"/output.txt") - -if not os.path.isfile(MsgFileName): - cmd_copy = "cp " + os.path.join(strTempDir,strProjectName+"/output.txt") + " " + results.output - OutFile = open(MsgFileName,"w") - OutputString = "A MaAsLin error has occurred\n" - OutputString = OutputString + "It typically happens when incorrect 'Last metadata row' was selected\n" - OutputString = OutputString + "For demo data please choose 'Weight'\n" - OutFile.write(OutputString) - OutFile.close() - -os.system(cmd_copy) - -### Zip up output folder -cmd_zip = "zip -jr " + os.path.join(strTempDir,strProjectName+".zip") + " " + os.path.join(strTempDir,strProjectName) + ">/dev/null 2>&1" - -os.system(cmd_zip) - -### Copy output folder to make available to galaxy -cmd_copy_zip = "cp " + os.path.join(strTempDir,strProjectName+".zip") + " " + results.zip_file -os.system(cmd_copy_zip) - -### Delete temp directory -cmd_del_tempdir = "rm -r " + strTempDir -######os.system(cmd_del_tempdir) diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/CreateReadConfigFile.R --- a/maaslin-4450aa4ecc84/src/CreateReadConfigFile.R Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,63 +0,0 @@ -#!/usr/bin/env Rscript -##################################################################################### -#Copyright (C) <2012> -# -#Permission is hereby granted, free of charge, to any person obtaining a copy of -#this software and associated documentation files (the "Software"), to deal in the -#Software without restriction, including without limitation the rights to use, copy, -#modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, -#and to permit persons to whom the Software is furnished to do so, subject to -#the following conditions: -# -#The above copyright notice and this permission notice shall be included in all copies -#or substantial portions of the Software. -# -#THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, -#INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A -#PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT -#HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION -#OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE -#SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -# -# This file is a component of the MaAsLin (Multivariate Associations Using Linear Models), -# authored by the Huttenhower lab at the Harvard School of Public Health -# (contact Timothy Tickle, ttickle@hsph.harvard.edu). -##################################################################################### - -inlinedocs <- function( -##author<< Curtis Huttenhower and Timothy Tickle -##description<< Allows read config files to be created. -) { return( pArgs ) } - -### Logging class -suppressMessages(library( logging, warn.conflicts=FALSE, quietly=TRUE, verbose=FALSE)) -### Class for commandline argument processing -suppressMessages(library( optparse, warn.conflicts=FALSE, quietly=TRUE, verbose=FALSE)) - -### Source the IO.R for the script -source(file.path("src","lib","IO.R")) -source(file.path("src","lib","Constants.R")) - -### Create command line argument parser -### The TSV (tab seperated value (column major, samples are rows) file that will be read in -### The column that is the last metadata name -### The read.config file that will be used to read in the TSV file -pArgs <- OptionParser( usage = "%prog [optional] " ) -# Settings for Read config -## row indices -pArgs <- add_option( pArgs, c("-r", "--rows"), type="character", action="store", dest="strRows", default=NA, metavar="row_indices", help="Rows to read by index starting with 1.") -## column indices -pArgs <- add_option( pArgs, c("-c", "--columns"), type="character", action="store", dest="strColumns", default=NA, metavar="column_indices", help="Columns to read in by index starting with 1.") -## delimiter -pArgs <- add_option( pArgs, c("-d", "--delimiter"), type="character", action="store", dest="charDelimiter", default="\t", metavar="delimiter", help="Delimiter to read the matrix.") -## append to current file -pArgs <- add_option( pArgs, c("-a", "--append"), type="logical", action="store_true", dest="fAppend", default=FALSE, metavar="append", help="Append to existing data. Default no append.") -### Parse arguments -lsArgs <- parse_args( pArgs, positional_arguments = TRUE ) - -#Get positional arguments -if( !(length( lsArgs$args ) == 2) ) { stop( print_help( pArgs ) ) } - -### Write to file the read config script -funcWriteMatrixToReadConfigFile(strConfigureFileName=lsArgs$args[1], strMatrixName=lsArgs$args[2], strRowIndices=lsArgs$options$strRows, - strColIndices=lsArgs$options$strColumns,acharDelimiter=lsArgs$options$charDelimiter,fAppend=lsArgs$options$fAppend) diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/Graphlan_settings.txt --- a/maaslin-4450aa4ecc84/src/Graphlan_settings.txt Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,19 +0,0 @@ -title Metadata Associations -title_font_size 13 -total_plotted_degrees 280 -start_rotation 270 -internal_labels_rotation 270 -annotation_background_alpha 0.15 -clade_separation 0.35 -class_legend_font_size 12 -annotation_legend_font_size 11 -annotation_font_size 5 -annotation_font_stretch 0 -clade_marker_size 5 -branch_bracket_depth 0.5 -branch_thickness 1.2 -internal_label 1 Ph. -internal_label 2 Classes -internal_label 3 Orders -internal_label 4 Families -internal_label 5 Genera diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/Maaslin.R --- a/maaslin-4450aa4ecc84/src/Maaslin.R Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,607 +0,0 @@ -#!/usr/bin/env Rscript -##################################################################################### -#Copyright (C) <2012> -# -#Permission is hereby granted, free of charge, to any person obtaining a copy of -#this software and associated documentation files (the "Software"), to deal in the -#Software without restriction, including without limitation the rights to use, copy, -#modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, -#and to permit persons to whom the Software is furnished to do so, subject to -#the following conditions: -# -#The above copyright notice and this permission notice shall be included in all copies -#or substantial portions of the Software. -# -#THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, -#INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A -#PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT -#HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION -#OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE -#SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -# -# This file is a component of the MaAsLin (Multivariate Associations Using Linear Models), -# authored by the Huttenhower lab at the Harvard School of Public Health -# (contact Timothy Tickle, ttickle@hsph.harvard.edu). -##################################################################################### - -inlinedocs <- function( -##author<< Curtis Huttenhower and Timothy Tickle -##description<< Main driver script. Should be called to perform MaAsLin Analysis. -) { return( pArgs ) } - - -### Install packages if not already installed -vDepLibrary = c("agricolae", "gam", "gamlss", "gbm", "glmnet", "inlinedocs", "logging", "MASS", "nlme", "optparse", "outliers", "penalized", "pscl", "robustbase", "testthat") -for(sDepLibrary in vDepLibrary) -{ - if(! require(sDepLibrary, character.only=TRUE) ) - { - install.packages(pkgs=sDepLibrary, repos="http://cran.us.r-project.org") - } -} - -### Logging class -suppressMessages(library( logging, warn.conflicts=FALSE, quietly=TRUE, verbose=FALSE)) -### Class for commandline argument processing -suppressMessages(library( optparse, warn.conflicts=FALSE, quietly=TRUE, verbose=FALSE)) - - -### Create command line argument parser -pArgs <- OptionParser( usage = "%prog [options] " ) - -# Input files for MaAsLin -## Data configuration file -pArgs <- add_option( pArgs, c("-i", "--input_config"), type="character", action="store", dest="strInputConfig", metavar="data.read.config", help="Optional configuration file describing data input format.") -## Data manipulation/normalization file -pArgs <- add_option( pArgs, c("-I", "--input_process"), type="character", action="store", dest="strInputR", metavar="data.R", help="Optional configuration script normalizing or processing data.") - -# Settings for MaAsLin -## Maximum false discovery rate -pArgs <- add_option( pArgs, c("-d", "--fdr"), type="double", action="store", dest="dSignificanceLevel", default=0.25, metavar="significance", help="The threshold to use for significance for the generated q-values (BH FDR). Anything equal to or lower than this is significant. [Default %default]") -## Minimum feature relative abundance filtering -pArgs <- add_option( pArgs, c("-r", "--minRelativeAbundance"), type="double", action="store", dest="dMinAbd", default=0.0001, metavar="minRelativeAbundance", help="The minimum relative abundance allowed in the data. Values below this are removed and imputed as the median of the sample data. [Default %default]") -## Minimum feature prevalence filtering -pArgs <- add_option( pArgs, c("-p", "--minPrevalence"), type="double", action="store", dest="dMinSamp", default=0.1, metavar="minPrevalence", help="The minimum percentage of samples a feature can have abundance in before being removed. Also is the minimum percentage of samples a metadata can have that are not NA before being removed. [Default %default]") -## Fence for outlier, if not set Grubbs test is used -pArgs <- add_option( pArgs, c("-o", "--outlierFence"), type="double", action="store", dest="dOutlierFence", default=0, metavar="outlierFence", help="Outliers are defined as this number times the interquartile range added/subtracted from the 3rd/1st quartiles respectively. If set to 0 (default), outliers are defined by the Grubbs test. [Default %default]") -## Significance for Grubbs test -pArgs <- add_option(pArgs, c("-G","--grubbsSig"), type="double", action="store", dest="dPOutlier", default=0.05, metavar="grubbsAlpha", help="This is the significance cuttoff used to indicate an outlier or not. The closer to zero, the more significant an outlier must be to be removed. [Default %default]") -## Fixed (not random) covariates -pArgs <- add_option( pArgs, c("-R","--random"), type="character", action="store", dest="strRandomCovariates", default=NULL, metavar="fixed", help="These metadata will be treated as random covariates. Comma delimited data feature names. These features must be listed in the read.config file. Example '-R RandomMetadata1,RandomMetadata2'. [Default %default]") -## Change the type of correction fo rmultiple corrections -pArgs <- add_option( pArgs, c("-T","--testingCorrection"), type="character", action="store", dest="strMultTestCorrection", default="BH", metavar="multipleTestingCorrection", help="This indicates which multiple hypothesis testing method will be used, available are holm, hochberg, hommel, bonferroni, BH, BY. [Default %default]") -## Use a zero inflated model of the inference method indicate in -m -pArgs <- add_option( pArgs, c("-z","--doZeroInfated"), type="logical", action="store_true", default = FALSE, dest="fZeroInflated", metavar="fZeroInflated", help="If true, the zero inflated version of the inference model indicated in -m is used. For instance if using lm, zero-inflated regression on a gaussian distribution is used. [Default %default].") - -# Arguments used in validation of MaAsLin -## Model selection (enumerate) c("none","boost","penalized","forward","backward") -pArgs <- add_option( pArgs, c("-s", "--selection"), type="character", action="store", dest="strModelSelection", default="boost", metavar="model_selection", help="Indicates which of the variable selection techniques to use. [Default %default]") -## Argument indicating which method should be ran (enumerate) c("univariate","lm","neg_binomial","quasi") -pArgs <- add_option( pArgs, c("-m", "--method"), type="character", action="store", dest="strMethod", default="lm", metavar="analysis_method", help="Indicates which of the statistical inference methods to run. [Default %default]") -## Argument indicating which link function is used c("none","asinsqrt") -pArgs <- add_option( pArgs, c("-l", "--link"), type="character", action="store", dest="strTransform", default="asinsqrt", metavar="transform_method", help="Indicates which link or transformation to use with a glm, if glm is not selected this argument will be set to none. [Default %default]") -pArgs <- add_option( pArgs, c("-Q","--NoQC"), type="logical", action="store_true", default=FALSE, dest="fNoQC", metavar="Do_Not_Run_QC", help="Indicates if the quality control will be ran on the metadata/data. Default is true. [Default %default]") - -# Arguments to suppress MaAsLin actions on certain data -## Do not perform model selection on the following data -pArgs <- add_option( pArgs, c("-F","--forced"), type="character", action="store", dest="strForcedPredictors", default=NULL, metavar="forced_predictors", help="Metadata features that will be forced into the model seperated by commas. These features must be listed in the read.config file. Example '-F Metadata2,Metadata6,Metadata10'. [Default %default]") -## Do not impute the following -pArgs <- add_option( pArgs, c("-n","--noImpute"), type="character", action="store", dest="strNoImpute", default=NULL, metavar="no_impute", help="These data will not be imputed. Comma delimited data feature names. Example '-n Feature1,Feature4,Feature6'. [Default %default]") - -#Miscellaneouse arguments -### Argument to control logging (enumerate) -strDefaultLogging = "DEBUG" -pArgs <- add_option( pArgs, c("-v", "--verbosity"), type="character", action="store", dest="strVerbosity", default=strDefaultLogging, metavar="verbosity", help="Logging verbosity [Default %default]") -### Run maaslin without creating a log file -pArgs <- add_option( pArgs, c("-O","--omitLogFile"), type="logical", action="store_true", default=FALSE, dest="fOmitLogFile", metavar="omitlogfile",help="Including this flag will stop the creation of the output log file. [Default %default]") -### Argument for inverting background to black -pArgs <- add_option( pArgs, c("-t", "--invert"), type="logical", action="store_true", dest="fInvert", default=FALSE, metavar="invert", help="When given, flag indicates to invert the background of figures to black. [Default %default]") -### Selection Frequency -pArgs <- add_option( pArgs, c("-f","--selectionFrequency"), type="double", action="store", dest="dSelectionFrequency", default=NA, metavar="selectionFrequency", help="Selection Frequency for boosting (max 1 will remove almost everything). Interpreted as requiring boosting to select metadata 100% percent of the time (or less if given a number that is less). Value should be between 1 (100%) and 0 (0%), NA (default is determined by data size).") -### All v All -pArgs <- add_option( pArgs, c("-a","--allvall"), type="logical", action="store_true", dest="fAllvAll", default=FALSE, metavar="compare_all", help="When given, the flag indicates that each fixed covariate that is not indicated as Forced is compared once at a time per data feature (bug). Made to be used with the -F option to specify one part of the model while allowing the other to cycle through a group of covariates. Does not affect Random covariates, which are always included when specified. [Default %default]") -pArgs <- add_option( pArgs, c("-N","--PlotNA"), type="logical", action="store_true", default=FALSE, dest="fPlotNA", metavar="plotNAs",help="Plot data that was originally NA, by default they are not plotted. [Default %default]") -### Alternative methodology settings -pArgs <- add_option( pArgs, c("-A","--pAlpha"), type="double", action="store", dest="dPenalizedAlpha", default=0.95, metavar="PenalizedAlpha",help="The alpha for penalization (1.0=L1 regularization, LASSO; 0.0=L2 regularization, ridge regression. [Default %default]") -### Pass an alternative library dir -pArgs <- add_option( pArgs, c("-L", "--libdir"), action="store", dest="sAlternativeLibraryLocation", default=file.path( "","usr","share","biobakery" ), metavar="AlternativeLibraryDirectory", help="An alternative location to find the lib directory. This dir and children will be searched for the first maaslin/src/lib dir.") - -### Misc biplot arguments -pArgs <- add_option( pArgs, c("-M","--BiplotMetadataScale"), type="double", action="store", dest="dBiplotMetadataScale", default=1, metavar="scaleForMetadata", help="A real number used to scale the metadata labels on the biplot (otherwise a default will be selected from the data). [Default %default]") -pArgs <- add_option( pArgs, c("-C", "--BiplotColor"), type="character", action="store", dest="strBiplotColor", default=NULL, metavar="BiplotColorCovariate", help="A continuous metadata that will be used to color samples in the biplot ordination plot (otherwise a default will be selected from the data). Example Age [Default %default]") -pArgs <- add_option( pArgs, c("-S", "--BiplotShapeBy"), type="character", action="store", dest="strBiplotShapeBy", default=NULL, metavar="BiplotShapeCovariate", help="A discontinuous metadata that will be used to indicate shapes of samples in the Biplot ordination plot (otherwise a default will be selected from the data). Example Sex [Default %default]") -pArgs <- add_option( pArgs, c("-P", "--BiplotPlotFeatures"), type="character", action="store", dest="strBiplotPlotFeatures", default=NULL, metavar="BiplotFeaturesToPlot", help="Metadata and data features to plot (otherwise a default will be selected from the data). Comma Delimited.") -pArgs <- add_option( pArgs, c("-D", "--BiplotRotateMetadata"), type="character", action="store", dest="sRotateByMetadata", default=NULL, metavar="BiplotRotateMetadata", help="Metadata to use to rotate the biplot. Format 'Metadata,value'. 'Age,0.5' . [Default %default]") -pArgs <- add_option( pArgs, c("-B", "--BiplotShapes"), type="character", action="store", dest="sShapes", default=NULL, metavar="BiplotShapes", help="Specify shapes specifically for metadata or metadata values. [Default %default]") -pArgs <- add_option( pArgs, c("-b", "--BugCount"), type="integer", action="store", dest="iNumberBugs", default=3, metavar="PlottedBugCount", help="The number of bugs automatically selected from the data to plot. [Default %default]") -pArgs <- add_option( pArgs, c("-E", "--MetadataCount"), type="integer", action="store", dest="iNumberMetadata", default=NULL, metavar="PlottedMetadataCount", help="The number of metadata automatically selected from the data to plot. [Default all significant metadata and minimum is 1]") - -#pArgs <- add_option( pArgs, c("-c","--MFAFeatureCount"), type="integer", action="store", dest="iMFAMaxFeatures", default=3, metavar="maxMFAFeature", help="Number of features or number of bugs to plot (default=3; 3 metadata and 3 data).") - -main <- function( -### The main function manages the following: -### 1. Optparse arguments are checked -### 2. A logger is created if requested in the optional arguments -### 3. The custom R script is sourced. This is the input *.R script named -### the same as the input *.pcl file. This script contains custom formating -### of data and function calls to the MFA visualization. -### 4. Matrices are written to the project folder as they are read in seperately as metadata and data and merged together. -### 5. Data is cleaned with custom filtering if supplied in the *.R script. -### 6. Transformations occur if indicated by the optional arguments -### 7. Standard quality control is performed on data -### 8. Cleaned metadata and data are written to output project for documentation. -### 9. A regularization method is ran (boosting by default). -### 10. An analysis method is performed on the model (optionally boosted model). -### 11. Data is summarized and PDFs are created for significant associations -### (those whose q-values {BH FDR correction} are <= the threshold given in the optional arguments. -pArgs -### Parsed commandline arguments -){ - lsArgs <- parse_args( pArgs, positional_arguments = TRUE ) - #logdebug("lsArgs", c_logrMaaslin) - #logdebug(paste(lsArgs,sep=" "), c_logrMaaslin) - - # Parse parameters - lsForcedParameters = NULL - if(!is.null(lsArgs$options$strForcedPredictors)) - { - lsForcedParameters = unlist(strsplit(lsArgs$options$strForcedPredictors,",")) - } - xNoImpute = NULL - if(!is.null(lsArgs$options$strNoImpute)) - { - xNoImpute = unlist(strsplit(lsArgs$options$strNoImpute,"[,]")) - } - lsRandomCovariates = NULL - if(!is.null(lsArgs$options$strRandomCovariates)) - { - lsRandomCovariates = unlist(strsplit(lsArgs$options$strRandomCovariates,"[,]")) - } - lsFeaturesToPlot = NULL - if(!is.null(lsArgs$options$strBiplotPlotFeatures)) - { - lsFeaturesToPlot = unlist(strsplit(lsArgs$options$strBiplotPlotFeatures,"[,]")) - } - - #If logging is not an allowable value, inform user and set to INFO - if(length(intersect(names(loglevels), c(lsArgs$options$strVerbosity))) == 0) - { - print(paste("Maaslin::Error. Did not understand the value given for logging, please use any of the following: DEBUG,INFO,WARN,ERROR.")) - print(paste("Maaslin::Warning. Setting logging value to \"",strDefaultLogging,"\".")) - } - - # Do not allow mixed effect models and zero inflated models, don't have implemented - if(lsArgs$options$fZeroInflated && !is.null(lsArgs$options$strRandomCovariates)) - { - stop("MaAsLin Error:: The combination of zero inflated models and mixed effects models are not supported.") - } - - ### Create logger - c_logrMaaslin <- getLogger( "maaslin" ) - addHandler( writeToConsole, c_logrMaaslin ) - setLevel( lsArgs$options$strVerbosity, c_logrMaaslin ) - - #Get positional arguments - if( length( lsArgs$args ) != 2 ) { stop( print_help( pArgs ) ) } - ### Output file name - strOutputTXT <- lsArgs$args[1] - ### Input TSV data file - strInputTSV <- lsArgs$args[2] - - # Get analysis method options - # includes data transformations, model selection/regularization, regression models/links - lsArgs$options$strModelSelection = tolower(lsArgs$options$strModelSelection) - if(!lsArgs$options$strModelSelection %in% c("none","boost","penalized","forward","backward")) - { - logerror(paste("Received an invalid value for the selection argument, received '",lsArgs$options$strModelSelection,"'"), c_logrMaaslin) - stop( print_help( pArgs ) ) - } - lsArgs$options$strMethod = tolower(lsArgs$options$strMethod) - if(!lsArgs$options$strMethod %in% c("univariate","lm","neg_binomial","quasi")) - { - logerror(paste("Received an invalid value for the method argument, received '",lsArgs$options$strMethod,"'"), c_logrMaaslin) - stop( print_help( pArgs ) ) - } - lsArgs$options$strTransform = tolower(lsArgs$options$strTransform) - if(!lsArgs$options$strTransform %in% c("none","asinsqrt")) - { - logerror(paste("Received an invalid value for the transform/link argument, received '",lsArgs$options$strTransform,"'"), c_logrMaaslin) - stop( print_help( pArgs ) ) - } - - if(!lsArgs$options$strMultTestCorrection %in% c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY")) - { - logerror(paste("Received an invalid value for the multiple testing correction argument, received '",lsArgs$options$strMultTestCorrection,"'"), c_logrMaaslin) - stop( print_help( pArgs ) ) - } - - ### Necessary local import files - ### Check to make sure the lib is in the expected place (where the script is) - ### if not, then try the alternative lib location - ### This will happen if, for instance the script is linked or - ### on the path. - # Get the first choice relative path - initial.options <- commandArgs(trailingOnly = FALSE) - script.name <- sub("--file=", "", initial.options[grep("--file=", initial.options)]) - strDir = file.path( dirname( script.name ), "lib" ) - # If this does not have the lib file then go for the alt lib - if( !file.exists(strDir) ) - { - lsPotentialListLocations = dir( path = lsArgs$options$sAlternativeLibraryLocation, pattern = "lib", recursive = TRUE, include.dirs = TRUE) - if( length( lsPotentialListLocations ) > 0 ) - { - sLibraryPath = file.path( "maaslin","src","lib" ) - iLibraryPathLength = nchar( sLibraryPath ) - for( strSearchDir in lsPotentialListLocations ) - { - # Looking for the path where the end of the path is equal to the library path given earlier - # Also checks before hand to make sure the path is atleast as long as the library path so no errors occur - if ( substring( strSearchDir, 1 + nchar( strSearchDir ) - iLibraryPathLength ) == sLibraryPath ) - { - strDir = file.path( lsArgs$options$sAlternativeLibraryLocation, strSearchDir ) - break - } - } - } - } - - strSelf = basename( script.name ) - for( strR in dir( strDir, pattern = "*.R$" ) ) - { - if( strR == strSelf ) {next} - source( file.path( strDir, strR ) ) - } - - # Get analysis modules - afuncVariableAnalysis = funcGetAnalysisMethods(lsArgs$options$strModelSelection,lsArgs$options$strTransform,lsArgs$options$strMethod,lsArgs$options$fZeroInflated) - - # Set up parameters for variable selection - lxParameters = list(dFreq=lsArgs$options$dSelectionFrequency, dPAlpha=lsArgs$options$dPenalizedAlpha) - if((lsArgs$options$strMethod == "lm")||(lsArgs$options$strMethod == "univariate")) - { lxParameters$sFamily = "gaussian" - } else if(lsArgs$options$strMethod == "neg_binomial"){ lxParameters$sFamily = "binomial" - } else if(lsArgs$options$strMethod == "quasi"){ lxParameters$sFamily = "poisson"} - - #Indicate start - logdebug("Start MaAsLin", c_logrMaaslin) - #Log commandline arguments - logdebug("Commandline Arguments", c_logrMaaslin) - logdebug(lsArgs, c_logrMaaslin) - - ### Output directory for the study based on the requested output file - outputDirectory = dirname(strOutputTXT) - ### Base name for the project based on the read.config name - strBase <- sub("\\.[^.]*$", "", basename(strInputTSV)) - - ### Sources in the custom script - ### If the custom script is not there then - ### defaults are used and no custom scripts are ran - funcSourceScript <- function(strFunctionPath) - { - #If is specified, set up the custom func clean variable - #If the custom script is null then return - if(is.null(strFunctionPath)){return(NULL)} - - #Check to make sure the file exists - if(file.exists(strFunctionPath)) - { - #Read in the file - source(strFunctionPath) - } else { - #Handle when the file does not exist - stop(paste("MaAsLin Error: A custom data manipulation script was indicated but was not found at the file path: ",strFunctionPath,sep="")) - } - } - - #Read file - inputFileData = funcReadMatrices(lsArgs$options$strInputConfig, strInputTSV, log=TRUE) - if(is.null(inputFileData[[c_strMatrixMetadata]])) { names(inputFileData)[1] <- c_strMatrixMetadata } - if(is.null(inputFileData[[c_strMatrixData]])) { names(inputFileData)[2] <- c_strMatrixData } - - #Metadata and bug names - lsOriginalMetadataNames = names(inputFileData[[c_strMatrixMetadata]]) - lsOriginalFeatureNames = names(inputFileData[[c_strMatrixData]]) - - #Dimensions of the datasets - liMetaData = dim(inputFileData[[c_strMatrixMetadata]]) - liData = dim(inputFileData[[c_strMatrixData]]) - - #Merge data files together - frmeData = merge(inputFileData[[c_strMatrixMetadata]],inputFileData[[c_strMatrixData]],by.x=0,by.y=0) - #Reset rownames - row.names(frmeData) = frmeData[[1]] - frmeData = frmeData[-1] - - #Write QC files only in certain modes of verbosity - # Read in and merge files - if( c_logrMaaslin$level <= loglevels["DEBUG"] ) { - # If the QC internal file does not exist, make - strQCDir = file.path(outputDirectory,"QC") - dir.create(strQCDir, showWarnings = FALSE) - # Write metadata matrix before merge - funcWriteMatrices(dataFrameList=list(Metadata = inputFileData[[c_strMatrixMetadata]]), saveFileList=c(file.path(strQCDir,"metadata.tsv")), configureFileName=c(file.path(strQCDir,"metadata.read.config")), acharDelimiter="\t") - # Write data matrix before merge - funcWriteMatrices(dataFrameList=list(Data = inputFileData[[c_strMatrixData]]), saveFileList=c(file.path(strQCDir,"data.tsv")), configureFileName=c(file.path(strQCDir,"data.read.config")), acharDelimiter="\t") - #Record the data as it has been read - funcWriteMatrices(dataFrameList=list(Merged = frmeData), saveFileList=c(file.path(strQCDir,"read-Merged.tsv")), configureFileName=c(file.path(strQCDir,"read-Merged.read.config")), acharDelimiter="\t") - } - - #Data needed for the MaAsLin environment - #List of lists (one entry per file) - #Is contained by a container of itself - #lslsData = list() - #List - lsData = c() - - #List of metadata indicies - aiMetadata = c(1:liMetaData[2]) - lsData$aiMetadata = aiMetadata - #List of data indicies - aiData = c(1:liData[2])+liMetaData[2] - lsData$aiData = aiData - #Add a list to hold qc metrics and counts - lsData$lsQCCounts$aiDataInitial = aiData - lsData$lsQCCounts$aiMetadataInitial = aiMetadata - - #Raw data - lsData$frmeRaw = frmeData - - #Load script if it exists, stop on error - funcProcess <- NULL - if(!is.null(funcSourceScript(lsArgs$options$strInputR))){funcProcess <- get(c_strCustomProcessFunction)} - - #Clean the data and update the current data list to the cleaned data list - funcTransformData = afuncVariableAnalysis[[c_iTransform]] - lsQCCounts = list(aiDataCleaned = c(), aiMetadataCleaned = c()) - lsRet = list(frmeData=frmeData, aiData=aiData, aiMetadata=aiMetadata, lsQCCounts=lsQCCounts, liNaIndices=c()) - - viNotTransformedDataIndices = c() - if(!lsArgs$options$fNoQC) - { - c_logrMaaslin$info( "Running quality control." ) - lsRet = funcClean( frmeData=frmeData, funcDataProcess=funcProcess, aiMetadata=aiMetadata, aiData=aiData, lsQCCounts=lsData$lsQCCounts, astrNoImpute=xNoImpute, dMinSamp = lsArgs$options$dMinSamp, dMinAbd = lsArgs$options$dMinAbd, dFence=lsArgs$options$dOutlierFence, funcTransform=funcTransformData, dPOutlier=lsArgs$options$dPOutlier) - - viNotTransformedDataIndices = lsRet$viNotTransformedData - - #If using a count based model make sure all are integer (QCing can add in numeric values during interpolation for example) - if(lsArgs$options$strMethod %in% c_vCountBasedModels) - { - c_logrMaaslin$info( "Assuring the data matrix is integer." ) - for(iDataIndex in aiData) - { - lsRet$frmeData[ iDataIndex ] = round( lsRet$frmeData[ iDataIndex ] ) - } - } - } else { - c_logrMaaslin$info( "Not running quality control, attempting transform." ) - ### Need to do transform if the QC is not performed - iTransformed = 0 - for(iDataIndex in aiData) - { - if( ! funcTransformIncreasesOutliers( lsRet$frmeData[iDataIndex], funcTransformData ) ) - { - lsRet$frmeData[iDataIndex]=funcTransformData(lsRet$frmeData[iDataIndex]) - iTransformed = iTransformed + 1 - } else { - viNotTransformedDataIndices = c(viNotTransformedDataIndices, iDataIndex) - } - } - c_logrMaaslin$info(paste("Number of features transformed = ", iTransformed)) - } - - logdebug("lsRet", c_logrMaaslin) - logdebug(format(lsRet), c_logrMaaslin) - #Update the variables after cleaning - lsRet$frmeRaw = frmeData - lsRet$lsQCCounts$aiDataCleaned = lsRet$aiData - lsRet$lsQCCounts$aiMetadataCleaned = lsRet$aiMetadata - - #Add List of metadata string names - astrMetadata = colnames(lsRet$frmeData)[lsRet$aiMetadata] - lsRet$astrMetadata = astrMetadata - - # If plotting NA data reset the NA metadata indices to empty so they will not be excluded - if(lsArgs$options$fPlotNA) - { - lsRet$liNaIndices = list() - } - - #Write QC files only in certain modes of verbosity - if( c_logrMaaslin$level <= loglevels["DEBUG"] ) { - #Record the data after cleaning - funcWriteMatrices(dataFrameList=list(Cleaned = lsRet$frmeData[union(lsRet$aiMetadata,lsRet$aiData)]), saveFileList=c(file.path(strQCDir,"read_cleaned.tsv")), configureFileName=c(file.path(strQCDir,"read_cleaned.read.config")), acharDelimiter="\t") } - - #These variables will be used to count how many features get analysed - lsRet$lsQCCounts$iBoosts = 0 - lsRet$lsQCCounts$iBoostErrors = 0 - lsRet$lsQCCounts$iNoTerms = 0 - lsRet$lsQCCounts$iLms = 0 - - #Indicate if the residuals plots should occur - fDoRPlot=TRUE - #Should not occur for univariates - if(lsArgs$options$strMethod %in% c("univariate")){ fDoRPlot=FALSE } - - #Run analysis - alsRetBugs = funcBugs( frmeData=lsRet$frmeData, lsData=lsRet, aiMetadata=lsRet$aiMetadata, aiData=lsRet$aiData, aiNotTransformedData=viNotTransformedDataIndices, strData=strBase, dSig=lsArgs$options$dSignificanceLevel, fInvert=lsArgs$options$fInvert, - strDirOut=outputDirectory, funcReg=afuncVariableAnalysis[[c_iSelection]], funcTransform=funcTransformData, funcUnTransform=afuncVariableAnalysis[[c_iUnTransform]], lsNonPenalizedPredictors=lsForcedParameters, - funcAnalysis=afuncVariableAnalysis[[c_iAnalysis]], lsRandomCovariates=lsRandomCovariates, funcGetResults=afuncVariableAnalysis[[c_iResults]], fDoRPlot=fDoRPlot, fOmitLogFile=lsArgs$options$fOmitLogFile, - fAllvAll=lsArgs$options$fAllvAll, liNaIndices=lsRet$liNaIndices, lxParameters=lxParameters, strTestingCorrection=lsArgs$options$strMultTestCorrection, - fIsUnivariate=afuncVariableAnalysis[[c_iIsUnivariate]], fZeroInflated=lsArgs$options$fZeroInflated ) - - #Write QC files only in certain modes of verbosity - if( c_logrMaaslin$level <= loglevels["DEBUG"] ) { - funcWriteQCReport(strProcessFileName=file.path(strQCDir,"ProcessQC.txt"), lsQCData=alsRetBugs$lsQCCounts, liDataDim=liData, liMetadataDim=liMetaData) - - ### Write out the parameters used in the run - unlink(file.path(strQCDir,"Run_Parameters.txt")) - funcWrite("Parameters used in the MaAsLin run", file.path(strQCDir,"Run_Parameters.txt")) - funcWrite(paste("Optional input read.config file=",lsArgs$options$strInputConfig), file.path(strQCDir,"Run_Parameters.txt")) - funcWrite(paste("Optional R file=",lsArgs$options$strInputR), file.path(strQCDir,"Run_Parameters.txt")) - funcWrite(paste("FDR threshold for pdf generation=",lsArgs$options$dSignificanceLevel), file.path(strQCDir,"Run_Parameters.txt")) - funcWrite(paste("Minimum relative abundance=",lsArgs$options$dMinAbd), file.path(strQCDir,"Run_Parameters.txt")) - funcWrite(paste("Minimum percentage of samples with measurements=",lsArgs$options$dMinSamp), file.path(strQCDir,"Run_Parameters.txt")) - funcWrite(paste("The fence used to define outliers with a quantile based analysis. If set to 0, the Grubbs test was used=",lsArgs$options$dOutlierFence), file.path(strQCDir,"Run_Parameters.txt")) - funcWrite(paste("Ignore if the Grubbs test was not used. The significance level used as a cut-off to define outliers=",lsArgs$options$dPOutlier), file.path(strQCDir,"Run_Parameters.txt")) - funcWrite(paste("These covariates are treated as random covariates and not fixed covariates=",lsArgs$options$strRandomCovariates), file.path(strQCDir,"Run_Parameters.txt")) - funcWrite(paste("The type of multiple testing correction used=",lsArgs$options$strMultTestCorrection), file.path(strQCDir,"Run_Parameters.txt")) - funcWrite(paste("Zero inflated inference models were turned on=",lsArgs$options$fZeroInflated), file.path(strQCDir,"Run_Parameters.txt")) - funcWrite(paste("Feature selection step=",lsArgs$options$strModelSelection), file.path(strQCDir,"Run_Parameters.txt")) - funcWrite(paste("Statistical inference step=",lsArgs$options$strMethod), file.path(strQCDir,"Run_Parameters.txt")) - funcWrite(paste("Numeric transform used=",lsArgs$options$strTransform), file.path(strQCDir,"Run_Parameters.txt")) - funcWrite(paste("Quality control was run=",!lsArgs$options$fNoQC), file.path(strQCDir,"Run_Parameters.txt")) - funcWrite(paste("These covariates were forced into each model=",lsArgs$options$strForcedPredictors), file.path(strQCDir,"Run_Parameters.txt")) - funcWrite(paste("These features' data were not changed by QC processes=",lsArgs$options$strNoImpute), file.path(strQCDir,"Run_Parameters.txt")) - funcWrite(paste("Output verbosity=",lsArgs$options$strVerbosity), file.path(strQCDir,"Run_Parameters.txt")) - funcWrite(paste("Log file was generated=",!lsArgs$options$fOmitLogFile), file.path(strQCDir,"Run_Parameters.txt")) - funcWrite(paste("Data plots were inverted=",lsArgs$options$fInvert), file.path(strQCDir,"Run_Parameters.txt")) - funcWrite(paste("Ignore unless boosting was used. The threshold for the rel.inf used to select features=",lsArgs$options$dSelectionFrequency), file.path(strQCDir,"Run_Parameters.txt")) - funcWrite(paste("All verses all inference method was used=",lsArgs$options$fAllvAll), file.path(strQCDir,"Run_Parameters.txt")) - funcWrite(paste("Ignore unless penalized feature selection was used. Alpha to determine the type of penalty=",lsArgs$options$dPenalizedAlpha), file.path(strQCDir,"Run_Parameters.txt")) - funcWrite(paste("Biplot parameter, user defined metadata scale=",lsArgs$options$dBiplotMetadataScale), file.path(strQCDir,"Run_Parameters.txt")) - funcWrite(paste("Biplot parameter, user defined metadata used to color the plot=",lsArgs$options$strBiplotColor), file.path(strQCDir,"Run_Parameters.txt")) - funcWrite(paste("Biplot parameter, user defined metadata used to dictate the shapes of the plot markers=",lsArgs$options$strBiplotShapeBy), file.path(strQCDir,"Run_Parameters.txt")) - funcWrite(paste("Biplot parameter, user defined user requested features to plot=",lsArgs$options$strBiplotPlotFeatures), file.path(strQCDir,"Run_Parameters.txt")) - funcWrite(paste("Biplot parameter, user defined metadata used to rotate the plot ordination=",lsArgs$options$sRotateByMetadata), file.path(strQCDir,"Run_Parameters.txt")) - funcWrite(paste("Biplot parameter, user defined custom shapes for metadata=",lsArgs$options$sShapes), file.path(strQCDir,"Run_Parameters.txt")) - funcWrite(paste("Biplot parameter, user defined number of bugs to plot =",lsArgs$options$iNumberBugs), file.path(strQCDir,"Run_Parameters.txt")) - } - - ### Write summary table - # Summarize output files based on a keyword and a significance threshold - # Look for less than or equal to the threshold (appropriate for p-value and q-value type measurements) - # DfSummary is sorted by the q.value when it is returned - dfSummary = funcSummarizeDirectory(astrOutputDirectory=outputDirectory, - strBaseName=strBase, - astrSummaryFileName=file.path(outputDirectory,paste(strBase,c_sSummaryFileSuffix, sep="")), - astrKeyword=c_strKeywordEvaluatedForInclusion, - afSignificanceLevel=lsArgs$options$dSignificanceLevel) - - if( !is.null( dfSummary ) ) - { - ### Start biplot - # Get metadata of interest and reduce to default size - lsSigMetadata = unique(dfSummary[[1]]) - if( is.null( lsArgs$options$iNumberMetadata ) ) - { - lsSigMetadata = lsSigMetadata[ 1:length( lsSigMetadata ) ] - } else { - lsSigMetadata = lsSigMetadata[ 1:min( length( lsSigMetadata ), max( lsArgs$options$iNumberMetadata, 1 ) ) ] - } - - # Convert to indices (ordered numerically here) - liSigMetadata = which( colnames( lsRet$frmeData ) %in% lsSigMetadata ) - - # Get bugs of interest and reduce to default size - lsSigBugs = unique(dfSummary[[2]]) - - # Reduce the bugs to the right size - if(lsArgs$options$iNumberBugs < 1) - { - lsSigBugs = c() - } else if( is.null( lsArgs$options$iNumberBugs ) ) { - lsSigBugs = lsSigBugs[ 1 : length( lsSigBugs ) ] - } else { - lsSigBugs = lsSigBugs[ 1 : lsArgs$options$iNumberBugs ] - } - - # Set color by and shape by features if not given - # Selects the continuous (for color) and factor (for shape) data with the most significant association - if(is.null(lsArgs$options$strBiplotColor)||is.null(lsArgs$options$strBiplotShapeBy)) - { - for(sMetadata in lsSigMetadata) - { - if(is.factor(lsRet$frmeRaw[[sMetadata]])) - { - if(is.null(lsArgs$options$strBiplotShapeBy)) - { - lsArgs$options$strBiplotShapeBy = sMetadata - if(!is.null(lsArgs$options$strBiplotColor)) - { - break - } - } - } - if(is.numeric(lsRet$frmeRaw[[sMetadata]])) - { - if(is.null(lsArgs$options$strBiplotColor)) - { - lsArgs$options$strBiplotColor = sMetadata - if(!is.null(lsArgs$options$strBiplotShapeBy)) - { - break - } - } - } - } - } - - #If a user defines a feature, make sure it is in the bugs/data indices - if(!is.null(lsFeaturesToPlot) || !is.null(lsArgs$options$strBiplotColor) || !is.null(lsArgs$options$strBiplotShapeBy)) - { - lsCombinedFeaturesToPlot = unique(c(lsFeaturesToPlot,lsArgs$options$strBiplotColor,lsArgs$options$strBiplotShapeBy)) - lsCombinedFeaturesToPlot = lsCombinedFeaturesToPlot[!is.null(lsCombinedFeaturesToPlot)] - - # If bugs to plot were given then do not use the significant bugs from the MaAsLin output which is default - if(!is.null(lsFeaturesToPlot)) - { - lsSigBugs = c() - liSigMetadata = c() - } - liSigMetadata = unique(c(liSigMetadata,which(colnames(lsRet$frmeData) %in% setdiff(lsCombinedFeaturesToPlot, lsOriginalFeatureNames)))) - lsSigBugs = unique(c(lsSigBugs, intersect(lsCombinedFeaturesToPlot, lsOriginalFeatureNames))) - } - - # Convert bug names and metadata names to comma delimited strings - vsBugs = paste(lsSigBugs,sep=",",collapse=",") - vsMetadata = paste(colnames(lsRet$frmeData)[liSigMetadata],sep=",",collapse=",") - vsMetadataByLevel = c() - - # Possibly remove the NA levels depending on the preferences - vsRemoveNA = c(NA, "NA", "na", "Na", "nA") - if(!lsArgs$options$fPlotNA){ vsRemoveNA = c() } - for(aiMetadataIndex in liSigMetadata) - { - lxCurMetadata = lsRet$frmeData[[aiMetadataIndex]] - sCurName = names(lsRet$frmeData[aiMetadataIndex]) - if(is.factor(lxCurMetadata)) - { - vsMetadataByLevel = c(vsMetadataByLevel,paste(sCurName, setdiff( levels(lxCurMetadata), vsRemoveNA),sep="_")) - } else { - vsMetadataByLevel = c(vsMetadataByLevel,sCurName) - } - } - - # If NAs should not be plotted, make them the background color - # Unless explicitly asked to be plotted - sPlotNAColor = "white" - if(lsArgs$options$fInvert){sPlotNAColor = "black"} - if(lsArgs$options$fPlotNA){sPlotNAColor = "grey"} - sLastMetadata = lsOriginalMetadataNames[max(which(lsOriginalMetadataNames %in% names(lsRet$frmeData)))] - - # Plot biplot - logdebug("PlotBiplot:Started") - funcDoBiplot( - sBugs = vsBugs, - sMetadata = vsMetadataByLevel, - sColorBy = lsArgs$options$strBiplotColor, - sPlotNAColor = sPlotNAColor, - sShapeBy = lsArgs$options$strBiplotShapeBy, - sShapes = lsArgs$options$sShapes, - sDefaultMarker = "16", - sRotateByMetadata = lsArgs$options$sRotateByMetadata, - dResizeArrow = lsArgs$options$dBiplotMetadataScale, - sInputFileName = lsRet$frmeRaw, - sLastMetadata = sLastMetadata, - sOutputFileName = file.path(outputDirectory,paste(strBase,"-biplot.pdf",sep=""))) - logdebug("PlotBiplot:Stopped") - } -} - -# This is the equivalent of __name__ == "__main__" in Python. -# That is, if it's true we're being called as a command line script; -# if it's false, we're being sourced or otherwise included, such as for -# library or inlinedocs. -if( identical( environment( ), globalenv( ) ) && - !length( grep( "^source\\(", sys.calls( ) ) ) ) { - main( pArgs ) } diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/MaaslinToGraphlanAnnotation.py --- a/maaslin-4450aa4ecc84/src/MaaslinToGraphlanAnnotation.py Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,213 +0,0 @@ -#!/usr/bin/env python -##################################################################################### -#Copyright (C) <2012> -# -#Permission is hereby granted, free of charge, to any person obtaining a copy of -#this software and associated documentation files (the "Software"), to deal in the -#Software without restriction, including without limitation the rights to use, copy, -#modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, -#and to permit persons to whom the Software is furnished to do so, subject to -#the following conditions: -# -#The above copyright notice and this permission notice shall be included in all copies -#or substantial portions of the Software. -# -#THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, -#INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A -#PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT -#HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION -#OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE -#SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -# -# This file is a component of the MaAsLin (Multivariate Associations Using Linear Models), -# authored by the Huttenhower lab at the Harvard School of Public Health -# (contact Timothy Tickle, ttickle@hsph.harvard.edu). -##################################################################################### - -__author__ = "Timothy Tickle" -__copyright__ = "Copyright 2012" -__credits__ = ["Timothy Tickle"] -__license__ = "" -__version__ = "" -__maintainer__ = "Timothy Tickle" -__email__ = "ttickle@sph.harvard.edu" -__status__ = "Development" - -import argparse -import csv -import math -from operator import itemgetter -import re -import string -import sys - -#def funcGetColor(fNumeric,fMax): -# if fNumeric>0: -# return("#"+str(int(99*fNumeric/fMax)).zfill(2)+"0000") -# if fNumeric<0: -# return("#00"+str(int(99*abs(fNumeric/fMax))).zfill(2)+"00") -# return("#000000") - -def funcGetColor(fNumeric): - if fNumeric>0: - return sRingPositiveColor - else: - return sRingNegativeColor - -def funcGetAlpha(fNumeric,fMax): - return max(abs(fNumeric/fMax),dMinAlpha) - -#Constants -sAnnotation = "annotation" -sAnnotationColor = "annotation_background_color" -sClass = "class" -sRingAlpha = "ring_alpha" -dMinAlpha = .075 -sRingColor = "ring_color" -sRingHeight = "ring_height" -#sRingHeightMin = 0.5 -sStandardizedRingHeight = "1.01" -sRingLabel = "ring_label" -sRingLabelSizeWord = "ring_label_font_size" -sRingLabelSize = 10 -sRingLineColor = "#999999" -sRingPositiveWord = "Positive" -sRingPositiveColor = "#990000" -sRingNegativeWord = "Negative" -sRingNegativeColor = "#009900" -sRingLineColorWord = "ring_separator_color" -sRingLineThickness = "0.5" -sRingLineThicknessWord = "ring_internal_separator_thickness" -sCladeMarkerColor = "clade_marker_color" -sCladeMarkerSize = "clade_marker_size" -sHighlightedMarkerSize = "10" -c_dMinDoubleValue = 0.00000000001 - -#Set up arguments reader -argp = argparse.ArgumentParser( prog = "MaaslinToGraphlanAnnotation.py", - description = """Converts summary files to graphlan annotation files.""" ) - -#### Read in information -#Arguments -argp.add_argument("strInputSummary", metavar = "SummaryFile", type = argparse.FileType("r"), help ="Input summary file produced by maaslin") -argp.add_argument("strInputCore", metavar = "CoreFile", type = argparse.FileType("r"), help ="Core file produced by Graphlan from the maaslin pcl") -argp.add_argument("strInputHeader", metavar = "HeaderFile", type = argparse.FileType("r"), help ="Input header file to append to the generated annotation file.") -argp.add_argument("strOutputAnnotation", metavar = "AnnotationFile", type = argparse.FileType("w"), help ="Output annotation file for graphlan") - -args = argp.parse_args( ) - -#Read in the summary file and transform to class based descriptions -csvSum = open(args.strInputSummary,'r') if isinstance(args.strInputSummary, str) else args.strInputSummary -fSum = csv.reader(csvSum, delimiter="\t") -#Skip header (until i do this a better way) -fSum.next() - -#Extract associations (Metadata,taxon,coef,qvalue) -lsAssociations = [[sLine[1],sLine[2],sLine[4],sLine[7]] for sLine in fSum] -csvSum.close() - -#### Read in default graphlan settings provided by maaslin -#Read in the annotation header file -csvHdr = open(args.strInputHeader,'r') if isinstance(args.strInputHeader, str) else args.strInputHeader -fHdr = csv.reader(csvHdr, delimiter="\t") - -#Begin writting the output -#Output annotation file -csvAnn = open(args.strOutputAnnotation,'w') if isinstance(args.strOutputAnnotation, str) else args.strOutputAnnotation -fAnn = csv.writer(csvAnn, delimiter="\t") -fAnn.writerows(fHdr) -csvHdr.close() - -#If no associatiosn were found -if(len(lsAssociations)==0): - csvAnn.close() - -else: - #### Fix name formats - #Manipulate names to graphlan complient names (clades seperated by .) - lsAssociations = sorted(lsAssociations, key=itemgetter(1)) - lsAssociations = [[sBug[0]]+[re.sub("^[A-Za-z]__","",sBug[1])]+sBug[2:] for sBug in lsAssociations] - lsAssociations = [[sBug[0]]+[re.sub("\|*[A-Za-z]__|\|",".",sBug[1])]+sBug[2:] for sBug in lsAssociations] - - #If this is an OTU, append the number and the genus level together for a more descriptive termal name - lsAssociationsModForOTU = [] - for sBug in lsAssociations: - lsBug = sBug[1].split(".") - if(len(lsBug))> 1: - if(lsBug[-1].isdigit()): - lsBug[-2]=lsBug[-2]+"_"+lsBug[-1] - lsBug = lsBug[0:-1] - lsAssociationsModForOTU.append([sBug[0]]+[".".join(lsBug)]+sBug[2:]) - else: - lsAssociationsModForOTU.append([sBug[0]]+[lsBug[0]]+sBug[2:]) - - #Extract just class info - #lsClassData = [[sLine[2],sClass,sLine[1]] for sLine in fSum] - - ### Make rings - #Setup rings - dictRings = dict([[enumData[1],enumData[0]] for enumData in enumerate(set([lsData[0] for lsData in lsAssociationsModForOTU]))]) - - #Ring graphlan setting: rings represent a metadata that associates with a feature - #Rings have a line to help differetiate them - lsRingSettings = [[sRingLabel,lsPair[1],lsPair[0]] for lsPair in dictRings.items()] - lsRingLineColors = [[sRingLineColorWord,lsPair[1],sRingLineColor] for lsPair in dictRings.items()] - lsRingLineThick = [[sRingLineThicknessWord,lsPair[1],sRingLineThickness] for lsPair in dictRings.items()] - lsRingLineLabelSize = [[sRingLabelSizeWord,lsPair[1], sRingLabelSize] for lsPair in dictRings.items()] - - #Create coloring for rings color represents the directionality of the relationship - dMaxCoef = max([abs(float(sAssociation[2])) for sAssociation in lsAssociationsModForOTU]) - lsRingColors = [[lsAssociation[1], sRingColor, dictRings[lsAssociation[0]], funcGetColor(float(lsAssociation[2]))] for lsAssociation in lsAssociationsModForOTU] - lsRingAlpha = [[lsAssociation[1], sRingAlpha, dictRings[lsAssociation[0]], funcGetAlpha(float(lsAssociation[2]), dMaxCoef)] for lsAssociation in lsAssociationsModForOTU] - - #Create height for rings representing the log tranformed q-value? - dMaxQValue = max([-1*math.log(max(float(sAssociation[3]), c_dMinDoubleValue)) for sAssociation in lsAssociationsModForOTU]) - #lsRingHeights = [[lsAssociation[1], sRingHeight, dictRings[lsAssociation[0]], ((-1*math.log(max(float(lsAssociation[3]), c_dMinDoubleValue)))/dMaxQValue)+sRingHeightMin] for lsAssociation in lsAssociationsModForOTU] - lsRingHeights = [[lsAssociation[1], sRingHeight, dictRings[lsAssociation[0]], sStandardizedRingHeight] for lsAssociation in lsAssociationsModForOTU] - - #### Marker - # Marker colors (mainly to make legend - lsMarkerColors = [[lsAssociation[1], sCladeMarkerColor, funcGetColor(float(lsAssociation[2]))] for lsAssociation in lsAssociationsModForOTU] - lsMarkerSizes = [[lsAssociation[1], sCladeMarkerSize, sHighlightedMarkerSize] for lsAssociation in lsAssociationsModForOTU] - - #### Make internal highlights - #Highlight the associated clades - lsUniqueAssociatedTaxa = sorted(list(set([lsAssociation[1] for lsAssociation in lsAssociationsModForOTU]))) - - lsHighlights = [] - sABCPrefix = "" - sListABC = string.ascii_lowercase - iListABCIndex = 0 - for lsHighlight in lsUniqueAssociatedTaxa: - lsTaxa = lsHighlight.split(".") - sLabel = sABCPrefix+sListABC[iListABCIndex]+":"+lsTaxa[-1] if len(lsTaxa) > 2 else lsTaxa[-1] - lsHighlights.append([lsHighlight, sAnnotation, sLabel]) - iListABCIndex = iListABCIndex + 1 - if iListABCIndex > 25: - iListABCIndex = 0 - sABCPrefix = sABCPrefix + sListABC[len(sABCPrefix)] - - #Read in the core file - csvCore = open(args.strInputCore,'r') if isinstance(args.strInputCore, str) else args.strInputCore - fSum = csv.reader(csvCore, delimiter="\t") - - #Add in all phylum just incase they were not already included here - lsAddSecondLevel = list(set([sUnique[0].split(".")[1] for sUnique in fSum if len(sUnique[0].split(".")) > 1])) - lsHighlights.extend([[sSecondLevel, sAnnotation, sSecondLevel] for sSecondLevel in lsAddSecondLevel]) - lsHighlightColor = [[lsHighlight[0], sAnnotationColor,"b"] for lsHighlight in lsHighlights] - - #### Write the remaining output annotation file - fAnn.writerows(lsRingSettings) - fAnn.writerows(lsRingLineColors) - fAnn.writerows(lsRingColors) - fAnn.writerows(lsRingAlpha) - fAnn.writerows(lsRingLineThick) - fAnn.writerows(lsRingLineLabelSize) - fAnn.writerows(lsRingHeights) - fAnn.writerows(lsMarkerColors) - fAnn.writerows(lsMarkerSizes) - fAnn.writerows([[sRingPositiveWord, sCladeMarkerColor, sRingPositiveColor]]) - fAnn.writerows([[sRingNegativeWord, sCladeMarkerColor, sRingNegativeColor]]) - fAnn.writerows(lsHighlights) - fAnn.writerows(lsHighlightColor) - csvAnn.close() diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/PCLToGraphlanCoreGene.py --- a/maaslin-4450aa4ecc84/src/PCLToGraphlanCoreGene.py Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,168 +0,0 @@ -#!/usr/bin/env python -##################################################################################### -#Copyright (C) <2012> -# -#Permission is hereby granted, free of charge, to any person obtaining a copy of -#this software and associated documentation files (the "Software"), to deal in the -#Software without restriction, including without limitation the rights to use, copy, -#modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, -#and to permit persons to whom the Software is furnished to do so, subject to -#the following conditions: -# -#The above copyright notice and this permission notice shall be included in all copies -#or substantial portions of the Software. -# -#THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, -#INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A -#PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT -#HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION -#OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE -#SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -# -# This file is a component of the MaAsLin (Multivariate Associations Using Linear Models), -# authored by the Huttenhower lab at the Harvard School of Public Health -# (contact Timothy Tickle, ttickle@hsph.harvard.edu). -##################################################################################### - -__author__ = "Timothy Tickle" -__copyright__ = "Copyright 2012" -__credits__ = ["Timothy Tickle"] -__license__ = "" -__version__ = "" -__maintainer__ = "Timothy Tickle" -__email__ = "ttickle@sph.harvard.edu" -__status__ = "Development" - -import argparse -import csv -from operator import itemgetter -import re -import sys - -#Helper function which returns a boolean indicator of an input string being parsable as an int -def funcIsInt(strInt): - try: - int(strInt) - return True - except: - return False - -#Helper function that gets the index of the name and gives the last value of the list for - or the first value depending on the position -# This supports the ranging in the read.config files -#If no range is given then the result is just one index of the given name -def funcGetIndices(lsFeature, lsFunctionNames): - if(len(lsFeature)) == 1: - if(funcIsInt(lsFeature[0])): - return int(lsFeature[0])-1 - return [lsFeatureNames.index(lsFeature[0])] - if(len(lsFeature)) == 2: - iIndices = [] - iPosition = 1 - for sFeature in lsFeature: - if(sFeature==""): - if(iPosition==1): - iIndices.append(2) - elif(iPosition==2): - iIndices.append(len(lsFunctionNames)-1) - elif(funcIsInt(sFeature)): - iIndices.append(int(sFeature)-1) - else: - iIndices.append(lsFeatureNames.index(sFeature)) - iPosition = iPosition + 1 - return iIndices - -#Constants -#The line indicating the rows to read -c_MatrixName = "Matrix:" -c_DataMatrix = "Abundance" -c_strRows = "Read_PCL_Rows:" - -#Set up arguments reader -argp = argparse.ArgumentParser( prog = "PCLToGraphlanCoreGene.py", - description = """Converts PCL files to Graphlan core gene files.""" ) - -#Arguments -argp.add_argument("strInputPCL", metavar = "PCLFile", type = argparse.FileType("r"), help ="Input PCl file used in maaslin") -argp.add_argument("strInputRC", metavar = "RCFile", type = argparse.FileType("r"), help ="Input read config file used in maaslin") -argp.add_argument("strOutputCoreGene", metavar = "CoreGeneFile", type = argparse.FileType("w"), help ="Output core gene file for graphlan") - -args = argp.parse_args( ) - -#Read in read config table and get the rows/columns to use -#Indicates if we are reading a data matrix -fIsData = False -#Holds the indices ranges -#List of lists,each internal list hold 1 or 2 indices, if two it indicates a range from the first to the second -llsIndices = [] -csvRC = open(args.strInputRC,'r') if isinstance(args.strInputRC, str) else args.strInputRC -fRC = csv.reader(csvRC, delimiter=" ") -for sLine in fRC: - #Get the row indices or names - if len(sLine): - if sLine[0] == c_MatrixName: - fIsData = sLine[1] == c_DataMatrix - if sLine[0] == c_strRows: - if fIsData: - llsIndices = [sIndexRange.split("-") for sIndexRange in sLine[1].split(",")] - break -csvRC.close() - -# Check to make sure RC file is read -if len(llsIndices)==0: - print("PCLToGraphlanCoreGene:: Could Not find indices in RC file "+args.strInputRC+".") - -#Read in the PCL file and parse the file names to core genes format -csvPCL = open(args.strInputPCL,'r') if isinstance(args.strInputPCL, str) else args.strInputPCL -fPCL = csv.reader(csvPCL,delimiter="\t") -#The first column of the csv file -lsFeatureNames = [sLine[0] for sLine in fPCL] -csvPCL.close() - -# Check to make sure PCL file is read -if len(lsFeatureNames)==0: - print("PCLToGraphlanCoreGene:: Could Not find features in PCL file "+args.strInputPCL+".") - -#If the indices are names switch with numbers otherwise subtract 1 because they are ment for R -liConvertedRangedIndices = [funcGetIndices(sIndex,lsFeatureNames) for sIndex in llsIndices] if len(llsIndices)>0 else [] -llsIndices = None - -#If there are any ranges, reduce to lists of indices -liConvertedIndices = [] -for lsIndices in liConvertedRangedIndices: - lsIndices.sort() - iLenIndices = len(lsIndices) - if iLenIndices > 2: - print "Error, received more than 2 indices in a range. Stopped." - exit() - liConvertedIndices.extend(lsIndices if iLenIndices == 1 else range(lsIndices[0],lsIndices[1]+1)) -liConvertedRangedIndices = None - -#Collapse all indices to a set which is then sorted -liConvertedIndices = sorted(list(set(liConvertedIndices))) - -#Reduce name of features to just bugs indicated by indices -lsFeatureNames = itemgetter(*liConvertedIndices)(lsFeatureNames) -liConvertedIndices = None - -#Change the bug names to the correct formatting (clades seperated by .) -lsFeatureNames = sorted(lsFeatureNames) -lsFeatureNames = [re.sub("^[A-Za-z]__","",sBug) for sBug in lsFeatureNames] -lsFeatureNames = [[re.sub("\|*[A-Za-z]__|\|",".",sBug)] for sBug in lsFeatureNames] - -#If this is an OTU, append the number and the genus level together for a more descriptive termal name -lsFeatureNamesModForOTU = [] -for sBug in lsFeatureNames: - lsBug = sBug[0].split(".") - if(len(lsBug))> 1: - if(lsBug[-1].isdigit()): - lsBug[-2]=lsBug[-2]+"_"+lsBug[-1] - lsBug = lsBug[0:-1] - lsFeatureNamesModForOTU.append([".".join(lsBug)]) - else: - lsFeatureNamesModForOTU.append([lsBug[0]]) - -#Output core gene file -csvCG = open(args.strOutputCoreGene,'w') if isinstance(args.strOutputCoreGene, str) else args.strOutputCoreGene -fCG = csv.writer(csvCG) -fCG.writerows(lsFeatureNamesModForOTU) -csvCG.close() diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/lib/AnalysisModules.R --- a/maaslin-4450aa4ecc84/src/lib/AnalysisModules.R Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,1237 +0,0 @@ -##################################################################################### -#Copyright (C) <2012> -# -#Permission is hereby granted, free of charge, to any person obtaining a copy of -#this software and associated documentation files (the "Software"), to deal in the -#Software without restriction, including without limitation the rights to use, copy, -#modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, -#and to permit persons to whom the Software is furnished to do so, subject to -#the following conditions: -# -#The above copyright notice and this permission notice shall be included in all copies -#or substantial portions of the Software. -# -#THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, -#INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A -#PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT -#HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION -#OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE -#SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -# -# This file is a component of the MaAsLin (Multivariate Associations Using Linear Models), -# authored by the Huttenhower lab at the Harvard School of Public Health -# (contact Timothy Tickle, ttickle@hsph.harvard.edu). -##################################################################################### - -inlinedocs <- function( -##author<< Curtis Huttenhower and Timothy Tickle -##description<< Allows one to plug in new modules to perform analysis (univariate or multivariate), regularization, and data (response) transformation. -) { return( pArgs ) } - -# Libraries -suppressMessages(library( agricolae, warn.conflicts=FALSE, quietly=TRUE, verbose=FALSE)) -# Needed for the pot-hoc Kruskal wallis comparisons -suppressMessages(library( penalized, warn.conflicts=FALSE, quietly=TRUE, verbose=FALSE)) -# Needed for stepAIC -suppressMessages(library( MASS, warn.conflicts=FALSE, quietly=TRUE, verbose=FALSE)) -# Needed for na action behavior -suppressMessages(library( gam, warn.conflicts=FALSE, quietly=TRUE, verbose=FALSE)) -# Needed for boosting -suppressMessages(library( gbm, warn.conflicts=FALSE, quietly=TRUE, verbose=FALSE)) -# Needed for LASSO -suppressMessages(library( glmnet, warn.conflicts=FALSE, quietly=TRUE, verbose=FALSE)) -# Needed for mixed models -#suppressMessages(library( lme4, warn.conflicts=FALSE, quietly=TRUE, verbose=FALSE)) -suppressMessages(library( nlme, warn.conflicts=FALSE, quietly=TRUE, verbose=FALSE)) - -# Needed for zero inflated models -#suppressMessages(library( MCMCglmm, warn.conflicts=FALSE, quietly=TRUE, verbose=FALSE)) -suppressMessages(library( pscl, warn.conflicts=FALSE, quietly=TRUE, verbose=FALSE)) -suppressMessages(library( gamlss, warn.conflicts=FALSE, quietly=TRUE, verbose=FALSE)) -# Do not use #suppressMessages(library( glmmADMB, warn.conflicts=FALSE, quietly=TRUE, verbose=FALSE)) - -fAddBack = TRUE -dUnevenMax = .9 - - -### Helper functions -# OK -funcMakeContrasts <- function -### Makes univariate contrasts of all predictors in the model formula with the response. -(strFormula, -### lm style string defining reponse and predictors -strRandomFormula, -### mixed model string defining the fixed covariates -frmeTmp, -### The data frame to find predictor data in -iTaxon, -### Taxon -functionContrast, -### functionContrast The univariate test to perform -lsQCCounts -### QC info -){ - #TODO are we updating the QCCounts? - lsSig = list() - ### Holds all the significance results from the tests - adP = c() - ### Holds the p-values - sCurDataName = names(frmeTmp)[iTaxon] - ### The name of the taxon (data row) that is being associated (always assumed to be numeric) - #Get test comparisons (predictor names from formula string) - asComparisons = unique(c(funcFormulaStrToList(strFormula),funcFormulaStrToList(strRandomFormula))) - - #Change metadata in formula to univariate comparisons - for(sComparison in asComparisons) - { - # Metadata values - vxTest = frmeTmp[[sComparison]] - - # Get the levels in the comparison - # Can ignore the first level because it is the reference level - asLevels = sComparison - if(is.factor(vxTest)){asLevels = levels(vxTest)[2:length(vxTest)]} - - lasComparisonResults = functionContrast(x=sComparison, adCur=frmeTmp[[sCurDataName]], dfData=frmeTmp) - for(asComparison in lasComparisonResults) - { - if( is.na( asComparison$p.value ) ) { next } - # Get pvalue - adP = c(adP, asComparison$p.value) - # Get SD, if not available, give SD of covariate - dSTD = asComparison$SD - # TODO Is using sd on factor and binary data correct? - if(is.na(dSTD) || is.null(dSTD)){dSTD = sd(vxTest)} - - lsSig[[length( lsSig ) + 1]] <- list( - #Current metadata name (string column name) ok - name = sComparison, - #Current metadatda name (string, as a factor level if existing as such) ok - orig = asComparison$name, - #Taxon feature name (string) ok - taxon = colnames( frmeTmp )[iTaxon], - #Taxon data / response (double vector) ok - data = frmeTmp[,iTaxon], - #Name of column ok - factors = sComparison, - #Metadata values (metadata as a factor or raw numeric) ok - metadata = vxTest, - #Current coefficient value (named coef value with level name (from coefs) ok - value = asComparison$coef, - #Standard deviation (numeric) ok - std = dSTD, - #Model coefficients (output from coefs with intercept) ok - allCoefs = asComparison$coef) - } - } - return(list(adP=adP, lsSig=lsSig, lsQCCounts=lsQCCounts)) - ### Returns a list of p-value, standard deviation, and comparison which produced the p-value -} - -#Ok -funcGetStepPredictors <- function -### Retrieve the predictors of the reduced model after stepwise model selection -(lmod, -### Linear model resulting from step-wise model selection -frmeTmp, -strLog -### File to document logging -){ - #Document - funcWrite( "#model", strLog ) - funcWrite( lmod$fit, strLog ) -#TODO funcWrite( lmod$train.error, strLog ) -#TODO funcWrite( lmod$Terms, strLog ) - funcWrite( "#summary-gbm", strLog ) - funcWrite( summary(lmod), strLog ) - - #Get Names from coefficients - asStepCoefsFactors = coefficients(lmod) - astrCoefNames = setdiff(names(asStepCoefsFactors[as.vector(!is.na(asStepCoefsFactors))==TRUE]),"(Intercept)") - asStepCoefsFactors = unique(as.vector(sapply(astrCoefNames,funcCoef2Col, frmeData=frmeTmp))) - - if(length(asStepCoefsFactors)<1){ return(NA) } - return( asStepCoefsFactors ) - ### Vector of string predictor names -} - -funcGetUnivariateResults <- function -### Reduce the list of list of results to the correct format -( llmod, -### The list of univariate models -frmeData, -### Data analysis is performed on -liTaxon, -### The response id -dSig, -### Significance level for q-values -adP, -### List of pvalues from all associations performed -lsSig, -### List of information from the lm containing, metadata name, metadatda name (as a factor level if existing as such), Taxon feature name, Taxon data / response, All levels, Metadata values, Current coeficient value, Standard deviation, Model coefficients -strLog, -### File to which to document logging -lsQCCounts, -### Records of counts associated with quality control -lastrCols, -### Predictors used in the association -asSuppressCovariates=c() -### Vector of covariates to suppress and not give results for -){ - adP = c() - lsSig = list() - for(lmod in llmod) - { - adP = c(adP,lmod$adP) - lsSig = c(lsSig,lmod$lsSig) - } - return(list(adP=adP, lsSig=lsSig, lsQCCounts=llmod[length(llmod)]$lsQCCounts)) -} - -# OK -funcGetLMResults <- function -### Reduce the lm object return to just the data needed for further analysis -( llmod, -### The result from a linear model -frmeData, -### Data analysis is performed on -liTaxon, -### The response id -dSig, -### Significance level for q-values -adP, -### List of pvalues from all associations performed -lsSig, -### List of information from the lm containing, metadata name, metadatda name (as a factor level if existing as such), Taxon feature name, Taxon data / response, All levels, Metadata values, Current coeficient value, Standard deviation, Model coefficients -strLog, -### File to which to document logging -lsQCCounts, -### Records of counts associated with quality control -lastrCols, -### Predictors used in the association -asSuppressCovariates=c() -### Vector of covariates to suppress and not give results for -){ - ilmodIndex = 0 - for( lmod in llmod ) - { - ilmodIndex = ilmodIndex + 1 - lmod = llmod[[ ilmodIndex ]] - iTaxon = liTaxon[[ ilmodIndex ]] - astrCols = lastrCols[[ ilmodIndex ]] - - #Exclude none and errors - if( !is.na( lmod ) && ( class( lmod ) != "try-error" ) ) - { - #holds the location of the pvlaues if an lm, if lmm is detected this will be changed - iPValuePosition = 4 - - #Get the column name of the iTaxon index - #frmeTmp needs to be what? - strTaxon = colnames( frmeData )[iTaxon] - #Get summary information from the linear model - lsSum = try( summary( lmod ) ) - #The following can actually happen when the stranger regressors return broken results - if( class( lsSum ) == "try-error" ) - { - next - } - - #Write summary information to log file - funcWrite( "#model", strLog ) - funcWrite( lmod, strLog ) - funcWrite( "#summary", strLog ) - #Unbelievably, some of the more unusual regression methods crash out when _printing_ their results - try( funcWrite( lsSum, strLog ) ) - - #Get the coefficients - #This should work for linear models - frmeCoefs <- try( coefficients( lsSum ) ) - - if( ( class(frmeCoefs ) == "try-error" ) || is.null( frmeCoefs ) ) - { - adCoefs = try( coefficients( lmod )) - if(class( adCoefs ) == "try-error") - { - adCoefs = coef(lmod) - } - frmeCoefs <- NA - } else { - if( class( frmeCoefs ) == "list" ) - { - frmeCoefs <- frmeCoefs$count - } - adCoefs = frmeCoefs[,1] - } - - #Go through each coefficient - astrRows <- names( adCoefs ) - - ##lmm - if( is.null( astrRows ) ) - { - astrRows = rownames( lsSum$tTable ) - frmeCoefs = lsSum$tTable - iPValuePosition = 5 - adCoefs = frmeCoefs[,1] - } - - for( iMetadata in 1:length( astrRows ) ) - { - #Current coef which is being evaluated - strOrig = astrRows[ iMetadata ] - #Skip y interscept - if( strOrig %in% c("(Intercept)", "Intercept", "Log(theta)") ) { next } - #Skip suppressed covariates - if( funcCoef2Col( strOrig, frmeData ) %in% asSuppressCovariates){ next } - - #Extract pvalue and std in standard model - dP = frmeCoefs[ strOrig, iPValuePosition ] - dStd = frmeCoefs[ strOrig, 2 ] - - #Attempt to extract the pvalue and std in mixed effects summary - #Could not get the pvalue so skip result - if( is.nan( dP ) || is.na( dP ) || is.null( dP ) ) { next } - - dCoef = adCoefs[ iMetadata ] - - #Setting adMetadata - #Metadata values - strMetadata = funcCoef2Col( strOrig, frmeData, astrCols ) - if( is.na( strMetadata ) ) - { - if( substring( strOrig, nchar( strOrig ) - 1 ) == "NA" ) { next } - c_logrMaaslin$error( "Unknown coefficient: %s", strOrig ) - } - if( substring( strOrig, nchar( strMetadata ) + 1 ) == "NA" ) { next } - adMetadata <- frmeData[, strMetadata ] - - # Store (factor level modified) p-value - # Store general results for each coef - adP <- c( adP, dP ) - - # Append to the list of information about associations - lsSig[[ length( lsSig ) + 1 ]] <- list( - # Current metadata name - name = strMetadata, - # Current metadatda name (as a factor level if existing as such) - orig = strOrig, - # Taxon feature name - taxon = strTaxon, - # Taxon data / response - data = frmeData[, iTaxon ], - # All levels - factors = c( strMetadata ), - # Metadata values - metadata = adMetadata, - # Current coeficient value - value = dCoef, - # Standard deviation - std = dStd, - # Model coefficients - allCoefs = adCoefs ) - } - } - } - return( list( adP = adP, lsSig = lsSig, lsQCCounts = lsQCCounts ) ) - ### List containing a list of pvalues, a list of significant data per association, and a list of QC data -} - -funcGetZeroInflatedResults <- function -### Reduce the lm object return to just the data needed for further analysis -( llmod, -### The result from a linear model -frmeData, -### Data analysis is performed on -liTaxon, -### The response id -dSig, -### Significance level for q-values -adP, -### List of pvalues from all associations performed -lsSig, -### List of information from the lm containing, metadata name, metadatda name (as a factor level if existing as such), Taxon feature name, Taxon data / response, All levels, Metadata values, Current coeficient value, Standard deviation, Model coefficients -strLog, -### File to which to document logging -lsQCCounts, -### Records of counts associated with quality control -lastrCols, -### Predictors used in the association -asSuppressCovariates=c() -### Vector of covariates to suppress and not give results for -){ - ilmodIndex = 0 - for(lmod in llmod) - { - ilmodIndex = ilmodIndex + 1 - lmod = llmod[[ilmodIndex]] - iTaxon = liTaxon[[ilmodIndex]] - astrCols = lastrCols[[ilmodIndex]] - - #Exclude none and errors - if( !is.na( lmod ) && ( class( lmod ) != "try-error" ) ) - { - #holds the location of the pvlaues if an lm, if lmm is detected this will be changed - iPValuePosition = 4 - - #Get the column name of the iTaxon index - #frmeTmp needs to be what? - strTaxon = colnames( frmeData )[iTaxon] - - #Write summary information to log file - funcWrite( "#model", strLog ) - funcWrite( lmod, strLog ) - - #Get the coefficients - #This should work for linear models - frmeCoefs <- summary(lmod) - if(! is.null( frmeCoefs$coefficients$count ) ) # Needed for zeroinfl - { - frmeCoefs = frmeCoefs$coefficients$count - } - adCoefs = frmeCoefs[,1] - names(adCoefs) = row.names(frmeCoefs) - - funcWrite( "#Coefs", strLog ) - funcWrite( frmeCoefs, strLog ) - - #Go through each coefficient - astrRows <- row.names( frmeCoefs ) - - for( iMetadata in 1:length( astrRows ) ) - { - #Current coef which is being evaluated - strOrig = astrRows[iMetadata] - #Skip y interscept - if( strOrig %in% c("(Intercept)", "Intercept", "Log(theta)") ) { next } - #Skip suppressed covariates - if( funcCoef2Col(strOrig,frmeData) %in% asSuppressCovariates){next} - - #Extract pvalue and std in standard model - dP = frmeCoefs[strOrig, iPValuePosition] - if(is.nan(dP)){next} - dStd = frmeCoefs[strOrig,2] - - dCoef = adCoefs[iMetadata] - - #Setting adMetadata - #Metadata values - strMetadata = funcCoef2Col( strOrig, frmeData, astrCols ) - if( is.na( strMetadata ) ) - { - if( substring( strOrig, nchar( strOrig ) - 1 ) == "NA" ) { next } - c_logrMaaslin$error( "Unknown coefficient: %s", strOrig ) - } - if( substring( strOrig, nchar( strMetadata ) + 1 ) == "NA" ) { next } - adMetadata <- frmeData[,strMetadata] - - #Store (factor level modified) p-value - #Store general results for each coef - adP <- c(adP, dP) - lsSig[[length( lsSig ) + 1]] <- list( - #Current metadata name - name = strMetadata, - #Current metadatda name (as a factor level if existing as such) - orig = strOrig,# - #Taxon feature name - taxon = strTaxon, - #Taxon data / response - data = frmeData[,iTaxon], - #All levels - factors = c(strMetadata), - #Metadata values - metadata = adMetadata, - #Current coeficient value - value = dCoef, - #Standard deviation - std = dStd, - #Model coefficients - allCoefs = adCoefs) - } - } - } - return(list(adP=adP, lsSig=lsSig, lsQCCounts=lsQCCounts)) - ### List containing a list of pvalues, a list of significant data per association, and a list of QC data -} - -oldfuncGetZeroInflatedResults <- function -### Reduce the lm object return to just the data needed for further analysis -( llmod, -### The result from a linear model -frmeData, -### Data analysis is performed on -liTaxon, -### The response id -dSig, -### Significance level for q-values -adP, -### List of pvalues from all associations performed -lsSig, -### List of information from the lm containing, metadata name, metadatda name (as a factor level if existing as such), Taxon feature name, Taxon data / response, All levels, Metadata values, Current coeficient value, Standard deviation, Model coefficients -strLog, -### File to which to document logging -lsQCCounts, -### Records of counts associated with quality control -lastrCols, -### Predictors used in the association -asSuppressCovariates=c() -### Vector of covariates to suppress and not give results for -){ - ilmodIndex = 0 - for(lmod in llmod) - { - ilmodIndex = ilmodIndex + 1 - lmod = llmod[[ilmodIndex]] - iTaxon = liTaxon[[ilmodIndex]] - astrCols = lastrCols[[ilmodIndex]] - - #Exclude none and errors - if( !is.na( lmod ) && ( class( lmod ) != "try-error" ) ) - { - #holds the location of the pvlaues if an lm, if lmm is detected this will be changed - iPValuePosition = 4 - - #Get the column name of the iTaxon index - #frmeTmp needs to be what? - strTaxon = colnames( frmeData )[iTaxon] - - #Write summary information to log file - funcWrite( "#model", strLog ) - funcWrite( lmod, strLog ) - - #Get the coefficients - #This should work for linear models - frmeCoefs <- summary(lmod) - adCoefs = frmeCoefs[,1] - names(adCoefs) = row.names(frmeCoefs) - - #Go through each coefficient - astrRows <- row.names( frmeCoefs ) - - for( iMetadata in 1:length( astrRows ) ) - { - #Current coef which is being evaluated - strOrig = astrRows[iMetadata] - #Skip y interscept - if( strOrig %in% c("(Intercept)", "Intercept", "Log(theta)") ) { next } - #Skip suppressed covariates - if( funcCoef2Col(strOrig,frmeData) %in% asSuppressCovariates){next} - - #Extract pvalue and std in standard model - dP = frmeCoefs[strOrig, iPValuePosition] - dStd = frmeCoefs[strOrig,2] - - dCoef = adCoefs[iMetadata] - - #Setting adMetadata - #Metadata values - strMetadata = funcCoef2Col( strOrig, frmeData, astrCols ) - if( is.na( strMetadata ) ) - { - if( substring( strOrig, nchar( strOrig ) - 1 ) == "NA" ) { next } - c_logrMaaslin$error( "Unknown coefficient: %s", strOrig ) - } - if( substring( strOrig, nchar( strMetadata ) + 1 ) == "NA" ) { next } - adMetadata <- frmeData[,strMetadata] - - #Store (factor level modified) p-value - #Store general results for each coef - adP <- c(adP, dP) - lsSig[[length( lsSig ) + 1]] <- list( - #Current metadata name - name = strMetadata, - #Current metadatda name (as a factor level if existing as such) - orig = strOrig,# - #Taxon feature name - taxon = strTaxon, - #Taxon data / response - data = frmeData[,iTaxon], - #All levels - factors = c(strMetadata), - #Metadata values - metadata = adMetadata, - #Current coeficient value - value = dCoef, - #Standard deviation - std = dStd, - #Model coefficients - allCoefs = adCoefs) - } - } - } - return(list(adP=adP, lsSig=lsSig, lsQCCounts=lsQCCounts)) - ### List containing a list of pvalues, a list of significant data per association, and a list of QC data -} - - -notfuncGetZeroInflatedResults <- function -### Reduce the lm object return to just the data needed for further analysis -( llmod, -### The result from a linear model -frmeData, -### Data analysis is performed on -liTaxon, -### The response id -dSig, -### Significance level for q-values -adP, -### List of pvalues from all associations performed -lsSig, -### List of information from the lm containing, metadata name, metadatda name (as a factor level if existing as such), Taxon feature name, Taxon data / response, All levels, Metadata values, Current coeficient value, Standard deviation, Model coefficients -strLog, -### File to which to document logging -lsQCCounts, -### Records of counts associated with quality control -lastrCols, -### Predictors used in the association -asSuppressCovariates=c() -### Vector of covariates to suppress and not give results for -){ - ilmodIndex = 0 - for(lmod in llmod) - { - ilmodIndex = ilmodIndex + 1 - lmod = llmod[[ilmodIndex]] - iTaxon = liTaxon[[ilmodIndex]] - astrCols = lastrCols[[ilmodIndex]] - - #Exclude none and errors - if( !is.na( lmod ) && ( class( lmod ) != "try-error" ) ) - { - #holds the location of the pvlaues if an lm, if lmm is detected this will be changed - iPValuePosition = 4 - - #Get the column name of the iTaxon index - #frmeTmp needs to be what? - strTaxon = colnames( frmeData )[iTaxon] - - #Write summary information to log file - funcWrite( "#model", strLog ) - funcWrite( lmod, strLog ) - - #Get the coefficients - #This should work for linear models - frmeCoefs <- summary(lmod) - frmeCoefs = frmeCoefs$coefficients$count - funcWrite( "#Coefs", strLog ) - funcWrite( frmeCoefs, strLog ) - - adCoefs = frmeCoefs[,1] - names(adCoefs) = row.names(frmeCoefs) - - #Go through each coefficient - astrRows <- row.names( frmeCoefs ) - - for( iMetadata in 1:length( astrRows ) ) - { - #Current coef which is being evaluated - strOrig = astrRows[iMetadata] - - #Skip y interscept - if( strOrig %in% c("(Intercept)", "Intercept", "Log(theta)") ) { next } - #Skip suppressed covariates - if( funcCoef2Col(strOrig,frmeData) %in% asSuppressCovariates){next} - - #Extract pvalue and std in standard model - dP = frmeCoefs[strOrig, iPValuePosition] - if(is.nan(dP)){ next } - dStd = frmeCoefs[strOrig,2] - dCoef = adCoefs[iMetadata] - - #Setting adMetadata - #Metadata values - strMetadata = funcCoef2Col( strOrig, frmeData, astrCols ) - if( is.na( strMetadata ) ) - { - if( substring( strOrig, nchar( strOrig ) - 1 ) == "NA" ) { next } - c_logrMaaslin$error( "Unknown coefficient: %s", strOrig ) - } - if( substring( strOrig, nchar( strMetadata ) + 1 ) == "NA" ) { next } - adMetadata <- frmeData[,strMetadata] - - #Store (factor level modified) p-value - #Store general results for each coef - adP <- c(adP, dP) - lsSig[[length( lsSig ) + 1]] <- list( - #Current metadata name - name = strMetadata, - #Current metadatda name (as a factor level if existing as such) - orig = strOrig,# - #Taxon feature name - taxon = strTaxon, - #Taxon data / response - data = frmeData[,iTaxon], - #All levels - factors = c(strMetadata), - #Metadata values - metadata = adMetadata, - #Current coeficient value - value = dCoef, - #Standard deviation - std = dStd, - #Model coefficients - allCoefs = adCoefs) - } - } - } - return(list(adP=adP, lsSig=lsSig, lsQCCounts=lsQCCounts)) - ### List containing a list of pvalues, a list of significant data per association, and a list of QC data -} - - -### Options for variable selection -# OK -funcBoostModel <- function( -### Perform model selection / regularization with boosting -strFormula, -### The formula of the full model before boosting -frmeTmp, -### The data on which to perform analysis -adCur, -### The response data -lsParameters, -### User controlled parameters needed specific to boosting -lsForcedParameters = NULL, -### Force these predictors to be in the model -strLog -### File to which to document logging -){ - funcWrite( c("#Boost formula", strFormula), strLog ) - lmod = try( gbm( as.formula( strFormula ), data=frmeTmp, distribution="laplace", verbose=FALSE, n.minobsinnode=min(10, round(0.1 * nrow( frmeTmp ) ) ), n.trees=1000 ) ) -#TODO# lmod = try( gbm( as.formula( strFormula ), data=frmeTmp, distribution="gaussian", verbose=FALSE, n.minobsinnode=min(10, round(0.1 * nrow( frmeTmp ) ) ), n.trees=1000 ) ) - - astrTerms <- c() - if( !is.na( lmod ) && ( class( lmod ) != "try-error" ) ) - { - #Get boosting summary results - lsSum <- summary( lmod, plotit = FALSE ) - - #Document - funcWrite( "#model-gbm", strLog ) - funcWrite( lmod$fit, strLog ) - funcWrite( lmod$train.error, strLog ) - funcWrite( lmod$Terms, strLog ) - funcWrite( "#summary-gbm", strLog ) - funcWrite( lsSum, strLog ) - - # Uneven metadata - vstrUneven = c() - # Kept metadata - vstrKeepMetadata = c() - - #Select model predictors - #Check the frequency of selection and skip if not selected more than set threshold dFreq - for( strMetadata in lmod$var.names ) - { - #Get the name of the metadata - strTerm <- funcCoef2Col( strMetadata, frmeTmp, c(astrMetadata, astrGenetics) ) - - #Add back in uneven metadata - if(fAddBack) - { - ldMetadata = frmeTmp[[strMetadata]] - if(length(which(table(ldMetadata)/length(ldMetadata)>dUnevenMax))>0) - { - astrTerms <- c(astrTerms, strTerm) - vstrUneven = c(vstrUneven,strMetadata) - next - } - } - - #If the selprob is less than a certain frequency, skip - dSel <- lsSum$rel.inf[which( lsSum$var == strMetadata )] / 100 - if( is.na(dSel) || ( dSel <= lsParameters$dFreq ) ){ next } -#TODO# if( is.na(dSel) || ( dSel < lsParameters$dFreq ) ){ next } - - #If you should ignore the metadata, continue - if( is.null( strTerm ) ) { next } - - #If you cant find the metadata name, write - if( is.na( strTerm ) ) - { - c_logrMaaslin$error( "Unknown coefficient: %s", strMetadata ) - next - } - - #Collect metadata names - astrTerms <- c(astrTerms, strTerm) - vstrKeepMetadata = c(vstrKeepMetadata,strTerm) - } - } else { astrTerms = lsForcedParameters } - -# funcBoostInfluencePlot(vdRelInf=lsSum$rel.inf, sFeature=lsParameters$sBugName, vsPredictorNames=lsSum$var, vstrKeepMetadata=vstrKeepMetadata, vstrUneven=vstrUneven) - - return(unique(c(astrTerms,lsForcedParameters))) - ### Return a vector of predictor names to use in a reduced model -} - -#Glmnet default is to standardize the variables. -#used as an example for implementation -#http://r.789695.n4.nabble.com/estimating-survival-times-with-glmnet-and-coxph-td4614225.html -funcPenalizedModel <- function( -### Perform penalized regularization for variable selection -strFormula, -### The formula of the full model before boosting -frmeTmp, -### The data on which to perform analysis -adCur, -### The response data -lsParameters, -### User controlled parameters needed specific to boosting -lsForcedParameters = NULL, -### Force these predictors to be in the model -strLog -### File to which to document logging -){ - #Convert the data frame to a model matrix - mtrxDesign = model.matrix(as.formula(strFormula), data=frmeTmp) - - #Cross validate the lambda - cvRet = cv.glmnet(x=mtrxDesign,y=adCur,alpha=lsParameters$dPAlpha) - - #Perform lasso - glmnetMod = glmnet(x=mtrxDesign,y=adCur,family=lsParameters$family,alpha=lsParameters$dPAlpha,lambda=cvRet$lambda.min) - - #Get non zero beta and return column names for covariate names. - ldBeta = glmnetMod$beta[,which.max(glmnetMod$dev.ratio)] - ldBeta = names(ldBeta[which(abs(ldBeta)>0)]) - return(sapply(ldBeta,funcCoef2Col,frmeData=frmeTmp)) -} - -# OK -funcForwardModel <- function( -### Perform model selection with forward stepwise selection -strFormula, -### lm style string defining reposne and predictors -frmeTmp, -### Data on which to perform analysis -adCur, -### Response data -lsParameters, -### User controlled parameters needed specific to boosting -lsForcedParameters = NULL, -### Force these predictors to be in the model -strLog -### File to which to document logging -){ - funcWrite( c("#Forward formula", strFormula), strLog ) - - strNullFormula = "adCur ~ 1" - if(!is.null(lsForcedParameters)) - { - strNullFormula = paste( "adCur ~", paste( sprintf( "`%s`", lsForcedParameters ), collapse = " + " )) - } - lmodNull <- try( lm(as.formula( strNullFormula ), data=frmeTmp)) - lmodFull <- try( lm(as.formula( strFormula ), data=frmeTmp )) - if(!("try-error" %in% c(class( lmodNull ),class( lmodFull )))) - { - lmod = stepAIC(lmodNull, scope=list(lower=lmodNull,upper=lmodFull), direction="forward", trace=0) - return(funcGetStepPredictors(lmod, frmeTmp, strLog)) - } - return( lsForcedParameters ) - ### Return a vector of predictor names to use in a reduced model or NA on error -} - -# OK -# Select model with backwards selection -funcBackwardsModel <- function( -### Perform model selection with backwards stepwise selection -strFormula, -### lm style string defining reponse and predictors -frmeTmp, -### Data on which to perform analysis -adCur, -### Response data -lsParameters, -### User controlled parameters needed specific to boosting -lsForcedParameters = NULL, -### Force these predictors to be in the model -strLog -### File to which to document logging -){ - funcWrite( c("#Backwards formula", strFormula), strLog ) - - strNullFormula = "adCur ~ 1" - if(!is.null(lsForcedParameters)) - { - strNullFormula = paste( "adCur ~", paste( sprintf( "`%s`", lsForcedParameters ), collapse = " + " )) - } - - lmodNull <- try( lm(as.formula( strNullFormula ), data=frmeTmp)) - lmodFull <- try( lm(as.formula( strFormula ), data=frmeTmp )) - - if(! class( lmodFull ) == "try-error" ) - { - lmod = stepAIC(lmodFull, scope=list(lower=lmodNull, upper=lmodFull), direction="backward") - return(funcGetStepPredictors(lmod, frmeTmp, strLog)) - } else { - return( lsForcedParameters ) } - ### Return a vector of predictor names to use in a reduced model or NA on error -} - -### Analysis methods -### Univariate options - -# Sparse Dir. Model -#TODO# Implement in sfle - -# Tested -# Correlation -# NOTE: Ignores the idea of random and fixed covariates -funcSpearman <- function( -### Perform multiple univariate comparisons producing spearman correlations for association -strFormula, -### lm style string defining reponse and predictors, for mixed effects models this holds the fixed variables -frmeTmp, -### Data on which to perform analysis -iTaxon, -### Index of the response data -lsQCCounts, -### List recording anything important to QC -strRandomFormula = NULL -### Has the formula for random covariates -){ - return(funcMakeContrasts(strFormula=strFormula, strRandomFormula=strRandomFormula, frmeTmp=frmeTmp, iTaxon=iTaxon, - functionContrast=function(x,adCur,dfData) - { - retList = list() - ret = cor.test(as.formula(paste("~",x,"+ adCur")), data=dfData, method="spearman", na.action=c_strNA_Action) - #Returning rho for the coef in a named vector - vdCoef = c() - vdCoef[[x]]=ret$estimate - retList[[1]]=list(p.value=ret$p.value,SD=sd(dfData[[x]]),name=x,coef=vdCoef) - return(retList) - }, lsQCCounts)) - ### List of contrast information, pvalue, contrast and std per univariate test -} - -# Tested -# Wilcoxon (T-Test) -# NOTE: Ignores the idea of random and fixed covariates -funcWilcoxon <- function( -### Perform multiple univariate comparisons performing wilcoxon tests on discontinuous data with 2 levels -strFormula, -### lm style string defining reponse and predictors, for mixed effects models this holds the fixed variables -frmeTmp, -### Data on which to perform analysis -iTaxon, -### Index of the response data -lsQCCounts, -### List recording anything important to QC -strRandomFormula = NULL -### Has the formula for random covariates -){ - return(funcMakeContrasts(strFormula=strFormula, strRandomFormula=strRandomFormula, frmeTmp=frmeTmp, iTaxon=iTaxon, - functionContrast=function(x,adCur,dfData) - { - retList = list() - ret = wilcox.test(as.formula(paste("adCur",x,sep=" ~ ")), data=dfData, na.action=c_strNA_Action) - #Returning NA for the coef in a named vector - vdCoef = c() - vdCoef[[x]]=ret$statistic - retList[[1]]=list(p.value=ret$p.value,SD=sd(dfData[[x]]),name=x,coef=vdCoef) - return(retList) - }, lsQCCounts)) - ### List of contrast information, pvalue, contrast and std per univariate test -} - -# Tested -# Kruskal.Wallis (Nonparameteric anova) -# NOTE: Ignores the idea of random and fixed covariates -funcKruskalWallis <- function( -### Perform multiple univariate comparisons performing Kruskal wallis rank sum tests on discontuous data with more than 2 levels -strFormula, -### lm style string defining reponse and predictors, for mixed effects models this holds the fixed variables -frmeTmp, -### Data on which to perform analysis -iTaxon, -### Index of the response data -lsQCCounts, -### List recording anything important to QC -strRandomFormula = NULL -### Has the formula for random covariates -){ - return(funcMakeContrasts(strFormula=strFormula, strRandomFormula=strRandomFormula, frmeTmp=frmeTmp, iTaxon=iTaxon, - functionContrast=function(x,adCur,dfData) - { - retList = list() - lmodKW = kruskal(adCur,dfData[[x]],group=FALSE,p.adj="holm") - - asLevels = levels(dfData[[x]]) - # The names of the generated comparisons, sometimes the control is first sometimes it is not so - # We will just check which is in the names and use that - asComparisons = row.names(lmodKW$comparisons) - #Get the comparison with the control - for(sLevel in asLevels[2:length(asLevels)]) - { - sComparison = intersect(c(paste(asLevels[1],sLevel,sep=" - "),paste(sLevel,asLevels[1],sep=" - ")),asComparisons) - #Returning NA for the coef in a named vector - vdCoef = c() - vdCoef[[paste(x,sLevel,sep="")]]=lmodKW$comparisons[sComparison,"Difference"] -# vdCoef[[paste(x,sLevel,sep="")]]=NA - retList[[length(retList)+1]]=list(p.value=lmodKW$comparisons[sComparison,"p.value"],SD=1.0,name=paste(x,sLevel,sep=""),coef=vdCoef) - } - return(retList) - }, lsQCCounts)) - ### List of contrast information, pvalue, contrast and std per univariate test -} - -# Tested -# NOTE: Ignores the idea of random and fixed covariates -funcDoUnivariate <- function( -### Perform multiple univariate comparisons producing spearman correlations for association -strFormula, -### lm style string defining reponse and predictors, for mixed effects models this holds the fixed variables -frmeTmp, -### Data on which to perform analysis -iTaxon, -### Index of the response data -lsHistory, -### List recording p-values, association information, and QC counts -strRandomFormula = NULL, -### Has the formula for random covariates -fZeroInflate = FALSE -){ - if(fZeroInflate) - { - throw("There are no zero-inflated univariate models to perform your analysis.") - } - - # Get covariates - astrCovariates = unique(c(funcFormulaStrToList(strFormula),funcFormulaStrToList(strRandomFormula))) - - # For each covariate - for(sCovariate in astrCovariates) - { - ## Check to see if it is discrete - axData = frmeTmp[[sCovariate]] - lsRet = NA - if(is.factor(axData) || is.logical(axData)) - { - ## If discrete check how many levels - lsDataLevels = levels(axData) - ## If 2 levels do wilcoxon test - if(length(lsDataLevels) < 3) - { - lsRet = funcWilcoxon(strFormula=paste("adCur",sCovariate,sep=" ~ "), frmeTmp=frmeTmp, iTaxon=iTaxon, lsQCCounts=lsHistory$lsQCCounts) - } else { - ## If 3 or more levels do kruskal wallis test - lsRet = funcKruskalWallis(strFormula=paste("adCur",sCovariate,sep=" ~ "), frmeTmp=frmeTmp, iTaxon=iTaxon, lsQCCounts=lsHistory$lsQCCounts) - } - } else { - ## If not discrete do spearman test (list(adP=adP, lsSig=lsSig, lsQCCounts=lsQCCounts)) - lsRet = funcSpearman(strFormula=paste("adCur",sCovariate,sep=" ~ "), frmeTmp=frmeTmp, iTaxon=iTaxon, lsQCCounts=lsHistory$lsQCCounts) - } - lsHistory[["adP"]] = c(lsHistory[["adP"]], lsRet[["adP"]]) - lsHistory[["lsSig"]] = c(lsHistory[["lsSig"]], lsRet[["lsSig"]]) - lsHistory[["lsQCCounts"]] = lsRet[["lsQCCounts"]] - } - return(lsHistory) -} - -### Multivariate - -# Tested -funcLM <- function( -### Perform vanilla linear regression -strFormula, -### lm style string defining reponse and predictors, for mixed effects models this holds the fixed variables -frmeTmp, -### Data on which to perform analysis -iTaxon, -### Index of the response data -lsHistory, -### List recording p-values, association information, and QC counts -strRandomFormula = NULL, -### Has the formula for random covariates -fZeroInflated = FALSE -### Turns on the zero inflated model -){ - adCur = frmeTmp[,iTaxon] - if(fZeroInflated) - { - return(try(gamlss(formula=as.formula(strFormula), family=BEZI, data=frmeTmp))) # gamlss - } else { - if(!is.null(strRandomFormula)) - { - return(try(glmmPQL(fixed=as.formula(strFormula), random=as.formula(strRandomFormula), family=gaussian(link="identity"), data=frmeTmp))) - #lme4 package but does not have pvalues for the fixed variables (have to use a mcmcsamp/pvals.fnc function which are currently disabled) - #return(try( lmer(as.formula(strFormula), data=frmeTmp, na.action=c_strNA_Action) )) - } else { - return(try( lm(as.formula(strFormula), data=frmeTmp, na.action=c_strNA_Action) )) - } - } - ### lmod result object from lm -} - -# Tested -funcBinomialMult <- function( -### Perform linear regression with negative binomial link -strFormula, -### lm style string defining reponse and predictors, for mixed effects models this holds the fixed variables -frmeTmp, -### Data on which to perform analysis -iTaxon, -### Index of the response data -lsHistory, -### List recording p-values, association information, and QC counts -strRandomFormula = NULL, -### Has the formula for random covariates -fZeroInflated = FALSE -### Turns on the zero inflated model -){ - adCur = frmeTmp[,iTaxon] - if(fZeroInflated) - { - return(try(zeroinfl(as.formula(strFormula), data=frmeTmp, dist="negbin"))) # pscl -# return(try(gamlss(as.formula(strFormula), family=ZINBI, data=frmeTmp))) # pscl - } else { - if(!is.null(strRandomFormula)) - { - throw("This analysis flow is not completely developed, please choose an option other than negative bionomial with random covariates") - #TODO need to estimate the theta - #return(try(glmmPQL(fixed=as.formula(strFormula), random=as.formula(strRandomFormula), family=negative.binomial(theta = 2, link=log), data=frmeTmp))) - #lme4 package but does not have pvalues for the fixed variables (have to use a mcmcsamp/pvals.fnc function which are currently disabled) - } else { - return(try( glm.nb(as.formula(strFormula), data=frmeTmp, na.action=c_strNA_Action ))) - } - } - ### lmod result object from lm -} - -# Tested -funcQuasiMult <- function( -### Perform linear regression with quasi-poisson link -strFormula, -### lm style string defining reponse and predictors, for mixed effects models this holds the fixed variables -frmeTmp, -### Data on which to perform analysis -iTaxon, -### Index of the response data -lsHistory, -### List recording p-values, association information, and QC counts -strRandomFormula = NULL, -### Has the formula for random covariates -fZeroInflated = FALSE -### Turns on a zero infalted model -){ - adCur = frmeTmp[,iTaxon] - if(fZeroInflated) - { -# return(try(gamlss(formula=as.formula(strFormula), family=ZIP, data=frmeTmp))) # gamlss - return(try(zeroinfl(as.formula(strFormula), data=frmeTmp, dist="poisson"))) # pscl - } else { - #Check to see if | is in the model, if so use a lmm otherwise the standard glm is ok - if(!is.null(strRandomFormula)) - { - return(try(glmmPQL(fixed=as.formula(strFormula), random=as.formula(strRandomFormula), family= quasipoisson, data=frmeTmp))) - #lme4 package but does not have pvalues for the fixed variables (have to use a mcmcsamp/pvals.fnc function which are currently disabled) - #return(try ( glmer(as.formula(strFormula), data=frmeTmp, family=quasipoisson, na.action=c_strNA_Action) )) - } else { - return(try( glm(as.formula(strFormula), family=quasipoisson, data=frmeTmp, na.action=c_strNA_Action) )) - } - } - ### lmod result object from lm -} - -### Transformations -# Tested -funcArcsinSqrt <- function( -# Transform data with arcsin sqrt transformation -aData -### The data on which to perform the transformation -){ - return(asin(sqrt(aData))) - ### Transformed data -} - -funcSquareSin <- function( -# Transform data with square sin transformation -# Opposite of the funcArcsinSqrt -aData -### The data on which to perform the transformation -){ - return(sin(aData)^2) - ### Transformed data -} - -# Tested -funcNoTransform <-function( -### Pass data without transform -aData -### The data on which to perform the transformation -### Only given here to preserve the pattern, not used. -){ - return(aData) - ### Transformed data -} - -funcGetAnalysisMethods <- function( -### Returns the appropriate functions for regularization, analysis, data transformation, and analysis object inspection. -### This allows modular customization per analysis step. -### To add a new method insert an entry in the switch for either the selection, transform, or method -### Insert them by using the pattern optparse_keyword_without_quotes = function_in_AnalysisModules -### Order in the return listy is currently set and expected to be selection, transforms/links, analysis method -### none returns null -sModelSelectionKey, -### Keyword defining the method of model selection -sTransformKey, -### Keyword defining the method of data transformation -sMethodKey, -### Keyword defining the method of analysis -fZeroInflated = FALSE -### Indicates if using zero inflated models -){ - lRetMethods = list() - #Insert selection methods here - lRetMethods[[c_iSelection]] = switch(sModelSelectionKey, - boost = funcBoostModel, - penalized = funcPenalizedModel, - forward = funcForwardModel, - backward = funcBackwardsModel, - none = NA) - - #Insert transforms - lRetMethods[[c_iTransform]] = switch(sTransformKey, - asinsqrt = funcArcsinSqrt, - none = funcNoTransform) - - #Insert untransform - lRetMethods[[c_iUnTransform]] = switch(sTransformKey, - asinsqrt = funcNoTransform, - none = funcNoTransform) - - #Insert analysis - lRetMethods[[c_iAnalysis]] = switch(sMethodKey, - neg_binomial = funcBinomialMult, - quasi = funcQuasiMult, - univariate = funcDoUnivariate, - lm = funcLM, - none = NA) - - # If a univariate method is used it is required to set this to true - # For correct handling. - lRetMethods[[c_iIsUnivariate]]=sMethodKey=="univariate" - - #Insert method to get results - if(fZeroInflated) - { - lRetMethods[[c_iResults]] = switch(sMethodKey, - neg_binomial = funcGetZeroInflatedResults, - quasi = funcGetZeroInflatedResults, - univariate = funcGetUnivariateResults, - lm = funcGetZeroInflatedResults, - none = NA) - } else { - lRetMethods[[c_iResults]] = switch(sMethodKey, - neg_binomial = funcGetLMResults, - quasi = funcGetLMResults, - univariate = funcGetUnivariateResults, - lm = funcGetLMResults, - none = NA) - } - - return(lRetMethods) - ### Returns a list of functions to be passed for regularization, data transformation, analysis, - ### and custom analysis results introspection functions to pull from return objects data of interest -} diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/lib/BoostGLM.R --- a/maaslin-4450aa4ecc84/src/lib/BoostGLM.R Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,887 +0,0 @@ -##################################################################################### -#Copyright (C) <2012> -# -#Permission is hereby granted, free of charge, to any person obtaining a copy of -#this software and associated documentation files (the "Software"), to deal in the -#Software without restriction, including without limitation the rights to use, copy, -#modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, -#and to permit persons to whom the Software is furnished to do so, subject to -#the following conditions: -# -#The above copyright notice and this permission notice shall be included in all copies -#or substantial portions of the Software. -# -#THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, -#INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A -#PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT -#HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION -#OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE -#SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -# -# This file is a component of the MaAsLin (Multivariate Associations Using Linear Models), -# authored by the Huttenhower lab at the Harvard School of Public Health -# (contact Timothy Tickle, ttickle@hsph.harvard.edu). -##################################################################################### - -inlinedocs <- function( -##author<< Curtis Huttenhower and Timothy Tickle -##description<< Manages the quality control of data and the performance of analysis (univariate or multivariate), regularization, and data (response) transformation. -) { return( pArgs ) } - -### Load libraries quietly -suppressMessages(library( gam, warn.conflicts=FALSE, quietly=TRUE, verbose=FALSE)) -suppressMessages(library( gbm, warn.conflicts=FALSE, quietly=TRUE, verbose=FALSE)) -suppressMessages(library( logging, warn.conflicts=FALSE, quietly=TRUE, verbose=FALSE)) -suppressMessages(library( outliers, warn.conflicts=FALSE, quietly=TRUE, verbose=FALSE)) -suppressMessages(library( robustbase, warn.conflicts=FALSE, quietly=TRUE, verbose=FALSE)) -suppressMessages(library( pscl, warn.conflicts=FALSE, quietly=TRUE, verbose=FALSE)) - -### Get constants -#source(file.path("input","maaslin","src","Constants.R")) -#source("Constants.R") - -## Get logger -c_logrMaaslin <- getLogger( "maaslin" ) - -funcDoGrubbs <- function( -### Use the Grubbs Test to identify outliers -iData, -### Column index in the data frame to test -frmeData, -### The data frame holding the data -dPOutlier, -### P-value threshold to indicate an outlier is significant -lsQC -### List holding the QC info of the cleaning step. Which indices are outliers is added. -){ - adData <- frmeData[,iData] - - # Original number of NA - viNAOrig = which(is.na(adData)) - - while( TRUE ) - { - lsTest <- try( grubbs.test( adData ), silent = TRUE ) - if( ( class( lsTest ) == "try-error" ) || is.na( lsTest$p.value ) || ( lsTest$p.value > dPOutlier ) ) - {break} - viOutliers = outlier( adData, logical = TRUE ) - adData[viOutliers] <- NA - } - - # Record removed data - viNAAfter = which(is.na(adData)) - - # If all were set to NA then ignore the filtering - if(length(adData)==length(viNAAfter)) - { - viNAAfter = viNAOrig - adData = frmeData[,iData] - c_logrMaaslin$info( paste("Grubbs Test:: Identifed all data as outliers so was inactived for index=",iData," data=",paste(as.vector(frmeData[,iData]),collapse=","), "number zeros=", length(which(frmeData[,iData]==0)), sep = " " )) - } else if(mean(adData, na.rm=TRUE) == 0) { - viNAAfter = viNAOrig - adData = frmeData[,iData] - c_logrMaaslin$info( paste("Grubbs Test::Removed all values but 0, ignored. Index=",iData,".",sep=" " ) ) - } else { - # Document removal - if( sum( is.na( adData ) ) ) - { - c_logrMaaslin$info( "Grubbs Test::Removing %d outliers from %s", sum( is.na( adData ) ), colnames(frmeData)[iData] ) - c_logrMaaslin$info( format( rownames( frmeData )[is.na( adData )] )) - } - } - - return(list(data=adData,outliers=length(viNAAfter)-length(viNAOrig),indices=setdiff(viNAAfter,viNAOrig))) -} - -funcDoFenceTest <- function( -### Use a threshold based on the quartiles of the data to identify outliers -iData, -### Column index in the data frame to test -frmeData, -### The data frame holding the data -dFence -### The fence outside the first and third quartiles to use as a threshold for cutt off. -### This many times the interquartile range +/- to the 3rd/1st quartiles -){ - # Establish fence - adData <- frmeData[,iData] - adQ <- quantile( adData, c(0.25, 0.5, 0.75), na.rm = TRUE ) - - dIQR <- adQ[3] - adQ[1] - if(!dIQR) - { - dIQR = sd(adData,na.rm = TRUE) - } - dUF <- adQ[3] + ( dFence * dIQR ) - dLF <- adQ[1] - ( dFence * dIQR ) - - # Record indices of values outside of fence to remove and remove. - aiRemove <- c() - for( j in 1:length( adData ) ) - { - d <- adData[j] - if( !is.na( d ) && ( ( d < dLF ) || ( d > dUF ) ) ) - { - aiRemove <- c(aiRemove, j) - } - } - - if(length(aiRemove)==length(adData)) - { - aiRemove = c() - c_logrMaaslin$info( "OutliersByFence:: Identified all data as outlier so was inactivated for index=", iData,"data=", paste(as.vector(frmeData[,iData]),collapse=","), "number zeros=", length(which(frmeData[,iData]==0)), sep=" " ) - } else { - adData[aiRemove] <- NA - - # Document to screen - if( length( aiRemove ) ) - { - c_logrMaaslin$info( "OutliersByFence::Removing %d outliers from %s", length( aiRemove ), colnames(frmeData)[iData] ) - c_logrMaaslin$info( format( rownames( frmeData )[aiRemove] )) - } - } - - return(list(data=adData,outliers=length(aiRemove),indices=aiRemove)) -} - -funcZerosAreUneven = function( -### -vdRawData, -### Raw data to be checked during transformation -funcTransform, -### Data transform to perform -vsStratificationFeatures, -### Groupings to check for unevenness -dfData -### Data frame holding the features -){ - # Return indicator of unevenness - fUneven = FALSE - - # Transform the data to compare - vdTransformed = funcTransform( vdRawData ) - - # Go through each stratification of data - for( sStratification in vsStratificationFeatures ) - { - # Current stratification - vFactorStrats = dfData[[ sStratification ]] - - # If the metadata is not a factor then skip - # Only binned data can be evaluated this way. - if( !is.factor( vFactorStrats )){ next } - - viZerosCountsRaw = c() - for( sLevel in levels( vFactorStrats ) ) - { - vdTest = vdRawData[ which( vFactorStrats == sLevel ) ] - viZerosCountsRaw = c( viZerosCountsRaw, length(which(vdTest == 0))) - vdTest = vdTransformed[ which( vFactorStrats == sLevel ) ] - } - dExpectation = 1 / length( viZerosCountsRaw ) - dMin = dExpectation / 2 - dMax = dExpectation + dMin - viZerosCountsRaw = viZerosCountsRaw / sum( viZerosCountsRaw ) - if( ( length( which( viZerosCountsRaw <= dMin ) ) > 0 ) || ( length( which( viZerosCountsRaw >= dMax ) ) > 0 ) ) - { - return( TRUE ) - } - } - return( fUneven ) -} - -funcTransformIncreasesOutliers = function( -### Checks if a data transform increases outliers in a distribution -vdRawData, -### Raw data to check for outlier zeros -funcTransform -){ - iUnOutliers = length( boxplot( vdRawData, plot = FALSE )$out ) - iTransformedOutliers = length( boxplot( funcTransform( vdRawData ), plot = FALSE )$out ) - - return( iUnOutliers <= iTransformedOutliers ) -} - -funcClean <- function( -### Properly clean / get data ready for analysis -### Includes custom analysis from the custom R script if it exists -frmeData, -### Data frame, input data to be acted on -funcDataProcess, -### Custom script that can be given to perform specialized processing before MaAsLin does. -aiMetadata, -### Indices of columns in frmeData which are metadata for analysis. -aiData, -### Indices of column in frmeData which are (abundance) data for analysis. -lsQCCounts, -### List that will hold the quality control information which is written in the output directory. -astrNoImpute = c(), -### An array of column names of frmeData not to impute. -dMinSamp, -### Minimum number of samples -dMinAbd, -# Minimum sample abundance -dFence, -### How many quartile ranges defines the fence to define outliers. -funcTransform, -### The data transformation function or a dummy function that does not affect the data -dPOutlier = 0.05 -### The significance threshold for the grubbs test to identify an outlier. -){ - # Call the custom script and set current data and indicies to the processes data and indicies. - c_logrMaaslin$debug( "Start Clean") - if( !is.null( funcDataProcess ) ) - { - c_logrMaaslin$debug("Additional preprocess function attempted.") - - pTmp <- funcDataProcess( frmeData=frmeData, aiMetadata=aiMetadata, aiData=aiData) - frmeData = pTmp$frmeData - aiMetadata = pTmp$aiMetadata - aiData = pTmp$aiData - lsQCCounts$lsQCCustom = pTmp$lsQCCounts - } - # Set data indicies after custom QC process. - lsQCCounts$aiAfterPreprocess = aiData - - # Remove missing data, remove any sample that has less than dMinSamp * the number of data or low abundance - aiRemove = c() - aiRemoveLowAbundance = c() - for( iCol in aiData ) - { - adCol = frmeData[,iCol] - adCol[!is.finite( adCol )] <- NA - if( ( sum( !is.na( adCol ) ) < ( dMinSamp * length( adCol ) ) ) || - ( length( unique( na.omit( adCol ) ) ) < 2 ) ) - { - aiRemove = c(aiRemove, iCol) - } - if( sum(adCol > dMinAbd, na.rm=TRUE ) < (dMinSamp * length( adCol))) - { - aiRemoveLowAbundance = c(aiRemoveLowAbundance, iCol) - } - } - # Remove and document - aiData = setdiff( aiData, aiRemove ) - aiData = setdiff( aiData, aiRemoveLowAbundance ) - lsQCCounts$iMissingData = aiRemove - lsQCCounts$iLowAbundanceData = aiRemoveLowAbundance - if(length(aiRemove)) - { - c_logrMaaslin$info( "Removing the following for data lower bound.") - c_logrMaaslin$info( format( colnames( frmeData )[aiRemove] )) - } - if(length(aiRemoveLowAbundance)) - { - c_logrMaaslin$info( "Removing the following for too many low abundance bugs.") - c_logrMaaslin$info( format( colnames( frmeData )[aiRemoveLowAbundance] )) - } - - #Transform data - iTransformed = 0 - viNotTransformedData = c() - for(aiDatum in aiData) - { - adValues = frmeData[,aiDatum] -# if( ! funcTransformIncreasesOutliers( adValues, funcTransform ) ) -# { - frmeData[,aiDatum] = funcTransform( adValues ) -# iTransformed = iTransformed + 1 -# } else { -# viNotTransformedData = c( viNotTransformedData, aiDatum ) -# } - } - c_logrMaaslin$info(paste("Number of features transformed = ",iTransformed)) - - # Metadata: Properly factorize all logical data and integer and number data with less than iNonFactorLevelThreshold - # Also record which are numeric metadata - aiNumericMetadata = c() - for( i in aiMetadata ) - { - if( ( class( frmeData[,i] ) %in% c("integer", "numeric", "logical") ) && - ( length( unique( frmeData[,i] ) ) < c_iNonFactorLevelThreshold ) ) { - c_logrMaaslin$debug(paste("Changing metadatum from numeric/integer/logical to factor",colnames(frmeData)[i],sep="=")) - frmeData[,i] = factor( frmeData[,i] ) - } - if( class( frmeData[,i] ) %in% c("integer","numeric") ) - { - aiNumericMetadata = c(aiNumericMetadata,i) - } - } - - # Remove outliers - # If the dFence Value is set use the method of defining the outllier as - # dFence * the interquartile range + or - the 3rd and first quartile respectively. - # If not the gibbs test is used. - lsQCCounts$aiDataSumOutlierPerDatum = c() - lsQCCounts$aiMetadataSumOutlierPerDatum = c() - lsQCCounts$liOutliers = list() - - if( dFence > 0.0 ) - { - # For data - for( iData in aiData ) - { - lOutlierInfo <- funcDoFenceTest(iData=iData,frmeData=frmeData,dFence=dFence) - frmeData[,iData] <- lOutlierInfo[["data"]] - lsQCCounts$aiDataSumOutlierPerDatum <- c(lsQCCounts$aiDataSumOutlierPerDatum,lOutlierInfo[["outliers"]]) - if(lOutlierInfo[["outliers"]]>0) - { - lsQCCounts$liOutliers[[paste(iData,sep="")]] <- lOutlierInfo[["indices"]] - } - } - - # Remove outlier non-factor metadata - for( iMetadata in aiNumericMetadata ) - { - lOutlierInfo <- funcDoFenceTest(iData=iMetadata,frmeData=frmeData,dFence=dFence) - frmeData[,iMetadata] <- lOutlierInfo[["data"]] - lsQCCounts$aiMetadataSumOutlierPerDatum <- c(lsQCCounts$aiMetadataSumOutlierPerDatum,lOutlierInfo[["outliers"]]) - if(lOutlierInfo[["outliers"]]>0) - { - lsQCCounts$liOutliers[[paste(iMetadata,sep="")]] <- lOutlierInfo[["indices"]] - } - } - #Do not use the fence, use the Grubbs test - } else if(dPOutlier!=0.0){ - # For data - for( iData in aiData ) - { - lOutlierInfo <- funcDoGrubbs(iData=iData,frmeData=frmeData,dPOutlier=dPOutlier) - frmeData[,iData] <- lOutlierInfo[["data"]] - lsQCCounts$aiDataSumOutlierPerDatum <- c(lsQCCounts$aiDataSumOutlierPerDatum,lOutlierInfo[["outliers"]]) - if(lOutlierInfo[["outliers"]]>0) - { - lsQCCounts$liOutliers[[paste(iData,sep="")]] <- lOutlierInfo[["indices"]] - } - } - for( iMetadata in aiNumericMetadata ) - { - lOutlierInfo <- funcDoGrubbs(iData=iMetadata,frmeData=frmeData,dPOutlier=dPOutlier) - frmeData[,iMetadata] <- lOutlierInfo[["data"]] - lsQCCounts$aiMetadataSumOutlierPerDatum <- c(lsQCCounts$aiMetadataSumOutlierPerDatum,lOutlierInfo[["outliers"]]) - if(lOutlierInfo[["outliers"]]>0) - { - lsQCCounts$liOutliers[[paste(iMetadata,sep="")]] <- lOutlierInfo[["indices"]] - } - } - } - - # Metadata: Remove missing data - # This is defined as if there is only one non-NA value or - # if the number of NOT NA data is less than a percentage of the data defined by dMinSamp - aiRemove = c() - for( iCol in c(aiMetadata) ) - { - adCol = frmeData[,iCol] - if( ( sum( !is.na( adCol ) ) < ( dMinSamp * length( adCol ) ) ) || - ( length( unique( na.omit( adCol ) ) ) < 2 ) ) - { - aiRemove = c(aiRemove, iCol) - } - } - - # Remove metadata - aiMetadata = setdiff( aiMetadata, aiRemove ) - - # Update the data which was removed. - lsQCCounts$iMissingMetadata = aiRemove - if(length(aiRemove)) - { - c_logrMaaslin$info("Removing the following metadata for too much missing data or only one data value outside of NA.") - c_logrMaaslin$info(format(colnames( frmeData )[aiRemove])) - } - - # Keep track of factor levels in a list for later use - lslsFactors <- list() - for( iCol in c(aiMetadata) ) - { - aCol <- frmeData[,iCol] - if( class( aCol ) == "factor" ) - { - lslsFactors[[length( lslsFactors ) + 1]] <- list(iCol, levels( aCol )) - } - } - - # Replace missing data values by the mean of the data column. - # Remove samples that were all NA from the cleaning and so could not be imputed. - aiRemoveData = c() - for( iCol in aiData ) - { - adCol <- frmeData[,iCol] - adCol[is.infinite( adCol )] <- NA - adCol[is.na( adCol )] <- mean( adCol[which(adCol>0)], na.rm = TRUE ) - frmeData[,iCol] <- adCol - - if(length(which(is.na(frmeData[,iCol]))) == length(frmeData[,iCol])) - { - c_logrMaaslin$info( paste("Removing data", iCol, "for being all NA after QC")) - aiRemoveData = c(aiRemoveData,iCol) - } - } - - # Remove and document - aiData = setdiff( aiData, aiRemoveData ) - lsQCCounts$iMissingData = c(lsQCCounts$iMissingData,aiRemoveData) - if(length(aiRemoveData)) - { - c_logrMaaslin$info( "Removing the following for having only NAs after cleaning (maybe due to only having NA after outlier testing).") - c_logrMaaslin$info( format( colnames( frmeData )[aiRemoveData] )) - } - - #Use na.gam.replace to manage NA metadata - aiTmp <- setdiff( aiMetadata, which( colnames( frmeData ) %in% astrNoImpute ) ) - # Keep tack of NAs so the may not be plotted later. - liNaIndices = list() - lsNames = names(frmeData) - for( i in aiTmp) - { - liNaIndices[[lsNames[i]]] = which(is.na(frmeData[,i])) - } - frmeData[,aiTmp] <- na.gam.replace( frmeData[,aiTmp] ) - - #If NA is a value in factor data, set the NA as a level. - for( lsFactor in lslsFactors ) - { - iCol <- lsFactor[[1]] - aCol <- frmeData[,iCol] - if( "NA" %in% levels( aCol ) ) - { - if(! lsNames[iCol] %in% astrNoImpute) - { - liNaIndices[[lsNames[iCol]]] = union(which(is.na(frmeData[,iCol])),which(frmeData[,iCol]=="NA")) - } - frmeData[,iCol] <- factor( aCol, levels = c(lsFactor[[2]], "NA") ) - } - } - - # Make sure there is a minimum number of non-0 measurements - aiRemove = c() - for( iCol in aiData ) - { - adCol = frmeData[,iCol] - if(length( which(adCol!=0)) < ( dMinSamp * length( adCol ) ) ) - { - aiRemove = c(aiRemove, iCol) - } - } - - # Remove and document - aiData = setdiff( aiData, aiRemove) - lsQCCounts$iZeroDominantData = aiRemove - if(length(aiRemove)) - { - c_logrMaaslin$info( "Removing the following for having not enough non-zero measurments for analysis.") - c_logrMaaslin$info( format( colnames( frmeData )[aiRemove] )) - } - - c_logrMaaslin$debug("End FuncClean") - return( list(frmeData = frmeData, aiMetadata = aiMetadata, aiData = aiData, lsQCCounts = lsQCCounts, liNaIndices=liNaIndices, viNotTransformedData = viNotTransformedData) ) - ### Return list of - ### frmeData: The Data after cleaning - ### aiMetadata: The indices of the metadata still being used after filtering - ### aiData: The indices of the data still being used after filtering - ### lsQCCOunts: QC info -} - -funcBugs <- function( -### Run analysis of all data features against all metadata -frmeData, -### Cleaned data including metadata, and data -lsData, -### This list is a general container for data as the analysis occurs, think about it as a cache for the analysis -aiMetadata, -### Indices of metadata used in analysis -aiData, -### Indices of response data -aiNotTransformedData, -### Indicies of the data not transformed -strData, -### Log file name -dSig, -### Significance threshold for the qvalue cut off -fInvert=FALSE, -### Invert images to have a black background -strDirOut = NA, -### Output project directory -funcReg=NULL, -### Function for regularization -funcTransform=NULL, -### Function used to transform the data -funcUnTransform=NULL, -### If a transform is used the opposite of that transfor must be used on the residuals in the partial residual plots -lsNonPenalizedPredictors=NULL, -### These predictors will not be penalized in the feature (model) selection step -funcAnalysis=NULL, -### Function to perform association analysis -lsRandomCovariates=NULL, -### List of string names of metadata which will be treated as random covariates -funcGetResults=NULL, -### Function to unpack results from analysis -fDoRPlot=TRUE, -### Plot residuals -fOmitLogFile = FALSE, -### Stops the creation of the log file -fAllvAll=FALSE, -### Flag to turn on all against all comparisons -liNaIndices = list(), -### Indicies of imputed NA data -lxParameters=list(), -### List holds parameters for different variable selection techniques -strTestingCorrection = "BH", -### Correction for multiple testing -fIsUnivariate = FALSE, -### Indicates if the function is univariate -fZeroInflated = FALSE -### Indicates to use a zero infalted model -){ - c_logrMaaslin$debug("Start funcBugs") - # If no output directory is indicated - # Then make it the current directory - if( is.na( strDirOut ) || is.null( strDirOut ) ) - { - if( !is.na( strData ) ) - { - strDirOut <- paste( dirname( strData ), "/", sep = "" ) - } else { strDirOut = "" } - } - - # Make th log file and output file names based on the log file name - strLog = NA - strBase = "" - if(!is.na(strData)) - { - strBaseOut <- paste( strDirOut, sub( "\\.([^.]+)$", "", basename(strData) ), sep = "/" ) - strLog <- paste( strBaseOut,c_sLogFileSuffix, ".txt", sep = "" ) - } - - # If indicated, stop the creation of the log file - # Otherwise delete the log file if it exists and log - if(fOmitLogFile){ strLog = NA } - if(!is.na(strLog)) - { - c_logrMaaslin$info( "Outputting to: %s", strLog ) - unlink( strLog ) - } - - # Will contain pvalues - adP = c() - adPAdj = c() - - # List of lists with association information - lsSig <- list() - # Go through each data that was not previously removed and perform inference - for( iTaxon in aiData ) - { - # Log to screen progress per 10 associations. - # Can be thown off if iTaxon is missing a mod 10 value - # So the taxons may not be logged every 10 but not a big deal - if( !( iTaxon %% 10 ) ) - { - c_logrMaaslin$info( "Taxon %d/%d", iTaxon, max( aiData ) ) - } - - # Call analysis method - lsOne <- funcBugHybrid( iTaxon=iTaxon, frmeData=frmeData, lsData=lsData, aiMetadata=aiMetadata, dSig=dSig, adP=adP, lsSig=lsSig, funcTransform=funcTransform, funcUnTransform=funcUnTransform, strLog=strLog, funcReg=funcReg, lsNonPenalizedPredictors=lsNonPenalizedPredictors, funcAnalysis=funcAnalysis, lsRandomCovariates=lsRandomCovariates, funcGetResult=funcGetResults, fAllvAll=fAllvAll, fIsUnivariate=fIsUnivariate, lxParameters=lxParameters, fZeroInflated=fZeroInflated, fIsTransformed= ! iTaxon %in% aiNotTransformedData ) - - # If you get a NA (happens when the lmm gets all random covariates) move on - if( is.na( lsOne ) ){ next } - - # The updating of the following happens in the inference method call in the funcBugHybrid call - # New pvalue array - adP <- lsOne$adP - # New lsSig contains data about significant feature v metadata comparisons - lsSig <- lsOne$lsSig - # New qc data - lsData$lsQCCounts = lsOne$lsQCCounts - } - - # Log the QC info - c_logrMaaslin$debug("lsData$lsQCCounts") - c_logrMaaslin$debug(format(lsData$lsQCCounts)) - - if( is.null( adP ) ) { return( NULL ) } - - # Perform bonferonni corrections on factor data (for levels), calculate the number of tests performed, and FDR adjust for multiple hypotheses - # Perform Bonferonni adjustment on factor data - for( iADIndex in 1:length( adP ) ) - { - # Only perform on factor data - if( is.factor( lsSig[[ iADIndex ]]$metadata ) ) - { - adPAdj = c( adPAdj, funcBonferonniCorrectFactorData( dPvalue = adP[ iADIndex ], vsFactors = lsSig[[ iADIndex ]]$metadata, fIgnoreNAs = length(liNaIndices)>0) ) - } else { - adPAdj = c( adPAdj, adP[ iADIndex ] ) - } - } - - iTests = funcCalculateTestCounts(iDataCount = length(aiData), asMetadata = intersect( lsData$astrMetadata, colnames( frmeData )[aiMetadata] ), asForced = lsNonPenalizedPredictors, asRandom = lsRandomCovariates, fAllvAll = fAllvAll) - - #Get indices of sorted data after the factor correction but before the multiple hypothesis corrections. - aiSig <- sort.list( adPAdj ) - - # Perform FDR BH - adQ = p.adjust(adPAdj, method=strTestingCorrection, n=max(length(adPAdj), iTests)) - - # Find all covariates that had significant associations - astrNames <- c() - for( i in 1:length( lsSig ) ) - { - astrNames <- c(astrNames, lsSig[[i]]$name) - } - astrNames <- unique( astrNames ) - - # Sets up named label return for global plotting - lsReturnTaxa <- list() - for( j in aiSig ) - { - if( adQ[j] > dSig ) { next } - strTaxon <- lsSig[[j]]$taxon - if(strTaxon %in% names(lsReturnTaxa)) - { - lsReturnTaxa[[strTaxon]] = min(lsReturnTaxa[[strTaxon]],adQ[j]) - } else { lsReturnTaxa[[strTaxon]] = adQ[j]} - } - - # For each covariate with significant associations - # Write out a file with association information - for( strName in astrNames ) - { - strFileTXT <- NA - strFilePDF <- NA - for( j in aiSig ) - { - lsCur <- lsSig[[j]] - strCur <- lsCur$name - - if( strCur != strName ) { next } - - strTaxon <- lsCur$taxon - adData <- lsCur$data - astrFactors <- lsCur$factors - adCur <- lsCur$metadata - adY <- adData - - if( is.na( strData ) ) { next } - - ## If the text file output is not written to yet - ## make the file names, and delete any previous file output - if( is.na( strFileTXT ) ) - { - strFileTXT <- sprintf( "%s-%s.txt", strBaseOut, strName ) - unlink(strFileTXT) - funcWrite( c("Variable", "Feature", "Value", "Coefficient", "N", "N not 0", "P-value", "Q-value"), strFileTXT ) - } - - ## Write text output - funcWrite( c(strName, strTaxon, lsCur$orig, lsCur$value, length( adData ), sum( adData > 0 ), adP[j], adQ[j]), strFileTXT ) - - ## If the significance meets the threshold - ## Write PDF file output - if( adQ[j] > dSig ) { next } - - # Do not make residuals plots if univariate is selected - strFilePDF = funcPDF( frmeTmp=frmeData, lsCur=lsCur, curPValue=adP[j], curQValue=adQ[j], strFilePDF=strFilePDF, strBaseOut=strBaseOut, strName=strName, funcUnTransform= funcUnTransform, fDoResidualPlot=fDoRPlot, fInvert=fInvert, liNaIndices=liNaIndices ) - } - if( dev.cur( ) != 1 ) { dev.off( ) } - } - aiTmp <- aiData - - logdebug("End funcBugs", c_logMaaslin) - return(list(lsReturnBugs=lsReturnTaxa, lsQCCounts=lsData$lsQCCounts)) - ### List of data features successfully associated without error and quality control data -} - -#Lightly Tested -### Performs analysis for 1 feature -### iTaxon: integer Taxon index to be associated with data -### frmeData: Data frame The full data -### lsData: List of all associated data -### aiMetadata: Numeric vector of indices -### dSig: Numeric significance threshold for q-value cut off -### adP: List of pvalues from associations -### lsSig: List which serves as a cache of data about significant associations -### strLog: String file to log to -funcBugHybrid <- function( -### Performs analysis for 1 feature -iTaxon, -### integer Taxon index to be associated with data -frmeData, -### Data frame, the full data -lsData, -### List of all associated data -aiMetadata, -### Numeric vector of indices -dSig, -### Numeric significance threshold for q-value cut off -adP, -### List of pvalues from associations -lsSig, -### List which serves as a cache of data about significant associations -funcTransform, -### The tranform used on the data -funcUnTransform, -### The reverse transform on the data -strLog = NA, -### String, file to which to log -funcReg=NULL, -### Function to perform regularization -lsNonPenalizedPredictors=NULL, -### These predictors will not be penalized in the feature (model) selection step -funcAnalysis=NULL, -### Function to perform association analysis -lsRandomCovariates=NULL, -### List of string names of metadata which will be treated as random covariates -funcGetResult=NULL, -### Function to unpack results from analysis -fAllvAll=FALSE, -### Flag to turn on all against all comparisons -fIsUnivariate = FALSE, -### Indicates the analysis function is univariate -lxParameters=list(), -### List holds parameters for different variable selection techniques -fZeroInflated = FALSE, -### Indicates if to use a zero infalted model -fIsTransformed = TRUE -### Indicates that the bug is transformed -){ -#dTime00 <- proc.time()[3] - #Get metadata column names - astrMetadata = intersect( lsData$astrMetadata, colnames( frmeData )[aiMetadata] ) - - #Get data measurements that are not NA - aiRows <- which( !is.na( frmeData[,iTaxon] ) ) - - #Get the dataframe of non-na data measurements - frmeTmp <- frmeData[aiRows,] - - #Set the min boosting selection frequency to a default if not given - if( is.na( lxParameters$dFreq ) ) - { - lxParameters$dFreq <- 0.5 / length( c(astrMetadata) ) - } - - # Get the full data for the bug feature - adCur = frmeTmp[,iTaxon] - lxParameters$sBugName = names(frmeTmp[iTaxon]) - - # This can run multiple models so some of the results are held in lists and some are not - llmod = list() - liTaxon = list() - lastrTerms = list() - - # Build formula for simple mixed effects models - # Removes random covariates from variable selection - astrMetadata = setdiff(astrMetadata, lsRandomCovariates) - strFormula <- paste( "adCur ~", paste( sprintf( "`%s`", astrMetadata ), collapse = " + " ), sep = " " ) - - # Document the model - funcWrite( c("#taxon", colnames( frmeTmp )[iTaxon]), strLog ) - funcWrite( c("#metadata", astrMetadata), strLog ) - funcWrite( c("#samples", rownames( frmeTmp )), strLog ) - - #Model terms - astrTerms <- c() - - # Attempt feature (model) selection - if(!is.na(funcReg)) - { - #Count model selection method attempts - lsData$lsQCCounts$iBoosts = lsData$lsQCCounts$iBoosts + 1 - #Perform model selection - astrTerms <- funcReg(strFormula=strFormula, frmeTmp=frmeTmp, adCur=adCur, lsParameters=lxParameters, lsForcedParameters=lsNonPenalizedPredictors, strLog=strLog) - #If the feature selection function is set to None, set all terms of the model to all the metadata - } else { astrTerms = astrMetadata } - - # Get look through the boosting results to get a model - # Holds the predictors in the predictors in the model that were selected by the boosting - if(is.null( astrTerms )){lsData$lsQCCounts$iBoostErrors = lsData$lsQCCounts$iBoostErrors + 1} - - # Get the indices that are transformed - # Of those indices check for uneven metadata - # Untransform any of the metadata that failed - # Failed means true for uneven occurences of zeros -# if( fIsTransformed ) -# { -# vdUnevenZeroCheck = funcUnTransform( frmeData[[ iTaxon ]] ) -# if( funcZerosAreUneven( vdRawData=vdUnevenZeroCheck, funcTransform=funcTransform, vsStratificationFeatures=astrTerms, dfData=frmeData ) ) -# { -# frmeData[[ iTaxon ]] = vdUnevenZeroCheck -# c_logrMaaslin$debug( paste( "Taxon transformation reversed due to unevenness of zero distribution.", iTaxon ) ) -# } -# } - - # Run association analysis if predictors exist and an analysis function is specified - # Run analysis - if(!is.na(funcAnalysis) ) - { - #If there are selected and forced fixed covariates - if( length( astrTerms ) ) - { - #Count the association attempt - lsData$lsQCCounts$iLms = lsData$lsQCCounts$iLms + 1 - - #Make the lm formula - #Build formula for simple mixed effects models using random covariates - strRandomCovariatesFormula = NULL - #Random covariates are forced - if(length(lsRandomCovariates)>0) - { - #Format for lme - #Needed for changes to not allowing random covariates through the boosting process - strRandomCovariatesFormula <- paste( "adCur ~ ", paste( sprintf( "1|`%s`", lsRandomCovariates), collapse = " + " )) - } - - #Set up a list of formula containing selected fixed variables changing and the forced fixed covariates constant - vstrFormula = c() - #Set up suppressing forced covariates in a all v all scenario only - asSuppress = c() - #Enable all against all comparisons - if(fAllvAll && !fIsUnivariate) - { - lsVaryingCovariates = setdiff(astrTerms,lsNonPenalizedPredictors) - lsConstantCovariates = setdiff(lsNonPenalizedPredictors,lsRandomCovariates) - strConstantFormula = paste( sprintf( "`%s`", lsConstantCovariates ), collapse = " + " ) - asSuppress = lsConstantCovariates - - if(length(lsVaryingCovariates)==0L) - { - vstrFormula <- c( paste( "adCur ~ ", paste( sprintf( "`%s`", lsConstantCovariates ), collapse = " + " )) ) - } else { - for( sVarCov in lsVaryingCovariates ) - { - strTempFormula = paste( "adCur ~ `", sVarCov,"`",sep="") - if(length(lsConstantCovariates)>0){ strTempFormula = paste(strTempFormula,strConstantFormula,sep=" + ") } - vstrFormula <- c( vstrFormula, strTempFormula ) - } - } - } else { - #This is either the multivariate case formula for all covariates in an lm or fixed covariates in the lmm - vstrFormula <- c( paste( "adCur ~ ", paste( sprintf( "`%s`", astrTerms ), collapse = " + " )) ) - } - - #Run the association - for( strAnalysisFormula in vstrFormula ) - { - i = length(llmod)+1 - llmod[[i]] = funcAnalysis(strFormula=strAnalysisFormula, frmeTmp=frmeTmp, iTaxon=iTaxon, lsHistory=list(adP=adP, lsSig=lsSig, lsQCCounts=lsData$lsQCCounts), strRandomFormula=strRandomCovariatesFormula, fZeroInflated=fZeroInflated) - - liTaxon[[i]] = iTaxon - lastrTerms[[i]] = funcFormulaStrToList(strAnalysisFormula) - } - } else { - #If there are no selected or forced fixed covariates - lsData$lsQCCounts$iNoTerms = lsData$lsQCCounts$iNoTerms + 1 - return(list(adP=adP, lsSig=lsSig, lsQCCounts=lsData$lsQCCounts)) - } - } - - #Call funcBugResults and return it's return - if(!is.na(funcGetResult)) - { - #Format the results to a consistent expected result. - return( funcGetResult( llmod=llmod, frmeData=frmeData, liTaxon=liTaxon, dSig=dSig, adP=adP, lsSig=lsSig, strLog=strLog, lsQCCounts=lsData$lsQCCounts, lastrCols=lastrTerms, asSuppressCovariates=asSuppress ) ) - } else { - return(list(adP=adP, lsSig=lsSig, lsQCCounts=lsData$lsQCCounts)) - } - ### List containing a list of pvalues, a list of significant data per association, and a list of QC data -} diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/lib/Constants.R --- a/maaslin-4450aa4ecc84/src/lib/Constants.R Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,119 +0,0 @@ -##################################################################################### -#Copyright (C) <2012> -# -#Permission is hereby granted, free of charge, to any person obtaining a copy of -#this software and associated documentation files (the "Software"), to deal in the -#Software without restriction, including without limitation the rights to use, copy, -#modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, -#and to permit persons to whom the Software is furnished to do so, subject to -#the following conditions: -# -#The above copyright notice and this permission notice shall be included in all copies -#or substantial portions of the Software. -# -#THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, -#INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A -#PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT -#HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION -#OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE -#SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -# -# This file is a component of the MaAsLin (Multivariate Associations Using Linear Models), -# authored by the Huttenhower lab at the Harvard School of Public Health -# (contact Timothy Tickle, ttickle@hsph.harvard.edu). -##################################################################################### - -inlinedocs <- function( -##author<< Curtis Huttenhower and Timothy Tickle -##description<< Global project constants. -) { return( pArgs ) } - -#General -c_COMMA = "," -c_DASH = "-" - -#For reading IO -c_MATRIX_NAME = "Matrix:" -c_FILE_NAME = "File:" -c_DELIMITER = "Delimiter:" -c_ID_ROW = "Name_Row_Number:" -c_ID_COLUMN = "Name_Column_Number:" -c_ROWS = "Read_Rows:" -c_PCLROWS = "Read_PCL_Rows:" -c_TSVROWS = "Read_TSV_Rows:" -c_COLUMNS = "Read_Columns:" -c_PCLCOLUMNS = "Read_PCL_Columns:" -c_TSVCOLUMNS = "Read_TSV_Columns:" -c_CHARACTER_DATA_TYPE = "DT_Character:" -c_FACTOR_DATA_TYPE = "DT_Factor:" -c_INTEGER_DATA_TYPE = "DT_Integer:" -c_LOGICAL_DATA_TYPE = "DT_Logical:" -c_NUMERIC_DATA_TYPE = "DT_Numeric:" -c_ORDEREDFACTOR_DATA_TYPE = "DT_Ordered_Factor:" - -### The name of the data matrix read in using a read.config file -c_strMatrixData <- "Abundance" -### The name of the metadata matrix read in using a read.config file -c_strMatrixMetadata <- "Metadata" -# Settings for MFA visualization/ordination -c_iMFA <- 30 -c_dHeight <- 9 -c_dDefaultScale = 0.5 -# The column that is used to determine if information meets a certain significance threshold (dSignificanceLevel) to include in the Summary text file) -c_strKeywordEvaluatedForInclusion <- "Q.value" -#The name of the custom process function -c_strCustomProcessFunction = "processFunction" - -#Delimiters -#Feature name delimiter -c_cFeatureDelim = "|" -c_cFeatureDelimRex = "\\|" - -#The word used for unclassified -c_strUnclassified = "unclassified" - -#Maaslincore settings -#If a metadata does not have more than count of unique values, it is changed to factor data mode. -c_iNonFactorLevelThreshold = 3 - -#Extensions -c_sDetailFileSuffix = ".txt" -c_sSummaryFileSuffix = ".txt" -c_sLogFileSuffix = "_log" - -#Delimiter for output tables -c_cTableDelimiter="\t" - -#Testing Related -c_strTestingDirectory = "testing" -c_strCorrectAnswers = "answers" -c_strTemporaryFiles = "tmp" -c_strTestingInput = "input" - -#Reading matrix defaults -c_strDefaultMatrixDelimiter = "\t" -c_strDefaultMatrixRowID = "1" -c_strDefaultMatrixColID = "1" -c_strDefaultReadRows = "-" -c_strDefaultReadCols = "-" - -#Separator used when collapsing factor names -c_sFactorNameSep = "" - -#Separator used by the mfa -c_sMFANameSep1 = "_" -c_sMFANameSep2 = "." - -#Analysis Module list positioning -c_iSelection = 1 -c_iTransform = 2 -c_iAnalysis = 3 -c_iResults = 4 -c_iUnTransform = 5 -c_iIsUnivariate = 6 - -#Count based models -c_vCountBasedModels = c("neg_binomial","quasi") - -# Na action in anaylsis, placed here to standardize -c_strNA_Action = "na.omit" diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/lib/IO.R --- a/maaslin-4450aa4ecc84/src/lib/IO.R Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,403 +0,0 @@ -##################################################################################### -#Copyright (C) <2012> -# -#Permission is hereby granted, free of charge, to any person obtaining a copy of -#this software and associated documentation files (the "Software"), to deal in the -#Software without restriction, including without limitation the rights to use, copy, -#modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, -#and to permit persons to whom the Software is furnished to do so, subject to -#the following conditions: -# -#The above copyright notice and this permission notice shall be included in all copies -#or substantial portions of the Software. -# -#THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, -#INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A -#PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT -#HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION -#OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE -#SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -# -# This file is a component of the MaAsLin (Multivariate Associations Using Linear Models), -# authored by the Huttenhower lab at the Harvard School of Public Health -# (contact Timothy Tickle, ttickle@hsph.harvard.edu). -##################################################################################### - -inlinedocs <- function( -##author<< Curtis Huttenhower and Timothy Tickle -##description<< Collection of functions centered on custom reading of data and some IO services. -) { return( pArgs ) } - -#Project Constants - -c_astrNA <- c(""," "," ","NA","na") - -#Do not report warnings -options(warn=-1) - -funcWriteMatrixToReadConfigFile = function( -### Writes a read config file. Will write over a file by default -strConfigureFileName, -### Matrix that will be read -strMatrixName, -### Name of matrix that will be read -strRowIndices=NA, -### Rows which will be Read (TSV) by default all will be read -strColIndices=NA, -### Cols which will be Read (TSV) by default all will be read -acharDelimiter=c_strDefaultMatrixDelimiter, -### Delimiter for the matrix that will be read in\ -fAppend=FALSE -### Append to a current read config file -){ - #If no append delete previous file - if(!fAppend){unlink(strConfigureFileName)} - - #Make delimiter readable - switch(acharDelimiter, - "\t" = {acharDelimiter = "TAB"}, - " " = {acharDelimiter = "SPACE"}, - "\r" = {acharDelimiter = "RETURN"}, - "\n" = {acharDelimiter = "ENDLINE"}) - - #Manage NAs - if(is.na(strRowIndices)){strRowIndices="-"} - if(is.na(strColIndices)){strColIndices="-"} - - #Required output - lsDataLines = c(paste(c_MATRIX_NAME,strMatrixName,sep=" "), - paste(c_DELIMITER,acharDelimiter,sep=" "), - paste(c_ID_ROW,"1",sep=" "), - paste(c_ID_COLUMN,"1",sep=" "), - paste(c_TSVROWS,strRowIndices,sep=" "), - paste(c_TSVCOLUMNS,strColIndices,sep=" ")) - - lsDataLines = c(lsDataLines,"\n") - - #Output to file - lapply(lsDataLines, cat, file=strConfigureFileName, sep="\n", append=TRUE) -} - -funcWriteMatrices = function( -### Write data frame data files with config files -dataFrameList, -### A named list of data frames (what you get directly from the read function) -saveFileList, -### File names to save the data matrices in (one name per data frame) -configureFileName, -### Name of the configure file to be written which will direct the reading of these data -acharDelimiter=c_strDefaultMatrixDelimiter, -### Matrix delimiter -log = FALSE -### Indicates if logging should occur -){ - #Get names - dataFrameNames = names(dataFrameList) - - #Get length of dataFrameList - dataFrameListLength = length(dataFrameList) - - #Get length of save file list - saveFileListLength = length(saveFileList) - - #If the save file list length and data frame list length are not equal, abort - if(!saveFileListLength == dataFrameListLength) - {stop(paste("Received a length of save files (",saveFileListLength,") that are different from the count of data frames (",dataFrameListLength,"). Stopped and returned false."),sep="")} - - #Delete the old config file - unlink(configureFileName) - - #For each data save - for (dataIndex in c(1:dataFrameListLength)) - { - #Current data frame - data = dataFrameList[[dataIndex]] - - #Get column count - columnCount = ncol(data) - - #Get row and column names - rowNames = row.names(data) - rowNamesString = paste(rowNames,sep="",collapse=",") - if(length(rowNamesString)==0){rowNamesString = NA} - - columnNamesString = paste(colnames(data),sep="",collapse=",") - if(length(columnNamesString)==0){columnNamesString = NA} - - #Get row indices - rowStart = 1 - if(!is.na(rowNamesString)){rowStart = 2} - rowEnd = nrow(data)+rowStart - 1 - rowIndices = paste(c(rowStart:rowEnd),sep="",collapse=",") - - #Get col indices - colStart = 1 - if(!is.na(columnNamesString)){ colStart = 2} - colEnd = columnCount+colStart - 1 - colIndices = paste(c(colStart:colEnd),sep="",collapse=",") - - #Write Data to file - write.table(data, saveFileList[dataIndex], quote = FALSE, sep = acharDelimiter, col.names = NA, row.names = rowNames, na = "NA", append = FALSE) - - #Write the read config file - funcWriteMatrixToReadConfigFile(strConfigureFileName=configureFileName, strMatrixName=dataFrameNames[dataIndex], - strRowIndices=rowIndices, strColIndices=colIndices, acharDelimiter=acharDelimiter, fAppend=TRUE) - } - return(TRUE) -} - -funcReadMatrices = function( -### Dynamically Read a Matrix/Matrices from a configure file -configureFile, -### Read config file to guide reading in data -defaultFile = NA, -### Default data file to read -log = FALSE -){ - #Named vector to return data frames read - returnFrames = list() - #Holds the names of the frames as they are being added - returnFrameNames = c() - returnFramesIndex = 1 - - #Read in config file info - #Read each data block extracted from the config file - lsDataBlocks <- funcReadConfigFile(configureFile, defaultFile) - if(!length(lsDataBlocks)) { - astrMetadata <- NULL - astrMetadata[2] <- defaultFile - astrMetadata[5] <- "2" - astrData <- NULL - astrData[2] <- defaultFile - astrData[5] <- "3-" - lsDataBlocks <- list(astrMetadata, astrData) - } - for(dataBlock in lsDataBlocks) - { - #Read in matrix - returnFrames[[returnFramesIndex]] = funcReadMatrix(tempMatrixName=dataBlock[1], tempFileName=dataBlock[2], tempDelimiter=dataBlock[3], tempColumns=dataBlock[5], tempRows=dataBlock[4], tempLog=log) - returnFrameNames = c(returnFrameNames,dataBlock[1]) - returnFramesIndex = returnFramesIndex + 1 - } - names(returnFrames) = returnFrameNames - return(returnFrames) -} - -funcReadMatrix = function( -### Read one matrix -### The name to give the block of data read in from file -tempMatrixName, -### ID rows and columns are assumed to be 1 -tempFileName=NA, -### Data file to read -tempDelimiter=NA, -### Data matrix delimiter -tempColumns=NA, -### Data columns to read -tempRows=NA, -### Data rows to read -tempLog=FALSE -### Indicator to log -){ - if(is.na(tempDelimiter)){tempDelimiter <- c_strDefaultMatrixDelimiter} - if(is.na(tempColumns)){tempColumns <- c_strDefaultReadCols} - if(is.na(tempRows)){tempRows <- c_strDefaultReadRows} - - #Check parameter and make sure not NA - if(is.na(tempMatrixName)){tempMatrixName <- ""} - if(!funcIsValid(tempMatrixName)){stop(paste("Did not receive a valid matrix name, received ",tempMatrixName,"."))} - - #Check to make sure there is a file name for the matrix - if(! funcIsValidFileName(tempFileName)) - {stop(paste("No valid file name is given for the matrix ",tempMatrixName," from file: ",tempFileName,". Please add a valid file name to read the matrix from.", sep=""))} - - #Read in superset matrix and give names if indicated - #Read in matrix - dataMatrix = read.table(tempFileName, sep = tempDelimiter, as.is = TRUE, na.strings=c_astrNA, quote = "", comment.char = "") - dataFrameDimension = dim(dataMatrix) - - #Get column names - columnNameList = as.matrix(dataMatrix[1,]) - rowNameList = dataMatrix[1][[1]] - - #Convert characters to vectors of indices - tempColumns = funcParseIndexSlices(ifelse(is.na(tempColumns),"-",tempColumns), columnNameList) - tempRows = funcParseIndexSlices(ifelse(is.na(tempRows),"-", tempRows), rowNameList) - - #Check indices - #Check to make sure valid id col/rows and data col/rows - if((!funcIsValid(tempColumns)) || (!funcIsValid(tempRows))) - {stop(paste("Received invalid row or col. Rows=",tempRows," Cols=", tempColumns))} - - #Check to make sure only 1 row id is given and it is not repeated in the data rows - if(length(intersect(1,tempColumns)) == 1) - {stop(paste("Index indicated as an id row but was found in the data row indices, can not be both. Index=1 Data indices=",tempColumns,sep=""))} - - #Check to make sure only one col id is given and it is not repeated in the data columns - #Id row/col should not be in data row/col - if(length(intersect(1, tempRows)) == 1) - {stop(paste("Index indicated as an id column but was found in the data column indices, can not be both. ID Index=1 Data Indices=", tempRows,".",sep=""))} - - #If the row names have the same length as the column count and has column names - #it is assumed that the tempIdCol index item is associated with the column names. - #Visa versa for rows, either way it is removed - #Remove ids from name vector - rowNameList = rowNameList[(-1)] - #Remove ids from data - dataMatrix = dataMatrix[(-1)] - #Adjust row ids given the removal of the id row - tempColumns=(tempColumns-1) - - ## Remove id rows/columns and set row/col names - #Remove ids from vector - columnNameList = columnNameList[(-1)] - #Remove ids from data - dataMatrix = dataMatrix[(-1),] - #Adjust column ids given the removal of the id column - tempRows =(tempRows-1) - #Add row and column names - row.names(dataMatrix) = as.character(rowNameList) - colnames(dataMatrix) = as.character(columnNameList) - - #Reduce matrix - #Account for when both column ranges and row ranges are given or just a column or a row range is given - dataMatrix = dataMatrix[tempRows, tempColumns, drop=FALSE] - - #Set all columns data types to R guessed default - for(i in 1:ncol(dataMatrix)){ - dataMatrix[,i] <- type.convert(dataMatrix[,i], na.strings = c_astrNA)} - - #Return matrix - return(dataMatrix) -} - -funcReadConfigFile = function( -### Reads in configure file and extracts the pieces needed for reading in a matrix -configureFile, -### Configure file = string path to configure file -defaultFile = NA -### Used to set a default data file -){ - #Read configure file - fileDataList <- list() - if(!is.null( configureFile ) ) { - fileDataList <- scan( file = configureFile, what = character(), sep="\n", quiet=TRUE) } - newList = list() - for(sLine in fileDataList) - { - sLine = gsub("\\s","",sLine) - vUnits = unlist(strsplit(sLine,":")) - if(length(vUnits)>1) - { - vUnits[1] = paste(vUnits[1],":",sep="") - newList[[length(newList)+1]] = vUnits - } - } - fileDataList = unlist(newList) - - matrixName <- NA - fileName <- defaultFile - - #Hold information on matrices to be read - matrixInformationList = list() - matrixInformationListCount = 1 - - for(textIndex in c(1:length(fileDataList))) - { - if(textIndex > length(fileDataList)) {break} - #Start at the Matrix name - #Keep this if statement first so that you scan through until you find a matrix block - if(fileDataList[textIndex] == c_MATRIX_NAME) - { - #If the file name is not NA then that is sufficient for a matrix, store - #Either way reset - if(funcIsValid(fileName)&&funcIsValid(matrixName)) - { - matrixInformationList[[matrixInformationListCount]] = c(matrixName,fileName,delimiter,rows,columns) - matrixInformationListCount = matrixInformationListCount + 1 - } - - #Get the matrix name and store - matrixName = fileDataList[textIndex + 1] - - fileName = defaultFile - delimiter = "\t" - rows = NA - columns = NA - #If is not matrix name and no matrix name is known skip until you find the matrix name - #If matrix name is known, continue to collect information about that matrix - } else if(is.na(matrixName)){next} - - #Parse different keywords - strParseKey = fileDataList[textIndex] - if(strParseKey == c_FILE_NAME){fileName=fileDataList[textIndex+1]} - else if(strParseKey==c_FILE_NAME){fileName=fileDataList[textIndex+1]} - else if(strParseKey %in% c(c_TSVROWS,c_PCLCOLUMNS,c_ROWS)){rows=fileDataList[textIndex+1]} - else if(strParseKey %in% c(c_TSVCOLUMNS,c_PCLROWS,c_COLUMNS)){columns=fileDataList[textIndex+1]} - else if(strParseKey==c_DELIMITER) - { - switch(fileDataList[textIndex + 1], - "TAB" = {delimiter = "\t"}, - "SPACE" = {delimiter = " "}, - "RETURN" = {delimiter = "\r"}, - "ENDLINE" = {delimiter = "\n"}) - } - } - #If there is matrix information left - if((!is.na(matrixName)) && (!is.na(fileName))) - { - matrixInformationList[[matrixInformationListCount]] = c(matrixName,fileName,delimiter,rows,columns) - matrixInformationListCount = matrixInformationListCount + 1 - } - return(matrixInformationList) -} - -funcParseIndexSlices = function( -### Take a string of comma or dash seperated integer strings and convert into a vector -### of integers to use in index slicing -strIndexString, -### String to be parsed into indicies vector -cstrNames -### Column names of the data so names can be resolved to indicies -){ - #If the slices are NA then return - if(is.na(strIndexString)){return(strIndexString)} - - #List of indices to return - viRetIndicies = c() - - #Split on commas - lIndexString = sapply(strsplit(strIndexString, c_COMMA),function(x) return(x)) - for(strIndexItem in lIndexString) - { - #Handle the - case - if(strIndexItem=="-"){strIndexItem = paste("2-",length(cstrNames),sep="")} - - #Split on dash and make sure it makes sense - lItemElement = strsplit(strIndexItem, c_DASH)[[1]] - if(length(lItemElement)>2){stop("Error in index, too many dashes, only one is allowed. Index = ",strIndexItem,sep="")} - - #Switch names to numbers - aiIndices = which(is.na(as.numeric(lItemElement))) - for( iIndex in aiIndices ) - { - lItemElement[iIndex] = which(cstrNames==lItemElement[iIndex])[1] - } - - #Make numeric - liItemElement = unlist(lapply(lItemElement, as.numeric)) - - #If dash is at the end or the beginning add on the correct number - if(substr(strIndexItem,1,1)==c_DASH){liItemElement[1]=2} - if(substr(strIndexItem,nchar(strIndexItem),nchar(strIndexItem))==c_DASH){liItemElement[2]=length(cstrNames)} - - #If multiple numbers turn to a slice - if(length(liItemElement)==2){liItemElement = c(liItemElement[1]:liItemElement[2])} - - #Update indices - viRetIndicies = c(viRetIndicies, liItemElement) - } - if(length(viRetIndicies)==0){return(NA)} - return(sort(unique(viRetIndicies))) - ### Sorted indicies vector -} diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/lib/MaaslinPlots.R --- a/maaslin-4450aa4ecc84/src/lib/MaaslinPlots.R Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,428 +0,0 @@ -##################################################################################### -#Copyright (C) <2012> -# -#Permission is hereby granted, free of charge, to any person obtaining a copy of -#this software and associated documentation files (the "Software"), to deal in the -#Software without restriction, including without limitation the rights to use, copy, -#modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, -#and to permit persons to whom the Software is furnished to do so, subject to -#the following conditions: -# -#The above copyright notice and this permission notice shall be included in all copies -#or substantial portions of the Software. -# -#THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, -#INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A -#PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT -#HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION -#OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE -#SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -# -# This file is a component of the MaAsLin (Multivariate Associations Using Linear Models), -# authored by the Huttenhower lab at the Harvard School of Public Health -# (contact Timothy Tickle, ttickle@hsph.harvard.edu). -##################################################################################### - -inlinedocs <- function( -##author<< Curtis Huttenhower and Timothy Tickle -##description<< Holds MaAsLin related plotting -) { return( pArgs ) } - -funcPDF <- function( -### Function to plot raw data with linear model information. -### Continuous and integer variables are plotted with a line of best fit. -### Other data is plotted as boxplots. -frmeTmp, -lsCur, -### Linear model information -curPValue, -### Pvalue to display -curQValue, -### Qvalue to display -strFilePDF, -### PDF file to create or to which to append -strBaseOut, -### Project directory to place pdf in -strName, -### Name of taxon -funcUnTransform=NULL, -### If a transform is used the appropriate of that transfor must be used on the residuals in the partial residual plots -fDoResidualPlot = TRUE, -### Plot the residual plots -fInvert = FALSE, -### Invert the figure so the background is black -liNaIndices = c() -### Indices of NA data that was imputed -){ - if( is.na( strFilePDF ) ) - { - strFilePDF <- sprintf( "%s-%s.pdf", strBaseOut, strName ) - pdf( strFilePDF, width = 11, useDingbats=FALSE ) - } - - #Invert plots - adColorMin <- c(1, 0, 0) - adColorMax <- c(0, 1, 0) - adColorMed <- c(0, 0, 0) - if( fInvert ) - { - par( bg = "black", fg = "white", col.axis = "white", col.lab = "white", col.main = "white", col.sub = "white" ) - adColorMin <- c(1, 1, 0) - adColorMax <- c(0, 1, 1) - adColorMed <- c(1, 1, 1) - } - - #Create linear model title data string - strTitle <- sprintf( "%s (%.3g sd %.3g, p=%.3g, q=%.3g)", lsCur$orig, lsCur$value, lsCur$std, curPValue, curQValue ) - adMar <- c(5, 4, 4, 2) + 0.1 - dLine <- NA - strTaxon <- lsCur$taxon - if( nchar( strTaxon ) > 80 ) - { - dCEX <- 0.75 - iLen <- nchar( strTaxon ) - if( iLen > 120 ) - { - dLine <- 2.5 - i <- round( iLen / 2 ) - strTaxon <- paste( substring( strTaxon, 0, i ), substring( strTaxon, i + 1 ), sep = "\n" ) - adMar[2] <- adMar[2] + 1 - } - } else { dCEX = 1 } - - #Plot 1x2 graphs per page - if(fDoResidualPlot){par(mfrow=c(1,2))} - - # Plot factor data as boxplot if is descrete data - # Otherwise plot as a line - adCur <- lsCur$metadata - adY <- lsCur$data - - # Remove NAs from data visualization if set to do so (if liNaIndices is not empty) - if(lsCur$name %in% names(liNaIndices)&&(length(liNaIndices[[lsCur$name]])>0)) - { - adY <- adY[-1*liNaIndices[[lsCur$name]]] - adCur = adCur[-1*liNaIndices[[lsCur$name]]] - if(class(adCur)=="factor") - { - adCur = factor(adCur) - } - } - - # Set the factor levels to include NA if they still exist - # This is so if something is not imputed, then if there are NAs they will be plotted (to show no imputing) - if( class( lsCur$metadata ) == "factor" ) - { - sCurLevels = levels(adCur) - adCur = (as.character(adCur)) - aiCurNAs = which(is.na(adCur)) - if(length(aiCurNAs) > 0) - { - adCur[aiCurNAs]="NA" - sCurLevels = c(sCurLevels,"NA") - } - adCur = factor(adCur, levels = sCurLevels) - } - - if( class( lsCur$metadata ) == "factor" ) - { - astrNames <- c() - astrColors <- c() - dMed <- median( adY[adCur == levels( adCur )[1]], na.rm = TRUE ) - adIQR <- quantile( adY, probs = c(0.25, 0.75), na.rm = TRUE ) - dIQR <- adIQR[2] - adIQR[1] - if( dIQR <= 0 ) - { - dIQR <- sd( adY, na.rm = TRUE ) - } - dIQR <- dIQR / 2 - - #Print boxplots/strip charts of raw data. Add model data to it. - for( strLevel in levels( adCur ) ) - { - c_iLen <- 32 - strLength <- strLevel - if( nchar( strLength ) > c_iLen ) - { - iTmp <- ( c_iLen / 2 ) - 2 - strLength <- paste( substr( strLength, 1, iTmp ), substring( strLength, nchar( strLength ) - iTmp ), sep = "..." ) - } - astrNames <- c(astrNames, sprintf( "%s (%d)", strLength, sum( adCur == strLevel, na.rm = TRUE ) )) - astrColors <- c(astrColors, sprintf( "%sAA", funcColor( ( median( adY[adCur == strLevel], na.rm = TRUE ) - dMed ) / - dIQR, dMax = 3, dMin = -3, adMax = adColorMin, adMin = adColorMax, adMed = adColorMed ) )) - } - #Controls boxplot labels - #(When there are many factor levels some are skipped and not plotted - #So this must be reduced) - dBoxPlotLabelCex = dCEX - if(length(astrNames)>8) - { - dBoxPlotLabelCex = dBoxPlotLabelCex * 1.5/(length(astrNames)/8) - } - par(cex.axis = dBoxPlotLabelCex) - boxplot( adY ~ adCur, notch = TRUE, names = astrNames, mar = adMar, col = astrColors, - main = strTitle, cex.main = 48/nchar(strTitle), xlab = lsCur$name, ylab = NA, cex.lab = dCEX, outpch = 4, outcex = 0.5 ) - par(cex.axis = dCEX) - stripchart( adY ~ adCur, add = TRUE, col = astrColors, method = "jitter", vertical = TRUE, pch = 20 ) - title( ylab = strTaxon, cex.lab = dCEX, line = dLine ) - } else { - #Plot continuous data - plot( adCur, adY, mar = adMar, main = strTitle, xlab = lsCur$name, pch = 20, - col = sprintf( "%s99", funcGetColor( ) ), ylab = NA, xaxt = "s" ) - title( ylab = strTaxon, cex.lab = dCEX ) - lmod <- lm( adY ~ adCur ) - dColor <- lmod$coefficients[2] * mean( adCur, na.rm = TRUE ) / mean( adY, na.rm = TRUE ) - strColor <- sprintf( "%sDD", funcColor( dColor, adMax = adColorMin, adMin = adColorMax, adMed = adColorMed ) ) - abline( reg = lmod, col = strColor, lwd = 3 ) - } - ### Plot the residual plot - if(fDoResidualPlot){funcResidualPlot(lsCur=lsCur, frmeTmp=frmeTmp, adColorMin=adColorMin, adColorMax=adColorMax, adColorMed=adColorMed, adMar, funcUnTransform=funcUnTransform, liNaIndices)} - return(strFilePDF) - ### File to which the pdf was written -} - -### Plot 1 -# axis 1 gene expression (one bug) -# axis 2 PC1 (bugs and metadata)(MFA) -# over plot real data vs the residuals (real data verses the prediction) -# remove all but 1 bug + metadata -### Plot 2 -#residuals (y) PCL1 (1 bug + metadata) -#Plot 3 -### Can plot the residuals against all the variables in a grid/lattic -funcGetFactorBoxColors = function(adCur,adY,adColorMin,adColorMax,adColorMed) -{ - astrColors = c() - - if( class( adCur ) == "factor" ) - { - if( "NA" %in% levels( adCur ) ) - { - afNA = adCur == "NA" - adY = adY[!afNA] - adCur = adCur[!afNA] - adCur = factor( adCur, levels = setdiff( levels( adCur ), "NA" ) ) - } - dMed = median( adY[adCur == levels( adCur )[1]], na.rm = TRUE ) - adIQR = quantile( adY, probs = c(0.25, 0.75), na.rm = TRUE ) - dIQR = adIQR[2] - adIQR[1] - if( dIQR <= 0 ) - { - dIQR <- sd( adY, na.rm = TRUE ) - } - dIQR <- dIQR / 2 - - for( strLevel in levels( adCur ) ) - { - astrColors <- c(astrColors, sprintf( "%sAA", funcColor( ( median( adY[adCur == strLevel], na.rm = TRUE ) - dMed ) / - dIQR, dMax = 3, dMin = -3, adMax = adColorMin, adMin = adColorMax, adMed = adColorMed ) )) - } - } - return(astrColors) -} - -funcResidualPlotHelper <- function( -frmTmp, -### The dataframe to plot from -sResponseFeature, -lsFullModelCovariateNames, -### All covariates in lm (Column Names) -lsCovariateToControlForNames, -### Y Axis: All the covariate which will be plotted together respectively * their beta with the residuals of the full model added. (These should dummied names for factor data; not column names). -sCovariateOfInterest, -### X Axis: raw data of the covariate of interest. (Column Name) -adColorMin, -### Min color in color range for markers -adColorMax, -### Max color in color range for markers -adColorMed, -### Medium color in color range for markers -adMar, -### Standardized margins -funcUnTransform = NULL, -### If a transform is used the opposite of that transfor must be used on the residuals in the partial residual plots -liNaIndices = c() -### Indices of NA data that was imputed -){ - # Get model matrix (raw data) - adCur = frmTmp[[sResponseFeature]] - - # Make a formula to calculate the new model to get the full model residuals - strFormula = paste("adCur",paste(sprintf( "`%s`", lsFullModelCovariateNames ),sep="", collapse="+"),sep="~") - - # Calculate lm - lmod = (lm(as.formula(strFormula),frmTmp)) - - # Get all coefficient data in the new model - dfdCoefs = coefficients(lmod) - - # Get all coefficient names in the new model - asAllCoefNames = names(dfdCoefs) - - # Get Y - # For each covariate that is being plotted on the y axis - # Convert the coefficient name to the column name - # If they are equal then the data is not discontinuous and you can use the raw data as is and multiply it by the coefficient in the model - # If they are not equal than the data is discontinuous, get the value for the data, set all but the levels equal to it to zero and multiply by the ceofficient from the model. - vY = rep(coefficients(lmod)[["(Intercept)"]],dim(frmTmp)[1]) -# vY = rep(0,dim(frmTmp)[1]) - - #Here we are not dealing with column names but, if factor data, the coefficient names - for(iCvIndex in 1:length(lsCovariateToControlForNames)) - { - sCurrentCovariate = lsCovariateToControlForNames[iCvIndex] - #Get the column name of the current covariate (this must be used to compare to other column names) - sCurCovariateColumnName = funcCoef2Col(sCurrentCovariate, frmTmp) - - #This is continuous data - if(sCurrentCovariate == sCurCovariateColumnName) - { - vY = vY + dfdCoefs[sCurrentCovariate]*frmTmp[[sCurCovariateColumnName]] - } else { - #Discontinuous data - # Get level - xLevel = substr(sCurrentCovariate,nchar(sCurCovariateColumnName)+1,nchar(sCurrentCovariate)) - - # Get locations where the data = level - aiLevelIndices = rep(0,dim(frmTmp)[1]) - aiLevelIndices[which(frmTmp[sCurCovariateColumnName] == xLevel)]=1 - sCurrentCovariateBeta = dfdCoefs[sCurrentCovariate] - if(is.na(sCurrentCovariateBeta)){sCurrentCovariateBeta=0} - vY = vY + sCurrentCovariateBeta * aiLevelIndices - } - } - #TODO based on transform vY = vY+sin(residuals(lmod))^2 - if(!is.null(funcUnTransform)) - { - vY = vY + funcUnTransform(residuals(lmod)) - } else { - vY = vY + residuals(lmod) } - - # Plot x, raw data - ## y label - sYLabel = paste(paste("B",lsCovariateToControlForNames,sep="*",collapse="+"),"e",sep="+") - sTitle = "Partial Residual Plot" - - adCurXValues = frmTmp[[sCovariateOfInterest]] - - # If not plotting metadata that was originally NA then remove the imputed values here - if(sCovariateOfInterest %in% names(liNaIndices)&&(length(liNaIndices[[sCovariateOfInterest]])>0)) - { - adCurXValues = adCurXValues[-1*liNaIndices[[sCovariateOfInterest]]] - vY <- vY[-1*liNaIndices[[sCovariateOfInterest]]] - if(is.factor(adCurXValues)){adCurXValues = factor(adCurXValues)} - } - - # Set the factor levels to include NA if they still exist - # This is so if something is not imputed, then if there are NAs they will be plotted (to show no imputing) - # Do not forget to keep te level order incase it was changed by the custom scripts. - if( class( adCurXValues ) == "factor" ) - { - vsLevels = levels(adCurXValues) - if(sum(is.na(adCurXValues))>0) - { - adCurXValues = as.character(adCurXValues) - adCurXValues[is.na(adCurXValues)]="NA" - adCurXValues = factor(adCurXValues, levels=c(vsLevels,"NA")) - } - } - - # Scale to the original range - if(!(class( adCurXValues ) == "factor" )) - { - vY = vY + mean(adCurXValues,rm.na=TRUE) - } - - # Plot Partial Residual Plot - # If we are printing discontinuous data - # Get the color of the box plots - # Plot box plots - # Plot data as strip charts - if(is.factor(adCurXValues)) - { -# adCurXValues = factor(adCurXValues) - astrColors = funcGetFactorBoxColors(adCurXValues,vY,adColorMin,adColorMax,adColorMed) - asNames = c() - for(sLevel in levels(adCurXValues)) - { - asNames = c(asNames,sprintf( "%s (%d)", sLevel, sum( adCurXValues == sLevel, na.rm = TRUE ) )) - } - - plot(adCurXValues, vY, xlab=sCovariateOfInterest, ylab=sYLabel, names=asNames, notch = TRUE,mar = adMar,col = astrColors, main=sTitle, outpch = 4, outcex = 0.5 ) - stripchart( vY ~ adCurXValues, add = TRUE, col = astrColors, method = "jitter", vertical = TRUE, pch = 20 ) - - } else { - plot( adCurXValues, vY, mar = adMar, main = sTitle, xlab=sCovariateOfInterest, col = sprintf( "%s99", funcGetColor( ) ), pch = 20,ylab = sYLabel, xaxt = "s" ) - - lmodLine = lm(vY~adCurXValues) - - dColor <- lmodLine$coefficients[2] * mean( adCurXValues, na.rm = TRUE ) / mean( vY, na.rm = TRUE ) - strColor <- sprintf( "%sDD", funcColor( dColor, adMax = adColorMin, adMin = adColorMax, adMed = adColorMed ) ) - abline( reg =lmodLine, col = strColor, lwd = 3 ) - } -} - -funcBoostInfluencePlot <- function( -# Plot to show the rel.inf from boosting, what to know if the rank order is correct, better ranks for spiked data. -# Show the cut off and features identified as uneven. -vdRelInf, -sFeature, -vsPredictorNames, -vstrKeepMetadata, -vstrUneven = c() -){ - vsCol = rep("black",length(vdRelInf)) - vsCol[which(vsPredictorNames %in% vstrKeepMetadata)]="green" - vsCol[which(vsPredictorNames %in% vstrUneven)] = "orange" - plot(vdRelInf, col=vsCol, main=sFeature, xlab="Index", ylab="Relative Influence") - legend("topright", pch = paste(1:length(vsPredictorNames)), legend= vsPredictorNames, text.col=vsCol, col=vsCol) -} - -funcResidualPlot <- function( -### Plot to data after confounding. -### That is, in a linear model with significant coefficient b1 for variable x1, -### that's been sparsified to some subset of terms: y = b0 + b1*x1 + sum(bi*xi) -### Plot x1 on the X axis, and instead of y on the Y axis, instead plot: -### y' = b0 + sum(bi*xi) -lsCur, -### Assocation to plot -frmeTmp, -### Data frame of orginal data -adColorMin, -### Min color in color range for markers -adColorMax, -### Max color in color range for markers -adColorMed, -### Medium color in color range for markers -adMar, -### Standardized margins -funcUnTransform, -### If a transform is used the opporite of that transfor must be used on the residuals in the partial residual plots -liNaIndices = c() -### Indices of NA data that was imputed -){ - #Now plot residual hat plot - #Get coefficient names - asAllCoefs = setdiff(names(lsCur$allCoefs),c("(Intercept)")) - asAllColNames = c() - for(sCoef in asAllCoefs) - { - asAllColNames = c(asAllColNames,funcCoef2Col(sCoef,frmeData=frmeTmp)) - } - asAllColNames = unique(asAllColNames) - - # All coefficients except for the one of interest - lsOtherCoefs = setdiff(asAllColNames, c(lsCur$name)) - - lsCovariatesToPlot = NULL - if(is.factor(lsCur$metadata)) - { - lsCovariatesToPlot = paste(lsCur$name,levels(lsCur$metadata),sep="") - }else{lsCovariatesToPlot=c(lsCur$orig)} - - # If there are no other coefficients then skip plot -# if(!length(lsOtherCoefs)){return()} - - # Plot residuals - funcResidualPlotHelper(frmTmp=frmeTmp, sResponseFeature=lsCur$taxon, lsFullModelCovariateNames=asAllColNames, lsCovariateToControlForNames=lsCovariatesToPlot, sCovariateOfInterest=lsCur$name, adColorMin=adColorMin, adColorMax=adColorMax, adColorMed=adColorMed, adMar=adMar, funcUnTransform=funcUnTransform, liNaIndices=liNaIndices) -} diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/lib/Misc.R --- a/maaslin-4450aa4ecc84/src/lib/Misc.R Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,208 +0,0 @@ -##################################################################################### -#Copyright (C) <2012> -# -#Permission is hereby granted, free of charge, to any person obtaining a copy of -#this software and associated documentation files (the "Software"), to deal in the -#Software without restriction, including without limitation the rights to use, copy, -#modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, -#and to permit persons to whom the Software is furnished to do so, subject to -#the following conditions: -# -#The above copyright notice and this permission notice shall be included in all copies -#or substantial portions of the Software. -# -#THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, -#INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A -#PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT -#HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION -#OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE -#SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -# -# This file is a component of the MaAsLin (Multivariate Associations Using Linear Models), -# authored by the Huttenhower lab at the Harvard School of Public Health -# (contact Timothy Tickle, ttickle@hsph.harvard.edu). -##################################################################################### - -### Modified Code -### This code is from the package agricolae by Felipe de Mendiburu -### Modifications here are minimal and allow one to use the p.values from the post hoc comparisons -### Authors do not claim credit for this solution only needed to modify code to use the output. -kruskal <- function (y, trt, alpha = 0.05, p.adj = c("none", "holm", "hochberg", - "bonferroni", "BH", "BY", "fdr"), group = TRUE, main = NULL) -{ - dfComparisons=NULL - dfMeans=NULL - dntStudent=NULL - dLSD=NULL - dHMean=NULL - name.y <- paste(deparse(substitute(y))) - name.t <- paste(deparse(substitute(trt))) - p.adj <- match.arg(p.adj) - junto <- subset(data.frame(y, trt), is.na(y) == FALSE) - N <- nrow(junto) - junto[, 1] <- rank(junto[, 1]) - means <- tapply.stat(junto[, 1], junto[, 2], stat = "sum") - sds <- tapply.stat(junto[, 1], junto[, 2], stat = "sd") - nn <- tapply.stat(junto[, 1], junto[, 2], stat = "length") - means <- data.frame(means, replication = nn[, 2]) - names(means)[1:2] <- c(name.t, name.y) - ntr <- nrow(means) - nk <- choose(ntr, 2) - DFerror <- N - ntr - rs <- 0 - U <- 0 - for (i in 1:ntr) { - rs <- rs + means[i, 2]^2/means[i, 3] - U <- U + 1/means[i, 3] - } - S <- (sum(junto[, 1]^2) - (N * (N + 1)^2)/4)/(N - 1) - H <- (rs - (N * (N + 1)^2)/4)/S -# cat("\nStudy:", main) -# cat("\nKruskal-Wallis test's\nTies or no Ties\n") -# cat("\nValue:", H) -# cat("\ndegrees of freedom:", ntr - 1) - p.chisq <- 1 - pchisq(H, ntr - 1) -# cat("\nPvalue chisq :", p.chisq, "\n\n") - DFerror <- N - ntr - Tprob <- qt(1 - alpha/2, DFerror) - MSerror <- S * ((N - 1 - H)/(N - ntr)) - means[, 2] <- means[, 2]/means[, 3] -# cat(paste(name.t, ",", sep = ""), " means of the ranks\n\n") - dfMeans=data.frame(row.names = means[, 1], means[, -1]) - if (p.adj != "none") { -# cat("\nP value adjustment method:", p.adj) - a <- 1e-06 - b <- 1 - for (i in 1:100) { - x <- (b + a)/2 - xr <- rep(x, nk) - d <- p.adjust(xr, p.adj)[1] - alpha - ar <- rep(a, nk) - fa <- p.adjust(ar, p.adj)[1] - alpha - if (d * fa < 0) - b <- x - if (d * fa > 0) - a <- x - } - Tprob <- qt(1 - x/2, DFerror) - } - nr <- unique(means[, 3]) - if (group) { - Tprob <- qt(1 - alpha/2, DFerror) -# cat("\nt-Student:", Tprob) -# cat("\nAlpha :", alpha) - dntStudent=Tprob - dAlpha=alpha - if (length(nr) == 1) { - LSD <- Tprob * sqrt(2 * MSerror/nr) -# cat("\nLSD :", LSD, "\n") - dLSD=LSD - } - else { - nr1 <- 1/mean(1/nn[, 2]) - LSD1 <- Tprob * sqrt(2 * MSerror/nr1) -# cat("\nLSD :", LSD1, "\n") - dLSD =LSD1 -# cat("\nHarmonic Mean of Cell Sizes ", nr1) - dHMean=nr1 - } -# cat("\nMeans with the same letter are not significantly different\n") -# cat("\nGroups, Treatments and mean of the ranks\n") - output <- order.group(means[, 1], means[, 2], means[, - 3], MSerror, Tprob, std.err = sqrt(MSerror/means[, - 3])) - dfComparisons=order.group(means[, 1], means[, 2], means[, - 3], MSerror, Tprob, std.err = sqrt(MSerror/means[, - 3])) - } - if (!group) { - comb <- combn(ntr, 2) - nn <- ncol(comb) - dif <- rep(0, nn) - LCL <- dif - UCL <- dif - pvalue <- dif - sdtdif <- dif - for (k in 1:nn) { - i <- comb[1, k] - j <- comb[2, k] - if (means[i, 2] < means[j, 2]) { - comb[1, k] <- j - comb[2, k] <- i - } - dif[k] <- abs(means[i, 2] - means[j, 2]) - sdtdif[k] <- sqrt(S * ((N - 1 - H)/(N - ntr)) * (1/means[i, - 3] + 1/means[j, 3])) - pvalue[k] <- 2 * round(1 - pt(dif[k]/sdtdif[k], DFerror), - 6) - } - if (p.adj != "none") - pvalue <- round(p.adjust(pvalue, p.adj), 6) - LCL <- dif - Tprob * sdtdif - UCL <- dif + Tprob * sdtdif - sig <- rep(" ", nn) - for (k in 1:nn) { - if (pvalue[k] <= 0.001) - sig[k] <- "***" - else if (pvalue[k] <= 0.01) - sig[k] <- "**" - else if (pvalue[k] <= 0.05) - sig[k] <- "*" - else if (pvalue[k] <= 0.1) - sig[k] <- "." - } - tr.i <- means[comb[1, ], 1] - tr.j <- means[comb[2, ], 1] - dfComparisons <- data.frame(Difference = dif, p.value = pvalue, - sig, LCL, UCL) - rownames(dfComparisons) <- paste(tr.i, tr.j, sep = " - ") -# cat("\nComparison between treatments mean of the ranks\n\n") -# print(output) - dfMeans <- data.frame(trt = means[, 1], means = means[, - 2], M = "", N = means[, 3]) - } -# invisible(output) - invisible(list(study=main,test="Kruskal-Wallis test",value=H,df=(ntr - 1),chisq.p.value=p.chisq,p.adj.method=p.adj,ntStudent=dntStudent,alpha=alpha,LSD=dLSD,Harmonic.mean=dHMean,comparisons=dfComparisons,means=dfMeans)) -} - -### This function is NOT original code but is from the gamlss package. -### It is written here in an effort to over write the gamlss object summary method -### so that I can return information of interest. -estimatesgamlss<-function (object, Qr, p1, coef.p, - est.disp , df.r, - digits = max(3, getOption("digits") - 3), - covmat.unscaled , ...) -{ - #covmat.unscaled <- chol2inv(Qr$qr[p1, p1, drop = FALSE]) - dimnames(covmat.unscaled) <- list(names(coef.p), names(coef.p)) - covmat <- covmat.unscaled #in glm is=dispersion * covmat.unscaled, but here is already multiplied by the dispersion - var.cf <- diag(covmat) - s.err <- sqrt(var.cf) - tvalue <- coef.p/s.err - dn <- c("Estimate", "Std. Error") - if (!est.disp) - { - pvalue <- 2 * pnorm(-abs(tvalue)) - coef.table <- cbind(coef.p, s.err, tvalue, pvalue) - dimnames(coef.table) <- list(names(coef.p), c(dn, "z value","Pr(>|z|)")) - } else if (df.r > 0) { - pvalue <- 2 * pt(-abs(tvalue), df.r) - coef.table <- cbind(coef.p, s.err, tvalue, pvalue) - dimnames(coef.table) <- list(names(coef.p), c(dn, "t value","Pr(>|t|)")) - } else { - coef.table <- cbind(coef.p, Inf) - dimnames(coef.table) <- list(names(coef.p), dn) - } - return(coef.table) -} - -### This function is NOT original code but is from the gamlss package. -### It is written here in an effort to over write the gamlss object summary method -### so that I can return information of interest. -summary.gamlss<- function (object, type = c("vcov", "qr"), save = FALSE, ...) -{ - return(as.data.frame(estimatesgamlss(object=object,Qr=object$mu.qr, p1=1:(object$mu.df-object$mu.nl.df), - coef.p=object$mu.coefficients[object$mu.qr$pivot[1:(object$mu.df-object$mu.nl.df)]], - est.disp =TRUE, df.r=(object$noObs - object$mu.df), - covmat.unscaled=chol2inv(object$mu.qr$qr[1:(object$mu.df-object$mu.nl.df), 1:(object$mu.df-object$mu.nl.df), drop = FALSE]) )) ) -} \ No newline at end of file diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/lib/SummarizeMaaslin.R --- a/maaslin-4450aa4ecc84/src/lib/SummarizeMaaslin.R Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,86 +0,0 @@ -##################################################################################### -#Copyright (C) <2012> -# -#Permission is hereby granted, free of charge, to any person obtaining a copy of -#this software and associated documentation files (the "Software"), to deal in the -#Software without restriction, including without limitation the rights to use, copy, -#modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, -#and to permit persons to whom the Software is furnished to do so, subject to -#the following conditions: -# -#The above copyright notice and this permission notice shall be included in all copies -#or substantial portions of the Software. -# -#THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, -#INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A -#PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT -#HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION -#OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE -#SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -# -# This file is a component of the MaAsLin (Multivariate Associations Using Linear Models), -# authored by the Huttenhower lab at the Harvard School of Public Health -# (contact Timothy Tickle, ttickle@hsph.harvard.edu). -##################################################################################### - -inlinedocs <- function( -##author<< Curtis Huttenhower and Timothy Tickle -##description<< Creates a summary of association detail files. -) { return( pArgs ) } - -#Logging class -suppressMessages(library(logging, warn.conflicts=FALSE, quietly=TRUE, verbose=FALSE)) - -# Get logger -c_logrMaaslin <- getLogger( "maaslin" ) - -funcSummarizeDirectory = function( -### Summarizes the massline detail files into one file based on significance. -astrOutputDirectory, -### The output directory to find the MaAsLin results. -strBaseName, -### The prefix string used in maaslin to start the detail files. -astrSummaryFileName, -### The summary file's name, should be a path not a file name -astrKeyword, -### The column name of the data to check significance before adding a detail to the summary -afSignificanceLevel -### The value of significance the data must be at or below to be included in the summary (0.0 is most significant; like p-values) -){ - #Store significant data elements - dfSignificantData = NULL - - #Get detail files in output directory - astrlsDetailFiles = list.files(astrOutputDirectory, pattern=paste(strBaseName,"-","[[:print:]]*",c_sDetailFileSuffix,sep=""), full.names=TRUE) - logdebug(format(astrlsDetailFiles),c_logrMaaslin) - - #For each file after the first file - for(astrFile in astrlsDetailFiles) - { - #Read in data and reduce to significance - dfDetails = read.table(astrFile, header=TRUE, sep=c_cTableDelimiter) - dfDetails = dfDetails[which(dfDetails[astrKeyword] <= afSignificanceLevel),] - - #Combine with other data if it exists - if(is.null(dfSignificantData)) - { - dfSignificantData = dfDetails - } else { - dfSignificantData = rbind(dfSignificantData,dfDetails) - } - } - - #Write data to file - unlink(astrSummaryFileName) - if(is.null(dfSignificantData)) - { - funcWrite("No significant data found.",astrSummaryFileName) - return( NULL ) - } else { - #Sort by metadata and then significance - dfSignificantData = dfSignificantData[order(dfSignificantData$Value, dfSignificantData$P.value, decreasing = FALSE),] - funcWriteTable( dfSignificantData, astrSummaryFileName, fAppend = FALSE ) - # Sort by q.value and return - return( dfSignificantData[ order( dfSignificantData$P.value, decreasing = FALSE ), ] ) - } -} diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/lib/Utility.R --- a/maaslin-4450aa4ecc84/src/lib/Utility.R Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,503 +0,0 @@ -##################################################################################### -#Copyright (C) <2012> -# -#Permission is hereby granted, free of charge, to any person obtaining a copy of -#this software and associated documentation files (the "Software"), to deal in the -#Software without restriction, including without limitation the rights to use, copy, -#modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, -#and to permit persons to whom the Software is furnished to do so, subject to -#the following conditions: -# -#The above copyright notice and this permission notice shall be included in all copies -#or substantial portions of the Software. -# -#THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, -#INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A -#PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT -#HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION -#OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE -#SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -# -# This file is a component of the MaAsLin (Multivariate Associations Using Linear Models), -# authored by the Huttenhower lab at the Harvard School of Public Health -# (contact Timothy Tickle, ttickle@hsph.harvard.edu). -##################################################################################### - -inlinedocs <- function( -##author<< Curtis Huttenhower and Timothy Tickle -##description<< Collection of minor utility scripts -) { return( pArgs ) } - -#source("Constants.R") - -funcRename <- function( -### Modifies labels for plotting -### If the name is not an otu collapse to the last two clades -### Otherwise use the most terminal clade -astrNames -### Names to modify for plotting -){ - astrRet <- c() - for( strName in astrNames ) - { - astrName <- strsplit( strName, c_cFeatureDelimRex )[[1]] - i <- length( astrName ) - if( ( astrName[i] == c_strUnclassified ) || !is.na( as.numeric( astrName[i] ) ) ) - { - strRet <- paste( astrName[( i - 1 ):i], collapse = c_cFeatureDelim ) - } else { - strRet <- astrName[i] - } - astrRet <- c(astrRet, strRet) - } - return( astrRet ) - ### List of modified names -} - -funcBonferonniCorrectFactorData <- function -### Bonferroni correct for factor data -(dPvalue, -### P-value to correct -vsFactors, -### Factors of the data to correct -fIgnoreNAs = TRUE -){ - vsUniqueFactors = unique( vsFactors ) - if( fIgnoreNAs ){ vsUniqueFactors = setdiff( vsUniqueFactors, c("NA","na","Na","nA") ) } - return( dPvalue * max( 1, ( length( vsUniqueFactors ) - 1 ) ) ) - ### Numeric p-value that is correct for levels (excluding NA levels) -} - -funcCalculateTestCounts <- function( -### Calculates the number of tests used in inference -iDataCount, -asMetadata, -asForced, -asRandom, -fAllvAll -){ - iMetadata = length(asMetadata) - iForced = length(setdiff(intersect( asForced, asMetadata ), asRandom)) - iRandom = length(intersect( asRandom, asMetadata )) - if(fAllvAll) - { - #AllvAll flow formula - return((iMetadata-iForced-iRandom) * iDataCount) - } - - #Normal flow formula - return((iMetadata-iRandom) * iDataCount) -} - -funcGetRandomColors=function( -#Generates a given number of random colors -tempNumberColors = 1 -### Number of colors to generate -){ - adRet = c() - return(sapply(1:tempNumberColors, function(x){ - adRGB <- ( runif( 3 ) * 0.66 ) + 0.33 - adRet <- c(adRet, rgb( adRGB[1], adRGB[2], adRGB[3] )) - })) -} - -funcCoef2Col <- function( -### Searches through a dataframe and looks for a column that would match the coefficient -### by the name of the column or the column name and level appended together. -strCoef, -### String coefficient name -frmeData, -### Data frame of data -astrCols = c() -### Column names of interest (if NULL is given, all column names are inspected). -){ - #If the coefficient is the intercept there is no data column to return so return null - if( strCoef %in% c("(Intercept)", "Intercept") ) { return( NULL ) } - #Remove ` from coefficient - strCoef <- gsub( "`", "", strCoef ) - - #If the coefficient name is not in the data frame - if( !( strCoef %in% colnames( frmeData ) ) ) - { - fHit <- FALSE - #If the column names are not provided, use the column names of the dataframe. - if( is.null( astrCols ) ){astrCols <- colnames( frmeData )} - - #Search through the different column names (factors) - for( strFactor in astrCols ) - { - #Select a column, if it is not a factor or does not begin with the factor's name then skip - adCur <- frmeData[,strFactor] - if( ( class( adCur ) != "factor" ) || - ( substr( strCoef, 1, nchar( strFactor ) ) != strFactor ) ) { next } - - #For the factors, create factor-level name combinations to read in factors - #Then check to see the factor-level combination is the coefficient of interest - #If it is then store that factor as the coefficient of interest - #And break - for( strValue in levels( adCur ) ) - { - strCur <- paste( strFactor, strValue, sep = c_sFactorNameSep ) - if( strCur == strCoef ) - { - strCoef <- strFactor - fHit <- TRUE - break - } - } - - #If the factor was found, return - if( fHit ){break } - } - } - - #If the original coefficient or the coefficient factor combination name are in the - #data frame, return the name. Otherwise return NA. - return( ifelse( ( strCoef %in% colnames( frmeData ) ), strCoef, NA ) ) - ### Coefficient name -} - -funcColToMFAValue = function( -### Given a column name, return the MFA values that could be associated with the name -lsColNames, -### String list of column names (as you would get from names(dataframe)) -dfData -### Data frame of data the column names refer to -){ - lsMFAValues = c() - - for(sColName in lsColNames) - { - axCur = dfData[[sColName]] - - if(is.logical(axCur)){axCur=as.factor(axCur)} - if(is.factor(axCur)) - { - lsLevels = levels(axCur) - if((length(lsLevels)==2) && (!is.na(as.numeric(lsLevels[1]))) && (!is.na(as.numeric(lsLevels[2])))) - { - lsMFAValues = c(lsMFAValues,paste(sColName,lsLevels[1],sep=c_sMFANameSep1),paste(sColName,lsLevels[2],sep=c_sMFANameSep1)) - }else{ - for(sLevel in levels(axCur)) - { - lsMFAValues = c(lsMFAValues,sLevel) - } - } - } else { - lsMFAValues = c(lsMFAValues,sColName) - } - } - return(setdiff(lsMFAValues,c("NA",NA))) -} - -funcMFAValue2Col = function( -### Given a value in a column, the column name is returned. -xValue, -dfData, -aiColumnIndicesToSearch = NULL -){ - lsColumnNames = names(dfData) - - if(is.null(aiColumnIndicesToSearch)) - { - aiColumnIndicesToSearch = c(1:dim(dfData)[2]) - } - - # Could be the column name - if(xValue %in% lsColumnNames){return(xValue)} - - # Could be the column name and value - iValueLength = length(xValue) - for( iColIndex in c(1:length(lsColumnNames) )) - { - adCur = dfData[[lsColumnNames[iColIndex]]] - if(is.factor(adCur)) - { - for(strValue in levels(adCur)) - { - strCurVersion1 <- paste( lsColumnNames[iColIndex], strValue, sep = c_sMFANameSep1 ) - strCurVersion2 <- paste( lsColumnNames[iColIndex], strValue, sep = c_sMFANameSep2 ) - if((xValue == strCurVersion1) || (xValue == strCurVersion2)){return(lsColumnNames[iColIndex])} - } - } - } - - # Could be the value - for(iColIndex in aiColumnIndicesToSearch) - { - if(xValue %in% dfData[[lsColumnNames[iColIndex]]]){return(lsColumnNames[iColIndex])} - } - return(NULL) -} - -funcColorHelper <- function( -### Makes sure the max is max and the min is min, and dmed is average -dMax = 1, -### Max number -dMin = -1, -### Min number -dMed = NA -### Average value -){ - #Make sure max is max and min is min - vSort = sort(c(dMin,dMax)) - return( list( dMin = vSort[1], dMax = vSort[2], dMed = ifelse((is.na(dMed)), (dMin+dMax)/2.0, dMed ) )) - ### List of min, max and med numbers -} - -funcColor <- function( -### Generate a color based on a number that is forced to be between a min and max range. -### The color is based on how far the number is from the center of the given range -### From red to green (high) are produced with default settings -dX, -### Number from which to generate the color -dMax = 1, -### Max possible value -dMin = -1, -### Min possible value -dMed = NA, -### Central value if you don't want to be the average -adMax = c(1, 1, 0), -### Is used to generate the color for the higher values in the range, this can be changed to give different colors set to green -adMin = c(0, 0, 1), -### Is used to generate the color for the lower values in the range, this can be changed to give different colors set to red -adMed = c(0, 0, 0) -### Is used to generate the color for the central values in the range, this can be changed to give different colors set to black -){ - lsTmp <- funcColorHelper( dMax, dMin, dMed ) - dMax <- lsTmp$dMax - dMin <- lsTmp$dMin - dMed <- lsTmp$dMed - if( is.na( dX ) ) - { - dX <- dMed - } - if( dX > dMax ) - { - dX <- dMax - } else if( dX < dMin ) - { - dX <- dMin } - if( dX < dMed ) - { - d <- ( dMed - dX ) / ( dMed - dMin ) - adCur <- ( adMed * ( 1 - d ) ) + ( adMin * d ) - } else { - d <- ( dMax - dX ) / ( dMax - dMed ) - adCur <- ( adMed * d ) + ( adMax * ( 1 - d ) ) - } - return( rgb( adCur[1], adCur[2], adCur[3] ) ) - ### RGB object -} - -funcColors <- function( -### Generate a range of colors -dMax = 1, -### Max possible value -dMin = -1, -### Min possible value -dMed = NA, -### Central value if you don't want to be the average -adMax = c(1, 1, 0), -### Is used to generate the color for the higher values in the range, this can be changed to give different colors set to green -adMin = c(0, 0, 1), -### Is used to generate the color for the lower values in the range, this can be changed to give different colors set to red -adMed = c(0, 0, 0), -### Is used to generate the color for the central values in the range, this can be changed to give different colors set to black -iSteps = 64 -### Number of intermediary colors made in the range of colors -){ - lsTmp <- funcColorHelper( dMax, dMin, dMed ) - dMax <- lsTmp$dMax - dMin <- lsTmp$dMin - dMed <- lsTmp$dMed - aRet <- c () - for( dCur in seq( dMin, dMax, ( dMax - dMin ) / ( iSteps - 1 ) ) ) - { - aRet <- c(aRet, funcColor( dCur, dMax, dMin, dMed, adMax, adMin, adMed )) - } - return( aRet ) - ### List of colors -} - -funcGetColor <- function( -### Get a color based on col parameter -) { - adCol <- col2rgb( par( "col" ) ) - return( sprintf( "#%02X%02X%02X", adCol[1], adCol[2], adCol[3] ) ) - ### Return hexadecimal color -} - -funcTrim=function( -### Remove whitespace at the beginning or the end of a string -tempString -### tempString String to be trimmed. -){ - return(gsub("^\\s+|\\s+$","",tempString)) -} - -funcWrite <- function( -### Write a string or a table of data -### This transposes a table before it is written -pOut, -### String or table to write -strFile -### File to which to write -){ - if(!is.na(strFile)) - { - if( length( intersect( class( pOut ), c("character", "numeric") ) ) ) - { - write.table( t(pOut), strFile, quote = FALSE, sep = c_cTableDelimiter, col.names = FALSE, row.names = FALSE, na = "", append = TRUE ) - } else { - capture.output( print( pOut ), file = strFile, append = TRUE ) - } - } -} - -funcWriteTable <- function( -### Log a table to a file -frmeTable, -### Table to write -strFile, -### File to which to write -fAppend = FALSE -### Append when writing -){ - if(!is.na(strFile)) - { - write.table( frmeTable, strFile, quote = FALSE, sep = c_cTableDelimiter, na = "", col.names = NA, append = fAppend ) - } -} - -funcWriteQCReport <- function( -### Write out the quality control report -strProcessFileName, -### File name -lsQCData, -### List of QC data generated by maaslin to be written -liDataDim, -### Dimensions of the data matrix -liMetadataDim -### Dimensions of the metadata matrix -){ - unlink(strProcessFileName) - funcWrite( paste("Initial Metadata Matrix Size: Rows ",liMetadataDim[1]," Columns ",liMetadataDim[2],sep=""), strProcessFileName ) - funcWrite( paste("Initial Data Matrix Size: Rows ",liDataDim[1]," Columns ",liDataDim[2],sep=""), strProcessFileName ) - funcWrite( paste("\nInitial Data Count: ",length(lsQCData$aiDataInitial),sep=""), strProcessFileName ) - funcWrite( paste("Initial Metadata Count: ",length(lsQCData$aiMetadataInitial),sep=""), strProcessFileName ) - funcWrite( paste("Data Count after preprocess: ",length(lsQCData$aiAfterPreprocess),sep=""), strProcessFileName ) - funcWrite( paste("Removed for missing metadata: ",length(lsQCData$iMissingMetadata),sep=""), strProcessFileName ) - funcWrite( paste("Removed for missing data: ",length(lsQCData$iMissingData),sep=""), strProcessFileName ) - funcWrite( paste("Number of data with outliers: ",length(which(lsQCData$aiDataSumOutlierPerDatum>0)),sep=""), strProcessFileName ) - funcWrite( paste("Number of metadata with outliers: ",length(which(lsQCData$aiMetadataSumOutlierPerDatum>0)),sep=""), strProcessFileName ) - funcWrite( paste("Metadata count which survived clean: ",length(lsQCData$aiMetadataCleaned),sep=""), strProcessFileName ) - funcWrite( paste("Data count which survived clean: ",length(lsQCData$aiDataCleaned),sep=""), strProcessFileName ) - funcWrite( paste("\nBoostings: ",lsQCData$iBoosts,sep=""), strProcessFileName ) - funcWrite( paste("Boosting Errors: ",lsQCData$iBoostErrors,sep=""), strProcessFileName ) - funcWrite( paste("LMs with no terms suriving boosting: ",lsQCData$iNoTerms,sep=""), strProcessFileName ) - funcWrite( paste("LMs performed: ",lsQCData$iLms,sep=""), strProcessFileName ) - if(!is.null(lsQCData$lsQCCustom)) - { - funcWrite("Custom preprocess QC data: ", strProcessFileName ) - funcWrite(lsQCData$lsQCCustom, strProcessFileName ) - } else { - funcWrite("No custom preprocess QC data.", strProcessFileName ) - } - funcWrite( "\n#Details###########################", strProcessFileName ) - funcWrite("\nInitial Data Count: ", strProcessFileName ) - funcWrite(lsQCData$aiDataInitial, strProcessFileName ) - funcWrite("\nInitial Metadata Count: ", strProcessFileName ) - funcWrite(lsQCData$aiMetadataInitial, strProcessFileName ) - funcWrite("\nData Count after preprocess: ", strProcessFileName ) - funcWrite(lsQCData$aiAfterPreprocess, strProcessFileName ) - funcWrite("\nRemoved for missing metadata: ", strProcessFileName ) - funcWrite(lsQCData$iMissingMetadata, strProcessFileName ) - funcWrite("\nRemoved for missing data: ", strProcessFileName ) - funcWrite(lsQCData$iMissingData, strProcessFileName ) - funcWrite("\nDetailed outlier indices: ", strProcessFileName ) - for(sFeature in names(lsQCData$liOutliers)) - { - funcWrite(paste("Feature",sFeature,"Outlier indice(s):", paste(lsQCData$liOutliers[[sFeature]],collapse=",")), strProcessFileName ) - } - funcWrite("\nMetadata which survived clean: ", strProcessFileName ) - funcWrite(lsQCData$aiMetadataCleaned, strProcessFileName ) - funcWrite("\nData which survived clean: ", strProcessFileName ) - funcWrite(lsQCData$aiDataCleaned, strProcessFileName ) -} - -funcLMToNoNAFormula <-function( -lMod, -frmeTmp, -adCur -){ - dfCoef = coef(lMod) - astrCoefNames = setdiff(names(dfCoef[as.vector(!is.na(dfCoef))==TRUE]),"(Intercept)") - astrPredictors = unique(as.vector(sapply(astrCoefNames,funcCoef2Col, frmeData=frmeTmp))) - strFormula = paste( "adCur ~", paste( sprintf( "`%s`", astrPredictors ), collapse = " + " ), sep = " " ) - return(try( lm(as.formula( strFormula ), data=frmeTmp ))) -} - -funcFormulaStrToList <- function( -#Takes a lm or mixed model formula and returns a list of covariate names in the formula -strFormula -#Formula to extract covariates from -){ - #Return list - lsRetComparisons = c() - - #If you get a null or na just return - if(is.null(strFormula)||is.na(strFormula)){return(lsRetComparisons)} - - #Get test comparisons (predictor names from formula string) - asComparisons = gsub("`","",setdiff(unlist(strsplit(unlist(strsplit(strFormula,"~"))[2]," ")),c("","+"))) - - #Change metadata in formula to univariate comparisons - for(sComparison in asComparisons) - { - #Removed random covariate formating - lsParse = unlist(strsplit(sComparison, "[\\(\\|\\)]", perl=FALSE)) - lsRetComparisons = c(lsRetComparisons,lsParse[length(lsParse)]) - } - return(lsRetComparisons) -} - -funcFormulaListToString <- function( -# Using covariate and random covariate names, creates a lm or mixed model formula -# returns a vector of c(strLM, strMixedModel), one will be NA given the existance of random covariates. -# On error c(NA,NA) is given -astrTerms, -#Fixed covariates or all covariates if using an lm -astrRandomCovariates = NULL -#Random covariates for a mixed model -){ - strRetLMFormula = NA - strRetMMFormula = NA - - #If no covariates return NA - if(is.null(astrTerms)){return(c(strRetLMFormula, strRetMMFormula))} - - #Get fixed covariates - astrFixedCovariates = setdiff(astrTerms,astrRandomCovariates) - - #If no fixed coavariates return NA - # Can not run a model with no fixed covariate, restriction of lmm - if(length(astrFixedCovariates)==0){return(c(strRetLMFormula, strRetMMFormula))} - - # Fixed Covariates - strFixedCovariates = paste( sprintf( "`%s`", astrFixedCovariates ), collapse = " + " ) - - #If random covariates, set up a formula for mixed models - if(length(astrRandomCovariates)>0) - { - #Format for lmer - #strRetFormula <- paste( "adCur ~ ", paste( sprintf( "(1|`%s`))", intersect(astrRandomCovariates, astrTerms)), collapse = " + " )) - #Format for glmmpql - strRandomCovariates = paste( sprintf( "1|`%s`", setdiff(astrRandomCovariates, astrTerms)), collapse = " + " ) - strRetMMFormula <- paste( "adCur ~ ", strFixedCovariates, " + ", strRandomCovariates, sep="") - } else { - #This is either the formula for all covariates in an lm or fixed covariates in the lmm - strRetLMFormula <- paste( "adCur ~ ", strFixedCovariates, sep="") - } - return(c(strRetLMFormula, strRetMMFormula)) -} \ No newline at end of file diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/lib/ValidateData.R --- a/maaslin-4450aa4ecc84/src/lib/ValidateData.R Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,93 +0,0 @@ -##################################################################################### -#Copyright (C) <2012> -# -#Permission is hereby granted, free of charge, to any person obtaining a copy of -#this software and associated documentation files (the "Software"), to deal in the -#Software without restriction, including without limitation the rights to use, copy, -#modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, -#and to permit persons to whom the Software is furnished to do so, subject to -#the following conditions: -# -#The above copyright notice and this permission notice shall be included in all copies -#or substantial portions of the Software. -# -#THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, -#INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A -#PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT -#HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION -#OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE -#SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -# -# This file is a component of the MaAsLin (Multivariate Associations Using Linear Models), -# authored by the Huttenhower lab at the Harvard School of Public Health -# (contact Timothy Tickle, ttickle@hsph.harvard.edu). -##################################################################################### - -inlinedocs <- function( -##author<< Curtis Huttenhower and Timothy Tickle -##description<< Minor validation files to check data typing when needed. -) { return( pArgs ) } - -funcIsValid <- function( -### Requires a data to not be NA, not be NULL -### Returns True on meeting these requirements, returns false otherwise -### Return boolean Indicator of not being empty (TRUE = not empty) -tempData = NA -### Parameter tempData Is evaluated as not empty -){ - #If the data is not na or null return true - if(!is.null(tempData)) - { - if(length(tempData)==1){ return(!is.na(tempData)) } - return(TRUE) - } - return(FALSE) - ### True (Valid) false (invalid) -} - -funcIsValidString <- function( -### Requires a data to not be NA, not be NULL, and to be of type Character -### Returns True on meeting these requirements, returns false otherwise -### Return boolean Indicator of identity as a string -tempData = NA -### Parameter tempData Is evaluated as a string -){ - #If is not a valid data return false - if(!funcIsValid(tempData)) - { - return(FALSE) - } - #If is a string return true - if((class(tempData)=="character")&&(length(tempData)==1)) - { - return(TRUE) - } - return(FALSE) - ### True (Valid) false (invalid) -} - -funcIsValidFileName <- function( -### Requires a data to not be NA, not be NULL, and to be a valid string -### which points to an existing file -### Returns True on meeting these requirements, returns false otherwise -### Return boolean Indicator of identity as a file name -tempData = NA, -### Parameter tempData Is evaluated as a file name -fVerbose=FALSE -### Verbose will print the file path when not valid. -){ - #If is not valid string return false - if(!(funcIsValidString(tempData))) - { - if(fVerbose){print(paste("FunctIsValidFileName: InvalidString. Value=",tempData,sep=""))} - return(FALSE) - } - #If is a valid string and points to a file - if(file.exists(tempData)) - { - return(TRUE) - } - if(fVerbose){print(paste("FunctIsValidFileName: Path does not exist. Value=",tempData,sep=""))} - return(FALSE) - ### True (Valid) false (invalid) -} diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/lib/scriptBiplotTSV.R --- a/maaslin-4450aa4ecc84/src/lib/scriptBiplotTSV.R Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,515 +0,0 @@ -#!/usr/bin/env Rscript - -library(vegan) -library(optparse) - -funcGetCentroidForMetadatum <- function( -### Given a binary metadatum, calculate the centroid of the samples associated with the metadata value of 1 -# 1. Get all samples that have the metadata value of 1 -# 2. Get the x and y coordinates of the selected samples -# 3. Get the median value for the x and ys -# 4. Return those coordinates as the centroid's X and Y value -vfMetadata, -### Logical or integer (0,1) vector, TRUE or 1 values indicate correspoinding samples in the -### mSamplePoints which will be used to define the centroid -mSamplePoints -### Coordinates (columns;n=2) of samples (rows) corresponding to the vfMetadata -){ - # Check the lengths which should be equal - if(length(vfMetadata)!=nrow(mSamplePoints)) - { - print(paste("funcGetCentroidForMetadata::Error: Should have received metadata and samples of the same length, received metadata length ",length(vfMetadata)," and sample ",nrow(mSamplePoints)," length.",sep="")) - return( FALSE ) - } - - # Get all the samples that have the metadata value of 1 - viMetadataSamples = which(as.integer(vfMetadata)==1) - - # Get the x and y coordinates for the selected samples - mSelectedPoints = mSamplePoints[viMetadataSamples,] - - # Get the median value for the x and the ys - if(!is.null(nrow(mSelectedPoints))) - { - return( list(x=median(mSelectedPoints[,1],na.rm = TRUE),y=median(mSelectedPoints[,2],na.rm = TRUE)) ) - } else { - return( list(x=mSelectedPoints[1],y=mSelectedPoints[2]) ) - } -} - -funcGetMaximumForMetadatum <- function( -### Given a continuous metadata -### 1. Use the x and ys from mSamplePoints for coordinates and the metadata value as a height (z) -### 2. Use lowess to smooth the landscape -### 3. Take the maximum of the landscape -### 4. Return the coordiantes for the maximum as the centroid -vdMetadata, -### Continuous (numeric or integer) metadata -mSamplePoints -### Coordinates (columns;n=2) of samples (rows) corresponding to the vfMetadata -){ - # Work with data frame - if(class(mSamplePoints)=="matrix") - { - mSamplePoints = data.frame(mSamplePoints) - } - # Check the lengths of the dataframes and the metadata - if(length(vdMetadata)!=nrow(mSamplePoints)) - { - print(paste("funcGetMaximumForMetadatum::Error: Should have received metadata and samples of the same length, received metadata length ",length(vdMetadata)," and sample ",nrow(mSamplePoints)," length.",sep="")) - return( FALSE ) - } - - # Add the metadata value to the points - mSamplePoints[3] = vdMetadata - names(mSamplePoints) = c("x","y","z") - - # Create lowess to smooth the surface - # And calculate the fitted heights - # x = sample coordinate 1 - # y = sample coordinate 2 - # z = metadata value - loessSamples = loess(z~x*y, data=mSamplePoints, degree = 1, normalize = FALSE, na.action=na.omit) - - # Naively get the max - vdCoordinates = loessSamples$x[which(loessSamples$y==max(loessSamples$y)),] - return(list(lsmod = loessSamples, x=vdCoordinates[1],y=vdCoordinates[2])) -} - -funcMakeShapes <- function( -### Takes care of defining shapes for the plot -dfInput, -### Data frame of metadata measurements -sShapeBy, -### The metadata to shape by -sShapes, -### List of custom metadata (per level if factor). -### Should correspond to the number of levels in shapeBy; the format is level:shape,level:shape for example HighLuminosity:14,LowLuminosity:2,HighPH:10,LowPH:18 -cDefaultShape -### Shape to default to if custom shapes are not used -){ - lShapes = list() - vsShapeValues = c() - vsShapeShapes = c() - vsShapes = c() - sMetadataId = sShapeBy - - # Set default shape, color, and color ranges - if(!is.null(cDefaultShape)) - { - # Default shape should be an int for the int pch options - if(!is.na(as.integer(cDefaultShape))) - { - cDefaultShape = as.integer(cDefaultShape) - } - } else { - cDefaultShape = 16 - } - - # Make shapes - vsShapes = rep(cDefaultShape,nrow(dfInput)) - - if(!is.null(sMetadataId)) - { - if(is.null(sShapes)) - { - vsShapeValues = unique(dfInput[[sMetadataId]]) - vsShapeShapes = 1:length(vsShapeValues) - } else { - # Put the markers in the order of the values) - vsShapeBy = unlist(strsplit(sShapes,",")) - for(sShapeBy in vsShapeBy) - { - vsShapeByPieces = unlist(strsplit(sShapeBy,":")) - lShapes[vsShapeByPieces[1]] = as.integer(vsShapeByPieces[2]) - } - vsShapeValues = names(lShapes) - } - - # Shapes in the correct order - if(!is.null(sShapes)) - { - vsShapeShapes = unlist(lapply(vsShapeValues,function(x) lShapes[[x]])) - } - vsShapeValues = paste(vsShapeValues) - - # Make the list of shapes - for(iShape in 1:length(vsShapeValues)) - { - vsShapes[which(paste(dfInput[[sMetadataId]])==vsShapeValues[iShape])]=vsShapeShapes[iShape] - } - - # If they are all numeric characters, make numeric - viIntNas = which(is.na(as.integer(vsShapes))) - viNas = which(is.na(vsShapes)) - if(length(setdiff(viIntNas,viNas))==0) - { - vsShapes = as.integer(vsShapes) - } else { - print("funcMakeShapes::Error: Please supply numbers 1-25 for shape in the -y,--shapeBy option") - vsShapeValues = c() - vsShapeShapes = c() - } - } - return(list(PlotShapes=vsShapes,Values=vsShapeValues,Shapes=vsShapeShapes,ID=sMetadataId,DefaultShape=cDefaultShape)) -} - -### Global defaults -c_sDefaultColorBy = NULL -c_sDefaultColorRange = "orange,cyan" -c_sDefaultTextColor = "black" -c_sDefaultArrowColor = "cyan" -c_sDefaultArrowTextColor = "Blue" -c_sDefaultNAColor = "grey" -c_sDefaultShapeBy = NULL -c_sDefaultShapes = NULL -c_sDefaultMarker = "16" -c_sDefaultRotateByMetadata = NULL -c_sDefaultResizeArrow = 1 -c_sDefaultTitle = "Custom Biplot of Bugs and Samples - Metadata Plotted with Centroids" -c_sDefaultOutputFile = NULL - -### Create command line argument parser -pArgs <- OptionParser( usage = "%prog last_metadata input.tsv" ) - -# Selecting features to plot -pArgs <- add_option( pArgs, c("-b", "--bugs"), type="character", action="store", default=NULL, dest="sBugs", metavar="BugsToPlot", help="Comma delimited list of data to plot as text. Bug|1,Bug|2") -pArgs <- add_option( pArgs, c("-m", "--metadata"), type="character", action="store", default=NULL, dest="sMetadata", metavar="MetadataToPlot", help="Comma delimited list of metadata to plot as arrows. metadata1,metadata2,metadata3") - -# Colors -pArgs <- add_option( pArgs, c("-c", "--colorBy"), type="character", action="store", default=c_sDefaultColorBy, dest="sColorBy", metavar="MetadataToColorBy", help="The id of the metadatum to use to make the marker colors. Expected to be a continuous metadata.") -pArgs <- add_option( pArgs, c("-r", "--colorRange"), type="character", action="store", default=c_sDefaultColorRange, dest="sColorRange", metavar="ColorRange", help=paste("Colors used to color the samples; a gradient will be formed between the color.Default=", c_sDefaultColorRange)) -pArgs <- add_option( pArgs, c("-t", "--textColor"), type="character", action="store", default=c_sDefaultTextColor, dest="sTextColor", metavar="TextColor", help=paste("The color bug features will be plotted with as text. Default =", c_sDefaultTextColor)) -pArgs <- add_option( pArgs, c("-a", "--arrowColor"), type="character", action="store", default=c_sDefaultArrowColor, dest="sArrowColor", metavar="ArrowColor", help=paste("The color metadata features will be plotted with as an arrow and text. Default", c_sDefaultArrowColor)) -pArgs <- add_option( pArgs, c("-w", "--arrowTextColor"), type="character", action="store", default=c_sDefaultArrowTextColor, dest="sArrowTextColor", metavar="ArrowTextColor", help=paste("The color for the metadata text ploted by the head of the metadata arrow. Default", c_sDefaultArrowTextColor)) -pArgs <- add_option(pArgs, c("-n","--plotNAColor"), type="character", action="store", default=c_sDefaultNAColor, dest="sPlotNAColor", metavar="PlotNAColor", help=paste("Plot NA values as this color. Example -n", c_sDefaultNAColor)) - -# Shapes -pArgs <- add_option( pArgs, c("-y", "--shapeby"), type="character", action="store", default=c_sDefaultShapeBy, dest="sShapeBy", metavar="MetadataToShapeBy", help="The metadata to use to make marker shapes. Expected to be a discrete metadatum. An example would be -y Environment") -pArgs <- add_option( pArgs, c("-s", "--shapes"), type="character", action="store", default=c_sDefaultShapes, dest="sShapes", metavar="ShapesForPlotting", help="This is to be used to specify the shapes to use for plotting. Can use numbers recognized by R as shapes (see pch). Should correspond to the number of levels in shapeBy; the format is level:shape,level:shape for example HighLuminosity:14,LowLuminosity:2,HighPH:10,LowPH:18 . Need to specify -y/--shapeBy for this option to work.") -pArgs <- add_option( pArgs, c("-d", "--defaultMarker"), type="character", action="store", default=c_sDefaultMarker, dest="sDefaultMarker", metavar="DefaultColorMarker", help="Default shape for markers which are not otherwise indicated in --shapes, can be used for unspecified values or NA. Must not be a shape in --shapes.") - -# Plot manipulations -pArgs <- add_option( pArgs, c("-e","--rotateByMetadata"), type="character", action="store", default=c_sDefaultRotateByMetadata, dest="sRotateByMetadata", metavar="RotateByMetadata", help="Rotate the ordination by a metadata. Give both the metadata and value to weight it by. The larger the weight, the more the ordination is influenced by the metadata. If the metadata is continuous, use the metadata id; if the metadata is discrete, the ordination will be by one of the levels so use the metadata ID and level seperated by a '_'. Discrete example -e Environment_HighLumninosity,100 ; Continuous example -e Environment,100 .") -pArgs <- add_option( pArgs, c("-z","--resizeArrow"), type="numeric", action="store", default=c_sDefaultResizeArrow, dest="dResizeArrow", metavar="ArrowScaleFactor", help="A constant to multiple the length of the arrow to expand or shorten all arrows together. This will not change the angle of the arrow nor the relative length of arrows to each other.") - -# Misc -pArgs <- add_option( pArgs, c("-i", "--title"), type="character", action="store", default=c_sDefaultTitle, dest="sTitle", metavar="Title", help="This is the title text to add to the plot.") -pArgs <- add_option( pArgs, c("-o", "--outputfile"), type="character", action="store", default=c_sDefaultOutputFile, dest="sOutputFileName", metavar="OutputFile", help="This is the name for the output pdf file. If an output file is not given, an output file name is made based on the input file name.") - -funcDoBiplot <- function( -### Perform biplot. Samples are markers, bugs are text, and metadata are text with arrows. Markers and bugs are dtermined usiing NMDS and Bray-Curtis dissimilarity. Metadata are placed on the ordination in one of two ways: 1. Factor data - for each level take the ordination points for the samples that have that level and plot the metadata text at the average orindation point. 2. For continuous data - make a landscape (x and y form ordination of the points) and z (height) as the metadata value. Use a lowess line to get the fitted values for z and take the max of the landscape. Plot the metadata text at that smoothed max. -sBugs, -### Comma delimited list of data to plot as text. Bug|1,Bug|2 -sMetadata, -### Comma delimited list of metadata to plot as arrows. metadata1,metadata2,metadata3. -sColorBy = c_sDefaultColorBy, -### The id of the metadatum to use to make the marker colors. Expected to be a continuous metadata. -sColorRange = c_sDefaultColorRange, -### Colors used to color the samples; a gradient will be formed between the color. Example orange,cyan -sTextColor = c_sDefaultTextColor, -### The color bug features will be plotted with as text. Example black -sArrowColor = c_sDefaultArrowColor, -### The color metadata features will be plotted with as an arrow and text. Example cyan -sArrowTextColor = c_sDefaultArrowTextColor, -### The color for the metadata text ploted by the head of the metadata arrow. Example Blue -sPlotNAColor = c_sDefaultNAColor, -### Plot NA values as this color. Example grey -sShapeBy = c_sDefaultShapeBy, -### The metadata to use to make marker shapes. Expected to be a discrete metadatum. -sShapes = c_sDefaultShapes, -### This is to be used to specify the shapes to use for plotting. Can use numbers recognized by R as shapes (see pch). Should correspond to the number of levels in shapeBy; the format is level:shape,level:shape for example HighLuminosity:14,LowLuminosity:2,HighPH:10,LowPH:18 . Works with sShapesBy. -sDefaultMarker = c_sDefaultMarker, -### The default marker shape to use if shapes are not otherwise indicated. -sRotateByMetadata = c_sDefaultRotateByMetadata, -### Metadata and value to rotate by. example Environment_HighLumninosity,100 -dResizeArrow = c_sDefaultResizeArrow, -### Scale factor to resize tthe metadata arrows -sTitle = c_sDefaultTitle, -### The title for the figure. -sInputFileName, -### File to input (tsv file: tab separated, row = sample file) -sLastMetadata, -### Last metadata that seperates data and metadata -sOutputFileName = c_sDefaultOutputFile -### The file name to save the figure. -){ - print("IN Biplot") - # Define the colors - vsColorRange = c("blue","orange") - cDefaultColor = "black" - if(!is.null(sColorRange)) - { - vsColorRange = unlist(strsplit(sColorRange,",")) - } - - # List of bugs to plot - # If there is a list it needs to be more than one. - vsBugsToPlot = c() - if(!is.null(sBugs)) - { - vsBugsToPlot = unlist(strsplit(sBugs,",")) - } - - print("vsBugsToPlot") - print(vsBugsToPlot) - # Metadata to plot - vsMetadata = c() - if(!is.null(sMetadata)) - { - vsMetadata = unlist(strsplit(sMetadata,",")) - } - - print("vsMetadata") - print(vsMetadata) - ### Load table - if(class(sInputFileName)=="character") - { - dfInput = read.table(sInputFileName, sep = "\t", header=TRUE) - names(dfInput) = unlist(lapply(names(dfInput),function(x) gsub(".","|",x,fixed=TRUE))) - row.names(dfInput) = dfInput[,1] - dfInput = dfInput[-1] - } else {dfInput = sInputFileName} - - ### Get positions of all metadata or all data - iLastMetadata = which(names(dfInput)==sLastMetadata) - viMetadata = 1:iLastMetadata - viData = (iLastMetadata+1):ncol(dfInput) - - ### Dummy the metadata if discontinuous - ### Leave the continous metadata alone but include - listMetadata = list() - vsRowNames = c() - viContinuousMetadata = c() - for(i in viMetadata) - { - print( names( dfInput )[i] ) - vCurMetadata = unlist(dfInput[i]) - if( ( is.numeric(vCurMetadata)||is.integer(vCurMetadata) ) && ( length( unique( vCurMetadata ) ) >= c_iNonFactorLevelThreshold ) ) - { - vCurMetadata[which(is.na(vCurMetadata))] = mean(vCurMetadata,na.rm=TRUE) - listMetadata[[length(listMetadata)+1]] = vCurMetadata - vsRowNames = c(vsRowNames,names(dfInput)[i]) - viContinuousMetadata = c(viContinuousMetadata,length(listMetadata)) - } else { - vCurMetadata = as.factor(vCurMetadata) - vsLevels = levels(vCurMetadata) - for(sLevel in vsLevels) - { - vNewMetadata = rep(0,length(vCurMetadata)) - vNewMetadata[which(vCurMetadata == sLevel)] = 1 - listMetadata[[length(listMetadata)+1]] = vNewMetadata - vsRowNames = c(vsRowNames,paste(names(dfInput)[i],sLevel,sep="_")) - } - } - } - - # Convert to data frame - dfDummyMetadata = as.data.frame(sapply(listMetadata,rbind)) - names(dfDummyMetadata) = vsRowNames - iNumberMetadata = ncol(dfDummyMetadata) - - # Data to use in ordination in NMDS - # All cleaned bug data - dfData = dfInput[viData] - - # If rotating the ordination by a metadata - # 1. Add in the metadata as a bug - # 2. Multiply the bug by the weight - # 3. Push this through the NMDS - if(!is.null(sRotateByMetadata)) - { - vsRotateMetadata = unlist(strsplit(sRotateByMetadata,",")) - sMetadata = vsRotateMetadata[1] - dWeight = as.numeric(vsRotateMetadata[2]) - sOrdinationMetadata = dfDummyMetadata[sMetadata]*dWeight - dfData[sMetadata] = sOrdinationMetadata - } - - # Run NMDS on bug data (Default B-C) - # Will have species and points because working off of raw data - mNMDSData = metaMDS(dfData,k=2) - - ## Make shapes - # Defines the shapes and the metadata they are based on - # Metadata to use as shapes - lShapeInfo = funcMakeShapes(dfInput=dfInput, sShapeBy=sShapeBy, sShapes=sShapes, cDefaultShape=sDefaultMarker) - - sMetadataShape = lShapeInfo[["ID"]] - vsShapeValues = lShapeInfo[["Values"]] - vsShapeShapes = lShapeInfo[["Shapes"]] - vsShapes = lShapeInfo[["PlotShapes"]] - cDefaultShape = lShapeInfo[["DefaultShape"]] - - # Colors - vsColors = rep(cDefaultColor,nrow(dfInput)) - vsColorValues = c() - vsColorRBG = c() - if(!is.null(sColorBy)) - { - vsColorValues = paste(sort(unique(unlist(dfInput[[sColorBy]])),na.last=TRUE)) - iLengthColorValues = length(vsColorValues) - - vsColorRBG = lapply(1:iLengthColorValues/iLengthColorValues,colorRamp(vsColorRange)) - vsColorRBG = unlist(lapply(vsColorRBG, function(x) rgb(x[1]/255,x[2]/255,x[3]/255))) - - for(iColor in 1:length(vsColorRBG)) - { - vsColors[which(paste(dfInput[[sColorBy]])==vsColorValues[iColor])]=vsColorRBG[iColor] - } - - #If NAs are seperately given color, then color here - if(!is.null(sPlotNAColor)) - { - vsColors[which(is.na(dfInput[[sColorBy]]))] = sPlotNAColor - vsColorRBG[which(vsColorValues=="NA")] = sPlotNAColor - } - } - - print("names(dfDummyMetadata)") - print(names(dfDummyMetadata)) - - # Reduce the bugs down to the ones in the list to be plotted - viBugsToPlot = which(row.names(mNMDSData$species) %in% vsBugsToPlot) - viMetadataDummy = which(names(dfDummyMetadata) %in% vsMetadata) - - print("viBugsToPlot") - print(viBugsToPlot) - print("viMetadataDummy") - print(names(dfDummyMetadata)[viMetadataDummy]) - - # Build the matrix of metadata coordinates - mMetadataCoordinates = matrix(rep(NA, iNumberMetadata*2),nrow=iNumberMetadata) - for( i in 1:iNumberMetadata ) - { - lxReturn = NA - if( i %in% viContinuousMetadata ) - { - lxReturn = funcGetMaximumForMetadatum(dfDummyMetadata[[i]],mNMDSData$points) - } else { - lxReturn = funcGetCentroidForMetadatum(dfDummyMetadata[[i]],mNMDSData$points) - } - mMetadataCoordinates[i,] = c(lxReturn$x,lxReturn$y) - } - row.names(mMetadataCoordinates) = vsRowNames - - # Plot the biplot with the centroid constructed metadata coordinates - if(length(viMetadataDummy)==0) - { - viMetadataDummy = 1:nrow(mMetadataCoordinates) - } - - # Plot samples - # Make output name - if(is.null(sOutputFileName)) - { - viPeriods = which(sInputFileName==".") - if(length(viPeriods)>0) - { - sOutputFileName = paste(OutputFileName[1:viPeriods[length(viPeriods)]],"pdf",sep=".") - } else { - sOutputFileName = paste(sInputFileName,"pdf",sep=".") - } - } - - pdf(sOutputFileName, useDingbats=FALSE) - plot(mNMDSData$points, xlab=paste("NMDS1","Stress=",mNMDSData$stress), ylab="NMDS2", pch=vsShapes, col=vsColors) - title(sTitle,line=3) - - # Plot Bugs - mPlotBugs = mNMDSData$species[viBugsToPlot,] - if(length(viBugsToPlot)==1) - { - text(x=mPlotBugs[1],y=mPlotBugs[2],labels=row.names(mNMDSData$species)[viBugsToPlot],col=sTextColor) - } else if(length(viBugsToPlot)>1){ - text(x=mPlotBugs[,1],y=mPlotBugs[,2],labels=row.names(mNMDSData$species)[viBugsToPlot],col=sTextColor) - } - - # Add alternative axes - axis(3, col=sArrowColor) - axis(4, col=sArrowColor) - box(col = "black") - - # Plot Metadata - if(length(viMetadataDummy)>0) - { - for(i in viMetadataDummy) - { - curCoordinates = mMetadataCoordinates[i,] - curCoordinates = curCoordinates * dResizeArrow - # Plot Arrow - arrows(0,0, curCoordinates[1] * 0.8, curCoordinates[2] * 0.8, col=sArrowColor, length=0.1 ) - } - # Plot text - if(length(viMetadataDummy)==1) - { - text(x=mMetadataCoordinates[viMetadataDummy,][1]*dResizeArrow*0.8, y=mMetadataCoordinates[viMetadataDummy,][2]*dResizeArrow*0.8, labels=row.names(mMetadataCoordinates)[viMetadataDummy],col=sArrowTextColor) - } else { - text(x=mMetadataCoordinates[viMetadataDummy,1]*dResizeArrow*0.8, y=mMetadataCoordinates[viMetadataDummy,2]*dResizeArrow*0.8, labels=row.names(mMetadataCoordinates)[viMetadataDummy],col=sArrowTextColor) - } - } - - # Create Legend - # The text default is the colorMetadata_level (one per level) plus the ShapeMetadata_level (one per level) - # The color default is already determined colors plus grey for shapes. - sLegendText = c(paste(vsColorValues,sColorBy,sep="_"),paste(sMetadataShape,vsShapeValues,sep="_")) - sLegendColors = c(vsColorRBG,rep(cDefaultColor,length(vsShapeValues))) - - # If the color values are numeric - # Too many values may be given in the legend (given they may be a continuous range of values) - # To reduce this they are summarized instead, given the colors and values for the extreme ends. - if( !sum( is.na( as.numeric( vsColorValues[ which( !is.na( vsColorValues ) ) ] ) ) ) ) - { - vdNumericColors = as.numeric( vsColorValues ) - vdNumericColors = vdNumericColors[ which( !is.na( vdNumericColors ) ) ] - vdSortedNumericColors = sort( vdNumericColors ) - sLegendText = c( paste( sColorBy, vdSortedNumericColors[ 1 ], sep="_" ), - paste( sColorBy, vdSortedNumericColors[ length(vdSortedNumericColors) ], sep="_" ), - paste( sMetadataShape, vsShapeValues, sep="_" ) ) - sLegendColors = c(vsColorRBG[ which( vdNumericColors == vdSortedNumericColors[ 1 ] )[ 1 ] ], - vsColorRBG[ which( vdNumericColors == vdSortedNumericColors[ length( vdSortedNumericColors ) ] )[ 1 ] ], - rep(cDefaultColor,length(vsShapeValues))) - } - sLegendShapes = c( rep( cDefaultShape, length( sLegendText ) - length( vsShapeShapes ) ), vsShapeShapes ) - - # If any legend text was constructed then make the legend. - if( length( sLegendText ) >0 ) - { - legend( "topright", legend = sLegendText, pch = sLegendShapes, col = sLegendColors ) - } - - # Original biplot call if you want to check the custom plotting of the script - # There will be one difference where the biplot call scales an axis, this one does not. In relation to the axes, the points, text and arrows should still match. - # Axes to the top and right are for the arrow, others are for markers and bug names. - #biplot(mNMDSData$points,mMetadataCoordinates[viMetadataDummy,],xlabs=vsShapes,xlab=paste("MDS1","Stress=",mNMDSData$stress),main="Biplot function Bugs and Sampes - Metadata Plotted with Centroids") - dev.off() -} - -# This is the equivalent of __name__ == "__main__" in Python. -# That is, if it's true we're being called as a command line script; -# if it's false, we're being sourced or otherwise included, such as for -# library or inlinedocs. -if( identical( environment( ), globalenv( ) ) && - !length( grep( "^source\\(", sys.calls( ) ) ) ) -{ - lsArgs <- parse_args( pArgs, positional_arguments=TRUE ) - - funcDoBiplot( - sBugs = lsArgs$options$sBugs, - sMetadata = lsArgs$options$sMetadata, - sColorBy = lsArgs$options$sColorBy, - sColorRange = lsArgs$options$sColorRange, - sTextColor = lsArgs$options$sTextColor, - sArrowColor = lsArgs$options$sArrowColor, - sArrowTextColor = lsArgs$options$sArrowTextColor, - sPlotNAColor = lsArgs$options$sPlotNAColor, - sShapeBy = lsArgs$options$sShapeBy, - sShapes = lsArgs$options$sShapes, - sDefaultMarker = lsArgs$options$sDefaultMarker, - sRotateByMetadata = lsArgs$options$sRotateByMetadata, - dResizeArrow = lsArgs$options$dResizeArrow, - sTitle = lsArgs$options$sTitle, - sInputFileName = lsArgs$args[2], - sLastMetadata = lsArgs$args[1], - sOutputFileName = lsArgs$options$sOutputFileName) -} diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/merge_metadata.py --- a/maaslin-4450aa4ecc84/src/merge_metadata.py Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,454 +0,0 @@ -#!/usr/bin/env python -##################################################################################### -#Copyright (C) <2012> -# -#Permission is hereby granted, free of charge, to any person obtaining a copy of -#this software and associated documentation files (the "Software"), to deal in the -#Software without restriction, including without limitation the rights to use, copy, -#modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, -#and to permit persons to whom the Software is furnished to do so, subject to -#the following conditions: -# -#The above copyright notice and this permission notice shall be included in all copies -#or substantial portions of the Software. -# -#THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, -#INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A -#PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT -#HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION -#OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE -#SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -# -# This file is a component of the MaAsLin (Multivariate Associations Using Linear Models), -# authored by the Huttenhower lab at the Harvard School of Public Health -# (contact Timothy Tickle, ttickle@hsph.harvard.edu). -##################################################################################### -""" -Examples -~~~~~~~~ - -``metadata.txt``:: - - - Y Z - a 1 x - b 0 y - c z - -``data.pcl``:: - - - a b c - A|B 1 2 3 - A|C 4 5 6 - D|E 7 8 9 - -``Examples``:: - - $ merge_metadata.py metadata.txt < data.pcl - sample a b c - Y 1 0 - Z x y z - A 0.416667 0.466667 0.5 - A|B 0.0833333 0.133333 0.166667 - A|C 0.333333 0.333333 0.333333 - D|E 0.583333 0.533333 0.5 - - $ merge_metadata.py metadata.txt -t 0 < data.pcl - sample a b c - Y 1 0 - Z x y z - A|B 0.0833333 0.133333 0.166667 - A|C 0.333333 0.333333 0.333333 - D|E 0.583333 0.533333 0.5 - - $ merge_metadata.py metadata.txt -t 1 < data.pcl - sample a b c - Y 1 0 - Z x y z - A 0.416667 0.466667 0.5 - D 0.583333 0.533333 0.5 - - $ merge_metadata.py metadata.txt -t 0 -n < data.pcl - sample a b c - Y 1 0 - Z x y z - A|B 1 2 3 - A|C 4 5 6 - D|E 7 8 9 - - $ merge_metadata.py metadata.txt -t 0 -m 0.8 -s "-" < data.pcl - sample b c - Y 0 - - Z y z - A|B 0.133333 0.166667 - A|C 0.333333 0.333333 - D|E 0.533333 0.5 - - $ merge_metadata.py -t 0 < data.pcl - sample a b c - A|B 1 2 3 - A|C 4 5 6 - D|E 7 8 9 - -.. testsetup:: - - from merge_metadata import * -""" - -import argparse -import blist -import csv -import re -import sys - -c_dTarget = 1.0 -c_fRound = False - -class CClade: - - def __init__( self ): - - self.m_hashChildren = {} - self.m_adValues = None - - def get( self, astrClade ): - - return self.m_hashChildren.setdefault( - astrClade[0], CClade( ) ).get( astrClade[1:] ) if astrClade else self - - def set( self, adValues ): - - self.m_adValues = blist.blist( [0] ) * len( adValues ) - for i, d in enumerate( adValues ): - if d: - self.m_adValues[i] = d - - def impute( self ): - - if not self.m_adValues: - for pChild in self.m_hashChildren.values( ): - adChild = pChild.impute( ) - if self.m_adValues: - for i in range( len( adChild or [] ) ): - if adChild[i]: - self.m_adValues[i] += adChild[i] - elif adChild: - self.m_adValues = adChild[:] - - return self.m_adValues - - def _freeze( self, hashValues, iTarget, astrClade, iDepth, fLeaves ): - - fHit = ( not iTarget ) or ( ( fLeaves and ( iDepth == iTarget ) ) or ( ( not fLeaves ) and ( iDepth <= iTarget ) ) ) - iDepth += 1 - setiRet = set() - if self.m_hashChildren: - for strChild, pChild in self.m_hashChildren.items( ): - setiRet |= pChild._freeze( hashValues, iTarget, astrClade + [strChild], iDepth, fLeaves ) - setiRet = set( ( i + 1 ) for i in setiRet ) - else: - setiRet.add( 0 ) - if iTarget < 0: - if fLeaves: - fHit = -( iTarget + 1 ) in setiRet - else: - fHit = -( iTarget + 1 ) <= max( setiRet ) - if astrClade and self.m_adValues and fHit: - hashValues["|".join( astrClade )] = self.m_adValues - return setiRet - - def freeze( self, hashValues, iTarget, fLeaves ): - - self._freeze( hashValues, iTarget, [], 0, fLeaves ) - - def _repr( self, strClade ): - - strRet = "<" - if strClade: - strRet += "%s %s" % (strClade, self.m_adValues) - if self.m_hashChildren: - strRet += " " - if self.m_hashChildren: - strRet += " ".join( p._repr( s ) for (s, p) in self.m_hashChildren.items( ) ) - - return ( strRet + ">" ) - - def __repr__( self ): - - return self._repr( "" ) - -""" -pTree = CClade( ) -pTree.get( ("A", "B") ).set( [1, 2, 3] ) -pTree.get( ("A", "C") ).set( [4, 5, 6] ) -pTree.get( ("D", "E") ).set( [7, 8, 9] ) -iTaxa = 0 -if iTaxa: - pTree.impute( ) -hashFeatures = {} -pTree.freeze( hashFeatures, iTaxa ) -print( pTree ) -print( hashFeatures ) -sys.exit( 0 ) -#""" - -def merge_metadata( aastrMetadata, aastrData, ostm, fNormalize, strMissing, astrExclude, dMin, iTaxa, fLeaves ): - """ - Joins and outputs a data matrix with a metadata matrix, optionally normalizing and filtering it. - A pipe-delimited taxonomy hierarchy can also be dynamically added or removed. - - :param aastrMetadata: Split lines from which metadata are read. - :type aastrMetadata: collection of string collections - :param aastrData: Split lines from which data are read. - :type aastrData: collection of string collections - :param ostm: Output stream to which joined rows are written. - :type ostm: output stream - :param fNormalize: If true, divide data values by column sums. - :type fNormalize: bool - :param strMissing: Representation for missing metadata values. - :type strMissing: str - :param astrExclude: Lines from which excluded IDs are read. - :type astrExclude: collection of strings - :param dMin: Minimum fraction of maximum value for per-column quality control. - :type dMin: bool - :param iTaxa: Depth of taxonomy to be computed, -1 = leaves only, 0 = no change - :type iTaxa: int - :param fLeaves: Output only leaves, not complete taxonomy; ignored if taxa = 0 - :type fLeaves: bool - - Metadata are optional; if not provided, data will be optionally normalized or its taxonomy - modified as requested. Metadata are provided one row per sample, data one column per - sample, both files tab-delimited text with one header row and one header column. - - Metadata IDs that do not match data IDs are discarded, and data IDs without corresponding - metadata IDs are given missing values. Missing data values are always treated (and output) - as zero. - - Per-column quality control is performed if the requested minimum fraction is greater than - zero. Specifically, for each column i, the row j containing the maximum value d is - identified. If d is less than the minimum fraction of row j's maximum value over all columns, - the entire column i is removed. - - A taxonomy hierarchy will be calculated by default if row IDs are pipe-delimited, i.e. of - the form A|B|C. All parent clades are computed by default, e.g. A|B and A, save when - they would be identical to a more specific child clade. Negative values are counted from the - bottom (right) of the hierarchy rather than the top. The special value of 0 deactivates - hierarchy calculation. - - >>> aastrMetadata = [[t.strip( ) for t in s] for s in ("-YZ", "a1x", "b0y", "c z")] - >>> aastrData = [s.split( ) for s in ( \ - "- a b c", \ - "A|B 1 2 3", \ - "A|C 4 5 6", \ - "D|E 7 8 9")] - >>> merge_metadata( aastrMetadata, aastrData, sys.stdout, True, "", [], 0.01, -1, False ) #doctest: +NORMALIZE_WHITESPACE - sample a b c - Y 1 0 - Z x y z - A 0.416667 0.466667 0.5 - A|B 0.0833333 0.133333 0.166667 - A|C 0.333333 0.333333 0.333333 - D|E 0.583333 0.533333 0.5 - - >>> merge_metadata( aastrMetadata, aastrData, sys.stdout, True, "", [], 0.01, -1, True ) #doctest: +NORMALIZE_WHITESPACE - sample a b c - Y 1 0 - Z x y z - A|B 0.0833333 0.133333 0.166667 - A|C 0.333333 0.333333 0.333333 - D|E 0.583333 0.533333 0.5 - - >>> merge_metadata( aastrMetadata, aastrData, sys.stdout, True, "", [], 0, 0, True ) #doctest: +NORMALIZE_WHITESPACE - sample a b c - Y 1 0 - Z x y z - A|B 0.0833333 0.133333 0.166667 - A|C 0.333333 0.333333 0.333333 - D|E 0.583333 0.533333 0.5 - - >>> merge_metadata( aastrMetadata, aastrData, sys.stdout, True, "", [], 0, 1, False ) #doctest: +NORMALIZE_WHITESPACE - sample a b c - Y 1 0 - Z x y z - A 0.416667 0.466667 0.5 - D 0.583333 0.533333 0.5 - - >>> merge_metadata( aastrMetadata, aastrData, sys.stdout, True, "", [], 0, -1, True ) #doctest: +NORMALIZE_WHITESPACE - sample a b c - Y 1 0 - Z x y z - A|B 0.0833333 0.133333 0.166667 - A|C 0.333333 0.333333 0.333333 - D|E 0.583333 0.533333 0.5 - - >>> merge_metadata( aastrMetadata, aastrData, sys.stdout, False, "", [], 0, 0, True ) #doctest: +NORMALIZE_WHITESPACE - sample a b c - Y 1 0 - Z x y z - A|B 1 2 3 - A|C 4 5 6 - D|E 7 8 9 - - >>> merge_metadata( aastrMetadata, aastrData, sys.stdout, True, "-", [], 0.8, 0, True ) #doctest: +NORMALIZE_WHITESPACE - sample b c - Y 0 - - Z y z - A|B 0.133333 0.166667 - A|C 0.333333 0.333333 - D|E 0.533333 0.5 - - >>> merge_metadata( None, aastrData, sys.stdout, False, "", [], 0, 0, True ) #doctest: +NORMALIZE_WHITESPACE - sample a b c - A|B 1 2 3 - A|C 4 5 6 - D|E 7 8 9 - - >>> merge_metadata( aastrMetadata, aastrData, sys.stdout, True, "", ["b"], 0.01, -1, False ) #doctest: +NORMALIZE_WHITESPACE - sample a c - Y 1 - Z x z - A 0.416667 0.5 - A|B 0.0833333 0.166667 - A|C 0.333333 0.333333 - D|E 0.583333 0.5 - """ - - #Put metadata in a dictionary - #{"First line element",["line element 2","line element 3","line element 4"]} - #If there is no metadata then - astrMetadata = None - hashMetadata = {} - for astrLine in ( aastrMetadata or [] ): - if astrMetadata: - hashMetadata[astrLine[0]] = astrLine[1:] - else: - astrMetadata = astrLine[1:] - - astrHeaders = adSeqs = iCol = None - pTree = CClade( ) - aastrRaw = [] - for astrLine in aastrData: - if astrHeaders: - if ( astrLine[0] == "EWEIGHT" ) or ( astrLine[0] == "total" ) or \ - ( len( astrLine ) < 2 ): - continue - try: - adCounts = [( float(strCur) if len( strCur.strip( ) ) else 0 ) for - strCur in astrLine[iCol:]] - except ValueError: - aastrRaw.append( astrLine ) - continue - for i in range( len( adCounts ) ): - adSeqs[i] += adCounts[i] - if ( iCol > 1 ) and ( astrLine[0] != astrLine[1] ): - if astrLine[1].find( astrLine[0] ) >= 0: - astrLine[0] = astrLine[1] - else: - astrLine[0] += " " + astrLine[1] - pTree.get( astrLine[0].split( "|" ) ).set( adCounts ) - else: - iCol = 2 if ( astrLine[1].upper( ) == "NAME" ) else 1 - astrHeaders = [strCur.replace( " ", "_" ) for strCur in astrLine[iCol:]] - adSeqs = [0] * len( astrHeaders ) - - if iTaxa: - pTree.impute( ) - hashFeatures = {} - pTree.freeze( hashFeatures, iTaxa, fLeaves ) - setstrFeatures = hashFeatures.keys( ) - - afOmit = [False] * len( astrHeaders ) - if dMin > 0: - aadData = list(hashFeatures.values( )) - for i in range( len( astrHeaders ) ): - iMax = max( range( len( aadData ) ), key = lambda j: aadData[j][i] ) - dMaxUs = aadData[iMax][i] - dMaxThem = max( aadData[iMax][j] for j in ( range( i ) + range( i + 1, len( astrHeaders ) ) ) ) - if dMaxUs < ( dMin * dMaxThem ): - sys.stderr.write( "Omitting: %s\n" % astrHeaders[i] ) - afOmit[i] = True - - if astrExclude: - setstrExclude = set(s.strip( ) for s in astrExclude) - for i in range( len( astrHeaders ) ): - if ( not afOmit[i] ) and ( astrHeaders[i] in setstrExclude ): - afOmit[i] = True - - adMult = [( ( c_dTarget / d ) if ( fNormalize and ( d > 0 ) ) else 1 ) for d in adSeqs] - for strFeature, adCounts in hashFeatures.items( ): - for i in range( len( adCounts ) ): - if adCounts[i]: - adCounts[i] *= adMult[i] - if c_fRound: - adCounts[i] = round( adCounts[i] ) - for strFeature, adCounts in hashFeatures.items( ): - astrFeature = strFeature.strip( ).split( "|" ) - while len( astrFeature ) > 1: - astrFeature = astrFeature[:-1] - strParent = "|".join( astrFeature ) - adParent = hashFeatures.get( strParent ) - if adParent == adCounts: - del hashFeatures[strParent] - setstrFeatures.remove( strParent ) - - if astrMetadata: - for i in range( len( astrMetadata ) ): - hashFeatures[astrMetadata[i]] = astrCur = [] - for strSubject in astrHeaders: - astrSubject = hashMetadata.get( strSubject ) - if not astrSubject: - strSubject = re.sub( '_.*$', "", strSubject ) - astrSubject = hashMetadata.get( strSubject, [] ) - astrCur.append( astrSubject[i] if ( i < len( astrSubject ) ) else "" ) - - astrFeatures = sorted( astrMetadata or [] ) + sorted( setstrFeatures ) - aiHeaders = filter( lambda i: not afOmit[i], range( len( astrHeaders ) ) ) - csvw = csv.writer( sys.stdout, csv.excel_tab ) - csvw.writerow( ["sample"] + [astrHeaders[i] for i in aiHeaders] ) - for iFeature in range( len( astrFeatures ) ): - strFeature = astrFeatures[iFeature] - adFeature = hashFeatures[strFeature] - astrValues = [adFeature[i] for i in aiHeaders] - for i in range( len( astrValues ) ): - strValue = astrValues[i] - if type( strValue ) in (int, float): - astrValues[i] = "%g" % astrValues[i] - elif ( not strValue ) or ( ( type( strValue ) == str ) and - ( len( strValue ) == 0 ) ): - astrValues[i] = strMissing - csvw.writerow( [strFeature] + astrValues ) - - for astrRaw in aastrRaw: - csvw.writerow( [astrRaw[i] for i in aiHeaders] ) - -argp = argparse.ArgumentParser( prog = "merge_metadata.py", - description = "Join a data matrix with a metadata matrix, optionally normalizing and filtering it.\n\n" + - "A pipe-delimited taxonomy hierarchy can also be dynamically added or removed." ) -argp.add_argument( "-n", dest = "fNormalize", action = "store_false", - help = "Don't normalize data values by column sums" ) -argp.add_argument( "-s", dest = "strMissing", metavar = "missing", - type = str, default = " ", - help = "String representing missing metadata values" ) -argp.add_argument( "-m", dest = "dMin", metavar = "min", - type = float, default = 0.01, - help = "Per-column quality control, minimum fraction of maximum value" ) -argp.add_argument( "-t", dest = "iTaxa", metavar = "taxa", - type = int, default = -1, - help = "Depth of taxonomy to be computed, negative = from right, 0 = no change" ) -argp.add_argument( "-l", dest = "fLeaves", action = "store_true", - help = "Output only leaves, not complete taxonomy" ) -argp.add_argument( "-x", dest = "istmExclude", metavar = "exclude.txt", - type = file, - help = "File from which sample IDs to exclude are read" ) -argp.add_argument( "istmMetadata", metavar = "metadata.txt", - type = file, nargs = "?", - help = "File from which metadata is read" ) -__doc__ = "::\n\n\t" + argp.format_help( ).replace( "\n", "\n\t" ) + __doc__ - -def _main( ): - args = argp.parse_args( ) - merge_metadata( args.istmMetadata and csv.reader( args.istmMetadata, csv.excel_tab ), - csv.reader( sys.stdin, csv.excel_tab ), sys.stdout, args.fNormalize, args.strMissing, - args.istmExclude, args.dMin, args.iTaxa, args.fLeaves ) - -if __name__ == "__main__": - _main( ) diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/test-AnalysisModules/test-AnalysisModules.R --- a/maaslin-4450aa4ecc84/src/test-AnalysisModules/test-AnalysisModules.R Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,788 +0,0 @@ -c_strDir <- file.path(getwd( ),"..") - -source(file.path(c_strDir,"lib","Constants.R")) -source(file.path(c_strDir,"lib","Utility.R")) - -#Test Utilities -context("Test funcGetLMResults") -context("Test funcGetStepPredictors") - -context("Test funcMakeContrasts") -covX1 = c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1) -covX2 = c(144.4, 245.9, 141.9, 253.3, 144.7, 244.1, 150.7, 245.2, 160.1) -covX3 = as.factor(c(1,2,3,1,2,3,1,2,3)) -covX4 = as.factor(c(1,1,1,1,2,2,2,2,2)) -covX5 = as.factor(c(1,2,1,2,1,2,1,2,1)) -covY = c(.26, .31, .25, .50, .36, .40, .52, .28, .38) -frmeTmp = data.frame(Covariate1=covX1, Covariate2=covX2, Covariate3=covX3, Covariate4=covX4, Covariate5=covX5, adCur= covY) -iTaxon = 6 -#Add in updating QC errors -#Add in random covariates -strFormula = "adCur ~ Covariate1" -strRandomFormula = NULL -lsSig = list() -lsSig[[1]] = list() -lsSig[[1]]$name = "Covariate1" -lsSig[[1]]$orig = "Covariate1" -lsSig[[1]]$taxon = "adCur" -lsSig[[1]]$data = covY -lsSig[[1]]$factors = "Covariate1" -lsSig[[1]]$metadata = covX1 -vdCoef = c(Covariate1=0.6) -lsSig[[1]]$value = vdCoef -lsSig[[1]]$std = sd(covX1) -lsSig[[1]]$allCoefs = vdCoef -ret1 = funcMakeContrasts(strFormula=strFormula, strRandomFormula=strRandomFormula, frmeTmp=frmeTmp, iTaxon=iTaxon, - functionContrast=function(x,adCur,dfData) - { - retList = list() - ret = cor.test(as.formula(paste("~",x,"+ adCur")), data=dfData, method="spearman", na.action=c_strNA_Action) - #Returning rho for the coef in a named vector - vdCoef = c() - vdCoef[[x]]=ret$estimate - retList[[1]]=list(p.value=ret$p.value,SD=sd(dfData[[x]]),name=x,coef=vdCoef) - return(retList) - }, lsQCCounts=list()) -ret1$adP = round(ret1$adP,5) -test_that("1. Test that the funcMakeContrasts works on a continuous variable.",{ - expect_equal(ret1,list(adP=round(c(0.09679784),5),lsSig=lsSig,lsQCCounts=list()))}) - -strFormula = "adCur ~ Covariate1 + Covariate2" -strRandomFormula = NULL -lsSig = list() -lsSig[[1]] = list() -lsSig[[1]]$name = "Covariate1" -lsSig[[1]]$orig = "Covariate1" -lsSig[[1]]$taxon = "adCur" -lsSig[[1]]$data = covY -lsSig[[1]]$factors = "Covariate1" -lsSig[[1]]$metadata = covX1 -vdCoef = c(Covariate1=0.6) -lsSig[[1]]$value = vdCoef -lsSig[[1]]$std = sd(covX1) -lsSig[[1]]$allCoefs = vdCoef -lsSig[[2]] = list() -lsSig[[2]]$name = "Covariate2" -lsSig[[2]]$orig = "Covariate2" -lsSig[[2]]$taxon = "adCur" -lsSig[[2]]$data = covY -lsSig[[2]]$factors = "Covariate2" -lsSig[[2]]$metadata = covX2 -vdCoef = c(Covariate2=0.46666667) -lsSig[[2]]$value = vdCoef -lsSig[[2]]$std = sd(covX2) -lsSig[[2]]$allCoefs = vdCoef -ret1 = funcMakeContrasts(strFormula=strFormula, strRandomFormula=strRandomFormula, frmeTmp=frmeTmp, iTaxon=iTaxon, - functionContrast=function(x,adCur,dfData) - { - retList = list() - ret = cor.test(as.formula(paste("~",x,"+ adCur")), data=dfData, method="spearman", na.action=c_strNA_Action) - #Returning rho for the coef in a named vector - vdCoef = c() - vdCoef[[x]]=ret$estimate - retList[[1]]=list(p.value=ret$p.value,SD=sd(dfData[[x]]),name=x,coef=vdCoef) - return(retList) - }, lsQCCounts=list()) -ret1$adP = round(ret1$adP,5) -test_that("Test that the funcMakeContrasts works on 2 continuous variables.",{ - expect_equal(ret1,list(adP=round(c(0.09679784,0.21252205),5),lsSig=lsSig,lsQCCounts=list()))}) - -strFormula = "adCur ~ Covariate4" -strRandomFormula = NULL -lsSig = list() -lsSig[[1]] = list() -lsSig[[1]]$name = "Covariate4" -lsSig[[1]]$orig = "Covariate42" -lsSig[[1]]$taxon = "adCur" -lsSig[[1]]$data = covY -lsSig[[1]]$factors = "Covariate4" -lsSig[[1]]$metadata = covX4 #update -vdCoef = c(Covariate42=NA) -lsSig[[1]]$value = vdCoef -lsSig[[1]]$std = sd(covX4) #update -lsSig[[1]]$allCoefs = vdCoef -# Get return -rets = funcMakeContrasts(strFormula=strFormula, strRandomFormula=strRandomFormula, frmeTmp=frmeTmp, iTaxon=iTaxon, - functionContrast=function(x,adCur,dfData) - { - retList = list() - lmodKW = kruskal(adCur,dfData[[x]],group=FALSE,p.adj="holm") - asLevels = levels(dfData[[x]]) - # The names of the generated comparisons, sometimes the control is first sometimes it is not so - # We will just check which is in the names and use that - asComparisons = row.names(lmodKW$comparisons) - #Get the comparison with the control - for(sLevel in asLevels[2:length(asLevels)]) - { - sComparison = intersect(c(paste(asLevels[1],sLevel,sep=" - "),paste(sLevel,asLevels[1],sep=" - ")),asComparisons) - #Returning NA for the coef in a named vector - vdCoef = c() - vdCoef[[paste(x,sLevel,sep="")]]=NA - retList[[length(retList)+1]]=list(p.value=lmodKW$comparisons[sComparison,"p.value"],SD=NA,name=paste(x,sLevel,sep=""),coef=vdCoef) - } - return(retList) - }, lsQCCounts=list()) -rets$adP=round(rets$adP,digits=5) -test_that("Test that the funcMakeContrasts works on 1 factor covariate with 2 levels.",{ - expect_equal(rets,list(adP=round(c(0.24434),5),lsSig=lsSig,lsQCCounts=list()))}) - -strFormula = "adCur ~ Covariate3" -strRandomFormula = NULL -lsSig = list() -lsSig[[1]] = list() -lsSig[[1]]$name = "Covariate3" -lsSig[[1]]$orig = "Covariate32" -lsSig[[1]]$taxon = "adCur" -lsSig[[1]]$data = covY -lsSig[[1]]$factors = "Covariate3" -lsSig[[1]]$metadata = covX3 #update -vdCoef = c(Covariate32=NA) -lsSig[[1]]$value = vdCoef -lsSig[[1]]$std = sd(covX3) #update -lsSig[[1]]$allCoefs = vdCoef -lsSig[[2]] = list() -lsSig[[2]]$name = "Covariate3" -lsSig[[2]]$orig = "Covariate33" -lsSig[[2]]$taxon = "adCur" -lsSig[[2]]$data = covY -lsSig[[2]]$factors = "Covariate3" -lsSig[[2]]$metadata = covX3 #update -vdCoef = c(Covariate33=NA) -lsSig[[2]]$value = vdCoef -lsSig[[2]]$std = sd(covX3) #update -lsSig[[2]]$allCoefs = vdCoef -ret1 = funcMakeContrasts(strFormula=strFormula, strRandomFormula=strRandomFormula, frmeTmp=frmeTmp, iTaxon=iTaxon, - functionContrast=function(x,adCur,dfData) - { - retList = list() - lmodKW = kruskal(adCur,dfData[[x]],group=FALSE,p.adj="holm") - asLevels = levels(dfData[[x]]) - # The names of the generated comparisons, sometimes the control is first sometimes it is not so - # We will just check which is in the names and use that - asComparisons = row.names(lmodKW$comparisons) - #Get the comparison with the control - for(sLevel in asLevels[2:length(asLevels)]) - { - sComparison = intersect(c(paste(asLevels[1],sLevel,sep=" - "),paste(sLevel,asLevels[1],sep=" - ")),asComparisons) - #Returning NA for the coef in a named vector - vdCoef = c() - vdCoef[[paste(x,sLevel,sep="")]]=NA - retList[[length(retList)+1]]=list(p.value=lmodKW$comparisons[sComparison,"p.value"],SD=NA,name=paste(x,sLevel,sep=""),coef=vdCoef) - } - return(retList) - }, lsQCCounts=list()) -ret1$adP = round(ret1$adP,5) -test_that("Test that the funcMakeContrasts works on 1 factor covariate with 3 levels.",{ - expect_equal(ret1,list(adP=c(1.0,1.0),lsSig=lsSig,lsQCCounts=list()))}) - -strFormula = "adCur ~ Covariate4 + Covariate5" -strRandomFormula = NULL -lsSig = list() -lsSig[[1]] = list() -lsSig[[1]]$name = "Covariate4" -lsSig[[1]]$orig = "Covariate42" -lsSig[[1]]$taxon = "adCur" -lsSig[[1]]$data = covY -lsSig[[1]]$factors = "Covariate4" -lsSig[[1]]$metadata = covX4 #update -vdCoef = c(Covariate42=NA) -lsSig[[1]]$value = vdCoef -lsSig[[1]]$std = sd(covX4) #update -lsSig[[1]]$allCoefs = vdCoef -lsSig[[2]] = list() -lsSig[[2]]$name = "Covariate5" -lsSig[[2]]$orig = "Covariate52" -lsSig[[2]]$taxon = "adCur" -lsSig[[2]]$data = covY -lsSig[[2]]$factors = "Covariate5" -lsSig[[2]]$metadata = covX5 #update -vdCoef = c(Covariate52=NA) -lsSig[[2]]$value = vdCoef -lsSig[[2]]$std = sd(covX5) #update -lsSig[[2]]$allCoefs = vdCoef -ret1 = funcMakeContrasts(strFormula=strFormula, strRandomFormula=strRandomFormula, frmeTmp=frmeTmp, iTaxon=iTaxon, - functionContrast=function(x,adCur,dfData) - { - retList = list() - lmodKW = kruskal(adCur,dfData[[x]],group=FALSE,p.adj="holm") - asLevels = levels(dfData[[x]]) - # The names of the generated comparisons, sometimes the control is first sometimes it is not so - # We will just check which is in the names and use that - asComparisons = row.names(lmodKW$comparisons) - #Get the comparison with the control - for(sLevel in asLevels[2:length(asLevels)]) - { - sComparison = intersect(c(paste(asLevels[1],sLevel,sep=" - "),paste(sLevel,asLevels[1],sep=" - ")),asComparisons) - #Returning NA for the coef in a named vector - vdCoef = c() - vdCoef[[paste(x,sLevel,sep="")]]=NA - retList[[length(retList)+1]]=list(p.value=lmodKW$comparisons[sComparison,"p.value"],SD=NA,name=paste(x,sLevel,sep=""),coef=vdCoef) - } - return(retList) - }, lsQCCounts=list()) -ret1$adP = round(ret1$adP,5) -test_that("1. Test that the funcMakeContrasts works on 2 factor covariate with 2 levels.",{ - expect_equal(ret1,list(adP=round(c(0.24434,0.655852),5),lsSig=lsSig,lsQCCounts=list()))}) - - -#Test Model selection - - -context("Test funcBoostModel") -context("Test funcForwardModel") -context("Test funcBackwardsModel") - - -#Test Univariates -context("Test funcSpearman") -strFormula = "adCur ~ Covariate1" -adCur = c(.26, .31, .25, .50, .36, .40, .52, .28, .38) -x = c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1) -frmeTmp = data.frame(Covariate1=x, adCur=adCur) -iTaxon = 2 -lsQCCounts = list() -lsSig = list() -lsSig[[1]] = list() -lsSig[[1]]$name = "Covariate1" -lsSig[[1]]$orig = "Covariate1" -lsSig[[1]]$taxon = "adCur" -lsSig[[1]]$data = adCur -lsSig[[1]]$factors = "Covariate1" -lsSig[[1]]$metadata = x -vdCoef = c(Covariate1=0.6) -lsSig[[1]]$value = vdCoef -lsSig[[1]]$std = sd(x) -lsSig[[1]]$allCoefs = vdCoef -ret1 = funcSpearman(strFormula=strFormula,frmeTmp=frmeTmp,iTaxon=iTaxon,lsQCCounts=lsQCCounts,strRandomFormula=NULL) -ret1$adP = round(ret1$adP,5) -test_that("Test that the spearman test has the correct results for 1 covariate.",{ - expect_equal(ret1,list(adP=round(c(0.09679784),5),lsSig=lsSig,lsQCCounts=list())) -}) - -strFormula = "adCur ~ Covariate1 + Covariate2" -frmeTmp = data.frame(Covariate1=x, Covariate2=x, adCur=adCur) -iTaxon = 3 -lsSig = list() -lsSig[[1]] = list() -lsSig[[1]]$name = "Covariate1" -lsSig[[1]]$orig = "Covariate1" -lsSig[[1]]$taxon = "adCur" -lsSig[[1]]$data = adCur -lsSig[[1]]$factors = "Covariate1" -lsSig[[1]]$metadata = x -vdCoef = c(Covariate1=0.6) -lsSig[[1]]$value = vdCoef -lsSig[[1]]$std = sd(x) -lsSig[[1]]$allCoefs = vdCoef -lsSig[[2]] = list() -lsSig[[2]]$name = "Covariate2" -lsSig[[2]]$orig = "Covariate2" -lsSig[[2]]$taxon = "adCur" -lsSig[[2]]$data = adCur -lsSig[[2]]$factors = "Covariate2" -lsSig[[2]]$metadata = x -vdCoef = c(Covariate2=0.6) -lsSig[[2]]$value = vdCoef -lsSig[[2]]$std = sd(x) -lsSig[[2]]$allCoefs = vdCoef -lsQCCounts = list() -ret1 = funcSpearman(strFormula=strFormula,frmeTmp=frmeTmp,iTaxon=iTaxon,lsQCCounts=lsQCCounts,strRandomFormula=NULL) -ret1$adP = round(ret1$adP,5) -test_that("Test that the spearman test has the correct results for 2 covariates.",{ - expect_equal(ret1,list(adP=round(c(0.09679784,0.09679784),5),lsSig=lsSig,lsQCCounts=list())) -}) - - -context("Test funcWilcoxon") -strFormula = "adCur ~ Covariate1" -x = c(TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,FALSE) -frmeTmp = data.frame(Covariate1=x, adCur=adCur) -iTaxon = 2 -lsSig = list() -lsSig[[1]] = list() -lsSig[[1]]$name = "Covariate1" -lsSig[[1]]$orig = "Covariate1" -lsSig[[1]]$taxon = "adCur" -lsSig[[1]]$data = adCur -lsSig[[1]]$factors = "Covariate1" -lsSig[[1]]$metadata = x -vdCoef = c(Covariate1=13) -lsSig[[1]]$value = vdCoef -lsSig[[1]]$std = sd(x) -lsSig[[1]]$allCoefs = vdCoef -lsQCCounts = list() -ret1 = funcWilcoxon(strFormula=strFormula,frmeTmp=frmeTmp,iTaxon=iTaxon,lsQCCounts=lsQCCounts,strRandomFormula=NULL) -ret1$adP = round(ret1$adP,5) -test_that("Test that the wilcoxon test has the correct results for 1 covariate.",{ - expect_equal(ret1,list(adP=round(c(0.55555556),5),lsSig=lsSig,lsQCCounts=list())) -}) - - -context("Test funcKruskalWallis") -strFormula = "adCur ~ Covariate1" -x = as.factor(c("one","two","three","one","one","three","two","three","two")) -frmeTmp = data.frame(Covariate1=x, adCur=adCur) -iTaxon = 2 -lsSig = list() -lsSig[[1]] = list() -lsSig[[1]]$name = "Covariate1" -lsSig[[1]]$orig = "Covariate1three" -lsSig[[1]]$taxon = "adCur" -lsSig[[1]]$data = adCur -lsSig[[1]]$factors = "Covariate1" -lsSig[[1]]$metadata = x -vdCoef = c(Covariate1three=NA) -lsSig[[1]]$value = vdCoef -lsSig[[1]]$std = sd(x) -lsSig[[1]]$allCoefs = vdCoef -lsSig[[2]] = list() -lsSig[[2]]$name = "Covariate1" -lsSig[[2]]$orig = "Covariate1two" -lsSig[[2]]$taxon = "adCur" -lsSig[[2]]$data = adCur -lsSig[[2]]$factors = "Covariate1" -lsSig[[2]]$metadata = x -vdCoef = c(Covariate1two=NA) -lsSig[[2]]$value = vdCoef -lsSig[[2]]$std = sd(x) -lsSig[[2]]$allCoefs = vdCoef -lsQCCounts = list() -ret1 = funcKruskalWallis(strFormula=strFormula,frmeTmp=frmeTmp,iTaxon=iTaxon,lsQCCounts=lsQCCounts,strRandomFormula=NULL) -ret1$adP = round(ret1$adP,5) -test_that("Test that the Kruskal Wallis (Nonparameteric anova) has the correct results for 1 covariate.",{ - expect_equal(ret1,list(adP=c(1.0,1.0),lsSig=lsSig,lsQCCounts=list())) -}) - - -context("test funcDoUnivariate") -covX1 = c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1) -covX2 = c(144.4, 245.9, 141.9, 253.3, 144.7, 244.1, 150.7, 245.2, 160.1) -covX3 = as.factor(c(1,2,3,1,2,3,1,2,3)) -covX4 = as.factor(c(1,1,1,1,2,2,2,2,2)) -covX5 = as.factor(c(1,2,1,2,1,2,1,2,1)) -covX6 = as.factor(c("one","two","three","one","one","three","two","three","two")) -covY = c(.26, .31, .25, .50, .36, .40, .52, .28, .38) -frmeTmp = data.frame(Covariate1=covX1, Covariate2=covX2, Covariate3=covX3, Covariate4=covX4, Covariate5=covX5, Covariate6=covX6, adCur= covY) -iTaxon = 7 -# 1 cont answer -lsSig = list() -lsSig[[1]] = list() -lsSig[[1]]$name = "Covariate1" -lsSig[[1]]$orig = "Covariate1" -lsSig[[1]]$taxon = "adCur" -lsSig[[1]]$data = adCur -lsSig[[1]]$factors = "Covariate1" -lsSig[[1]]$metadata = frmeTmp[["Covariate1"]] -vdCoef = c(Covariate1=0.6) -lsSig[[1]]$value = vdCoef -lsSig[[1]]$std = sd(frmeTmp[["Covariate1"]]) -lsSig[[1]]$allCoefs = vdCoef -lsHistory = list(adP=c(), lsSig=c(),lsQCCounts=list()) -ret1 = funcDoUnivariate(strFormula="adCur ~ Covariate1",frmeTmp=frmeTmp,iTaxon=iTaxon, lsHistory=lsHistory, strRandomFormula=NULL) -ret2 = funcDoUnivariate(strFormula=NULL,frmeTmp=frmeTmp,iTaxon=iTaxon, lsHistory=lsHistory, strRandomFormula="adCur ~ 1|Covariate1") -ret1$adP = round(ret1$adP,5) -ret2$adP = round(ret2$adP,5) -print("ret1") -print(ret1) -print("list(adP=round(c(0.09679784),5),lsSig=lsSig,lsQCCounts=list())") -print(list(adP=round(c(0.09679784),5),lsSig=lsSig,lsQCCounts=list())) -test_that("2. Test that the funcMakeContrasts works on a continuous variable.",{ - expect_equal(ret1,list(adP=round(c(0.09679784),5),lsSig=lsSig,lsQCCounts=list())) - expect_equal(ret2,list(adP=round(c(0.09679784),5),lsSig=lsSig,lsQCCounts=list())) -}) -lsSig[[2]] = list() -lsSig[[2]]$name = "Covariate2" -lsSig[[2]]$orig = "Covariate2" -lsSig[[2]]$taxon = "adCur" -lsSig[[2]]$data = adCur -lsSig[[2]]$factors = "Covariate2" -lsSig[[2]]$metadata = frmeTmp[["Covariate2"]] -vdCoef = c(Covariate2=0.46666667) -lsSig[[2]]$value = vdCoef -lsSig[[2]]$std = sd(frmeTmp[["Covariate2"]]) -lsSig[[2]]$allCoefs = vdCoef -ret1 = funcDoUnivariate(strFormula="adCur ~ Covariate1 + Covariate2",frmeTmp=frmeTmp,iTaxon=iTaxon, lsHistory=lsHistory,strRandomFormula=NULL) -ret2 = funcDoUnivariate(strFormula=NULL,frmeTmp=frmeTmp,iTaxon=iTaxon, lsHistory=lsHistory, strRandomFormula="adCur ~ 1|Covariate1 + 1|Covariate2") -ret1$adP = round(ret1$adP,5) -ret2$adP = round(ret2$adP,5) -test_that("Test that the funcMakeContrasts works on 2 continuous variables.",{ - expect_equal(ret1,list(adP=round(c(0.09679784,0.21252205),5),lsSig=lsSig,lsQCCounts=list())) - expect_equal(ret2,list(adP=round(c(0.09679784,0.21252205),5),lsSig=lsSig,lsQCCounts=list())) -}) -lsSig = list() -lsSig[[1]] = list() -lsSig[[1]]$name = "Covariate4" -lsSig[[1]]$orig = "Covariate4" -lsSig[[1]]$taxon = "adCur" -lsSig[[1]]$data = adCur -lsSig[[1]]$factors = "Covariate4" -lsSig[[1]]$metadata = frmeTmp[["Covariate4"]] -vdCoef = c(Covariate4=5) -lsSig[[1]]$value = vdCoef -lsSig[[1]]$std = sd(frmeTmp[["Covariate4"]]) -lsSig[[1]]$allCoefs = vdCoef -ret1 = funcDoUnivariate(strFormula="adCur ~ Covariate4",frmeTmp=frmeTmp,iTaxon=iTaxon, lsHistory=lsHistory,strRandomFormula=NULL) -ret2 = funcDoUnivariate(strFormula=NULL,frmeTmp=frmeTmp,iTaxon=iTaxon, lsHistory=lsHistory, strRandomFormula="adCur ~ 1|Covariate4") -ret1$adP = round(ret1$adP,5) -ret2$adP = round(ret2$adP,5) -test_that("Test that the funcMakeContrasts works on 1 factor covariate with 2 levels.",{ - expect_equal(ret1,list(adP=round(c(0.2857143),5),lsSig=lsSig,lsQCCounts=list())) - expect_equal(ret2,list(adP=round(c(0.2857143),5),lsSig=lsSig,lsQCCounts=list())) -}) -lsSig[[2]] = list() -lsSig[[2]]$name = "Covariate5" -lsSig[[2]]$orig = "Covariate5" -lsSig[[2]]$taxon = "adCur" -lsSig[[2]]$data = adCur -lsSig[[2]]$factors = "Covariate5" -lsSig[[2]]$metadata = frmeTmp[["Covariate5"]] -vdCoef = c(Covariate5=8) -lsSig[[2]]$value = vdCoef -lsSig[[2]]$std = sd(frmeTmp[["Covariate5"]]) -lsSig[[2]]$allCoefs = vdCoef -ret1 = funcDoUnivariate(strFormula="adCur ~ Covariate4 + Covariate5",frmeTmp=frmeTmp,iTaxon=iTaxon, lsHistory=lsHistory, strRandomFormula=NULL) -ret2 = funcDoUnivariate(strFormula=NULL,frmeTmp=frmeTmp,iTaxon=iTaxon, lsHistory=lsHistory, strRandomFormula="adCur ~ 1|Covariate4 + 1|Covariate5") -ret3 = funcDoUnivariate(strFormula="adCur ~ Covariate4",frmeTmp=frmeTmp,iTaxon=iTaxon, lsHistory=lsHistory, strRandomFormula="adCur ~ 1|Covariate5") -ret1$adP = round(ret1$adP,5) -ret2$adP = round(ret2$adP,5) -ret3$adP = round(ret3$adP,5) -test_that("2. Test that the funcMakeContrasts works on 2 factor covariate with 2 levels.",{ - expect_equal(ret1,list(adP=round(c(0.2857143,0.73016),5),lsSig=lsSig,lsQCCounts=list())) - expect_equal(ret2,list(adP=round(c(0.2857143,0.73016),5),lsSig=lsSig,lsQCCounts=list())) - expect_equal(ret3,list(adP=round(c(0.2857143,0.73016),5),lsSig=lsSig,lsQCCounts=list())) -}) -lsSig = list() -lsSig[[1]] = list() -lsSig[[1]]$name = "Covariate4" -lsSig[[1]]$orig = "Covariate4" -lsSig[[1]]$taxon = "adCur" -lsSig[[1]]$data = adCur -lsSig[[1]]$factors = "Covariate4" -lsSig[[1]]$metadata = frmeTmp[["Covariate4"]] -vdCoef = c(Covariate4=5) -lsSig[[1]]$value = vdCoef -lsSig[[1]]$std = sd(frmeTmp[["Covariate4"]]) -lsSig[[1]]$allCoefs = vdCoef -lsSig[[2]] = list() -lsSig[[2]]$name = "Covariate1" -lsSig[[2]]$orig = "Covariate1" -lsSig[[2]]$taxon = "adCur" -lsSig[[2]]$data = adCur -lsSig[[2]]$factors = "Covariate1" -lsSig[[2]]$metadata = frmeTmp[["Covariate1"]] -vdCoef = c(Covariate1=0.6) -lsSig[[2]]$value = vdCoef -lsSig[[2]]$std = sd(frmeTmp[["Covariate1"]]) -lsSig[[2]]$allCoefs = vdCoef -ret1 = funcDoUnivariate(strFormula="adCur ~ Covariate4 + Covariate1",frmeTmp=frmeTmp,iTaxon=iTaxon, lsHistory=lsHistory, strRandomFormula=NULL) -ret2 = funcDoUnivariate(strFormula=NULL,frmeTmp=frmeTmp,iTaxon=iTaxon, lsHistory=lsHistory, strRandomFormula="adCur ~ 1|Covariate4 + 1|Covariate1") -ret3 = funcDoUnivariate(strFormula="adCur ~ Covariate4",frmeTmp=frmeTmp,iTaxon=iTaxon, lsHistory=lsHistory, strRandomFormula="adCur ~ 1|Covariate1") -ret1$adP = round(ret1$adP,5) -ret2$adP = round(ret2$adP,5) -ret3$adP = round(ret3$adP,5) -test_that("Test that the funcMakeContrasts works on 1 factor covariate with 2 levels and a continuous variable.",{ - expect_equal(ret1,list(adP=round(c(0.2857143,0.09679784),5),lsSig=lsSig,lsQCCounts=list())) - expect_equal(ret2,list(adP=round(c(0.2857143,0.09679784),5),lsSig=lsSig,lsQCCounts=list())) - expect_equal(ret3,list(adP=round(c(0.2857143,0.09679784),5),lsSig=lsSig,lsQCCounts=list())) -}) -lsSig = list() -lsSig[[1]] = list() -lsSig[[1]]$name = "Covariate3" -lsSig[[1]]$orig = "Covariate32" -lsSig[[1]]$taxon = "adCur" -lsSig[[1]]$data = adCur -lsSig[[1]]$factors = "Covariate3" -lsSig[[1]]$metadata = frmeTmp[["Covariate3"]] -vdCoef = c(Covariate32=NA) -lsSig[[1]]$value = vdCoef -lsSig[[1]]$std = sd(frmeTmp[["Covariate3"]]) -lsSig[[1]]$allCoefs = vdCoef -lsSig[[2]] = list() -lsSig[[2]]$name = "Covariate3" -lsSig[[2]]$orig = "Covariate33" -lsSig[[2]]$taxon = "adCur" -lsSig[[2]]$data = adCur -lsSig[[2]]$factors = "Covariate3" -lsSig[[2]]$metadata = frmeTmp[["Covariate3"]] -vdCoef = c(Covariate33=NA) -lsSig[[2]]$value = vdCoef -lsSig[[2]]$std = sd(frmeTmp[["Covariate3"]]) -lsSig[[2]]$allCoefs = vdCoef -lsSig[[3]] = list() -lsSig[[3]]$name = "Covariate1" -lsSig[[3]]$orig = "Covariate1" -lsSig[[3]]$taxon = "adCur" -lsSig[[3]]$data = adCur -lsSig[[3]]$factors = "Covariate1" -lsSig[[3]]$metadata = frmeTmp[["Covariate1"]] -vdCoef = c(Covariate1=0.6) -lsSig[[3]]$value = vdCoef -lsSig[[3]]$std = sd(frmeTmp[["Covariate1"]]) -lsSig[[3]]$allCoefs = vdCoef -lsSig[[4]] = list() -lsSig[[4]]$name = "Covariate2" -lsSig[[4]]$orig = "Covariate2" -lsSig[[4]]$taxon = "adCur" -lsSig[[4]]$data = adCur -lsSig[[4]]$factors = "Covariate2" -lsSig[[4]]$metadata = frmeTmp[["Covariate2"]] -vdCoef = c(Covariate2=0.46666667) -lsSig[[4]]$value = vdCoef -lsSig[[4]]$std = sd(frmeTmp[["Covariate2"]]) -lsSig[[4]]$allCoefs = vdCoef -ret1 = funcDoUnivariate(strFormula="adCur ~ Covariate3 + Covariate1 + Covariate2",frmeTmp=frmeTmp,iTaxon=iTaxon, lsHistory=lsHistory, strRandomFormula=NULL) -ret2 = funcDoUnivariate(strFormula=NULL,frmeTmp=frmeTmp,iTaxon=iTaxon, lsHistory=lsHistory, strRandomFormula="adCur ~ 1|Covariate3 + 1|Covariate1 + 1|Covariate2") -ret3 = funcDoUnivariate(strFormula="adCur ~ Covariate3 + Covariate1",frmeTmp=frmeTmp,iTaxon=iTaxon, lsHistory=lsHistory, strRandomFormula="adCur ~ 1|Covariate2") -ret1$adP = round(ret1$adP,5) -ret2$adP = round(ret2$adP,5) -ret3$adP = round(ret3$adP,5) -test_that("Test that the funcMakeContrasts works on 1 factor covariate with 3 levels and 2 continuous variables.",{ - expect_equal(ret1,list(adP=round(c(1.0,1.0,0.09679784,0.21252205),5),lsSig=lsSig,lsQCCounts=list())) - expect_equal(ret2,list(adP=round(c(1.0,1.0,0.09679784,0.21252205),5),lsSig=lsSig,lsQCCounts=list())) - expect_equal(ret3,list(adP=round(c(1.0,1.0,0.09679784,0.21252205),5),lsSig=lsSig,lsQCCounts=list())) -}) -lsSig = list() -lsSig[[1]] = list() -lsSig[[1]]$name = "Covariate4" -lsSig[[1]]$orig = "Covariate4" -lsSig[[1]]$taxon = "adCur" -lsSig[[1]]$data = adCur -lsSig[[1]]$factors = "Covariate4" -lsSig[[1]]$metadata = frmeTmp[["Covariate4"]] -vdCoef = c(Covariate4=5) -lsSig[[1]]$value = vdCoef -lsSig[[1]]$std = sd(frmeTmp[["Covariate4"]]) -lsSig[[1]]$allCoefs = vdCoef -lsSig[[2]] = list() -lsSig[[2]]$name = "Covariate2" -lsSig[[2]]$orig = "Covariate2" -lsSig[[2]]$taxon = "adCur" -lsSig[[2]]$data = adCur -lsSig[[2]]$factors = "Covariate2" -lsSig[[2]]$metadata = frmeTmp[["Covariate2"]] -vdCoef = c(Covariate2=0.46666667) -lsSig[[2]]$value = vdCoef -lsSig[[2]]$std = sd(frmeTmp[["Covariate2"]]) -lsSig[[2]]$allCoefs = vdCoef -ret1 = funcDoUnivariate(strFormula="adCur ~ Covariate4 + Covariate2",frmeTmp=frmeTmp,iTaxon=iTaxon, lsHistory=lsHistory, strRandomFormula=NULL) -ret2 = funcDoUnivariate(strFormula=NULL,frmeTmp=frmeTmp,iTaxon=iTaxon, lsHistory=lsHistory, strRandomFormula="adCur ~ 1|Covariate4 + 1|Covariate2") -ret3 = funcDoUnivariate(strFormula= "adCur ~ Covariate4",frmeTmp=frmeTmp,iTaxon=iTaxon, lsHistory=lsHistory, strRandomFormula="adCur ~ 1|Covariate2") -ret1$adP = round(ret1$adP,5) -ret2$adP = round(ret2$adP,5) -ret3$adP = round(ret3$adP,5) -test_that("3. Test that the funcMakeContrasts works on 2 factor covariate with 2 levels and a continuous variable.",{ - expect_equal(ret1,list(adP=round(c(0.2857143,0.21252205),5),lsSig=lsSig,lsQCCounts=list())) - expect_equal(ret2,list(adP=round(c(0.2857143,0.21252205),5),lsSig=lsSig,lsQCCounts=list())) - expect_equal(ret3,list(adP=round(c(0.2857143,0.21252205),5),lsSig=lsSig,lsQCCounts=list())) -}) -lsSig = list() -lsSig[[1]] = list() -lsSig[[1]]$name = "Covariate4" -lsSig[[1]]$orig = "Covariate4" -lsSig[[1]]$taxon = "adCur" -lsSig[[1]]$data = adCur -lsSig[[1]]$factors = "Covariate4" -lsSig[[1]]$metadata = frmeTmp[["Covariate4"]] -vdCoef = c(Covariate4=5) -lsSig[[1]]$value = vdCoef -lsSig[[1]]$std = sd(frmeTmp[["Covariate4"]]) -lsSig[[1]]$allCoefs = vdCoef -lsSig[[2]] = list() -lsSig[[2]]$name = "Covariate3" -lsSig[[2]]$orig = "Covariate32" -lsSig[[2]]$taxon = "adCur" -lsSig[[2]]$data = adCur -lsSig[[2]]$factors = "Covariate3" -lsSig[[2]]$metadata = frmeTmp[["Covariate3"]] -vdCoef = c(Covariate32=NA) -lsSig[[2]]$value = vdCoef -lsSig[[2]]$std = sd(frmeTmp[["Covariate3"]]) -lsSig[[2]]$allCoefs = vdCoef -lsSig[[3]] = list() -lsSig[[3]]$name = "Covariate3" -lsSig[[3]]$orig = "Covariate33" -lsSig[[3]]$taxon = "adCur" -lsSig[[3]]$data = adCur -lsSig[[3]]$factors = "Covariate3" -lsSig[[3]]$metadata = frmeTmp[["Covariate3"]] -vdCoef = c(Covariate33=NA) -lsSig[[3]]$value = vdCoef -lsSig[[3]]$std = sd(frmeTmp[["Covariate3"]]) -lsSig[[3]]$allCoefs = vdCoef -lsSig[[4]] = list() -lsSig[[4]]$name = "Covariate2" -lsSig[[4]]$orig = "Covariate2" -lsSig[[4]]$taxon = "adCur" -lsSig[[4]]$data = adCur -lsSig[[4]]$factors = "Covariate2" -lsSig[[4]]$metadata = frmeTmp[["Covariate2"]] -vdCoef = c(Covariate2=0.46666667) -lsSig[[4]]$value = vdCoef -lsSig[[4]]$std = sd(frmeTmp[["Covariate2"]]) -lsSig[[4]]$allCoefs = vdCoef -ret1 = funcDoUnivariate(strFormula="adCur ~ Covariate4 + Covariate3 + Covariate2",frmeTmp=frmeTmp,iTaxon=iTaxon, lsHistory=lsHistory, strRandomFormula=NULL) -ret2 = funcDoUnivariate(strFormula=NULL,frmeTmp=frmeTmp,iTaxon=iTaxon, lsHistory=lsHistory, strRandomFormula="adCur ~ 1|Covariate4 +1|Covariate3 + 1|Covariate2") -ret1$adP = round(ret1$adP,5) -ret2$adP = round(ret2$adP,5) -test_that("Test that the funcMakeContrasts works on 1 factor covariate with 2 levels , 1 factor with 3 levels, and a continuous variable.",{ - expect_equal(ret1,list(adP=round(c(0.2857143,1.0,1.0,0.21252205),5),lsSig=lsSig,lsQCCounts=list())) - expect_equal(ret2,list(adP=round(c(0.2857143,1.0,1.0,0.21252205),5),lsSig=lsSig,lsQCCounts=list())) -}) - -#Test multivariates -context("Test funcLasso") - - -context("Test funcLM") -#This test just makes sure the statistical method is being called correctly for one covariate with the correct return -strFormula = "adCur ~ Covariate1" -strRandomFormula = NULL -x = c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1) -x2 = c(34.2, 32.5, 22.4, 43, 3.25, 6.4, 7, 87, 9) -xf1 = c(1,1,2,2,1,2,1,1,2) -xf2 = c(1,1,1,1,2,2,2,2,2) -frmeTmp = data.frame(Covariate1=x, Covariate2=x2, FCovariate3=xf1, FCovariate4=xf2, adCur=adCur) -iTaxon = 5 -lmRet = lm(as.formula(strFormula), data=frmeTmp, na.action = c_strNA_Action) -test_that("Test that the lm has the correct results for 1 covariate.",{ - expect_equal(funcLM(strFormula=strFormula,frmeTmp=frmeTmp,iTaxon=iTaxon,lsHistory=lsHistory,strRandomFormula=strRandomFormula),lmRet) -}) -#Test for correct call for 2 covariates -strFormula = "adCur ~ Covariate1 + Covariate2" -lmRet = lm(as.formula(strFormula), data=frmeTmp, na.action = c_strNA_Action) -test_that("Test that the lm has the correct results for 2 covariates.",{ - expect_equal(funcLM(strFormula=strFormula,frmeTmp=frmeTmp,iTaxon=iTaxon,lsHistory=lsHistory,strRandomFormula=strRandomFormula),lmRet) -}) -##Test for correct call with 1 random and one fixed covariate -#strFormula = "adCur ~ Covariate1" -#strRandomFormula = "~1|FCovariate3" -#lmRet = glmmPQL(fixed=as.formula(strFormula), random=as.formula(strRandomFormula), family=gaussian(link="identity"), data=frmeTmp) -#test_that("Test that the lm has the correct results for 1 random and one fixed covariate.",{ -# expect_equal(funcLM(strFormula=strFormula,frmeTmp=frmeTmp,iTaxon=iTaxon,lsHistory=lsHistory,strRandomFormula=strRandomFormula),lmRet) -#}) -##Test for correct call with 1 random and 2 fixed covariates -#strFormula = "adCur ~ Covariate1 + Covariate2" -#strRandomFormula = "~1|FCovariate3" -#lmRet = glmmPQL(fixed=as.formula(strFormula), random=as.formula(strRandomFormula), family=gaussian(link="identity"), data=frmeTmp) -#test_that("Test that the lm has the correct results for 1 random and 2 fixed covariates.",{ -# expect_equal(funcLM(strFormula=strFormula,frmeTmp=frmeTmp,iTaxon=iTaxon,lsHistory=lsHistory,strRandomFormula=strRandomFormula),lmRet) -#}) -##Test for correct call with 2 random and 1 fixed covariates -#strFormula = "adCur ~ Covariate1" -#strRandomFormula = "~1|FCovariate4+1|FCovariate3" -#lmRet = glmmPQL(fixed=as.formula(strFormula), random=as.formula(strRandomFormula), family=gaussian(link="identity"), data=frmeTmp) -#test_that("Test that the lm has the correct results for 2 random and 1 fixed covariates.",{ -# expect_equal(funcLM(strFormula=strFormula,frmeTmp=frmeTmp,iTaxon=iTaxon,lsHistory=lsHistory,strRandomFormula=strRandomFormula),lmRet) -#}) - - -context("Test funcBinomialMult") -strFormula = "adCur ~ Covariate1" -strRandomFormula = NULL -x = c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1) -x2 = c(34.2, 32.5, 22.4, 43, 3.25, 6.4, 7, 87, 9) -xf1 = c(1,1,2,2,1,2,1,1,2) -xf2 = c(1,1,1,1,2,2,2,2,2) -frmeTmp = data.frame(Covariate1=x, Covariate2=x2, FCovariate3=xf1, FCovariate4=xf2, adCur=adCur) -iTaxon = 5 -lmRet = glm(as.formula(strFormula), family=binomial(link=logit), data=frmeTmp, na.action=c_strNA_Action) -test_that("Test that the neg binomial regression has the correct results for 1 covariate.",{ - expect_equal(funcBinomialMult(strFormula=strFormula,frmeTmp=frmeTmp,iTaxon=iTaxon,lsHistory=lsHistory,strRandomFormula=strRandomFormula),lmRet) -}) -#Test for correct call for 2 covariates -strFormula = "adCur ~ Covariate1 + Covariate2" -iTaxon = 5 -lmRet = glm(as.formula(strFormula), family=binomial(link=logit), data=frmeTmp, na.action=c_strNA_Action) -test_that("Test that the neg binomial regression has the correct results for 2 covariates.",{ - expect_equal(funcBinomialMult(strFormula=strFormula,frmeTmp=frmeTmp,iTaxon=iTaxon,lsHistory=lsHistory,strRandomFormula=strRandomFormula),lmRet) -}) -##Test for correct call with 1 random and one fixed covariate -#strFormula = "adCur ~ Covariate1" -#strRandomFormula = "~1|FCovariate3" -#lmRet = glmmPQL(fixed=as.formula(strFormula), random=as.formula(strRandomFormula), family=binomial(link=logit), data=frmeTmp) -#test_that("Test that the lm has the correct results for 1 random and one fixed covariate.",{ -# expect_equal(funcBinomialMult(strFormula=strFormula,frmeTmp=frmeTmp,iTaxon=iTaxon,lsHistory=lsHistory,strRandomFormula=strRandomFormula),lmRet) -#}) -##Test for correct call with 1 random and 2 fixed covariates -#strFormula = "adCur ~ Covariate1 + Covariate2" -#strRandomFormula = "~1|FCovariate3" -#lmRet = glmmPQL(fixed=as.formula(strFormula), random=as.formula(strRandomFormula), family=binomial(link=logit), data=frmeTmp) -#test_that("Test that the lm has the correct results for 1 random and 2 fixed covariates.",{ -# expect_equal(funcBinomialMult(strFormula=strFormula,frmeTmp=frmeTmp,iTaxon=iTaxon,lsHistory=lsHistory,strRandomFormula=strRandomFormula),lmRet) -#}) -##Test for correct call with 2 random and 1 fixed covariates -#strFormula = "adCur ~ Covariate1" -#strRandomFormula = "~1|FCovariate4+1|FCovariate3" -#lmRet = glmmPQL(fixed=as.formula(strFormula), random=as.formula(strRandomFormula), family=binomial(link=logit), data=frmeTmp) -#test_that("Test that the lm has the correct results for 2 random and 1 fixed covariates.",{ -# expect_equal(funcBinomialMult(strFormula=strFormula,frmeTmp=frmeTmp,iTaxon=iTaxon,lsHistory=lsHistory,strRandomFormula=strRandomFormula),lmRet) -#}) - - -context("Test funcQuasiMult") -strFormula = "adCur ~ Covariate1" -strRandomFormula = NULL -x = c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1,44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1) -x2 = c(34.2, 32.5, 22.4, 43, 3.25, 6.4, 7, 87, 9,34.2, 32.5, 22.4, 43, 3.25, 6.4, 7, 87, 9) -xf1 = c(1,1,2,2,1,1,2,2,2,1,1,2,2,1,1,2,2,2) -xf2 = c(1,1,1,1,2,2,2,2,2,1,1,1,1,2,2,2,2,2) -frmeTmp = data.frame(Covariate1=x, Covariate2=x2, FCovariate3=xf1, FCovariate4=xf2, adCur=adCur) -iTaxon = 5 -lmRet = glm(as.formula(strFormula), family=quasipoisson, data=frmeTmp, na.action=c_strNA_Action) -test_that("Test that the quasi poisson has the correct results for 1 covariate.",{ - expect_equal(funcQuasiMult(strFormula=strFormula,frmeTmp=frmeTmp,iTaxon=iTaxon,lsHistory=lsHistory,strRandomFormula=strRandomFormula),lmRet) -}) -#Test for correct call for 2 covariates -strFormula = "adCur ~ Covariate1 + Covariate2" -iTaxon = 5 -lmRet = glm(as.formula(strFormula), family=quasipoisson, data=frmeTmp, na.action=c_strNA_Action) -test_that("Test that the quasi poisson has the correct results for 2 covariates.",{ - expect_equal(funcQuasiMult(strFormula=strFormula,frmeTmp=frmeTmp,iTaxon=iTaxon,lsHistory=lsHistory,strRandomFormula=strRandomFormula),lmRet) -}) -##Test for correct call with 1 random and one fixed covariate -#strFormula = "adCur ~ Covariate1" -#strRandomFormula = "~1|FCovariate3" -#lmRet = glmmPQL(fixed=as.formula(strFormula), random=as.formula(strRandomFormula), family=quasipoisson, data=frmeTmp) -#test_that("Test that the lm has the correct results for 1 random and one fixed covariate.",{ -# expect_equal(funcQuasiMult(strFormula=strFormula,frmeTmp=frmeTmp,iTaxon=iTaxon,lsHistory=lsHistory,strRandomFormula=strRandomFormula),lmRet) -#}) -##Test for correct call with 1 random and 2 fixed covariates -#strFormula = "adCur ~ Covariate1 + Covariate2" -#strRandomFormula = "~1|FCovariate3" -#lmRet = glmmPQL(fixed=as.formula(strFormula), random=as.formula(strRandomFormula), family=quasipoisson, data=frmeTmp) -#test_that("Test that the lm has the correct results for 1 random and 2 fixed covariates.",{ -# expect_equal(funcQuasiMult(strFormula=strFormula,frmeTmp=frmeTmp,iTaxon=iTaxon,lsHistory=lsHistory,strRandomFormula=strRandomFormula),lmRet) -#}) -##Test for correct call with 2 random and 1 fixed covariates -#strFormula = "adCur ~ Covariate1" -#strRandomFormula = "~1|FCovariate4+1|FCovariate3" -#lmRet = glmmPQL(fixed=as.formula(strFormula), random=as.formula(strRandomFormula), family=quasipoisson, data=frmeTmp) -#test_that("Test that the lm has the correct results for 2 random and 1 fixed covariates.",{ -# expect_equal(funcQuasiMult(strFormula=strFormula,frmeTmp=frmeTmp,iTaxon=iTaxon,lsHistory=lsHistory,strRandomFormula=strRandomFormula),lmRet) -#}) - - -#Test transforms -context("Test funcNoTransform") -aTest1 = c(NA) -aTest2 = c(NULL) -aTest3 = c(0.5,1.4,2.4,3332.4,0.0,0.0000003) -aTest4 = c(0.1) -test_that("Test that no transform does not change the data.",{ - expect_equal(funcNoTransform(aTest1), aTest1) - expect_equal(funcNoTransform(aTest2), aTest2) - expect_equal(funcNoTransform(aTest3), aTest3) - expect_equal(funcNoTransform(aTest4), aTest4) -}) - - -context("Test funcArcsinSqrt") -aTest1 = c(NA) -aTest2 = c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0) -aTest3 = c(0.000001) -test_that("Test that funcArcsinSqrt performs the transform correctly.",{ - expect_equal(funcArcsinSqrt(NA), as.numeric(NA)) - expect_equal(funcArcsinSqrt(aTest1), asin(sqrt(aTest1))) - expect_equal(funcArcsinSqrt(aTest2), asin(sqrt(aTest2))) - expect_equal(funcArcsinSqrt(aTest3), asin(sqrt(aTest3))) -}) \ No newline at end of file diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/test-BoostGLM/test-BoostGLM.R --- a/maaslin-4450aa4ecc84/src/test-BoostGLM/test-BoostGLM.R Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,307 +0,0 @@ -c_strDir <- file.path(getwd( ),"..") - -source(file.path(c_strDir,"lib","Constants.R")) -source(file.path(c_strDir,"lib","Utility.R")) -source(file.path(c_strDir,"lib","AnalysisModules.R")) - -# General setup -covX1 = c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1) -covX2 = c(144.4, 245.9, 141.9, 253.3, 144.7, 244.1, 150.7, 245.2, 160.1) -covX3 = as.factor(c(1,2,3,1,2,3,1,2,3)) -covX4 = as.factor(c(1,1,1,1,2,2,2,2,2)) -covX5 = as.factor(c(1,2,1,2,1,2,1,2,1)) -covY = c(.26, .31, .25, .50, .36, .40, .52, .28, .38) -frmeTmp = data.frame(Covariate1=covX1, Covariate2=covX2, Covariate3=covX3, Covariate4=covX4, Covariate5=covX5, adCur= covY) -iTaxon = 6 - -lsCov1 = list() -lsCov1$name = "Covariate1" -lsCov1$orig = "Covariate1" -lsCov1$taxon = "adCur" -lsCov1$data = covY -lsCov1$factors = "Covariate1" -lsCov1$metadata = covX1 -vdCoef = c() -vdCoef["(Intercept)"]=round(0.0345077486,5) -vdCoef["Covariate1"]= round(0.0052097355,5) -vdCoef["Covariate2"]= round(0.0005806568,5) -vdCoef["Covariate32"]=round(-0.1333421874,5) -vdCoef["Covariate33"]=round(-0.1072006419,5) -vdCoef["Covariate42"]=round(0.0849198280,5) -lsCov1$value = c(Covariate1=round(0.005209736,5)) -lsCov1$std = round(0.0063781728,5) -lsCov1$allCoefs = vdCoef -lsCov2 = list() -lsCov2$name = "Covariate2" -lsCov2$orig = "Covariate2" -lsCov2$taxon = "adCur" -lsCov2$data = covY -lsCov2$factors = "Covariate2" -lsCov2$metadata = covX2 -lsCov2$value = c(Covariate2=round(0.0005806568,5)) -lsCov2$std = round(0.0006598436,5) -lsCov2$allCoefs = vdCoef -lsCov3 = list() -lsCov3$name = "Covariate3" -lsCov3$orig = "Covariate32" -lsCov3$taxon = "adCur" -lsCov3$data = covY -lsCov3$factors = "Covariate3" -lsCov3$metadata = covX3 -lsCov3$value = c(Covariate32=round(-0.1333422,5)) -lsCov3$std = round(0.0895657826,5) -lsCov3$allCoefs = vdCoef -lsCov4 = list() -lsCov4$name = "Covariate3" -lsCov4$orig = "Covariate33" -lsCov4$taxon = "adCur" -lsCov4$data = covY -lsCov4$factors = "Covariate3" -lsCov4$metadata = covX3 -lsCov4$value = c(Covariate33=round(-0.1072006,5)) -lsCov4$std = round(0.0792209541,5) -lsCov4$allCoefs = vdCoef -lsCov5 = list() -lsCov5$name = "Covariate4" -lsCov5$orig = "Covariate42" -lsCov5$taxon = "adCur" -lsCov5$data = covY -lsCov5$factors = "Covariate4" -lsCov5$metadata = covX4 -lsCov5$value = c(Covariate42=round(0.08491983,5)) -lsCov5$std = round(0.0701018621,5) -lsCov5$allCoefs = vdCoef - -context("Test funcClean") - -context("Test funcBugHybrid") -# multiple covariates, one call lm -aiMetadata = c(1:5) -aiData = c(iTaxon) -dFreq = 0.5 / length( aiMetadata ) -dSig = 0.25 -dMinSamp = 0.1 -adP = c() -lsSig = list() -funcReg = NA -funcAnalysis = funcLM -funcGetResult = funcGetLMResults -lsData = list(frmeData=frmeTmp, aiMetadata=aiMetadata, aiData=aiData, lsQCCounts=list()) -lsData$astrMetadata = names(frmeTmp)[aiMetadata] - -adPExpected = round(c(0.4738687,0.4436566,0.4665972,0.5378693,0.3124672),5) -QCExpected = list(iLms=numeric(0)) -lsSigExpected = list() -lsSigExpected[[1]] = lsCov1 -lsSigExpected[[2]] = lsCov2 -lsSigExpected[[3]] = lsCov3 -lsSigExpected[[4]] = lsCov4 -lsSigExpected[[5]] = lsCov5 -expectedReturn = list(adP=adPExpected,lsSig=lsSigExpected,lsQCCounts=QCExpected) -receivedReturn = funcBugHybrid(iTaxon=iTaxon,frmeData=frmeTmp,lsData=lsData,aiMetadata=aiMetadata,dFreq=dFreq,dSig=dSig,dMinSamp=dMinSamp,adP=adP,lsSig=lsSig, strLog=NA,funcReg=funcReg,lsNonPenalizedPredictors=NULL,funcAnalysis=funcAnalysis,lsRandomCovariates=NULL,funcGetResult=funcGetResult) -receivedReturn$adP = round(receivedReturn$adP,5) - -vCoefs=receivedReturn$lsSig[[1]]$allCoefs -vCoefs[1]=round(vCoefs[1],5) -vCoefs[2]=round(vCoefs[2],5) -vCoefs[3]=round(vCoefs[3],5) -vCoefs[4]=round(vCoefs[4],5) -vCoefs[5]=round(vCoefs[5],5) -vCoefs[6]=round(vCoefs[6],5) -receivedReturn$lsSig[[1]]$allCoefs=vCoefs -receivedReturn$lsSig[[2]]$allCoefs=vCoefs -receivedReturn$lsSig[[3]]$allCoefs=vCoefs -receivedReturn$lsSig[[4]]$allCoefs=vCoefs -receivedReturn$lsSig[[5]]$allCoefs=vCoefs -vValue=c() -vValue[receivedReturn$lsSig[[1]]$orig]=round(receivedReturn$lsSig[[1]]$value[[1]],5) -receivedReturn$lsSig[[1]]$value=vValue -vValue=c() -vValue[receivedReturn$lsSig[[2]]$orig]=round(receivedReturn$lsSig[[2]]$value[[1]],5) -receivedReturn$lsSig[[2]]$value=vValue -vValue=c() -vValue[receivedReturn$lsSig[[3]]$orig]=round(receivedReturn$lsSig[[3]]$value[[1]],5) -receivedReturn$lsSig[[3]]$value=vValue -vValue=c() -vValue[receivedReturn$lsSig[[4]]$orig]=round(receivedReturn$lsSig[[4]]$value[[1]],5) -receivedReturn$lsSig[[4]]$value=vValue -vValue=c() -vValue[receivedReturn$lsSig[[5]]$orig]=round(receivedReturn$lsSig[[5]]$value[[1]],5) -receivedReturn$lsSig[[5]]$value=vValue -receivedReturn$lsSig[[1]]$std=round(receivedReturn$lsSig[[1]]$std,5) -receivedReturn$lsSig[[2]]$std=round(receivedReturn$lsSig[[2]]$std,5) -receivedReturn$lsSig[[3]]$std=round(receivedReturn$lsSig[[3]]$std,5) -receivedReturn$lsSig[[4]]$std=round(receivedReturn$lsSig[[4]]$std,5) -receivedReturn$lsSig[[5]]$std=round(receivedReturn$lsSig[[5]]$std,5) -test_that("funcBugHybrid works with the lm option with multiple covariates.",{expect_equal(receivedReturn,expectedReturn)}) - - -# single covariate, single call lm -aiMetadata = c(1) -dFreq = 0.5 / length( aiMetadata ) -lsData$astrMetadata = names(frmeTmp)[aiMetadata] -adPExpected = round(c(0.1081731),5) -QCExpected = list(iLms=numeric(0)) -lsSigExpected = list() -lsSigExpected[[1]] = lsCov1 -lsSigExpected[[1]]$std=round(0.005278468,5) -vdCoef = c() -vdCoef["(Intercept)"]=round(-0.102410716,5) -vdCoef["Covariate1"]= round(0.009718095,5) -lsSigExpected[[1]]$allCoefs= vdCoef -lsSigExpected[[1]]$value = c(Covariate1=round(0.009718095,5)) - -expectedReturn = list(adP=adPExpected,lsSig=lsSigExpected,lsQCCounts=QCExpected) -receivedReturn = funcBugHybrid(iTaxon=iTaxon,frmeData=frmeTmp,lsData=lsData,aiMetadata=aiMetadata,dFreq=dFreq,dSig=dSig,dMinSamp=dMinSamp,adP=adP,lsSig=lsSig, strLog=NA,funcReg=funcReg,lsNonPenalizedPredictors=NULL,funcAnalysis=funcAnalysis,lsRandomCovariates=NULL,funcGetResult=funcGetResult) -receivedReturn$adP = round(receivedReturn$adP,5) - -vCoefs=receivedReturn$lsSig[[1]]$allCoefs -vCoefs[1]=round(vCoefs[1],5) -vCoefs[2]=round(vCoefs[2],5) -receivedReturn$lsSig[[1]]$allCoefs=vCoefs -vValue=c() -vValue[receivedReturn$lsSig[[1]]$orig]=round(receivedReturn$lsSig[[1]]$value[[1]],5) -receivedReturn$lsSig[[1]]$value=vValue -receivedReturn$lsSig[[1]]$std=round(0.005278468,5) -test_that("funcBugHybrid works with the lm option with 1 covariates.",{expect_equal(receivedReturn,expectedReturn)}) - - -# multiple covariate, single call univariate -funcReg = NA -funcAnalysis = funcDoUnivariate -funcGetResult = NA -aiMetadata = c(3,1,2) -dFreq = 0.5 / length( aiMetadata ) -lsData$astrMetadata = names(frmeTmp)[aiMetadata] -adPExpected = round(c(1.0,1.0,0.09679784,0.21252205),5) -QCExpected = list(iLms=numeric(0)) -lsSigExpected = list() -lsCov1 = list() -lsCov1$name = "Covariate3" -lsCov1$orig = "Covariate32" -lsCov1$taxon = "adCur" -lsCov1$data = covY -lsCov1$factors = "Covariate3" -lsCov1$metadata = frmeTmp[["Covariate3"]] -vdCoef = c(Covariate32=NA) -lsCov1$value = vdCoef -lsCov1$std = sd(frmeTmp[["Covariate3"]]) -lsCov1$allCoefs = vdCoef -lsCov2 = list() -lsCov2$name = "Covariate3" -lsCov2$orig = "Covariate33" -lsCov2$taxon = "adCur" -lsCov2$data = covY -lsCov2$factors = "Covariate3" -lsCov2$metadata = frmeTmp[["Covariate3"]] -vdCoef = c(Covariate33=NA) -lsCov2$value = vdCoef -lsCov2$std = sd(frmeTmp[["Covariate3"]]) -lsCov2$allCoefs = vdCoef -lsCov3 = list() -lsCov3$name = "Covariate1" -lsCov3$orig = "Covariate1" -lsCov3$taxon = "adCur" -lsCov3$data = covY -lsCov3$factors = "Covariate1" -lsCov3$metadata = frmeTmp[["Covariate1"]] -vdCoef = c(Covariate1=0.6) -lsCov3$value = vdCoef -lsCov3$std = sd(frmeTmp[["Covariate1"]]) -lsCov3$allCoefs = vdCoef -lsCov4 = list() -lsCov4$name = "Covariate2" -lsCov4$orig = "Covariate2" -lsCov4$taxon = "adCur" -lsCov4$data = covY -lsCov4$factors = "Covariate2" -lsCov4$metadata = frmeTmp[["Covariate2"]] -vdCoef = c(Covariate2=0.46666667) -lsCov4$value = vdCoef -lsCov4$std = sd(frmeTmp[["Covariate2"]]) -lsCov4$allCoefs = vdCoef - -lsSigExpected = list() -lsSigExpected[[1]] = lsCov1 -lsSigExpected[[2]] = lsCov2 -lsSigExpected[[3]] = lsCov3 -lsSigExpected[[4]] = lsCov4 - -expectedReturn = list(adP=adPExpected,lsSig=lsSigExpected,lsQCCounts=QCExpected) -receivedReturn = funcBugHybrid(iTaxon=iTaxon,frmeData=frmeTmp,lsData=lsData,aiMetadata=aiMetadata,dFreq=dFreq,dSig=dSig,dMinSamp=dMinSamp,adP=adP,lsSig=lsSig, strLog=NA,funcReg=funcReg,lsNonPenalizedPredictors=NULL,funcAnalysis=funcAnalysis,lsRandomCovariates=NULL,funcGetResult=funcGetResult) -receivedReturn$adP = round(receivedReturn$adP,5) -test_that("funcBugHybrid works with the univariate option with 3 covariates.",{expect_equal(receivedReturn,expectedReturn)}) - - -# single covariate, single call univariate -funcReg = NA -funcAnalysis = funcDoUnivariate -funcGetResult = NA -aiMetadata = c(1) -dFreq = 0.5 / length( aiMetadata ) -lsData$astrMetadata = names(frmeTmp)[aiMetadata] -adPExpected = round(c(0.09679784),5) -QCExpected = list(iLms=numeric(0)) -lsSigExpected = list() -lsSigExpected[[1]] = lsCov3 - -expectedReturn = list(adP=adPExpected,lsSig=lsSigExpected,lsQCCounts=QCExpected) -receivedReturn = funcBugHybrid(iTaxon=iTaxon,frmeData=frmeTmp,lsData=lsData,aiMetadata=aiMetadata,dFreq=dFreq,dSig=dSig,dMinSamp=dMinSamp,adP=adP,lsSig=lsSig, strLog=NA,funcReg=funcReg,lsNonPenalizedPredictors=NULL,funcAnalysis=funcAnalysis,lsRandomCovariates=NULL,funcGetResult=funcGetResult) -receivedReturn$adP = round(receivedReturn$adP,5) -test_that("funcBugHybrid works with the univariate option with 1 covariates.",{expect_equal(receivedReturn,expectedReturn)}) - - -context("Test funcBugs") -#One LM run -frmeData=frmeTmp -aiMetadata=c(1) -aiData=c(iTaxon) -strData=NA -dFreq= 0.5 / length( aiMetadata ) -dSig=0.25 -dMinSamp=0.1 -strDirOut=NA -funcReg=NA -lsNonPenalizedPredictors=NULL -lsRandomCovariates=NULL -funcAnalysis=funcLM -funcGetResults=funcGetLMResults -fDoRPlot=FALSE -lsData = list(frmeData=frmeData, aiMetadata=aiMetadata, aiData=aiData, lsQCCounts=list()) -lsData$astrMetadata = names(frmeTmp)[aiMetadata] -QCExpected = list(iLms=numeric(0)) - -expectedReturn = list(aiReturnBugs=aiData,lsQCCounts=QCExpected) -receivedReturn = funcBugs(frmeData=frmeData, lsData=lsData, aiMetadata=aiMetadata, aiData=aiData, strData=strData, dFreq=dFreq, dSig=dSig, dMinSamp=dMinSamp,strDirOut=strDirOut, funcReg=funcReg,lsNonPenalizedPredictors=lsNonPenalizedPredictors,funcAnalysis=funcAnalysis,lsRandomCovariates=lsRandomCovariates,funcGetResults=funcGetResults,fDoRPlot=fDoRPlot) - -test_that("funcBugs works with the lm option with 1 covariate.",{expect_equal(receivedReturn,expectedReturn)}) - -#multiple LM run -frmeData=frmeTmp -aiMetadata=c(1:5) -aiData=c(iTaxon) -strData=NA -dFreq= 0.5 / length( aiMetadata ) -dSig=0.25 -dMinSamp=0.1 -strDirOut=NA -funcReg=NA -lsNonPenalizedPredictors=NULL -lsRandomCovariates=NULL -funcAnalysis=funcLM -funcGetResults=funcGetLMResults -fDoRPlot=FALSE -lsData = list(frmeData=frmeData, aiMetadata=aiMetadata, aiData=aiData, lsQCCounts=list()) -lsData$astrMetadata = names(frmeTmp)[aiMetadata] -QCExpected = list(iLms=numeric(0)) - -expectedReturn = list(aiReturnBugs=aiData,lsQCCounts=QCExpected) -receivedReturn = funcBugs(frmeData=frmeData, lsData=lsData, aiMetadata=aiMetadata, aiData=aiData, strData=strData, dFreq=dFreq, dSig=dSig, dMinSamp=dMinSamp,strDirOut=strDirOut, funcReg=funcReg,lsNonPenalizedPredictors=lsNonPenalizedPredictors,funcAnalysis=funcAnalysis,lsRandomCovariates=lsRandomCovariates,funcGetResults=funcGetResults,fDoRPlot=fDoRPlot) - -print("START START") -print(expectedReturn) -print("RECEIVED") -print(receivedReturn) -print("STOP STOP") - -test_that("funcBugs works with the lm option with multiple covariates.",{expect_equal(receivedReturn,expectedReturn)}) \ No newline at end of file diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/test-IO/test-IO.R --- a/maaslin-4450aa4ecc84/src/test-IO/test-IO.R Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,162 +0,0 @@ -c_strDir <- file.path(getwd( ),"..") - -source(file.path(c_strDir,"lib","Constants.R")) -source(file.path(c_strDir,"lib","ValidateData.R")) -strTestingDirectory = file.path(c_strDir,c_strTestingDirectory) - -expect_equal(funcParseIndexSlices("1",cNames),c(1)) - -cNames = c("One","Two","Three","Four","Five","Six","Seven","Eight","Nine","Ten","Eleven", - "Twelve","Thirteen","Fourteen","Fifteen") - -test_that("Just Numerics are parsed",{ - expect_equal(funcParseIndexSlices("1",cNames),c(1)) - expect_equal(funcParseIndexSlices("8,10",cNames),c(8,10)) - expect_equal(funcParseIndexSlices("2-6",cNames), c(2,3,4,5,6)) - expect_equal(funcParseIndexSlices("3,7,10-12",cNames), c(3,7,10,11,12)) -}) - -test_that("Missing numbers are parsed",{ - expect_equal(funcParseIndexSlices("-",cNames), c(2:15)) - expect_equal(funcParseIndexSlices("-4",cNames), c(2,3,4)) - expect_equal(funcParseIndexSlices("3-",cNames), c(3:15)) -}) - -test_that("Words are parsed correctly",{ - expect_equal(funcParseIndexSlices("One",cNames), c(1)) - expect_equal(funcParseIndexSlices("Eight,Ten",cNames), c(8,10)) - expect_equal(funcParseIndexSlices("Two-Six",cNames), c(2,3,4,5,6)) - expect_equal(funcParseIndexSlices("Three,Seven,Ten-Twelve",cNames), c(3,7,10,11,12)) -}) - -test_that("Missing words are parsed",{ - expect_equal(funcParseIndexSlices("-Four",cNames), c(2:4)) - expect_equal(funcParseIndexSlices("Three-",cNames), c(3:15)) -}) - -test_that("Words and numbers are parsed correctly",{ - expect_equal(funcParseIndexSlices("Eight,10",cNames), c(8,10)) - expect_equal(funcParseIndexSlices("2-Six",cNames), c(2,3,4,5,6)) - expect_equal(funcParseIndexSlices("Three,7,10-Twelve",cNames), c(3,7,10,11,12)) -}) - - -context("Test funcWriteMatrixToReadConfigFile") -# File to temporarily write to -strWriteMatrixRCTestFile = file.path(strTestingDirectory,c_strTemporaryFiles,"FuncWriteMatrixToReadConfigFileTemp.read.config") -# Files that hold answers -strFileSimpleRCFileAnswer = file.path(strTestingDirectory,c_strCorrectAnswers,"FuncWriteMatrixToReadConfigFile_SimpleAnswer.read.config") -strFileUseAllRCFileAnswer = file.path(strTestingDirectory,c_strCorrectAnswers,"FuncWriteMatrixToReadConfigFile_AllAnswer.read.config") -strFileAppendTrueRCFileAnswer = file.path(strTestingDirectory,c_strCorrectAnswers,"FuncWriteMatrixToReadConfigFile_AppendAnswer.read.config") -#Input matrix file -strFileMatrix = file.path(strTestingDirectory,c_strTestingInput,"TestMatrix.tsv") - -#Get read config files in different scenarios -funcWriteMatrixToReadConfigFile(strWriteMatrixRCTestFile,"SimpleMatrix") -strSimpleInterface = readLines(strWriteMatrixRCTestFile) -funcWriteMatrixToReadConfigFile(strWriteMatrixRCTestFile,"AllMatrix",strRowIndices="1,2,3,4,5", strColIndices="10,11,12",acharDelimiter=" ") -strUseAllParametersInterface = readLines(strWriteMatrixRCTestFile) -funcWriteMatrixToReadConfigFile(strWriteMatrixRCTestFile,"SimpleMatrix") -funcWriteMatrixToReadConfigFile(strWriteMatrixRCTestFile,"SimpleMatrix") -strAppendFalseInterface = readLines(strWriteMatrixRCTestFile) -funcWriteMatrixToReadConfigFile(strWriteMatrixRCTestFile,"SimpleMatrix") -funcWriteMatrixToReadConfigFile(strWriteMatrixRCTestFile,"SimpleMatrix",fAppend=TRUE) -strAppendTrueInterface = readLines(strWriteMatrixRCTestFile) - -test_that("Correct config file is written",{ - expect_equal(strSimpleInterface,readLines(strFileSimpleRCFileAnswer)) - expect_equal(strUseAllParametersInterface,readLines(strFileUseAllRCFileAnswer)) - expect_equal(strAppendFalseInterface,readLines(strFileSimpleRCFileAnswer)) - expect_equal(strAppendTrueInterface,readLines(strFileAppendTrueRCFileAnswer)) -}) - -context("Test funcReadConfigFile") -lsSimpleRC = funcReadConfigFile(strFileSimpleRCFileAnswer,strFileMatrix) -lsAllRC = funcReadConfigFile(strFileUseAllRCFileAnswer,strFileMatrix) - -lsSimpleListAnswer = list() -lsSimpleListAnswer[[1]]=c("SimpleMatrix",strFileMatrix,"\t","-","-") -lsAllListAnswer = list() -lsAllListAnswer[[1]]=c("AllMatrix",strFileMatrix," ","1,2,3,4,5","10,11,12") - -test_that("Test readConfigFile reads in files correctly.",{ - expect_equal(lsSimpleRC,lsSimpleListAnswer) - expect_equal(lsAllRC,lsAllListAnswer) -}) - - -context("Test funcReadMatrix") - -#Read in config files -dfSimpleRead = funcReadMatrix("SimpleMatrix",strFileMatrix,"\t","2,4,5","7,3,5") -dfUseAllParametersRead = funcReadMatrix("AllMatrix",strFileMatrix,"\t","2,3,4","6,2,4") - -dfSimpleReadCorrect = as.data.frame(as.matrix(rbind(c(21,23,24),c(41,43,44),c(61,63,64)))) -rownames(dfSimpleReadCorrect) = c("Feature2", "Feature4", "Feature6") -colnames(dfSimpleReadCorrect) = c("Sample1", "Sample3", "Sample4") - -dfUseAllReadCorrect = as.data.frame(as.matrix(rbind(c(11,12,13),c(31,32,33),c(51,52,53)))) -rownames(dfUseAllReadCorrect) = c("Feature1", "Feature3", "Feature5") -colnames(dfUseAllReadCorrect) = c("Sample1", "Sample2", "Sample3") - -test_that("Matrix file is read correctly.",{ - expect_equal(dfSimpleRead,dfSimpleReadCorrect) - expect_equal(dfUseAllParametersRead,dfUseAllReadCorrect) -}) - -context("Test funcReadMatrices") - -sConfigureFile1Matrix = file.path(strTestingDirectory,c_strTestingInput,"1Matrix.read.config") -mtxOne = as.data.frame(as.matrix(rbind(c(11,12,13,14,15),c(21,22,23,24,25),c(31,32,33,34,35),c(41,42,43,44,45), - c(51,52,53,54,55),c(61,62,63,64,65),c(71,72,73,74,75),c(81,82,83,84,85), - c(91,92,93,94,95),c(101,102,103,104,105),c(111,112,113,114,115),c(121,122,123,124,125), - c(131,132,133,134,135),c(141,142,143,144,145),c(151,152,153,154,155)))) -rownames(mtxOne) = c("Feature1","Feature2","Feature3","Feature4","Feature5","Feature6","Feature7","Feature8","Feature9","Feature10", - "Feature11","Feature12","Feature13","Feature14","Feature15") -colnames(mtxOne) = c("Sample1","Sample2","Sample3","Sample4","Sample5") -sConfigureFile2Matrix = file.path(strTestingDirectory,c_strTestingInput,"2Matrix.read.config") -mtxTwo = as.data.frame(as.matrix(rbind(c(11,12,13),c(21,22,23),c(31,32,33)))) -rownames(mtxTwo) = c("Feature1","Feature2","Feature3") -colnames(mtxTwo) = c("Sample1","Sample2","Sample3") - -sConfigureFile3Matrix = file.path(strTestingDirectory,c_strTestingInput,"3Matrix.read.config") -mtxThree = as.data.frame(as.matrix(rbind(c(11,12,14),c(21,22,24),c(31,32,34),c(41,42,44), - c(51,52,54),c(61,62,64),c(71,72,74),c(81,82,84),c(91,92,94)))) -rownames(mtxThree) = c("Feature1","Feature2","Feature3","Feature4","Feature5","Feature6","Feature7","Feature8","Feature9") -colnames(mtxThree) = c("Sample1","Sample2","Sample4") - -#Read one matrix -ldfRet1 = funcReadMatrices(configureFile=sConfigureFile1Matrix,strFileMatrix) -ldfRet1Answer = list( "Matrix1" = mtxOne) - -#Read two matrices -ldfRet2 = funcReadMatrices(configureFile=sConfigureFile2Matrix,strFileMatrix) -ldfRet2Answer = list( "Matrix1" = mtxOne, - "Matrix2" = mtxTwo) - -#Read three matrices from two different files -ldfRet3 = funcReadMatrices(configureFile=sConfigureFile3Matrix,strFileMatrix) -ldfRet3Answer = list( "Matrix1" = mtxOne, - "Matrix2" = mtxTwo, - "Matrix3" = mtxThree) - -test_that("Test funcReadMatrices read in the correct matrices not matter the number or source",{ - expect_equal(ldfRet1,ldfRet1Answer) - expect_equal(ldfRet2,ldfRet2Answer) - expect_equal(ldfRet3,ldfRet3Answer) -}) - -context("Test funcWriteMatrices") -strFuncWriteMatricesMatrix1 = file.path(strTestingDirectory,c_strTemporaryFiles,"FuncWriteMatrices1.tsv") -strFuncWriteMatricesMatrix2 = file.path(strTestingDirectory,c_strTemporaryFiles,"FuncWriteMatrices2.tsv") -strFuncWriteMatricesMatrix1Answer = file.path(strTestingDirectory, c_strCorrectAnswers,"FuncWriteMatrices1.tsv") -strFuncWriteMatricesMatrix2Answer = file.path(strTestingDirectory, c_strCorrectAnswers,"FuncWriteMatrices2.tsv") -strFuncWriteMatricesRCFile = file.path(strTestingDirectory,c_strTemporaryFiles,"FuncWriteMatrices.read.config") -strFuncWriteMatricesRCFileAnswer = file.path(strTestingDirectory, c_strCorrectAnswers,"FuncWriteMatrices.read.config") -funcWriteMatrices(list("1"=mtxOne, "2"=mtxThree),c(strFuncWriteMatricesMatrix1, strFuncWriteMatricesMatrix2), strFuncWriteMatricesRCFile) - -test_that("Test that writing to a file occurs correctly, for both matrix and configure file.",{ - expect_equal(readLines(strFuncWriteMatricesMatrix1Answer),readLines(strFuncWriteMatricesMatrix1)) - expect_equal(readLines(strFuncWriteMatricesMatrix2Answer),readLines(strFuncWriteMatricesMatrix2)) - expect_equal(readLines(strFuncWriteMatricesRCFileAnswer),readLines(strFuncWriteMatricesRCFile)) -}) diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/test-Maaslin/test-Maaslin.R --- a/maaslin-4450aa4ecc84/src/test-Maaslin/test-Maaslin.R Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,41 +0,0 @@ -c_strDir <- file.path(getwd( ),"..") - -source(file.path(c_strDir,"lib","Constants.R")) -strTestingDirectory = file.path(c_strDir,c_strTestingDirectory) -sScriptMaaslin = file.path( c_strDir, "Maaslin.R" ) - -context("Test Run From Commandline") - -#Input Files -sTestReadConfig = file.path(strTestingDirectory, c_strTestingInput, "TestMaaslin.read.config") -sTestCustomR = file.path(strTestingDirectory, c_strTestingInput, "TestMaaslin.R") -sTestMaaslinDirectory = file.path(strTestingDirectory, c_strTemporaryFiles, "testMaaslin") -sTestOutput = file.path(sTestMaaslinDirectory,"TestMaaslin_Summary.txt") -sTestTSV = file.path(strTestingDirectory, c_strTestingInput, "TestMaaslin.tsv") -#Test file answers -sTestOutputAnswer = file.path(strTestingDirectory, c_strCorrectAnswers, "TestMaaslin.tsv") - -#Delete Test MaAsLin output -unlink(sTestMaaslinDirectory, recursive=TRUE) -#Make neccessary directories -dir.create(sTestMaaslinDirectory) -dir.create(file.path(sTestMaaslinDirectory,"QC")) - -sCommand = paste(sScriptMaaslin, "-v", "ERROR", "-d", "0.25", "-r", "0.0001", "-p", "0.1", sTestOutput, sTestTSV, sTestReadConfig, sTestCustomR, sep=" ") -print(sCommand) -system(sCommand) - -sExpectedTitle = "\tVariable\tFeature\tValue\tCoefficient\tN\tN.not.0\tP.value\tQ.value" -iExpectedNumberOfLines = 3 -lsOutputSummaryFile = readLines(sTestOutput) - -test_that("Make sure that the summary output file is what is expected (generally).",{ - expect_equal(lsOutputSummaryFile[1], sExpectedTitle) - expect_equal(length(lsOutputSummaryFile),iExpectedNumberOfLines) -}) - -lsDirectoryStructure = list.files(sTestMaaslinDirectory) -lsDirectoryStructureAnswer = c(basename(sTestOutput),"QC","TestMaaslin-age.pdf","TestMaaslin-age.txt","TestMaaslin-dx.txt","TestMaaslin.pdf","TestMaaslin.txt") -test_that("Make sure the expected directory structure is created.",{ - expect_equal(sort(lsDirectoryStructure), sort(lsDirectoryStructureAnswer)) -}) diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/test-SummarizeMaaslin/test-SummarizeMaaslin.R --- a/maaslin-4450aa4ecc84/src/test-SummarizeMaaslin/test-SummarizeMaaslin.R Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,72 +0,0 @@ -c_strDir <- file.path(getwd( ),"..") - -source(file.path(c_strDir,"lib","Constants.R")) -source(file.path(c_strDir,"lib","SummarizeMaaslin.R")) -source(file.path(c_strDir,"lib","Utility.R")) - -context("Test funcSummarizeDirectory") -strDirectoryNone = file.path(c_strDir,c_strTestingDirectory,c_strTestingInput,"funcSummarizeDirectory","None") -strDirectory1 = file.path(c_strDir,c_strTestingDirectory,c_strTestingInput,"funcSummarizeDirectory","1") -strDirectory3 = file.path(c_strDir,c_strTestingDirectory,c_strTestingInput,"funcSummarizeDirectory","3") -strFileBase1 = "FileBase1.txt" -strFileBase2 = "FileBase2.txt" - -sKeyword = "Q.value" -sAltKeyword = "P.value" -sAltSignificance = "0.35" - -sBaseName = "FuncSummarizeDirectory" - -#Output and answer files -sNoFileResult = file.path(c_strDir,c_strTestingDirectory,c_strTemporaryFiles,"FuncSummarizeDirectory-NoFileResult.txt") -sNoFileResultAltKeyword = file.path(c_strDir,c_strTestingDirectory,c_strTemporaryFiles,"FuncSummarizeDirectory-NoFileAltKeyResult.txt") -sNoFileResultAltSig = file.path(c_strDir,c_strTestingDirectory,c_strTemporaryFiles,"FuncSummarizeDirectory-NoFileAltSigResult.txt") -sNoFileResultAnswer = file.path(c_strDir,c_strTestingDirectory,c_strCorrectAnswers,"FuncSummarizeDirectory-NoFileAnswer.txt") -sNoFileResultAnswerAltKeyword = file.path(c_strDir,c_strTestingDirectory,c_strCorrectAnswers,"FuncSummarizeDirectory-NoFileAltKeyAnswer.txt") -sNoFileResultAnswerAltSig = file.path(c_strDir,c_strTestingDirectory,c_strCorrectAnswers,"FuncSummarizeDirectory-NoFileAltSigAnswer.txt") -unlink(sNoFileResult) -sCorrectResults1File = file.path(c_strDir,c_strTestingDirectory,c_strTemporaryFiles,"FuncSummarizeDirectory-1FileResult.txt") -sCorrectResults1FileAltKeyword = file.path(c_strDir,c_strTestingDirectory,c_strTemporaryFiles,"FuncSummarizeDirectory-1FileAltKeyResult.txt") -sCorrectResults1FileAltSig = file.path(c_strDir,c_strTestingDirectory,c_strTemporaryFiles,"FuncSummarizeDirectory-1FileAltSigResult.txt") -sCorrectResults1FileAnswer = file.path(c_strDir,c_strTestingDirectory,c_strCorrectAnswers,"FuncSummarizeDirectory-1FileResult.txt") -sCorrectResults1FileAnswerAltKeyword = file.path(c_strDir,c_strTestingDirectory,c_strCorrectAnswers,"FuncSummarizeDirectory-1FileAltKeyResult.txt") -sCorrectResults1FileAnswerAltSig = file.path(c_strDir,c_strTestingDirectory,c_strCorrectAnswers,"FuncSummarizeDirectory-1FileAltSigResult.txt") -unlink(sCorrectResults1File) -sCorrectResults3Files = file.path(c_strDir,c_strTestingDirectory,c_strTemporaryFiles,"FuncSummarizeDirectory-3FileResult.txt") -sCorrectResults3FilesAltKeyword = file.path(c_strDir,c_strTestingDirectory,c_strTemporaryFiles,"FuncSummarizeDirectory-3FileAltKeyResult.txt") -sCorrectResults3FilesAltSig = file.path(c_strDir,c_strTestingDirectory,c_strTemporaryFiles,"FuncSummarizeDirectory-3FileAltSigResult.txt") -sCorrectResults3FilesAnswer = file.path(c_strDir,c_strTestingDirectory,c_strCorrectAnswers,"FuncSummarizeDirectory-3FileResult.txt") -sCorrectResults3FilesAnswerAltKeyword = file.path(c_strDir,c_strTestingDirectory,c_strCorrectAnswers,"FuncSummarizeDirectory-3FileAltKeyResult.txt") -sCorrectResults3FilesAnswerAltSig = file.path(c_strDir,c_strTestingDirectory,c_strCorrectAnswers,"FuncSummarizeDirectory-3FileAltSigResult.txt") -unlink(sCorrectResults3Files) - -#Run tests -funcSummarizeDirectory(astrOutputDirectory=strDirectoryNone, strBaseName=sBaseName, astrSummaryFileName=sNoFileResult, astrKeyword=sKeyword, afSignificanceLevel="0.25") -funcSummarizeDirectory(astrOutputDirectory=strDirectory1, strBaseName=sBaseName, astrSummaryFileName=sCorrectResults1File, astrKeyword=sKeyword, afSignificanceLevel="0.25") -funcSummarizeDirectory(astrOutputDirectory=strDirectory3, strBaseName=sBaseName, astrSummaryFileName=sCorrectResults3Files, astrKeyword=sKeyword, afSignificanceLevel="0.25") - -funcSummarizeDirectory(astrOutputDirectory=strDirectoryNone, strBaseName=sBaseName, astrSummaryFileName=sNoFileResultAltKeyword, astrKeyword=sAltKeyword, afSignificanceLevel="0.25") -funcSummarizeDirectory(astrOutputDirectory=strDirectory1, strBaseName=sBaseName, astrSummaryFileName=sCorrectResults1FileAltKeyword, astrKeyword=sAltKeyword, afSignificanceLevel="0.25") -funcSummarizeDirectory(astrOutputDirectory=strDirectory3, strBaseName=sBaseName, astrSummaryFileName=sCorrectResults3FilesAltKeyword, astrKeyword=sAltKeyword, afSignificanceLevel="0.25") - -funcSummarizeDirectory(astrOutputDirectory=strDirectoryNone, strBaseName=sBaseName, astrSummaryFileName=sNoFileResultAltSig, astrKeyword= sKeyword, afSignificanceLevel=sAltSignificance) -funcSummarizeDirectory(astrOutputDirectory=strDirectory1, strBaseName=sBaseName, astrSummaryFileName=sCorrectResults1FileAltSig, astrKeyword= sKeyword, afSignificanceLevel=sAltSignificance) -funcSummarizeDirectory(astrOutputDirectory=strDirectory3, strBaseName=sBaseName, astrSummaryFileName=sCorrectResults3FilesAltSig, astrKeyword= sKeyword, afSignificanceLevel=sAltSignificance) - -test_that("Check the cases where no, and real summary files exist.",{ - expect_equal(readLines(sNoFileResult),readLines(sNoFileResultAnswer)) - expect_equal(readLines(sCorrectResults1File),readLines(sCorrectResults1FileAnswer)) - expect_equal(readLines(sCorrectResults3Files),readLines(sCorrectResults3FilesAnswer)) -}) - -test_that("Check changing the keyword.",{ - expect_equal(readLines(sNoFileResultAltKeyword),readLines(sNoFileResultAnswerAltKeyword)) - expect_equal(readLines(sCorrectResults1FileAltKeyword),readLines(sCorrectResults1FileAnswerAltKeyword)) - expect_equal(readLines(sCorrectResults3FilesAltKeyword),readLines(sCorrectResults3FilesAnswerAltKeyword)) -}) - -test_that("Check that changing the significance threshold effects inclusion.",{ - expect_equal(readLines(sNoFileResultAltSig),readLines(sNoFileResultAnswerAltSig)) - expect_equal(readLines(sCorrectResults1FileAltSig),readLines(sCorrectResults1FileAnswerAltSig)) - expect_equal(readLines(sCorrectResults3FilesAltSig),readLines(sCorrectResults3FilesAnswerAltSig)) -}) diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/test-Utility/test-Utility.R --- a/maaslin-4450aa4ecc84/src/test-Utility/test-Utility.R Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,159 +0,0 @@ -c_strDir <- file.path(getwd( ),"..") - -source(file.path(c_strDir,"lib","Constants.R")) -source(file.path(c_strDir,"lib","Utility.R")) - -context("Test funcRename") -test_that("Test that unclassified and none otus are represented as 2 terminal clades and others are 1",{ - expect_equal(funcRename(paste("A","B","C","D",c_strUnclassified, sep=c_cFeatureDelim)),paste("D",c_strUnclassified, sep=c_cFeatureDelim)) - expect_equal(funcRename(paste("A","B","C","D","101", sep=c_cFeatureDelim)),paste("D","101", sep=c_cFeatureDelim)) - expect_equal(funcRename(paste("A","B","C","D", sep=c_cFeatureDelim)),paste("D", sep=c_cFeatureDelim)) - expect_equal(funcRename(paste("A", sep=c_cFeatureDelim)),paste("A", sep=c_cFeatureDelim)) - expect_equal(funcRename(paste(c_strUnclassified, sep=c_cFeatureDelim)),paste(c_strUnclassified, sep=c_cFeatureDelim)) - expect_equal(funcRename(paste("101", sep=c_cFeatureDelim)),paste("101", sep=c_cFeatureDelim)) -}) - -context("Test funcColorHelper") -test_that("Test that min is min and max is max and average is average even if given as NA",{ - expect_equal(funcColorHelper( dMax = 1, dMin = 1, dMed = NA ), list( dMin = 1, dMax = 1, dMed = 1)) - expect_equal(funcColorHelper( dMax = -3, dMin = 10, dMed = NA ), list( dMin = -3, dMax = 10, dMed = 3.5)) - expect_equal(funcColorHelper( dMax = 1, dMin = 11, dMed = NA ), list( dMin = 1, dMax = 11, dMed = 6)) - expect_equal(funcColorHelper( dMax = 4, dMin = 10, dMed = 5 ), list( dMin = 4, dMax = 10, dMed = 5)) - expect_equal(funcColorHelper( dMax = 10, dMin = 4, dMed = 5 ), list( dMin = 4, dMax = 10, dMed = 5)) -}) - -context("Test funcTrim") -test_that("Test that white spaces at the beginning and end of s string are removed",{ - expect_equal(funcTrim("TRIM"),"TRIM") - expect_equal(funcTrim(" TRIM"),"TRIM") - expect_equal(funcTrim(" TRIM"),"TRIM") - expect_equal(funcTrim(" TRIM "),"TRIM") - expect_equal(funcTrim("TRIM "),"TRIM") - expect_equal(funcTrim(" TRIM "),"TRIM") - expect_equal(funcTrim("TR IM"),"TR IM") - expect_equal(funcTrim(" TR IM"),"TR IM") - expect_equal(funcTrim(" TR I M"),"TR I M") - expect_equal(funcTrim(" TR IM "),"TR IM") - expect_equal(funcTrim("T R IM "),"T R IM") - expect_equal(funcTrim(" T RIM "),"T RIM") -}) - -#TODO currently the capture versio of this does not produce a tabbed table (or default table delim) which is not consistent with the rest of the code base. -context("Test funcWrite") -#Answer files -c_sAnswerWriteFile1 = file.path(c_strDir,c_strTestingDirectory,c_strCorrectAnswers,"FuncWriteTemp1.txt") -c_sAnswerWriteFile2 = file.path(c_strDir,c_strTestingDirectory,c_strCorrectAnswers,"FuncWriteTemp2.txt") -print("c_sAnswerWriteFile2") -print(c_sAnswerWriteFile2) -c_sAnswerWriteDFFile1 = file.path(c_strDir,c_strTestingDirectory,c_strCorrectAnswers,"FuncWriteTempDF1.txt") -c_sAnswerWriteDFFile2 = file.path(c_strDir,c_strTestingDirectory,c_strCorrectAnswers,"FuncWriteTempDF2.txt") - -#Working files -c_sTempWriteFile1 = file.path(c_strDir,c_strTestingDirectory,c_strTemporaryFiles,"FuncWriteTemp1.txt") -c_sTempWriteFile2 = file.path(c_strDir,c_strTestingDirectory,c_strTemporaryFiles,"FuncWriteTemp2.txt") -c_sTempWriteDFFile1 = file.path(c_strDir,c_strTestingDirectory,c_strTemporaryFiles,"FuncWriteTempDF1.txt") -c_sTempWriteDFFile2 = file.path(c_strDir,c_strTestingDirectory,c_strTemporaryFiles,"FuncWriteTempDF2.txt") -dfTest = as.data.frame(as.matrix(cbind(c(1,11,111),c(2,22,222),c(3,33,333)))) -sWriteString = "Testing, 1,2,3 anything but that." -unlink(c_sTempWriteFile1) -unlink(c_sTempWriteFile2) -unlink(c_sTempWriteDFFile1) -unlink(c_sTempWriteDFFile2) -funcWrite(sWriteString,c_sTempWriteFile1) -funcWrite(sWriteString,c_sTempWriteFile2) -funcWrite(sWriteString,c_sTempWriteFile2) -funcWrite(dfTest,c_sTempWriteDFFile1) -funcWrite(dfTest,c_sTempWriteDFFile2) -funcWrite(dfTest,c_sTempWriteDFFile2) - -test_that("Test that a test file is written and appended to for strings and dataframes.",{ - expect_equal(readLines(c_sTempWriteFile1),readLines(c_sAnswerWriteFile1)) - expect_equal(readLines(c_sTempWriteFile2),readLines(c_sAnswerWriteFile2)) - expect_equal(readLines(c_sTempWriteDFFile1),readLines(c_sAnswerWriteDFFile1)) - expect_equal(readLines(c_sTempWriteDFFile2),readLines(c_sAnswerWriteDFFile2)) -}) - -context("Test funcWriteTable") -#Answer files -c_sAnswerWriteDFFile1 = file.path(c_strDir,c_strTestingDirectory,c_strCorrectAnswers,"FuncWriteTableTempDF1.txt") -c_sAnswerWriteDFFile2 = file.path(c_strDir,c_strTestingDirectory,c_strCorrectAnswers,"FuncWriteTableTempDF2.txt") - -#Working files -c_sTempWriteDFFile1 = file.path(c_strDir,c_strTestingDirectory,c_strTemporaryFiles,"FuncWriteTableTempDF1.txt") -c_sTempWriteDFFile2 = file.path(c_strDir,c_strTestingDirectory,c_strTemporaryFiles,"FuncWriteTableTempDF2.txt") -unlink(c_sTempWriteDFFile1) -unlink(c_sTempWriteDFFile2) -funcWriteTable(dfTest,c_sTempWriteDFFile1) -funcWriteTable(dfTest,c_sTempWriteDFFile2, fAppend=TRUE) -funcWriteTable(dfTest,c_sTempWriteDFFile2, fAppend=TRUE) - -test_that("Test that a test file is written and appended to for dataframes.",{ - expect_equal(readLines(c_sTempWriteDFFile1),readLines(c_sAnswerWriteDFFile1)) - expect_equal(readLines(c_sTempWriteDFFile2),readLines(c_sAnswerWriteDFFile2)) -}) - -context("Test funcCoef2Col") -dfTestWithFactors = as.data.frame(as.matrix(cbind(c(1,11,111),c(2,22,222),c(3,33,333)))) -colnames(dfTestWithFactors)=c("A","B","C") -dfTestWithFactors["B"]=as.factor(as.character(dfTestWithFactors[["B"]])) -test_that("Test that a coefficients are found or not given if they exist",{ - expect_equal(funcCoef2Col(strCoef="C",frmeData=dfTestWithFactors,astrCols=c()),"C") - expect_equal(funcCoef2Col(strCoef="A",frmeData=dfTestWithFactors,astrCols=c()),"A") - expect_equal(funcCoef2Col(strCoef=paste("B","2",sep=c_sFactorNameSep),frmeData=dfTestWithFactors,astrCols=c()),"B") - expect_equal(funcCoef2Col(strCoef=paste("B","22",sep=c_sFactorNameSep),frmeData=dfTestWithFactors,astrCols=c()),"B") - expect_equal(funcCoef2Col(strCoef=paste("B","222",sep=c_sFactorNameSep),frmeData=dfTestWithFactors,astrCols=c()),"B") - expect_equal(funcCoef2Col(strCoef="C",frmeData=dfTestWithFactors,astrCols=c("A","B","C")),"C") - expect_equal(funcCoef2Col(strCoef=paste("B","2",sep=c_sFactorNameSep),frmeData=dfTestWithFactors,astrCols=c("A","B")),"B") - expect_equal(funcCoef2Col(strCoef=paste("B","22",sep=c_sFactorNameSep),frmeData=dfTestWithFactors,astrCols=c("B","C")),"B") - expect_equal(funcCoef2Col(strCoef=paste("B","222",sep=c_sFactorNameSep),frmeData=dfTestWithFactors,astrCols=c("B")),"B") -}) - -context("Test funcMFAValue2Col") -dfTestWithFactors = data.frame(A=c(1,3,3,4,5,6,7,8),B=c(1.0,2.0, 5.8,4.6,4.7,8.9,9.0,2.0),C=c("one","two","one","two","one","two","one","two"),D=c("1","2","1","2","1","2","1","2")) -dfTestWithFactors["three"]=as.factor(dfTestWithFactors[["three"]]) -test_that("Test that a column names is found or not given if they exist",{ - expect_equal(funcMFAValue2Col(xValue=5.8,dfData=dfTestWithFactors, aiColumnIndicesToSearch=NULL),"B") - expect_equal(funcMFAValue2Col(xValue=6,dfData=dfTestWithFactors, aiColumnIndicesToSearch=NULL),"A") - expect_equal(funcMFAValue2Col(xValue="one",dfData=dfTestWithFactors, aiColumnIndicesToSearch=NULL),"C") - expect_equal(funcMFAValue2Col(xValue="two",dfData=dfTestWithFactors, aiColumnIndicesToSearch=NULL),"C") - expect_equal(funcMFAValue2Col(xValue=paste("D","1",sep=c_sMFANameSep1),dfData=dfTestWithFactors, aiColumnIndicesToSearch=NULL),"D") - expect_equal(funcMFAValue2Col(xValue=paste("D","2",sep=c_sMFANameSep1),dfData=dfTestWithFactors, aiColumnIndicesToSearch=NULL),"D") - expect_equal(funcMFAValue2Col(xValue=2.0,dfData=dfTestWithFactors, aiColumnIndicesToSearch=c(1,3)),NULL) - expect_equal(funcMFAValue2Col(xValue=6,dfData=dfTestWithFactors, aiColumnIndicesToSearch=c(2,3)),NULL) - expect_equal(funcMFAValue2Col(xValue="one",dfData=dfTestWithFactors, aiColumnIndicesToSearch=c(1,2)),NULL) - expect_equal(funcMFAValue2Col(xValue="two",dfData=dfTestWithFactors, aiColumnIndicesToSearch=c(1,2)),NULL) - expect_equal(funcMFAValue2Col(xValue=2.0,dfData=dfTestWithFactors, aiColumnIndicesToSearch=c(2)),"B") - expect_equal(funcMFAValue2Col(xValue=6,dfData=dfTestWithFactors, aiColumnIndicesToSearch=c(1)),"A") - expect_equal(funcMFAValue2Col(xValue="one",dfData=dfTestWithFactors, aiColumnIndicesToSearch=c(3)),"C") - expect_equal(funcMFAValue2Col(xValue=paste("D","2",sep=c_sMFANameSep1),dfData=dfTestWithFactors, aiColumnIndicesToSearch=c(4)),"D") -}) - -context("Test funcFormulaStrToList") -test_that("List of covariates are given, from lm or mixed model formulas",{ - expect_equal(funcFormulaStrToList("adCur ~ `1Covariate`"),c("1Covariate")) - expect_equal(funcFormulaStrToList("adCur ~ `1Covariate` + `2Covariate`"),c("1Covariate","2Covariate")) - expect_equal(funcFormulaStrToList("adCur ~ `1Covariate` + `2Covariate` + `3Covariate`"),c("1Covariate","2Covariate","3Covariate")) - expect_equal(funcFormulaStrToList("adCur ~ 1|`1Covariate`"),c("1Covariate")) - expect_equal(funcFormulaStrToList("adCur ~ 1|`1Covariate` + 1|`2Covariate`"),c("1Covariate","2Covariate")) - expect_equal(funcFormulaStrToList("adCur ~ 1|`1Covariate` + 1|`2Covariate` + 1|`3Covariate`"),c("1Covariate","2Covariate","3Covariate")) - expect_equal(funcFormulaStrToList("adCur ~ 1|`1Covariate` + `2Covariate` + 1|`3Covariate`"),c("1Covariate","2Covariate","3Covariate")) - expect_equal(funcFormulaStrToList("adCur ~ 1|`1Covariate` + 1|`2Covariate` + `3Covariate`"),c("1Covariate","2Covariate","3Covariate")) - expect_equal(funcFormulaStrToList("adCur ~ `1Covariate` + 1|`2Covariate` + 1|`3Covariate`"),c("1Covariate","2Covariate","3Covariate")) - expect_equal(funcFormulaStrToList("adCur ~ `1Covariate` + `2Covariate` + 1|`3Covariate`"),c("1Covariate","2Covariate","3Covariate")) - expect_equal(funcFormulaStrToList("adCur ~ 1|`1Covariate` + `2Covariate` + `3Covariate`"),c("1Covariate","2Covariate","3Covariate")) - expect_equal(funcFormulaStrToList("adCur ~ `1Covariate` + 1|`2Covariate` + `3Covariate`"),c("1Covariate","2Covariate","3Covariate")) -}) - -context("Test funcFormulaListToString") -test_that("The correct string formula for a lm or mixed model is created from a list of covariates",{ - expect_equal(funcFormulaListToString(NULL),c(NA,NA)) - expect_equal(funcFormulaListToString(c("1Covariate")),c("adCur ~ `1Covariate`",NA)) - expect_equal(funcFormulaListToString(c("1Covariate","2Covariate")),c("adCur ~ `1Covariate` + `2Covariate`",NA)) - expect_equal(funcFormulaListToString(c("1Covariate","2Covariate","3Covariate")),c("adCur ~ `1Covariate` + `2Covariate` + `3Covariate`",NA)) - expect_equal(funcFormulaListToString(c("1Covariate","2Covariate"),c("3Covariate")),c(NA,"adCur ~ `1Covariate` + `2Covariate` + 1|`3Covariate`")) - expect_equal(funcFormulaListToString(c("1Covariate","3Covariate"),c("2Covariate")),c(NA,"adCur ~ `1Covariate` + `3Covariate` + 1|`2Covariate`")) - expect_equal(funcFormulaListToString(c("2Covariate","3Covariate"),c("1Covariate")),c(NA,"adCur ~ `2Covariate` + `3Covariate` + 1|`1Covariate`")) - expect_equal(funcFormulaListToString(c("2Covariate"),c("1Covariate","3Covariate")),c(NA,"adCur ~ `2Covariate` + 1|`1Covariate` + 1|`3Covariate`")) - expect_equal(funcFormulaListToString(c("1Covariate"),c("2Covariate","3Covariate")),c(NA,"adCur ~ `1Covariate` + 1|`2Covariate` + 1|`3Covariate`")) - expect_equal(funcFormulaListToString(c("3Covariate"),c("1Covariate","2Covariate")),c(NA,"adCur ~ `3Covariate` + 1|`1Covariate` + 1|`2Covariate`")) -}) \ No newline at end of file diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/test-ValidateData/test-ValidateData.R --- a/maaslin-4450aa4ecc84/src/test-ValidateData/test-ValidateData.R Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,53 +0,0 @@ -c_strDir <- file.path(getwd( ),"..") - -source(file.path(c_strDir,"lib","Constants.R")) -source(file.path(c_strDir,"lib","ValidateData.R")) - -context("Test funcIsValid") -test_that("NA and NUll are false, all others are true",{ - expect_equal(funcIsValid(NA),FALSE) - expect_equal(funcIsValid(NULL),FALSE) - expect_equal(funcIsValid(1), TRUE) - expect_equal(funcIsValid("3"), TRUE) - expect_equal(funcIsValid(c("3","4")), TRUE) - expect_equal(funcIsValid(c(3,NA)), TRUE) - expect_equal(funcIsValid(""), TRUE) - expect_equal(funcIsValid(list()), TRUE) - expect_equal(funcIsValid(2.3), TRUE) - expect_equal(funcIsValid(TRUE), TRUE) - expect_equal(funcIsValid(FALSE), TRUE) - expect_equal(funcIsValid(as.factor(3)), TRUE) -}) - -context("Test funcIsValidString") -test_that("Test only strings are true",{ - expect_equal(funcIsValidString(NA),FALSE) - expect_equal(funcIsValidString(NULL),FALSE) - expect_equal(funcIsValidString(1), FALSE) - expect_equal(funcIsValidString("3"), TRUE) - expect_equal(funcIsValidString(c("3","4")), FALSE) - expect_equal(funcIsValidString(""), TRUE) - expect_equal(funcIsValidString(list()), FALSE) - expect_equal(funcIsValidString(2.3), FALSE) - expect_equal(funcIsValidString(TRUE), FALSE) - expect_equal(funcIsValidString(FALSE), FALSE) -}) - -context("Test funcIsValidFileName") -strFileSimpleRCFileAnswer = file.path(c_strDir,c_strTestingDirectory,c_strCorrectAnswers,"FuncWriteMatrixToReadConfigFile_SimpleAnswer.read.config") -strFileUseAllRCFileAnswer = file.path(c_strDir,c_strTestingDirectory,c_strCorrectAnswers,"FuncWriteMatrixToReadConfigFile_AllAnswer.read.config") - -test_that("Test only strings pointing to existing files are true",{ - expect_equal(funcIsValidFileName(NA),FALSE) - expect_equal(funcIsValidFileName(NULL),FALSE) - expect_equal(funcIsValidFileName(1), FALSE) - expect_equal(funcIsValidFileName("3"), FALSE) - expect_equal(funcIsValidFileName(c("3","4")), FALSE) - expect_equal(funcIsValidFileName(""), FALSE) - expect_equal(funcIsValidFileName(list()), FALSE) - expect_equal(funcIsValidFileName(2.3), FALSE) - expect_equal(funcIsValidFileName(TRUE), FALSE) - expect_equal(funcIsValidFileName(FALSE), FALSE) - expect_equal(funcIsValidFileName(strFileSimpleRCFileAnswer),TRUE) - expect_equal(funcIsValidFileName(strFileUseAllRCFileAnswer),TRUE) -}) diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/answers/FuncSummarizeDirectory-1FileAltKeyResult.txt --- a/maaslin-4450aa4ecc84/src/testing/answers/FuncSummarizeDirectory-1FileAltKeyResult.txt Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,8 +0,0 @@ - Variable Feature Value Coefficient N N.not.0 P.value Q.value -1 V1 Bacteria|1 V1 0.000376948962983632 228 78 0.000337563617350514 0.0140710728916636 -2 V1 Bacteria|2 V1 -0.000172924087271453 228 39 0.00173356878954918 0.04501595020731 -3 V1 Bacteria|3 V1 0.000176541929148173 228 50 0.00213541203877865 0.0497425392562556 -4 V1 Bacteria|4 V1 0.000233041055999211 228 54 0.00309255350077782 0.0653147299364275 -5 V1 Bacteria|5 V1 0.000170023412983991 228 28 0.0055803225587723 0.0982136770343924 -6 V1 Bacteria|6 V1 -0.000129327171064622 228 29 0.00625257130491581 0.103167426531111 -7 V1 Bacteria|7 V1 -0.00246205053294096 228 227 0.201094523365195 0.215761136458804 \ No newline at end of file diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/answers/FuncSummarizeDirectory-1FileAltSigResult.txt --- a/maaslin-4450aa4ecc84/src/testing/answers/FuncSummarizeDirectory-1FileAltSigResult.txt Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,9 +0,0 @@ - Variable Feature Value Coefficient N N.not.0 P.value Q.value -1 V1 Bacteria|1 V1 0.000376948962983632 228 78 0.000337563617350514 0.0140710728916636 -2 V1 Bacteria|2 V1 -0.000172924087271453 228 39 0.00173356878954918 0.04501595020731 -3 V1 Bacteria|3 V1 0.000176541929148173 228 50 0.00213541203877865 0.0497425392562556 -4 V1 Bacteria|4 V1 0.000233041055999211 228 54 0.00309255350077782 0.0653147299364275 -5 V1 Bacteria|5 V1 0.000170023412983991 228 28 0.0055803225587723 0.0982136770343924 -6 V1 Bacteria|6 V1 -0.000129327171064622 228 29 0.00625257130491581 0.103167426531111 -7 V1 Bacteria|7 V1 -0.00246205053294096 228 227 0.201094523365195 0.215761136458804 -8 V1 Bacteria|8 V1 -0.000133233647710018 228 28 0.301159271814143 0.315830056496576 \ No newline at end of file diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/answers/FuncSummarizeDirectory-1FileResult.txt --- a/maaslin-4450aa4ecc84/src/testing/answers/FuncSummarizeDirectory-1FileResult.txt Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,8 +0,0 @@ - Variable Feature Value Coefficient N N.not.0 P.value Q.value -1 V1 Bacteria|1 V1 0.000376948962983632 228 78 0.000337563617350514 0.0140710728916636 -2 V1 Bacteria|2 V1 -0.000172924087271453 228 39 0.00173356878954918 0.04501595020731 -3 V1 Bacteria|3 V1 0.000176541929148173 228 50 0.00213541203877865 0.0497425392562556 -4 V1 Bacteria|4 V1 0.000233041055999211 228 54 0.00309255350077782 0.0653147299364275 -5 V1 Bacteria|5 V1 0.000170023412983991 228 28 0.0055803225587723 0.0982136770343924 -6 V1 Bacteria|6 V1 -0.000129327171064622 228 29 0.00625257130491581 0.103167426531111 -7 V1 Bacteria|7 V1 -0.00246205053294096 228 227 0.201094523365195 0.215761136458804 diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/answers/FuncSummarizeDirectory-3FileAltKeyResult.txt --- a/maaslin-4450aa4ecc84/src/testing/answers/FuncSummarizeDirectory-3FileAltKeyResult.txt Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,16 +0,0 @@ - Variable Feature Value Coefficient N N.not.0 P.value Q.value -1 V1 Bacteria|1 V1 0.000376948962983632 228 78 0.000337563617350514 0.0140710728916636 -2 V1 Bacteria|2 V1 -0.000172924087271453 228 39 0.00173356878954918 0.04501595020731 -3 V1 Bacteria|3 V1 0.000176541929148173 228 50 0.00213541203877865 0.0497425392562556 -4 V1 Bacteria|4 V1 0.000233041055999211 228 54 0.00309255350077782 0.0653147299364275 -5 V1 Bacteria|5 V1 0.000170023412983991 228 28 0.0055803225587723 0.0982136770343924 -6 V1 Bacteria|6 V1 -0.000129327171064622 228 29 0.00625257130491581 0.103167426531111 -7 V1 Bacteria|7 V1 -0.00246205053294096 228 227 0.201094523365195 0.215761136458804 -8 V2 Bacteria|1 V2 0.000376948962983632 228 78 0.000337563617350514 0.0140710728916636 -9 V2 Bacteria|2 V2 -0.000172924087271453 228 39 0.100173356878955 0.104501595020731 -10 V2 Bacteria|3 V2 0.000176541929148173 228 50 0.200213541203878 0.204974253925626 -11 V3 Bacteria|1 V3 0.000376948962983632 228 78 0.000337563617350514 0.0140710728916636 -12 V3 Bacteria|2 V3 -0.000172924087271453 228 39 0.0100173356878955 0.0104501595020731 -13 V3 Bacteria|3 V3 0.000176541929148173 228 50 0.0200213541203878 0.0204974253925626 -14 V3 Bacteria|4 V3 0.000233041055999211 228 54 0.0300309255350078 0.0306531472993643 -15 V3 Bacteria|5 V3 0.000170023412983991 228 28 0.140055803225588 0.144098213677034 \ No newline at end of file diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/answers/FuncSummarizeDirectory-3FileAltSigResult.txt --- a/maaslin-4450aa4ecc84/src/testing/answers/FuncSummarizeDirectory-3FileAltSigResult.txt Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,19 +0,0 @@ - Variable Feature Value Coefficient N N.not.0 P.value Q.value -1 V1 Bacteria|1 V1 0.000376948962983632 228 78 0.000337563617350514 0.0140710728916636 -2 V1 Bacteria|2 V1 -0.000172924087271453 228 39 0.00173356878954918 0.04501595020731 -3 V1 Bacteria|3 V1 0.000176541929148173 228 50 0.00213541203877865 0.0497425392562556 -4 V1 Bacteria|4 V1 0.000233041055999211 228 54 0.00309255350077782 0.0653147299364275 -5 V1 Bacteria|5 V1 0.000170023412983991 228 28 0.0055803225587723 0.0982136770343924 -6 V1 Bacteria|6 V1 -0.000129327171064622 228 29 0.00625257130491581 0.103167426531111 -7 V1 Bacteria|7 V1 -0.00246205053294096 228 227 0.201094523365195 0.215761136458804 -8 V1 Bacteria|8 V1 -0.000133233647710018 228 28 0.301159271814143 0.315830056496576 -9 V2 Bacteria|1 V2 0.000376948962983632 228 78 0.000337563617350514 0.0140710728916636 -10 V2 Bacteria|2 V2 -0.000172924087271453 228 39 0.100173356878955 0.104501595020731 -11 V2 Bacteria|3 V2 0.000176541929148173 228 50 0.200213541203878 0.204974253925626 -12 V2 Bacteria|4 V2 0.000233041055999211 228 54 0.300309255350078 0.306531472993643 -13 V3 Bacteria|1 V3 0.000376948962983632 228 78 0.000337563617350514 0.0140710728916636 -14 V3 Bacteria|2 V3 -0.000172924087271453 228 39 0.0100173356878955 0.0104501595020731 -15 V3 Bacteria|3 V3 0.000176541929148173 228 50 0.0200213541203878 0.0204974253925626 -16 V3 Bacteria|4 V3 0.000233041055999211 228 54 0.0300309255350078 0.0306531472993643 -17 V3 Bacteria|5 V3 0.000170023412983991 228 28 0.140055803225588 0.144098213677034 -18 V3 Bacteria|6 V3 -0.000129327171064622 228 29 0.250062525713049 0.251031674265311 \ No newline at end of file diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/answers/FuncSummarizeDirectory-3FileResult.txt --- a/maaslin-4450aa4ecc84/src/testing/answers/FuncSummarizeDirectory-3FileResult.txt Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,16 +0,0 @@ - Variable Feature Value Coefficient N N.not.0 P.value Q.value -1 V1 Bacteria|1 V1 0.000376948962983632 228 78 0.000337563617350514 0.0140710728916636 -2 V1 Bacteria|2 V1 -0.000172924087271453 228 39 0.00173356878954918 0.04501595020731 -3 V1 Bacteria|3 V1 0.000176541929148173 228 50 0.00213541203877865 0.0497425392562556 -4 V1 Bacteria|4 V1 0.000233041055999211 228 54 0.00309255350077782 0.0653147299364275 -5 V1 Bacteria|5 V1 0.000170023412983991 228 28 0.0055803225587723 0.0982136770343924 -6 V1 Bacteria|6 V1 -0.000129327171064622 228 29 0.00625257130491581 0.103167426531111 -7 V1 Bacteria|7 V1 -0.00246205053294096 228 227 0.201094523365195 0.215761136458804 -8 V2 Bacteria|1 V2 0.000376948962983632 228 78 0.000337563617350514 0.0140710728916636 -9 V2 Bacteria|2 V2 -0.000172924087271453 228 39 0.100173356878955 0.104501595020731 -10 V2 Bacteria|3 V2 0.000176541929148173 228 50 0.200213541203878 0.204974253925626 -11 V3 Bacteria|1 V3 0.000376948962983632 228 78 0.000337563617350514 0.0140710728916636 -12 V3 Bacteria|2 V3 -0.000172924087271453 228 39 0.0100173356878955 0.0104501595020731 -13 V3 Bacteria|3 V3 0.000176541929148173 228 50 0.0200213541203878 0.0204974253925626 -14 V3 Bacteria|4 V3 0.000233041055999211 228 54 0.0300309255350078 0.0306531472993643 -15 V3 Bacteria|5 V3 0.000170023412983991 228 28 0.140055803225588 0.144098213677034 \ No newline at end of file diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/answers/FuncSummarizeDirectory-NoFileAltKeyAnswer.txt --- a/maaslin-4450aa4ecc84/src/testing/answers/FuncSummarizeDirectory-NoFileAltKeyAnswer.txt Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,1 +0,0 @@ -No significant data found. diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/answers/FuncSummarizeDirectory-NoFileAltSigAnswer.txt --- a/maaslin-4450aa4ecc84/src/testing/answers/FuncSummarizeDirectory-NoFileAltSigAnswer.txt Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,1 +0,0 @@ -No significant data found. diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/answers/FuncSummarizeDirectory-NoFileAnswer.txt --- a/maaslin-4450aa4ecc84/src/testing/answers/FuncSummarizeDirectory-NoFileAnswer.txt Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,1 +0,0 @@ -No significant data found. diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/answers/FuncWriteMatrices.read.config --- a/maaslin-4450aa4ecc84/src/testing/answers/FuncWriteMatrices.read.config Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,16 +0,0 @@ -Matrix: 1 -Delimiter: TAB -Name_Row_Number: 1 -Name_Column_Number: 1 -Read_TSV_Rows: 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16 -Read_TSV_Columns: 2,3,4,5,6 - - -Matrix: 2 -Delimiter: TAB -Name_Row_Number: 1 -Name_Column_Number: 1 -Read_TSV_Rows: 2,3,4,5,6,7,8,9,10 -Read_TSV_Columns: 2,3,4 - - diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/answers/FuncWriteMatrices1.tsv --- a/maaslin-4450aa4ecc84/src/testing/answers/FuncWriteMatrices1.tsv Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,16 +0,0 @@ - Sample1 Sample2 Sample3 Sample4 Sample5 -Feature1 11 12 13 14 15 -Feature2 21 22 23 24 25 -Feature3 31 32 33 34 35 -Feature4 41 42 43 44 45 -Feature5 51 52 53 54 55 -Feature6 61 62 63 64 65 -Feature7 71 72 73 74 75 -Feature8 81 82 83 84 85 -Feature9 91 92 93 94 95 -Feature10 101 102 103 104 105 -Feature11 111 112 113 114 115 -Feature12 121 122 123 124 125 -Feature13 131 132 133 134 135 -Feature14 141 142 143 144 145 -Feature15 151 152 153 154 155 \ No newline at end of file diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/answers/FuncWriteMatrices2.tsv --- a/maaslin-4450aa4ecc84/src/testing/answers/FuncWriteMatrices2.tsv Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,10 +0,0 @@ - Sample1 Sample2 Sample4 -Feature1 11 12 14 -Feature2 21 22 24 -Feature3 31 32 34 -Feature4 41 42 44 -Feature5 51 52 54 -Feature6 61 62 64 -Feature7 71 72 74 -Feature8 81 82 84 -Feature9 91 92 94 \ No newline at end of file diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/answers/FuncWriteMatrixToReadConfigFile_AllAnswer.read.config --- a/maaslin-4450aa4ecc84/src/testing/answers/FuncWriteMatrixToReadConfigFile_AllAnswer.read.config Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,8 +0,0 @@ -Matrix: AllMatrix -Delimiter: SPACE -Name_Row_Number: 1 -Name_Column_Number: 1 -Read_TSV_Rows: 1,2,3,4,5 -Read_TSV_Columns: 10,11,12 - - diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/answers/FuncWriteMatrixToReadConfigFile_AppendAnswer.read.config --- a/maaslin-4450aa4ecc84/src/testing/answers/FuncWriteMatrixToReadConfigFile_AppendAnswer.read.config Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,16 +0,0 @@ -Matrix: SimpleMatrix -Delimiter: TAB -Name_Row_Number: 1 -Name_Column_Number: 1 -Read_TSV_Rows: - -Read_TSV_Columns: - - - -Matrix: SimpleMatrix -Delimiter: TAB -Name_Row_Number: 1 -Name_Column_Number: 1 -Read_TSV_Rows: - -Read_TSV_Columns: - - - diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/answers/FuncWriteMatrixToReadConfigFile_SimpleAnswer.read.config --- a/maaslin-4450aa4ecc84/src/testing/answers/FuncWriteMatrixToReadConfigFile_SimpleAnswer.read.config Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,8 +0,0 @@ -Matrix: SimpleMatrix -Delimiter: TAB -Name_Row_Number: 1 -Name_Column_Number: 1 -Read_TSV_Rows: - -Read_TSV_Columns: - - - diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/answers/FuncWriteTableTempDF1.txt --- a/maaslin-4450aa4ecc84/src/testing/answers/FuncWriteTableTempDF1.txt Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,4 +0,0 @@ - V1 V2 V3 -1 1 2 3 -2 11 22 33 -3 111 222 333 diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/answers/FuncWriteTableTempDF2.txt --- a/maaslin-4450aa4ecc84/src/testing/answers/FuncWriteTableTempDF2.txt Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,8 +0,0 @@ - V1 V2 V3 -1 1 2 3 -2 11 22 33 -3 111 222 333 - V1 V2 V3 -1 1 2 3 -2 11 22 33 -3 111 222 333 diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/answers/FuncWriteTemp1.txt --- a/maaslin-4450aa4ecc84/src/testing/answers/FuncWriteTemp1.txt Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,1 +0,0 @@ -Testing, 1,2,3 anything but that. \ No newline at end of file diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/answers/FuncWriteTemp2.txt --- a/maaslin-4450aa4ecc84/src/testing/answers/FuncWriteTemp2.txt Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,2 +0,0 @@ -Testing, 1,2,3 anything but that. -Testing, 1,2,3 anything but that. \ No newline at end of file diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/answers/FuncWriteTempDF1.txt --- a/maaslin-4450aa4ecc84/src/testing/answers/FuncWriteTempDF1.txt Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,4 +0,0 @@ - V1 V2 V3 -1 1 2 3 -2 11 22 33 -3 111 222 333 diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/answers/FuncWriteTempDF2.txt --- a/maaslin-4450aa4ecc84/src/testing/answers/FuncWriteTempDF2.txt Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,8 +0,0 @@ - V1 V2 V3 -1 1 2 3 -2 11 22 33 -3 111 222 333 - V1 V2 V3 -1 1 2 3 -2 11 22 33 -3 111 222 333 diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/input/1Matrix.read.config --- a/maaslin-4450aa4ecc84/src/testing/input/1Matrix.read.config Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,4 +0,0 @@ -Matrix: Matrix1 -Delimiter: TAB -Read_TSV_Rows: - -Read_TSV_Columns: - diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/input/2Matrix.read.config --- a/maaslin-4450aa4ecc84/src/testing/input/2Matrix.read.config Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,9 +0,0 @@ -Matrix: Matrix1 -Delimiter: TAB -Read_TSV_Rows: - -Read_TSV_Columns: - - -Matrix: Matrix2 -Delimiter: TAB -Read_TSV_Rows: 2-4 -Read_TSV_Columns: 2-4 diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/input/3Matrix.read.config --- a/maaslin-4450aa4ecc84/src/testing/input/3Matrix.read.config Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,14 +0,0 @@ -Matrix: Matrix1 -Delimiter: TAB -Read_TSV_Rows: - -Read_TSV_Columns: - - -Matrix: Matrix2 -Delimiter: TAB -Read_TSV_Rows: 2-4 -Read_TSV_Columns: 2-4 - -Matrix: Matrix3 -Delimiter: TAB -Read_TSV_Rows: 2-10 -Read_TSV_Columns: 2,3,5 diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/input/TestMaaslin.read.config --- a/maaslin-4450aa4ecc84/src/testing/input/TestMaaslin.read.config Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,7 +0,0 @@ -Matrix: Metadata -Delimiter: TAB -Read_TSV_Columns: 2,3,4 - -Matrix: Abundance -Delimiter: TAB -Read_TSV_Columns: 5-14 diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/input/TestMaaslin.tsv --- a/maaslin-4450aa4ecc84/src/testing/input/TestMaaslin.tsv Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,251 +0,0 @@ -sample activity age dx Archaea|Euryarchaeota|Methanobacteria|Methanobacteriales|Methanobacteriaceae Archaea|Euryarchaeota|Methanobacteria|Methanobacteriales|Methanobacteriaceae|Methanobrevibacter Archaea|Euryarchaeota|Methanobacteria|Methanobacteriales|Methanobacteriaceae|Methanosphaera Bacteria Bacteria|Actinobacteria|Actinobacteria Bacteria|Actinobacteria|Actinobacteria|Actinomycetales Bacteria|Actinobacteria|Actinobacteria|Actinomycetales|Actinomycetaceae Bacteria|Actinobacteria|Actinobacteria|Actinomycetales|Actinomycetaceae|Actinomyces Bacteria|Actinobacteria|Actinobacteria|Actinomycetales|Actinomycetaceae|Varibaculum Bacteria|Actinobacteria|Actinobacteria|Actinomycetales|Actinomycetaceae|unclassified -100001 19 CD 0 0 0 1 0.177746 0.00108382 0.00108382 0.00108382 0 0 -100003 26 CD 0 0 0 1 0.0213904 0.000822707 0.000822707 0.000822707 0 0 -100009 55 UC 0 0 0 1 0.000457666 0 0 0 0 0 -100015 57 CD 0.000457038 0.000457038 0 0.999543 0.0393053 0.000457038 0 0 0 0 -100016 46 0 0 0 1 0.0344828 0 0 0 0 0 -100043 21 0 0 0 1 0.0133779 0 0 0 0 0 -100046 61 CD 0 0 0 1 0.0552083 0.00260417 0.00260417 0.00260417 0 0 -100047 31 UC 0 0 0 1 0.0135002 0 0 0 0 0 -100048 50 UC 0 0 0 0.99944 0.242093 0.000279877 0.000279877 0.000279877 0 0 -100049 46 UC 0.00139909 0.00104932 0.000349773 0.998601 0.0255334 0 0 0 0 0 -100051 36 UC 0 0 0 1 0.0533784 0 0 0 0 0 -100052 23 CD 0 0 0 1 0.00985793 0 0 0 0 0 -100058 21 CD 0 0 0 1 0.0308166 0.000513611 0.000513611 0.000513611 0 0 -100060 60 CD 0 0 0 1 0.00132406 0 0 0 0 0 -100062 12 0 0 0 1 0.0115964 0.000386548 0.000386548 0.000386548 0 0 -100065 20 CD 0 0 0 1 0.0241899 0.00136924 0 0 0 0 -100068 33 0 0 0 1 0.0114533 0.000254518 0 0 0 0 -100070 12 UC 0 0 0 1 0.00128991 0 0 0 0 0 -100071 43 UC 0 0 0 1 0.00809444 0 0 0 0 0 -100072 18 CD 0 0 0 1 0.0171569 0.00122549 0.00122549 0.00122549 0 0 -100074 54 CD 0 0 0 1 0.0289179 0 0 0 0 0 -100075 19 CD 0 0 0 1 0.00620767 0 0 0 0 0 -100077 15 CD 0 0 0 1 0.0122675 0.00118718 0.00118718 0.00118718 0 0 -100078 31 0 0 0 1 0.00114635 0 0 0 0 0 -100080 74 UC 0 0 0 1 0.242624 0.00188324 0.00125549 0.00125549 0 0 -100083 37 CD 0 0 0 1 0.0517868 0.000302847 0 0 0 0 -100084 12 CD 0 0 0 1 0.00522718 0.000402091 0.000402091 0.000402091 0 0 -100085 34 CD 0 0 0 1 0.0114848 0 0 0 0 0 -100086 80 UC 0 0 0 1 0.0117933 0 0 0 0 0 -100087 8 CD 0 0 0 1 0.00256598 0 0 0 0 0 -100088 29 CD 0 0 0 1 0.0107383 0 0 0 0 0 -100089 22 UC 0 0 0 1 0.0128598 0 0 0 0 0 -100090 13 0 0 0 1 0.0109111 0.000272777 0 0 0 0 -100091 23 CD 0 0 0 1 0.0398703 0.000324149 0.000324149 0.000324149 0 0 -100092 62 CD 0 0 0 1 0.061049 0.000429923 0.000429923 0 0.000429923 0 -100095 41 UC 0 0 0 1 0.105263 0 0 0 0 0 -100096 18 UC 0 0 0 1 0.0884817 0.00026178 0.00026178 0.00026178 0 0 -100099 83 UC 0 0 0 1 0.0463725 0 0 0 0 0 -100100 53 CD 0 0 0 1 0.0272953 0.00330852 0.00330852 0.00330852 0 0 -100101 58 CD 0 0 0 1 0.00897724 0.000320616 0.000320616 0.000320616 0 0 -100102 62 CD 0 0 0 1 0.0607595 0.00101266 0 0 0 0 -100104 44 UC 0 0 0 1 0.0655521 0.00307963 0.00307963 0.00307963 0 0 -100105 55 CD 0 0 0 1 0.00692259 0 0 0 0 0 -100106 26 UC 0 0 0 1 0.0195765 0 0 0 0 0 -100107 12 CD 0 0 0 1 0.0693363 0.000741565 0.000741565 0.000741565 0 0 -100109 48 UC 0 0 0 1 0.00874636 0.000416493 0.000416493 0.000416493 0 0 -100115 15 CD 0 0 0 1 0.000385802 0.000385802 0.000385802 0.000385802 0 0 -100117 33 UC 0 0 0 1 0.0198915 0 0 0 0 0 -100128 8 0 0 0 1 0.176075 0.00100806 0.00100806 0.00100806 0 0 -100143 40 UC 0 0 0 1 0.0766177 0.0023819 0.00198491 0.00198491 0 0 -100144 66 CD 0 0 0 1 0.0162602 0.000706964 0.000706964 0.000706964 0 0 -100150 9 CD 0 0 0 1 0.0476627 0 0 0 0 0 -100151 32 CD 0 0 0 1 0.00895857 0 0 0 0 0 -100154 44 CD 0 0 0 1 0.0187793 0.000521648 0.000521648 0.000521648 0 0 -100155 11 UC 0 0 0 1 0.0318149 0.000723066 0.000723066 0.000723066 0 0 -100156 20 CD 0 0 0 1 0.0827749 0.000394166 0.000394166 0.000394166 0 0 -100158 47 CD 0 0 0 1 0.0038674 0 0 0 0 0 -100160 48 CD 0 0 0 1 0.520226 0.00446163 0.000297442 0.000297442 0 0 -100161 53 CD 0 0 0 1 0.0225443 0 0 0 0 0 -100162 36 UC 0 0 0 1 0.0274194 0.000322581 0.000322581 0.000322581 0 0 -100163 20 CD 0 0 0 1 0.0137345 0.000654022 0.000654022 0.000654022 0 0 -100164 77 CD 0 0 0 1 0.0129185 0 0 0 0 0 -100165 10 CD 0 0 0 1 0.034678 0 0 0 0 0 -100166 45 UC 0 0 0 1 0.0118212 0 0 0 0 0 -100167 31 UC 0 0 0 1 0.189468 0 0 0 0 0 -100168 15 CD 0 0 0 1 0.0124306 0 0 0 0 0 -100169 23 UC 0 0 0 1 0.0734036 0 0 0 0 0 -100170 30 UC 0 0 0 1 0.141724 0 0 0 0 0 -100171 17 CD 0 0 0 1 0.132795 0.0013459 0.0013459 0.0013459 0 0 -100172 61 UC 0 0 0 1 0.0407113 0 0 0 0 0 -100173 53 CD 0 0 0 1 0.0425093 0 0 0 0 0 -100174 10 CD 0 0 0 1 0.00385852 0.00192926 0 0 0 0 -100175 14 UC 0 0 0 1 0.000370782 0 0 0 0 0 -100176 26 CD 0 0 0 1 0.00311769 0.00155885 0.00155885 0.00155885 0 0 -100177 39 UC 0 0 0 1 0.0112423 0.000562114 0.000562114 0.000562114 0 0 -100178 51 CD 0 0 0 1 0.0013986 0 0 0 0 0 -100180 50 UC 0 0 0 1 0.138815 0.00382409 0.00305927 0.00305927 0 0 -100181 67 UC 0 0 0 1 0.0168745 0.000366838 0.000366838 0.000366838 0 0 -100182 69 UC 0 0 0 1 0.00409277 0 0 0 0 0 -100183 24 UC 0 0 0 1 0.12679 0.000298329 0.000298329 0.000298329 0 0 -100184 48 CD 0 0 0 1 0.0524005 0.00029274 0 0 0 0 -100187 15 CD 0 0 0 1 0.0101502 0.00974421 0.00933821 0.0089322 0 0.000406009 -100188 53 UC 0 0 0 1 0.011194 0 0 0 0 0 -100190 9 CD 0 0 0 1 0.106939 0.00122449 0.00122449 0.000408163 0.000816327 0 -100191 41 CD 0 0 0 1 0.00180245 0 0 0 0 0 -100192 47 CD 0 0 0 1 0.0264201 0.000880669 0.000440335 0.000440335 0 0 -100195 65 CD 0.000817996 0.000817996 0 0.999182 0.121063 0 0 0 0 0 -100196 51 0 0 0 1 0.184506 0.000832986 0.000832986 0.000832986 0 0 -100197 7 CD 0 0 0 1 0.00813743 0 0 0 0 0 -100198 40 CD 0 0 0 1 0.00676299 0.000614817 0.000614817 0.000614817 0 0 -100200 35 CD 0 0 0 1 0.0204765 0 0 0 0 0 -100202 54 UC 0 0 0 1 0.00110538 0.00036846 0.00036846 0.00036846 0 0 -100203 29 CD 0 0 0 1 0.00220022 0 0 0 0 0 -100204 18 UC 0 0 0 1 0.174194 0.00286738 0.00286738 0.00286738 0 0 -100205 16 CD 0 0 0 1 0.0508814 0.00120192 0.00120192 0.00120192 0 0 -100206 53 UC 0 0 0 1 0.00644053 0 0 0 0 0 -100207 15 CD 0.000380373 0.000380373 0 0.99962 0.0931913 0.00342335 0.00304298 0.00304298 0 0 -100209 11 CD 0 0 0 1 0.0976948 0.000365898 0 0 0 0 -100210 61 UC 0 0 0 1 0.00372439 0 0 0 0 0 -100211 74 UC 0 0 0 1 0 0 0 0 0 0 -100212 66 CD 0 0 0 1 0.0027735 0 0 0 0 0 -100214 23 UC 0 0 0 1 0.0157295 0 0 0 0 0 -100216 60 UC 0 0 0 1 0.0135952 0.000302115 0 0 0 0 -100217 20 CD 0 0 0 1 0.0894871 0.00109131 0.00109131 0.00109131 0 0 -100219 50 UC 0 0 0 1 0.0755957 0 0 0 0 0 -100220 27 UC 0 0 0 1 0.0148871 0.00102669 0.00102669 0.00102669 0 0 -100221 30 UC 0 0 0 1 0.0252178 0.000458505 0.000458505 0.000458505 0 0 -100222 33 CD 0 0 0 1 0.0326907 0 0 0 0 0 -100224 31 UC 0 0 0 1 0.00961538 0 0 0 0 0 -100225 27 CD 0 0 0 1 0.034306 0 0 0 0 0 -100226 62 CD 0.00880214 0.00880214 0 0.991198 0.097589 0 0 0 0 0 -100227 51 UC 0 0 0 1 0.0054757 0.00239562 0.00171116 0.00171116 0 0 -100228 35 UC 0.00103663 0.00103663 0 0.998963 0.0967519 0.00310988 0.00310988 0.00310988 0 0 -100229 44 UC 0 0 0 1 0.0243665 0 0 0 0 0 -100233 19 UC 0 0 0 1 0.112604 0.0140755 0.0140755 0 0.0140755 0 -100234 7 CD 0 0 0 1 0.164645 0 0 0 0 0 -7003A 44 CD 0 0 0 0.989683 0.0140015 0.00221076 0 0 0 0 -7003B 45 CD 0 0 0 0.94 0.008 0.002 0.002 0.002 0 0 -7007 28 CD 0 0 0 0.976701 0 0 0 0 0 0 -7010 41 CD 0 0 0 0.923429 0.00457143 0.00114286 0 0 0 0 -7016 36 UC 0 0 0 1 0 0 0 0 0 0 -7018 30 UC 0 0 0 0.972956 0.0197628 0.00062409 0.00062409 0.00062409 0 0 -7021 39 CD 0 0 0 0.803437 0.00320997 0 0 0 0 0 -7022 34 CD 0 0 0 0.696739 0.00326087 0.000362319 0 0 0 0 -7035 45 CD 0 0 0 0.945148 0.0434599 0.00126582 0 0 0 0 -7037 32 CD 0 0 0 0.891882 0.0167193 0.00018577 0.00018577 0.00018577 0 0 -7039 50 CD 0 0 0 0.660474 0.000464468 0 0 0 0 0 -7043 54 UC 0 0 0 0.787327 0.00460059 0.00146382 0.00104559 0.00104559 0 0 -7049 38 CD 0 0 0 0.494214 0.0012054 0.0012054 0.00048216 0.00048216 0 0 -7052 56 CD 0 0 0 0.946339 0.00145591 0.000831947 0.00062396 0.00062396 0 0 -7059 40 CD 0 0 0 0.565217 0.00093838 0 0 0 0 0 -7061 44 CD 0 0 0 0.905832 0 0 0 0 0 0 -7064 53 CD 0 0 0 0.938795 0.00094162 0.00094162 0 0 0 0 -7072 49 CD 0 0 0 0.987552 0.00033195 0 0 0 0 0 -7077A 37 UC 0 0 0 0.831933 0.00280112 0.00280112 0.00280112 0.00280112 0 0 -7077B 38 UC 0 0 0 0.961957 0.0144928 0.00181159 0 0 0 0 -7079 35 CD 0 0 0 0.962963 0.00854701 0.002849 0 0 0 0 -7085 55 CD 0 0 0 0.945273 0.00403438 0 0 0 0 0 -7094A 34 CD 0 0 0 0.927184 0 0 0 0 0 0 -7094B 35 CD 0 0 0 0.971014 0 0 0 0 0 0 -7094C 35 CD 0 0 0 0.931592 0 0 0 0 0 0 -7095 49 CD 0 0 0 0.449966 0.000671592 0.000671592 0 0 0 0 -7097 53 CD 0 0 0 0.973372 0.0109079 0 0 0 0 0 -7102 40 UC 0 0 0 0.987827 0.0150286 0.000150286 0.000150286 0.000150286 0 0 -7111 27 UC 0 0 0 0.942429 0.00287853 0.00172712 0 0 0 0 -7116 43 CD 0 0 0 0.622163 0.0153257 0.0126732 0.0120837 0.0120837 0 0 -7117 41 UC 0 0 0 0.941901 0.00142653 0.000129685 0.000129685 0.000129685 0 0 -7123 53 CD 0 0 0 0.99478 0.0102773 0.000326264 0.000326264 0.000326264 0 0 -7124 36 UC 0 0 0 0.736072 0.0036567 0.0004302 0.0002151 0.0002151 0 0 -7125 24 UC 0 0 0 0.518562 0.000294638 0 0 0 0 0 -7126 48 CD 0 0 0 0.894489 0.00412614 0.000442087 0 0 0 0 -7129 57 CD 0.00017337 0.00017337 0 0.822642 0.0064147 0.00017337 0.00017337 0.00017337 0 0 -7130 48 CD 0 0 0 0.948624 0.0182672 0 0 0 0 0 -7134 47 UC 0 0 0 0.614877 0.00184648 0.000527565 0.000263783 0.000263783 0 0 -7135 35 CD 0 0 0 0.861962 0 0 0 0 0 0 -7141 30 CD 0 0 0 0.971996 0.0035819 0.00097688 0.000651254 0.000651254 0 0 -7145 27 UC 0 0 0 0.973958 0.0199653 0 0 0 0 0 -7153 44 CD 0 0 0 0.871961 0.00277842 0.00046307 0 0 0 0 -7154 47 CD 0 0 0 0.978239 0.0386858 0.00583132 0.00526241 0.00526241 0 0 -7156 40 UC 0 0 0 0.760829 0.00414508 0 0 0 0 0 -7160 22 UC 0 0 0 0.712315 0.000492611 0.000246305 0 0 0 0 -7164 46 CD 0 0 0 0.980083 0.00239506 0.00201689 0.00201689 0.00201689 0 0 -7168 30 UC 0 0 0 0.679389 0.0014313 0 0 0 0 0 -7174 43 UC 0 0 0 0.976899 0.00496124 0.000155039 0.000155039 0.000155039 0 0 -7175A 36 CD 0 0 0 0.990166 0.00245851 0 0 0 0 0 -7175B 38 CD 0 0 0 0.969677 0.00129032 0 0 0 0 0 -7178A 37 CD 0 0 0 0.832765 0.00170648 0 0 0 0 0 -7178B 37 CD 0 0 0 0.794944 0.00351124 0 0 0 0 0 -7181 37 CD 0 0 0 0.817115 0.0185233 0 0 0 0 0 -7185 39 CD 0 0 0 0.933048 0.0108618 0.00623219 0.00534188 0.00534188 0 0 -7200 32 CD 0 0 0 0.930969 0.000270709 0.000270709 0.000270709 0.000270709 0 0 -7221 36 CD 0 0 0 0.991701 0.00173418 0.000495479 0.000495479 0.000495479 0 0 -7225 75 CD 0 0 0 0.989733 0.000684463 0.000684463 0.000684463 0.000684463 0 0 -7228 30 UC 0 0 0 0.890719 0.000906892 0 0 0 0 0 -7233 33 CD 0 0 0 0.439987 0.0037087 0.00202293 0 0 0 0 -7242A 56 CD 0 0 0 0.889764 0.00787402 0.00787402 0.00787402 0.00787402 0 0 -7242B 57 CD 0 0 0 0.853061 0.0122449 0.00272109 0.00136054 0.00136054 0 0 -7250 20 CD 0 0 0 0.99722 0.0152911 0.000173762 0.000173762 0.000173762 0 0 -7251 32 CD 0 0 0 0.997058 0.00187216 0 0 0 0 0 -7265 57 UC 0.00043911 0.00043911 0 0.915544 0.00102459 0.00014637 0.00014637 0.00014637 0 0 -7267 37 CD 0 0 0 0.956056 0.00406204 0 0 0 0 0 -7284 38 CD 0 0 0 0.973013 0.0009995 0 0 0 0 0 -7287 51 UC 0 0 0 0.601478 0.00564972 0 0 0 0 0 -7295 35 CD 0 0 0 0.825761 0.0052249 0 0 0 0 0 -7298 57 Healthy 0 0 0 0.996292 0.00370828 0 0 0 0 0 -7313 42 Healthy 0 0 0 0.994539 0.00204778 0 0 0 0 0 -7325 44 CD 0 0 0 0.767925 0.00283019 0.00188679 0 0 0 0 -7347 36 UC 0 0 0 1 0.00273224 0 0 0 0 0 -7352 46 CD 0 0 0 0.834951 0.00970874 0.00970874 0 0 0 0 -7355 26 UC 0 0 0 0.935735 0.00389484 0 0 0 0 0 -7356 50 Healthy 0 0 0 0.994746 0.0140105 0 0 0 0 0 -7360 60 Healthy 0 0 0 0.871642 0.00298507 0.00298507 0 0 0 0 -7361 58 Healthy 0 0 0 0.975364 0.00447928 0 0 0 0 0 -7365 52 Healthy 0 0 0 0.992902 0.00177462 0 0 0 0 0 -7385 60 Healthy 0 0 0 0.947635 0.00168919 0 0 0 0 0 -7429 31 CD 0 0 0 0.806763 0.000536769 0 0 0 0 0 -7454 37 CD 0 0 0 0.873016 0.00907029 0.00226757 0 0 0 0 -7584 44 UC 0 0 0 0.939158 0.00234009 0 0 0 0 0 -7594 28 CD 0 0 0 0.993401 0 0 0 0 0 0 -7610 69 Healthy 0 0 0 0.982582 0.00102459 0.00102459 0 0 0 0 -7614 60 CD 0 0 0 0.996437 0 0 0 0 0 0 -7615 24 CD 0 0 0 0.870656 0 0 0 0 0 0 -7621 64 UC 0 0 0 0.987464 0.000569801 0 0 0 0 0 -7624 27 CD 0 0 0 0.965767 0.00048216 0 0 0 0 0 -7632 25 CD 0 0 0 0.977591 0.00280112 0.000933707 0 0 0 0 -7662 31 Healthy 0 0 0 1 0.00213828 0 0 0 0 0 -7664 51 Healthy 0 0 0 0.987382 0.00757098 0 0 0 0 0 -7749 44 UC 0 0 0 0.954764 0.000962464 0 0 0 0 0 -7775 56 UC 0 0 0 0.989747 0.0150376 0 0 0 0 0 -7844 31 Healthy 0 0 0 1 0.0167845 0.000441696 0.000441696 0.000441696 0 0 -7848 24 Healthy 0 0 0 1 0.0254777 0.00106157 0.00106157 0.00106157 0 0 -7855 24 Healthy 0 0 0 1 0.00314465 0.000628931 0.000628931 0.000628931 0 0 -7858 26 Healthy 0 0 0 1 0.0264447 0 0 0 0 0 -7859 53 UC 0 0 0 0.986404 0.00407886 0 0 0 0 0 -7860 22 Healthy 0 0 0 1 0.0560376 0 0 0 0 0 -7861 23 Healthy 0 0 0 1 0.0905612 0 0 0 0 0 -7862 26 Healthy 0 0 0 1 0.0062819 0 0 0 0 0 -7870 26 Healthy 0 0 0 1 0.0276699 0 0 0 0 0 -7871 27 UC 0 0 0 0.52233 0.00194175 0.00194175 0 0 0 0 -7879 29 Healthy 0 0 0 1 0.0378979 0 0 0 0 0 -7899 23 Healthy 0 0 0 1 0.00107875 0 0 0 0 0 -7904 23 Healthy 0 0 0 1 0.0112933 0.000364299 0 0 0 0 -7906 23 Healthy 0 0 0 1 0.00332717 0 0 0 0 0 -7908 23 Healthy 0 0 0 1 0.0453906 0.00070373 0.00070373 0.00070373 0 0 -7909 24 Healthy 0 0 0 1 0.0708354 0 0 0 0 0 -7910 23 Healthy 0 0 0 1 0.0865063 0 0 0 0 0 -7911 25 Healthy 0 0 0 1 0.100825 0.00030553 0.00030553 0.00030553 0 0 -7912 24 Healthy 0 0 0 1 0.125173 0.00138568 0.000923788 0.000923788 0 0 -MGH100512 0 0 0 1 0.00465942 0.000221877 0.000221877 0.000221877 0 0 -MGH101598 0 0 0 1 0.00281796 0 0 0 0 0 -MGH101635 0 0 0 1 0.00827316 0.000300842 0.000300842 0.000300842 0 0 -MGH101746 0 0 0 1 0.00923206 0 0 0 0 0 -MGH102376 0 0 0 0.948276 0.00229885 0 0 0 0 0 -MGH102691 0 0 0 1 0.00034002 0.00034002 0 0 0 0 -MGH102692 0 0 0 1 0.00077101 0.000192753 0 0 0 0 -MGH102725 0 0 0 1 0.00114443 0.000228885 0 0 0 0 -MGH102806 0 0 0 0.815789 0.00657895 0.00657895 0.00328947 0.00328947 0 0 -MGH103070 0 0 0 1 0.0003861 0 0 0 0 0 -MGH103120 0 0 0 1 0.00148258 0.000185322 0 0 0 0 -MGH103121 0 0 0 1 0.00103869 0 0 0 0 0 -MGH103405 0.000157406 0.000157406 0 0.999843 0.00629624 0.0053518 0 0 0 0 -MGH103562 0 0 0 1 0.000701508 0 0 0 0 0 -MGH103629 0 0 0 1 0.00153846 0.00153846 0 0 0 0 -MGH103803 0 0 0 0.988064 0.00132626 0.00132626 0 0 0 0 -MGH103909 0 0 0 1 0.00116356 0 0 0 0 0 -MGH103963 0 0 0 0.885949 0.00547445 0.00456204 0 0 0 0 -MGH104169 0 0 0 1 0.00047672 0 0 0 0 0 -MGH104504 0 0 0 1 0.000837521 0.000279174 0 0 0 0 -MGH104890 0 0 0 1 0.00117233 0.00104207 0 0 0 0 -MGH105371 NA NA NA 0 0 0 1 0.000169895 0 0 0 0 0 diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/input/TestMatrix.tsv --- a/maaslin-4450aa4ecc84/src/testing/input/TestMatrix.tsv Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,16 +0,0 @@ - Sample1 Sample2 Sample3 Sample4 Sample5 -Feature1 11 12 13 14 15 -Feature2 21 22 23 24 25 -Feature3 31 32 33 34 35 -Feature4 41 42 43 44 45 -Feature5 51 52 53 54 55 -Feature6 61 62 63 64 65 -Feature7 71 72 73 74 75 -Feature8 81 82 83 84 85 -Feature9 91 92 93 94 95 -Feature10 101 102 103 104 105 -Feature11 111 112 113 114 115 -Feature12 121 122 123 124 125 -Feature13 131 132 133 134 135 -Feature14 141 142 143 144 145 -Feature15 151 152 153 154 155 diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/input/funcSummarizeDirectory/1/FuncSummarizeDirectory-1.txt --- a/maaslin-4450aa4ecc84/src/testing/input/funcSummarizeDirectory/1/FuncSummarizeDirectory-1.txt Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,11 +0,0 @@ -Variable Feature Value Coefficient N N not 0 P-value Q-value -V1 Bacteria|1 V1 0.000376948962983632 228 78 0.000337563617350514 0.0140710728916636 -V1 Bacteria|2 V1 -0.000172924087271453 228 39 0.00173356878954918 0.04501595020731 -V1 Bacteria|3 V1 0.000176541929148173 228 50 0.00213541203877865 0.0497425392562556 -V1 Bacteria|4 V1 0.000233041055999211 228 54 0.00309255350077782 0.0653147299364275 -V1 Bacteria|5 V1 0.000170023412983991 228 28 0.0055803225587723 0.0982136770343924 -V1 Bacteria|6 V1 -0.000129327171064622 228 29 0.00625257130491581 0.103167426531111 -V1 Bacteria|7 V1 -0.00246205053294096 228 227 0.201094523365195 0.215761136458804 -V1 Bacteria|8 V1 -0.000133233647710018 228 28 0.30115927181414323 0.3158300564965765 -V1 Bacteria|9 V1 -0.000390699948289467 228 127 0.40153721372588625 0.4190230198578424 -V1 Bacteria|10 V1 0.000260009506485308 228 110 0.50180198778634843 0.5211433233598216 \ No newline at end of file diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/input/funcSummarizeDirectory/3/FuncSummarizeDirectory-1.txt --- a/maaslin-4450aa4ecc84/src/testing/input/funcSummarizeDirectory/3/FuncSummarizeDirectory-1.txt Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,11 +0,0 @@ -Variable Feature Value Coefficient N N not 0 P-value Q-value -V1 Bacteria|1 V1 0.000376948962983632 228 78 0.000337563617350514 0.0140710728916636 -V1 Bacteria|2 V1 -0.000172924087271453 228 39 0.00173356878954918 0.04501595020731 -V1 Bacteria|3 V1 0.000176541929148173 228 50 0.00213541203877865 0.0497425392562556 -V1 Bacteria|4 V1 0.000233041055999211 228 54 0.00309255350077782 0.0653147299364275 -V1 Bacteria|5 V1 0.000170023412983991 228 28 0.0055803225587723 0.0982136770343924 -V1 Bacteria|6 V1 -0.000129327171064622 228 29 0.00625257130491581 0.103167426531111 -V1 Bacteria|7 V1 -0.00246205053294096 228 227 0.20109452336519472 0.215761136458804 -V1 Bacteria|8 V1 -0.000133233647710018 228 28 0.30115927181414323 0.3158300564965765 -V1 Bacteria|9 V1 -0.000390699948289467 228 127 0.40153721372588625 0.4190230198578424 -V1 Bacteria|10 V1 0.000260009506485308 228 110 0.50180198778634843 0.5211433233598216 \ No newline at end of file diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/input/funcSummarizeDirectory/3/FuncSummarizeDirectory-2.txt --- a/maaslin-4450aa4ecc84/src/testing/input/funcSummarizeDirectory/3/FuncSummarizeDirectory-2.txt Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,11 +0,0 @@ -Variable Feature Value Coefficient N N not 0 P-value Q-value -V2 Bacteria|1 V2 0.000376948962983632 228 78 0.000337563617350514 0.0140710728916636 -V2 Bacteria|2 V2 -0.000172924087271453 228 39 0.100173356878954918 0.104501595020731 -V2 Bacteria|3 V2 0.000176541929148173 228 50 0.200213541203877865 0.20497425392562556 -V2 Bacteria|4 V2 0.000233041055999211 228 54 0.300309255350077782 0.30653147299364275 -V2 Bacteria|5 V2 0.000170023412983991 228 28 0.40055803225587723 0.440982136770343924 -V2 Bacteria|6 V2 -0.000129327171064622 228 29 0.500625257130491581 0.5103167426531111 -V2 Bacteria|7 V2 -0.00246205053294096 228 227 0.620109452336519472 0.6215761136458804 -V2 Bacteria|8 V2 -0.000133233647710018 228 28 0.730115927181414323 0.73158300564965765 -V2 Bacteria|9 V2 -0.000390699948289467 228 127 0.840153721372588625 0.84190230198578424 -V2 Bacteria|10 V2 0.000260009506485308 228 110 0.950180198778634843 0.95211433233598216 \ No newline at end of file diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/input/funcSummarizeDirectory/3/FuncSummarizeDirectory-3.txt --- a/maaslin-4450aa4ecc84/src/testing/input/funcSummarizeDirectory/3/FuncSummarizeDirectory-3.txt Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,11 +0,0 @@ -Variable Feature Value Coefficient N N not 0 P-value Q-value -V3 Bacteria|1 V3 0.000376948962983632 228 78 0.000337563617350514 0.0140710728916636 -V3 Bacteria|2 V3 -0.000172924087271453 228 39 0.0100173356878954918 0.0104501595020731 -V3 Bacteria|3 V3 0.000176541929148173 228 50 0.0200213541203877865 0.020497425392562556 -V3 Bacteria|4 V3 0.000233041055999211 228 54 0.0300309255350077782 0.030653147299364275 -V3 Bacteria|5 V3 0.000170023412983991 228 28 0.140055803225587723 0.1440982136770343924 -V3 Bacteria|6 V3 -0.000129327171064622 228 29 0.2500625257130491581 0.25103167426531111 -V3 Bacteria|7 V3 -0.00246205053294096 228 227 0.3620109452336519472 0.36215761136458804 -V3 Bacteria|8 V3 -0.000133233647710018 228 28 0.4730115927181414323 0.473158300564965765 -V3 Bacteria|9 V3 -0.000390699948289467 228 127 0.5840153721372588625 0.584190230198578424 -V3 Bacteria|10 V3 0.000260009506485308 228 110 0.6950180198778634843 0.695211433233598216 \ No newline at end of file diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/testing/tmp/.keep diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/src/transpose.py --- a/maaslin-4450aa4ecc84/src/transpose.py Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,86 +0,0 @@ -#!/usr/bin/env python -####################################################################################### -# This file is provided under the Creative Commons Attribution 3.0 license. -# -# You are free to share, copy, distribute, transmit, or adapt this work -# PROVIDED THAT you attribute the work to the authors listed below. -# For more information, please see the following web page: -# http://creativecommons.org/licenses/by/3.0/ -# -# This file is a component of the SflE Scientific workFLow Environment for reproducible -# research, authored by the Huttenhower lab at the Harvard School of Public Health -# (contact Curtis Huttenhower, chuttenh@hsph.harvard.edu). -# -# If you use this environment, the included scripts, or any related code in your work, -# please let us know, sign up for the SflE user's group (sfle-users@googlegroups.com), -# pass along any issues or feedback, and we'll let you know as soon as a formal citation -# is available. -####################################################################################### - -""" -Examples -~~~~~~~~ - -``data.pcl``:: - - a b - c d - e f - -``Examples``:: - - $ transpose.py < data.pcl - a c e - b d f - - $ echo "a b c" | transpose.py - a - b - c - -.. testsetup:: - - from transpose import * -""" - -import argparse -import csv -import sys - -def transpose( aastrIn, ostm ): - """ - Outputs the matrix transpose of the input tab-delimited rows. - - :param aastrIn: Split lines from which data are read. - :type aastrIn: collection of string collections - :param ostm: Output stream to which transposed rows are written. - :type ostm: output stream - - >>> aastrIn = [list(s) for s in ("ab", "cd", "ef")] - >>> transpose( aastrIn, sys.stdout ) #doctest: +NORMALIZE_WHITESPACE - a c e - b d f - - >>> transpose( [list("abc")], sys.stdout ) #doctest: +NORMALIZE_WHITESPACE - a - b - c - """ - - aastrLines = [a for a in aastrIn] - csvw = csv.writer( ostm, csv.excel_tab ) - for iRow in range( len( aastrLines[0] ) ): - csvw.writerow( [aastrLines[iCol][iRow] for iCol in range( len( aastrLines ) )] ) - -argp = argparse.ArgumentParser( prog = "transpose.py", - description = """Transposes a tab-delimited text matrix. - -The transposition process is robust to missing elements and rows of differing lengths.""" ) -__doc__ = "::\n\n\t" + argp.format_help( ).replace( "\n", "\n\t" ) + __doc__ - -def _main( ): - args = argp.parse_args( ) - transpose( csv.reader( sys.stdin, csv.excel_tab ), sys.stdout ) - -if __name__ == "__main__": - _main( ) diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/test-data/maaslin_input --- a/maaslin-4450aa4ecc84/test-data/maaslin_input Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,30 +0,0 @@ -sample Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 -Age 87 78 3 2 32 10 39 96 -Cohort Healthy Healthy Healthy Healthy IBD IBD IBD IBD -Favorite_color Yellow Blue Green Yellow Green Blue Green Blue -Height 60 72 63 67 71 65 61 64 -Sex 0 1 0 1 1 0 1 0 -Smoking 0 0 1 0 1 1 1 0 -Star_Trek_Fan 1 1 0 0 1 0 0 1 -Weight 151 258 195 172 202 210 139 140 -Bacteria 1 1 1 1 1 1 1 1 -Bacteria|Actinobacteria|Actinobacteria 0.0507585 0.252153 0.161725 0.0996769 0.144075 0.00592628 0.0399472 0.0663809 -Bacteria|Actinobacteria|Actinobacteria|Bifidobacteriales|Bifidobacteriaceae|Bifidobacterium|1 0.0507585 0.0861117 0.00168464 0.0011966 0.0164305 0.00592628 0.0367439 0.0663809 -Bacteria|Actinobacteria|Actinobacteria|Coriobacteriales|Coriobacteriaceae|1008 0 0.166041 0.16004 0.0984803 0.127644 0 0.00320332 0 -Bacteria|Bacteroidetes|Bacteroidia|Bacteroidales 0.210385 0.0229631 0.154874 0.212157 0.044465 0.0861681 0.349727 0.29982 -Bacteria|Bacteroidetes|Bacteroidia|Bacteroidales|Bacteroidaceae|Bacteroides|101 0.0110852 0.0229631 0.019991 0.0329065 0.044465 0.020979 0 0.0450837 -Bacteria|Bacteroidetes|Bacteroidia|Bacteroidales|Prevotellaceae|1010 0.1993 0 0.134883 0.179251 0 0.065189 0.349727 0.254737 -Bacteria|Firmicutes 0.738856 0.719806 0.67655 0.668541 0.663381 0.730117 0.417939 0.443231 -Bacteria|Firmicutes|Bacilli|Lactobacillales 0.37713 0.0119232 0.0982704 0.102549 0.45307 0.13903 0.0192199 0 -Bacteria|Firmicutes|Bacilli|Lactobacillales|Enterococcaceae|1023 0.290198 0.0119232 0 0.00538471 0.351818 0.0321204 0.0192199 0 -Bacteria|Firmicutes|Bacilli|Lactobacillales|Unclassified|1013 0.0869312 0 0.0982704 0.0971641 0.101253 0.10691 0 0 -Bacteria|Firmicutes|Clostridia|Clostridiales 0.29755 0.562817 0.503145 0.388656 0.143561 0.142349 0.271528 0 -Bacteria|Firmicutes|Clostridia|Clostridiales|Lachnospiraceae 0.233372 0.41157 0.423967 0.329065 0.142226 0.142349 0.266817 0 -Bacteria|Firmicutes|Clostridia|Clostridiales|Lachnospiraceae|Anaerostipes|1026 0 0 0.143194 0 0.131957 0.142349 0.228754 0 -Bacteria|Firmicutes|Clostridia|Clostridiales|Lachnospiraceae|Roseburia|1032 0.233372 0.41157 0.280773 0.329065 0.010269 0 0.0380629 0 -Bacteria|Firmicutes|Clostridia|Clostridiales|Ruminococcaceae|1156 0.0641774 0.151248 0.0791779 0.0595908 0.00133498 0 0.00471076 0 -Bacteria|Firmicutes|Erysipelotrichi|Erysipelotrichales|Erysipelotrichaceae|Coprobacillus|1179 0 0.00971517 0.0049416 0.123489 0 0.380586 0 0.380998 -Bacteria|Firmicutes|Unclassified|1232 0.0641774 0.13535 0.0701932 0.0538471 0.0667488 0.0681522 0.127191 0.0622321 -Bacteria|Proteobacteria 0 0.00507838 0.00685085 0.0196243 0.14808 0.177788 0.192387 0.190568 -Bacteria|Proteobacteria|Betaproteobacteria|Burkholderiales|Alcaligenaceae|Parasutterella|1344 0 0.00507838 0.0012354 0.00167524 0.0351201 0 0.00395704 0 -Bacteria|Proteobacteria|Gammaproteobacteria|Enterobacteriales|Enterobacteriaceae|Escherichia/Shigella|1532 0 0 0.00561545 0.017949 0.11296 0.177788 0.18843 0.190568 \ No newline at end of file diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/test-data/maaslin_output --- a/maaslin-4450aa4ecc84/test-data/maaslin_output Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,6 +0,0 @@ - Variable Feature Value Coefficient N N.not.0 P.value Q.value -1 Age Bacteria|Actinobacteria|Actinobacteria|Bifidobacteriales|Bifidobacteriaceae|Bifidobacterium|1 Age 0.00247925731553718 8 8 0.000443046842141386 0.0236291649142073 -2 Cohort Bacteria|Proteobacteria CohortIBD 0.361202359969779 8 7 8.29695122618112e-05 0.0132751219618898 -3 Cohort Bacteria|Proteobacteria|Gammaproteobacteria|Enterobacteriales|Enterobacteriaceae|Escherichia/Shigella|1532 CohortIBD 0.368439847775899 8 6 0.000282569701775158 0.0226055761420126 -4 Cohort Bacteria|Firmicutes|Clostridia|Clostridiales|Lachnospiraceae|Roseburia|1032 CohortIBD -0.517733343029902 8 6 0.000628473175503113 0.0251389270201245 -5 Cohort Bacteria|Firmicutes|Clostridia|Clostridiales|Ruminococcaceae|1156 CohortIBD -0.271131332905165 8 6 0.00121369709195569 0.0388383069425819 diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/tool_dependencies.xml --- a/maaslin-4450aa4ecc84/tool_dependencies.xml Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,6 +0,0 @@ - - - - $REPOSITORY_INSTALL_DIR - - diff -r 589169d452c0 -r 18774fa866d8 maaslin-4450aa4ecc84/transpose.py --- a/maaslin-4450aa4ecc84/transpose.py Sun Feb 08 23:21:34 2015 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,86 +0,0 @@ -#!/usr/bin/env python -####################################################################################### -# This file is provided under the Creative Commons Attribution 3.0 license. -# -# You are free to share, copy, distribute, transmit, or adapt this work -# PROVIDED THAT you attribute the work to the authors listed below. -# For more information, please see the following web page: -# http://creativecommons.org/licenses/by/3.0/ -# -# This file is a component of the SflE Scientific workFLow Environment for reproducible -# research, authored by the Huttenhower lab at the Harvard School of Public Health -# (contact Curtis Huttenhower, chuttenh@hsph.harvard.edu). -# -# If you use this environment, the included scripts, or any related code in your work, -# please let us know, sign up for the SflE user's group (sfle-users@googlegroups.com), -# pass along any issues or feedback, and we'll let you know as soon as a formal citation -# is available. -####################################################################################### - -""" -Examples -~~~~~~~~ - -``data.pcl``:: - - a b - c d - e f - -``Examples``:: - - $ transpose.py < data.pcl - a c e - b d f - - $ echo "a b c" | transpose.py - a - b - c - -.. testsetup:: - - from transpose import * -""" - -import argparse -import csv -import sys - -def transpose( aastrIn, ostm ): - """ - Outputs the matrix transpose of the input tab-delimited rows. - - :param aastrIn: Split lines from which data are read. - :type aastrIn: collection of string collections - :param ostm: Output stream to which transposed rows are written. - :type ostm: output stream - - >>> aastrIn = [list(s) for s in ("ab", "cd", "ef")] - >>> transpose( aastrIn, sys.stdout ) #doctest: +NORMALIZE_WHITESPACE - a c e - b d f - - >>> transpose( [list("abc")], sys.stdout ) #doctest: +NORMALIZE_WHITESPACE - a - b - c - """ - - aastrLines = [a for a in aastrIn] - csvw = csv.writer( ostm, csv.excel_tab ) - for iRow in range( len( aastrLines[0] ) ): - csvw.writerow( [aastrLines[iCol][iRow] for iCol in range( len( aastrLines ) )] ) - -argp = argparse.ArgumentParser( prog = "transpose.py", - description = """Transposes a tab-delimited text matrix. - -The transposition process is robust to missing elements and rows of differing lengths.""" ) -__doc__ = "::\n\n\t" + argp.format_help( ).replace( "\n", "\n\t" ) + __doc__ - -def _main( ): - args = argp.parse_args( ) - transpose( csv.reader( sys.stdin, csv.excel_tab ), sys.stdout ) - -if __name__ == "__main__": - _main( ) diff -r 589169d452c0 -r 18774fa866d8 maaslin_ziped.zip Binary file maaslin_ziped.zip has changed