Mercurial > repos > cristian > notos

diff readme.rst @ 0:1535ffddeff4 draft
planemo upload commit a7ac27de550a07fd6a3e3ea3fb0de65f3a10a0e6-dirty
author: cristian
date: Thu, 07 Sep 2017 08:51:57 -0400
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/readme.rst	Thu Sep 07 08:51:57 2017 -0400
@@ -0,0 +1,173 @@
+Notos
+=====
+
+Notos is a suite that calculates CpN o/e ratios (e.g., the commonly used CpG o/e ratios) for a set of nucleotide sequences and uses Kernel Density Estimation (KDE) to model the obtained distribution.
+
+It consists of two programs, CpGoe.pl is used to calculate the CpN o/e ratios and KDEanalysis.r estimates the model. 
+In the following, these two programs are described briefly.
+
+CpGoe.pl
+--------
+
+
+This program will calculate CpN o/e ratios on nucleotide multifasta files.
+For each sequence that is found in the file it will output the sequence name followed by the CpN o/e ratio, where N can be any of the nucletides A, C, G or T, into a TAB separated file.
+
+An example call would be:
+
+    perl CpGoe.pl -f input_species.fasta -a 1 -c CpG -o input_species_cpgoe.csv -m 200
+	
+
+The available contexts (-c) are CpG, CpA, CpC, CpT. Default is CpG. 
+
+The available algorithms (-a) for calculating the CpNo/e ratio are the following (here shown for CpG o/e)::
+
+    1 => (CpG / (C * G)) * (L^2 / L-1)
+    2 => (CpG / (C * G)) * L
+    3 => (CpG / L) / ((C + G) / L)^2
+    4 => (CpG / (C + G)/2)^2
+		
+Here L denotes the length of the sequence, CpG represents the count of CG dinucleotide, C and G represent the count for the respective bases.
+
+KDEanalysis.r
+-------------
+
+This program carries out two steps.
+First, the data preparation step, mainly to remove data artifacts.
+Secondly, the mode detection step, which is baesd on a KDE modelling approach.
+
+Example basic usage on command line:
+
+    Rscript ~/src/github/notos/KDEanalysis.r "Input species" input_species_cpgoe.csv
+
+	
+In the above case "Input species" will be used to name the graphs that are generated as well as an identifier for each sample.
+It has to be surrounded by " if the name of the species contains spaces.
+The input of KDEanalysis.r is of the same format as the output of CpGoe.pl.
+
+Any of the following parameters can be used
+
++--------+---------------------+-----------------------------------------------------------------------------------------+
+| Option | Long option         | Description                                                                             |
++--------+---------------------+-----------------------------------------------------------------------------------------+
+| -o     | --frac-outl         | maximum fraction of CpGo/e ratios excluded as outliers [default 0.01]                   |
++--------+---------------------+-----------------------------------------------------------------------------------------+
+| -d     | --min-dist          | minimum distance between modes, modes that are closer are joined [default 0.2]          |
++--------+---------------------+-----------------------------------------------------------------------------------------+
+| -c     | --conf-level        | level of the confidence intervals of the mode positions [default 0.95]                  |
++--------+---------------------+-----------------------------------------------------------------------------------------+
+| -m     | --mode-mass         | minimum probability mass of a mode [default 0.05]                                       |
++--------+---------------------+-----------------------------------------------------------------------------------------+
+| -b     | --band-width        | bandwidth constant for kernels [default 1.06]                                           |
++--------+---------------------+-----------------------------------------------------------------------------------------+
+| -B     | --bootstrap         | calculate confidence intervals of mode positions using bootstrap.                       |
++--------+---------------------+-----------------------------------------------------------------------------------------+
+| -r     | --bootstrap-reps    | number of bootstrap repetitions [default 1500]                                          |
++--------+---------------------+-----------------------------------------------------------------------------------------+
+| -p     | --peak-file         | name of the output file describing the peaks of the KDE [default modes_basic_stats.csv] |
++--------+---------------------+-----------------------------------------------------------------------------------------+
+| -s     | --bootstrap-file    | Name of the output file with bootstrap values [default "modes_bootstrap.csv"]           |
++--------+---------------------+-----------------------------------------------------------------------------------------+
+| -H     | --outlier-hist-file | Outliers histogram file [default outliers_hist.pdf]                                     |
++--------+---------------------+-----------------------------------------------------------------------------------------+
+| -C     | --cutoff-file       | Outliers cutoff file [default outliers_cutoff.csv]                                      |
++--------+---------------------+-----------------------------------------------------------------------------------------+
+| -k     | --kde-file          | Kernel density estimation graph [default KDE.pdf]                                       |
++--------+---------------------+-----------------------------------------------------------------------------------------+
+
+Of special interest is the -B parameter that will trigger the bootstrap calculations.
+Default settings have been thoroughly calibrated through extensive testing, so we would advice to modify them only if you know what you are doing.
+
+Output: Both the data preparation and the mode detection step return results in form of CSV files and figures to the user.
+The two figures illustrate the results of the data cleaning and mode detection step, respectively.
+The contents of the CSV files is described in the following.
+
+1. outliers_cutoff.csv. The columns of this file contain
+
++---------------+----------------------------------------------------------------------------------------------------------------+
+| Column        | description                                                                                                    |
++---------------+----------------------------------------------------------------------------------------------------------------+
+| Name          | name of the file analyzed                                                                                      |
++---------------+----------------------------------------------------------------------------------------------------------------+
+| prop.zero     | proportion of observations equal to zero excluded (relative to original sample)                                |
++---------------+----------------------------------------------------------------------------------------------------------------+
+| prop.out.2iqr | proportion of values equal excluded if 2 * IQR was used, relative to sample after exclusion of zeros (0 - 100) |
++---------------+----------------------------------------------------------------------------------------------------------------+
+| prop.out.3iqr | proportion of values equal excluded if 3 * IQR was used, relative to sample after exclusion of zeros (0 - 100) |
++---------------+----------------------------------------------------------------------------------------------------------------+
+| prop.out.4iqr | proportion of values equal excluded if 4 * IQR was used, relative to sample after exclusion of zeros (0 - 100) |
++---------------+----------------------------------------------------------------------------------------------------------------+
+| prop.out.5iqr | proportion of values equal excluded if 5 * IQR was used, relative to sample after exclusion of zeros (0 - 100) |
++---------------+----------------------------------------------------------------------------------------------------------------+
+| used          | IQR used for exclusion of outliers / extreme values                                                            |
++---------------+----------------------------------------------------------------------------------------------------------------+
+| no.obs.raw    | number of observations in the original sample                                                                  |
++---------------+----------------------------------------------------------------------------------------------------------------+
+| no.obs.nozero | number of observations in sample after excluding values equal to zero                                          |
++---------------+----------------------------------------------------------------------------------------------------------------+
+| no.obs.clean  | number of observations in sample after excluding outliers / extreme values                                     |
++---------------+----------------------------------------------------------------------------------------------------------------+
+
+2. modes_basic_stats.csv. We use the following notation: sigma - standard deviation, mu - mean, nu - median, Mo - mode, Q_i - the i-th quartile, q_s - the s % quantile. The columns of this file contain
+
++-----------------------------------+------------------------------------------------------------------------------------+
+| Column                            | description                                                                        |
++-----------------------------------+------------------------------------------------------------------------------------+
+| Name                              | name of the file analyzed                                                          |
++-----------------------------------+------------------------------------------------------------------------------------+
+| Number of modes                   | number of modes without applying any exclusion criterion                           |
++-----------------------------------+------------------------------------------------------------------------------------+
+| Number of modes (5% excluded)     | number of modes after exclusion of those with less then 5% probability mass        |
++-----------------------------------+------------------------------------------------------------------------------------+
+| Number of modes (10% excluded)    | number of modes after exclusion of those with less then 10% probability mass       |
++-----------------------------------+------------------------------------------------------------------------------------+
+| Skewness                          | Pearson's moment coefficient of skewness E(X-mu/sigma)^3                           |
++-----------------------------------+------------------------------------------------------------------------------------+
+| Mode skewness                     | Pearson's first skewness coefficient (mu - Mo)/sigma                               |
++-----------------------------------+------------------------------------------------------------------------------------+
+| Nonparametric skew                | (mu - nu)/sigma                                                                    |
++-----------------------------------+------------------------------------------------------------------------------------+
+| Q50 skewness                      | Bowley's measure of skewness / Yule's coefficient (Q_3 + Q_1 - 2Q_2) / (Q_3 - Q_1) |
++-----------------------------------+------------------------------------------------------------------------------------+
+| Absolute Q50 mode skewness        | (Q_3 + Q_1) / 2 - Mo                                                               |
++-----------------------------------+------------------------------------------------------------------------------------+
+| Absolute Q80 mode skewness        | (q_90 + q_10) / 2 - Mo                                                             |
++-----------------------------------+------------------------------------------------------------------------------------+
+| Peak i, i = 1,..., 10             | location of peak i                                                                 |
++-----------------------------------+------------------------------------------------------------------------------------+
+| Probability Mass i, i = 1,..., 10 | probability mass assigned to peak i                                                |
++-----------------------------------+------------------------------------------------------------------------------------+
+| Warning close modes               | flag indicating that modes lie too close. The default threshold is 0.2             |
++-----------------------------------+------------------------------------------------------------------------------------+
+| Number close modes                | number of modes lying too close, given the threshold                               |
++-----------------------------------+------------------------------------------------------------------------------------+
+| Modes (close modes excluded)      | number of modes after exclusion of modes that are too close                        |
++-----------------------------------+------------------------------------------------------------------------------------+
+| SD                                | sample standard deviation sigma                                                    |
++-----------------------------------+------------------------------------------------------------------------------------+
+| IQR 80                            | 80% distance between the 90 % and 10 % quantile                                    |
++-----------------------------------+------------------------------------------------------------------------------------+
+| IQR 90                            | 90% distance between the 95 % and 5 % quantile                                     |
++-----------------------------------+------------------------------------------------------------------------------------+
+| Total number of sequences         | total number of sequences / CpG o/e ratios used for this analysis step             |
++-----------------------------------+------------------------------------------------------------------------------------+
+
+3. modes_bootstrap.csv. The columns of this optional file resulting from the bootstrap procedure contains:
+
++-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+
+| Column                                    | description                                                                                                                                |
++-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+
+| Name                                      | name of the file analyzed                                                                                                                  |
++-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+
+| Number of modes (NM)                      | number of modes detected for the original sample                                                                                           |
++-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+
+| % of samples with same NM                 | proportion of bootstrap samples with the same number of modes (0 - 100)                                                                    |
++-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+
+| % of samples with more NM                 | proportion of bootstrap samples a higher number of modes (0 - 100)                                                                         |
++-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+
+| % of samples with less NM                 | proportion of bootstrap samples a lower number of modes (0 - 100)                                                                          |
++-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+
+| no. of samples with same NM               | number of bootstrap samples with the same number of modes                                                                                  |
++-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+
+| % BS samples excluded by prob.~mass crit. | proportion of bootstrap samples excluded due to strong deviations from the probability masses determined for the original sample (0 - 100) |
++-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+
author	cristian
date	Thu, 07 Sep 2017 08:51:57 -0400
parents
children