Mercurial > repos > cristian > notos
comparison readme.rst @ 0:1535ffddeff4 draft
planemo upload commit a7ac27de550a07fd6a3e3ea3fb0de65f3a10a0e6-dirty
author | cristian |
---|---|
date | Thu, 07 Sep 2017 08:51:57 -0400 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:1535ffddeff4 |
---|---|
1 Notos | |
2 ===== | |
3 | |
4 Notos is a suite that calculates CpN o/e ratios (e.g., the commonly used CpG o/e ratios) for a set of nucleotide sequences and uses Kernel Density Estimation (KDE) to model the obtained distribution. | |
5 | |
6 It consists of two programs, CpGoe.pl is used to calculate the CpN o/e ratios and KDEanalysis.r estimates the model. | |
7 In the following, these two programs are described briefly. | |
8 | |
9 CpGoe.pl | |
10 -------- | |
11 | |
12 | |
13 This program will calculate CpN o/e ratios on nucleotide multifasta files. | |
14 For each sequence that is found in the file it will output the sequence name followed by the CpN o/e ratio, where N can be any of the nucletides A, C, G or T, into a TAB separated file. | |
15 | |
16 An example call would be: | |
17 | |
18 perl CpGoe.pl -f input_species.fasta -a 1 -c CpG -o input_species_cpgoe.csv -m 200 | |
19 | |
20 | |
21 The available contexts (-c) are CpG, CpA, CpC, CpT. Default is CpG. | |
22 | |
23 The available algorithms (-a) for calculating the CpNo/e ratio are the following (here shown for CpG o/e):: | |
24 | |
25 1 => (CpG / (C * G)) * (L^2 / L-1) | |
26 2 => (CpG / (C * G)) * L | |
27 3 => (CpG / L) / ((C + G) / L)^2 | |
28 4 => (CpG / (C + G)/2)^2 | |
29 | |
30 Here L denotes the length of the sequence, CpG represents the count of CG dinucleotide, C and G represent the count for the respective bases. | |
31 | |
32 KDEanalysis.r | |
33 ------------- | |
34 | |
35 This program carries out two steps. | |
36 First, the data preparation step, mainly to remove data artifacts. | |
37 Secondly, the mode detection step, which is baesd on a KDE modelling approach. | |
38 | |
39 Example basic usage on command line: | |
40 | |
41 Rscript ~/src/github/notos/KDEanalysis.r "Input species" input_species_cpgoe.csv | |
42 | |
43 | |
44 In the above case "Input species" will be used to name the graphs that are generated as well as an identifier for each sample. | |
45 It has to be surrounded by " if the name of the species contains spaces. | |
46 The input of KDEanalysis.r is of the same format as the output of CpGoe.pl. | |
47 | |
48 Any of the following parameters can be used | |
49 | |
50 +--------+---------------------+-----------------------------------------------------------------------------------------+ | |
51 | Option | Long option | Description | | |
52 +--------+---------------------+-----------------------------------------------------------------------------------------+ | |
53 | -o | --frac-outl | maximum fraction of CpGo/e ratios excluded as outliers [default 0.01] | | |
54 +--------+---------------------+-----------------------------------------------------------------------------------------+ | |
55 | -d | --min-dist | minimum distance between modes, modes that are closer are joined [default 0.2] | | |
56 +--------+---------------------+-----------------------------------------------------------------------------------------+ | |
57 | -c | --conf-level | level of the confidence intervals of the mode positions [default 0.95] | | |
58 +--------+---------------------+-----------------------------------------------------------------------------------------+ | |
59 | -m | --mode-mass | minimum probability mass of a mode [default 0.05] | | |
60 +--------+---------------------+-----------------------------------------------------------------------------------------+ | |
61 | -b | --band-width | bandwidth constant for kernels [default 1.06] | | |
62 +--------+---------------------+-----------------------------------------------------------------------------------------+ | |
63 | -B | --bootstrap | calculate confidence intervals of mode positions using bootstrap. | | |
64 +--------+---------------------+-----------------------------------------------------------------------------------------+ | |
65 | -r | --bootstrap-reps | number of bootstrap repetitions [default 1500] | | |
66 +--------+---------------------+-----------------------------------------------------------------------------------------+ | |
67 | -p | --peak-file | name of the output file describing the peaks of the KDE [default modes_basic_stats.csv] | | |
68 +--------+---------------------+-----------------------------------------------------------------------------------------+ | |
69 | -s | --bootstrap-file | Name of the output file with bootstrap values [default "modes_bootstrap.csv"] | | |
70 +--------+---------------------+-----------------------------------------------------------------------------------------+ | |
71 | -H | --outlier-hist-file | Outliers histogram file [default outliers_hist.pdf] | | |
72 +--------+---------------------+-----------------------------------------------------------------------------------------+ | |
73 | -C | --cutoff-file | Outliers cutoff file [default outliers_cutoff.csv] | | |
74 +--------+---------------------+-----------------------------------------------------------------------------------------+ | |
75 | -k | --kde-file | Kernel density estimation graph [default KDE.pdf] | | |
76 +--------+---------------------+-----------------------------------------------------------------------------------------+ | |
77 | |
78 Of special interest is the -B parameter that will trigger the bootstrap calculations. | |
79 Default settings have been thoroughly calibrated through extensive testing, so we would advice to modify them only if you know what you are doing. | |
80 | |
81 Output: Both the data preparation and the mode detection step return results in form of CSV files and figures to the user. | |
82 The two figures illustrate the results of the data cleaning and mode detection step, respectively. | |
83 The contents of the CSV files is described in the following. | |
84 | |
85 1. outliers_cutoff.csv. The columns of this file contain | |
86 | |
87 +---------------+----------------------------------------------------------------------------------------------------------------+ | |
88 | Column | description | | |
89 +---------------+----------------------------------------------------------------------------------------------------------------+ | |
90 | Name | name of the file analyzed | | |
91 +---------------+----------------------------------------------------------------------------------------------------------------+ | |
92 | prop.zero | proportion of observations equal to zero excluded (relative to original sample) | | |
93 +---------------+----------------------------------------------------------------------------------------------------------------+ | |
94 | prop.out.2iqr | proportion of values equal excluded if 2 * IQR was used, relative to sample after exclusion of zeros (0 - 100) | | |
95 +---------------+----------------------------------------------------------------------------------------------------------------+ | |
96 | prop.out.3iqr | proportion of values equal excluded if 3 * IQR was used, relative to sample after exclusion of zeros (0 - 100) | | |
97 +---------------+----------------------------------------------------------------------------------------------------------------+ | |
98 | prop.out.4iqr | proportion of values equal excluded if 4 * IQR was used, relative to sample after exclusion of zeros (0 - 100) | | |
99 +---------------+----------------------------------------------------------------------------------------------------------------+ | |
100 | prop.out.5iqr | proportion of values equal excluded if 5 * IQR was used, relative to sample after exclusion of zeros (0 - 100) | | |
101 +---------------+----------------------------------------------------------------------------------------------------------------+ | |
102 | used | IQR used for exclusion of outliers / extreme values | | |
103 +---------------+----------------------------------------------------------------------------------------------------------------+ | |
104 | no.obs.raw | number of observations in the original sample | | |
105 +---------------+----------------------------------------------------------------------------------------------------------------+ | |
106 | no.obs.nozero | number of observations in sample after excluding values equal to zero | | |
107 +---------------+----------------------------------------------------------------------------------------------------------------+ | |
108 | no.obs.clean | number of observations in sample after excluding outliers / extreme values | | |
109 +---------------+----------------------------------------------------------------------------------------------------------------+ | |
110 | |
111 2. modes_basic_stats.csv. We use the following notation: sigma - standard deviation, mu - mean, nu - median, Mo - mode, Q_i - the i-th quartile, q_s - the s % quantile. The columns of this file contain | |
112 | |
113 +-----------------------------------+------------------------------------------------------------------------------------+ | |
114 | Column | description | | |
115 +-----------------------------------+------------------------------------------------------------------------------------+ | |
116 | Name | name of the file analyzed | | |
117 +-----------------------------------+------------------------------------------------------------------------------------+ | |
118 | Number of modes | number of modes without applying any exclusion criterion | | |
119 +-----------------------------------+------------------------------------------------------------------------------------+ | |
120 | Number of modes (5% excluded) | number of modes after exclusion of those with less then 5% probability mass | | |
121 +-----------------------------------+------------------------------------------------------------------------------------+ | |
122 | Number of modes (10% excluded) | number of modes after exclusion of those with less then 10% probability mass | | |
123 +-----------------------------------+------------------------------------------------------------------------------------+ | |
124 | Skewness | Pearson's moment coefficient of skewness E(X-mu/sigma)^3 | | |
125 +-----------------------------------+------------------------------------------------------------------------------------+ | |
126 | Mode skewness | Pearson's first skewness coefficient (mu - Mo)/sigma | | |
127 +-----------------------------------+------------------------------------------------------------------------------------+ | |
128 | Nonparametric skew | (mu - nu)/sigma | | |
129 +-----------------------------------+------------------------------------------------------------------------------------+ | |
130 | Q50 skewness | Bowley's measure of skewness / Yule's coefficient (Q_3 + Q_1 - 2Q_2) / (Q_3 - Q_1) | | |
131 +-----------------------------------+------------------------------------------------------------------------------------+ | |
132 | Absolute Q50 mode skewness | (Q_3 + Q_1) / 2 - Mo | | |
133 +-----------------------------------+------------------------------------------------------------------------------------+ | |
134 | Absolute Q80 mode skewness | (q_90 + q_10) / 2 - Mo | | |
135 +-----------------------------------+------------------------------------------------------------------------------------+ | |
136 | Peak i, i = 1,..., 10 | location of peak i | | |
137 +-----------------------------------+------------------------------------------------------------------------------------+ | |
138 | Probability Mass i, i = 1,..., 10 | probability mass assigned to peak i | | |
139 +-----------------------------------+------------------------------------------------------------------------------------+ | |
140 | Warning close modes | flag indicating that modes lie too close. The default threshold is 0.2 | | |
141 +-----------------------------------+------------------------------------------------------------------------------------+ | |
142 | Number close modes | number of modes lying too close, given the threshold | | |
143 +-----------------------------------+------------------------------------------------------------------------------------+ | |
144 | Modes (close modes excluded) | number of modes after exclusion of modes that are too close | | |
145 +-----------------------------------+------------------------------------------------------------------------------------+ | |
146 | SD | sample standard deviation sigma | | |
147 +-----------------------------------+------------------------------------------------------------------------------------+ | |
148 | IQR 80 | 80% distance between the 90 % and 10 % quantile | | |
149 +-----------------------------------+------------------------------------------------------------------------------------+ | |
150 | IQR 90 | 90% distance between the 95 % and 5 % quantile | | |
151 +-----------------------------------+------------------------------------------------------------------------------------+ | |
152 | Total number of sequences | total number of sequences / CpG o/e ratios used for this analysis step | | |
153 +-----------------------------------+------------------------------------------------------------------------------------+ | |
154 | |
155 3. modes_bootstrap.csv. The columns of this optional file resulting from the bootstrap procedure contains: | |
156 | |
157 +-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+ | |
158 | Column | description | | |
159 +-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+ | |
160 | Name | name of the file analyzed | | |
161 +-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+ | |
162 | Number of modes (NM) | number of modes detected for the original sample | | |
163 +-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+ | |
164 | % of samples with same NM | proportion of bootstrap samples with the same number of modes (0 - 100) | | |
165 +-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+ | |
166 | % of samples with more NM | proportion of bootstrap samples a higher number of modes (0 - 100) | | |
167 +-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+ | |
168 | % of samples with less NM | proportion of bootstrap samples a lower number of modes (0 - 100) | | |
169 +-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+ | |
170 | no. of samples with same NM | number of bootstrap samples with the same number of modes | | |
171 +-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+ | |
172 | % BS samples excluded by prob.~mass crit. | proportion of bootstrap samples excluded due to strong deviations from the probability masses determined for the original sample (0 - 100) | | |
173 +-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+ |