annotate SkewIT/README.md @ 7:2c65d4257fe6 draft

Uploaded
author dereeper
date Thu, 30 May 2024 12:30:04 +0000
parents e42d30da7a74
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
1 # SkewIT
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
2 SkewIT (Skew Index Test) is a tool for analyzing GC Skew in bacterial genomes. GC Skew is a phenomenon observed in many bacterial genomes wherein the two strands of the chromosome contain different proportions of guanine/cytosine nucleotides. SkewIT quantifies GC Skew using a single metric that can then be compared/analyzed across thousands of bacterial genomes.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
4 More information about the method is detailed in the [SkewIT paper](https://doi.org/10.1371/journal.pcbi.1008439) (published Dec 4, 2020).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
5
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
6 **IMPORTANT:** GC Skew/SkewIT is intended for use with only complete, fully contiguous, bacterial sequences with no gaps. Sequences should be fully assembled from end-to-end for the calculated SkewI to be informative. Contigs/Scaffolds are not expected to display GC Skew.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
7
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
8 --------------------------------------
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
9 ## README Sections
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
10 1. [Code Availability](#code-availability)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
11 2. [Data Availability](#data-availability)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
12 3. [SkewIT R Shiny App](#skewit-shiny-app)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
13 4. [skewi.py](#skewipy)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
14 5. [gcskew.py](#gcskewpy)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
15 6. [plot\_gcskew.py](#plotgcskewpy)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
16 ---------------------------------------
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
17
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
18 ## Code availability
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
19 This repository contains three main python scripts (developed in python 2.7.5):
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
20 1. [skewi.py](#skewipy): calculates SkewI for each genome provided
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
21 2. [gcskew.py](#gcskewpy): calculates gc skew values across the whole genome for one single genome
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
22 3. [plot\_gcskew.py](#plotgcskewpy): plots gc skew for each genome provided in a single multi-FASTA file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
23
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
24 Scripts are located in the `/src/` folder. While each script can be run using `python myscript.py`, users can make each script executable by running
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
25
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
26 chmod +x skewi.py
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
27 ./skewi.py -h
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
28
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
29 **DEPENDENCIES:** SkewIT scripts require biopython to read FASTA sequences.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
30 For more information about installing biopython, see the [biopython website](https://biopython.org/wiki/Download)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
31
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
32 `plot_gcskew.py` script requires matplotlib and numpy. Plots will be saved in png format.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
33
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
34 ## Data availability
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
35 In addition to the available code, we also provide SkewI values and thresholds for RefSeq release 97 in the `/data/` folder.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
36 1. `RefSeq97_Bacteria_SkewI_incl.taxonomy.txt`: lists SkewI values for each complete bacterial genome along with their taxonomy
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
37 2. `RefSeq97_Bacteria_GenusSkewIThresholds.txt`: lists each bacterial genus with the number of genomes, SkewI mean/standard deviation, and (for genera with >= 10 genomes) the SkewI threshold (2 standard deviations below mean).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
38
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
39 ## SkewIT Shiny App
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
40 For ease of analysis, we have developed and provide a ShinyApp to visualize the SkewI distributions for the 15,067 bacterial genomes in RefSeq release 97: https://jenniferlu717.shinyapps.io/SkewIT/. This app provides users with the ability to
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
41 1. Visualize the SkewI distribution across all genomes
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
42 2. Visualize the SkewI distribution for any selected genus
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
43 3. Visualize the SkewI values as separated by species
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
44 4. Identify which genomes have SkewI values falling below the calculated SkewI threshold.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
45 5. Plot GC skew values and calculate SkewI from a user-provided FASTA file.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
46 6. Plot GC skew values as produced by the [gcskew.py](#gcskewpy) program provided here.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
47
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
48
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
49 ---------------------------------------
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
50 ## skewi.py
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
51 ### 1. skewi.py Usage/Options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
52 This program will calculate SkewI values for each genome provided. Running `python skewi.py --usage` will print a full usage message to the system standard out.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
53 Here, we describe how to run `skewi.py`, along with all related options and possibilities.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
54
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
55 python skewi.py -i SEQ.FASTA
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
56
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
57 Required parameters:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
58 * -i SEQ.FASTA...............fasta/multi-fasta sequence file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
59
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
60 Optional parameters:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
61 * -o SKEWI.TXT...............output file [if none is provided, the program will print to standard out]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
62 * -k WINDOW SIZE.............size of window to assign a gc skew value [default: 20kb]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
63 * -f FREQUENCY...............number of bases between the start of each window [default: k == f, adjacent/non-overlapping windows]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
64 * --min-seq-len LENGTH.......minimum sequence length required to analyze [default: 500kb]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
65 * --complete/--all...........only analyze complete sequences/analyze complete and draft sequences [default: --complete]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
66 * --plasmid/--no-plasmid.....include/exclude plasmid sequences [default: --no-plasmid]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
67
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
68 ### 2. skewi.py Input Files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
69
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
70 Currently, input sequence files must be FASTA formatted and not zipped. Multi-fasta files are permitted. The program will calculate and print one SkewI value for each sequence provided.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
71
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
72 ### 3. skewi.py Output Format
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
73
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
74 If an output file is provided, the program will generate a tab-delimited, 2-column output file with headers. The first column will contain the full sequence ID/description. The second column will contain the calculated SkewI value.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
75
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
76 If no output file is provided, the program will print these two columns to the system standard out. Users can pipe this output into a file of their choice by running:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
77 `python skewi.py -i MYSEQ.FASTA > MYOUTPUT.TXT`
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
78
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
79
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
80 ### 4. skewi.py Window Length/Frequency Options (-k/-f/--min-seq-len)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
81
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
82 By default, the program will calculate SkewI using non-overlapping/adjacent windows of size 20kb only for sequences with a minimum length of 500kb.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
83
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
84 If users choose to change the window size (`-k`), but do not specify a window frequency, the program will by default use non-overlapping/adjacent windows (`k == f`)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
85
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
86 1. For overlapping sequences, specify a frequency < window length:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
87
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
88 `python skewi.py -i MYSEQ.FASTA -k 20000 -f 10000`
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
89
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
90 2. For no minimum sequence length, specify `--min-seq-len 0`
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
91
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
92 `python skewi.py -i MYSEQ.FASTA --min-seq-len 0`
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
93
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
94 3. For a smaller window size (and therefore more resolution):
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
95
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
96 `python skewi.py -i MYSEQ.FASTA -k 10000`
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
97
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
98 The window size `-k` must always be larger or equal to frequency `-f`. Both values must be greater than 0.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
99
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
100
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
101 ### 5. skewi.py Complete Genome Options (--complete/--all)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
102 As the program was designed to work with RefSeq output files, these two options are provided to allow users to specify whether complete or all genomes in the provided files should be analyzed.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
103
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
104 Specifying `--complete` will require that "complete" is in the sequence header, while specifying `--all` will allow any sequence to be analyzed.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
105
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
106
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
107 ### 6. skewi.py Plasmid Options (--plasmid/--no-plasmid)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
108 This program was designed for analysis of bacterial chromosomes, not plasmids. We have not tested the performance of the program on plasmid sequences. Therefore, by default, the program will skip any sequence containing "plasmid" in the header.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
109
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
110 If users would like to analyze plasmid sequences in their input files, simply specify `--plasmid` during runtime.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
111
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
112 ---------------------------------------
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
113 ## gcskew.py
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
114 ### 1. gcskew.py Usage/Options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
115 This program will calculate GC Skew values for each genome provided. Running `python gcskew.py --usage` will print a full usage message to the system standard out.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
116 Here, we describe how to run `gcskew.py`.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
117
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
118 python gcskew.py -i SEQ.FASTA
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
119
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
120 Required parameters:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
121 * -i SEQ.FASTA...............fasta/multi-fasta sequence file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
122
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
123 Optional parameters:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
124 * -o SKEW.TXT................output file [if none is provided, the program will print to gcskew.txt (overwrites if exists)]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
125 * -k WINDOW SIZE.............size of window within which to calculate gc skew [default: 20kb]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
126 * -f FREQUENCY...............number of bases between the start of each window [default: k == f, adjacent/non-overlapping windows]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
127
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
128 ### 2. gcskew.py Input/Output Files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
129
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
130 Currently, input sequence files must be FASTA formatted and not zipped. Multi-fasta files are permitted.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
131
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
132 The output file is a 3 column, tab-delimited file with the following columns:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
133 1. sequence ID = allows users to sort out which GC Skew values belong to which sequences
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
134 2. index = designates the start index of the window for which GC Skew is calculated
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
135 3. GC Skew value = calculated by summing guanine (G) and cytosine (C) bases and calculating (G-C)/(G+C)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
136
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
137 This output file can be loaded into the [SkewIT R Shiny App](#skewit-shiny-app) (https://jenniferlu717.shinyapps.io/SkewIT/.) for visualization.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
138
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
139 ### 3. gcskew.py Window Length/Frequency Options (-k/-f)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
140
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
141 These options are identical to those described above for the `skewi.py` script.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
142
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
143 ---------------------------------------
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
144 ## plot\_gcskew.py
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
145 ### 1. plot\_gcskew.py Usage/Options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
146 This program will PLOT GC Skew values for each genome provided. Running `python plot_gcskew.py --usage` will print a full usage message to the system standard out.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
147 Here, we describe how to run `plot_gcskew.py`.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
148
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
149 python plot_gcskew.py -i SEQ.FASTA
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
150
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
151 Required parameters:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
152 * -i SEQ.FASTA...............fasta/multi-fasta sequence file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
153
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
154 Optional parameters:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
155 * -o SKEW.PNG................output file [if none is provided, the program will produce curr.png (overwrites if exists)]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
156 * -k WINDOW SIZE.............size of window within which to calculate gc skew [default: 20kb]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
157 * -f FREQUENCY...............number of bases between the start of each window [default: k == f, adjacent/non-overlapping windows]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
158
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
159 Options for this script are identical to those of other SkewIT programs provided.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
160
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
161 ### 2. plot\_gcskew.py Example Output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
162 If a multi-FASTA file is given, one .png image is produced containing GC skew plots for each FASTA sequence. Ideally, do not provide a multi-FASTA file with more than 5 sequences.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
163
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
164 For a 2-genome multi-FASTA file, `plot_gcskew.py` will generate the following:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
165 ![GC Skew Example](data/example_gcskewplot.png)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
166
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
167 # Author information
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
168 Updated: 2020/05/10
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
169
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
170 Jennifer Lu, jennifer.lu717@gmail.com