comparison SkewIT/README.md @ 3:e42d30da7a74 draft

Uploaded
author dereeper
date Thu, 30 May 2024 11:52:25 +0000
parents
children
comparison
equal deleted inserted replaced
2:97e4e3e818b6 3:e42d30da7a74
1 # SkewIT
2 SkewIT (Skew Index Test) is a tool for analyzing GC Skew in bacterial genomes. GC Skew is a phenomenon observed in many bacterial genomes wherein the two strands of the chromosome contain different proportions of guanine/cytosine nucleotides. SkewIT quantifies GC Skew using a single metric that can then be compared/analyzed across thousands of bacterial genomes.
3
4 More information about the method is detailed in the [SkewIT paper](https://doi.org/10.1371/journal.pcbi.1008439) (published Dec 4, 2020).
5
6 **IMPORTANT:** GC Skew/SkewIT is intended for use with only complete, fully contiguous, bacterial sequences with no gaps. Sequences should be fully assembled from end-to-end for the calculated SkewI to be informative. Contigs/Scaffolds are not expected to display GC Skew.
7
8 --------------------------------------
9 ## README Sections
10 1. [Code Availability](#code-availability)
11 2. [Data Availability](#data-availability)
12 3. [SkewIT R Shiny App](#skewit-shiny-app)
13 4. [skewi.py](#skewipy)
14 5. [gcskew.py](#gcskewpy)
15 6. [plot\_gcskew.py](#plotgcskewpy)
16 ---------------------------------------
17
18 ## Code availability
19 This repository contains three main python scripts (developed in python 2.7.5):
20 1. [skewi.py](#skewipy): calculates SkewI for each genome provided
21 2. [gcskew.py](#gcskewpy): calculates gc skew values across the whole genome for one single genome
22 3. [plot\_gcskew.py](#plotgcskewpy): plots gc skew for each genome provided in a single multi-FASTA file
23
24 Scripts are located in the `/src/` folder. While each script can be run using `python myscript.py`, users can make each script executable by running
25
26 chmod +x skewi.py
27 ./skewi.py -h
28
29 **DEPENDENCIES:** SkewIT scripts require biopython to read FASTA sequences.
30 For more information about installing biopython, see the [biopython website](https://biopython.org/wiki/Download)
31
32 `plot_gcskew.py` script requires matplotlib and numpy. Plots will be saved in png format.
33
34 ## Data availability
35 In addition to the available code, we also provide SkewI values and thresholds for RefSeq release 97 in the `/data/` folder.
36 1. `RefSeq97_Bacteria_SkewI_incl.taxonomy.txt`: lists SkewI values for each complete bacterial genome along with their taxonomy
37 2. `RefSeq97_Bacteria_GenusSkewIThresholds.txt`: lists each bacterial genus with the number of genomes, SkewI mean/standard deviation, and (for genera with >= 10 genomes) the SkewI threshold (2 standard deviations below mean).
38
39 ## SkewIT Shiny App
40 For ease of analysis, we have developed and provide a ShinyApp to visualize the SkewI distributions for the 15,067 bacterial genomes in RefSeq release 97: https://jenniferlu717.shinyapps.io/SkewIT/. This app provides users with the ability to
41 1. Visualize the SkewI distribution across all genomes
42 2. Visualize the SkewI distribution for any selected genus
43 3. Visualize the SkewI values as separated by species
44 4. Identify which genomes have SkewI values falling below the calculated SkewI threshold.
45 5. Plot GC skew values and calculate SkewI from a user-provided FASTA file.
46 6. Plot GC skew values as produced by the [gcskew.py](#gcskewpy) program provided here.
47
48
49 ---------------------------------------
50 ## skewi.py
51 ### 1. skewi.py Usage/Options
52 This program will calculate SkewI values for each genome provided. Running `python skewi.py --usage` will print a full usage message to the system standard out.
53 Here, we describe how to run `skewi.py`, along with all related options and possibilities.
54
55 python skewi.py -i SEQ.FASTA
56
57 Required parameters:
58 * -i SEQ.FASTA...............fasta/multi-fasta sequence file
59
60 Optional parameters:
61 * -o SKEWI.TXT...............output file [if none is provided, the program will print to standard out]
62 * -k WINDOW SIZE.............size of window to assign a gc skew value [default: 20kb]
63 * -f FREQUENCY...............number of bases between the start of each window [default: k == f, adjacent/non-overlapping windows]
64 * --min-seq-len LENGTH.......minimum sequence length required to analyze [default: 500kb]
65 * --complete/--all...........only analyze complete sequences/analyze complete and draft sequences [default: --complete]
66 * --plasmid/--no-plasmid.....include/exclude plasmid sequences [default: --no-plasmid]
67
68 ### 2. skewi.py Input Files
69
70 Currently, input sequence files must be FASTA formatted and not zipped. Multi-fasta files are permitted. The program will calculate and print one SkewI value for each sequence provided.
71
72 ### 3. skewi.py Output Format
73
74 If an output file is provided, the program will generate a tab-delimited, 2-column output file with headers. The first column will contain the full sequence ID/description. The second column will contain the calculated SkewI value.
75
76 If no output file is provided, the program will print these two columns to the system standard out. Users can pipe this output into a file of their choice by running:
77 `python skewi.py -i MYSEQ.FASTA > MYOUTPUT.TXT`
78
79
80 ### 4. skewi.py Window Length/Frequency Options (-k/-f/--min-seq-len)
81
82 By default, the program will calculate SkewI using non-overlapping/adjacent windows of size 20kb only for sequences with a minimum length of 500kb.
83
84 If users choose to change the window size (`-k`), but do not specify a window frequency, the program will by default use non-overlapping/adjacent windows (`k == f`)
85
86 1. For overlapping sequences, specify a frequency < window length:
87
88 `python skewi.py -i MYSEQ.FASTA -k 20000 -f 10000`
89
90 2. For no minimum sequence length, specify `--min-seq-len 0`
91
92 `python skewi.py -i MYSEQ.FASTA --min-seq-len 0`
93
94 3. For a smaller window size (and therefore more resolution):
95
96 `python skewi.py -i MYSEQ.FASTA -k 10000`
97
98 The window size `-k` must always be larger or equal to frequency `-f`. Both values must be greater than 0.
99
100
101 ### 5. skewi.py Complete Genome Options (--complete/--all)
102 As the program was designed to work with RefSeq output files, these two options are provided to allow users to specify whether complete or all genomes in the provided files should be analyzed.
103
104 Specifying `--complete` will require that "complete" is in the sequence header, while specifying `--all` will allow any sequence to be analyzed.
105
106
107 ### 6. skewi.py Plasmid Options (--plasmid/--no-plasmid)
108 This program was designed for analysis of bacterial chromosomes, not plasmids. We have not tested the performance of the program on plasmid sequences. Therefore, by default, the program will skip any sequence containing "plasmid" in the header.
109
110 If users would like to analyze plasmid sequences in their input files, simply specify `--plasmid` during runtime.
111
112 ---------------------------------------
113 ## gcskew.py
114 ### 1. gcskew.py Usage/Options
115 This program will calculate GC Skew values for each genome provided. Running `python gcskew.py --usage` will print a full usage message to the system standard out.
116 Here, we describe how to run `gcskew.py`.
117
118 python gcskew.py -i SEQ.FASTA
119
120 Required parameters:
121 * -i SEQ.FASTA...............fasta/multi-fasta sequence file
122
123 Optional parameters:
124 * -o SKEW.TXT................output file [if none is provided, the program will print to gcskew.txt (overwrites if exists)]
125 * -k WINDOW SIZE.............size of window within which to calculate gc skew [default: 20kb]
126 * -f FREQUENCY...............number of bases between the start of each window [default: k == f, adjacent/non-overlapping windows]
127
128 ### 2. gcskew.py Input/Output Files
129
130 Currently, input sequence files must be FASTA formatted and not zipped. Multi-fasta files are permitted.
131
132 The output file is a 3 column, tab-delimited file with the following columns:
133 1. sequence ID = allows users to sort out which GC Skew values belong to which sequences
134 2. index = designates the start index of the window for which GC Skew is calculated
135 3. GC Skew value = calculated by summing guanine (G) and cytosine (C) bases and calculating (G-C)/(G+C)
136
137 This output file can be loaded into the [SkewIT R Shiny App](#skewit-shiny-app) (https://jenniferlu717.shinyapps.io/SkewIT/.) for visualization.
138
139 ### 3. gcskew.py Window Length/Frequency Options (-k/-f)
140
141 These options are identical to those described above for the `skewi.py` script.
142
143 ---------------------------------------
144 ## plot\_gcskew.py
145 ### 1. plot\_gcskew.py Usage/Options
146 This program will PLOT GC Skew values for each genome provided. Running `python plot_gcskew.py --usage` will print a full usage message to the system standard out.
147 Here, we describe how to run `plot_gcskew.py`.
148
149 python plot_gcskew.py -i SEQ.FASTA
150
151 Required parameters:
152 * -i SEQ.FASTA...............fasta/multi-fasta sequence file
153
154 Optional parameters:
155 * -o SKEW.PNG................output file [if none is provided, the program will produce curr.png (overwrites if exists)]
156 * -k WINDOW SIZE.............size of window within which to calculate gc skew [default: 20kb]
157 * -f FREQUENCY...............number of bases between the start of each window [default: k == f, adjacent/non-overlapping windows]
158
159 Options for this script are identical to those of other SkewIT programs provided.
160
161 ### 2. plot\_gcskew.py Example Output
162 If a multi-FASTA file is given, one .png image is produced containing GC skew plots for each FASTA sequence. Ideally, do not provide a multi-FASTA file with more than 5 sequences.
163
164 For a 2-genome multi-FASTA file, `plot_gcskew.py` will generate the following:
165 ![GC Skew Example](data/example_gcskewplot.png)
166
167 # Author information
168 Updated: 2020/05/10
169
170 Jennifer Lu, jennifer.lu717@gmail.com