annotate COG/bac-genomics-scripts/ncbi_ftp_download/README.md @ 10:d103c41b6931 draft

Uploaded
author dereeper
date Thu, 30 May 2024 16:35:22 +0000
parents e42d30da7a74
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
1 ncbi_ftp_download
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
2 =================
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
4 **This pipeline is NOT working at the moment, as NCBI reorganized the structure of their [FTP server for genomes](https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/). As an alternative way to fetch bacterial genomes from NCBI I recommend [`ncbi-genome-download`](https://github.com/kblin/ncbi-genome-download) from @kbiln, or [`Bio-RetrieveAssemblies`](https://github.com/andrewjpage/Bio-RetrieveAssemblies) from @andrewjpage from the Wellcome Trust Sanger Institute.**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
5
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
6 Scripts to batch download all bacterial genomes of a genus/species from NCBI's FTP site (RefSeq and GenBank) for easy access.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
7
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
8 ## Synopsis
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
9
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
10 ncbi_ftp_download.sh Genus_species
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
11
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
12 ## Description
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
13
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
14 These scripts are intended to download all bacterial genomes for a particular genus or species from NCBI's FTP site (http://www.ncbi.nlm.nih.gov/Ftp/ and ftp://ftp.ncbi.nlm.nih.gov/) and copy them to result folders for easy access.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
15
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
16 `ncbi_ftp_download.sh` is a bash shell wrapper script that employs UNIX's `wget` to download microbial genomes in genbank (\*.gbk) and fasta (\*.fna) format from the GenBank and RefSeq databases (NCBI Reference Sequence Database, http://www.ncbi.nlm.nih.gov/refseq/) on NCBI's FTP server, which can be accessed anonymously. As first argument it takes the bacterial genus or species name you want to download (it uses that name with a glob inside the script, e.g. Escherichia_coli will be used as Escherichia_coli\*), see examples below in [usage](#usage). Have a look on the NCBI FTP server to get the correct name (either with your browser or e.g. with FileZilla, http://filezilla-project.org/). If you want to download genomes for several distinct species just run the script with different arguments repeatedly.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
17
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
18 The `wget` parameters are specified to keep the FTP server folder structure and mirror it locally downstream from the current working directory (folder 'ftp.ncbi.nlm.nih.gov' will be the top folder of the new folder structure). If you update an already existing folder structure, `wget` will only download and replace files if they are in a newer version on NCBI's FTP server. **But** be aware that NCBI shuffles files around (including new ones, deleting old ones etc.), thus it might be useful to remove 'ftp.ncbi.nlm.nih.gov' and download everything new.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
19
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
20 After the download with `wget`, `ncbi_ftp_download.sh` will run the Perl script `ncbi_ftp_concat_unpack.pl`. This script unpacks (draft genomes are stored as tarballs, \*.tgz) and concatenates all complete and draft genomes, which are present in the folder 'ftp.ncbi.nlm.nih.gov' in the current working directory. The script traverses the downloaded NCBI ftp-folder structure and thus has to be called from the top level (containing the folder 'ftp.ncbi.nlm.nih.gov'). `ncbi_ftp_download.sh` runs `ncbi_ftp_concat_unpack.pl` with both **genbank** and **refseq** options, as well as option **y** to overwrite the old result folders (see below [options](#options)). Both scripts have to be in the same directory (or in the path) to run `ncbi_ftp_download.sh`.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
21
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
22 For **complete** genomes **plasmids** are concatenated to the **chromosomes** to create multi-genbank/-fasta files (script `split_multi-seq_file.pl` can be used to split the multi-sequence file to single-sequence files).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
23
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
24 In **draft** genomes, **scaffold** and/or **contig** files, designated by 'draft_scaf' or 'draft_con', are controlled for annotation (i.e. if gene primary feature tags exist); usually only one of those contains annotations. The one with annotation is then used to create multi-genbank files. Multi-fasta files are created for the corresponding genbank file or, if no annotation exists, for the file which contains more sequence information (either contigs or scaffolds). In the case, that the sequence information is equal, scaffold files are preferred. If sequence size discrepancies between a genbank and its corresponding fasta file are found, error file 'seq_errors.txt' will be created and indicate the villains.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
25
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
26 As a suggestion, pick the genomes you're looking for **first** out of './refseq' and the rest out of './genbank'. RefSeq genomes have a higher annotation quality, while GenBank includes more genomes.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
27
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
28 Depending on the amount of data to download, the whole process can take quite a while. Also have a mind for space requirements, e.g. all *E. coli*/*Shigella* genomes (March 2014) have a final total space requirement of ~58 GB ('ftp.ncbi.nlm.nih.gov' = ~18 GB; ./genbank = ~25 GB; ./refseq = ~16 GB)!
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
29
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
30 If you're new to the NCBI FTP site you should read an excellent overview for microbial RefSeq genomes on NCBI's FTP site on Torsten Seemann's blog: http://thegenomefactory.blogspot.de/2012/07/navigating-microbial-genomes-on-ncbi.html.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
31
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
32 You can also access an introductory talk for the microbial NCBI FTP resources at figshare (http://figshare.com/articles/Introduction_to_NCBI_s_FTP_server_for_bacterial_genomes/972893). It might be a good idea to read the blog post and have a look in the PDF to have a general idea what's going on, but of course you can just run the scripts and work with the genome files.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
33
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
34 ## Usage
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
35
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
36 ### 1.) Manual consecutively
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
37
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
38 #### 1.1.) `wget`
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
39
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
40 Download RefSeq complete genomes (in fasta and genbank format):
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
41
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
42 wget -cNrv -t 45 -A *.gbk,*.fna "ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Genus_species*" -P .
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
43
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
44 Download RefSeq draft genomes as tarballs:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
45
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
46 wget -cNrv -t 45 -A *.gbk.tgz,*.fna.tgz "ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria_DRAFT/Genus_species*" -P .
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
47
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
48 The same procedure has to be followed for GenBank files, here complete genomes:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
49
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
50 wget -cNrv -t 45 -A *.gbk,*.fna "ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria/Genus_species*" -P .
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
51
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
52 And finally download GenBank draft genomes:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
53
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
54 wget -cNrv -t 45 -A *.gbk.tgz,*.fna.tgz "ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria_DRAFT/Genus_species*" -P .
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
55
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
56 #### 1.2.) `ncbi_ftp_concat_unpack.pl`
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
57
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
58 perl ncbi_ftp_concat_unpack.pl refseq y
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
59 perl ncbi_ftp_concat_unpack.pl genbank y
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
60
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
61 ### 2.) With one command: `ncbi_ftp_download.sh` wrapper script
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
62
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
63 Some examples how you can use the shell script, e.g. download all *E. coli* genomes from NCBI's ftp server:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
64
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
65 ncbi_ftp_download.sh Escherichia_coli
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
66
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
67 Download all *B. cereus* genomes:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
68
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
69 ncbi_ftp_download.sh Bacillus_cereus
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
70
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
71 Download all *Paenibacillus* genomes:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
72
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
73 ncbi_ftp_download.sh Paenibacillus
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
74
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
75 ## Options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
76
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
77 ### *ncbi_ftp_concat_unpack.pl*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
78
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
79 * genbank (as first argument)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
80
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
81 Copy GenBank genomes (from './ftp.ncbi.nlm.nih.gov/genbank/genomes/Bacteria\*') as (multi-)sequence files in the result folder './genbank'.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
82
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
83 * refseq (as first argument)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
84
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
85 Copy RefSeq genomes (from './ftp.ncbi.nlm.nih.gov/genomes/Bacteria\*') as (multi-)sequence files in the result folder './refseq'.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
86
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
87 * y (as second argument)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
88
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
89 Will delete previous result folders and create new ones (otherwise, the script will ask user if to proceed with overwriting)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
90
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
91 ## Output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
92
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
93 ### `ncbi_ftp_download.sh`
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
94
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
95 * './ftp.ncbi.nlm.nih.gov/'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
96
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
97 Mirrors NCBI's FTP server structure and downloads the wanted bacterial genome files in this folder with subfolders
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
98
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
99 ### `ncbi_ftp_concat_unpack.pl`
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
100
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
101 * './genbank'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
102
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
103 Result folder for all **GenBank** genomes
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
104
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
105 * './refseq'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
106
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
107 Result folder for all **RefSeq** genomes
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
108
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
109 * (seq_errors.txt)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
110
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
111 Lists \*.gbk and corresponding \*.fasta files with sequence size discrepancies.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
112
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
113 ## Run environment
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
114
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
115 Both the Perl script and the bash-shell script run only under UNIX flavors.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
116
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
117 ## Dependencies (not in the core Perl modules)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
118
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
119 * no extra dependencies
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
120
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
121 ## Authors/contact
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
122
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
123 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
124
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
125 ## Citation, installation, and license
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
126
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
127 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
128
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
129 ## Changelog
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
130
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
131 ### *ncbi_ftp_concat_unpack.pl*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
132
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
133 * v0.2.1 (13.07.2015)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
134 - Adapted all scripts to the new NCBI FTP server address: 'ftp://ftp.ncbi.nlm.nih.gov/'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
135 * v0.2 (21.02.2013)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
136 - 'seq_errors.txt' error file if sequence size discrepancies between genbank and corresponding fasta file found
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
137 - die with error if 'genbank|refseq' not given as first argument
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
138 - print status message which genome is being processed and what file is kept for draft genomes (e.g. scaffold or contig etc.)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
139 - bug fixes to test for file existence before running code
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
140 - changed usage to HERE document
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
141 * v0.1 (15.09.2012)