annotate COG/bac-genomics-scripts/prot_finder/README.md @ 10:d103c41b6931 draft

Uploaded
author dereeper
date Thu, 30 May 2024 16:35:22 +0000
parents e42d30da7a74
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
1 prot_finder
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
2 ===========
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
3
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
4 `prot_finder.pl` is a script to search for query protein homologs in annotated bacterial genomes with **BLASTP**. A companion bash shell script pipeline is available, `prot_finder_pipe.sh`.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
5
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
6 Included is also the script `prot_binary_matrix.pl` to create a binary presence/absence matrix (e.g. for [**iTOL**](http://itol.embl.de/)) from the `prot_finder.pl` output. Additionally, two downstream scripts are provided to wrangle these binary presence/absence matrix: `transpose_matrix.pl` to transpose a delimited TEXT matrix and `binary_group_stats.pl` to get overall presence/absence statistics for groups of columns in a delimited binary TEXT matrix (in the style of [`po2group_stats.pl`](/po2group_stats)).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
7
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
8 * [Synopsis](#synopsis)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
9 * [prot_finder synopsis](#prot_finder-synopsis)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
10 * [prot_binary_matrix synopsis](#prot_binary_matrix-synopsis)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
11 * [transpose_matrix synopsis](#transpose_matrix-synopsis)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
12 * [binary_group_stats synopsis](#binary_group_stats-synopsis)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
13 * [Description](#description)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
14 * [Usage](#usage)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
15 * [prot_finder usage](#prot_finder-usage)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
16 * [Manual consecutively](#manual-consecutively)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
17 * [cds_extractor for subject proteins](#cds_extractor-for-subject-proteins)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
18 * [Legacy BLASTP](#legacy-blastp)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
19 * [BLASTP plus](#blastp-plus)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
20 * [prot_finder](#prot_finder-1)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
21 * [prot_finder_pipe bash script pipeline](#prot_finder_pipe-bash-script-pipeline)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
22 * [prot_binary_matrix usage](#prot_binary_matrix-usage)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
23 * [transpose_matrix usage](#transpose_matrix-usage)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
24 * [binary_group_stats usage](#binary_group_stats-usage)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
25 * [Options](#options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
26 * [prot_finder.pl options](#prot_finderpl-options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
27 * [Mandatory prot_finder.pl options](#mandatory-prot_finderpl-options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
28 * [Optional prot_finder.pl options](#optional-prot_finderpl-options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
29 * [prot_finder_pipe.sh options](#prot_finder_pipesh-options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
30 * [Mandatory prot_finder_pipe.sh options](#mandatory-prot_finder_pipesh-options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
31 * [Optional prot_finder_pipe.sh options](#optional-prot_finder_pipesh-options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
32 * [prot_binary_matrix.pl options](#prot_binary_matrixpl-options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
33 * [transpose_matrix.pl options](#transpose_matrixpl-options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
34 * [binary_group_stats.pl options](#binary_group_statspl-options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
35 * [Mandatory binary_group_stats.pl options](#mandatory-binary_group_statspl-options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
36 * [Optional binary_group_stats.pl options](#optional-binary_group_statspl-options)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
37 * [Output](#output)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
38 * [cds_extractor.pl output](#cds_extractorpl-output)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
39 * [prot_finder.pl output](#prot_finderpl-output)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
40 * [prot_finder_pipe.sh output](#prot_finder_pipesh-output)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
41 * [prot_binary_matrix.pl output](#prot_binary_matrixpl-output)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
42 * [transpose_matrix.pl output](#transpose_matrixpl-output)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
43 * [binary_group_stats.pl output](#binary_group_statspl-output)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
44 * [Dependencies](#dependencies)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
45 * [`prot_finder.pl`/`prot_finder_pipe.sh` dependencies](#prot_finderplprot_finder_pipesh-dependencies)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
46 * [`binary_group_stats.pl` dependencies](#binary_group_statspl-dependencies)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
47 * [Run environment](#run-environment)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
48 * [Author - contact](#author---contact)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
49 * [Citation, installation, and license](#citation-installation-and-license)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
50 * [Acknowledgements](#acknowledgements)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
51 * [Changelog](#changelog)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
52 * [prot_finder changelog](#prot_finder-changelog)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
53 * [prot_binary_matrix changelog](#prot_binary_matrix-changelog)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
54 * [transpose_matrix changelog](#transpose_matrix-changelog)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
55 * [binary_group_stats changelog](#binary_group_stats-changelog)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
56
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
57 ## Synopsis
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
58
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
59 ### prot_finder synopsis
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
60
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
61 ./prot_finder_pipe.sh -q query.faa (-s subject.faa|-f (embl|gbk)) > blast_hits.tsv
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
62
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
63 **or**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
64
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
65 perl prot_finder.pl -r report.blastp -s subject.faa > blast_hits.tsv
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
66
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
67 ### prot_binary_matrix synopsis
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
68
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
69 perl prot_binary_matrix.pl blast_hits.tsv > binary_matrix.tsv
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
70
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
71 **or**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
72
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
73 perl prot_finder.pl -r report.blastp -s subject.faa | perl prot_binary_matrix.pl > binary_matrix.tsv
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
74
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
75 ### transpose_matrix synopsis
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
76
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
77 perl transpose_matrix.pl input_matrix.tsv > input_matrix_transposed.tsv
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
78
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
79 **or**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
80
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
81 perl prot_binary_matrix.pl blast_hits.tsv | perl transpose_matrix.pl > binary_matrix_transposed.tsv
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
82
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
83 ### binary_group_stats synopsis
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
84
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
85 perl binary_group_stats.pl -i binary_matrix.tsv -g group_file.tsv -p > overall_stats.tsv
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
86
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
87 ## Description
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
88
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
89 The script `prot_finder.pl` is intended to search for homologous proteins in annotated bacterial genomes. For this purpose, a previous [**BLASTP**](http://blast.ncbi.nlm.nih.gov/Blast.cgi), either [legacy or plus](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download), needs to be run with query protein sequences against a **BLASTP** database of subject proteins (e.g. all proteins from several *Escherichia coli* genomes).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
90
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
91 The script [`cds_extractor.pl`](/cds_extractor) (with options **-p -f**) can be used to create multi-FASTA protein files of all non-pseudo CDS from RichSeq genome files to create the needed subject **BLASTP** database. Present locus tags will be used as FASTA IDs, but see [`cds_extractor.pl`](/cds_extractor) for a description of the format. Query protein sequences for the **BLASTP** need a **unique** FASTA ID.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
92
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
93 The **BLASTP** report file (option **-r**), the subject protein multi-FASTA file (option **-s**), and optionally the query protein (multi-)FASTA (option **-q**) file are then given to `prot_finder.pl`. Significant **BLASTP** subject hits are filtered according to the given cutoffs (options **-i**, **-cov_q**, and **-cov_s**) and the result is printed as an informative tab-separated result table to *STDOUT*. To apply global identity/coverage cutoffs to subject hits high-scoring pairs (HSPs) are tiled (see http://www.bioperl.org/wiki/HOWTO:Tiling and http://search.cpan.org/dist/BioPerl/Bio/Search/Hit/GenericHit.pm). Additionally, the subject protein sequences with significant query hits are written to result multi-FASTA files, named according to the respective query FASTA IDs (optionally including the query sequence with option **-q**).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
94
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
95 Optionally, [**Clustal Omega**](http://www.clustal.org/omega/) can be called (option **-a** with optional **-p**) to create multiple alignments (FASTA format) for each of the resulting multi-FASTA files. These alignments can be used to calculate phylogenies e.g. with **RAxML** (http://sco.h-its.org/exelixis/software.html) or **MEGA** (http://www.megasoftware.net/).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
96
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
97 Run the script [`cds_extractor.pl`](/cds_extractor) (with options **-p -f**) and the **BLASTP** manually or use the bash shell wrapper script `prot_finder_pipe.sh` (see below ['prot_finder_pipe bash script pipeline'](#prot_finder_pipe-bash-script-pipeline)) to execute the whole pipeline including `prot_finder.pl` (with optional option **-q**). For additional options of the pipeline shell script see below ['prot_finder_pipe.sh options'](#prot_finder_pipesh-options). Be aware that some options in `prot_finder_pipe.sh` corresponding to options in `prot_finder.pl` have different names (**-c** instead of **-cov_q**, **-k** instead of **-cov_s**, and **-o** instead of **-p**; also **-f** has a different meaning). If [`cds_extractor.pl`](/cds_extractor) is used in the pipeline (option **-f** of the shell script) the working folder has to contain the annotated bacterial genome subject files (in RichSeq format, e.g. EMBL or GENBANK format). Also, the Perl scripts [`cds_extractor.pl`](/cds_extractor) (only for `prot_finder_pipe.sh` option **-f**) and `prot_finder.pl`have to be either contained in the current working directory or installed in the global *PATH*. **BLASTP** (legacy and/or plus) and **Clustal Omega** binaries have to be installed in global *PATH*, or for **Clustal Omega** you can give the path to the binary with option **-o**. In the pipeline **BLASTP** is run with **disabled** query filtering, locally optimal Smith-Waterman alignments, and increasing the number of database sequences to show alignments to 500 for [**BioPerl**](http://www.bioperl.org) parsing (legacy: **-F F -s T -b 500**, plus: **-seg no -use_sw_tback -num_alignments 500**). The pipeline script ends with the *STDERR* message 'Pipeline finished!', if this is not the case have a look at the log files in the result directory for errors.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
98
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
99 The resulting tab-separated table with significant **BLASTP** hits (from `prot_finder.pl` or `prot_finder_pipe.sh`) can be given to the script `prot_binary_matrix.pl`, either as *STDIN* or as a file, to create a presence/absence matrix of the results. See below ['prot_binary_matrix.pl options'](#prot_binary_matrixpl-options) for the `prot_binary_matrix.pl` options. By default a tab-delimited binary presence/absence matrix for query hits per subject organism will be printed to *STDOUT*. Use option **-t** to count all query hits per subject organism, not just the binary presence/absence. This presence/absence matrix can be given to the script `transpose_matrix.pl`, either as *STDIN* or as a file, to transpose the matrix, i.e. rows will become columns and columns rows. Actually, `transpose_matrix.pl` can be used to transpose any delimited TEXT matrix (see below ['transpose_matrix.pl options'](#transpose_matrixpl-options)).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
100
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
101 The presence/absence matri(x|ces) can e.g. be loaded into the Interactive Tree Of Life website ([**iTOL**](http://itol.embl.de/)) to associate the data with a phylogenetic tree. [**iTOL**](http://itol.embl.de/) likes individual comma-separated input files, thus use `prot_binary_matrix.pl` options **-s -c** for this purpose. However, the organism names have to have identical names to the leaves of the phylogenetic tree, thus manual adaptation, e.g. in a spreadsheet software (like [**LibreOffice Calc**](https://www.libreoffice.org/discover/calc/)), might be needed. **Careful**, subject organisms without a significant **BLASTP** hit won't be included in the `prot_finder.pl` result table and hence can't be included by `prot_binary_matrix.pl`. If needed add them manually to the result matri(x|ces).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
102
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
103 At last, script `binary_group_stats.pl` can be used to categorize columns of the binary presence/absence matrix from `prot_binary_matrix.pl` according to group affiliations. `binary_group_stats.pl` is based upon [`po2group_stats.pl`](/po2group_stats), which does the same thing for genomes in an ortholog/paralog output matrix from a [Proteinortho5](http://www.bioinf.uni-leipzig.de/Software/proteinortho/) calculation. Actually, `binary_group_stats.pl` can work with any delimited TEXT **binary** matrix (option **-i**). But, all fields of the binary matrix need to be filled with either a **0** indicating absence or a **1** indicating presence, i.e. all rows need to have the same number of columns. Use option **-d** to set the delimiter of the input matrix, default is set to tab-delimited/separated matrices.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
104
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
105 Also, column headers in the first row and row headers in the first column are **mandatory** for the input binary matrix. Only alphanumeric (a-z, A-Z, 0-9), underscore (_), dash (-), and period (.) characters are allowed for the **column headers** and **group names** in the group file (option **-g**) to avoid downstream problems with the operating/file system. As a consequence, also no whitespaces are allowed in these! Additionally, **column headers**, **row headers**, and **group names** need to be **unique**.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
106
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
107 The group affiliations of the columns are intended to get overall presence/absence statistics for groups of columns and not simply single columns of the matrix. Percentage inclusion (option **-cut_i**) and exclusion (option **-cut_e**) cutoffs can be set to define how strict the presence/absence of column groups within a row are defined. Of course groups can also hold only single column headers to get single column statistics. Group affiliations are defined in a mandatory **tab-delimited** group input file (option **-g**), including the column headers from the input binary matrix, with **minimal two** and **maximal four** groups.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
108
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
109 See the *README.md* of [`po2group_stats.pl`](/po2group_stats) for an explanation of the logic behind the categorization and the resulting group binary matrix and venn diagram of `binary_group_stats.pl` (and of course its [options](#binary_group_statspl-options) and [output](#binary_group_statspl-output)).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
110
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
111 ## Usage
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
112
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
113 ### prot_finder usage
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
114
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
115 #### Manual consecutively
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
116
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
117 ##### cds_extractor for subject proteins
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
118
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
119 for file in *.(gbk|embl); do perl cds_extractor.pl -i "$file" -p -f; done
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
120 cat *.faa > subject.faa
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
121 rm !(subject).faa
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
122
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
123 ##### Legacy **BLASTP**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
124
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
125 formatdb -p T -i subject.faa -n prot_finder_db
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
126 blastall -p blastp -d prot_finder_db -i query.faa -o prot_finder.blastp -e 1e-10 -F F -s T -b 500
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
127
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
128 **or**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
129
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
130 ##### **BLASTP** plus
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
131
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
132 makeblastdb -dbtype prot -in subject.faa -out prot_finder_db
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
133 blastp -db prot_finder_db -query query.faa -out prot_finder.blastp -evalue 1e-10 -seg no -use_sw_tback -num_alignments 500
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
134
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
135 ##### prot_finder
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
136
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
137 perl prot_finder.pl -r prot_finder.blastp -s subject.faa -cov_s 80 > blast_hits.tsv
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
138
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
139 **or**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
140
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
141 perl prot_finder.pl -r prot_finder.blastp -s subject.faa -d result_dir -f -q query.faa -i 50 -cov_q 50 -b -a -p ~/bin/clustalo -t 6 > result_dir/blast_hits.tsv
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
142
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
143 #### prot_finder_pipe bash script pipeline
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
144
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
145 ./prot_finder_pipe.sh -q query.faa -s subject.faa > blast_hits.tsv
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
146
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
147 **or**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
148
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
149 ./prot_finder_pipe.sh -q query.faa -f embl -d result_dir -p legacy -e 0 -t 12 -i 50 -c 50 -k 30 -b -a -o ~/bin/clustalo -m > result_dir/blast_hits.tsv
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
150
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
151 ### prot_binary_matrix usage
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
152
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
153 perl prot_binary_matrix.pl -s -d result_dir -t blast_hits.tsv
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
154
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
155 **or**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
156
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
157 perl prot_finder.pl -r report.blastp -s subject.faa | perl prot_binary_matrix.pl -l -c > binary_matrix.csv
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
158
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
159 **or**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
160
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
161 mkdir result_dir && ./prot_finder_pipe.sh -q query.faa -s subject.faa -d result_dir -m | tee result_dir/blast_hits.tsv | perl prot_binary_matrix.pl > binary_matrix.tsv
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
162
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
163 ### transpose_matrix usage
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
164
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
165 perl transpose_matrix.pl -d ' ' -e NA input_matrix_space-delimit.txt > input_matrix_space-delimit_transposed.txt
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
166
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
167 **or**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
168
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
169 perl prot_finder.pl -r report.blastp -s subject.faa | perl prot_binary_matrix.pl -l -c | perl transpose_matrix.pl -d , > binary_matrix_transposed.csv
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
170
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
171 **or**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
172
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
173 mkdir result_dir && ./prot_finder_pipe.sh -q query.faa -s subject.faa -d result_dir -m | tee result_dir/blast_hits.tsv | perl prot_binary_matrix.pl | tee result_dir/binary_matrix.tsv | perl transpose_matrix.pl > result_dir/binary_matrix_transposed.tsv
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
174
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
175 Transpose all matrices in a folder:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
176
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
177 for matrix in *.tsv; do perl transpose_matrix.pl "$matrix" > "${matrix%.*}_transposed.tsv"; done
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
178
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
179 ### binary_group_stats usage
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
180
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
181 perl binary_group_stats.pl -i binary_matrix_transposed.csv -g group_file.tsv -d , -r result_dir -cut_i 0.7 -cut_e 0.2 -b -p -co -s -u -a > overall_stats.tsv
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
182
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
183 ## Options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
184
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
185 ### `prot_finder.pl` options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
186
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
187 #### Mandatory `prot_finder.pl` options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
188
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
189 - **-r**=_str_, **-report**=_str_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
190
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
191 Path to **BLASTP** report/output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
192
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
193 - **-s**=_str_, **-subject**=_str_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
194
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
195 Path to subject multi-FASTA protein sequence file (\*.faa) created with [`cds_extractor.pl`](/cds_extractor) (and its options **-p -f**), which was used to create the **BLASTP** database
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
196
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
197 #### Optional `prot_finder.pl` options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
198
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
199 - **-h**, **-help**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
200
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
201 Help (perldoc POD)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
202
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
203 - **-d**=_str_, **-dir_result**=_str_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
204
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
205 Path to result folder [default = query identity and coverage cutoffs, './results_i#_cq#']
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
206
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
207 - **-f**, **-force_dir**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
208
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
209 Force output to an existing result folder, otherwise ask user to remove content of existing folder. Careful, files from a previous analysis might not be overwritten if different to current analysis.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
210
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
211 - **-q**=_str_, **-query**=_str_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
212
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
213 Path to query (multi-)FASTA protein sequence file (\*.faa) with **unique** FASTA IDs, which was used as query in the **BLASTP**. Will include each query protein sequence in the respective multi-FASTA 'query-ID_hits.faa' result file.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
214
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
215 - **-b**, **-best_hit**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
216
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
217 Give only the best hit (i.e. highest identity) for each subject sequence if a subject has several hits with different queries
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
218
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
219 - **-i**=_int_, **-ident_cutoff**=_int_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
220
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
221 Query identity cutoff for significant hits (not including gaps), has to be an integer number >= 0 and <= 100 [default = 70]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
222
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
223 - **-cov_q**=_int_, **-cov_query_cutoff**=_int_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
224
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
225 Query coverage cutoff, has to be an integer number >= 0 and <= 100 [default = 70]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
226
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
227 - **-cov_s**=_int_, **-cov_subject_cutoff**=_int_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
228
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
229 Subject/hit coverage cutoff, has to be an integer >= 0 and <= 100 [default = 0]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
230
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
231 - **-a**, **-align_clustalo**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
232
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
233 Call [**Clustal Omega**](http://www.clustal.org/omega/) for multiple alignment of each 'query-ID_hits.faa' result file
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
234
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
235 - **-p**=_str_, **-path_clustalo**=_str_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
236
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
237 Path to executable **Clustal Omega** binary if not present in global *PATH* variable; requires option **-a**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
238
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
239 - **-t**=_int_, **-threads_clustalo**=_int_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
240
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
241 Number of threads for **Clustal Omega** to use; requires option **-a** [default = all processors on system]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
242
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
243 - **-v**, **-version**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
244
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
245 Print version number to *STDERR*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
246
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
247 ### `prot_finder_pipe.sh` options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
248
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
249 #### Mandatory `prot_finder_pipe.sh` options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
250
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
251 - **-q**=_str_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
252
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
253 Path to query protein (multi-)FASTA file (\*.faa) with **unique** FASTA IDs
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
254
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
255 - **-f**=_str_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
256
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
257 File extension for files in the **current** working directory to use for [`cds_extractor.pl`](/cds_extractor) (e.g. 'embl' or 'gbk'); excludes shell script option **-s**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
258
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
259 **or**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
260
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
261 - **-s**=_str_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
262
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
263 Path to subject protein multi-FASTA file (\*.faa) already created with [`cds_extractor.pl`](/cds_extractor) (and its options **-p -f**), will not run [`cds_extractor.pl`](/cds_extractor); excludes shell script option **-f**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
264
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
265 #### Optional `prot_finder_pipe.sh` options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
266
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
267 - **-h**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
268
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
269 Print usage
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
270
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
271 - **-d**=_str_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
272
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
273 Path to result folder [default = query identity and coverage cutoffs,'./results_i#_cq#']
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
274
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
275 - **-p**=(legacy|plus)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
276
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
277 **BLASTP** suite to use [default = plus]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
278
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
279 - **-e**=_real_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
280
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
281 E-value for **BLASTP** [default = 1e-10]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
282
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
283 - **-t**=_int_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
284
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
285 Number of threads to be used for **BLASTP** and **Clustal Omega** [default = all processors on system]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
286
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
287 - **-i**=_int_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
288
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
289 Query identity cutoff for significant hits [default = 70]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
290
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
291 - **-c**=_int_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
292
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
293 Query coverage cutoff (corresponds to [`prot_finder.pl` option](#optional-prot_finderpl-options) **-cov_q**) [default = 70]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
294
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
295 - **-k**=_int_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
296
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
297 Subject coverage cutoff (corresponds to [`prot_finder.pl` option](#optional-prot_finderpl-options) **-cov_s**) [default = 0]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
298
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
299 - **-b**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
300
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
301 Give only the best hit (i.e. highest identity) for each subject sequence
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
302
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
303 - **-a**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
304
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
305 Multiple alignment of each multi-FASTA result file with [**Clustal Omega**](http://www.clustal.org/omega/)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
306
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
307 - **-o**=_str_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
308
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
309 Path to executable **Clustal Omega** binary if not in global *PATH*; requires shell script option **-a** (corresponds to [`prot_finder.pl` option](#optional-prot_finderpl-options) **-p**)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
310
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
311 - **-m**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
312
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
313 Clean up all non-essential files, see below ['`prot_finder_pipe.sh` output'](#prot_finder_pipesh-output)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
314
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
315 ### `prot_binary_matrix.pl` options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
316
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
317 - **-h**, **-help**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
318
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
319 Help (perldoc POD)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
320
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
321 - **-s**, **-separate**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
322
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
323 Separate presence/absence files for each query protein printed to the result directory [default without **-s** = *STDOUT* matrix for all query proteins combined]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
324
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
325 - **-d**=_str_, **-dir_result**=_str_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
326
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
327 Path to result folder, requires option **-s** [default = './binary_matrix_results']
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
328
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
329 - **-t**, **-total**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
330
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
331 Count total occurrences of query proteins, not just presence/absence binary
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
332
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
333 - **-c**, **-csv**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
334
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
335 Output matri(x|ces) in comma-separated format (\*.csv) instead of tab-delimited format (\*.tsv)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
336
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
337 - **-l**, **-locus_tag**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
338
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
339 Use the locus_tag **prefixes** in the subject_ID column of the `prot_finder.pl` output (instead of the subject_organism column) as organism IDs to associate query hits to organisms. The subject_ID column will include locus_tags if they're annotated for a genome (see the [`cds_extractor.pl`](/cds_extractor) format description). Useful if the [`cds_extractor.pl`](/cds_extractor) output doesn't include strain names for 'o=' in the FASTA IDs, because the prefix of a locus_tag should be unique for a genome (see http://www.ncbi.nlm.nih.gov/genbank/genomesubmit_annotation).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
340
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
341 - **-v**, **-version**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
342
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
343 Print version number to *STDERR*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
344
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
345 ### `transpose_matrix.pl` options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
346
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
347 - **-h**, **-help**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
348
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
349 Help (perldoc POD)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
350
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
351 - **-d**=_str_, **-delimiter**=_str_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
352
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
353 Set delimiter of input and output matrix (e.g. comma ',', single space ' ' etc.) [default = tab-delimited/separated]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
354
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
355 - **-e**=_str_, **-empty**=_str_
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
356
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
357 Fill empty cells of the input matrix with a value in the transposed matrix (e.g. 'NA', '0' etc.)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
358
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
359 - **-v**, **-version**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
360
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
361 Print version number to *STDERR*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
362
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
363 ### `binary_group_stats.pl` options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
364
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
365 #### Mandatory `binary_group_stats.pl` options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
366
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
367 - **-i**=*str*, **-input**=*str*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
368
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
369 Input delimited TEXT binary matrix (e.g. *.tsv, *.csv, or *.txt), see option **-d**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
370
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
371 - **-g**=*str*, **-groups_file**=*str*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
372
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
373 Tab-delimited file with group affiliation for the columns from the input binary matrix with **minimal two** and **maximal four** groups (easiest to create in a spreadsheet software and save in tab-separated format). **All** column headers from the input binary matrix need to be included. Column headers and group names can only include alphanumeric (a-z, A-Z, 0-9), underscore (_), dash (-), and period (.) characters (no whitespaces allowed either). Example format with two column headers in group A, three in group B and D, and one in group C:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
374
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
375 group\_A&emsp;group\_B&emsp;group\_C&emsp;group\_D<br>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
376 column\_header1&emsp;column\_header9&emsp;column\_header3&emsp;column\_header8<br>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
377 column\_header7&emsp;column\_header6&emsp;&emsp;column\_header5<br>
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
378 &emsp;column\_header4&emsp;&emsp;column\_header2
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
379
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
380 #### Optional `binary_group_stats.pl` options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
381
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
382 - **-h**, **-help**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
383
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
384 Help (perldoc POD)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
385
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
386 - **-d**=*str*, **-delimiter**=*str*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
387
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
388 Set delimiter of input binary matrix (e.g. comma ',', single space ' ' etc.) [default = tab-delimited/separated]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
389
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
390 - **-r**=*str*, **-result\_dir**=*str*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
391
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
392 Path to result folder \[default = inclusion and exclusion percentage cutoff, './results\_i#\_e#'\]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
393
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
394 - **-cut\_i**=*float*, **-cut\_inclusion**=*float*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
395
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
396 Percentage inclusion cutoff for column presence counts in a group per row, has to be > 0 and <= 1. Cutoff will be rounded according to the column header number in each group and has to be > the rounded exclusion cutoff in this group. \[default = 0.9\]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
397
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
398 - **-cut\_e**=*float*, **-cut\_exclusion**=*float*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
399
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
400 Percentage exclusion cutoff, has to be >= 0 and < 1. Rounded cutoff has to be < rounded inclusion cutoff. \[default = 0.1\]
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
401
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
402 - **-b**, **-binary\_group\_matrix**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
403
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
404 Print a group binary matrix with the presence/absence column group results according to the cutoffs (excluding 'unspecific' category rows)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
405
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
406 - **-p**, **-plot\_venn**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
407
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
408 Plot venn diagram from the group binary matrix (except 'unspecific' and 'underrepresented' categories) with function `venn` from **R** package **gplots**, requires option **-b**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
409
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
410 - **-co**, **-core_strict**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
411
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
412 Include 'strict core' category in output for rows where **all** columns have a '1'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
413
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
414 - **-s**, **-singletons**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
415
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
416 Include singleton/column-specific rows for each column header in the output, activates also overall column header presence ('1') counts in final stats matrix for columns with singletons
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
417
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
418 - **-u**, **-unspecific**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
419
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
420 Include 'unspecific' category in output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
421
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
422 - **-a**, **-all\_column\_presence\_overall**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
423
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
424 Report overall presence counts for all column headers (appended to the final stats matrix), also those without singletons; will include all overall column header presence counts without option **-s**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
425
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
426 - **-v**, **-version**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
427
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
428 Print version number to *STDERR*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
429
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
430 ## Output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
431
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
432 ### `cds_extractor.pl` output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
433
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
434 - \*.faa
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
435
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
436 Multi-FASTA file(s) of subject CDS protein sequences; will be removed with [`prot_finder_pipe.sh` option](#optional-prot_finder_pipesh-options) **-m**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
437
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
438 For optional error files from [`cds_extractor.pl`](/cds_extractor) see its documentation.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
439
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
440 ### `prot_finder.pl` output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
441
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
442 - *STDOUT*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
443
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
444 The resulting tab-delimited output table with the significant subject **BLASTP** hits is printed to *STDOUT*. Redirect (e.g. to a file in the result directory, options **-d -f**) or pipe into another tool as needed (e.g. `prot_binary_matrix.pl`).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
445
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
446 - ./results_i#_cq#
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
447
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
448 All output files are stored in a result folder
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
449
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
450 - ./results_i#_cq#/query-ID_hits.faa
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
451
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
452 Multi-FASTA protein files of significant subject hits for each query protein (named after the respective query FASTA ID), optionally includes the respective query protein sequence (with option **-q**)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
453
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
454 - subject.faa.idx
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
455
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
456 Index file of the subject protein file for fast sequence retrieval (can be deleted if no further **BLASTPs** are needed with these subject sequences); will be removed with [`prot_finder_pipe.sh` option](#optional-prot_finder_pipesh-options) **-m**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
457
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
458 - (./results_i#_cq#/queries_no_blastp-hits.txt)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
459
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
460 Lists all query sequence IDs without significant subject hits; with option **-b** includes also queries with significant hits but *without* a best blast hit for a subject
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
461
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
462 - (./results_i#_cq#/clustal_omega.log)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
463
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
464 Optional log file of verbose **Clustal Omega** *STDOUT*/*STDERR* messages; will be removed with [`prot_finder_pipe.sh` option](#optional-prot_finder_pipesh-options) **-m**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
465
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
466 - (./results_i#_cq#/query-ID_aln.fasta)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
467
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
468 Optional **Clustal Omega** multiple alignment of each 'query-ID_hits.faa' result file in FASTA alignment format
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
469
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
470 - (./results_i#_cq#/query-ID_tree.nwk)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
471
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
472 Optional, **Clustal Omega** NJ-guide tree in Newick format
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
473
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
474 ### `prot_finder_pipe.sh` output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
475
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
476 In addition to the [`cds_extractor.pl` output](#cds_extractorpl-output) and the [`prot_finder.pl` output](#prot_finderpl-output) the pipeline also creates the following non-essential output files, which will be removed with [`prot_finder_pipe.sh` option](#optional-prot_finder_pipesh-options) **-m**:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
477
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
478 - ./results_i#_cq#/cds_extractor.log
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
479
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
480 Log file of [`cds_extractor.pl`](/cds_extractor) *STDOUT*/*STDERR* messages (with `prot_finder_pipe.sh` option **-f**)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
481
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
482 - ./results_i#_cq#/prot_finder.faa
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
483
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
484 Concatenated [`cds_extractor.pl`](/cds_extractor) output files to create the subject **BLASTP** database
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
485
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
486 - prot_finder_db.phr, prot_finder_db.pin, prot_finder_db.psq
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
487
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
488 **BLASTP** database files from the concatenated subject sequences ('prot_finder.faa')
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
489
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
490 - formatdb.log, error.log **or** ./results_i#_cq#/makeblastdb.log
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
491
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
492 Legacy **BLASTP** or **BLASTP+** log files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
493
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
494 - ./results_i#_cq#/prot_finder.blastp
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
495
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
496 **BLASTP** report/output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
497
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
498 - ./results_i#_cq#/prot_finder.log
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
499
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
500 Log file of `prot_finder.pl` *STDOUT*/*STDERR* messages
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
501
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
502 ### `prot_binary_matrix.pl` output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
503
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
504 - *STDOUT*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
505
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
506 The resulting presence/absence matrix is printed to *STDOUT* without option **-s**. Redirect or pipe into another tool as needed.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
507
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
508 - (./binary_matrix_results)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
509
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
510 Separate query presence/absence files are stored in a result folder with option **-s**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
511
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
512 - (./binary_matrix_results/query-ID_binary_matrix.(tsv|csv))
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
513
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
514 Separate query presence/absence files with option **-s**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
515
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
516 ### `transpose_matrix.pl` output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
517
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
518 - *STDOUT*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
519
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
520 The transposed matrix is printed to *STDOUT*. Redirect or pipe into another tool as needed.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
521
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
522 ### `binary_group_stats.pl` output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
523
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
524 - *STDOUT*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
525
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
526 The tab-delimited final stats matrix is printed to *STDOUT*. Redirect or pipe into another tool as needed.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
527
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
528 - ./results_i#_e#
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
529
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
530 All output files are stored in a results folder
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
531
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
532 - ./results_i#_e#/[\*_specific|\*_absent|cutoff_core|underrepresented]_rows.txt
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
533
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
534 Files including the row headers for rows in non-optional categories
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
535
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
536 - (./results_i#_e#/[\*_singletons|strict_core|unspecific]_rows.txt)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
537
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
538 Optional category output files with the respective row headers
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
539
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
540 - (./results_i#_e#/binary_matrix.tsv)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
541
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
542 Tab-delimited binary matrix of group presence/absence results according to cutoffs (excluding 'unspecific' rows)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
543
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
544 - (./results_i#_e#/venn_diagram.pdf)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
545
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
546 Venn diagram for non-optional categories (except 'unspecific' and 'underrepresented' categories)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
547
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
548 ## Dependencies
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
549
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
550 ### `prot_finder.pl`/`prot_finder_pipe.sh` dependencies
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
551
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
552 - [**BioPerl**](http://www.bioperl.org) (tested version 1.006923)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
553 - [`cds_extractor`](/cds_extractor) (tested version 0.7.1)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
554 - [**BLASTP**](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
555 - legacy **BLASTP** (tested version 2.2.26)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
556 - **BLASTP+** (tested version 2.2.28+)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
557 - [**Clustal Omega**](http://www.clustal.org/omega/) (tested version 1.2.1)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
558
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
559 ### `binary_group_stats.pl` dependencies
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
560
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
561 - **Statistical computing language [R](http://www.r-project.org/)**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
562
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
563 `Rscript` is needed to plot the venn diagram with option **-p**, tested with version 3.2.2
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
564
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
565 - **gplots (https://cran.r-project.org/web/packages/gplots/index.html)**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
566
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
567 Package needed for **R** to plot the venn diagram, includes function `venn`. Tested with **gplots** version 2.17.0.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
568
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
569 ## Run environment
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
570
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
571 The scripts run under UNIX flavors.
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
572
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
573 ## Author - contact
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
574
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
575 Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
576
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
577 ## Citation, installation, and license
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
578
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
579 For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
580
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
581 ## Acknowledgements
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
582
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
583 The Perl implementation for transposing a matrix on Stack Overflow
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
584 was very useful for `transpose_matrix.pl`: https://stackoverflow.com/questions/1729824/transpose-a-file-in-bash
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
585
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
586 ## Changelog
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
587
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
588 ### prot_finder changelog
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
589
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
590 * v0.7.1 (05.04.2016)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
591 * bug fix: significant but non-best blast hits with option **-b** now listed in *queries_no_blastp-hits.txt*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
592 * v0.7 (23.11.2015)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
593 * changed script name from `blast_prot_finder.pl` to `prot_finder.pl`
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
594 * fixed bug introduced in v0.6 with a `seek`, because option **-query** didn't pull query sequences anymore
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
595 * included version switch
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
596 * included 'use autodie;' pragma
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
597 * included `pod2usage` with Pod::Usage and `pod2usage`-die for Getopt::Long call
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
598 * tab-delimited output table with filtered **BLASTP** hits now printed to *STDOUT* instead of file and can be used as *STDIN* for `prot_binary_matrix.pl`
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
599 * result files now created in result directory to unclutter working dir (new options **-dir_result** and **-force_dir**) and replaced subroutine 'file_exist' with 'empty_dir' (also no *STDOUT* message which files were created anymore)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
600 * new output file *queries_no_blastp-hits.txt* to list all queries without **BLASTP** hits
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
601 * new output file *clustal_omega.log* to log **Clustal Omega** verbose output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
602 * additional new options, **-path_clustalo** and **-threads_clustalo**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
603 * major code changes/restructuring and additions:
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
604 * POD syntax changes and additions for new code
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
605 * Perl syntax changes and removing code redundancies
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
606 * including script run status messages
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
607 * changes to subroutine 'split_fasta_header' to work more robustly with `split` instead of a regex and adapted to `cds_extractor.pl` v0.7+ output format
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
608 * included several additional option and input checks
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
609 * replaced simple `blast_prot_finder_legacy.sh` with more elaborate bash script pipeline `prot_finder_pipe.sh`
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
610 * changed some output column names and moved the subject_ID column before the subject_gene column
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
611 * v0.6 (10.06.2013)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
612 * included BioPerls 'frac\*' methods, which include hsp tiling, to correct bug in query coverage and identity calculations for **whole** hits
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
613 * corrected bug to use **whole** hit e-value and not just first hsp e-value of a significant hit
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
614 * option **-cov_subject** for subject/hit coverage cutoff (not only query coverage cutoff)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
615 * more info in POD
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
616 * v0.5 (14.05.2013)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
617 * optionally include query proteins in result subject multi-FASTA files
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
618 * new option **-best_hit** to include only best **BLASTP** hit (highest identity) for each subject locus_tag, if a subject protein has hits to several query proteins
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
619 * print queries with no subject **BLASTP** hits to *STDOUT*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
620 * v0.4 (24.01.2013)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
621 * corrected bug in hash structure to store **BLASTP** hits, query acc/IDs (keys) and array reference of subject locus_tags (values), as several queries can have the same subject locus_tag as hit
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
622 * include 'subject protein_function' in 'blast_hits.txt' output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
623 * v0.3 (12.09.2012)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
624 * **original** script name `blast_prot_finder.pl`
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
625 * **BLASTP** hits are stored in a hash with subject locus_tags (keys) and query accessions/IDs (values)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
626
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
627 ### prot_binary_matrix changelog
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
628
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
629 * v0.6 (23.11.2015)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
630 * adapted to `prot_finder.pl` v0.7 output
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
631 * included a POD, `pod2usage` with Pod::Usage and `pod2usage`-die for Getopt::Long call
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
632 * Perl syntax changes and some simpler loop structures
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
633 * Removed option **-input** and instead accept `prot_finder.pl` output as *STDIN* or file as argument
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
634 * Removed option **-c|-collective** (actually repurposed, see below), option **-separate** to indicate separate output is enough
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
635 * **Without** option **-separate** result matrix is now printed to *STDOUT*
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
636 * **With** option **-separate** result matrices now printed to result directory (name optionally given with new option **-dir_result**)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
637 * New option **-locus_tag** to use the locus_tag prefix in column subject_ID of the `prot_finder.pl` output as ID, which is in most cases the locus_tag, instead of the subject_organism; controls if the locus_tags follow NCBI standards
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
638 * Output now default tab-separated format (*tsv*), optionally with new option **-csv** comma-separated (*csv*)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
639 * Removed subroutine 'result' and statement which result files have been created
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
640 * v0.5 (05.03.2014)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
641 * option **-total** to count all occurences of query proteins within one organism (paralogs), instead of just binary presence/absence
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
642 * enforce mandatory options
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
643 * changed script name to 'prot_binary_matrix.pl'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
644 * version switch
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
645 * included 'use autodie'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
646 * options with Getopt::Long
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
647 * changed usage to HERE document
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
648 * v0.4 (29.04.2013)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
649 * adapted to new *prot_finders* 'blast_hits.txt' layout, which includes an additional column for 'subject protein_function'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
650 * changed script name to 'blast_binary_matrix.pl'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
651 * v0.3 (15.02.2013)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
652 * status of created result files in *STDOUT* for option **-s**
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
653 * v0.2 (21.12.2012)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
654 * prot_finder output file 'blast_hits.txt' doesn't have to be ordered by query protein accessions/IDs anymore
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
655 * v0.1 (25.10.2012)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
656 * **original** script name 'blastp_iTOL_binary.pl'
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
657
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
658 ### transpose_matrix changelog
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
659
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
660 * v0.1 (12.04.2016)
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
661
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
662 ### binary_group_stats changelog
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
663
e42d30da7a74 Uploaded
dereeper
parents:
diff changeset
664 * v0.1 (06.06.2016)