Mercurial > repos > lsong10 > psiclass
comparison PsiCLASS-1.0.2/README.md @ 0:903fc43d6227 draft default tip
Uploaded
author | lsong10 |
---|---|
date | Fri, 26 Mar 2021 16:52:45 +0000 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:903fc43d6227 |
---|---|
1 PsiCLASS | |
2 ======= | |
3 | |
4 Described in: | |
5 | |
6 Song, L., Sabunciyan, S., Yang, G. and Florea, L. [A multi-sample approach increases the accuracy of transcript assembly](https://www.nature.com/articles/s41467-019-12990-0). *Nat Commun* 10, 5000 (2019) | |
7 | |
8 Copyright (C) 2018- and GNU GPL by Li Song, Liliana Florea | |
9 | |
10 Includes portions copyright from: | |
11 | |
12 samtools - Copyright (C) 2008-, Genome Research Ltd, Heng Li | |
13 | |
14 Commands, scripts and supporting data for the paper can be found [here](https://github.com/splicebox/PsiCLASS_paper/). | |
15 | |
16 ### What is PsiCLASS? | |
17 | |
18 PsiCLASS is a reference-based transcriptome assembler for single or multiple RNA-seq samples. Unlike conventional methods that analyze each sample separately and then merge the outcomes to create a unified set of meta-annotations, PsiCLASS takes a multi-sample approach, simultaneously analyzing all RNA-seq data sets in an experiment. PsiCLASS is both a transcript assembler and a meta-assembler, producing separate transcript sets for the individual samples and a unified set of meta-annotations. The algorithmic underpinnings of PsiCLASS include using a global subexon splice graph, statistical cross-sample feature (intron, subexon) selection methods, and an efficient dynamic programming algorithm to select a subset of transcripts from among those encoded in the graph, based on the read support in each sample. Lastly, the set of meta-annotations is selected from among the transcripts generated for individual samples by voting. While PsiCLASS is highly accurate and efficient for medium-to-large collections of RNA-seq data, its accuracy is equally high for small RNA-seq data sets (2-10 samples) and is competitive to reference methods for single samples. Additionally, its performance is robust with the aggregation method used, including the built-in voting and assembly-based approaches such as StringTie-merge and TACO. Therefore, it can be effectively used as a multi-sample and as a single-sample assembler, as well as in conventional assemble-and-merge protocols. | |
19 | |
20 ### Install | |
21 | |
22 1. Clone the [GitHub repo](https://github.com/splicebox/psiclass), e.g. with `git clone https://github.com/splicebox/psiclass.git` | |
23 2. Run `make` in the repo directory | |
24 | |
25 You will find the executable files in the downloaded directory. If you want to run PsiCLASS without specifying the directory, you can either add the directory of PsiCLASS to the environment variable PATH or create a soft link ("ln -s") of the file "psiclass" to a directory in PATH. | |
26 | |
27 PsiCLASS depends on [pthreads](http://en.wikipedia.org/wiki/POSIX_Threads) and samtools depends on [zlib](http://en.wikipedia.org/wiki/Zlib). | |
28 | |
29 | |
30 ### Usage | |
31 | |
32 Usage: ./psiclass [OPTIONS] | |
33 Required: | |
34 -b STRING: paths to the alignment BAM files; use comma to separate multiple BAM files | |
35 or | |
36 --lb STRING: path to the file listing the alignment BAM files | |
37 Optional: | |
38 -s STRING: path to the trusted splice file (default: not used) | |
39 -o STRING: prefix of output files (default: ./psiclass) | |
40 -p INT: number of threads (default: 1) | |
41 -c FLOAT: only use the subexons with classifier score <= than the given number. (default: 0.05) | |
42 --sa FLOAT: the minimum average number of supported read for retained introns (default: 0.5) | |
43 --vd FLOAT : the minimum average coverage depth of a transcript to be reported in voting (defaults: 1.0) | |
44 --maxDpConstraintSize: the number of subexons a constraint can cover in DP. (default: 7. -1 for inf) | |
45 --primaryParalog: use primary alignment to retain paralog genes (default: use unique alignments) | |
46 --version: print version and exit | |
47 --stage INT: (default: 0) | |
48 0-start from the beginning - building the splice site file for each sample | |
49 1-start from building the subexon file for each samples | |
50 2-start from combining the subexon files across samples | |
51 3-start from assembling the transcripts for each sample | |
52 4-start from voting the consensus transcripts across samples | |
53 | |
54 ### Practical notes | |
55 | |
56 *Alignment compatibility.* PsiCLASS has been tuned to run on alignments generated with the tools [HISAT](https://ccb.jhu.edu/software/hisat/index.shtml) and [STAR](https://github.com/alexdobin/STAR). | |
57 | |
58 When running PsiCLASS with STAR alignments, run STAR with the option `--outSAMstrandField intronMotif`, which will include the XS field indicating the strand in the BAM alignments. Further, when including alignments with *non-canonical splice sites*, use the provided `addXS` executable to add the XS field: | |
59 | |
60 samtools view -h in.bam | ./addXS reference_genome.fa | samtools view -bS - > out.bam | |
61 | |
62 *Trusted introns from other sources.* By default, PsiCLASS determines a set of trusted introns from the input spliced alignments, to use in building the global subexon graph. Alternatively, the user can supply an external set of trusted introns, for instance extracted from the GENCODE gene annotations or judiciously selected from the input data using a tool like [JULIP](https://github.com/Guangyu-Yang/JULiP). This file must contain three columns: | |
63 | |
64 chr_name start_site end_site | |
65 | |
66 *Voting optimization.* The default parameters for voting have been calibrated and perform near-optimally for a wide variety of data, including with varying levels of coverage and different library construction protocols. However, if further optimization is desired, to determine a better cutoff value one can run the voting stage (see [Usage](#usage) above) with different parameter values, and assess the performance against a reference set of gene annotations, such as [GENCODE](https://www.gencodegenes.org). The program 'grader', included in the package, can be used for this purpose. Note that the per sample sets of transcripts will remain unchanged. | |
67 | |
68 *Add gene name.* For many applications, it would be desirable to associate the known (annotated) gene name with each transcript. PsiCLASS provides the program "add-genename" for such purpose. "add-genename" takes as input a GTF file containing a reference set of gene annotations and a file listing the raw GTF files, and generates a new GTF file for each input raw GTF file by appending the annotated gene names. If a gene is not found in the annotation, "add-genename" will use "novel_INT" to represent its gene name. The program can be run as: | |
69 | |
70 ./add-genename annotation.gtf gtflist | |
71 | |
72 ### Input/Output | |
73 | |
74 The primary input to PsiCLASS is a set of BAM alignment files, one for each RNA-seq sample in the analysis. The program calculates a set of subexon files and a set of splice (intron) files, for the individual samples. (Optionally, one may specify a path to an external file of trusted introns as explained [above](#practical-notes).) The output consists of one GTF file of transcripts for each sample, and the GTF file of meta-annotations produced by voting, stored in the output directory: | |
75 | |
76 Sample-wise GTF files: (psiclass)_sample_{0,1,...,n-1}.gtf | |
77 Meta-assembly GTF file: (psiclass)_vote.gtf | |
78 | |
79 where indices 0,1,...,n-1 match the order of the input BAM files. | |
80 | |
81 Subexon and splice (intron) files, and other auxiliary files, are in the subdirectories: | |
82 | |
83 Intron files: splice/* | |
84 Subexon graph files: subexon/* | |
85 Log file: (psiclass)_classes.log | |
86 | |
87 ### Example | |
88 | |
89 The directory './example' in this distribution contains two BAM files, along with an example of a BAM list file. Run PsiCLASS with: | |
90 | |
91 ./psiclass -b example/s1.bam,example/s2.bam | |
92 | |
93 or | |
94 | |
95 ./psiclass --lb example/slist | |
96 | |
97 The run will generate the files 'psiclass_sample_0.gtf' for 's1.bam', 'psiclass_sample_1.gtf' for 's2.bam', and the file 'psiclass_vote.gtf' containing the meta-assemblies. | |
98 | |
99 ### Terms of use | |
100 | |
101 This program is free software; you can redistribute it and/or modify it | |
102 under the terms of the GNU General Public License as published by the | |
103 Free Software Foundation; either version 2 of the License, or (at your | |
104 option) any later version. | |
105 | |
106 This program is distributed in the hope that it will be useful, | |
107 but WITHOUT ANY WARRANTY; without even the implied warranty of | |
108 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | |
109 GNU General Public License for more details. | |
110 | |
111 You should have received (LICENSE.txt) a copy of the GNU General | |
112 Public License along with this program; if not, you can obtain one from | |
113 http://www.gnu.org/licenses/gpl.txt or by writing to the Free Software | |
114 Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA | |
115 | |
116 ### Support | |
117 | |
118 Create a [GitHub issue](https://github.com/splicebox/PsiCLASS/issues). |