comparison Manual @ 0:28d1a6f8143f draft

planemo upload for repository https://github.com/portiahollyoak/Tools commit 132bb96bba8e7aed66a102ed93b7744f36d10d37-dirty
author portiahollyoak
date Mon, 25 Apr 2016 13:08:56 -0400
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:28d1a6f8143f
1 TEMP (Transposable Element Movement in Population) Manual
2
3
4 2015.01.09
5
6
7 TEMP is a software designed to 1) detect transposable elements (TEs) insertions and absences relative to the reference genome, 2) define the genome-TE junctions up to base pair resolution when it is possible, and 3) estimate the population frequency of the detected insertions and absences.
8 This document provides information concerning how to run TEMP, what options to use, and how to interpret the outputs. If you have any questions or find any bugs please contact Jiali Zhuang through jiali.zhuang@umassmed.edu.
9
10
11
12 Requirement and installation
13
14
15 TEMP runs on Linux x86_64 systems.
16 Following softwares are required by TEMP and should be included in the path:
17 Samtools (http://samtools.sourceforge.net/),
18 bedtools (http://code.google.com/p/bedtools/),
19 bwa (http://sourceforge.net/projects/bio-bwa/),
20 twoBitToFa (http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/twoBitToFa),
21 Perl package BioPerl is also required for running TEMP (http://www.bioperl.org/wiki/Main_Page).
22
23 For installing TEMP just unzip and untar the file.
24 In the directory TEMP_v1.01/ there are two bash scripts TEMP_Insertion.sh and TEMP_Absence.sh for TE insertion and absence analysis, respectively.
25
26
27
28
29 Options
30
31
32 For TEMP_Insertion.sh the arguments to the options are explained below:
33
34
35 -i Input file in bam format with full path. The users need to map the reads to the reference genome using mapping softwares such as BWA (http://bio-bwa.sourceforge.net/). Please sort and index the bam files before calling TEMP. Sorting and indexing can be done by 'samtools sort' and 'samtools index'.
36
37
38 -s The full path to the scripts in directory TEMP_v1.0/.
39
40
41 -o The full path to output directory. Default is current directory.
42
43
44 -r Transposon consensus sequence fasta format with full path. Such files can be downloaded from Repbase (http://www.girinst.org/repbase/).
45
46
47 -t Annotated transposon positions in the genome (e.g., RepeakMasker) in bed6 format with full path.
48
49
50 -u Families of transposable elements in tab delimited format (with the first column the name of the elemenet and the second column family). Only use together with -t.
51
52
53 -x The minimum score difference between the best hit and the second best hit for considering a read as uniquely mapped. The higher the score the more strigent the criterion. For BWA mem, which does not produce the XT:A: tag.
54
55
56 -m Number of mismatches allowed when mapping to TE concensus sequences.
57
58
59 -f An integer specifying the length of the fragments (inserts) of the library. Default is 500.
60
61
62 -c An integer specifying the number of CUPs used. Default is 8.
63
64
65 -h Show help message.
66
67
68
69
70 For TEMP_Absence.sh the arguments to the options are explained below:
71
72
73 -i Input file in bam format with full path. The users need to map the reads to the reference genome using mapping softwares such as BWA (http://bio-bwa.sourceforge.net/). Please sort and index the bam files before calling TEMP. Sorting and indexing can be done by 'samtools sort' and 'samtools index'.
74
75
76 -s The full path to the scripts in directory TEMP_v1.0/.
77
78
79 -o Path to output directory. Default is current directory.
80
81
82 -r Annotated transposon positions in the genome (e.g., RepeakMasker) in bed6 format with full path. For major model organisms such file can be downloaded from UCSC Genome Browser. In Table Browser page just choose “variation and repeats” in the group tab and “RepeatMasker” in the track tab.
83
84
85 -t 2bit file for the reference genome. Such file can be downloaded from UCSC Genome Browser. In Downloads page choose the right genome, click on the “Full data set” link and download the *.2bit file.
86
87
88 -f An integer specifying the length of the fragments (inserts) of the library. Default is 500.
89
90
91 -c An integer specifying the number of CUPs used. Default is 4.
92
93
94 -h Show help message.
95
96
97
98
99 Output files
100
101
102 For TE insertion analysis, the summay output file has the suffix: .insertion.refined.bp.summary.
103
104
105 There are 14 columns in the summary file and their meanings are listed below:
106 Column 1: The chromosome where the detected insertion happens.
107 Column 2: The coordinate of the start position of the detected insertion.
108 Column 3: The coordinate of the end position of the detected insertion.
109 Column 4: The TE family that the detected insertion belongs to.
110 Column 5: The direction of the insertion. “Plus” means that the TE is integrated with the plus strand of the genome while “minus” means the TE is integrated with the minus strand.
111 Column 6: The class of the insertion. “1p1” means that the detected insertion is supported by reads at both sides. “2p” means the detected insertion is supported by more than 1 read at only 1 side. “Singleton” means the detected insertion is supported by only 1 read at 1 side.
112 Column 7: The total number of read pairs that support the detected insertion.
113 Column 8: The estimated population frequency of the detected insertion.
114 Columns 9 & 10: The coordinate of a junction and the number of the reads supporting it. If the junction is not found column 9 will be the arithmetic mean of the start and end coordinates and column 10 will have the value 0.
115 Columns 11 & 12: Same as Columns 9 & 10 except for the junction on the other strand.
116 Column 13: The number of reads supporting the detected insertion at the 5’ end of the TE (not including junction spanning reads).
117 Column 13: The number of reads supporting the detected insertion at the 3’ end of the TE (not including junction spanning reads).
118
119
120
121
122 For TE absence analysis, the summay output file has the suffix: .absence.refined.bp.summary.
123
124
125 There are 9 columns in the summary file and their meanings are listed below:
126 Column 1: The chromosome where the detected absence happens.
127 Column 2: The coordinate of the start position of the detected absence.
128 Column 3: The coordinate of the end position of the detected absence.
129 Column 4: The TE family that the detected insertion belongs to.
130 Column 5: Junctions at 5’ of the excised TE. The two numbers are the coordinates of the junctions on the two strands.
131 Column 6: Junctions at 3’ of the excised TE. The two numbers are the coordinates of the junctions on the two strands.
132 Column 7: The number of reads supporting the absence.
133 Column 8: The number of reads supporting the reference (no absence).
134 Column 9: Estimated population frequency of the detected absence event.
135
136
137
138
139
140 Visualization
141
142 Since v1.01, we added a new function to TEMP that enables the visualization of the distribution of predicted TE insertion across the genome using Dr. Xiaopeng Zhu's visualization tool "circosjs".
143
144 The procedure involves two steps:
145 1) Generate the JSON objects file from the TEMP detected TE insertions.
146 This can be done easily by running the script "generate_density_json.pl": e.g.
147 perl generate_density_json.pl input.insertion.bp.summary ref.chromInfo 500000
148
149 This script takes 3 parameters: (1) the TE insertions predicted by TEMP (i.e., the output file produced by TEMP_Insertion.sh);
150 (2) the file contains the sizes of all the chromosomes in a reference genome (the chromInfo files for model organism genomes can be downloaded from UCSC Genome Browser);
151 (3) the size of genomic bins (500kb in the above example), total number predicted TE insertions in each will be calculated and plotted later.
152
153 2) Visualization of the distribution of TE insertions across the genome.
154 Dr. Xiaopeng Zhu (https://twitter.com/nimezhu) at UMass Medical School developed a powserful web-based visualization tool that is available at: http://circos.zhu.land/
155 The user only needs to upload the JSON file generated in step1 in the "read local file" section.
156
157 Please forward any question and suggestion about the website to Dr. Zhu: xiaopeng.zhu@umassmed.edu
158
159
160
161
162
163
164 Test datasets
165
166 We put together two datasets for testing TEMP.
167
168 One is a simulated set generated using Drosophila Melanogaster Chromosome 2L as the template. It's distributed along with this package.
169
170 The recommended commandline invokation for this testset is:
171 git clone https://github.com/JialiUMassWengLab/TEMP.git
172 cd TEMP
173 tar -xvzf test_dataset.tar.gz
174 cd test_dataset/
175 bash ../scripts/TEMP_Insertion.sh -i ./test_chromosome.sorted.bam -s ../scripts -r ./test_concensus.fa -t ./test_TE_annotation.bed -m 3 -f 500 -c 8
176 bash ../scripts/TEMP_Absence.sh -i ./test_chromosome.sorted.bam -s ../scripts -r ./test_TE_annotation.bed -t ./dm3_chr2L.2bit -f 500 -c 4
177
178 The other one is derived from chromosome 11 of 8 individuals from 1000 gnomes project. It's available at http://zlab.umassmed.edu/~zhuangj/TEMP_resources/Human_test_dataset.tar.gz.
179 The recommended commandline invokation for this testset is:
180 git clone https://github.com/JialiUMassWengLab/TEMP.git
181 cd TEMP
182 wget http://bib.umassmed.edu/~zhuangj/TEMP_resources/Human_test_dataset.tar.gz
183 tar -zxvf Human_test_dataset.tar.gz
184 cd Human_test_dataset
185 bash ../scripts/TEMP_Insertion.sh -i ./chrom11.test.sorted.bam -s ../scripts -r ./HomoSapienRepbaseTEConcensus.fa -t ./hg19_rpmk.bed -m 3 -f 500 -c 8
186 bash ../scripts/TEMP_Absence.sh -i ./chrom11.test.sorted.bam -s ../scripts -r ./hg19_rpmk.bed -t ./hg19.2bit -f 500 -c 4