annotate lib/documentation.org @ 0:1d1b9e1b2e2f draft

Uploaded
author petr-novak
date Thu, 19 Dec 2019 10:24:45 -0500
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
1 #+TITLE: RepeatExplorer documentation
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
2 #+HTML_HEAD_EXTRA: <link rel="stylesheet" type="text/css" href="style1.css" />
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
3 #+LANGUAGE: en
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
4 #+OPTIONS: html-postamble:nil
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
5
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
6 #+begin_export html
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
7 <h1 id="clust"> Cluster annotation table </h1>
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
8 #+end_export
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
9
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
10 - Cluster :: cluster index, contain link to individual cluster report
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
11 - Supercluster :: Supercluster index, contains link inf individual supercluster report
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
12 - Proportion[%] :: Proportion of the reads in the cluster with respect to the amount of number of analyzed sequence.
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
13 - Proportions adjusted[%] :: Adjusted genome proportion can differ from unadjusted value if the Perform automatic filtering of abundant satellite repeats was on. Sequences belonging to high abundance satellites were partially removed from all-to-all comparison and clustering. This causes that the Genome proportion estimate for these satellite is underestimated. Adjusted Genome proportion provide corrected estimate of ‘real’ genomic proportion for particular satellite repeat.
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
14 - Number of reads :: number of reads in the cluster
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
15 - Graph layout :: Preview of graph based visualization of sequence reads cluster. More detailed graph layout can be foun in individual cluster reports
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
16 - Similarity hits :: summarize the proportion of reads in the clusters with similarity to REXdb or DNA reference databases. Only hits with proportion above 0.1% are shown
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
17 - LTR detection :: Show if the LTR with primer binding site was detected on contig assembly and what type of tRNA is used for priming.
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
18 - Satellite probability :: provide empirical probability that cluster represent satellite
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
19 - TAREAN classification :: TAREAN divides clusters into five categories described in box 9.
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
20 - Consensus length :: For clusters analyzed by TAREAN module, the best estimate of monomer length is shown.
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
21 - Consensus :: The best consensus estimate reconstructed by TAREAN module
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
22 - Kmer analysis :: if cluster was analyzed by TAREAN, this field contains the link to the detailed TAREAN kmer analysis (box 10)
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
23 - Connected component index C, Pair completeness index P, Kmer coverage :: statistics reported by TAREAN module
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
24 - |V| :: Number of vertices of the graph
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
25 - |E| :: Number of edges of the graph
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
26
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
27
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
28 #+begin_export html
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
29 <h1 id="superclust"> Supercluster annotation table </h1>
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
30 #+end_export
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
31
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
32 - Supercluster :: supercluster index
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
33 - Reads :: number of reads in supercluster
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
34 - Automatic classification :: Result of automatic supercluster classification
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
35 - Similarity hits :: Number similarity hits against REXdb and DNA database are shown in the classification tree structure together with the number of reads assigned to putative satellite cluster and information about detection of LTR/PBS. The parts of the tree without any evidences are pruned off.
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
36 - TAREAN annotation :: Clusters which are part of supercluster and classified by TAREAN as putative satellite are listed here
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
37 - Clusters :: hyperlinked list of clusters which are part of the superclusters.
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
38
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
39 #+begin_export html
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
40 <h1 id="tra"> Tandem repeat analysis </h1>
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
41 #+end_export
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
42
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
43 TAREAN divides clusters into five categories with corresponding files in the
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
44 archive:
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
45
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
46 - High confidence satellites with consensus sequences in file ~TR_consensus_rank_1_.fasta~
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
47 - Low confidence satellites with consensus sequences in file ~TR_consensus_rank_2_.fasta~
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
48 - Putative LTR element with consensus sequences in file ~TR_consensus_rank_3_.fasta~
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
49 - rDNA with consensus in ~TR_consensus_rank_4_.fasta~
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
50 - other clusters – these clusters are not reconstructed by TAREAN because no potential tandem like structure was found.
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
51
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
52 Summary tables from TAREAN html report include following information:
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
53
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
54 - Cluster :: cluster identifier
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
55 - Proportion[%] :: (Number of sequences in cluster/Number of sequences in clustering) x 100%
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
56 - Proportion adjusted[%] ::
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
57 - Number of reads :: Number of reads in the cluster
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
58 - Satellite probability :: Empirical probability estimate that cluster sequences are derived from satellite repeat. This estimate is based on analysis of manually anotated and experimentaly validated satellite repeats
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
59 - Consensus length ::
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
60 - Consensus :: Consensus sequence is outcome of kmer-based analysis and represents the most probable satellite monomer sequence, other alternative consensus sequences are included in individual cluster reports
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
61 - Graph layout :: Graph-based visualization of similarities among sequence reads
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
62 - Kmer analysis :: hyperlink to Individual clusters TAREAN kmer report (fig X, box 10)
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
63 - Connected component index C :: Proportion of nodes of the graph which are part of the the largest strongly connected component
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
64 - Pair completeness index P :: Proportion of reads with available mate-pair within the same cluster
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
65 - Kmer coverage :: Sum of relative frequencies of all kmers used for consensus sequence reconstruction
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
66 - |V| :: Number of vertices of the graph
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
67 - |E| :: Number of edges of the graph
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
68 - PBS score :: Primer binding site detection score
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
69 - Similarity hits :: similarity hits based on the search using blastn/blastx against built-in databases of known sequences. By default, this will contain similarity hits to built in database which include rDNA sequences, plastid and mitochondrial sequences. If TAREAN was run within RepeatExplorer2 pipeline, it will also contain information about similarity hist against REXdb database.
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
70
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
71 In individual clusters TAREAN report contain other variant of consensus
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
72 sequences sorted by kmer coverage score. For each consensus, corresponding
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
73 de-Bruijn graph representation and corresponding sequence logo is shown.
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
74
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
75 #+begin_export html
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
76 <h1 id="kmer"> TAREAN k-mer analysis report </h1>
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
77 #+end_export
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
78
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
79 TAREAN module generates kmer analysis report for each cluster assigned to a putative satellite, rDNA or a putative LTR category. Monomer sequences of putative tandem repeats are reconstructed using k-mer based method using the most frequent k-mers. Several k-mer lengths are evaluated and the best estimated of monomer consensus sequence are reported. Kmer analysis summary contain the following information:
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
80 - k-mer length :: length of the k-mer used for monomer reconstruction
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
81 - Variant index :: Each kmer of given length can yield multiple consensus variant. Variants are indexed
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
82 - k-mer coverage score :: is sum of proportions of all k-mer used for reconstruction of particular monomer. If the value is 1 then all kmers from corresponding cluster were used for reconstruction of monomer meaning that there is no variability. The more variable the monomer, the lower the k-mer coverage score.
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
83 - Consensus length :: length of estimated monomer
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
84 - Consensus :: consensus sequence shows the consensus sequence extracted from position probability matrix.
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
85 - k-mer bases graph :: the visualization of de-Bruijn graph. Each vertex corespond to single k-mer. Size of vertex is proportional to the kmer frequency. Path which was used to reconstruct monomer sequence is grey out.
1d1b9e1b2e2f Uploaded
petr-novak
parents:
diff changeset
86 - Sequence logo :: visualization of position probability matrices for corresponding consensus variant.