Mercurial > repos > petr-novak > repeatrxplorer
comparison lib/tarean_output_help.org @ 0:1d1b9e1b2e2f draft
Uploaded
| author | petr-novak |
|---|---|
| date | Thu, 19 Dec 2019 10:24:45 -0500 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| -1:000000000000 | 0:1d1b9e1b2e2f |
|---|---|
| 1 #+TITLE: TAREAN output description | |
| 2 #+HTML_HEAD_EXTRA: <link rel="stylesheet" type="text/css" href="style1.css" /> | |
| 3 #+LANGUAGE: en | |
| 4 | |
| 5 * Introduction | |
| 6 TAREAN output includes *HTML report* with list of all analyzed clusters; the clusters are classified into five categories: | |
| 7 + high confidence satellites | |
| 8 + low confidence satellites | |
| 9 + potential LTR elements | |
| 10 + rDNA | |
| 11 + other clusters | |
| 12 Each cluster for which consensus sequences was reconstructed has also its own detailed report, linked to the main report. | |
| 13 | |
| 14 * Main HTML report | |
| 15 This report contains basic information about all clusters larger than specified threshold (default value is 0.01% of analyzed reads) | |
| 16 ** Table legend | |
| 17 + Cluster :: Cluster identifier | |
| 18 + Genome Proportion[%] :: /(Number of sequences in cluster/Number of sequences in clustering) x 100%/ | |
| 19 + Size :: Number of reads in the cluster | |
| 20 + Satellite probability :: Empirical probability estimate that cluster sequences | |
| 21 are derived from satellite repeat. This estimate is based on analysis of more | |
| 22 than xxx clusters including yyy manually anotated and zzz experimentaly | |
| 23 validated satellite repeats | |
| 24 + Consensus :: Consensus sequence is outcome of kmer-based | |
| 25 analysis and represents the most probable satellite monomer | |
| 26 sequence | |
| 27 + Kmer analysis :: | |
| 28 link to analysis report for individual clusters | |
| 29 + Graph layout :: Graph-based visualization of similarities among sequence | |
| 30 reads | |
| 31 + Connected component index :: Proportion of nodes of the graph which are part | |
| 32 of the the largest strongly connected component | |
| 33 + Pair completeness index :: Proportion of reads with available | |
| 34 mate-pair within the same cluster | |
| 35 + Kmer coverage :: Sum of relative frequencies of all kmers used for consensus | |
| 36 sequence reconstruction | |
| 37 + |V| :: Number of vertices of the graph | |
| 38 + |E| :: Number of edges of the graph | |
| 39 + PBS score :: Primer binding site detection score | |
| 40 + The longest ORF length :: Length of the longest open reading frame found in | |
| 41 any of the possible six reading frames. Search was done on dimer of | |
| 42 consensus so ORFs can be longer than 'monomer' length | |
| 43 + Similarity-based annotation :: Annotation based on | |
| 44 similarity search using blastn/blastx against database of known | |
| 45 repeats. | |
| 46 * Detailed cluster report | |
| 47 Cluster report includes a list of major monomer sequence varinats reconstructed from the most frequent k-mers. The reconstructed consensus sequences are sorted based on their significance (that is, what proportion of k-mer they represent). | |
| 48 ** Table legend | |
| 49 - kmer :: length of kmer used for consensus reconstruction. | |
| 50 - variant :: identifier of consensus variant. | |
| 51 - total score :: measure of significance of consensus variant. Score is calculated as a sum of weights of all k-mers used for consensus reconstruction. | |
| 52 - monomer length :: length of the consensus | |
| 53 - consensus :: consensus sequence without ambiguous bases. | |
| 54 - graph image :: part of de-Bruijn graph based on the abundant k-mers. Size of | |
| 55 vertices corresponds to k-mer frequencies, Paths in the graph which was used | |
| 56 for reconstruction of consensus sequences is gray colored. | |
| 57 - logo image :: consensus sequences shown as DNA logo. Height of letters corresponds to kmer frequencies. Logo images are linked to corresponding position probability matrices. | |
| 58 | |
| 59 * Structure of the output archive | |
| 60 Complete results from TAREAN analysis can by downloaded as zip archive which contains the following | |
| 61 files and directories: | |
| 62 | |
| 63 #+BEGIN_SRC files & directories | |
| 64 . | |
| 65 . | |
| 66 ├── clusters_info.csv <------------ list of clusters in tab delimited format | |
| 67 ├── index.html <------------ main html report | |
| 68 ├── seqclust | |
| 69 │ ├── assembly # not implemented yet | |
| 70 │ ├── blastn <------------ results of read comparison with DNA database | |
| 71 │ ├── blastx <------------ results of read comparison with protein database | |
| 72 │ ├── clustering | |
| 73 │ │ ├── clusters | |
| 74 │ │ │ ├── dir_CL0001 <----┐- detailed information about clusters | |
| 75 │ │ │ ├── dir_CL0002 <----│ | |
| 76 │ │ │ ├── dir_CL0003 <----│ | |
| 77 │ │ │ .... <----┘ | |
| 78 │ │ │ | |
| 79 │ │ └── hitsort.cls <--------- list of reads in individual clusters | |
| 80 │ ├── mgblast | |
| 81 │ ├── prerun | |
| 82 │ └── sequences <--------- input reads | |
| 83 ├── summary # not implemented yet | |
| 84 ├── TR_consensus_rank_1_.fasta <-- reconstructed monomer sequences for HIGH confidence satellites | |
| 85 ├── TR_consensus_rank_2_.fasta <-- reconstructed monomer sequences for LOW confidence satellites | |
| 86 ├── TR_consensus_rank_3_.fasta <-- reconstructed sequences of potential LTR elements | |
| 87 └── TR_consensus_rank_4_.fasta <-- reconstructed consensus for rDNA | |
| 88 | |
| 89 #+END_SRC | |
| 90 | |
| 91 List of all clusters which is available in HTML file =index.html= is also | |
| 92 available in tab delimited format in the file =clusters_info.csv= which can be | |
| 93 easily viewed and edited in spreadsheet editing programs. List of all clusters | |
| 94 and the corresponding reads is in the file =hitsort.cls= which has the following | |
| 95 format: | |
| 96 | |
| 97 : >CL1 11 | |
| 98 : 134234r 55494f 85525f 136746r 96742f 91926f 239729r 105445f 222518r 136402r 9013 | |
| 99 : >CL2 10 | |
| 100 : 76205r 120735r 69527r 12235r 176778f 189307f 131952f 163507f 100038r 178475r | |
| 101 : >CL3 6 | |
| 102 : 99835r 222598f 29715r 102023f 99524r 30116f | |
| 103 : >CL4 6 | |
| 104 : 51723r 69073r 218774r 146425f 136314r 41744f | |
| 105 : >CL5 5 | |
| 106 : 70686f 65565f 234078r 50430r 68247r | |
| 107 | |
| 108 where =CL1 11= is the cluster ID followed by number of reads in the cluster; | |
| 109 next line contains list of all read names belonging to the cluster. | |
| 110 ** structure of cluster directories | |
| 111 | |
| 112 Detailed information for each cluster is stored is subdirectories: | |
| 113 | |
| 114 #+BEGIN_SRC folder directories | |
| 115 dir_CL0011 | |
| 116 ├── blast.csv <------------tab delimited file, all-to-all comparison od reads within cluster | |
| 117 ├── CL11_directed_graph.RData <----directed graph representation of cluster saved as R igraph object | |
| 118 ├── CL11.GL <-----------------undirected graph representation of cluster saved as R igraph object | |
| 119 ├── CL11.png <-----------┐- images with graph visualization | |
| 120 ├── CL11_tmb.png <-----------┘ | |
| 121 ├── dna_database_annotation.csv <-- annotation of cluster reads based on the DNA database of repeats | |
| 122 ├── reads_all.fas <---------------- all reads included in the cluster in fasta format | |
| 123 ├── reads.fas <---------------- subset of reads used for monomer reconstruction | |
| 124 ├── reads_oriented.fas <------------ subset of reads all in the same orientation | |
| 125 └── tarean | |
| 126 ├── consensus.fasta <----------- fasta file with tandem repeat consensus variants | |
| 127 ├── ggmin.RData | |
| 128 ├── img | |
| 129 │ ├── graph_11mer_1.png <-----┐ | |
| 130 │ ├── graph_11mer_2.png <-----│ | |
| 131 │ ├── graph_15mer_2.png <-----│ | |
| 132 │ ├── graph_15mer_3.png <-----│ | |
| 133 │ ├── graph_15mer_4.png <-----│ images of kmer-based graphs used for reconstruction of | |
| 134 │ ├── graph_19mer_2.png <-----│ monomer variants | |
| 135 │ ├── graph_19mer_4.png <-----│ | |
| 136 │ ├── graph_19mer_5.png <-----│ | |
| 137 │ ├── graph_23mer_2.png <-----│ | |
| 138 │ ├── graph_27mer_3.png <-----┘ | |
| 139 │ │ | |
| 140 │ ├── logo_11mer_1.png <-----┐ | |
| 141 │ ├── logo_11mer_2.png <-----│ | |
| 142 │ ├── logo_15mer_2.png <-----│ | |
| 143 │ ├── logo_15mer_3.png <-----│ | |
| 144 │ ├── logo_15mer_4.png <-----│ images with DNA logos representing consensus sequences | |
| 145 │ ├── logo_19mer_2.png <-----│ of monomer variants | |
| 146 │ ├── logo_19mer_4.png <-----│ | |
| 147 │ ├── logo_19mer_5.png <-----│ | |
| 148 │ ├── logo_23mer_2.png <-----│ | |
| 149 │ └── logo_27mer_3.png <-----┘ | |
| 150 │ | |
| 151 ├── ppm_11mer_1.csv <-----┐ | |
| 152 ├── ppm_11mer_2.csv <-----│ | |
| 153 ├── ppm_15mer_2.csv <-----│ | |
| 154 ├── ppm_15mer_3.csv <-----│ | |
| 155 ├── ppm_15mer_4.csv <-----│ position probability matrices for individual monomer | |
| 156 ├── ppm_19mer_2.csv <-----│ variants derived from k-mer frequencies | |
| 157 ├── ppm_19mer_4.csv <-----│ | |
| 158 ├── ppm_19mer_5.csv <-----│ | |
| 159 ├── ppm_23mer_2.csv <-----│ | |
| 160 ├── ppm_27mer_3.csv <-----┘ | |
| 161 │ | |
| 162 ├── reads_oriented.fas_11.kmers <-----┐ | |
| 163 ├── reads_oriented.fas_15.kmers <-----│ | |
| 164 ├── reads_oriented.fas_19.kmers <-----│ k-mer frequencies calculated on oriented reads | |
| 165 ├── reads_oriented.fas_23.kmers <-----│ for k-mer lengths 11 - 27 | |
| 166 ├── reads_oriented.fas_27.kmers <-----┘ | |
| 167 ├── reads_oriented.fasblast_out.cvs <---------┐results of blastn search against database of tRNA | |
| 168 ├── reads_oriented.fasblast_out.cvs_L.csv <----│for purposes of LTR detection | |
| 169 ├── reads_oriented.fasblast_out.cvs_R.csv <----┘ | |
| 170 └── report.html <--- cluster analysisHTML summary | |
| 171 #+END_SRC | |
| 172 | |
| 173 | |
| 174 |
