What it does
This tool creates phylip formatted files from two different input types: coverage and genes.
If the coverage option is selected the inputs for the program are:
- a gd_indivs table
- a gd_genotype file with the coverage information for individuals in the gd_indivs table
- a gd_genotype file with the genotype information for individuals in the gd_indivs table
- a coverage threshold (optional)
- a percentage of individuals (threshold).
The program produces a phylip formatted file using the sequence in the genotype file as a template. In this sequence nucleotides for each sequence that are below the coverage threshold, or the positions with a percentage of individuals below the selected value are replaced by "N".
If the gene option is selected the inputs for the program are:
- a gd_indivs table
- a gene dataset table with a gene name in the first column
- the column with transcript start in the gene dataset table
- the column with transcript end in the gene dataset table
- the column with coding start in the gene dataset table
- the column with coding end in the gene dataset table
- the column with exon starts (comma-separated) in the gene dataset table
- the column with exon ends (comma-separated) in the gene dataset table
- a FASTA formatted file for all the genes of interest with their names as headers (NOTE: these names should be the same in the input gene dataset table).
The program produces as output one phylip formatted file for each gene in the gene dataset table.
Example
In a case were the option coverage is selected, for the inputs:
gd_indivs:
7 W_Java 10 E_Java 16 Pen_Ma ...
Genotype table:
chrM 15 T C -1 -1 2 -1 -1 2 -1 -1 -1 -1 -1 2 -1 -1 -1 -1 0 -1 -1 chrM 18 G A -1 -1 0 -1 -1 0 -1 -1 -1 -1 -1 0 -1 -1 -1 -1 0 -1 -1 chrM 20 C T -1 -1 0 -1 -1 2 -1 -1 -1 -1 -1 0 -1 -1 -1 -1 0 -1 -1 ...
Coverage table:
chrM 0 G G 0 0 0 0 0 0 0 0 0 0 0 0 0 chrM 1 T T 0 0 3 0 0 50 0 0 0 0 0 2 0 chrM 2 T T 0 0 5 0 0 50 0 0 0 0 0 2 0 ...
Coverage threshold = 0
Percentage of individuals = 0.0
The output is:
4 19 15428 W_Java GTTCATCATGTTCATCGAAT E_Java GTTCATCATGTTCATCGAAC Pen_Ma GTTCATCATGTTCATCGAAT
In a case were option genotype is selected with the inputs:
Gene dataset table input:
1 ENSLAFT00000017123 chrM + 1002 1061 1002 1061 1 1002, 1061, 0 ENSLAFG00000017122 cmpl incmpl 0, BTRC ENSLAFT00000017123 ENSLAFP00000014355 1 ENSLAFT00000037164 chrM - 1058 1092 1062 1073 1 1062,1068 1065,1073 0 ENSLAFG00000007680 cmpl cmpl 0, MYOF ENSLAFT00000037164 ENSLAFP00000025175 26509 1 ENSLAFT00000008925 chrM + 990 1000 990 1000 1 990, 1000, 0 ENSLAFG00000008924 incmpl incmpl 0, PRKG1 ENSLAFT00000008925 ENSLAFP00000007492 ...
In this table:
column with transcript start = 5 column with transcript end = 6 column with coding start = 7 column with coding end = 8 column with exon starts = 10 column with exon ends = 11
gd_indivs:
7 W_Java 10 E_Java 16 Pen_Ma ...
Genotype table:
chrM 1005 T C -1 -1 2 -1 -1 2 -1 -1 -1 -1 -1 2 -1 -1 -1 -1 0 -1 -1 chrM 1060 G A -1 -1 0 -1 -1 0 -1 -1 -1 -1 -1 0 -1 -1 -1 -1 0 -1 -1 chrM 991 C T -1 -1 0 -1 -1 2 -1 -1 -1 -1 -1 0 -1 -1 -1 -1 0 -1 -1 ...
The outputs are going to one file for each sequence in the input gene dataset table (as long as they are included in the input FASTA file).