# HG changeset patch # User Richard Burhans # Date 1352137457 18000 # Node ID d6b961721037f1ddb900a86f4ab003a92635f80e # Parent 8a4b8efbc82cb5236ef5d929647904cec5214b03 Miller Lab Devshed version 4c04e35b18f6 diff -r 8a4b8efbc82c -r d6b961721037 add_fst_column.xml --- a/add_fst_column.xml Tue Oct 23 14:38:04 2012 -0400 +++ b/add_fst_column.xml Mon Nov 05 12:44:17 2012 -0500 @@ -10,11 +10,11 @@ - + - + @@ -22,20 +22,20 @@ - - - + + + - - - + + + - - + + @@ -61,13 +61,24 @@ +**Dataset formats** + +The input datasets are in gd_snp_ and gd_indivs_ formats. +The output dataset is in gd_snp_ format. (`Dataset missing?`_) + +.. _gd_snp: ./static/formatHelp.html#gd_snp +.. _gd_indivs: ./static/formatHelp.html#gd_indivs +.. _Dataset missing?: ./static/formatHelp.html + +----- + **What it does** The user specifies a SNP table and two "populations" of individuals, both previously defined using the Galaxy tool to specify individuals from a SNP table. No individual can be in both populations. Other choices are as follows. -Data source. The allele frequencies of a SNP in the two populations can be estimated either by the total number of reads of each allele, or by adding the frequencies inferred from genotypes of individuals in the populations. +Frequency metric. The allele frequencies of a SNP in the two populations can be estimated either by the total number of reads of each allele, or by adding the frequencies inferred from genotypes of individuals in the populations. -After specifying the data source, the user sets lower bounds on amount of data required at a SNP. For estimating the Fst using read counts, the bound is the minimum count of reads of the two alleles in a population. For estimations based on genotype, the bound is the minimum reported genotype quality per individual. +After specifying the frequency metric, the user sets lower bounds on amount of data required at a SNP. For estimating the Fst using read counts, the bound is the minimum count of reads of the two alleles in a population. For estimations based on genotype, the bound is the minimum reported genotype quality per individual. The user specifies whether the SNPs that violate the lower bound should be ignored or the Fst set to -1. @@ -81,15 +92,46 @@ Sewall Wright (1951) The genetical structure of populations. Ann Eugen 15:323-354. -B. S. Weir and C. Clark Cockerham (1984) Estimating F-statistics for the analysis of population structure. Evolution 38:1358-1370. +Weir, B.S. and Cockerham, C. Clark (1984) Estimating F-statistics for the analysis of population structure. Evolution 38:1358-1370. Weir, B.S. 1996. Population substructure. Genetic data analysis II, pp. 161-173. Sinauer Associates, Sundand, MA. David Reich, Kumarasamy Thangaraj, Nick Patterson, Alkes L. Price, and Lalji Singh (2009) Reconstructing Indian population history. Nature 461:489-494, especially Supplement 2. -Their effectiveness for computing FSTs when there are many SNPs but few individuals is discussed in the followoing paper. +Their effectiveness for computing FSTs when there are many SNPs but few individuals is discussed in the following paper. Eva-Maria Willing, Christine Dreyer, Cock van Oosterhout (2012) Estimates of genetic differentiation measured by FST do not necessarily require large sample sizes when using many SNP markers. PLoS One 7:e42649. +----- + +**Example** + +- input, SNP table:: + + #{"column_names":["scaf","pos","A","B","qual","ref","rpos","rnuc","1A","1B","1G","1Q","2A","2B","2G","2Q","3A","3B","3G","3Q","4A","4B","4G","4Q", + #"5A","5B","5G","5Q","6A","6B","6G","6Q","pair","dist","prim","rflp"],"dbkey":"canFam2", + #"individuals":[["PB1",9],["PB2",13],["PB3",17],["PB4",21],["PB6",25],["PB8",29]], + #"pos":2,"rPos":7,"ref":6,"scaffold":1,"species":"bear"} + Contig161_chr1_4641264_4641879 115 C T 73.5 chr1 4641382 C 6 0 2 45 8 0 2 51 15 0 2 72 5 0 2 42 6 0 2 45 10 0 2 57 Y 54 0.323 0 + Contig113_chr5_11052263_11052603 28 C T 38.2 chr5 11052280 C 1 2 1 12 3 2 1 10 5 0 2 42 2 1 2 13 3 0 2 36 8 0 2 51 Y 161 +99. 0 + Contig215_chr5_70946445_70947428 363 T G 28.2 chr5 70946809 C 4 0 2 39 0 5 0 12 9 0 2 54 6 0 2 45 3 3 2 1 9 0 2 54 N 43 0.153 0 + etc. + +- input, Population 1 individuals:: + + 9 PB1 + 13 PB2 + +- input, Population 2 individuals:: + + 17 PB3 + 21 PB4 + +- output (minimum read count of 3, discard fixed):: + + Contig113_chr5_11052263_11052603 28 C T 38.2 chr5 11052280 C 1 2 1 12 3 2 1 10 5 0 2 42 2 1 2 13 3 0 2 36 8 0 2 51 Y 161 +99. 0 0.1636 + Contig215_chr5_70946445_70947428 363 T G 28.2 chr5 70946809 C 4 0 2 39 0 5 0 12 9 0 2 54 6 0 2 45 3 3 2 1 9 0 2 54 N 43 0.153 0 0.3846 + etc. + diff -r 8a4b8efbc82c -r d6b961721037 average_fst.xml --- a/average_fst.xml Tue Oct 23 14:38:04 2012 -0400 +++ b/average_fst.xml Mon Nov 05 12:44:17 2012 -0500 @@ -15,14 +15,14 @@ - + - - - + + + @@ -32,15 +32,15 @@ - - - + + + - - + + @@ -69,19 +69,32 @@ +**Dataset formats** + +The input datasets are in gd_snp_ and gd_indivs_ formats. +The output dataset is in text_ format. (`Dataset missing?`_) + +.. _gd_snp: ./static/formatHelp.html#gd_snp +.. _gd_indivs: ./static/formatHelp.html#gd_indivs +.. _text: ./static/formatHelp.html#text +.. _Dataset missing?: ./static/formatHelp.html + +----- + **What it does** The user specifies a SNP table and two "populations" of individuals, both previously defined using the Galaxy tool to specify individuals from a SNP table. No individual can be in both populations. Other choices are as follows. -Data source. The allele frequencies of a SNP in the two populations can be estimated either by the total number of reads of each allele, or by adding the frequencies inferred from genotypes of individuals in the populations. +Frequency metric. The allele frequencies of a SNP in the two populations can be estimated either by the total number of reads of each allele, or by adding the frequencies inferred from genotypes of individuals in the populations. -After specifying the data source, the user sets lower bounds on amount of data required at a SNP. For estimating the FST using read counts, the bound is the minimum count of reads of the two alleles in a population. For estimations based on genotype, the bound is the minimum reported genotype quality per individual. SMPs not meeting these lower bounds are ignored. +After specifying the frequency metric, the user sets lower bounds on amount of data required at a SNP. For estimating the FST using read counts, the bound is the minimum count of reads of the two alleles in a population. For estimations based on genotype, the bound is the minimum reported genotype quality per individual. SNPs not meeting these lower bounds are ignored. The user specifies whether SNPs where both populations appear to be fixed for the same allele should be retained or discarded. Finally, the user decides whether to use randomizations. If so, then the user specifies how many randomly generated population pairs (retaining the numbers of individuals of the originals) to generate, as well as the "population" of additional individuals (not in the first two populations) that can be used in the randomization process. The program prints the following measures of FST for the two populations. + 1. The formulation by Sewall Wright (average over FSTs for all SNPs). 2. The Weir-Cockerham estimator (average over FSTs for all SNPs). 3. The Reich-Patterson estimator (average over FSTs for all SNPs). @@ -93,14 +106,27 @@ Sewall Wright (1951) The genetical structure of populations. Ann Eugen 15:323-354. -B. S. Weir and C. Clark Cockerham (1984) Estimating F-statistics for the analysis of population structure. Evolution 38:1358-1370. +Weir, B.S. and Cockerham, C. Clark (1984) Estimating F-statistics for the analysis of population structure. Evolution 38:1358-1370. Weir, B.S. 1996. Population substructure. Genetic data analysis II, pp. 161-173. Sinauer Associates, Sundand, MA. David Reich, Kumarasamy Thangaraj, Nick Patterson, Alkes L. Price, and Lalji Singh (2009) Reconstructing Indian population history. Nature 461:489-494, especially Supplement 2. -Their effectiveness for computing FSTs when there are many SNPs but few individuals is discussed in the followoing paper. +Their effectiveness for computing FSTs when there are many SNPs but few individuals is discussed in the following paper. Eva-Maria Willing, Christine Dreyer, Cock van Oosterhout (2012) Estimates of genetic differentiation measured by FST do not necessarily require large sample sizes when using many SNP markers. PLoS One 7:e42649. + +----- + +**Example** + +- output:: + + Using 37847 SNPs, we compute: + Average Wright FST is 0.22810. + Average Weir-Cockerham FST is 0.30813. + Average Reich-Patterson FST is 0.31012. + The population-based Reich-Patterson Fst is 0.33625. + diff -r 8a4b8efbc82c -r d6b961721037 calctfreq.py --- a/calctfreq.py Tue Oct 23 14:38:04 2012 -0400 +++ b/calctfreq.py Mon Nov 05 12:44:17 2012 -0500 @@ -99,7 +99,11 @@ sKEGGcPthws=dKEGGcPthws.pop(cGen) for eachP in sKEGGcPthws: if eachP!='N': - dPthContsTmp[eachP]+=1 + if eachP in dPthContsTmp: + dPthContsTmp[eachP]+=1 + else: + print >> sys.stderr, "Error: pathway not found in database: '{0}'".format(eachP) + sys.exit(1) cntGens+=1 #~ Calculate Freqs. ltfreqs=[((Decimal(dPthContsTmp[x])/Decimal(dPthContsTotls[x])),Decimal(dPthContsTmp[x]),x) for x in dPthContsTotls] diff -r 8a4b8efbc82c -r d6b961721037 commits.log --- a/commits.log Tue Oct 23 14:38:04 2012 -0400 +++ b/commits.log Mon Nov 05 12:44:17 2012 -0500 @@ -1,3 +1,7 @@ + +:f556345a4185 +cathy 2012-11-02 17:45 +Tweaked parameter labels. :8703e16fca01 cathy 2012-10-04 11:42 diff -r 8a4b8efbc82c -r d6b961721037 dpmix.xml --- a/dpmix.xml Tue Oct 23 14:38:04 2012 -0400 +++ b/dpmix.xml Mon Nov 05 12:44:17 2012 -0500 @@ -10,19 +10,19 @@ - + - + - + @@ -71,13 +71,13 @@ chromosomes) and a set of potentially admixed individuals, and chooses between the sequence coverage or the estimated genotypes to measure the similarity of genomic intervals in admixed individuals to the two -classes of ancestral chromosomes. The user also picks a "switch penalty", +classes of ancestral chromosomes. The user also picks a "genotype switch penalty", typically between 10 and 100. For each potentially admixed individual, the program divides the genome into three "genotypes": (0) homozygous for the first ancestral population (i.e., both chromosomes from that population), (1) heterozygous, or (2) homozygous for the second ancestral population. Parts of a chromosome that are labeled as "heterochromatic" -are given the non-genotype, 3. Smaller values of the switch penalty +are given the non-genotype "3". Smaller values of the switch penalty (corresponding to more ancient admixture events) generally lead to the reconstruction of more frequent changes between genotypes. diff -r 8a4b8efbc82c -r d6b961721037 extract_flanking_dna.xml --- a/extract_flanking_dna.xml Tue Oct 23 14:38:04 2012 -0400 +++ b/extract_flanking_dna.xml Mon Nov 05 12:44:17 2012 -0500 @@ -12,13 +12,13 @@ - - + + - + - + @@ -53,17 +53,31 @@ +**Dataset formats** + +The input dataset is in tabular_ format and must contain a scaffold or +chromosome column and a position column. The output is in fasta_ format or +Boulder-IO_ format used by Primer3. +(`Dataset missing?`_) + +.. _tabular: ./static/formatHelp.html#tab +.. _fasta: ./static/formatHelp.html#fasta +.. _Boulder-IO: ./static/formatHelp.html#boulder +.. _Dataset missing?: ./static/formatHelp.html + +----- + **What it does** - This tool reports a DNA segment containing each SNP, with up to 200 nucleotides on - either side of the SNP position, which is indicated by "n". Fewer nucleotides - are reported if the SNP is near an end of the assembled genome fragment. +This tool reports a DNA segment containing each SNP, with up to 200 nucleotides +on either side of the SNP position, which is indicated by "n". Fewer nucleotides +are reported if the SNP is near an end of the assembled genome fragment. ----- **Example** -- input file:: +- input (gd_snp format):: chr2_75111355_75112576 314 A C L F chr2 75111676 C F 15 4 53 2 9 48 Y 96 0.369 0.355 0.396 0 chr8_93901796_93905612 2471 A C A A chr8 93904264 A A 8 0 51 10 2 14 Y 961 0.016 0.534 0.114 2 @@ -77,7 +91,7 @@ chr19_39866997_39874915 3117 C T P P chr19 39870110 C P 3 7 65 14 2 32 Y 6 0.321 0.911 0.462 4 etc. -- output file:: +- output (FastA format):: > chr2_75111355_75112576 314 A C TATCTTCATTTTTATTATAGACTCTCTGAACCAATTTGCCCTGAGGCAGACTTTTTAAAGTACTGTGTAATGTATGAAGTCCTTCTGCTCAAGCAAATCATTGGCATGAAAACAGTTGCAAACTTATTGTGAGAGAAGAGTCCAAGAGTTTTAACAGTCTGTAAGTATATAGCCTGTGAGTTTGATTTCCTTCTTGTTTTTnTTCCAGAAACATGATCAGGGGCAAGTTCTATTGGATATAGTCTTCAAGCATCTTGATTTGACTGAGCGTGACTATTTTGGTTTGCAGTTGACTGACGATTCCACTGATAACCCAGTAAGTTTAAGCTGTTGTCTTTCATTGTCATTGCAATTTTTCTGTCTTTATACTAGGTCCTTTCTGATTTACATTGTTCACTGATT diff -r 8a4b8efbc82c -r d6b961721037 extract_primers.xml --- a/extract_primers.xml Tue Oct 23 14:38:04 2012 -0400 +++ b/extract_primers.xml Mon Nov 05 12:44:17 2012 -0500 @@ -11,9 +11,9 @@ - + - + @@ -46,30 +46,46 @@ +**Dataset formats** + +The input dataset is in tabular_ format and must contain a scaffold or +chromosome column and a position column. The output dataset is in text_ +format as described below. +(`Dataset missing?`_) + +.. _tabular: ./static/formatHelp.html#tab +.. _text: ./static/formatHelp.html#text +.. _Dataset missing?: ./static/formatHelp.html + +----- + **What it does** - This tool extracts primers for SNPs in the dataset using the Primer3 program. - The first line of output for a given SNP reports the name of the assembled - contig, the SNP's position in the contig, the two variant nucleotides, and - Primer3's "pair penalty". The next line, if not blank, names restriction - enzymes (from the user-adjustable list) that differentially cut at that - site, but do not cut at any other position between and including the - primer positions. The next lines show the SNP's flanking regions, with - the SNP position indicated by "n", including the primer positions and an - additional 3 nucleotides. +This tool extracts primers for SNPs in the dataset using the Primer3 program +(Steve Rozen and Helen J. Skaletsky, 2000). +The first line of output for a given SNP reports the name of the assembled +contig, the SNP's position in the contig, the two variant nucleotides, and +Primer3's "pair penalty". The next line, if not blank, names restriction +enzymes (from the user-adjustable list) that differentially cut at that +site, but do not cut at any other position between and including the +primer positions. The next lines show the SNP's flanking regions, with +the SNP position indicated by "n", including the primer positions and an +additional 3 nucleotides. + ----- **Example** -- input file:: +- input (gd_snp format):: chr5_30800874_30802049 734 G A chr5 30801606 A 24 0 99 4 11 97 Y 496 0.502 0.033 0.215 6 chr8_55117827_55119487 994 A G chr8 55118815 G 25 0 102 4 11 96 Y 22 0.502 0.025 2.365 1 chr9_100484836_100485311 355 C T chr9 100485200 T 27 0 108 6 17 100 Y 190 0.512 0.880 2.733 4 chr12_3635530_3637738 2101 T C chr12 3637630 T 25 0 102 4 13 93 Y 169 0.554 0.024 0.366 4 + etc. -- output file:: +- output:: chr5_30800874_30802049 734 G A 0.352964 BglII,MboI,Sau3AI,Tru9I,XhoII diff -r 8a4b8efbc82c -r d6b961721037 find_intervals.xml --- a/find_intervals.xml Tue Oct 23 14:38:04 2012 -0400 +++ b/find_intervals.xml Mon Nov 05 12:44:17 2012 -0500 @@ -22,41 +22,41 @@ - + - + - + - + - - + + - - - + + + - - + + @@ -105,14 +105,14 @@ For gd_snp format the metadata can be used to specify the chromosome and position. Other inputs include -a percentage or raw score for the "cutoff" which should be greater than the +a percentage or raw score for the "score-shift" which should be greater than the average value for the scores column. A higher value will give smaller intervals in the output. If a percentage (e.g. 95%) is specified -then that percentile of the scores is used as the cutoff; +then that percentile of the scores is used as the shift; percentile may not work well if many rows or SNPs have the same score (in that case use a raw score). The program subtracts the -cutoff from every score, then finds genomic intervals (i.e., consecutive runs +shift from every score, then finds genomic intervals (i.e., consecutive runs of SNPs) whose total score cannot be increased by adding or subtracting one or more adjusted scores at the ends of the interval. Another input is the number of times the diff -r 8a4b8efbc82c -r d6b961721037 map_ensembl_transcripts.xml --- a/map_ensembl_transcripts.xml Tue Oct 23 14:38:04 2012 -0400 +++ b/map_ensembl_transcripts.xml Mon Nov 05 12:44:17 2012 -0500 @@ -11,8 +11,10 @@ - - + + + + @@ -34,9 +36,46 @@ +**Dataset formats** + +The input and output datasets are in tabular_ format. +The input dataset must have a column with an ENSEMBL transcript ID and have +the database/build set. Even though positions are not needed the correct +database/build must be given to look up the pathways. +The output dataset will have added columns for the pathway. +(`Dataset missing?`_) + +.. _tabular: ./static/formatHelp.html#tab +.. _Dataset missing?: ./static/formatHelp.html + +----- + **What it does** -Adds the fields KEGG gene codes and KEGG pathways to an input table of ENSEMBL transcript codes. +Adds the fields "KEGG gene ID" and "KEGG pathways" to an input table of ENSEMBL +transcript IDs. A "U" in the KEGG gene ID field indicates that the +tool cannot link the ENSEMBL transcript ID to a KEGG gene ID. +An "N" in the pathway field means the KEGG pathway is unknown. + +----- + +**Example** + +- input:: + ENSCAFT00000000001 + ENSCAFT00000000144 + ENSCAFT00000000160 + ENSCAFT00000000215 + etc. + +- output:: + + ENSCAFT00000000001 476153 cfa00230=Purine metabolism.cfa00500=Starch and sucrose metabolism.cfa00740=Riboflavin metabolism.cfa00760=Nicotinate and nicotinamide metabolism.cfa00770=Pantothenate and CoA biosynthesis.cfa01100=Metabolic pathways + ENSCAFT00000000144 483960 N + ENSCAFT00000000160 610160 N + ENSCAFT00000000215 U N + etc. + diff -r 8a4b8efbc82c -r d6b961721037 pathway_image.xml --- a/pathway_image.xml Tue Oct 23 14:38:04 2012 -0400 +++ b/pathway_image.xml Mon Nov 05 12:44:17 2012 -0500 @@ -6,15 +6,15 @@ "--input=${input}" "--output=${output}" "--KEGGpath=${pathway}" - "--posKEGGclmn=${input.metadata.kegg_path}" - "--KEGGgeneposcolmn=${input.metadata.kegg_gene}" + "--posKEGGclmn=${kpath}" + "--KEGGgeneposcolmn=${kgene}" - - - - + + + + @@ -30,6 +30,8 @@ + + @@ -37,12 +39,45 @@ +**Dataset formats** + +The input and output datasets are in tabular_ format. +The input dataset must have columns with KEGG gene ID and pathways. +The output dataset is described below. +(`Dataset missing?`_) + +.. _tabular: ./static/formatHelp.html#tab +.. _Dataset missing?: ./static/formatHelp.html + +----- + **What it does** -This tool produces an image of an input KEGG pathway, highlighting the -modules representing genes in an input list. NOTE: a given gene can +This tool produces an image of a KEGG pathway, highlighting (in red) the +modules representing genes in the input dataset. Click here_ for help +with reading the pathway map. + +NOTE: a given gene can be assigned to multiple modules, and different genes can be assigned to the same module. +.. _here: http://www.genome.jp/kegg/document/help_pathway.html + +----- + +**Example** + +- input:: + + 476153 probably damaging cfa00230=Purine metabolism.cfa00500=Starch and sucrose metabolism.cfa00740=Riboflavin metabolism.cfa00760=Nicotinate and nicotinamide metabolism.cfa00770=Pantothenate and CoA biosynthesis.cfa01100=Metabolic pathways + 483960 probably damaging N + 610160 possibly damaging N + 403657 benign cfa04010=MAPK signaling pathway.cfa04012=ErbB signaling pathway.cfa04060=Cytokine-cytokine receptor interaction.cfa04144=Endocytosis.cfa04510=Focal adhesion.cfa04540=Gap junction.cfa04810=Regulation of actin cytoskeleton.cfa05160=Hepatitis C.cfa05200=Pathways in cancer.cfa05212=Pancreatic cancer.cfa05213=Endometrial cancer.cfa05214=Glioma.cfa05215=Prostate cancer.cfa05218=Melanoma.cfa05219=Bladder cancer.cfa05223=Non-small cell lung cancer + etc. + +output showing pathway cfa05214: + +.. image:: ${static_path}/images/gd_pathway_image.png + diff -r 8a4b8efbc82c -r d6b961721037 rank_pathways.xml --- a/rank_pathways.xml Tue Oct 23 14:38:04 2012 -0400 +++ b/rank_pathways.xml Mon Nov 05 12:44:17 2012 -0500 @@ -7,19 +7,19 @@ #else if str($output_format) == 'b' calclenchange.py #end if - "--loc_file=${GALAXY_DATA_INDEX_DIR}/gd.rank.loc" - "--species=${input.metadata.dbkey}" - "--input=${input}" - "--output=${output}" - "--posKEGGclmn=${input.metadata.kegg_path}" - "--KEGGgeneposcolmn=${input.metadata.kegg_gene}" + "--loc_file=${GALAXY_DATA_INDEX_DIR}/gd.rank.loc" + "--species=${input.metadata.dbkey}" + "--input=${input}" + "--output=${output}" + "--posKEGGclmn=${kpath}" + "--KEGGgeneposcolmn=${kgene}" - - - - + + + + @@ -32,6 +32,8 @@ + + @@ -39,6 +41,18 @@ +**Dataset formats** + +The input and output datasets are in tabular_ format. +The input dataset must have columns with KEGG gene ID and pathways. +The output dataset is described below. +(`Dataset missing?`_) + +.. _tabular: ./static/formatHelp.html#tab +.. _Dataset missing?: ./static/formatHelp.html + +----- + **What it does** This tool produces a table ranking the pathways based on the percentage @@ -54,23 +68,49 @@ If pathways are ranked by percentage of genes affected, the output is a tabular dataset with the following columns: - 1. number of genes in the pathway present in the input dataset - 2. percentage of the total genes in the pathway included in the input dataset - 3. rank of the frequency (from high freq to low freq) - 4. name of the pathway +1. number of genes in the pathway present in the input dataset +2. percentage of the total genes in the pathway included in the input dataset +3. rank of the frequency (from high freq to low freq) +4. name of the pathway If pathways are ranked by change in length and number of paths, the output is a tabular dataset with the following columns: - 1. change in the mean length of paths between sources and sinks - 2. mean length of paths between sources and sinks in the pathway including the genes in the input dataset. If the pathway do not have sources/sinks, the length is assumed to be infinite (I) - 3. mean length of paths between sources and sinks in the pathway excluding the genes in the input dataset. If the pathway do not have sources/sinks, the length is assumed to be infinite (I) - 4. rank of the change in the mean length of paths between sources and sinks (from high change to low change) - 5. change in the number of paths between sources and sinks - 6. number of paths between sources and sinks in the pathway including the genes in the input dataset. If the pathway do not have sources/sinks, it is assumed to be a circuit (C) - 7. number of paths between sources and sinks in the pathway excluding the genes in the input dataset. If the pathway do not have sources/sinks, it is assumed to be a circuit (C) - 8. rank of the change in the number of paths between sources and sinks (from high change to low change) - 9. name of the pathway +1. change in the mean length of paths between sources and sinks +2. mean length of paths between sources and sinks in the pathway including the genes in the input dataset. If the pathway do not have sources/sinks, the length is assumed to be infinite (I) +3. mean length of paths between sources and sinks in the pathway excluding the genes in the input dataset. If the pathway do not have sources/sinks, the length is assumed to be infinite (I) +4. rank of the change in the mean length of paths between sources and sinks (from high change to low change) +5. change in the number of paths between sources and sinks +6. number of paths between sources and sinks in the pathway including the genes in the input dataset. If the pathway do not have sources/sinks, it is assumed to be a circuit (C) +7. number of paths between sources and sinks in the pathway excluding the genes in the input dataset. If the pathway do not have sources/sinks, it is assumed to be a circuit (C) +8. rank of the change in the number of paths between sources and sinks (from high change to low change) +9. name of the pathway + +----- + +**Examples** + +- input (column 10 for KEGG gene ID, column 12 for KEGG pathways):: + + Contig39_chr1_3261104_3261850 414 chr1 3261546 ENSCAFT00000000001 ENSCAFP00000000001 S 667 F 476153 probably damaging cfa00230=Purine metabolism.cfa00500=Starch and sucrose metabolism.cfa00740=Riboflavin metabolism.cfa00760=Nicotinate and nicotinamide metabolism.cfa00770=Pantothenate and CoA biosynthesis.cfa01100=Metabolic pathways + Contig62_chr1_19011969_19012646 265 chr1 19012240 ENSCAFT00000000144 ENSCAFP00000000125 * 161 R 483960 probably damaging N + etc. + +- output ranked by percentage of genes affected:: + + 3 0.25 1 cfa03450=Non-homologous end-joining + 1 0.25 1 cfa00750=Vitamin B6 metabolism + 2 0.2 3 cfa00290=Valine, leucine and isoleucine biosynthesis + 3 0.18 4 cfa00770=Pantothenate and CoA biosynthesis + etc. + +- output ranked by change in length and number of paths:: + + 3.64 8.44 4.8 2 4 9 5 1 cfa00260=Glycine, serine and threonine metabolism + 7.6 9.6 2 1 3 5 2 2 cfa00240=Pyrimidine metabolism + 0.05 2.67 2.62 6 1 30 29 3 cfa00982=Drug metabolism - cytochrome P450 + -0.08 8.33 8.41 84 1 30 29 3 cfa00564=Glycerophospholipid metabolism + etc. diff -r 8a4b8efbc82c -r d6b961721037 select_snps.xml --- a/select_snps.xml Tue Oct 23 14:38:04 2012 -0400 +++ b/select_snps.xml Mon Nov 05 12:44:17 2012 -0500 @@ -11,12 +11,12 @@ - + - + @@ -50,17 +50,27 @@ +**Dataset formats** + +The input and output datasets are in tabular_ format. +(`Dataset missing?`_) + +.. _tabular: ./static/formatHelp.html#tab +.. _Dataset missing?: ./static/formatHelp.html + +----- + **What it does** - This tool attempts to select a specified number of SNPs from the dataset, making them - approximately uniformly spaced relative to the reference genome. The number - actually selected may be slightly more than the specified number. +This tool attempts to select a specified number of SNPs from the dataset, making +them approximately uniformly spaced relative to the reference genome. The number +actually selected may be slightly more than the specified number. ----- **Example** -- input file:: +- input (gd_snp format):: chr2_75111355_75112576 314 A C L F chr2 75111676 C F 15 4 53 2 9 48 Y 96 0.369 0.355 0.396 0 chr8_93901796_93905612 2471 A C A A chr8 93904264 A A 8 0 51 10 2 14 Y 961 0.016 0.534 0.114 2 @@ -74,7 +84,7 @@ chr19_39866997_39874915 3117 C T P P chr19 39870110 C P 3 7 65 14 2 32 Y 6 0.321 0.911 0.462 4 etc. -- output file:: +- output:: chr2_75111355_75112576 314 A C L F chr2 75111676 C F 15 4 53 2 9 48 Y 96 0.369 0.355 0.396 0 chr8_93901796_93905612 2471 A C A A chr8 93904264 A A 8 0 51 10 2 14 Y 961 0.016 0.534 0.114 2 diff -r 8a4b8efbc82c -r d6b961721037 specify_restriction_enzymes.xml --- a/specify_restriction_enzymes.xml Tue Oct 23 14:38:04 2012 -0400 +++ b/specify_restriction_enzymes.xml Mon Nov 05 12:44:17 2012 -0500 @@ -12,9 +12,9 @@ - + - + @@ -54,17 +54,28 @@ +**Dataset formats** + +The input and output datasets are in tabular_ format. +The input dataset must contain columns for scaffold or chromosome and position. +(`Dataset missing?`_) + +.. _tabular: ./static/formatHelp.html#tab +.. _Dataset missing?: ./static/formatHelp.html + +----- + **What it does** - It selects the SNPs that are differentially cut by at least one of the - specified restriction enzymes. The enzymes are required to cut the amplified - segment (for the specified PCR primers) only at the SNP. +It selects the SNPs that are differentially cut by at least one of the +specified restriction enzymes. The enzymes are required to cut the amplified +segment (for the specified PCR primers) only at the SNP. ----- **Example** -- input file:: +- input (gd_snp format):: chr2_75111355_75112576 314 A C L F chr2 75111676 C F 15 4 53 2 9 48 Y 96 0.369 0.355 0.396 0 chr8_93901796_93905612 2471 A C A A chr8 93904264 A A 8 0 51 10 2 14 Y 961 0.016 0.534 0.114 2 @@ -78,7 +89,7 @@ chr19_39866997_39874915 3117 C T P P chr19 39870110 C P 3 7 65 14 2 32 Y 6 0.321 0.911 0.462 4 etc. -- output file:: +- output:: chr8_93901796_93905612 2471 A C A A chr8 93904264 A A 8 0 51 10 2 14 Y 961 0.016 0.534 0.114 2 chr14_80021455_80022064 138 G A H H chr14 80021593 G H 14 0 69 9 6 124 Y 377 0.118 0.997 0.195 1